<!--
-SPDX-FileCopyrightText: 2024 Lady <https://www.ladys.computer/about/#lady>
+SPDX-FileCopyrightText: 2024, 2025 Lady <https://www.ladys.computer/about/#lady>
SPDX-License-Identifier: CC0-1.0
-->
# 💄📝 Les·M·L
The first line of any 💄📝 Les·M·L document should be the string
`#!lesml`.
-
-Following the shebang, document metadata may be provided in the [Record
- Jar][draft-phillips-record-jar-01] format.
+A language tag may follow this, beginning with `@` and terminated with
+ `$`, like so:
+`#!lesml@en$`.
+Regardless of whether a language tag is present, the shebang line may
+ be terminated by a space‐separated list of properties of the form
+ `key=value`.
+Only one property is currently permitted: `profile`, whose value should
+ be a U·R·I and is translated to the `@data-lesml-profile` attribute
+ on the resulting `<html:article>` element.
+
+Following the shebang line, document metadata may be provided in the
+ [Record Jar][draft-phillips-record-jar-01] format.
The body of the document begins after the last line which begins with
the string `%%`, or after the shebang line if none exists.
+Multiple documents can be catenated into a single file; a new document
+ is begun on any line which starts with `#!lesml` or `##`.
+Documents in the later case inherit the latest preceding `#!lesml`
+ declaration.
+`##` may be followed by other text; this is treated as an interdocument
+ comment.
+
Documents are broken into paragraphs by blank lines.
-Non·empty paragraphs are classified as follows :—
+Empty paragraphs are ignored.
-- If the paragraph consists of only the characters
- `#*-=_~⁂─━┄┅┈┉╌╍═╴╶╸╺☙❧` plus any amount of white·space, then it is
- considered to be a section break (`<html:hr>`).
+If every line in the paragraph begins with (optional white·space
+ followed by) `»` it is quoted (`<html:blockquote>`); if every line
+ begins with `]` it is bracketed.
+The lines, minus this leading, are then re‐analysed.
+Bracketed paragraphs which end quotes are treated as captions
+ (`<html:figcaption>`); otherwise, they are footers (`<html:footer>`).
-- If every line in the paragraph begins with at least one space, then
- it is considered to be a quoted paragraph (`<html:blockquote>`).
- There is only one level of paragraph quoting; quoted paragraphs may
- not be quoted again.
+Non·empty paragraphs are classified as follows :—
-- Otherwise, the paragraph is unquoted.
+- If the paragraph consists of only the following section‐break
+ characters, plus any amount of white·space, then it is
+ considered to be a section break (`<html:hr>`).
-After this classification, each quoted or unquoted paragraph is further
+ The section break characters are :—
+
+ | Character | Codepoint | Unicode Name |
+ | --------- | --------- | ------------ |
+ | `*` | `U+002A` | `ASTERISK` |
+ | `-` | `U+002D` | `HYPHEN-MINUS` |
+ | `.` | `U+002E` | `FULL STOP` |
+ | `=` | `U+003D` | `EQUALS SIGN` |
+ | `_` | `U+005F` | `LOW LINE` |
+ | `~` | `U+007E` | `TILDE` |
+ | `·` | `U+00B7` | `MIDDLE DOT` |
+ | `․` | `U+2024` | `ONE DOT LEADER` |
+ | `‥` | `U+2025` | `TWO DOT LEADER` |
+ | `…` | `U+2026` | `HORIZONTAL ELLIPSIS` |
+ | `⁂` | `U+2042` | `ASTERISM` |
+ | `⋯` | `U+22EF` | `MIDLINE HORIZONTAL ELLIPSIS` |
+ | `─` | `U+2500` | `BOX DRAWINGS LIGHT HORIZONTAL` |
+ | `━` | `U+2501` | `BOX DRAWINGS HEAVY HORIZONTAL` |
+ | `┄` | `U+2504` | `BOX DRAWINGS LIGHT TRIPLE DASH HORIZONTAL` |
+ | `┅` | `U+2505` | `BOX DRAWINGS HEAVY TRIPLE DASH HORIZONTAL` |
+ | `┈` | `U+2508` | `BOX DRAWINGS LIGHT QUADRUPLE DASH HORIZONTAL` |
+ | `┉` | `U+2509` | `BOX DRAWINGS HEAVY QUADRUPLE DASH HORIZONTAL` |
+ | `╌` | `U+254C` | `BOX DRAWINGS LIGHT DOUBLE DASH HORIZONTAL` |
+ | `╍` | `U+254D` | `BOX DRAWINGS HEAVY DOUBLE DASH HORIZONTAL` |
+ | `═` | `U+2550` | `BOX DRAWINGS DOUBLE HORIZONTAL` |
+ | `╴` | `U+2574` | `BOX DRAWINGS LIGHT LEFT` |
+ | `╶` | `U+2576` | `BOX DRAWINGS LIGHT RIGHT` |
+ | `╸` | `U+2578` | `BOX DRAWINGS HEAVY LEFT` |
+ | `╺` | `U+257A` | `BOX DRAWINGS HEAVY RIGHT` |
+ | `☙` | `U+2619` | `REVERSED ROTATED FLORAL HEART BULLET` |
+ | `❧` | `U+2767` | `ROTATED FLORAL HEART BULLET` |
+ | ` ` | `U+3000` | `IDEOGRAPHIC SPACE` |
+ | `・` | `U+30FB` | `KATAKANA MIDDLE DOT` |
+ | `*` | `U+FF0A` | `FULLWIDTH ASTERISK` |
+ | `-` | `U+FF0D` | `FULLWIDTH HYPHEN-MINUS` |
+ | `.` | `U+FF0E` | `FULLWIDTH FULL STOP` |
+ | `=` | `U+FF1D` | `FULLWIDTH EQUALS SIGN` |
+ | `_` | `U+FF3F` | `FULLWIDTH LOW LINE` |
+ | `~` | `U+FF5E` | `FULLWIDTH TILDE` |
+
+- If every line in the paragraph begins with zero or more white·space
+ characters followed by `|`, it is a “preformatted” paragraph and
+ white·space is not collapsed (`<html:pre>`).
+
+- Otherwise, the paragraph is ordinary.
+
+After this classification, each ordinary paragraph is further
classified by type based on its first character (which is must be
- followed by white·space to be recognized) :—
+ followed by white·space, a pilcrow, or else the only thing on the
+ line) :—
+
+- If the paragraph is preformatted, it is an ordinary paragraph.
- If the paragraph begins with `⁌`, it is a chapter heading
(`<html:h1>`).
- If the paragraph begins with `⚠︎`, it is a warning note
(`<html:div role="note" class="warn">`).
-- If the paragraph begins with `⋯`, it is a continuation paragraph
- (`<html:div class="continuation">`).
- Continuation paragraphs may be used to continue a preceding list item
- or quote.
- Note, however, that an unquoted paragraph cannot continue a quoted
- one, or vice·versa.
+- If the paragraph begins with `#`, it is a comment.
+ Comments produce X·M·L comment nodes and can be used to break up list
+ items into separate lists.
+
+- If the paragraph begins with `⋯`, it is a continuation paragraph.
+ Continuation paragraphs may be used to continue a preceding div or
+ list item.
+ If there is no such preceding div or list item, they will attach to
+ adjacent heading elements to form heading groups (`<html:hgroup>`).
+ Otherwise, they will be treated as ordinary paragraphs.
- Otherwise, it is an ordinary paragraph.
-Following this sigil (if any, including trailing white·space) there may
- be a `¶` followed by zero or more non·white·space characters.
+Following this sigil (if any) there may be a `¶` followed by zero or
+ more non·white·space characters.
The characters following the `¶` give the identifier for the paragraph,
which is expected to be unique within a document.
+This may be suffixed with a language tag beginning with `@` and
+ terminated with `$`.
The remaining characters in a paragraph form its contents.
Markup within paragraphs is delimited with·out exception by pairs of
characters, with the following precedence :—
+- The characters `⌦` and `⌫` indicate inline comments.
+ A single character `⌧` may be used to indicate an “empty” comment
+ (consisting of `U+034F COMBINING GRAPHEME JOINER` for X·M·L
+ compatibility).
+
+- The characters `{@` and `"}` indicate attribute specifications.
+ The attribute specification must contain at least one `="` which
+ separates the key of the attribute from the value.
+ Attributes attach to the previous element or text node, with
+ white·space‐only text nodes after elements ignored; if there is no
+ such previous element or text node, an empty text node is used
+ instead.
+ Multiple attributes can be given in sequence using multiple
+ specifications.
+ Text nodes with attributes are wrapped in `<html:span>`.
+
- The characters `{🔗` and `>}` indicate a hyperlink to a U·R·L
(`<html:a>`).
The hyperlink must contain at least one `<`; the content before the
- The characters `⸨` and `⸩` indicate parenthetical content
(`<html:small>`).
-- The characters `☞︎` and `☜︎` indicate strong importance
- (`<html:strong>`).
-
-- The characters `⹐` and `⹑` indicate emphasis (`<html:em>`).
+- The characters `` ` `` and `´` indicate code (`<html:code>`).
- The characters `⟪` and `⟫` indicate titles (`<html:cite>`).
+- The characters `⸶` and `⸷` indicate names (`<html:u class="name">`).
+
- The characters `⟨` and `⟩` indicate offset text (`<html:i>`).
- This may be followed by a `@`, a language tag, and a `$` to provide
- the language of the text.
- The characters `⦃` and `⦄` indicate keyword highlighting
(`<html:b>`).
-- The characters `` ` `` and `´` indicate code (`<html:code>`).
+- The characters `☞︎` and `☜︎` indicate strong importance
+ (`<html:strong>`).
+
+- The characters `⹐` and `⹑` indicate emphasis (`<html:em>`).
Once the tree is built as above, it is remediated into its final form
by the following steps :—
-- Successive quoted paragraphs are joined into one quote.
- If the final quoted paragraph is an ordinary paragraph which begins
- with `—` and a space, the quote is wrapped in a `<html:figure>`
- and the final paragraph becomes its `<html:figcaption>`.
-
- Continuation paragraphs are joined with the preceding list items or
- quotes.
+ divs.
- List items of a higher level are nested in preceding list items, when
present.
- Successive list items of the same level and class are joined into
a single list.
+- Linebreaks in preformatted paragraphs are replaced with `<html:br>`.
+
Finally, any character can be escaped by instead providing its Unicode
- codepoint in the form `<U+NNNN>`, where `NNNN` is one or more
+ codepoint in the form `{U+NNNN}`, where `NNNN` is one or more
hexadecimal digits.
Multiple codepoints may be provided separated by periods, as in
- `<U+WWWW.ZZZZ>`
+ `{U+WWWW.ZZZZ}`.
+Due to limitations in X·S·L·T, characters cannot be escaped in
+ attributes (including link targets).
## Usage
-💄📝 Les·M·L is designed for usage with [⛩️📰 书社][Shushe].
+💄📝 Les·M·L is designed for usage with [⛩📰 书社][Shushe].
Simply include the `parser.xslt` provided by this repository to
- ⛩️📰 书社 as an additional parser, and `magic` as an additional
+ ⛩📰 书社 as an additional parser, and `magic` as an additional
magic file.
## License