X-Git-Url: https://git.ladys.computer/LesML/blobdiff_plain/c4b39347e3aa1d302af09b2e2c66df449c33f663..refs/heads/current:/README.markdown?ds=inline diff --git a/README.markdown b/README.markdown index 6a21246..ac2d0dc 100644 --- a/README.markdown +++ b/README.markdown @@ -1,5 +1,5 @@ <!-- -SPDX-FileCopyrightText: 2024 Lady <https://www.ladys.computer/about/#lady> +SPDX-FileCopyrightText: 2024, 2025 Lady <https://www.ladys.computer/about/#lady> SPDX-License-Identifier: CC0-1.0 --> # 💄📝 Les·M·L @@ -28,29 +28,96 @@ It is implemented as an X·S·L·T transformation from a The first line of any 💄📝 Les·M·L document should be the string `#!lesml`. - -Following the shebang, document metadata may be provided in the [Record - Jar][draft-phillips-record-jar-01] format. +A language tag may follow this, beginning with `@` and terminated with + `$`, like so: +`#!lesml@en$`. +Regardless of whether a language tag is present, the shebang line may + be terminated by a space‐separated list of properties of the form + `key=value`. +Only one property is currently permitted: `profile`, whose value should + be a U·R·I and is translated to the `@data-lesml-profile` attribute + on the resulting `<html:article>` element. + +Following the shebang line, document metadata may be provided in the + [Record Jar][draft-phillips-record-jar-01] format. The body of the document begins after the last line which begins with the string `%%`, or after the shebang line if none exists. +Multiple documents can be catenated into a single file; a new document + is begun on any line which starts with `#!lesml` or `##`. +Documents in the later case inherit the latest preceding `#!lesml` + declaration. +`##` may be followed by other text; this is treated as an interdocument + comment. + Documents are broken into paragraphs by blank lines. -Non·empty paragraphs are classified as follows :— +Empty paragraphs are ignored. -- If the paragraph consists of only the characters - `#*-=_~⁂─━┄┅┈┉╌╍═╴╶╸╺☙❧` plus any amount of white·space, then it is - considered to be a section break (`<html:hr>`). +If every line in the paragraph begins with (optional white·space + followed by) `»` it is quoted (`<html:blockquote>`); if every line + begins with `]` it is bracketed. +The lines, minus this leading, are then re‐analysed. +Bracketed paragraphs which end quotes are treated as captions + (`<html:figcaption>`); otherwise, they are footers (`<html:footer>`). -- If every line in the paragraph begins with at least one space, then - it is considered to be a quoted paragraph (`<html:blockquote>`). - There is only one level of paragraph quoting; quoted paragraphs may - not be quoted again. +Non·empty paragraphs are classified as follows :— -- Otherwise, the paragraph is unquoted. +- If the paragraph consists of only the following section‐break + characters, plus any amount of white·space, then it is + considered to be a section break (`<html:hr>`). -After this classification, each quoted or unquoted paragraph is further + The section break characters are :— + + | Character | Codepoint | Unicode Name | + | --------- | --------- | ------------ | + | `*` | `U+002A` | `ASTERISK` | + | `-` | `U+002D` | `HYPHEN-MINUS` | + | `.` | `U+002E` | `FULL STOP` | + | `=` | `U+003D` | `EQUALS SIGN` | + | `_` | `U+005F` | `LOW LINE` | + | `~` | `U+007E` | `TILDE` | + | `·` | `U+00B7` | `MIDDLE DOT` | + | `․` | `U+2024` | `ONE DOT LEADER` | + | `‥` | `U+2025` | `TWO DOT LEADER` | + | `…` | `U+2026` | `HORIZONTAL ELLIPSIS` | + | `⁂` | `U+2042` | `ASTERISM` | + | `⋯` | `U+22EF` | `MIDLINE HORIZONTAL ELLIPSIS` | + | `─` | `U+2500` | `BOX DRAWINGS LIGHT HORIZONTAL` | + | `━` | `U+2501` | `BOX DRAWINGS HEAVY HORIZONTAL` | + | `┄` | `U+2504` | `BOX DRAWINGS LIGHT TRIPLE DASH HORIZONTAL` | + | `┅` | `U+2505` | `BOX DRAWINGS HEAVY TRIPLE DASH HORIZONTAL` | + | `┈` | `U+2508` | `BOX DRAWINGS LIGHT QUADRUPLE DASH HORIZONTAL` | + | `┉` | `U+2509` | `BOX DRAWINGS HEAVY QUADRUPLE DASH HORIZONTAL` | + | `╌` | `U+254C` | `BOX DRAWINGS LIGHT DOUBLE DASH HORIZONTAL` | + | `╍` | `U+254D` | `BOX DRAWINGS HEAVY DOUBLE DASH HORIZONTAL` | + | `═` | `U+2550` | `BOX DRAWINGS DOUBLE HORIZONTAL` | + | `╴` | `U+2574` | `BOX DRAWINGS LIGHT LEFT` | + | `╶` | `U+2576` | `BOX DRAWINGS LIGHT RIGHT` | + | `╸` | `U+2578` | `BOX DRAWINGS HEAVY LEFT` | + | `╺` | `U+257A` | `BOX DRAWINGS HEAVY RIGHT` | + | `☙` | `U+2619` | `REVERSED ROTATED FLORAL HEART BULLET` | + | `❧` | `U+2767` | `ROTATED FLORAL HEART BULLET` | + | ` ` | `U+3000` | `IDEOGRAPHIC SPACE` | + | `・` | `U+30FB` | `KATAKANA MIDDLE DOT` | + | `*` | `U+FF0A` | `FULLWIDTH ASTERISK` | + | `-` | `U+FF0D` | `FULLWIDTH HYPHEN-MINUS` | + | `.` | `U+FF0E` | `FULLWIDTH FULL STOP` | + | `=` | `U+FF1D` | `FULLWIDTH EQUALS SIGN` | + | `_` | `U+FF3F` | `FULLWIDTH LOW LINE` | + | `~` | `U+FF5E` | `FULLWIDTH TILDE` | + +- If every line in the paragraph begins with zero or more white·space + characters followed by `|`, it is a “preformatted” paragraph and + white·space is not collapsed (`<html:pre>`). + +- Otherwise, the paragraph is ordinary. + +After this classification, each ordinary paragraph is further classified by type based on its first character (which is must be - followed by white·space to be recognized) :— + followed by white·space, a pilcrow, or else the only thing on the + line) :— + +- If the paragraph is preformatted, it is an ordinary paragraph. - If the paragraph begins with `⁌`, it is a chapter heading (`<html:h1>`). @@ -102,24 +169,46 @@ After this classification, each quoted or unquoted paragraph is further - If the paragraph begins with `⚠︎`, it is a warning note (`<html:div role="note" class="warn">`). -- If the paragraph begins with `⋯`, it is a continuation paragraph - (`<html:div class="continuation">`). - Continuation paragraphs may be used to continue a preceding list item - or quote. - Note, however, that an unquoted paragraph cannot continue a quoted - one, or vice·versa. +- If the paragraph begins with `#`, it is a comment. + Comments produce X·M·L comment nodes and can be used to break up list + items into separate lists. + +- If the paragraph begins with `⋯`, it is a continuation paragraph. + Continuation paragraphs may be used to continue a preceding div or + list item. + If there is no such preceding div or list item, they will attach to + adjacent heading elements to form heading groups (`<html:hgroup>`). + Otherwise, they will be treated as ordinary paragraphs. - Otherwise, it is an ordinary paragraph. -Following this sigil (if any, including trailing white·space) there may - be a `¶` followed by zero or more non·white·space characters. +Following this sigil (if any) there may be a `¶` followed by zero or + more non·white·space characters. The characters following the `¶` give the identifier for the paragraph, which is expected to be unique within a document. +This may be suffixed with a language tag beginning with `@` and + terminated with `$`. The remaining characters in a paragraph form its contents. Markup within paragraphs is delimited with·out exception by pairs of characters, with the following precedence :— +- The characters `⌦` and `⌫` indicate inline comments. + A single character `⌧` may be used to indicate an “empty” comment + (consisting of `U+034F COMBINING GRAPHEME JOINER` for X·M·L + compatibility). + +- The characters `{@` and `"}` indicate attribute specifications. + The attribute specification must contain at least one `="` which + separates the key of the attribute from the value. + Attributes attach to the previous element or text node, with + white·space‐only text nodes after elements ignored; if there is no + such previous element or text node, an empty text node is used + instead. + Multiple attributes can be given in sequence using multiple + specifications. + Text nodes with attributes are wrapped in `<html:span>`. + - The characters `{🔗` and `>}` indicate a hyperlink to a U·R·L (`<html:a>`). The hyperlink must contain at least one `<`; the content before the @@ -137,32 +226,27 @@ Markup within paragraphs is delimited with·out exception by pairs of - The characters `⸨` and `⸩` indicate parenthetical content (`<html:small>`). -- The characters `☞︎` and `☜︎` indicate strong importance - (`<html:strong>`). - -- The characters `⹐` and `⹑` indicate emphasis (`<html:em>`). +- The characters `` ` `` and `´` indicate code (`<html:code>`). - The characters `⟪` and `⟫` indicate titles (`<html:cite>`). +- The characters `⸶` and `⸷` indicate names (`<html:u class="name">`). + - The characters `⟨` and `⟩` indicate offset text (`<html:i>`). - This may be followed by a `@`, a language tag, and a `$` to provide - the language of the text. - The characters `⦃` and `⦄` indicate keyword highlighting (`<html:b>`). -- The characters `` ` `` and `´` indicate code (`<html:code>`). +- The characters `☞︎` and `☜︎` indicate strong importance + (`<html:strong>`). + +- The characters `⹐` and `⹑` indicate emphasis (`<html:em>`). Once the tree is built as above, it is remediated into its final form by the following steps :— -- Successive quoted paragraphs are joined into one quote. - If the final quoted paragraph is an ordinary paragraph which begins - with `—` and a space, the quote is wrapped in a `<html:figure>` - and the final paragraph becomes its `<html:figcaption>`. - - Continuation paragraphs are joined with the preceding list items or - quotes. + divs. - List items of a higher level are nested in preceding list items, when present. @@ -170,17 +254,21 @@ Once the tree is built as above, it is remediated into its final form - Successive list items of the same level and class are joined into a single list. +- Linebreaks in preformatted paragraphs are replaced with `<html:br>`. + Finally, any character can be escaped by instead providing its Unicode - codepoint in the form `<U+NNNN>`, where `NNNN` is one or more + codepoint in the form `{U+NNNN}`, where `NNNN` is one or more hexadecimal digits. Multiple codepoints may be provided separated by periods, as in - `<U+WWWW.ZZZZ>` + `{U+WWWW.ZZZZ}`. +Due to limitations in X·S·L·T, characters cannot be escaped in + attributes (including link targets). ## Usage -💄📝 Les·M·L is designed for usage with [⛩️📰 书社][Shushe]. +💄📝 Les·M·L is designed for usage with [⛩📰 书社][Shushe]. Simply include the `parser.xslt` provided by this repository to - ⛩️📰 书社 as an additional parser, and `magic` as an additional + ⛩📰 书社 as an additional parser, and `magic` as an additional magic file. ## License