2 SPDX-FileCopyrightText: 2024, 2025 Lady <https://www.ladys.computer/about/#lady>
 
   3 SPDX-License-Identifier: CC0-1.0
 
   7 <b>Ladys simple markup language.</b>
 
   9 💄📝 Les·M·L is a document markup language designed with two goals in
 
  12 1. It must be trivial to parse, even with limited tooling such as that
 
  15 2. It must be sophisticated enough to handle longform hypertext
 
  16      documents and associated metadata.
 
  18 It is implemented as an X·S·L·T transformation from a
 
  19   `<html:script type="text/lesml">` element into H·T·M·L
 
  24 <i>Les·M·L</i> is an abbreviation of the phrase “Ladys Extremely Simple
 
  29 The first line of any 💄📝 Les·M·L document should be the string
 
  31 A language tag may follow this, beginning with `@` and terminated with
 
  34 Regardless of whether a language tag is present, the shebang line may
 
  35   be terminated by a space‐separated list of properties of the form
 
  37 Only one property is currently permitted: `profile`, whose value should
 
  38   be a U·R·I and is translated to the `@data-lesml-profile` attribute
 
  39   on the resulting `<html:article>` element.
 
  41 Following the shebang line, document metadata may be provided in the
 
  42   [Record Jar][draft-phillips-record-jar-01] format.
 
  43 The body of the document begins after the last line which begins with
 
  44   the string `%%`, or after the shebang line if none exists.
 
  46 Multiple documents can be catenated into a single file; a new document
 
  47   is begun on any line which starts with `#!lesml` or `##`.
 
  48 Documents in the later case inherit the latest preceding `#!lesml`
 
  50 `##` may be followed by other text; this is treated as an interdocument
 
  53 Documents are broken into paragraphs by blank lines.
 
  54 Empty paragraphs are ignored.
 
  56 If every line in the paragraph begins with (optional white·space
 
  57   followed by) `»` it is quoted (`<html:blockquote>`); if every line
 
  58   begins with `]` it is bracketed.
 
  59 The lines, minus this leading, are then re‐analysed.
 
  60 Bracketed paragraphs which end quotes are treated as captions
 
  61   (`<html:figcaption>`); otherwise, they are footers (`<html:footer>`).
 
  63 Non·empty paragraphs are classified as follows :—
 
  65 - If the paragraph consists of only the following section‐break
 
  66     characters, plus any amount of white·space, then it is
 
  67     considered to be a section break (`<html:hr>`).
 
  69   The section break characters are :—
 
  71   | Character | Codepoint | Unicode Name |
 
  72   | --------- | --------- | ------------ |
 
  73   | `*` | `U+002A` | `ASTERISK` |
 
  74   | `-` | `U+002D` | `HYPHEN-MINUS` |
 
  75   | `.` | `U+002E` | `FULL STOP` |
 
  76   | `=` | `U+003D` | `EQUALS SIGN` |
 
  77   | `_` | `U+005F` | `LOW LINE` |
 
  78   | `~` | `U+007E` | `TILDE` |
 
  79   | `·` | `U+00B7` | `MIDDLE DOT` |
 
  80   | `․` | `U+2024` | `ONE DOT LEADER` |
 
  81   | `‥` | `U+2025` | `TWO DOT LEADER` |
 
  82   | `…` | `U+2026` | `HORIZONTAL ELLIPSIS` |
 
  83   | `⁂` | `U+2042` | `ASTERISM` |
 
  84   | `⋯` | `U+22EF` | `MIDLINE HORIZONTAL ELLIPSIS` |
 
  85   | `─` | `U+2500` | `BOX DRAWINGS LIGHT HORIZONTAL` |
 
  86   | `━` | `U+2501` | `BOX DRAWINGS HEAVY HORIZONTAL` |
 
  87   | `┄` | `U+2504` | `BOX DRAWINGS LIGHT TRIPLE DASH HORIZONTAL` |
 
  88   | `┅` | `U+2505` | `BOX DRAWINGS HEAVY TRIPLE DASH HORIZONTAL` |
 
  89   | `┈` | `U+2508` | `BOX DRAWINGS LIGHT QUADRUPLE DASH HORIZONTAL` |
 
  90   | `┉` | `U+2509` | `BOX DRAWINGS HEAVY QUADRUPLE DASH HORIZONTAL` |
 
  91   | `╌` | `U+254C` | `BOX DRAWINGS LIGHT DOUBLE DASH HORIZONTAL` |
 
  92   | `╍` | `U+254D` | `BOX DRAWINGS HEAVY DOUBLE DASH HORIZONTAL` |
 
  93   | `═` | `U+2550` | `BOX DRAWINGS DOUBLE HORIZONTAL` |
 
  94   | `╴` | `U+2574` | `BOX DRAWINGS LIGHT LEFT` |
 
  95   | `╶` | `U+2576` | `BOX DRAWINGS LIGHT RIGHT` |
 
  96   | `╸` | `U+2578` | `BOX DRAWINGS HEAVY LEFT` |
 
  97   | `╺` | `U+257A` | `BOX DRAWINGS HEAVY RIGHT` |
 
  98   | `☙` | `U+2619` | `REVERSED ROTATED FLORAL HEART BULLET` |
 
  99   | `❧` | `U+2767` | `ROTATED FLORAL HEART BULLET` |
 
 100   | ` ` | `U+3000` | `IDEOGRAPHIC SPACE` |
 
 101   | `・` | `U+30FB` | `KATAKANA MIDDLE DOT` |
 
 102   | `*` | `U+FF0A` | `FULLWIDTH ASTERISK` |
 
 103   | `-` | `U+FF0D` | `FULLWIDTH HYPHEN-MINUS` |
 
 104   | `.` | `U+FF0E` | `FULLWIDTH FULL STOP` |
 
 105   | `=` | `U+FF1D` | `FULLWIDTH EQUALS SIGN` |
 
 106   | `_` | `U+FF3F` | `FULLWIDTH LOW LINE` |
 
 107   | `~` | `U+FF5E` | `FULLWIDTH TILDE` |
 
 109 - If every line in the paragraph begins with zero or more white·space
 
 110     characters followed by `|`, it is a “preformatted” paragraph and
 
 111     white·space is not collapsed (`<html:pre>`).
 
 113 - Otherwise, the paragraph is ordinary.
 
 115 After this classification, each ordinary paragraph is further
 
 116   classified by type based on its first character (which is must be
 
 117   followed by white·space, a pilcrow, or else the only thing on the
 
 120 - If the paragraph is preformatted, it is an ordinary paragraph.
 
 122 - If the paragraph begins with `⁌`, it is a chapter heading
 
 125 - If the paragraph begins with `§`, it is a section heading
 
 128 - If the paragraph begins with `❦`, it is a subsection heading
 
 131 - If the paragraph begins with `✠`, it is a subsubsection heading
 
 134 - If the paragraph begins with `•` or `🔢`, it is a primary unordered
 
 135     or ordered list item (`<html:li class="unordered" data-level="1">`
 
 136     or `<html:li class="ordered" data-level="1">`).
 
 138 - If the paragraph begins with `◦` or `🔠`, it is a secondary unordered
 
 139     or ordered list item (`<html:li class="unordered" data-level="2">`
 
 140     or `<html:li class="ordered" data-level="2">`).
 
 141   Secondary list items are considered to be nested inside of primary
 
 142     list items which precede them.
 
 144 - If the paragraph begins with `▪` or `🔡`, it is a tertiary unordered
 
 145     or ordered list item (`<html:li class="unordered" data-level="3">`
 
 146     or `<html:li class="ordered" data-level="3">`).
 
 147   Tertiary list items are considered to be nested inside of primary
 
 148     and secondary list items which precede them.
 
 150 - If the paragraph begins with `⁃` or `🔣`, it is a quaternary
 
 151     unordered or ordered list item
 
 152     (`<html:li class="unordered" data-level="4">` or
 
 153     `<html:li class="ordered" data-level="4">`).
 
 154   Quaternary list items are considered to be nested inside of primary,
 
 155     secondary, and tertiary list items which precede them.
 
 157 - If the paragraph begins with `※`, it is an ordinary note
 
 158     (`<html:div role="note" class="note">`).
 
 160 - If the paragraph begins with `☡`, it is a cautionary note
 
 161     (`<html:div role="note" class="caution">`).
 
 163 - If the paragraph begins with `🛈`, it is an informative note
 
 164     (`<html:div role="note" class="info">`).
 
 166 - If the paragraph begins with `⯑`, it is a questioning note
 
 167     (`<html:div role="note" class="query">`).
 
 169 - If the paragraph begins with `⚠︎`, it is a warning note
 
 170     (`<html:div role="note" class="warn">`).
 
 172 - If the paragraph begins with `#`, it is a comment.
 
 173   Comments produce X·M·L comment nodes and can be used to break up list
 
 174     items into separate lists.
 
 176 - If the paragraph begins with `⋯`, it is a continuation paragraph.
 
 177   Continuation paragraphs may be used to continue a preceding div or
 
 179   If there is no such preceding div or list item, they will attach to
 
 180     adjacent heading elements to form heading groups (`<html:hgroup>`).
 
 181   Otherwise, they will be treated as ordinary paragraphs.
 
 183 - Otherwise, it is an ordinary paragraph.
 
 185 Following this sigil (if any) there may be a `¶` followed by zero or
 
 186   more non·white·space characters.
 
 187 The characters following the `¶` give the identifier for the paragraph,
 
 188   which is expected to be unique within a document.
 
 189 This may be suffixed with a language tag beginning with `@` and
 
 192 The remaining characters in a paragraph form its contents.
 
 193 Markup within paragraphs is delimited with·out exception by pairs of
 
 194   characters, with the following precedence :—
 
 196 - The characters `⌦` and `⌫` indicate inline comments.
 
 197   A single character `⌧` may be used to indicate an “empty” comment
 
 198     (consisting of `U+034F COMBINING GRAPHEME JOINER` for X·M·L
 
 201 - The characters `{@` and `"}` indicate attribute specifications.
 
 202   The attribute specification must contain at least one `="` which
 
 203     separates the key of the attribute from the value.
 
 204   Attributes attach to the previous element or text node, with
 
 205     white·space‐only text nodes after elements ignored; if there is no
 
 206     such previous element or text node, an empty text node is used
 
 208   Multiple attributes can be given in sequence using multiple
 
 210   Text nodes with attributes are wrapped in `<html:span>`.
 
 212 - The characters `{🔗` and `>}` indicate a hyperlink to a U·R·L
 
 214   The hyperlink must contain at least one `<`; the content before the
 
 215     last `<` gives the text of the link, and the content after gives
 
 216     the U·R·L that the link points to.
 
 217   If no text is given, the U·R·L will be used instead.
 
 219 - The characters `⸠` and `⸡` indicate a strikethru (`<html:s>`).
 
 221 - The characters `⸤` and `⸥` indicate underlining (`<html:u>`).
 
 223 - The characters `⟦` and `⟧` indicate an inline note
 
 224     (`<html:small role="note">`).
 
 226 - The characters `⸨` and `⸩` indicate parenthetical content
 
 229 - The characters `` ` `` and `´` indicate code (`<html:code>`).
 
 231 - The characters `⟪` and `⟫` indicate titles (`<html:cite>`).
 
 233 - The characters `⸶` and `⸷` indicate names (`<html:u class="name">`).
 
 235 - The characters `⟨` and `⟩` indicate offset text (`<html:i>`).
 
 237 - The characters `⦃` and `⦄` indicate keyword highlighting
 
 240 - The characters `☞︎` and `☜︎` indicate strong importance
 
 243 - The characters `⹐` and `⹑` indicate emphasis (`<html:em>`).
 
 245 Once the tree is built as above, it is remediated into its final form
 
 246   by the following steps :—
 
 248 - Continuation paragraphs are joined with the preceding list items or
 
 251 - List items of a higher level are nested in preceding list items, when
 
 254 - Successive list items of the same level and class are joined into
 
 257 - Linebreaks in preformatted paragraphs are replaced with `<html:br>`.
 
 259 Finally, any character can be escaped by instead providing its Unicode
 
 260   codepoint in the form `{U+NNNN}`, where `NNNN` is one or more
 
 262 Multiple codepoints may be provided separated by periods, as in
 
 264 Due to limitations in X·S·L·T, characters cannot be escaped in
 
 265   attributes (including link targets).
 
 269 💄📝 Les·M·L is designed for usage with [⛩📰 书社][Shushe].
 
 270 Simply include the `parser.xslt` provided by this repository to
 
 271   ⛩📰 书社 as an additional parser, and `magic` as an additional
 
 276 This repository conforms to [REUSE][].
 
 278 The parser is licensed under the terms of the <cite>Mozilla Public
 
 279   License, version 2.0</cite>.
 
 281 [REUSE]: <https://reuse.software/spec/>
 
 282 [Shushe]: <https://git.ladys.computer/Shushe/>
 
 283 [draft-phillips-record-jar-01]: <https://datatracker.ietf.org/doc/html/draft-phillips-record-jar-01>