Lady’s Gitweb - LesML/blob - README.markdown

   1 <!--
   2 SPDX-FileCopyrightText: 2024, 2025 Lady <https://www.ladys.computer/about/#lady>
   3 SPDX-License-Identifier: CC0-1.0
   4 -->
   5 # 💄📝 Les·M·L
   6
   7 <b>Ladys simple markup language.</b>
   8
   9 💄📝 Les·M·L is a document markup language designed with two goals in
  10   mind :⁠—
  11
  12 1. It must be trivial to parse, even with limited tooling such as that
  13      provided by X·S·L·T.
  14
  15 2. It must be sophisticated enough to handle longform hypertext
  16      documents and associated metadata.
  17
  18 It is implemented as an X·S·L·T transformation from a
  19   `<html:script type="text/lesml">` element into H·T·M·L
  20   (`parser.xslt`).
  21
  22 ## Nomenclature
  23
  24 <i>Les·M·L</i> is an abbreviation of the phrase “Ladys Extremely Simple
  25   Markup Language”.
  26
  27 ## Markup Syntax
  28
  29 The first line of any 💄📝 Les·M·L document should be the string
  30   `#!lesml`.
  31 A language tag may follow this, beginning with `@` and terminated with
  32   `$`, like so:
  33 `#!lesml@en$`.
  34 Regardless of whether a language tag is present, the shebang line may
  35   be terminated by a space‐separated list of properties of the form
  36   `key=value`.
  37 Only one property is currently permitted: `profile`, whose value should
  38   be a U·R·I and is translated to the `@data-lesml-profile` attribute
  39   on the resulting `<html:article>` element.
  40
  41 Following the shebang line, document metadata may be provided in the
  42   [Record Jar][draft-phillips-record-jar-01] format.
  43 The body of the document begins after the last line which begins with
  44   the string `%%`, or after the shebang line if none exists.
  45
  46 Multiple documents can be catenated into a single file; a new document
  47   is begun on any line which starts with `#!lesml` or `##`.
  48 Documents in the later case inherit the latest preceding `#!lesml`
  49   declaration.
  50 `##` may be followed by other text; this is treated as an interdocument
  51   comment.
  52
  53 Documents are broken into paragraphs by blank lines.
  54 Empty paragraphs are ignored.
  55
  56 If every line in the paragraph begins with (optional white·space
  57   followed by) `»` it is quoted (`<html:blockquote>`); if every line
  58   begins with `]` it is bracketed.
  59 The lines, minus this leading, are then re‐analysed.
  60 Bracketed paragraphs which end quotes are treated as captions
  61   (`<html:figcaption>`); otherwise, they are footers (`<html:footer>`).
  62
  63 Non·empty paragraphs (which, to be clear, may still result in empty
  64   `<html:p>` elements) are classified as follows :⁠—
  65
  66 - If the paragraph consists of only the following section‐break
  67     characters, plus any amount of white·space, then it is
  68     considered to be a section break (`<html:hr>`).
  69
  70   The section break characters are :⁠—
  71
  72   | Character | Codepoint | Unicode Name |
  73   | --------- | --------- | ------------ |
  74   | `*` | `U+002A` | `ASTERISK` |
  75   | `-` | `U+002D` | `HYPHEN-MINUS` |
  76   | `.` | `U+002E` | `FULL STOP` |
  77   | `=` | `U+003D` | `EQUALS SIGN` |
  78   | `_` | `U+005F` | `LOW LINE` |
  79   | `~` | `U+007E` | `TILDE` |
  80   | `·` | `U+00B7` | `MIDDLE DOT` |
  81   | `․` | `U+2024` | `ONE DOT LEADER` |
  82   | `‥` | `U+2025` | `TWO DOT LEADER` |
  83   | `…` | `U+2026` | `HORIZONTAL ELLIPSIS` |
  84   | `⁂` | `U+2042` | `ASTERISM` |
  85   | `⋯` | `U+22EF` | `MIDLINE HORIZONTAL ELLIPSIS` |
  86   | `─` | `U+2500` | `BOX DRAWINGS LIGHT HORIZONTAL` |
  87   | `━` | `U+2501` | `BOX DRAWINGS HEAVY HORIZONTAL` |
  88   | `┄` | `U+2504` | `BOX DRAWINGS LIGHT TRIPLE DASH HORIZONTAL` |
  89   | `┅` | `U+2505` | `BOX DRAWINGS HEAVY TRIPLE DASH HORIZONTAL` |
  90   | `┈` | `U+2508` | `BOX DRAWINGS LIGHT QUADRUPLE DASH HORIZONTAL` |
  91   | `┉` | `U+2509` | `BOX DRAWINGS HEAVY QUADRUPLE DASH HORIZONTAL` |
  92   | `╌` | `U+254C` | `BOX DRAWINGS LIGHT DOUBLE DASH HORIZONTAL` |
  93   | `╍` | `U+254D` | `BOX DRAWINGS HEAVY DOUBLE DASH HORIZONTAL` |
  94   | `═` | `U+2550` | `BOX DRAWINGS DOUBLE HORIZONTAL` |
  95   | `╴` | `U+2574` | `BOX DRAWINGS LIGHT LEFT` |
  96   | `╶` | `U+2576` | `BOX DRAWINGS LIGHT RIGHT` |
  97   | `╸` | `U+2578` | `BOX DRAWINGS HEAVY LEFT` |
  98   | `╺` | `U+257A` | `BOX DRAWINGS HEAVY RIGHT` |
  99   | `☙` | `U+2619` | `REVERSED ROTATED FLORAL HEART BULLET` |
 100   | `❧` | `U+2767` | `ROTATED FLORAL HEART BULLET` |
 101   | `　` | `U+3000` | `IDEOGRAPHIC SPACE` |
 102   | `・` | `U+30FB` | `KATAKANA MIDDLE DOT` |
 103   | `＊` | `U+FF0A` | `FULLWIDTH ASTERISK` |
 104   | `－` | `U+FF0D` | `FULLWIDTH HYPHEN-MINUS` |
 105   | `．` | `U+FF0E` | `FULLWIDTH FULL STOP` |
 106   | `＝` | `U+FF1D` | `FULLWIDTH EQUALS SIGN` |
 107   | `＿` | `U+FF3F` | `FULLWIDTH LOW LINE` |
 108   | `～` | `U+FF5E` | `FULLWIDTH TILDE` |
 109
 110 - If every line in the paragraph begins with zero or more white·space
 111     characters followed by `|`, it is a “preformatted” paragraph and
 112     white·space is not collapsed (`<html:pre>`).
 113
 114 - Otherwise, the paragraph is ordinary.
 115
 116 After this classification, each ordinary paragraph is further
 117   classified by type based on its first character (which must be
 118   followed by white·space or a pilcrow, or else be the only thing on
 119   the line) :⁠—
 120
 121 - If the paragraph is preformatted, it is an ordinary paragraph.
 122
 123 - If the paragraph begins with `⁌`, it is a chapter heading
 124     (`<html:h1>`).
 125
 126 - If the paragraph begins with `§`, it is a section heading
 127     (`<html:h2>`).
 128
 129 - If the paragraph begins with `❦`, it is a subsection heading
 130     (`<html:h3>`).
 131
 132 - If the paragraph begins with `✠`, it is a subsubsection heading
 133     (`<html:h4>`).
 134
 135 - If the paragraph begins with `•` or `🔢`, it is a primary unordered
 136     or ordered list item (`<html:li class="unordered" aria-level="1">`
 137     or `<html:li class="ordered" aria-level="1">`).
 138
 139 - If the paragraph begins with `◦` or `🔠`, it is a secondary unordered
 140     or ordered list item (`<html:li class="unordered" aria-level="2">`
 141     or `<html:li class="ordered" aria-level="2">`).
 142   Secondary list items are considered to be nested inside of primary
 143     list items which precede them.
 144
 145 - If the paragraph begins with `▪` or `🔡`, it is a tertiary unordered
 146     or ordered list item (`<html:li class="unordered" aria-level="3">`
 147     or `<html:li class="ordered" aria-level="3">`).
 148   Tertiary list items are considered to be nested inside of primary
 149     and secondary list items which precede them.
 150
 151 - If the paragraph begins with `⁃` or `🔣`, it is a quaternary
 152     unordered or ordered list item
 153     (`<html:li class="unordered" aria-level="4">` or
 154     `<html:li class="ordered" aria-level="4">`).
 155   Quaternary list items are considered to be nested inside of primary,
 156     secondary, and tertiary list items which precede them.
 157
 158 - If the paragraph begins with `※`, it is an ordinary note
 159     (`<html:section role="note" class="note">`).
 160
 161 - If the paragraph begins with `☡`, it is a cautionary note
 162     (`<html:section role="note" class="caution">`).
 163
 164 - If the paragraph begins with `⯑`, it is a questioning note
 165     (`<html:section role="note" class="query">`).
 166
 167 - If the paragraph begins with `@`, it is an abstract
 168     (`<html:section role="doc-abstract">`).
 169
 170 - If the paragraph begins with `🛈`, it is a (informative) tip
 171     (`<html:section role="doc-tip">`).
 172
 173 - If the paragraph begins with `⚠︎`, it is a (warning) notice
 174     (`<html:section role="doc-notice">`).
 175
 176 - If the paragraph begins with `^`, it is a footnote
 177     (`<html:li class="ordered footnote" aria-level="1">`).
 178   Footnotes are ignored unless their first paragraph has an i·d
 179     (specified with `¶`) which is referenced by one or more footnote
 180     references.
 181   Footnotes are treated as level 1 ordered list items, so they can
 182     contain nested lists.
 183
 184   Footnotes are removed from the normal document flow and placed in a
 185     footer (`<html:section role="doc-endnotes">`) in order of first
 186     reference.
 187   It is recommended that the i·d¦s you choose are kept stable, so that
 188     links to footnotes do not break.
 189
 190 - If the paragraph begins with `#`, it is a comment.
 191   Comments produce X·M·L comment nodes and can be used to break up list
 192     items into separate lists.
 193
 194 - If the paragraph begins with `⋯`, it is a continuation paragraph.
 195   Continuation paragraphs may be used to continue a preceding note,
 196     footnote, or list item.
 197   If there is no such preceding note, footnote, or list item, they will
 198     attach to adjacent heading elements to form heading groups
 199     (`<html:hgroup>`).
 200   Otherwise, they will be treated as ordinary paragraphs.
 201
 202 - Otherwise, it is an ordinary paragraph.
 203
 204 Following this sigil (if any) there may be a `¶` followed by zero or
 205   more non·white·space characters.
 206 The characters following the `¶` give the identifier for the paragraph,
 207   which is expected to be unique within a document.
 208 This may be suffixed with a language tag beginning with `@` and
 209   terminated with `$`.
 210
 211 When a paragraph produces an `<html:p>` element “wrapped in” another
 212   kind of element (e·g, a blockquote, section, or list item), the
 213   identifier and language of the first paragraph are applied to the
 214   wrapping element.
 215 If the first paragraph has no other contents, it is deleted.
 216 To apply the identifier or language to the `<html:p>` element itself,
 217   and not its wrapper, one can simply make the first paragraph empty
 218   (using a literal `¶` with no other contents).
 219 This paragraph will be dropped, but the following paragraphs will still
 220   be processed as non·initial.
 221
 222 The remaining characters in a paragraph form its contents.
 223 Markup within paragraphs is delimited with·out exception by pairs of
 224   characters, with the following precedence :⁠—
 225
 226 - The characters `⌦` and `⌫` indicate inline comments.
 227   A single character `⌧` may be used to indicate an “empty” comment
 228     (consisting of `U+034F COMBINING GRAPHEME JOINER` for X·M·L
 229     compatibility).
 230
 231 - The characters `{@` and `"}` indicate attribute specifications.
 232   The attribute specification must contain at least one `="` which
 233     separates the key of the attribute from the value.
 234   Attributes attach to the previous element or text node, with
 235     white·space‐only text nodes after elements ignored; if there is no
 236     such previous element or text node, an empty text node is used
 237     instead.
 238   Multiple attributes can be given in sequence using multiple
 239     specifications.
 240   Text nodes with attributes are wrapped in `<html:span>`.
 241
 242 - The characters `{🔗` and `>}` indicate a hyperlink to a U·R·L
 243     (`<html:a>`).
 244   The hyperlink must contain at least one `<`; the content before the
 245     last `<` gives the text of the link, and the content after gives
 246     the U·R·L that the link points to.
 247   If no text is given, the U·R·L will be used instead.
 248
 249 - The characters `⸠` and `⸡` indicate a strikethru (`<html:s>`).
 250
 251 - The characters `⸤` and `⸥` indicate underlining (`<html:u>`).
 252
 253 - The characters `⟦` and `⟧` indicate an inline note
 254     (`<html:small role="note">`).
 255
 256 - The characters `⸨` and `⸩` indicate parenthetical content
 257     (`<html:small>`).
 258
 259 - The characters `` ` `` and `´` indicate code (`<html:code>`).
 260
 261 - The characters `⟪` and `⟫` indicate titles (`<html:cite>`).
 262
 263 - The characters `⸶` and `⸷` indicate names (`<html:u class="name">`).
 264
 265 - The characters `⟨` and `⟩` indicate offset text (`<html:i>`).
 266
 267 - The characters `⦃` and `⦄` indicate keyword highlighting
 268     (`<html:b>`).
 269
 270 - The characters `☞︎` and `☜︎` indicate strong importance
 271     (`<html:strong>`).
 272
 273 - The characters `⹐` and `⹑` indicate emphasis (`<html:em>`).
 274
 275 - The characters `^` and `.` indicate a footnote reference
 276     (`<html:a role="doc-noteref">`).
 277   The characters between these sigils must match the i·d of the first
 278     paragraph of some footnote in the same document.
 279
 280 Once the tree is built as above, it is remediated into its final form
 281   by the following steps :⁠—
 282
 283 - Continuation paragraphs are joined with the preceding list items or
 284     sections.
 285
 286 - List items of a higher level are nested in preceding list items, when
 287     present.
 288   List items of a level greater than 1 can also be nested in preceding
 289     sections (notes, abstracts, ⁊·c…).
 290
 291 - Successive list items of the same level and class are joined into
 292     a single list.
 293
 294 - Linebreaks in preformatted paragraphs are replaced with `<html:br>`.
 295
 296 Finally, any character can be escaped by instead providing its Unicode
 297   codepoint in the form `{U+NNNN}`, where `NNNN` is one or more
 298   hexadecimal digits.
 299 Multiple codepoints may be provided separated by periods, as in
 300   `{U+WWWW.ZZZZ}`.
 301 Due to limitations in X·S·L·T, characters cannot be escaped in
 302   attributes (including link targets).
 303
 304 ## Usage
 305
 306 💄📝 Les·M·L is designed for usage with [⛩📰 书社][Shushe].
 307 Simply include the `parser.xslt` provided by this repository to
 308   ⛩📰 书社 as an additional parser, and `magic` as an additional
 309   magic file.
 310
 311 ## License
 312
 313 This repository conforms to [REUSE][].
 314
 315 The parser is licensed under the terms of the <cite>Mozilla Public
 316   License, version 2.0</cite>.
 317
 318 [REUSE]: <https://reuse.software/spec/>
 319 [Shushe]: <https://git.ladys.computer/Shushe/>
 320 [draft-phillips-record-jar-01]: <https://datatracker.ietf.org/doc/html/draft-phillips-record-jar-01>