Make LesML:split a function

[LesML] / README.markdown
diff --git a/README.markdown b/README.markdown

index 6a21246144cd05dee95300e77368fd9994a3bce5..99ff8badb8b838699e8e8ba0928287fbd636050c 100644 (file)
--- a/README.markdown
+++ b/README.markdown
@@ -1,5 +1,5 @@
  <!--
-SPDX-FileCopyrightText: 2024 Lady <https://www.ladys.computer/about/#lady>
+SPDX-FileCopyrightText: 2024, 2025 Lady <https://www.ladys.computer/about/#lady>
  SPDX-License-Identifier: CC0-1.0
  -->
  # 💄📝 Les·M·L
@@ -28,29 +28,97 @@ It is implemented as an X·S·L·T transformation from a
  
  The first line of any 💄📝 Les·M·L document should be the string
    `#!lesml`.
-
-Following the shebang, document metadata may be provided in the [Record
-  Jar][draft-phillips-record-jar-01] format.
+A language tag may follow this, beginning with `@` and terminated with
+  `$`, like so:
+`#!lesml@en$`.
+Regardless of whether a language tag is present, the shebang line may
+  be terminated by a space‐separated list of properties of the form
+  `key=value`.
+Only one property is currently permitted: `profile`, whose value should
+  be a U·R·I and is translated to the `@data-lesml-profile` attribute
+  on the resulting `<html:article>` element.
+
+Following the shebang line, document metadata may be provided in the
+  [Record Jar][draft-phillips-record-jar-01] format.
  The body of the document begins after the last line which begins with
    the string `%%`, or after the shebang line if none exists.
  
+Multiple documents can be catenated into a single file; a new document
+  is begun on any line which starts with `#!lesml` or `##`.
+Documents in the later case inherit the latest preceding `#!lesml`
+  declaration.
+`##` may be followed by other text; this is treated as an interdocument
+  comment.
+
  Documents are broken into paragraphs by blank lines.
-Non·empty paragraphs are classified as follows :⁠—
+Empty paragraphs are ignored.
  
-- If the paragraph consists of only the characters
-    `#*-=_~⁂─━┄┅┈┉╌╍═╴╶╸╺☙❧` plus any amount of white·space, then it is
-    considered to be a section break (`<html:hr>`).
+If every line in the paragraph begins with (optional white·space
+  followed by) `»` it is quoted (`<html:blockquote>`); if every line
+  begins with `]` it is bracketed.
+The lines, minus this leading, are then re‐analysed.
+Bracketed paragraphs which end quotes are treated as captions
+  (`<html:figcaption>`); otherwise, they are footers (`<html:footer>`).
  
-- If every line in the paragraph begins with at least one space, then
-    it is considered to be a quoted paragraph (`<html:blockquote>`).
-  There is only one level of paragraph quoting; quoted paragraphs may
-    not be quoted again.
+Non·empty paragraphs (which, to be clear, may still result in empty
+  `<html:p>` elements) are classified as follows :⁠—
  
-- Otherwise, the paragraph is unquoted.
+- If the paragraph consists of only the following section‐break
+    characters, plus any amount of white·space, then it is
+    considered to be a section break (`<html:hr>`).
  
-After this classification, each quoted or unquoted paragraph is further
-  classified by type based on its first character (which is must be
-   followed by white·space to be recognized) :⁠—
+  The section break characters are :⁠—
+
+  | Character | Codepoint | Unicode Name |
+  | --------- | --------- | ------------ |
+  | `*` | `U+002A` | `ASTERISK` |
+  | `-` | `U+002D` | `HYPHEN-MINUS` |
+  | `.` | `U+002E` | `FULL STOP` |
+  | `=` | `U+003D` | `EQUALS SIGN` |
+  | `_` | `U+005F` | `LOW LINE` |
+  | `~` | `U+007E` | `TILDE` |
+  | `·` | `U+00B7` | `MIDDLE DOT` |
+  | `․` | `U+2024` | `ONE DOT LEADER` |
+  | `‥` | `U+2025` | `TWO DOT LEADER` |
+  | `…` | `U+2026` | `HORIZONTAL ELLIPSIS` |
+  | `⁂` | `U+2042` | `ASTERISM` |
+  | `⋯` | `U+22EF` | `MIDLINE HORIZONTAL ELLIPSIS` |
+  | `─` | `U+2500` | `BOX DRAWINGS LIGHT HORIZONTAL` |
+  | `━` | `U+2501` | `BOX DRAWINGS HEAVY HORIZONTAL` |
+  | `┄` | `U+2504` | `BOX DRAWINGS LIGHT TRIPLE DASH HORIZONTAL` |
+  | `┅` | `U+2505` | `BOX DRAWINGS HEAVY TRIPLE DASH HORIZONTAL` |
+  | `┈` | `U+2508` | `BOX DRAWINGS LIGHT QUADRUPLE DASH HORIZONTAL` |
+  | `┉` | `U+2509` | `BOX DRAWINGS HEAVY QUADRUPLE DASH HORIZONTAL` |
+  | `╌` | `U+254C` | `BOX DRAWINGS LIGHT DOUBLE DASH HORIZONTAL` |
+  | `╍` | `U+254D` | `BOX DRAWINGS HEAVY DOUBLE DASH HORIZONTAL` |
+  | `═` | `U+2550` | `BOX DRAWINGS DOUBLE HORIZONTAL` |
+  | `╴` | `U+2574` | `BOX DRAWINGS LIGHT LEFT` |
+  | `╶` | `U+2576` | `BOX DRAWINGS LIGHT RIGHT` |
+  | `╸` | `U+2578` | `BOX DRAWINGS HEAVY LEFT` |
+  | `╺` | `U+257A` | `BOX DRAWINGS HEAVY RIGHT` |
+  | `☙` | `U+2619` | `REVERSED ROTATED FLORAL HEART BULLET` |
+  | `❧` | `U+2767` | `ROTATED FLORAL HEART BULLET` |
+  | `　` | `U+3000` | `IDEOGRAPHIC SPACE` |
+  | `・` | `U+30FB` | `KATAKANA MIDDLE DOT` |
+  | `＊` | `U+FF0A` | `FULLWIDTH ASTERISK` |
+  | `－` | `U+FF0D` | `FULLWIDTH HYPHEN-MINUS` |
+  | `．` | `U+FF0E` | `FULLWIDTH FULL STOP` |
+  | `＝` | `U+FF1D` | `FULLWIDTH EQUALS SIGN` |
+  | `＿` | `U+FF3F` | `FULLWIDTH LOW LINE` |
+  | `～` | `U+FF5E` | `FULLWIDTH TILDE` |
+
+- If every line in the paragraph begins with zero or more white·space
+    characters followed by `|`, it is a “preformatted” paragraph and
+    white·space is not collapsed (`<html:pre>`).
+
+- Otherwise, the paragraph is ordinary.
+
+After this classification, each ordinary paragraph is further
+  classified by type based on its first character (which must be
+  followed by white·space or a pilcrow, or else be the only thing on
+  the line) :⁠—
+
+- If the paragraph is preformatted, it is an ordinary paragraph.
  
  - If the paragraph begins with `⁌`, it is a chapter heading
      (`<html:h1>`).
@@ -65,61 +133,112 @@ After this classification, each quoted or unquoted paragraph is further
      (`<html:h4>`).
  
  - If the paragraph begins with `•` or `🔢`, it is a primary unordered
-    or ordered list item (`<html:li class="unordered" data-level="1">`
-    or `<html:li class="ordered" data-level="1">`).
+    or ordered list item (`<html:li class="unordered" aria-level="1">`
+    or `<html:li class="ordered" aria-level="1">`).
  
  - If the paragraph begins with `◦` or `🔠`, it is a secondary unordered
-    or ordered list item (`<html:li class="unordered" data-level="2">`
-    or `<html:li class="ordered" data-level="2">`).
+    or ordered list item (`<html:li class="unordered" aria-level="2">`
+    or `<html:li class="ordered" aria-level="2">`).
    Secondary list items are considered to be nested inside of primary
      list items which precede them.
  
  - If the paragraph begins with `▪` or `🔡`, it is a tertiary unordered
-    or ordered list item (`<html:li class="unordered" data-level="3">`
-    or `<html:li class="ordered" data-level="3">`).
+    or ordered list item (`<html:li class="unordered" aria-level="3">`
+    or `<html:li class="ordered" aria-level="3">`).
    Tertiary list items are considered to be nested inside of primary
      and secondary list items which precede them.
  
  - If the paragraph begins with `⁃` or `🔣`, it is a quaternary
      unordered or ordered list item
-    (`<html:li class="unordered" data-level="4">` or
-    `<html:li class="ordered" data-level="4">`).
+    (`<html:li class="unordered" aria-level="4">` or
+    `<html:li class="ordered" aria-level="4">`).
    Quaternary list items are considered to be nested inside of primary,
      secondary, and tertiary list items which precede them.
  
  - If the paragraph begins with `※`, it is an ordinary note
-    (`<html:div role="note" class="note">`).
+    (`<html:section role="note" class="note">`).
  
  - If the paragraph begins with `☡`, it is a cautionary note
-    (`<html:div role="note" class="caution">`).
-
-- If the paragraph begins with `🛈`, it is an informative note
-    (`<html:div role="note" class="info">`).
+    (`<html:section role="note" class="caution">`).
  
  - If the paragraph begins with `⯑`, it is a questioning note
-    (`<html:div role="note" class="query">`).
-
-- If the paragraph begins with `⚠︎`, it is a warning note
-    (`<html:div role="note" class="warn">`).
-
-- If the paragraph begins with `⋯`, it is a continuation paragraph
-    (`<html:div class="continuation">`).
-  Continuation paragraphs may be used to continue a preceding list item
-    or quote.
-  Note, however, that an unquoted paragraph cannot continue a quoted
-    one, or vice·versa.
+    (`<html:section role="note" class="query">`).
+
+- If the paragraph begins with `@`, it is an abstract
+    (`<html:section role="doc-abstract">`).
+
+- If the paragraph begins with `🛈`, it is a (informative) tip
+    (`<html:section role="doc-tip">`).
+
+- If the paragraph begins with `⚠︎`, it is a (warning) notice
+    (`<html:section role="doc-notice">`).
+
+- If the paragraph begins with `^`, it is a footnote
+    (`<html:li class="ordered footnote" aria-level="1">`).
+  Footnotes are ignored unless their first paragraph has an i·d
+    (specified with `¶`) which is referenced by one or more footnote
+    references.
+  Footnotes are treated as level 1 ordered list items, so they can
+    contain nested lists.
+
+  Footnotes are removed from the normal document flow and placed in a
+    footer (`<html:section role="doc-endnotes">`) in order of first
+    reference.
+  It is recommended that the i·d¦s you choose are kept stable, so that
+    links to footnotes do not break.
+
+- If the paragraph begins with `#`, it is a comment.
+  Comments produce X·M·L comment nodes and can be used to break up list
+    items into separate lists.
+
+- If the paragraph begins with `⋯`, it is a continuation paragraph.
+  Continuation paragraphs may be used to continue a preceding note,
+    footnote, or list item.
+  If there is no such preceding note, footnote, or list item, they will
+    attach to adjacent heading elements to form heading groups
+    (`<html:hgroup>`).
+  Otherwise, they will be treated as ordinary paragraphs.
  
  - Otherwise, it is an ordinary paragraph.
  
-Following this sigil (if any, including trailing white·space) there may
-  be a `¶` followed by zero or more non·white·space characters.
+Following this sigil (if any) there may be a `¶` followed by zero or
+  more non·white·space characters.
  The characters following the `¶` give the identifier for the paragraph,
    which is expected to be unique within a document.
+This may be suffixed with a language tag beginning with `@` and
+  terminated with `$`.
+
+When a paragraph produces an `<html:p>` element “wrapped in” another
+  kind of element (e·g, a blockquote, section, or list item), the
+  identifier and language of the first paragraph are applied to the
+  wrapping element.
+If the first paragraph has no other contents, it is deleted.
+To apply the identifier or language to the `<html:p>` element itself,
+  and not its wrapper, one can simply make the first paragraph empty
+  (using a literal `¶` with no other contents).
+This paragraph will be dropped, but the following paragraphs will still
+  be processed as non·initial.
  
  The remaining characters in a paragraph form its contents.
  Markup within paragraphs is delimited with·out exception by pairs of
    characters, with the following precedence :⁠—
  
+- The characters `⌦` and `⌫` indicate inline comments.
+  A single character `⌧` may be used to indicate an “empty” comment
+    (consisting of `U+034F COMBINING GRAPHEME JOINER` for X·M·L
+    compatibility).
+
+- The characters `{@` and `"}` indicate attribute specifications.
+  The attribute specification must contain at least one `="` which
+    separates the key of the attribute from the value.
+  Attributes attach to the previous element or text node, with
+    white·space‐only text nodes after elements ignored; if there is no
+    such previous element or text node, an empty text node is used
+    instead.
+  Multiple attributes can be given in sequence using multiple
+    specifications.
+  Text nodes with attributes are wrapped in `<html:span>`.
+
  - The characters `{🔗` and `>}` indicate a hyperlink to a U·R·L
      (`<html:a>`).
    The hyperlink must contain at least one `<`; the content before the
@@ -137,50 +256,56 @@ Markup within paragraphs is delimited with·out exception by pairs of
  - The characters `⸨` and `⸩` indicate parenthetical content
      (`<html:small>`).
  
-- The characters `☞︎` and `☜︎` indicate strong importance
-    (`<html:strong>`).
-
-- The characters `⹐` and `⹑` indicate emphasis (`<html:em>`).
+- The characters `` ` `` and `´` indicate code (`<html:code>`).
  
  - The characters `⟪` and `⟫` indicate titles (`<html:cite>`).
  
+- The characters `⸶` and `⸷` indicate names (`<html:u class="name">`).
+
  - The characters `⟨` and `⟩` indicate offset text (`<html:i>`).
-  This may be followed by a `@`, a language tag, and a `$` to provide
-    the language of the text.
  
  - The characters `⦃` and `⦄` indicate keyword highlighting
      (`<html:b>`).
  
-- The characters `` ` `` and `´` indicate code (`<html:code>`).
+- The characters `☞︎` and `☜︎` indicate strong importance
+    (`<html:strong>`).
+
+- The characters `⹐` and `⹑` indicate emphasis (`<html:em>`).
+
+- The characters `^` and `.` indicate a footnote reference
+    (`<html:a role="doc-noteref">`).
+  The characters between these sigils must match the i·d of the first
+    paragraph of some footnote in the same document.
  
  Once the tree is built as above, it is remediated into its final form
    by the following steps :⁠—
  
-- Successive quoted paragraphs are joined into one quote.
-  If the final quoted paragraph is an ordinary paragraph which begins
-    with `—` and a space, the quote is wrapped in a `<html:figure>`
-    and the final paragraph becomes its `<html:figcaption>`.
-
  - Continuation paragraphs are joined with the preceding list items or
-    quotes.
+    sections.
  
  - List items of a higher level are nested in preceding list items, when
      present.
+  List items of a level greater than 1 can also be nested in preceding
+    sections (notes, abstracts, ⁊·c…).
  
  - Successive list items of the same level and class are joined into
      a single list.
  
+- Linebreaks in preformatted paragraphs are replaced with `<html:br>`.
+
  Finally, any character can be escaped by instead providing its Unicode
-  codepoint in the form `<U+NNNN>`, where `NNNN` is one or more
+  codepoint in the form `{U+NNNN}`, where `NNNN` is one or more
    hexadecimal digits.
  Multiple codepoints may be provided separated by periods, as in
-  `<U+WWWW.ZZZZ>`
+  `{U+WWWW.ZZZZ}`.
+Due to limitations in X·S·L·T, characters cannot be escaped in
+  attributes (including link targets).
  
  ## Usage
  
-💄📝 Les·M·L is designed for usage with [⛩️📰 书社][Shushe].
+💄📝 Les·M·L is designed for usage with [⛩📰 书社][Shushe].
  Simply include the `parser.xslt` provided by this repository to
-  ⛩️📰 书社 as an additional parser, and `magic` as an additional
+  ⛩📰 书社 as an additional parser, and `magic` as an additional
    magic file.
  
  ## License