Quotes and brackets as multiparagraph divisions

[LesML] / README.markdown
diff --git a/README.markdown b/README.markdown

index ea08cb475ec25539acc378775955ddc2b0d9da4b..1643af28f7db26d85b1471e309883924a17bd696 100644 (file)
--- a/README.markdown
+++ b/README.markdown
@@ -1,5 +1,5 @@
  <!--
  <!--
-SPDX-FileCopyrightText: 2024 Lady <https://www.ladys.computer/about/#lady>
+SPDX-FileCopyrightText: 2024, 2025 Lady <https://www.ladys.computer/about/#lady>
  SPDX-License-Identifier: CC0-1.0
  -->
  # 💄📝 Les·M·L
  SPDX-License-Identifier: CC0-1.0
  -->
  # 💄📝 Les·M·L
@@ -28,29 +28,96 @@ It is implemented as an X·S·L·T transformation from a
  
  The first line of any 💄📝 Les·M·L document should be the string
    `#!lesml`.
  
  The first line of any 💄📝 Les·M·L document should be the string
    `#!lesml`.
-
-Following the shebang, document metadata may be provided in the [Record
-  Jar][draft-phillips-record-jar-01] format.
+A language tag may follow this, beginning with `@` and terminated with
+  `$`, like so:
+`#!lesml@en$`.
+Regardless of whether a language tag is present, the shebang line may
+  be terminated by a space‐separated list of properties of the form
+  `key=value`.
+Only one property is currently permitted: `profile`, whose value should
+  be a U·R·I and is translated to the `@data-lesml-profile` attribute
+  on the resulting `<html:article>` element.
+
+Following the shebang line, document metadata may be provided in the
+  [Record Jar][draft-phillips-record-jar-01] format.
  The body of the document begins after the last line which begins with
    the string `%%`, or after the shebang line if none exists.
  
  The body of the document begins after the last line which begins with
    the string `%%`, or after the shebang line if none exists.
  
+Multiple documents can be catenated into a single file; a new document
+  is begun on any line which starts with `#!lesml` or `##`.
+Documents in the later case inherit the latest preceding `#!lesml`
+  declaration.
+`##` may be followed by other text; this is treated as an interdocument
+  comment.
+
  Documents are broken into paragraphs by blank lines.
  Documents are broken into paragraphs by blank lines.
-Non·empty paragraphs are classified as follows :⁠—
+Empty paragraphs are ignored.
  
  
-- If the paragraph consists of only the characters
-    `#*-=_~⁂─━┄┅┈┉╌╍═╴╶╸╺☙❧` plus any amount of white·space, then it is
-    considered to be a section break (`<html:hr>`).
+If every line in the paragraph begins with (optional white·space
+  followed by) `»` it is quoted (`<html:blockquote>`); if every line
+  begins with `]` it is bracketed.
+The lines, minus this leading, are then re‐analysed.
+Bracketed paragraphs which end quotes are treated as captions
+  (`<html:figcaption>`); otherwise, they are footers (`<html:footer>`).
  
  
-- If every line in the paragraph begins with at least one space, then
-    it is considered to be a quoted paragraph (`<html:blockquote>`).
-  There is only one level of paragraph quoting; quoted paragraphs may
-    not be quoted again.
+Non·empty paragraphs are classified as follows :⁠—
  
  
-- Otherwise, the paragraph is unquoted.
+- If the paragraph consists of only the following section‐break
+    characters, plus any amount of white·space, then it is
+    considered to be a section break (`<html:hr>`).
  
  
-After this classification, each quoted or unquoted paragraph is further
+  The section break characters are :⁠—
+
+  | Character | Codepoint | Unicode Name |
+  | --------- | --------- | ------------ |
+  | `*` | `U+002A` | `ASTERISK` |
+  | `-` | `U+002D` | `HYPHEN-MINUS` |
+  | `.` | `U+002E` | `FULL STOP` |
+  | `=` | `U+003D` | `EQUALS SIGN` |
+  | `_` | `U+005F` | `LOW LINE` |
+  | `~` | `U+007E` | `TILDE` |
+  | `·` | `U+00B7` | `MIDDLE DOT` |
+  | `․` | `U+2024` | `ONE DOT LEADER` |
+  | `‥` | `U+2025` | `TWO DOT LEADER` |
+  | `…` | `U+2026` | `HORIZONTAL ELLIPSIS` |
+  | `⁂` | `U+2042` | `ASTERISM` |
+  | `⋯` | `U+22EF` | `MIDLINE HORIZONTAL ELLIPSIS` |
+  | `─` | `U+2500` | `BOX DRAWINGS LIGHT HORIZONTAL` |
+  | `━` | `U+2501` | `BOX DRAWINGS HEAVY HORIZONTAL` |
+  | `┄` | `U+2504` | `BOX DRAWINGS LIGHT TRIPLE DASH HORIZONTAL` |
+  | `┅` | `U+2505` | `BOX DRAWINGS HEAVY TRIPLE DASH HORIZONTAL` |
+  | `┈` | `U+2508` | `BOX DRAWINGS LIGHT QUADRUPLE DASH HORIZONTAL` |
+  | `┉` | `U+2509` | `BOX DRAWINGS HEAVY QUADRUPLE DASH HORIZONTAL` |
+  | `╌` | `U+254C` | `BOX DRAWINGS LIGHT DOUBLE DASH HORIZONTAL` |
+  | `╍` | `U+254D` | `BOX DRAWINGS HEAVY DOUBLE DASH HORIZONTAL` |
+  | `═` | `U+2550` | `BOX DRAWINGS DOUBLE HORIZONTAL` |
+  | `╴` | `U+2574` | `BOX DRAWINGS LIGHT LEFT` |
+  | `╶` | `U+2576` | `BOX DRAWINGS LIGHT RIGHT` |
+  | `╸` | `U+2578` | `BOX DRAWINGS HEAVY LEFT` |
+  | `╺` | `U+257A` | `BOX DRAWINGS HEAVY RIGHT` |
+  | `☙` | `U+2619` | `REVERSED ROTATED FLORAL HEART BULLET` |
+  | `❧` | `U+2767` | `ROTATED FLORAL HEART BULLET` |
+  | `　` | `U+3000` | `IDEOGRAPHIC SPACE` |
+  | `・` | `U+30FB` | `KATAKANA MIDDLE DOT` |
+  | `＊` | `U+FF0A` | `FULLWIDTH ASTERISK` |
+  | `－` | `U+FF0D` | `FULLWIDTH HYPHEN-MINUS` |
+  | `．` | `U+FF0E` | `FULLWIDTH FULL STOP` |
+  | `＝` | `U+FF1D` | `FULLWIDTH EQUALS SIGN` |
+  | `＿` | `U+FF3F` | `FULLWIDTH LOW LINE` |
+  | `～` | `U+FF5E` | `FULLWIDTH TILDE` |
+
+- If every line in the paragraph begins with zero or more white·space
+    characters followed by `|`, it is a “preformatted” paragraph and
+    white·space is not collapsed (`<html:pre>`).
+
+- Otherwise, the paragraph is ordinary.
+
+After this classification, each ordinary paragraph is further
    classified by type based on its first character (which is must be
    classified by type based on its first character (which is must be
-   followed by white·space to be recognized) :⁠—
+  followed by white·space, a pilcrow, or else the only thing on the
+  line) :⁠—
+
+- If the paragraph is preformatted, it is an ordinary paragraph.
  
  - If the paragraph begins with `⁌`, it is a chapter heading
      (`<html:h1>`).
  
  - If the paragraph begins with `⁌`, it is a chapter heading
      (`<html:h1>`).
@@ -58,10 +125,10 @@ After this classification, each quoted or unquoted paragraph is further
  - If the paragraph begins with `§`, it is a section heading
      (`<html:h2>`).
  
  - If the paragraph begins with `§`, it is a section heading
      (`<html:h2>`).
  
-- If the paragraph begins with `✠`, it is a subsection heading
+- If the paragraph begins with `❦`, it is a subsection heading
      (`<html:h3>`).
  
      (`<html:h3>`).
  
-- If the paragraph begins with `❦`, it is a subsubsection heading
+- If the paragraph begins with `✠`, it is a subsubsection heading
      (`<html:h4>`).
  
  - If the paragraph begins with `•` or `🔢`, it is a primary unordered
      (`<html:h4>`).
  
  - If the paragraph begins with `•` or `🔢`, it is a primary unordered
@@ -102,24 +169,44 @@ After this classification, each quoted or unquoted paragraph is further
  - If the paragraph begins with `⚠︎`, it is a warning note
      (`<html:div role="note" class="warn">`).
  
  - If the paragraph begins with `⚠︎`, it is a warning note
      (`<html:div role="note" class="warn">`).
  
+- If the paragraph begins with `#`, it is a comment.
+  Comments produce X·M·L comment nodes and can be used to break up list
+    items into separate lists.
+
  - If the paragraph begins with `⋯`, it is a continuation paragraph
      (`<html:div class="continuation">`).
  - If the paragraph begins with `⋯`, it is a continuation paragraph
      (`<html:div class="continuation">`).
-  Continuation paragraphs may be used to continue a preceding list item
-    or quote.
-  Note, however, that an unquoted paragraph cannot continue a quoted
-    one, or vice·versa.
+  Continuation paragraphs may be used to continue a preceding div or
+    list item.
  
  - Otherwise, it is an ordinary paragraph.
  
  
  - Otherwise, it is an ordinary paragraph.
  
-Following this sigil (if any, including trailing white·space) there may
-  be a `¶` followed by zero or more non·white·space characters.
+Following this sigil (if any) there may be a `¶` followed by zero or
+  more non·white·space characters.
  The characters following the `¶` give the identifier for the paragraph,
    which is expected to be unique within a document.
  The characters following the `¶` give the identifier for the paragraph,
    which is expected to be unique within a document.
+This may be suffixed with a language tag beginning with `@` and
+  terminated with `$`.
  
  The remaining characters in a paragraph form its contents.
  Markup within paragraphs is delimited with·out exception by pairs of
    characters, with the following precedence :⁠—
  
  
  The remaining characters in a paragraph form its contents.
  Markup within paragraphs is delimited with·out exception by pairs of
    characters, with the following precedence :⁠—
  
+- The characters `⌦` and `⌫` indicate inline comments.
+  A single character `⌧` may be used to indicate an “empty” comment
+    (consisting of `U+034F COMBINING GRAPHEME JOINER` for X·M·L
+    compatibility).
+
+- The characters `{@` and `"}` indicate attribute specifications.
+  The attribute specification must contain at least one `="` which
+    separates the key of the attribute from the value.
+  Attributes attach to the previous element or text node, with
+    white·space‐only text nodes after elements ignored; if there is no
+    such previous element or text node, an empty text node is used
+    instead.
+  Multiple attributes can be given in sequence using multiple
+    specifications.
+  Text nodes with attributes are wrapped in `<html:span>`.
+
  - The characters `{🔗` and `>}` indicate a hyperlink to a U·R·L
      (`<html:a>`).
    The hyperlink must contain at least one `<`; the content before the
  - The characters `{🔗` and `>}` indicate a hyperlink to a U·R·L
      (`<html:a>`).
    The hyperlink must contain at least one `<`; the content before the
@@ -137,32 +224,27 @@ Markup within paragraphs is delimited with·out exception by pairs of
  - The characters `⸨` and `⸩` indicate parenthetical content
      (`<html:small>`).
  
  - The characters `⸨` and `⸩` indicate parenthetical content
      (`<html:small>`).
  
-- The characters `☞︎` and `☜︎` indicate strong importance
-    (`<html:strong>`).
-
-- The characters `⹐` and `⹑` indicate emphasis (`<html:em>`).
+- The characters `` ` `` and `´` indicate code (`<html:code>`).
  
  - The characters `⟪` and `⟫` indicate titles (`<html:cite>`).
  
  
  - The characters `⟪` and `⟫` indicate titles (`<html:cite>`).
  
+- The characters `⸶` and `⸷` indicate names (`<html:u class="name">`).
+
  - The characters `⟨` and `⟩` indicate offset text (`<html:i>`).
  - The characters `⟨` and `⟩` indicate offset text (`<html:i>`).
-  This may be followed by a `@`, a language tag, and a `$` to provide
-    the language of the text.
  
  - The characters `⦃` and `⦄` indicate keyword highlighting
      (`<html:b>`).
  
  
  - The characters `⦃` and `⦄` indicate keyword highlighting
      (`<html:b>`).
  
-- The characters `` ` `` and `´` indicate code (`<html:code>`).
+- The characters `☞︎` and `☜︎` indicate strong importance
+    (`<html:strong>`).
+
+- The characters `⹐` and `⹑` indicate emphasis (`<html:em>`).
  
  Once the tree is built as above, it is remediated into its final form
    by the following steps :⁠—
  
  
  Once the tree is built as above, it is remediated into its final form
    by the following steps :⁠—
  
-- Successive quoted paragraphs are joined into one quote.
-  If the final quoted paragraph is an ordinary paragraph which begins
-    with `—` and a space, the quote is wrapped in a `<html:figure>`
-    and the final paragraph becomes its `<html:figcaption>`.
-
  - Continuation paragraphs are joined with the preceding list items or
  - Continuation paragraphs are joined with the preceding list items or
-    quotes.
+    divs.
  
  - List items of a higher level are nested in preceding list items, when
      present.
  
  - List items of a higher level are nested in preceding list items, when
      present.
@@ -170,17 +252,21 @@ Once the tree is built as above, it is remediated into its final form
  - Successive list items of the same level and class are joined into
      a single list.
  
  - Successive list items of the same level and class are joined into
      a single list.
  
+- Linebreaks in preformatted paragraphs are replaced with `<html:br>`.
+
  Finally, any character can be escaped by instead providing its Unicode
  Finally, any character can be escaped by instead providing its Unicode
-  codepoint in the form `<U+NNNN>`, where `NNNN` is one or more
+  codepoint in the form `{U+NNNN}`, where `NNNN` is one or more
    hexadecimal digits.
  Multiple codepoints may be provided separated by periods, as in
    hexadecimal digits.
  Multiple codepoints may be provided separated by periods, as in
-  `<U+WWWW.ZZZZ>`
+  `{U+WWWW.ZZZZ}`.
+Due to limitations in X·S·L·T, characters cannot be escaped in
+  attributes (including link targets).
  
  ## Usage
  
  
  ## Usage
  
-💄📝 Les·M·L is designed for usage with [⛩️📰 书社][Shushe].
+💄📝 Les·M·L is designed for usage with [⛩📰 书社][Shushe].
  Simply include the `parser.xslt` provided by this repository to
  Simply include the `parser.xslt` provided by this repository to
-  ⛩️📰 书社 as an additional parser, and `magic` as an additional
+  ⛩📰 书社 as an additional parser, and `magic` as an additional
    magic file.
  
  ## License
    magic file.
  
  ## License