]> Lady’s Gitweb - LesML/blob - README.markdown
Make LesML:split a function
[LesML] / README.markdown
1 <!--
2 SPDX-FileCopyrightText: 2024, 2025 Lady <https://www.ladys.computer/about/#lady>
3 SPDX-License-Identifier: CC0-1.0
4 -->
5 # 💄📝 Les·M·L
6
7 <b>Ladys simple markup language.</b>
8
9 💄📝 Les·M·L is a document markup language designed with two goals in
10 mind :⁠—
11
12 1. It must be trivial to parse, even with limited tooling such as that
13 provided by X·S·L·T.
14
15 2. It must be sophisticated enough to handle longform hypertext
16 documents and associated metadata.
17
18 It is implemented as an X·S·L·T transformation from a
19 `<html:script type="text/lesml">` element into H·T·M·L
20 (`parser.xslt`).
21
22 ## Nomenclature
23
24 <i>Les·M·L</i> is an abbreviation of the phrase “Ladys Extremely Simple
25 Markup Language”.
26
27 ## Markup Syntax
28
29 The first line of any 💄📝 Les·M·L document should be the string
30 `#!lesml`.
31 A language tag may follow this, beginning with `@` and terminated with
32 `$`, like so:
33 `#!lesml@en$`.
34 Regardless of whether a language tag is present, the shebang line may
35 be terminated by a space‐separated list of properties of the form
36 `key=value`.
37 Only one property is currently permitted: `profile`, whose value should
38 be a U·R·I and is translated to the `@data-lesml-profile` attribute
39 on the resulting `<html:article>` element.
40
41 Following the shebang line, document metadata may be provided in the
42 [Record Jar][draft-phillips-record-jar-01] format.
43 The body of the document begins after the last line which begins with
44 the string `%%`, or after the shebang line if none exists.
45
46 Multiple documents can be catenated into a single file; a new document
47 is begun on any line which starts with `#!lesml` or `##`.
48 Documents in the later case inherit the latest preceding `#!lesml`
49 declaration.
50 `##` may be followed by other text; this is treated as an interdocument
51 comment.
52
53 Documents are broken into paragraphs by blank lines.
54 Empty paragraphs are ignored.
55
56 If every line in the paragraph begins with (optional white·space
57 followed by) `»` it is quoted (`<html:blockquote>`); if every line
58 begins with `]` it is bracketed.
59 The lines, minus this leading, are then re‐analysed.
60 Bracketed paragraphs which end quotes are treated as captions
61 (`<html:figcaption>`); otherwise, they are footers (`<html:footer>`).
62
63 Non·empty paragraphs (which, to be clear, may still result in empty
64 `<html:p>` elements) are classified as follows :⁠—
65
66 - If the paragraph consists of only the following section‐break
67 characters, plus any amount of white·space, then it is
68 considered to be a section break (`<html:hr>`).
69
70 The section break characters are :⁠—
71
72 | Character | Codepoint | Unicode Name |
73 | --------- | --------- | ------------ |
74 | `*` | `U+002A` | `ASTERISK` |
75 | `-` | `U+002D` | `HYPHEN-MINUS` |
76 | `.` | `U+002E` | `FULL STOP` |
77 | `=` | `U+003D` | `EQUALS SIGN` |
78 | `_` | `U+005F` | `LOW LINE` |
79 | `~` | `U+007E` | `TILDE` |
80 | `·` | `U+00B7` | `MIDDLE DOT` |
81 | `․` | `U+2024` | `ONE DOT LEADER` |
82 | `‥` | `U+2025` | `TWO DOT LEADER` |
83 | `…` | `U+2026` | `HORIZONTAL ELLIPSIS` |
84 | `⁂` | `U+2042` | `ASTERISM` |
85 | `⋯` | `U+22EF` | `MIDLINE HORIZONTAL ELLIPSIS` |
86 | `─` | `U+2500` | `BOX DRAWINGS LIGHT HORIZONTAL` |
87 | `━` | `U+2501` | `BOX DRAWINGS HEAVY HORIZONTAL` |
88 | `┄` | `U+2504` | `BOX DRAWINGS LIGHT TRIPLE DASH HORIZONTAL` |
89 | `┅` | `U+2505` | `BOX DRAWINGS HEAVY TRIPLE DASH HORIZONTAL` |
90 | `┈` | `U+2508` | `BOX DRAWINGS LIGHT QUADRUPLE DASH HORIZONTAL` |
91 | `┉` | `U+2509` | `BOX DRAWINGS HEAVY QUADRUPLE DASH HORIZONTAL` |
92 | `╌` | `U+254C` | `BOX DRAWINGS LIGHT DOUBLE DASH HORIZONTAL` |
93 | `╍` | `U+254D` | `BOX DRAWINGS HEAVY DOUBLE DASH HORIZONTAL` |
94 | `═` | `U+2550` | `BOX DRAWINGS DOUBLE HORIZONTAL` |
95 | `╴` | `U+2574` | `BOX DRAWINGS LIGHT LEFT` |
96 | `╶` | `U+2576` | `BOX DRAWINGS LIGHT RIGHT` |
97 | `╸` | `U+2578` | `BOX DRAWINGS HEAVY LEFT` |
98 | `╺` | `U+257A` | `BOX DRAWINGS HEAVY RIGHT` |
99 | `☙` | `U+2619` | `REVERSED ROTATED FLORAL HEART BULLET` |
100 | `❧` | `U+2767` | `ROTATED FLORAL HEART BULLET` |
101 | ` ` | `U+3000` | `IDEOGRAPHIC SPACE` |
102 | `・` | `U+30FB` | `KATAKANA MIDDLE DOT` |
103 | `*` | `U+FF0A` | `FULLWIDTH ASTERISK` |
104 | `-` | `U+FF0D` | `FULLWIDTH HYPHEN-MINUS` |
105 | `.` | `U+FF0E` | `FULLWIDTH FULL STOP` |
106 | `=` | `U+FF1D` | `FULLWIDTH EQUALS SIGN` |
107 | `_` | `U+FF3F` | `FULLWIDTH LOW LINE` |
108 | `~` | `U+FF5E` | `FULLWIDTH TILDE` |
109
110 - If every line in the paragraph begins with zero or more white·space
111 characters followed by `|`, it is a “preformatted” paragraph and
112 white·space is not collapsed (`<html:pre>`).
113
114 - Otherwise, the paragraph is ordinary.
115
116 After this classification, each ordinary paragraph is further
117 classified by type based on its first character (which must be
118 followed by white·space or a pilcrow, or else be the only thing on
119 the line) :⁠—
120
121 - If the paragraph is preformatted, it is an ordinary paragraph.
122
123 - If the paragraph begins with `⁌`, it is a chapter heading
124 (`<html:h1>`).
125
126 - If the paragraph begins with `§`, it is a section heading
127 (`<html:h2>`).
128
129 - If the paragraph begins with `❦`, it is a subsection heading
130 (`<html:h3>`).
131
132 - If the paragraph begins with `✠`, it is a subsubsection heading
133 (`<html:h4>`).
134
135 - If the paragraph begins with `•` or `🔢`, it is a primary unordered
136 or ordered list item (`<html:li class="unordered" aria-level="1">`
137 or `<html:li class="ordered" aria-level="1">`).
138
139 - If the paragraph begins with `◦` or `🔠`, it is a secondary unordered
140 or ordered list item (`<html:li class="unordered" aria-level="2">`
141 or `<html:li class="ordered" aria-level="2">`).
142 Secondary list items are considered to be nested inside of primary
143 list items which precede them.
144
145 - If the paragraph begins with `▪` or `🔡`, it is a tertiary unordered
146 or ordered list item (`<html:li class="unordered" aria-level="3">`
147 or `<html:li class="ordered" aria-level="3">`).
148 Tertiary list items are considered to be nested inside of primary
149 and secondary list items which precede them.
150
151 - If the paragraph begins with `⁃` or `🔣`, it is a quaternary
152 unordered or ordered list item
153 (`<html:li class="unordered" aria-level="4">` or
154 `<html:li class="ordered" aria-level="4">`).
155 Quaternary list items are considered to be nested inside of primary,
156 secondary, and tertiary list items which precede them.
157
158 - If the paragraph begins with `※`, it is an ordinary note
159 (`<html:section role="note" class="note">`).
160
161 - If the paragraph begins with `☡`, it is a cautionary note
162 (`<html:section role="note" class="caution">`).
163
164 - If the paragraph begins with `⯑`, it is a questioning note
165 (`<html:section role="note" class="query">`).
166
167 - If the paragraph begins with `@`, it is an abstract
168 (`<html:section role="doc-abstract">`).
169
170 - If the paragraph begins with `🛈`, it is a (informative) tip
171 (`<html:section role="doc-tip">`).
172
173 - If the paragraph begins with `⚠︎`, it is a (warning) notice
174 (`<html:section role="doc-notice">`).
175
176 - If the paragraph begins with `^`, it is a footnote
177 (`<html:li class="ordered footnote" aria-level="1">`).
178 Footnotes are ignored unless their first paragraph has an i·d
179 (specified with `¶`) which is referenced by one or more footnote
180 references.
181 Footnotes are treated as level 1 ordered list items, so they can
182 contain nested lists.
183
184 Footnotes are removed from the normal document flow and placed in a
185 footer (`<html:section role="doc-endnotes">`) in order of first
186 reference.
187 It is recommended that the i·d¦s you choose are kept stable, so that
188 links to footnotes do not break.
189
190 - If the paragraph begins with `#`, it is a comment.
191 Comments produce X·M·L comment nodes and can be used to break up list
192 items into separate lists.
193
194 - If the paragraph begins with `⋯`, it is a continuation paragraph.
195 Continuation paragraphs may be used to continue a preceding note,
196 footnote, or list item.
197 If there is no such preceding note, footnote, or list item, they will
198 attach to adjacent heading elements to form heading groups
199 (`<html:hgroup>`).
200 Otherwise, they will be treated as ordinary paragraphs.
201
202 - Otherwise, it is an ordinary paragraph.
203
204 Following this sigil (if any) there may be a `¶` followed by zero or
205 more non·white·space characters.
206 The characters following the `¶` give the identifier for the paragraph,
207 which is expected to be unique within a document.
208 This may be suffixed with a language tag beginning with `@` and
209 terminated with `$`.
210
211 When a paragraph produces an `<html:p>` element “wrapped in” another
212 kind of element (e·g, a blockquote, section, or list item), the
213 identifier and language of the first paragraph are applied to the
214 wrapping element.
215 If the first paragraph has no other contents, it is deleted.
216 To apply the identifier or language to the `<html:p>` element itself,
217 and not its wrapper, one can simply make the first paragraph empty
218 (using a literal `¶` with no other contents).
219 This paragraph will be dropped, but the following paragraphs will still
220 be processed as non·initial.
221
222 The remaining characters in a paragraph form its contents.
223 Markup within paragraphs is delimited with·out exception by pairs of
224 characters, with the following precedence :⁠—
225
226 - The characters `⌦` and `⌫` indicate inline comments.
227 A single character `⌧` may be used to indicate an “empty” comment
228 (consisting of `U+034F COMBINING GRAPHEME JOINER` for X·M·L
229 compatibility).
230
231 - The characters `{@` and `"}` indicate attribute specifications.
232 The attribute specification must contain at least one `="` which
233 separates the key of the attribute from the value.
234 Attributes attach to the previous element or text node, with
235 white·space‐only text nodes after elements ignored; if there is no
236 such previous element or text node, an empty text node is used
237 instead.
238 Multiple attributes can be given in sequence using multiple
239 specifications.
240 Text nodes with attributes are wrapped in `<html:span>`.
241
242 - The characters `{🔗` and `>}` indicate a hyperlink to a U·R·L
243 (`<html:a>`).
244 The hyperlink must contain at least one `<`; the content before the
245 last `<` gives the text of the link, and the content after gives
246 the U·R·L that the link points to.
247 If no text is given, the U·R·L will be used instead.
248
249 - The characters `⸠` and `⸡` indicate a strikethru (`<html:s>`).
250
251 - The characters `⸤` and `⸥` indicate underlining (`<html:u>`).
252
253 - The characters `⟦` and `⟧` indicate an inline note
254 (`<html:small role="note">`).
255
256 - The characters `⸨` and `⸩` indicate parenthetical content
257 (`<html:small>`).
258
259 - The characters `` ` `` and `´` indicate code (`<html:code>`).
260
261 - The characters `⟪` and `⟫` indicate titles (`<html:cite>`).
262
263 - The characters `⸶` and `⸷` indicate names (`<html:u class="name">`).
264
265 - The characters `⟨` and `⟩` indicate offset text (`<html:i>`).
266
267 - The characters `⦃` and `⦄` indicate keyword highlighting
268 (`<html:b>`).
269
270 - The characters `☞︎` and `☜︎` indicate strong importance
271 (`<html:strong>`).
272
273 - The characters `⹐` and `⹑` indicate emphasis (`<html:em>`).
274
275 - The characters `^` and `.` indicate a footnote reference
276 (`<html:a role="doc-noteref">`).
277 The characters between these sigils must match the i·d of the first
278 paragraph of some footnote in the same document.
279
280 Once the tree is built as above, it is remediated into its final form
281 by the following steps :⁠—
282
283 - Continuation paragraphs are joined with the preceding list items or
284 sections.
285
286 - List items of a higher level are nested in preceding list items, when
287 present.
288 List items of a level greater than 1 can also be nested in preceding
289 sections (notes, abstracts, ⁊·c…).
290
291 - Successive list items of the same level and class are joined into
292 a single list.
293
294 - Linebreaks in preformatted paragraphs are replaced with `<html:br>`.
295
296 Finally, any character can be escaped by instead providing its Unicode
297 codepoint in the form `{U+NNNN}`, where `NNNN` is one or more
298 hexadecimal digits.
299 Multiple codepoints may be provided separated by periods, as in
300 `{U+WWWW.ZZZZ}`.
301 Due to limitations in X·S·L·T, characters cannot be escaped in
302 attributes (including link targets).
303
304 ## Usage
305
306 💄📝 Les·M·L is designed for usage with [⛩📰 书社][Shushe].
307 Simply include the `parser.xslt` provided by this repository to
308 ⛩📰 书社 as an additional parser, and `magic` as an additional
309 magic file.
310
311 ## License
312
313 This repository conforms to [REUSE][].
314
315 The parser is licensed under the terms of the <cite>Mozilla Public
316 License, version 2.0</cite>.
317
318 [REUSE]: <https://reuse.software/spec/>
319 [Shushe]: <https://git.ladys.computer/Shushe/>
320 [draft-phillips-record-jar-01]: <https://datatracker.ietf.org/doc/html/draft-phillips-record-jar-01>
This page took 0.139524 seconds and 5 git commands to generate.