]> Lady’s Gitweb - LesML/blob - README.markdown
Refactor initial chunking to be line‐based
[LesML] / README.markdown
1 <!--
2 SPDX-FileCopyrightText: 2024 Lady <https://www.ladys.computer/about/#lady>
3 SPDX-License-Identifier: CC0-1.0
4 -->
5 # 💄📝 Les·M·L
6
7 <b>Ladys simple markup language.</b>
8
9 💄📝 Les·M·L is a document markup language designed with two goals in
10 mind :⁠—
11
12 1. It must be trivial to parse, even with limited tooling such as that
13 provided by X·S·L·T.
14
15 2. It must be sophisticated enough to handle longform hypertext
16 documents and associated metadata.
17
18 It is implemented as an X·S·L·T transformation from a
19 `<html:script type="text/lesml">` element into H·T·M·L
20 (`parser.xslt`).
21
22 ## Nomenclature
23
24 <i>Les·M·L</i> is an abbreviation of the phrase “Ladys Extremely Simple
25 Markup Language”.
26
27 ## Markup Syntax
28
29 The first line of any 💄📝 Les·M·L document should be the string
30 `#!lesml`.
31 A language tag may follow this, beginning with `@` and terminated with
32 `$`, like so:
33 `#!lesml@en$`.
34 Regardless of whether a language tag is present, the shebang line may
35 be terminated by a space‐separated list of properties of the form
36 `key=value`.
37 Only one property is currently permitted: `profile`, whose value should
38 be a U·R·I and is translated to the `@data-lesml-profile` attribute
39 on the resulting `<html:article>` element.
40
41 Following the shebang line, document metadata may be provided in the
42 [Record Jar][draft-phillips-record-jar-01] format.
43 The body of the document begins after the last line which begins with
44 the string `%%`, or after the shebang line if none exists.
45
46 Documents are broken into paragraphs by blank lines.
47 Empty paragraphs are ignored.
48 Non·empty paragraphs are classified as follows :⁠—
49
50 - If the paragraph consists of only the following section‐break
51 characters, plus any amount of white·space, then it is
52 considered to be a section break (`<html:hr>`).
53
54 The section break characters are :⁠—
55
56 | Character | Codepoint | Unicode Name |
57 | --------- | --------- | ------------ |
58 | `#` | `U+0023` | `NUMBER SIGN` |
59 | `*` | `U+002A` | `ASTERISK` |
60 | `-` | `U+002D` | `HYPHEN-MINUS` |
61 | `.` | `U+002E` | `FULL STOP` |
62 | `=` | `U+003D` | `EQUALS SIGN` |
63 | `_` | `U+005F` | `LOW LINE` |
64 | `~` | `U+007E` | `TILDE` |
65 | `·` | `U+00B7` | `MIDDLE DOT` |
66 | `․` | `U+2024` | `ONE DOT LEADER` |
67 | `‥` | `U+2025` | `TWO DOT LEADER` |
68 | `…` | `U+2026` | `HORIZONTAL ELLIPSIS` |
69 | `⁂` | `U+2042` | `ASTERISM` |
70 | `⋯` | `U+22EF` | `MIDLINE HORIZONTAL ELLIPSIS` |
71 | `─` | `U+2500` | `BOX DRAWINGS LIGHT HORIZONTAL` |
72 | `━` | `U+2501` | `BOX DRAWINGS HEAVY HORIZONTAL` |
73 | `┄` | `U+2504` | `BOX DRAWINGS LIGHT TRIPLE DASH HORIZONTAL` |
74 | `┅` | `U+2505` | `BOX DRAWINGS HEAVY TRIPLE DASH HORIZONTAL` |
75 | `┈` | `U+2508` | `BOX DRAWINGS LIGHT QUADRUPLE DASH HORIZONTAL` |
76 | `┉` | `U+2509` | `BOX DRAWINGS HEAVY QUADRUPLE DASH HORIZONTAL` |
77 | `╌` | `U+254C` | `BOX DRAWINGS LIGHT DOUBLE DASH HORIZONTAL` |
78 | `╍` | `U+254D` | `BOX DRAWINGS HEAVY DOUBLE DASH HORIZONTAL` |
79 | `═` | `U+2550` | `BOX DRAWINGS DOUBLE HORIZONTAL` |
80 | `╴` | `U+2574` | `BOX DRAWINGS LIGHT LEFT` |
81 | `╶` | `U+2576` | `BOX DRAWINGS LIGHT RIGHT` |
82 | `╸` | `U+2578` | `BOX DRAWINGS HEAVY LEFT` |
83 | `╺` | `U+257A` | `BOX DRAWINGS HEAVY RIGHT` |
84 | `☙` | `U+2619` | `REVERSED ROTATED FLORAL HEART BULLET` |
85 | `❧` | `U+2767` | `ROTATED FLORAL HEART BULLET` |
86 | ` ` | `U+3000` | `IDEOGRAPHIC SPACE` |
87 | `・` | `U+30FB` | `KATAKANA MIDDLE DOT` |
88 | `*` | `U+FF0A` | `FULLWIDTH ASTERISK` |
89 | `-` | `U+FF0D` | `FULLWIDTH HYPHEN-MINUS` |
90 | `.` | `U+FF0E` | `FULLWIDTH FULL STOP` |
91 | `=` | `U+FF1D` | `FULLWIDTH EQUALS SIGN` |
92 | `_` | `U+FF3F` | `FULLWIDTH LOW LINE` |
93 | `~` | `U+FF5E` | `FULLWIDTH TILDE` |
94
95 - If every line in the paragraph begins with at least one space, then
96 it is considered to be a quoted paragraph (`<html:blockquote>`).
97 There is only one level of paragraph quoting; quoted paragraphs may
98 not be quoted again.
99
100 - Otherwise, the paragraph is unquoted.
101
102 After this classification, each quoted or unquoted paragraph is further
103 classified by type based on its first character (which is must be
104 followed by white·space to be recognized) :⁠—
105
106 - If the paragraph begins with `⁌`, it is a chapter heading
107 (`<html:h1>`).
108
109 - If the paragraph begins with `§`, it is a section heading
110 (`<html:h2>`).
111
112 - If the paragraph begins with `❦`, it is a subsection heading
113 (`<html:h3>`).
114
115 - If the paragraph begins with `✠`, it is a subsubsection heading
116 (`<html:h4>`).
117
118 - If the paragraph begins with `•` or `🔢`, it is a primary unordered
119 or ordered list item (`<html:li class="unordered" data-level="1">`
120 or `<html:li class="ordered" data-level="1">`).
121
122 - If the paragraph begins with `◦` or `🔠`, it is a secondary unordered
123 or ordered list item (`<html:li class="unordered" data-level="2">`
124 or `<html:li class="ordered" data-level="2">`).
125 Secondary list items are considered to be nested inside of primary
126 list items which precede them.
127
128 - If the paragraph begins with `▪` or `🔡`, it is a tertiary unordered
129 or ordered list item (`<html:li class="unordered" data-level="3">`
130 or `<html:li class="ordered" data-level="3">`).
131 Tertiary list items are considered to be nested inside of primary
132 and secondary list items which precede them.
133
134 - If the paragraph begins with `⁃` or `🔣`, it is a quaternary
135 unordered or ordered list item
136 (`<html:li class="unordered" data-level="4">` or
137 `<html:li class="ordered" data-level="4">`).
138 Quaternary list items are considered to be nested inside of primary,
139 secondary, and tertiary list items which precede them.
140
141 - If the paragraph begins with `※`, it is an ordinary note
142 (`<html:div role="note" class="note">`).
143
144 - If the paragraph begins with `☡`, it is a cautionary note
145 (`<html:div role="note" class="caution">`).
146
147 - If the paragraph begins with `🛈`, it is an informative note
148 (`<html:div role="note" class="info">`).
149
150 - If the paragraph begins with `⯑`, it is a questioning note
151 (`<html:div role="note" class="query">`).
152
153 - If the paragraph begins with `⚠︎`, it is a warning note
154 (`<html:div role="note" class="warn">`).
155
156 - If the paragraph begins with `⋯`, it is a continuation paragraph
157 (`<html:div class="continuation">`).
158 Continuation paragraphs may be used to continue a preceding list item
159 or quote.
160 Note, however, that an unquoted paragraph cannot continue a quoted
161 one, or vice·versa.
162
163 - Otherwise, it is an ordinary paragraph.
164
165 Following this sigil (if any, including trailing white·space) there may
166 be a `¶` followed by zero or more non·white·space characters.
167 The characters following the `¶` give the identifier for the paragraph,
168 which is expected to be unique within a document.
169
170 The remaining characters in a paragraph form its contents.
171 Markup within paragraphs is delimited with·out exception by pairs of
172 characters, with the following precedence :⁠—
173
174 - The characters `{🔗` and `>}` indicate a hyperlink to a U·R·L
175 (`<html:a>`).
176 The hyperlink must contain at least one `<`; the content before the
177 last `<` gives the text of the link, and the content after gives
178 the U·R·L that the link points to.
179 If no text is given, the U·R·L will be used instead.
180
181 - The characters `⸠` and `⸡` indicate a strikethru (`<html:s>`).
182
183 - The characters `⸤` and `⸥` indicate underlining (`<html:u>`).
184
185 - The characters `⟦` and `⟧` indicate an inline note
186 (`<html:small role="note">`).
187
188 - The characters `⸨` and `⸩` indicate parenthetical content
189 (`<html:small>`).
190
191 - The characters `☞︎` and `☜︎` indicate strong importance
192 (`<html:strong>`).
193
194 - The characters `⹐` and `⹑` indicate emphasis (`<html:em>`).
195
196 - The characters `⟪` and `⟫` indicate titles (`<html:cite>`).
197
198 - The characters `⟨` and `⟩` indicate offset text (`<html:i>`).
199 This may be followed by a `@`, a language tag, and a `$` to provide
200 the language of the text.
201
202 - The characters `⦃` and `⦄` indicate keyword highlighting
203 (`<html:b>`).
204
205 - The characters `` ` `` and `´` indicate code (`<html:code>`).
206
207 Once the tree is built as above, it is remediated into its final form
208 by the following steps :⁠—
209
210 - Successive quoted paragraphs are joined into one quote.
211 If the final quoted paragraph is an ordinary paragraph which begins
212 with `—` and a space, the quote is wrapped in a `<html:figure>`
213 and the final paragraph becomes its `<html:figcaption>`.
214
215 - Continuation paragraphs are joined with the preceding list items or
216 quotes.
217
218 - List items of a higher level are nested in preceding list items, when
219 present.
220
221 - Successive list items of the same level and class are joined into
222 a single list.
223
224 Finally, any character can be escaped by instead providing its Unicode
225 codepoint in the form `<U+NNNN>`, where `NNNN` is one or more
226 hexadecimal digits.
227 Multiple codepoints may be provided separated by periods, as in
228 `<U+WWWW.ZZZZ>`
229
230 ## Usage
231
232 💄📝 Les·M·L is designed for usage with [⛩📰 书社][Shushe].
233 Simply include the `parser.xslt` provided by this repository to
234 ⛩📰 书社 as an additional parser, and `magic` as an additional
235 magic file.
236
237 ## License
238
239 This repository conforms to [REUSE][].
240
241 The parser is licensed under the terms of the <cite>Mozilla Public
242 License, version 2.0</cite>.
243
244 [REUSE]: <https://reuse.software/spec/>
245 [Shushe]: <https://git.ladys.computer/Shushe/>
246 [draft-phillips-record-jar-01]: <https://datatracker.ietf.org/doc/html/draft-phillips-record-jar-01>
This page took 0.055917 seconds and 5 git commands to generate.