Support parsed metadata

[Shushe] / README.markdown
diff --git a/README.markdown b/README.markdown

index 37cc7f7cd7ff8b5f00284bde7c735ce46629bdb6..0b4c4b4ef135c7375368a9f78e859964998e4cf2 100644 (file)
--- a/README.markdown
+++ b/README.markdown
@@ -1,5 +1,5 @@
  <!--
-SPDX-FileCopyrightText: 2024 Lady <https://www.ladys.computer/about/#lady>
+SPDX-FileCopyrightText: 2024, 2025 Lady <https://www.ladys.computer/about/#lady>
  SPDX-License-Identifier: CC0-1.0
  -->
  # ⛩📰 书社
@@ -188,6 +188,7 @@ This document uses a few name·space prefixes, with the following
  |    `exsl:` | `http://exslt.org/common`                     |
  | `exslstr:` | `http://exslt.org/strings`                    |
  |    `html:` | `http://www.w3.org/1999/xhtml`                |
+|     `rdf:` | `http://www.w3.org/1999/02/22-rdf-syntax-ns#` |
  |     `svg:` | `http://www.w3.org/2000/svg`                  |
  |   `xlink:` | `http://www.w3.org/1999/xlink`                |
  |    `xslt:` | `http://www.w3.org/1999/XSL/Transform`        |
@@ -244,6 +245,14 @@ The following additional variables can be used to control the behaviour
    Multiple include directories can be provided, so long as the same
      file subpath doesn’t exist in more than one of them.
  
+- **`DATADIR`:**
+  If set to the location of a directory, ⛩📰 书社 will run a two‐stage build.
+  In the first stage, only files in `SRCDIR` which match `FINDDATARULES` (see below) will be built, with files in `DATADIR` serving as includes.
+  In the second stage, the remaining files in `SRCDIR` will be built, with the files built during the first stage, in addition to any files in `INCLUDEDIR`, serving as includes.
+  Files built during the first stage are copied into `DESTDIR` alongside those from the second stage when installing.
+
+  This functionality is intended for sites where the bulk of the site can be built from a few data files which are expensive to create.
+
  - **`BUILDDIR`:**
    The location of the (temporary) build directory (default: `build`).
    `make clean` will delete this, and it is recommended that it not be
@@ -299,6 +308,50 @@ The following additional variables can be used to control the behaviour
      default, to enable additional rules without overriding the existing
      ones.
  
+- **`DATAOPTS`:**
+  Additional options to use when calling Make during the first stage of a two‐stage build using `DATADIR`.
+
+  This can be used to override variables which are only applicable during the second stage.
+  Note that when supplying this variable on the shell, it will need to be double‐quoted.
+
+- **`DATAEXT`:**
+  A list of file extensions which signify “data” files during a two‐stage build using `DATADIR`.
+
+- **`FINDDATARULES`:**
+  Rules to use with `find` when searching for data files.
+  By default, these rules are derived from `DATAEXT`.
+
+- **`EXTRAFINDDATARULES`:**
+  The value of this variable is appended to `FINDDATARULES` by
+    default, to enable additional rules without overriding the existing
+    ones.
+
+- **`FINDFILTERONLY`:**
+  A semicolon‐separated list of regular expressions, at least one of which the paths for sources and includes are required to match, unless empty (default: empty).
+
+- **`FINDFILTEROUT`:**
+  A semicolon‐separated list of regular expressions, each of which matches paths that should _not_ be considered sources or includes (default: empty).
+
+- **`FINDINCLUDEFILTERONLY`:**
+  A semicolon‐separated list of regular expressions, at least one of which the paths for includes are required to match, unless empty (default: empty).
+
+  Note that only paths which already match `FINDFILTERONLY` are considered.
+
+- **`FINDINCLUDEFILTEROUT`:**
+  A semicolon‐separated list of regular expressions, each of which matches paths that should _not_ be considered includes, but may still be considered sources (default: empty).
+
+- **`FINDFILTERONLYEXTENDED`:**
+  If non·empty, `FINDFILTERONLY` is an extended regular expression; otherwise, it is basic (default: empty).
+
+- **`FINDFILTEROUTEXTENDED`:**
+  If non·empty, `FINDFILTEROUT` is an extended regular expression; otherwise, it is basic (default: matches `FINDFILTERONLYEXTENDED`).
+
+- **`FINDINCLUDEFILTERONLYEXTENDED`:**
+  If non·empty, `FINDINCLUDEFILTERONLY` is an extended regular expression; otherwise, it is basic (default: matches `FINDFILTERONLYEXTENDED`).
+
+- **`FINDINCLUDEFILTEROUTEXTENDED`:**
+  If non·empty, `FINDINCLUDEFILTEROUT` is an extended regular expression; otherwise, it is basic (default: `1` if either `FINDFILTEROUTEXTENDED` or `FINDINCLUDEFILTERONLYEXTENDED` is non·empty).
+
  - **`PARSERS`:**
    A white·space‐separated list of parsers to use (default:
      `$(THISDIR)/parsers/*.xslt`).
@@ -307,6 +360,15 @@ The following additional variables can be used to control the behaviour
    The value of this variable is appended to `PARSERS` by default, to
      enable additional parsers without overriding the existing ones.
  
+- **`PARSERLIBS`:**
+  A white·space‐separated list of parser dependencies (default:
+    `$(THISDIR)/lib/split.xslt`).
+
+- **`EXTRAPARSERLIBS`:**
+  The value of this variable is appended to `PARSERLIBS` by default, to
+    enable additional parser dependencies without overriding the
+    existing ones.
+
  - **`TRANSFORMS`:**
    A white·space‐separated list of transforms to use (default:
      `$(THISDIR)/transforms/*.xslt`).
@@ -315,6 +377,15 @@ The following additional variables can be used to control the behaviour
    The value of this variable is appended to `TRANSFORMS` by default, to
      enable additional transforms without overriding the existing ones.
  
+- **`TRANSFORMLIBS`:**
+  A white·space‐separated list of transform dependencies (default:
+    `$(THISDIR)/lib/serialize.xslt`).
+
+- **`EXTRATRANSFORMLIBS`:**
+  The value of this variable is appended to `TRANSFORMLIBS` by default,
+    to enable additional transform dependencies without overriding the
+    existing ones.
+
  - **`XMLTYPES`:**
    A white·space‐separated list of media types or media type suffixes to
      consider X·M·L (default: `application/xml text/xml +xml`).
@@ -477,6 +548,18 @@ These include :⁠—
  - A `@书社:media-type` attribute, giving the identified media type of
      the plaintext node.
  
+### Parsed metadata
+
+It is possible to extract metadata from a document at the same time as
+  it is being parsed.
+This is done by creating result elements in the `书社:about` mode;
+  these should be R·D·F property elements which apply to the conceptual
+  entity that is the document being parsed.
+
+During transformation, metadata for the file with identifier `$FILE`
+  can be read from the children of
+  `$书社:about//*[@rdf:about=$FILE]/nie:interpretedAs/*`.
+
  ## Output Redirection
  
  By default, ⛩📰 书社 installs files to the same location in `DESTDIR`
@@ -547,6 +630,8 @@ Transforms are used to convert X·M·L files into their final output,
      media types into the appropriate H·T·M·L elements, and deletes
      `<html:style>` elements from the body of the document and moves
      them to the head.
+  This conversion happens during the finalization phase, after the main
+    transformation.
  
  - **`transforms/metadata.xslt`:**
    Provides basic `<html:head>` metadata.
@@ -564,7 +649,7 @@ Transforms are used to convert X·M·L files into their final output,
  - **`transforms/serialization.xslt`:**
    Replaces `<书社:serialize-xml>` elements with the (escaped)
      serialized X·M·L of their contents.
-  This replacement happens during the application phase, after most
+  This replacement happens during the finalization phase, after most
      other transformations have taken place.
  
    If a `@with-namespaces` attribute is provided, any name·space nodes
@@ -618,6 +703,25 @@ The following params are made available globally in parsers and
  - **`THISREV`:**
    The value of the `THISREV` variable (if present).
  
+In transforms, the following params are additionally available :⁠—
+
+- **`书社:about`:**
+  R·D·F metadata about all of the documents ⛩📰 书社 knows about.
+  Use `$书社:about//*[@rdf:about=$IDENTIFIER]` to get the metadata for
+    the current document.
+
+- **`书社:source`:**
+  The parsed source document being transformed, prior to any expansion.
+
+- **`书社:expansion`:**
+  The document after the all embeds have been expanded.
+  Unavailable during the `书社:expand` stage.
+
+- **`书社:result`:**
+  The document after the main set of transformations have been applied.
+  Only available during the `书社:finalize` stage, where it is used to
+    apply output wrapping and other clean·up.
+
  ## Output Wrapping
  
  Provided at least one toplevel result element belongs to the H·T·M·L
@@ -690,8 +794,8 @@ It is especially useful in combination with output wrapping.
  
  In both cases, attributes from various sources are combined with
    white·space between them.
-Attribute application takes place after all ordinary transforms have
-  completed.
+Attribute application takes place after each stage of the
+  transformation, including after the initial embedding phase.
  
  Both elements ignore attributes in the `xml:` name·space, except for
    `@xml:lang`, which ignores all but the first definition (including