A Plain Text Mark-Up-Processing Chain

mh@tin-pot.net
2015-10-22

1 Overview

This text presents a concept how CommonMark syntax could be extended by other, similar “human-oriented”, “plain text” mark-up notations like

The Z Notation “e-mail mark-up” (plain text mark-up for formal specifications), or
ASCIImath (plain text mark-up for formulas), or
Mermaid (plain text mark-up for graphs),

and how this “extended” syntax could be used not only in processing plain text files (like in the “conventional” use case for CommonMark) but also how the “extended” CommonMark syntax could be used

in a generic XML/SGML authoring/validating/transforming environment

like for authoring and processing of DocBook documents, for example.

[Note 1: The Z Notation is one of the more popular formal specification languages: it is based on ordinary mathematics (Zermelo-Fraenkel set theory, probably the reason for the name “Z”) and is formally defined in ISO/IEC 13568:2002. This standard also specifies the “human-oriented” plain text mark-up mentioned here, intended primarily for e-mail and similar circumstances. The Z Notation standard is publicly available at no cost (as a zipped PDF file), and the Corrigendum 1 is too—but the standard is most certainly not an easy-to-read “introduction” to the Z Notation!]

[Note 2: The designation “XML/SGML”, or even “XML/XHTML/SGML/HTML”, emphasizes that there are differences between XML and SGML. But luckily it turns out that these differences are mostly irrelevant for the concept presented here: the same concept and all of the same mark- up processor implementations can be used to process and XML as well as SGML documents.]

1.1 Design questions and constraints

Given the—obvious, in my opinion—requirement that CommonMark syntax and whatever other syntaxes are to be made available will be processed each by a specific, separate processor (each one a separate process, where the next one is processsing the output of the previous one), the obvious design questions and decisions are:

How do we interface the processing of “plain text” mark-up— performed by mark-up processors like cmark etc—with the rest of the authoring environment: both at the input and the output end of our “plain text” mark-up process?
How do we combine the processors for the various “plain text” mark- up syntaxes so that each one of them can do it’s job, without interfering with the other syntaxes and their processors?

Some goals for and important qualities of an implementation (comprising the modification of existing tools and/or the implementation of new ones) are obvious:

We want—of course—to spend as little effort as possible to implement the whole thing (albeit there is no money to waste nor gain here anyway …), and
we want to introduce only minimal modifications which are really required into existing mark-up processors, without disrupting their current behaviour.
The implementation has to be general: agnostic about the number or nature of syntaxes and processors employed, which is a pre-requisite for
another important design goal: adaptability. It is obviously important to strive for an implementation that permits the easy incorporation of additional “foreign” syntaxes and the pertaining mark-up processors, with as little change to the existing tools as possible, or the replacement of a specific mark-up processor with a new one for the same mark-up syntax, and so on.
Of course the implementation has to be platform-independant if possible at all (ideally using only features of Standard C or Standard C++), or at least be easy to port (say by providing a thin abstraction layer above platform-dependant APIs).
One “soft” goal—but important to me—is that whatever “syntax extensions” of the current CommonMark specification are introduced should have a “natural” feel and ought to fit in without accidental conflicts (with existing mark-up, or with existing styles of authoring CommonMark documents). The best way to “extend” the CommmonMark specification would of course be no extension at all, but to instead use the existing notations where the specification only mentions “implementation-dependant” behaviour.

The plain text mark-up-processing chain, or “mark-up chain” concept is my answer to these (and other) requirements and design questions and constraints.

1.2 The purpose and use of a “mark-up chain”

A key idea here and one major goal is that an author would write inside his XML/SGML document plain text in (a mixture of) “human- oriented” mark-up like CommonMark as the unparsed character content of element (instances) of some specific XML/SMGL element type, let’s call them “container elements”. All other content, structure, and (XML/SGML) mark-up of the document should need no change whatsoever.

The job of the “mark-up chain”, which is really no more than a bunch of “filters” (ie processes transforming input text to output text) chained together in a pipeline for the processing of mark-up, is then

to extract these plain text fragments from the container elements inside the XML/SGML document,
process the “plain text” mark-up according to CommonMark (and other) rules,
replace these container elements with the XML/SGML fragments (sequences of element instances) resulting from the processing.

The output of this “mark-up chain” would then be again an XML/SGML document: without any embedded “marked-up plain text”, but with

the elements generated by transforming these plain text fragments
together with the unchanged XML/SGML content from the input document around them.

As we will see, the same process implemented in the “mark-up chain” can be just as well used to generate XML/SGML/HTML output from “conventional” plain text input files—the same transformation that cmark and other processors accomplish already.

But with an important twist: we will see that the proposed “mark-up chain” provides a robust, flexible and extendable way to augment CommonMark syntax with “foreign syntaxes” processed by other tools, and this “extended syntax” can be used when processing “conventional” plain text input files just as well as in the case of XML/SGML input documents.

2 The XML/SGML environment

In an usage scenario for the proposed “mark-up chain” an author would write his structured document (or his typescript)

either directly into an XML/SGML document (probably using an XML/SGML editing/validating tool like Oxygen XML),
or would prefer to use CommonMark notation as far as possible, and to write his original typescript as a simple plain text file in CommonMark syntax, interspersed (in some way) with text in other mark-up syntaxes like the ones mentioned earlier.

The input containing the marked-up “plain text” in this scenario would in the first alternative not consist of an entire plain text file, but the marked-up plain text would rather be encapsulated in XML/SGML element instances, embedded into a complete XML/SGML document.

To treat both cases the same, the plain text file’s content in the second case could be embedded into the root element of an XML/SGML document and then presented to the mark-up tool chain.

Similarly, the output at the end of processing the “human-oriented” plain text would not neccesarily be always an (X)HTML document, created entirely by the tool chain, but could basically any XML/SGML document of a type relevant for the whole authoring process at hand (think of DocBook for example). And some or most of the content of this document is not genererated by the processors in our mark-up tool chain, but created in some other authoring tool, presented as the complete input XML/SGML document to out tool chain, and passed through to all mark-up processors and into the final output XML document of our tool chain.

3 Input to tools in the chain

So both the “initial” input to and the “final” output from our tool chain are XML or SGML documents, passed in from and out to the rest of the authoring environment.

If the author prefers to prepare his typescript as a simple text file, we can trivially wrap this plain text file’s content inside an element designated to “transport” such marked-up plain text, and proceed using this as or “initial” input XML document, which would look like this:

<?xml version="1.0" encoding="UTF-8"?>
<mark-up mode="vertical" notation="CommonMark">
<![CDATA[## Your _Markdown_ Input ##
The _Markdown_ input appears <em>unscathed</em> inside the
<mark-up> element here!]]></mark-up>

Otherwise, we would receive an XML/SGML document already containing such <mark-up> elements where the plain text to process can be found inside.

Every mark-up processor (for CommonMark, for Z Notation, etc) used in the chain would thus process only these <mark-up> element instances and replace them with the result from transforming their (plain text) input into XML/SGML elements (a single one, or a sequence of elements representing, for example, multiple paragraphs).

Everything outside such <mark-up> elements is just passed through by each mark-up processor.

4 Output from tools in the chain

We already have seen that each mark-up processor outputs a well-formed XML document and passes it to the input of the next processor in the chain.

So at the end of a chain like this

cat input.xml | cmark | zml | asciimath > output.xml

we have an output XML document where (hopefully) all the <mark-up> elements are gone and have been replaced by the result of interpreting their plain text content, and this interpretation result itself is represented simply as a sequence of the appropriate XML/SGML elements, specific to the mark-up processor rsp. the syntax it transforms.

Depending on the document type used in the authoring environment, another step may be required to transform these output fragments into the appropriate elements of the target document type: for example, the current CommonMark DTD uses the element type name <paragraph> for ordinary paragraphs of text, but HTML uses <P>, while DocBook uses <Para>. (A HTML or XML DOCTYPE declaration may also need to be inserted into the output document, and similar “final touches”.)

Furthermore, the content model for the CommonMark <paragraph> element type encompasses only other elements (notably the <text> element), while HTML’s <P> as well as DocBooks <Para> elements have #PCDATA in their content models: the character content appears immediately inside these paragraph elements, while in CommonMark it is nested inside <text> (among other) elements.

To illustrate the differences, here are the respective parameter entitiy declarations and element declarations for the “just a simple paragraph of text” element type in the CommonMark DTD, in the ISO/IEC 15445:2000 HTML DTD, and in the DocBook 3.1 DTD:

In the CommonMark DTD:

<!ENTITY % inline
'text|softbreak|linebreak|code|inline_html|emph|strong|link|image'>
<!ELEMENT paragraph (%inline;)*>

In the ISO HTML DTD:

<!-- Character level elements and text strings -->
<!ENTITY % text "#PCDATA | %physical.styles; | %logical.styles;
                                             | %special;
                                             | %form.fields;" >
<!ELEMENT P - O (%text;)+ >

In the DocBook 3.1 DTD

<!ENTITY % para.char.mix "#PCDATA
                          |%xref.char.class;  |%gen.char.class;
                          |%link.char.class;  |%tech.char.class;
                          |%base.char.class;  |%docinfo.char.class;
                          |%other.char.class; |%inlineobj.char.class;
                          |%synop.class;
                          |%ndxterm.class;
                          %local.para.char.mix;">
<!ELEMENT Para - O ((%para.char.mix; | %para.mix;)+)>

Transformations of this kind are best done using an XSLT step, and are not really in the scope of our “mark-up” chain. So we’ll mostly ignore them from now on.

5 The mark-up processors in the chain

An instance of this kind of “mark-up chain” could be a pipeline composed of text filter processes, each processing only one kind of “human- oriented” plain text mark-up:

cat authored.xml | markdown | znotation | ... > final.xml

where markdown and znotation stand for some kind of processor for CommonMark rsp Z Notation syntax—we’ll come to the details about these processors later.

While this is a slightly simplified representation, a major goal of the “mark-up chain” concept is in fact to meet the requirement that all mark-up processors are filters (ie separate processes) which can be composed as needed into a pipeline of this kind.

We already have seen two more elements of the chain:

The final.xml document could be the output not of a mark-up processor, but an XSLT step (where xsl here stands for the XSLT processor of choice, together with the required options and XSLT files):
```
cat authored.xml | cmark | ... | xsl > final.xml
```
Processing an initial plain text file needs a wrapping step to place the plain text inside an XML/SGML document. This wrapping can trivially be done by a tiny tool, let’s call it txtin:
```
cat plain.txt | txtin | cmark | ... | xsl > final.xml
```

5.1 Avoiding XML/SGML parsing

We assumed so far that all tools in the chain are connected through text streams, and pass XML/SGML documents from one step in the pipeline to the next.

But this would imply that each of these tools ought to be capable of parsing and processing whole XML/SGML documents; and even if the required parser could be shared among the tools (in a DLL rsp shared library), I find the idea that each tool drags around a parser like this a horrible idea indeed.

We would be much better off if the document content could be passed along in

a format which is simple to parse and generate, but it obviously needs to represent faithfully the whole “intended” content (disregarding differences in XML/SGML mark-up) of the XML/SGML document which we feed into the pipeline in the first place, notably those parts of the content which are neither recognized nor used in the mark-up chain but simply passed through.
And what is the “intended” content of XML/SGML documents anyway— disregarding the format of representation and mark-up?

Luckily, there are already answers to both questions:

The “intended” content of an SGML document is the so-called “Element Structure Information Set” (ESIS, ISO 8879), and of an XML documents it is called the “XML Information Set”. Since XML is kind-of a specialization of SGML, we can treat both concepts the same (or at least assume that we can in the following).
A nice representation of this “document content”, or ESIS— as an easy-to-parse text format—does already exist too: it is the native output format of the widely used XML/SGML parsers SP (nsgml) and OpenSP (onsgml), both written and published by James Clark. The format description fits on a single page, and we need only a fragment of it.

If we settle for this format, each tool can quite easily filter out the plain text of it’s concern by reading the input stream in this format line by line, watching out for column one in each line, and following these rules:

lines that start with neither A nor ( are simple copied to the output,
for lines starting with A (attribute defintion), remember the attribute value they specify,
for lines starting with ( (element start tag), check if this is a container element containing plain text in the right syntax:
- If not so, copy the remembered A lines and this ( line to the output, relax and go on reading more input;
- If so, start using the character content that follows until the corresponding ) line (element end tag) to this ( occurs, process the character content into the output, and after that: relax and go on reading more input.

The format and meaning of lines of concern is determied by a flag character in column 1:

“A”: An attribute value in the form
```
Aattr CDATA value
```
There are other forms, but the only other form occuring in the output from parsing an XML file without a DTD using nsgmls is
```
Aattr IMPLIED value
```
which can be ignored and passed on.
“(”: An element start tag, the element name (GI) follows, starting in column 2.
“-”: Character content, which follows, starting in column 2, encoded in UTF-8. The only “escape-encoding” used for it is similar to the convention followed eg in C for string literals:
- “\n” represents EOL (U+000A: END OF LINE).
- “\\” represents a backslash (U+005C: REVERSE SOLIDUS).
- “\ddd”, where ddd are three octal digits, represents the character with the given number (up to the number 511, corresponding to U+01FF LATIN SMALL LETTER O WITH STROKE AND ACUTE, but usually is just a control character in the range U+0000 .. U+0031). The character \012 (U+000D: CARRIAGE RETURN) usually follows \n immediately, and can be ignored.
- “\#n;” where n is a decimal number, represents the character with the given number, in our case: Unicode code point. (This should never be needed nor occur in UTF-8 encoded text.)
“)”: An element end tag, again the element name (GI) follows, starting in column 2. When reading well-formed documents with balanced start and end tags, the name can be ignored.

So generating output containing new elements (as a result of processing the plain text mark-up) is equally trivial:

First output the attributes, line by line, using
```
Aattr CDATA value
```
Output the start tag:
```
(name
```
Output the character content (if any):
```
-text including\n\012to represent line breaks
```
and enclosed other elements (if any).
Output the end tag:
```
)name
```

The use of this format has one additional advantage: we can start the mark-up chain with any XML/SGML document by parsing it first and feeding the parser output into the following tools.

Our txtin tool would produce output in this format, and it is easy to see that one consequence of this is that this little tool would be even more trivial to implement.

To summarize our possible inputs into the mark-up chain discussed so far:

SGML with a DTD: gets parsed and validated;
XML with a DTD: gets parsed and validated;
XML without a DTD: is assumed to be well-formed and gets parsed;
TXT (plain text in UTF-8): gets wrapped into ESIS by txtin.

While a full-blown parser like nsgmls can parse (but obviously not validate) XML input without a DTD, it would be really “overkill” in this use. One could alternatively use a tool to convert well-formed XML into the internal format based, say xmlin, specifically developed for this purpose.

To validate XML input against an XML Schema, one could use the SAX2Print sample program from the Xerxes XML parser library. If validation was successful, it outputs a well-formed XML document (without referencing a DTD obviously), which can be fed directly into xmlin.

Note also that we need to piece together an XML (or SGML) document at the end of our chain from the internal ESIS representation: this too is trivial to implement in a dedicated tool, say xmlout. As long as the required XML transformation done in the XSLT step is simple enough it could be done inside xmlout too. For example

renaming elements and
deleting or inserting specific start/end tags (to unwrap or wrap element content)

is certainly easy to implement, and would suffice to transform <paragraph> into <P> rsp <Para> as described in the example above.

5.2 A generic “adapter” for existing mark-up processors

If an existing mark-up processor can not be modified to adhere to the conventions established in this concept (reading from stdin the internal ESIS representation format, and writing the processed ESIS to stdout again etc)—be it that no source code is available, or no developer to do these modifications—one could still use the existing processor in the “mark-up” chain as long as it exhibits at least in some invocation this simple behaviour:

It reads plain text in the specific mark-up syntax, either from a file named in a command-line argument, or from stdin;
it produces a sequence of XML elements (native element types or “common” ones like XHTML element types), either into a file named in a command-line argument, or to stdout.

To use such an existing mark-up processor in the “mark-up chain”, a special “adapter” process could feed plain text into it, and fetch the output XML element sequence from it.

All of the behaviour required to function in the “mark-up chain” would be implemented in the “adapter” processor itself:

The parsing and filtering of the internal ESIS representation coming in from the processor ahead in the chain, as well as
converting and inserting the element sequence produced by the existing mark-up processor into the ESIS stream going out to the following processor in the chain.

The “adapter” processor would be linked into the chain in the exact same way as any other mark-up processor, namely through standard input and standard output and the internal ESIS representation transmitted through these channels, and would invoke the existing mark-up processor “privately” just for the transformation of plain text fragments into XML element sequences.

Apart from the one-time effort to implement such an “adapter” processor, some performance cost would also incurred in using it:

the existing mark-up processor would have to be started anew for each plain text fragment to process

Furthermore,

if no system support for executing a child process and connecting through a pair of pipes to it,
or if the “adapter” processor is to be implemented using the standard C library only,
or if the existing mark-up processor simply does not use stdin and/or stdout in the manner required

the adapter process would have to communicate with the existing mark-up processor not through pipes, but through temporary files.

All in all it is certainly better to adapt an existing mark-up processor to be used directly in the “mark-up chain”, but an “adapter” processor of this could in fact provide a work-around in some cases.

And implementing the “adapter” processor would be a nice little development experiment, too …

6 Required new tools and changes to existing tools

Developing new or adapting existing tools for use in the “mark-up chain” should be pretty easy. Each tool must:

read the internal ESIS representation format from stdin,
pass it through or filter out character data for processing,
generate start tags, character data, end tags to output.

Note that character data output into the internal format does not need to and in fact must not be “XML-escaped”, just as character data in the input need not and must not be “XML-decoded” (replacing < by < etc). Line endings and control characters must however be converted (to and from \n\012 rsp \077) from character data input and to character data output.

The txtin tool does little more than copying input to output, while the xmlout tool basically re-formats the internal format to proper XML (but must introduce entity references for < etc).

The xmlin tool should be able to rely on a XML parser library (presumably following the SAX model), and does little more than re- formatting too.

The hard part—parsing (and validating if needed against a DTD or an XML schema)—can be done by simply using existing parsers like nsgmls or xerxes.

To directly use the cmark parser (or similar implementations like discount or sundown) in this chain, it has to parse and generate the internal ESIS representation format, in the way described above. One obvious style to implement this would be by adding a new format for the -t option, say esis, and re-use most of the implementation of the -t xml format. I would guess that this could be done in one day (a couple of hours), or develop a separate program that only uses the existing API of the reference implementation (feeding plain text extracted from container elements in the input stream into cmark_parse_document(), and generating output from the parsed document in the internal ESIS format through a modified copy of cmark_render_xml(), say cmark_render_esis().
For other mark-up processing tools, adding an option to parse and generate the internal ESIS format might take more or less effort, but is certainly not a huge task.
The txtin tool is really trivial to implement.
The xmlout tool is really trivial to implement without transformation capabilities; with these capablities (and options to insert appropriate DOCTYPE declarations etc in the output) the effort increases, but again this too can certainly be done in a couple of hours.
The xmlin tool is rather trivial to implement, using a non- validating XML parsing library like expat (by James Clark, too), or tinyxml2, or libxml etc.

[Note 3: I already have a prototype implementation using expat, which took me probably less than an hour to implement (using expat for the first time): it produces the same output as nsgmls for well- formed XML documents without a DTD (with the—unproblematic—exception that nsgmls likes to output Aattr IMPLIED val lines for attributes missing in the element but which already had a value definition somewhere before; and that the order in which A lines appear may differ between parsers).]

7 Embedding “foreign” syntax into CommonMark texts

One important question was avoided in the concept so far: How does the CommonMark text look like if “foreign” mark-up syntaxes are to be mixed in? While this is a topic open for debate, as the order of tools in the presented pipeline indicates, the “foreign” syntaxes would be processed after the (modified) cmark filter has done it’s job: the other tools act as post-processors.

[Note 4: How the different syntaxes are delimited and recognized exactly is still TBD and the topic of discussion.]

The design presented in this section provides

a mechanism to implement different forms of such delimiters (using a configuration file), and
the rules how to package the raw text content of “foreign” syntax fragments into “container elements” for consumption by a post- processor (proposing an element type name and attributes, as well as rules how these attributes are to be set and interpreted).

[Note 5: In the “mark-up chain” concept as presented here, these two items have to be provided (ie settled upon, documented, and implemented) in any implementation tailored for any specific rules and conventions on how mixing “extended” syntax into CommonMark documents looks like.]

The following mechanism could be implemented in the new -t esis processing mode of a modified cmark (or in any free-standing CommonMark processor, of course).

7.1 Configurable delimiters

The parser is provided with a configuration file containing patterns to match against input lines. These patterns come in pairs, where the first pattern specifies the form of a “mark-up block start” line, and the second the form of the corresponding “mark-up block end” line.

The configuration file specifying these patterns could look like this:

"^+-- @ ---$" "^---$" "Z Notation"
"C*" "C Programming Language"

The meaning of these two lines for the handling of “foreign syntaxes” is:

The first line shown here contains three strings, the first two specify a pair of patterns intended to detect
- the opening line for a “Z Schema definition paragraph” in the “e-mail mark-up” of the Z Notation, and
- the corresponding closing line pattern,
- while the third string in the line names the “notation” contained in blocks delimited by matching lines: Z Notation in this case.
In the simple example pattern syntax used here, every character stands for itself except “@” (COMMERCIAL AT), which matches any non-empty string not containing the following literal character, a SPACE in this case.
The second line specifies (this time using a glob pattern) one set of info strings usable in starting lines for fenced code blocks. The matching info strings will be recognized as indicating the “notation” C Programming Language.

[Note 6: The exact syntax for expressing these two kinds of patterns, and which algorithm(s) to use for matching, is TBD.]

Because fenced code blocks with info strings are part of the core CommonMark syntax, and recognizing such fenced code blocks is already implemented in cmark, all we need to implement is a modified treatment of fenced code blocks with an info string— no other modification of cmark’s behaviour is required, and no more configuration set-up to “send” the pertaining code block content to an appropriate and available post-processor is needed.

[Note 7: There are stricter requirements than “three or more tilde characters” imposed on a block-ending fenced line by the CommonMark specification anyway, which are hard to express in a pattern—luckily no such pattern is needed.]

When a plain text input line is encountered that matches a configured “mark-up block start” line, the CommonMark processor has to “package” the raw text content of the block starting there into a “container element”, and send out this container element into the output for the consumption by the post-processors which come later in the chain.

7.2 Packaging plain text for post-processing

The proposed element type name, or general identifier (GI) in SGML terminology, for the container elements is mark-up, and three attributes are defined for it. The (SGML) element and attribute list declarations are:

<!ELEMENT mark-up  CDATA >
<!ATTLIST mark-up
          notation CDATA                   #REQUIRED
          mode     (vertical | horizontal) #IMPLIED
          label    CDATA                   #IMPLIED>

The same declarations apply in XML, but without the “- -” in the element declaration.

[Note 8: In the SGML element declaration, “- -” specifies that neither the start tag <mark-up> nor the end tag </mark-up> can be omitted in a marked-up SGML document.]

The attributes of <mark-up> have the following meaning:

the “mode” of this <mark-up> element: “vertical” in this case (it would be “horizontal” for in-line “foreign” syntax, an yes, this is TeX jargon ;-)
the “notation” (ie kind of syntax) in which the plain text content is in (this is defined in the configuration and depends on the form and/or the the label of this block),
the “label” extracted using the pattern match, if any;

7.3 Rules for handling CommonMark “mark-up blocks”

A CommonMark processor should handle blocks in “foreign” syntax for the “mark-up chain” process according to the following rulse, using the patterns and notation names defined in the configuration file introduced above.

One can call “mark-up blocks” both forms of blocks which require post- processing:

fenced code blocks with “known” info strings and
blocks in “foreign mark-up syntax” delimited by configurable lines

and both are processed and packaged in the same way, except for the different ways of detecting the starting fenced code block line rsp the “mark-up block start” line, namely:

during regular parsing in the case of a fenced code block with an info string: the info string is matched against the configured patterns (for info strings) to find a defined notation for this form of info string; but
by matching the (start of) the input line against a list of configurable patterns in the case of “foreign mark-up syntax” blocks: if a match is found, the notation again comes from the configuration line, and the pattern match also determines the label to use (if any) for this block.

and vice versa for finding the corresponding ending fenced code block line rsp “mark-up block end” line.

The actions the CommonMark processor needs to take to process a “foreign mark-up syntax” block—when encountering a plain text input line which matches a configured pattern—are these:

The parser closes the currently open block-level element (if any);
an (optional) label is extracted from this line using the pattern match;
a start tag for a <mark-up> element is generated into the output, using attribute values
- for the mode attribute the value vertical (because this is the default value for mode, the mode attribute can be omitted),
- for the label attribute the CDATA substring found in the “mark-up block start” line pattern match,
- for the notation the CDATA string defined in the configuration file;
the “block start” line itself is copied as character data into the output;
all plain text up to the corresponding “mark-up block end” line is not processed in any way, but taken “literally”, and output as character content (substituting \n\012 for EOL etc as usual, but not entity-encoding < for example);
when the corresponding “mark-up block end” line occurs in the plain text input, the end tag for this <mark-up> element instance is output after copying the “block end” line into the output.

[Note 9: The first step (closing the currently open block-level element) is only needed if there is no blank line preceding the “foreign mark-up start” line matched.

[Note 10: The reason for copying the starting and ending plain text lines of a “foreign mark-up block” into the <mark-up> element’s content is simple: the post-processor might actually need to see them to decide what the raw plain text between them means. This is in fact the case for marked-up blocks in the Z Notation (like the one shown in the example below), where the form of the starting line determines the syntactic category of (most) such blocks.]

It would probably better to require this blank line to be there, for in that case there is never an open block-level element that needs closing when a “foreign mark-up start” line is detected. Furthermore, the parser would only need to match each first non-empty line after (a sequence of) blank lines, instead of every single input line.]

Processing fenced code blocks labeled with a “known” info string (meaning: one for which there is a match in the configuration file) follows the same steps, except that

the value used for the label attribute is the info string, and
neither the starting fenced code block line nor the ending one are copied into the output character data.

Note that both these transformations need only be implemented once in a CommonMark processor, and after that pretty much any syntax for starting and ending blocks of “extended” syntax can be defined in the configuration file and recognized by pattern matching (of an expressive enough kind).

Note also that the “plain text input” for the CommonMark processor is itself the character data content of a <mark-up> element in the ESIS input stream.

The CommonMark processor will see only one such <mark-up> element if the input into the “mark-up chain” is an XML-wrapped plain text file (in which case this one element contains the complete plain text content of the input); but may see many such elements otherwise (when processing a parsed XML/SGML document).

Since the CommonMark processor packages all plain text fragments recognized as “foreign” notations into the appropriate <mark-up> elements, the post-processors in the tail of the processing chain need not be concerned with any “foreign” or CommonMark syntax—they just need to filter “their” plain text from the appropriate <mark-up> elements.

7.4 Examples of packaged “foreign syntax blocks”

Here is how mark-up using the configured “extended syntax” (using the configuration file shown above) would be transformed into the ESIS output of the CommonMark processor:

A fenced code block with an info string "C" like this one:

~~~ C
int foo();
~~~

would be transformed into a <mark-up> container element instance in the output stream like that one (written here in XML format, not in the internal ESIS representation):

<mark-up mode="vertical" label="C">
<![CDATA[~~~ C
int foo();
~~~]]></mark-up>

And a “foreign mark-up syntax block” (in example configuration: a “Z Schema definition paragraph”) in the CommonMark input plain text looking like this:

+-- NAME ---
    DeclPart
|--
    Predicate
---

would be transformed into that <mark-up> element:

<mark-up mode="vertical"
         label="NAME"
         notation="Z Notation">
<![CDATA[+-- NAME ---
    DeclPart
|--
    Predicate
---]]></mark-up>

[Note 11: The CDATA section mark-up is shown here simply to illustrate that the plain text inside the <mark-up> element is not to be parsed in any way. But then the whole XML/SGML mark-up shown is just for illustration: there are no CDATA sections in the ESIS and hence in the internal format to represent the ESIS—once more the internal format is simpler than the XML mark-up!]

7.5 Filter processing by tools in the mark-up chain

It should be obvious how a the CommonMark processor at the head of the chain and every post-processor later on can watch out for <mark-up> elements it needs to handle:

A “syntax highlighting” post-processor would look for such elements that have a label attribute it recognizes: maybe simple a list of available programming languages, like “C”, “Java”, “C++”, etc. Or it recognizes additional information in the label, too: for example, a label “C(indent=4)” could additionally specify how to lay out each highlighted line of code in the rendered output.

Since the form of the “mark-up block start” and “end” line is irrelevant in this case, the post-processor can simply skip over both of these lines in the plain text character content of the <mark-up> element.
A “Z Notation e-mail mark-up transforming” post-processor would have to watch for <mark-up> elements with a notation attribute specifying Z Notation.

It could, but need not, use the label attribute to find there the name of the schema written in the “mark-up block start” line (in this specific form of Z Notation paragraph), but could also just parse the whole plain text contained in the <mark-up> element and ignore the label attribute.

[Note 12: As the first example above demonstrates, the label attribute of the <mark-up> element could be used to transmit processing parameters and options written in the CommonMark source to the designated post-processor, which is a nice feature.

On the other hand: the the label attribute’s value transmitted this way is never checked for validity apart the fact that it matches the approriate pattern in the configuration file.

As a consequence, every post-processor must be able to cope with any value (that matches the configured pattern) in the label attribute, and must generate a formally valid—though possibly nonsensical—element sequence as output.

This is because the apparent alternatives, namely

passing the <mark-up> element along in the chain,
wrapping the character content of the <mark-up> element in another element similar to HTML’s <PRE>,
or “unpacking” the character content and inserting it as character data into the output stream,

are all not viable if you think about it.]

7.6 Properties of the “syntax extension” design

This model of handling and post-processing “foreign” or “extended” syntax in CommonMark typescripts has several advantages:

The CommonMark specification needs no modification or extension,
the cmark processor needs to be modified only once (for handling labeled code blocks, and for matching input lines against the configured patterns), and after that
introducing another “foreign syntax” together with a processor for it requires only adding one line in the configuration file (and inserting the processor for this kind of “plain text” mark-up into the “mark-up chain” of course!)
each of the mark-up processors in the chain after cmark do not need to know anything about CommonMark or Markdown or any other typescript syntax (in fact, they could be used alone and without cmark just the same to process plain text fragments in any XML/SGML document!),
it can be easily extended to accomodate in-line “foreign” syntax fragments in the exact same way. The main difficulty would be to decide on a delimiter syntax which is both flexible and unobtrusive. One idea would be to use a non-ASCII character as in
```
... quote the ´C`int foo();` declaration ...
```
(that is, prefixing a U+00B4 ACUTE ACCENT, followed by a label, to a code span) and then transform that into
```
... quote the <mark-up mode="horizontal" label="C">
<![CDATA[int foo();]]></mark-up> declaration ...
```
This would offer completely the same range of possiblities as the info string in fenced code blocks does; and it seems natural to equate the info strings of fenced code blocks with the prefixed strings for code spans (in this delimiter style or a different one).

Note that—after a one-time modification of the CommonMark processor’s code span handling—this would require no configuration settings if the configuration file already has a info string pattern indicating the “foreign” notation used here—thus introducing a new “foreign” syntax for both block and in-line use is one and the same step.

Note also that the only difference in the generated <mark-up> element compared to the code block case (aside from the obviously missing “code block start” and “end” lines) is the attribute mode, which indicates that this is an in-line span (horizontal) and not a separate block (vertical), so the “right” post-processor does already receive the packaged character data—it might or might not make a difference in processing whether the <mark-up> element represents a block or an inline span (as indicated by the mode attribute), but everything else is the same for both sorts of “foreign” mark-up fragments.

[Note 13: Placing the txtin tool in front of of the “mark-up chain” in order to implement a process to transform a “conventional” plain text input in CommonMark notation into the target document would provide processing of “foreign syntax” embedded in plain text input files too.]

8 Conclusion

I am convinced that the approach presented as the “mark-up chain” provides a useful and viable extension to the way Markdown and similar “human-oriented” plain text syntaxes can be used.

8.1 Effort

I expect that the

required modifications to existing mark-up processors like cmark, sundown, zhtml etc, as well as the
additional tools for use in the chain (txtin, xmlin, xmlout)

can be implemented with moderate effort and without disrupting the existing implementations.

8.2 Gains

Using these modified or new tools in the proposed way would bring huge gains for the flexible use of CommonMark processing:

the abilty to extend CommonMark syntax with other “foreign” syntaxes
in a robust and modular way (each tool handles just one syntax),
with minimal configuration effort (adding one configuration line for each “foreign syntax” extension)

one the one hand, and beyond that

the ability to use cmark and similar tools not only to generate “structured” (ie XML/SGML/HTML) documents from plain text input files remains, but is extended by
the ability to use cmark and/or similar tools to transform input XML/SGML/HTML documents by replacing plain text contained as element character content in designated and marked-up “container elements” (the <mark-up> elements presented above) with element (sequences) generated from and for this plain text fragments

extend the application of CommonMark into new areas like the XML/SGML authoring scenario sketched at the beginning: One could use “extended” CommonMark for authoring eg DocBook documents, while retaining the complete authoring XML/SGML tool chain: editing, validating, transforming (into HTML, PDF, eBook, whatever) can all be done as before, with one additional step introduced: transform an XML/SGML document with “plain text” mark-up in it into a XML/SGML output document without such plain text, but conforming to the target document type (ie the DocBook DTD rsp XML Schema or whatever the target XML/SGML document type may be).

CC BY-SA 4.0 license applies

$Date: 2015-10-22 05:08:40 +0200 (Do, 22 Okt 2015) $