A Plain Text Mark-Up-Processing Chain

mh@tin-pot.net
2015-10-22



1 Overview

This text presents a concept how CommonMark syntax could be extended by other, similar “human-oriented”, “plain text” mark-up notations like

and how this “extended” syntax could be used not only in processing plain text files (like in the “conventional” use case for CommonMark) but also how the “extended” CommonMark syntax could be used

like for authoring and processing of DocBook documents, for example.

[Note 1: The Z Notation is one of the more popular formal specification languages: it is based on ordinary mathematics (Zermelo-Fraenkel set theory, probably the reason for the name “Z”) and is formally defined in ISO/IEC 13568:2002. This standard also specifies the “human-oriented” plain text mark-up mentioned here, intended primarily for e-mail and similar circumstances. The Z Notation standard is publicly available at no cost (as a zipped PDF file), and the Corrigendum 1 is too—but the standard is most certainly not an easy-to-read “introduction” to the Z Notation!]

[Note 2: The designation “XML/SGML”, or even “XML/XHTML/SGML/HTML”, emphasizes that there are differences between XML and SGML. But luckily it turns out that these differences are mostly irrelevant for the concept presented here: the same concept and all of the same mark- up processor implementations can be used to process and XML as well as SGML documents.]

1.1 Design questions and constraints

Given the—obvious, in my opinion—requirement that CommonMark syntax and whatever other syntaxes are to be made available will be processed each by a specific, separate processor (each one a separate process, where the next one is processsing the output of the previous one), the obvious design questions and decisions are:

  1. How do we interface the processing of “plain text” mark-up— performed by mark-up processors like cmark etc—with the rest of the authoring environment: both at the input and the output end of our “plain text” mark-up process?

  2. How do we combine the processors for the various “plain text” mark- up syntaxes so that each one of them can do it’s job, without interfering with the other syntaxes and their processors?

Some goals for and important qualities of an implementation (comprising the modification of existing tools and/or the implementation of new ones) are obvious:

  1. We want—of course—to spend as little effort as possible to implement the whole thing (albeit there is no money to waste nor gain here anyway …), and

  2. we want to introduce only minimal modifications which are really required into existing mark-up processors, without disrupting their current behaviour.

  3. The implementation has to be general: agnostic about the number or nature of syntaxes and processors employed, which is a pre-requisite for

  4. another important design goal: adaptability. It is obviously important to strive for an implementation that permits the easy incorporation of additional “foreign” syntaxes and the pertaining mark-up processors, with as little change to the existing tools as possible, or the replacement of a specific mark-up processor with a new one for the same mark-up syntax, and so on.

  5. Of course the implementation has to be platform-independant if possible at all (ideally using only features of Standard C or Standard C++), or at least be easy to port (say by providing a thin abstraction layer above platform-dependant APIs).

  6. One “soft” goal—but important to me—is that whatever “syntax extensions” of the current CommonMark specification are introduced should have a “natural” feel and ought to fit in without accidental conflicts (with existing mark-up, or with existing styles of authoring CommonMark documents). The best way to “extend” the CommmonMark specification would of course be no extension at all, but to instead use the existing notations where the specification only mentions “implementation-dependant” behaviour.

The plain text mark-up-processing chain, or “mark-up chain” concept is my answer to these (and other) requirements and design questions and constraints.

1.2 The purpose and use of a “mark-up chain”

A key idea here and one major goal is that an author would write inside his XML/SGML document plain text in (a mixture of) “human- oriented” mark-up like CommonMark as the unparsed character content of element (instances) of some specific XML/SMGL element type, let’s call them “container elements”. All other content, structure, and (XML/SGML) mark-up of the document should need no change whatsoever.

The job of the “mark-up chain”, which is really no more than a bunch of “filters” (ie processes transforming input text to output text) chained together in a pipeline for the processing of mark-up, is then

The output of this “mark-up chain” would then be again an XML/SGML document: without any embedded “marked-up plain text”, but with

As we will see, the same process implemented in the “mark-up chain” can be just as well used to generate XML/SGML/HTML output from “conventional” plain text input files—the same transformation that cmark and other processors accomplish already.

But with an important twist: we will see that the proposed “mark-up chain” provides a robust, flexible and extendable way to augment CommonMark syntax with “foreign syntaxes” processed by other tools, and this “extended syntax” can be used when processing “conventional” plain text input files just as well as in the case of XML/SGML input documents.

2 The XML/SGML environment

In an usage scenario for the proposed “mark-up chain” an author would write his structured document (or his typescript)

The input containing the marked-up “plain text” in this scenario would in the first alternative not consist of an entire plain text file, but the marked-up plain text would rather be encapsulated in XML/SGML element instances, embedded into a complete XML/SGML document.

To treat both cases the same, the plain text file’s content in the second case could be embedded into the root element of an XML/SGML document and then presented to the mark-up tool chain.

Similarly, the output at the end of processing the “human-oriented” plain text would not neccesarily be always an (X)HTML document, created entirely by the tool chain, but could basically any XML/SGML document of a type relevant for the whole authoring process at hand (think of DocBook for example). And some or most of the content of this document is not genererated by the processors in our mark-up tool chain, but created in some other authoring tool, presented as the complete input XML/SGML document to out tool chain, and passed through to all mark-up processors and into the final output XML document of our tool chain.

3 Input to tools in the chain

So both the “initial” input to and the “final” output from our tool chain are XML or SGML documents, passed in from and out to the rest of the authoring environment.

If the author prefers to prepare his typescript as a simple text file, we can trivially wrap this plain text file’s content inside an element designated to “transport” such marked-up plain text, and proceed using this as or “initial” input XML document, which would look like this:

<?xml version="1.0" encoding="UTF-8"?>
<mark-up mode="vertical" notation="CommonMark">
<![CDATA[## Your _Markdown_ Input ##
The _Markdown_ input appears <em>unscathed</em> inside the
<mark-up> element here!]]></mark-up>

Otherwise, we would receive an XML/SGML document already containing such <mark-up> elements where the plain text to process can be found inside.

Every mark-up processor (for CommonMark, for Z Notation, etc) used in the chain would thus process only these <mark-up> element instances and replace them with the result from transforming their (plain text) input into XML/SGML elements (a single one, or a sequence of elements representing, for example, multiple paragraphs).

Everything outside such <mark-up> elements is just passed through by each mark-up processor.

4 Output from tools in the chain

We already have seen that each mark-up processor outputs a well-formed XML document and passes it to the input of the next processor in the chain.

So at the end of a chain like this

cat input.xml | cmark | zml | asciimath > output.xml

we have an output XML document where (hopefully) all the <mark-up> elements are gone and have been replaced by the result of interpreting their plain text content, and this interpretation result itself is represented simply as a sequence of the appropriate XML/SGML elements, specific to the mark-up processor rsp. the syntax it transforms.

Depending on the document type used in the authoring environment, another step may be required to transform these output fragments into the appropriate elements of the target document type: for example, the current CommonMark DTD uses the element type name <paragraph> for ordinary paragraphs of text, but HTML uses <P>, while DocBook uses <Para>. (A HTML or XML DOCTYPE declaration may also need to be inserted into the output document, and similar “final touches”.)

Furthermore, the content model for the CommonMark <paragraph> element type encompasses only other elements (notably the <text> element), while HTML’s <P> as well as DocBooks <Para> elements have #PCDATA in their content models: the character content appears immediately inside these paragraph elements, while in CommonMark it is nested inside <text> (among other) elements.

To illustrate the differences, here are the respective parameter entitiy declarations and element declarations for the “just a simple paragraph of text” element type in the CommonMark DTD, in the ISO/IEC 15445:2000 HTML DTD, and in the DocBook 3.1 DTD:

In the CommonMark DTD:

<!ENTITY % inline
'text|softbreak|linebreak|code|inline_html|emph|strong|link|image'>
<!ELEMENT paragraph (%inline;)*>

In the ISO HTML DTD:

<!-- Character level elements and text strings -->
<!ENTITY % text "#PCDATA | %physical.styles; | %logical.styles;
                                             | %special;
                                             | %form.fields;" >
<!ELEMENT P - O (%text;)+ >

In the DocBook 3.1 DTD

<!ENTITY % para.char.mix "#PCDATA
                          |%xref.char.class;  |%gen.char.class;
                          |%link.char.class;  |%tech.char.class;
                          |%base.char.class;  |%docinfo.char.class;
                          |%other.char.class; |%inlineobj.char.class;
                          |%synop.class;
                          |%ndxterm.class;
                          %local.para.char.mix;">
<!ELEMENT Para - O ((%para.char.mix; | %para.mix;)+)>

Transformations of this kind are best done using an XSLT step, and are not really in the scope of our “mark-up” chain. So we’ll mostly ignore them from now on.

5 The mark-up processors in the chain

An instance of this kind of “mark-up chain” could be a pipeline composed of text filter processes, each processing only one kind of “human- oriented” plain text mark-up:

cat authored.xml | markdown | znotation | ... > final.xml

where markdown and znotation stand for some kind of processor for CommonMark rsp Z Notation syntax—we’ll come to the details about these processors later.

While this is a slightly simplified representation, a major goal of the “mark-up chain” concept is in fact to meet the requirement that all mark-up processors are filters (ie separate processes) which can be composed as needed into a pipeline of this kind.

We already have seen two more elements of the chain:

5.1 Avoiding XML/SGML parsing

We assumed so far that all tools in the chain are connected through text streams, and pass XML/SGML documents from one step in the pipeline to the next.

But this would imply that each of these tools ought to be capable of parsing and processing whole XML/SGML documents; and even if the required parser could be shared among the tools (in a DLL rsp shared library), I find the idea that each tool drags around a parser like this a horrible idea indeed.

We would be much better off if the document content could be passed along in

Luckily, there are already answers to both questions:

If we settle for this format, each tool can quite easily filter out the plain text of it’s concern by reading the input stream in this format line by line, watching out for column one in each line, and following these rules:

  1. lines that start with neither A nor ( are simple copied to the output,

  2. for lines starting with A (attribute defintion), remember the attribute value they specify,

  3. for lines starting with ( (element start tag), check if this is a container element containing plain text in the right syntax:

The format and meaning of lines of concern is determied by a flag character in column 1:

  1. A”: An attribute value in the form

    Aattr CDATA value
    

    There are other forms, but the only other form occuring in the output from parsing an XML file without a DTD using nsgmls is

    Aattr IMPLIED value
    

    which can be ignored and passed on.

  2. (”: An element start tag, the element name (GI) follows, starting in column 2.

  3. -”: Character content, which follows, starting in column 2, encoded in UTF-8. The only “escape-encoding” used for it is similar to the convention followed eg in C for string literals:

  4. )”: An element end tag, again the element name (GI) follows, starting in column 2. When reading well-formed documents with balanced start and end tags, the name can be ignored.

So generating output containing new elements (as a result of processing the plain text mark-up) is equally trivial:

  1. First output the attributes, line by line, using

    Aattr CDATA value
    
  2. Output the start tag:

    (name
    
  3. Output the character content (if any):

    -text including\n\012to represent line breaks
    

    and enclosed other elements (if any).

  4. Output the end tag:

    )name
    

The use of this format has one additional advantage: we can start the mark-up chain with any XML/SGML document by parsing it first and feeding the parser output into the following tools.

Our txtin tool would produce output in this format, and it is easy to see that one consequence of this is that this little tool would be even more trivial to implement.

To summarize our possible inputs into the mark-up chain discussed so far:

  1. SGML with a DTD: gets parsed and validated;
  2. XML with a DTD: gets parsed and validated;
  3. XML without a DTD: is assumed to be well-formed and gets parsed;
  4. TXT (plain text in UTF-8): gets wrapped into ESIS by txtin.

While a full-blown parser like nsgmls can parse (but obviously not validate) XML input without a DTD, it would be really “overkill” in this use. One could alternatively use a tool to convert well-formed XML into the internal format based, say xmlin, specifically developed for this purpose.

To validate XML input against an XML Schema, one could use the SAX2Print sample program from the Xerxes XML parser library. If validation was successful, it outputs a well-formed XML document (without referencing a DTD obviously), which can be fed directly into xmlin.

Note also that we need to piece together an XML (or SGML) document at the end of our chain from the internal ESIS representation: this too is trivial to implement in a dedicated tool, say xmlout. As long as the required XML transformation done in the XSLT step is simple enough it could be done inside xmlout too. For example

is certainly easy to implement, and would suffice to transform <paragraph> into <P> rsp <Para> as described in the example above.

5.2 A generic “adapter” for existing mark-up processors

If an existing mark-up processor can not be modified to adhere to the conventions established in this concept (reading from stdin the internal ESIS representation format, and writing the processed ESIS to stdout again etc)—be it that no source code is available, or no developer to do these modifications—one could still use the existing processor in the “mark-up” chain as long as it exhibits at least in some invocation this simple behaviour:

  1. It reads plain text in the specific mark-up syntax, either from a file named in a command-line argument, or from stdin;

  2. it produces a sequence of XML elements (native element types or “common” ones like XHTML element types), either into a file named in a command-line argument, or to stdout.

To use such an existing mark-up processor in the “mark-up chain”, a special “adapter” process could feed plain text into it, and fetch the output XML element sequence from it.

All of the behaviour required to function in the “mark-up chain” would be implemented in the “adapter” processor itself:

  1. The parsing and filtering of the internal ESIS representation coming in from the processor ahead in the chain, as well as

  2. converting and inserting the element sequence produced by the existing mark-up processor into the ESIS stream going out to the following processor in the chain.

The “adapter” processor would be linked into the chain in the exact same way as any other mark-up processor, namely through standard input and standard output and the internal ESIS representation transmitted through these channels, and would invoke the existing mark-up processor “privately” just for the transformation of plain text fragments into XML element sequences.

Apart from the one-time effort to implement such an “adapter” processor, some performance cost would also incurred in using it:

Furthermore,

the adapter process would have to communicate with the existing mark-up processor not through pipes, but through temporary files.

All in all it is certainly better to adapt an existing mark-up processor to be used directly in the “mark-up chain”, but an “adapter” processor of this could in fact provide a work-around in some cases.

And implementing the “adapter” processor would be a nice little development experiment, too …

6 Required new tools and changes to existing tools

Developing new or adapting existing tools for use in the “mark-up chain” should be pretty easy. Each tool must:

Note that character data output into the internal format does not need to and in fact must not be “XML-escaped”, just as character data in the input need not and must not be “XML-decoded” (replacing < by &lt; etc). Line endings and control characters must however be converted (to and from \n\012 rsp \077) from character data input and to character data output.

The txtin tool does little more than copying input to output, while the xmlout tool basically re-formats the internal format to proper XML (but must introduce entity references for < etc).

The xmlin tool should be able to rely on a XML parser library (presumably following the SAX model), and does little more than re- formatting too.

The hard part—parsing (and validating if needed against a DTD or an XML schema)—can be done by simply using existing parsers like nsgmls or xerxes.

  1. To directly use the cmark parser (or similar implementations like discount or sundown) in this chain, it has to parse and generate the internal ESIS representation format, in the way described above. One obvious style to implement this would be by adding a new format for the -t option, say esis, and re-use most of the implementation of the -t xml format. I would guess that this could be done in one day (a couple of hours), or develop a separate program that only uses the existing API of the reference implementation (feeding plain text extracted from container elements in the input stream into cmark_parse_document(), and generating output from the parsed document in the internal ESIS format through a modified copy of cmark_render_xml(), say cmark_render_esis().

  2. For other mark-up processing tools, adding an option to parse and generate the internal ESIS format might take more or less effort, but is certainly not a huge task.

  3. The txtin tool is really trivial to implement.

  4. The xmlout tool is really trivial to implement without transformation capabilities; with these capablities (and options to insert appropriate DOCTYPE declarations etc in the output) the effort increases, but again this too can certainly be done in a couple of hours.

  5. The xmlin tool is rather trivial to implement, using a non- validating XML parsing library like expat (by James Clark, too), or tinyxml2, or libxml etc.

[Note 3: I already have a prototype implementation using expat, which took me probably less than an hour to implement (using expat for the first time): it produces the same output as nsgmls for well- formed XML documents without a DTD (with the—unproblematic—exception that nsgmls likes to output Aattr IMPLIED val lines for attributes missing in the element but which already had a value definition somewhere before; and that the order in which A lines appear may differ between parsers).]

7 Embedding “foreign” syntax into CommonMark texts

One important question was avoided in the concept so far: How does the CommonMark text look like if “foreign” mark-up syntaxes are to be mixed in? While this is a topic open for debate, as the order of tools in the presented pipeline indicates, the “foreign” syntaxes would be processed after the (modified) cmark filter has done it’s job: the other tools act as post-processors.

[Note 4: How the different syntaxes are delimited and recognized exactly is still TBD and the topic of discussion.]

The design presented in this section provides

[Note 5: In the “mark-up chain” concept as presented here, these two items have to be provided (ie settled upon, documented, and implemented) in any implementation tailored for any specific rules and conventions on how mixing “extended” syntax into CommonMark documents looks like.]

The following mechanism could be implemented in the new -t esis processing mode of a modified cmark (or in any free-standing CommonMark processor, of course).

7.1 Configurable delimiters

The parser is provided with a configuration file containing patterns to match against input lines. These patterns come in pairs, where the first pattern specifies the form of a “mark-up block start” line, and the second the form of the corresponding “mark-up block end” line.

The configuration file specifying these patterns could look like this:

"^+-- @ ---$" "^---$" "Z Notation"
"C*" "C Programming Language"

The meaning of these two lines for the handling of “foreign syntaxes” is:

  1. The first line shown here contains three strings, the first two specify a pair of patterns intended to detect

    In the simple example pattern syntax used here, every character stands for itself except “@” (COMMERCIAL AT), which matches any non-empty string not containing the following literal character, a SPACE in this case.

  2. The second line specifies (this time using a glob pattern) one set of info strings usable in starting lines for fenced code blocks. The matching info strings will be recognized as indicating the “notation” C Programming Language.

[Note 6: The exact syntax for expressing these two kinds of patterns, and which algorithm(s) to use for matching, is TBD.]

Because fenced code blocks with info strings are part of the core CommonMark syntax, and recognizing such fenced code blocks is already implemented in cmark, all we need to implement is a modified treatment of fenced code blocks with an info string— no other modification of cmark’s behaviour is required, and no more configuration set-up to “send” the pertaining code block content to an appropriate and available post-processor is needed.

[Note 7: There are stricter requirements than “three or more tilde characters” imposed on a block-ending fenced line by the CommonMark specification anyway, which are hard to express in a pattern—luckily no such pattern is needed.]

When a plain text input line is encountered that matches a configured “mark-up block start” line, the CommonMark processor has to “package” the raw text content of the block starting there into a “container element”, and send out this container element into the output for the consumption by the post-processors which come later in the chain.

7.2 Packaging plain text for post-processing

The proposed element type name, or general identifier (GI) in SGML terminology, for the container elements is mark-up, and three attributes are defined for it. The (SGML) element and attribute list declarations are:

<!ELEMENT mark-up  CDATA >
<!ATTLIST mark-up
          notation CDATA                   #REQUIRED
          mode     (vertical | horizontal) #IMPLIED
          label    CDATA                   #IMPLIED>

The same declarations apply in XML, but without the “- -” in the element declaration.

[Note 8: In the SGML element declaration, “- -” specifies that neither the start tag <mark-up> nor the end tag </mark-up> can be omitted in a marked-up SGML document.]

The attributes of <mark-up> have the following meaning:

  1. the “mode” of this <mark-up> element: “vertical” in this case (it would be “horizontal” for in-line “foreign” syntax, an yes, this is TeX jargon ;-)

  2. the “notation” (ie kind of syntax) in which the plain text content is in (this is defined in the configuration and depends on the form and/or the the label of this block),

  3. the “label” extracted using the pattern match, if any;

7.3 Rules for handling CommonMark “mark-up blocks”

A CommonMark processor should handle blocks in “foreign” syntax for the “mark-up chain” process according to the following rulse, using the patterns and notation names defined in the configuration file introduced above.

One can call “mark-up blocks” both forms of blocks which require post- processing:

and both are processed and packaged in the same way, except for the different ways of detecting the starting fenced code block line rsp the “mark-up block start” line, namely:

and vice versa for finding the corresponding ending fenced code block line rsp “mark-up block end” line.

The actions the CommonMark processor needs to take to process a “foreign mark-up syntax” block—when encountering a plain text input line which matches a configured pattern—are these:

  1. The parser closes the currently open block-level element (if any);

  2. an (optional) label is extracted from this line using the pattern match;

  3. a start tag for a <mark-up> element is generated into the output, using attribute values

  4. the “block start” line itself is copied as character data into the output;

  5. all plain text up to the corresponding “mark-up block end” line is not processed in any way, but taken “literally”, and output as character content (substituting \n\012 for EOL etc as usual, but not entity-encoding < for example);

  6. when the corresponding “mark-up block end” line occurs in the plain text input, the end tag for this <mark-up> element instance is output after copying the “block end” line into the output.

[Note 9: The first step (closing the currently open block-level element) is only needed if there is no blank line preceding the “foreign mark-up start” line matched.

[Note 10: The reason for copying the starting and ending plain text lines of a “foreign mark-up block” into the <mark-up> element’s content is simple: the post-processor might actually need to see them to decide what the raw plain text between them means. This is in fact the case for marked-up blocks in the Z Notation (like the one shown in the example below), where the form of the starting line determines the syntactic category of (most) such blocks.]

It would probably better to require this blank line to be there, for in that case there is never an open block-level element that needs closing when a “foreign mark-up start” line is detected. Furthermore, the parser would only need to match each first non-empty line after (a sequence of) blank lines, instead of every single input line.]

Processing fenced code blocks labeled with a “known” info string (meaning: one for which there is a match in the configuration file) follows the same steps, except that

Note that both these transformations need only be implemented once in a CommonMark processor, and after that pretty much any syntax for starting and ending blocks of “extended” syntax can be defined in the configuration file and recognized by pattern matching (of an expressive enough kind).

Note also that the “plain text input” for the CommonMark processor is itself the character data content of a <mark-up> element in the ESIS input stream.

The CommonMark processor will see only one such <mark-up> element if the input into the “mark-up chain” is an XML-wrapped plain text file (in which case this one element contains the complete plain text content of the input); but may see many such elements otherwise (when processing a parsed XML/SGML document).

Since the CommonMark processor packages all plain text fragments recognized as “foreign” notations into the appropriate <mark-up> elements, the post-processors in the tail of the processing chain need not be concerned with any “foreign” or CommonMark syntax—they just need to filter “their” plain text from the appropriate <mark-up> elements.

7.4 Examples of packaged “foreign syntax blocks”

Here is how mark-up using the configured “extended syntax” (using the configuration file shown above) would be transformed into the ESIS output of the CommonMark processor:

A fenced code block with an info string "C" like this one:

~~~ C
int foo();
~~~

would be transformed into a <mark-up> container element instance in the output stream like that one (written here in XML format, not in the internal ESIS representation):

<mark-up mode="vertical" label="C">
<![CDATA[~~~ C
int foo();
~~~]]></mark-up>

And a “foreign mark-up syntax block” (in example configuration: a “Z Schema definition paragraph”) in the CommonMark input plain text looking like this:

+-- NAME ---
    DeclPart
|--
    Predicate
---

would be transformed into that <mark-up> element:

<mark-up mode="vertical"
         label="NAME"
         notation="Z Notation">
<![CDATA[+-- NAME ---
    DeclPart
|--
    Predicate
---]]></mark-up>

[Note 11: The CDATA section mark-up is shown here simply to illustrate that the plain text inside the <mark-up> element is not to be parsed in any way. But then the whole XML/SGML mark-up shown is just for illustration: there are no CDATA sections in the ESIS and hence in the internal format to represent the ESIS—once more the internal format is simpler than the XML mark-up!]

7.5 Filter processing by tools in the mark-up chain

It should be obvious how a the CommonMark processor at the head of the chain and every post-processor later on can watch out for <mark-up> elements it needs to handle:

[Note 12: As the first example above demonstrates, the label attribute of the <mark-up> element could be used to transmit processing parameters and options written in the CommonMark source to the designated post-processor, which is a nice feature.

On the other hand: the the label attribute’s value transmitted this way is never checked for validity apart the fact that it matches the approriate pattern in the configuration file.

As a consequence, every post-processor must be able to cope with any value (that matches the configured pattern) in the label attribute, and must generate a formally valid—though possibly nonsensical—element sequence as output.

This is because the apparent alternatives, namely

are all not viable if you think about it.]

7.6 Properties of the “syntax extension” design

This model of handling and post-processing “foreign” or “extended” syntax in CommonMark typescripts has several advantages:

  1. The CommonMark specification needs no modification or extension,

  2. the cmark processor needs to be modified only once (for handling labeled code blocks, and for matching input lines against the configured patterns), and after that

  3. introducing another “foreign syntax” together with a processor for it requires only adding one line in the configuration file (and inserting the processor for this kind of “plain text” mark-up into the “mark-up chain” of course!)

  4. each of the mark-up processors in the chain after cmark do not need to know anything about CommonMark or Markdown or any other typescript syntax (in fact, they could be used alone and without cmark just the same to process plain text fragments in any XML/SGML document!),

  5. it can be easily extended to accomodate in-line “foreign” syntax fragments in the exact same way. The main difficulty would be to decide on a delimiter syntax which is both flexible and unobtrusive. One idea would be to use a non-ASCII character as in

    ... quote the ´C`int foo();` declaration ...
    

    (that is, prefixing a U+00B4 ACUTE ACCENT, followed by a label, to a code span) and then transform that into

    ... quote the <mark-up mode="horizontal" label="C">
    <![CDATA[int foo();]]></mark-up> declaration ...
    

    This would offer completely the same range of possiblities as the info string in fenced code blocks does; and it seems natural to equate the info strings of fenced code blocks with the prefixed strings for code spans (in this delimiter style or a different one).

    Note that—after a one-time modification of the CommonMark processor’s code span handling—this would require no configuration settings if the configuration file already has a info string pattern indicating the “foreign” notation used here—thus introducing a new “foreign” syntax for both block and in-line use is one and the same step.

    Note also that the only difference in the generated <mark-up> element compared to the code block case (aside from the obviously missing “code block start” and “end” lines) is the attribute mode, which indicates that this is an in-line span (horizontal) and not a separate block (vertical), so the “right” post-processor does already receive the packaged character data—it might or might not make a difference in processing whether the <mark-up> element represents a block or an inline span (as indicated by the mode attribute), but everything else is the same for both sorts of “foreign” mark-up fragments.

[Note 13: Placing the txtin tool in front of of the “mark-up chain” in order to implement a process to transform a “conventional” plain text input in CommonMark notation into the target document would provide processing of “foreign syntax” embedded in plain text input files too.]

8 Conclusion

I am convinced that the approach presented as the “mark-up chain” provides a useful and viable extension to the way Markdown and similar “human-oriented” plain text syntaxes can be used.

8.1 Effort

I expect that the

can be implemented with moderate effort and without disrupting the existing implementations.

8.2 Gains

Using these modified or new tools in the proposed way would bring huge gains for the flexible use of CommonMark processing:

one the one hand, and beyond that

extend the application of CommonMark into new areas like the XML/SGML authoring scenario sketched at the beginning: One could use “extended” CommonMark for authoring eg DocBook documents, while retaining the complete authoring XML/SGML tool chain: editing, validating, transforming (into HTML, PDF, eBook, whatever) can all be done as before, with one additional step introduced: transform an XML/SGML document with “plain text” mark-up in it into a XML/SGML output document without such plain text, but conforming to the target document type (ie the DocBook DTD rsp XML Schema or whatever the target XML/SGML document type may be).


Valid ISO/IEC 15445:2000 © 2015 tin-pot.net CC BY-SA 4.0 license applies CC BY-SA 4.0 licenced

$Date: 2015-10-22 05:08:40 +0200 (Do, 22 Okt 2015) $