mh@tin-pot.net
2015-10-22
This text presents a concept how CommonMark syntax could be extended by other, similar “human-oriented”, “plain text” mark-up notations like
and how this “extended” syntax could be used not only in processing plain text files (like in the “conventional” use case for CommonMark) but also how the “extended” CommonMark syntax could be used
like for authoring and processing of DocBook documents, for example.
[Note 1: The Z Notation is one of the more popular formal specification languages: it is based on ordinary mathematics (Zermelo-Fraenkel set theory, probably the reason for the name “Z”) and is formally defined in ISO/IEC 13568:2002. This standard also specifies the “human-oriented” plain text mark-up mentioned here, intended primarily for e-mail and similar circumstances. The Z Notation standard is publicly available at no cost (as a zipped PDF file), and the Corrigendum 1 is too—but the standard is most certainly not an easy-to-read “introduction” to the Z Notation!]
[Note 2: The designation “XML/SGML”, or even “XML/XHTML/SGML/HTML”, emphasizes that there are differences between XML and SGML. But luckily it turns out that these differences are mostly irrelevant for the concept presented here: the same concept and all of the same mark- up processor implementations can be used to process and XML as well as SGML documents.]
Given the—obvious, in my opinion—requirement that CommonMark syntax and whatever other syntaxes are to be made available will be processed each by a specific, separate processor (each one a separate process, where the next one is processsing the output of the previous one), the obvious design questions and decisions are:
How do we interface the processing of “plain text” mark-up—
performed by mark-up processors like cmark
etc—with the rest of
the authoring environment: both at the input and the output end of
our “plain text” mark-up process?
How do we combine the processors for the various “plain text” mark- up syntaxes so that each one of them can do it’s job, without interfering with the other syntaxes and their processors?
Some goals for and important qualities of an implementation (comprising the modification of existing tools and/or the implementation of new ones) are obvious:
We want—of course—to spend as little effort as possible to implement the whole thing (albeit there is no money to waste nor gain here anyway …), and
we want to introduce only minimal modifications which are really required into existing mark-up processors, without disrupting their current behaviour.
The implementation has to be general: agnostic about the number or nature of syntaxes and processors employed, which is a pre-requisite for
another important design goal: adaptability. It is obviously important to strive for an implementation that permits the easy incorporation of additional “foreign” syntaxes and the pertaining mark-up processors, with as little change to the existing tools as possible, or the replacement of a specific mark-up processor with a new one for the same mark-up syntax, and so on.
Of course the implementation has to be platform-independant if possible at all (ideally using only features of Standard C or Standard C++), or at least be easy to port (say by providing a thin abstraction layer above platform-dependant APIs).
One “soft” goal—but important to me—is that whatever “syntax extensions” of the current CommonMark specification are introduced should have a “natural” feel and ought to fit in without accidental conflicts (with existing mark-up, or with existing styles of authoring CommonMark documents). The best way to “extend” the CommmonMark specification would of course be no extension at all, but to instead use the existing notations where the specification only mentions “implementation-dependant” behaviour.
The plain text mark-up-processing chain, or “mark-up chain” concept is my answer to these (and other) requirements and design questions and constraints.
A key idea here and one major goal is that an author would write inside his XML/SGML document plain text in (a mixture of) “human- oriented” mark-up like CommonMark as the unparsed character content of element (instances) of some specific XML/SMGL element type, let’s call them “container elements”. All other content, structure, and (XML/SGML) mark-up of the document should need no change whatsoever.
The job of the “mark-up chain”, which is really no more than a bunch of “filters” (ie processes transforming input text to output text) chained together in a pipeline for the processing of mark-up, is then
to extract these plain text fragments from the container elements inside the XML/SGML document,
process the “plain text” mark-up according to CommonMark (and other) rules,
replace these container elements with the XML/SGML fragments (sequences of element instances) resulting from the processing.
The output of this “mark-up chain” would then be again an XML/SGML document: without any embedded “marked-up plain text”, but with
the elements generated by transforming these plain text fragments
together with the unchanged XML/SGML content from the input document around them.
As we will see, the same process implemented in the “mark-up chain”
can be just as well used to generate XML/SGML/HTML output from
“conventional” plain text input files—the same transformation that
cmark
and other processors accomplish already.
But with an important twist: we will see that the proposed “mark-up chain” provides a robust, flexible and extendable way to augment CommonMark syntax with “foreign syntaxes” processed by other tools, and this “extended syntax” can be used when processing “conventional” plain text input files just as well as in the case of XML/SGML input documents.
In an usage scenario for the proposed “mark-up chain” an author would write his structured document (or his typescript)
either directly into an XML/SGML document (probably using an XML/SGML editing/validating tool like Oxygen XML),
or would prefer to use CommonMark notation as far as possible, and to write his original typescript as a simple plain text file in CommonMark syntax, interspersed (in some way) with text in other mark-up syntaxes like the ones mentioned earlier.
The input containing the marked-up “plain text” in this scenario would in the first alternative not consist of an entire plain text file, but the marked-up plain text would rather be encapsulated in XML/SGML element instances, embedded into a complete XML/SGML document.
To treat both cases the same, the plain text file’s content in the second case could be embedded into the root element of an XML/SGML document and then presented to the mark-up tool chain.
Similarly, the output at the end of processing the “human-oriented” plain text would not neccesarily be always an (X)HTML document, created entirely by the tool chain, but could basically any XML/SGML document of a type relevant for the whole authoring process at hand (think of DocBook for example). And some or most of the content of this document is not genererated by the processors in our mark-up tool chain, but created in some other authoring tool, presented as the complete input XML/SGML document to out tool chain, and passed through to all mark-up processors and into the final output XML document of our tool chain.
So both the “initial” input to and the “final” output from our tool chain are XML or SGML documents, passed in from and out to the rest of the authoring environment.
If the author prefers to prepare his typescript as a simple text file, we can trivially wrap this plain text file’s content inside an element designated to “transport” such marked-up plain text, and proceed using this as or “initial” input XML document, which would look like this:
<?xml version="1.0" encoding="UTF-8"?>
<mark-up mode="vertical" notation="CommonMark">
<![CDATA[## Your _Markdown_ Input ##
The _Markdown_ input appears <em>unscathed</em> inside the
<mark-up> element here!]]></mark-up>
Otherwise, we would receive an XML/SGML document already containing
such <mark-up>
elements where the plain text to process can be found
inside.
Every mark-up processor (for CommonMark, for Z Notation, etc)
used in the chain would thus process only these <mark-up>
element
instances and replace them with the result from transforming their
(plain text) input into XML/SGML elements (a single one, or a sequence
of elements representing, for example, multiple paragraphs).
Everything outside such <mark-up>
elements is just passed through
by each mark-up processor.
We already have seen that each mark-up processor outputs a well-formed XML document and passes it to the input of the next processor in the chain.
So at the end of a chain like this
cat input.xml | cmark | zml | asciimath > output.xml
we have an output XML document where (hopefully) all the <mark-up>
elements are gone and have been replaced by the result of interpreting
their plain text content, and this interpretation result itself is
represented simply as a sequence of the appropriate XML/SGML elements,
specific to the mark-up processor rsp. the syntax it transforms.
Depending on the document type used in the authoring environment,
another step may be required to transform these output fragments into
the appropriate elements of the target document type: for example, the
current CommonMark DTD uses the element type name <paragraph>
for
ordinary paragraphs of text, but HTML uses <P>
, while DocBook uses
<Para>
. (A HTML or XML DOCTYPE
declaration may also need to be
inserted into the output document, and similar “final touches”.)
Furthermore, the content model for the CommonMark <paragraph>
element type encompasses only other elements (notably the <text>
element), while HTML’s <P>
as well as DocBooks <Para>
elements
have #PCDATA
in their content models: the character content appears
immediately inside these paragraph elements, while in CommonMark it
is nested inside <text>
(among other) elements.
To illustrate the differences, here are the respective parameter entitiy declarations and element declarations for the “just a simple paragraph of text” element type in the CommonMark DTD, in the ISO/IEC 15445:2000 HTML DTD, and in the DocBook 3.1 DTD:
In the CommonMark DTD:
<!ENTITY % inline
'text|softbreak|linebreak|code|inline_html|emph|strong|link|image'>
<!ELEMENT paragraph (%inline;)*>
In the ISO HTML DTD:
<!-- Character level elements and text strings -->
<!ENTITY % text "#PCDATA | %physical.styles; | %logical.styles;
| %special;
| %form.fields;" >
<!ELEMENT P - O (%text;)+ >
In the DocBook 3.1 DTD
<!ENTITY % para.char.mix "#PCDATA
|%xref.char.class; |%gen.char.class;
|%link.char.class; |%tech.char.class;
|%base.char.class; |%docinfo.char.class;
|%other.char.class; |%inlineobj.char.class;
|%synop.class;
|%ndxterm.class;
%local.para.char.mix;">
<!ELEMENT Para - O ((%para.char.mix; | %para.mix;)+)>
Transformations of this kind are best done using an XSLT step, and are not really in the scope of our “mark-up” chain. So we’ll mostly ignore them from now on.
An instance of this kind of “mark-up chain” could be a pipeline composed of text filter processes, each processing only one kind of “human- oriented” plain text mark-up:
cat authored.xml | markdown | znotation | ... > final.xml
where markdown
and znotation
stand for some kind of processor for
CommonMark rsp Z Notation syntax—we’ll come to the details about
these processors later.
While this is a slightly simplified representation, a major goal of the “mark-up chain” concept is in fact to meet the requirement that all mark-up processors are filters (ie separate processes) which can be composed as needed into a pipeline of this kind.
We already have seen two more elements of the chain:
The final.xml
document could be the output not of a mark-up
processor, but an XSLT step (where xsl
here stands for the XSLT
processor of choice, together with the required options and XSLT
files):
cat authored.xml | cmark | ... | xsl > final.xml
Processing an initial plain text file needs a wrapping step to
place the plain text inside an XML/SGML document. This wrapping can
trivially be done by a tiny tool, let’s call it txtin
:
cat plain.txt | txtin | cmark | ... | xsl > final.xml
We assumed so far that all tools in the chain are connected through text streams, and pass XML/SGML documents from one step in the pipeline to the next.
But this would imply that each of these tools ought to be capable of parsing and processing whole XML/SGML documents; and even if the required parser could be shared among the tools (in a DLL rsp shared library), I find the idea that each tool drags around a parser like this a horrible idea indeed.
We would be much better off if the document content could be passed along in
a format which is simple to parse and generate, but it obviously needs to represent faithfully the whole “intended” content (disregarding differences in XML/SGML mark-up) of the XML/SGML document which we feed into the pipeline in the first place, notably those parts of the content which are neither recognized nor used in the mark-up chain but simply passed through.
And what is the “intended” content of XML/SGML documents anyway— disregarding the format of representation and mark-up?
Luckily, there are already answers to both questions:
The “intended” content of an SGML document is the so-called “Element Structure Information Set” (ESIS, ISO 8879), and of an XML documents it is called the “XML Information Set”. Since XML is kind-of a specialization of SGML, we can treat both concepts the same (or at least assume that we can in the following).
A nice representation of this “document content”, or ESIS—
as an easy-to-parse text format—does already exist too: it is
the native output format of the widely used XML/SGML parsers SP
(nsgml
) and OpenSP (onsgml
), both written and published
by James Clark. The format description fits on a
single page, and we need only a fragment of it.
If we settle for this format, each tool can quite easily filter out the plain text of it’s concern by reading the input stream in this format line by line, watching out for column one in each line, and following these rules:
lines that start with neither A
nor (
are simple copied to the
output,
for lines starting with A
(attribute defintion), remember the
attribute value they specify,
for lines starting with (
(element start tag), check if this is a
container element containing plain text in the right syntax:
If not so, copy the remembered A
lines and this (
line to
the output, relax and go on reading more input;
If so, start using the character content that follows until the
corresponding )
line (element end tag) to this (
occurs,
process the character content into the output, and after that:
relax and go on reading more input.
The format and meaning of lines of concern is determied by a flag character in column 1:
“A
”: An attribute value in the form
Aattr CDATA value
There are other forms, but the only other form occuring in the
output from parsing an XML file without a DTD using nsgmls
is
Aattr IMPLIED value
which can be ignored and passed on.
“(
”: An element start tag, the element name (GI) follows, starting
in column 2.
“-
”: Character content, which follows, starting in column 2,
encoded in UTF-8. The only “escape-encoding” used for it is similar
to the convention followed eg in C for string literals:
“\n
” represents EOL (U+000A: END OF LINE).
“\\
” represents a backslash (U+005C: REVERSE SOLIDUS).
“\ddd
”, where ddd
are three octal digits, represents
the character with the given number (up to the number 511,
corresponding to U+01FF LATIN SMALL LETTER O WITH STROKE AND
ACUTE, but usually is just a control character in the range
U+0000 .. U+0031). The character \012
(U+000D: CARRIAGE
RETURN) usually follows \n
immediately, and can be ignored.
“\#n;
” where n
is a decimal number, represents the
character with the given number, in our case: Unicode code
point. (This should never be needed nor occur in UTF-8 encoded
text.)
“)
”: An element end tag, again the element name (GI) follows,
starting in column 2. When reading well-formed documents with
balanced start and end tags, the name can be ignored.
So generating output containing new elements (as a result of processing the plain text mark-up) is equally trivial:
First output the attributes, line by line, using
Aattr CDATA value
Output the start tag:
(name
Output the character content (if any):
-text including\n\012to represent line breaks
and enclosed other elements (if any).
Output the end tag:
)name
The use of this format has one additional advantage: we can start the mark-up chain with any XML/SGML document by parsing it first and feeding the parser output into the following tools.
Our txtin
tool would produce output in this format, and it is easy to
see that one consequence of this is that this little tool would be even
more trivial to implement.
To summarize our possible inputs into the mark-up chain discussed so far:
txtin
.While a full-blown parser like nsgmls
can parse (but obviously not
validate) XML input without a DTD, it would be really “overkill” in this
use. One could alternatively use a tool to convert well-formed XML into
the internal format based, say xmlin
, specifically developed for this
purpose.
To validate XML input against an XML Schema, one could use the
SAX2Print
sample program from the Xerxes
XML parser library.
If validation was successful, it outputs a well-formed XML document
(without referencing a DTD obviously), which can be fed directly into
xmlin
.
Note also that we need to piece together an XML (or SGML) document at
the end of our chain from the internal ESIS representation: this too is
trivial to implement in a dedicated tool, say xmlout
. As long as the
required XML transformation done in the XSLT step is simple enough it
could be done inside xmlout
too. For example
is certainly easy to implement, and would suffice to transform
<paragraph>
into <P>
rsp <Para>
as described in the example above.
If an existing mark-up processor can not be modified to adhere to the
conventions established in this concept (reading from stdin
the
internal ESIS representation format, and writing the processed ESIS to
stdout
again etc)—be it that no source code is available, or no
developer to do these modifications—one could still use the existing
processor in the “mark-up” chain as long as it exhibits at least in
some invocation this simple behaviour:
It reads plain text in the specific mark-up syntax, either from a
file named in a command-line argument, or from stdin
;
it produces a sequence of XML elements (native element types or
“common” ones like XHTML element types), either into a file named in
a command-line argument, or to stdout
.
To use such an existing mark-up processor in the “mark-up chain”, a special “adapter” process could feed plain text into it, and fetch the output XML element sequence from it.
All of the behaviour required to function in the “mark-up chain” would be implemented in the “adapter” processor itself:
The parsing and filtering of the internal ESIS representation coming in from the processor ahead in the chain, as well as
converting and inserting the element sequence produced by the existing mark-up processor into the ESIS stream going out to the following processor in the chain.
The “adapter” processor would be linked into the chain in the exact same way as any other mark-up processor, namely through standard input and standard output and the internal ESIS representation transmitted through these channels, and would invoke the existing mark-up processor “privately” just for the transformation of plain text fragments into XML element sequences.
Apart from the one-time effort to implement such an “adapter” processor, some performance cost would also incurred in using it:
Furthermore,
if no system support for executing a child process and connecting through a pair of pipes to it,
or if the “adapter” processor is to be implemented using the standard C library only,
or if the existing mark-up processor simply does not use stdin
and/or stdout
in the manner required
the adapter process would have to communicate with the existing mark-up processor not through pipes, but through temporary files.
All in all it is certainly better to adapt an existing mark-up processor to be used directly in the “mark-up chain”, but an “adapter” processor of this could in fact provide a work-around in some cases.
And implementing the “adapter” processor would be a nice little development experiment, too …
Developing new or adapting existing tools for use in the “mark-up chain” should be pretty easy. Each tool must:
stdin
,Note that character data output into the internal format does not need
to and in fact must not be “XML-escaped”, just as character data
in the input need not and must not be “XML-decoded” (replacing <
by <
etc). Line endings and control characters must however be
converted (to and from \n\012
rsp \077
) from character data input
and to character data output.
The txtin
tool does little more than copying input to output, while
the xmlout
tool basically re-formats the internal format to proper XML
(but must introduce entity references for <
etc).
The xmlin
tool should be able to rely on a XML parser library
(presumably following the SAX model), and does little more than re-
formatting too.
The hard part—parsing (and validating if needed against a DTD or an
XML schema)—can be done by simply using existing parsers like nsgmls
or xerxes
.
To directly use the cmark
parser (or similar implementations
like discount
or sundown
) in this chain, it has to parse and
generate the internal ESIS representation format, in the way
described above. One obvious style to implement this would be by
adding a new format for the -t
option, say esis
, and re-use
most of the implementation of the -t xml
format. I would guess
that this could be done in one day (a couple of hours), or develop
a separate program that only uses the existing API of the reference
implementation (feeding plain text extracted from container elements
in the input stream into cmark_parse_document()
, and generating
output from the parsed document in the internal ESIS format through
a modified copy of cmark_render_xml()
, say cmark_render_esis()
.
For other mark-up processing tools, adding an option to parse and generate the internal ESIS format might take more or less effort, but is certainly not a huge task.
The txtin
tool is really trivial to implement.
The xmlout
tool is really trivial to implement without
transformation capabilities; with these capablities (and options to
insert appropriate DOCTYPE
declarations etc in the output) the
effort increases, but again this too can certainly be done in a
couple of hours.
The xmlin
tool is rather trivial to implement, using a non-
validating XML parsing library like expat
(by James
Clark, too), or tinyxml2
, or libxml
etc.
[Note 3: I already have a prototype implementation using expat
,
which took me probably less than an hour to implement (using expat
for the first time): it produces the same output as nsgmls
for well-
formed XML documents without a DTD (with the—unproblematic—exception
that nsgmls
likes to output Aattr IMPLIED val
lines for attributes
missing in the element but which already had a value definition
somewhere before; and that the order in which A
lines appear may
differ between parsers).]
One important question was avoided in the concept so far: How does the
CommonMark text look like if “foreign” mark-up syntaxes are to be
mixed in? While this is a topic open for debate, as the order of tools
in the presented pipeline indicates, the “foreign” syntaxes would be
processed after the (modified) cmark
filter has done it’s job: the
other tools act as post-processors.
[Note 4: How the different syntaxes are delimited and recognized exactly is still TBD and the topic of discussion.]
The design presented in this section provides
a mechanism to implement different forms of such delimiters (using a configuration file), and
the rules how to package the raw text content of “foreign” syntax fragments into “container elements” for consumption by a post- processor (proposing an element type name and attributes, as well as rules how these attributes are to be set and interpreted).
[Note 5: In the “mark-up chain” concept as presented here, these two items have to be provided (ie settled upon, documented, and implemented) in any implementation tailored for any specific rules and conventions on how mixing “extended” syntax into CommonMark documents looks like.]
The following mechanism could be implemented in the new -t esis
processing mode of a modified cmark
(or in any free-standing
CommonMark processor, of course).
The parser is provided with a configuration file containing patterns to match against input lines. These patterns come in pairs, where the first pattern specifies the form of a “mark-up block start” line, and the second the form of the corresponding “mark-up block end” line.
The configuration file specifying these patterns could look like this:
"^+-- @ ---$" "^---$" "Z Notation"
"C*" "C Programming Language"
The meaning of these two lines for the handling of “foreign syntaxes” is:
The first line shown here contains three strings, the first two specify a pair of patterns intended to detect
the opening line for a “Z Schema definition paragraph” in the “e-mail mark-up” of the Z Notation, and
the corresponding closing line pattern,
while the third string in the line names the “notation” contained in blocks delimited by matching lines: Z Notation in this case.
In the simple example pattern syntax used here, every character
stands for itself except “@
” (COMMERCIAL AT), which matches any
non-empty string not containing the following literal character, a
SPACE in this case.
The second line specifies (this time using a glob pattern) one set of info strings usable in starting lines for fenced code blocks. The matching info strings will be recognized as indicating the “notation” C Programming Language.
[Note 6: The exact syntax for expressing these two kinds of patterns, and which algorithm(s) to use for matching, is TBD.]
Because fenced code blocks with info strings are part of the
core CommonMark syntax, and recognizing such fenced code blocks
is already implemented in cmark
, all we need to implement is a
modified treatment of fenced code blocks with an info string—
no other modification of cmark
’s behaviour is required, and no more
configuration set-up to “send” the pertaining code block content to an
appropriate and available post-processor is needed.
[Note 7: There are stricter requirements than “three or more tilde characters” imposed on a block-ending fenced line by the CommonMark specification anyway, which are hard to express in a pattern—luckily no such pattern is needed.]
When a plain text input line is encountered that matches a configured “mark-up block start” line, the CommonMark processor has to “package” the raw text content of the block starting there into a “container element”, and send out this container element into the output for the consumption by the post-processors which come later in the chain.
The proposed element type name, or general identifier (GI) in SGML
terminology, for the container elements is mark-up
, and three
attributes are defined for it. The (SGML) element and attribute list
declarations are:
<!ELEMENT mark-up CDATA >
<!ATTLIST mark-up
notation CDATA #REQUIRED
mode (vertical | horizontal) #IMPLIED
label CDATA #IMPLIED>
The same declarations apply in XML, but without the “- -
” in the
element declaration.
[Note 8: In the SGML element declaration, “- -
” specifies that
neither the start tag <mark-up>
nor the end tag </mark-up>
can be
omitted in a marked-up SGML document.]
The attributes of <mark-up>
have the following meaning:
the “mode” of this <mark-up>
element: “vertical” in this case (it
would be “horizontal” for in-line “foreign” syntax, an yes, this is
TeX jargon ;-)
the “notation” (ie kind of syntax) in which the plain text content is in (this is defined in the configuration and depends on the form and/or the the label of this block),
the “label” extracted using the pattern match, if any;
A CommonMark processor should handle blocks in “foreign” syntax for the “mark-up chain” process according to the following rulse, using the patterns and notation names defined in the configuration file introduced above.
One can call “mark-up blocks” both forms of blocks which require post- processing:
fenced code blocks with “known” info strings and
blocks in “foreign mark-up syntax” delimited by configurable lines
and both are processed and packaged in the same way, except for the different ways of detecting the starting fenced code block line rsp the “mark-up block start” line, namely:
during regular parsing in the case of a fenced code block with an info string: the info string is matched against the configured patterns (for info strings) to find a defined notation for this form of info string; but
by matching the (start of) the input line against a list of configurable patterns in the case of “foreign mark-up syntax” blocks: if a match is found, the notation again comes from the configuration line, and the pattern match also determines the label to use (if any) for this block.
and vice versa for finding the corresponding ending fenced code block line rsp “mark-up block end” line.
The actions the CommonMark processor needs to take to process a “foreign mark-up syntax” block—when encountering a plain text input line which matches a configured pattern—are these:
The parser closes the currently open block-level element (if any);
an (optional) label is extracted from this line using the pattern match;
a start tag for a <mark-up>
element is generated into the output,
using attribute values
for the mode
attribute the value vertical
(because this
is the default value for mode
, the mode
attribute can be
omitted),
for the label
attribute the CDATA
substring found in the
“mark-up block start” line pattern match,
for the notation
the CDATA
string defined in the
configuration file;
the “block start” line itself is copied as character data into the output;
all plain text up to the corresponding “mark-up block end” line is
not processed in any way, but taken “literally”, and output as
character content (substituting \n\012
for EOL etc as usual, but
not entity-encoding <
for example);
when the corresponding “mark-up block end” line occurs in the plain
text input, the end tag for this <mark-up>
element instance is
output after copying the “block end” line into the output.
[Note 9: The first step (closing the currently open block-level element) is only needed if there is no blank line preceding the “foreign mark-up start” line matched.
[Note 10: The reason for copying the starting and ending plain text
lines of a “foreign mark-up block” into the <mark-up>
element’s
content is simple: the post-processor might actually need to see them
to decide what the raw plain text between them means. This is in fact
the case for marked-up blocks in the Z Notation (like the one shown in
the example below), where the form of the starting line determines the
syntactic category of (most) such blocks.]
It would probably better to require this blank line to be there, for in that case there is never an open block-level element that needs closing when a “foreign mark-up start” line is detected. Furthermore, the parser would only need to match each first non-empty line after (a sequence of) blank lines, instead of every single input line.]
Processing fenced code blocks labeled with a “known” info string (meaning: one for which there is a match in the configuration file) follows the same steps, except that
the value used for the label
attribute is the info string, and
neither the starting fenced code block line nor the ending one are copied into the output character data.
Note that both these transformations need only be implemented once in a CommonMark processor, and after that pretty much any syntax for starting and ending blocks of “extended” syntax can be defined in the configuration file and recognized by pattern matching (of an expressive enough kind).
Note also that the “plain text input” for the CommonMark processor is
itself the character data content of a <mark-up>
element in the ESIS
input stream.
The CommonMark processor will see only one such <mark-up>
element
if the input into the “mark-up chain” is an XML-wrapped plain text
file (in which case this one element contains the complete plain text
content of the input); but may see many such elements otherwise (when
processing a parsed XML/SGML document).
Since the CommonMark processor packages all plain text fragments
recognized as “foreign” notations into the appropriate <mark-up>
elements, the post-processors in the tail of the processing chain need
not be concerned with any “foreign” or CommonMark syntax—they
just need to filter “their” plain text from the appropriate <mark-up>
elements.
Here is how mark-up using the configured “extended syntax” (using the configuration file shown above) would be transformed into the ESIS output of the CommonMark processor:
A fenced code block with an info string "C"
like this one:
~~~ C
int foo();
~~~
would be transformed into a <mark-up>
container element instance in
the output stream like that one (written here in XML format, not in the
internal ESIS representation):
<mark-up mode="vertical" label="C">
<![CDATA[~~~ C
int foo();
~~~]]></mark-up>
And a “foreign mark-up syntax block” (in example configuration: a “Z Schema definition paragraph”) in the CommonMark input plain text looking like this:
+-- NAME ---
DeclPart
|--
Predicate
---
would be transformed into that <mark-up>
element:
<mark-up mode="vertical"
label="NAME"
notation="Z Notation">
<![CDATA[+-- NAME ---
DeclPart
|--
Predicate
---]]></mark-up>
[Note 11: The CDATA
section mark-up is shown here simply to illustrate
that the plain text inside the <mark-up>
element is not to be parsed
in any way. But then the whole XML/SGML mark-up shown is just for
illustration: there are no CDATA
sections in the ESIS and hence in the
internal format to represent the ESIS—once more the internal format is
simpler than the XML mark-up!]
It should be obvious how a the CommonMark processor at the head of the
chain and every post-processor later on can watch out for <mark-up>
elements it needs to handle:
A “syntax highlighting” post-processor would look for such elements
that have a label
attribute it recognizes: maybe simple a list of
available programming languages, like “C”, “Java”, “C++”, etc. Or it
recognizes additional information in the label, too: for example, a
label “C(indent=4)
” could additionally specify how to lay out each
highlighted line of code in the rendered output.
Since the form of the “mark-up block start” and “end” line is
irrelevant in this case, the post-processor can simply skip over
both of these lines in the plain text character content of the
<mark-up>
element.
A “Z Notation e-mail mark-up transforming” post-processor would
have to watch for <mark-up>
elements with a notation
attribute
specifying Z Notation
.
It could, but need not, use the label
attribute to find there the
name of the schema written in the “mark-up block start” line (in
this specific form of Z Notation paragraph), but could also just
parse the whole plain text contained in the <mark-up>
element and
ignore the label
attribute.
[Note 12: As the first example above demonstrates, the label
attribute of the <mark-up>
element could be used to transmit
processing parameters and options written in the CommonMark source
to the designated post-processor, which is a nice feature.
On the other hand: the the label
attribute’s value
transmitted this way is never checked for validity apart the
fact that it matches the approriate pattern in the configuration
file.
As a consequence, every post-processor must be able to cope with
any value (that matches the configured pattern) in the label
attribute, and must generate a formally valid—though possibly
nonsensical—element sequence as output.
This is because the apparent alternatives, namely
passing the <mark-up>
element along in the chain,
wrapping the character content of the <mark-up>
element in
another element similar to HTML’s <PRE>
,
or “unpacking” the character content and inserting it as character data into the output stream,
are all not viable if you think about it.]
This model of handling and post-processing “foreign” or “extended” syntax in CommonMark typescripts has several advantages:
The CommonMark specification needs no modification or extension,
the cmark
processor needs to be modified only once (for handling
labeled code blocks, and for matching input lines against the
configured patterns), and after that
introducing another “foreign syntax” together with a processor for it requires only adding one line in the configuration file (and inserting the processor for this kind of “plain text” mark-up into the “mark-up chain” of course!)
each of the mark-up processors in the chain after cmark
do not
need to know anything about CommonMark or Markdown or any
other typescript syntax (in fact, they could be used alone and
without cmark
just the same to process plain text fragments in
any XML/SGML document!),
it can be easily extended to accomodate in-line “foreign” syntax fragments in the exact same way. The main difficulty would be to decide on a delimiter syntax which is both flexible and unobtrusive. One idea would be to use a non-ASCII character as in
... quote the ´C`int foo();` declaration ...
(that is, prefixing a U+00B4 ACUTE ACCENT, followed by a label, to a code span) and then transform that into
... quote the <mark-up mode="horizontal" label="C">
<![CDATA[int foo();]]></mark-up> declaration ...
This would offer completely the same range of possiblities as the info string in fenced code blocks does; and it seems natural to equate the info strings of fenced code blocks with the prefixed strings for code spans (in this delimiter style or a different one).
Note that—after a one-time modification of the CommonMark processor’s code span handling—this would require no configuration settings if the configuration file already has a info string pattern indicating the “foreign” notation used here—thus introducing a new “foreign” syntax for both block and in-line use is one and the same step.
Note also that the only difference in the generated <mark-up>
element compared to the code block case (aside from the obviously
missing “code block start” and “end” lines) is the attribute mode
,
which indicates that this is an in-line span (horizontal
) and
not a separate block (vertical
), so the “right” post-processor
does already receive the packaged character data—it might or might
not make a difference in processing whether the <mark-up>
element
represents a block or an inline span (as indicated by the mode
attribute), but everything else is the same for both sorts of
“foreign” mark-up fragments.
[Note 13: Placing the txtin
tool in front of of the “mark-up chain”
in order to implement a process to transform a “conventional” plain
text input in CommonMark notation into the target document would
provide processing of “foreign syntax” embedded in plain text input
files too.]
I am convinced that the approach presented as the “mark-up chain” provides a useful and viable extension to the way Markdown and similar “human-oriented” plain text syntaxes can be used.
I expect that the
required modifications to existing mark-up processors like cmark
,
sundown
, zhtml
etc, as well as the
additional tools for use in the chain (txtin
, xmlin
, xmlout
)
can be implemented with moderate effort and without disrupting the existing implementations.
Using these modified or new tools in the proposed way would bring huge gains for the flexible use of CommonMark processing:
the abilty to extend CommonMark syntax with other “foreign” syntaxes
in a robust and modular way (each tool handles just one syntax),
with minimal configuration effort (adding one configuration line for each “foreign syntax” extension)
one the one hand, and beyond that
the ability to use cmark
and similar tools not only to generate
“structured” (ie XML/SGML/HTML) documents from plain text input
files remains, but is extended by
the ability to use cmark
and/or similar tools to transform input
XML/SGML/HTML documents by replacing plain text contained as element
character content in designated and marked-up “container elements”
(the <mark-up>
elements presented above) with element (sequences)
generated from and for this plain text fragments
extend the application of CommonMark into new areas like the XML/SGML authoring scenario sketched at the beginning: One could use “extended” CommonMark for authoring eg DocBook documents, while retaining the complete authoring XML/SGML tool chain: editing, validating, transforming (into HTML, PDF, eBook, whatever) can all be done as before, with one additional step introduced: transform an XML/SGML document with “plain text” mark-up in it into a XML/SGML output document without such plain text, but conforming to the target document type (ie the DocBook DTD rsp XML Schema or whatever the target XML/SGML document type may be).
© 2015 tin-pot.net | CC BY-SA 4.0 license applies |