xm-tool | regex for xml |
The xml processing of documents needs basically three phases, the first one is to convert any kind of document into an xml document, the second one to convert one flavour of an xml document into another, and the last one to convert an xml document into an output format. The general xml tools will care for the last two steps - they can add printer control information based on the xml markups for the last steps, and in the transformation steps they expect a text with lots of xml markup where they can combine and erase tags with their text body and they can insert tags based on the markups.
However, the general xml tools have problems to add new tags based on the text content - the general markup combining and insertion rules will match the markup content and not the body text content - splitting a body text into parts is not easily done with xml tools. This is the domain of general text-oriented regex machines like the one builtin to perl.
The xm-tool projects provides perl scripts for the inner and last step too, but they are not as far developed as the dedicated tools from the xml arena. In general, the output format of xm-tool processing is again an xml-type or plaintext-type format, nothing compared to xml::fo and their like. The xml docbook processors can print in postscript or troff as well - a similar thing could be done for xm-tool but nobody has take the effort since one can use a docbook intermediate xml format.
Many processing snippets in the xm-tool project expect a very simplified form of xml markup rules, and to clarify about that we speak of "xm markup rules" where you should note the missing "l" since we do not build another language-type but just put some additional meta-info into the text. In its simplest form the following rules apply:
From these xm-rules, we can derive the most simplest forms
of a markup-matching regex. To simply match any
markup (excluding plaintext areas) one can just use
<[^<>]*>
. This will walk through
all markups in a document, no matter which type - this
might be useful for the last step in document processing
when converting an xm-text into some output format that
does need other control-characters than xml-tags - see
the css-html example within the xm-tool project for an
example.
To just walk about all markups is generally not quite
useful, instead we want to match just a specific set
of markups. This can be achieved with the following
<(MYMARK|OTHERMARK|DIFFERENT)\b([^<>])>
and note the use of the zero-width "\b" matching so
that this regex does not match a tag that looks like
"<DIFFERENTIAL ...
". Furthermore
note that we put a second part-getter around the
arguments that might live in this markup. From here
it is easy to write a simple rewrite rule for some
markup:
s{<(para)\b([^<>]*)>} {<p align="justify"$2>}gs; |
The actual usefulness of the xm-tool will however not occur before we also bind the enclosed text. If we can assume that a tag can not nest within each other, here is a rule to walk through all toplevel two-sided markups within a document.
s{<(\w+)\b([^<>]*)> ((?:.(?!</?\1\b))*.) (<\1>) } { print "markup=",$1," args=",$2," enclosed=",$3," final=",$4 ; "" }gsex; |
s{<(mytitle)\b([^<>]*)> ((?:.(?!</?mytitle\b))*.) (<mytitle>) } { "<sect1><title".$2.">".$3 ."</title></sect1>" ; "" }gsex; |
The preceding text did make quite some assumptions on the
xm-rules for markups which are sometimes inconvenient. They
target at making it easier to write a regex for matching
the intro-target but at the expense of not being able to
catch many xml-type expressions that would contain chars
like ":" or "-" within the markup-name. The following two
markups can not be distinguished (!!) with the simplified
regex above:
"<title hello>"
and
"<title-text>...".
A modernish xm-markup can be written a bit different which does drop the \b-rule that shall seperate the markup-name from the markup-attributes - instead we claim that some whitespace separates the two parts - or nothing at all if no attributes are given at this markup. See here:
s{<(my-title)(\s[^<>]*)?> ((?:.(?!</?my-title[\s>]))*.) (</my-title>) } { "<sect1><title".$2.$3 ."</title></sect1>" ; "" }gsex; |
if you look closer at the above markups then you will notice that the bodytext can not be zero-width. One can of course help that out with adding some dummy-markup in there like an xml-comment so that the regex will still match. If one wants to generalize that however, a more complex regex should be used in that place:
s{<(mytitle)\b([^<>]*)(?=>) ((?:.(?!</?mytitle\b))*.) (</mytitle>) } { "<sect1><title".$2.">".$3 ."</title></sect1>" ; "" }gsex; |
If you look closer than this rule contradicts the extension in the section about modernish xm-tag names. To combine them both into one regex-rule you'll need to express the attribute-match with an alternative
s{<(my-title)(\s[^<>]*)?)(?=>) ((?:.(?!</?my-title[\s>]))*.) (</my-title>) } { "<sect1><title".$2.">".$3 ."</title></sect1>" ; "" }gsex; |
If you know, on the other hand, that the my-title parts can not be nested anyway, the regex can be simplified again, and still match both extensions. And in the following example, we show another regex variant to match a zero-width body - this is the recommended form in xm-tool now as it is quite readable.
s{<(my-title)(\s[^<>]*)?> ((?:[^<]|<(?!/?my-title[\s>]))*) (</my-title(?:\s[^<>]*)?>) } { "<sect1><title".$2.">".$3 ."</title></sect1>" ; "" }gsex; |