using regex for xml

The xml processing of documents needs basically three phases, the first one is to convert any kind of document into an xml document, the second one to convert one flavour of an xml document into another, and the last one to convert an xml document into an output format. The general xml tools will care for the last two steps - they can add printer control information based on the xml markups for the last steps, and in the transformation steps they expect a text with lots of xml markup where they can combine and erase tags with their text body and they can insert tags based on the markups.

However, the general xml tools have problems to add new tags based on the text content - the general markup combining and insertion rules will match the markup content and not the body text content - splitting a body text into parts is not easily done with xml tools. This is the domain of general text-oriented regex machines like the one builtin to perl.

The xm-tool projects provides perl scripts for the inner and last step too, but they are not as far developed as the dedicated tools from the xml arena. In general, the output format of xm-tool processing is again an xml-type or plaintext-type format, nothing compared to xml::fo and their like. The xml docbook processors can print in postscript or troff as well - a similar thing could be done for xm-tool but nobody has take the effort since one can use a docbook intermediate xml format.

simplified markup regex

Many processing snippets in the xm-tool project expect a very simplified form of xml markup rules, and to clarify about that we speak of "xm markup rules" where you should note the missing "l" since we do not build another language-type but just put some additional meta-info into the text. In its simplest form the following rules apply:

text parts do not contain "<" or ">", and within a markup itself no "<" or ">" can be found either. These two chars will occur alternating through the whole text and each of them seperates plaintext from markuptext.
different markup types can only be differentiated by their first char and different markups only by their name. The usual <br/> can not be differentiated from its <br> cousin - use two markups for a simple one is recommended about this.
the seperation of markup names can be seen with a \b

From these xm-rules, we can derive the most simplest forms of a markup-matching regex. To simply match any markup (excluding plaintext areas) one can just use <[^<>]*>. This will walk through all markups in a document, no matter which type - this might be useful for the last step in document processing when converting an xm-text into some output format that does need other control-characters than xml-tags - see the css-html example within the xm-tool project for an example.

To just walk about all markups is generally not quite useful, instead we want to match just a specific set of markups. This can be achieved with the following <(MYMARK|OTHERMARK|DIFFERENT)\b([^<>])> and note the use of the zero-width "\b" matching so that this regex does not match a tag that looks like "<DIFFERENTIAL ...". Furthermore note that we put a second part-getter around the arguments that might live in this markup. From here it is easy to write a simple rewrite rule for some markup:
s{<(para)\b([^<>]*)>} {<p align="justify"$2>}gs;
and this example shows also that you have to always use both "g" and "s" modifiers for tag-transformation rules.

The actual usefulness of the xm-tool will however not occur before we also bind the enclosed text. If we can assume that a tag can not nest within each other, here is a rule to walk through all toplevel two-sided markups within a document.
s{<(\w+)\b([^<>]*)> ((?:.(?!</?\1\b))*.) (<\1>) } { print "markup=",$1," args=",$2," enclosed=",$3," final=",$4 ; "" }gsex;
Note that we used \1 to backreference the markup-name. The inner text is matched as being anything that does not look like a markup with the name of the first match and which ends with an end-markup with the name of the first match. Just memorize the inner-text-match regex that you see up above, and paste it over to your own rules. A typical home-made rule would look like
s{<(mytitle)\b([^<>]*)> ((?:.(?!</?mytitle\b))*.) (<mytitle>) } { "<sect1><title".$2.">".$3 ."</title></sect1>" ; "" }gsex;

modernish xml markup regex

The preceding text did make quite some assumptions on the xm-rules for markups which are sometimes inconvenient. They target at making it easier to write a regex for matching the intro-target but at the expense of not being able to catch many xml-type expressions that would contain chars like ":" or "-" within the markup-name. The following two markups can not be distinguished (!!) with the simplified regex above: "<title hello>" and "<title-text>...".




 A modernish xm-markup can be written a bit different which
 does drop the \b-rule that shall seperate the markup-name
 from the markup-attributes - instead we claim that some
 whitespace separates the two parts - or nothing at all
 if no attributes are given at this markup. See here:
    s{<(my-title)(\s[^<>]*)?>
                 ((?:.(?!</?my-title[\s>]))*.)
                    (</my-title>) }
    { "<sect1><title".$2.$3
        ."</title></sect1>" ; "" }gsex;
 
 note here that the ">" is now contained in the 
 attribute part-getter in the intro-markup, in fact
 the $2 will never be zero-width - a fact that is used
 to express the end-of-any-markup-name in the inner part
 at the bodytext part-getter.


zero-width body-text 


 if you look closer at the above markups then you will notice
 that the bodytext can not be zero-width. One can of course
 help that out with adding some dummy-markup in there like 
 an xml-comment so that the regex will still match. If one
 wants to generalize that however, a more complex regex
 should be used in that place:
    s{<(mytitle)\b([^<>]*)(?=>)
                 ((?:.(?!</?mytitle\b))*.)
                    (</mytitle>) }
    { "<sect1><title".$2.">".$3
        ."</title></sect1>" ; "" }gsex;
 
 and here we used the other way round - we used a zero-catch
 lookahead rule as of (?=>), so that in the end the
 closeing ">" becomes part of the bodytext - always.
 The first char of the body-text is the ">" which should
 be accounted for in the rewrite-clause of the regex-subst.



 If you look closer than this rule contradicts the extension
 in the section about modernish xm-tag names. To combine them
 both into one regex-rule you'll need to express the 
 attribute-match with an alternative
    s{<(my-title)(\s[^<>]*)?)(?=>)
                 ((?:.(?!</?my-title[\s>]))*.)
                    (</my-title>) }
    { "<sect1><title".$2.">".$3
        ."</title></sect1>" ; "" }gsex;
 

 however, combining such regex makes the regex itself not
 only complex but slow in processing too.



 If you know, on the other hand, that the my-title parts
 can not be nested anyway, the regex can be 
 simplified again, and still match both extensions. And
 in the following example, we show another regex variant
 to match a zero-width body - this is the recommended
 form in xm-tool now as it is quite readable.
    s{<(my-title)(\s[^<>]*)?>
                 ((?:[^<]|<(?!/?my-title[\s>]))*)
                    (</my-title(?:\s[^<>]*)?>) }
    { "<sect1><title".$2.">".$3
        ."</title></sect1>" ; "" }gsex;