from Patrick Gundlach |

New XML and XPath parser

Categories: Develpment, speedata Publisher

The version 4.15.8 of the speedata Publisher contains a completely rewritten XML and XPath parser. Currently this is opt in, but it will be the default in version 5 of the speedata Publisher.

The old (current) implementation is a combination of an ad-hoc XML parser and an XPath parser, both written in pure lua. These parser works fine in practice, but are too fault tolerant and do not give enough information in case of an error. In some errors these parsers get into a bad state. There have been many, many issues fixed related to these two parser parts since starting the development.

What is an XPath parser anyway?

Wikipedia writes: “XPath (XML Path Language) is an expression language designed to support the query or transformation of XML documents.”

Say you have an XML document with products from your database. You can now ask with XPath for details about the elements and attributes. For example “give me all articles that have the attribute ’new’” or “how many articles are in a certain article group”. You can also use XPath functions to transform the results by replacing text for example.

So for a database Publishing software, XPath is a very useful tool to make it easier to decide how to layout the data in the PDF

The new XML parser

This XML parser is actually not new, it has been part of the speedata Publisher for quite some time, but was not advertised. It is written in Go (as most of the non-layout functionality) and based on the XML parser from the standard library with a small patch that allows to report the current input position for better error message.

For the later processing, the resulting document is written to a (big) Lua table.

The new XPath parser

The new XPath parser is written in Lua and works on the resulting table from the XML parser.

It is strictly derived from the formal grammar provided by the specification. For example, the grammar starts with:

[1]  XPath      ::=  Expr
[2]  Expr       ::=  ExprSingle ("," ExprSingle)*
[3]  ExprSingle ::=  ForExpr
                      | QuantifiedExpr
                      | IfExpr
                      | OrExpr
[4] ForExpr     ::=  SimpleForClause "return" ExprSingle
...

which means that XPath is an Expr, an Expr is an ExprSingle followed by zero or more “,” and ExprSingle. ExprSingle is either a ForExpr, a QuantifiedExpr, an IfExpr or an OrExpr. The ForExpr is a SimpleForClause, followed by the keyword “return” and an ExprSingle.

The old XPath implementation works mostly by matching patterns which have difficulties for example matching nested paired parentheses or strings. With the current implementation this is much more robust because the parser knows exactly which tokens to expect.

The parser is split into three stages:

  1. Split the input into tokens, for example the input sequence ‘-123’ is split into two tokens, a minus sign and a number.
  2. Apply the grammar rules and return an anonymous function that evaluates the input. See below for an example.
  3. Execute the function from 2. with the current context.

The step two needs a bit more explanation:

Say you have the XPath expression "3 + $variable". The relevant rule is

[12] AdditiveExpr ::= MultiplicativeExpr
                      ( ("+" | "-") MultiplicativeExpr )*

An AdditiveExpr is either a MultiplicativeExpr or a MultiplicativeExpr followed by one of + or -, followed by a MultiplicativeExpr. In this case I use the following (pseudo) code:

parseAdditiveExpr = function (tl tokenlist) {
    // leftHandSide is a function
    leftHandSide = parseMultipliativeExpr(tl)

    if tl.nextTokenIs('+','-') {
        op = tl.readToken()
        rightHandSide = parseMultipliativeExpr(tl)

        -- a context with a current state
        f = function(ctx context) {
            if op == "+" {
                return leftHandSide(ctx)
                       + rightHandSide(ctx)
            } else {
                return leftHandSide(ctx)
                       - rightHandSide(ctx)
            }
        }
        return f
    } else {
        return leftHandSide
    }
    // never reached
}

The result of calling parseAdditiveExpr() with the token list as an argument is a function which expects an XPath context as an argument. This function can be called over and over again with different contexts and always return the results of the current state.

As an optimization step I can now save these evaluation functions as a replacement for the original input string ("3 + $variable") and skip the first two steps in the XPath parsing process. Since with the speedata Publisher these XPath expressions are evaluated often more than once, this will give a small speedup during layout processing.

Limitations

There are of course some limitations which will hopefully be resolved.

  • Unicode awareness: Lua has no built-in unicode library and relies for some parts on the Go library with bridging function calls.
  • Same goes for regular expressions. Lua has no “real” regular expressions and the speedata Publisher uses bridging functions to Go to solve these problems.
  • No collations.
  • Implementation is still work in progress. The current amount of XPath functionality is more than the current parser provides, so there is no disadvantage on using the new parser.
  • No calculation on dimensions: the old XPath parser allows something like "3in + 12pt" which will not work anymore. There will be functions that allow unit calculations (sd:unit-add(...) or sd:dimexpr() - this has not yet been decided).

What is new?

The new XML module provides a few enhancements over the current implementation:

  • custom XPath functions: you can now define your own functions in the layout XML.
  • speedup: On selected layouts there is a 30% speedup. Don’t expect that much speedup on your layout because there are a lot of things that can slow down PDF generation (big images for example).
  • robustness: parsing is done very close to the official grammar, so the parser is always in a clear state.
  • error checking with line numbers: error messages now contain a line number of the layout XML file.
  • more functionality: there are already more XPath functions implemented compared to the current XPath parser.

How to activate the new parsers?

To use the opt-in new parsers, just run sp with

sp --xpath lxpath

or put

xpath=lxpath

into the configuration file.

The plan is that the speedata Publisher version 5 has this as the default.

Source code

The source code is included in the speedata Publisher distribution and (for the XPath parser) available as a standalone Lua file on GitHub.