Parallel XSLT Processing of Large XML Documents - XML Prague 2015

Parallel XSLT Processing
of Large Documents
Jakub Maly, Barclays
@j_maly
jakub@maly.cz XML Prague 2015

Reminder on streaming…
 Can now process huge documents in bounded memory
 A whole new area where XSLT is now applicable
 With trade-offs
 stylesheet must follow streamability rules
 limited XPath
 XSLT 3.0 only, only in commercial products
 Large documents take long time to process
 processing time dominated by the time required to parse the input

Motivation
 Simple input
XML
structure,
700MB in size
 Simple XSLT
 Takes 35s to
process…
<ProteinEntry id="CCMQR">
<header>
<uid>CCMQR</uid>
<accession>A00003</accession>
<created_date>17-Mar-1987</created_date>
<seq-rev_date>17-Mar-1987</seq-rev_date>
<txt-rev_date>03-Mar-2000</txt-rev_date>
</header>
<protein>
<name>cytochrome c</name>
</protein>
...
</ProteinEntry>

Why so long?
 I/O is not a problem (SSDs are fast enough)
 We are using streaming, so memory
consumption is constant (bounded)
 Processor runs on 100%
 but just one of the cores…

Space for optimization?
 Multi-core machines are ubiquitous
 XSLT processor should use all cores if possible
 Parsing + processing in multiple threads
 and then merge the outputs

Trade-offs
 One processor thread can’t see data processed by other threads
 The document has to consist of fairly independent “records”
 can be processed separately
 As in streaming, we can’t “go back”
 and crotches like accumulators won’t work
 And sometimes can’t even “go up” (out of the record)

Requirements #1 (input)
 The document has a well-defined structure (schema)
 A major part of the content is in a sequence of nodes
of certain types (we will call these core types)
 Core types and their ancestors are not recursive.
 Contents of core types are reasonably independent.
 We expect that processing of each
record takes similar amount of time
 Input can readable by multiple
threads from random positions

Requirements #2 (stylesheet)
 Streamable
 Explicitly marked templates for core nodes
 Paths in those templates are absolute and use only child axis
and element names
 alternatively: provide schema
 Only the core node and it’s subtree can be accessed by XPath
match="/ProteinDatabase/ProteinEntry"
pxsl:core="yes"

Special cases
 If we know more about the structure, we can
access more data safely, e.g.
 If all core nodes are children of one node
 We can read from „intro“ in all threads

Special cases #2
 If all core nodes are not children of one node
 Maybe we could choose different layer of
nodes as core nodes

Parsing problems
 Possible issues when splitting the document
 comments, PIs, CDATA
 Solutions
 report error
 preprocessing
 with „fast“ XML parser
 non XML-aware
 ?
<ProteinEntry>
...

</ProteinEntry>

Side-effect problem
 Parallelization can produce unexpected results
 Side-effects defined by the language, e.g. xsl:message
 Could be buffered/concatenated
 Others
 Vendor-specific extensions
 User extensions
 Solutions?

Experimental implementation
 Thin wrapper around Saxon EE 9.6, written in Java
1. Split the documents into portions of roughly the same size
2. Turn each portion into a well-formed XML
(by adding a small prefix/suffix)
3. Run an instance of Saxon on each portion
4. Merge the results when all threads finish
https://github.com/j-maly/pXSLT

Use Case
 RUIAN = DB of geographical, municipal information, XML
 Prague = 614 MB of data
 Simple format
 Records for streets, buildings, …
 Task: split the large file into
individual records
(each in one XML file)
 Takes 42 minutes in Saxon EE

Conclusion
 Processing in multiple threads provides measurable speed-up
 Imposes additional limitations on the stylesheet and input
 Described approach makes sense only for large documents
 (for documents that fit into memory, other solutions are already
available, e.g. saxon:threads)
https://github.com/j-maly/pXSLT

Parallel XSLT Processing of Large XML Documents - XML Prague 2015

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to Parallel XSLT Processing of Large XML Documents - XML Prague 2015

Similar to Parallel XSLT Processing of Large XML Documents - XML Prague 2015 (20)

Recently uploaded

Recently uploaded (20)

Parallel XSLT Processing of Large XML Documents - XML Prague 2015

Editor's Notes