SlideShare a Scribd company logo
1 of 16
Parallel XSLT Processing
of Large Documents
Jakub Maly, Barclays
@j_maly
jakub@maly.cz XML Prague 2015
Reminder on streaming…
 Can now process huge documents in bounded memory
 A whole new area where XSLT is now applicable
 With trade-offs
 stylesheet must follow streamability rules
 limited XPath
 XSLT 3.0 only, only in commercial products
 Large documents take long time to process
 processing time dominated by the time required to parse the input
Motivation
 Simple input
XML
structure,
700MB in size
 Simple XSLT
 Takes 35s to
process…
<ProteinEntry id="CCMQR">
<header>
<uid>CCMQR</uid>
<accession>A00003</accession>
<created_date>17-Mar-1987</created_date>
<seq-rev_date>17-Mar-1987</seq-rev_date>
<txt-rev_date>03-Mar-2000</txt-rev_date>
</header>
<protein>
<name>cytochrome c</name>
</protein>
...
</ProteinEntry>
Why so long?
 I/O is not a problem (SSDs are fast enough)
 We are using streaming, so memory
consumption is constant (bounded)
 Processor runs on 100%
 but just one of the cores…
Space for optimization?
 Multi-core machines are ubiquitous
 XSLT processor should use all cores if possible
 Parsing + processing in multiple threads
 and then merge the outputs
Results
Trade-offs
 One processor thread can’t see data processed by other threads
 The document has to consist of fairly independent “records”
 can be processed separately
 As in streaming, we can’t “go back”
 and crotches like accumulators won’t work
 And sometimes can’t even “go up” (out of the record)
Requirements #1 (input)
 The document has a well-defined structure (schema)
 A major part of the content is in a sequence of nodes
of certain types (we will call these core types)
 Core types and their ancestors are not recursive.
 Contents of core types are reasonably independent.
 We expect that processing of each
record takes similar amount of time
 Input can readable by multiple
threads from random positions
Requirements #2 (stylesheet)
 Streamable
 Explicitly marked templates for core nodes
 Paths in those templates are absolute and use only child axis
and element names
 alternatively: provide schema
 Only the core node and it’s subtree can be accessed by XPath
match="/ProteinDatabase/ProteinEntry"
pxsl:core="yes"
Special cases
 If we know more about the structure, we can
access more data safely, e.g.
 If all core nodes are children of one node
 We can read from „intro“ in all threads
Special cases #2
 If all core nodes are not children of one node
 Maybe we could choose different layer of
nodes as core nodes
Parsing problems
 Possible issues when splitting the document
 comments, PIs, CDATA
 Solutions
 report error
 preprocessing
 with „fast“ XML parser
 non XML-aware
 ?
<ProteinEntry>
...
<!--
</ProteinEntry>
<ProteinEntry>
...
-->
</ProteinEntry>
Side-effect problem
 Parallelization can produce unexpected results
 Side-effects defined by the language, e.g. xsl:message
 Could be buffered/concatenated
 Others
 Vendor-specific extensions
 User extensions
 Solutions?
Experimental implementation
 Thin wrapper around Saxon EE 9.6, written in Java
1. Split the documents into portions of roughly the same size
2. Turn each portion into a well-formed XML
(by adding a small prefix/suffix)
3. Run an instance of Saxon on each portion
4. Merge the results when all threads finish
https://github.com/j-maly/pXSLT
Use Case
 RUIAN = DB of geographical, municipal information, XML
 Prague = 614 MB of data
 Simple format
 Records for streets, buildings, …
 Task: split the large file into
individual records
(each in one XML file)
 Takes 42 minutes in Saxon EE
Conclusion
 Processing in multiple threads provides measurable speed-up
 Imposes additional limitations on the stylesheet and input
 Described approach makes sense only for large documents
 (for documents that fit into memory, other solutions are already
available, e.g. saxon:threads)
https://github.com/j-maly/pXSLT

More Related Content

What's hot (9)

Xml And JSON Java
Xml And JSON JavaXml And JSON Java
Xml And JSON Java
 
Ch23 xml processing_with_java
Ch23 xml processing_with_javaCh23 xml processing_with_java
Ch23 xml processing_with_java
 
Opps Concept
Opps ConceptOpps Concept
Opps Concept
 
Adodb Pdo Presentation
Adodb Pdo PresentationAdodb Pdo Presentation
Adodb Pdo Presentation
 
Nhibernate Part 2
Nhibernate   Part 2Nhibernate   Part 2
Nhibernate Part 2
 
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 2
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 2OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 2
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 2
 
PostgreSQL - Case Study
PostgreSQL - Case StudyPostgreSQL - Case Study
PostgreSQL - Case Study
 
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
 
BITS: Introduction to relational databases and MySQL - Schema design
BITS: Introduction to relational databases and MySQL - Schema designBITS: Introduction to relational databases and MySQL - Schema design
BITS: Introduction to relational databases and MySQL - Schema design
 

Similar to Parallel XSLT Processing of Large XML Documents - XML Prague 2015

Please do ECE572 requirementECECS 472572 Final Exam Project (W.docx
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docxPlease do ECE572 requirementECECS 472572 Final Exam Project (W.docx
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docx
ARIV4
 
Silicon Valley CodeCamp 2008: High performance Ajax with ExtJS and ASP.NET
Silicon Valley CodeCamp 2008: High performance Ajax with ExtJS and ASP.NETSilicon Valley CodeCamp 2008: High performance Ajax with ExtJS and ASP.NET
Silicon Valley CodeCamp 2008: High performance Ajax with ExtJS and ASP.NET
Mats Bryntse
 
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
tidwellveronique
 

Similar to Parallel XSLT Processing of Large XML Documents - XML Prague 2015 (20)

Real World Experience With Oracle Xml Database 11g An Oracle Ace’s Perspectiv...
Real World Experience With Oracle Xml Database 11g An Oracle Ace’s Perspectiv...Real World Experience With Oracle Xml Database 11g An Oracle Ace’s Perspectiv...
Real World Experience With Oracle Xml Database 11g An Oracle Ace’s Perspectiv...
 
Capturing Network Traffic into Database
Capturing Network Traffic into Database Capturing Network Traffic into Database
Capturing Network Traffic into Database
 
Java Course 12: XML & XSL, Web & Servlets
Java Course 12: XML & XSL, Web & ServletsJava Course 12: XML & XSL, Web & Servlets
Java Course 12: XML & XSL, Web & Servlets
 
Cirrostratus first overview
Cirrostratus first overviewCirrostratus first overview
Cirrostratus first overview
 
GC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconGC free coding in @Java presented @Geecon
GC free coding in @Java presented @Geecon
 
Xml processing-by-asfak
Xml processing-by-asfakXml processing-by-asfak
Xml processing-by-asfak
 
Basics of XML
Basics of XMLBasics of XML
Basics of XML
 
Multi-Core on Chip Architecture *doc - IK
Multi-Core on Chip Architecture *doc - IKMulti-Core on Chip Architecture *doc - IK
Multi-Core on Chip Architecture *doc - IK
 
Design Concepts For Xml Applications That Will Perform
Design Concepts For Xml Applications That Will PerformDesign Concepts For Xml Applications That Will Perform
Design Concepts For Xml Applications That Will Perform
 
58 65
58 6558 65
58 65
 
Share point 2013 coding standards and best practices 1.0
Share point 2013 coding standards and best practices 1.0Share point 2013 coding standards and best practices 1.0
Share point 2013 coding standards and best practices 1.0
 
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docx
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docxPlease do ECE572 requirementECECS 472572 Final Exam Project (W.docx
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docx
 
Silicon Valley CodeCamp 2008: High performance Ajax with ExtJS and ASP.NET
Silicon Valley CodeCamp 2008: High performance Ajax with ExtJS and ASP.NETSilicon Valley CodeCamp 2008: High performance Ajax with ExtJS and ASP.NET
Silicon Valley CodeCamp 2008: High performance Ajax with ExtJS and ASP.NET
 
Con1741 mcintosh top 10 database performance tips for sparc systems running o...
Con1741 mcintosh top 10 database performance tips for sparc systems running o...Con1741 mcintosh top 10 database performance tips for sparc systems running o...
Con1741 mcintosh top 10 database performance tips for sparc systems running o...
 
Cray xt3
Cray xt3Cray xt3
Cray xt3
 
How Prometheus Store the Data
How Prometheus Store the DataHow Prometheus Store the Data
How Prometheus Store the Data
 
Cache memory
Cache memoryCache memory
Cache memory
 
eXtensible Markup Language (XML)
eXtensible Markup Language (XML)eXtensible Markup Language (XML)
eXtensible Markup Language (XML)
 
Impact of Intel Optane Technology on HPC
Impact of Intel Optane Technology on HPCImpact of Intel Optane Technology on HPC
Impact of Intel Optane Technology on HPC
 
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
 

Recently uploaded

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
chiefasafspells
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 

Recently uploaded (20)

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 

Parallel XSLT Processing of Large XML Documents - XML Prague 2015

  • 1. Parallel XSLT Processing of Large Documents Jakub Maly, Barclays @j_maly jakub@maly.cz XML Prague 2015
  • 2. Reminder on streaming…  Can now process huge documents in bounded memory  A whole new area where XSLT is now applicable  With trade-offs  stylesheet must follow streamability rules  limited XPath  XSLT 3.0 only, only in commercial products  Large documents take long time to process  processing time dominated by the time required to parse the input
  • 3. Motivation  Simple input XML structure, 700MB in size  Simple XSLT  Takes 35s to process… <ProteinEntry id="CCMQR"> <header> <uid>CCMQR</uid> <accession>A00003</accession> <created_date>17-Mar-1987</created_date> <seq-rev_date>17-Mar-1987</seq-rev_date> <txt-rev_date>03-Mar-2000</txt-rev_date> </header> <protein> <name>cytochrome c</name> </protein> ... </ProteinEntry>
  • 4. Why so long?  I/O is not a problem (SSDs are fast enough)  We are using streaming, so memory consumption is constant (bounded)  Processor runs on 100%  but just one of the cores…
  • 5. Space for optimization?  Multi-core machines are ubiquitous  XSLT processor should use all cores if possible  Parsing + processing in multiple threads  and then merge the outputs
  • 7. Trade-offs  One processor thread can’t see data processed by other threads  The document has to consist of fairly independent “records”  can be processed separately  As in streaming, we can’t “go back”  and crotches like accumulators won’t work  And sometimes can’t even “go up” (out of the record)
  • 8. Requirements #1 (input)  The document has a well-defined structure (schema)  A major part of the content is in a sequence of nodes of certain types (we will call these core types)  Core types and their ancestors are not recursive.  Contents of core types are reasonably independent.  We expect that processing of each record takes similar amount of time  Input can readable by multiple threads from random positions
  • 9. Requirements #2 (stylesheet)  Streamable  Explicitly marked templates for core nodes  Paths in those templates are absolute and use only child axis and element names  alternatively: provide schema  Only the core node and it’s subtree can be accessed by XPath match="/ProteinDatabase/ProteinEntry" pxsl:core="yes"
  • 10. Special cases  If we know more about the structure, we can access more data safely, e.g.  If all core nodes are children of one node  We can read from „intro“ in all threads
  • 11. Special cases #2  If all core nodes are not children of one node  Maybe we could choose different layer of nodes as core nodes
  • 12. Parsing problems  Possible issues when splitting the document  comments, PIs, CDATA  Solutions  report error  preprocessing  with „fast“ XML parser  non XML-aware  ? <ProteinEntry> ... <!-- </ProteinEntry> <ProteinEntry> ... --> </ProteinEntry>
  • 13. Side-effect problem  Parallelization can produce unexpected results  Side-effects defined by the language, e.g. xsl:message  Could be buffered/concatenated  Others  Vendor-specific extensions  User extensions  Solutions?
  • 14. Experimental implementation  Thin wrapper around Saxon EE 9.6, written in Java 1. Split the documents into portions of roughly the same size 2. Turn each portion into a well-formed XML (by adding a small prefix/suffix) 3. Run an instance of Saxon on each portion 4. Merge the results when all threads finish https://github.com/j-maly/pXSLT
  • 15. Use Case  RUIAN = DB of geographical, municipal information, XML  Prague = 614 MB of data  Simple format  Records for streets, buildings, …  Task: split the large file into individual records (each in one XML file)  Takes 42 minutes in Saxon EE
  • 16. Conclusion  Processing in multiple threads provides measurable speed-up  Imposes additional limitations on the stylesheet and input  Described approach makes sense only for large documents  (for documents that fit into memory, other solutions are already available, e.g. saxon:threads) https://github.com/j-maly/pXSLT

Editor's Notes

  1. So we are using streaming mode, but we don’t support “real” streaming scenarios
  2. Remember, we need to avoid parsing the whole document in one thread, because time to do that can dominate the time of the whole transformation. Some preprocessing XML parser (not really dealing with attributes, namespaces, entities etc) Some other preprocessing – comments, PIs, CDATAs are all linear constructs, so just make sure we don’t end up in the middle of them when splitting the document…