XML::XParentAnother way to store XML elements... Marco Masetti(grubert) - email@example.com firstname.lastname@example.org
Ways of storing XML files• Plain files, simple scripts to perform XPath queries – trivial, very limited scalability, search and element handling• DBMS as BLOBs (text) – Limited search features, performance and scalability. No inherent element handling.• DBMS with XML support – Document oriented. Not supported by all. Different features provided.• Native XML databases (Tamino, Basex, eXist,...) – Ok…but then I need something else to talk of…• Custom DBMS schemas – Data oriented, element handling trivial, scale very well
Custom DBMS schemas• Structure mapping: – the design of the database schema is based on the understanding of XML Schema or DTDs• Model mapping: – A fixed database schema for all XML documents without assistance of DTD or XML schemes
Structure-mapping schema: XML::RDB!• Perl module to convert XML files into RDB schemas and populate, and unpopulate them. You end up with 1 table per each xml element type.• Pros: ● Does what he means ● Quite fast ● Works with XML Schemas too ● Could eventually treat value types properly• Cons: ● Inherent hierarchical structure lost ● Not good if XML files belongs to different schemas ● Does only what he means... ● Not very well maintained... ● SQL schemas can easily become unreadable...
Model-mapping schema: XParent !• XParent is a very simple DBMS schema that can be used to store XML elements• Does not require the XML schema (Schema-oblivious)• Highly normalized• Cons: Values are stored as text
The XML::XParent module• Perl module to handle XML documents on a XParent schema• Can load any XML file into the same SQL schema• Plugins can be registered for custom logic on elements• Provides utilities to: ● Create the XParent schema for SQLite and Postgresql ● Parse and load an XML file ( xparent-parse.pl ) ● Query the XParent schema ( xparent-search.pl )• Classes: ● XML::XParent::Parser: XML parser based on XML::Twig ● XML::XParent::Parser::Plugin: base interface class to be implemented by any plugin ● XML::XParent::Schema: base class (interface) to the XParent schema ● XML::XParent::Elem: class that describes an XML element
XML::XParent::Schema drivers• The XML::XParent::Schema class implements the Driver/Interface pattern: in this way custom drivers can be implemented for specific data stores• 2 generic drivers implemented so far: XML::XParent::Schema::DBIx: driver implementation based on DBIx::Class ● All advantages of an ORM (but who cares ?) ● Quite slow! XML::XParent::Schema::DBI: driver implementation based on DBI ● Direct integration with the data store ● Much faster...
The quest for speed...● Tests performed on my laptop: ● CPU0: Intel(R) Core(TM) i5 CPU M 540@ 2.53GHz stepping 05 ● CPU1: Intel(R) Core(TM) i5 CPU M 540@ 2.53GHz stepping 05● Reference XML file: ● Size: 45 MB ● XML elements: ~600.000● Reference DBMS: PostgreSQL 8.4.13● Parsing of the reference file with the DBIx driver: ● perl xparentparse.pl i <ref.xml> driver DBIx ● Execution time: > 3000 mins !!!● Parsing of the reference file with the DBI driver: ● perl xparentparse.pl i <ref.xml> driver DBI ● Execution time: ~ 400 mins.
...But then... ● I realized loading times were divergent! ● I realized there was a stupid error in the implementation of the algorith...Exec Time(log t) 4 3000 3 400 177 2 28 1 ... m . ed. le ch Im p pat f. go Re Al
...But then...● I realized that records in Data and DataPath tables are notreferenced by anybody...● They do not need to be inserted one each...● => Bulk Loading!!!● ...given N elements, how many records we have in theDataPath table ?
Bulk Loading • Saves a lot of time storing data: DBI: Bulk loading of 1000000 records All in once: 50.462398 wallclock seconds Chunks of 1000: 31.157044 wallclock seconds Chunks of 2000: 27.747248 wallclock seconds Chunks of 5000: 28.209256 wallclock secondsExec Time Chunks of 10000:26.334099 wallclock seconds(log t) 4 • Distinct inserts of 1000000 records: 3000 Elapsed time: 250.563282 wallclock seconds 3 400 177 2 98 28 1 16 ... ... . d. g. em he in pl tc ad Im pa Lo f. go lk Re Al Bu
...But then... • For each element we have to check if path already exists... • Much better cache it in an hash than go back and forth into the DB...Exec Time(log t) 4 3000 3 400 177 2 98 41 28 16 1 12 ... ... ... . . d. g. m e di n t hs le ch Pa Im p pat L oa f. go lk ed Re Al Bu ch Ca
...But then... • Added some indexes: • CREATE INDEX LabelPath_Path ON LabelPath (Path); • CREATE INDEX Element_PathID ON Element (PathID); • CREATE INDEX DataPath_Cid ON DataPath (Cid); • CREATE INDEX DataPath_Pid ON DataPath (Pid); • CREATE INDEX Data_Did ON Data (Did);Exec Time(log t) 4 3000 3 400 177 2 98 41 28 16 29 1 12 8 . ... . ... g. .. ... m . ed n s. s. le h di th xe p tc oa Pa m pa L d de f .I go lk he In Re Al Bu Ca c +
...But then...• Realized I could “compact” records... <?xml version="1.0" encoding="ISO88591"?> <Mpeg7 xmlns="http://www.mpeg7.org/2001/MPEG7_Schema" xmlns:xsi="http://www.w3.org/2000/10/XMLSchemainstance"> <DescriptionUnit xsi:type="DescriptorCollectionType"> <Descriptor size="5" xsi:type="DominantColorType"> <ColorSpace type="HSV" colorReferenceFlag="false"/> <SpatialCoherency>0</SpatialCoherency> <Values> <Percentage>2</Percentage> <Index>10 6 0</Index> </Values> <Values> <Percentage>15</Percentage> <Index>6 16 9</Index> </Values> <Values> <Percentage>3</Percentage> <Index>7 18 4</Index> </Values> </Descriptor> </DescriptionUnit> </Mpeg7>Saves another 20%-30%...Needs some logic at query time (experimental)...
To cut a very long story short... Time (mins) to load ~600.000 XML elems Reference Algo Bulk Cached indexes Compact patched loading Paths DBIx > 3000 177 98 41 29 22 DBI ~400 28 16 12 8 6● ..and we have still to do: ● Code profiling... ● Specific DBMS techniques... ● Use MapReduce to split jobs among several workers...
About retrieval...• At first I tried implementing an Xpath-to-sql translator• Found it very very hard...• ...and almost useless• ...use the power of SQL to express what you want!• XML::XParent provides an API (get_elem) to query for a set of elements whose paths match a given SQL regex. The API returns a set of XML::XParent::Elem objects.
XML::XParent utilities: how to use them• Configure parameters into xparent.yml file: • To load an XML file: schema_params:perl xparentparse.pl dbi:Pg:dbname=xparent i <input file> # dbi:SQLite:xparent.db driver <the Schema driver to use> grubert grubert [config_file <the config file>] [verbose] AutoCommit: 1 [clean] #plugins: [compact] # SLMS::Redis::ParserPlugin: • To query the Xparent data store:# tag: MovingRegion perl xparentsearch.pl path <path regex> driver <the Schema driver to use> [config_file <the config file>]• To clean the data store:perl xparentclean.pl driver <the Schema driver to use> [config_file <the config file>]