XML::XParent
Another way to store XML elements...

             Marco Masetti(grubert) - masetti@linux.it
                                     grubert65@gmail.com
Ways of storing XML files
• Plain files, simple scripts to perform XPath
  queries
 – trivial, very limited scalability, search and element handling
• DBMS as BLOBs (text)
 – Limited search features, performance and scalability. No
   inherent element handling.
• DBMS with XML support
 – Document oriented. Not supported by all. Different features
   provided.
• Native XML databases (Tamino, Basex, eXist,...)
 – Ok…but then I need something else to talk of…
• Custom DBMS schemas
 – Data oriented, element handling trivial, scale very well
Custom DBMS schemas

• Structure mapping:
 – the design of the database schema is based on the
   understanding of XML Schema or DTDs

• Model mapping:
 – A fixed database schema for all XML documents
   without assistance of DTD or XML schemes
Structure-mapping schema: XML::RDB!
• Perl module to convert XML files into RDB schemas and
  populate, and unpopulate them. You end up with 1 table
  per each xml element type.
• Pros:
  ●
    Does what he means
  ●
    Quite fast
  ●
    Works with XML Schemas too
  ●
    Could eventually treat value types properly
• Cons:
  ●
    Inherent hierarchical structure lost
  ●
    Not good if XML files belongs to different schemas
  ●
    Does only what he means...
  ●
    Not very well maintained...
  ●
    SQL schemas can easily become unreadable...
Model-mapping schema: XParent !

• XParent is a very simple DBMS schema that can be
  used to store XML elements
• Does not require the XML schema (Schema-oblivious)
• Highly normalized
• Cons:
  
    Values are stored as text
XParent: how it works...
                     Table LabelPath
                      id | len |                               path                               
                     ­­­­+­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
                       1 |   4 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace
                       2 |   5 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace/@colorReferenceFlag
<?xml version="1.0" encoding="ISO­8859­1"?>
                       3 |   5 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace/@type
  <Mpeg7 xmlns="http://www.mpeg7.org/2001/MPEG­7_Schema"
         xmlns:xsi="http://www.w3.org/2000/10/XMLSchema­instance">
    <DescriptionUnit xsi:type="DescriptorCollectionType">
      <Descriptor size="5" xsi:type="DominantColorType">
                     Table Element
        <ColorSpace type="HSV" colorReferenceFlag="false"/>
                      did | pathid | ordinal 
        <SpatialCoherency>0</SpatialCoherency>
                     ­­­­­+­­­­­­­­+­­­­­­­­­
        <Values>        1 |      1 |       1
        <Percentage>2</Percentage>
                        2 |      2 |       1
        <Index>10 6 0</Index>
                        3 |      3 |       2
        </Values>
        <Values>
          <Percentage>15</Percentage>
                     Table Data
          <Index>6 16 9</Index>
                      did | pathid | ordinal |                    value                     
        </Values>
                     ­­­­­+­­­­­­­­+­­­­­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
        <Values>
                        2 |      2 |       1 | false
          <Percentage>3</Percentage>
                        3 |      3 |       2 | HSV
          <Index>7 18 4</Index>
      </Values>
    </Descriptor>
  </DescriptionUnit>
</Mpeg7>             Table DataPath
                      pid | cid 
                     ­­­­­+­­­­­
                        1 |   2
                        1 |   3
The XML::XParent module
• Perl module to handle XML documents on a XParent
  schema
• Can load any XML file into the same SQL schema
• Plugins can be registered for custom logic on elements
• Provides utilities to:
  ●
    Create the XParent schema for SQLite and Postgresql
  ●
    Parse and load an XML file ( xparent-parse.pl )
  ●
    Query the XParent schema ( xparent-search.pl )
• Classes:
  ●
    XML::XParent::Parser: XML parser based on XML::Twig
  ●
    XML::XParent::Parser::Plugin: base interface class to
    be implemented by any plugin
  ●
    XML::XParent::Schema: base class (interface) to the
    XParent schema
  ●
    XML::XParent::Elem: class that describes an XML
    element
XML::XParent::Schema drivers

• The XML::XParent::Schema class implements the
  Driver/Interface pattern: in this way custom drivers can
  be implemented for specific data stores
• 2 generic drivers implemented so far:
  
    XML::XParent::Schema::DBIx: driver implementation based on
    DBIx::Class
    ●
      All advantages of an ORM (but who cares ?)
     ●
         Quite slow!
 
     XML::XParent::Schema::DBI: driver implementation
     based on DBI
     ●
       Direct integration with the data store
     ●
       Much faster...
The quest for speed...

●
    Tests performed on my laptop:
    ●
        CPU0: Intel(R) Core(TM) i5 CPU M 540@ 2.53GHz stepping 05
    ●
        CPU1: Intel(R) Core(TM) i5 CPU M 540@ 2.53GHz stepping 05

●
    Reference XML file:
    ●
        Size: 45 MB
    ●
        XML elements: ~600.000
●
    Reference DBMS: PostgreSQL 8.4.13

●
    Parsing of the reference file with the DBIx driver:
    ●
        perl xparent­parse.pl ­i <ref.xml> ­­driver DBIx
    ●
      Execution time: > 3000 mins !!!
●
    Parsing of the reference file with the DBI driver:
    ●
        perl xparent­parse.pl ­i <ref.xml> ­­driver DBI
    ●
        Execution time: ~ 400 mins.
...But then...

  ●
      I realized loading times were divergent!

  ●
   I realized there was a stupid error in the implementation of
  the algorith...
Exec Time
(log t)
        4
                            3000
        3
                            400
                                                     177
        2


                                                     28
        1



                                               ...
                       m
                        .                  ed.
                    le                  ch
              Im
                   p
                                    pat
         f.                    go
      Re                     Al
...But then...

●
  I realized that records in Data and DataPath tables are not
referenced by anybody...
●
  They do not need to be inserted one each...
●
  => Bulk Loading!!!
●
  ...given N elements, how many records we have in the
DataPath table ?
Bulk Loading
                                                  • Saves a lot of time storing data:
                                                  ­­­ DBI: Bulk loading of 1000000 records ­­­
                                                  All in once:    50.462398 wallclock seconds
                                                  Chunks of 1000: 31.157044 wallclock seconds
                                                  Chunks of 2000: 27.747248 wallclock seconds
                                                  Chunks of 5000: 28.209256 wallclock seconds
Exec Time                                         Chunks of 10000:26.334099 wallclock seconds
(log t)
        4                                         • Distinct inserts of 1000000 records:
                          3000
                                                            Elapsed time: 250.563282 wallclock seconds
       3
                          400
                                                 177
       2                                                                   98
                                                 28
       1                                                                   16



                                           ...                       ...
                      .                 d.                        g.
                    em                he                       in
                  pl                tc                      ad
             Im                   pa                      Lo
        f.                   go                      lk
     Re                    Al                     Bu
...But then...
    • For each element we have to check if path
      already exists...
    • Much better cache it in an hash than go back
      and forth into the DB...
Exec Time
(log t)
       4
                           3000
       3
                           400
                                                    177
       2                                                                      98
                                                                                                             41
                                                    28
                                                                              16
       1                                                                                                     12


                                              ...                       ...                        ...
                                                                                                         .
                       .                   d.                        g.
                      m                   e
                                                                 di
                                                                   n                           t hs
                   le                  ch                                                    Pa
             Im
                  p
                                   pat                       L oa
        f.                    go                        lk                              ed
     Re                     Al                       Bu                               ch
                                                                                   Ca
...But then...
                                      • Added some indexes:
                                      •   CREATE INDEX LabelPath_Path ON LabelPath (Path);
                                      •   CREATE INDEX Element_PathID ON Element (PathID);
                                      •   CREATE INDEX DataPath_Cid ON DataPath (Cid);
                                      •   CREATE INDEX DataPath_Pid ON DataPath (Pid);
                                      •   CREATE INDEX Data_Did ON Data (Did);
Exec Time
(log t)
       4
                              3000
       3
                              400
                                                     177
       2                                                                       98
                                                                                                                 41
                                                     28
                                                                               16                                                         29
       1                                                                                                         12
                                                                                                                                          8

                                                 .                       ...                                 .
                                              ...                     g.                                ..                          ...
                         m
                          .                 ed                      n                                s.                           s.
                      le                  h                       di                               th                          xe
                     p                  tc                      oa                               Pa
                 m                    pa                      L                              d                              de
        f   .I                   go                      lk                                he                             In
     Re                        Al                     Bu                            Ca
                                                                                         c                            +
...But then...
• Realized I could “compact” records...
                 <?xml version="1.0" encoding="ISO­8859­1"?>
                   <Mpeg7 xmlns="http://www.mpeg7.org/2001/MPEG­7_Schema"
                          xmlns:xsi="http://www.w3.org/2000/10/XMLSchema­instance">
                     <DescriptionUnit xsi:type="DescriptorCollectionType">
                       <Descriptor size="5" xsi:type="DominantColorType">
                         <ColorSpace type="HSV" colorReferenceFlag="false"/>
                         <SpatialCoherency>0</SpatialCoherency>
                         <Values>
                           <Percentage>2</Percentage>
                           <Index>10 6 0</Index>
                         </Values>
                         <Values>
                           <Percentage>15</Percentage>
                           <Index>6 16 9</Index>
                         </Values>
                         <Values>
                           <Percentage>3</Percentage>
                           <Index>7 18 4</Index>
                         </Values>
                     </Descriptor>
                   </DescriptionUnit>
                 </Mpeg7>



Saves another 20%-30%...
Needs some logic at query time (experimental)...
To cut a very long story short...
      Time (mins) to load ~600.000 XML elems
             Reference   Algo      Bulk      Cached   indexes   Compact
                         patched   loading   Paths

      DBIx   > 3000      177       98        41       29        22


      DBI    ~400        28        16        12       8         6




●
    ..and we have still to do:
    ●
      Code profiling...
    ●
      Specific DBMS techniques...
    ●
      Use MapReduce to split jobs among several
      workers...
About retrieval...

• At first I tried implementing an Xpath-to-sql
  translator
• Found it very very hard...
• ...and almost useless
• ...use the power of SQL to express what you
  want!
• XML::XParent provides an API (get_elem) to
  query for a set of elements whose paths match
  a given SQL regex. The API returns a set of
  XML::XParent::Elem objects.
XML::XParent utilities: how to use them
• Configure parameters into xparent.yml file:
                                  ­­­
• To load an XML file:            schema_params:
perl xparent­parse.pl                 ­ 'dbi:Pg:dbname=xparent'
    ­i <input file>               #    ­ 'dbi:SQLite:xparent.db'
    ­­driver <the Schema driver to use>
                                      ­ grubert
                                      ­ grubert
    [­­config_file <the config file>]
    [­­verbose]                       ­
                                          AutoCommit: 1
    [­­clean]                     #plugins:
    [­­compact]                   #    'SLMS::Redis::ParserPlugin': 
• To query the Xparent data store:#        'tag': 'MovingRegion' 
perl xparent­search.pl
   ­­path <path regex>
   ­­driver <the Schema driver to use>
   [­­config_file <the config file>]
• To clean the data store:
perl xparent­clean.pl 
   ­­driver <the Schema driver to use>
   [­­config_file <the config file>]
Contribute!

https://github.com/grubert65/XParent-Perl.git
Thank You !!!!

Xml::parent - Yet another way to store XML files

  • 1.
    XML::XParent Another way tostore XML elements... Marco Masetti(grubert) - masetti@linux.it grubert65@gmail.com
  • 2.
    Ways of storingXML files • Plain files, simple scripts to perform XPath queries – trivial, very limited scalability, search and element handling • DBMS as BLOBs (text) – Limited search features, performance and scalability. No inherent element handling. • DBMS with XML support – Document oriented. Not supported by all. Different features provided. • Native XML databases (Tamino, Basex, eXist,...) – Ok…but then I need something else to talk of… • Custom DBMS schemas – Data oriented, element handling trivial, scale very well
  • 3.
    Custom DBMS schemas •Structure mapping: – the design of the database schema is based on the understanding of XML Schema or DTDs • Model mapping: – A fixed database schema for all XML documents without assistance of DTD or XML schemes
  • 4.
    Structure-mapping schema: XML::RDB! •Perl module to convert XML files into RDB schemas and populate, and unpopulate them. You end up with 1 table per each xml element type. • Pros: ● Does what he means ● Quite fast ● Works with XML Schemas too ● Could eventually treat value types properly • Cons: ● Inherent hierarchical structure lost ● Not good if XML files belongs to different schemas ● Does only what he means... ● Not very well maintained... ● SQL schemas can easily become unreadable...
  • 5.
    Model-mapping schema: XParent! • XParent is a very simple DBMS schema that can be used to store XML elements • Does not require the XML schema (Schema-oblivious) • Highly normalized • Cons:  Values are stored as text
  • 6.
    XParent: how itworks... Table LabelPath  id | len |                               path                                ­­­­+­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­   1 |   4 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace   2 |   5 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace/@colorReferenceFlag <?xml version="1.0" encoding="ISO­8859­1"?>   3 |   5 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace/@type   <Mpeg7 xmlns="http://www.mpeg7.org/2001/MPEG­7_Schema"          xmlns:xsi="http://www.w3.org/2000/10/XMLSchema­instance">     <DescriptionUnit xsi:type="DescriptorCollectionType">       <Descriptor size="5" xsi:type="DominantColorType"> Table Element         <ColorSpace type="HSV" colorReferenceFlag="false"/>  did | pathid | ordinal          <SpatialCoherency>0</SpatialCoherency> ­­­­­+­­­­­­­­+­­­­­­­­­         <Values>    1 |      1 |       1         <Percentage>2</Percentage>    2 |      2 |       1         <Index>10 6 0</Index>    3 |      3 |       2         </Values>         <Values>           <Percentage>15</Percentage> Table Data           <Index>6 16 9</Index>  did | pathid | ordinal |                    value                              </Values> ­­­­­+­­­­­­­­+­­­­­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­         <Values>    2 |      2 |       1 | false           <Percentage>3</Percentage>    3 |      3 |       2 | HSV           <Index>7 18 4</Index>       </Values>     </Descriptor>   </DescriptionUnit> </Mpeg7> Table DataPath  pid | cid  ­­­­­+­­­­­    1 |   2    1 |   3
  • 7.
    The XML::XParent module •Perl module to handle XML documents on a XParent schema • Can load any XML file into the same SQL schema • Plugins can be registered for custom logic on elements • Provides utilities to: ● Create the XParent schema for SQLite and Postgresql ● Parse and load an XML file ( xparent-parse.pl ) ● Query the XParent schema ( xparent-search.pl ) • Classes: ● XML::XParent::Parser: XML parser based on XML::Twig ● XML::XParent::Parser::Plugin: base interface class to be implemented by any plugin ● XML::XParent::Schema: base class (interface) to the XParent schema ● XML::XParent::Elem: class that describes an XML element
  • 8.
    XML::XParent::Schema drivers • TheXML::XParent::Schema class implements the Driver/Interface pattern: in this way custom drivers can be implemented for specific data stores • 2 generic drivers implemented so far:  XML::XParent::Schema::DBIx: driver implementation based on DBIx::Class ● All advantages of an ORM (but who cares ?) ● Quite slow!  XML::XParent::Schema::DBI: driver implementation based on DBI ● Direct integration with the data store ● Much faster...
  • 9.
    The quest forspeed... ● Tests performed on my laptop: ● CPU0: Intel(R) Core(TM) i5 CPU M 540@ 2.53GHz stepping 05 ● CPU1: Intel(R) Core(TM) i5 CPU M 540@ 2.53GHz stepping 05 ● Reference XML file: ● Size: 45 MB ● XML elements: ~600.000 ● Reference DBMS: PostgreSQL 8.4.13 ● Parsing of the reference file with the DBIx driver: ● perl xparent­parse.pl ­i <ref.xml> ­­driver DBIx ● Execution time: > 3000 mins !!! ● Parsing of the reference file with the DBI driver: ● perl xparent­parse.pl ­i <ref.xml> ­­driver DBI ● Execution time: ~ 400 mins.
  • 10.
    ...But then... ● I realized loading times were divergent! ● I realized there was a stupid error in the implementation of the algorith... Exec Time (log t) 4 3000 3 400 177 2 28 1 ... m . ed. le ch Im p pat f. go Re Al
  • 11.
    ...But then... ● I realized that records in Data and DataPath tables are not referenced by anybody... ● They do not need to be inserted one each... ● => Bulk Loading!!! ● ...given N elements, how many records we have in the DataPath table ?
  • 12.
    Bulk Loading • Saves a lot of time storing data: ­­­ DBI: Bulk loading of 1000000 records ­­­ All in once:    50.462398 wallclock seconds Chunks of 1000: 31.157044 wallclock seconds Chunks of 2000: 27.747248 wallclock seconds Chunks of 5000: 28.209256 wallclock seconds Exec Time Chunks of 10000:26.334099 wallclock seconds (log t) 4 • Distinct inserts of 1000000 records: 3000 Elapsed time: 250.563282 wallclock seconds 3 400 177 2 98 28 1 16 ... ... . d. g. em he in pl tc ad Im pa Lo f. go lk Re Al Bu
  • 13.
    ...But then... • For each element we have to check if path already exists... • Much better cache it in an hash than go back and forth into the DB... Exec Time (log t) 4 3000 3 400 177 2 98 41 28 16 1 12 ... ... ... . . d. g. m e di n t hs le ch Pa Im p pat L oa f. go lk ed Re Al Bu ch Ca
  • 14.
    ...But then... • Added some indexes: • CREATE INDEX LabelPath_Path ON LabelPath (Path); • CREATE INDEX Element_PathID ON Element (PathID); • CREATE INDEX DataPath_Cid ON DataPath (Cid); • CREATE INDEX DataPath_Pid ON DataPath (Pid); • CREATE INDEX Data_Did ON Data (Did); Exec Time (log t) 4 3000 3 400 177 2 98 41 28 16 29 1 12 8 . ... . ... g. .. ... m . ed n s. s. le h di th xe p tc oa Pa m pa L d de f .I go lk he In Re Al Bu Ca c +
  • 15.
    ...But then... • RealizedI could “compact” records... <?xml version="1.0" encoding="ISO­8859­1"?>   <Mpeg7 xmlns="http://www.mpeg7.org/2001/MPEG­7_Schema"          xmlns:xsi="http://www.w3.org/2000/10/XMLSchema­instance">     <DescriptionUnit xsi:type="DescriptorCollectionType">       <Descriptor size="5" xsi:type="DominantColorType">         <ColorSpace type="HSV" colorReferenceFlag="false"/>         <SpatialCoherency>0</SpatialCoherency>         <Values>           <Percentage>2</Percentage>           <Index>10 6 0</Index>         </Values>         <Values>           <Percentage>15</Percentage>           <Index>6 16 9</Index>         </Values>         <Values>           <Percentage>3</Percentage>           <Index>7 18 4</Index>         </Values>     </Descriptor>   </DescriptionUnit> </Mpeg7> Saves another 20%-30%... Needs some logic at query time (experimental)...
  • 16.
    To cut avery long story short... Time (mins) to load ~600.000 XML elems Reference Algo Bulk Cached indexes Compact patched loading Paths DBIx > 3000 177 98 41 29 22 DBI ~400 28 16 12 8 6 ● ..and we have still to do: ● Code profiling... ● Specific DBMS techniques... ● Use MapReduce to split jobs among several workers...
  • 17.
    About retrieval... • Atfirst I tried implementing an Xpath-to-sql translator • Found it very very hard... • ...and almost useless • ...use the power of SQL to express what you want! • XML::XParent provides an API (get_elem) to query for a set of elements whose paths match a given SQL regex. The API returns a set of XML::XParent::Elem objects.
  • 18.
    XML::XParent utilities: howto use them • Configure parameters into xparent.yml file: ­­­ • To load an XML file: schema_params: perl xparent­parse.pl     ­ 'dbi:Pg:dbname=xparent' ­i <input file> #    ­ 'dbi:SQLite:xparent.db' ­­driver <the Schema driver to use>     ­ grubert     ­ grubert [­­config_file <the config file>] [­­verbose]     ­         AutoCommit: 1 [­­clean] #plugins: [­­compact] #    'SLMS::Redis::ParserPlugin':  • To query the Xparent data store:#        'tag': 'MovingRegion'  perl xparent­search.pl ­­path <path regex> ­­driver <the Schema driver to use> [­­config_file <the config file>] • To clean the data store: perl xparent­clean.pl  ­­driver <the Schema driver to use> [­­config_file <the config file>]
  • 19.
  • 20.