Your SlideShare is downloading. ×
0
Xml::parent - Yet another way to store XML files
Xml::parent - Yet another way to store XML files
Xml::parent - Yet another way to store XML files
Xml::parent - Yet another way to store XML files
Xml::parent - Yet another way to store XML files
Xml::parent - Yet another way to store XML files
Xml::parent - Yet another way to store XML files
Xml::parent - Yet another way to store XML files
Xml::parent - Yet another way to store XML files
Xml::parent - Yet another way to store XML files
Xml::parent - Yet another way to store XML files
Xml::parent - Yet another way to store XML files
Xml::parent - Yet another way to store XML files
Xml::parent - Yet another way to store XML files
Xml::parent - Yet another way to store XML files
Xml::parent - Yet another way to store XML files
Xml::parent - Yet another way to store XML files
Xml::parent - Yet another way to store XML files
Xml::parent - Yet another way to store XML files
Xml::parent - Yet another way to store XML files
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Xml::parent - Yet another way to store XML files

1,126

Published on

XParent is a simple SQL schema to store XML elements. XML::XParent is a perl module that provides API to store XML files and retrieve XML elements from a XParent data store.

XParent is a simple SQL schema to store XML elements. XML::XParent is a perl module that provides API to store XML files and retrieve XML elements from a XParent data store.

1 Comment
0 Likes
Statistics
Notes
  • http://www.dbmanagement.info/Tutorials/XML.htm
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

No Downloads
Views
Total Views
1,126
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
1
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. XML::XParentAnother way to store XML elements... Marco Masetti(grubert) - masetti@linux.it grubert65@gmail.com
  • 2. Ways of storing XML files• Plain files, simple scripts to perform XPath queries – trivial, very limited scalability, search and element handling• DBMS as BLOBs (text) – Limited search features, performance and scalability. No inherent element handling.• DBMS with XML support – Document oriented. Not supported by all. Different features provided.• Native XML databases (Tamino, Basex, eXist,...) – Ok…but then I need something else to talk of…• Custom DBMS schemas – Data oriented, element handling trivial, scale very well
  • 3. Custom DBMS schemas• Structure mapping: – the design of the database schema is based on the understanding of XML Schema or DTDs• Model mapping: – A fixed database schema for all XML documents without assistance of DTD or XML schemes
  • 4. Structure-mapping schema: XML::RDB!• Perl module to convert XML files into RDB schemas and populate, and unpopulate them. You end up with 1 table per each xml element type.• Pros: ● Does what he means ● Quite fast ● Works with XML Schemas too ● Could eventually treat value types properly• Cons: ● Inherent hierarchical structure lost ● Not good if XML files belongs to different schemas ● Does only what he means... ● Not very well maintained... ● SQL schemas can easily become unreadable...
  • 5. Model-mapping schema: XParent !• XParent is a very simple DBMS schema that can be used to store XML elements• Does not require the XML schema (Schema-oblivious)• Highly normalized• Cons:  Values are stored as text
  • 6. XParent: how it works... Table LabelPath  id | len |                               path                                ­­­­+­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­   1 |   4 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace   2 |   5 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace/@colorReferenceFlag<?xml version="1.0" encoding="ISO­8859­1"?>   3 |   5 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace/@type  <Mpeg7 xmlns="http://www.mpeg7.org/2001/MPEG­7_Schema"         xmlns:xsi="http://www.w3.org/2000/10/XMLSchema­instance">    <DescriptionUnit xsi:type="DescriptorCollectionType">      <Descriptor size="5" xsi:type="DominantColorType"> Table Element        <ColorSpace type="HSV" colorReferenceFlag="false"/>  did | pathid | ordinal         <SpatialCoherency>0</SpatialCoherency> ­­­­­+­­­­­­­­+­­­­­­­­­        <Values>    1 |      1 |       1        <Percentage>2</Percentage>    2 |      2 |       1        <Index>10 6 0</Index>    3 |      3 |       2        </Values>        <Values>          <Percentage>15</Percentage> Table Data          <Index>6 16 9</Index>  did | pathid | ordinal |                    value                             </Values> ­­­­­+­­­­­­­­+­­­­­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­        <Values>    2 |      2 |       1 | false          <Percentage>3</Percentage>    3 |      3 |       2 | HSV          <Index>7 18 4</Index>      </Values>    </Descriptor>  </DescriptionUnit></Mpeg7> Table DataPath  pid | cid  ­­­­­+­­­­­    1 |   2    1 |   3
  • 7. The XML::XParent module• Perl module to handle XML documents on a XParent schema• Can load any XML file into the same SQL schema• Plugins can be registered for custom logic on elements• Provides utilities to: ● Create the XParent schema for SQLite and Postgresql ● Parse and load an XML file ( xparent-parse.pl ) ● Query the XParent schema ( xparent-search.pl )• Classes: ● XML::XParent::Parser: XML parser based on XML::Twig ● XML::XParent::Parser::Plugin: base interface class to be implemented by any plugin ● XML::XParent::Schema: base class (interface) to the XParent schema ● XML::XParent::Elem: class that describes an XML element
  • 8. XML::XParent::Schema drivers• The XML::XParent::Schema class implements the Driver/Interface pattern: in this way custom drivers can be implemented for specific data stores• 2 generic drivers implemented so far:  XML::XParent::Schema::DBIx: driver implementation based on DBIx::Class ● All advantages of an ORM (but who cares ?) ● Quite slow!  XML::XParent::Schema::DBI: driver implementation based on DBI ● Direct integration with the data store ● Much faster...
  • 9. The quest for speed...● Tests performed on my laptop: ● CPU0: Intel(R) Core(TM) i5 CPU M 540@ 2.53GHz stepping 05 ● CPU1: Intel(R) Core(TM) i5 CPU M 540@ 2.53GHz stepping 05● Reference XML file: ● Size: 45 MB ● XML elements: ~600.000● Reference DBMS: PostgreSQL 8.4.13● Parsing of the reference file with the DBIx driver: ● perl xparent­parse.pl ­i <ref.xml> ­­driver DBIx ● Execution time: > 3000 mins !!!● Parsing of the reference file with the DBI driver: ● perl xparent­parse.pl ­i <ref.xml> ­­driver DBI ● Execution time: ~ 400 mins.
  • 10. ...But then... ● I realized loading times were divergent! ● I realized there was a stupid error in the implementation of the algorith...Exec Time(log t) 4 3000 3 400 177 2 28 1 ... m . ed. le ch Im p pat f. go Re Al
  • 11. ...But then...● I realized that records in Data and DataPath tables are notreferenced by anybody...● They do not need to be inserted one each...● => Bulk Loading!!!● ...given N elements, how many records we have in theDataPath table ?
  • 12. Bulk Loading • Saves a lot of time storing data: ­­­ DBI: Bulk loading of 1000000 records ­­­ All in once:    50.462398 wallclock seconds Chunks of 1000: 31.157044 wallclock seconds Chunks of 2000: 27.747248 wallclock seconds Chunks of 5000: 28.209256 wallclock secondsExec Time Chunks of 10000:26.334099 wallclock seconds(log t) 4 • Distinct inserts of 1000000 records: 3000 Elapsed time: 250.563282 wallclock seconds 3 400 177 2 98 28 1 16 ... ... . d. g. em he in pl tc ad Im pa Lo f. go lk Re Al Bu
  • 13. ...But then... • For each element we have to check if path already exists... • Much better cache it in an hash than go back and forth into the DB...Exec Time(log t) 4 3000 3 400 177 2 98 41 28 16 1 12 ... ... ... . . d. g. m e di n t hs le ch Pa Im p pat L oa f. go lk ed Re Al Bu ch Ca
  • 14. ...But then... • Added some indexes: • CREATE INDEX LabelPath_Path ON LabelPath (Path); • CREATE INDEX Element_PathID ON Element (PathID); • CREATE INDEX DataPath_Cid ON DataPath (Cid); • CREATE INDEX DataPath_Pid ON DataPath (Pid); • CREATE INDEX Data_Did ON Data (Did);Exec Time(log t) 4 3000 3 400 177 2 98 41 28 16 29 1 12 8 . ... . ... g. .. ... m . ed n s. s. le h di th xe p tc oa Pa m pa L d de f .I go lk he In Re Al Bu Ca c +
  • 15. ...But then...• Realized I could “compact” records... <?xml version="1.0" encoding="ISO­8859­1"?>   <Mpeg7 xmlns="http://www.mpeg7.org/2001/MPEG­7_Schema"          xmlns:xsi="http://www.w3.org/2000/10/XMLSchema­instance">     <DescriptionUnit xsi:type="DescriptorCollectionType">       <Descriptor size="5" xsi:type="DominantColorType">         <ColorSpace type="HSV" colorReferenceFlag="false"/>         <SpatialCoherency>0</SpatialCoherency>         <Values>           <Percentage>2</Percentage>           <Index>10 6 0</Index>         </Values>         <Values>           <Percentage>15</Percentage>           <Index>6 16 9</Index>         </Values>         <Values>           <Percentage>3</Percentage>           <Index>7 18 4</Index>         </Values>     </Descriptor>   </DescriptionUnit> </Mpeg7>Saves another 20%-30%...Needs some logic at query time (experimental)...
  • 16. To cut a very long story short... Time (mins) to load ~600.000 XML elems Reference Algo Bulk Cached indexes Compact patched loading Paths DBIx > 3000 177 98 41 29 22 DBI ~400 28 16 12 8 6● ..and we have still to do: ● Code profiling... ● Specific DBMS techniques... ● Use MapReduce to split jobs among several workers...
  • 17. About retrieval...• At first I tried implementing an Xpath-to-sql translator• Found it very very hard...• ...and almost useless• ...use the power of SQL to express what you want!• XML::XParent provides an API (get_elem) to query for a set of elements whose paths match a given SQL regex. The API returns a set of XML::XParent::Elem objects.
  • 18. XML::XParent utilities: how to use them• Configure parameters into xparent.yml file: ­­­• To load an XML file: schema_params:perl xparent­parse.pl     ­ dbi:Pg:dbname=xparent ­i <input file> #    ­ dbi:SQLite:xparent.db ­­driver <the Schema driver to use>     ­ grubert     ­ grubert [­­config_file <the config file>] [­­verbose]     ­         AutoCommit: 1 [­­clean] #plugins: [­­compact] #    SLMS::Redis::ParserPlugin: • To query the Xparent data store:#        tag: MovingRegion perl xparent­search.pl ­­path <path regex> ­­driver <the Schema driver to use> [­­config_file <the config file>]• To clean the data store:perl xparent­clean.pl  ­­driver <the Schema driver to use> [­­config_file <the config file>]
  • 19. Contribute!https://github.com/grubert65/XParent-Perl.git
  • 20. Thank You !!!!

×