XML Amsterdam - Creating structure in unstructured data

1,899 views

Published on

Presentation used for XML Amsterdam 2013

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,899
On SlideShare
0
From Embeds
0
Number of Embeds
89
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • See also OOW 2010, S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text – Nipun Agarwal, Oracle
  • XML Amsterdam - Creating structure in unstructured data

    1. 1. Creating Structure in Unstructured Data What is possible, today…? Marco Gralike
    2. 2. “Big Data” = XML ?
    3. 3. Challenges are! Ahum, the problems are!
    4. 4. WikiPedia • One string of XML data with structured and unstructured data sections • Language: English • Size : 42,15 GB • Pages : 12.961.997 • Date : 21 Dec 2012
    5. 5. Adventures into the unknown…?
    6. 6. Setup • VirtualBox VM – OEL 5U8 (64) – 8 GB RAM • LaCie Little Big Disk – RAID 0 – Thunderbolt • Database – SGA – PGA 4GB 2GB
    7. 7. My new LaCie LBD is really fast - 
    8. 8. Defeat?! - 1.000.000 pages only
    9. 9. Status of Technology used
    10. 10. XML - Where are we…? Gartner
    11. 11. Performance Oracle & XML XQuery XML API’s 1998 XQuery-Update XQuery Full-Text Binary Storage XQJ, Big-Data & Indexing XML Storage & Repository 2001 2004 2007 2013
    12. 12. Achieved…?
    13. 13. Horizon…? (Oracle NoSQL, XMLDB) •JSoniq •Zorba •JSON support •In-Memory
    14. 14. Building (streaming) Bridges
    15. 15. Oracle XML DB • NO cost option • C (native, embedded in the kernel) • XML / XQuery Standards • Code maintained by Oracle • JSON / In-Memory
    16. 16. XQuery XQuery XMLType Abstraction XMLType Abstraction DB XQuery XQuery Rewrite XQuery Rewrite SQL Execution SQL Execution Relational Relational Access Access Methods Methods Procedural XQuery Pushdown Streaming Streaming XPath XPath Evaluation Evaluation XVM XVM (use “no query rewrite”) (use “no query rewrite”) XMLIndex XMLIndex Object-Relational Binary XML Relational Storage Secure Files Source: S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text
    17. 17. What are we talking about?
    18. 18. WikiPedia • Structured & Unstructured bits and pieces • A lot of “unbounded” elements • Not a lot of restrictions • The bit with value is in element “tekst”
    19. 19. How do we get this Structured?
    20. 20. Strings = small & defined (12c?) Ename  pointer += 100;
    21. 21. <string1/><string2/><string3/>
    22. 22. Flexible, Humans No Design Patterns
    23. 23. <small/><verybigggr/><bigger/>
    24. 24. <verybigggr> <empno>1</empno><ename>Marco</ename> <empno>2</empno> </verybigggr> <small/><verybigggr/><bigger/>
    25. 25. We need options!
    26. 26. “XMLType” Container In Memory In Memory (document) (document) CLOB CLOB (document) (document) Object Relational Object Relational (data) (data) Binary XML Binary XML (data) (data)
    27. 27. XMLType In Memory In Memory (document) (document) XOB XOB XML Schema XML Schema
    28. 28. XMLType Binary XML Securefile Binary XML Securefile (document/content) (document/content) Post Parse Post Parse LOB Index LOB Index
    29. 29. XMLType Object Relational Object Relational (content) (content) Fully Shredded Fully Shredded Indexes Indexes
    30. 30. Something else to Realize !
    31. 31. “What is the fastest way to get this stuff in the database…?”
    32. 32. “…it depends…”
    33. 33. “So what is the fastest way to get XML in the database… … and useful in my case…?”
    34. 34. Garbage IN – Garbage OUT
    35. 35. WikiPedia • • • • SQL*Loader Parallel or Direct Securefile LOB Column 2.5 hours And no (performant) way to get the details out… a.k.a “completely useless”
    36. 36. WikiPedia • • • • SQL*Loader Parallel or Direct Securefile Binary XML …2.5 hours ???
    37. 37. XML Parsing • SAX • DOM - Simple API for XML - Document Object Module
    38. 38. fast insert performance CLOB XMLType CLOB (domain) indexes XMLType Binary XML XMLType Object Relational fast select performance
    39. 39. So let’s pick a XMLType storage method…
    40. 40. XMLType Binary XML Securefile Binary XML Securefile (document/content) (document/content) Post Parse Post Parse LOB Index LOB Index
    41. 41. Driving access on CONTENT needs BTree Index Function based Index (XPath) Unstructured XMLIndex Structured XMLIndex Oracle XML Text Index BTree Index
    42. 42. Structured Data
    43. 43. Structured XMLIndex (SXI) • CONTENT TABLE(s) • Based on XMLTABLE syntax • XMLTABLE construct can be nested: Structured XMLIndex f (x) – VIRTUAL column alias • Can be maintained manually • Secondary indexes possible Content Tables
    44. 44. Describe CONTENT TABLE • A “regular” heap table with columns… • Ideal for secondary indexes, if needed.
    45. 45. Semi-Structured Data
    46. 46. Unstructured XMLIndex (UXI) • PATH TABLE • Use Path Subsetting – Full Blown XMLIndex can be BIG • Token Tables (XDB.X$......) Unstructured XMLIndex f (x) – Query re-write on Tokens – Fuzzy Searches, // – Optimizer Statistics • Can be maintained manually – Recorded in Pending Table • Secondary indexes possible Path Table
    47. 47. Describe PATH TABLE
    48. 48. What’s hidden…
    49. 49. Binary XML – No Index
    50. 50. Binary XML + XMLIndex (SXI)
    51. 51. Binary XML + XMLIndex + Sec.Ind.
    52. 52. Binary XML + XMLIndex + Sec.Ind.
    53. 53. Un-Structured Data
    54. 54. XML Full Tekst Index • Based on Oracle Text Index, XQuery Full Text • XML Namespace Aware • XML Semantic aware full text search – Full-Tekst Selection Expression – contains text – Logical Full Text Operator – ftor, ftand, ftMildNot – Context Aware full text search
    55. 55. Balanced Design • Inserts, Updates & Deletes – XML Future Changes – Index Maintenance • Selects – In Memory – Via Indexes • XML Validation – Strict, Lazy – Client Side Possibilities
    56. 56. Reward • Optimal performance • Out performing XML • Proper design will give performance increase over XML handling… …proper design is still key…
    57. 57. References Oracle XML DB – http://www.oracle.com/pls/db112/homepage XML DB FAQ Thread – http://forums.oracle.com/forums/thread.jspa? threadID=410714 Personal Blog – http://www.xmldb.nl – http://technology.amis.nl
    58. 58. References Daniela Florescu, Oracle Corporation Advances in XML and XQuery Sam Idicula, Oracle XML DB Development Team Binary XML Storage and Query Processing in Oracle Jinyu Wang, Scott Brewton Making XML Technology Easier to Use Joel Spolsky - Joel on Software Back to Basics
    59. 59. References Oracle XML DB Main page material • Oracle XML DB : Best Practices to Get Optimal Performance out of XML Queries (PDF) • Oracle XML DB : Choosing the Best XMLType Storage Option for Your Use Case (PDF) • A Request for Comments for the Oracle Binary XML Format

    ×