Creating Structure in                 Unstructured Data                What is possible, today…?Marco Gralike
“Big Data” = XML ?
Challenges are!Ahum, the problems are!
WikiPedia• One string of XML data with  structured and unstructured  data sections• Language: English• Size      : 42,15 G...
Adventures intothe unknown…?
Setup• VirtualBox VM  – OEL 5U8 (64)  – 8 GB RAM• LaCie Little Big Disk  – RAID 0  – Thunderbolt• Database  – SGA    4GB  ...
My new LaCie LBD is really fast - 
Defeat?! - 1.000.000 pages only
Status of Technology used
XML - Where are we…?Gartner
Achieved…?
On the Horizon!• JSoniq• Zorba
Building (streaming) Bridges
Oracle XML DB      • NO cost option      • C (native / embedded kernel)      • (XQuery) Standards      • Code maintained b...
XQuery                                           XMLType Abstraction                               DB XQuery              ...
So about what are we talking ?
WikiPedia• Structured & Unstructured  bits and pieces• A lot of “unbounded”  elements• Not a lot of restrictions• The bit ...
How do we get this Structured?
Strings = small & defined (12c?)   Ename  pointer += 100;
<string1/><string2/><string3/>
Flexible, HumansNo Design Patterns
<small/><verybigggr/><bigger/>
<verybigggr>       <empno>1</empno><ename>Marco</ename>       <empno>2</empno></verybigggr> <small/><verybigggr/><bigger/>
We need options!
“XMLType” Container  In Memory            CLOB  (document)        (document)Object Relational   Binary XML     (data)     ...
XMLType      In Memory      (document)XOB          XML Schema
XMLType   Binary XML Securefile    (document/content)Post Parse        LOB Index
XMLType        Object Relational           (content)Fully Shredded        Indexes
Something else to Realize !
“What is the fastest way to get this    stuff in the database…?”
“…it depends…”
“So what is the fastest way to get    XML in the database…?”
“…it depends…”
“So what is the fastest way to get XML           in the database…    … and   useful in my case…?”
Garbage IN – Garbage OUT
WikiPedia•   SQL*Loader•   Parallel or Direct•   Securefile LOB Column•   2.5 hoursAnd no (performant) wayto get the detai...
WikiPedia•   SQL*Loader•   Parallel or Direct•   Securefile Binary XML•   …2.5 hours ???
XML Parsing• SAX   - Simple API for XML• DOM   - Document Object Module
fastinsert performance   CLOB                               XMLType                                CLOB                   ...
XML Partitioning• Object Relational Partitioning  – Equi-Partitioning since version Oracle 11.1.0.7.0• Binary XML Partitio...
XMLType   Binary XML Securefile    (document/content)Post Parse        LOB Index
Driving access on CONTENT                                                   BTre                                          ...
Structured Data
Structured XMLIndex (SXI)• CONTENT TABLE(s)• Based on XMLTABLE syntax        Structured                                  X...
Describe CONTENT TABLE• A “regular” heap table with columns…• Ideal for secondary indexes, if needed.
CONTENT TABLE(s) Structured XMLIndex    f (x)  Content  Tables
Semi-Structured Data
Unstructured XMLIndex (UXI)• PATH TABLE• Use Path Subsetting                 Unstructured   – Full Blown XMLIndex can be B...
Describe PATH TABLE
What’s hidden…
PATH TABLEUnstructured XMLIndex    f (x) Path Table
Binary XML – No Index
Binary XML + XMLIndex (SXI)
Binary XML + XMLIndex + Sec.Ind.
Binary XML + XMLIndex + Sec.Ind.
Un-Structured Data
XML Full Tekst Index• Based on Oracle Text Index, XQuery Full Text• XML Namespace Aware• XML Semantic aware full text sear...
Balanced Design• Inserts, Updates & Deletes  – XML Future Changes  – Index Maintenance           In Memory   On Disk• Sele...
Reward• Optimal performance• Out performing XML• Proper design will give  performance increase over  XML handling……proper ...
ReferencesOracle XML DB  – http://www.oracle.com/pls/db112/homepageXML DB FAQ Thread  – http://forums.oracle.com/forums/th...
ReferencesDaniela Florescu, Oracle Corporation  Advances in XML and XQuerySam Idicula, Oracle XML DB Development Team  Bin...
ReferencesOracle XML DB Main page material• Oracle XML DB : Best Practices to Get Optimal  Performance out of XML Queries ...
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Upcoming SlideShare
Loading in...5
×

Hotsos 2013 - Creating Structure in Unstructured Data

5,311

Published on

Presentation used during the Hotsos Symposium in 2013, Irving, Texas.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,311
On Slideshare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • See also OOW 2010, S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text – Nipun Agarwal, Oracle
  • Hotsos 2013 - Creating Structure in Unstructured Data

    1. 1. Creating Structure in Unstructured Data What is possible, today…?Marco Gralike
    2. 2. “Big Data” = XML ?
    3. 3. Challenges are!Ahum, the problems are!
    4. 4. WikiPedia• One string of XML data with structured and unstructured data sections• Language: English• Size : 42,15 GB• Pages : 12.961.997• Date : 21 Dec 2012
    5. 5. Adventures intothe unknown…?
    6. 6. Setup• VirtualBox VM – OEL 5U8 (64) – 8 GB RAM• LaCie Little Big Disk – RAID 0 – Thunderbolt• Database – SGA 4GB – PGA 2GB
    7. 7. My new LaCie LBD is really fast - 
    8. 8. Defeat?! - 1.000.000 pages only
    9. 9. Status of Technology used
    10. 10. XML - Where are we…?Gartner
    11. 11. Achieved…?
    12. 12. On the Horizon!• JSoniq• Zorba
    13. 13. Building (streaming) Bridges
    14. 14. Oracle XML DB • NO cost option • C (native / embedded kernel) • (XQuery) Standards • Code maintained by Oracle
    15. 15. XQuery XMLType Abstraction DB XQuery Procedural XQuery XQuery Rewrite Pushdown XVM (use “no query rewrite”) Relational Streaming XPath DOM Tree Evaluation Model Access SQL Execution Methods XMLIndex Object-Relational Binary XML Relational Storage Secure FilesSource: S317428: Building Really Scalable XML Applications with Oracle XML DB and Oracle Text
    16. 16. So about what are we talking ?
    17. 17. WikiPedia• Structured & Unstructured bits and pieces• A lot of “unbounded” elements• Not a lot of restrictions• The bit with value is in element “tekst”
    18. 18. How do we get this Structured?
    19. 19. Strings = small & defined (12c?) Ename  pointer += 100;
    20. 20. <string1/><string2/><string3/>
    21. 21. Flexible, HumansNo Design Patterns
    22. 22. <small/><verybigggr/><bigger/>
    23. 23. <verybigggr> <empno>1</empno><ename>Marco</ename> <empno>2</empno></verybigggr> <small/><verybigggr/><bigger/>
    24. 24. We need options!
    25. 25. “XMLType” Container In Memory CLOB (document) (document)Object Relational Binary XML (data) (data)
    26. 26. XMLType In Memory (document)XOB XML Schema
    27. 27. XMLType Binary XML Securefile (document/content)Post Parse LOB Index
    28. 28. XMLType Object Relational (content)Fully Shredded Indexes
    29. 29. Something else to Realize !
    30. 30. “What is the fastest way to get this stuff in the database…?”
    31. 31. “…it depends…”
    32. 32. “So what is the fastest way to get XML in the database…?”
    33. 33. “…it depends…”
    34. 34. “So what is the fastest way to get XML in the database… … and useful in my case…?”
    35. 35. Garbage IN – Garbage OUT
    36. 36. WikiPedia• SQL*Loader• Parallel or Direct• Securefile LOB Column• 2.5 hoursAnd no (performant) wayto get the details out…a.k.a “completely useless”
    37. 37. WikiPedia• SQL*Loader• Parallel or Direct• Securefile Binary XML• …2.5 hours ???
    38. 38. XML Parsing• SAX - Simple API for XML• DOM - Document Object Module
    39. 39. fastinsert performance CLOB XMLType CLOB (domain) indexes XMLType Binary XML XMLType Object Relational fast select performance
    40. 40. XML Partitioning• Object Relational Partitioning – Equi-Partitioning since version Oracle 11.1.0.7.0• Binary XML Partitioning – Range, List, Hash• Local partitioned XMLIndex – LOCAL keyword in XMLIndex create syntax• Partition Key on virtual column (Binary XML)• Partition Key on column (Object Relational)
    41. 41. XMLType Binary XML Securefile (document/content)Post Parse LOB Index
    42. 42. Driving access on CONTENT BTre e Index bookstore Function based Index (XPath) book whitepapertitle author author chapter title author id paragraph Unstructured Structured XMLIndex XMLIndex content structured content BTree Oracle XML Index Text Index
    43. 43. Structured Data
    44. 44. Structured XMLIndex (SXI)• CONTENT TABLE(s)• Based on XMLTABLE syntax Structured XMLIndex• XMLTable construct can be f (x) nested: – VIRTUAL column alias• Can be maintained manually• Secondary indexes possible Content Tables
    45. 45. Describe CONTENT TABLE• A “regular” heap table with columns…• Ideal for secondary indexes, if needed.
    46. 46. CONTENT TABLE(s) Structured XMLIndex f (x) Content Tables
    47. 47. Semi-Structured Data
    48. 48. Unstructured XMLIndex (UXI)• PATH TABLE• Use Path Subsetting Unstructured – Full Blown XMLIndex can be BIG XMLIndex f (x)• Token Tables (XDB.X$......) – Query re-write on Tokens – Fuzzy Searches, // – Optimizer Statistics• Can be maintained manually – Recorded in Pending Table Path Table• Secondary indexes possible
    49. 49. Describe PATH TABLE
    50. 50. What’s hidden…
    51. 51. PATH TABLEUnstructured XMLIndex f (x) Path Table
    52. 52. Binary XML – No Index
    53. 53. Binary XML + XMLIndex (SXI)
    54. 54. Binary XML + XMLIndex + Sec.Ind.
    55. 55. Binary XML + XMLIndex + Sec.Ind.
    56. 56. Un-Structured Data
    57. 57. XML Full Tekst Index• Based on Oracle Text Index, XQuery Full Text• XML Namespace Aware• XML Semantic aware full text search – Full-Tekst Selection Expression – contains text – Logical Full Text Operator – ftor, ftand, ftMildNot – Context Aware full text search
    58. 58. Balanced Design• Inserts, Updates & Deletes – XML Future Changes – Index Maintenance In Memory On Disk• Selects – In Memory – Via Indexes• XML Validation – Strict, Lazy – Client Side Possibilities
    59. 59. Reward• Optimal performance• Out performing XML• Proper design will give performance increase over XML handling……proper design is still key…
    60. 60. ReferencesOracle XML DB – http://www.oracle.com/pls/db112/homepageXML DB FAQ Thread – http://forums.oracle.com/forums/thread.jspa?thr eadID=410714Personal Blog – http://www.xmldb.nl – http://technology.amis.nl
    61. 61. ReferencesDaniela Florescu, Oracle Corporation Advances in XML and XQuerySam Idicula, Oracle XML DB Development Team Binary XML Storage and Query Processing in OracleJinyu Wang, Scott Brewton Making XML Technology Easier to UseJoel Spolsky - Joel on Software Back to Basics
    62. 62. ReferencesOracle XML DB Main page material• Oracle XML DB : Best Practices to Get Optimal Performance out of XML Queries (PDF)• Oracle XML DB : Choosing the Best XMLType Storage Option for Your Use Case (PDF)• A Request for Comments for the Oracle Binary XML Format

    ×