SlideShare a Scribd company logo
Access Path Selection
for
XML Query Processing
Attila Barta
IBM Extreme Blue RTP & University of Toronto
abarta@us.ibm.com
atibarta@cs.toronto.edu
August 8, 2003 Access Path Selection for XML QP 2
Outline
1. Introduction to XML Query Processing
2. XML Query Optimization from a Back-End
Perspective
3. Schema issues in XML Query Optimization
4. ToXop
5. Access Path Selection in ToXop
6. Future Work
August 8, 2003 Access Path Selection for XML QP 3
Motivation
XML is here to stay - E-commerce, B2B and
Information exchange generate a large amount of
XML data
All these data needs to be queried - XQuery is the
nearly standard XML query language
There are more than 39 XQuery implementations
recorded on the W3C site
August 8, 2003 Access Path Selection for XML QP 4
XQuery Implementations
The implementations are mainly in memory on top of
a DOM parser
Systems like Xperanto, LegoDB render XQuery into
an underlying query language (SQL.)
Systems like Timber, Tukwilla and ToX provide native
implementations of XQuery
Only the native implementations support query
optimization
Question: how can we perform query optimization in
the XML context? Can we still use the relational
approach?
August 8, 2003 Access Path Selection for XML QP 5
The Relational Optimization Technique
The SQL query is decomposed into an internal
representation based on Relational Algebra (RA) - an
operator tree with RA operators as nodes
The access path selection is performed - that is
choosing the cheapest access to the tables (choosing
among FileScan, IndexScan, etc.)
The join order is computed
Question: can we use the same technique for optimizing
XML queries?
August 8, 2003 Access Path Selection for XML QP 6
XML Query Algebra?
There are many proposals - all but two are APIs or
Calculus and only one, TAX, is implemented
The Tree Algebra for XML (TAX):
 part of the Timber project from U of Michigan
 the database is a collection of XML documents
 set of operators that mirrors the RA + specific
XML operators like copy and paste
 the operators work on collection of trees
 Specific to TAX: pattern trees, witness trees
August 8, 2003 Access Path Selection for XML QP 7
An XQuery query and its Pattern Trees
for $x in document("file: /catalog.xml")//item,
$y in document("file: /parts.xml")//part,
$z in document("file: /supplier.xml")//supplier
let $a := $x/part_no
where $a = $y/part_no
and $x/supplier_no = $z/supplier_no
and $z/city = "Munich"
return
<result> { $a } { $x/price } { $y/description }{ $z/name } </result>
August 8, 2003 Access Path Selection for XML QP 8
Access Path Selection for XQuery
In the Relational model a Scan operator operates on tuples
In the XML model a Scan operator operates on witness trees
generated by pattern trees
In order to evaluate the cost of a Scan operator we have to
evaluate the cost of evaluating the corresponding pattern trees
There are two paradigms for evaluating pattern trees (and
XPaths in general): node-at-a-time and set-at-a-time
In the node-at-a-time the nodes are processed as they are
scanned into the parser
August 8, 2003 Access Path Selection for XML QP 9
Set-at-the-time
Region Algebra like encoding:
 [term, DocID, StartPos, EndPos, LevelNum] - elements
 [term, DocID, StartPos, LevelNum] - string values
For the XPath expression: ‘//a/b’ :
SELECT *
FROM Elements e1, Elements e2
WHEREe1.term=’a’ AND e2.term=’b’ AND e.docno=t.docno
AND e1.begin < e2.begin AND e2.end < e1.end
AND e1.level = e2.level + 1
This approach is also known as structural join and it is used in
many systems, including Timber and Niagara
August 8, 2003 Access Path Selection for XML QP 10
Outline
1. Introduction to XML Query Processing
2. XML Query Optimization from a Back-End
Perspective
3. Schema issues in XML Query Optimization
4. ToXop
5. Access Path Selection in ToXop
6. Future Work
August 8, 2003 Access Path Selection for XML QP 11
The Back-End
Most of the XQuery processing systems are built for
one particular back-end
The Toronto XML Server (ToX) supports multiple
back-ends - this enables access path selection
The exiting ToX back-ends: flat files, relational, ToXin
August 8, 2003 Access Path Selection for XML QP 12
ToXin
Proposed by Rizzolo and Mendelzon
ToXin is an index structure that allows backward and
forward navigation on an XML document
ToXin captures the entire content of a document,
thus it can be used as a back end
ToXin mirrors the structure of the document. Thus,
for each element in a distinct path, there is a ToXin
node to represent it
August 8, 2003 Access Path Selection for XML QP 13
A Sample XML Document
<!-- Supplier.xml : Suppliers and branches -->
<suppliers>
<supplier> Magna
<branch> Toronto </branch>
<branch> Montreal </branch>
<branch> Detroit </branch>
</supplier>
<supplier> ABB
<branch> Zurich </branch>
<branch> Stockholm </branch>
<branch> Toulouse </branch>
<branch> Haifa </branch>
<branch> New York </branch>
<branch> Kyoto </branch>
<branch> Sydney </branch>
<branch> Hong Kong </branch>
</supplier>
<supplier> Demag
<branch> Munich </branch>
<branch> Koln </branch>
</supplier>
</suppliers>
August 8, 2003 Access Path Selection for XML QP 14
A Tree representation for the Sample XML Document
August 8, 2003 Access Path Selection for XML QP 15
A ToXin Tree for the Sample Document
August 8, 2003 Access Path Selection for XML QP 16
ToXin Instance and Value Tables
August 8, 2003 Access Path Selection for XML QP 17
ToXin Encoding
Bottom-up evaluation: start evaluating the predicates on the
leaf value table and proceed upwards
August 8, 2003 Access Path Selection for XML QP 18
Outline
1. Introduction to XML Query Processing
2. XML Query Optimization from a Back-End
Perspective
3. Schema issues in XML Query Optimization
4. ToXop
5. Access Path Selection in ToXop
6. Future Work
August 8, 2003 Access Path Selection for XML QP 19
Schema in Relational Query Optimization
Schema information and data statistics are essential for query
optimization
In the relational systems, schema information is inferred from
the system catalog and data statistics, also stored in the system
catalog, are collected periodically from the database
The schema information is used to check the correctness of the
query and to infer type information, while the data statistics are
used to compute the query plan
Although it seems straightforward to use the same approach for
XML databases, there are two impediments: the semantics of
schema in the XML context and which statistics to collect for an
XML document
August 8, 2003 Access Path Selection for XML QP 20
Schema issues in XML Query Optimization
In a relational system the database schema mirrors the data
structure
In XML documents, the schema reflects the validity of a
document and not the existing structure of the document
An example:
 an DTD element definition: ‘a/b*’
 an XML document: “<a><b/><b/></a>”
 an XPath expression: “/a/b/b/b”
 apparently any valid document for the given DTD, should
satisfy the XPath too - but, this is not the case
Existing Schema: the schema that reflects the existing structure
of the document and not the valid one
August 8, 2003 Access Path Selection for XML QP 21
Augmented ToXin Trees
ToXin trees reflect the existing structure of the
documents, thus an ToXin tree is an existing schema
Augmented ToXin Trees (aTree):
ToXin trees + statistical information = catalog
aTree statistical information:
 NCARD, cardinality of an element - number of instances for
this element
 ICARD, number of distinct values for an element
 fan-out (Fout) – average number of sub-element instances for
each sub-element
August 8, 2003 Access Path Selection for XML QP 22
Outline
1. Introduction to XML Query Processing
2. XML Query Optimization from a Back-End
Perspective
3. Schema issues in XML Query Optimization
4. ToXop
5. Access Path Selection in ToXop
6. Future Work
August 8, 2003 Access Path Selection for XML QP 23
ToXop
ToX is an expandable XML native database - many different
components, with the same functionality, can coexist at the
same time
ToXop is one of the query optimization modules in ToX
ToXop is inspired by OPT++ and Volcano - it has two sets of
operators, logical operators and physical operators, and an open
optimization technique (inspired from OPT++), which permits
different optimization strategies to be plugged-in
The logical operators are the TAX operators, while the physical
operators are back-end specific
ToXop can accommodate any query algebra - however, it was
designed with TAX in mind
August 8, 2003 Access Path Selection for XML QP 24
August 8, 2003 Access Path Selection for XML QP 25
Outline
1. Introduction to XML Query Processing
2. XML Query Optimization from a Back-End
Perspective
3. Schema issues in XML Query Optimization
4. ToXop
5. Access Path Selection in ToXop
6. Future Work
August 8, 2003 Access Path Selection for XML QP 26
TurboXPath
Joint work with Vanja Josifovski, IBM Research – Almaden
Characteristics:
 supports natively a “core” XPath: child (‘/’), descendant (‘//’)
axes, predicates (‘[]’) restricted to use Boolean ‘and’ and ‘or’
operators, uses XALAN for the rest of the predicates
 uses definition file: concatenates XPath expressions in order
to extract multiple results
 output: tuples or XML
 works with “recursive” documents
 works in streaming environments
 iterator model implementation
Usage: DB2 XML Wrapper, XML Cutter (part of DB2 XML
Extender), DB2 SQLX.
August 8, 2003 Access Path Selection for XML QP 27
TurboXPath Definition File
CREATE NICKNAME CUSTOMER_I
(name VARCHAR(16) OPTIONS(XPATH ‘.//name’),
address VARCHAR(30) OPTIONS(XPATH ‘.//addr/@street’),
cid VARCHAR(16) OPTIONS(XPATH ‘@cid’, ID ‘Y’))
FOR SERVER xml_customer OPTIONS(XPATH ‘//customer’);
CREATE NICKNAME ORDER_I
(amount VARCHAR(20) OPTIONS(XPATH ‘./amount’),
date VARCHAR(10) OPTIONS(XPATH ‘./date’),
oid VARCHAR(16) OPTIONS(ID ‘Y’))
cid VARCHAR(16) OPTIONS(PARENT_LINK ‘Y’))
FOR SERVER xml_customer OPTIONS(XPATH ‘//order’);
PARENT ‘CUSTOMER_I’
August 8, 2003 Access Path Selection for XML QP 28
August 8, 2003 Access Path Selection for XML QP 29
August 8, 2003 Access Path Selection for XML QP 30
TurboXPath in ToXop Context
Observations:
 TurboXPath is a Scan operator
 TurboXPath parse tree  TAX pattern tree
In the ToXop environment TurboXPath:
 takes as argument a pattern tree and an XML document and
outputs the corresponding witness trees for the given pattern
tree and document
 TurboXPath can be viewed as a FileScan operator
augmented with selection and projection - the selections and
projections are passed to TurboXPath through the pattern
tree
August 8, 2003 Access Path Selection for XML QP 31
ToxinScan provides access to the document on a representation
of the document, the augmented ToXin tree
ToxinScan takes a pattern tree as parameter, thus it has
projection and selection embedded within
ToxinScan evaluates the pattern tree against the ToXin tree -
the resultants are matched ToXin trees
A matched ToXin Trees (mTree) are those parts of a ToXin tree
that satisfy the given pattern tree and the nodes are adorned
with the corresponding selection predicates from the pattern
tree
ToxinScan
August 8, 2003 Access Path Selection for XML QP 32
August 8, 2003 Access Path Selection for XML QP 33
ToxinScan Optimization
The goal of the ToxinScan is to evaluate the mTrees. An mTree
can be evaluated in many different ways, yielding different costs
The optimization process consists in:
 right direction selection
 right order selection
August 8, 2003 Access Path Selection for XML QP 34
Terminology
Def: in the context of a mTree, given a node n and a set of
predicates S attached to the node n, we call node selectivity
factor ‘F’ the expected fraction of instances of the node n that
satisfy the predicate set S
Def: in the context of a mTree, assume a node p and a node c,
such that c is a child node of p. We call parent selectivity of the
(child) node c the fraction of the node p’s instances, that are
selected after evaluating the path expression that stems from
the parent p and the (child) node c is part of it
Def: we call joint cost of two path expressions that stem from
the same root, the cost of evaluating first a path using a
bottom-up evaluation plus the cost of evaluating the second
path using a top-down evaluation
August 8, 2003 Access Path Selection for XML QP 35
ToxinScan Optimization Heuristics
The heuristics are based on a uniform distribution assumption
for node instances and employs the following properties
Property 1: in the case of a uniform distribution, for a mTree
rooted in node a with nodes b and c as children, if node b has
a lower selectivity than node c, then:
 the parent selectivity of node b is lower than the parent
selectivity of node c
 Cbac < Ccab.
Property 2: in the case of a uniform distribution, for a mTree
rooted in node a with nodes b and c as children, if node b has
a lower parent selectivity than node c, then the cost of
evaluating c top-down is less than the cost of evaluating c
bottom-up
August 8, 2003 Access Path Selection for XML QP 36
Algorithm for Access Order Selection
First, we sort the children according to parent
selectivity
Second, we evaluate the path with the lowest
selectivity using a bottom-up evaluation
Next, we evaluate all the other paths, in the
selectivity order, using a top-down evaluation
August 8, 2003 Access Path Selection for XML QP 37
An Example of Access Order Selection
August 8, 2003 Access Path Selection for XML QP 38
Outline
1. Introduction to XML Query Processing
2. XML Query Optimization from a Back-End
Perspective
3. Schema issues in XML Query Optimization
4. ToXop
5. Access Path Selection in ToXop
6. Future Work
August 8, 2003 Access Path Selection for XML QP 39
The Road Ahead
The ToXop framework and the ToXop access method selection
are fully implemented
The next step is to implement an Execution Engine in order to
perform tests and running benchmarks
We plan to implement a back-end using structural joins in the
Timber manner and compare our base line with the Timber base
line. Then, compare ToXop optimized results with the base line
in order to measure the speedup and thus to compare with the
Timber reported performance
We plan to test ToXop on structured documents, the DBLP
collection; deeply nested data, the EBOC medical data; and the
XMARK benchmark
It is our believe that Timber performs better with certain type of
documents while ToXop performs better with other types
This is not the end,
this is just the beginning!
Thank you for your attention!

More Related Content

What's hot

8 query processing and optimization
8 query processing and optimization8 query processing and optimization
8 query processing and optimizationKumar
 
Sedna XML Database: Query Parser & Optimizing Rewriter
Sedna XML Database: Query Parser & Optimizing RewriterSedna XML Database: Query Parser & Optimizing Rewriter
Sedna XML Database: Query Parser & Optimizing Rewriter
Ivan Shcheklein
 
MySQL 8.0: What Is New in Optimizer and Executor?
MySQL 8.0: What Is New in Optimizer and Executor?MySQL 8.0: What Is New in Optimizer and Executor?
MySQL 8.0: What Is New in Optimizer and Executor?
Norvald Ryeng
 
The life of a query (oracle edition)
The life of a query (oracle edition)The life of a query (oracle edition)
The life of a query (oracle edition)maclean liu
 
XQuery Triggers in Native XML Database Sedna
XQuery Triggers in Native XML Database SednaXQuery Triggers in Native XML Database Sedna
XQuery Triggers in Native XML Database Sednamaria.grineva
 
Unit II - LINEAR DATA STRUCTURES
Unit II -  LINEAR DATA STRUCTURESUnit II -  LINEAR DATA STRUCTURES
Unit II - LINEAR DATA STRUCTURES
Usha Mahalingam
 
Sedna XML Database System: Internal Representation
Sedna XML Database System: Internal RepresentationSedna XML Database System: Internal Representation
Sedna XML Database System: Internal Representation
Ivan Shcheklein
 
Introduction To R Language
Introduction To R LanguageIntroduction To R Language
Introduction To R Language
Gaurang Dobariya
 
The map interface (the java™ tutorials collections interfaces)
The map interface (the java™ tutorials   collections   interfaces)The map interface (the java™ tutorials   collections   interfaces)
The map interface (the java™ tutorials collections interfaces)
charan kumar
 
Xml query language and navigation
Xml query language and navigationXml query language and navigation
Xml query language and navigationRaghu nath
 
Semantic RDF based integration framework for heterogeneous XML data sources
Semantic RDF based integration framework for heterogeneous XML data sourcesSemantic RDF based integration framework for heterogeneous XML data sources
Semantic RDF based integration framework for heterogeneous XML data sources
Deniz Kılınç
 
Chapter15
Chapter15Chapter15
Chapter15
gourab87
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
Siddharth Mathur
 
R Programming Tutorial for Beginners - -TIB Academy
R Programming Tutorial for Beginners - -TIB AcademyR Programming Tutorial for Beginners - -TIB Academy
R Programming Tutorial for Beginners - -TIB Academy
rajkamaltibacademy
 
Free Based DSLs for Distributed Compute Engines
Free Based DSLs for Distributed Compute EnginesFree Based DSLs for Distributed Compute Engines
Free Based DSLs for Distributed Compute Engines
Joydeep Banik Roy
 
Query evaluation and optimization
Query evaluation and optimizationQuery evaluation and optimization
Query evaluation and optimization
lavanya marichamy
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on R
Ajay Ohri
 
Overview of query evaluation
Overview of query evaluationOverview of query evaluation
Overview of query evaluationavniS
 
Elag 2012 - Under the hood of 3TU.Datacentrum.
Elag 2012 - Under the hood of 3TU.Datacentrum.Elag 2012 - Under the hood of 3TU.Datacentrum.
Elag 2012 - Under the hood of 3TU.Datacentrum.
Egbert Gramsbergen
 

What's hot (20)

8 query processing and optimization
8 query processing and optimization8 query processing and optimization
8 query processing and optimization
 
Sedna XML Database: Query Parser & Optimizing Rewriter
Sedna XML Database: Query Parser & Optimizing RewriterSedna XML Database: Query Parser & Optimizing Rewriter
Sedna XML Database: Query Parser & Optimizing Rewriter
 
MySQL 8.0: What Is New in Optimizer and Executor?
MySQL 8.0: What Is New in Optimizer and Executor?MySQL 8.0: What Is New in Optimizer and Executor?
MySQL 8.0: What Is New in Optimizer and Executor?
 
The life of a query (oracle edition)
The life of a query (oracle edition)The life of a query (oracle edition)
The life of a query (oracle edition)
 
XQuery Triggers in Native XML Database Sedna
XQuery Triggers in Native XML Database SednaXQuery Triggers in Native XML Database Sedna
XQuery Triggers in Native XML Database Sedna
 
Unit II - LINEAR DATA STRUCTURES
Unit II -  LINEAR DATA STRUCTURESUnit II -  LINEAR DATA STRUCTURES
Unit II - LINEAR DATA STRUCTURES
 
Sedna XML Database System: Internal Representation
Sedna XML Database System: Internal RepresentationSedna XML Database System: Internal Representation
Sedna XML Database System: Internal Representation
 
Introduction To R Language
Introduction To R LanguageIntroduction To R Language
Introduction To R Language
 
The map interface (the java™ tutorials collections interfaces)
The map interface (the java™ tutorials   collections   interfaces)The map interface (the java™ tutorials   collections   interfaces)
The map interface (the java™ tutorials collections interfaces)
 
Xml query language and navigation
Xml query language and navigationXml query language and navigation
Xml query language and navigation
 
Semantic RDF based integration framework for heterogeneous XML data sources
Semantic RDF based integration framework for heterogeneous XML data sourcesSemantic RDF based integration framework for heterogeneous XML data sources
Semantic RDF based integration framework for heterogeneous XML data sources
 
Chapter15
Chapter15Chapter15
Chapter15
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
 
Query compiler
Query compilerQuery compiler
Query compiler
 
R Programming Tutorial for Beginners - -TIB Academy
R Programming Tutorial for Beginners - -TIB AcademyR Programming Tutorial for Beginners - -TIB Academy
R Programming Tutorial for Beginners - -TIB Academy
 
Free Based DSLs for Distributed Compute Engines
Free Based DSLs for Distributed Compute EnginesFree Based DSLs for Distributed Compute Engines
Free Based DSLs for Distributed Compute Engines
 
Query evaluation and optimization
Query evaluation and optimizationQuery evaluation and optimization
Query evaluation and optimization
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on R
 
Overview of query evaluation
Overview of query evaluationOverview of query evaluation
Overview of query evaluation
 
Elag 2012 - Under the hood of 3TU.Datacentrum.
Elag 2012 - Under the hood of 3TU.Datacentrum.Elag 2012 - Under the hood of 3TU.Datacentrum.
Elag 2012 - Under the hood of 3TU.Datacentrum.
 

Viewers also liked

Cultura y cultura organizacional
Cultura y cultura organizacionalCultura y cultura organizacional
Cultura y cultura organizacional
Javier Jose
 
Congreso
CongresoCongreso
Congreso
gongop
 
Connect
ConnectConnect
Connect
apj sr
 
Publish and Share a Scenario (7 slides)
Publish and Share a Scenario (7 slides) Publish and Share a Scenario (7 slides)
Publish and Share a Scenario (7 slides)
Trefis
 
The Nature and Future of the Relation Between Neoliberalism And Non-Governmen...
The Nature and Future of the Relation Between Neoliberalism And Non-Governmen...The Nature and Future of the Relation Between Neoliberalism And Non-Governmen...
The Nature and Future of the Relation Between Neoliberalism And Non-Governmen...
inventionjournals
 
অদ্ভুত ২০ কারণেও আপনার প্রেম হয়ে যেতে পারে
অদ্ভুত ২০ কারণেও আপনার প্রেম হয়ে যেতে পারেঅদ্ভুত ২০ কারণেও আপনার প্রেম হয়ে যেতে পারে
অদ্ভুত ২০ কারণেও আপনার প্রেম হয়ে যেতে পারে
Beauty World
 
スポーツまちづくりの理念構造と定義
スポーツまちづくりの理念構造と定義スポーツまちづくりの理念構造と定義
スポーツまちづくりの理念構造と定義
Atsushi TAKAOKA
 
Resulteset
ResultesetResulteset
Trabajo Aragon
Trabajo AragonTrabajo Aragon
Trabajo AragonSaramusica
 
What is the purpose of film openings
What is the purpose of film openingsWhat is the purpose of film openings
What is the purpose of film openings
Bruna Martins
 
Informe a3
Informe a3Informe a3
Informe a3sidokar
 
Habilidades digitales
Habilidades digitalesHabilidades digitales
Habilidades digitales
mauro montoya cantera
 
Webquest de espanol
Webquest de espanolWebquest de espanol
Webquest de espanolbellelaufer
 
HUF ppt2011
HUF ppt2011HUF ppt2011
HUF ppt2011
Alex Ferrel
 
Presentación Hlp
Presentación HlpPresentación Hlp
Presentación Hlphlizarragap
 
Presentacion empresa
Presentacion empresaPresentacion empresa
Presentacion empresa
Aumenta tu Trafico
 

Viewers also liked (20)

Cultura y cultura organizacional
Cultura y cultura organizacionalCultura y cultura organizacional
Cultura y cultura organizacional
 
art%3A10.1186%2F1756-0500-6-299
art%3A10.1186%2F1756-0500-6-299art%3A10.1186%2F1756-0500-6-299
art%3A10.1186%2F1756-0500-6-299
 
Congreso
CongresoCongreso
Congreso
 
Connect
ConnectConnect
Connect
 
Publish and Share a Scenario (7 slides)
Publish and Share a Scenario (7 slides) Publish and Share a Scenario (7 slides)
Publish and Share a Scenario (7 slides)
 
The Nature and Future of the Relation Between Neoliberalism And Non-Governmen...
The Nature and Future of the Relation Between Neoliberalism And Non-Governmen...The Nature and Future of the Relation Between Neoliberalism And Non-Governmen...
The Nature and Future of the Relation Between Neoliberalism And Non-Governmen...
 
অদ্ভুত ২০ কারণেও আপনার প্রেম হয়ে যেতে পারে
অদ্ভুত ২০ কারণেও আপনার প্রেম হয়ে যেতে পারেঅদ্ভুত ২০ কারণেও আপনার প্রেম হয়ে যেতে পারে
অদ্ভুত ২০ কারণেও আপনার প্রেম হয়ে যেতে পারে
 
スポーツまちづくりの理念構造と定義
スポーツまちづくりの理念構造と定義スポーツまちづくりの理念構造と定義
スポーツまちづくりの理念構造と定義
 
Resulteset
ResultesetResulteset
Resulteset
 
Trabajo Aragon
Trabajo AragonTrabajo Aragon
Trabajo Aragon
 
What is the purpose of film openings
What is the purpose of film openingsWhat is the purpose of film openings
What is the purpose of film openings
 
Actividad de afianzamiento fransica y la muerte
Actividad de afianzamiento fransica y la muerteActividad de afianzamiento fransica y la muerte
Actividad de afianzamiento fransica y la muerte
 
Informe a3
Informe a3Informe a3
Informe a3
 
Habilidades digitales
Habilidades digitalesHabilidades digitales
Habilidades digitales
 
Etapas de la vida
Etapas de la vidaEtapas de la vida
Etapas de la vida
 
(97 2003.)
(97 2003.)(97 2003.)
(97 2003.)
 
Webquest de espanol
Webquest de espanolWebquest de espanol
Webquest de espanol
 
HUF ppt2011
HUF ppt2011HUF ppt2011
HUF ppt2011
 
Presentación Hlp
Presentación HlpPresentación Hlp
Presentación Hlp
 
Presentacion empresa
Presentacion empresaPresentacion empresa
Presentacion empresa
 

Similar to ibm_research_aug_8_03

SQL/XML on Oracle
SQL/XML on OracleSQL/XML on Oracle
SQL/XML on Oracle
torp42
 
D0373024030
D0373024030D0373024030
D0373024030
theijes
 
XPath - XML Path Language
XPath - XML Path LanguageXPath - XML Path Language
XPath - XML Path Language
yht4ever
 
PostgreSQL and XML
PostgreSQL and XMLPostgreSQL and XML
PostgreSQL and XML
Peter Eisentraut
 
eXtensible Markup Language (XML)
eXtensible Markup Language (XML)eXtensible Markup Language (XML)
eXtensible Markup Language (XML)
Serhii Kartashov
 
Bitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query ProcessingBitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query Processing
Kyong-Ha Lee
 
Effective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch AlgorithmEffective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch Algorithm
IRJET Journal
 
Environment Canada's Data Management Service
Environment Canada's Data Management ServiceEnvironment Canada's Data Management Service
Environment Canada's Data Management Service
Safe Software
 
Real World Experience With Oracle Xml Database 11g An Oracle Ace’s Perspectiv...
Real World Experience With Oracle Xml Database 11g An Oracle Ace’s Perspectiv...Real World Experience With Oracle Xml Database 11g An Oracle Ace’s Perspectiv...
Real World Experience With Oracle Xml Database 11g An Oracle Ace’s Perspectiv...
Marco Gralike
 
Using Java to implement RESTful Web Services: JAX-RS
Using Java to implement RESTful Web Services: JAX-RSUsing Java to implement RESTful Web Services: JAX-RS
Using Java to implement RESTful Web Services: JAX-RSKatrien Verbert
 
transforming xml using xsl and xslt
transforming xml using xsl and xslttransforming xml using xsl and xslt
transforming xml using xsl and xslt
Hemant Suthar
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
BG Java EE Course
 
Boost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
Boost Your Environment With XMLDB - UKOUG 2008 - Marco GralikeBoost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
Boost Your Environment With XMLDB - UKOUG 2008 - Marco GralikeMarco Gralike
 
EXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp frameworkEXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp frameworkFlorent Georges
 
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1Marco Gralike
 
Learning XSLT
Learning XSLTLearning XSLT
Learning XSLT
Overdue Books LLC
 
A Standard Data Format for Computational Chemistry: CSX
A Standard Data Format for Computational Chemistry: CSXA Standard Data Format for Computational Chemistry: CSX
A Standard Data Format for Computational Chemistry: CSX
Stuart Chalk
 

Similar to ibm_research_aug_8_03 (20)

SQL/XML on Oracle
SQL/XML on OracleSQL/XML on Oracle
SQL/XML on Oracle
 
D0373024030
D0373024030D0373024030
D0373024030
 
XPath - XML Path Language
XPath - XML Path LanguageXPath - XML Path Language
XPath - XML Path Language
 
PostgreSQL and XML
PostgreSQL and XMLPostgreSQL and XML
PostgreSQL and XML
 
eXtensible Markup Language (XML)
eXtensible Markup Language (XML)eXtensible Markup Language (XML)
eXtensible Markup Language (XML)
 
Bitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query ProcessingBitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query Processing
 
Effective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch AlgorithmEffective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch Algorithm
 
Environment Canada's Data Management Service
Environment Canada's Data Management ServiceEnvironment Canada's Data Management Service
Environment Canada's Data Management Service
 
Java XML Parsing
Java XML ParsingJava XML Parsing
Java XML Parsing
 
Real World Experience With Oracle Xml Database 11g An Oracle Ace’s Perspectiv...
Real World Experience With Oracle Xml Database 11g An Oracle Ace’s Perspectiv...Real World Experience With Oracle Xml Database 11g An Oracle Ace’s Perspectiv...
Real World Experience With Oracle Xml Database 11g An Oracle Ace’s Perspectiv...
 
Using Java to implement RESTful Web Services: JAX-RS
Using Java to implement RESTful Web Services: JAX-RSUsing Java to implement RESTful Web Services: JAX-RS
Using Java to implement RESTful Web Services: JAX-RS
 
OAXAL
OAXALOAXAL
OAXAL
 
transforming xml using xsl and xslt
transforming xml using xsl and xslttransforming xml using xsl and xslt
transforming xml using xsl and xslt
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
 
Boost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
Boost Your Environment With XMLDB - UKOUG 2008 - Marco GralikeBoost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
Boost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
 
Day2 xslt x_path_xquery
Day2 xslt x_path_xqueryDay2 xslt x_path_xquery
Day2 xslt x_path_xquery
 
EXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp frameworkEXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp framework
 
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
 
Learning XSLT
Learning XSLTLearning XSLT
Learning XSLT
 
A Standard Data Format for Computational Chemistry: CSX
A Standard Data Format for Computational Chemistry: CSXA Standard Data Format for Computational Chemistry: CSX
A Standard Data Format for Computational Chemistry: CSX
 

ibm_research_aug_8_03

  • 1. Access Path Selection for XML Query Processing Attila Barta IBM Extreme Blue RTP & University of Toronto abarta@us.ibm.com atibarta@cs.toronto.edu
  • 2. August 8, 2003 Access Path Selection for XML QP 2 Outline 1. Introduction to XML Query Processing 2. XML Query Optimization from a Back-End Perspective 3. Schema issues in XML Query Optimization 4. ToXop 5. Access Path Selection in ToXop 6. Future Work
  • 3. August 8, 2003 Access Path Selection for XML QP 3 Motivation XML is here to stay - E-commerce, B2B and Information exchange generate a large amount of XML data All these data needs to be queried - XQuery is the nearly standard XML query language There are more than 39 XQuery implementations recorded on the W3C site
  • 4. August 8, 2003 Access Path Selection for XML QP 4 XQuery Implementations The implementations are mainly in memory on top of a DOM parser Systems like Xperanto, LegoDB render XQuery into an underlying query language (SQL.) Systems like Timber, Tukwilla and ToX provide native implementations of XQuery Only the native implementations support query optimization Question: how can we perform query optimization in the XML context? Can we still use the relational approach?
  • 5. August 8, 2003 Access Path Selection for XML QP 5 The Relational Optimization Technique The SQL query is decomposed into an internal representation based on Relational Algebra (RA) - an operator tree with RA operators as nodes The access path selection is performed - that is choosing the cheapest access to the tables (choosing among FileScan, IndexScan, etc.) The join order is computed Question: can we use the same technique for optimizing XML queries?
  • 6. August 8, 2003 Access Path Selection for XML QP 6 XML Query Algebra? There are many proposals - all but two are APIs or Calculus and only one, TAX, is implemented The Tree Algebra for XML (TAX):  part of the Timber project from U of Michigan  the database is a collection of XML documents  set of operators that mirrors the RA + specific XML operators like copy and paste  the operators work on collection of trees  Specific to TAX: pattern trees, witness trees
  • 7. August 8, 2003 Access Path Selection for XML QP 7 An XQuery query and its Pattern Trees for $x in document("file: /catalog.xml")//item, $y in document("file: /parts.xml")//part, $z in document("file: /supplier.xml")//supplier let $a := $x/part_no where $a = $y/part_no and $x/supplier_no = $z/supplier_no and $z/city = "Munich" return <result> { $a } { $x/price } { $y/description }{ $z/name } </result>
  • 8. August 8, 2003 Access Path Selection for XML QP 8 Access Path Selection for XQuery In the Relational model a Scan operator operates on tuples In the XML model a Scan operator operates on witness trees generated by pattern trees In order to evaluate the cost of a Scan operator we have to evaluate the cost of evaluating the corresponding pattern trees There are two paradigms for evaluating pattern trees (and XPaths in general): node-at-a-time and set-at-a-time In the node-at-a-time the nodes are processed as they are scanned into the parser
  • 9. August 8, 2003 Access Path Selection for XML QP 9 Set-at-the-time Region Algebra like encoding:  [term, DocID, StartPos, EndPos, LevelNum] - elements  [term, DocID, StartPos, LevelNum] - string values For the XPath expression: ‘//a/b’ : SELECT * FROM Elements e1, Elements e2 WHEREe1.term=’a’ AND e2.term=’b’ AND e.docno=t.docno AND e1.begin < e2.begin AND e2.end < e1.end AND e1.level = e2.level + 1 This approach is also known as structural join and it is used in many systems, including Timber and Niagara
  • 10. August 8, 2003 Access Path Selection for XML QP 10 Outline 1. Introduction to XML Query Processing 2. XML Query Optimization from a Back-End Perspective 3. Schema issues in XML Query Optimization 4. ToXop 5. Access Path Selection in ToXop 6. Future Work
  • 11. August 8, 2003 Access Path Selection for XML QP 11 The Back-End Most of the XQuery processing systems are built for one particular back-end The Toronto XML Server (ToX) supports multiple back-ends - this enables access path selection The exiting ToX back-ends: flat files, relational, ToXin
  • 12. August 8, 2003 Access Path Selection for XML QP 12 ToXin Proposed by Rizzolo and Mendelzon ToXin is an index structure that allows backward and forward navigation on an XML document ToXin captures the entire content of a document, thus it can be used as a back end ToXin mirrors the structure of the document. Thus, for each element in a distinct path, there is a ToXin node to represent it
  • 13. August 8, 2003 Access Path Selection for XML QP 13 A Sample XML Document <!-- Supplier.xml : Suppliers and branches --> <suppliers> <supplier> Magna <branch> Toronto </branch> <branch> Montreal </branch> <branch> Detroit </branch> </supplier> <supplier> ABB <branch> Zurich </branch> <branch> Stockholm </branch> <branch> Toulouse </branch> <branch> Haifa </branch> <branch> New York </branch> <branch> Kyoto </branch> <branch> Sydney </branch> <branch> Hong Kong </branch> </supplier> <supplier> Demag <branch> Munich </branch> <branch> Koln </branch> </supplier> </suppliers>
  • 14. August 8, 2003 Access Path Selection for XML QP 14 A Tree representation for the Sample XML Document
  • 15. August 8, 2003 Access Path Selection for XML QP 15 A ToXin Tree for the Sample Document
  • 16. August 8, 2003 Access Path Selection for XML QP 16 ToXin Instance and Value Tables
  • 17. August 8, 2003 Access Path Selection for XML QP 17 ToXin Encoding Bottom-up evaluation: start evaluating the predicates on the leaf value table and proceed upwards
  • 18. August 8, 2003 Access Path Selection for XML QP 18 Outline 1. Introduction to XML Query Processing 2. XML Query Optimization from a Back-End Perspective 3. Schema issues in XML Query Optimization 4. ToXop 5. Access Path Selection in ToXop 6. Future Work
  • 19. August 8, 2003 Access Path Selection for XML QP 19 Schema in Relational Query Optimization Schema information and data statistics are essential for query optimization In the relational systems, schema information is inferred from the system catalog and data statistics, also stored in the system catalog, are collected periodically from the database The schema information is used to check the correctness of the query and to infer type information, while the data statistics are used to compute the query plan Although it seems straightforward to use the same approach for XML databases, there are two impediments: the semantics of schema in the XML context and which statistics to collect for an XML document
  • 20. August 8, 2003 Access Path Selection for XML QP 20 Schema issues in XML Query Optimization In a relational system the database schema mirrors the data structure In XML documents, the schema reflects the validity of a document and not the existing structure of the document An example:  an DTD element definition: ‘a/b*’  an XML document: “<a><b/><b/></a>”  an XPath expression: “/a/b/b/b”  apparently any valid document for the given DTD, should satisfy the XPath too - but, this is not the case Existing Schema: the schema that reflects the existing structure of the document and not the valid one
  • 21. August 8, 2003 Access Path Selection for XML QP 21 Augmented ToXin Trees ToXin trees reflect the existing structure of the documents, thus an ToXin tree is an existing schema Augmented ToXin Trees (aTree): ToXin trees + statistical information = catalog aTree statistical information:  NCARD, cardinality of an element - number of instances for this element  ICARD, number of distinct values for an element  fan-out (Fout) – average number of sub-element instances for each sub-element
  • 22. August 8, 2003 Access Path Selection for XML QP 22 Outline 1. Introduction to XML Query Processing 2. XML Query Optimization from a Back-End Perspective 3. Schema issues in XML Query Optimization 4. ToXop 5. Access Path Selection in ToXop 6. Future Work
  • 23. August 8, 2003 Access Path Selection for XML QP 23 ToXop ToX is an expandable XML native database - many different components, with the same functionality, can coexist at the same time ToXop is one of the query optimization modules in ToX ToXop is inspired by OPT++ and Volcano - it has two sets of operators, logical operators and physical operators, and an open optimization technique (inspired from OPT++), which permits different optimization strategies to be plugged-in The logical operators are the TAX operators, while the physical operators are back-end specific ToXop can accommodate any query algebra - however, it was designed with TAX in mind
  • 24. August 8, 2003 Access Path Selection for XML QP 24
  • 25. August 8, 2003 Access Path Selection for XML QP 25 Outline 1. Introduction to XML Query Processing 2. XML Query Optimization from a Back-End Perspective 3. Schema issues in XML Query Optimization 4. ToXop 5. Access Path Selection in ToXop 6. Future Work
  • 26. August 8, 2003 Access Path Selection for XML QP 26 TurboXPath Joint work with Vanja Josifovski, IBM Research – Almaden Characteristics:  supports natively a “core” XPath: child (‘/’), descendant (‘//’) axes, predicates (‘[]’) restricted to use Boolean ‘and’ and ‘or’ operators, uses XALAN for the rest of the predicates  uses definition file: concatenates XPath expressions in order to extract multiple results  output: tuples or XML  works with “recursive” documents  works in streaming environments  iterator model implementation Usage: DB2 XML Wrapper, XML Cutter (part of DB2 XML Extender), DB2 SQLX.
  • 27. August 8, 2003 Access Path Selection for XML QP 27 TurboXPath Definition File CREATE NICKNAME CUSTOMER_I (name VARCHAR(16) OPTIONS(XPATH ‘.//name’), address VARCHAR(30) OPTIONS(XPATH ‘.//addr/@street’), cid VARCHAR(16) OPTIONS(XPATH ‘@cid’, ID ‘Y’)) FOR SERVER xml_customer OPTIONS(XPATH ‘//customer’); CREATE NICKNAME ORDER_I (amount VARCHAR(20) OPTIONS(XPATH ‘./amount’), date VARCHAR(10) OPTIONS(XPATH ‘./date’), oid VARCHAR(16) OPTIONS(ID ‘Y’)) cid VARCHAR(16) OPTIONS(PARENT_LINK ‘Y’)) FOR SERVER xml_customer OPTIONS(XPATH ‘//order’); PARENT ‘CUSTOMER_I’
  • 28. August 8, 2003 Access Path Selection for XML QP 28
  • 29. August 8, 2003 Access Path Selection for XML QP 29
  • 30. August 8, 2003 Access Path Selection for XML QP 30 TurboXPath in ToXop Context Observations:  TurboXPath is a Scan operator  TurboXPath parse tree  TAX pattern tree In the ToXop environment TurboXPath:  takes as argument a pattern tree and an XML document and outputs the corresponding witness trees for the given pattern tree and document  TurboXPath can be viewed as a FileScan operator augmented with selection and projection - the selections and projections are passed to TurboXPath through the pattern tree
  • 31. August 8, 2003 Access Path Selection for XML QP 31 ToxinScan provides access to the document on a representation of the document, the augmented ToXin tree ToxinScan takes a pattern tree as parameter, thus it has projection and selection embedded within ToxinScan evaluates the pattern tree against the ToXin tree - the resultants are matched ToXin trees A matched ToXin Trees (mTree) are those parts of a ToXin tree that satisfy the given pattern tree and the nodes are adorned with the corresponding selection predicates from the pattern tree ToxinScan
  • 32. August 8, 2003 Access Path Selection for XML QP 32
  • 33. August 8, 2003 Access Path Selection for XML QP 33 ToxinScan Optimization The goal of the ToxinScan is to evaluate the mTrees. An mTree can be evaluated in many different ways, yielding different costs The optimization process consists in:  right direction selection  right order selection
  • 34. August 8, 2003 Access Path Selection for XML QP 34 Terminology Def: in the context of a mTree, given a node n and a set of predicates S attached to the node n, we call node selectivity factor ‘F’ the expected fraction of instances of the node n that satisfy the predicate set S Def: in the context of a mTree, assume a node p and a node c, such that c is a child node of p. We call parent selectivity of the (child) node c the fraction of the node p’s instances, that are selected after evaluating the path expression that stems from the parent p and the (child) node c is part of it Def: we call joint cost of two path expressions that stem from the same root, the cost of evaluating first a path using a bottom-up evaluation plus the cost of evaluating the second path using a top-down evaluation
  • 35. August 8, 2003 Access Path Selection for XML QP 35 ToxinScan Optimization Heuristics The heuristics are based on a uniform distribution assumption for node instances and employs the following properties Property 1: in the case of a uniform distribution, for a mTree rooted in node a with nodes b and c as children, if node b has a lower selectivity than node c, then:  the parent selectivity of node b is lower than the parent selectivity of node c  Cbac < Ccab. Property 2: in the case of a uniform distribution, for a mTree rooted in node a with nodes b and c as children, if node b has a lower parent selectivity than node c, then the cost of evaluating c top-down is less than the cost of evaluating c bottom-up
  • 36. August 8, 2003 Access Path Selection for XML QP 36 Algorithm for Access Order Selection First, we sort the children according to parent selectivity Second, we evaluate the path with the lowest selectivity using a bottom-up evaluation Next, we evaluate all the other paths, in the selectivity order, using a top-down evaluation
  • 37. August 8, 2003 Access Path Selection for XML QP 37 An Example of Access Order Selection
  • 38. August 8, 2003 Access Path Selection for XML QP 38 Outline 1. Introduction to XML Query Processing 2. XML Query Optimization from a Back-End Perspective 3. Schema issues in XML Query Optimization 4. ToXop 5. Access Path Selection in ToXop 6. Future Work
  • 39. August 8, 2003 Access Path Selection for XML QP 39 The Road Ahead The ToXop framework and the ToXop access method selection are fully implemented The next step is to implement an Execution Engine in order to perform tests and running benchmarks We plan to implement a back-end using structural joins in the Timber manner and compare our base line with the Timber base line. Then, compare ToXop optimized results with the base line in order to measure the speedup and thus to compare with the Timber reported performance We plan to test ToXop on structured documents, the DBLP collection; deeply nested data, the EBOC medical data; and the XMARK benchmark It is our believe that Timber performs better with certain type of documents while ToXop performs better with other types
  • 40. This is not the end, this is just the beginning! Thank you for your attention!