ibm_research_aug_8_03

Access Path Selection
for
XML Query Processing
Attila Barta
IBM Extreme Blue RTP & University of Toronto
abarta@us.ibm.com
atibarta@cs.toronto.edu

August 8, 2003 Access Path Selection for XML QP 2
Outline
1. Introduction to XML Query Processing
2. XML Query Optimization from a Back-End
Perspective
3. Schema issues in XML Query Optimization
4. ToXop
5. Access Path Selection in ToXop
6. Future Work

Motivation
XML is here to stay - E-commerce, B2B and
Information exchange generate a large amount of
XML data
All these data needs to be queried - XQuery is the
nearly standard XML query language
There are more than 39 XQuery implementations
recorded on the W3C site

XQuery Implementations
The implementations are mainly in memory on top of
a DOM parser
Systems like Xperanto, LegoDB render XQuery into
an underlying query language (SQL.)
Systems like Timber, Tukwilla and ToX provide native
implementations of XQuery
Only the native implementations support query
optimization
Question: how can we perform query optimization in
the XML context? Can we still use the relational
approach?

The Relational Optimization Technique
The SQL query is decomposed into an internal
representation based on Relational Algebra (RA) - an
operator tree with RA operators as nodes
The access path selection is performed - that is
choosing the cheapest access to the tables (choosing
among FileScan, IndexScan, etc.)
The join order is computed
Question: can we use the same technique for optimizing
XML queries?

XML Query Algebra?
There are many proposals - all but two are APIs or
Calculus and only one, TAX, is implemented
The Tree Algebra for XML (TAX):
 part of the Timber project from U of Michigan
 the database is a collection of XML documents
 set of operators that mirrors the RA + specific
XML operators like copy and paste
 the operators work on collection of trees
 Specific to TAX: pattern trees, witness trees

An XQuery query and its Pattern Trees
for $x in document("file: /catalog.xml")//item,
$y in document("file: /parts.xml")//part,
$z in document("file: /supplier.xml")//supplier
let $a := $x/part_no
where $a = $y/part_no
and $x/supplier_no = $z/supplier_no
and $z/city = "Munich"
return
<result> { $a } { $x/price } { $y/description }{ $z/name } </result>

Access Path Selection for XQuery
In the Relational model a Scan operator operates on tuples
In the XML model a Scan operator operates on witness trees
generated by pattern trees
In order to evaluate the cost of a Scan operator we have to
evaluate the cost of evaluating the corresponding pattern trees
There are two paradigms for evaluating pattern trees (and
XPaths in general): node-at-a-time and set-at-a-time
In the node-at-a-time the nodes are processed as they are
scanned into the parser

Set-at-the-time
Region Algebra like encoding:
 [term, DocID, StartPos, EndPos, LevelNum] - elements
 [term, DocID, StartPos, LevelNum] - string values
For the XPath expression: ‘//a/b’ :
SELECT *
FROM Elements e1, Elements e2
WHEREe1.term=’a’ AND e2.term=’b’ AND e.docno=t.docno
AND e1.begin < e2.begin AND e2.end < e1.end
AND e1.level = e2.level + 1
This approach is also known as structural join and it is used in
many systems, including Timber and Niagara

Outline
Perspective
4. ToXop
6. Future Work

The Back-End
Most of the XQuery processing systems are built for
one particular back-end
The Toronto XML Server (ToX) supports multiple
back-ends - this enables access path selection
The exiting ToX back-ends: flat files, relational, ToXin

ToXin
Proposed by Rizzolo and Mendelzon
ToXin is an index structure that allows backward and
forward navigation on an XML document
ToXin captures the entire content of a document,
thus it can be used as a back end
ToXin mirrors the structure of the document. Thus,
for each element in a distinct path, there is a ToXin
node to represent it

A Sample XML Document

<suppliers>
<supplier> Magna
<branch> Toronto </branch>
<branch> Montreal </branch>
<branch> Detroit </branch>
</supplier>
<supplier> ABB
<branch> Zurich </branch>
<branch> Stockholm </branch>
<branch> Toulouse </branch>
<branch> Haifa </branch>
<branch> New York </branch>
<branch> Kyoto </branch>
<branch> Sydney </branch>
<branch> Hong Kong </branch>
</supplier>
<supplier> Demag
<branch> Munich </branch>
<branch> Koln </branch>
</supplier>
</suppliers>

A Tree representation for the Sample XML Document

A ToXin Tree for the Sample Document

ToXin Instance and Value Tables

ToXin Encoding
Bottom-up evaluation: start evaluating the predicates on the
leaf value table and proceed upwards

Outline
Perspective
4. ToXop
6. Future Work

Schema in Relational Query Optimization
Schema information and data statistics are essential for query
optimization
In the relational systems, schema information is inferred from
the system catalog and data statistics, also stored in the system
catalog, are collected periodically from the database
The schema information is used to check the correctness of the
query and to infer type information, while the data statistics are
used to compute the query plan
Although it seems straightforward to use the same approach for
XML databases, there are two impediments: the semantics of
schema in the XML context and which statistics to collect for an
XML document

Schema issues in XML Query Optimization
In a relational system the database schema mirrors the data
structure
In XML documents, the schema reflects the validity of a
document and not the existing structure of the document
An example:
 an DTD element definition: ‘a/b*’
 an XML document: “<a><b/><b/></a>”
 an XPath expression: “/a/b/b/b”
 apparently any valid document for the given DTD, should
satisfy the XPath too - but, this is not the case
Existing Schema: the schema that reflects the existing structure
of the document and not the valid one

Augmented ToXin Trees
ToXin trees reflect the existing structure of the
documents, thus an ToXin tree is an existing schema
Augmented ToXin Trees (aTree):
ToXin trees + statistical information = catalog
aTree statistical information:
 NCARD, cardinality of an element - number of instances for
this element
 ICARD, number of distinct values for an element
 fan-out (Fout) – average number of sub-element instances for
each sub-element

Outline
Perspective
4. ToXop
6. Future Work

ToXop
ToX is an expandable XML native database - many different
components, with the same functionality, can coexist at the
same time
ToXop is one of the query optimization modules in ToX
ToXop is inspired by OPT++ and Volcano - it has two sets of
operators, logical operators and physical operators, and an open
optimization technique (inspired from OPT++), which permits
different optimization strategies to be plugged-in
The logical operators are the TAX operators, while the physical
operators are back-end specific
ToXop can accommodate any query algebra - however, it was
designed with TAX in mind

Outline
Perspective
4. ToXop
6. Future Work

TurboXPath
Joint work with Vanja Josifovski, IBM Research – Almaden
Characteristics:
 supports natively a “core” XPath: child (‘/’), descendant (‘//’)
axes, predicates (‘[]’) restricted to use Boolean ‘and’ and ‘or’
operators, uses XALAN for the rest of the predicates
 uses definition file: concatenates XPath expressions in order
to extract multiple results
 output: tuples or XML
 works with “recursive” documents
 works in streaming environments
 iterator model implementation
Usage: DB2 XML Wrapper, XML Cutter (part of DB2 XML
Extender), DB2 SQLX.

TurboXPath Definition File
CREATE NICKNAME CUSTOMER_I
(name VARCHAR(16) OPTIONS(XPATH ‘.//name’),
address VARCHAR(30) OPTIONS(XPATH ‘.//addr/@street’),
cid VARCHAR(16) OPTIONS(XPATH ‘@cid’, ID ‘Y’))
FOR SERVER xml_customer OPTIONS(XPATH ‘//customer’);
CREATE NICKNAME ORDER_I
(amount VARCHAR(20) OPTIONS(XPATH ‘./amount’),
date VARCHAR(10) OPTIONS(XPATH ‘./date’),
oid VARCHAR(16) OPTIONS(ID ‘Y’))
cid VARCHAR(16) OPTIONS(PARENT_LINK ‘Y’))
FOR SERVER xml_customer OPTIONS(XPATH ‘//order’);
PARENT ‘CUSTOMER_I’

TurboXPath in ToXop Context
Observations:
 TurboXPath is a Scan operator
 TurboXPath parse tree  TAX pattern tree
In the ToXop environment TurboXPath:
 takes as argument a pattern tree and an XML document and
outputs the corresponding witness trees for the given pattern
tree and document
 TurboXPath can be viewed as a FileScan operator
augmented with selection and projection - the selections and
projections are passed to TurboXPath through the pattern
tree

ToxinScan provides access to the document on a representation
of the document, the augmented ToXin tree
ToxinScan takes a pattern tree as parameter, thus it has
projection and selection embedded within
ToxinScan evaluates the pattern tree against the ToXin tree -
the resultants are matched ToXin trees
A matched ToXin Trees (mTree) are those parts of a ToXin tree
that satisfy the given pattern tree and the nodes are adorned
with the corresponding selection predicates from the pattern
tree
ToxinScan

ToxinScan Optimization
The goal of the ToxinScan is to evaluate the mTrees. An mTree
can be evaluated in many different ways, yielding different costs
The optimization process consists in:
 right direction selection
 right order selection

Terminology
Def: in the context of a mTree, given a node n and a set of
predicates S attached to the node n, we call node selectivity
factor ‘F’ the expected fraction of instances of the node n that
satisfy the predicate set S
Def: in the context of a mTree, assume a node p and a node c,
such that c is a child node of p. We call parent selectivity of the
(child) node c the fraction of the node p’s instances, that are
selected after evaluating the path expression that stems from
the parent p and the (child) node c is part of it
Def: we call joint cost of two path expressions that stem from
the same root, the cost of evaluating first a path using a
bottom-up evaluation plus the cost of evaluating the second
path using a top-down evaluation

ToxinScan Optimization Heuristics
The heuristics are based on a uniform distribution assumption
for node instances and employs the following properties
Property 1: in the case of a uniform distribution, for a mTree
rooted in node a with nodes b and c as children, if node b has
a lower selectivity than node c, then:
 the parent selectivity of node b is lower than the parent
selectivity of node c
 Cbac < Ccab.
Property 2: in the case of a uniform distribution, for a mTree
rooted in node a with nodes b and c as children, if node b has
a lower parent selectivity than node c, then the cost of
evaluating c top-down is less than the cost of evaluating c
bottom-up

Algorithm for Access Order Selection
First, we sort the children according to parent
selectivity
Second, we evaluate the path with the lowest
selectivity using a bottom-up evaluation
Next, we evaluate all the other paths, in the
selectivity order, using a top-down evaluation

An Example of Access Order Selection

Outline
Perspective
4. ToXop
6. Future Work

The Road Ahead
The ToXop framework and the ToXop access method selection
are fully implemented
The next step is to implement an Execution Engine in order to
perform tests and running benchmarks
We plan to implement a back-end using structural joins in the
Timber manner and compare our base line with the Timber base
line. Then, compare ToXop optimized results with the base line
in order to measure the speedup and thus to compare with the
Timber reported performance
We plan to test ToXop on structured documents, the DBLP
collection; deeply nested data, the EBOC medical data; and the
XMARK benchmark
It is our believe that Timber performs better with certain type of
documents while ToXop performs better with other types

This is not the end,
this is just the beginning!
Thank you for your attention!

ibm_research_aug_8_03

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to ibm_research_aug_8_03

Similar to ibm_research_aug_8_03 (20)

ibm_research_aug_8_03