3. Introduction
What is XML
XML is a markup language much like HTML
XML was designed to describe data
XML tags are not predefined. You must define your
own tags
XML uses a Document Type Definition (DTD) or
an XML Schema to describe the data
5. Introduction
(continue)
Database management systems are
increasingly being called upon to manage
semi-structured data: data with an irregular
or changing organization
Semi-structured data is often represented as
a graph (tree structure)
Evaluating queries over semi-structured data
involves navigating paths through this
relationship structure
10. Background
Position based indexing
Queries are processed by manipulating the
range of offsets of words, elements or
attributes.
In path-based indexing, the location of
words is expressed as structural elements
and the paths in tree structures are used for
the processing of query.
11. Background
(continue)
Bitcube: A three dimensional indexing
for XML Documents
According to this technique documents
can be hierarchically represented by
XML elements. XML documents are
represented and indexed.
12. Background
(continue)
Content and Structure in indexing and
ranking XML
Index structures with a ranking support are
therefore needed for fast access to relevant
parts of large documents collections. An
analysis reveals that ranking parameters
related to both the content and structure of
data are poorly supported by most known
XML indexes.
13. Background
(continue)
Ctree
It provides an indexing structure that is based
on two levels: path summary and detailed
element-level relationships. The first one, the
path summary, is a tree that is
distracted from the original data
14. Background
(continue)
Indexing for XML Siblings:
Given the importance of XPath based query access,
Grust proposed R-tree index, we refer to as wholetree indexes (WI). Such index, however, has a very
high cost for the following-sibling and precedingsibling axes. In this method they develop a family of
index structures, which refer to as splittree indexes
(SI), to address this problem, in which (i) XML data is
horizontally split by a simple, yet efficient criteria,
and (ii) the split value is associated with tree
labeling.
15. Background
(continue)
High-performance XML Storage/Retrieval
System.
The basic idea of this technique is to allocate a field
ID to each text data item of the XML element and to
register it in the structure index and text index. The
structure index manages the hierarchical structure of
each field, and the text index manages the field ID
and document ID in which query words appears. The
structure index is one big data tree and represents
the overlapped structure of documents.
16. Background
(continue)
Indexing documents for queries on
structure,content and attributes
It Explains position-based indexing and
path-based indexing to access XML
document by content, structure, or
attributes.
17. Background
(continue)
Extensible index technique
An extensible index technique is proposed to
express position information between nodes
in a XML document. It is an efficient index
technique that simplifies the comparative
object applied to a search query and
minimizes the reconstruction of index
structure by update operation. In addition,
they specially proposed extensible index
technique with deferred update.
18. Problem Statement
Support of element addressing
Index size becomes very large
Doc.ID should include NodeId (Xpath) + Offset
Xpath are long
Support of typed data
Integer, float, simple types of XML schema
Requires classical indexes for certain
elements
20. Problem Statement
Evaluation criteria
Identifiers
By element scan
Update
By join algo.
By graph traversal
By OID comparison
Keyword Search
Per node or per document
Descendant/Ancestor Search
(continue)
Incremental
Index size
By B-tree traversal
Entry number
Entry size
21. Problem Statement
(continue)
indexing structures use which the absolute
address to pinpoint where data resides,
update causes a re-computation
If the update frequency is high the cost of
reconstruction is unbearable
Support of updating the indexes is not
considered in most of the indexing
techniques.
22. Problem Statement
(continue)
Updates are an issue in any such
labeling scheme. It is conceivable that a
complete re-labeling could be required
for each update,
the existing techniques do not support
the storage of multiple documents in a
single time.
23. Proposed Technique
An XML document instance is a plain-text file that
uses markup delimiters (tags) to define the logical
structure of a document in a hierarchical fashion.
Robert Korfhage proposed three purposes of indexing
in IR, which can best take advantage of structured
documents.
To permit easy location of documents by topic;
To define topic areas and hence relate one document
to another;
To predict relevance of a given document to given
information need.
24. Proposed Technique
(continue)
The current structured query and indexing
models for XML have not fulfilled these
requirements.
The ideal system seems to be one that will
provide efficient and comprehensive indexing
of document content and structure, and be
able to support the predicted degree of
relevance all matching documents have to a
particular query
25. Proposed Technique
(continue)
There is a node corresponding to each
element, with child nodes for subelements. However, all attributes of an
element node are clubbed together into
a single node, which is then stored as a
child node of that element node
The content of an element node, if any,
is pulled out into a separate child node.
27. Proposed Technique
(continue)
S1 and S2 are start labels, E1 and E2
are end labels, and L1 and L2 are level
labels in these formulae.
We address the update issue by leaving
gaps between successive label values.
28. Results and discussions
System architecture
Data Parser
(continue)
The Data Parser takes an XML document as
input, and produces a parse tree as output.
Data manager takes each node of tree
mark its indices and store it into
database.
29. Results and discussions
(continue)
If the node is of mixed type, with
multiple content parts interspersed with
sub-elements, each content part is
pulled out into a separate child node.
All processing instructions, comments,
and such are simply ignored
30. Conclusion and future directions
Reconstruction of index file due to a partial update is
a problem that XML database applications inevitably
have to face
We have developed the indexing system that
is based on the two indexing techniques
extensible index technique and the relative
region coordinate based indexing of XML
documents with our own proposed scheme
which assigns the level numbers to each node
of XML documents and document number to
each document.
31. Conclusion & future directions
(continue)
Update of the index structure which
increases the cost is successfully
removed as the index structure remains
unaffected after adding the new nodes.
Parent child and ancestor-descendent
relationship could be found easily for
efficient retrieval.
32. Conclusion & future directions
(continue)
all processing instructions, comments,
and such which are simply ignored. In a
future, it could be created yet another
child node of the element node with all
such data.
An index that is efficient for both
update and retrieval may not available.
33. Conclusion & future directions
(continue)
One of alternatives is building two
separate indices such that one is
suitable when update is frequent, the
other is better at query processing.
In this case, a transformation
mechanism between the indexing
structures is needed to be developed.