Extracting data from xml

Document Models
• Representation in Memory
– XML is a text-based way to represent documents
– Once read into memory, XML document is represented
as tree
• DOM (Document Object Model)
– Originally designed for handling HTML in browsers
– XML DOM is a separate specification, but supported
by all modern browsers, a host of other applications
and libraries
• XDM (XPATH Data Model)
– powerful than the DOM, includes support for objects
with types described by W3C XML Schema

A Sample DOM Tree
<entry id=”armstrong-john”>
<title>Armstrong, John</title>
<body>
, an English physician and poet, was born
in <born>1715</born> in the parish of Castleton in
Roxburghshire, where his father and brother were
clergymen; and having completed his education at the
University of Edinburgh, took his degree in physics, Feb.
4, 1732, with much reputation.

</body>
</entry>

Element
Nodes
Text
Nodes
Attribute
Properties

DOM Node Types
• in-memory representation of any XML item in a DOM
tree such as an element, an attribute, a piece of text,
and so on is called a node
– Document Node: represents the entire document
– DocumentFragment Node: used for holding part of a
document, such as a buffer for copy and paste
– Element Node: represents a single XML element and its
contents
– Attr (Attribute) Node: represent a single attribute with its
name, value, and possibly, type information
– Text node: the textual content of an element
– DocumentType, CDATASection, Notation, Entity, and
Comment Nodes: These are for more advanced use of the
DOM.

DOM Node Lists
• Accessing the parts of a DOM tree from a program,
we get back a node list
• Simply a list of nodes
• Can iterate to march through the list one item at a
time
foreach (e in document.getelementsbytagname(“p”)) {
if (e.attributes[“role”] = “introduction”) {
document.delete(e);
}
}

XPATH Language
• XML Path Language, XPATH is used to point into XML
documents and select parts of them for later use
• designed to be embedded and used by other languages
like XSLT, XLink, and XPointer, Xquery, Python, PHP, Perl,
C, C++, JavaScript, Schema, Java, Scala etc..

XPATH Basics
• How to use?
– pass an XPath expression and one or more XML documents to
an XPath engine
– the XPath engine evaluates the expression and gives you back
the result
– Can do it by including XPath in another language such as
XQuery or XSLT
• the result of evaluating an XPath expression is a node list
• Example: How to get at John Armstrong date of birth?
/entry/body/p/born
If there was a whole book with lots of entries, and you just
wanted the one for John Armstrong, then
/book/entry[@id = “armstrong-john”]/body/p/born

Understanding Context
• can “navigate” through the document and evaluate an
XPath expression against any node in the tree
• When an environment enables navigation and
evaluation, the node becomes the context item
• The XPath context item can be set in three places
– explicitly setting the initial context to the document node using /
– before evaluation by the language or environment embedding
Xpath (with xsl:for-each and xsl:apply-templates)
– Using a predicate, the context is set by XPath inside the
predicate (example: armstrong-john )

XPath Node Types and Node Tests
• XML documents can also contain node types such as
processing instructions and comments
• The following table lists all the different node tests we
can use in XPath
Node Test Node Types Matched
node() Matches any node at all.
text() A text node
processing-instruction() A processing instruction.
comment() A comment node (if the XML processor didn’t remove them
from the input!).
prefix:name This is called a QName, and matches any element with the
same namespace URI as that to which the prefix is bound and
the same “local name” as name
Examples: svg:circle, html:div
Name An element with the given name (entry, body, and so on).

Node Test Node Types Matched
@attr An attribute of the given name (id, href, and so on);
* Any element
element(name, type) An element of given name (use * for any), and of the given
schema type, for example, xs:decimal.
Examples: entry/p/element(*, xs:gYear) to find any
element declared with XSD to have content of type
xs:gYear;
entry/p/element(born, *), which is essentially the same as
entry/p/born.
attribute(name, type) Same as element() but for attributes
• can use any of these node tests in a path
• for example, /entry/body/title/text() would match the text node inside the <title>
element
• The parentheses in text() are needed to show it’s a test for a text node, and not
looking for an element called text instead.

XPath Predicates
• can apply a predicate to any node list and the result is those
nodes for which the predicate evaluated to a true, or non-
zero value
• Can combine predicates like
/book/chapter[position() = 2]/p[@class = ‘footnote’][position() = 1]
– finds all <chapter> elements in a book, then uses a predicate to pick
out only those chapters that are the second child of the book
– then finds all the children of that chapter, and uses a predicate
to filter out all except those elements that have a class attribute
with the value footnote
– finally, it uses another predicate to choose only the first node in the
list i.e. the first element
• expression inside the predicate can actually be any XPath
expression
/entries/entry[body/p/born = /entries/entry/@born]
– finds all the entry elements that contain a born element whose
value is equal to the born attribute on some entry element

• Positional Predicates
– simplest predicate is just a number, like [17]: selects only the seventeenth
node
/book/ chapter[2] to select the second chapter and same as
/book/chapter[position() = 2]
– positional predicate is a filter with a Boolean expression inside
• The Context in Predicates
– every XPath expression is evaluated in a context
– The step (/) and the predicate change the context
– In the context of a <def> element in any document
//*[@use = current()/@id]
find every element with an attribute called use whose value is equal to
the id attribute on the current <def> element
//*[@use = @id]
find every element having use and id attributes with the same value

XPath Steps and Axes
• XPath axis is a direction and a step moves you along the
axis
Shorthand Full Name Meaning
name child:: The default axis; /a/child::b matches <a>
elements that contain one or more elements.
// descendent:: descendant::born is true if there’s at least one
born element anywhere in the tree beneath the
context node.
a//b matches all b elements anywhere under a in
the tree; a leading // searches the whole document.
@ attribute:: Matches an attribute of the context node, with the
given name; for example, @href.
self:: Matches the context node. For example, self::p is
true if the context node is an element named “p.”
parent:: The parent of the current node.
following:: Elements anywhere in the document after the
current node.

TOPIC KEY POINTS
How XML is stored in
memory
XML is usually stored in trees, using DOM, XDM, or some
other data model
What is XPath? XPath is an expression language used primarily for
finding items in XML trees.
Is XPath a programming
language?
Although XPath is a complete language, it is designed to
be “hosted” in another environment, such as XSLT, a
Web browser, Query, or Java.
XPath and Namespaces You generally have to bind a prefix to a namespace URI
outside of XPath and use expressions like
/h:html/h:body/h:div to match elements with an
associated namespace
Can XPath change the
document, or return
elements without their
children, or make new
elements?
No. Use XQuery or XSLT for that
When should I program
with the DOM?
The DOM API is low-level; use XPath, XQuery, or XSLT in
preference to direct access of the DOM.

Extracting data from xml

More Related Content

What's hot

Viewers also liked

Similar to Extracting data from xml

More from Kumar

Recently uploaded

Extracting data from xml