3. Semistructured Data
Another data model, based on trees.
Motivation: flexible representation of data.
◦ Often, data comes from multiple sources
with differences in notation, meaning, etc.
Motivation: sharing of documents among
systems and databases.
3
4. Graphs of Semistructured Data
Nodes = objects.
Labels on arcs (attributes, relationships).
Atomic values at leaf nodes (nodes with no
arcs out).
Flexibility: no restriction on:
◦ Labels out of a node.
◦ Number of successors with a given label.
4
5.
6.
7. XML
XML = Extensible Markup Language.
While HTML uses tags for formatting
(e.g., “italic”), XML uses tags for
semantics (e.g., “this is an address”).
Key idea: create tag sets for a domain
(e.g., genomics), and translate all data into
properly tagged XML documents.
7
8. HTML and XML
8
XML stands for extensible Markup Language
HTML is used to mark up
text so it can be displayed to
users
XML is used to mark up
data so it can be processed
by computers
HTML describes both
structure (e.g. <p>, <h2>,
<tr>,<td>) and appearance
(e.g. <br>, <font>, <i>)
XML describes only
content, or “meaning”
HTML uses a fixed,
unchangeable set of tags
In XML, you make up
your own tags
9. HTML
<h1> Bibliography </h1>
<p> <i> Foundations of Databases </i>
Abiteboul, Hull, Vianu
<br> Addison Wesley, 1995
<p> <i> Data on the Web </i>
Abiteboul, Buneman, Suciu
<br> Morgan Kaufmann, 1999
10. XML
<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
</bibliography>
XML describes the content
11.
12. Well-Formed and Valid XML
Well-Formed XML allows you to invent
your own tags.
◦ Similar to labels in semistructured data.
Valid XML involves a DTD (Document
Type Definition), a grammar for tags.
12
13. Well-Formed XML
Start the document with a declaration,
surrounded by <?xml … ?> .
Normal declaration is:
<?xml version = “1.0” standalone = “yes”
?>
◦ “Standalone” = “no DTD provided.”
Balance of document is a root tag
surrounding nested tags.
13
14. Tags
Tags, as in HTML, are normally matched
pairs, as <FOO> … </FOO> .
Tags may be nested arbitrarily.
XML tags are case sensitive.
14
15. XML and Semistructured Data
Well-Formed XML with nested tags is
exactly the same idea as trees of semi-
structured data.
We shall see that XML also enables non
tree structures, as does the semi-structured
data model.
15
16. Example
The <BARS> XML document is:
16
Joe’s Bar
Bud 2.50 Miller 3.00
PRICE
BAR
BAR
BARS
NAME . . .
BAR
PRICE
NAME
BEER
BEER
NAME
17. Slide 27- 17
XML Hierarchical (Tree) Data Model
(contd.)
The basic object is XML is the XML
document.
There are two main structuring concepts
that are used to construct an XML
document:
◦ Elements
◦ Attributes
Attributes in XML provide additional
information that describe elements.
18. Slide 27- 18
XML Hierarchical (Tree) Data Model
(contd.)
As in HTML, elements are identified in a document by
their start tag and end tag.
◦ The tag names are enclosed between angled brackets
<…>, and end tags are further identified by a
backslash </…>.
Complex elements are constructed from other elements
hierarchically, whereas simple elements contain data
values.
It is straightforward to see the correspondence between
the XML textual representation and the tree structure.
◦ In the tree representation, internal nodes represent
complex elements, whereas leaf nodes represent
simple elements.
◦ That is why the XML model is called a tree model or
a hierarchical model.
19. Slide 27- 19
XML Hierarchical (Tree) Data Model
(contd.)
It is possible to characterize three main types of XML documents:
1. Data-centric XML documents
These documents have many small data items that follow
a specific structure, and hence may be extracted from a
structured database. They are formatted as XML
documents in order to exchange them or display them
over the Web.
2. Document-centric XML documents:
These are documents with large amounts of text, such as
news articles or books. There is little or no structured data
elements in these documents.
3. Hybrid XML documents:
These documents may have parts that contains structured
data and other parts that are predominantly textual or
unstructured.
23. DTD Elements
The description of an element consists of
its name (tag), and a parenthesized
description of any nested tags.
◦ Includes order of subtags and their
multiplicity.
Leaves (text elements) have #PCDATA
(Parsed Character DATA ) in place of
nested tags.
23
24. Example: DTD
<!DOCTYPE BARS [
<!ELEMENT BARS (BAR*)>
<!ELEMENT BAR (NAME, BEER+)>
<!ELEMENT NAME (#PCDATA)>
<!ELEMENT BEER (NAME, PRICE)>
<!ELEMENT PRICE (#PCDATA)>
]>
24
A BARS object has
zero or more BAR’s
nested within.
A BAR has one
NAME and one
or more BEER
subobjects.
A BEER has a
NAME and a
PRICE.
NAME and PRICE
are text.
25. Element Descriptions
Sub tags must appear in order shown.
A tag may be followed by a symbol to
indicate its multiplicity.
◦ * = zero or more.
◦ + = one or more.
◦ ? = zero or one.
Symbol | can connect alternative sequences
of tags.
25
26.
27. XML Schema
In XML format
Element names and types associated locally
Includes primitive data types (integers, strings,
dates, etc.)
Supports value-based constraints (integers >
100)
User-definable structured types
Inheritance (extension or restriction)
Foreign keys
Element-type reference constraints