SlideShare a Scribd company logo
1 of 34
Data warehouse design from XML sources
Matte0 Golfarelli Stefano Rizzi
DEIS - University of Bologna DEIS - University of Bologna
Viale Risorgimento, 2 Viale Risorgimento, 2
40136 Bologna - Italy 40136 Bologna - Italy
+39-051-642862 +39-051-2093542
[email protected][email protected]
Boris Vrdoljak
FER - University of Zagreb
Unska 3
10000 Zagreb - Croatia
+385-(O)l -6129756
[email protected]
ABSTRACT
A large amount of data needed in decision-making processes is
stored in the XML data format, which is widely used for e-
commerce and Internet-based information exchange. Thus, as
more organizations view the web as an integral part of their
communication and business, the importance of integrating
XML
data in data warehousing environments is becoming increasingly
high. In this paper we show how the design of a data mart can
be
carried out starting directly from an XML source. Two main
issues arise: on the one hand, since XML models semi-
structured
data, not all the information needed for design can be safely
derived; on the other, different approaches for representing
relationships in XML DTDs and Schemas are possible, each
with
different expressive power. After discussing these issues, we
propose a semi-automatic approach for building the conceptual
schema for a data mart starting from the XML sources.
Keywords
Data warehouse design, Data warehousing and the web, XML
1. INTRODUCTION
A large amount of data needed in decision-making processes is
stored in the XML (Extensible Markup Language) data format.
The structure of XML, composed of nested custom-defined tags
that can describe the meaning of the content itself, makes it
usable
as a semantic-preserving data exchange format on the web. As
the
Internet has evolved into a global platform for e-commerce and
information exchange, the interest in XML has been growing
and
large volumes of XML data already exist.
XML can be considered as a particular standard syntax for the
exchange of semi-structured data [l]. One common feature of
semi-structured data models is the lack of schema, so that the
data
is self-describing. However, XML documents can be associated
with and validated against either a Document Type Definition
(DTD) or an XML Schema, both of which allow the structure of
XML documents to be described and their contents to be
constrained. DTDs are defined as a part of the XML 1.0
Permission to make digital or hard copies of all or part of this
work for
personal or classroom use is granted without fee provided that
copies
are not made or distributed for profit or commercial advantage
and that
copies bear this notice and the full citation on the first page. To
copy
otherwise, or republish, to post on servers or to redistribute to
lists.
requires prior specific permission and/or a fee.
DOLAP ‘01, November 9,2001, Atlanta, Georgia, USA.
Copyright 2001 ACM l-581 l3-437-l/Ol/OOl I . ..$5.00.
Specification [15], while XML Schemas have recently become
a
W3C Recommendation [16]. XML Schemas considerably
extend
the capabilities of DTDs, especially from the point of view of
data
typing and constraining. With DTDs or Schemas, the
applications
exchanging data can agree about the meaning of the tags and, in
that case, XML reaches its full potential.
During the recent years, the enterprises have been asking for
support in the process of extracting useful, concise and handy
information for decision-making out of the tons of data stored
in
their expensive and complex information systems.
Consequently,
a substantial effort has been made in the academic world to
devise
methodologies for designing data warehousing systems capable
of
seamlessly integrating information from different sources, in
particular from heterogeneous databases. Now, as more
organizations view the web as an integral part of their
communication and business, the importance of integrating
XML
data in data warehousing environments is becoming increasingly
high. Some commercial tools now support data extraction from
XML sources to feed the warehouse, but both the warehouse
schema and the logical mapping between the source and the
target
schemas must be defined by the designer.
In this paper we show how multidimensional design for data
warehouses can be carried out starting directly from an XML
source. Two main issues arise: on the one hand, different
approaches for representing relationships in XML DTDs and
Schemas are possible, each achieving a different expressive
power; on the other, since XML models semi-structured data,
not
all the information needed for design can be safely derived.
Thus,
our contribution in this work is twofold: first, we propose a
warehouse-oriented review and comparison of the approaches
for
structuring XML documents; then, we outline an algorithm in
which the problem of correctly inferring the needed information
is
solved by querying the source XML documents and, if
necessary,
by asking the designer’s help.
One alternative approach to design from XML sources consists
in
first translating them into an equivalent relational schema, then
starting from the latter to design the warehouse. Some
approaches
for translating XML documents into a relational database can be
found in the literature, both leaning on the DTD [10][13] or not
[5], but insufficient emphasis is given to the problem of
determining the cardinality of relationships, which instead has a
primary role in multidimensional design.
The paper is structured as follows. In Section 2 the basics of
multidimensional modeling and design are given, while in
Section
3 the design alternatives for modeling relationships in both
DTDs
4 0
and XML Schemas are reviewed and discussed. In Section 4 our
approach to multidimensional design is presented with reference
to a case study, and in Section 5 the conclusions are drawn.
2. MULTIDIMENSIONAL MODELING AND
DESIGN
It is now widely recognized that an accurate conceptual design
is
the necessary foundation for building a data warehouse which is
both well-documented and fully satisfies requirements. In order
to
be independent of the specific issues involved in logical and
physical modeling, the approach proposed here is referred to the
conceptual level, from which the logical schemas of the data
marts
can be easily derived.
2.1 Conceptual Modeling
Several conceptual models for data warehouses were devised in
the literature [2][3][6][7][8][14]; they mainly differ for the
graphical representation of concepts, with small differences in
expressive power. In this paper we will adopt the Dimensional
Fact Model (DFM) [7], which represents the data mart by
means
of a set offact schemas.
In the following we will briefly discuss the DFM representation
of
the main concepts of the multidimensional model with reference
to the fact schema CLICK, which describes the analysis of the
web site traffic. The reason for choosing this example is that,
due
to the significant role now played by the web in attracting new
clients and supporting sales, analyzing the web server traffic
may
be crucial for improving the enterprise business. In this context,
multidimensional modeling allows many unpredictable complex
queries to be answered, such as:
. What is the trend for the most and the least accessed pages?
. Is there a relationship between business events (for instance,
sale promotions in an e-commerce site) and the number of
accesses?
For such an analysis, we need information about the hostname
or
IP address of the computer requesting the file, the date and time
of
the request, and the URL of the file being requested.
m o n t h 0
day of week
date (
h o l i d a y
d o m a i n
(category I
n a t i o n ) CLICK URL f i l e t y p e
0 n n o . o f c l i c k s A 0
host
( h o s t n a m e /
IP address)
hour ()
Figure 1. Fact schema for click-stream analysis
In the fact schema shown in Figure 1, the fact CLICK, focus of
interest for the decision-making process, is associated to the
measures which describe it, i.e. no. of clicks, and to the
dimensions determining its minimum granularity, namely host,
date, hour, and URL. Each dimension is the root of a hierarchy
which determines how the fact may be aggregated and selected
significantly for the decision-making process; each hierarchy
includes a set of attributes linked by functional dependencies.
For
instance, the URL of the file being requested determines the file
type, and the hostname or IP address of the computer requesting
the file determines its domain. The values for the domain
attribute
can be extracted from the hostname suffix that is indicating
either
the category (for instance, “.com” for commercial companies)
or
the nation.
Within the DFM, as in all the other conceptual models, a strong
relevance is given to functional dependencies, since they
represent
many-to-one relationships between attributes which enable
flexible aggregation of data in OLAP queries [9]. Thus, the
main
problem in building a conceptual data mart schema is to identify
those relationships in the business domain.
2.2 Conceptual Design
In most approaches to design of data marts, the conceptual
schema
is built starting from the (logical or conceptual) schema of the
operational sources [2][7][8]. The common core of these
approaches consists in navigating the functional dependencies
in
the source schema in order to determine the hierarchies for the
fact. In particular, the methodology to build a fact schema from
an
E/R schema proposed in [7] consists of the following steps:
1. Choosing facts.
2. For each fact:
2.1 Building the attribute tree.
2.2 Rearranging the attribute tree.
2.3 Defining dimensions and measures
For briefly illustrating the methodology, we will use the E/R
diagram describing the web site traffic shown in Figure 2.
Facts typically correspond to events occurring dynamically in
the
enterprise world. On the E/R schema a fact may be represented
either by an entity F or by an n-ary relationship R. In our
example,
the fact of primary interest is represented by entity CLICK.
Given a portion of interest of a source schema and an entity F
belonging to it, we call attribute tree the tree such that:
. each vertex corresponds to an attribute - simple or compound
- of the schema;
. the root corresponds to the identifier of F,
. for each vertex V, the corresponding attribute functionally
determines all the attributes corresponding to the descendants
of v.
The attribute tree for F may be constructed automatically by
navigating, starting from F, the functional dependencies
expressed
by many-to-one relationships between entities in the source
schema. Each entity E analyzed is represented in the attribute
tree
by: (1) a node corresponding to the identifier of E, (2) a child
node for each of the non-identifier attributes of E; (3) a child
node
for each entity G connected to E by a many-to-one relationship.
41
time d a t e
(IN
r-5U R LC A T E G O R Y
Figure 2. E/R schema for web site traffk
The resulting attribute tree for the web site traffic analysis
elements (i.e. sub-elements) and attributes. Both elements and
example is shown in Figure 3. Starting from the CLICK entity,
attributes are allowed to have values. Document structures can
be
new vertices are added by following many-to-one relationships.
nested to any level of complexity; between the opening and
The URL category attribute could not be included in the
attribute closing tags for an element, any number of additional
elements or
tree because there is a many-to-many relationship between URL
textual data may be present. Attributes are included, with their
and URL category. The nation attribute is included twice in the
respective values, within the element’s opening declaration. An
attribute tree because it can be added to the attribute tree by
XML document that contains data about the web site traffic is
navigating from both the host and the site entities. shown in
Figure 4.
d a t e time <webTraffic>
< c l i c k >
< h o s t hostId=“ares.csr.unibo.it”>
<nation>italy</nation>
</host>
<date>23-MAY-ZOOl</date>
<time>16:43:25</time>
curl urlID="BL0023">
<site siteID="www.hr">
<nation>croatia</nation>
</site>
Figure 3. Attribute tree for fact CLZCK
Once the attribute tree has been built, some uninteresting nodes
may be dropped from it. The time attribute, that represents the
exact time of a click, is replaced with a coarser hour attribute.
<fileType>shtml</fileType>
<urlCategory>catalog</urlCategory>
</url>
</click>
<click>
Finally, dimensions and measures must be selected among the
children of the root. In our example, the attributes chosen as
dimensions are host, date, hour and URL; number of clicks,
determined by counting the clicks from the same host to the
same
URL on a given date and hour, is chosen as a measure.
. .
</click>
. . .
</webTraffic>
F i g u r e 4 . A n X M L d o c u m e n t d e s c r i b i n g t h
e w e b s i t e t r a f f i c
Some further minor arrangements must be made in order to
obtain
the fact schema in Figure 1; in particular, the date dimension is
enriched by building a hierarchy which includes attributes
month,
day of week, and holiday. Besides, the host category and nation
optional attributes are replaced by attribute domain, which
indicates either the category or the nation of the host.
An XML document is valid if it has an associated schema, such
as
a DTD or an XML Schema, and if it conforms to the constraints
expressed in that schema. Since our methodology for conceptual
design is based on detecting many-to-one relationships, in the
following we will focus on the way those relationships can be
e x p r e s s e d i n t h e D T D a n d t h e X M L S c h e m
a .
3. MODELING RELATIONSHIPS IN XML
An XML document consists of nested element structures,
starting
with a root element. Each element may contain component
3.1 Relationships in DTDs
A DTD defines elements and attributes allowed in an XML
document, and the nesting and occurrences of each element. The
42
structure of an XML document is constrained using element-
type
and attribute-list declarations. Element-type declarations
specify
which sub-elements can appear as children of the element;
attribute-list declarations specify the name, type, and possibly
default value of each attribute associated with a given element
type. Among the different attribute types, types ID, IDREF and
IDREFS have particular relevance for our approach: the ID type
defines a unique identifier for the element; the IDREF type
means
that the attribute’s value is some other element’s identifier;
IDREFS means that its value is a list of identifiers. IDREF(S)
must match the value of some ID attribute in the document [ 11.
3. I .2 Modeling relationships by ID and IDREF(S)
Another way to specify relationships between elements in DTDs
is by means of ID and IDREF(S) attributes. The way these
attributes operate resembles the key and foreign key mechanism
used in relational databases, with some important differences
and
limitations. For instance, if we take the last part of the DTD
from
Figure 5 and let the relationships of the url element be defined
by
ID and IDREF(S) attributes rather than by sub-elements, we
obtain the DTD in Figure 6.
3.1. I Modeling relationships by sub-elements
<!ELEMENT webTraffic ( c l i c k * , f i l e T y p e + ,
urlCategory+) >
Relationships can be specified in DTDs by sub-elements that
may
have different cafdinalities. The optional character following a
child element name or list in the element-type declarations
determines whether the element or the content particles in the
list
may appear one or more (+), zero or more (*), or zero or one
times (?); the default cardinality is exactly one.
Figure 5 presents a DTD according to which the XML document
presented in Figure 4 is valid. Element webTraffic is defined as
a
document element, thus becomes the root of XML documents. A
webTraffic element may have many click elements, while in an
url
element the site sub-element must occur exactly once, followed
by
onefiJeType and many urlCategory sub-elements. A host
element
may have either a category or a nation element.
<!DOCTYPE webTraffic [
<!ELEMEIW webTraffic ( c l i c k * ) >
<!ELEMENT c l i c k ( h o s t , d a t e , t i m e , url) >
<!ELEMENT h o s t ( c a t e g o r y 1 n a t i o n ) >
<!ATTLIST h o s t
hostId ID #REQUIRED>
<!ELEMENT c a t e g o r y (#PCDATA)>
<!ELEMENT d a t e (#PCDATA)>
<!ELEMENT t i m e (#PCDATA)>
<!ELEMENT url ( s i t e , fileType,
urlCategory+) >
<!ATTLIST u r l
urlId ID #REQUIRED>
< ! ELEMENT site (nation) >
<!ATTLIST s i t e
siteId ID #REQUIRED>
< ! ELEMENT nation (#PCDATA) >
<!ELEMENT fileType (#PCDATA)>
<!ELEMENT urlcategory (#PCDATA)>
I>
Figure 5. A DTD where relationships are specified by sub-
e l e m e n t s
If a one-to-one or one-to-many relationship must be represented
in
XML, sub-elements with the above mentioned cardinalities can
be
used without loss of information. However, given a DTD, we
can
follow only one direction of a relationship. For instance,
according to the DTD in Figure 5, an url element may have
many
urlCutegory sub-elements, but it is not possible to find out,
from
the DTD, whether an URL category can refer to many URLs.
Only by having some knowledge about the domain described by
the DTD, we can conclude that the latter is the case.
. . .
<!ELEMENT u r l ( s i t e )
<!ATTLIST url
urlId ID #REQUIRED
fileTypeRef I D R E F # R E Q U I R E D
urlCategoryRef IDREFS # R E Q U I R E D >
<!ELEMENT fileType EMPTY>
<!ATTLIST fileType
typeID ID #REQUIRED
typeDescription CDATA #IMPLIED>
<!ELEMENT urlcategory EMPTY>
<!ATTLIST urlcategory
urlCategoryId ID #REQUIRED
urlCateqorvDesc CDATA #IMPLIED>
Figure 6. A DTD where relationships are specified by
IDREF(S)
In this example we assume that the$ZeTypeRefattribute of the
url
element references the fireType element. On the other hand, an
instance of urlCategoryRef references many instances of
urZCategory. Therefore, for one URL there is exactly one file
type
while there may be several categories. The problem is that,
using
IDREF(S), the participating elements cannot be constrained to
be
of a certain element type. For instance, jZeTypeRef could also
contain a reference to an urZCutegory element, while obviously
we would like to constrain such references tofiZeType elements
only. Further, though the value of an ID attribute is unique
within
the whole document, element types are not required to have an
ID,
and even if an element type has an ID, its usage may be
optional.
For these reasons, there is no means to actually constrain the
allowed relationships by the ID/IDREF(S) mechanism.
3.2 Relationships in XML Schemas
XML Schemas give more accurate representation of the XML
structure constraints than DTDs; in particular, the cardinal&y
can
be specified in more detail by using the minoccurs and
maxoccurs attributes. An XML Schema consists of type
definitions, which can be derived from each other, and element
declarations. The possibility of separating an element
declaration
from the definition of its type enables the sharing and reusing
of
simple and composite types. Further, besides ID and IDREF(S)
attributes, XML Schemas support the use of key and keyRef
elements for defining keys and their references.
The two different ways of specifying relationships in XML
Schemas, i.e. sub-elements and key/keyRef mechanisms, will be
described in the following.
43
3.2.1 Modeling relationships by sub-elements
Figure 7 presents a part of an XML Schema that defines the
same
structure and constraints as the DTD in Figure 5. Since, for
defining elements, attributes minOccurs and maxOccurs default
to
1, and only in the case of the urlCategory element the
maxOccurs
attribute is set to “unbounded”, an w-1 element will consist of
one
site, one$leType and many urlcategory elements. We will not
further discuss this solution since, from the point of view of
cardinality modeling, it is essentially equivalent to the solution
in
Section3.1.1.
< e l e m e n t name=“click”>
<complexType>
<sequence>
<element name="host"
type="hostType"/>
<element name="date" type="date"/>
<element name="time" type="time"/>
<element name="url" type="urlType"/>
</sequence>
</complexType>
</element>
. .
<complexType name="urlType">
<sequence>
<element name="site" type="siteType"/>
<element name="fileType"
type="string"/>
<element name="urlCategory"
type="string"
maxOccurs="unbounded"/>
</sequence>
<attribute name="urlID" type="string"/>
</complexType>
. . .
F i g u r e 7 . A n X M L S c h e m a w h e r e r e l a t i o n
s h i p s a r e s p e c i f i e d
b y s u b - e l e m e n t s
3 . 2 . 2 Modeling relationships by key and keyRef
elements
In addition to ID and IDREF(S) attributes, XML Schemas
introduce more powerful and flexible mechanisms, similar to
the
relational concept of foreign key. The key element is used to
indicate that every attribute or element value must be unique
within a certain scope and not null. By using keyRef elements,
keys can be referenced.
For the purpose of describing the declaration of keys and their
references, an element namedfileTypes (represented in Figure
8)
is added as a sub-element to the click element from the XML
document in Figure 4. The fileTypes element consists of type
elements with z’ypeZD attributes. Figure 9 presents an
evolution of
the XML Schema in Figure 7, according to which every click
element also contains a fileTypes element. This element has a
composite type named fileTypesType (its definition is not
presented in Figure 9), according to which the XML document
in
Figure 8 is valid.
XML Schemas allow to specify the scope for each key by means
of an XPath expression [17]. In our example, the key element
is
namedfileTypeKey. The typeID attribute from Figure 8 is
specified as the key by means of the selector and thefield sub-
elements of the key attribute in Figure 9. The xpath attribute of
the
selector element contains an XPath expression,$leTypes/type,
that
selects the type elements that are sub-elements of the fileTypes
element. The xpath attribute of the field element contains the
@typeZD expression, that specifies the typeZD attribute of the
type
element as the key. Further, thefileType element, that is a sub-
element of w-1, is declared as a keyReJ this means that, for
every
value ofjZeType, afileTypeKey with the same value must
exist.
<fileTypes>
<type typeID="html">
hypertext markup language</type>
<type typeID="gif">
graphic interchange format</type>
. . .
</fileTypes>
Figure 8. ThefiZeTypes element as a sub-element of click
<element name="click">
<complexType>
. . .
<element name="url" type="urlType"/>
<element name="fileTypes"
type="fileTypesType"/>
. . .
</complexType>
<key name="fileTypeKey">
<selector xpath="fileTypes/type"/>
<field xpath="@typeID"/>
</key>
ckeyref name="fileTypeRef"
refer="fileTypeKey">
<selector xpath=“url”/>
<field xpath="fileType"/>
i/keyref>
</element>
Figure 9. An XML Schema where relationships are specified
by key/keyRef
In conclusion, the key/keyRef mechanism may be applied to any
element and attribute content, and the scope of the constraint
can
be specified, while an ID is a type of attribute whose scope is
fixed to be the whole document. Furthermore, combinations of
element and attribute content can also serve as keys and
references in XML Schemas.
4. CONCEPTUAL DESIGN FROM XML
SOURCES
In the previous section, we showed different approaches for
representing relationships in DTDs and XML Schemas. Three
of
them are suitable for specifying relationships: sub-elements in
DTDs, sub-elements and key/keyRef in Schemas; though their
expressive power is different, in the context of this paper they
may be considered to be equivalent since with reference to
multidimensional design they allow the same knowledge to be
captured. We do not consider the fourth approach, ID/IDREF(S)
4 4
in DTDs, since it is not precise and useful enough in
constraining
relationships.
In this section we propose a semi-automatic approach for
building
the conceptual schema of a data mart starting from the XML
sources. Of the three above-mentioned approaches, we have
chosen sub-elements in DTDs for the presentation of our
methodology, since Schemas are still not as widely used as
DTDs;
however, the methodology is essentially the same when dealing
w i t h S c h e m a s .
Starting with the assumption that the XML document has a DTD
and conforms to it, the methodology consists of the following
steps:
1. Simplifying the DTD.
2 . Creating a DTD graph.
3. Choosing facts.
4. For each fact:
4.1 Building the attribute tree from the DTD graph.
4.2 Rearranging the attribute tree.
4.3 Defining dimensions and measures.
In the following paragraphs we will describe steps from (1) to
(4.1) referring to the web site traffic example; once the attribute
tree is built, steps 4.2 and 4.3 are identical, respectively, to
steps
2.2 and 2.3 described in Section 2.3.
Simvlifvina t h e D T D
The sub-elements in DTDs may have been declared in a
complicated and redundant way. However, those details of a
DTD
can be simplified [ 131. The transformations for simplifying a
DTD
include converting a nested definition into a flat representation:
f o r i n s t a n c e , i n t h e w e b s i t e t r a f f i c e x a m
p l e ,
host(category /nation) i s t r a n s f o r m e d i n t o
host(category?,nation?). Further, the sub-elements having the
same name are grouped, and many unary operators are reduced
to
a single unary operator. Finally, all “+” operators are
transformed
intO “*” operators.
Creating a DTD waph
After simplifying the DTD, a DTD graph representing its
structure
can be created as described in [lo] and [13]; its vertices
correspond to elements, attributes and operators in the DTD.
Attributes and sub-elements are not distinguished in the graph
since, in our methodology, they are considered as equivalent
nesting mechanisms. The DTD graph for the DTD in Figure 5 is
given in Figure 10.
Definina facts
The designer chooses one or more vertices of the DTD graph as
facts; each of them becomes the root of a fact schema. In our
example, we choose the click vertex as the only interesting fact.
Building the attribute tree
The vertices of the attribute tree are a subset of the element and
attribute vertices of the DTD graph. The algorithm to build the
attribute tree is sketched in Figure Il.
webTraffic
1
click
host d a t e t i m e url
Figure 10. DTD graph for web site traffic analysis
root=newVertex (F) ;
// newVertex(<vertex>) r e t u r n s a n e w v e r t e x
// o f t h e a t t r i b u t e t r e e
// c o r r e s p o n d i n g t o < v e r t e x >
// o f t h e D T D g r a p h
expand(F,root) ;
expand(E,V) :
// E i s t h e c u r r e n t DTD vertex,
// V i s t h e c u r r e n t a t t r i b u t e t r e e v e r t e x
{ f o r e a c h c h i l d W o f E d o
if W i s e l e m e n t o r a t t r i b u t e d o
{ next=newVertex(W);
addchild (V, next) ;
// a d d s c h i l d W t o V
expand (W, next ) ;
else
if W="?" do
expand (W, V) ;
for each parent Z of E such that
Z i s n o t a d o c u m e n t e l e m e n t d o
if Z=“3” or Z=“*” do
expand(Z,V) ;
e l s e
if n o t toMany(E,Z) do
if askDesignerToOne (E, Z) do
{ next=newVertex(Z);
addChild(V,next) ;
expand (Z, next) ;
1
Figure 11. Algorithm for building the attribute tree
The attribute tree is initialized with the fact vertex F; then, it is
enlarged by recursively navigating the functional dependencies
between the vertices of the DTD graph. Each vertex V inserted
in
the attribute tree is expanded as follows (procedure expand):
1. For each vertex W that is a child of V in the DTD graph:
When examining relationships in the same direction as in the
DTD graph, the cardinality information is expressed either
explicitly by “?” and “*” vertices or implicitly by their
4 5
2 . For each vertex Z that is a parent of V in the DTD graph:
absence. If W corresponds to an element or attribute in the since
there is no need for the existence of both host and hostld
DTD, it is added to the attribute tree as a child of V; if W is a
vertices, only host should be left, the same logic should be
applied
“ ? ” operator, its child is added to the attribute tree as a child
of for urlld and siteld. Finally, the time attribute is replaced
with the
V; if W is a “*” operator, no vertex is added. hour attribute.
When examining relationships in this direction, vertices
corresponding to “*” and “?,, operators are skipped since they
only express the cardinality in the opposite direction. Since
the DTD yields no further information about the relationship
cardinality, it is necessary to examine the actual data by
querying the XML documents conforming to the DTD. This is
done by procedure checkToMany, which counts the number
of distinct values of Z corresponding to each value of E. If a -
to-many relationship is detected, Z is not included in the
attribute tree. Otherwise, we still cannot be sure that the
cardinality of the relationship from E to Z is -to-one. In this
case, only the designer can tell, leaning on her knowledge of
the business domain, whether the actual cardinality is -to-one
or -to-many (procedure askDesignerToOne). Only in the
first case, Z is added to the attribute tree. The reason why
document elements are not considered is that they have only
one instance within XML documents, thus they have no
interest for aggregation and should not be modeled in the data
m a r t .
In the following some general considerations on the proposed
approach are reported.
The problem of checking cardinalities in XML documents is
related to that of discovering functional dependencies in
relational
databases, which was widely investigated in the literature on
relational theory and data mining [11][12]. In our case, the
situation is much simpler since no inference is necessary, so it
comes down to properly querying the data with an XML query
language supporting aggregate queries. For instance, in W3C
XQuery [18] the use of the distinct function is proposed for
that
purpose, while the use of the group-by function is proposed in
[4].
The main question arising is how many XML documents we
must
see to reasonably confirm our presumption that the cardinality
is
-to-one.
Clearly, the semi-structured nature of XML sources increases
the
level of uncertainty on the structure of data as compared to E/R
sources, thus making recourses to the designer’s experience
more
frequent. In our algorithm, we chose to ask questions
interactively
during the tree building phase in order to avoid unnecessary
queries on the documents. An alternative solution consists in
first
building the tree by emphasizing all the uncertain relationships,
then handing on the complete tree to the designer in order to
have
it rearranged (step 4.2) with specific attention to uncertain
relationships, which could be dropped together with their
subtree.
While this solution allows the designer to have a wider point of
view on the tree, it may be less efficient since a vertex, dropped
later, might have been (uselessly) expanded by querying the
XML
d o c u m e n t s .
d a t e t i m e u r l l d
host category Q P P
url f i l e t y p e
0
n a t i o n site
s i t e l d
0
F i g u r e 1 2 . A t t r i b u t e t r e e d e r i v e d f r o m t h
e D T D g r a p h
In our example, no uncertain relationships are navigated. Vertex
urlCategory is not added to the attribute tree because it is a
child
of a -*I’ vertex in the DTD graph. The resulting attribute tree
for
the web site traffic analysis example is given in Figure 12.
Some
further arrangements should be made to this tree: for instance,
As a matter of fact, the problem of inferring the relationship
cardinalities is present also when the source to be used for
design
is a relational schema [7]. In fact, the presence in a relation R
of a
foreign key F referencing the primary key K of a relation S,
implies that F functionally determines K and, consequently, all
the
other attributes of K, but tells us nothing about the number of
distinct tuples of R related to each tuple of S. Thus, in
principle, in
order to guess the uncertain cardinalities we should query the
database as in the case of XML sources. On the other hand, in
the
relational case this issue is much less relevant than in the XML
case. In fact, while the designer of an XML document chooses
the
webTraffic
7
”
category
F i g u r e 1 3 . A n o t h e r p o s s i b l e D T D g r a p h d
e s c r i b i n g t h e w e b s i t e traffk
4 6
direction of each link without considering the cardinality of the
relationship to be modeled, the designer of a relational schema
is
constrained (by the need to satisfy the first normal form) to
represent each relationship in the -to-one direction. Thus, in
general, the relationship from S to R is one-to-many, hence, not
interesting for multidimensional modeling; the only case in
which
it might be interesting is when the foreign key mechanism has
been used by the designer to model a one-to-one relationship -
but
this is not very frequent.
As seen in Section 3, several DTDs representing the same
subject
may be designed; for each of them, the resulting attribute tree
may
look different. For instance, the attribute tree for the DTD graph
in
Figure 13 is presented in Figure 14. Having click as a fact,
navigating from hostld to host and from urlld to url entails
analyzing the data to check the uncertain relationship. After
inverting hostld with host and urlld with url (it can be done
since
they are related by a one-to-one relationship), the resulting
attribute tree becomes the same as the one in Figure 12.
h o s t c a t e g o r y
date t i m e
nation
hostld
r-
urlld url t i l e t y p e
A n 0
c l i c k
siteld
site () 0
0 nation
F i g u r e 1 4 . A t t r i b u t e t r e e f o r t h e D T D g r a
p h i n F i g u r e 1 3
5. CONCLUSIONS
In this paper we described a semi-automatic approach to
conceptual design of a data mart from an XML source. We
showed how the semi-structured nature of the source increases
the
level of uncertainty on the structure of data as compared to
structured sources such as database schemas, thus requiring
access
to the source documents and, possibly, the designer’s help in
order
to detect -to-one relationships. The approach was described with
reference to the case in which the sources are constrained by a
DTD using sub-elements, but it can be adopted equivalently
when
XML Schemas a r e c o n s i d e r e d .
Using XML sources for feeding data warehouse systems will
become a standard in the next few years. Adopting a technique
to
derive the data mart schema directly from the XML sources is
not
the only possible approach: the data mart schema may also be
designed “manually”, meaning that facts, measures and
hierarchies are determined starting from the user requirements
and
the logical connection with the source schemas is established
only
a posteriori. On the other hand, the main problem with this
solution is that, very often, the requirements expressed by the
users cannot be fully supported by the existing data; besides,
the
process of mapping each requirement back to the source schema
m a y b e v e r y c o m p l e x .
6. REFERENCES
VI
PI
[31
[41
PI
[61
[71
PI
PI
Abiteboul, S., Buneman, P., and Suciu, D. Data on the Web:
F r o m R e l a t i o n s t o S e m i s t r u c t u r e d D a t a a
n d X M L . M o r g a n
Kaufman Publishers, 2000.
Cabibbo, L., and Torlone, R. A logical approach to
multidimensional databases. In Proc. EDBT, 1998.
Datta, A., and Thomas, H. A conceptual model and algebra
for on-line analytical processing in data warehouses. In Proc.
W I T S , 1 9 9 7 .
D e u t s c h , A . , F e m a n d e z , M . , F l o r e s c u , D .
, L e v y , A . , a n d
Suciu, D. A Query Language for XML. In Proc. 8th World
Wide Web Conference, 1999.
Florescu, D., and Kossmann, D. Storing and Querying XML
Data using an RDBMS. IEEE Data Engineering Bulletin 22,
3, 1999.
Franconi, E., and Sattler, U. A data warehouse conceptual
model for multidimensional aggregation. In Proc. DMDW,
1999.
Golfarelli, M., Maio, D., and Rizzi, S. The Dimensional Fact
Model: a conceptual model for data warehouses. Int. Jour. of
Cooperative Inf. Systems 7,2&3, 1998.
Hiisemann, B., Lechtenborger, J., and Vossen, G. Conceptual
data warehouse design. In Proc. DMDW, 2000.
Kimball, R. The data warehouse toolkit. John Wiley & Sons,
1996.
[lo] Lee, D., and Chu, W.W. Constraints-preserving
T r a n s f o r m a t i o n f r o m X M L D o c u m e n t T y p
e D e f i n i t i o n t o
Relational Schema. In Proc. 19” ER (Salt Lake City), 2000.
[ 111 Mannila, H., and Raiha, K.J. Algorithms for inferring
functional dependencies. Data & Knowledge Engineering,
12, 1, 1994.
[ 121 Savnik, I., and Flach, P. Bottom-up induction of
functional
dependencies from relations. In Piatesky-Shapiro (ed.),
Knowledge Discovery in Databases, AAAI, 1993.
[ 131 Shanmugasundaram, J., et al. Relational Databases for
Querying XML Documents: Limitations and Opportunities.
In Proc. 25” VLDB (Edinburgh), 1999.
[ 141 Vassiliadis, P. Modeling multidimensional databases,
cubes
and cube operations. In Proc. 10th SSDBM Conf. (Capri,
Italy), 1998.
[ 151 World Wide Web Consortium (W3C). XML 1 .O
Specification. http://www.w3,org/TR/2000/REC-xml-
20001006.
[16] World Wide Web Consortium (W3C). XML Schema.
http:Nwww.w3.org/XML/Schema.
[17] World Wide Web Consortium (W3C). Xpath Specification
1 .O. http:llwww.w3.orglTRJxpath.
[18] World Wide Web Consortium (W3C). XQuery 1.0: An
XML
Query Language (Working Draft),
http:llwww.w3.orglTIUxqueryl.
47

More Related Content

Similar to Data warehouse design from XML sources

Physical Database Requirements.pdf
Physical Database Requirements.pdfPhysical Database Requirements.pdf
Physical Database Requirements.pdfseifusisay06
 
Mastering data-modeling-for-master-data-domains
Mastering data-modeling-for-master-data-domainsMastering data-modeling-for-master-data-domains
Mastering data-modeling-for-master-data-domainsChanukya Mekala
 
BI Architecture in support of data quality
BI Architecture in support of data qualityBI Architecture in support of data quality
BI Architecture in support of data qualityTom Breur
 
IRJET- Behaviour of Hybrid Fibre Reinforced Sintered Fly Ash Aggregate Concre...
IRJET- Behaviour of Hybrid Fibre Reinforced Sintered Fly Ash Aggregate Concre...IRJET- Behaviour of Hybrid Fibre Reinforced Sintered Fly Ash Aggregate Concre...
IRJET- Behaviour of Hybrid Fibre Reinforced Sintered Fly Ash Aggregate Concre...IRJET Journal
 
IRJET- SVM-based Web Content Mining with Leaf Classification Unit From DOM-Tree
IRJET- SVM-based Web Content Mining with Leaf Classification Unit From DOM-TreeIRJET- SVM-based Web Content Mining with Leaf Classification Unit From DOM-Tree
IRJET- SVM-based Web Content Mining with Leaf Classification Unit From DOM-TreeIRJET Journal
 
Discussion post· The proper implementation of a database is es.docx
Discussion post· The proper implementation of a database is es.docxDiscussion post· The proper implementation of a database is es.docx
Discussion post· The proper implementation of a database is es.docxmadlynplamondon
 
Catalog-based Conversion from Relational Database into XML Schema (XSD)
Catalog-based Conversion from Relational Database into XML Schema (XSD)Catalog-based Conversion from Relational Database into XML Schema (XSD)
Catalog-based Conversion from Relational Database into XML Schema (XSD)CSCJournals
 
Oudg cross model datum access
Oudg cross model datum accessOudg cross model datum access
Oudg cross model datum accesscsandit
 
OUDG : Cross Model Datum Access with Semantic Preservation for Legacy Databases
OUDG : Cross Model Datum Access with Semantic Preservation for Legacy Databases OUDG : Cross Model Datum Access with Semantic Preservation for Legacy Databases
OUDG : Cross Model Datum Access with Semantic Preservation for Legacy Databases csandit
 
A relational model of data for large shared data banks
A relational model of data for large shared data banksA relational model of data for large shared data banks
A relational model of data for large shared data banksSammy Alvarez
 
Xml data clustering an overview
Xml data clustering an overviewXml data clustering an overview
Xml data clustering an overviewunyil96
 
P REFIX - BASED L ABELING A NNOTATION FOR E FFECTIVE XML F RAGMENTATION
P REFIX - BASED  L ABELING  A NNOTATION FOR  E FFECTIVE  XML F RAGMENTATIONP REFIX - BASED  L ABELING  A NNOTATION FOR  E FFECTIVE  XML F RAGMENTATION
P REFIX - BASED L ABELING A NNOTATION FOR E FFECTIVE XML F RAGMENTATIONijcsit
 
Intake 38 data access 4
Intake 38 data access 4Intake 38 data access 4
Intake 38 data access 4Mahmoud Ouf
 
Formal Models and Algorithms for XML Data Interoperability
Formal Models and Algorithms for XML Data InteroperabilityFormal Models and Algorithms for XML Data Interoperability
Formal Models and Algorithms for XML Data InteroperabilityThomas Lee
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLijscai
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLIJSCAI Journal
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLijscai
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLIJSCAI Journal
 

Similar to Data warehouse design from XML sources (20)

Unit 3 WEB TECHNOLOGIES
Unit 3 WEB TECHNOLOGIES Unit 3 WEB TECHNOLOGIES
Unit 3 WEB TECHNOLOGIES
 
Physical Database Requirements.pdf
Physical Database Requirements.pdfPhysical Database Requirements.pdf
Physical Database Requirements.pdf
 
Mastering data-modeling-for-master-data-domains
Mastering data-modeling-for-master-data-domainsMastering data-modeling-for-master-data-domains
Mastering data-modeling-for-master-data-domains
 
BI Architecture in support of data quality
BI Architecture in support of data qualityBI Architecture in support of data quality
BI Architecture in support of data quality
 
IRJET- Behaviour of Hybrid Fibre Reinforced Sintered Fly Ash Aggregate Concre...
IRJET- Behaviour of Hybrid Fibre Reinforced Sintered Fly Ash Aggregate Concre...IRJET- Behaviour of Hybrid Fibre Reinforced Sintered Fly Ash Aggregate Concre...
IRJET- Behaviour of Hybrid Fibre Reinforced Sintered Fly Ash Aggregate Concre...
 
IRJET- SVM-based Web Content Mining with Leaf Classification Unit From DOM-Tree
IRJET- SVM-based Web Content Mining with Leaf Classification Unit From DOM-TreeIRJET- SVM-based Web Content Mining with Leaf Classification Unit From DOM-Tree
IRJET- SVM-based Web Content Mining with Leaf Classification Unit From DOM-Tree
 
Discussion post· The proper implementation of a database is es.docx
Discussion post· The proper implementation of a database is es.docxDiscussion post· The proper implementation of a database is es.docx
Discussion post· The proper implementation of a database is es.docx
 
Catalog-based Conversion from Relational Database into XML Schema (XSD)
Catalog-based Conversion from Relational Database into XML Schema (XSD)Catalog-based Conversion from Relational Database into XML Schema (XSD)
Catalog-based Conversion from Relational Database into XML Schema (XSD)
 
Oudg cross model datum access
Oudg cross model datum accessOudg cross model datum access
Oudg cross model datum access
 
OUDG : Cross Model Datum Access with Semantic Preservation for Legacy Databases
OUDG : Cross Model Datum Access with Semantic Preservation for Legacy Databases OUDG : Cross Model Datum Access with Semantic Preservation for Legacy Databases
OUDG : Cross Model Datum Access with Semantic Preservation for Legacy Databases
 
Data Modeling.docx
Data Modeling.docxData Modeling.docx
Data Modeling.docx
 
A relational model of data for large shared data banks
A relational model of data for large shared data banksA relational model of data for large shared data banks
A relational model of data for large shared data banks
 
Xml data clustering an overview
Xml data clustering an overviewXml data clustering an overview
Xml data clustering an overview
 
P REFIX - BASED L ABELING A NNOTATION FOR E FFECTIVE XML F RAGMENTATION
P REFIX - BASED  L ABELING  A NNOTATION FOR  E FFECTIVE  XML F RAGMENTATIONP REFIX - BASED  L ABELING  A NNOTATION FOR  E FFECTIVE  XML F RAGMENTATION
P REFIX - BASED L ABELING A NNOTATION FOR E FFECTIVE XML F RAGMENTATION
 
Intake 38 data access 4
Intake 38 data access 4Intake 38 data access 4
Intake 38 data access 4
 
Formal Models and Algorithms for XML Data Interoperability
Formal Models and Algorithms for XML Data InteroperabilityFormal Models and Algorithms for XML Data Interoperability
Formal Models and Algorithms for XML Data Interoperability
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQL
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQL
 

More from whittemorelucilla

Database reports provide us with the ability to further analyze ou.docx
Database reports provide us with the ability to further analyze ou.docxDatabase reports provide us with the ability to further analyze ou.docx
Database reports provide us with the ability to further analyze ou.docxwhittemorelucilla
 
DataInformationKnowledge1.  Discuss the relationship between.docx
DataInformationKnowledge1.  Discuss the relationship between.docxDataInformationKnowledge1.  Discuss the relationship between.docx
DataInformationKnowledge1.  Discuss the relationship between.docxwhittemorelucilla
 
DataHole 12 Score6757555455555455575775655565656555655656556566643.docx
DataHole 12 Score6757555455555455575775655565656555655656556566643.docxDataHole 12 Score6757555455555455575775655565656555655656556566643.docx
DataHole 12 Score6757555455555455575775655565656555655656556566643.docxwhittemorelucilla
 
DataDestination PalletsTotal CasesCases redCases whiteCases organi.docx
DataDestination PalletsTotal CasesCases redCases whiteCases organi.docxDataDestination PalletsTotal CasesCases redCases whiteCases organi.docx
DataDestination PalletsTotal CasesCases redCases whiteCases organi.docxwhittemorelucilla
 
DataIllinois Tool WorksConsolidated Statement of Income($ in milli.docx
DataIllinois Tool WorksConsolidated Statement of Income($ in milli.docxDataIllinois Tool WorksConsolidated Statement of Income($ in milli.docx
DataIllinois Tool WorksConsolidated Statement of Income($ in milli.docxwhittemorelucilla
 
DataIDSalaryCompa-ratioMidpoint AgePerformance RatingServiceGender.docx
DataIDSalaryCompa-ratioMidpoint AgePerformance RatingServiceGender.docxDataIDSalaryCompa-ratioMidpoint AgePerformance RatingServiceGender.docx
DataIDSalaryCompa-ratioMidpoint AgePerformance RatingServiceGender.docxwhittemorelucilla
 
DataCity1997 Median Price1997 Change1998 Forecast1993-98 Annualize.docx
DataCity1997 Median Price1997 Change1998 Forecast1993-98 Annualize.docxDataCity1997 Median Price1997 Change1998 Forecast1993-98 Annualize.docx
DataCity1997 Median Price1997 Change1998 Forecast1993-98 Annualize.docxwhittemorelucilla
 
DataClientRoom QualityFood QualityService Quality1GPG2GGG3GGG4GPG5.docx
DataClientRoom QualityFood QualityService Quality1GPG2GGG3GGG4GPG5.docxDataClientRoom QualityFood QualityService Quality1GPG2GGG3GGG4GPG5.docx
DataClientRoom QualityFood QualityService Quality1GPG2GGG3GGG4GPG5.docxwhittemorelucilla
 
Database Project CharterBusiness CaseKhalia HartUnive.docx
Database Project CharterBusiness CaseKhalia HartUnive.docxDatabase Project CharterBusiness CaseKhalia HartUnive.docx
Database Project CharterBusiness CaseKhalia HartUnive.docxwhittemorelucilla
 
Databases selected Multiple databases...Full Text (1223 .docx
Databases selected Multiple databases...Full Text (1223  .docxDatabases selected Multiple databases...Full Text (1223  .docx
Databases selected Multiple databases...Full Text (1223 .docxwhittemorelucilla
 
Database SystemsDesign, Implementation, and ManagementCo.docx
Database SystemsDesign, Implementation, and ManagementCo.docxDatabase SystemsDesign, Implementation, and ManagementCo.docx
Database SystemsDesign, Implementation, and ManagementCo.docxwhittemorelucilla
 
DATABASE SYSTEMS DEVELOPMENT & IMPLEMENTATION PLAN1DATABASE SYS.docx
DATABASE SYSTEMS DEVELOPMENT & IMPLEMENTATION PLAN1DATABASE SYS.docxDATABASE SYSTEMS DEVELOPMENT & IMPLEMENTATION PLAN1DATABASE SYS.docx
DATABASE SYSTEMS DEVELOPMENT & IMPLEMENTATION PLAN1DATABASE SYS.docxwhittemorelucilla
 
Database Security Assessment Transcript You are a contracting office.docx
Database Security Assessment Transcript You are a contracting office.docxDatabase Security Assessment Transcript You are a contracting office.docx
Database Security Assessment Transcript You are a contracting office.docxwhittemorelucilla
 
Database Design Mid Term ExamSpring 2020Name ________________.docx
Database Design Mid Term ExamSpring 2020Name ________________.docxDatabase Design Mid Term ExamSpring 2020Name ________________.docx
Database Design Mid Term ExamSpring 2020Name ________________.docxwhittemorelucilla
 
Database Justification MemoCreate a 1-page memo for the .docx
Database Justification MemoCreate a 1-page memo for the .docxDatabase Justification MemoCreate a 1-page memo for the .docx
Database Justification MemoCreate a 1-page memo for the .docxwhittemorelucilla
 
Database Concept Maphttpwikieducator.orgCCNCCCN.docx
Database Concept Maphttpwikieducator.orgCCNCCCN.docxDatabase Concept Maphttpwikieducator.orgCCNCCCN.docx
Database Concept Maphttpwikieducator.orgCCNCCCN.docxwhittemorelucilla
 
Database Dump Script(Details of project in file)Mac1) O.docx
Database Dump Script(Details of project in file)Mac1) O.docxDatabase Dump Script(Details of project in file)Mac1) O.docx
Database Dump Script(Details of project in file)Mac1) O.docxwhittemorelucilla
 
Database Design 1. What is a data model A. method of sto.docx
Database Design 1.  What is a data model A. method of sto.docxDatabase Design 1.  What is a data model A. method of sto.docx
Database Design 1. What is a data model A. method of sto.docxwhittemorelucilla
 
DataAGEGENDERETHNICMAJORSEMHOUSEGPAHRSNEWSPAPTVHRSSLEEPWEIGHTHEIGH.docx
DataAGEGENDERETHNICMAJORSEMHOUSEGPAHRSNEWSPAPTVHRSSLEEPWEIGHTHEIGH.docxDataAGEGENDERETHNICMAJORSEMHOUSEGPAHRSNEWSPAPTVHRSSLEEPWEIGHTHEIGH.docx
DataAGEGENDERETHNICMAJORSEMHOUSEGPAHRSNEWSPAPTVHRSSLEEPWEIGHTHEIGH.docxwhittemorelucilla
 

More from whittemorelucilla (20)

Database reports provide us with the ability to further analyze ou.docx
Database reports provide us with the ability to further analyze ou.docxDatabase reports provide us with the ability to further analyze ou.docx
Database reports provide us with the ability to further analyze ou.docx
 
DataInformationKnowledge1.  Discuss the relationship between.docx
DataInformationKnowledge1.  Discuss the relationship between.docxDataInformationKnowledge1.  Discuss the relationship between.docx
DataInformationKnowledge1.  Discuss the relationship between.docx
 
DataHole 12 Score6757555455555455575775655565656555655656556566643.docx
DataHole 12 Score6757555455555455575775655565656555655656556566643.docxDataHole 12 Score6757555455555455575775655565656555655656556566643.docx
DataHole 12 Score6757555455555455575775655565656555655656556566643.docx
 
DataDestination PalletsTotal CasesCases redCases whiteCases organi.docx
DataDestination PalletsTotal CasesCases redCases whiteCases organi.docxDataDestination PalletsTotal CasesCases redCases whiteCases organi.docx
DataDestination PalletsTotal CasesCases redCases whiteCases organi.docx
 
DataIllinois Tool WorksConsolidated Statement of Income($ in milli.docx
DataIllinois Tool WorksConsolidated Statement of Income($ in milli.docxDataIllinois Tool WorksConsolidated Statement of Income($ in milli.docx
DataIllinois Tool WorksConsolidated Statement of Income($ in milli.docx
 
DataIDSalaryCompa-ratioMidpoint AgePerformance RatingServiceGender.docx
DataIDSalaryCompa-ratioMidpoint AgePerformance RatingServiceGender.docxDataIDSalaryCompa-ratioMidpoint AgePerformance RatingServiceGender.docx
DataIDSalaryCompa-ratioMidpoint AgePerformance RatingServiceGender.docx
 
DataCity1997 Median Price1997 Change1998 Forecast1993-98 Annualize.docx
DataCity1997 Median Price1997 Change1998 Forecast1993-98 Annualize.docxDataCity1997 Median Price1997 Change1998 Forecast1993-98 Annualize.docx
DataCity1997 Median Price1997 Change1998 Forecast1993-98 Annualize.docx
 
DataClientRoom QualityFood QualityService Quality1GPG2GGG3GGG4GPG5.docx
DataClientRoom QualityFood QualityService Quality1GPG2GGG3GGG4GPG5.docxDataClientRoom QualityFood QualityService Quality1GPG2GGG3GGG4GPG5.docx
DataClientRoom QualityFood QualityService Quality1GPG2GGG3GGG4GPG5.docx
 
Database Project CharterBusiness CaseKhalia HartUnive.docx
Database Project CharterBusiness CaseKhalia HartUnive.docxDatabase Project CharterBusiness CaseKhalia HartUnive.docx
Database Project CharterBusiness CaseKhalia HartUnive.docx
 
Databases selected Multiple databases...Full Text (1223 .docx
Databases selected Multiple databases...Full Text (1223  .docxDatabases selected Multiple databases...Full Text (1223  .docx
Databases selected Multiple databases...Full Text (1223 .docx
 
Database SystemsDesign, Implementation, and ManagementCo.docx
Database SystemsDesign, Implementation, and ManagementCo.docxDatabase SystemsDesign, Implementation, and ManagementCo.docx
Database SystemsDesign, Implementation, and ManagementCo.docx
 
DATABASE SYSTEMS DEVELOPMENT & IMPLEMENTATION PLAN1DATABASE SYS.docx
DATABASE SYSTEMS DEVELOPMENT & IMPLEMENTATION PLAN1DATABASE SYS.docxDATABASE SYSTEMS DEVELOPMENT & IMPLEMENTATION PLAN1DATABASE SYS.docx
DATABASE SYSTEMS DEVELOPMENT & IMPLEMENTATION PLAN1DATABASE SYS.docx
 
Database Security Assessment Transcript You are a contracting office.docx
Database Security Assessment Transcript You are a contracting office.docxDatabase Security Assessment Transcript You are a contracting office.docx
Database Security Assessment Transcript You are a contracting office.docx
 
Data.docx
Data.docxData.docx
Data.docx
 
Database Design Mid Term ExamSpring 2020Name ________________.docx
Database Design Mid Term ExamSpring 2020Name ________________.docxDatabase Design Mid Term ExamSpring 2020Name ________________.docx
Database Design Mid Term ExamSpring 2020Name ________________.docx
 
Database Justification MemoCreate a 1-page memo for the .docx
Database Justification MemoCreate a 1-page memo for the .docxDatabase Justification MemoCreate a 1-page memo for the .docx
Database Justification MemoCreate a 1-page memo for the .docx
 
Database Concept Maphttpwikieducator.orgCCNCCCN.docx
Database Concept Maphttpwikieducator.orgCCNCCCN.docxDatabase Concept Maphttpwikieducator.orgCCNCCCN.docx
Database Concept Maphttpwikieducator.orgCCNCCCN.docx
 
Database Dump Script(Details of project in file)Mac1) O.docx
Database Dump Script(Details of project in file)Mac1) O.docxDatabase Dump Script(Details of project in file)Mac1) O.docx
Database Dump Script(Details of project in file)Mac1) O.docx
 
Database Design 1. What is a data model A. method of sto.docx
Database Design 1.  What is a data model A. method of sto.docxDatabase Design 1.  What is a data model A. method of sto.docx
Database Design 1. What is a data model A. method of sto.docx
 
DataAGEGENDERETHNICMAJORSEMHOUSEGPAHRSNEWSPAPTVHRSSLEEPWEIGHTHEIGH.docx
DataAGEGENDERETHNICMAJORSEMHOUSEGPAHRSNEWSPAPTVHRSSLEEPWEIGHTHEIGH.docxDataAGEGENDERETHNICMAJORSEMHOUSEGPAHRSNEWSPAPTVHRSSLEEPWEIGHTHEIGH.docx
DataAGEGENDERETHNICMAJORSEMHOUSEGPAHRSNEWSPAPTVHRSSLEEPWEIGHTHEIGH.docx
 

Recently uploaded

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 

Recently uploaded (20)

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 

Data warehouse design from XML sources

  • 1. Data warehouse design from XML sources Matte0 Golfarelli Stefano Rizzi DEIS - University of Bologna DEIS - University of Bologna Viale Risorgimento, 2 Viale Risorgimento, 2 40136 Bologna - Italy 40136 Bologna - Italy +39-051-642862 +39-051-2093542 [email protected][email protected] Boris Vrdoljak FER - University of Zagreb Unska 3 10000 Zagreb - Croatia +385-(O)l -6129756 [email protected] ABSTRACT A large amount of data needed in decision-making processes is stored in the XML data format, which is widely used for e- commerce and Internet-based information exchange. Thus, as more organizations view the web as an integral part of their communication and business, the importance of integrating XML data in data warehousing environments is becoming increasingly high. In this paper we show how the design of a data mart can be carried out starting directly from an XML source. Two main issues arise: on the one hand, since XML models semi- structured
  • 2. data, not all the information needed for design can be safely derived; on the other, different approaches for representing relationships in XML DTDs and Schemas are possible, each with different expressive power. After discussing these issues, we propose a semi-automatic approach for building the conceptual schema for a data mart starting from the XML sources. Keywords Data warehouse design, Data warehousing and the web, XML 1. INTRODUCTION A large amount of data needed in decision-making processes is stored in the XML (Extensible Markup Language) data format. The structure of XML, composed of nested custom-defined tags that can describe the meaning of the content itself, makes it usable as a semantic-preserving data exchange format on the web. As the Internet has evolved into a global platform for e-commerce and information exchange, the interest in XML has been growing and large volumes of XML data already exist. XML can be considered as a particular standard syntax for the exchange of semi-structured data [l]. One common feature of semi-structured data models is the lack of schema, so that the data is self-describing. However, XML documents can be associated with and validated against either a Document Type Definition (DTD) or an XML Schema, both of which allow the structure of XML documents to be described and their contents to be constrained. DTDs are defined as a part of the XML 1.0 Permission to make digital or hard copies of all or part of this work for
  • 3. personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists. requires prior specific permission and/or a fee. DOLAP ‘01, November 9,2001, Atlanta, Georgia, USA. Copyright 2001 ACM l-581 l3-437-l/Ol/OOl I . ..$5.00. Specification [15], while XML Schemas have recently become a W3C Recommendation [16]. XML Schemas considerably extend the capabilities of DTDs, especially from the point of view of data typing and constraining. With DTDs or Schemas, the applications exchanging data can agree about the meaning of the tags and, in that case, XML reaches its full potential. During the recent years, the enterprises have been asking for support in the process of extracting useful, concise and handy information for decision-making out of the tons of data stored in their expensive and complex information systems. Consequently, a substantial effort has been made in the academic world to devise methodologies for designing data warehousing systems capable of seamlessly integrating information from different sources, in particular from heterogeneous databases. Now, as more organizations view the web as an integral part of their
  • 4. communication and business, the importance of integrating XML data in data warehousing environments is becoming increasingly high. Some commercial tools now support data extraction from XML sources to feed the warehouse, but both the warehouse schema and the logical mapping between the source and the target schemas must be defined by the designer. In this paper we show how multidimensional design for data warehouses can be carried out starting directly from an XML source. Two main issues arise: on the one hand, different approaches for representing relationships in XML DTDs and Schemas are possible, each achieving a different expressive power; on the other, since XML models semi-structured data, not all the information needed for design can be safely derived. Thus, our contribution in this work is twofold: first, we propose a warehouse-oriented review and comparison of the approaches for structuring XML documents; then, we outline an algorithm in which the problem of correctly inferring the needed information is solved by querying the source XML documents and, if necessary, by asking the designer’s help. One alternative approach to design from XML sources consists in first translating them into an equivalent relational schema, then starting from the latter to design the warehouse. Some approaches for translating XML documents into a relational database can be found in the literature, both leaning on the DTD [10][13] or not [5], but insufficient emphasis is given to the problem of
  • 5. determining the cardinality of relationships, which instead has a primary role in multidimensional design. The paper is structured as follows. In Section 2 the basics of multidimensional modeling and design are given, while in Section 3 the design alternatives for modeling relationships in both DTDs 4 0 and XML Schemas are reviewed and discussed. In Section 4 our approach to multidimensional design is presented with reference to a case study, and in Section 5 the conclusions are drawn. 2. MULTIDIMENSIONAL MODELING AND DESIGN It is now widely recognized that an accurate conceptual design is the necessary foundation for building a data warehouse which is both well-documented and fully satisfies requirements. In order to be independent of the specific issues involved in logical and physical modeling, the approach proposed here is referred to the conceptual level, from which the logical schemas of the data marts can be easily derived. 2.1 Conceptual Modeling Several conceptual models for data warehouses were devised in the literature [2][3][6][7][8][14]; they mainly differ for the graphical representation of concepts, with small differences in expressive power. In this paper we will adopt the Dimensional Fact Model (DFM) [7], which represents the data mart by
  • 6. means of a set offact schemas. In the following we will briefly discuss the DFM representation of the main concepts of the multidimensional model with reference to the fact schema CLICK, which describes the analysis of the web site traffic. The reason for choosing this example is that, due to the significant role now played by the web in attracting new clients and supporting sales, analyzing the web server traffic may be crucial for improving the enterprise business. In this context, multidimensional modeling allows many unpredictable complex queries to be answered, such as: . What is the trend for the most and the least accessed pages? . Is there a relationship between business events (for instance, sale promotions in an e-commerce site) and the number of accesses? For such an analysis, we need information about the hostname or IP address of the computer requesting the file, the date and time of the request, and the URL of the file being requested. m o n t h 0 day of week date ( h o l i d a y d o m a i n
  • 7. (category I n a t i o n ) CLICK URL f i l e t y p e 0 n n o . o f c l i c k s A 0 host ( h o s t n a m e / IP address) hour () Figure 1. Fact schema for click-stream analysis In the fact schema shown in Figure 1, the fact CLICK, focus of interest for the decision-making process, is associated to the measures which describe it, i.e. no. of clicks, and to the dimensions determining its minimum granularity, namely host, date, hour, and URL. Each dimension is the root of a hierarchy which determines how the fact may be aggregated and selected significantly for the decision-making process; each hierarchy includes a set of attributes linked by functional dependencies. For instance, the URL of the file being requested determines the file type, and the hostname or IP address of the computer requesting the file determines its domain. The values for the domain attribute can be extracted from the hostname suffix that is indicating either the category (for instance, “.com” for commercial companies) or the nation. Within the DFM, as in all the other conceptual models, a strong relevance is given to functional dependencies, since they represent many-to-one relationships between attributes which enable
  • 8. flexible aggregation of data in OLAP queries [9]. Thus, the main problem in building a conceptual data mart schema is to identify those relationships in the business domain. 2.2 Conceptual Design In most approaches to design of data marts, the conceptual schema is built starting from the (logical or conceptual) schema of the operational sources [2][7][8]. The common core of these approaches consists in navigating the functional dependencies in the source schema in order to determine the hierarchies for the fact. In particular, the methodology to build a fact schema from an E/R schema proposed in [7] consists of the following steps: 1. Choosing facts. 2. For each fact: 2.1 Building the attribute tree. 2.2 Rearranging the attribute tree. 2.3 Defining dimensions and measures For briefly illustrating the methodology, we will use the E/R diagram describing the web site traffic shown in Figure 2. Facts typically correspond to events occurring dynamically in the enterprise world. On the E/R schema a fact may be represented either by an entity F or by an n-ary relationship R. In our example, the fact of primary interest is represented by entity CLICK.
  • 9. Given a portion of interest of a source schema and an entity F belonging to it, we call attribute tree the tree such that: . each vertex corresponds to an attribute - simple or compound - of the schema; . the root corresponds to the identifier of F, . for each vertex V, the corresponding attribute functionally determines all the attributes corresponding to the descendants of v. The attribute tree for F may be constructed automatically by navigating, starting from F, the functional dependencies expressed by many-to-one relationships between entities in the source schema. Each entity E analyzed is represented in the attribute tree by: (1) a node corresponding to the identifier of E, (2) a child node for each of the non-identifier attributes of E; (3) a child node for each entity G connected to E by a many-to-one relationship. 41 time d a t e (IN r-5U R LC A T E G O R Y Figure 2. E/R schema for web site traffk The resulting attribute tree for the web site traffic analysis
  • 10. elements (i.e. sub-elements) and attributes. Both elements and example is shown in Figure 3. Starting from the CLICK entity, attributes are allowed to have values. Document structures can be new vertices are added by following many-to-one relationships. nested to any level of complexity; between the opening and The URL category attribute could not be included in the attribute closing tags for an element, any number of additional elements or tree because there is a many-to-many relationship between URL textual data may be present. Attributes are included, with their and URL category. The nation attribute is included twice in the respective values, within the element’s opening declaration. An attribute tree because it can be added to the attribute tree by XML document that contains data about the web site traffic is navigating from both the host and the site entities. shown in Figure 4. d a t e time <webTraffic> < c l i c k > < h o s t hostId=“ares.csr.unibo.it”> <nation>italy</nation> </host> <date>23-MAY-ZOOl</date> <time>16:43:25</time> curl urlID="BL0023"> <site siteID="www.hr"> <nation>croatia</nation> </site> Figure 3. Attribute tree for fact CLZCK
  • 11. Once the attribute tree has been built, some uninteresting nodes may be dropped from it. The time attribute, that represents the exact time of a click, is replaced with a coarser hour attribute. <fileType>shtml</fileType> <urlCategory>catalog</urlCategory> </url> </click> <click> Finally, dimensions and measures must be selected among the children of the root. In our example, the attributes chosen as dimensions are host, date, hour and URL; number of clicks, determined by counting the clicks from the same host to the same URL on a given date and hour, is chosen as a measure. . . </click> . . . </webTraffic> F i g u r e 4 . A n X M L d o c u m e n t d e s c r i b i n g t h e w e b s i t e t r a f f i c Some further minor arrangements must be made in order to obtain the fact schema in Figure 1; in particular, the date dimension is enriched by building a hierarchy which includes attributes month, day of week, and holiday. Besides, the host category and nation optional attributes are replaced by attribute domain, which indicates either the category or the nation of the host.
  • 12. An XML document is valid if it has an associated schema, such as a DTD or an XML Schema, and if it conforms to the constraints expressed in that schema. Since our methodology for conceptual design is based on detecting many-to-one relationships, in the following we will focus on the way those relationships can be e x p r e s s e d i n t h e D T D a n d t h e X M L S c h e m a . 3. MODELING RELATIONSHIPS IN XML An XML document consists of nested element structures, starting with a root element. Each element may contain component 3.1 Relationships in DTDs A DTD defines elements and attributes allowed in an XML document, and the nesting and occurrences of each element. The 42 structure of an XML document is constrained using element- type and attribute-list declarations. Element-type declarations specify which sub-elements can appear as children of the element; attribute-list declarations specify the name, type, and possibly default value of each attribute associated with a given element type. Among the different attribute types, types ID, IDREF and IDREFS have particular relevance for our approach: the ID type defines a unique identifier for the element; the IDREF type means that the attribute’s value is some other element’s identifier; IDREFS means that its value is a list of identifiers. IDREF(S) must match the value of some ID attribute in the document [ 11.
  • 13. 3. I .2 Modeling relationships by ID and IDREF(S) Another way to specify relationships between elements in DTDs is by means of ID and IDREF(S) attributes. The way these attributes operate resembles the key and foreign key mechanism used in relational databases, with some important differences and limitations. For instance, if we take the last part of the DTD from Figure 5 and let the relationships of the url element be defined by ID and IDREF(S) attributes rather than by sub-elements, we obtain the DTD in Figure 6. 3.1. I Modeling relationships by sub-elements <!ELEMENT webTraffic ( c l i c k * , f i l e T y p e + , urlCategory+) > Relationships can be specified in DTDs by sub-elements that may have different cafdinalities. The optional character following a child element name or list in the element-type declarations determines whether the element or the content particles in the list may appear one or more (+), zero or more (*), or zero or one times (?); the default cardinality is exactly one. Figure 5 presents a DTD according to which the XML document presented in Figure 4 is valid. Element webTraffic is defined as a document element, thus becomes the root of XML documents. A webTraffic element may have many click elements, while in an url element the site sub-element must occur exactly once, followed by onefiJeType and many urlCategory sub-elements. A host
  • 14. element may have either a category or a nation element. <!DOCTYPE webTraffic [ <!ELEMEIW webTraffic ( c l i c k * ) > <!ELEMENT c l i c k ( h o s t , d a t e , t i m e , url) > <!ELEMENT h o s t ( c a t e g o r y 1 n a t i o n ) > <!ATTLIST h o s t hostId ID #REQUIRED> <!ELEMENT c a t e g o r y (#PCDATA)> <!ELEMENT d a t e (#PCDATA)> <!ELEMENT t i m e (#PCDATA)> <!ELEMENT url ( s i t e , fileType, urlCategory+) > <!ATTLIST u r l urlId ID #REQUIRED> < ! ELEMENT site (nation) > <!ATTLIST s i t e siteId ID #REQUIRED> < ! ELEMENT nation (#PCDATA) > <!ELEMENT fileType (#PCDATA)> <!ELEMENT urlcategory (#PCDATA)> I> Figure 5. A DTD where relationships are specified by sub- e l e m e n t s If a one-to-one or one-to-many relationship must be represented in XML, sub-elements with the above mentioned cardinalities can be
  • 15. used without loss of information. However, given a DTD, we can follow only one direction of a relationship. For instance, according to the DTD in Figure 5, an url element may have many urlCutegory sub-elements, but it is not possible to find out, from the DTD, whether an URL category can refer to many URLs. Only by having some knowledge about the domain described by the DTD, we can conclude that the latter is the case. . . . <!ELEMENT u r l ( s i t e ) <!ATTLIST url urlId ID #REQUIRED fileTypeRef I D R E F # R E Q U I R E D urlCategoryRef IDREFS # R E Q U I R E D > <!ELEMENT fileType EMPTY> <!ATTLIST fileType typeID ID #REQUIRED typeDescription CDATA #IMPLIED> <!ELEMENT urlcategory EMPTY> <!ATTLIST urlcategory urlCategoryId ID #REQUIRED urlCateqorvDesc CDATA #IMPLIED> Figure 6. A DTD where relationships are specified by IDREF(S) In this example we assume that the$ZeTypeRefattribute of the url
  • 16. element references the fireType element. On the other hand, an instance of urlCategoryRef references many instances of urZCategory. Therefore, for one URL there is exactly one file type while there may be several categories. The problem is that, using IDREF(S), the participating elements cannot be constrained to be of a certain element type. For instance, jZeTypeRef could also contain a reference to an urZCutegory element, while obviously we would like to constrain such references tofiZeType elements only. Further, though the value of an ID attribute is unique within the whole document, element types are not required to have an ID, and even if an element type has an ID, its usage may be optional. For these reasons, there is no means to actually constrain the allowed relationships by the ID/IDREF(S) mechanism. 3.2 Relationships in XML Schemas XML Schemas give more accurate representation of the XML structure constraints than DTDs; in particular, the cardinal&y can be specified in more detail by using the minoccurs and maxoccurs attributes. An XML Schema consists of type definitions, which can be derived from each other, and element declarations. The possibility of separating an element declaration from the definition of its type enables the sharing and reusing of simple and composite types. Further, besides ID and IDREF(S) attributes, XML Schemas support the use of key and keyRef elements for defining keys and their references. The two different ways of specifying relationships in XML
  • 17. Schemas, i.e. sub-elements and key/keyRef mechanisms, will be described in the following. 43 3.2.1 Modeling relationships by sub-elements Figure 7 presents a part of an XML Schema that defines the same structure and constraints as the DTD in Figure 5. Since, for defining elements, attributes minOccurs and maxOccurs default to 1, and only in the case of the urlCategory element the maxOccurs attribute is set to “unbounded”, an w-1 element will consist of one site, one$leType and many urlcategory elements. We will not further discuss this solution since, from the point of view of cardinality modeling, it is essentially equivalent to the solution in Section3.1.1. < e l e m e n t name=“click”> <complexType> <sequence> <element name="host" type="hostType"/> <element name="date" type="date"/> <element name="time" type="time"/> <element name="url" type="urlType"/> </sequence> </complexType>
  • 18. </element> . . <complexType name="urlType"> <sequence> <element name="site" type="siteType"/> <element name="fileType" type="string"/> <element name="urlCategory" type="string" maxOccurs="unbounded"/> </sequence> <attribute name="urlID" type="string"/> </complexType> . . . F i g u r e 7 . A n X M L S c h e m a w h e r e r e l a t i o n s h i p s a r e s p e c i f i e d b y s u b - e l e m e n t s 3 . 2 . 2 Modeling relationships by key and keyRef elements In addition to ID and IDREF(S) attributes, XML Schemas introduce more powerful and flexible mechanisms, similar to the relational concept of foreign key. The key element is used to indicate that every attribute or element value must be unique within a certain scope and not null. By using keyRef elements, keys can be referenced. For the purpose of describing the declaration of keys and their
  • 19. references, an element namedfileTypes (represented in Figure 8) is added as a sub-element to the click element from the XML document in Figure 4. The fileTypes element consists of type elements with z’ypeZD attributes. Figure 9 presents an evolution of the XML Schema in Figure 7, according to which every click element also contains a fileTypes element. This element has a composite type named fileTypesType (its definition is not presented in Figure 9), according to which the XML document in Figure 8 is valid. XML Schemas allow to specify the scope for each key by means of an XPath expression [17]. In our example, the key element is namedfileTypeKey. The typeID attribute from Figure 8 is specified as the key by means of the selector and thefield sub- elements of the key attribute in Figure 9. The xpath attribute of the selector element contains an XPath expression,$leTypes/type, that selects the type elements that are sub-elements of the fileTypes element. The xpath attribute of the field element contains the @typeZD expression, that specifies the typeZD attribute of the type element as the key. Further, thefileType element, that is a sub- element of w-1, is declared as a keyReJ this means that, for every value ofjZeType, afileTypeKey with the same value must exist. <fileTypes> <type typeID="html">
  • 20. hypertext markup language</type> <type typeID="gif"> graphic interchange format</type> . . . </fileTypes> Figure 8. ThefiZeTypes element as a sub-element of click <element name="click"> <complexType> . . . <element name="url" type="urlType"/> <element name="fileTypes" type="fileTypesType"/> . . . </complexType> <key name="fileTypeKey"> <selector xpath="fileTypes/type"/> <field xpath="@typeID"/> </key> ckeyref name="fileTypeRef" refer="fileTypeKey"> <selector xpath=“url”/> <field xpath="fileType"/> i/keyref> </element>
  • 21. Figure 9. An XML Schema where relationships are specified by key/keyRef In conclusion, the key/keyRef mechanism may be applied to any element and attribute content, and the scope of the constraint can be specified, while an ID is a type of attribute whose scope is fixed to be the whole document. Furthermore, combinations of element and attribute content can also serve as keys and references in XML Schemas. 4. CONCEPTUAL DESIGN FROM XML SOURCES In the previous section, we showed different approaches for representing relationships in DTDs and XML Schemas. Three of them are suitable for specifying relationships: sub-elements in DTDs, sub-elements and key/keyRef in Schemas; though their expressive power is different, in the context of this paper they may be considered to be equivalent since with reference to multidimensional design they allow the same knowledge to be captured. We do not consider the fourth approach, ID/IDREF(S) 4 4 in DTDs, since it is not precise and useful enough in constraining relationships. In this section we propose a semi-automatic approach for building the conceptual schema of a data mart starting from the XML sources. Of the three above-mentioned approaches, we have chosen sub-elements in DTDs for the presentation of our
  • 22. methodology, since Schemas are still not as widely used as DTDs; however, the methodology is essentially the same when dealing w i t h S c h e m a s . Starting with the assumption that the XML document has a DTD and conforms to it, the methodology consists of the following steps: 1. Simplifying the DTD. 2 . Creating a DTD graph. 3. Choosing facts. 4. For each fact: 4.1 Building the attribute tree from the DTD graph. 4.2 Rearranging the attribute tree. 4.3 Defining dimensions and measures. In the following paragraphs we will describe steps from (1) to (4.1) referring to the web site traffic example; once the attribute tree is built, steps 4.2 and 4.3 are identical, respectively, to steps 2.2 and 2.3 described in Section 2.3. Simvlifvina t h e D T D The sub-elements in DTDs may have been declared in a complicated and redundant way. However, those details of a DTD can be simplified [ 131. The transformations for simplifying a DTD include converting a nested definition into a flat representation:
  • 23. f o r i n s t a n c e , i n t h e w e b s i t e t r a f f i c e x a m p l e , host(category /nation) i s t r a n s f o r m e d i n t o host(category?,nation?). Further, the sub-elements having the same name are grouped, and many unary operators are reduced to a single unary operator. Finally, all “+” operators are transformed intO “*” operators. Creating a DTD waph After simplifying the DTD, a DTD graph representing its structure can be created as described in [lo] and [13]; its vertices correspond to elements, attributes and operators in the DTD. Attributes and sub-elements are not distinguished in the graph since, in our methodology, they are considered as equivalent nesting mechanisms. The DTD graph for the DTD in Figure 5 is given in Figure 10. Definina facts The designer chooses one or more vertices of the DTD graph as facts; each of them becomes the root of a fact schema. In our example, we choose the click vertex as the only interesting fact. Building the attribute tree The vertices of the attribute tree are a subset of the element and attribute vertices of the DTD graph. The algorithm to build the attribute tree is sketched in Figure Il. webTraffic 1 click
  • 24. host d a t e t i m e url Figure 10. DTD graph for web site traffic analysis root=newVertex (F) ; // newVertex(<vertex>) r e t u r n s a n e w v e r t e x // o f t h e a t t r i b u t e t r e e // c o r r e s p o n d i n g t o < v e r t e x > // o f t h e D T D g r a p h expand(F,root) ; expand(E,V) : // E i s t h e c u r r e n t DTD vertex, // V i s t h e c u r r e n t a t t r i b u t e t r e e v e r t e x { f o r e a c h c h i l d W o f E d o if W i s e l e m e n t o r a t t r i b u t e d o { next=newVertex(W); addchild (V, next) ; // a d d s c h i l d W t o V expand (W, next ) ; else if W="?" do expand (W, V) ; for each parent Z of E such that Z i s n o t a d o c u m e n t e l e m e n t d o if Z=“3” or Z=“*” do expand(Z,V) ; e l s e if n o t toMany(E,Z) do
  • 25. if askDesignerToOne (E, Z) do { next=newVertex(Z); addChild(V,next) ; expand (Z, next) ; 1 Figure 11. Algorithm for building the attribute tree The attribute tree is initialized with the fact vertex F; then, it is enlarged by recursively navigating the functional dependencies between the vertices of the DTD graph. Each vertex V inserted in the attribute tree is expanded as follows (procedure expand): 1. For each vertex W that is a child of V in the DTD graph: When examining relationships in the same direction as in the DTD graph, the cardinality information is expressed either explicitly by “?” and “*” vertices or implicitly by their 4 5 2 . For each vertex Z that is a parent of V in the DTD graph: absence. If W corresponds to an element or attribute in the since there is no need for the existence of both host and hostld DTD, it is added to the attribute tree as a child of V; if W is a vertices, only host should be left, the same logic should be applied “ ? ” operator, its child is added to the attribute tree as a child of for urlld and siteld. Finally, the time attribute is replaced with the
  • 26. V; if W is a “*” operator, no vertex is added. hour attribute. When examining relationships in this direction, vertices corresponding to “*” and “?,, operators are skipped since they only express the cardinality in the opposite direction. Since the DTD yields no further information about the relationship cardinality, it is necessary to examine the actual data by querying the XML documents conforming to the DTD. This is done by procedure checkToMany, which counts the number of distinct values of Z corresponding to each value of E. If a - to-many relationship is detected, Z is not included in the attribute tree. Otherwise, we still cannot be sure that the cardinality of the relationship from E to Z is -to-one. In this case, only the designer can tell, leaning on her knowledge of the business domain, whether the actual cardinality is -to-one or -to-many (procedure askDesignerToOne). Only in the first case, Z is added to the attribute tree. The reason why document elements are not considered is that they have only one instance within XML documents, thus they have no interest for aggregation and should not be modeled in the data m a r t . In the following some general considerations on the proposed approach are reported. The problem of checking cardinalities in XML documents is related to that of discovering functional dependencies in relational databases, which was widely investigated in the literature on relational theory and data mining [11][12]. In our case, the situation is much simpler since no inference is necessary, so it comes down to properly querying the data with an XML query language supporting aggregate queries. For instance, in W3C XQuery [18] the use of the distinct function is proposed for that purpose, while the use of the group-by function is proposed in
  • 27. [4]. The main question arising is how many XML documents we must see to reasonably confirm our presumption that the cardinality is -to-one. Clearly, the semi-structured nature of XML sources increases the level of uncertainty on the structure of data as compared to E/R sources, thus making recourses to the designer’s experience more frequent. In our algorithm, we chose to ask questions interactively during the tree building phase in order to avoid unnecessary queries on the documents. An alternative solution consists in first building the tree by emphasizing all the uncertain relationships, then handing on the complete tree to the designer in order to have it rearranged (step 4.2) with specific attention to uncertain relationships, which could be dropped together with their subtree. While this solution allows the designer to have a wider point of view on the tree, it may be less efficient since a vertex, dropped later, might have been (uselessly) expanded by querying the XML d o c u m e n t s . d a t e t i m e u r l l d host category Q P P url f i l e t y p e 0
  • 28. n a t i o n site s i t e l d 0 F i g u r e 1 2 . A t t r i b u t e t r e e d e r i v e d f r o m t h e D T D g r a p h In our example, no uncertain relationships are navigated. Vertex urlCategory is not added to the attribute tree because it is a child of a -*I’ vertex in the DTD graph. The resulting attribute tree for the web site traffic analysis example is given in Figure 12. Some further arrangements should be made to this tree: for instance, As a matter of fact, the problem of inferring the relationship cardinalities is present also when the source to be used for design is a relational schema [7]. In fact, the presence in a relation R of a foreign key F referencing the primary key K of a relation S, implies that F functionally determines K and, consequently, all the other attributes of K, but tells us nothing about the number of distinct tuples of R related to each tuple of S. Thus, in principle, in order to guess the uncertain cardinalities we should query the database as in the case of XML sources. On the other hand, in the relational case this issue is much less relevant than in the XML case. In fact, while the designer of an XML document chooses the webTraffic
  • 29. 7 ” category F i g u r e 1 3 . A n o t h e r p o s s i b l e D T D g r a p h d e s c r i b i n g t h e w e b s i t e traffk 4 6 direction of each link without considering the cardinality of the relationship to be modeled, the designer of a relational schema is constrained (by the need to satisfy the first normal form) to represent each relationship in the -to-one direction. Thus, in general, the relationship from S to R is one-to-many, hence, not interesting for multidimensional modeling; the only case in which it might be interesting is when the foreign key mechanism has been used by the designer to model a one-to-one relationship - but this is not very frequent. As seen in Section 3, several DTDs representing the same subject may be designed; for each of them, the resulting attribute tree may look different. For instance, the attribute tree for the DTD graph in Figure 13 is presented in Figure 14. Having click as a fact, navigating from hostld to host and from urlld to url entails analyzing the data to check the uncertain relationship. After inverting hostld with host and urlld with url (it can be done
  • 30. since they are related by a one-to-one relationship), the resulting attribute tree becomes the same as the one in Figure 12. h o s t c a t e g o r y date t i m e nation hostld r- urlld url t i l e t y p e A n 0 c l i c k siteld site () 0 0 nation F i g u r e 1 4 . A t t r i b u t e t r e e f o r t h e D T D g r a p h i n F i g u r e 1 3 5. CONCLUSIONS In this paper we described a semi-automatic approach to conceptual design of a data mart from an XML source. We showed how the semi-structured nature of the source increases the level of uncertainty on the structure of data as compared to structured sources such as database schemas, thus requiring access to the source documents and, possibly, the designer’s help in order
  • 31. to detect -to-one relationships. The approach was described with reference to the case in which the sources are constrained by a DTD using sub-elements, but it can be adopted equivalently when XML Schemas a r e c o n s i d e r e d . Using XML sources for feeding data warehouse systems will become a standard in the next few years. Adopting a technique to derive the data mart schema directly from the XML sources is not the only possible approach: the data mart schema may also be designed “manually”, meaning that facts, measures and hierarchies are determined starting from the user requirements and the logical connection with the source schemas is established only a posteriori. On the other hand, the main problem with this solution is that, very often, the requirements expressed by the users cannot be fully supported by the existing data; besides, the process of mapping each requirement back to the source schema m a y b e v e r y c o m p l e x . 6. REFERENCES VI PI [31 [41 PI [61
  • 32. [71 PI PI Abiteboul, S., Buneman, P., and Suciu, D. Data on the Web: F r o m R e l a t i o n s t o S e m i s t r u c t u r e d D a t a a n d X M L . M o r g a n Kaufman Publishers, 2000. Cabibbo, L., and Torlone, R. A logical approach to multidimensional databases. In Proc. EDBT, 1998. Datta, A., and Thomas, H. A conceptual model and algebra for on-line analytical processing in data warehouses. In Proc. W I T S , 1 9 9 7 . D e u t s c h , A . , F e m a n d e z , M . , F l o r e s c u , D . , L e v y , A . , a n d Suciu, D. A Query Language for XML. In Proc. 8th World Wide Web Conference, 1999. Florescu, D., and Kossmann, D. Storing and Querying XML Data using an RDBMS. IEEE Data Engineering Bulletin 22, 3, 1999. Franconi, E., and Sattler, U. A data warehouse conceptual model for multidimensional aggregation. In Proc. DMDW, 1999. Golfarelli, M., Maio, D., and Rizzi, S. The Dimensional Fact Model: a conceptual model for data warehouses. Int. Jour. of Cooperative Inf. Systems 7,2&3, 1998. Hiisemann, B., Lechtenborger, J., and Vossen, G. Conceptual
  • 33. data warehouse design. In Proc. DMDW, 2000. Kimball, R. The data warehouse toolkit. John Wiley & Sons, 1996. [lo] Lee, D., and Chu, W.W. Constraints-preserving T r a n s f o r m a t i o n f r o m X M L D o c u m e n t T y p e D e f i n i t i o n t o Relational Schema. In Proc. 19” ER (Salt Lake City), 2000. [ 111 Mannila, H., and Raiha, K.J. Algorithms for inferring functional dependencies. Data & Knowledge Engineering, 12, 1, 1994. [ 121 Savnik, I., and Flach, P. Bottom-up induction of functional dependencies from relations. In Piatesky-Shapiro (ed.), Knowledge Discovery in Databases, AAAI, 1993. [ 131 Shanmugasundaram, J., et al. Relational Databases for Querying XML Documents: Limitations and Opportunities. In Proc. 25” VLDB (Edinburgh), 1999. [ 141 Vassiliadis, P. Modeling multidimensional databases, cubes and cube operations. In Proc. 10th SSDBM Conf. (Capri, Italy), 1998. [ 151 World Wide Web Consortium (W3C). XML 1 .O Specification. http://www.w3,org/TR/2000/REC-xml- 20001006. [16] World Wide Web Consortium (W3C). XML Schema. http:Nwww.w3.org/XML/Schema. [17] World Wide Web Consortium (W3C). Xpath Specification
  • 34. 1 .O. http:llwww.w3.orglTRJxpath. [18] World Wide Web Consortium (W3C). XQuery 1.0: An XML Query Language (Working Draft), http:llwww.w3.orglTIUxqueryl. 47