Clustering Homogenous XML Documents (CS501 Final Report) (1)

Page | 1
PORTLAND STATE UNIVERSITY
Clustering Homogenous
XML Documents
Evaluation of current clustering methodologies
and the proposal of signature-based clustering
By
Abdussalam Alawini
8/12/2011
Supervised by:
Prof. David Maier

Page | 2
Table of Contents
Abstract.........................................................................................................................................................3
1. Introduction ..........................................................................................................................................3
1.1. Problem Statement.......................................................................................................................4
2. Background ...........................................................................................................................................6
2.1. XML Data Model ...........................................................................................................................6
2.2. XML Structural Clustering .............................................................................................................6
2.3. Important definitions....................................................................................................................6
2.3.1. Structured Product Labeling (SPL) ........................................................................................6
2.3.2. Homogenous XML documents..............................................................................................6
2.3.3. Heterogeneous XML documents ..........................................................................................7
3. XML Structural Clustering Approaches.................................................................................................7
3.1. Tree-edit Distance.........................................................................................................................7
3.2. Graph-based Clustering ................................................................................................................8
3.3. Tree-based Clustering.................................................................................................................10
4. Why a New Approach is needed for Clustering Homogenous XML Documents................................14
4.1. Losing important structure information.....................................................................................14
Tree-edit distance ...............................................................................................................................14
Graph-based clustering.......................................................................................................................15
Tree-based Clustering.........................................................................................................................16
4.2. Expensive computation time ......................................................................................................17
Tree-edit distance ...............................................................................................................................17
4.3. Poor quality Clusters for homogenous XML documents ............................................................17
S-graph-based Clustering....................................................................................................................17
Tree-based Clustering.........................................................................................................................18
5. Proposed Abstract Solution: Signature-based Clustering...................................................................18
5.1 Stage 1: Extracting XML Structural Summary ...................................................................................19
5.2 Stage 2: Performing initial signature-based clustering.....................................................................20
5.3 Stage 3: Refining clusters based on user input.................................................................................21
Appendix A: XML Summary Extractor Code................................................................................................24

Page | 3
Abstract
Huge amount of information is being stored as XML documents due to the standardization of XML as
information exchange language over the internet. This means that more applications need to process
XML documents for different purposes. One of the important applications is grouping XML documents
that have a single XML Schema or DTD based on their structure, especially for XML documents with ill-
defined XML Schema or DTD. Mining such structural information may provide important clues about the
meaning of the data enclosed in these documents. This paper evaluates the application of current XML
structural clustering approaches (tree-edit, graph-based and tree-based clustering approaches) on
homogenous XML documents. The results suggest that these approaches are not efficient for clustering
this kind of XML document, hence new approach has to be developed for clustering XML documents
that are based on single Schema or DTD. This paper also provides a detailed description of an abstract
design of signature-based clustering approach that is developed for clustering homogenous XML
documents.
1. Introduction
Extendible Markup Language (XML) is the internet standard for exchanging structured data over the web
[1]. Its powerful ability for managing, displaying and organizing data as well as providing interoperability
and enabling automatic processing of Web resources drive many real-world applications to quickly
adapted [1]. The major advantage of XML over HTML is that XML can describe the content and structure
of the data it encompasses while HTML focuses only on presenting instructions to web browsers on how
the data it encompasses should be displayed [1]. As a result, more information is being stored in more
structurally rich documents, from HTML to XML [2].
With more applications adapting XML as a standard for transferring structured data over the web, the
need to automatically process those documents for information extraction, similarity clustering, and
search applications is significantly increasing [2]. XML data processing and management as well as
content mining have been a very popular research area. However, the focus on mining XML documents
based on their structure information is still emerging. Due to the fact that XML documents’ well
organized structure may provide very important clues on what the content of the data might be, more
researchers are focusing on mining XML documents structural information.
Grouping XML documents with similar structure is an important functionality for many applications such
as XML query processing and information extraction [2]. The focus of many of XML clustering
approaches has been on clustering XML documents from different sources [3][4][6][7][8][9]; for
example, documents with different XML Schema, different DTD or no schema at all. In contrast,
clustering XML documents based on single Schema or DTD has not received strong attention. Clustering
these homogenous documents is very important for many applications especially when the XML Schema
for a group of documents is ill-defined.
In this research, I first present the main three approaches for clustering XML documents based on their
structure. Then I show that these approaches are not efficient for clustering homogenous XML

Page | 4
documents, and that a new approach is needed especially for clustering those kinds of documents.
Finally, we present an abstract design for a novel homogenous clustering approach called signature-
based clustering. This paper is organized as following: section 1 introduces the paper and describe the
problem statement, section 2 gives some important background information related to the paper topic,
section 3 presents three different clustering approaches for heterogeneous XML documents, section 4
discusses why these approaches are not appropriate for clustering XML documents that based on a
single Schema or DTD, section 5 discusses signature-based clustering as a solution, and finally future
work and research conclusion are presented in section 6.
1.1.Problem Statement
The first question that comes to your mind when you hear “clustering XML documents that are based on
single Schema or DTD” is that why we need to cluster such documents if all these documents have
similar structure rules? The simple answer is that some XML Schemas or DTDs may be ill-defined and
allow for XML documents to contain data that is not well structured. Mining XML structural information
for such documents may give important hints on how these documents could be better structured.
To better understand this problem, we will discuss this issue in a real-life application; Structured Product
Labeling (SPL) for FDA drug establishing registration and drug listing [15]. SPL documents contain two
major parts: product data elements and content of labeling. The first part is very well structured and
each data item is assigned to one element. These elements contain essential drug/device information
such as drug name, ingredients…etc. The second part contains the content of product labeling along
with additional machine readable information. Elements in this part contains sections such as adverse
reactions, over dosage, etc. that can be found in regular drug labels. In contrast to the first part, these
sections don’t encompass structured data. The way elements are used in these sections is very similar to
the way HTML elements are used; each component has a text element which contains the whole textual
content of a section. These text elements do not contain any descriptive elements; all elements are
mainly used for browsing proposes. The only elements allowed inside the text element are formatting
elements such as font, color, table, list, etc.
Some of the sections are very important such as adverse reactions and over dosage. For example, a
simple task of listing all the adverse reactions by body system or ingredients cannot be performed in
such unstructured data. Having this sensitive information in human readable format (with no machine
readable descriptive elements) deprives any SPL tools from using XML technologies to retrieve or check
the content of SPL sections. Without the descriptive elements inside labeling sections, it is very hard to
automate any task for processing the data within product labeling sections.
However, elements (which contain no descriptive information) that construct the text element may
contain some important structural information that may be exploited as clues for understanding the
meaning of the data stored in these elements. For instance, the text tag of adverse reactions section
may contain several paragraph elements that represent a structural division such that each paragraph is
dedicated for each body system reaction or for each ingredient affect. . Drug manufactures may be
using similar standards for describing product labeling information within SPL product labeling sections.

Page | 5
Thus, mining structural information of these sections may reveal some patterns that can be useful for
understanding the meaning of the text in these
Figure 1 shows adverse reactions section (a) with its corresponding XML structure (b) as an example of
how structure information could potentially lead to content understanding. Adverse reactions section in
this example is organized human body system such as skin, renal toxicity etc. The structure of this
document could lead to understand text content. As we can see in figure 1 (b) the text element consists
of several paragraph elements; each of which contain a body system part embedded in content
element. Understanding this structure will certainly lead to reveal the content meaning.
In perusing solution for this problem, I first evaluated the most common, and successful, approaches for
clustering XML documents. Unfortunately, and to the best of my knowledge, all these approaches are
developed for XML documents that are based on different XML Schema or DTD. Thus, a new approach
to cluster homogenous XML documents is a necessity.

Page | 6
2. Background
2.1.XML Data Model
Unlike structured databases, XML documents are considered to be semi-structured data that have
flexible structure. Document Type Description (DTD) as well as XML Schema are used to contain the
structure of an XML document. DTD makes use of regular languages to precisely describe a document
structure. XML documents may have no document structure container such as DTD or Schema. In such a
case, XML documents have to be a well formed document that must confirm to XML grammar. In a well-
formed document, all tags must be well parenthesized and the documents must contain a single root
[4].
XML documents can be represented by ordered labeled trees. The root of the documents is the root
node of the tree and tag names corresponds to the labels of the tree nodes. In XML tree, each node may
have any number of children nodes [3].
2.2.XML Structural Clustering
Information extraction, search applications, similarity grouping are some of the applications that require
the use of data mining techniques on XML documents. However, until recent time, mining techniques
were applied to XML documents without considering their structural information. XML documents were
mined based on their content which is treated as bag of text. When excluding structural information,
important data relations and logical information that can be used as meaning clues will be lost [2].
XML structural clustering is one of the most common methods of utilizing XML structural information. It
is defined as the process of grouping XML documents (or sections of documents) based on their
structural similarity. As an example, in information extraction applications, structural information can
help in sorting a large number of different sites’ pages into sets that are comparable. Then software
application can group the set of documents that have similar structure which will help the information
extraction algorithm to produce more accurate results.
2.3. Important definitions
2.3.1. Structured Product Labeling (SPL)
Structured Product Labeling (SPL) is a document markup standard, approved by health level 7
(HL7), that is used by FDA as mechanism of exchanging product information. SPL defines the
content of a drug labeling in an XML format. SPL documents have to adhere to SPL schema that
defines the document structure and format [4].
2.3.2. Homogenous XML documents
Homogenous XML documents are XML documents that adhere to a single XML Schema or DTD
rules. For example, FDA has a database of about 10,000 SPL file that are all based on one single
XML Schema called SPL Schema.

Page | 7
2.3.3. Heterogeneous XML documents
Heterogeneous XML documents are those documents that are based on different XML Schema
or DTD rules. For example, RSS feed XML documents that are generated from different news
agency may have different XML Schemas
3. XML Structural Clustering Approaches
In this section, I provide a detailed explanation of how the three major XML structural clustering
approaches, namely: Tree-edit distance, graph-based clustering, and tree-based clustering approaches,
perform structural clustering on XML documents. By understanding how these methods work, we will be
able to study and analyze their shortcomings and limitations on clustering homogenous XML
documents.
3.1. Tree-edit Distance
In order to define tree-edit distance, we first need to define tree-edit sequence. Tree-edit sequence is a
sequence of tree editing operations such as insert-node, delete-node, replace-node and move sub-tree
that transforms one rooted ordered tree to another [6]. To better understand tree-edit sequence, figure
2 presents an example of transforming XML tree ‫ݐ‬ଵ ‫ݐ ݋ݐ‬ଶ by performing delete, insert and replace
operations on ‫ݐ‬ଵ.
Figure 2: Tree-edit sequence operations - Source [6]
If we have two rooted ordered labeled trees ‫ݐ‬ଵand ‫ݐ‬ଶ,and we have a cost model that assign a fixed cost
for every tree edit operation, then tree edit distance between these two trees “is the minimum cost
between the costs of all possible tree edit sequences that transform ‫ݐ‬ଵto ‫ݐ‬ଶ” [6].
There are several different tree-edit distance algorithms that are all based on dynamic programming
techniques including Selkow’s algorithm [7], Zhang’s algorithm [8], Chawathe’s algorithm [9], and
Chawathe’s algorithm II [10]. All of these algorithms have a common issue which is the high
computational cost that is required to calculate the edit-distance between any two XML trees.

Page | 8
Dalamagasa, T et. al. suggest an efficient tree-edit based algorithm for clustering XML documents by
structure [6]. This algorithm makes use of structural summaries which is a compact tree representation
that reduces node repetition and nesting of the actual XML document tree structure. Figure 3 shows an
example of extracting a structural summary of original XML tree. Tree ܶଷrepresents a structural
summary of ܶଵ after reducing node nesting (ܶଶ) as well as reducing node repetition (ܶଷ). The use of
structural summaries, as claimed by the authors, do not affect clustering results and reduces the time
needed to go over the entire XML tree which will increase efficiency and reduce computation time.
Figure 3: XML tree nesting and reputation reduction – Source [6]
Figure 4 illustrates the steps of Dalamagasa algorithm. Initially, each tree in the data set is considered to
be a single cluster. Next structural summaries are created by applying nesting and repetition reduction.
Then process of calculating structural distance between clusters begins. Structural distance between any
two trees ‫ݐ‬ଵܽ݊݀ ‫ݐ‬ଶ is defined as the tree-edit distance between two structural summaries divided by the
cost to delete all nodes from ‫ݐ‬ଵ and insert all nodes from ‫ݐ‬ଶ [6]. The clusters with the minimum distance
are merged and the distance between the new cluster and the remaining clusters is recalculated. The
process of merging clusters and recalculating distance between them continuous until clusters reach
some of level of defined quality.
Figure 4: Tree-edit approach for cluster XML document by structure – source [6]
3.2.Graph-based Clustering
For data with no apparent or semi-structured structure, labeled graph are commonly used to represent
the structure of such data [12]. A graph data structure is a set of ordered pairs. Each pair consists of
nodes or vertices and arcs or edges. A graph is denoted by (N,E) where N is the set of nodes and E is the
set of relations where (x,y) ϵ E iff x points to y.

Page | 9
S-GRACE algorithm [11] and Shanmugasundaram et. al. algorithm [13] are two XML Structural clustering
algorithms that utilizes graph for representing XML documents structure and then perform clustering
based on the new graph representation. In this paper, we will focus on S-GRACE algorithm [11] due to
the fact that the structure graphs are extracted from XML documents and not from documents’ DTD as
it is the case in Shanmugasundaram et. al. algorithm [13].
To understand how S-GRACE algorithm works, we first need to define structure graph (s-graph). The s-
graph for a set of XML documents C (denoted by sg(C)) is a directed graph (N,E) where N is the set of all
the elements and attributes in the documents in C and (x,y) ϵ N iff x is a parent element of element y or
y is an attribute of element x in some document in C [11]. To illustrate how s-graph can capture a set of
XML documents structure, Figure 5 presents an example of two XML documents represented by a single
s-graph. The labels in this graph represent the set of nodes in either doc1 or doc2, while the directed
arrows indicate parent-child/element-attribute relationships between nodes in either doc1 or doc 2.
Figure 5: S-graph representation of doc1 and 2 – source [11]
S-graph clustering approach is implemented in two steps: in step one, structural information is extracted
and encoded. In this step, all XML documents are scanned and their corresponding s-graph is computed
then encoded in a data structure called SG. Each SG record consists of two fields: a bit string encoding of
s-graph and a set of XML documents ids of all the documents whose s-graphs are represented by this bit
string [11]. Figure 6 illustrates an example of three XML documents and how they are encoded within SG
data structure. In this example, we can see that doc1 and doc2 are represented by the same bit string as
they have similar structure. Doc3 has its own s-graph as it defers from doc1 and doc2.

Page | 10
Figure 6: SG data structure – Source [11]
Step two involves the creation of clusters based on the structural information of the XML documents
that have been encoded as s-graphs in the previous step. Since s-graphs are encoded as bit strings, the
problem has changed from clustering XML documents to clustering smaller sets of bit strings which
increases efficiency and reduces computation time. For performing clustering over s-graphs, S-GRACE
algorithm is used.
S-GRACE clustering algorithm is a representative categorical algorithm that is considered to be a
hierarchical clustering algorithm on XML documents. It applies ROCK which is a suitable algorithm for
clustering categorical data such as s-graphs. What makes S-GRACE algorithm superior to pure tree-edit
distance algorithms is that it considers common neighbors metric along with distance metric between
clusters. In order to compute common neighbors between any two s-graphs, the distance between each
pair is computed and stored in a matrix called DIST. If the distance between any two s-graph is less than
a threshold Θ, then they are considered to be neighbors. Using the distance matrix (DIST), S-GRACE can
calculate the common neighbors metric. So if two s-graphs share a large number of neighbors, then they
will be merged in one cluster even if they are not close in distance. Link property is used to merge
appropriate clusters’ pairs. Link is defined as following: “for any pair of clusters ‫ܥ‬௜, ‫ܥ‬௝, link[‫ܥ‬௜, ‫ܥ‬௝] is the
number of cross links between elements in ‫ܥ‬௜, ܽ݊݀ ‫ܥ‬௝” [11].
3.3.Tree-based Clustering
Tree-based clustering approach is based on the notion of XML cluster representative. Each group of
similar XML documents (XML cluster) will be represented by XML tree that summarizes the most
relevant features of a group of XML documents within a cluster. The most important advantage of
cluster representatives is the compact representation of a group of XML files while retaining the
specifics of all the documents within the cluster it represents. As a result of utilizing cluster
representatives, the overall computation of XML clusters will be reduced due to the fact that the
similarity search is now performed on a much smaller tree (i.e. clusters representatives) and not on the
much larger original XML trees [14].
XRep [14] is one of the algorithms that utilizes the notion of tree-based clustering. This algorithm adapts
a hierarchical approach in clustering XML documents which has proven to result in high quality clusters
[14]. Figure 7 shows an abstract of XRep hierarchical algorithm. First, each XML tree in the set S of XML

Page | 11
trees (‫ݐ‬ଵ,,,‫ݐ‬௡) is considered as a single tree cluster ‫ܥ‬௜ that is represented by the original XML tree. Then
tree-edit distance between each pair of clusters representation is computed and stored in distance
matrix Md. The distance between each pair of clusters is computed by the following formula:
݀௝ሺ‫ݐ‬ଵ, ‫ݐ‬ଶሻ ൌ 1 െ
|‫݄ݐܽ݌‬ሺ‫ݐ‬ଵሻ ∩ ‫݄ݐܽ݌‬ሺ‫ݐ‬ଶሻ|
max ሼ|‫݄ݐܽ݌‬ሺ‫ݐ‬ଵሻ|, |‫݄ݐܽ݌‬ሺ‫ݐ‬ଶሻ|ሽ
Where ‫݄ݐܽ݌‬ሺ‫ݐ‬௜ሻ is the set of paths in ‫ݐ‬௜ , ‫݄ݐܽ݌‬ሺ‫ݐ‬௜ሻ ∩ ‫݄ݐܽ݌‬ሺ‫ݐ‬௝ሻ is the set of common paths between
‫ݐ‬௜ ܽ݊݀ ‫ݐ‬௝ [14]. It is important to notice that the distance is measured based on the similarity on paths
and that duplicated paths are eliminated. Next, the algorithm goes on a loop where the following steps
are performed until an optimal clusters partition is reached:
• Choose pair of clusters, ‫ܥ‬௜ and ‫ܥ‬௝, that has minimal distance
• Compute the tree representation r of cluster C which is the union of ‫ܥ‬௜ and ‫ܥ‬௝
• Update the set of clusters to reflect the new merge between ‫ܥ‬௜ and ‫ܥ‬௝ and update the distance
matrix ‫ܯ‬ௗ
Figure 7: XRep algorithm – Soruce [14]
Computing XML representatives for clusters with two or more trees is the most challenging process in
XRep algorithm. Any approach used to compute cluster summary tree has to be able to include the most
relevant specifications of all documents that a cluster includes while maintaining a compact
representation tree. XML tree matching and merging is the heuristic approach that the notion of cluster
representative relies on. An optimal matching tree is initially built for a given set of XML documents.
Structural resemblances that characterize the original documents are used to build the optimal tree of a
given cluster. Then a merge tree that includes all the characteristics of a cluster is built. The goal of this
tree is to include substructures that are not recurring across the cluster documents. Then the process of
creating a cluster representative starts by pruning the merge tree while keeping the optimal tree as a
measuring level to the lower-bound until the representation tree is extracted.
The pruning approach for computing cluster representatives is conducted in three stages: optimal tree
construction stage, computing merge tree stage and the pruning of the merge tree stage. Before we
explain the first stage, two important notions have to be defined: strong matching and matching tree.
For any two given nodes (u, v), where u ∈ ‫ݐ‬௜, v ∈ ‫ݐ‬௝ we say that these nodes have strong matching if

Page | 12
node u and v share both the same tag names and depth level. Tree t is a matching tree, denoted by t=
match (‫ݐ‬௜, ‫ݐ‬௝ሻ, if the following conditions hold:
1. Two mappings f1 : t →‫ݐ‬௜ and f2 : t →‫ݐ‬௝ associating nodes and edges in t with a sub-tree in ‫ݐ‬௜ and
‫ݐ‬௝ are exists
2. nodes in the matching tree t have strong matching with nodes in ‫ݐ‬௜ ܽ݊݀ ‫ݐ‬௝
3. The root of the matching tree is the same root of ‫ݐ‬௜ ܽ݊݀ ‫ݐ‬௝ as well as any edge in the matching
tree
Can be found in ‫ݐ‬௜ ܽ݊݀ ‫ݐ‬௝
Figure 8: (a) Strong matching, (b) multiple matching, (c) matching trees – Source [14]
In stage 1, the process of XML tree matching is used to construct the optimal tree of a cluster. This
process consists of three parts: matching detection, selection of best matching and the construction of
the optimal tree. A matrix ‫ܯ‬௠ of |ܸ௧೔
X ܸ௧ೕ
| is built where ܸ௧೔
is the number of nodes in XML tree ‫ݐ‬௜ and
ܸ௧ೕ
is the number of nodes in XML tree ‫ݐ‬௜. ‫ܯ‬௠ rows represent nodes of ‫ݐ‬௜ and its columns represent
nodes of ‫ݐ‬௜ (i.e. element (i,j) of ‫ܯ‬௠ corresponds to nodes ‫ݒ‬௜ ∈ܸ௧೔
and ‫ݓ‬௝ ∈ܸ௧ೕ
) and stores a weight
‫ݓ‬௠(‫ݒ‬௜, ‫ݓ‬௝) that represents the matching between ‫ݒ‬௜ and ‫ݓ‬௝. If ‫ݓ‬௠(‫ݒ‬௜, ‫ݓ‬௝) = 0 then there’s no match
between ‫ݒ‬௜ and ‫ݓ‬௝. If ‫ݓ‬௠(‫ݒ‬௜, ‫ݓ‬௝) = 1 then a strong matching exists. Figure 8 show an example strong (a)
and multiple matching (b) with their corresponding matching trees (c).

Page | 13
Figure 9: Matching and selection matrixes – Source [14]
For selecting the best matching, the weight of each node is calculated by taking into account the
matching of children nodes as well. By doing so, the issue of multiple matching can be overcame by just
considering nodes with high matching weights. A marking vector is used to track the overall best
matching. A marking vector ܸ௠ଵ = {j∗‫ݒ‬ଵ, . . . , j∗‫ݒ‬௡}, whose generic i-th entry indicates the node of ‫ݐ‬ଶ
with which ‫ݒ‬௜ ∈ܸ௧భ
has the best matching. ܸ௠ଵ [i] is set to −1 for the node ‫ݒ‬௜ ∈ܸ௧భ
with no matching
[14]. An example of best matching matrix with its marking vectors is shown in Figure 9 (b)(c).The final
part in stage one is the construction of optimal matching tree which can be simply done by exploiting
the marking vectors by removing all nodes with no matching. Figure 10 (a) shows the resulted optimal
matching tree of the two XML trees in Figure 9 (a).
Figure 10: (a) Optimal matching tree, (b) Merge tree, (c) Cluster representative – Source [14]

Page | 14
In stage two, the computation of a merge tree is done. Merge trees resemble the idea of optimized
union. In its simplest form, a merge tree could be the union of two trees. However, to avoid duplicate
nodes, optimal matching tree, computed in stage 2, must be used to avoid adding duplicated nodes. A
merge tree is simply the optimal tree plus the nodes that has no matching in either tree plus the nodes
that have multiple matching. Going back to Figure 10 (b), we see that the merge tree is the optimal tree
plus nodes with no matching from t1 and t2 (8, 11 from t1 and 9, 10, 11 from t2), plus nodes with
multiple matchings from t1 and t2 (nodes 2, 5 from t1 and 8 from t2).
Now with optimal matching tree and merge tree are ready, the final stage of creating a cluster
representative is ready to begin. In this stage, nodes are removed from the leaf level of the merge tree.
Each time a node is removed, the distance between the refined merge tree and the original trees in the
cluster is calculated. The process of removing nodes stops when the distance between the refined
merge tree and the original XML trees on the cluster cannot be further decreased. At this point, the
cluster representative is the final version of the refined merge tree.
Cluster representative is used to compare the cluster with other clusters and merge clusters that have
similar structure as described in XRep algorithm above. If any new cluster added to the cluster, the
cluster representative has to be updated to reflect the new merged cluster(s).
4. Why a New Approach is needed for Clustering Homogenous XML
Documents
4.1. Losing important structure information
Tree-edit distance
One of the major advantages discussed in algorithm [6] is the introduction of structural summaries
to calculate tree-edit distances between XML trees. The use of structural summaries improves tree-
edit algorithm efficiency by decreasing computation time as the algorithm will perform pair-wise
comparisons on much smaller trees. Structural summaries also provide more accurate clustering
results due to the removal of nesting and repetition in original trees. However, all these claims make
sense only when tree-edit algorithm is applied to a set of heterogynous XML documents.
Dalamagasa,T. et. all state that “A tree edit algorithm will output a large distance between two XML
documents which are based on the same DTD, with one of the two being quite long due to many
repeated elements.” [6] As we explained in the problem statement section above, our problem is to
find structural similarities in homogenous XML documents- XML documents that share the same
XML Schema (or DTD). By removing nodes repetition and nesting, homogenous XML documents will
lose essential and very important structural information that could significantly help in the process
of clustering those documents.
To better understand this issue, we will apply the structural summaries notion on selected section of
actual SPL documents. Figure 11 shows adverse reactions sections for two different SPL documents.
Adverse effects information for “SPL Doc 1” is stored with no specific structure in one paragraph tag.

Page | 15
For “SPL Doc 2”, there are seven paragraphs under the text tag which could potentially mean that
adverse effects section may contain important structural hints.
Figure 11: Adverse effects sections from two different SPL documents
However, when we create the structural summary for “SPL Doc 1” (the same as the original tree as
there’s no repetition or nesting within this document) and “SPL Doc 2”, we notice that the later will
have exactly the same structural summary of “SPL Doc 1” (Figure 12 shows the trees of SPL doc 1
and 2 along with the resulted structural summary of “SPL doc 2”). As a result, the two documents
will be later added to the same cluster by the clustering algorithm even though SPL Doc 2 has more
structural information that makes it structurally different from “SPL Doc 1”. Important structural
information of doc 2 was lost in the node repetition reduction, which is a part of structural summary
building process, was applied to this document tree.
Figure 12: Applying reputation reduction in SPL Doc 2
Graph-based clustering
The first stage in S-GRACE algorithm is the extraction of s-graphs from XML trees. S-graph encodes
nodes’ parent-child relationships of all nodes in an XML tree as edges. All Edges repetition is not

Page | 16
considered in the SG data structure. The aim of repetition removal is to make s-graphs as small as
possible for efficient pair-wise comparison that will be later performed on these s-graphs. However,
and similar to tree-edit approach issue, the removal of edge repetition will result in removal of
sensitive structural information that significantly affects clustering homogenous XML documents.
To illustrate this problem, we extract the s-graphs of the two SPL documents presented in Figure 11.
Figure 13 shows the corresponding SG that encodes the s-graph of SPL 1 and SPL 2.
Figure 13: SG for SPL 1 and SPL 2
As we can see in Figure 13, the final result is that SPL 1 and SPL 2 are assigned to the same cluster.
Even though, as we explained on applying tree-edit distance approach above, the two SPL
documents have different structure and should not be in the same cluster.
Tree-based Clustering
One of the initial stages of XRep algorithm is the computation of the distance matrix ‫ܯ‬ௗthat records
the distance between each pair of clusters (initially each cluster represents a single XML trees). The
distance formula used to compute the distance is ݀௝ሺ‫ݐ‬ଵ, ‫ݐ‬ଶሻ ൌ 1 െ
|௣௔௧௛ሺ௧భሻ∩ ௣௔௧௛ሺ௧మሻ|
୫ୟ୶ ሼ|௣௔௧௛ሺ௧భሻ|,|௣௔௧௛ሺ௧మሻ|ሽ
(refer
back to section 3.3 for complete formula definition). The set of paths ‫݄ݐܽ݌‬ሺ‫ݐ‬௜ሻ represents tree ‫ݐ‬௜
paths without any path repetition. Moreover, the set of common shared paths (‫݄ݐܽ݌‬ሺ‫ݐ‬ଵሻ ∩
‫݄ݐܽ݌‬ሺ‫ݐ‬ଶሻ) of ‫ݐ‬ଵܽ݊݀ ‫ݐ‬ଶ, by set definition, doesn’t include repeated paths. As a result, redundant
paths will not be considered which will cause losing sensitive structural information that lead to
poor clustering for homogenous XML documents.
Let us use the same SPL documents example in Figure 11 to illustrate repetition removal issue with
tree-based clustering. Figure 14 shows the paths for each SPL document. In the second part of
Figure 14, the set of paths for each tree are presented. We can notice that SPL doc 1 and SPL doc 2
have the same set of paths even though SPL doc 1 has different structure. Finally, Figure 14 presents
the interaction of the two SPL documents which results in keeping all paths with removing any
repeated paths. Now let us calculate the distance between SPL doc 1 & 2 to see how the XRep
algorithm is going to cluster them. DIST=1-(4/max(4,4)) => 1-4/4 => 1-1 = zero. So these two trees
match exactly and they are merged in one cluster again even though they should have been
assigned to different clusters.

Page | 17
Figure 14: Clusters distance calculation using tree-based algorithm
4.2. Expensive computation time
Tree-edit distance
In tree-edit distance approach, each XML tree has to be compared to all the other documents in the
set of documents that needs to be clustered. Mathematically, [N * (N – 1) / 2] comparisons have to
be performed to calculate tree-edit distance between all documents in a set of N documents. The
FDA database contains over 12,000 SPL documents and the number is increasing periodically [5].
This means that to cluster SPL documents using tree-edit approach, we approximately need to
perform [10000*9999/2] = 49,995,000 comparison operation. As stated in [6], the average time to
compare original SPL tree (structural summaries are not appropriate for homogenous documents as
discussed above) is 700 milliseconds (ms). To find the time it takes to compare all these documents
we first multiply the number of comparisons operations with the time it takes to perform single
comparison operation (49,995,000 * 700ms) which is equal to 34996500000 ms. This means that it
will take about 13.5 months to finish all SPL pair-wise comparisons
4.3. Poor quality Clusters for homogenous XML documents
S-graph-based Clustering
S-GRACE algorithm was developed to find similarities in XML documents from different sources
(documents with different DTDs or XML Schemas). It measures similarities by calculating how many
common edges a pair of s-graphs may have. Thus, it considers that two s-graphs should be placed in
two different clusters, if they have few overlapping edges (which means that fewer path expressions
can be answered by both XML trees). However, this notion is valid for only heterogeneous XML trees

Page | 18
and not homogenous ones. In measuring distance between any two given s-graphs, if we consider
that similarity is only based on how many common edges both are sharing, and keeping in mind that
edges are representing a node parent-child relationships and not full XML paths (no repetition nor
nesting is represented in s-graphs), then documents that adhere to one XML Schema or DTD will
mostly be place in a one cluster.
The results of the experiment that I’ve performed on a sample of SPL files support this finding. S-
GRACE was manually tested on a selected set of SPL documents. To simplify this experiment, I’ve
focused only on adverse effects section of the SPL document. Out of 20 SPL documents, only three
clusters were generated. 18 of which were assigned to the first cluster that contains all documents
with the following edges: Component Section, Section text, text paragraph. The second
cluster contains one SPL document that has same edges as cluster one with the extra edge text
table. The third cluster also has only one SPL document. This cluster contains similar edges as cluster
one along with Section excerpt, excerpt highlight, and highlight text edges (see Figure 15).
After computing distance between each pair of clusters, cluster 1 and 2 was merged in one cluster.
Even though the 18 SPL documents in the first cluster have different structures, S-GRACE algorithm
has assigned them all to the same cluster just because they share the same set of nodes.
Figure 15: SG data structure before and after merging similar SPL clusters
Tree-based Clustering
Taking off path repetition in computing distance metric will result in computing major structural
differences that mainly come from differences in XML Schema or DTD that define these documents.
So, if two XML trees represent XML documents from the same XML Schema or DTD, then the
distance between these trees will be very close and will be assigned to the same cluster. This issue
will result in few clusters that do not actually contain significant structural differences for
homogenous XML documents.
5. Proposed Abstract Solution: Signature-based Clustering
As discussed in the above sections, current clustering approaches are designed for XML documents that
are based on different XML Schemas or DTDs. In calculating heterogeneous XML documents’ clusters,
the structural information that is considered irrelevant could be very sensitive information that may
have significant impact on the quality of homogenous XML documents clusters. Thus, for XML
documents with a single schema, as the case with SPL documents, different approaches have to be
implemented.

Page | 19
In this section, I propose an abstract design of novel approach for clustering homogenous XML
documents. I call this approach signature-based clustering (SBC). SBC approach is a staged approach that
utilizes human inputs to refine the resulted clusters. Initially, structural summaries of each XML
document will be extracted in the first stage. These summaries do not include irrelevant XML elements
as well as the text content of each element. These summaries are then stored in a data structure called
SS, that contains three fields: a reference ID, file name of the actual XML document (XML document
name) and a field that contain the structural summary stored as an XML document. In stage two, quick
and poor quality clusters that are based on summaries of XML documents’ structure generated in stage
one are calculated and presented to the user with some samples of the actual XML files that each cluster
represents. The user decides which clusters are relevant and which are not and provide some
description that will be later used to refine clusters. In stage three, irrelevant clusters will be removed.
Then each remained cluster will be further refined by mining the text content of the actual documents
within this cluster for more clustering hints based on user inputs. Figure 16 show a detailed illustration
model for SBC approach.
Figure 16: Signature-base Clustering Model
5.1 Stage 1: Extracting XML Structural Summary
In order to perform quick clustering over a large set of XML documents, it is better to perform clustering
process over XML tree summaries and not the original trees. However, in extracting structural
summaries, neither node repetition nor nesting is removed. Only irrelevant elements and element text
content are removed (XQuery code used to extract structural summaries from SPL documents can be
reviewed in appendix A). The following figure shows a pseudo-code for stage one.

Page | 20
Figure 17: Stage one: Extracting XML Structural Summary
Next, structural summaries are stored in a data structure called SS. The importance of this data structure
is to link each summary to its original XML document. Each record is SS consists of three fields: doc_id
which is used as a reference to XML document and its structural summary, doc_name which stores the
original XML document name, and finally struct_summ in which structural summary is stored as XML
document. Figure 18 shows an example of SS data structure with one record for books.xml document
and its summary.
Figure 18: SS data structure
5.2 Stage 2: Performing initial signature-based clustering
In this stage, a preliminary clustering is performed over structural summaries that are stored in SS. The
clustering approach used here is signature-based clustering that is aimed to reduce the number of pair-
wise comparisons between structural summaries. For each summary, a signature is calculated and
immediately searched on the signature tables. In case the signature was found, the current structure
summary is assigned to the group of summaries that this signature represents. If it was not found, a new
entry in the signature table is created for this new signature and the current summary will be the first
document to be assigned to this group. The signature table is a data structure that contains all unique
signature discovered after scanned all XML documents in the set of documents to be clustered. Each
record consists of sign_id field that contains a reference id to the signature record, signature field which
contains the actual signature, and the sign_grp which stores the set of documents (doc_ids from SS file)
with the same signature.

Page | 21
Figure 19: Stage two - Signature-based structural clustering
The resulted clusters are presented to the user at the end of stage two. Along with the resulted clusters,
some examples of original XML documents of each cluster are presented to the user for better
evaluation as well as to facilitate decision making process on what clusters make sense and what are
not.
5.3 Stage 3: Refining clusters based on user input
As we previously discussed in the problem statement section, clustering process for homogenous XML
documents depends on the application. The user feedback is very important to help SBC algorithm learn
what the user is looking for. The user inputs consists of two parts: the first part is a simple yes or no
answer to the question of “Is this cluster make sense?” The second part is that user can refine some of
the clusters by adding key words or patterns that will assist the algorithm in optimizing clusters’ quality.
Stage three consists of two steps: in the first step, user inputs are used for filtering all the irrelevant
clusters from the signature table. In step two, the second part of the user inputs, key words and/or
patterns, is utilized for mining the text content of the original XML files. This process combines structural
and textual mining for tuning clusters quality. Accepted clusters from stage two may be split, joint or
even deleted as a result of the tuning process in stage three. The final clusters are then presented to the
user at the end of this stage. Figure 20 shows the pseudo-code that summarizes stage three.
Figure 20: Stage three: Refining clusters based on user input

Page | 22
6. Conclusion and future work
The main aim of this research is to shade the light on the problem of clustering homogenous XML
documents and provide guidelines for how the solution of this problem could be implemented.
Signature calculation algorithm is the essence of signature-based clustering that defines the quality of
produced clusters. As a continuation of this work, an effective signature calculating algorithm should be
implemented. Next, signature-based methodology should be evaluated using SPL documents and
validated by some manually inspected samples.
To the best of my knowledge, there is no clustering methodology that clusters homogenous XML
documents based on their structural information. I have shown that the conventional clustering
algorithms don not produce accurate results when tested with XML documents that based on single
XML Schema or DTD. Thus, a new approach for clustering homogenous XML documents must be
implemented using different techniques and methods from the ones that are used in clustering
heterogeneous XML documents. In our research, we suggest a schema design for a methodology,
signature-based clustering, that is aimed to solve this issue. This new approach is a semi-automated
approach that utilizes human inputs to refine the quality of computer-generated clusters.

Page | 23
References
[1] Hunter, D. et al. (2007). Beginning XML. Wiley Publication. 4th
edition. pp. 4-20.
[2] Buttler, D. (ND). A Short Survey of Document Structure Similarity Algorithms. Lawrence Livermore
National Laborator.
[3] Nierman, A. and Jagadush, H. (ND). Evaluating Structural Similarity in XML Documents
[4] Candillier, L. et al. (2008). Mining XML documents.
[5] FDA. (Feb. 2011)
http://www.fda.gov/ForIndustry/DataStandards/StructuredProductLabeling/default.htm
[6] Dalamagasa, T. et. al. (2004). A methodology for clustering XML documents by structure. Elsevier.
Information Systems 31 (2006) 187–228
[7] S.M. Selkow, The tree-to-tree editing problem, Inform. Process. Lett. 6 (1977) 184–186.
[8] K. Zhang, D. Shasha, Simple fast algorithms for the editing distance between trees and related
problems, SIAM J. Comput. 18
(1989) 1245–1262.
[9] S.S. Chawathe, A. Rajaraman, H. Garcia-Molina, J. Widom, Change Detection in Hierarchically
Structured Information, in:
Proceedings of the ACM SIGMOD Conference, USA, 1996, pp. 493–504.
[10] S.S. Chawathe, Comparing hierarchical data in external memory, in: Proceedings of the VLDB
Conference, Edinburgh, Scotland,
UK, 1999, pp. 90–101.
[11] Wang, L; Cheung, DWL; Mamoulis, N; Yiu, SM. (2004). An efficient and scalable algorithm for
clustering XML documents by structure. IEEE Transactions on Knowledge & Data Engineering. v. 16 n. 1,
p. 82-96. Retrieved from http://hdl.handle.net/10722/43670
[12] Buneman, Peter. Davidson, Susan. Fernandez, Mary. Suciu, Dan. (1997) Database Theory: Adding
structure to unstructured data. Springer Berlin / Heidelberg. Pp. 336-350 v. 1186. Retrieved from
http://dx.doi.org/10.1007/3-540-62222-5_55
[13] J. Shanmugasundaram, K. Tufte, G. He, C. Zhang, D. DeWitt, and J. Naughton, (1999). Relational
Databases for Querying XML Documents: Limitations and Opportunities,” Proc. 25th Int’l Conf. Very
Large Data Bases, pp. 302-314.
[14] Gianni Costa, Giuseppe Manco, Riccardo Ortale, and Andrea Tagarelli. (April 2004). A Tree-based
Approach to Clustering XML Documents by Structure
[15] FDA, “SPL Implementation Guide for FDA Drug Establishment Registration and Drug Listing v2.0”,
ND.
(http://www.fda.gov/downloads/ForIndustry/DataStandards/StructuredProductLabeling/UCM162024.p
df)

Page | 24
Appendix A: XML Summary Extractor Code

Clustering Homogenous XML Documents (CS501 Final Report) (1)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Clustering Homogenous XML Documents (CS501 Final Report) (1)

Similar to Clustering Homogenous XML Documents (CS501 Final Report) (1) (20)

Clustering Homogenous XML Documents (CS501 Final Report) (1)