1.
Parallel Frequent Tree Mining
CSE 721 – Introduction to Parallel Computing
Project – Report
Qian Zhu
Shirish Tatikonda
Introduction
Recent research in Data Mining has progressed from mining frequent itemsets to more general
and structured patterns like trees and graphs. Itemset mining can be thought as a special case of
graph mining. Data is often stored in the form of graphs in various applications like World Wide
Web, Telecommunications, Social Networks, and Bioinformatics. Graphs in such areas, in
general, are massive in size. In general, mining for special structures like graphs and trees is
referred to as Pattern Mining. However, graphs in general have undesirable theoretical properties
with regard to algorithmic complexity. In terms of complexity theory, currently no efficient
algorithms are known to determine if one graph is isomorphic to a subgraph of another. Infact,
the problem of Graph Isomorphism is in NP and problem of Subgraph Isomorphism is NP-
Complete. Furthermore, no efficient algorithm is known to perform systematic enumeration of
the subgraphs of a given graph, a common facet of a data mining algorithm. One could therefore
expect that general graphs pose serious efficiency problems.
Fortunately, many practical databases do not consist of graphs that require exponential
computations. Many applications deal with simpler structures where the number of cycles is
limited or graphs may even be acyclic. The latter case is especially interesting, e.g., when the
graphs are trees, because many very efficient algorithms are known for this class of graphs. In this
project, we consider a particular pattern, Tree Structures. Tree is a minimal connected and
maximal acyclic graph. Specifically, we address the problem of Mining Frequent Induced Subtrees.
Mining frequent subtrees from databases is a budding research area that has many practical
applications in areas such as computer networks, Web mining, bioinformatics, XML document
mining, etc. Since tree structures are complex structures compared to traditional transactions
consisting of items, mining for frequent subtrees is a complex and challenging task compared to
mining for frequent items. Most of the existing tree mining algorithms borrow techniques from
traditional and more matured area of itemset mining.
Motivation
In areas such as bioinformatics, web mining, chemical compound analysis, data is often
represented in semi-structured form. In most generic form, structured data can be represented
using graphs. Applications like XML deals with restricted structures such as trees. Mining
frequent trees from the given dataset has several interesting applications. For example, it has
been used in developing efficient algorithms for network multicast routing. Few researchers have
applied frequent tree mining algorithms to internet movie descriptions and obtained the common
structures present in the movie documentation. Other interesting application is to classify given
XML documents using frequent subtree structures.
2.
Recently, there has been growing interest in mining databases of labeled trees, partly due to the
increasing popularity of XML in databases. In [1], Zaki presented an algorithm, TreeMiner, to
discover all frequent embedded subtrees, i.e., those subtrees that preserve ancestor-descendant
relationships, in a forest or a database of rooted ordered trees. The algorithm was extended
further in [2] to build a structural classifier for XML data. In [3] Asai et al. presented an algorithm,
FREQT, to discover frequent rooted ordered subtrees. For mining rooted unordered subtrees,
Asai et al. in [3] and Chi et al. in [4] both proposed algorithms based on enumeration tree
growing. Because there could be multiple ordered trees corresponding to the same unordered
tree, similar canonical forms for rooted unordered trees are defined in both studies. In [5], Chi et
al. have studied the problem of indexing and mining free trees and developed an Apriori-like
algorithm, FreeTreeMiner, to mine all frequent free subtrees. Another very efficient algorithm for
mining Free Trees is Gaston, which is proposed by Nijssen et al. [6]. It is important to note that the
comparison of various algorithms is not straightforward as the structures each algorithm mines
are different. For example, TreeMiner [1] mines for rooted induced and embedded subtrees and
Gaston [6] mines for free trees.
Objective
In this project, we focus on the problem of mining frequent induced subtrees in a large database
of rooted and labeled trees. We first propose the serial version of the mining algorithm. Then we
exploit the scope for parallelism in our serial algorithm to come up with parallel version of the
algorithm. To the best of our knowledge, there exists no parallel tree mining algorithm.
Furthermore, we evaluate proposed serial version of algorithm with TreeMiner algorithm. We
make this comparison even though, as noted in previous section, it is not fair and straight
forward to compare these two algorithms. Finally, We show the performance differences between
our parallel and serial versions of mining algorithm.
Generic Approach
Let D denote a database where each transaction s ∈ D is a labeled rooted unordered tree. For a
given pattern t, which is a rooted unordered tree, we say t occurs in a transaction s if there exists
at least one subtree of s that is isomorphic to t. The occurrence δt(s) of t in s is the number of
distinct subtrees of s that are isomorphic to t. Let σt(s) = 1 if δt(s) > 0, and 0 otherwise. We say s
supports pattern t if σt(s) is 1 and we define the support of a pattern t as supp(t) = ∑sεD σt(s). A
pattern t is called frequent if its support is greater than or equal to a minimum support (minsup)
specified by a user. The frequent subtree mining problem is to find all frequent subtrees in a
given database. One nice property of frequent trees is the a priori property, as given below:
Apriori Property: Any subtree of a frequent tree is also frequent and any super tree of an
infrequent tree is also infrequent.
Above mentioned property reduces the search space drastically. Hence, most of the existing
algorithms employ this strategy of pruning the search space. In general, most of tree mining
algorithms start with a seed pattern. At every step, pattern is extended with one edge to create a
set of candidate patterns. A scan of dataset is then performed to find frequent patterns among the
candidate patterns. These frequent patterns are then extended with one edge to create a new set
of candidate patterns. Algorithms differ in the way they extend and enumerate the candidate
3.
patterns. For example, TreeMiner generates the candidate patterns using vertical layout of
equivalence classes. And, Gaston employs the pattern growth approach like gSpan[8].
Our Approach
In our method, we represent each tree as a Prufer Sequence instead of any traditional tree or
graph representation methods. Prufer sequences provide a bijection between ordered trees with
n nodes and sequences of length n-1. Prufer sequences are first proposed by Heinz Prufer in 1918
for couting number of free trees with n nodes. Simple algorithm to construct a prufer sequence
from a tree with vertices {1, 2, ..., n} is as follows. We start with a prufer sequence of length 0. At
each step, we remove the leaf with smallest label and append its parent to the prufer sequence
that is already constructed. This process is repeated until only two vertices remain i.e., it is
repeated for n-2 iterations. Resulting sequence will look like (p1, p2, ..., pn-2), where pi is the parent
of a vertex with ith smallest label. [7] extended this algorithm to obtain sequence of length n-1 by
continuing the process till only one vertex remains. Since the size of Prufer sequence is linear to
tree size, storage complexity of our approach is linear in the database size. Since the labels in
database trees can occur multiple number of times, above mentioned prufer sequence is not
sufficient to uniquely represent a tree. In order to uniquely represent the labeled database tree,
we store the following sequences: Labeled Prufer Sequence (LPS), Numbered Prufer Sequence
(NPS) and Tree’s Label Sequence (LS). Please note that, though LPS can be constructed from NPS
and LS, we chose to distinguish it for the purpose of exposition. Hence, any labeled tree can be
uniquely represented using LS and NPS. It is worth mentioning that the method of constructing
prufer sequences follows post order (PO) traversal. Example database tree and its representation
using prufer sequence is shown below.
B, 10
C, 4 A, 9 LPS: A A C B A B D A B -
NPS: 3 3 4 10 9 7 8 9 10 –
LS: B D A C E C B D A B
A, 3 D, 8 PON: 1 2 3 4 5 6 7 8 9 10
E, 5
B, 1 D, 2
B, 7
C, 6
(b) Prufer Sequence – Representation
Database Tree with Labels and of database tree
Post order numbers
We define Left Most Path (LMP) as path from root to left most leaf node. In the above example
tree, LMP using post order numbers (PON) is 10-4-3-1. Since the prufer sequence is based on PO
traversal, addition of edges on LMP corresponds to extending the prufer sequence on the left
hand side. For example, consider the tree without the edge, B-A (1-3). In that tree, LMP is given
by 10-4-3-2. Now, the addition of edge B-A on LMP corresponds to prefixing the prufer sequence
with information of new edge. Changes in tree and prufer sequence are shown in red color.
Our pattern growth mechanism is dependent upon the growth from Left Most Path. Like any
pattern growth algorithm, we start with individual nodes as seed patterns. At each step, we
4.
extend the pattern by adding one edge on the LMP. It can be proved that this mechanism is able
to generate every possible subtree with given labels. Such a traversal in the search space is shown
below. For simplicity, we assume that the set of possible labels is {A, B, C}. Node or edge with
which the pattern is extended is shown in red color.
A B C
Equivalence
B B B C C C Class
A A A
A B C A B C
A B C
.
.
A A A
A A A
.
B B B
A B B B C B
A B C
It can be easily seen that this search space is exponential and hence it can be very
computationally expensive to traverse through it. We hence, adopt embedding based growth
approach. i.e., we grow edges based on the embedding of pattern in the database tree. Using this
method, only the edges that are present in database are considered as candidates. Because of
space constraints, we do not discuss this in detail. Once the candidate patterns are generated,
support of each candidate pattern is calculated by scanning the dataset. If the calculated support
is greater than minsup, then the pattern is flagged as frequent and outputted. Frequent subtrees in
one level serves as seed patterns for candidates in next level.
In general, database consists of large number of trees and hence processing the candidate
patterns serially can be quite expensive. Equivalence classes offer the best place for parallelism. It
can be easily seen that the processing of two different equivalence classes can be performed
parallely. Furthermore, support counting of a pattern involves scanning all the database trees.
Hence, the support counting step can also be easily parallelized. In this report, we present the
results of parallelizing across equivalence classes only. One can use OpenMP or any other thread
based methods to parallelize across equivalence classes. We implemented our parallel algorithms
using POSIX treads. We adopt worker based method of partitioning the work across threads. We
first add all the tasks (mining each equivalence classes) to a job queue from which the worker (an
idle process) will pick up a job and mines for the corresponding equivalence class. We chose this
approach since it offers the better load-balancing across threads. This is because, we do not know
mining of which equivalence class is expensive compared to others.
In the next section, we present the experimental evaluation of our serial and parallel algorithms.
5.
Experiments and Results
We have conducted all the experiments on Altix Shared Memory system. We have implemented
our algorithms using C++ and POSIX threads. We have performed experiments for datasets with
varying sizes (in terms of # of trees - 500K, 1M, 2M, 3M, 4M) and at varying support levels (50,
100, 200, 300, 400, 500). Furthermore, we conducted these experiments using different number of
processes (1-8). Due to the space constraint, we present only part of our results.
Scalability with respect to dataset size (support=200)
70.00
60.00
50.00
Execution time (sec)
40.00 serial
2 processors
4 processors
30.00 8 processors
20.00
10.00
0.00
500K 1M 2M 3M 4M
# of Trees
Scalability with respect to # of processors
45.00
40.00
35.00
Execution Time (sec)
30.00
25.00
20.00
15.00
10.00
5.00
0.00
1 2 3 4 5 6 7 8
# of Processors
6.
First graph shows that our approach is scalable with the increase in dataset size (in terms of
number of trees). The increase in execution time is slower compared to the increase in the dataset
size. Execution time actually increases as we increase the number of processors. This might be
due to the fact that the heap memory is shared across processes. Hence, the calls to malloc( ) from
different processes at the same time are serialized thereby increasing the execution time.
We have also analyzed the memory footprint of our serial and parallel versions of the algorithm.
We have observed that serial version took approximately 8KB (Dataset size = 2M, support = 200)
whereas the parallel version took around 1300MB. We have observed the similar memory usage
(1300MB) even when number of processors is 1. pthreads library allocates certain memory for
each thread at the time of creation. This might be the reason for the huge difference in the
memory footprint. We are still in the process of analyzing the trend.
Comparison of execution time on serial and parallel algorithms (# of trees=2M)
45.00
40.00
35.00
30.00
Execution time (sec)
25.00 serial
2 processors
4 processors
20.00 8 processors
15.00
10.00
5.00
0.00
50 100 200 300 400 500
Support
From the above graph, we can observe the general trend that the execution time decreases as the
minimum support increases. This is because of the Apriori-style pruning of search space. We
prune large sections of search space in the initial stages when the minimum support is high.
7.
Comparison with TreeMiner
TreeMiner Our Approach - Serial
350
300
250
Execution Time (sec)
200
150
100
50
0
500K 1M 2M 3M 4M
# of Trees
Above graph illustrates the difference in execution times between our approach and the
TreeMiner. Our approach beats the TreeMiner at all the dataset sizes by a large margin. But, it
needs to be noted that this comparison is not fair as TreeMiner mines for induced and embedded
subtrees whereas our approach mines only for induced subtrees. We are currently investigating
the methods to incorporate embedded subtree mining into our approach. However, we expect
our approach to perform better when compared to TreeMiner as the difference is quite high.
Conclusions and Future Work
In this project, we have designed and developed a novel algorithm to mine induced subtrees
given a database of rooted labeled trees. We presented some strategies to parallelize such mining
algorithm. Our approach is completely novel and performs better when compared to state-of-the-
art algorithms.
We want to evaluate our approach more closely to determine the reasons for unexpected trends
given by parallel version. We also plan to parallelize the available serial version of TreeMiner and
compare against our parallel version. Furthermore, we want to evaluate performance of Gaston
with our algorithms.
References
[1] MJ Zaki. Efficiently mining frequent trees in a forest. In Proceedings of the international
conference on Knowledge discovery and data mining, 2002.
[2] M. J. Zaki and C. C. Aggarwal. XRules: An effective structural classifier for XML data. In Proc.
of the 2003 Int. Conf. Knowledge Discovery and Data Mining, 2003.
8.
[3] T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Satamoto, and S. Arikawa. Efficient substructure
discovery from large semi-structured data. In Proc. of the 2nd SIAM Int. Conf. on Data Mining,
2002.
[4] Y. Chi, Y. Yang, and R. R. Muntz. Mining frequent rooted trees and free trees using canonical
forms. Technical Report CSD-TR No. 030043, ftp://ftp.cs.ucla.edu/tech-report/2003-
reports/030043.pdf, UCLA, 2003.
[5] Y. Chi, Y. Yang, and R. R. Muntz. Indexing and mining free trees. In Proc. Of the 2003 IEEE Int.
Conf. on Data Mining, 2003.
[6] S. Nijssen and J. N. Kok. A quickstart in frequent structure mining can make a difference. In
Proceedings of the international conference on Knowledge discovery and data mining, 2004.
[7] P. Rao and B. Moon. PRIX: Indexing and querying XML using Prufer Sequences. In
Proceedings of International Conference on Data Engineering, 2004.
[8] Y. Yan and J. Han. gspan: Graph-based substructure pattern mining. In Proc. 2002 Int. Conf. on
Data Mining (ICDM'02), Maebashi, Japan, December 2002.
Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.
Be the first to comment