Outline
Motivations
State of the Art
Parallel CLOSET+
Ending Remarks
Parallel CLOSET+ Algorithm for Finding
Frequent Closed Itemsets
Tayfun Sen
¸
M.Sc. Thesis Defense
Department of Computer Engineering
Middle East Technical University
June 29, 2009
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations
State of the Art
Parallel CLOSET+
Ending Remarks
Motivations
Problem: Information Pollution
Recent Advancements in Data and Computing
State of the Art
Data Mining
Parallel Computing
Parallel CLOSET+
The CLOSET+ Algorithm
Parallelization
Test Results
Conclusion
Ending Remarks
Demo
References
Q&A
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations
Problem: Information Pollution
State of the Art
Recent Advancements in Data and Computing
Parallel CLOSET+
Ending Remarks
Problem: Information Pollution
Computerization
Image titled “Listening Post” from flickr, licensed CC BY-NC-ND 2.0. Source:
http://www.flickr.com/photos/fenchurch/427814801/
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations
Problem: Information Pollution
State of the Art
Recent Advancements in Data and Computing
Parallel CLOSET+
Ending Remarks
Problem: Information Pollution
Computerization
Internetization (is that a
word?)
Image titled “Listening Post” from flickr, licensed CC BY-NC-ND 2.0. Source:
http://www.flickr.com/photos/fenchurch/427814801/
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations
Problem: Information Pollution
State of the Art
Recent Advancements in Data and Computing
Parallel CLOSET+
Ending Remarks
Problem: Information Pollution
Computerization
Internetization (is that a
word?)
As a result, data accessible by ordinary people through everyday
devices increases exponentially.
Image titled “Listening Post” from flickr, licensed CC BY-NC-ND 2.0. Source:
http://www.flickr.com/photos/fenchurch/427814801/
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations
Problem: Information Pollution
State of the Art
Recent Advancements in Data and Computing
Parallel CLOSET+
Ending Remarks
Recent Advancements in Data and Computing
Newer sources providing huge data (open governance, open
APIs, crowdsourcing . . . )
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations
Problem: Information Pollution
State of the Art
Recent Advancements in Data and Computing
Parallel CLOSET+
Ending Remarks
Recent Advancements in Data and Computing
Newer sources providing huge data (open governance, open
APIs, crowdsourcing . . . )
Increased computing power (grids, cheaper clusters, cloud
computing . . . )
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations
Problem: Information Pollution
State of the Art
Recent Advancements in Data and Computing
Parallel CLOSET+
Ending Remarks
Google servers circa 1996
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations
Problem: Information Pollution
State of the Art
Recent Advancements in Data and Computing
Parallel CLOSET+
Ending Remarks
Google servers circa 1999. a
a
Image taken from flickr, licensed CC BY-2.0. Source:
http://en.wikipedia.org/wiki/File:Google’s First Production Serve
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations
Problem: Information Pollution
State of the Art
Recent Advancements in Data and Computing
Parallel CLOSET+
Ending Remarks
Image taken from Wikipedia, licensed CC BY-3.0. Source:
http://en.wikipedia.org/wiki/File:Athlon64x2-6400plus.jpg
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations
Data Mining
State of the Art
Parallel Computing
Parallel CLOSET+
Ending Remarks
Data Mining
Data mining enables the transition from data to knowledge:
Data ⇒ Information ⇒ Knowledge
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations
Data Mining
State of the Art
Parallel Computing
Parallel CLOSET+
Ending Remarks
Data Mining
Data mining enables the transition from data to knowledge:
Data ⇒ Information ⇒ Knowledge
A relatively young research area, but quite active nonetheless.
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations
Data Mining
State of the Art
Parallel Computing
Parallel CLOSET+
Ending Remarks
Data Mining
Data mining enables the transition from data to knowledge:
Data ⇒ Information ⇒ Knowledge
A relatively young research area, but quite active nonetheless.
Many popular applications on the wild (e-commerce, finance,
biotechnology, counter terrorism etc.)
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations
Data Mining
State of the Art
Parallel Computing
Parallel CLOSET+
Ending Remarks
Data Mining
Data mining enables the transition from data to knowledge:
Data ⇒ Information ⇒ Knowledge
A relatively young research area, but quite active nonetheless.
Many popular applications on the wild (e-commerce, finance,
biotechnology, counter terrorism etc.)
flickr.com interestingness, amazon.com suggestions, google flu
trends are some of the well known implementations.
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations
Data Mining
State of the Art
Parallel Computing
Parallel CLOSET+
Ending Remarks
Data Mining - A Top Down Approach
KDD? Knowledge Discovery in Databases
(data preprocessing and result interpretation is included)
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations
Data Mining
State of the Art
Parallel Computing
Parallel CLOSET+
Ending Remarks
Data Mining - A Top Down Approach
KDD? Knowledge Discovery in Databases
(data preprocessing and result interpretation is included)
Data Mining
(sometimes used interchangeably with KDD)
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations
Data Mining
State of the Art
Parallel Computing
Parallel CLOSET+
Ending Remarks
Data Mining - A Top Down Approach
KDD? Knowledge Discovery in Databases
(data preprocessing and result interpretation is included)
Data Mining
(sometimes used interchangeably with KDD)
Association Rule Mining
(beer and baby diapers?)
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations
Data Mining
State of the Art
Parallel Computing
Parallel CLOSET+
Ending Remarks
Data Mining - A Top Down Approach
KDD? Knowledge Discovery in Databases
(data preprocessing and result interpretation is included)
Data Mining
(sometimes used interchangeably with KDD)
Association Rule Mining
(beer and baby diapers?)
Frequent Itemset Mining
(self explanatory)
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations
Data Mining
State of the Art
Parallel Computing
Parallel CLOSET+
Ending Remarks
Parallel Computing
To program multi-core CPUs (whatever happened to
frequency increases?)
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations
Data Mining
State of the Art
Parallel Computing
Parallel CLOSET+
Ending Remarks
Parallel Computing
To program multi-core CPUs (whatever happened to
frequency increases?)
To program systems with multiple computing resources
(shared memory or shared nothing)
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations
Data Mining
State of the Art
Parallel Computing
Parallel CLOSET+
Ending Remarks
Parallel Computing
To program multi-core CPUs (whatever happened to
frequency increases?)
To program systems with multiple computing resources
(shared memory or shared nothing)
Beginning to get really important as it is needed to take
benefit of the ever increasing computing resources.
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations
Data Mining
State of the Art
Parallel Computing
Parallel CLOSET+
Ending Remarks
Parallel Programming Methods
Automatic parallelization (a lost war?)
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations
Data Mining
State of the Art
Parallel Computing
Parallel CLOSET+
Ending Remarks
Parallel Programming Methods
Automatic parallelization (a lost war?)
Threads (too hard to manage?)
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations
Data Mining
State of the Art
Parallel Computing
Parallel CLOSET+
Ending Remarks
Parallel Programming Methods
Automatic parallelization (a lost war?)
Threads (too hard to manage?)
OpenMP (easier to do, but does not work on shared nothing
architectures)
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations
Data Mining
State of the Art
Parallel Computing
Parallel CLOSET+
Ending Remarks
Parallel Programming Methods
Automatic parallelization (a lost war?)
Threads (too hard to manage?)
OpenMP (easier to do, but does not work on shared nothing
architectures)
MPI (flexible, but low level and harder)
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
The CLOSET+ Algorithm
Motivations
Parallelization
State of the Art
Test Results
Parallel CLOSET+
Conclusion
Ending Remarks
The CLOSET+ Algorithm
Developed by Wang, Han et al. [1]
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
The CLOSET+ Algorithm
Motivations
Parallelization
State of the Art
Test Results
Parallel CLOSET+
Conclusion
Ending Remarks
The CLOSET+ Algorithm
Developed by Wang, Han et al. [1]
A natural step in the evolution of data mining algorithms
(sets ⇒ trees ⇒ graphs)
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
The CLOSET+ Algorithm
Motivations
Parallelization
State of the Art
Test Results
Parallel CLOSET+
Conclusion
Ending Remarks
The CLOSET+ Algorithm
Developed by Wang, Han et al. [1]
A natural step in the evolution of data mining algorithms
(sets ⇒ trees ⇒ graphs)
A data structure called FP-tree used
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
The CLOSET+ Algorithm
Motivations
Parallelization
State of the Art
Test Results
Parallel CLOSET+
Conclusion
Ending Remarks
Building FP-tree
Table: An Example Database Table: Pruned and Ordered DB
TID Basket contents TID Pruned & ordered items
001 a, c, f, m, p 001 f, c, a, m, p
002 a, c, d, f, m, p 002 f, c, a, m, p
003 a, b, c, f, g, m 003 f, c, a, b, m
004 b, f, i 004 f, b
005 b, c, n, p 005 c, b, p
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
The CLOSET+ Algorithm
Motivations
Parallelization
State of the Art
Test Results
Parallel CLOSET+
Conclusion
Ending Remarks
Figure: Building of FP-tree as each transaction is processed
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
The CLOSET+ Algorithm
Motivations
Parallelization
State of the Art
Test Results
Parallel CLOSET+
Conclusion
Ending Remarks
Mining FP-tree
Figure: Projected FP-tree for item
p:3
Figure: FP-tree with side links
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
The CLOSET+ Algorithm
Motivations
Parallelization
State of the Art
Test Results
Parallel CLOSET+
Conclusion
Ending Remarks
Parallelization
Used OpenMPI and Boost libraries.
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
The CLOSET+ Algorithm
Motivations
Parallelization
State of the Art
Test Results
Parallel CLOSET+
Conclusion
Ending Remarks
Parallelization
Used OpenMPI and Boost libraries.
Developed using C++
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
The CLOSET+ Algorithm
Motivations
Parallelization
State of the Art
Test Results
Parallel CLOSET+
Conclusion
Ending Remarks
Parallelization
Used OpenMPI and Boost libraries.
Developed using C++
Debugging is particularly tricky (new types of bugs, huge
number of interleavings . . . )
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
The CLOSET+ Algorithm
Motivations
Parallelization
State of the Art
Test Results
Parallel CLOSET+
Conclusion
Ending Remarks
Item Count Merging
Figure: Merging of the local item counts
Simple adding of support counts.
Next up, FP-tree and result tree merging.
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
The CLOSET+ Algorithm
Motivations
Parallelization
State of the Art
Test Results
Parallel CLOSET+
Conclusion
Ending Remarks
Figure: Merging of two FP-trees
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
The CLOSET+ Algorithm
Motivations
Parallelization
State of the Art
Test Results
Parallel CLOSET+
Conclusion
Ending Remarks
Result tree merging
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
The CLOSET+ Algorithm
Motivations
Parallelization
State of the Art
Test Results
Parallel CLOSET+
Conclusion
Ending Remarks
Test Results
Tested with two types of data
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
The CLOSET+ Algorithm
Motivations
Parallelization
State of the Art
Test Results
Parallel CLOSET+
Conclusion
Ending Remarks
Test Results
Tested with two types of data
A real dataset and a synthetic one generated using IBM’s
Quest dataset generator
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
The CLOSET+ Algorithm
Motivations
Parallelization
State of the Art
Test Results
Parallel CLOSET+
Conclusion
Ending Remarks
Test Results
Tested with two types of data
A real dataset and a synthetic one generated using IBM’s
Quest dataset generator
No over-subscription done (each core executes a single thread)
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
The CLOSET+ Algorithm
Motivations
Parallelization
State of the Art
Test Results
Parallel CLOSET+
Conclusion
Ending Remarks
70 300
1 core 4 cores
60 2 cores 8 cores
3 cores 250 12 cores
50 4 cores 16 cores
200
Time (sec)
Time (sec)
40
150
30
100
20
10 50
0 0
40 45 50 55 60 65 70 75 80 85 90 95 40 45 50 55 60 65 70 75 80 85 90 95
Support value (%) Support value (%)
Figure: Execution on 1-4 Cores, Figure: Execution on 4-16 Cores,
Retail dataset Retail dataset
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
The CLOSET+ Algorithm
Motivations
Parallelization
State of the Art
Test Results
Parallel CLOSET+
Conclusion
Ending Remarks
Conclusion
High speedup and efficiency for high to medium support values
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
The CLOSET+ Algorithm
Motivations
Parallelization
State of the Art
Test Results
Parallel CLOSET+
Conclusion
Ending Remarks
Conclusion
High speedup and efficiency for high to medium support values
The basic determinant for performance is communication
overhead
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
The CLOSET+ Algorithm
Motivations
Parallelization
State of the Art
Test Results
Parallel CLOSET+
Conclusion
Ending Remarks
Conclusion
High speedup and efficiency for high to medium support values
The basic determinant for performance is communication
overhead
FP-tree provides a compressed communication, usefulness of
parallel execution is increased
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
The CLOSET+ Algorithm
Motivations
Parallelization
State of the Art
Test Results
Parallel CLOSET+
Conclusion
Ending Remarks
Conclusion
High speedup and efficiency for high to medium support values
The basic determinant for performance is communication
overhead
FP-tree provides a compressed communication, usefulness of
parallel execution is increased
As support threshold is lowered and number of processors is
increased, efficiency gets lower
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
The CLOSET+ Algorithm
Motivations
Parallelization
State of the Art
Test Results
Parallel CLOSET+
Conclusion
Ending Remarks
Conclusion
High speedup and efficiency for high to medium support values
The basic determinant for performance is communication
overhead
FP-tree provides a compressed communication, usefulness of
parallel execution is increased
As support threshold is lowered and number of processors is
increased, efficiency gets lower
It is left to the application owner to find the optimum
numbers for execution (in terms of support values and number
of processors)
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations Demo
State of the Art References
Parallel CLOSET+ Q&A
Ending Remarks
Demo
A real life demo on Nar
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations Demo
State of the Art References
Parallel CLOSET+ Q&A
Ending Remarks
References
Jianyong Wang, Jiawei Han, and Jian Pei.
CLOSET+: searching for the best strategies for mining
frequent closed itemsets.
In KDD ’03: Proceedings of the ninth ACM SIGKDD
international conference on Knowledge discovery and data
mining, pages 236–245, New York, NY, USA, 2003.
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
Outline
Motivations Demo
State of the Art References
Parallel CLOSET+ Q&A
Ending Remarks
Thanks for listening. Any questions?
Tayfun Sen
¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
0 comments
Post a comment