Outline
                  Motivations
            State of the Art
          Parallel CLOSET+
            Ending Remarks

...
Outline
                          Motivations
                    State of the Art
                  Parallel CLOSET+
    ...
Outline
                                            Motivations
                                                          ...
Outline
                                            Motivations
                                                          ...
Outline
                                            Motivations
                                                          ...
Outline
                            Motivations
                                          Problem: Information Pollution
 ...
Outline
                             Motivations
                                           Problem: Information Pollution...
Outline
                                    Motivations
                                                  Problem: Informa...
Outline
        Motivations
                      Problem: Information Pollution
  State of the Art
                      ...
Outline
                                         Motivations
                                                         Prob...
Outline
                             Motivations
                                           Data Mining
                  ...
Outline
                             Motivations
                                           Data Mining
                  ...
Outline
                             Motivations
                                           Data Mining
                  ...
Outline
                             Motivations
                                           Data Mining
                  ...
Outline
                             Motivations
                                           Data Mining
                  ...
Outline
                             Motivations
                                           Data Mining
                  ...
Outline
                             Motivations
                                           Data Mining
                  ...
Outline
                             Motivations
                                           Data Mining
                  ...
Outline
                             Motivations
                                           Data Mining
                  ...
Outline
                             Motivations
                                           Data Mining
                  ...
Outline
                             Motivations
                                           Data Mining
                  ...
Outline
                             Motivations
                                           Data Mining
                  ...
Outline
                             Motivations
                                           Data Mining
                  ...
Outline
                             Motivations
                                           Data Mining
                  ...
Outline
                             Motivations
                                           Data Mining
                  ...
Outline
                                         The CLOSET+ Algorithm
                           Motivations
            ...
Outline
                                          The CLOSET+ Algorithm
                            Motivations
          ...
Outline
                                          The CLOSET+ Algorithm
                            Motivations
          ...
Outline
                                             The CLOSET+ Algorithm
                               Motivations
    ...
Outline
                                     The CLOSET+ Algorithm
                       Motivations
                    ...
Outline
                                              The CLOSET+ Algorithm
                                Motivations
  ...
Outline
                                           The CLOSET+ Algorithm
                             Motivations
        ...
Outline
                                           The CLOSET+ Algorithm
                             Motivations
        ...
Outline
                                            The CLOSET+ Algorithm
                              Motivations
      ...
Outline
                                                         The CLOSET+ Algorithm
                               Moti...
Figure: Merging of the local item counts


   Simple adding of support counts.
   Next up, FP-tree and result tree merging...
Outline
                       The CLOSET+ Algorithm
         Motivations
                       Parallelization
   State ...
Outline
                                            The CLOSET+ Algorithm
                              Motivations
      ...
Outline
                                           The CLOSET+ Algorithm
                             Motivations
        ...
Outline
                                           The CLOSET+ Algorithm
                             Motivations
        ...
Outline
                                           The CLOSET+ Algorithm
                             Motivations
        ...
Outline
                                                              The CLOSET+ Algorithm
                              ...
Outline
                                          The CLOSET+ Algorithm
                            Motivations
          ...
Outline
                                          The CLOSET+ Algorithm
                            Motivations
          ...
Outline
                                          The CLOSET+ Algorithm
                            Motivations
          ...
Outline
                                           The CLOSET+ Algorithm
                             Motivations
        ...
Outline
                                           The CLOSET+ Algorithm
                             Motivations
        ...
Outline
                             Motivations   Demo
                       State of the Art    References
            ...
Outline
                             Motivations   Demo
                       State of the Art    References
            ...
Upcoming SlideShare
Loading in …5
×

M.Sc. Jury Defense

803 views

Published on

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
803
On SlideShare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

M.Sc. Jury Defense

  1. 1. Outline Motivations State of the Art Parallel CLOSET+ Ending Remarks Parallel CLOSET+ Algorithm for Finding Frequent Closed Itemsets Tayfun Sen ¸ M.Sc. Thesis Defense Department of Computer Engineering Middle East Technical University June 29, 2009 Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  2. 2. Outline Motivations State of the Art Parallel CLOSET+ Ending Remarks Motivations Problem: Information Pollution Recent Advancements in Data and Computing State of the Art Data Mining Parallel Computing Parallel CLOSET+ The CLOSET+ Algorithm Parallelization Test Results Conclusion Ending Remarks Demo References Q&A Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  3. 3. Outline Motivations Problem: Information Pollution State of the Art Recent Advancements in Data and Computing Parallel CLOSET+ Ending Remarks Problem: Information Pollution Computerization Image titled “Listening Post” from flickr, licensed CC BY-NC-ND 2.0. Source: http://www.flickr.com/photos/fenchurch/427814801/ Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  4. 4. Outline Motivations Problem: Information Pollution State of the Art Recent Advancements in Data and Computing Parallel CLOSET+ Ending Remarks Problem: Information Pollution Computerization Internetization (is that a word?) Image titled “Listening Post” from flickr, licensed CC BY-NC-ND 2.0. Source: http://www.flickr.com/photos/fenchurch/427814801/ Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  5. 5. Outline Motivations Problem: Information Pollution State of the Art Recent Advancements in Data and Computing Parallel CLOSET+ Ending Remarks Problem: Information Pollution Computerization Internetization (is that a word?) As a result, data accessible by ordinary people through everyday devices increases exponentially. Image titled “Listening Post” from flickr, licensed CC BY-NC-ND 2.0. Source: http://www.flickr.com/photos/fenchurch/427814801/ Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  6. 6. Outline Motivations Problem: Information Pollution State of the Art Recent Advancements in Data and Computing Parallel CLOSET+ Ending Remarks Recent Advancements in Data and Computing Newer sources providing huge data (open governance, open APIs, crowdsourcing . . . ) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  7. 7. Outline Motivations Problem: Information Pollution State of the Art Recent Advancements in Data and Computing Parallel CLOSET+ Ending Remarks Recent Advancements in Data and Computing Newer sources providing huge data (open governance, open APIs, crowdsourcing . . . ) Increased computing power (grids, cheaper clusters, cloud computing . . . ) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  8. 8. Outline Motivations Problem: Information Pollution State of the Art Recent Advancements in Data and Computing Parallel CLOSET+ Ending Remarks Google servers circa 1996 Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  9. 9. Outline Motivations Problem: Information Pollution State of the Art Recent Advancements in Data and Computing Parallel CLOSET+ Ending Remarks Google servers circa 1999. a a Image taken from flickr, licensed CC BY-2.0. Source: http://en.wikipedia.org/wiki/File:Google’s First Production Serve Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  10. 10. Outline Motivations Problem: Information Pollution State of the Art Recent Advancements in Data and Computing Parallel CLOSET+ Ending Remarks Image taken from Wikipedia, licensed CC BY-3.0. Source: http://en.wikipedia.org/wiki/File:Athlon64x2-6400plus.jpg Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  11. 11. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Data Mining Data mining enables the transition from data to knowledge: Data ⇒ Information ⇒ Knowledge Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  12. 12. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Data Mining Data mining enables the transition from data to knowledge: Data ⇒ Information ⇒ Knowledge A relatively young research area, but quite active nonetheless. Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  13. 13. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Data Mining Data mining enables the transition from data to knowledge: Data ⇒ Information ⇒ Knowledge A relatively young research area, but quite active nonetheless. Many popular applications on the wild (e-commerce, finance, biotechnology, counter terrorism etc.) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  14. 14. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Data Mining Data mining enables the transition from data to knowledge: Data ⇒ Information ⇒ Knowledge A relatively young research area, but quite active nonetheless. Many popular applications on the wild (e-commerce, finance, biotechnology, counter terrorism etc.) flickr.com interestingness, amazon.com suggestions, google flu trends are some of the well known implementations. Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  15. 15. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Data Mining - A Top Down Approach KDD? Knowledge Discovery in Databases (data preprocessing and result interpretation is included) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  16. 16. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Data Mining - A Top Down Approach KDD? Knowledge Discovery in Databases (data preprocessing and result interpretation is included) Data Mining (sometimes used interchangeably with KDD) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  17. 17. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Data Mining - A Top Down Approach KDD? Knowledge Discovery in Databases (data preprocessing and result interpretation is included) Data Mining (sometimes used interchangeably with KDD) Association Rule Mining (beer and baby diapers?) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  18. 18. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Data Mining - A Top Down Approach KDD? Knowledge Discovery in Databases (data preprocessing and result interpretation is included) Data Mining (sometimes used interchangeably with KDD) Association Rule Mining (beer and baby diapers?) Frequent Itemset Mining (self explanatory) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  19. 19. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Parallel Computing To program multi-core CPUs (whatever happened to frequency increases?) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  20. 20. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Parallel Computing To program multi-core CPUs (whatever happened to frequency increases?) To program systems with multiple computing resources (shared memory or shared nothing) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  21. 21. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Parallel Computing To program multi-core CPUs (whatever happened to frequency increases?) To program systems with multiple computing resources (shared memory or shared nothing) Beginning to get really important as it is needed to take benefit of the ever increasing computing resources. Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  22. 22. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Parallel Programming Methods Automatic parallelization (a lost war?) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  23. 23. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Parallel Programming Methods Automatic parallelization (a lost war?) Threads (too hard to manage?) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  24. 24. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Parallel Programming Methods Automatic parallelization (a lost war?) Threads (too hard to manage?) OpenMP (easier to do, but does not work on shared nothing architectures) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  25. 25. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Parallel Programming Methods Automatic parallelization (a lost war?) Threads (too hard to manage?) OpenMP (easier to do, but does not work on shared nothing architectures) MPI (flexible, but low level and harder) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  26. 26. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks The CLOSET+ Algorithm Developed by Wang, Han et al. [1] Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  27. 27. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks The CLOSET+ Algorithm Developed by Wang, Han et al. [1] A natural step in the evolution of data mining algorithms (sets ⇒ trees ⇒ graphs) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  28. 28. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks The CLOSET+ Algorithm Developed by Wang, Han et al. [1] A natural step in the evolution of data mining algorithms (sets ⇒ trees ⇒ graphs) A data structure called FP-tree used Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  29. 29. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Building FP-tree Table: An Example Database Table: Pruned and Ordered DB TID Basket contents TID Pruned & ordered items 001 a, c, f, m, p 001 f, c, a, m, p 002 a, c, d, f, m, p 002 f, c, a, m, p 003 a, b, c, f, g, m 003 f, c, a, b, m 004 b, f, i 004 f, b 005 b, c, n, p 005 c, b, p Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  30. 30. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Figure: Building of FP-tree as each transaction is processed Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  31. 31. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Mining FP-tree Figure: Projected FP-tree for item p:3 Figure: FP-tree with side links Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  32. 32. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Parallelization Used OpenMPI and Boost libraries. Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  33. 33. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Parallelization Used OpenMPI and Boost libraries. Developed using C++ Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  34. 34. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Parallelization Used OpenMPI and Boost libraries. Developed using C++ Debugging is particularly tricky (new types of bugs, huge number of interleavings . . . ) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  35. 35. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Item Count Merging
  36. 36. Figure: Merging of the local item counts Simple adding of support counts. Next up, FP-tree and result tree merging. Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  37. 37. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Figure: Merging of two FP-trees Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  38. 38. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Result tree merging Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  39. 39. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Test Results Tested with two types of data Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  40. 40. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Test Results Tested with two types of data A real dataset and a synthetic one generated using IBM’s Quest dataset generator Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  41. 41. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Test Results Tested with two types of data A real dataset and a synthetic one generated using IBM’s Quest dataset generator No over-subscription done (each core executes a single thread) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  42. 42. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks 70 300 1 core 4 cores 60 2 cores 8 cores 3 cores 250 12 cores 50 4 cores 16 cores 200 Time (sec) Time (sec) 40 150 30 100 20 10 50 0 0 40 45 50 55 60 65 70 75 80 85 90 95 40 45 50 55 60 65 70 75 80 85 90 95 Support value (%) Support value (%) Figure: Execution on 1-4 Cores, Figure: Execution on 4-16 Cores, Retail dataset Retail dataset Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  43. 43. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Conclusion High speedup and efficiency for high to medium support values Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  44. 44. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Conclusion High speedup and efficiency for high to medium support values The basic determinant for performance is communication overhead Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  45. 45. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Conclusion High speedup and efficiency for high to medium support values The basic determinant for performance is communication overhead FP-tree provides a compressed communication, usefulness of parallel execution is increased Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  46. 46. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Conclusion High speedup and efficiency for high to medium support values The basic determinant for performance is communication overhead FP-tree provides a compressed communication, usefulness of parallel execution is increased As support threshold is lowered and number of processors is increased, efficiency gets lower Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  47. 47. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Conclusion High speedup and efficiency for high to medium support values The basic determinant for performance is communication overhead FP-tree provides a compressed communication, usefulness of parallel execution is increased As support threshold is lowered and number of processors is increased, efficiency gets lower It is left to the application owner to find the optimum numbers for execution (in terms of support values and number of processors) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  48. 48. Outline Motivations Demo State of the Art References Parallel CLOSET+ QA Ending Remarks Demo A real life demo on Nar Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  49. 49. Outline Motivations Demo State of the Art References Parallel CLOSET+ QA Ending Remarks References Jianyong Wang, Jiawei Han, and Jian Pei. CLOSET+: searching for the best strategies for mining frequent closed itemsets. In KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 236–245, New York, NY, USA, 2003. Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
  50. 50. Outline Motivations Demo State of the Art References Parallel CLOSET+ QA Ending Remarks Thanks for listening. Any questions? Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets

×