M.Sc. Jury Defense

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    M.Sc. Jury Defense - Presentation Transcript

    1. Outline Motivations State of the Art Parallel CLOSET+ Ending Remarks Parallel CLOSET+ Algorithm for Finding Frequent Closed Itemsets Tayfun Sen ¸ M.Sc. Thesis Defense Department of Computer Engineering Middle East Technical University June 29, 2009 Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    2. Outline Motivations State of the Art Parallel CLOSET+ Ending Remarks Motivations Problem: Information Pollution Recent Advancements in Data and Computing State of the Art Data Mining Parallel Computing Parallel CLOSET+ The CLOSET+ Algorithm Parallelization Test Results Conclusion Ending Remarks Demo References Q&A Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    3. Outline Motivations Problem: Information Pollution State of the Art Recent Advancements in Data and Computing Parallel CLOSET+ Ending Remarks Problem: Information Pollution Computerization Image titled “Listening Post” from flickr, licensed CC BY-NC-ND 2.0. Source: http://www.flickr.com/photos/fenchurch/427814801/ Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    4. Outline Motivations Problem: Information Pollution State of the Art Recent Advancements in Data and Computing Parallel CLOSET+ Ending Remarks Problem: Information Pollution Computerization Internetization (is that a word?) Image titled “Listening Post” from flickr, licensed CC BY-NC-ND 2.0. Source: http://www.flickr.com/photos/fenchurch/427814801/ Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    5. Outline Motivations Problem: Information Pollution State of the Art Recent Advancements in Data and Computing Parallel CLOSET+ Ending Remarks Problem: Information Pollution Computerization Internetization (is that a word?) As a result, data accessible by ordinary people through everyday devices increases exponentially. Image titled “Listening Post” from flickr, licensed CC BY-NC-ND 2.0. Source: http://www.flickr.com/photos/fenchurch/427814801/ Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    6. Outline Motivations Problem: Information Pollution State of the Art Recent Advancements in Data and Computing Parallel CLOSET+ Ending Remarks Recent Advancements in Data and Computing Newer sources providing huge data (open governance, open APIs, crowdsourcing . . . ) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    7. Outline Motivations Problem: Information Pollution State of the Art Recent Advancements in Data and Computing Parallel CLOSET+ Ending Remarks Recent Advancements in Data and Computing Newer sources providing huge data (open governance, open APIs, crowdsourcing . . . ) Increased computing power (grids, cheaper clusters, cloud computing . . . ) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    8. Outline Motivations Problem: Information Pollution State of the Art Recent Advancements in Data and Computing Parallel CLOSET+ Ending Remarks Google servers circa 1996 Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    9. Outline Motivations Problem: Information Pollution State of the Art Recent Advancements in Data and Computing Parallel CLOSET+ Ending Remarks Google servers circa 1999. a a Image taken from flickr, licensed CC BY-2.0. Source: http://en.wikipedia.org/wiki/File:Google’s First Production Serve Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    10. Outline Motivations Problem: Information Pollution State of the Art Recent Advancements in Data and Computing Parallel CLOSET+ Ending Remarks Image taken from Wikipedia, licensed CC BY-3.0. Source: http://en.wikipedia.org/wiki/File:Athlon64x2-6400plus.jpg Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    11. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Data Mining Data mining enables the transition from data to knowledge: Data ⇒ Information ⇒ Knowledge Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    12. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Data Mining Data mining enables the transition from data to knowledge: Data ⇒ Information ⇒ Knowledge A relatively young research area, but quite active nonetheless. Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    13. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Data Mining Data mining enables the transition from data to knowledge: Data ⇒ Information ⇒ Knowledge A relatively young research area, but quite active nonetheless. Many popular applications on the wild (e-commerce, finance, biotechnology, counter terrorism etc.) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    14. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Data Mining Data mining enables the transition from data to knowledge: Data ⇒ Information ⇒ Knowledge A relatively young research area, but quite active nonetheless. Many popular applications on the wild (e-commerce, finance, biotechnology, counter terrorism etc.) flickr.com interestingness, amazon.com suggestions, google flu trends are some of the well known implementations. Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    15. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Data Mining - A Top Down Approach KDD? Knowledge Discovery in Databases (data preprocessing and result interpretation is included) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    16. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Data Mining - A Top Down Approach KDD? Knowledge Discovery in Databases (data preprocessing and result interpretation is included) Data Mining (sometimes used interchangeably with KDD) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    17. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Data Mining - A Top Down Approach KDD? Knowledge Discovery in Databases (data preprocessing and result interpretation is included) Data Mining (sometimes used interchangeably with KDD) Association Rule Mining (beer and baby diapers?) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    18. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Data Mining - A Top Down Approach KDD? Knowledge Discovery in Databases (data preprocessing and result interpretation is included) Data Mining (sometimes used interchangeably with KDD) Association Rule Mining (beer and baby diapers?) Frequent Itemset Mining (self explanatory) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    19. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Parallel Computing To program multi-core CPUs (whatever happened to frequency increases?) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    20. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Parallel Computing To program multi-core CPUs (whatever happened to frequency increases?) To program systems with multiple computing resources (shared memory or shared nothing) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    21. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Parallel Computing To program multi-core CPUs (whatever happened to frequency increases?) To program systems with multiple computing resources (shared memory or shared nothing) Beginning to get really important as it is needed to take benefit of the ever increasing computing resources. Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    22. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Parallel Programming Methods Automatic parallelization (a lost war?) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    23. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Parallel Programming Methods Automatic parallelization (a lost war?) Threads (too hard to manage?) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    24. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Parallel Programming Methods Automatic parallelization (a lost war?) Threads (too hard to manage?) OpenMP (easier to do, but does not work on shared nothing architectures) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    25. Outline Motivations Data Mining State of the Art Parallel Computing Parallel CLOSET+ Ending Remarks Parallel Programming Methods Automatic parallelization (a lost war?) Threads (too hard to manage?) OpenMP (easier to do, but does not work on shared nothing architectures) MPI (flexible, but low level and harder) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    26. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks The CLOSET+ Algorithm Developed by Wang, Han et al. [1] Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    27. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks The CLOSET+ Algorithm Developed by Wang, Han et al. [1] A natural step in the evolution of data mining algorithms (sets ⇒ trees ⇒ graphs) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    28. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks The CLOSET+ Algorithm Developed by Wang, Han et al. [1] A natural step in the evolution of data mining algorithms (sets ⇒ trees ⇒ graphs) A data structure called FP-tree used Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    29. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Building FP-tree Table: An Example Database Table: Pruned and Ordered DB TID Basket contents TID Pruned & ordered items 001 a, c, f, m, p 001 f, c, a, m, p 002 a, c, d, f, m, p 002 f, c, a, m, p 003 a, b, c, f, g, m 003 f, c, a, b, m 004 b, f, i 004 f, b 005 b, c, n, p 005 c, b, p Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    30. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Figure: Building of FP-tree as each transaction is processed Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    31. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Mining FP-tree Figure: Projected FP-tree for item p:3 Figure: FP-tree with side links Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    32. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Parallelization Used OpenMPI and Boost libraries. Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    33. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Parallelization Used OpenMPI and Boost libraries. Developed using C++ Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    34. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Parallelization Used OpenMPI and Boost libraries. Developed using C++ Debugging is particularly tricky (new types of bugs, huge number of interleavings . . . ) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    35. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Item Count Merging      
    36.                   
    37.                   
    38.             Figure: Merging of the local item counts Simple adding of support counts. Next up, FP-tree and result tree merging. Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    39. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Figure: Merging of two FP-trees Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    40. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Result tree merging Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    41. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Test Results Tested with two types of data Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    42. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Test Results Tested with two types of data A real dataset and a synthetic one generated using IBM’s Quest dataset generator Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    43. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Test Results Tested with two types of data A real dataset and a synthetic one generated using IBM’s Quest dataset generator No over-subscription done (each core executes a single thread) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    44. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks 70 300 1 core 4 cores 60 2 cores 8 cores 3 cores 250 12 cores 50 4 cores 16 cores 200 Time (sec) Time (sec) 40 150 30 100 20 10 50 0 0 40 45 50 55 60 65 70 75 80 85 90 95 40 45 50 55 60 65 70 75 80 85 90 95 Support value (%) Support value (%) Figure: Execution on 1-4 Cores, Figure: Execution on 4-16 Cores, Retail dataset Retail dataset Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    45. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Conclusion High speedup and efficiency for high to medium support values Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    46. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Conclusion High speedup and efficiency for high to medium support values The basic determinant for performance is communication overhead Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    47. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Conclusion High speedup and efficiency for high to medium support values The basic determinant for performance is communication overhead FP-tree provides a compressed communication, usefulness of parallel execution is increased Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    48. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Conclusion High speedup and efficiency for high to medium support values The basic determinant for performance is communication overhead FP-tree provides a compressed communication, usefulness of parallel execution is increased As support threshold is lowered and number of processors is increased, efficiency gets lower Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    49. Outline The CLOSET+ Algorithm Motivations Parallelization State of the Art Test Results Parallel CLOSET+ Conclusion Ending Remarks Conclusion High speedup and efficiency for high to medium support values The basic determinant for performance is communication overhead FP-tree provides a compressed communication, usefulness of parallel execution is increased As support threshold is lowered and number of processors is increased, efficiency gets lower It is left to the application owner to find the optimum numbers for execution (in terms of support values and number of processors) Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    50. Outline Motivations Demo State of the Art References Parallel CLOSET+ Q&A Ending Remarks Demo A real life demo on Nar Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    51. Outline Motivations Demo State of the Art References Parallel CLOSET+ Q&A Ending Remarks References Jianyong Wang, Jiawei Han, and Jian Pei. CLOSET+: searching for the best strategies for mining frequent closed itemsets. In KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 236–245, New York, NY, USA, 2003. Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets
    52. Outline Motivations Demo State of the Art References Parallel CLOSET+ Q&A Ending Remarks Thanks for listening. Any questions? Tayfun Sen ¸ Parallel CLOSET+ for Finding Frequent Closed Itemsets

    + Tayfun SenTayfun Sen, 5 months ago

    custom

    136 views, 0 favs, 0 embeds more stats

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 136
      • 136 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 1
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories