Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. CS 267: Applications of Parallel Computers Lecture 25: Data Mining Kathy Yelick Material based on lecture by Vipin Kumar and Mahesh Joshi http://www-users.cs.umn.edu/~mjoshi/hpdmtut/
  2. 2. Lecture Schedule <ul><li>12/3: 3 things </li></ul><ul><li>Projects and performance analysis </li></ul><ul><li>(N-body assignment observations) </li></ul><ul><li>Data Mining </li></ul><ul><li>HKN Review at 3:40 </li></ul><ul><li>12/5: The Future of Parallel Computing </li></ul><ul><li>David Bailey </li></ul><ul><li>12/13: CS267 Poster Session ( 2-4pm , Woz) </li></ul><ul><li>12/14: Final Papers due </li></ul>
  3. 3. N-Body Assignment <ul><li>Some observations on your N-Body assignments </li></ul><ul><ul><li>Problems and pitfalls to avoid in final project </li></ul></ul><ul><li>Performance analysis </li></ul><ul><ul><li>Micro-benchmarks are good </li></ul></ul><ul><ul><ul><li>To understand application performance, build up performance model from measured pieces, e.g., network performance </li></ul></ul></ul><ul><ul><li>Noise is expected, but quantifying it is also useful </li></ul></ul><ul><ul><ul><li>Means, alone, can be confusing </li></ul></ul></ul><ul><ul><ul><li>Median + variance is good </li></ul></ul></ul><ul><ul><li>Carefully select problem sizes </li></ul></ul><ul><ul><ul><li>Are they large enough to justify the # of processors? </li></ul></ul></ul><ul><ul><ul><li>What do real users want? </li></ul></ul></ul><ul><ul><ul><li>Can you vary the problem size in some reasonable way? </li></ul></ul></ul>
  4. 4. N-Body Assignment <ul><li>Minor comments on N-Body Results </li></ul><ul><ul><li>Describe performance graphs – what is expected, surprising </li></ul></ul><ul><ul><li>Sanity check your numbers </li></ul></ul><ul><ul><ul><li>Are you getting more than P time speedup on P processors? </li></ul></ul></ul><ul><ul><ul><li>Does the observed running time (“time command”) match total? </li></ul></ul></ul><ul><ul><ul><li>What is your Mflops rate? Is it between 10 and 90% of HW peak? </li></ul></ul></ul><ul><ul><li>Be careful of different timers </li></ul></ul><ul><ul><ul><li>Get-time-of-day is wall-clock time (charged for OS and others) </li></ul></ul></ul><ul><ul><ul><li>Clock is process time (Linux creates a process per thread) </li></ul></ul></ul><ul><ul><ul><li>RT clock on Cray is wall clock time </li></ul></ul></ul><ul><ul><li>Check captions, titles, axes of figures/graphs </li></ul></ul><ul><ul><li>Run spell checker </li></ul></ul>
  5. 5. Outline <ul><li>Overview of Data Mining </li></ul><ul><li>Serial Algorithms for Classification </li></ul><ul><li>Parallel Algorithms for Classification </li></ul><ul><li>Summary </li></ul>
  6. 6. Data Mining Overview <ul><li>What is Data Mining? </li></ul><ul><li>Data Mining Tasks </li></ul><ul><ul><li>Classification </li></ul></ul><ul><ul><li>Clustering </li></ul></ul><ul><ul><li>Association Rules and Sequential Patterns </li></ul></ul><ul><ul><li>Regression </li></ul></ul><ul><ul><li>Deviation Detection </li></ul></ul>
  7. 7. What is Data Mining? <ul><li>Several definitions: </li></ul><ul><ul><li>Search for valuable information in large volumes of data </li></ul></ul><ul><li>Exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover useful rules </li></ul><ul><li>A step in the Knowledge Discovery in Databases (KDD) process </li></ul>
  8. 8. Knowledge Discovery Process <ul><li>Knowledge Discovery in Databases: identify valid, novel, useful, and understandable patterns in data </li></ul>Clean, collect, summarize Data preparation Data mining Verification and evaluation Data warehouse Operational Databases Training Data Model, patterns
  9. 9. Why Mine Data? <ul><li>Data collected and stored at enormous rate </li></ul><ul><ul><li>Remote sensor on a satellite </li></ul></ul><ul><ul><li>Telescope scanning the skies </li></ul></ul><ul><ul><li>Microarrays generating gene expressions </li></ul></ul><ul><ul><li>Scientific simulations </li></ul></ul><ul><li>Traditional techniques infeasible </li></ul><ul><li>Data mining for data reduction </li></ul><ul><ul><li>Cataloging, classifying, segmenting </li></ul></ul><ul><ul><li>Help scientists formulate hypotheses </li></ul></ul>
  10. 10. Data Mining Tasks <ul><li>Predictive Methods: Use variable to predict unknown or future values of other variables </li></ul><ul><ul><li>Classification </li></ul></ul><ul><ul><li>Regression </li></ul></ul><ul><ul><li>Deviation Detection </li></ul></ul><ul><li>Descriptive Methods: Find human-interpretable patterns that describe data </li></ul><ul><ul><li>Clustering </li></ul></ul><ul><ul><li>Association Rule Discovery </li></ul></ul><ul><ul><li>Sequential Pattern Discovery </li></ul></ul>
  11. 11. Classification <ul><li>Given a collection of records (training set) </li></ul><ul><ul><li>Each record contains a set of attributes , on of which is the class </li></ul></ul><ul><li>Find a model for class attributes as a function of the values of other attributes </li></ul><ul><li>Goal: previously unseen records should be accurately assigned a class </li></ul><ul><ul><li>A test set is used to determine accuracy </li></ul></ul><ul><li>Examples: </li></ul><ul><ul><li>Direct marketing: targeted mailings based on buy/don’t class </li></ul></ul><ul><ul><li>Fraud detection: predict fraudulent use of credit cards, insurance, telephones, etc. </li></ul></ul><ul><ul><li>Sky survey cataloging: catalog objects based as star/galaxy </li></ul></ul>
  12. 12. Classification Example: Sky Survey <ul><li>Approach </li></ul><ul><ul><li>Segment the image </li></ul></ul><ul><ul><li>Measure image attributes – 40 per object </li></ul></ul><ul><ul><li>Model the class (star/galaxy or stage) based on the attributes </li></ul></ul><ul><li>Currently 3K images </li></ul><ul><li>23Kx23K pixels </li></ul>Images from: http://aps.umn.edu
  13. 13. Clustering <ul><li>Given a set of data points: </li></ul><ul><ul><li>Each has a set of attributes </li></ul></ul><ul><ul><li>A similarity measure among them </li></ul></ul><ul><li>Find clusters such that: </li></ul><ul><ul><li>Points in one cluster are more similar to each other than points in other clusters </li></ul></ul><ul><li>Similarities measures are problem specific: </li></ul><ul><ul><li>E.g., Euclidean distance for continuous data </li></ul></ul>
  14. 14. Clustering Applications <ul><li>Market Segmentation: </li></ul><ul><ul><li>Divide market into distinct subsets </li></ul></ul><ul><li>Document clustering: </li></ul><ul><ul><li>Find group of related documents, based on common keywords </li></ul></ul><ul><ul><li>Set in information retrieval </li></ul></ul><ul><li>Financial market analysis </li></ul><ul><ul><li>Find groups of companies with common stock behavior </li></ul></ul>
  15. 15. Associate Rule Discovery <ul><li>Given a set of records, each containing set of items </li></ul><ul><ul><li>Produce dependency rules that predict occurrences of an item based on others </li></ul></ul><ul><li>Applications: </li></ul><ul><ul><li>Marketing, sales promotion and shelf management </li></ul></ul><ul><ul><li>Inventory management </li></ul></ul>Rules: {Milk}  {Coke} {Diaper,Milk}  Beer Coke, Diaper, Milk 5 Beer, Bread, Diaper, Milk 4 Beer, Coke, Diaper, Milk 3 Beer, Bread 2 Bread, Coke, Milk 1 Items TID
  16. 16. Other Data Mining Problems <ul><li>Sequential Pattern Discovery </li></ul><ul><ul><li>Given a set of objects, each with a timeline of events </li></ul></ul><ul><ul><li>Find rules that predict sequential dependencies </li></ul></ul><ul><ul><li>Example: patterns in telecommunications alarm logs </li></ul></ul><ul><li>Regression: </li></ul><ul><ul><li>Predict a value one variable given others </li></ul></ul><ul><ul><li>Assume a linear or non-linear model of dependence </li></ul></ul><ul><ul><li>Examples: </li></ul></ul><ul><ul><ul><li>Predict sales amounts based on advertising expenditures </li></ul></ul></ul><ul><ul><ul><li>Predict wind velocities based on temperature, pressure, etc. </li></ul></ul></ul><ul><li>Deviation Detection </li></ul><ul><ul><li>Discover most significant change in data from previous values </li></ul></ul>
  17. 17. Serial Algorithms for Classification <ul><li>Decision Tree Classifiers </li></ul><ul><ul><li>Overview of Decision Trees </li></ul></ul><ul><ul><li>Tree induction </li></ul></ul><ul><ul><li>Tree pruning </li></ul></ul><ul><li>Rule-based methods </li></ul><ul><li>Memory Based Reasoning </li></ul><ul><li>Neural networks </li></ul><ul><li>Genetic algorithms </li></ul><ul><li>Bayesian networks </li></ul>Inexpensive Easy to interpret Easy to integrate into DBs
  18. 18. Decision Tree Algorithms <ul><li>Many Algorithms </li></ul><ul><ul><li>Hunt’s Algorithm </li></ul></ul><ul><ul><li>CART </li></ul></ul><ul><ul><li>ID3, C4.5 </li></ul></ul><ul><ul><li>SLIQ, SPRINT </li></ul></ul>Refund Marital Income NO NO NO YES >80K <=80K S,D M Yes No Yes 90K S No 10 No 75K M No 9 Yes 85K S No 8 No 220K D Yes 7 No 60K M No 6 Yes 95K D No 5 No 120K M Yes 4 No 70K S No 3 No 100K M No 2 No 125K S Yes 1 Cheat Income Marital Refund Tid
  19. 19. Tree Induction <ul><li>Greedy strategy </li></ul><ul><ul><li>Split based on attribute that optimizes splitting criterion </li></ul></ul><ul><li>Two phases at each node in tree </li></ul><ul><ul><li>Split determining phase: </li></ul></ul><ul><ul><ul><li>Which attribute to split </li></ul></ul></ul><ul><ul><ul><li>How to split </li></ul></ul></ul><ul><ul><ul><ul><li>Two-way split of multi-valued attribute (Marital: S,D,M) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Continuous attributes: discretize in advance, cluster on the fly </li></ul></ul></ul></ul><ul><ul><li>Splitting phase </li></ul></ul><ul><ul><ul><li>Do the split and create child nodes </li></ul></ul></ul>
  20. 20. GINI Splitting Criterion <ul><li>Gini Index: </li></ul><ul><ul><li>GINI(t) = 1 –  j [p(j | t) ] 2 </li></ul></ul><ul><ul><li>where p(j|t) is the relative frequence of class j at node t </li></ul></ul><ul><li>Measures impurity of a node </li></ul><ul><ul><ul><li>Max (1-1/nc) when records are equally distributed </li></ul></ul></ul><ul><ul><ul><li>Minimum (0.) when all records belong to one class, implying most interesting information </li></ul></ul></ul><ul><li>Other criteria may be better, but similar evaluation </li></ul>Gini = 0.00 6 C2 0 C1 Gini = 0.28 5 C2 1 C1 Gini = 0.44 6 C2 2 C1 Gini = 0.50 3 C2 3 C1
  21. 21. Splitting Based on GINI <ul><li>Use in CART, SLIQ, SPRINT </li></ul><ul><li>Criterion: Minimize GINI index of the Split </li></ul><ul><li>When a node p is split into k partitions (children), the quality of the split is computed as </li></ul><ul><li>GINI split =  j k =1 n j / n GINI(j) </li></ul><ul><li>Where n j = number of records at child j </li></ul><ul><li>n = number or records at node p </li></ul><ul><li>To evaluate: </li></ul><ul><ul><li>Categorical attributes: compute counts of each class </li></ul></ul><ul><ul><li>Continuous attributes: sort and choose split (1 or more) </li></ul></ul>
  22. 22. Splitting Based on INFO <ul><li>Information/Entropy: </li></ul><ul><li>INFO(t) = – (  j k =1 p(j | t) log g(j | t) ) </li></ul><ul><li>Information Gain </li></ul><ul><li>GAIN split = INFO(p) – (  j k =1 n j / n INFO(j) ) </li></ul><ul><li>Measures reduction in entropy; choose split to maximize </li></ul><ul><li>Used in ID3 and C4.5 </li></ul><ul><li>Problems: tends to prefer splits that are large in number </li></ul><ul><ul><li>Variations avoid this </li></ul></ul><ul><li>Computation similar to GINI </li></ul>
  23. 23. C4.5 Classification <ul><li>Simple depth-first construction of tree </li></ul><ul><li>Sorts continuous attributes at each node </li></ul><ul><li>Needs to fit data into memory </li></ul><ul><ul><li>To avoid out-of-core sort </li></ul></ul><ul><ul><li>Limits scalability </li></ul></ul>
  24. 24. SLIQ Classification <ul><li>Arrays of continuous attributes are pre-sorted </li></ul><ul><li>Classification tree is grown breadth-first </li></ul><ul><li>Class list structure maintains mapping: record id  node </li></ul><ul><li>Split determining phase: class list is referred to for computing best split for each attribute. (breadth-first) </li></ul><ul><li>Splitting phase: the list of this splitting attribute is used to update the leave labels in the class list. (no physical splitting) </li></ul><ul><li>Problem: class list is frequently and randomly accessed </li></ul><ul><ul><li>Required to be in-memory for efficient performance </li></ul></ul>
  25. 25. SLIQ Example
  26. 26. SPRINT <ul><li>Arrays of continuous attributes are presorted </li></ul><ul><ul><li>Sorted order is maintained during splits </li></ul></ul><ul><li>Classification tree is grown breadth-first </li></ul><ul><li>Attribute lists are physically split among nodes </li></ul><ul><li>Split determining phase is just a linear scan of lists at each nodes </li></ul><ul><li>Hashing scheme used in splitting phase </li></ul><ul><ul><li>IDs of the splitting attribute are hashed with the tree node </li></ul></ul><ul><ul><li>Remaining attribute arrays are split by querying this hash table </li></ul></ul><ul><li>Problems: Hash table is O(N) at root </li></ul>
  27. 27. Parallel Algorithms for Classification <ul><li>Driven by need to handle large data sets </li></ul><ul><ul><li>Larger aggregate memory on parallel machines </li></ul></ul><ul><ul><li>Scales on cluster architecture </li></ul></ul><ul><li>I/O time dominates </li></ul><ul><ul><li>More difficult to analyze benefits (cost/performance) than simple MFLOP-limited problem </li></ul></ul><ul><ul><li>I.e., buy disks for parallel Bandwidth vs. Processors+Memory </li></ul></ul>
  28. 28. Parallel Tree Construction: Approach 1 <ul><li>First approach: partition data, data-parallel operations across nodes </li></ul><ul><ul><li>Global reduction per node </li></ul></ul><ul><ul><li>Large number of nodes is expensive </li></ul></ul>
  29. 29. Parallel Tree Construction: Approach 2 <ul><li>Task parallelism: exploit parallelism between nodes </li></ul><ul><ul><li>Load imbalance as number of records vary </li></ul></ul><ul><ul><li>Locality: child/parent need same data </li></ul></ul>
  30. 30. Parallel Tree Construction: Hybrid Approach <ul><li>Switch from data to task parallelism (within a node to between nodes) when: </li></ul><ul><li>total communication cost >= </li></ul><ul><li>Moving cost + Load balancing cost </li></ul><ul><li>Splitting ensures: </li></ul><ul><ul><li>Communication cost <= 2 * Optimal-Communication-cost </li></ul></ul>
  31. 31. Continuous Data <ul><li>Parallel mining algorithms with continuous data adds </li></ul><ul><ul><li>Parallel sort </li></ul></ul><ul><ul><ul><li>Essentially a transpose of data – all to all </li></ul></ul></ul><ul><ul><li>Parallel hashing </li></ul></ul><ul><ul><ul><li>Ramon small access </li></ul></ul></ul><ul><li>Both are very hard to do efficiently on current machines </li></ul>
  32. 32. Performance Results from ScalParC <ul><li>Parallel running time on Cray T3E </li></ul>
  33. 33. Performance Results from ScalParC <ul><li>Runtime with constant size per processor, also T3E </li></ul>