• Like
Upcoming SlideShare
Loading in...5
Uploaded on


  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. CS 267: Applications of Parallel Computers Lecture 25: Data Mining Kathy Yelick Material based on lecture by Vipin Kumar and Mahesh Joshi http://www-users.cs.umn.edu/~mjoshi/hpdmtut/
  • 2. Lecture Schedule
    • 12/3: 3 things
    • Projects and performance analysis
    • (N-body assignment observations)
    • Data Mining
    • HKN Review at 3:40
    • 12/5: The Future of Parallel Computing
    • David Bailey
    • 12/13: CS267 Poster Session ( 2-4pm , Woz)
    • 12/14: Final Papers due
  • 3. N-Body Assignment
    • Some observations on your N-Body assignments
      • Problems and pitfalls to avoid in final project
    • Performance analysis
      • Micro-benchmarks are good
        • To understand application performance, build up performance model from measured pieces, e.g., network performance
      • Noise is expected, but quantifying it is also useful
        • Means, alone, can be confusing
        • Median + variance is good
      • Carefully select problem sizes
        • Are they large enough to justify the # of processors?
        • What do real users want?
        • Can you vary the problem size in some reasonable way?
  • 4. N-Body Assignment
    • Minor comments on N-Body Results
      • Describe performance graphs – what is expected, surprising
      • Sanity check your numbers
        • Are you getting more than P time speedup on P processors?
        • Does the observed running time (“time command”) match total?
        • What is your Mflops rate? Is it between 10 and 90% of HW peak?
      • Be careful of different timers
        • Get-time-of-day is wall-clock time (charged for OS and others)
        • Clock is process time (Linux creates a process per thread)
        • RT clock on Cray is wall clock time
      • Check captions, titles, axes of figures/graphs
      • Run spell checker
  • 5. Outline
    • Overview of Data Mining
    • Serial Algorithms for Classification
    • Parallel Algorithms for Classification
    • Summary
  • 6. Data Mining Overview
    • What is Data Mining?
    • Data Mining Tasks
      • Classification
      • Clustering
      • Association Rules and Sequential Patterns
      • Regression
      • Deviation Detection
  • 7. What is Data Mining?
    • Several definitions:
      • Search for valuable information in large volumes of data
    • Exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover useful rules
    • A step in the Knowledge Discovery in Databases (KDD) process
  • 8. Knowledge Discovery Process
    • Knowledge Discovery in Databases: identify valid, novel, useful, and understandable patterns in data
    Clean, collect, summarize Data preparation Data mining Verification and evaluation Data warehouse Operational Databases Training Data Model, patterns
  • 9. Why Mine Data?
    • Data collected and stored at enormous rate
      • Remote sensor on a satellite
      • Telescope scanning the skies
      • Microarrays generating gene expressions
      • Scientific simulations
    • Traditional techniques infeasible
    • Data mining for data reduction
      • Cataloging, classifying, segmenting
      • Help scientists formulate hypotheses
  • 10. Data Mining Tasks
    • Predictive Methods: Use variable to predict unknown or future values of other variables
      • Classification
      • Regression
      • Deviation Detection
    • Descriptive Methods: Find human-interpretable patterns that describe data
      • Clustering
      • Association Rule Discovery
      • Sequential Pattern Discovery
  • 11. Classification
    • Given a collection of records (training set)
      • Each record contains a set of attributes , on of which is the class
    • Find a model for class attributes as a function of the values of other attributes
    • Goal: previously unseen records should be accurately assigned a class
      • A test set is used to determine accuracy
    • Examples:
      • Direct marketing: targeted mailings based on buy/don’t class
      • Fraud detection: predict fraudulent use of credit cards, insurance, telephones, etc.
      • Sky survey cataloging: catalog objects based as star/galaxy
  • 12. Classification Example: Sky Survey
    • Approach
      • Segment the image
      • Measure image attributes – 40 per object
      • Model the class (star/galaxy or stage) based on the attributes
    • Currently 3K images
    • 23Kx23K pixels
    Images from: http://aps.umn.edu
  • 13. Clustering
    • Given a set of data points:
      • Each has a set of attributes
      • A similarity measure among them
    • Find clusters such that:
      • Points in one cluster are more similar to each other than points in other clusters
    • Similarities measures are problem specific:
      • E.g., Euclidean distance for continuous data
  • 14. Clustering Applications
    • Market Segmentation:
      • Divide market into distinct subsets
    • Document clustering:
      • Find group of related documents, based on common keywords
      • Set in information retrieval
    • Financial market analysis
      • Find groups of companies with common stock behavior
  • 15. Associate Rule Discovery
    • Given a set of records, each containing set of items
      • Produce dependency rules that predict occurrences of an item based on others
    • Applications:
      • Marketing, sales promotion and shelf management
      • Inventory management
    Rules: {Milk}  {Coke} {Diaper,Milk}  Beer Coke, Diaper, Milk 5 Beer, Bread, Diaper, Milk 4 Beer, Coke, Diaper, Milk 3 Beer, Bread 2 Bread, Coke, Milk 1 Items TID
  • 16. Other Data Mining Problems
    • Sequential Pattern Discovery
      • Given a set of objects, each with a timeline of events
      • Find rules that predict sequential dependencies
      • Example: patterns in telecommunications alarm logs
    • Regression:
      • Predict a value one variable given others
      • Assume a linear or non-linear model of dependence
      • Examples:
        • Predict sales amounts based on advertising expenditures
        • Predict wind velocities based on temperature, pressure, etc.
    • Deviation Detection
      • Discover most significant change in data from previous values
  • 17. Serial Algorithms for Classification
    • Decision Tree Classifiers
      • Overview of Decision Trees
      • Tree induction
      • Tree pruning
    • Rule-based methods
    • Memory Based Reasoning
    • Neural networks
    • Genetic algorithms
    • Bayesian networks
    Inexpensive Easy to interpret Easy to integrate into DBs
  • 18. Decision Tree Algorithms
    • Many Algorithms
      • Hunt’s Algorithm
      • CART
      • ID3, C4.5
      • SLIQ, SPRINT
    Refund Marital Income NO NO NO YES >80K <=80K S,D M Yes No Yes 90K S No 10 No 75K M No 9 Yes 85K S No 8 No 220K D Yes 7 No 60K M No 6 Yes 95K D No 5 No 120K M Yes 4 No 70K S No 3 No 100K M No 2 No 125K S Yes 1 Cheat Income Marital Refund Tid
  • 19. Tree Induction
    • Greedy strategy
      • Split based on attribute that optimizes splitting criterion
    • Two phases at each node in tree
      • Split determining phase:
        • Which attribute to split
        • How to split
          • Two-way split of multi-valued attribute (Marital: S,D,M)
          • Continuous attributes: discretize in advance, cluster on the fly
      • Splitting phase
        • Do the split and create child nodes
  • 20. GINI Splitting Criterion
    • Gini Index:
      • GINI(t) = 1 –  j [p(j | t) ] 2
      • where p(j|t) is the relative frequence of class j at node t
    • Measures impurity of a node
        • Max (1-1/nc) when records are equally distributed
        • Minimum (0.) when all records belong to one class, implying most interesting information
    • Other criteria may be better, but similar evaluation
    Gini = 0.00 6 C2 0 C1 Gini = 0.28 5 C2 1 C1 Gini = 0.44 6 C2 2 C1 Gini = 0.50 3 C2 3 C1
  • 21. Splitting Based on GINI
    • Use in CART, SLIQ, SPRINT
    • Criterion: Minimize GINI index of the Split
    • When a node p is split into k partitions (children), the quality of the split is computed as
    • GINI split =  j k =1 n j / n GINI(j)
    • Where n j = number of records at child j
    • n = number or records at node p
    • To evaluate:
      • Categorical attributes: compute counts of each class
      • Continuous attributes: sort and choose split (1 or more)
  • 22. Splitting Based on INFO
    • Information/Entropy:
    • INFO(t) = – (  j k =1 p(j | t) log g(j | t) )
    • Information Gain
    • GAIN split = INFO(p) – (  j k =1 n j / n INFO(j) )
    • Measures reduction in entropy; choose split to maximize
    • Used in ID3 and C4.5
    • Problems: tends to prefer splits that are large in number
      • Variations avoid this
    • Computation similar to GINI
  • 23. C4.5 Classification
    • Simple depth-first construction of tree
    • Sorts continuous attributes at each node
    • Needs to fit data into memory
      • To avoid out-of-core sort
      • Limits scalability
  • 24. SLIQ Classification
    • Arrays of continuous attributes are pre-sorted
    • Classification tree is grown breadth-first
    • Class list structure maintains mapping: record id  node
    • Split determining phase: class list is referred to for computing best split for each attribute. (breadth-first)
    • Splitting phase: the list of this splitting attribute is used to update the leave labels in the class list. (no physical splitting)
    • Problem: class list is frequently and randomly accessed
      • Required to be in-memory for efficient performance
  • 25. SLIQ Example
  • 26. SPRINT
    • Arrays of continuous attributes are presorted
      • Sorted order is maintained during splits
    • Classification tree is grown breadth-first
    • Attribute lists are physically split among nodes
    • Split determining phase is just a linear scan of lists at each nodes
    • Hashing scheme used in splitting phase
      • IDs of the splitting attribute are hashed with the tree node
      • Remaining attribute arrays are split by querying this hash table
    • Problems: Hash table is O(N) at root
  • 27. Parallel Algorithms for Classification
    • Driven by need to handle large data sets
      • Larger aggregate memory on parallel machines
      • Scales on cluster architecture
    • I/O time dominates
      • More difficult to analyze benefits (cost/performance) than simple MFLOP-limited problem
      • I.e., buy disks for parallel Bandwidth vs. Processors+Memory
  • 28. Parallel Tree Construction: Approach 1
    • First approach: partition data, data-parallel operations across nodes
      • Global reduction per node
      • Large number of nodes is expensive
  • 29. Parallel Tree Construction: Approach 2
    • Task parallelism: exploit parallelism between nodes
      • Load imbalance as number of records vary
      • Locality: child/parent need same data
  • 30. Parallel Tree Construction: Hybrid Approach
    • Switch from data to task parallelism (within a node to between nodes) when:
    • total communication cost >=
    • Moving cost + Load balancing cost
    • Splitting ensures:
      • Communication cost <= 2 * Optimal-Communication-cost
  • 31. Continuous Data
    • Parallel mining algorithms with continuous data adds
      • Parallel sort
        • Essentially a transpose of data – all to all
      • Parallel hashing
        • Ramon small access
    • Both are very hard to do efficiently on current machines
  • 32. Performance Results from ScalParC
    • Parallel running time on Cray T3E
  • 33. Performance Results from ScalParC
    • Runtime with constant size per processor, also T3E