Upcoming SlideShare
Loading in...5







Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

PowerPoint PowerPoint Presentation Transcript

  • CS 267: Applications of Parallel Computers Lecture 25: Data Mining Kathy Yelick Material based on lecture by Vipin Kumar and Mahesh Joshi http://www-users.cs.umn.edu/~mjoshi/hpdmtut/
  • Lecture Schedule
    • 12/3: 3 things
    • Projects and performance analysis
    • (N-body assignment observations)
    • Data Mining
    • HKN Review at 3:40
    • 12/5: The Future of Parallel Computing
    • David Bailey
    • 12/13: CS267 Poster Session ( 2-4pm , Woz)
    • 12/14: Final Papers due
  • N-Body Assignment
    • Some observations on your N-Body assignments
      • Problems and pitfalls to avoid in final project
    • Performance analysis
      • Micro-benchmarks are good
        • To understand application performance, build up performance model from measured pieces, e.g., network performance
      • Noise is expected, but quantifying it is also useful
        • Means, alone, can be confusing
        • Median + variance is good
      • Carefully select problem sizes
        • Are they large enough to justify the # of processors?
        • What do real users want?
        • Can you vary the problem size in some reasonable way?
  • N-Body Assignment
    • Minor comments on N-Body Results
      • Describe performance graphs – what is expected, surprising
      • Sanity check your numbers
        • Are you getting more than P time speedup on P processors?
        • Does the observed running time (“time command”) match total?
        • What is your Mflops rate? Is it between 10 and 90% of HW peak?
      • Be careful of different timers
        • Get-time-of-day is wall-clock time (charged for OS and others)
        • Clock is process time (Linux creates a process per thread)
        • RT clock on Cray is wall clock time
      • Check captions, titles, axes of figures/graphs
      • Run spell checker
  • Outline
    • Overview of Data Mining
    • Serial Algorithms for Classification
    • Parallel Algorithms for Classification
    • Summary
  • Data Mining Overview
    • What is Data Mining?
    • Data Mining Tasks
      • Classification
      • Clustering
      • Association Rules and Sequential Patterns
      • Regression
      • Deviation Detection
  • What is Data Mining?
    • Several definitions:
      • Search for valuable information in large volumes of data
    • Exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover useful rules
    • A step in the Knowledge Discovery in Databases (KDD) process
  • Knowledge Discovery Process
    • Knowledge Discovery in Databases: identify valid, novel, useful, and understandable patterns in data
    Clean, collect, summarize Data preparation Data mining Verification and evaluation Data warehouse Operational Databases Training Data Model, patterns
  • Why Mine Data?
    • Data collected and stored at enormous rate
      • Remote sensor on a satellite
      • Telescope scanning the skies
      • Microarrays generating gene expressions
      • Scientific simulations
    • Traditional techniques infeasible
    • Data mining for data reduction
      • Cataloging, classifying, segmenting
      • Help scientists formulate hypotheses
  • Data Mining Tasks
    • Predictive Methods: Use variable to predict unknown or future values of other variables
      • Classification
      • Regression
      • Deviation Detection
    • Descriptive Methods: Find human-interpretable patterns that describe data
      • Clustering
      • Association Rule Discovery
      • Sequential Pattern Discovery
  • Classification
    • Given a collection of records (training set)
      • Each record contains a set of attributes , on of which is the class
    • Find a model for class attributes as a function of the values of other attributes
    • Goal: previously unseen records should be accurately assigned a class
      • A test set is used to determine accuracy
    • Examples:
      • Direct marketing: targeted mailings based on buy/don’t class
      • Fraud detection: predict fraudulent use of credit cards, insurance, telephones, etc.
      • Sky survey cataloging: catalog objects based as star/galaxy
  • Classification Example: Sky Survey
    • Approach
      • Segment the image
      • Measure image attributes – 40 per object
      • Model the class (star/galaxy or stage) based on the attributes
    • Currently 3K images
    • 23Kx23K pixels
    Images from: http://aps.umn.edu
  • Clustering
    • Given a set of data points:
      • Each has a set of attributes
      • A similarity measure among them
    • Find clusters such that:
      • Points in one cluster are more similar to each other than points in other clusters
    • Similarities measures are problem specific:
      • E.g., Euclidean distance for continuous data
  • Clustering Applications
    • Market Segmentation:
      • Divide market into distinct subsets
    • Document clustering:
      • Find group of related documents, based on common keywords
      • Set in information retrieval
    • Financial market analysis
      • Find groups of companies with common stock behavior
  • Associate Rule Discovery
    • Given a set of records, each containing set of items
      • Produce dependency rules that predict occurrences of an item based on others
    • Applications:
      • Marketing, sales promotion and shelf management
      • Inventory management
    Rules: {Milk}  {Coke} {Diaper,Milk}  Beer Coke, Diaper, Milk 5 Beer, Bread, Diaper, Milk 4 Beer, Coke, Diaper, Milk 3 Beer, Bread 2 Bread, Coke, Milk 1 Items TID
  • Other Data Mining Problems
    • Sequential Pattern Discovery
      • Given a set of objects, each with a timeline of events
      • Find rules that predict sequential dependencies
      • Example: patterns in telecommunications alarm logs
    • Regression:
      • Predict a value one variable given others
      • Assume a linear or non-linear model of dependence
      • Examples:
        • Predict sales amounts based on advertising expenditures
        • Predict wind velocities based on temperature, pressure, etc.
    • Deviation Detection
      • Discover most significant change in data from previous values
  • Serial Algorithms for Classification
    • Decision Tree Classifiers
      • Overview of Decision Trees
      • Tree induction
      • Tree pruning
    • Rule-based methods
    • Memory Based Reasoning
    • Neural networks
    • Genetic algorithms
    • Bayesian networks
    Inexpensive Easy to interpret Easy to integrate into DBs
  • Decision Tree Algorithms
    • Many Algorithms
      • Hunt’s Algorithm
      • CART
      • ID3, C4.5
      • SLIQ, SPRINT
    Refund Marital Income NO NO NO YES >80K <=80K S,D M Yes No Yes 90K S No 10 No 75K M No 9 Yes 85K S No 8 No 220K D Yes 7 No 60K M No 6 Yes 95K D No 5 No 120K M Yes 4 No 70K S No 3 No 100K M No 2 No 125K S Yes 1 Cheat Income Marital Refund Tid
  • Tree Induction
    • Greedy strategy
      • Split based on attribute that optimizes splitting criterion
    • Two phases at each node in tree
      • Split determining phase:
        • Which attribute to split
        • How to split
          • Two-way split of multi-valued attribute (Marital: S,D,M)
          • Continuous attributes: discretize in advance, cluster on the fly
      • Splitting phase
        • Do the split and create child nodes
  • GINI Splitting Criterion
    • Gini Index:
      • GINI(t) = 1 –  j [p(j | t) ] 2
      • where p(j|t) is the relative frequence of class j at node t
    • Measures impurity of a node
        • Max (1-1/nc) when records are equally distributed
        • Minimum (0.) when all records belong to one class, implying most interesting information
    • Other criteria may be better, but similar evaluation
    Gini = 0.00 6 C2 0 C1 Gini = 0.28 5 C2 1 C1 Gini = 0.44 6 C2 2 C1 Gini = 0.50 3 C2 3 C1
  • Splitting Based on GINI
    • Use in CART, SLIQ, SPRINT
    • Criterion: Minimize GINI index of the Split
    • When a node p is split into k partitions (children), the quality of the split is computed as
    • GINI split =  j k =1 n j / n GINI(j)
    • Where n j = number of records at child j
    • n = number or records at node p
    • To evaluate:
      • Categorical attributes: compute counts of each class
      • Continuous attributes: sort and choose split (1 or more)
  • Splitting Based on INFO
    • Information/Entropy:
    • INFO(t) = – (  j k =1 p(j | t) log g(j | t) )
    • Information Gain
    • GAIN split = INFO(p) – (  j k =1 n j / n INFO(j) )
    • Measures reduction in entropy; choose split to maximize
    • Used in ID3 and C4.5
    • Problems: tends to prefer splits that are large in number
      • Variations avoid this
    • Computation similar to GINI
  • C4.5 Classification
    • Simple depth-first construction of tree
    • Sorts continuous attributes at each node
    • Needs to fit data into memory
      • To avoid out-of-core sort
      • Limits scalability
  • SLIQ Classification
    • Arrays of continuous attributes are pre-sorted
    • Classification tree is grown breadth-first
    • Class list structure maintains mapping: record id  node
    • Split determining phase: class list is referred to for computing best split for each attribute. (breadth-first)
    • Splitting phase: the list of this splitting attribute is used to update the leave labels in the class list. (no physical splitting)
    • Problem: class list is frequently and randomly accessed
      • Required to be in-memory for efficient performance
  • SLIQ Example
    • Arrays of continuous attributes are presorted
      • Sorted order is maintained during splits
    • Classification tree is grown breadth-first
    • Attribute lists are physically split among nodes
    • Split determining phase is just a linear scan of lists at each nodes
    • Hashing scheme used in splitting phase
      • IDs of the splitting attribute are hashed with the tree node
      • Remaining attribute arrays are split by querying this hash table
    • Problems: Hash table is O(N) at root
  • Parallel Algorithms for Classification
    • Driven by need to handle large data sets
      • Larger aggregate memory on parallel machines
      • Scales on cluster architecture
    • I/O time dominates
      • More difficult to analyze benefits (cost/performance) than simple MFLOP-limited problem
      • I.e., buy disks for parallel Bandwidth vs. Processors+Memory
  • Parallel Tree Construction: Approach 1
    • First approach: partition data, data-parallel operations across nodes
      • Global reduction per node
      • Large number of nodes is expensive
  • Parallel Tree Construction: Approach 2
    • Task parallelism: exploit parallelism between nodes
      • Load imbalance as number of records vary
      • Locality: child/parent need same data
  • Parallel Tree Construction: Hybrid Approach
    • Switch from data to task parallelism (within a node to between nodes) when:
    • total communication cost >=
    • Moving cost + Load balancing cost
    • Splitting ensures:
      • Communication cost <= 2 * Optimal-Communication-cost
  • Continuous Data
    • Parallel mining algorithms with continuous data adds
      • Parallel sort
        • Essentially a transpose of data – all to all
      • Parallel hashing
        • Ramon small access
    • Both are very hard to do efficiently on current machines
  • Performance Results from ScalParC
    • Parallel running time on Cray T3E
  • Performance Results from ScalParC
    • Runtime with constant size per processor, also T3E