Bellwether Analysis




                            Data Mining

            (with many slides due to Gehrke, Garofalakis,...
Introduction




TECS 2007, Data Mining   Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Ya...
Definition
            Data mining is the exploration and analysis of large quantities of data in
               order to ...
Case Study: Bank
       • Business goal: Sell more home equity loans
       • Current models:
              – Customers wi...
Case Study: Bank (Contd.)

      1.         Select subset of customer records who have received
                 home equi...
Case Study: Bank (Contd.)

        2.       Find rules to predict whether a customer would
                 respond to hom...
Case Study: Bank (Contd.)
       3. Group customers into clusters and investigate
          clusters



                  ...
Case Study: Bank (Contd.)



           4. Evaluate results:
                  –      Many ―uninteresting‖ clusters
      ...
Example: Bank
                                       (Contd.)



            Action:
            • New marketing campaign
...
Example Application: Fraud Detection



           • Industries: Health care, retail, credit card
             services, t...
Fraud Detection (Contd.)
       • Examples:
              – Auto insurance: Detect groups of people who stage accidents to...
Other Example Applications

           •    CPG: Promotion analysis
           •    Retail: Category management
          ...
What is a Data Mining Model?

           A data mining model is a description of a certain aspect
             of a datase...
Data Mining Methods




TECS 2007, Data Mining   Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrish...
Overview
       • Several well-studied tasks
              – Classification
              – Clustering
              – Fre...
Classification
       Goal:
            Learn a function that assigns a record to one of several
            predefined cl...
Classification

         Example application: telemarketing




TECS 2007, Data Mining   Bee-Chung Chen, Raghu Ramakrishna...
Classification (Contd.)

          • Decision trees are one approach to
            classification.
          • Other appr...
Classification Example

   •    Training database:                                               Age Car               Cla...
Classification Problem

       • If Y is categorical, the problem is a classification
         problem, and we use C inste...
Regression Problem
       • If Y is numerical, the problem is a regression problem.
       • Y is called the dependent var...
Regression Example

       •     Example training database                                      Age          Car          ...
Decision Trees




TECS 2007, Data Mining   Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, ...
What are Decision Trees?



                                     Age                                         Minivan
     ...
Decision Trees
       • A decision tree T encodes d (a classifier or
         regression function) in form of a tree.
    ...
Internal Nodes
       • Each internal node has an associated splitting
         predicate. Most common are binary predicat...
Leaf Nodes

           Consider leaf node t:
           • Classification problem: Node t is labeled with
             one ...
Example

                                                                            Encoded classifier:
                 ...
Issues in Tree Construction

           •     Three algorithmic components:
                  –      Split Selection Metho...
Top-Down Tree Construction

         BuildTree(Node n, Training database D,
                    Split Selection Method S)
...
Split Selection Method
       •    Numerical Attribute: Find a split point that
            separates the (two) classes


...
Split Selection Method (Contd.)
  •    Categorical Attributes: How to group?

  Sport:                            Truck:  ...
Impurity-based Split Selection Methods

        •    Split selection method has two parts:
               –   Search space...
Data Access Method

       •    Goal: Scalable decision tree construction, using
            the complete training databas...
AVC-Sets
        Training Database                                                      AVC-Sets
              A ge       ...
Motivation for Data Access Methods


                                                        Age
                         ...
RainForest Algorithms: RF-Hybrid
       First scan:

                                                                     ...
RainForest Algorithms: RF-Hybrid

   Second Scan:                                       Build AVC sets for children of the...
RainForest Algorithms: RF-Hybrid
       Third Scan:                                             As we expand the tree, we ...
RainForest Algorithms: RF-Hybrid
       Further optimization: While writing partitions, concurrently build AVC-groups of
 ...
CLUSTERING




TECS 2007, Data Mining   Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yaho...
Problem
       • Given points in a multidimensional space, group
         them into a small number of clusters, using
    ...
Clustering
         • Output: (k) groups of records called clusters, such that
           the records within a group are m...
Clustering (Contd.)

            • Requirements: Need to define ―similarity‖
              between records
            • I...
Approaches

           • Centroid-based: Assume we have k
             clusters, guess at the centers, assign points
     ...
Scalable Clustering Algorithms for Numeric
                            Attributes

                                       ...
Birch [ZRL96]
          Pre-cluster data points using ―CF-tree‖ data structure




TECS 2007, Data Mining   Bee-Chung Chen...
Clustering Feature (CF)




                         Allows incremental merging of clusters!


TECS 2007, Data Mining     ...
Points to Note
       • Basic algorithm works in a single pass to
         condense metric data using spherical
         s...
Market Basket Analysis:
                           Frequent Itemsets




TECS 2007, Data Mining   Bee-Chung Chen, Raghu Ra...
Market Basket Analysis
       • Consider shopping cart filled with several items
       • Market basket analysis tries to ...
Market Basket Analysis

                                                          TID       CID       Date         Item   ...
Market Basket Analysis (Contd.)
   • Co-occurrences
          – 80% of all customers purchase items X, Y and Z
           ...
Confidence and Support

        We prune the set of all possible association rules
          using two interestingness mea...
Example
       • Example rule:                                       TID       CID       Date          Item       Qty
    ...
Exercise
       • Can you find all itemsets TID                             CID       Date         Item        Qty
       ...
Exercise
       • Can you find all                               TID        CID       Date         Item        Qty
       ...
Extensions

          • Imposing constraints
                 – Only find rules involving the dairy department
           ...
Market Basket Analysis: Applications

           • Sample Applications
                  –      Direct marketing
         ...
DBMS Support for DM




TECS 2007, Data Mining   Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrish...
Why Integrate DM into a DBMS?



                         Copy                         Mine                 Models

      ...
Integration Objectives
           • Avoid isolation of                                   • Make it possible to add
       ...
SQL/MM: Data Mining
       • A collection of classes that provide a standard
         interface for invoking DM algorithms...
DATA MINING SUPPORT IN MICROSOFT
                      SQL SERVER *




       * Thanks to Surajit Chaudhuri for permissio...
Key Design Decisions

       • Adopt relational data representation
              – A Data Mining Model (DMM) as a ―tabula...
DM Concepts to Support


       •    Representation of input (cases)
       •    Representation of models
       •    Spec...
What are ―Cases‖?


       • DM algorithms analyze ―cases‖
       • The ―case‖ is the entity being categorized and classif...
Cases as Records: Examples


                                                                                           Ag...
Types of Columns
                                         Marital                                Product Purchases
       ...
More on Columns
       • Properties describe attributes
              – Can represent generalization hierarchy
       • Di...
Representing a DMM                                                  Age
                                                  ...
CREATE MINING MODEL
                                                                         Name of model

    CREATE MIN...
CREATE MINING MODEL
    CREATE MINING MODEL [Age Prediction]
    (
    [Customer ID] LONG        KEY,
    [Gender]        ...
Training a DMM
  • Training a DMM requires passing it ―known‖ cases
  • Use an INSERT INTO in order to ―insert‖ the data t...
INSERT INTO

           INSERT INTO [Age Prediction]
           (
           [Gender],[Hair Color], [Age]
           )
   ...
Executing Insert Into
       • The DMM is trained
              – The model can be retrained or incrementally refined
    ...
What are Predictions?


       • Predictions apply the trained model to estimate
         missing attributes in a data set...
Prediction Join


SELECT [Customers].[ID],
      MyDMM.[Age],
      PredictProbability(MyDMM.[Age])
FROM
  MyDMM PREDICTIO...
Exploratory Mining:
                         Combining OLAP and DM




TECS 2007, Data Mining   Bee-Chung Chen, Raghu Rama...
Databases and Data Mining
       • What can database systems offer in the grand
         challenge of understanding and le...
Databases and Data Mining
       • What can database systems offer in the grand
         challenge of understanding and le...
Multidimensional Data Model

         • One fact table D=(X,M)
                – X=X1, X2, ... Dimension attributes
      ...
Multidimensional Data

                                                  Automobile
              3



                   ...
Cube Space

       • Cube space: C = EX1EX2…EXd
       • Region: Hyper rectangle in cube space
              – c = (v1,...
OLAP Over Imprecise Data

      with Doug Burdick, Prasad Deshpande, T.S. Jayram, and
                        Shiv Vaithya...
Imprecise Data

                                                  Automobile
              3



                          ...
Querying Imprecise Facts

           Auto = F150
           Loc = MA
           SUM(Repair) = ???                         ...
Allocation (1)




                         Truck
                                                               FactID   ...
Allocation (2)

                                                    (Huh? Why 0.5 / 0.5?
                                 ...
Allocation (3)
       Auto = F150
       Loc = MA
       SUM(Repair) = 150                            Query the Extended D...
Allocation Policies
       • The procedure for assigning allocation weights
         is referred to as an allocation polic...
Allocation Policy: Count

                                                                                  Count (c1)    ...
Allocation Policy: Measure

                                                                                  Sales(c1)   ...
Allocation Policy Template
                  Count (c1)              Q(c1)           Sales(c1)
                           ...
What is a Good Allocation Policy?

  Query: COUNT                                     Truck

                             ...
Desideratum I: Consistency


                                   Truck                                    • Consistency
   ...
Desideratum II: Faithfulness

            Data Set 1                                Data Set 2                            ...
Results on Query Semantics
         • Evaluating queries over extended data model yields
           expected value of the ...
F150      Sierra
                                                                                                      Imp...
Query Semantics

         • Given all possible worlds together with their
           probabilities, queries are easily ans...
Exploratory Mining:
                              Prediction Cubes

                    with Beechun Chen, Lei Chen, and Y...
The Idea

           • Build OLAP data cubes in which cell values represent
             decision/prediction behavior
    ...
Example (1/7): Regular OLAP

                                                                             Z: Dimensions Y:...
Example (2/7): Regular OLAP
Goal: Look for patterns of unusually                                           Z: Dimensions Y...
Example (3/7): Decision Analysis
                     Goal: Analyze a bank’s loan decision process
                      w...
Example (3/7): Decision Analysis

       •          Are there branches (and time windows) where
                  approval...
Example (4/7): Prediction Cubes

                                                                            1. Build a mo...
Example (5/7): Model-Similarity

   Given:                                                             Data table D
      ...
Example (6/7): Predictiveness

   Given:                                                            Data table D
         ...
Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)
Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)
Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)
Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)
Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)
Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)
Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)
Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)
Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)
Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)
Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)
Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)
Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)
Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)
Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)
Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)
Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)
Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)
Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)
Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)
Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)
Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)
Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)
Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)
Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)
Upcoming SlideShare
Loading in...5
×

Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)

1,402

Published on

0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,402
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
27
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Transcript of "Data Mining (with Many Slides due to Gehrke, Garofalakis, Rastogi)"

  1. 1. Bellwether Analysis Data Mining (with many slides due to Gehrke, Garofalakis, Rastogi) Raghu Ramakrishnan Yahoo! Research University of Wisconsin–Madison (on leave) TECS 2007 R. Ramakrishnan, Yahoo! Research
  2. 2. Introduction TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 2
  3. 3. Definition Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Valid: The patterns hold in general. Novel: We did not know the pattern beforehand. Useful: We can devise actions from the patterns. Understandable: We can interpret and comprehend the patterns. TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 3
  4. 4. Case Study: Bank • Business goal: Sell more home equity loans • Current models: – Customers with college-age children use home equity loans to pay for tuition – Customers with variable income use home equity loans to even out stream of income • Data: – Large data warehouse – Consolidates data from 42 operational data sources TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 4
  5. 5. Case Study: Bank (Contd.) 1. Select subset of customer records who have received home equity loan offer – Customers who declined – Customers who signed up Income Number of Average Checking … Reponse Children Account Balance $40,000 2 $1500 Yes $75,000 0 $5000 No $50,000 1 $3000 No … … … … … TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 5
  6. 6. Case Study: Bank (Contd.) 2. Find rules to predict whether a customer would respond to home equity loan offer IF (Salary < 40k) and (numChildren > 0) and (ageChild1 > 18 and ageChild1 < 22) THEN YES … TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 6
  7. 7. Case Study: Bank (Contd.) 3. Group customers into clusters and investigate clusters Group 2 Group 3 Group 1 Group 4 TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 7
  8. 8. Case Study: Bank (Contd.) 4. Evaluate results: – Many ―uninteresting‖ clusters – One interesting cluster! Customers with both business and personal accounts; unusually high percentage of likely respondents TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 8
  9. 9. Example: Bank (Contd.) Action: • New marketing campaign Result: • Acceptance rate for home equity offers more than doubled TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 9
  10. 10. Example Application: Fraud Detection • Industries: Health care, retail, credit card services, telecom, B2B relationships • Approach: – Use historical data to build models of fraudulent behavior – Deploy models to identify fraudulent instances TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 10
  11. 11. Fraud Detection (Contd.) • Examples: – Auto insurance: Detect groups of people who stage accidents to collect insurance – Medical insurance: Fraudulent claims – Money laundering: Detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) – Telecom industry: Find calling patterns that deviate from a norm (origin and destination of the call, duration, time of day, day of week). TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 11
  12. 12. Other Example Applications • CPG: Promotion analysis • Retail: Category management • Telecom: Call usage analysis, churn • Healthcare: Claims analysis, fraud detection • Transportation/Distribution: Logistics management • Financial Services: Credit analysis, fraud detection • Data service providers: Value-added data analysis TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 12
  13. 13. What is a Data Mining Model? A data mining model is a description of a certain aspect of a dataset. It produces output values for an assigned set of inputs. Examples: • Clustering • Linear regression model • Classification model • Frequent itemsets and association rules • Support Vector Machines TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 13
  14. 14. Data Mining Methods TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 14
  15. 15. Overview • Several well-studied tasks – Classification – Clustering – Frequent Patterns • Many methods proposed for each • Focus in database and data mining community: – Scalability – Managing the process – Exploratory analysis TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 15
  16. 16. Classification Goal: Learn a function that assigns a record to one of several predefined classes. Requirements on the model: – High accuracy – Understandable by humans, interpretable – Fast construction for very large training databases TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R.
  17. 17. Classification Example application: telemarketing TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 17
  18. 18. Classification (Contd.) • Decision trees are one approach to classification. • Other approaches include: – Linear Discriminant Analysis – k-nearest neighbor methods – Logistic regression – Neural networks – Support Vector Machines TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R.
  19. 19. Classification Example • Training database: Age Car Class – Two predictor attributes: 20 M Yes Age and Car-type (Sport, Minivan and Truck) 30 M Yes – Age is ordered, Car-type is 25 T No categorical attribute 30 S Yes – Class label indicates 40 S Yes whether person bought 20 T No product – Dependent attribute is categorical 30 M Yes 25 M Yes 40 M Yes 20 S No TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R.
  20. 20. Classification Problem • If Y is categorical, the problem is a classification problem, and we use C instead of Y. |dom(C)| = J, the number of classes. • C is the class label, d is called a classifier. • Let r be a record randomly drawn from P. Define the misclassification rate of d: RT(d,P) = P(d(r.X1, …, r.Xk) != r.C) • Problem definition: Given dataset D that is a random sample from probability distribution P, find classifier d such that RT(d,P) is minimized. TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 22
  21. 21. Regression Problem • If Y is numerical, the problem is a regression problem. • Y is called the dependent variable, d is called a regression function. • Let r be a record randomly drawn from P. Define mean squared error rate of d: RT(d,P) = E(r.Y - d(r.X1, …, r.Xk))2 • Problem definition: Given dataset D that is a random sample from probability distribution P, find regression function d such that RT(d,P) is minimized. TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 23
  22. 22. Regression Example • Example training database Age Car Spent – Two predictor attributes: 20 M $200 Age and Car-type (Sport, Minivan and Truck) 30 M $150 – Spent indicates how much 25 T $300 person spent during a recent visit 30 S $220 to the web site 40 S $400 – Dependent attribute is numerical 20 T $80 30 M $100 25 M $125 40 M $500 20 S $420 TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R.
  23. 23. Decision Trees TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 25
  24. 24. What are Decision Trees? Age Minivan <30 >=30 YES Sports, YES Car Type YES Truck NO Minivan Sports, Truck YES NO 0 30 60 Age TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R.
  25. 25. Decision Trees • A decision tree T encodes d (a classifier or regression function) in form of a tree. • A node t in T without children is called a leaf node. Otherwise t is called an internal node. TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 27
  26. 26. Internal Nodes • Each internal node has an associated splitting predicate. Most common are binary predicates. Example predicates: – Age <= 20 – Profession in {student, teacher} – 5000*Age + 3*Salary – 10000 > 0 TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 28
  27. 27. Leaf Nodes Consider leaf node t: • Classification problem: Node t is labeled with one class label c in dom(C) • Regression problem: Two choices – Piecewise constant model: t is labeled with a constant y in dom(Y). – Piecewise linear model: t is labeled with a linear model Y = yt + Σ aiXi TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 30
  28. 28. Example Encoded classifier: If (age<30 and carType=Minivan) Then YES Age If (age <30 and <30 >=30 (carType=Sports or carType=Truck)) Then NO Car Type YES If (age >= 30) Then YES Minivan Sports, Truck YES NO TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 31
  29. 29. Issues in Tree Construction • Three algorithmic components: – Split Selection Method – Pruning Method – Data Access Method TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R.
  30. 30. Top-Down Tree Construction BuildTree(Node n, Training database D, Split Selection Method S) [ (1) Apply S to D to find splitting criterion ] (1a) for each predictor attribute X (1b) Call S.findSplit(AVC-set of X) (1c) endfor (1d) S.chooseBest(); (2) if (n is not a leaf node) ... S: C4.5, CART, CHAID, FACT, ID3, GID3, QUEST, etc. TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R.
  31. 31. Split Selection Method • Numerical Attribute: Find a split point that separates the (two) classes Age 30 35 (Yes: No: ) TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R.
  32. 32. Split Selection Method (Contd.) • Categorical Attributes: How to group? Sport: Truck: Minivan: (Sport, Truck) -- (Minivan) (Sport) --- (Truck, Minivan) (Sport, Minivan) --- (Truck) TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R.
  33. 33. Impurity-based Split Selection Methods • Split selection method has two parts: – Search space of possible splitting criteria. Example: All splits of the form ―age <= c‖. – Quality assessment of a splitting criterion • Need to quantify the quality of a split: Impurity function • Example impurity functions: Entropy, gini- index, chi-square index TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R.
  34. 34. Data Access Method • Goal: Scalable decision tree construction, using the complete training database TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R.
  35. 35. AVC-Sets Training Database AVC-Sets A ge Car Clas s Age Yes No 20 M Yes 20 1 2 30 M Yes 25 1 1 25 T No 30 3 0 30 S Yes 40 2 0 40 S Yes 20 T No Car Yes No 30 M Yes Sport 2 1 25 M Yes Truck 0 2 40 M Yes Minivan 5 0 20 S No TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R.
  36. 36. Motivation for Data Access Methods Age Training Database <30 >=30 Left Partition Right Partition In principle, one pass over training database for each node. Can we improve? TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R.
  37. 37. RainForest Algorithms: RF-Hybrid First scan: Build AVC-sets for root Database AVC-Sets Main Memory TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R.
  38. 38. RainForest Algorithms: RF-Hybrid Second Scan: Build AVC sets for children of the root Age<30 Database AVC-Sets Main Memory TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R.
  39. 39. RainForest Algorithms: RF-Hybrid Third Scan: As we expand the tree, we run out Of memory, and have to ―spill‖ partitions to disk, and recursively read and process them later. Age<30 Database Sal<20k Car==S Main Memory Partition 1 Partition 2 Partition 3 Partition 4 TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R.
  40. 40. RainForest Algorithms: RF-Hybrid Further optimization: While writing partitions, concurrently build AVC-groups of as many nodes as possible in-memory. This should remind you of Hybrid Hash-Join! Age<30 Database Sal<20k Car==S Main Memory Partition 1 Partition 2 Partition 3 Partition 4 TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R.
  41. 41. CLUSTERING TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 44
  42. 42. Problem • Given points in a multidimensional space, group them into a small number of clusters, using some measure of ―nearness‖ – E.g., Cluster documents by topic – E.g., Cluster users by similar interests TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 45
  43. 43. Clustering • Output: (k) groups of records called clusters, such that the records within a group are more similar to records in other groups – Representative points for each cluster – Labeling of each record with each cluster number – Other description of each cluster • This is unsupervised learning: No record labels are given to learn from • Usage: – Exploratory data mining – Preprocessing step (e.g., outlier detection) TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 46
  44. 44. Clustering (Contd.) • Requirements: Need to define ―similarity‖ between records • Important: Use the ―right‖ similarity (distance) function – Scale or normalize all attributes. Example: seconds, hours, days – Assign different weights to reflect importance of the attribute – Choose appropriate measure (e.g., L1, L2) TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 49
  45. 45. Approaches • Centroid-based: Assume we have k clusters, guess at the centers, assign points to nearest center, e.g., K-means; over time, centroids shift • Hierarchical: Assume there is one cluster per point, and repeatedly merge nearby clusters using some distance threshold Scalability: Do this with fewest number of passes over data, ideally, sequentially TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 51
  46. 46. Scalable Clustering Algorithms for Numeric Attributes CLARANS DBSCAN BIRCH CLIQUE CURE ……. • Above algorithms can be used to cluster documents after reducing their dimensionality using SVD TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 54
  47. 47. Birch [ZRL96] Pre-cluster data points using ―CF-tree‖ data structure TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R.
  48. 48. Clustering Feature (CF) Allows incremental merging of clusters! TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R.
  49. 49. Points to Note • Basic algorithm works in a single pass to condense metric data using spherical summaries – Can be incremental • Additional passes cluster CFs to detect non- spherical clusters • Approximates density function • Extensions to non-metric data TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 60
  50. 50. Market Basket Analysis: Frequent Itemsets TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 63
  51. 51. Market Basket Analysis • Consider shopping cart filled with several items • Market basket analysis tries to answer the following questions: – Who makes purchases – What do customers buy TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 64
  52. 52. Market Basket Analysis TID CID Date Item Qty • Given: 111 201 5/1/99 Pen 2 – A database of customer 111 201 5/1/99 Ink 1 transactions 111 201 5/1/99 Milk 3 – Each transaction is a set 111 201 5/1/99 Juice 6 of items 112 105 6/3/99 Pen 1 112 105 6/3/99 Ink 1 • Goal: 112 105 6/3/99 Milk 1 – Extract rules 113 106 6/5/99 Pen 1 113 106 6/5/99 Milk 1 114 201 7/1/99 Pen 2 114 201 7/1/99 Ink 2 114 201 7/1/99 Juice 4 TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 65
  53. 53. Market Basket Analysis (Contd.) • Co-occurrences – 80% of all customers purchase items X, Y and Z together. • Association rules – 60% of all customers who purchase X and Y also buy Z. • Sequential patterns – 60% of customers who first buy X also purchase Y within three weeks. TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 66
  54. 54. Confidence and Support We prune the set of all possible association rules using two interestingness measures: • Confidence of a rule: – X => Y has confidence c if P(Y|X) = c • Support of a rule: – X => Y has support s if P(XY) = s We can also define • Support of a co-ocurrence XY: – XY has support s if P(XY) = s TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 67
  55. 55. Example • Example rule: TID CID Date Item Qty {Pen} => {Milk} 111 201 5/1/99 Pen 2 Support: 75% 111 201 5/1/99 Ink 1 Confidence: 75% 111 201 5/1/99 Milk 3 111 201 5/1/99 Juice 6 112 105 6/3/99 Pen 1 • Another example: 112 105 6/3/99 Ink 1 112 105 6/3/99 Milk 1 {Ink} => {Pen} 113 106 6/5/99 Pen 1 Support: 100% 113 106 6/5/99 Milk 1 Confidence: 100% 114 201 7/1/99 Pen 2 114 201 7/1/99 Ink 2 114 201 7/1/99 Juice 4 TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 68
  56. 56. Exercise • Can you find all itemsets TID CID Date Item Qty with 111 201 5/1/99 Pen 2 support >= 75%? 111 201 5/1/99 Ink 1 111 201 5/1/99 Milk 3 111 201 5/1/99 Juice 6 112 105 6/3/99 Pen 1 112 105 6/3/99 Ink 1 112 105 6/3/99 Milk 1 113 106 6/5/99 Pen 1 113 106 6/5/99 Milk 1 114 201 7/1/99 Pen 2 114 201 7/1/99 Ink 2 114 201 7/1/99 Juice 4 TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 69
  57. 57. Exercise • Can you find all TID CID Date Item Qty association rules with 111 201 5/1/99 Pen 2 support >= 50%? 111 201 5/1/99 Ink 1 111 201 5/1/99 Milk 3 111 201 5/1/99 Juice 6 112 105 6/3/99 Pen 1 112 105 6/3/99 Ink 1 112 105 6/3/99 Milk 1 113 106 6/5/99 Pen 1 113 106 6/5/99 Milk 1 114 201 7/1/99 Pen 2 114 201 7/1/99 Ink 2 114 201 7/1/99 Juice 4 TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 70
  58. 58. Extensions • Imposing constraints – Only find rules involving the dairy department – Only find rules involving expensive products – Only find rules with “whiskey” on the right hand side – Only find rules with “milk” on the left hand side – Hierarchies on the items – Calendars (every Sunday, every 1st of the month) TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 71
  59. 59. Market Basket Analysis: Applications • Sample Applications – Direct marketing – Fraud detection for medical insurance – Floor/shelf planning – Web site layout – Cross-selling TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 72
  60. 60. DBMS Support for DM TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 73
  61. 61. Why Integrate DM into a DBMS? Copy Mine Models Extract Data Consistency? TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 74
  62. 62. Integration Objectives • Avoid isolation of • Make it possible to add querying from mining new models – Difficult to do ―ad-hoc‖ • Make it possible to add mining new, scalable • Provide simple algorithms programming approach to creating and using DM models Analysts (users) DM Vendors TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 75
  63. 63. SQL/MM: Data Mining • A collection of classes that provide a standard interface for invoking DM algorithms from SQL systems. • Four data models are supported: – Frequent itemsets, association rules – Clusters – Regression trees – Classification trees TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 76
  64. 64. DATA MINING SUPPORT IN MICROSOFT SQL SERVER * * Thanks to Surajit Chaudhuri for permission to use/adapt his slides TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 77
  65. 65. Key Design Decisions • Adopt relational data representation – A Data Mining Model (DMM) as a ―tabular‖ object (externally; can be represented differently internally) • Language-based interface – Extension of SQL – Standard syntax TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 78
  66. 66. DM Concepts to Support • Representation of input (cases) • Representation of models • Specification of training step • Specification of prediction step Should be independent of specific algorithms TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 79
  67. 67. What are ―Cases‖? • DM algorithms analyze ―cases‖ • The ―case‖ is the entity being categorized and classified • Examples – Customer credit risk analysis: Case = Customer – Product profitability analysis: Case = Product – Promotion success analysis: Case = Promotion • Each case encapsulates all we know about the entity TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 80
  68. 68. Cases as Records: Examples Age Car Class 20 M Yes Marital 30 M Yes Cust ID Age Wealth Status 25 T No 1 35 M 380,000 30 S Yes 2 20 S 50,000 40 S Yes 3 57 M 470,000 20 T No 30 M Yes 25 M Yes 40 M Yes 20 S No TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 81
  69. 69. Types of Columns Marital Product Purchases Cust ID Age Wealth Status Product Quantity Type 1 35 M 380,000 TV 1 Appliance Coke 6 Drink Ham 3 Food • Keys: Columns that uniquely identify a case • Attributes: Columns that describe a case – Value: A state associated with the attribute in a specific case – Attribute Property: Columns that describe an attribute – Unique for a specific attribute value (TV is always an appliance) – Attribute Modifier: Columns that represent additional ―meta‖ information for an attribute – Weight of a case, Certainty of prediction TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 82
  70. 70. More on Columns • Properties describe attributes – Can represent generalization hierarchy • Distribution information associated with attributes – Discrete/Continuous – Nature of Continuous distributions • Normal, Log_Normal – Other Properties (e.g., ordered, not null) TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 83
  71. 71. Representing a DMM Age <30 >=30 Car Type YES Minivan Sports, Truck • Specifying a Model – Columns to predict NO – Algorithm to use YES – Special parameters • Model is represented as a (nested) table – Specification = Create table – Training = Inserting data into the table – Predicting = Querying the table TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 84
  72. 72. CREATE MINING MODEL Name of model CREATE MINING MODEL [Age Prediction] ( [Gender] TEXT DISCRETE ATTRIBUTE, [Hair Color] TEXT DISCRETE ATTRIBUTE, [Age] DOUBLE CONTINUOUS ATTRIBUTE PREDICT, ) USING [Microsoft Decision Tree] Name of algorithm TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 85
  73. 73. CREATE MINING MODEL CREATE MINING MODEL [Age Prediction] ( [Customer ID] LONG KEY, [Gender] TEXT DISCRETE ATTRIBUTE, [Age] DOUBLE CONTINUOUS ATTRIBUTE PREDICT, [ProductPurchases] TABLE ( [ProductName] TEXT KEY, [Quantity] DOUBLE NORMAL CONTINUOUS, [ProductType] TEXT DISCRETE RELATED TO [ProductName] ) ) USING [Microsoft Decision Tree] Note that the ProductPurchases column is a nested table. SQL Server computes this field when data is “inserted”. TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 86
  74. 74. Training a DMM • Training a DMM requires passing it ―known‖ cases • Use an INSERT INTO in order to ―insert‖ the data to the DMM – The DMM will usually not retain the inserted data – Instead it will analyze the given cases and build the DMM content (decision tree, segmentation model) • INSERT [INTO] <mining model name> [(columns list)] <source data query> TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 87
  75. 75. INSERT INTO INSERT INTO [Age Prediction] ( [Gender],[Hair Color], [Age] ) OPENQUERY([Provider=MSOLESQL…, ‘SELECT [Gender], [Hair Color], [Age] FROM [Customers]’ ) TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 88
  76. 76. Executing Insert Into • The DMM is trained – The model can be retrained or incrementally refined • Content (rules, trees, formulas) can be explored • Prediction queries can be executed TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 89
  77. 77. What are Predictions? • Predictions apply the trained model to estimate missing attributes in a data set • Predictions = Queries • Specification: – Input data set – A trained DMM (think of it as a truth table, with one row per combination of predictor-attribute values; this is only conceptual) – Binding (mapping) information between the input data and the DMM TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 90
  78. 78. Prediction Join SELECT [Customers].[ID], MyDMM.[Age], PredictProbability(MyDMM.[Age]) FROM MyDMM PREDICTION JOIN [Customers] ON MyDMM.[Gender] = [Customers].[Gender] AND MyDMM.[Hair Color] = [Customers].[Hair Color] TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 91
  79. 79. Exploratory Mining: Combining OLAP and DM TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 92
  80. 80. Databases and Data Mining • What can database systems offer in the grand challenge of understanding and learning from the flood of data we’ve unleashed? – The plumbing – Scalability TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 93
  81. 81. Databases and Data Mining • What can database systems offer in the grand challenge of understanding and learning from the flood of data we’ve unleashed? – The plumbing – Scalability – Ideas! • Declarativeness • Compositionality • Ways to conceptualize your data TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 94
  82. 82. Multidimensional Data Model • One fact table D=(X,M) – X=X1, X2, ... Dimension attributes – M=M1, M2,… Measure attributes • Domain hierarchy for each dimension attribute: – Collection of domains Hier(Xi)= (Di(1),..., Di(k)) – The extended domain: EXi = 1≤k≤t DXi(k) • Value mapping function: γD1D2(x) – e.g., γmonthyear(12/2005) = 2005 – Form the value hierarchy graph – Stored as dimension table attribute (e.g., week for a time value) or conversion functions (e.g., month, quarter) TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 95
  83. 83. Multidimensional Data Automobile 3 1 3 2 ALL ALL Region State Category 2 ALL Sedan Truck DIMENSION Civic Camry F150 Sierra Model 1 ATTRIBUTES p3 p4 MA East FactID Auto Loc Repair NY LOCATION p1 p2 p1 F150 NY 100 ALL p2 Sierra NY 500 TX West p3 F150 MA 100 CA p4 Sierra MA 200 TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 96
  84. 84. Cube Space • Cube space: C = EX1EX2…EXd • Region: Hyper rectangle in cube space – c = (v1,v2,…,vd) , vi  EXi • Region granularity: – gran(c) = (d1, d2, ..., dd), di = Domain(c.vi) • Region coverage: – coverage(c) = all facts in c • Region set: All regions with same granularity TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 97
  85. 85. OLAP Over Imprecise Data with Doug Burdick, Prasad Deshpande, T.S. Jayram, and Shiv Vaithyanathan In VLDB 05, 06 joint work with IBM Almaden TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 98
  86. 86. Imprecise Data Automobile 3 1 3 2 ALL ALL Region State Category 2 ALL Sedan Truck Civic Camry F150 Sierra Model 1 p3 p4 MA p5 East FactID Auto Loc Repair NY LOCATION p1 p2 p1 F150 NY 100 ALL p2 Sierra NY 500 TX West p3 F150 MA 100 CA p4 Sierra MA 200 p5 Truck MA 100 TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research R. 99
  87. 87. Querying Imprecise Facts Auto = F150 Loc = MA SUM(Repair) = ??? How do we treat p5? Truck FactID Auto Loc Repair F150 Sierra p1 F150 NY 100 p2 Sierra NY 500 p5 p3 F150 MA 100 MA p3 p4 p4 Sierra MA 200 East p5 Truck MA 100 NY p1 p2 TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research 100 R.
  88. 88. Allocation (1) Truck FactID Auto Loc Repair F150 Sierra p1 F150 NY 100 p5 p2 Sierra NY 500 MA p3 p4 p3 F150 MA 100 East p4 Sierra MA 200 NY p1 p2 p5 Truck MA 100 TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research 101 R.
  89. 89. Allocation (2) (Huh? Why 0.5 / 0.5? - Hold on to that thought) Truck ID FactID Auto Loc Repair Weight F150 Sierra 1 p1 F150 NY 100 1.0 2 p2 Sierra NY 500 1.0 p5 p5 MA p3 p4 3 p3 F150 MA 100 1.0 East 4 p4 Sierra MA 200 1.0 5 p5 F150 MA 100 0.5 NY p1 p2 6 p5 Sierra MA 100 0.5 TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research 102 R.
  90. 90. Allocation (3) Auto = F150 Loc = MA SUM(Repair) = 150 Query the Extended Data Model! Truck ID FactID Auto Loc Repair Weight F150 Sierra 1 p1 F150 NY 100 1.0 2 p2 Sierra NY 500 1.0 p5 p5 MA p3 p4 3 p3 F150 MA 100 1.0 East 4 p4 Sierra MA 200 1.0 5 p5 F150 MA 100 0.5 NY p1 p2 6 p5 Sierra MA 100 0.5 TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research 103 R.
  91. 91. Allocation Policies • The procedure for assigning allocation weights is referred to as an allocation policy: – Each allocation policy uses different information to assign allocation weights – Reflects assumption about the correlation structure in the data • Leads to EM-style iterative algorithms for allocating imprecise facts, maximizing likelihood of observed data TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research 104 R.
  92. 92. Allocation Policy: Count Count (c1) 2 Truck pc1, p 5   Count (c1)  Count (c2) 2  1 Count (c 2) 1 F150 Sierra pc 2, p 5   Count (c1)  Count (c2) 2  1 p5 p5 MA p3 p4 p6 c1 c2 East p1 p2 NY TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research 105 R.
  93. 93. Allocation Policy: Measure Sales(c1) 700 Truck pc1, p 5   Sales(c1)  Sales(c 2) 700  200 Sales(c 2) 200 F150 Sierra pc 2, p 5   Sales(c1)  Sales(c 2) 700  200 p5 p5 MA p3 p4 ID Sales p6 p1 100 c1 c2 East p2 150 p1 p2 p3 300 NY p4 200 p5 250 p6 400 TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research 106 R.
  94. 94. Allocation Policy Template Count (c1) Q(c1) Sales(c1) pc1, p 5  pc1, p 5  Count (c1)  Count (c 2) 1) p  c 2) Q(cpc1,5 Q(Sales (c1)  Sales (c 2) Q(c 2) pc 2, p 5  pc 2, p 5  Count (c 2) pc 2, p  Sales(c 2) Count (c1)  Count (c 2) Q(c1) 5 Q(c 2) Sales(c1)  Sales (c 2) Truck Q (c ) Q (c ) F150 Sierra pc,r    Q(c ') Qsum(r ) r MA c 'region ( r ) c1 c2 East NY TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research 107 R.
  95. 95. What is a Good Allocation Policy? Query: COUNT Truck F150 Sierra p3 p4 MA We propose desiderata that enable p5 appropriate definition of query East semantics for imprecise data NY p1 p2 TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research 108 R.
  96. 96. Desideratum I: Consistency Truck • Consistency specifies the F150 Sierra relationship between answers to related p3 p4 queries on a fixed MA p5 data set East NY p1 p2 TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research 109 R.
  97. 97. Desideratum II: Faithfulness Data Set 1 Data Set 2 Data Set 3 F150 Sierra F150 Sierra F150 Sierra p5 p5 p5 MA MA MA p3 p4 p3 p4 p3 p4 NY NY NY p1 p2 p1 p2 p1 p2 • Faithfulness specifies the relationship between answers to a fixed query on related data sets TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research 110 R.
  98. 98. Results on Query Semantics • Evaluating queries over extended data model yields expected value of the aggregation operator over all possible worlds • Efficient query evaluation algorithms available for SUM, COUNT; more expensive dynamic programming algorithm for AVERAGE – Consistency and faithfulness for SUM, COUNT are satisfied under appropriate conditions – (Bound-)Consistency does not hold for AVERAGE, but holds for E(SUM)/E(COUNT) • Weak form of faithfulness holds – Opinion pooling with LinOP: Similar to AVERAGE TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research 111 R.
  99. 99. F150 Sierra Imprecise facts p5 lead to many MA p3 p4 possible worlds [Kripke63, …] NY p1 p2 F150 Sierra F150 Sierra MA p3 MA p3 p5 p4 p4 w1 p5 w4 NY w2 NY w3 p1 p2 p1 p2 F150 Sierra F150 Sierra MA MA p4 p5 p4 p5 p3 p3 NY NY p2 p2 p1 p1 TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research 113 R.
  100. 100. Query Semantics • Given all possible worlds together with their probabilities, queries are easily answered using expected values – But number of possible worlds is exponential! • Allocation gives facts weighted assignments to possible completions, leading to an extended version of the data – Size increase is linear in number of (completions of) imprecise facts – Queries operate over this extended version TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research 114 R.
  101. 101. Exploratory Mining: Prediction Cubes with Beechun Chen, Lei Chen, and Yi Lin In VLDB 05; EDAM Project TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research 115 R.
  102. 102. The Idea • Build OLAP data cubes in which cell values represent decision/prediction behavior – In effect, build a tree for each cell/region in the cube— observe that this is not the same as a collection of trees used in an ensemble method! – The idea is simple, but it leads to promising data mining tools – Ultimate objective: Exploratory analysis of the entire space of ―data mining choices‖ • Choice of algorithms, data conditioning parameters … TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research 116 R.
  103. 103. Example (1/7): Regular OLAP Z: Dimensions Y: Measure Location Time # of App. Goal: Look for patterns of unusually … … ... high numbers of applications: AL, USA Dec, 04 2 … … … WY, USA Dec, 04 3 Location Time All All All All Country Japan USA Norway Year 85 86 04 State AL WY Month Jan., 86 Dec., 86 TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research 117 R.
  104. 104. Example (2/7): Regular OLAP Goal: Look for patterns of unusually Z: Dimensions Y: Measure high numbers of applications: Location Time # of App. … … ... AL, USA Dec, 04 2 … … … 04 03 … Coarser CA 100 90 … WY, USA Dec, 04 3 USA 80 90 … regions … … … … 2004 … Jan … Dec … Roll up AB 20 15 15 … Drill CA … 5 2 20 … 2004 2003 … YT 5 3 15 … down 55 … … … Jan … Dec Jan … Dec … AL CA 30 20 50 25 30 … … USA … 5 … … USA 70 2 8 10 … … … WY 10 … … … … … … … … … … … … … … … … … Cell value: Number of loan applications Finer regions TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research 118 R.
  105. 105. Example (3/7): Decision Analysis Goal: Analyze a bank’s loan decision process w.r.t. two dimensions: Location and Time Fact table D Z: Dimensions X: Predictors Y: Class Cube subset Location Time Race Sex … Approval AL, USA Dec, 04 White M … Yes Model h(X, Z(D)) … … … … … … E.g., decision tree WY, USA Dec, 04 Black F … No Location Time All All All All Country Japan USA Norway Year 85 86 04 State AL WY Month Jan., 86 Dec., 86 TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research 119 R.
  106. 106. Example (3/7): Decision Analysis • Are there branches (and time windows) where approvals were closely tied to sensitive attributes (e.g., race)? – Suppose you partitioned the training data by location and time, chose the partition for a given branch and time window, and built a classifier. You could then ask, “Are the predictions of this classifier closely correlated with race?” • Are there branches and times with decision making reminiscent of 1950s Alabama? – Requires comparison of classifiers trained using different subsets of data. TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research 120 R.
  107. 107. Example (4/7): Prediction Cubes 1. Build a model using data 2004 2003 … from USA in Dec., 1985 Jan … Dec Jan … Dec … CA 0.4 0.8 0.9 0.6 0.8 … … 2. Evaluate that model USA 0.2 0.3 0.5 … … … … … … … … … … … Measure in a cell: • Accuracy of the model • Predictiveness of Race Data [USA, Dec 04](D) measured based on that Location Time Race Sex … Approval model AL ,USA Dec, 04 White M … Y • Similarity between that … … … … … … model and a given model WY, USA Dec, 04 Black F … N Model h(X, [USA, Dec 04](D)) E.g., decision tree TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research 121 R.
  108. 108. Example (5/7): Model-Similarity Given: Data table D Location Time Race Sex … Approval - Data table D - Target model h0(X) AL, USA Dec, 04 White M … Yes - Test set D w/o labels … … … … … … WY, USA Dec, 04 Black F … No 2004 2003 … Jan … Dec Jan … Dec … CA 0.4 0.2 0.3 0.6 0.5 … … USA 0.2 0.3 0.9 … … … Build a model … … … … … … … … Similarity Race Sex … Level: [Country, Month] White F … Yes Yes … … … … … Black M … No Yes The loan decision process in USA during Dec 04 h0(X) Test set D was similar to a discriminatory decision model TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research 122 R.
  109. 109. Example (6/7): Predictiveness Given: Data table D Location Time Race Sex … Approval - Data table D - Attributes V AL, USA Dec, 04 White M … Yes - Test set D w/o labels … … … … … … WY, USA Dec, 04 Black F … No 2004 2003 … Jan … Dec Jan … Dec … CA 0.4 0.2 0.3 0.6 0.5 … … Yes Yes USA 0.2 0.3 0.9 … … … No No Build models . . … … … … … … … … . . Yes No Level: [Country, Month] h(X) h(XV) Race Sex … Predictiveness of V White F … … … … Black M … Race was an important predictor of loan approval decision in USA during Dec 04 Test set D TECS 2007, Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Ramakrishnan, Yahoo! Research 123 R.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×