Data Mining
Steps and Functionalities
1
Data Mining: A KDD Process
 Data mining: the core of
knowledge discovery
process.
Data Cleaning
Data Integration
Databases
Data
Warehouse
Task-relevant Data
Selection &
Transformation
Data Mining
Pattern Evaluation
2
Steps of a KDD Process
 Data Cleaning
 Handles Noisy, Inconsistent, Incomplete data
 Missing Values
 Noisy data
 Binning, Clustering etc.
 Inconsistencies
 Tools, functional dependencies
3
 Data Integration
 Schema Integration
 Entity Identification problem
 Redundancy
 Correlation Analysis
 Data Selection
 Select Only the task relevant data
Steps of a KDD Process
4
 Data Transformation
 Transform or consolidate data
 Smoothing, Normalization, Feature Construction
 Data Reduction - Compression
 Data Mining
 Intelligent methods are applied to extract patterns
Steps of a KDD Process
5
 Pattern Evaluation
 Interestingness Measures
 Knowledge Presentation
 Visualization
Steps of a KDD Process
6
Data Mining Functionalities
 Descriptive
 Characterize general properties of the data
 Predictive
 Performs inference
 Mining
 Parallel
 Various Granularities
7
Data Mining Functionalities
 Concept/class description
 Association Analysis
 Classification and Prediction
 Cluster Analysis
 Outlier Analysis
 Evolution Analysis
8
Concept/ Class Description
 Data can be associated with Classes /
Concepts
 Computers, Printers
 BigSpenders Vs BudgetSpenders
 Class / Concept Description
 Classes and Concepts can be summarized in
concise and precise terms
 Data Characterization
 Data Discrimination
9
Data Characterization
 Summarization of the general characteristics
 Data collected and aggregated
 OLAP roll up operation
 Attribute Oriented Induction
 Results – Charts, cubes, rules
 Example
 Characteristics of Customers
10
Data Discrimination
 Compare target class and contrasting classes
 Maybe user specified
 Examples:
 Products whose sales increased Vs decreased
 Regular Shoppers Vs Occasional Shoppers
 Output includes Comparative measures
11
Association Analysis
 Discovery of association rules
 Form: X ⇒ Y
 Multi-dimensional
 Age(X, “20…29”) ∧ income(X, “20K…25K”) ⇒
buys(X, “Laptop”)
 Single Dimensional
 buys(X, “Laptop”) ⇒ buys(X, “Software”)
12
Classification and Prediction
 Classification
 Finds models that describe and differentiate
classes or concepts
 Predicts class
 Training data
 Models – rules, decision trees, NN, formulae
 Preceded by relevance analysis (to eliminate
irrelevant attributes)
13
Classification and Prediction
 Prediction
 Derived model is used for prediction
 Data value prediction
 Class label prediction (Classification)
 Trend identification
14
Cluster Analysis
 Unsupervised
 Class labels are missing in the training set
 Maximize Intra-class similarity
 Minimize Inter-class similarity
 Hierarchy of classes
15
Outlier Analysis
 Objects that do not comply with the general
behavior
 Noise Vs Rare events
 Fraud detection
 Statistical tests
 Deviation based methods
16
Evolution Analysis
 Trend detection
 Time series data
 Involves other functionalities
17

1.2 steps and functionalities

  • 1.
    Data Mining Steps andFunctionalities 1
  • 2.
    Data Mining: AKDD Process  Data mining: the core of knowledge discovery process. Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection & Transformation Data Mining Pattern Evaluation 2
  • 3.
    Steps of aKDD Process  Data Cleaning  Handles Noisy, Inconsistent, Incomplete data  Missing Values  Noisy data  Binning, Clustering etc.  Inconsistencies  Tools, functional dependencies 3
  • 4.
     Data Integration Schema Integration  Entity Identification problem  Redundancy  Correlation Analysis  Data Selection  Select Only the task relevant data Steps of a KDD Process 4
  • 5.
     Data Transformation Transform or consolidate data  Smoothing, Normalization, Feature Construction  Data Reduction - Compression  Data Mining  Intelligent methods are applied to extract patterns Steps of a KDD Process 5
  • 6.
     Pattern Evaluation Interestingness Measures  Knowledge Presentation  Visualization Steps of a KDD Process 6
  • 7.
    Data Mining Functionalities Descriptive  Characterize general properties of the data  Predictive  Performs inference  Mining  Parallel  Various Granularities 7
  • 8.
    Data Mining Functionalities Concept/class description  Association Analysis  Classification and Prediction  Cluster Analysis  Outlier Analysis  Evolution Analysis 8
  • 9.
    Concept/ Class Description Data can be associated with Classes / Concepts  Computers, Printers  BigSpenders Vs BudgetSpenders  Class / Concept Description  Classes and Concepts can be summarized in concise and precise terms  Data Characterization  Data Discrimination 9
  • 10.
    Data Characterization  Summarizationof the general characteristics  Data collected and aggregated  OLAP roll up operation  Attribute Oriented Induction  Results – Charts, cubes, rules  Example  Characteristics of Customers 10
  • 11.
    Data Discrimination  Comparetarget class and contrasting classes  Maybe user specified  Examples:  Products whose sales increased Vs decreased  Regular Shoppers Vs Occasional Shoppers  Output includes Comparative measures 11
  • 12.
    Association Analysis  Discoveryof association rules  Form: X ⇒ Y  Multi-dimensional  Age(X, “20…29”) ∧ income(X, “20K…25K”) ⇒ buys(X, “Laptop”)  Single Dimensional  buys(X, “Laptop”) ⇒ buys(X, “Software”) 12
  • 13.
    Classification and Prediction Classification  Finds models that describe and differentiate classes or concepts  Predicts class  Training data  Models – rules, decision trees, NN, formulae  Preceded by relevance analysis (to eliminate irrelevant attributes) 13
  • 14.
    Classification and Prediction Prediction  Derived model is used for prediction  Data value prediction  Class label prediction (Classification)  Trend identification 14
  • 15.
    Cluster Analysis  Unsupervised Class labels are missing in the training set  Maximize Intra-class similarity  Minimize Inter-class similarity  Hierarchy of classes 15
  • 16.
    Outlier Analysis  Objectsthat do not comply with the general behavior  Noise Vs Rare events  Fraud detection  Statistical tests  Deviation based methods 16
  • 17.
    Evolution Analysis  Trenddetection  Time series data  Involves other functionalities 17