DATA MINING
     BY
     SARANYA




               Page 1
INTRODUCTION

• New buzzword, old idea.
• Inferring new information from already
  collected data.
• Traditionally job of Data Analysts
• Computers have changed this.
  Far more efficient to comb through
  data using a machine than eyeballing
  statistical data.

                                 Page 2
DEFINITION
      “Data mining is the entire
process of applying computer-
based methodology, including
new techniques for knowledge
discovery, from data.”


                             Page 3
Two Main Components

Knowledge Discovery
     Concrete information gleaned from known
 data. Data you may not have known, but which
 is supported by recorded facts.

Knowledge Prediction
    Uses known data to forecast future trends,
 events, etc. (ie: Stock market predictions)


                                       Page 4
Uses of Data Mining
• AI/Machine Learning
  Combinatorial/Game Data Mining
  Good for analyzing winning strategies to games, and
  thus developing intelligent AI opponents. (ie: Chess)
• Business Strategies
  Market Basket Analysis
  Identify customer demographics, preferences, and
  purchasing patterns.
• Risk Analysis
  Product Defect Analysis
  Analyze product defect rates for given plants and
  predict possible complications (read: lawsuits) down
  the line.

                                             Page 5
(Continued)
• User Behavior Validation
  Fraud Detection
  In the realm of cell phones
  Comparing phone activity to calling records.
  Can help detect calls made on cloned
  phones.

  Similarly, with credit cards, comparing
  purchases with historical purchases. Can
  detect activity with stolen cards.

                                        Page 6
Sources of Data for Mining


  Databases (most obvious)

  Text Documents

  Computer Simulations

  Social Networks
                              Page 7
Data Mining Development




                      Page 8
Database Processing vs. Data Mining
            Processing
   • Query                  • Query
     – Well defined           – Poorly defined
     – SQL                    – No precise query
                                language
   • Data                   • Data
     -Operational data        – - Not operational data

   • Output                 • Output
     - Precise                – - Fuzzy
     - Subset of database     – - Not a subset of
                                database
                                             Page 9
Data Mining Models and Tasks




                         Page 10
Basic Data Mining Tasks
• Classification maps data into predefined
  groups or classes
  – Supervised learning
  – Pattern recognition
  – Prediction
• Regression is used to map a data item
  to a real valued prediction variable.
• Clustering groups similar data together into
  clusters.
  – Unsupervised learning
  – Segmentation
  – Partitioning                      Page 11
(cont’d)
• Summarization maps data into subsets with
  associated simple descriptions.
  – Characterization
  – Generalization
• Link Analysis uncovers relationships
  among data.
  – Affinity Analysis
  – Association Rules
  – Sequential Analysis determines sequential
    patterns.

                                         Page 12
Data Mining Techniques
• Statistical
   – Point Estimation
   – Models Based on Summarization
   – Bayes Theorem
   – Hypothesis Testing
   – Regression and Correlation
• Similarity Measures
• Decision Trees
• Neural Networks
   – Activation Functions
• Genetic Algorithms
                                     Page 13
Challenges of Data Mining
 q Scalability
 q Dimensionality
 q Complex and Heterogeneous Data
 q Data Quality
 q Data Ownership and Distribution
 q Privacy Preservation
 q Streaming Data


                              Page 14
THANK”U”




           Page 15

`Data mining

  • 1.
    DATA MINING BY SARANYA Page 1
  • 2.
    INTRODUCTION • New buzzword,old idea. • Inferring new information from already collected data. • Traditionally job of Data Analysts • Computers have changed this. Far more efficient to comb through data using a machine than eyeballing statistical data. Page 2
  • 3.
    DEFINITION “Data mining is the entire process of applying computer- based methodology, including new techniques for knowledge discovery, from data.” Page 3
  • 4.
    Two Main Components KnowledgeDiscovery Concrete information gleaned from known data. Data you may not have known, but which is supported by recorded facts. Knowledge Prediction Uses known data to forecast future trends, events, etc. (ie: Stock market predictions) Page 4
  • 5.
    Uses of DataMining • AI/Machine Learning Combinatorial/Game Data Mining Good for analyzing winning strategies to games, and thus developing intelligent AI opponents. (ie: Chess) • Business Strategies Market Basket Analysis Identify customer demographics, preferences, and purchasing patterns. • Risk Analysis Product Defect Analysis Analyze product defect rates for given plants and predict possible complications (read: lawsuits) down the line. Page 5
  • 6.
    (Continued) • User BehaviorValidation Fraud Detection In the realm of cell phones Comparing phone activity to calling records. Can help detect calls made on cloned phones. Similarly, with credit cards, comparing purchases with historical purchases. Can detect activity with stolen cards. Page 6
  • 7.
    Sources of Datafor Mining  Databases (most obvious)  Text Documents  Computer Simulations  Social Networks Page 7
  • 8.
  • 9.
    Database Processing vs.Data Mining Processing • Query • Query – Well defined – Poorly defined – SQL – No precise query language • Data • Data -Operational data – - Not operational data • Output • Output - Precise – - Fuzzy - Subset of database – - Not a subset of database Page 9
  • 10.
    Data Mining Modelsand Tasks Page 10
  • 11.
    Basic Data MiningTasks • Classification maps data into predefined groups or classes – Supervised learning – Pattern recognition – Prediction • Regression is used to map a data item to a real valued prediction variable. • Clustering groups similar data together into clusters. – Unsupervised learning – Segmentation – Partitioning Page 11
  • 12.
    (cont’d) • Summarization mapsdata into subsets with associated simple descriptions. – Characterization – Generalization • Link Analysis uncovers relationships among data. – Affinity Analysis – Association Rules – Sequential Analysis determines sequential patterns. Page 12
  • 13.
    Data Mining Techniques •Statistical – Point Estimation – Models Based on Summarization – Bayes Theorem – Hypothesis Testing – Regression and Correlation • Similarity Measures • Decision Trees • Neural Networks – Activation Functions • Genetic Algorithms Page 13
  • 14.
    Challenges of DataMining q Scalability q Dimensionality q Complex and Heterogeneous Data q Data Quality q Data Ownership and Distribution q Privacy Preservation q Streaming Data Page 14
  • 15.