1
   Motivation: Why data mining?
   What is data mining?
   Data Mining: On what kind of data?
   Data mining functionality
   Are all the patterns interesting?
   Major issues in data mining
                                         2
   Data explosion problem

     Automated data collection tools and mature database technology

      lead to tremendous amounts of data stored in databases, data
      warehouses and other information repositories
   We are drowning in data, but starving for knowledge!
   Solution: Data warehousing and data mining
     Data warehousing and on-line analytical processing

     Extraction of interesting knowledge (rules, regularities, patterns,

      constraints) from data in large databases

                                                                            3
   1960s:
     Data collection, database creation, IMS and network DBMS
   1970s:
     Relational data model, relational DBMS implementation
   1980s:
     RDBMS, advanced data models (extended-relational, OO, deductive,
      etc.) and application-oriented DBMS (spatial, scientific, engineering,
      etc.)
   1990s—2000s:
     Data mining and data warehousing, multimedia databases, and Web
      databases
                                                                               4
   Data mining (knowledge discovery in databases):
     Extraction of interesting (non-trivial, implicit, previously
      unknown and potentially useful) information or patterns from
      data in large databases
   Alternative names:
     Data mining: a misnomer?
     Knowledge discovery(mining) in databases (KDD), knowledge
      extraction, data/pattern analysis, data archeology, data
      dredging, information harvesting, business intelligence, etc.
   What is not data mining?
     (Deductive) query processing.
     Expert systems or small ML/statistical programs


                                                                      5
   Database analysis and decision support
     Market analysis and management
       ▪ target marketing, customer relation management, market
         basket analysis, cross selling, market segmentation
     Risk analysis and management
       ▪ Forecasting, customer retention, improved underwriting, quality
         control, competitive analysis
     Fraud detection and management
   Other Applications
     Text mining (news group, email, documents)
     Stream data mining
     Web mining.
     DNA data analysis

                                                                           6
   Where are the data sources for analysis?
     Credit card transactions, loyalty cards, discount coupons, customer
      complaint calls, plus (public) lifestyle studies
   Target marketing
     Find clusters of “model” customers who share the same
      characteristics: interest, income level, spending habits, etc.
   Determine customer purchasing patterns over time
     Conversion of single to a joint bank account: marriage, etc.
   Cross-market analysis
     Associations/co-relations between product sales
     Prediction based on the association information

                                                                            7
   Customer profiling
     data mining can tell you what types of customers buy what products
      (clustering or classification)
   Identifying customer requirements
     identifying the best products for different customers

     use prediction to find what factors will attract new customers
   Provides summary information
     various multidimensional summary reports

     statistical summary information (data central tendency and
      variation)
                                                                           8
   Finance planning and asset evaluation
     cash flow analysis and prediction
     contingent claim analysis to evaluate assets
     cross-sectional and time series analysis (financial-ratio, trend
      analysis, etc.)
   Resource planning:
     summarize and compare the resources and spending
   Competition:
     monitor competitors and market directions
     group customers into classes and a class-based pricing procedure
     set pricing strategy in a highly competitive market



                                                                         9
   Applications
     widely used in health care, retail, credit card services,
      telecommunications (phone card fraud), etc.
   Approach
     use historical data to build models of fraudulent behavior and use
      data mining to help identify similar instances
   Examples
     auto insurance: detect a group of people who stage accidents to
      collect on insurance
     money laundering: detect suspicious money transactions (US
      Treasury's Financial Crimes Enforcement Network)
     medical insurance: detect professional patients and ring of doctors
      and ring of references
                                                                            10
   Detecting inappropriate medical treatment
     Australian Health Insurance Commission identifies that in many cases
      blanket screening tests were requested (save Australian $1m/yr).
   Detecting telephone fraud
     Telephone call model: destination of the call, duration, time of day or
      week. Analyze patterns that deviate from an expected norm.
     British Telecom identified discrete groups of callers with frequent
      intra-group calls, especially mobile phones, and broke a multimillion
      dollar fraud.
   Retail
     Analysts estimate that 38% of retail shrink is due to dishonest
      employees.



                                                                                11
   Sports
     IBM Advanced Scout analyzed NBA game statistics (shots blocked,
      assists, and fouls) to gain competitive advantage for New York
      Knicks and Miami Heat
   Astronomy
     JPL and the Palomar Observatory discovered 22 quasars with the
      help of data mining
   Internet Web Surf-Aid
     IBM Surf-Aid applies data mining algorithms to Web access logs for
      market-related pages to discover customer preference and behavior
      pages, analyzing effectiveness of Web marketing, improving Web
      site organization, etc.

                                                                           12
Pattern Evaluation
   Data mining: the core of
    knowledge discovery
    process.                       Data Mining

                    Task-relevant Data


      Data                   Selection
      Warehouse
Data Cleaning

          Data Integration


        Databases
                                                               13
   Learning the application domain:
     relevant prior knowledge and goals of application
   Creating a target data set: data selection
   Data cleaning and preprocessing: (may take 60% of effort!)
   Data reduction and transformation:
     Find useful features, dimensionality/variable reduction, invariant
        representation.
   Choosing functions of data mining
       summarization, classification, regression, association, clustering.
   Choosing the mining algorithm(s)
   Data mining: search for patterns of interest
   Pattern evaluation and knowledge presentation
     visualization, transformation, removing redundant patterns, etc.
   Use of discovered knowledge


                                                                              14
   Relational databases
   Data warehouses
   Transactional databases
   Advanced DB and information repositories
       Object-oriented and object-relational databases
       Spatial and temporal data
       Time-series data and stream data
       Text databases and multimedia databases
       Heterogeneous and legacy databases
       WWW
                                                          15
   Association rule mining:
     Finding frequent patterns, associations, correlations, or causal
      structures among sets of items or objects in transaction databases,
      relational databases, and other information repositories.
     Frequent pattern: pattern (set of items, sequence, etc.) that occurs
      frequently in a database
   Motivation: finding regularities in data
       What products were often purchased together? — Beer and diapers?!
       What are the subsequent purchases after buying a PC?
       What kinds of DNA are sensitive to this new drug?
       Can we automatically classify web documents?



                                                                             16
Transaction-id           Items bought    Itemset X={x1, …, xk}
     10                     A, B, C      Find all the rules XY with min
     20                      A, C         confidence and support
     30                      A, D          support, s, probability that a
     40                     B, E, F          transaction contains X∪Y
                                           confidence, c, conditional probability
                                             that a transaction having X also
             Customer       Customer         contains Y.
             buys both      buys
                            diapers

                                            Let min_support = 50%,
                                            min_conf = 50%:
 Customer
                                               A  C (50%, 66.7%)
 buys beer                                     C  A (50%, 100%)
                                                                                 17
Transaction-id   Items bought   Min. support 50%
     10             A, B, C     Min. confidence 50%
     20              A, C
     30              A, D         Frequent pattern    Support
     40             B, E, F             {A}             75%
                                        {B}             50%
                                        {C}             50%
                                       {A, C}           50%


 For rule A ⇒ C:
    support = support({A}∪{C}) = 50%
    confidence = support({A}∪{C})/support({A}) = 66.6%

                                                                18
   Any subset of a frequent itemset must be frequent
     if {beer, diaper, nuts} is frequent, so is {beer, diaper}
     every transaction having {beer, diaper, nuts} also contains {beer, diaper}
 Apriori pruning principle: If there is any itemset which is infrequent, its
  superset should not be generated/tested!
 Method:
   generate length (k+1) candidate itemsets from length k frequent itemsets,
    and
   test the candidates against DB
 The performance studies show its efficiency and scalability



                                                                                   19
Itemset       sup
                                                                     Itemset       sup
                                            {A}         2
Database TDB                                                  L1         {A}         2
                                  C1        {B}         3
Tid        Items                                                         {B}         3
                                            {C}         3
10         A, C, D           1st scan                                    {C}         3
                                            {D}         1
                                                                         {E}         3
20         B, C, E                          {E}         3
30     A, B, C, E
40          B, E                 C2     Itemset    sup               C2        Itemset
                                         {A, B}     1
                                         {A, C}     2        2nd scan           {A, B}
 L2   Itemset        sup
                                         {A, E}     1                           {A, C}
       {A, C}            2
                                         {B, C}     2                           {A, E}
       {B, C}            2
                                         {B, E}     3                           {B, C}
       {B, E}            3
                                         {C, E}     2                           {B, E}
       {C, E}            2
                                                                                {C, E}

      C3     Itemset
                                 3rd scan         L3   Itemset     sup
             {B, C, E}                                 {B, C, E}    2
                                                                                         20
   Pseudo-code:
      Ck: Candidate itemset of size k
      Lk : frequent itemset of size k
      L1 = {frequent items};
      for (k = 1; Lk !=∅; k++) do begin
         Ck+1 = candidates generated from Lk;
        for each transaction t in database do
               increment the count of all candidates in Ck+1
           that are contained in t
        Lk+1 = candidates in Ck+1 with min_support
        end
      return ∪k Lk;
                                                               21
   How to generate candidates?
     Step 1: self-joining Lk
     Step 2: pruning
   Example of Candidate-generation
     L3={abc, abd, acd, ace, bcd}
     Self-joining: L3*L3
       ▪ abcd from abc and abd
       ▪ acde from acd and ace
     Pruning:
       ▪ acde is removed because ade is not in L3
     C4={abcd}


                                                    22
   Suppose the items in Lk-1 are listed in an order
   Step 1: self-joining Lk-1
    insert into Ck
    select p.item1, p.item2, …, p.itemk-1, q.itemk-1
    from Lk-1 p, Lk-1 q
    where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
   Step 2: pruning
    forall itemsets c in Ck do
       forall (k-1)-subsets s of c do
           if (s is not in Lk-1) then delete c from Ck


                                                                           23
 Finding models (functions) that describe and
  distinguish classes or concepts for future prediction
 E.g., classify countries based on climate, or classify
  cars based on gas mileage
 Presentation: decision-tree, classification rule, neural
  network
 Prediction: Predict some unknown or missing
  numerical values
                                                             24
Classification
                                            Algorithms
              Training
                Data


NAM E   RANK           YEARS TENURED         Classifier
M ike   Assistant Prof   3      no           (Model)
M ary   Assistant Prof   7      yes
Bill    Professor        2      yes
Jim     Associate Prof   7      yes
                                       IF rank = ‘professor’
Dave    Assistant Prof   6      no
                                       OR years > 6
Anne    Associate Prof   3      no
                                       THEN tenured = ‘yes’
                                                            25
Classifier


                 Testing
                  Data                          Unseen Data

                                             (Jeff, Professor, 4)
NAM E      RANK           YEARS TENURED
Tom        Assistant Prof   2      no        Tenured?
M erlisa   Associate Prof   7      no
G eorge    Professor        5      yes
Joseph     Assistant Prof   7      yes
                                                                    26
age    income student credit_rating
           <=30    high       no fair
Training   <=30
           31…40
                   high
                   high
                              no excellent
                              no fair
set        >40     medium     no fair
           >40     low       yes fair
           >40     low       yes excellent
           31…40   low       yes excellent
           <=30    medium     no fair
           <=30    low       yes fair
           >40     medium    yes fair
           <=30    medium    yes excellent
           31…40   medium     no excellent
           31…40   high      yes fair
           >40     medium     no excellent

                                                   27
age?


        <=30          overcast
                       30..40     >40

     student?           yes        credit rating?


no              yes              excellent    fair

no              yes                 no        yes

                                                     28
   Cluster analysis
     Class label is unknown: Group data to form new classes, e.g., cluster houses
       to find distribution patterns
     Clustering based on the principle: maximizing the intra-class similarity and
       minimizing the interclass similarity
   Outlier analysis
     Outlier: a data object that does not comply with the general behavior of the
       data
     It can be considered as noise or exception but is quite useful in fraud
       detection, rare events analysis


                                                                                     29
30
31

Data miningppt378

  • 1.
  • 2.
    Motivation: Why data mining?  What is data mining?  Data Mining: On what kind of data?  Data mining functionality  Are all the patterns interesting?  Major issues in data mining 2
  • 3.
    Data explosion problem  Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories  We are drowning in data, but starving for knowledge!  Solution: Data warehousing and data mining  Data warehousing and on-line analytical processing  Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases 3
  • 4.
    1960s:  Data collection, database creation, IMS and network DBMS  1970s:  Relational data model, relational DBMS implementation  1980s:  RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)  1990s—2000s:  Data mining and data warehousing, multimedia databases, and Web databases 4
  • 5.
    Data mining (knowledge discovery in databases):  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases  Alternative names:  Data mining: a misnomer?  Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.  What is not data mining?  (Deductive) query processing.  Expert systems or small ML/statistical programs 5
  • 6.
    Database analysis and decision support  Market analysis and management ▪ target marketing, customer relation management, market basket analysis, cross selling, market segmentation  Risk analysis and management ▪ Forecasting, customer retention, improved underwriting, quality control, competitive analysis  Fraud detection and management  Other Applications  Text mining (news group, email, documents)  Stream data mining  Web mining.  DNA data analysis 6
  • 7.
    Where are the data sources for analysis?  Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies  Target marketing  Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.  Determine customer purchasing patterns over time  Conversion of single to a joint bank account: marriage, etc.  Cross-market analysis  Associations/co-relations between product sales  Prediction based on the association information 7
  • 8.
    Customer profiling  data mining can tell you what types of customers buy what products (clustering or classification)  Identifying customer requirements  identifying the best products for different customers  use prediction to find what factors will attract new customers  Provides summary information  various multidimensional summary reports  statistical summary information (data central tendency and variation) 8
  • 9.
    Finance planning and asset evaluation  cash flow analysis and prediction  contingent claim analysis to evaluate assets  cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)  Resource planning:  summarize and compare the resources and spending  Competition:  monitor competitors and market directions  group customers into classes and a class-based pricing procedure  set pricing strategy in a highly competitive market 9
  • 10.
    Applications  widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc.  Approach  use historical data to build models of fraudulent behavior and use data mining to help identify similar instances  Examples  auto insurance: detect a group of people who stage accidents to collect on insurance  money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network)  medical insurance: detect professional patients and ring of doctors and ring of references 10
  • 11.
    Detecting inappropriate medical treatment  Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr).  Detecting telephone fraud  Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm.  British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud.  Retail  Analysts estimate that 38% of retail shrink is due to dishonest employees. 11
  • 12.
    Sports  IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat  Astronomy  JPL and the Palomar Observatory discovered 22 quasars with the help of data mining  Internet Web Surf-Aid  IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc. 12
  • 13.
    Pattern Evaluation  Data mining: the core of knowledge discovery process. Data Mining Task-relevant Data Data Selection Warehouse Data Cleaning Data Integration Databases 13
  • 14.
    Learning the application domain:  relevant prior knowledge and goals of application  Creating a target data set: data selection  Data cleaning and preprocessing: (may take 60% of effort!)  Data reduction and transformation:  Find useful features, dimensionality/variable reduction, invariant representation.  Choosing functions of data mining  summarization, classification, regression, association, clustering.  Choosing the mining algorithm(s)  Data mining: search for patterns of interest  Pattern evaluation and knowledge presentation  visualization, transformation, removing redundant patterns, etc.  Use of discovered knowledge 14
  • 15.
    Relational databases  Data warehouses  Transactional databases  Advanced DB and information repositories  Object-oriented and object-relational databases  Spatial and temporal data  Time-series data and stream data  Text databases and multimedia databases  Heterogeneous and legacy databases  WWW 15
  • 16.
    Association rule mining:  Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.  Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database  Motivation: finding regularities in data  What products were often purchased together? — Beer and diapers?!  What are the subsequent purchases after buying a PC?  What kinds of DNA are sensitive to this new drug?  Can we automatically classify web documents? 16
  • 17.
    Transaction-id Items bought  Itemset X={x1, …, xk} 10 A, B, C  Find all the rules XY with min 20 A, C confidence and support 30 A, D  support, s, probability that a 40 B, E, F transaction contains X∪Y  confidence, c, conditional probability that a transaction having X also Customer Customer contains Y. buys both buys diapers Let min_support = 50%, min_conf = 50%: Customer A  C (50%, 66.7%) buys beer C  A (50%, 100%) 17
  • 18.
    Transaction-id Items bought Min. support 50% 10 A, B, C Min. confidence 50% 20 A, C 30 A, D Frequent pattern Support 40 B, E, F {A} 75% {B} 50% {C} 50% {A, C} 50% For rule A ⇒ C: support = support({A}∪{C}) = 50% confidence = support({A}∪{C})/support({A}) = 66.6% 18
  • 19.
    Any subset of a frequent itemset must be frequent  if {beer, diaper, nuts} is frequent, so is {beer, diaper}  every transaction having {beer, diaper, nuts} also contains {beer, diaper}  Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!  Method:  generate length (k+1) candidate itemsets from length k frequent itemsets, and  test the candidates against DB  The performance studies show its efficiency and scalability 19
  • 20.
    Itemset sup Itemset sup {A} 2 Database TDB L1 {A} 2 C1 {B} 3 Tid Items {B} 3 {C} 3 10 A, C, D 1st scan {C} 3 {D} 1 {E} 3 20 B, C, E {E} 3 30 A, B, C, E 40 B, E C2 Itemset sup C2 Itemset {A, B} 1 {A, C} 2 2nd scan {A, B} L2 Itemset sup {A, E} 1 {A, C} {A, C} 2 {B, C} 2 {A, E} {B, C} 2 {B, E} 3 {B, C} {B, E} 3 {C, E} 2 {B, E} {C, E} 2 {C, E} C3 Itemset 3rd scan L3 Itemset sup {B, C, E} {B, C, E} 2 20
  • 21.
    Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=∅; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return ∪k Lk; 21
  • 22.
    How to generate candidates?  Step 1: self-joining Lk  Step 2: pruning  Example of Candidate-generation  L3={abc, abd, acd, ace, bcd}  Self-joining: L3*L3 ▪ abcd from abc and abd ▪ acde from acd and ace  Pruning: ▪ acde is removed because ade is not in L3  C4={abcd} 22
  • 23.
    Suppose the items in Lk-1 are listed in an order  Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1  Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck 23
  • 24.
     Finding models(functions) that describe and distinguish classes or concepts for future prediction  E.g., classify countries based on climate, or classify cars based on gas mileage  Presentation: decision-tree, classification rule, neural network  Prediction: Predict some unknown or missing numerical values 24
  • 25.
    Classification Algorithms Training Data NAM E RANK YEARS TENURED Classifier M ike Assistant Prof 3 no (Model) M ary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes IF rank = ‘professor’ Dave Assistant Prof 6 no OR years > 6 Anne Associate Prof 3 no THEN tenured = ‘yes’ 25
  • 26.
    Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAM E RANK YEARS TENURED Tom Assistant Prof 2 no Tenured? M erlisa Associate Prof 7 no G eorge Professor 5 yes Joseph Assistant Prof 7 yes 26
  • 27.
    age income student credit_rating <=30 high no fair Training <=30 31…40 high high no excellent no fair set >40 medium no fair >40 low yes fair >40 low yes excellent 31…40 low yes excellent <=30 medium no fair <=30 low yes fair >40 medium yes fair <=30 medium yes excellent 31…40 medium no excellent 31…40 high yes fair >40 medium no excellent 27
  • 28.
    age? <=30 overcast 30..40 >40 student? yes credit rating? no yes excellent fair no yes no yes 28
  • 29.
    Cluster analysis  Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns  Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity  Outlier analysis  Outlier: a data object that does not comply with the general behavior of the data  It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis 29
  • 30.
  • 31.