SlideShare a Scribd company logo
1
   Motivation: Why data mining?
   What is data mining?
   Data Mining: On what kind of data?
   Data mining functionality
   Are all the patterns interesting?
   Major issues in data mining
                                         2
   Data explosion problem

     Automated data collection tools and mature database technology

      lead to tremendous amounts of data stored in databases, data
      warehouses and other information repositories
   We are drowning in data, but starving for knowledge!
   Solution: Data warehousing and data mining
     Data warehousing and on-line analytical processing

     Extraction of interesting knowledge (rules, regularities, patterns,

      constraints) from data in large databases

                                                                            3
   1960s:
     Data collection, database creation, IMS and network DBMS
   1970s:
     Relational data model, relational DBMS implementation
   1980s:
     RDBMS, advanced data models (extended-relational, OO, deductive,
      etc.) and application-oriented DBMS (spatial, scientific, engineering,
      etc.)
   1990s—2000s:
     Data mining and data warehousing, multimedia databases, and Web
      databases
                                                                               4
   Data mining (knowledge discovery in databases):
     Extraction of interesting (non-trivial, implicit, previously
      unknown and potentially useful) information or patterns from
      data in large databases
   Alternative names:
     Data mining: a misnomer?
     Knowledge discovery(mining) in databases (KDD), knowledge
      extraction, data/pattern analysis, data archeology, data
      dredging, information harvesting, business intelligence, etc.
   What is not data mining?
     (Deductive) query processing.
     Expert systems or small ML/statistical programs


                                                                      5
   Database analysis and decision support
     Market analysis and management
       ▪ target marketing, customer relation management, market
         basket analysis, cross selling, market segmentation
     Risk analysis and management
       ▪ Forecasting, customer retention, improved underwriting, quality
         control, competitive analysis
     Fraud detection and management
   Other Applications
     Text mining (news group, email, documents)
     Stream data mining
     Web mining.
     DNA data analysis

                                                                           6
   Where are the data sources for analysis?
     Credit card transactions, loyalty cards, discount coupons, customer
      complaint calls, plus (public) lifestyle studies
   Target marketing
     Find clusters of “model” customers who share the same
      characteristics: interest, income level, spending habits, etc.
   Determine customer purchasing patterns over time
     Conversion of single to a joint bank account: marriage, etc.
   Cross-market analysis
     Associations/co-relations between product sales
     Prediction based on the association information

                                                                            7
   Customer profiling
     data mining can tell you what types of customers buy what products
      (clustering or classification)
   Identifying customer requirements
     identifying the best products for different customers

     use prediction to find what factors will attract new customers
   Provides summary information
     various multidimensional summary reports

     statistical summary information (data central tendency and
      variation)
                                                                           8
   Finance planning and asset evaluation
     cash flow analysis and prediction
     contingent claim analysis to evaluate assets
     cross-sectional and time series analysis (financial-ratio, trend
      analysis, etc.)
   Resource planning:
     summarize and compare the resources and spending
   Competition:
     monitor competitors and market directions
     group customers into classes and a class-based pricing procedure
     set pricing strategy in a highly competitive market



                                                                         9
   Applications
     widely used in health care, retail, credit card services,
      telecommunications (phone card fraud), etc.
   Approach
     use historical data to build models of fraudulent behavior and use
      data mining to help identify similar instances
   Examples
     auto insurance: detect a group of people who stage accidents to
      collect on insurance
     money laundering: detect suspicious money transactions (US
      Treasury's Financial Crimes Enforcement Network)
     medical insurance: detect professional patients and ring of doctors
      and ring of references
                                                                            10
   Detecting inappropriate medical treatment
     Australian Health Insurance Commission identifies that in many cases
      blanket screening tests were requested (save Australian $1m/yr).
   Detecting telephone fraud
     Telephone call model: destination of the call, duration, time of day or
      week. Analyze patterns that deviate from an expected norm.
     British Telecom identified discrete groups of callers with frequent
      intra-group calls, especially mobile phones, and broke a multimillion
      dollar fraud.
   Retail
     Analysts estimate that 38% of retail shrink is due to dishonest
      employees.



                                                                                11
   Sports
     IBM Advanced Scout analyzed NBA game statistics (shots blocked,
      assists, and fouls) to gain competitive advantage for New York
      Knicks and Miami Heat
   Astronomy
     JPL and the Palomar Observatory discovered 22 quasars with the
      help of data mining
   Internet Web Surf-Aid
     IBM Surf-Aid applies data mining algorithms to Web access logs for
      market-related pages to discover customer preference and behavior
      pages, analyzing effectiveness of Web marketing, improving Web
      site organization, etc.

                                                                           12
Pattern Evaluation
   Data mining: the core of
    knowledge discovery
    process.                       Data Mining

                    Task-relevant Data


      Data                   Selection
      Warehouse
Data Cleaning

          Data Integration


        Databases
                                                               13
   Learning the application domain:
     relevant prior knowledge and goals of application
   Creating a target data set: data selection
   Data cleaning and preprocessing: (may take 60% of effort!)
   Data reduction and transformation:
     Find useful features, dimensionality/variable reduction, invariant
        representation.
   Choosing functions of data mining
       summarization, classification, regression, association, clustering.
   Choosing the mining algorithm(s)
   Data mining: search for patterns of interest
   Pattern evaluation and knowledge presentation
     visualization, transformation, removing redundant patterns, etc.
   Use of discovered knowledge


                                                                              14
   Relational databases
   Data warehouses
   Transactional databases
   Advanced DB and information repositories
       Object-oriented and object-relational databases
       Spatial and temporal data
       Time-series data and stream data
       Text databases and multimedia databases
       Heterogeneous and legacy databases
       WWW
                                                          15
   Association rule mining:
     Finding frequent patterns, associations, correlations, or causal
      structures among sets of items or objects in transaction databases,
      relational databases, and other information repositories.
     Frequent pattern: pattern (set of items, sequence, etc.) that occurs
      frequently in a database
   Motivation: finding regularities in data
       What products were often purchased together? — Beer and diapers?!
       What are the subsequent purchases after buying a PC?
       What kinds of DNA are sensitive to this new drug?
       Can we automatically classify web documents?



                                                                             16
Transaction-id           Items bought    Itemset X={x1, …, xk}
     10                     A, B, C      Find all the rules XY with min
     20                      A, C         confidence and support
     30                      A, D          support, s, probability that a
     40                     B, E, F          transaction contains X∪Y
                                           confidence, c, conditional probability
                                             that a transaction having X also
             Customer       Customer         contains Y.
             buys both      buys
                            diapers

                                            Let min_support = 50%,
                                            min_conf = 50%:
 Customer
                                               A  C (50%, 66.7%)
 buys beer                                     C  A (50%, 100%)
                                                                                 17
Transaction-id   Items bought   Min. support 50%
     10             A, B, C     Min. confidence 50%
     20              A, C
     30              A, D         Frequent pattern    Support
     40             B, E, F             {A}             75%
                                        {B}             50%
                                        {C}             50%
                                       {A, C}           50%


 For rule A ⇒ C:
    support = support({A}∪{C}) = 50%
    confidence = support({A}∪{C})/support({A}) = 66.6%

                                                                18
   Any subset of a frequent itemset must be frequent
     if {beer, diaper, nuts} is frequent, so is {beer, diaper}
     every transaction having {beer, diaper, nuts} also contains {beer, diaper}
 Apriori pruning principle: If there is any itemset which is infrequent, its
  superset should not be generated/tested!
 Method:
   generate length (k+1) candidate itemsets from length k frequent itemsets,
    and
   test the candidates against DB
 The performance studies show its efficiency and scalability



                                                                                   19
Itemset       sup
                                                                     Itemset       sup
                                            {A}         2
Database TDB                                                  L1         {A}         2
                                  C1        {B}         3
Tid        Items                                                         {B}         3
                                            {C}         3
10         A, C, D           1st scan                                    {C}         3
                                            {D}         1
                                                                         {E}         3
20         B, C, E                          {E}         3
30     A, B, C, E
40          B, E                 C2     Itemset    sup               C2        Itemset
                                         {A, B}     1
                                         {A, C}     2        2nd scan           {A, B}
 L2   Itemset        sup
                                         {A, E}     1                           {A, C}
       {A, C}            2
                                         {B, C}     2                           {A, E}
       {B, C}            2
                                         {B, E}     3                           {B, C}
       {B, E}            3
                                         {C, E}     2                           {B, E}
       {C, E}            2
                                                                                {C, E}

      C3     Itemset
                                 3rd scan         L3   Itemset     sup
             {B, C, E}                                 {B, C, E}    2
                                                                                         20
   Pseudo-code:
      Ck: Candidate itemset of size k
      Lk : frequent itemset of size k
      L1 = {frequent items};
      for (k = 1; Lk !=∅; k++) do begin
         Ck+1 = candidates generated from Lk;
        for each transaction t in database do
               increment the count of all candidates in Ck+1
           that are contained in t
        Lk+1 = candidates in Ck+1 with min_support
        end
      return ∪k Lk;
                                                               21
   How to generate candidates?
     Step 1: self-joining Lk
     Step 2: pruning
   Example of Candidate-generation
     L3={abc, abd, acd, ace, bcd}
     Self-joining: L3*L3
       ▪ abcd from abc and abd
       ▪ acde from acd and ace
     Pruning:
       ▪ acde is removed because ade is not in L3
     C4={abcd}


                                                    22
   Suppose the items in Lk-1 are listed in an order
   Step 1: self-joining Lk-1
    insert into Ck
    select p.item1, p.item2, …, p.itemk-1, q.itemk-1
    from Lk-1 p, Lk-1 q
    where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
   Step 2: pruning
    forall itemsets c in Ck do
       forall (k-1)-subsets s of c do
           if (s is not in Lk-1) then delete c from Ck


                                                                           23
 Finding models (functions) that describe and
  distinguish classes or concepts for future prediction
 E.g., classify countries based on climate, or classify
  cars based on gas mileage
 Presentation: decision-tree, classification rule, neural
  network
 Prediction: Predict some unknown or missing
  numerical values
                                                             24
Classification
                                            Algorithms
              Training
                Data


NAM E   RANK           YEARS TENURED         Classifier
M ike   Assistant Prof   3      no           (Model)
M ary   Assistant Prof   7      yes
Bill    Professor        2      yes
Jim     Associate Prof   7      yes
                                       IF rank = ‘professor’
Dave    Assistant Prof   6      no
                                       OR years > 6
Anne    Associate Prof   3      no
                                       THEN tenured = ‘yes’
                                                            25
Classifier


                 Testing
                  Data                          Unseen Data

                                             (Jeff, Professor, 4)
NAM E      RANK           YEARS TENURED
Tom        Assistant Prof   2      no        Tenured?
M erlisa   Associate Prof   7      no
G eorge    Professor        5      yes
Joseph     Assistant Prof   7      yes
                                                                    26
age    income student credit_rating
           <=30    high       no fair
Training   <=30
           31…40
                   high
                   high
                              no excellent
                              no fair
set        >40     medium     no fair
           >40     low       yes fair
           >40     low       yes excellent
           31…40   low       yes excellent
           <=30    medium     no fair
           <=30    low       yes fair
           >40     medium    yes fair
           <=30    medium    yes excellent
           31…40   medium     no excellent
           31…40   high      yes fair
           >40     medium     no excellent

                                                   27
age?


        <=30          overcast
                       30..40     >40

     student?           yes        credit rating?


no              yes              excellent    fair

no              yes                 no        yes

                                                     28
   Cluster analysis
     Class label is unknown: Group data to form new classes, e.g., cluster houses
       to find distribution patterns
     Clustering based on the principle: maximizing the intra-class similarity and
       minimizing the interclass similarity
   Outlier analysis
     Outlier: a data object that does not comply with the general behavior of the
       data
     It can be considered as noise or exception but is quite useful in fraud
       detection, rare events analysis


                                                                                     29
30
31

More Related Content

What's hot

introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
Salah Amean
 
Data mining
Data miningData mining
Data mining
Daminda Herath
 
Introduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in DatabaseIntroduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in Database
Kartik Kalpande Patil
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining Phi Jack
 
3 Data Mining Tasks
3  Data Mining Tasks3  Data Mining Tasks
3 Data Mining Tasks
Mahmoud Alfarra
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
Devakumar Jain
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
Sushil Kulkarni
 
What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)
Pratik Tambekar
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
Amr Abd El Latief
 
Basic Overview of Data Mining
Basic Overview of Data MiningBasic Overview of Data Mining
Basic Overview of Data Mining
Syracuse University
 
Data Mining
Data MiningData Mining
Data Mining
SHIKHA GAUTAM
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
Dhilsath Fathima
 
Data mining
Data mining Data mining
Data mining
AthiraR23
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Hoang Nguyen
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
AbcdDcba12
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an Introduction
Ali Abbasi
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousing
Sunny Gandhi
 
Knowledge discovery process
Knowledge discovery process Knowledge discovery process
Knowledge discovery process
Shuvra Ghosh
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and Applications
IJMER
 

What's hot (20)

introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Data mining
Data miningData mining
Data mining
 
Introduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in DatabaseIntroduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in Database
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
3 Data Mining Tasks
3  Data Mining Tasks3  Data Mining Tasks
3 Data Mining Tasks
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 
Basic Overview of Data Mining
Basic Overview of Data MiningBasic Overview of Data Mining
Basic Overview of Data Mining
 
Data Mining
Data MiningData Mining
Data Mining
 
Data mining 1
Data mining 1Data mining 1
Data mining 1
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Data mining
Data mining Data mining
Data mining
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an Introduction
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousing
 
Knowledge discovery process
Knowledge discovery process Knowledge discovery process
Knowledge discovery process
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and Applications
 

Similar to Data miningppt378

Introduction.ppt
Introduction.pptIntroduction.ppt
Introduction.ppt
bommaiah
 
Data warehouse and data mining
Data warehouse and data miningData warehouse and data mining
Data warehouse and data miningRohit Kumar
 
6 weeks summer training in data mining,jalandhar
6 weeks summer training in data mining,jalandhar6 weeks summer training in data mining,jalandhar
6 weeks summer training in data mining,jalandhar
deepikakaler1
 
6months industrial training in data mining,ludhiana
6months industrial training in data mining,ludhiana6months industrial training in data mining,ludhiana
6months industrial training in data mining,ludhiana
deepikakaler1
 
6months industrial training in data mining, jalandhar
6months industrial training in data mining, jalandhar6months industrial training in data mining, jalandhar
6months industrial training in data mining, jalandhar
deepikakaler1
 
6 weeks summer training in data mining,ludhiana
6 weeks summer training in data mining,ludhiana6 weeks summer training in data mining,ludhiana
6 weeks summer training in data mining,ludhiana
deepikakaler1
 
Data mining 1 - Introduction (cheat sheet - printable)
Data mining 1 - Introduction (cheat sheet - printable)Data mining 1 - Introduction (cheat sheet - printable)
Data mining 1 - Introduction (cheat sheet - printable)
yesheeka
 
Data mining final year project in ludhiana
Data mining final year project in ludhianaData mining final year project in ludhiana
Data mining final year project in ludhiana
deepikakaler1
 
Data mining final year project in jalandhar
Data mining final year project in jalandharData mining final year project in jalandhar
Data mining final year project in jalandhar
deepikakaler1
 
Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Miningdataminers.ir
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Tony Nguyen
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Young Alista
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Harry Potter
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
James Wong
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Fraboni Ec
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Luis Goldster
 

Similar to Data miningppt378 (20)

Introduction
IntroductionIntroduction
Introduction
 
Introduction.ppt
Introduction.pptIntroduction.ppt
Introduction.ppt
 
Data warehouse and data mining
Data warehouse and data miningData warehouse and data mining
Data warehouse and data mining
 
Introduction data mining
Introduction data miningIntroduction data mining
Introduction data mining
 
6 weeks summer training in data mining,jalandhar
6 weeks summer training in data mining,jalandhar6 weeks summer training in data mining,jalandhar
6 weeks summer training in data mining,jalandhar
 
6months industrial training in data mining,ludhiana
6months industrial training in data mining,ludhiana6months industrial training in data mining,ludhiana
6months industrial training in data mining,ludhiana
 
6months industrial training in data mining, jalandhar
6months industrial training in data mining, jalandhar6months industrial training in data mining, jalandhar
6months industrial training in data mining, jalandhar
 
6 weeks summer training in data mining,ludhiana
6 weeks summer training in data mining,ludhiana6 weeks summer training in data mining,ludhiana
6 weeks summer training in data mining,ludhiana
 
D
DD
D
 
Data mining 1 - Introduction (cheat sheet - printable)
Data mining 1 - Introduction (cheat sheet - printable)Data mining 1 - Introduction (cheat sheet - printable)
Data mining 1 - Introduction (cheat sheet - printable)
 
Data mining final year project in ludhiana
Data mining final year project in ludhianaData mining final year project in ludhiana
Data mining final year project in ludhiana
 
Data mining final year project in jalandhar
Data mining final year project in jalandharData mining final year project in jalandhar
Data mining final year project in jalandhar
 
Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Mining
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining
Data miningData mining
Data mining
 

Recently uploaded

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 

Recently uploaded (20)

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 

Data miningppt378

  • 1. 1
  • 2. Motivation: Why data mining?  What is data mining?  Data Mining: On what kind of data?  Data mining functionality  Are all the patterns interesting?  Major issues in data mining 2
  • 3. Data explosion problem  Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories  We are drowning in data, but starving for knowledge!  Solution: Data warehousing and data mining  Data warehousing and on-line analytical processing  Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases 3
  • 4. 1960s:  Data collection, database creation, IMS and network DBMS  1970s:  Relational data model, relational DBMS implementation  1980s:  RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)  1990s—2000s:  Data mining and data warehousing, multimedia databases, and Web databases 4
  • 5. Data mining (knowledge discovery in databases):  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases  Alternative names:  Data mining: a misnomer?  Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.  What is not data mining?  (Deductive) query processing.  Expert systems or small ML/statistical programs 5
  • 6. Database analysis and decision support  Market analysis and management ▪ target marketing, customer relation management, market basket analysis, cross selling, market segmentation  Risk analysis and management ▪ Forecasting, customer retention, improved underwriting, quality control, competitive analysis  Fraud detection and management  Other Applications  Text mining (news group, email, documents)  Stream data mining  Web mining.  DNA data analysis 6
  • 7. Where are the data sources for analysis?  Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies  Target marketing  Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.  Determine customer purchasing patterns over time  Conversion of single to a joint bank account: marriage, etc.  Cross-market analysis  Associations/co-relations between product sales  Prediction based on the association information 7
  • 8. Customer profiling  data mining can tell you what types of customers buy what products (clustering or classification)  Identifying customer requirements  identifying the best products for different customers  use prediction to find what factors will attract new customers  Provides summary information  various multidimensional summary reports  statistical summary information (data central tendency and variation) 8
  • 9. Finance planning and asset evaluation  cash flow analysis and prediction  contingent claim analysis to evaluate assets  cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)  Resource planning:  summarize and compare the resources and spending  Competition:  monitor competitors and market directions  group customers into classes and a class-based pricing procedure  set pricing strategy in a highly competitive market 9
  • 10. Applications  widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc.  Approach  use historical data to build models of fraudulent behavior and use data mining to help identify similar instances  Examples  auto insurance: detect a group of people who stage accidents to collect on insurance  money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network)  medical insurance: detect professional patients and ring of doctors and ring of references 10
  • 11. Detecting inappropriate medical treatment  Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr).  Detecting telephone fraud  Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm.  British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud.  Retail  Analysts estimate that 38% of retail shrink is due to dishonest employees. 11
  • 12. Sports  IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat  Astronomy  JPL and the Palomar Observatory discovered 22 quasars with the help of data mining  Internet Web Surf-Aid  IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc. 12
  • 13. Pattern Evaluation  Data mining: the core of knowledge discovery process. Data Mining Task-relevant Data Data Selection Warehouse Data Cleaning Data Integration Databases 13
  • 14. Learning the application domain:  relevant prior knowledge and goals of application  Creating a target data set: data selection  Data cleaning and preprocessing: (may take 60% of effort!)  Data reduction and transformation:  Find useful features, dimensionality/variable reduction, invariant representation.  Choosing functions of data mining  summarization, classification, regression, association, clustering.  Choosing the mining algorithm(s)  Data mining: search for patterns of interest  Pattern evaluation and knowledge presentation  visualization, transformation, removing redundant patterns, etc.  Use of discovered knowledge 14
  • 15. Relational databases  Data warehouses  Transactional databases  Advanced DB and information repositories  Object-oriented and object-relational databases  Spatial and temporal data  Time-series data and stream data  Text databases and multimedia databases  Heterogeneous and legacy databases  WWW 15
  • 16. Association rule mining:  Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.  Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database  Motivation: finding regularities in data  What products were often purchased together? — Beer and diapers?!  What are the subsequent purchases after buying a PC?  What kinds of DNA are sensitive to this new drug?  Can we automatically classify web documents? 16
  • 17. Transaction-id Items bought  Itemset X={x1, …, xk} 10 A, B, C  Find all the rules XY with min 20 A, C confidence and support 30 A, D  support, s, probability that a 40 B, E, F transaction contains X∪Y  confidence, c, conditional probability that a transaction having X also Customer Customer contains Y. buys both buys diapers Let min_support = 50%, min_conf = 50%: Customer A  C (50%, 66.7%) buys beer C  A (50%, 100%) 17
  • 18. Transaction-id Items bought Min. support 50% 10 A, B, C Min. confidence 50% 20 A, C 30 A, D Frequent pattern Support 40 B, E, F {A} 75% {B} 50% {C} 50% {A, C} 50% For rule A ⇒ C: support = support({A}∪{C}) = 50% confidence = support({A}∪{C})/support({A}) = 66.6% 18
  • 19. Any subset of a frequent itemset must be frequent  if {beer, diaper, nuts} is frequent, so is {beer, diaper}  every transaction having {beer, diaper, nuts} also contains {beer, diaper}  Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!  Method:  generate length (k+1) candidate itemsets from length k frequent itemsets, and  test the candidates against DB  The performance studies show its efficiency and scalability 19
  • 20. Itemset sup Itemset sup {A} 2 Database TDB L1 {A} 2 C1 {B} 3 Tid Items {B} 3 {C} 3 10 A, C, D 1st scan {C} 3 {D} 1 {E} 3 20 B, C, E {E} 3 30 A, B, C, E 40 B, E C2 Itemset sup C2 Itemset {A, B} 1 {A, C} 2 2nd scan {A, B} L2 Itemset sup {A, E} 1 {A, C} {A, C} 2 {B, C} 2 {A, E} {B, C} 2 {B, E} 3 {B, C} {B, E} 3 {C, E} 2 {B, E} {C, E} 2 {C, E} C3 Itemset 3rd scan L3 Itemset sup {B, C, E} {B, C, E} 2 20
  • 21. Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=∅; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return ∪k Lk; 21
  • 22. How to generate candidates?  Step 1: self-joining Lk  Step 2: pruning  Example of Candidate-generation  L3={abc, abd, acd, ace, bcd}  Self-joining: L3*L3 ▪ abcd from abc and abd ▪ acde from acd and ace  Pruning: ▪ acde is removed because ade is not in L3  C4={abcd} 22
  • 23. Suppose the items in Lk-1 are listed in an order  Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1  Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck 23
  • 24.  Finding models (functions) that describe and distinguish classes or concepts for future prediction  E.g., classify countries based on climate, or classify cars based on gas mileage  Presentation: decision-tree, classification rule, neural network  Prediction: Predict some unknown or missing numerical values 24
  • 25. Classification Algorithms Training Data NAM E RANK YEARS TENURED Classifier M ike Assistant Prof 3 no (Model) M ary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes IF rank = ‘professor’ Dave Assistant Prof 6 no OR years > 6 Anne Associate Prof 3 no THEN tenured = ‘yes’ 25
  • 26. Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAM E RANK YEARS TENURED Tom Assistant Prof 2 no Tenured? M erlisa Associate Prof 7 no G eorge Professor 5 yes Joseph Assistant Prof 7 yes 26
  • 27. age income student credit_rating <=30 high no fair Training <=30 31…40 high high no excellent no fair set >40 medium no fair >40 low yes fair >40 low yes excellent 31…40 low yes excellent <=30 medium no fair <=30 low yes fair >40 medium yes fair <=30 medium yes excellent 31…40 medium no excellent 31…40 high yes fair >40 medium no excellent 27
  • 28. age? <=30 overcast 30..40 >40 student? yes credit rating? no yes excellent fair no yes no yes 28
  • 29. Cluster analysis  Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns  Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity  Outlier analysis  Outlier: a data object that does not comply with the general behavior of the data  It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis 29
  • 30. 30
  • 31. 31