1
   Motivation: Why data mining?   What is data mining?   Data Mining: On what kind of data?   Data mining functionalit...
   Data explosion problem     Automated data collection tools and mature database technology      lead to tremendous amo...
   1960s:     Data collection, database creation, IMS and network DBMS   1970s:     Relational data model, relational ...
   Data mining (knowledge discovery in databases):     Extraction of interesting (non-trivial, implicit, previously     ...
   Database analysis and decision support     Market analysis and management       ▪ target marketing, customer relation...
   Where are the data sources for analysis?     Credit card transactions, loyalty cards, discount coupons, customer     ...
   Customer profiling     data mining can tell you what types of customers buy what products      (clustering or classif...
   Finance planning and asset evaluation     cash flow analysis and prediction     contingent claim analysis to evaluat...
   Applications     widely used in health care, retail, credit card services,      telecommunications (phone card fraud)...
   Detecting inappropriate medical treatment     Australian Health Insurance Commission identifies that in many cases   ...
   Sports     IBM Advanced Scout analyzed NBA game statistics (shots blocked,      assists, and fouls) to gain competiti...
Pattern Evaluation   Data mining: the core of    knowledge discovery    process.                       Data Mining       ...
   Learning the application domain:     relevant prior knowledge and goals of application   Creating a target data set:...
   Relational databases   Data warehouses   Transactional databases   Advanced DB and information repositories       ...
   Association rule mining:     Finding frequent patterns, associations, correlations, or causal      structures among s...
Transaction-id           Items bought    Itemset X={x1, …, xk}     10                     A, B, C      Find all the rule...
Transaction-id   Items bought   Min. support 50%     10             A, B, C     Min. confidence 50%     20              A,...
   Any subset of a frequent itemset must be frequent     if {beer, diaper, nuts} is frequent, so is {beer, diaper}     ...
Itemset       sup                                                                     Itemset       sup                   ...
   Pseudo-code:      Ck: Candidate itemset of size k      Lk : frequent itemset of size k      L1 = {frequent items};    ...
   How to generate candidates?     Step 1: self-joining Lk     Step 2: pruning   Example of Candidate-generation     ...
   Suppose the items in Lk-1 are listed in an order   Step 1: self-joining Lk-1    insert into Ck    select p.item1, p.i...
 Finding models (functions) that describe and  distinguish classes or concepts for future prediction E.g., classify coun...
Classification                                            Algorithms              Training                DataNAM E   RANK...
Classifier                 Testing                  Data                          Unseen Data                             ...
age    income student credit_rating           <=30    high       no fairTraining   <=30           31…40                   ...
age?        <=30          overcast                       30..40     >40     student?           yes        credit rating?no...
   Cluster analysis     Class label is unknown: Group data to form new classes, e.g., cluster houses       to find distr...
30
31
Upcoming SlideShare
Loading in …5
×

Data miningppt378

1,156 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,156
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
60
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Data miningppt378

  1. 1. 1
  2. 2.  Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Are all the patterns interesting? Major issues in data mining 2
  3. 3.  Data explosion problem  Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining  Data warehousing and on-line analytical processing  Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases 3
  4. 4.  1960s:  Data collection, database creation, IMS and network DBMS 1970s:  Relational data model, relational DBMS implementation 1980s:  RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s—2000s:  Data mining and data warehousing, multimedia databases, and Web databases 4
  5. 5.  Data mining (knowledge discovery in databases):  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases Alternative names:  Data mining: a misnomer?  Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. What is not data mining?  (Deductive) query processing.  Expert systems or small ML/statistical programs 5
  6. 6.  Database analysis and decision support  Market analysis and management ▪ target marketing, customer relation management, market basket analysis, cross selling, market segmentation  Risk analysis and management ▪ Forecasting, customer retention, improved underwriting, quality control, competitive analysis  Fraud detection and management Other Applications  Text mining (news group, email, documents)  Stream data mining  Web mining.  DNA data analysis 6
  7. 7.  Where are the data sources for analysis?  Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing  Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. Determine customer purchasing patterns over time  Conversion of single to a joint bank account: marriage, etc. Cross-market analysis  Associations/co-relations between product sales  Prediction based on the association information 7
  8. 8.  Customer profiling  data mining can tell you what types of customers buy what products (clustering or classification) Identifying customer requirements  identifying the best products for different customers  use prediction to find what factors will attract new customers Provides summary information  various multidimensional summary reports  statistical summary information (data central tendency and variation) 8
  9. 9.  Finance planning and asset evaluation  cash flow analysis and prediction  contingent claim analysis to evaluate assets  cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) Resource planning:  summarize and compare the resources and spending Competition:  monitor competitors and market directions  group customers into classes and a class-based pricing procedure  set pricing strategy in a highly competitive market 9
  10. 10.  Applications  widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. Approach  use historical data to build models of fraudulent behavior and use data mining to help identify similar instances Examples  auto insurance: detect a group of people who stage accidents to collect on insurance  money laundering: detect suspicious money transactions (US Treasurys Financial Crimes Enforcement Network)  medical insurance: detect professional patients and ring of doctors and ring of references 10
  11. 11.  Detecting inappropriate medical treatment  Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr). Detecting telephone fraud  Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm.  British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. Retail  Analysts estimate that 38% of retail shrink is due to dishonest employees. 11
  12. 12.  Sports  IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat Astronomy  JPL and the Palomar Observatory discovered 22 quasars with the help of data mining Internet Web Surf-Aid  IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc. 12
  13. 13. Pattern Evaluation  Data mining: the core of knowledge discovery process. Data Mining Task-relevant Data Data Selection WarehouseData Cleaning Data Integration Databases 13
  14. 14.  Learning the application domain:  relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:  Find useful features, dimensionality/variable reduction, invariant representation. Choosing functions of data mining  summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation  visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge 14
  15. 15.  Relational databases Data warehouses Transactional databases Advanced DB and information repositories  Object-oriented and object-relational databases  Spatial and temporal data  Time-series data and stream data  Text databases and multimedia databases  Heterogeneous and legacy databases  WWW 15
  16. 16.  Association rule mining:  Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.  Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database Motivation: finding regularities in data  What products were often purchased together? — Beer and diapers?!  What are the subsequent purchases after buying a PC?  What kinds of DNA are sensitive to this new drug?  Can we automatically classify web documents? 16
  17. 17. Transaction-id Items bought  Itemset X={x1, …, xk} 10 A, B, C  Find all the rules XY with min 20 A, C confidence and support 30 A, D  support, s, probability that a 40 B, E, F transaction contains X∪Y  confidence, c, conditional probability that a transaction having X also Customer Customer contains Y. buys both buys diapers Let min_support = 50%, min_conf = 50%: Customer A  C (50%, 66.7%) buys beer C  A (50%, 100%) 17
  18. 18. Transaction-id Items bought Min. support 50% 10 A, B, C Min. confidence 50% 20 A, C 30 A, D Frequent pattern Support 40 B, E, F {A} 75% {B} 50% {C} 50% {A, C} 50% For rule A ⇒ C: support = support({A}∪{C}) = 50% confidence = support({A}∪{C})/support({A}) = 66.6% 18
  19. 19.  Any subset of a frequent itemset must be frequent  if {beer, diaper, nuts} is frequent, so is {beer, diaper}  every transaction having {beer, diaper, nuts} also contains {beer, diaper} Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! Method:  generate length (k+1) candidate itemsets from length k frequent itemsets, and  test the candidates against DB The performance studies show its efficiency and scalability 19
  20. 20. Itemset sup Itemset sup {A} 2Database TDB L1 {A} 2 C1 {B} 3Tid Items {B} 3 {C} 310 A, C, D 1st scan {C} 3 {D} 1 {E} 320 B, C, E {E} 330 A, B, C, E40 B, E C2 Itemset sup C2 Itemset {A, B} 1 {A, C} 2 2nd scan {A, B} L2 Itemset sup {A, E} 1 {A, C} {A, C} 2 {B, C} 2 {A, E} {B, C} 2 {B, E} 3 {B, C} {B, E} 3 {C, E} 2 {B, E} {C, E} 2 {C, E} C3 Itemset 3rd scan L3 Itemset sup {B, C, E} {B, C, E} 2 20
  21. 21.  Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=∅; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return ∪k Lk; 21
  22. 22.  How to generate candidates?  Step 1: self-joining Lk  Step 2: pruning Example of Candidate-generation  L3={abc, abd, acd, ace, bcd}  Self-joining: L3*L3 ▪ abcd from abc and abd ▪ acde from acd and ace  Pruning: ▪ acde is removed because ade is not in L3  C4={abcd} 22
  23. 23.  Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck 23
  24. 24.  Finding models (functions) that describe and distinguish classes or concepts for future prediction E.g., classify countries based on climate, or classify cars based on gas mileage Presentation: decision-tree, classification rule, neural network Prediction: Predict some unknown or missing numerical values 24
  25. 25. Classification Algorithms Training DataNAM E RANK YEARS TENURED ClassifierM ike Assistant Prof 3 no (Model)M ary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yes IF rank = ‘professor’Dave Assistant Prof 6 no OR years > 6Anne Associate Prof 3 no THEN tenured = ‘yes’ 25
  26. 26. Classifier Testing Data Unseen Data (Jeff, Professor, 4)NAM E RANK YEARS TENUREDTom Assistant Prof 2 no Tenured?M erlisa Associate Prof 7 noG eorge Professor 5 yesJoseph Assistant Prof 7 yes 26
  27. 27. age income student credit_rating <=30 high no fairTraining <=30 31…40 high high no excellent no fairset >40 medium no fair >40 low yes fair >40 low yes excellent 31…40 low yes excellent <=30 medium no fair <=30 low yes fair >40 medium yes fair <=30 medium yes excellent 31…40 medium no excellent 31…40 high yes fair >40 medium no excellent 27
  28. 28. age? <=30 overcast 30..40 >40 student? yes credit rating?no yes excellent fairno yes no yes 28
  29. 29.  Cluster analysis  Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns  Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity Outlier analysis  Outlier: a data object that does not comply with the general behavior of the data  It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis 29
  30. 30. 30
  31. 31. 31

×