Data Mining  Chapter 26
Chapter 1.  Introduction <ul><li>Motivation: Why data mining? </li></ul><ul><li>What is data mining? </li></ul><ul><li>Dat...
Motivation:  “Necessity is the Mother of Invention” <ul><li>Data explosion problem   </li></ul><ul><ul><li>Automated data ...
Evolution of Database Technology <ul><li>1960s: </li></ul><ul><ul><li>Data collection, database creation, IMS and network ...
What Is Data Mining? <ul><li>Data mining (knowledge discovery in databases):  </li></ul><ul><ul><li>Extraction of interest...
Why Data Mining? — Potential Applications <ul><li>Database analysis and decision support </li></ul><ul><ul><li>Market anal...
Market Analysis and Management (1) <ul><li>Where are the data sources for analysis? </li></ul><ul><ul><li>Credit card tran...
Market Analysis and Management (2) <ul><li>Customer profiling </li></ul><ul><ul><li>data mining can tell you what types of...
Corporate Analysis and Risk Management <ul><li>Finance planning and asset evaluation </li></ul><ul><ul><li>cash flow analy...
Fraud Detection and Management (1) <ul><li>Applications </li></ul><ul><ul><li>widely used in health care, retail, credit c...
Fraud Detection and Management (2) <ul><li>Detecting inappropriate medical treatment </li></ul><ul><ul><li>Australian Heal...
Other Applications <ul><li>Sports </li></ul><ul><ul><li>IBM Advanced Scout analyzed NBA game statistics (shots blocked, as...
Data Mining: A KDD Process <ul><ul><li>Data mining: the core of knowledge discovery process. </li></ul></ul>Data Cleaning ...
Steps of a KDD Process   <ul><li>Learning the application domain: </li></ul><ul><ul><li>relevant prior knowledge and goals...
Data Mining: On What Kind of Data? <ul><li>Relational databases </li></ul><ul><li>Data warehouses </li></ul><ul><li>Transa...
Data Mining Functionalities
Association Rule Mining <ul><li>Association rule mining: </li></ul><ul><ul><li>Finding frequent patterns, associations, co...
Association Rule Mining (cont.) <ul><li>Itemset X={x 1 , …, x k } </li></ul><ul><li>Find all the rules  X  Y   with min c...
Mining Association Rules—an Example <ul><li>For rule  A      C : </li></ul><ul><ul><li>support = support({A}  {C}) = 50%...
Apriori: A Candidate Generation-and-test Approach <ul><li>Any subset of a frequent itemset must be frequent </li></ul><ul>...
The Apriori Algorithm — An Example Database TDB 1 st  scan C 1 L 1 L 2 C 2 C 2 2 nd  scan C 3 L 3 3 rd  scan B, E 40 A, B,...
The Apriori Algorithm <ul><li>Pseudo-code : </li></ul><ul><ul><ul><li>C k : Candidate itemset of size k </li></ul></ul></u...
Important Details of Apriori <ul><li>How to generate candidates? </li></ul><ul><ul><li>Step 1: self-joining  L k </li></ul...
How to Generate Candidates? <ul><li>Suppose the items in  L k-1  are listed in an order </li></ul><ul><li>Step 1: self-joi...
Classification and Prediction <ul><ul><li>Finding models (functions) that describe and distinguish classes or concepts for...
Classification Process: Model Construction Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘ye...
Classification Process: Use the Model in Prediction (Jeff, Professor, 4) Tenured? Classifier Testing Data Unseen Data
Decision Trees Training set
Output: A Decision Tree for “ buys_computer” age? overcast student? credit rating? no yes fair excellent <=30 >40 no no ye...
Cluster and outlier analysis <ul><li>Cluster analysis </li></ul><ul><ul><li>Class label is unknown: Group data to form new...
Clusters and Outliers
Upcoming SlideShare
Loading in...5
×

data mining.ppt

82,175

Published on

10 Comments
17 Likes
Statistics
Notes
No Downloads
Views
Total Views
82,175
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
3,192
Comments
10
Likes
17
Embeds 0
No embeds

No notes for slide

data mining.ppt

  1. 1. Data Mining Chapter 26
  2. 2. Chapter 1. Introduction <ul><li>Motivation: Why data mining? </li></ul><ul><li>What is data mining? </li></ul><ul><li>Data Mining: On what kind of data? </li></ul><ul><li>Data mining functionality </li></ul><ul><li>Are all the patterns interesting? </li></ul><ul><li>Major issues in data mining </li></ul>
  3. 3. Motivation: “Necessity is the Mother of Invention” <ul><li>Data explosion problem </li></ul><ul><ul><li>Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories </li></ul></ul><ul><li>We are drowning in data, but starving for knowledge! </li></ul><ul><li>Solution: Data warehousing and data mining </li></ul><ul><ul><li>Data warehousing and on-line analytical processing </li></ul></ul><ul><ul><li>Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases </li></ul></ul>
  4. 4. Evolution of Database Technology <ul><li>1960s: </li></ul><ul><ul><li>Data collection, database creation, IMS and network DBMS </li></ul></ul><ul><li>1970s: </li></ul><ul><ul><li>Relational data model, relational DBMS implementation </li></ul></ul><ul><li>1980s: </li></ul><ul><ul><li>RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.) </li></ul></ul><ul><li>1990s—2000s: </li></ul><ul><ul><li>Data mining and data warehousing, multimedia databases, and Web databases </li></ul></ul>
  5. 5. What Is Data Mining? <ul><li>Data mining (knowledge discovery in databases): </li></ul><ul><ul><li>Extraction of interesting ( non-trivial, implicit , previously unknown and potentially useful) information or patterns from data in large databases </li></ul></ul><ul><li>Alternative names: </li></ul><ul><ul><li>Data mining: a misnomer? </li></ul></ul><ul><ul><li>Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. </li></ul></ul><ul><li>What is not data mining? </li></ul><ul><ul><li>(Deductive) query processing. </li></ul></ul><ul><ul><li>Expert systems or small ML/statistical programs </li></ul></ul>
  6. 6. Why Data Mining? — Potential Applications <ul><li>Database analysis and decision support </li></ul><ul><ul><li>Market analysis and management </li></ul></ul><ul><ul><ul><li>target marketing, customer relation management, market basket analysis, cross selling, market segmentation </li></ul></ul></ul><ul><ul><li>Risk analysis and management </li></ul></ul><ul><ul><ul><li>Forecasting, customer retention, improved underwriting, quality control, competitive analysis </li></ul></ul></ul><ul><ul><li>Fraud detection and management </li></ul></ul><ul><li>Other Applications </li></ul><ul><ul><li>Text mining (news group, email, documents) </li></ul></ul><ul><ul><li>Stream data mining </li></ul></ul><ul><ul><li>Web mining. </li></ul></ul><ul><ul><li>DNA data analysis </li></ul></ul>
  7. 7. Market Analysis and Management (1) <ul><li>Where are the data sources for analysis? </li></ul><ul><ul><li>Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies </li></ul></ul><ul><li>Target marketing </li></ul><ul><ul><li>Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. </li></ul></ul><ul><li>Determine customer purchasing patterns over time </li></ul><ul><ul><li>Conversion of single to a joint bank account: marriage, etc. </li></ul></ul><ul><li>Cross-market analysis </li></ul><ul><ul><li>Associations/co-relations between product sales </li></ul></ul><ul><ul><li>Prediction based on the association information </li></ul></ul>
  8. 8. Market Analysis and Management (2) <ul><li>Customer profiling </li></ul><ul><ul><li>data mining can tell you what types of customers buy what products (clustering or classification) </li></ul></ul><ul><li>Identifying customer requirements </li></ul><ul><ul><li>identifying the best products for different customers </li></ul></ul><ul><ul><li>use prediction to find what factors will attract new customers </li></ul></ul><ul><li>Provides summary information </li></ul><ul><ul><li>various multidimensional summary reports </li></ul></ul><ul><ul><li>statistical summary information (data central tendency and variation) </li></ul></ul>
  9. 9. Corporate Analysis and Risk Management <ul><li>Finance planning and asset evaluation </li></ul><ul><ul><li>cash flow analysis and prediction </li></ul></ul><ul><ul><li>contingent claim analysis to evaluate assets </li></ul></ul><ul><ul><li>cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) </li></ul></ul><ul><li>Resource planning: </li></ul><ul><ul><li>summarize and compare the resources and spending </li></ul></ul><ul><li>Competition: </li></ul><ul><ul><li>monitor competitors and market directions </li></ul></ul><ul><ul><li>group customers into classes and a class-based pricing procedure </li></ul></ul><ul><ul><li>set pricing strategy in a highly competitive market </li></ul></ul>
  10. 10. Fraud Detection and Management (1) <ul><li>Applications </li></ul><ul><ul><li>widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. </li></ul></ul><ul><li>Approach </li></ul><ul><ul><li>use historical data to build models of fraudulent behavior and use data mining to help identify similar instances </li></ul></ul><ul><li>Examples </li></ul><ul><ul><li>auto insurance : detect a group of people who stage accidents to collect on insurance </li></ul></ul><ul><ul><li>money laundering : detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) </li></ul></ul><ul><ul><li>medical insurance : detect professional patients and ring of doctors and ring of references </li></ul></ul>
  11. 11. Fraud Detection and Management (2) <ul><li>Detecting inappropriate medical treatment </li></ul><ul><ul><li>Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr). </li></ul></ul><ul><li>Detecting telephone fraud </li></ul><ul><ul><li>Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. </li></ul></ul><ul><ul><li>British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. </li></ul></ul><ul><li>Retail </li></ul><ul><ul><li>Analysts estimate that 38% of retail shrink is due to dishonest employees. </li></ul></ul>
  12. 12. Other Applications <ul><li>Sports </li></ul><ul><ul><li>IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat </li></ul></ul><ul><li>Astronomy </li></ul><ul><ul><li>JPL and the Palomar Observatory discovered 22 quasars with the help of data mining </li></ul></ul><ul><li>Internet Web Surf-Aid </li></ul><ul><ul><li>IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc. </li></ul></ul>
  13. 13. Data Mining: A KDD Process <ul><ul><li>Data mining: the core of knowledge discovery process. </li></ul></ul>Data Cleaning Data Integration Databases Data Warehouse Knowledge Task-relevant Data Selection Data Mining Pattern Evaluation
  14. 14. Steps of a KDD Process <ul><li>Learning the application domain: </li></ul><ul><ul><li>relevant prior knowledge and goals of application </li></ul></ul><ul><li>Creating a target data set: data selection </li></ul><ul><li>Data cleaning and preprocessing: (may take 60% of effort!) </li></ul><ul><li>Data reduction and transformation: </li></ul><ul><ul><li>Find useful features, dimensionality/variable reduction, invariant representation. </li></ul></ul><ul><li>Choosing functions of data mining </li></ul><ul><ul><li>summarization, classification, regression, association, clustering. </li></ul></ul><ul><li>Choosing the mining algorithm(s) </li></ul><ul><li>Data mining : search for patterns of interest </li></ul><ul><li>Pattern evaluation and knowledge presentation </li></ul><ul><ul><li>visualization, transformation, removing redundant patterns, etc. </li></ul></ul><ul><li>Use of discovered knowledge </li></ul>
  15. 15. Data Mining: On What Kind of Data? <ul><li>Relational databases </li></ul><ul><li>Data warehouses </li></ul><ul><li>Transactional databases </li></ul><ul><li>Advanced DB and information repositories </li></ul><ul><ul><li>Object-oriented and object-relational databases </li></ul></ul><ul><ul><li>Spatial and temporal data </li></ul></ul><ul><ul><li>Time-series data and stream data </li></ul></ul><ul><ul><li>Text databases and multimedia databases </li></ul></ul><ul><ul><li>Heterogeneous and legacy databases </li></ul></ul><ul><ul><li>WWW </li></ul></ul>
  16. 16. Data Mining Functionalities
  17. 17. Association Rule Mining <ul><li>Association rule mining: </li></ul><ul><ul><li>Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. </li></ul></ul><ul><ul><li>Frequent pattern : pattern (set of items, sequence, etc.) that occurs frequently in a database </li></ul></ul><ul><li>Motivation: finding regularities in data </li></ul><ul><ul><li>What products were often purchased together? — Beer and diapers?! </li></ul></ul><ul><ul><li>What are the subsequent purchases after buying a PC? </li></ul></ul><ul><ul><li>What kinds of DNA are sensitive to this new drug? </li></ul></ul><ul><ul><li>Can we automatically classify web documents? </li></ul></ul>
  18. 18. Association Rule Mining (cont.) <ul><li>Itemset X={x 1 , …, x k } </li></ul><ul><li>Find all the rules X  Y with min confidence and support </li></ul><ul><ul><li>support , s , probability that a transaction contains X  Y </li></ul></ul><ul><ul><li>confidence , c, conditional probability that a transaction having X also contains Y . </li></ul></ul><ul><li>Let min_support = 50%, min_conf = 50%: </li></ul><ul><ul><li>A  C (50%, 66.7%) </li></ul></ul><ul><ul><li>C  A (50%, 100%) </li></ul></ul>Customer buys diapers Customer buys both Customer buys beer B, E, F 40 A, D 30 A, C 20 A, B, C 10 Items bought Transaction-id
  19. 19. Mining Association Rules—an Example <ul><li>For rule A  C : </li></ul><ul><ul><li>support = support({A}  {C}) = 50% </li></ul></ul><ul><ul><li>confidence = support({A}  {C})/support({A}) = 66.6% </li></ul></ul>Min. support 50% Min. confidence 50% B, E, F 40 A, D 30 A, C 20 A, B, C 10 Items bought Transaction-id 50% {A, C} 50% {C} 50% {B} 75% {A} Support Frequent pattern
  20. 20. Apriori: A Candidate Generation-and-test Approach <ul><li>Any subset of a frequent itemset must be frequent </li></ul><ul><ul><li>if {beer, diaper, nuts} is frequent, so is {beer, diaper} </li></ul></ul><ul><ul><li>every transaction having {beer, diaper, nuts} also contains {beer, diaper} </li></ul></ul><ul><li>Apriori pruning principle : If there is any itemset which is infrequent, its superset should not be generated/tested! </li></ul><ul><li>Method: </li></ul><ul><ul><li>generate length (k+1) candidate itemsets from length k frequent itemsets, and </li></ul></ul><ul><ul><li>test the candidates against DB </li></ul></ul><ul><li>The performance studies show its efficiency and scalability </li></ul>
  21. 21. The Apriori Algorithm — An Example Database TDB 1 st scan C 1 L 1 L 2 C 2 C 2 2 nd scan C 3 L 3 3 rd scan B, E 40 A, B, C, E 30 B, C, E 20 A, C, D 10 Items Tid 1 {D} 3 {E} 3 {C} 3 {B} 2 {A} sup Itemset 3 {E} 3 {C} 3 {B} 2 {A} sup Itemset {C, E} {B, E} {B, C} {A, E} {A, C} {A, B} Itemset 1 {A, B} 2 {A, C} 1 {A, E} 2 {B, C} 3 {B, E} 2 {C, E} sup Itemset 2 {A, C} 2 {B, C} 3 {B, E} 2 {C, E} sup Itemset {B, C, E} Itemset 2 {B, C, E} sup Itemset
  22. 22. The Apriori Algorithm <ul><li>Pseudo-code : </li></ul><ul><ul><ul><li>C k : Candidate itemset of size k </li></ul></ul></ul><ul><ul><ul><li>L k : frequent itemset of size k </li></ul></ul></ul><ul><ul><ul><li>L 1 = {frequent items}; </li></ul></ul></ul><ul><ul><ul><li>for ( k = 1; L k !=  ; k ++) do begin </li></ul></ul></ul><ul><ul><ul><li>C k+1 = candidates generated from L k ; </li></ul></ul></ul><ul><ul><ul><li>for each transaction t in database do </li></ul></ul></ul><ul><ul><ul><ul><li>increment the count of all candidates in C k+1 that are contained in t </li></ul></ul></ul></ul><ul><ul><ul><li>L k+1 = candidates in C k+1 with min_support </li></ul></ul></ul><ul><ul><ul><li>end </li></ul></ul></ul><ul><ul><ul><li>return  k L k ; </li></ul></ul></ul>
  23. 23. Important Details of Apriori <ul><li>How to generate candidates? </li></ul><ul><ul><li>Step 1: self-joining L k </li></ul></ul><ul><ul><li>Step 2: pruning </li></ul></ul><ul><li>Example of Candidate-generation </li></ul><ul><ul><li>L 3 = { abc, abd, acd, ace, bcd } </li></ul></ul><ul><ul><li>Self-joining: L 3 *L 3 </li></ul></ul><ul><ul><ul><li>abcd from abc and abd </li></ul></ul></ul><ul><ul><ul><li>acde from acd and ace </li></ul></ul></ul><ul><ul><li>Pruning: </li></ul></ul><ul><ul><ul><li>acde is removed because ade is not in L 3 </li></ul></ul></ul><ul><ul><li>C 4 ={ abcd } </li></ul></ul>
  24. 24. How to Generate Candidates? <ul><li>Suppose the items in L k-1 are listed in an order </li></ul><ul><li>Step 1: self-joining L k-1 </li></ul><ul><ul><li>insert into C k </li></ul></ul><ul><ul><li>select p.item 1 , p.item 2 , …, p.item k-1 , q.item k-1 </li></ul></ul><ul><ul><li>from L k-1 p, L k-1 q </li></ul></ul><ul><ul><li>where p.item 1 =q.item 1 , …, p.item k-2 =q.item k-2 , p.item k-1 < q.item k-1 </li></ul></ul><ul><li>Step 2: pruning </li></ul><ul><ul><li>forall itemsets c in C k do </li></ul></ul><ul><ul><ul><li>forall (k-1)-subsets s of c do </li></ul></ul></ul><ul><ul><ul><ul><li>if (s is not in L k-1 ) then delete c from C k </li></ul></ul></ul></ul>
  25. 25. Classification and Prediction <ul><ul><li>Finding models (functions) that describe and distinguish classes or concepts for future prediction </li></ul></ul><ul><ul><li>E.g., classify countries based on climate, or classify cars based on gas mileage </li></ul></ul><ul><ul><li>Presentation: decision-tree, classification rule, neural network </li></ul></ul><ul><ul><li>Prediction: Predict some unknown or missing numerical values </li></ul></ul>
  26. 26. Classification Process: Model Construction Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Training Data Classifier (Model)
  27. 27. Classification Process: Use the Model in Prediction (Jeff, Professor, 4) Tenured? Classifier Testing Data Unseen Data
  28. 28. Decision Trees Training set
  29. 29. Output: A Decision Tree for “ buys_computer” age? overcast student? credit rating? no yes fair excellent <=30 >40 no no yes yes yes 30..40
  30. 30. Cluster and outlier analysis <ul><li>Cluster analysis </li></ul><ul><ul><li>Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns </li></ul></ul><ul><ul><li>Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity </li></ul></ul><ul><li>Outlier analysis </li></ul><ul><ul><li>Outlier: a data object that does not comply with the general behavior of the data </li></ul></ul><ul><ul><li>It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis </li></ul></ul>
  31. 31. Clusters and Outliers
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×