Data Mining
Upcoming SlideShare
Loading in...5
×
 

Data Mining

on

  • 959 views

 

Statistics

Views

Total Views
959
Views on SlideShare
959
Embed Views
0

Actions

Likes
1
Downloads
47
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Data Mining Data Mining Document Transcript

  • Data Warehousing SoSe 2006 Data Mining Based on Tutorial Slides by Dr. Jens-Peter Dittrich jens.dittrich@inf Gregory Piatetsky-Shapiro www.inf.ethz.ch/~jensdi Kdnuggets.com Institute of Information Systems Outline Trends leading to Data Flood % More data is generated: %Introduction % Web, text, images … %Data Mining Tasks % Business transactions, calls, ... %Classification & Evaluation % Scientific data: astronomy, biology, etc %Clustering % More data is captured: % Storage technology faster %Application Examples and cheaper % DBMS can handle bigger DB © 2006 KDnuggets 3 © 2006 KDnuggets 4 Largest Databases in 2005 Data Growth Winter Corp. 2005 Commercial Database Survey: ! Max Planck Inst. for " Meteorology , 222 TB #"Yahoo ~ 100 TB (Largest Data Warehouse) $"AT&T ~ 94 TB www.wintercorp.com/VLDB/2005_TopTen_Survey/TopTenWinners_2005.asp In 2 years (2003 to 2005), the size of the largest database TRIPLED! 5 6 © 2006 KDnuggets © 2006 KDnuggets
  • Data Growth Rate Knowledge Discovery Definition Knowledge Discovery in Data is the % Twice as much information was created in 2002 as in 1999 (~30% growth rate) non-trivial process of identifying % valid % Other growth rate estimates even higher % novel % Very little data will ever be looked at by a human % potentially useful % and ultimately understandable patterns in data. Knowledge Discovery is NEEDED to make sense from Advances in Knowledge Discovery and Data and use of data. Mining, Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, (Chapter 1), AAAI/MIT Press 1996 7 8 © 2006 KDnuggets © 2006 KDnuggets Statistics, Machine Learning and Related Fields Data Mining % Statistics: Machine Visualization % more theory-based Learning % more focused on testing hypotheses % Machine learning Data Mining and % more heuristic Knowledge Discovery % focused on improving performance of a learning agent % also looks at real-time learning and robotics – areas not part of data mining % Data Mining and Knowledge Discovery Statistics Databases % integrates theory and heuristics % focus on the entire process of knowledge discovery, including data cleaning, learning, and integration and visualization of results % Distinctions are fuzzy © 2006 KDnuggets 9 © 2006 KDnuggets 10 Knowledge Discovery Process Historical Note: flow, according to CRISP-DM Many Names of Data Mining % Data Fishing, Data Dredging: 1960- % used by statisticians (as bad name) see % Data Mining :1990 -- Monitoring www.crisp-dm.org for more % used in DB community, business information % Knowledge Discovery in Databases (1989-) Continuous % used by AI, Machine Learning Community monitoring and improvement is % also Data Archaeology, Information Harvesting, an addition to CRISP Information Discovery, Knowledge Extraction, ... Currently: Data Mining and Knowledge Discovery are used interchangeably 11 12 © 2006 KDnuggets © 2006 KDnuggets
  • Some Definitions % Instance (also Item or Record): % an example, described by a number of attributes, Data Mining Tasks % e.g. a day can be described by temperature, humidity and cloud status % Attribute or Field % measuring aspects of the Instance, e.g. temperature % Class (Label) % grouping of instances, e.g. days good for playing © 2006 KDnuggets 14 © 2006 KDnuggets Major Data Mining Tasks Classification %Classification: predicting an item class Learn a method for predicting the instance class from pre-labeled (classified) instances %Clustering: finding clusters in data %Associations: e.g. A & B & C occur frequently Many approaches: Statistics, %Visualization: to facilitate human discovery Decision Trees, Neural Networks, %Summarization: describing a group ... % Deviation Detection: finding changes % Estimation: predicting a continuous value % Link Analysis: finding relationships %… © 2006 KDnuggets 15 © 2006 KDnuggets 16 Association Rules & Clustering Frequent Itemsets Find “natural” grouping of instances given un-labeled data Transactions TID Produce Frequent Itemsets: 1 MILK, BREAD, EGGS 2 BREAD, SUGAR Milk, Bread (4) 3 BREAD, CEREAL Bread, Cereal (3) 4 MILK, BREAD, SUGAR 5 MILK, CEREAL Milk, Bread, Cereal (2) 6 BREAD, CEREAL … 7 MILK, CEREAL 8 MILK, BREAD, CEREAL, EGGS 9 MILK, BREAD, CEREAL Rules: Milk => Bread (66%) 17 18 © 2006 KDnuggets © 2006 KDnuggets
  • Visualization & Data Mining Summarization % Visualizing the data to facilitate human ! Describe features of the discovery selected group ! Use natural language and graphics ! Usually in Combination with Deviation detection or other methods % Presenting the discovered results in a Average length of stay in this study area rose 45.7 percent, visually "nice" way from 4.3 days to 6.2 days, because ... 19 20 © 2006 KDnuggets © 2006 KDnuggets Data Mining Central Quest Find true patterns and avoid overfitting Classification Methods (finding seemingly signifcant but really random patterns due to searching too many possibilites) © 2006 KDnuggets © 2006 KDnuggets 21 Classification Classification: Linear Regression Learn a method for predicting the instance class from pre-labeled (classified) instances % Linear Regression Many approaches: Regression, w0 + w1 x + w2 y >= 0 Decision Trees, % Regression computes Bayesian, wi from data to Neural Networks, minimize squared error ... to ‘fit’ the data % Not flexible enough Given a set of points from classes what is the class of new point ? 23 24 © 2006 KDnuggets © 2006 KDnuggets
  • Regression for Classification Classification: Decision Trees % Any regression technique can be used for classification if X > 5 then blue % Training: perform a regression for each class, setting the else if Y > 3 then blue output to 1 for training instances that belong to class, and 0 Y else if X > 2 then green for those that don’t else blue % Prediction: predict class corresponding to model with largest output value (membership value) % For linear regression this is known as multi-response 3 linear regression 2 5 X 25 26 © 2006 KDnuggets © 2006 KDnuggets DECISION TREE Weather Data: Play or not Play? % An internal node is a test on an attribute. Outlook Temperature Humidity Windy Play? sunny hot high false No % A branch represents an outcome of the test, e.g., sunny hot high true No Note: Color=red. overcast hot high false Yes Outlook is the rain mild high false Yes Forecast, % A leaf node represents a class label or class label rain cool normal false Yes no relation to distribution. rain cool normal true No Microsoft overcast cool normal true Yes email program % At each node, one attribute is chosen to split sunny mild high false No training examples into distinct classes as much as sunny cool normal false Yes rain mild normal false Yes possible sunny mild normal true Yes overcast mild high true Yes % A new instance is classified by following a overcast hot normal false Yes matching path to a leaf node. rain mild high true No © 2006 KDnuggets 27 © 2006 KDnuggets 28 Example Tree for “Play?” Classification: Neural Nets Outlook % Can select more sunny rain complex regions overcast % Can be more accurate Humidity Yes % Also can overfit the Windy data – find patterns in random noise high normal true false No Yes No Yes 29 30 © 2006 KDnuggets © 2006 KDnuggets
  • Classification: other approaches % Naïve Bayes % Rules % Support Vector Machines Evaluation % Genetic Algorithms %… See www.KDnuggets.com/software/ © 2006 KDnuggets 31 © 2006 KDnuggets Evaluating which method works the Comparison of Major best for classification Classification Approaches % No model is uniformly the best Train Run Noise Can Use Accuracy Under- % Dimensions for Comparison time Time Toler Prior on Customer standable ance Know- Modelling % speed of training ledge Decision fast fast poor no medium medium % speed of model application Trees % noise tolerance Rules med fast poor no medium good % explanation ability Neural slow fast good no good poor Networks % Best Results: Hybrid, Integrated models Bayesian slow fast good yes good good A hybrid method will have higher accuracy © 2006 KDnuggets 33 © 2006 KDnuggets 34 Evaluation of Classification Models Evaluation issues % How predictive is the model we learned? % Possible evaluation measures: % Error on the training data is not a good indicator % Classification Accuracy of performance on future data % Total cost/benefit – when different errors involve different costs % The new data will probably not be exactly the same as the training data! % Lift and ROC (Receiver operating characteristic) curves % Overfitting – fitting the training data too precisely % Error in numeric predictions - usually leads to poor results on new data % How reliable are the predicted results ? 35 36 © 2006 KDnuggets © 2006 KDnuggets
  • Classifier error rate Evaluation on “LARGE” data % Natural performance measure for classification If many (>1000) examples are available, problems: error rate including >100 examples from each class % Success: instance’s class is predicted correctly % A simple evaluation will give useful results % Error: instance’s class is predicted incorrectly % Randomly split data into training and test sets (usually % Error rate: proportion of errors made over the whole 2/3 for train, 1/3 for test) set of instances % Build a classifier using the train set and evaluate % Training set error rate: is way too optimistic! it using the test set % you can find patterns even in random data 37 38 © 2006 KDnuggets © 2006 KDnuggets Classification Step 1: Classification Step 2: Split data into train and test sets Build a model on a training set THE PAST THE PAST Results Known Results Known + + + Training set + Training set - - - - + + Data Data Model Builder Testing set Testing set © 2006 KDnuggets 39 © 2006 KDnuggets 40 Classification Step 3: Unbalanced data Evaluate on test set (Re-train?) Results Known % Sometimes, classes have very unequal frequency + + Training set - - % Attrition prediction: 97% stay, 3% attrite (in a month) Data + % medical diagnosis: 90% healthy, 10% disease Model Builder Evaluate % eCommerce: 99% don’t buy, 1% buy Predictions + % Security: >99.99% of Americans are not terrorists Y N - Testing set + % Similar situation with multiple classes - % Majority class classifier can be 97% correct, but useless 41 42 © 2006 KDnuggets © 2006 KDnuggets
  • Handling unbalanced data – Balancing unbalanced data, 1 how? If we have two classes that are very % With two classes, a good approach is to build unbalanced, then how can we evaluate our BALANCED train and test sets, and train model classifier method? on a balanced set % randomly select desired number of minority class instances % add equal number of randomly selected majority class % How do we generalize “balancing” to multiple classes? 43 44 © 2006 KDnuggets © 2006 KDnuggets Balancing unbalanced data, 2 A note on parameter tuning % It is important that the test data is not used in any way to %Generalize “balancing” to multiple classes create the classifier % Ensure that each class is represented with % Some learning schemes operate in two stages: approximately equal proportions in train and test % Stage 1: builds the basic structure % Stage 2: optimizes parameter settings % The test data can’t be used for parameter tuning! % Proper procedure uses three sets: training data, validation data, and test data % Validation data is used to optimize parameters © 2006 KDnuggets 45 © 2006 KDnuggets 46 Making the most of the data Classification: Train, Validation, Test split Results Known % Once evaluation is complete, all the data can be + Model used to build the final classifier + - Training set Builder - + % Generally, the larger the training data the better Data Evaluate the classifier Model Builder Predictions % The larger the test data the more accurate the + - error estimate Y N + Validation set - + - Final Evaluation + Final Test Set Final Model - 47 48 © 2006 KDnuggets © 2006 KDnuggets
  • Cross-validation Cross-validation example: —Break up data into groups of the same size % Cross-validation avoids overlapping test sets — % First step: data is split into k subsets of equal size — % Second step: each subset in turn is used for testing and the remainder for training —Hold aside one group for testing and use the rest to build model % This is called k-fold cross-validation — Test % Often the subsets are stratified before the cross- —Repeat validation is performed % The error estimates are averaged to yield an overall error estimate 50 49 50 © 2006 KDnuggets © 2006 KDnuggets More on cross-validation Direct Marketing Paradigm % Standard method for evaluation: stratified ten- % Find most likely prospects to contact fold cross-validation % Not everybody needs to be contacted % Why ten? Extensive experiments have shown that % Number of targets is usually much smaller than number of this is the best choice to get an accurate estimate prospects % Stratification reduces the estimate’s variance % Typical Applications % Even better: repeated stratified cross-validation % retailers, catalogues, direct mail (and e-mail) % E.g. ten-fold cross-validation is repeated ten times and % customer acquisition, cross-sell, attrition prediction results are averaged (reduces the variance) % ... © 2006 KDnuggets 51 © 2006 KDnuggets 52 Direct Marketing Evaluation Model-Sorted List Use a model to assign score to each customer Sort customers by decreasing score % Accuracy on the entire dataset is not the Expect more targets (hits) near the top of the list right measure No Score Target CustID Age % Approach 1 0.97 Y 1746 … 3 hits in top 5% of the list % develop a target model 2 0.95 N 1024 … If there 15 targets % score all prospects and rank them by decreasing score 3 0.94 Y 2478 … overall, then top 5 4 0.93 Y 3820 … has 3/15=20% of % select top P% of prospects for action 5 0.92 N 4897 … targets % How do we decide what is the best subset of … … … … prospects ? 99 0.11 N 2734 … 100 0.06 N 2422 53 54 © 2006 KDnuggets © 2006 KDnuggets
  • CPH: Random List vs CPH (Cumulative Pct Hits) Model-ranked list 100 100 Cumulative % Hits Cumulative % Hits 90 Definition: 90 80 80 CPH(P,M) 70 70 = % of all targets 60 Random 60 Random 50 50 Model in the first P% 40 40 of the list scored 30 30 by model M 20 20 10 10 CPH frequently 0 0 called Gains 15 25 35 45 55 65 75 85 95 5 15 25 35 45 55 65 75 85 95 5 Pct list Pct list 5% of random list have 5% of targets 5% of random list have 5% of targets, but 5% of model ranked list have 21% of targets CPH(5%,model)=21%. 55 56 © 2006 KDnuggets © 2006 KDnuggets Lift Lift – a measure of model quality Lift(P,M) = CPH(P,M) / P Lift (at 5%) 4.5 % Lift helps us decide which models are better 4 = 21% / 5% 3.5 % If cost/benefit values are not available or = 4.2 3 changing, we can use Lift to select a better better 2.5 model. Lift than random 2 1.5 % Model with the higher Lift curve will generally be 1 better Note: Some authors 0.5 use “Lift” for what 0 5 we call CPH. 15 25 35 45 55 65 75 85 95 P -- percent of the list © 2006 KDnuggets 57 © 2006 KDnuggets 58 Clustering Unsupervised learning: Finds “natural” grouping of instances given un-labeled data Clustering © 2006 KDnuggets 60 © 2006 KDnuggets
  • Clustering Methods Clustering Evaluation % Many different method and algorithms: % Manual inspection % For numeric and/or symbolic data % Benchmarking on existing labels % Deterministic vs. probabilistic % Exclusive vs. overlapping % Cluster quality measures % Hierarchical vs. flat % distance measures % Top-down vs. bottom-up % high similarity within a cluster, low across clusters 61 62 © 2006 KDnuggets © 2006 KDnuggets The distance function Simple Clustering: K-means % Simplest case: one numeric attribute A Works with numeric data only % Distance(X,Y) = A(X) – A(Y) 1) Pick a number (K) of cluster centers (at random) % Several numeric attributes: 2) Assign every item to its nearest cluster center % Distance(X,Y) = Euclidean distance between X,Y (e.g. using Euclidean distance) % Nominal attributes: distance is set to 1 if values 3) Move each cluster center to the mean of its are different, 0 if they are equal assigned items % Are all attributes equally important? 4) Repeat steps 2,3 until convergence (change in % Weighting the attributes might be necessary cluster assignments less than a threshold) © 2006 KDnuggets 63 © 2006 KDnuggets 64 K-means example, step 1 K-means example, step 2 1 c1 c1 Y Y Pick 3 c2 c2 initial Assign cluster each point centers to the closest (randomly) cluster c3 center c3 X X 65 66 © 2006 KDnuggets © 2006 KDnuggets
  • K-means example, step 3 K-means example, step 4a c1 c1 Reassign c1 Y points Y closest to a Move c2 different new each cluster cluster center center c3 c3 c2 Q: Which c2 to the mean of each cluster points are c3 reassigned? X X 67 68 © 2006 KDnuggets © 2006 KDnuggets K-means example, step 4b K-means example, step 4c 1 Reassign c1 c1 points Y Y closest to a 3 different new A: three points with 2 cluster center c3 animation c3 Q: Which c2 c2 points are reassigned? X X © 2006 KDnuggets 69 © 2006 KDnuggets 70 K-means example, step 4d K-means example, step 5 c1 c1 Y Y re-compute cluster means c2 c3 move cluster c2 centers to c3 cluster means X X 71 72 © 2006 KDnuggets © 2006 KDnuggets
  • Problems Suitable for Data-Mining % require knowledge-based decisions Data Mining Applications % have a changing environment % have sub-optimal current methods % have accessible, sufficient, and relevant data % provides high payoff for the right decisions! © 2006 KDnuggets 74 © 2006 KDnuggets Major Application Areas for Application: Search Engines Data Mining Solutions % Advertising % Bioinformatics % Before Google, web search engines used mainly % Customer Relationship Management (CRM) keywords on a page – results were easily subject % Database Marketing to manipulation % Fraud Detection % Google's early success was partly due to its % eCommerce algorithm which uses mainly links to the page % Health Care % Investment/Securities % Google founders Sergey Brin and Larry Page were % Manufacturing, Process Control students at Stanford in 1990s % Sports and Entertainment % Their research in databases and data mining led % Telecommunications to Google % Web © 2006 KDnuggets 75 © 2006 KDnuggets 76 Microarrays: Classifying Leukemia Microarray Potential Applications % Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 1999 % New and better molecular diagnostics % Jan 11, 2005: FDA approved Roche Diagnostic AmpliChip, based % 72 examples (38 train, 34 test), about 7,000 genes on Affymetrix technology % New molecular targets for therapy ALL AML % few new drugs, large pipeline, … % Improved treatment outcome % Partially depends on genetic signature Visually similar, but genetically very different % Fundamental Biological Discovery % finding and refining biological pathways Best Model: 97% accuracy, 1 error (sample suspected mislabelled) % Personalized medicine ?! 77 78 © 2006 KDnuggets © 2006 KDnuggets
  • Application: Direct Marketing and CRM Application: e-Commerce % Most major direct marketing companies are using % Amazon.com recommendations modeling and data mining % if you bought (viewed) X, you are likely to buy Y % Most financial companies are using customer % Netflix modeling % If you liked "Monty Python and the Holy Grail", % Modeling is easier than changing customer you get a recommendation for "This is Spinal Tap" behaviour % Comparison shopping % Example % Froogle, mySimon, Yahoo Shopping, … % Verizon Wireless reduced customer attrition rate from 2% to 1.5%, saving many millions of $ 79 80 © 2006 KDnuggets © 2006 KDnuggets Application: Data Mining, Privacy, and Security Security and Fraud Detection % Credit Card Fraud Detection % TIA: Terrorism (formerly Total) Information % over 20 Million credit cards protected by Awareness Program – Neural networks (Fair, Isaac) % TIA program closed by Congress in 2003 because of % Securities Fraud Detection privacy concerns % NASDAQ KDD system % However, in 2006 we learn that NSA is analyzing US domestic call info to find potential terrorists % Phone fraud detection % Invasion of Privacy or Needed Intelligence? % AT&T, Bell Atlantic, British Telecom/MCI © 2006 KDnuggets 81 © 2006 KDnuggets 82 Criticism of Analytic Approaches Can Data Mining and Statistics to Threat Detection: be Effective for Threat Detection? Data Mining will % Criticism: Databases have 5% errors, so analyzing 100 million suspects will generate 5 million false % be ineffective - generate millions of false positives positives % Reality: Analytical models correlate many items of % and invade privacy information to reduce false positives. % Example: Identify one biased coin from 1,000. % After one throw of each coin, we cannot First, can data mining be effective? % After 30 throws, one biased coin will stand out with high probability. % Can identify 19 biased coins out of 100 million with sufficient number of throws 83 84 © 2006 KDnuggets © 2006 KDnuggets
  • Another Approach: Link Analysis Analytic technology can be effective % Data Mining is just one additional tool to help analysts % Combining multiple models and link analysis can reduce false positives % Today there are millions of false positives with manual analysis % Analytic technology has the potential to reduce the current high rate of false positives Can find unusual patterns in the network structure 85 86 © 2006 KDnuggets © 2006 KDnuggets The Hype Curve for Data Mining with Privacy Data Mining and Knowledge Discovery % Data Mining looks for patterns, not people! % Technical solutions can limit privacy invasion Over-inflated expectations % Replacing sensitive personal data with anon. ID Growing acceptance % Give randomized outputs and mainstreaming rising % Multi-party computation – distributed data expectations %… Disappointment Performance % Bayardo & Srikant, Technological Solutions for Expectations Protecting Privacy, IEEE Computer, Sep 2003 1990 1998 2000 2002 2005 © 2006 KDnuggets 87 © 2006 KDnuggets 88 Summary Additional Resources % Data Mining and Knowledge Discovery are needed www.KDnuggets.com to deal with the flood of data data mining software, jobs, courses, etc % Knowledge Discovery is a process ! % Avoid overfitting (finding random patterns by www.acm.org/sigkdd searching too many possibilities) ACM SIGKDD – the professional society for data mining 89 90 © 2006 KDnuggets © 2006 KDnuggets