Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Model Validation, performance measu... by eurobasin 3149 views
- 6 scikit-learn - Data Tuesday 26 ... by Data Tuesday 997 views
- Cross-validation aggregation for fo... by Devon Barrow 1409 views
- Intro to Machine Learning by Micros... by microsoftventures 1741 views
- Validation by Sergey Makarevich 193 views
- Lecture7 cross validation by Stéphane Canu 439 views

No Downloads

Total views

1,040

On SlideShare

0

From Embeds

0

Number of Embeds

2

Shares

0

Downloads

54

Comments

0

Likes

1

No embeds

No notes for slide

- 1. DBM630: Data Mining and Data Warehousing MS.IT. Rangsit University Semester 2/2011 Lecture 8 Classification and Prediction Evaluation by Kritsada Sriphaew (sriphaew.k AT gmail.com)1
- 2. Topics Train, Test and Validation sets Evaluation on Large data Unbalanced data Evaluation on Small data Cross validation Bootstrap Comparing data mining schemes Significance test Lift Chart / ROC curve Numeric Prediction Evaluation 2 Data Warehousing and Data Mining by Kritsada Sriphaew
- 3. Evaluation in Classification Tasks How predictive is the model we learned? Error on the training data is not a good indicator of performance on future data Q: Why? A: Because new data will probably not be exactly the same as the training data! Overfitting – fitting the training data too precisely - usually leads to poor results on new data 3 Classification and Prediction: Evaluation
- 4. Model Selection and Bias-Variance Tradeoff Typical behavior of the test and training error, as model complexity is varied. High Bias Low Bias Low Variance High VariancePrediction Error Test Sample Training Sample Low Model Complexity High 4 Classification and Prediction: Evaluation
- 5. Classifier error rate Natural performance measure for classification problems: error rate Success: instance’s class is predicted correctly Error: instance’s class is predicted incorrectly Error rate: proportion of errors made over the whole set of instances Resubstitution error: error rate on training data (too optimistic way!) Accuracy Error rate (1-Accuracy) Generalization error: error rate 15 % data 13 % 85 % 87 % on test 2% improvement 2% error reduction 2.35% improvement rate 13.3% error reduction rate5 Classification and Prediction: Evaluation
- 6. Evaluation on LARGE data If many (thousands) of examples are available, including several hundred examples from each class, then how can we evaluate our classifier model? A simple evaluation is sufficient For example, randomly split data into training and test sets (usually 2/3 for train, 1/3 for test) Build a classifier using the train set and evaluate it using the test set.6 Classification and Prediction: Evaluation
- 7. Classification Step 1:Split data into train and test sets THE PAST Results Known + + Training set - - + Data Testing set 7 Classification and Prediction: Evaluation
- 8. Classification Step 2:Build a model on a training set THE PAST Results Known + + Training set - - + Data Model Builder Testing set8 Classification and Prediction: Evaluation
- 9. Classification Step 3:Evaluate on test set (and may be re-train) THE PAST Results Known + + Training set - - + Data Model Builder feedback Predictions + Y N - + Testing set - 9 Classification and Prediction: Evaluation
- 10. A note on parameter tuning It is important that the test data is not used in any way to create the classifier Some learning schemes operate in two stages: Stage 1: builds the basic structure Stage 2: optimizes parameter settings The test data can’t be used for parameter tuning! Proper procedure uses three sets: training data, validation data, and test data Validation data is used to optimize parameters 10 Classification and Prediction: Evaluation
- 11. Classification: Train, Validation, Test split Results Known + Training set Model + - - Builder + Data Evaluate Model Builder Predictions + - Y N + Validation set - + - Final Evaluation +Final Test Set Final Model - 11 Classification and Prediction: Evaluation
- 12. Unbalanced data Sometimes, classes have very unequal frequency Accommodation prediction: 97% stay, 3% don’t stay medical diagnosis: 90% healthy, 10% disease eCommerce: 99% don’t buy, 1% buy Security: >99.99% of Americans are not terrorists Similar situation with multiple classes Majority class classifier can be 97% correct, but useless Solution: With two classes, a good approach is to build BALANCED train and test sets, and train model on a balanced set randomly select desired number of minority class instances add equal number of randomly selected majority class That is, we ignore the effect of the number of instances for each class. 12 Classification and Prediction: Evaluation
- 13. Evaluation on SMALL data The holdout method reserves a certain amount for testing and uses the remainder for training Usually: one third for testing, the rest for training For “unbalanced” datasets, samples might not be representative Few or none instances of some classes Stratified sample: advanced version of balancing the data Make sure that each class is represented with approximately equal proportions in both subsets What if we have a small data set? The chosen 2/3 for training may not be representative. The chosen 1/3 for testing may not be representative. 13 Classification and Prediction: Evaluation
- 14. Repeated Holdout Method Holdout estimate can be made more reliable by repeating the process with different subsamples In each iteration, a certain proportion is randomly selected for training (possibly with stratification) The error rates on the different iterations are averaged to yield an overall error rate This is called the repeated holdout method Still not optimum: the different test sets overlap. Can we prevent overlapping? 14 Classification and Prediction: Evaluation
- 15. Cross-validation Cross-validation avoids overlapping test sets First step: data is split into k subsets of equal size Second step: each subset in turn is used for testing and the remainder for training This is called k-fold cross-validation Often the subsets are stratified before the cross- validation is performed The error estimates are averaged to yield an overall error estimate 15 Classification and Prediction: Evaluation
- 16. Cross-validation Example Break up data into groups of the same size (possibly with stratification) Hold aside one group for testing and use the rest to build model Test Train Repeat by another test data until 16 Classification and Prediction: Evaluation
- 17. More on cross-validation Standard method for evaluation: stratified ten-fold cross-validation Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate Stratification reduces the estimate’s variance Even better: repeated stratified cross-validation E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance) 17 Classification and Prediction: Evaluation
- 18. Leave-One-Out cross-validation Leave-One-Out: Remove one instance for testing and the other for training a particular form of cross-validation: Set the number of folds equal to the number of training instances i.e., for n training instances, build classifier n times Makes best use of the data Involves no random subsampling Very computationally expensive 18 Classification and Prediction: Evaluation
- 19. Leave-One-Out-CV and stratification Disadvantage of Leave-One-Out-CV: stratification is not possible It guarantees a non-stratified sample because there is only one instance in the test set! Extreme example: random dataset split equally into two classes Best inducer predicts majority class 50% accuracy on fresh data Leave-One-Out-CV estimate is 100% error! 19 Classification and Prediction: Evaluation
- 20. *The bootstrap CV uses sampling without replacement The same instance, once selected, can not be selected again for a particular training/test set The bootstrap uses sampling with replacement to form the training set Sample a dataset of n instances n times with replacement to form a new dataset of n instances Use this data as the training set Use the instances from the original dataset that do not occur in the new training set for testing 20 Classification and Prediction: Evaluation
- 21. Evaluating the Accuracy of a Classifier orPredictor Bootstrap method The training tuples are sampled uniformly with replacement Each time a tuple is selected, it is equally likely to be selected again and readded to the training set There are several bootstrap method – the commonly used one is .632 bootstrap which works as follows Given a data set of d tuples The data set is sampled d times, with replacement, resulting bootstrap sample of training set of d samples It is very likely that some of the original data tuples will occur more than once in this sample The data tuples that did not make it into the training set end up forming the test set Suppose we try this out several times – on average 63.2% of original data tuple will end up in the bootstrap, and the remaining 36.8% will form the test set 21 Classification and Prediction: Evaluation
- 22. *The 0.632 bootstrap Also called the 0.632 bootstrap For n instances, a particular instance has a probability of 1–1/n of not being picked Thus its probability of ending up in the test data is: n 1 1 e 1 0.368 n This means the training data will contain approximately 63.2% of the instances Note: e is an irrational constant approximately equal to 2.718281828 22 Classification and Prediction: Evaluation
- 23. *Estimating error with the bootstrap The error estimate on the test data will be very pessimistic Trained on just ~63% of the instances Therefore, combine it with the re-substitution error: err 0.632 errortest_instances 0.368 errortraining_instances The re-substitution error gets less weight than the error on the test data Repeat process several times with different replacement samples; average the results 23 Classification and Prediction: Evaluation
- 24. *More on the bootstrap Probably the best way of estimating performance for very small datasets However, it has some problems Consider the random dataset from above A perfect memorizer will achieve 0% resubstitution error and ~50% error on test data Bootstrap estimate for this classifier: err 0.632 50% 0.368 0% 31.6% True expected error: 50% 24 Classification and Prediction: Evaluation
- 25. Evaluating Two-class Classification(Lift Chart vs. ROC Curve) Information Retrieval or Search Engine An application to find a set of related documents given a set of keywords. Hard Decision vs. Soft Decision Focus on soft decision Multiclass Class probability (Ranking) Class by class evaluation Example: promotional mailout Situation 1: classifier predicts that 0.1% of all households will respond Situation 2: classifier predicts that 0.4% of the 100000 most promising households will respond 27 Classification and Prediction: Evaluation
- 26. Confusion Matrix (Two-class) Also called contingency table Actual Class Yes No True False Yes Predicted Positive Positive Class False True No Negative Negative28 Classification and Prediction: Evaluation
- 27. Measures in Information Retrieval precision: Percentage of retrieved documents that are relevant. TP recall: Percentage of relevant precision documents that are returned. TP FP F-measure: The combination TP recall measure of recall and precision TP FN Precision/recall curves have 2 recall precision F measure hyperbolic shape recall precision Summary measures: average precision at 20%, 50% and 80% recall (three-point average recall) 29 Classification and Prediction: Evaluation
- 28. Measures in Two-Class ClassificationFor Positive Class For Negative Class TP TN precision precision TP FP TN FN TP TN recall recall TP FN TN FP TP TN TP TN accuracy accuracy TP TN FP FN TP TN FP FN Usually, we focus only positive class FP (“True” cases or “Yes” cases), FP Rate TN FP therefore, only precision and recall TP of positive class are used for TP Rate recall performance comparison TP FN 30 Classification and Prediction: Evaluation
- 29. Confusion matrix: Example Actual Buys_computer Buys_computer Total Predict = yes = no Buys_computer=yes 6,954 46 7,000 Buys_computer=no 412 2,588 3,000 Total 7,366 2,634 10,000No. of tuple of class buys_computer=yesthat were labeled by a classifier as classbuys_computer=no: FN No. of tuple of class buys_computer=no that were labeled by a classifier as class buys_computer=yes: FP 31
- 30. Cumulative Gains Chart/Lift Chart/ROC curve They are visual aids for measuring model performance Cumulative Gains is a measure of the effectiveness of predictive model on TP Rate (%true responses) Lift is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model Cumulative gains and lift charts consist of a lift curve and a baseline. The greater the area between the lift curve and the baseline, the better the model. ROC is a measure of the effectiveness of a predictive model on %positive response against %negative response TP Rate against FP Rate. The greater the area under ROC curve, the better the model. 32 Classification and Prediction: Evaluation
- 31. Generating Charts Instances are sorted according to their predicted probability of being a true positive:33 Classification and Prediction: Evaluation
- 32. Cumulative Gains Chart The x-axis shows the percentage of samples The y-axis shows the percentage of positive responses (or true positive rate). This is a percentage of the total possible positive responses Baseline (overall response rate): If we sample X% of data then we will receive X% of the total positive responses. Lift Curve: Using the predictions of the response model, calculate the percentage of positive responses for the percent of customers contacted and map these points to create the lift curve. 34 Classification and Prediction: Evaluation
- 33. A Sample Cumulative Gains Chart 100%TP Rate 80%%positiveresponses 60% 40% Baseline: %positive_responses = %sample_size 20% 0 10% samples 100% samples 35 Classification and Prediction: Evaluation
- 34. Lift Chart The x-axis shows the percentage of samples The y-axis shows the ratio of true positives with model and without model To plot the chart: Calculate the points on the lift curve by determining the ratio between the result predicted by our model and the result using no model Example: For contacting 10% of customers, using no model we should get 10% of responders and using the given model we should get 30% of responders. The y-value of the lift curve at 10% is 30 / 10 = 3. 36 Classification and Prediction: Evaluation
- 35. A Sample Lift Chart 100% 80%Lift 60% 40% Baseline: Lift = %sample size 20% 0 10% samples 100% samples37 Classification and Prediction: Evaluation
- 36. ROC Curve ROC curves are similar to lift charts “ROC” stands for “receiver operating characteristic” Used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel Differences: y axis shows percentage of true positives in sample x axis shows percentage of false positives in sample (rather than sample size) 38 Classification and Prediction: Evaluation
- 37. A Sample ROC Curve 1000 respondsTrue PositiveRate 400 responds Baseline: TP Rate = FP Rate False Positive Rate 1000000-1000 mailouts 39 Classification and Prediction: Evaluation
- 38. Extending to Multiple-Class Classification PREDICTED CLASS C1 C2 Cn Sum Recall C1 n11 n12 n1n R1 n11/R1ACTUAL CLASS C2 n21 n22 n2n R2 n22 /R2 Cn nn1 nn2 nnn Rn nnn /Rn Sum P1 P2 Pn T R Preci n11 n22 nnn n11 +n22+..+nnn sion P1 P T P2 Pn 40 Classification and Prediction: Evaluation
- 39. Measures in Multiple-Class Classification nii precision (Ci ) Pi nii recall (Ci ) Ri n11 n22 nnn accuracy T41 Classification and Prediction: Evaluation
- 40. Numeric Prediction Evaluation Same strategies: independent test set, cross-validation, significance tests, etc. Difference: error measures or generalization error Actual target values: a1, a2,…, an Predicted target values: p1, p2,…, pn Most popular measure: mean-squared error p 1 a1 p2 a2 ... pn an 2 2 2 n Easy to manipulate mathematically 42 Classification and Prediction: Evaluation
- 41. Other Measures The root mean-squared (rms) error: p1 a1 p2 a2 ... pn an 2 2 2 n The mean absolute error is less sensitive to outliers than the mean-squared error: p1 a1 p2 a2 ... pn an n Sometimes relative error values are more appropriate (e. g. 10% for an error of 50 when predicting 500)43 Classification and Prediction: Evaluation
- 42. Improvement on the Mean Often we want to know how much the scheme improves on simply predicting the average The relative squared error is: p1 a1 p2 a2 ... pn an 2 2 2 a a a a ... a a 1 2 2 2 n 2 a is average The relative absolute error is: p1 a1 p2 a2 ... pn an a a1 a a2 ... a an 44 Classification and Prediction: Evaluation
- 43. The Correlation Coefficient Measures the statistical correlation between the predicted values and the actual values S PA (ai a) 2 S pSA SA i n 1 ( pi p)(ai a) ( pi p) 2 S PA i n 1 SP i n 1 Scale independent, between -1 and +1 Good performance leads to large values! 45 Classification and Prediction: Evaluation
- 44. Practice 1 A company wants to do a mail marketing campaign. It costs the company $1 for each item mailed. They have information on 100,000 customers. Create a cumulative gains and a lift chart from the following data. 46 Classification and Prediction: Evaluation
- 45. Results47 Classification and Prediction: Evaluation
- 46. Practice 2 Using the response model P(x)=Salary(x)/1000+Age(x) for customer x and the data table shown below, construct the cumulative gains, lift charts and ROC curves. Ties in ranking should be arbitrarily broken by assigning a higher rank to who appears first in the table. Customer Name Salary Age Actual P(x) Rank Response A 10000 39 N B 50000 21 Y C 65000 25 Y D 62000 30 Y E 67000 19 Y F 69000 48 N G 65000 12 Y H 64000 51 N I 71000 65 Y 48 J 73000 42 N Classification and Prediction: Evaluation
- 47. Customer Salary Age Actual P(x) Rank Ordered by P(x) Name Response Customer Actual P(x) Rank A 10000 39 N 49 10 Name Response B 50000 21 Y 71 9 I Y 136 1 C 65000 25 Y 90 6 F N 117 2 D 62000 30 Y 92 5 H N 115 3 E 67000 19 Y 86 7 J N 115 4 F 69000 48 N 117 2 D Y 92 5 G 65000 12 Y 77 8 C Y 90 6 H 64000 51 N 115 3 E Y 86 7 I 71000 65 Y 136 1 G Y 77 8 J 73000 42 N 115 4 B Y 71 9 A N 49 10 % Positive responsesCumulative Gains Chart Baseline Lift Chart Lift Baseline 100.00% 2.00 80.00% 1.50 60.00% 40.00% 1.00 20.00% 0.50 0.00% 0.00 70% 10% 20% 30% 40% 50% 60% 80% 90% 100% 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 49
- 48. Customer Actual #Sample TP FP TPR FPR Name Response I Y 1 1 0 16.67% 0.00% F N 2 1 1 16.67% 25.00% H N 3 1 2 16.67% 50.00% J N 4 1 3 16.67% 75.00% C Y 5 2 3 33.33% 75.00% E Y 6 3 3 50.00% 75.00% D Y 7 4 3 66.67% 75.00% G Y 8 5 3 83.33% 75.00% B Y 9 6 3 100.00% 75.00% A N 10 6 4 100.00% 100.00% ROC Curve 120.00% 100.00% 80.00% TPR 60.00% 40.00% 20.00% 0.00% 0.00% 50.00% 100.00% 150.00% 50 FPR
- 49. *Predicting performance Assume the estimated error rate is 25%. How close is this to the true error rate? Depends on the amount of test data Prediction is just like tossing a biased (!) coin “Head” is a “success”, “tail” is an “error” In statistics, a succession of independent events like this is called a Bernoulli process Statistical theory provides us with confidence intervals for the true underlying proportion! 51 Classification and Prediction: Evaluation
- 50. *Confidence intervals We can say: p lies within a certain specified interval with a certain specified confidence Example: S=750 successes in N=1000 trials Estimated success rate: 75% How close is this to true success rate p? Answer: with 80% confidence p[73.2,76.7] Another example: S=75 and N=100 Estimated success rate: 75% With 80% confidence p[69.1,80.1] 52 Classification and Prediction: Evaluation
- 51. *Mean and variance (also Mod 7) Mean and variance for a Bernoulli trial: p, p (1–p) Expected success rate f=S/N Mean and variance for f : p, p (1–p)/N For large enough N, f follows a Normal distribution c% confidence interval [–z X z] for random variable with 0 mean is given by: Pr[ z X z ] c With a symmetric distribution: Pr[ z X z] 1 2 Pr[ X z] 53 Classification and Prediction: Evaluation
- 52. *Confidence limits Pr[X z] z 0.1% 3.09 0.5% 2.58 1% 2.33 5% 1.65 10% 1.28 20% 0.84 –1 0 1 1.65 40% 0.25 Confidence limits for the normal distribution with 0 mean and a variance of 1: Thus: Pr[1.65 X 1.65] 90% To use this we have to reduce our random variable f to have 0 mean and unit variance 54 Classification and Prediction: Evaluation
- 53. *Transforming f f p Transformed value for f : p(1 p) / N (i.e. subtract the mean and divide by the standard deviation) f p Pr z z c p(1 p) / N Resulting equation: z2 f f2 z2 z2 Solving for p : p f z 1 2 2N N N 4N N 55 Classification and Prediction: Evaluation
- 54. *Examples f = 75%, N = 1000, c = 80% (so that z = 1.28): p [0.732,0.767] f = 75%, N = 100, c = 80% (so that z = 1.28): p [0.691,0.801] Note that normal distribution assumption is only valid for large N (i.e. N > 100) f = 75%, N = 10, c = 80% (so that z = 1.28): p [0.549,0.881] (should be taken with a grain of salt) 56 Classification and Prediction: Evaluation

No public clipboards found for this slide

Be the first to comment