Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

evaluation and credibility-Part 2

1,558 views

Published on

evaluation and credibility

Published in: Education
  • Be the first to comment

evaluation and credibility-Part 2

  1. 1. Tilani Gunawardena Machine Learning and Data Mining Evaluation and Credibility
  2. 2. • Introduction • Train, Test and Validation sets • Evaluation on Large data Unbalanced data • Evaluation on Small data – Cross validation – Bootstrap • Comparing data mining schemes – Significance test – Lift Chart / ROC curve • Numeric Prediction Evaluation Outline
  3. 3. Model’s Evaluation in the KDD Process
  4. 4. How to Estimate the Metrics? • We can use: – Training data; – Independent test data; – Hold-out method; – k-fold cross-validation method; – Leave-one-out method; – Bootstrap method; – And many more…
  5. 5. Estimation with Training Data • The accuracy/error estimates on the training data are not good indicators of performance on future data. – Q: Why? – A: Because new data will probably not be exactly the same as the training data! • The accuracy/error estimates on the training data measure the degree of classifier’s overfitting. Training set Classifier Training set
  6. 6. Estimation with Independent Test Data • Estimation with independent test data is used when we have plenty of data and there is a natural way to forming training and test data. • For example: Quinlan in 1987 reported experiments in a medical domain for which the classifiers were trained on data from 1985 and tested on data from 1986. Training set Classifier Test set
  7. 7. Hold-out Method • The hold-out method splits the data into training data and test data (usually 2/3 for train, 1/3 for test). Then we build a classifier using the train data and test it using the test data. • The hold-out method is usually used when we have thousands of instances, including several hundred instances from each class. Training set Classifier Test set Data
  8. 8. Classification: Train, Validation, Test Split Data Predictions Y N Results Known Training set Validation set + + - - + Classifier Builder Evaluate + - + - ClassifierFinal Test Set + - + - Final Evaluation Model Builder The test data can’t be used for parameter tuning!
  9. 9. k-Fold Cross-Validation • k-fold cross-validation avoids overlapping test sets: – First step: data is split into k subsets of equal size; – Second step: each subset in turn is used for testing and the remainder for training. • The estimates are averaged to yield an overall estimate. Classifier Data train train test train test train test train train
  10. 10. Example collect data from real world(photographs and labels)
  11. 11. Method 1: Training Process
  12. 12. Giving students the answer before giving them exam
  13. 13. Method 2
  14. 14. Cross Validation Error
  15. 15. Method 3
  16. 16. If the world happens to be well represented by our dataset
  17. 17. • Model Selection • Evaluating our selection Method CV
  18. 18. 35 The Bootstrap • CV uses sampling without replacement – The same instance, once selected, can not be selected again for a particular training/test set • The bootstrap uses sampling with replacement to form the training set – Sample a dataset of n instances n times with replacement to form a new dataset of n instances – Use this data as the training set – Use the instances from the original dataset that don’t occur in the new training set for testing
  19. 19. Example • Sample of same size N(with replacement) • N=4,M=N=4,M=3 • N=150, M=5000 • This gives M=5000 means of random samples of X
  20. 20. 37 The 0.632 bootstrap • Also called the 0.632 bootstrap – A particular instance has a probability of 1–1/n of not being picked – Thus its probability of ending up in the test data is: – This means the training data will contain approximately 63.2% of the instances 368.0 1 1 1 =»÷ ø ö ç è æ - - e n n
  21. 21. 38 Estimating error with the bootstrap • The error estimate on the test data will be very pessimistic – Trained on just ~63% of the instances • Therefore, combine it with the resubstitution error: • The resubstitution error gets less weight than the error on the test data • Repeat process several times with different replacement samples; average the results instancestraininginstancestest 368.0632.0 eeerr ×+×=
  22. 22. 39 More on the bootstrap • Probably the best way of estimating performance for very small datasets • However, it has some problems – Completely random dataset with two classes of equal size. The true error rate is 50% for any prediction rule. – Consider the random dataset from above – 0% resubstitution error and ~50% error on test data – Bootstrap estimate for this classifier: – True expected error: 50% %6.31%0368.0%50632.0 =×+×=err
  23. 23. • It is a straightforward way to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution
  24. 24. 41 Evaluation Summary: • Use Train, Test, Validation sets for “LARGE” data • Balance “un-balanced” data • Use Cross-validation for Middle size/small data • Use the leave-one-out and bootstrap methods for small data • Don’t use test data for parameter tuning - use separate validation data
  25. 25. Agenda • Quantifying learner performance – Cross validation – Error vs. loss – Precision & recall • Model selection
  26. 26. Accuracy Vs Precision accuracy refers to the closeness of a measurement or estimate to the TRUE value. precision (or variance) refers to the degree of agreement for a series of measurements.
  27. 27. Precision Vs Recall precision: Percentage of retrieved documents that are relevant. recall: Percentage of relevant documents that are returned.
  28. 28. Scenario • We use a dataset with knows classes to build a model • We use another dataset with known classes to evaluate the model(this dataset could be part of the original dataset) • We compare/count the predicted classes against the actual classes
  29. 29. Confusion Matrix • A confusion matrix shows the number of correct and incorrect predictions made by the classification model compared to the actual outcomes(target value) in the data • The matrix is NxN, where N is the number of target values(Classes) • Performance of such models commonly evaluated using data in the matrix
  30. 30. Two Types of Error False negative (“miss”), FN alarm doesn’t sound but person is carrying metal False positive (“false alarm”), FP alarm sounds but person is not carrying metal
  31. 31. How to evaluate the Classifier’s Generalization Performance? Predicted class Actual class Pos Neg Pos TP FN Neg FP TN • Assume that we test a classifier on some test set and we derive at the end the following confusion matrix (Two-Class) • Also called contingency table P N
  32. 32. Measures in Two-Class Classification
  33. 33. Example: 1) How many images of Gerhard Schroeder in the data set? 2) How many predictions of G Schroeder are there? 3) What is the Probability that Hugo Chavez classified correctly in our learning algorithm? 4) Your learning algorithm predicted/classified as Hugo Chavez. What is the probability he is actually Hugo Chavez? 5) Recall(“Hugo Chavez”) = 6)Precision(“Hugo Chavez”)= 7) Recall(“Colin Powell”)= 8) Precision(“Colin Powel”)= 9)Recall(“George W Bush”)= 10) Precision(“George W Bush”)=
  34. 34. 1) True Positive (“Tony Blair”)= 2) False Positive (“Tony Blair”)= 3) False Negative(“Tony Blair”)= 4) True Positive (“Donald Rumsdeld”)= 5) False Positive (““Donald Rumsdeld”)= 6) False Negative(““Donald Rumsdeld”)=
  35. 35. Metrics for Classifier’s Evaluation Predicted class Actual class Pos Neg Pos TP FN Neg FP TN • Accuracy = (TP+TN)/(P+N) • Error = (FP+FN)/(P+N) • Precision = TP/(TP+FP) • Recall/TP rate = TP/P • FP Rate = FP/N P N
  36. 36. Example: 3 classifiers True Predicted pos neg pos 60 40 neg 20 80 True Predicted pos neg pos 70 30 neg 50 50 True Predicted pos neg pos 40 60 neg 30 70 Classifier 1 TPR = FPR = Classifier 2 TPR = FPR = Classifier 3 TPR = FPR =
  37. 37. Example: 3 classifiers True Predicted pos neg pos 60 40 neg 20 80 True Predicted pos neg pos 70 30 neg 50 50 True Predicted pos neg pos 40 60 neg 30 70 Classifier 1 TPR = 0.4 FPR = 0.3 Classifier 2 TPR = 0.7 FPR = 0.5 Classifier 3 TPR = 0.6 FPR = 0.2
  38. 38. Multiclass-Things to Notice • The total number of test examples of any class would be the sum of corresponding row(i.e the TP +FN for that class) • The total number of FN’s for a class is sum of values in the corresponding row(excluding the TP) • The total number of FP’s for a class is the sum of values in the corresponding column(excluding the TP) • The total number of TN’s for a certain class will be the sum of all column and rows excluding that class's column and row Predicted Actual A B C D E A TPA EAB EAC EAD EAE B EBA TPB EBC EBD EBE C ECA ECB TPC ECD ECE D EDA EDB EDC TPD EDE E EEA EEB EEC EED TPE
  39. 39. Predicted Actual A B C D E A TPA EAB EAC EAD EAE B EBA TPB EBC EBD EBE C ECA ECB TPC ECD ECE D EDA EDB EDC TPD EDE E EEA EEB EEC EED TPE
  40. 40. Multi-class Predicted Act ual A B C A TPA EAB EAC B EBA TPB EBC C ECA ECB TPC Predicted class Actual class P N P TP FN N FP TN Predicted Actual A Not A A Not A Predicted Actual B Not B B Not B Predicted Actual C Not C C Not C
  41. 41. Multi-class Predicted Act ual A B C A TPA EAB EAC B EBA TPB EBC C ECA ECB TPC Predicted class Actual class P N P TP FN N FP TN Predicted Actual A Not A A TPA EAB + EAC Not A EBA + ECA TPB + EBC ECB + TPC Predicted Actual B Not B B TPB EBA + EBC Not B EAB+ ECB TPA + EAC ECA + TPC Predicted Actual C Not C C TPC ECA + ECB Not C EAC + EBC TPA + EAB EBA + TPB
  42. 42. Example: A B C A 25 5 2 B 3 32 4 C 1 0 15 Overall Accuracy: Precision A= Recall B= Predicted Actual
  43. 43. Example: A B C A 25 5 2 B 3 32 4 C 1 0 15 Overall Accuracy = (25+32+15)/(25+5+2+3+32+4+1+0+15) Precision A= 25/(25+3+1) Recall B= 32/(32+3+4)
  44. 44. Counting the Costs • In practice, different types of classification errors often incur different costs • Examples: – ¨ Terrorist profiling • “Not a terrorist” correct 99.99% of the time – Loan decisions – Fault diagnosis – Promotional mailing
  45. 45. Cost Matrices Pos Neg Pos TP Cost FN Cost Neg FP Cost TN Cost Usually, TP Cost and TN Cost are set equal to 0 Hypothesized class True class
  46. 46. Lift Charts • In practice, decisions are usually made by comparing possible scenarios taking into account different costs. • Example: • Promotional mail out to 1,000,000 households. If we mail to all households, we get 0.1% respond (1000). • Data mining tool identifies -subset of 100,000 households with 0.4% respond (400); or - subset of 400,000 households with 0.2% respond (800); • Depending on the costs we can make final decision using lift charts! • A lift chart allows a visual comparison for measuring model performance
  47. 47. Generating a Lift Chart • Given a scheme that outputs probability, sort the instances in descending order according to the predicted probability • In lift chart, x axis is sample size and y axis is number of true positives. Rank Predicted Probability Actual Class 1 0.95 Yes 2 0.93 Yes 3 0.93 No 4 0.88 Yes ….. …. ….
  48. 48. Gains Chart
  49. 49. Example 01: Direct Marketing • A company wants to do a mail marketing campaign • It costs the company $1 for each item mailed • They have information on 100,000 customers • Create a cumulative gains and lift charts from the following data • Overall Response Rate: If we assume we have no model other than the prediction of the overall response rate, then we can predict the number of positive responses as a fraction of the total customers contacted • Suppose the response rate is 20% • If all 100,000 customers are contacted we will receive around 20,000 positive responses
  50. 50. Cost($) Total Customers Contacted Positive Responses 100000 100000 20000 • Prediction of Response Model: A Response model predicts who will respond to a marketting campaign • If we have a response model, we can make more detailed predictions • For example, we use the response model to assign a score to all 100,000 customers and predict the results of contacting only the top 10,000 customers, the top 20,000 customers ,etc Cost($) Total Customers Contacted Positive Responses 10,000 10,000 6,000 20,000 20,000 10,000 30,000 30,000 13,000 40,000 40,000 15,800 50,000 50,000 17,000 60,000 60,000 18,000 70,000 70,000 18,800 80,000 80,000 19,400 90,000 90,000 19,800 100,000 100,000 20,000
  51. 51. Cumulative Gains Chart • The y-axis shows the percentage of positive responses. This is a percentage of the total possible positive responses (20,000 as the overall response rate shows) • The x-axis shows the percentage of customers contacted, which is a fraction of the 100,000 total customers • Baseline(Overall response rate): If we contact X% of customers then we will receive X% if the total positive responses • Lift Curve: Using the predictions of the response model, calculate the percentage of positive responses for the percent customers contacted and map these points to create the lift curve
  52. 52. Cost($) Total Customers Contacted Positive Responses 10,000 10,000 6,000 20,000 20,000 10,000 30,000 30,000 13,000 40,000 40,000 15,800 50,000 50,000 17,000 60,000 60,000 18,000 70,000 70,000 18,800 80,000 80,000 19,400 90,000 90,000 19,800 100,000 100,000 20,000
  53. 53. Lift Chart • Shows the actual lift. • To plot the chart: Calculate the points on the lift curve by determining the ratio between the result predicted by our model and the result using no model. • Example: For contacting 10% of customers, using no model we should get 10% of responders and using the given model we should get 30% of responders. The y-value of the lift curve at 10% is 30 / 10 = 3
  54. 54. Lift Chart Cumulative gains and lift charts are a graphical representation of the advantage of using a predictive model to choose which customers to contact
  55. 55. Example 2: • Using the response model P(x)=100-AGE(x) for customer x and the data table shown below, construct the cumulative gains and lift charts.
  56. 56. Calculate P(x) for each person x 1. Calculate P(x) for each person x 2. Order the people according to rank P(x) 3. Calculate the percentage of total responses for each cutoff point Response Rate = Number of Responses / Total Number of Responses Total Custo mer Conta cted #of Respo nses Respo nse Rate 2 4 6 8 10 12 14 16 18 20
  57. 57. Calculate P(x) for each person x 1. Calculate P(x) for each person x 2. Order the people according to rank P(x) 3. Calculate the percentage of total responses for each cutoff point Response Rate = Number of Responses / Total Number of Responses
  58. 58. Cumulative Gains vs Lift Chart The lift curve and the baseline have the same values for 10%-20% and 90%-100%.
  59. 59. ROC Curves • ROC curves are similar to lift charts – Stands for “receiver operating characteristic” – Used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel • Differences from gains chart: – x axis shows percentage of false positives in sample, rather than sample size
  60. 60. ROC Curve
  61. 61. 81 Non-diseased cases Diseased cases Threshold
  62. 62. ROC Curves and Analysis True Predicted pos neg pos 60 40 neg 20 80 True Predicted pos neg pos 70 30 neg 50 50 True Predicted pos neg pos 40 60 neg 30 70 Classifier 1 TPr = 0.4 FPr = 0.3 Classifier 2 TPr = 0.7 FPr = 0.5 Classifier 3 TPr = 0.6 FPr = 0.2
  63. 63. ROC analysis • True Positive Rate – TPR = TP / (TP+FN) – also called sensitivity – true abnormals called abnormal by the observer • False Positive Rate – FPR = FP / (FP+TN) • Specificity (TNR)= TN / (TN+FP) – True normals called normal by the observer • FPR = 1 - specificity
  64. 64. Evaluating classifiers (via their ROC curves) Classifier A can’t distinguish between normal and abnormal. B is better but makes some mistakes. C makes very few mistakes.
  65. 65. “Perfect” means no false positives and no false negatives.
  66. 66. Quiz 4: 1) How many images of Gerhard Schroeder in the data set? 2) How many predictions of G Schroeder are there? 3) What is the Probability that Hugo Chavez classified correctly in our learning algorithm? 4) Your learning algorithm predicted/classified as Hugo Chavez. What is the probability he is actually Hugo Chavez? 5) Recall(“Hugo Chavez”) = 6)Precision(“Hugo Chavez”)= 7) Recall(“Colin Powell”)= 8) Precision(“Colin Powel”)= 9)Recall(“George W Bush”)= 10) Precision(“George W Bush”)=

×