Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

VSSML18. Evaluations

95 views

Published on

Evaluations.
VSSML18: 4th edition of the Valencian Summer School in Machine Learning.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

VSSML18. Evaluations

  1. 1. Valencian Summer School in Machine Learning 4th edition September 13-14, 2018
  2. 2. BigML, Inc 2 Evaluations Proving a Model Works Poul Petersen CIO, BigML, Inc
  3. 3. BigML, Inc 3Evaluations Why Evaluations • FACT: No model is perfect - they all make mistakes • Your data has mistakes • Models are “approximations” • Today you have seen models that predict: • Churn: How many people will churn that we didn’t predict? • Diabetes: How many patients might have diabetes that we said were fine? • Home Prices: How accurate are the predicted prices? • You have also seen several different kinds of models • Decision Trees / Ensembles / Logistic Regression / Deepnets • Which one works the best for your data
  4. 4. BigML, Inc 4Evaluations Easy Right? INTL MIN INTL CALLS INTL CHARGE CUST SERV CALLS CHURN 8,7 4 2,35 1 False 11,2 5 3,02 0 False 12,7 6 3,43 4 True 9,1 5 2,46 0 False 11,2 2 3,02 1 False 12,3 5 3,32 3 False 13,1 6 3,54 4 False 5,4 9 1,46 4 True 13,8 4 3,73 1 False Model Prediction PREDICT CHURN False True True False False False False False False Look for Mistakes!
  5. 5. BigML, Inc 5Evaluations Evaluations Demo #1
  6. 6. BigML, Inc 6Evaluations What Just Happened? • We started with the churn Datasource • Created a Dataset • Built a Model to predict churn • We used the Model to predict churn for each customer in the Dataset using a Batch Prediction • Downloaded the Batch Prediction as a CSV and looked for errors. That is, when the Prediction did not match the known true value for churn • The comparison was tedious! • Examining one line at a time • Hard to understand - need some metrics!!!
  7. 7. BigML, Inc 7Evaluations Evaluation Metrics • Imagine we have a model that can predict a person’s dominant hand, that is for any individual it predicts left / right • Define the positive class • This selection is arbitrary • It is the class you are interested in! • The negative class is the “other” class (or others) • For this example, we choose : left
  8. 8. BigML, Inc 8Evaluations Evaluation Metrics • We choose the positive class: left • True Positive (TP) • We predicted left and the correct answer was left • True Negative (TN) • We predicted right and the correct answer was right • False Positive (FP) • Predicted left but the correct answer was right • False Negative (FN) • Predict right but the correct answer was left
  9. 9. BigML, Inc 9Evaluations Evaluation Metrics True Positive: Correctly predicted the positive class True Negative: Correctly predicted the negative class False Positive: Incorrectly predicted the positive class False Negative: Incorrectly predicted the negative class Remember…
  10. 10. BigML, Inc 10Evaluations Accuracy TP + TN Total • “Percentage correct” - like an exam • If Accuracy = 1 then no mistakes • If Accuracy = 0 then all mistakes • Intuitive but not always useful • Watch out for unbalanced classes! • Ex: 90% of people are right-handed and 10% are left • A silly model which always predicts right handed is 90% accurate
  11. 11. BigML, Inc 11Evaluations Accuracy Classified as Left Handed Classified as Right Handed TP = 0 FP = 0 TN = 7 FN = 3 = Left = RightPositive Class Negative Class TP + TN Total = 70%
  12. 12. BigML, Inc 12Evaluations Precision TP TP + FP • “accuracy” or “purity” of positive class • How well you did separating the positive class from the negative class • If Precision = 1 then no FP. • You may have missed some left handers, but of the ones you identified, all are left handed. No mistakes. • If Precision = 0 then no TP • None of the left handers you identified are actually left handed. All mistakes.
  13. 13. BigML, Inc 13Evaluations Precision Classified as Left Handed Classified as Right Handed TP = 2 FP = 2 TN = 5 FN = 1 Positive Class Negative Class = Left = Right TP TP + FP = 50%
  14. 14. BigML, Inc 14Evaluations Recall TP TP + FN • percentage of positive class correctly identified • A measure of how well you identified all of the positive class examples • If Recall = 1 then no FN → All left handers identified • There may be FP, so precision could be <1 • If Recall = 0 then no TP → No left handers identified
  15. 15. BigML, Inc 15Evaluations Recall Classified as Left Handed Classified as Right Handed TP = 2 FP = 2 TN = 5 FN = 1 Positive Class Negative Class = Left = Right TP TP + FN = 66%
  16. 16. BigML, Inc 16Evaluations f-Measure 2 * Recall * Precision Recall + Precision • harmonic mean of Recall & Precision • If f-measure = 1 then Recall == Precision == 1 • If Precision OR Recall is small then the f-measure is small
  17. 17. BigML, Inc 17Evaluations f-Measure Classified as Fraud Classified as Not Fraud R = 66% P = 50% f = 57% Positive Class Negative Class = Left = Right
  18. 18. BigML, Inc 18Evaluations Phi Coefficient __________TP*TN_-_FP*FN__________ SQRT[(TP+FP)(TP+FN)(TN+FP)(TN+FN)] • Returns a value between -1 and 1 • If -1 then predictions are opposite reality • =0 no correlation between predictions and reality • =1 then predictions are always correct
  19. 19. BigML, Inc 19Evaluations Phi Coefficient Classified as Fraud Classified as Not Fraud TP = 2 FP = 2 TN = 5 FN = 1 Phi = 0.356 Positive Class Negative Class = Left = Right
  20. 20. BigML, Inc 20Evaluations Evaluations Demo #2
  21. 21. BigML, Inc 21Evaluations What Just Happened? • Starting with the Diabetes Source, we created a Dataset and then a Model. • Using both the Model and the original Dataset, we created an Evaluation. • We reviewed the metrics provided by the Evaluation: • Confusion Matrix • Accuracy, Precision, Recall, f-measure and phi • This Model seemed to perform really, really well… Question: Can we trust this model?
  22. 22. BigML, Inc 22Evaluations Evaluation Danger! • Never evaluate with the training data! • Many models are able to “memorize” the training data • This will result in overly optimistic evaluations!
  23. 23. BigML, Inc 23Evaluations “Memorizing” Training Data plasma glucose bmi diabetes pedigree age diabetes 148 33,6 0,627 50 TRUE 85 26,6 0,351 31 FALSE 183 23,3 0,672 32 TRUE 89 28,1 0,167 21 FALSE 137 43,1 2,288 33 TRUE 116 25,6 0,201 30 FALSE 78 31 0,248 26 TRUE 115 35,3 0,134 29 FALSE 197 30,5 0,158 53 TRUE Training Evaluating plasma glucose bmi diabetes pedigree age diabetes 148 33,6 0,627 50 ? 85 26,6 0,351 31 ? • Exactly the same values! • Who needs a model? • What we want to know is how the model performs with values never seen at training: 124 22 0,107 46 ?
  24. 24. BigML, Inc 24Evaluations Evaluation Danger! • Never evaluate with the training data! • Many models are able to “memorize” the training data • This will result in overly optimistic evaluations! • If you only have one Dataset, use a train/test split
  25. 25. BigML, Inc 25Evaluations Train / Test Split plasma glucose bmi diabetes pedigree age diabetes 148 33,6 0,627 50 TRUE 183 23,3 0,672 32 TRUE 89 28,1 0,167 21 FALSE 78 31 0,248 26 TRUE 115 35,3 0,134 29 FALSE 197 30,5 0,158 53 TRUE Train Test plasma glucose bmi diabetes pedigree age diabetes 85 26,6 0,351 31 FALSE 137 43,1 2,288 33 TRUE 116 25,6 0,201 30 FALSE • These instances were never seen at training time. • Better evaluation of how the model will perform with “new” data
  26. 26. BigML, Inc 26Evaluations Train / Test Split DATASET TRAIN SET TEST SET PREDICTIONS METRICS
  27. 27. BigML, Inc 27Evaluations Evaluation Danger! • Never evaluate with the training data! • Many models are able to “memorize” the training data • This will result in overly optimistic evaluations! • If you only have one Dataset, use a train/test split • Even a train/test split may not be enough! • Might get a “lucky” split • Solution is to repeat several times (formally to cross validate)
  28. 28. BigML, Inc 28Evaluations Evaluations Demo #3
  29. 29. BigML, Inc 29Evaluations What Just Happened? • Starting with the Diabetes Dataset we created a train/test split • We built a Model using the train set and evaluated it with the test set • The scores were much worse than before, showing the danger of evaluating with training data. • Then we launched several other types of models and used the evaluation comparison tool to see which model algorithm performed the best. Question: Couldn’t we search for the best Model? STAY TUNED
  30. 30. BigML, Inc 30Evaluations Evaluation • Never evaluate with the training data! • Many models are able to “memorize” the training data • This will result in overly optimistic evaluations! • If you only have one Dataset, use a train/test split • Even a train/test split may not be enough! • Might get a “lucky” split • Solution is to repeat several times (formally to cross validate) • Don’t forget that accuracy can be mis-leading! • Mostly useless with unbalanced classes (left/right?) • Use weighting, operating points, other tricks…
  31. 31. BigML, Inc 31Evaluations Weighting Instance Rate Payment Outcome Predict Confidence 1 23 % 134 Paid Paid 20 % 2 23 % 134 Paid Paid 25 % 3 23 % 134 Paid Paid 30 % ... ... ... ... ... 1000 23 % 134 Paid Paid 99,5 % 1001 23 % 134 Default Paid 99,4 % Problem: Default is “more important”,but occurs less often than Paid Solution: Weights tell the model to treat instances of a specific class (in this case Default) with more importance
  32. 32. BigML, Inc 32Evaluations Operating Points • The default probability threshold is 50% • Changing the threshold can change the outcome for a specific class Rate Payment … Actual Outcome Probability PAID Threshold @ 50% Threshold @ 60% Threshold @ 90% 8,4 % $456 … PAID 95 % PAID PAID PAID 9,6 % $134 … PAID 87 % PAID PAID DEFAULT 18 % $937 … DEFAULT 36 % DEFAULT DEFAULT DEFAULT 21 % $35 … PAID 88 % PAID PAID DEFAULT 17,5 % $1.044 … DEFAULT 55 % PAID DEFAULT DEFAULT
  33. 33. BigML, Inc 33Evaluations Lending Club Dataset • Peer to Peer lending service • As an investor, we want a way to identify loans that are a lower risk • Fortunately, the data for the outcome (paid or default) for past loans is available from Lending Club. • Using this data, we can build a model to predict which loans are good or bad Instance Rate Payment Outcome 1 8,4 % 456 Paid 2 9,6 % 134 Paid 3 18 % 937 Default MODEL NEW LOANS GOOD / BAD
  34. 34. BigML, Inc 34Evaluations Evaluations Demo #4
  35. 35. BigML, Inc 35Evaluations What just happened? • We split the Lending Club data into training and test Datasets • We created a Model and Evaluation • Looking at the Accuracy, we saw that the Model was performing well but because of unbalanced classes • The resulting Model did well at predicting good loans • But bad loans are "more important" • We tried different weights to increase the Recall of bad loans: • objective balancing: equal consideration • class weights: bad = 1000, good = 1 • Finally, we explored the impact of changing the probability threshold Wait - What about regressions?
  36. 36. BigML, Inc 36Evaluations Regression - Fitting a Line Data Points Model
  37. 37. BigML, Inc 37Evaluations Mean Absolute Error e1 e2 e7 e6 e5 e4 e3 MAE = |e1|+|e2|+ … +|en| n
  38. 38. BigML, Inc 38Evaluations Mean Squared Error e1 e2 e7 e6 e5 e4 e3 MSE = (e1)2 +(e2)2 + … +(en)2 n
  39. 39. BigML, Inc 39Evaluations MSE versus MAE • For both MAE & MSE: Smaller is better, but values are unbounded • MSE is always larger than or equal to MAE
  40. 40. BigML, Inc 40Evaluations R-Squared Error Data Points Model Mean
  41. 41. BigML, Inc 41Evaluations R-Squared Error Mean v1 v2 v3 v4 v5 v7 v6
  42. 42. BigML, Inc 42Evaluations R-Squared Error e1 e2 e7 e6 e5 e4 e3 Mean v1 v2 v3 v4 v5 v7 v6 MSEmodel MSEmean RSE = 1 -
  43. 43. BigML, Inc 43Evaluations R-Squared Error • RSE: measure of how much better the model is than always predicting the mean • < 0 model is worse then mean • MSEmodel > MSEmean • = 0 model is no better than the mean • MSEmodel = MSEmean • ➞ 1 model fits the data “perfectly” • MSEmodel = 0 (or MSEmean >> MSEmodel) MSEmodel MSEmean RSE = 1 -
  44. 44. BigML, Inc 44Evaluations Evaluations Demo #5
  45. 45. BigML, Inc 45Evaluations What just happened? • We split the RedFin data into training and test Datasets • We created a Model and Evaluation • We examined the Evaluation metrics Wait - What about Time Series?
  46. 46. BigML, Inc 46Data Transformations Independent Data Color Mass PPAP red 11 pen green 45 apple red 53 apple yellow 0 pen blue 2 pen green 422 pineapple yellow 555 pineapple blue 7 pen Discovering patterns: • Color = “red” Mass < 100 • PPAP = “pineapple” Color ≠ “blue” • Color = “blue” PPAP = “pen”
  47. 47. BigML, Inc 47Data Transformations Independent Data Color Mass PPAP green 45 apple blue 2 pen green 422 pineapple blue 7 pen yellow 0 pen yellow 9 pineapple red 555 apple red 11 pen Patterns still hold when rows re-arranged: • Color = “red” Mass < 100 • PPAP = “pineapple” Color ≠ “blue” • Color = “blue” PPAP = “pen”
  48. 48. BigML, Inc 48Data Transformations Dependent Data Year Pineapple Harvest1986 50,74 1987 22,03 1988 50,69 1989 40,38 1990 29,80 1991 9,90 1992 73,93 1993 22,95 1994 139,09 1995 115,17 1996 193,88 1997 175,31 1998 223,41 1999 295,03 2000 450,53 Pineapple Harvest Tons 0 125 250 375 500 Year 1986 1988 1990 1992 1994 1996 1998 2000 Trend Error
  49. 49. BigML, Inc 49Data Transformations Dependent Data Pineapple Harvest Tons 0 125 250 375 500 Year 1986 1988 1990 1992 1994 1996 1998 2000 Year Pineapple Harvest1986 139,09 1987 175,31 1988 9,91 1989 22,95 1990 450,53 1991 73,93 1992 40,38 1993 22,03 1994 295,03 1995 50,74 1996 29,8 1997 223,41 1998 115,17 1999 193,88 2000 50,69 Rearranging Disrupts Patterns
  50. 50. BigML, Inc 50Evaluations Random Train / Test Split plasma glucose bmi diabetes pedigree age diabetes 148 33,6 0,627 50 TRUE 183 23,3 0,672 32 TRUE 89 28,1 0,167 21 FALSE 78 31 0,248 26 TRUE 115 35,3 0,134 29 FALSE 197 30,5 0,158 53 TRUE Train Test plasma glucose bmi diabetes pedigree age diabetes 85 26,6 0,351 31 FALSE 137 43,1 2,288 33 TRUE 116 25,6 0,201 30 FALSE
  51. 51. BigML, Inc 51Evaluations Linear Train / Test Split Train Test Year Pineapple Harvest1986 50,74 1987 22,03 1988 50,69 1989 40,38 1990 29,80 1991 9,90 1992 73,93 1993 22,95 1994 139,09 1995 115,17 1996 193,88 Year Pineapple Harvest 1997 175,31 1998 223,41 1999 295,03 2000 450,53 Forecast COMPARE
  52. 52. BigML, Inc 52Evaluations Evaluation Demo #6

×