Successfully reported this slideshow.
Upcoming SlideShare
×

# L2. Evaluating Machine Learning Algorithms I

1,578 views

Published on

Valencian Summer School 2015
Day 1
Lecture 2
Evaluating Machine Learning Algorithms I
Cèsar Ferri (UPV)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015

Published in: Data & Analytics
• Full Name
Comment goes here.

Are you sure you want to Yes No

Are you sure you want to  Yes  No

Are you sure you want to  Yes  No

Are you sure you want to  Yes  No
• I'm sleeping so much better! Hi David, just a quick message to say thank you so much for this. I have been using a CPAP machine for 3 years and I absolutely hate it. It's so uncomfortable and I sleep worse with it on than I do without it. I'm now sleeping much better thanks to your program. And my wife is so much happier too! ●●● https://bit.ly/37PhtTN

Are you sure you want to  Yes  No

### L2. Evaluating Machine Learning Algorithms I

1. 1. Evaluating Machine Learning Models I Cèsar Ferri Ramírez Universitat Politècnica de València
2. 2. 2! q  Machine Learning Tasks q  Classiﬁcation v  Imbalanced problems, probabilistic classiﬁers, rankers q  Regression q  Unsupervised Learning q  Lessons learned Outline
3. 3. 3! q  Supervised learning: The problem is presented with example inputs and their desired outputs and the goal is to learn a general rule that maps inputs to outputs. o  Classiﬁcation: Output is categorical o  Regression: Output is numerical q  Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to ﬁnd structure in its input. o  Clustering, association rules… q  Reinforcement learning: A computer program interacts with a dynamic environment in which it must perform a certain goal: Driving a vehicle or videogames. Machine Learning Tasks
4. 4. 4! q  Classiﬁcation: problem of identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations whose category membership is known. o  Spam ﬁlters o  Face identiﬁcation o  Diagnosis of patients Evaluation of Classiﬁers
6. 6. 6! q  Common solution: o  Split between training and test data Split the data training test Models Evaluation Best model ∑∈ −= Sx S xhxf n herror 2 ))()(( 1 )( data Algorithms What if there is not much data available? GOLDEN RULE: Never use the same example for training the model and evaluating it!!
7. 7. 7! q  Too much training data: poor evaluation q  Too much test data: poor training q  Can we have more training data and more test data without breaking the golden rule? o  Repeat the experiment! ü  Bootstrap: we perform n samples (with repetition) and test with the rest. ü  Cross validation: Data is split in n folds of equal size. Taking the most of the data
8. 8. 8! q  What dataset do we use to estimate all previous metrics? o  If we use all data to train the models and evaluate them, we get overoptimistic models: ü  Over-ﬁtting: o  If we try to compensate by generalising the model (e.g., pruning a tree), we may get: ü  Under-ﬁtting: o  How can we ﬁnd a trade-oﬀ? Overﬁtting?
9. 9. 9! q  Confusion (contingency) matrix: o  We can observe how errors are distributed. Confusion Matrix c Buy No Buy Buy 4 1 No Buy 2 3 Actual Pred.
10. 10. 10! q  For two classes: Confusion Matrix + - + - TP FN FP TN actual predicted TP+FN FP+TN true positive false positive false negative true negative Accuracy=​ 𝑇 𝑃+ 𝑇𝑁/𝑁 ! Error=1-Accuracy=​ 𝐹 𝑃+ 𝐹𝑁/𝑁 !
11. 11. 11! q  For two classes: Confusion Matrix + - + - TP FN FP TN actual predicted TP+FN FP+TN true positive false positive false negative true negative TPRate, Sensitivity=​ 𝑇 𝑃/𝑇𝑃+ 𝐹𝑁 ! TNRate, Speciﬁcity =​ 𝑇 𝑁/𝐹𝑃+ 𝑇𝑁 !
12. 12. 12! q  For two classes: Confusion Matrix + - + - TP FN FP TN actual predicted TP+FN FP+TN true positive false positive false negative true negative TPRate, Sensitivity, Recall=​ 𝑇 𝑃/𝑇𝑃+ 𝐹𝑁 ! Positive predictive value (PPV), Precision =​ 𝑇 𝑃/𝑇𝑃+ 𝐹𝑃 !
13. 13. 13! q  Common measures in IR: Precision and Recall q  Are deﬁned in terms of a set of retrieved documents and a set of relevant documents. o  Precision: Fraction of retrieved documents that are relevant to the query o  Recall: Percent of all relevant documents that is returned by the search q  Both measures are usually combined in one (harmonic mean): Information Retrieval asure=2​ 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛× 𝑅𝑒𝑐𝑎𝑙𝑙/𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛× 𝑅𝑒𝑐𝑎𝑙𝑙 =​2 𝑇𝑃/2 𝑇𝑃+ 𝐹𝑃+ 𝐹𝑁
14. 14. 14! q  Confusion (contingency) tables can be multiclass q  Measures based on 2-class matrices are computed o  1 vs all (average N partial measures) o  1 vs 1 (average N*(N-1)/2 partial measures) ü  Weighted average? Multiclass problems ERROR actual low medium high low 20 0 13 medium 5 15 4predicted high 4 7 60
15. 15. 15! q  In some cases we can ﬁnd important diﬀerences among proportion of classes o  A naïve classiﬁer that always predictive majority class (ignoring minority classes) obtains good performance ü  In a binary problem (+/-) with 1% of negative instances, model “always positive” gets an accuracy of 99%. q  Macro-accuracy: Average of accuracy per class o  The naïve classiﬁer gets a macro-accuracy=0.5 Imbalanced Datasets m total hits total hits total Hits hmacroacc mclass mclass 2class 2class 1class 1class ... )( +++ =
16. 16. q  Crisp and Soft Classiﬁers: o  A “hard” or “crisp” classiﬁer predicts a class between a set of possible classes. o  A “soft” or “scoring” classiﬁer (probabilistic) predicts a class, but accompanies each prediction with an estimation of the reliability (conﬁdence) of each prediction. ü  Most learning methods can be adapted to generate this conﬁdence value. Soft classiﬁers 16!
17. 17. q  A special kind of soft classiﬁer is a class probability estimator. o  Instead of predicting “a”, “b” or “c”, it gives a probability estimation for “a”, “b” or “c”, i.e., “pa”, “pb” and “pc”. ü  Example: v  Classiﬁer 1: pa = 0.2, pb = 0.5 and pc = 0.3. v  Classiﬁer 2: pa = 0.3, pb = 0.4 and pc = 0.3. o  Both predict “b”, but classiﬁer 1 is more conﬁdent. Soft classiﬁers 17!
18. 18. q  Probabilistic classiﬁers: Classiﬁers that are able to predict a probability distribution over a set of classes o  Provide classiﬁcation with a degree of certainty: ü  Combining classiﬁers ü  Cost sensitive contexts q  Mean Squared Error (Brier Score) q  Log Loss Evaluating probabilistic classiﬁers 18! ∑∑∈ ∈ −= Si Cj jipjif n MSE ),(),( 1 f(i,j)=1 if instance i is of class j, 0 otherwise. p(i,j) returns de prob. instance i in class j ∑∑∈ ∈ ∗−= Si Cj jipjif n Logloss )),(log),(( 1 2
19. 19. q  MSE or Brier Score can be decomposed into two factors: o  BS=CAL+REF ü  Calibration: Measures the quality of classiﬁer scores wrt class membership probabilities ü  Reﬁnement: it is an aggregation of resolution and uncertainty, and is related to the area under the ROC Curve. q  Calibration Methods: o  Try to transform classiﬁer scores into class membership probabilities ü  Platt scaling, Isotonic Regression, PAVcal.. Evaluating probabilistic classiﬁers 19!
20. 20. q  Brier Curves for analysing classiﬁer performance Evaluating probabilistic classiﬁers 20! Non calibrated Brier curve ! PAV-calibrated Brier curve !
21. 21. q  “Rankers”: o  Whenever we have a probability estimator for a two- class problem: ü  pa = x, then pb = 1 ‒ x. o  Let’s call one class 0 (neg) and the other class 1 (pos). o  A ranker is a soft classiﬁer that gives a value (score) monotonically related to the probability of class 1. ü  Examples: v  Probability of a customer buying a product. v  Probability of a message being spam.... Soft classiﬁers 21!
22. 22. 22! q  We can rank instances according to estimated probability o  CRM: You are interested in the top % of potential costumers q  Measures for ranking o  AUC: Area Ander the ROC Curve o  Distances between perfect ranking and estimated ranking Evaluation of Rankers
23. 23. 23! q  Regression: In this case the variable to be predicted is a continuous value. o  Predicting daily value of stocks in NASDAQ o  Forecasting number of docks available in a Valenbisi station in the next hour o  Predicting the amount of beers sold the next month by a retail company Regression
24. 24. 24! q  Given a set S of n instances, o  Mean Absolute Error: o  Mean Squared Error: o  Root Mean Suared Error: ü  MSE is more sensitive to extreme values ü  RMSE and MAE are in the same magnitude of the actual values Evaluation of Regressors ∑∈ −= Sx S xhxf n hMAE )()( 1 )( ∑∈ −= Sx S xhxf n hMSE 2 ))()(( 1 )( ∑∈ −= Sx S xhxf n hRMSE 2 ))()(( 1 )(
25. 25. 25! q  Example: Evaluation of Regressors Predicted Value(h(x)) Acual Value (f(x)) Error Error2 100 mill. € 102 mill. € 2 4 102 mill. € 110 mill. € 8 64 105 mill. € 95 mill. € 10 100 95 mill. € 75 mill. € 20 400 101 mill. € 103 mill. € 2 4 105 mill. € 110 mill. € 5 25 105 mill. € 98 mill. € 7 49 40 mill. € 32 mill. € 8 64 220 mill. € 215 mill. € 5 25 100 mill. € 103 mill. € 3 9 MSE= 744/10 = 74,4! MAE= 60/10 =6! RMSE= sqrt(744/10) = 8.63!
26. 26. 26! q  Sometimes relative error values are more appropiate: o  10% for an error of 50 when predicting 500 q  How much does the scheme improve on simply predicting the average: o  Relative Mean Squared Error o  Relative Mean Absolute Error Evaluation of Regressors ( ) ( )∑ ∑ ∈ ∈ − − = Sx Sx S xff xhxf hRSE 2 2 )( )()( )( ∑ ∑ ∈ ∈ − − = Sx Sx S xhf xhxf hRAE )( )()( )(
27. 27. 27! q  A related measure is R2 (coeﬃcient of determination) o  Number that indicates how well data ﬁt a statistical model ü  sum of squares of residuals ü  total sum of squares ü  coeﬃcient of determination Evaluation of Regressors ( )∑∈ −= Sx s xffhSStot 2 )()( )( )( 1)(2 hSStot hSSres hR s s S −= ( )∑∈ −= Sx s xhfhSSres 2 )()(
28. 28. 28! q  Association Rules: task of discovering interesting relations between variables in databases. q  Common metrics: o  Support: Estimates the popularity of a rule o  Conﬁdence: Estimates the reliability of a rule q  Rules are ordered according to a measures that combine both values q  No partition train/test. Unsupervised Learning
29. 29. 29! q  Clustering: task of grouping a set of objects in such a way that objects in the same group (cluster) are more similar to each other than to those in other clusters. o  Task diﬃcult to evaluate q  Some evaluation measures based on distance: o  Distance among borders of clusters o  Distance among centers (centroids) of clusters o  Radius and density of clusters Unsupervised Learning
30. 30. q  Model evaluation is a fundamental phase in the knowledge discovery process. q  In classiﬁcation, depending on the feature and context we want to analyse, we need to use the proper metric. q  Several (and sometimes equivalent) measures for regression models. q  Supervised models are easier to evaluate since we have an estimate of the ground truth. Lessons learned 30!
31. 31. q  Witten, Ian H., and Eibe Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005. q  Flach, P.A. (2012)“Machine Learning: The Art and Science of Algorithms that Make Sense of Cambridge University Press. q  Hand, D.J. (1997) “Construction and Assessment of Classiﬁcation Rules”, Wiley. q  César Ferri, José Hernández-Orallo, R. Modroiu (2009): An experimental comparison of performance measures for classiﬁcation. Pattern Recognition Letters 30(1): 27-38 q  Nathalie Japkowicz, ‎Mohak Shah (2011) “Evaluating Learning Algorithms: A Classiﬁcation Perspective”, Cambridge University Press 2011. q  Antonio Bella, Cèsar Ferri Ramirez, José Hernández-Orallo, M. José Ramírez- Quintana: On the eﬀect of calibration in classiﬁer combination. Appl. Intell. 38(4): 566-585 (2013) q  José Hernández-Orallo, Peter A. Flach, Cèsar Ferri Ramirez:Brier Curves: a New Cost-Based Visualisation of Classiﬁer Performance. ICML 2011: 585-592 To know more 31!