SlideShare a Scribd company logo
Tilani Gunawardena
Evaluation and Credibility
(How much should we
believe in what was learned)
• Basic Algorithms
– 1R
– Naïve Bayes
– Decision Trees
– KNN
– Clustering(K-means, Hierarchical)
– Covering Algorithm
– Associate Rule Mining
• Introduction
• Train, Test and Validation sets
• Evaluation on Large data Unbalanced data
• Evaluation on Small data
– Cross validation
– Bootstrap
• Comparing data mining schemes
– Significance test
– Lift Chart / ROC curve
• Numeric Prediction Evaluation
Outline
4
Introduction
• How predictive is the model we learned?
• Error on the training data is not a good
indicator of performance on future data
– Q: Why?
– A: Because new data will probably not be exactly
the same as the training data!
• Overfitting – fitting the training data too
precisely - usually leads to poor results on new
data
Model Selection and Bias-Variance
Tradeoff
6
Evaluation issues
• Possible evaluation measures:
– Classification Accuracy
– Total cost/benefit – when different errors involve
different costs
– Lift and ROC curves
– Error in numeric predictions
• How reliable are the predicted results ?
what we are interested in is the likely
future performance on new data, not
the past performance on old data.
is the error rate on old data likely to
be a good indicator of the error rate
on new data?
7
Classifier error rate
• Natural performance measure for classification
problems: error rate
– Success: instance’s class is predicted correctly
– Error: instance’s class is predicted incorrectly
– Error rate: proportion of errors made over the whole
set of instances
• Training set error rate(Resubstitution error )
: is way too optimistic!
– you can find patterns even in random data
Accuracy Error Rate(1-Accuracy)
8
Evaluation on “LARGE” data
• If many (thousands) of examples are available,
including several hundred examples from each
class, then how can we evaluate our classifier
method/model?
• A simple evaluation is sufficient
– Ex: Randomly split data into training and test sets
(usually 2/3 for train, 1/3 for test)
• Build a classifier using the train set and
evaluate it using the test set.
It is important that the test data is
not used in any way to create the
classifier.
11
Classification Step 1:
Split data into train and test sets
Results Known
+
+
-
-
+
THE PAST
Data
Training set
Testing set
12
Classification Step 2:
Build a model on a training set
Training set
Results Known
+
+
-
-
+
THE PAST
Data
Model Builder
Testing set
13
Classification Step 3:
Evaluate on test set (Re-train?)
Data
Predictions
Y N
Results Known
Training set
Testing set
+
+
-
-
+
Model Builder
Evaluate
+
-
+
-
14
A note on parameter tuning
• It is important that the test data is not used in any way to create the
classifier
• Some learning schemes operate in two stages:
– Stage 1: builds the basic structure
– Stage 2: optimizes parameter settings
• The test data can’t be used for parameter tuning!
• Proper procedure uses three sets: training data, validation data, and
test data
– Training data is used by one or more learning schemes to come up with
classifiers.
– Validation data is used to optimize parameters of those classifier, or to select a
particular one.
– Test data is used to calculate the error rate of the final, optimized,
method.
15
Making the most of the data
• Once evaluation is complete, all the data can be
used to build the final classifier
• Generally, the larger the training data the better
the classifier (but returns begin to diminish once
a certain volume of training data is exceeded )
• The larger the test data the more accurate the
error estimate
16
Classification:
Train, Validation, Test split
Data
Predictions
Y N
Results Known
Training set
Validation set
+
+
-
-
+
Model Builder
Evaluate
+
-
+
-
Final ModelFinal Test Set
+
-
+
-
Final Evaluation
Model
Builder
17
Unbalanced data
• Sometimes, classes have very unequal frequency
– Accommodation prediction: 97% stay, 3% don’t stay
(in a month)
– Medical diagnosis: 90% healthy, 10% disease
– eCommerce: 99% don’t buy, 1% buy
– Security: >99.99% of Americans are not terrorists
• Similar situation with multiple classes
• Majority class classifier can be 97% correct, but
useless
18
Handling unbalanced data – how?
• If we have two classes that are very unbalanced,
then how can we evaluate our classifier method?
• With two classes, a good approach is to build
BALANCED train and test sets, and train model on
a balanced set
– randomly select desired number of minority class
instances
– add equal number of randomly selected majority class
• How do we generalize “balancing” to multiple
classes
– Ensure that each class is represented with
approximately equal proportions in train and test
19
Predicting performance
• Assume the estimated error rate is 25%. How
close is this to the true error rate?
– Depends on the amount of test data
• Prediction is just like tossing a biased coin
– “Head” is a “success”, “tail” is an “error”
• In statistics, a succession of independent
events like this is called a Bernoulli process
• Statistical theory provides us with confidence
intervals for the true underlying proportion!
20
Confidence intervals
• p: A true (but unknown) success rate
• We can say: p lies within a certain specified interval with a certain
specified confidence
• Example: S=750 successes in N=1000 trials
– Estimated success rate: 75%(observed success rate is f = S/N)
– How close is this to true success rate p?
• Answer: with 80% confidence p[73.2,76.7]
• Another example: S=75 and N=100
– Estimated success rate: 75%
– With 80% confidence p[69.1,80.1]
21
Mean and Variance
• Mean and variance for a Bernoulli trial:
p, p (1–p)
• Expected success rate f=S/N
• Mean and variance for f : p, p (1–p)/N
• For large enough N, f follows a Normal distribution
• c% confidence interval [–z  X  z] for random variable
with 0 mean is given by:
• With a symmetric distribution:
czXz =££- ]Pr[
]Pr[21]Pr[ zXzXz ³´-=££-
22
Confidence limits
• Confidence limits for the normal distribution with 0 mean and a
variance of 1:
• Thus:
• To use this we have to reduce our random variable f to have 0
mean and unit variance
Pr[X  z] z
0.1% 3.09
0.5% 2.58
1% 2.33
5% 1.65
10% 1.28
20% 0.84
40% 0.25
%90]65.165.1Pr[ =££- X
–1 0 1 1.65
Pr[X ≥ z] = 5% implies that there is a 5% chance that X lies
more than 1.65 standard deviations above the mean.
23
Transforming f
• Transformed value for f :
(i.e. subtract the mean and divide by the standard deviation)
• Resulting equation:
• Solving for p :
Npp
pf
/)1( -
-
cz
Npp
pf
z =ú
û
ù
ê
ë
é
£
-
-
£-
/)1(
Pr
÷
ø
ö
ç
è
æ
+÷
÷
ø
ö
ç
ç
è
æ
+-±+=
N
z
N
z
N
f
N
f
z
N
z
fp
2
2
222
1
42
24
Examples
• f = 75%, N = 1000, c = 80% (so that z = 1.28):
• f = 75%, N = 100, c = 80% (so that z = 1.28):
• Note that normal distribution assumption is only valid for large N (i.e. N >
100)
• f = 75%, N = 10, c = 80% (so that z = 1.28):
(should be taken with a grain of salt)
]767.0,732.0[Îp
]801.0,691.0[Îp
]881.0,549.0[Îp
25
Evaluation on “small” data
• The holdout method reserves a certain amount for
testing and uses the remainder for training
– Usually: one third for testing, the rest for training
• For “unbalanced” datasets, samples might not be
representative
– Few or none instances of some classes
• Stratified sample: advanced version of balancing
the data
– Make sure that each class is represented with
approximately equal proportions in both subsets
26
Repeated holdout method
• Holdout estimate can be made more reliable
by repeating the process with different
subsamples
– In each iteration, a certain proportion is randomly
selected for training (possibly with stratification)
– The error rates on the different iterations are
averaged to yield an overall error rate
• This is called the repeated holdout method
Cross-validation
• Cross-validation avoids overlapping test sets
– First step: data is split into k subsets of equal size
– Second step: each subset in turn is used for testing
and the remainder for training
• This is called k-fold cross-validation
• Often the subsets are stratified before the
cross-validation is performed(stratified k-fold
cross-validation)
• The error estimates are averaged to yield an
overall error estimate
29 29
Cross-validation example:
— Break up data into groups of the same size
—
—
— Hold aside one group for testing and use the rest to build model
—
— Repeat
Test
Train
30
More on cross-validation
• Standard method for evaluation: stratified ten-
fold cross-validation
• Why 10?
– Extensive experiments have shown that this is the
best choice to get an accurate estimate
• Stratification reduces the estimate’s variance
• Even better: repeated stratified cross-validation
– E.g. ten-fold cross-validation is repeated ten times and
results are averaged (reduces the variance)
31
Leave-One-Out cross-
validation(LOOCV)
• simply n-fold cross-validation, where n is the number of instances in the
dataset
• Leave-One-Out: Remove one instance for testing and the other for training
particular form of cross-validation:
– Set number of folds to number of training instances
– I.e., for n training instances, build classifier n times
• It is judged by its correctness on the remaining instance (one or zero for
success or failure)
• The results of all n judgments, one for each member of the dataset, are
averaged, and that average represents the final error estimate.
• Makes best use of the data test
• The greatest possible amount of data is used for training in each case, which presumably
increases the chance that the classifier is an accurate one.
• Involves no random subsampling
• leave-one-out seems to offer a chance of squeezing the maximum out of a small dataset
and getting as accurate an estimate as possible.
• Very computationally expensive
32
Leave-One-Out-CV and stratification
• Disadvantage of Leave-One-Out-CV: stratification is not
possible
– It guarantees a non-stratified sample because there is only one
instance in the test set!
33
The Bootstrap
• CV uses sampling without replacement
– The same instance, once selected, can not be selected again for a
particular training/test set
• The bootstrap uses sampling with replacement to form
the training set
– Sample a dataset of n instances n times with replacement to form
a new dataset of n instances
– Use this data as the training set
– Use the instances from the original
dataset that don’t occur in the new
training set for testing
34
The 0.632 bootstrap
• Also called the 0.632 bootstrap
– A particular instance has a probability of 1–1/n of not being picked
– Thus its probability of ending up in the test data is:
– This means the training data will contain approximately 63.2% of
the instances
368.0
1
1 1
=»÷
ø
ö
ç
è
æ
- -
e
n
n
35
Estimating error
with the bootstrap
• The error estimate on the test data will be very pessimistic
– Trained on just ~63% of the instances
• Therefore, combine it with the resubstitution error:
• The resubstitution error gets less weight than the error on
the test data
• Repeat process several times with different replacement
samples; average the results
instancestraininginstancestest 368.0632.0 eeerr ×+×=
36
More on the bootstrap
• Probably the best way of estimating performance for very
small datasets
• However, it has some problems
– Consider the random dataset from above
– 0% resubstitution error and
~50% error on test data
– Bootstrap estimate for this classifier:
– True expected error: 50%
%6.31%0368.0%50632.0 =×+×=err
37
Evaluation Summary:
• Use Train, Test, Validation sets for “LARGE”
data
• Balance “un-balanced” data
• Use Cross-validation for small data
• Don’t use test data for parameter tuning - use
separate validation data

More Related Content

What's hot

Cross validation
Cross validationCross validation
Cross validation
RidhaAfrawe
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
Marina Santini
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
Md. Main Uddin Rony
 
WEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been LearnedWEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been Learned
DataminingTools Inc
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
Machine Learning Valencia
 
Classification using L1-Penalized Logistic Regression
Classification using L1-Penalized Logistic RegressionClassification using L1-Penalized Logistic Regression
Classification using L1-Penalized Logistic Regression
Setia Pramana
 
Machine learning yearning
Machine learning yearningMachine learning yearning
Machine learning yearning
mohammad pourheidary
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Thomas Ploetz
 
Cross-validation Tutorial: What, how and which?
Cross-validation Tutorial: What, how and which?Cross-validation Tutorial: What, how and which?
Cross-validation Tutorial: What, how and which?
Pradeep Redddy Raamana
 
K-Folds Cross Validation Method
K-Folds Cross Validation MethodK-Folds Cross Validation Method
K-Folds Cross Validation Method
SHUBHAM GUPTA
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Marina Santini
 
Anomaly detection : QuantUniversity Workshop
Anomaly detection : QuantUniversity Workshop Anomaly detection : QuantUniversity Workshop
Anomaly detection : QuantUniversity Workshop
QuantUniversity
 
Machine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and CredibilityMachine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and Credibility
Pier Luca Lanzi
 
Parkinson disease classification v2.0
Parkinson disease classification v2.0Parkinson disease classification v2.0
Parkinson disease classification v2.0
Nikhil Shrivastava, MS, SAFe PMPO
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in Agriculture
Aman Vasisht
 
Covering (Rules-based) Algorithm
Covering (Rules-based) AlgorithmCovering (Rules-based) Algorithm
Covering (Rules-based) Algorithm
ZHAO Sam
 
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
PAPIs.io
 
Anomaly detection Meetup Slides
Anomaly detection Meetup SlidesAnomaly detection Meetup Slides
Anomaly detection Meetup Slides
QuantUniversity
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
QuantUniversity
 
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Marina Santini
 

What's hot (20)

Cross validation
Cross validationCross validation
Cross validation
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
WEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been LearnedWEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been Learned
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
 
Classification using L1-Penalized Logistic Regression
Classification using L1-Penalized Logistic RegressionClassification using L1-Penalized Logistic Regression
Classification using L1-Penalized Logistic Regression
 
Machine learning yearning
Machine learning yearningMachine learning yearning
Machine learning yearning
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
 
Cross-validation Tutorial: What, how and which?
Cross-validation Tutorial: What, how and which?Cross-validation Tutorial: What, how and which?
Cross-validation Tutorial: What, how and which?
 
K-Folds Cross Validation Method
K-Folds Cross Validation MethodK-Folds Cross Validation Method
K-Folds Cross Validation Method
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
 
Anomaly detection : QuantUniversity Workshop
Anomaly detection : QuantUniversity Workshop Anomaly detection : QuantUniversity Workshop
Anomaly detection : QuantUniversity Workshop
 
Machine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and CredibilityMachine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and Credibility
 
Parkinson disease classification v2.0
Parkinson disease classification v2.0Parkinson disease classification v2.0
Parkinson disease classification v2.0
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in Agriculture
 
Covering (Rules-based) Algorithm
Covering (Rules-based) AlgorithmCovering (Rules-based) Algorithm
Covering (Rules-based) Algorithm
 
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
 
Anomaly detection Meetup Slides
Anomaly detection Meetup SlidesAnomaly detection Meetup Slides
Anomaly detection Meetup Slides
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)
 

Similar to evaluation and credibility-Part 1

Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptx
rajalakshmi5921
 
Week 11 Model Evalaution Model Evaluation
Week 11 Model Evalaution Model EvaluationWeek 11 Model Evalaution Model Evaluation
Week 11 Model Evalaution Model Evaluation
khairulhuda242
 
6 Evaluating Predictive Performance and ensemble.pptx
6 Evaluating Predictive Performance and ensemble.pptx6 Evaluating Predictive Performance and ensemble.pptx
6 Evaluating Predictive Performance and ensemble.pptx
mohammedalherwi1
 
crossvalidation.pptx
crossvalidation.pptxcrossvalidation.pptx
crossvalidation.pptx
PriyadharshiniG41
 
IME 672 - Classifier Evaluation I.pptx
IME 672 - Classifier Evaluation I.pptxIME 672 - Classifier Evaluation I.pptx
IME 672 - Classifier Evaluation I.pptx
Temp762476
 
DMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification ModelsDMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification Models
Pier Luca Lanzi
 
Cross Validation Cross ValidationmCross Validation.pptx
Cross Validation Cross ValidationmCross Validation.pptxCross Validation Cross ValidationmCross Validation.pptx
Cross Validation Cross ValidationmCross Validation.pptx
Nishant83346
 
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Maninda Edirisooriya
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluation
Pier Luca Lanzi
 
WEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been LearnedWEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been Learned
weka Content
 
Machine Learning Project - 1994 U.S. Census
Machine Learning Project - 1994 U.S. CensusMachine Learning Project - 1994 U.S. Census
Machine Learning Project - 1994 U.S. Census
Tim Enalls
 
BIIntro.ppt
BIIntro.pptBIIntro.ppt
BIIntro.ppt
PerumalPitchandi
 
ai4.ppt
ai4.pptai4.ppt
ai4.ppt
ssuser448ad3
 
Statistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptxStatistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptx
nagarajan740445
 
shubhampresentation-180430060134.pptx
shubhampresentation-180430060134.pptxshubhampresentation-180430060134.pptx
shubhampresentation-180430060134.pptx
ABINASHPADHY6
 
ai4.ppt
ai4.pptai4.ppt
ai4.ppt
ai4.pptai4.ppt
ai4.ppt
ssuserec53e73
 
ai4.ppt
ai4.pptai4.ppt
ai4.ppt
atul404633
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
Sri Ambati
 
Cross validation.pptx
Cross validation.pptxCross validation.pptx
Cross validation.pptx
YouKnowwho28
 

Similar to evaluation and credibility-Part 1 (20)

Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptx
 
Week 11 Model Evalaution Model Evaluation
Week 11 Model Evalaution Model EvaluationWeek 11 Model Evalaution Model Evaluation
Week 11 Model Evalaution Model Evaluation
 
6 Evaluating Predictive Performance and ensemble.pptx
6 Evaluating Predictive Performance and ensemble.pptx6 Evaluating Predictive Performance and ensemble.pptx
6 Evaluating Predictive Performance and ensemble.pptx
 
crossvalidation.pptx
crossvalidation.pptxcrossvalidation.pptx
crossvalidation.pptx
 
IME 672 - Classifier Evaluation I.pptx
IME 672 - Classifier Evaluation I.pptxIME 672 - Classifier Evaluation I.pptx
IME 672 - Classifier Evaluation I.pptx
 
DMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification ModelsDMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification Models
 
Cross Validation Cross ValidationmCross Validation.pptx
Cross Validation Cross ValidationmCross Validation.pptxCross Validation Cross ValidationmCross Validation.pptx
Cross Validation Cross ValidationmCross Validation.pptx
 
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluation
 
WEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been LearnedWEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been Learned
 
Machine Learning Project - 1994 U.S. Census
Machine Learning Project - 1994 U.S. CensusMachine Learning Project - 1994 U.S. Census
Machine Learning Project - 1994 U.S. Census
 
BIIntro.ppt
BIIntro.pptBIIntro.ppt
BIIntro.ppt
 
ai4.ppt
ai4.pptai4.ppt
ai4.ppt
 
Statistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptxStatistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptx
 
shubhampresentation-180430060134.pptx
shubhampresentation-180430060134.pptxshubhampresentation-180430060134.pptx
shubhampresentation-180430060134.pptx
 
ai4.ppt
ai4.pptai4.ppt
ai4.ppt
 
ai4.ppt
ai4.pptai4.ppt
ai4.ppt
 
ai4.ppt
ai4.pptai4.ppt
ai4.ppt
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Cross validation.pptx
Cross validation.pptxCross validation.pptx
Cross validation.pptx
 

More from Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

BlockChain.pptx
BlockChain.pptxBlockChain.pptx
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Introduction to cloud computing
Introduction to cloud computingIntroduction to cloud computing
Data analytics
Data analyticsData analytics
Hadoop Eco system
Hadoop Eco systemHadoop Eco system
Parallel Computing on the GPU
Parallel Computing on the GPUParallel Computing on the GPU
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
Decision tree
Decision treeDecision tree
kmean clustering
kmean clusteringkmean clustering
Covering algorithm
Covering algorithmCovering algorithm
Hierachical clustering
Hierachical clusteringHierachical clustering
Assosiate rule mining
Assosiate rule miningAssosiate rule mining
Big data in telecom
Big data in telecomBig data in telecom
Cloud Computing
Cloud ComputingCloud Computing
MapReduce
MapReduceMapReduce
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Pig Experience
Pig ExperiencePig Experience
Interpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with SawzallInterpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with Sawzall
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
HadoopDB in Action
HadoopDB in ActionHadoopDB in Action
Efficient Parallel Set-Similarity Joins Using MapReduce
 Efficient Parallel Set-Similarity Joins Using MapReduce Efficient Parallel Set-Similarity Joins Using MapReduce
Efficient Parallel Set-Similarity Joins Using MapReduce
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 

More from Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL (20)

BlockChain.pptx
BlockChain.pptxBlockChain.pptx
BlockChain.pptx
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
Introduction to cloud computing
Introduction to cloud computingIntroduction to cloud computing
Introduction to cloud computing
 
Data analytics
Data analyticsData analytics
Data analytics
 
Hadoop Eco system
Hadoop Eco systemHadoop Eco system
Hadoop Eco system
 
Parallel Computing on the GPU
Parallel Computing on the GPUParallel Computing on the GPU
Parallel Computing on the GPU
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
Decision tree
Decision treeDecision tree
Decision tree
 
kmean clustering
kmean clusteringkmean clustering
kmean clustering
 
Covering algorithm
Covering algorithmCovering algorithm
Covering algorithm
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
Assosiate rule mining
Assosiate rule miningAssosiate rule mining
Assosiate rule mining
 
Big data in telecom
Big data in telecomBig data in telecom
Big data in telecom
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
MapReduce
MapReduceMapReduce
MapReduce
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Interpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with SawzallInterpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with Sawzall
 
HadoopDB in Action
HadoopDB in ActionHadoopDB in Action
HadoopDB in Action
 
Efficient Parallel Set-Similarity Joins Using MapReduce
 Efficient Parallel Set-Similarity Joins Using MapReduce Efficient Parallel Set-Similarity Joins Using MapReduce
Efficient Parallel Set-Similarity Joins Using MapReduce
 

Recently uploaded

Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
TechSoup
 
Data Structure using C by Dr. K Adisesha .ppsx
Data Structure using C by Dr. K Adisesha .ppsxData Structure using C by Dr. K Adisesha .ppsx
Data Structure using C by Dr. K Adisesha .ppsx
Prof. Dr. K. Adisesha
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
Jyoti Chand
 
مصحف القراءات العشر أعد أحرف الخلاف سمير بسيوني.pdf
مصحف القراءات العشر   أعد أحرف الخلاف سمير بسيوني.pdfمصحف القراءات العشر   أعد أحرف الخلاف سمير بسيوني.pdf
مصحف القراءات العشر أعد أحرف الخلاف سمير بسيوني.pdf
سمير بسيوني
 
Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.
IsmaelVazquez38
 
Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"
National Information Standards Organization (NISO)
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
Nguyen Thanh Tu Collection
 
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdfREASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
giancarloi8888
 
Skimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S EliotSkimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S Eliot
nitinpv4ai
 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
haiqairshad
 
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
National Information Standards Organization (NISO)
 
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem studentsRHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
Himanshu Rai
 
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
indexPub
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
iammrhaywood
 
MDP on air pollution of class 8 year 2024-2025
MDP on air pollution of class 8 year 2024-2025MDP on air pollution of class 8 year 2024-2025
MDP on air pollution of class 8 year 2024-2025
khuleseema60
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
MJDuyan
 
The basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptxThe basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptx
heathfieldcps1
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
Nguyen Thanh Tu Collection
 
Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10
nitinpv4ai
 
skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)
Mohammad Al-Dhahabi
 

Recently uploaded (20)

Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
 
Data Structure using C by Dr. K Adisesha .ppsx
Data Structure using C by Dr. K Adisesha .ppsxData Structure using C by Dr. K Adisesha .ppsx
Data Structure using C by Dr. K Adisesha .ppsx
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
 
مصحف القراءات العشر أعد أحرف الخلاف سمير بسيوني.pdf
مصحف القراءات العشر   أعد أحرف الخلاف سمير بسيوني.pdfمصحف القراءات العشر   أعد أحرف الخلاف سمير بسيوني.pdf
مصحف القراءات العشر أعد أحرف الخلاف سمير بسيوني.pdf
 
Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.
 
Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
 
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdfREASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
 
Skimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S EliotSkimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S Eliot
 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
 
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
 
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem studentsRHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
 
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
 
MDP on air pollution of class 8 year 2024-2025
MDP on air pollution of class 8 year 2024-2025MDP on air pollution of class 8 year 2024-2025
MDP on air pollution of class 8 year 2024-2025
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
 
The basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptxThe basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptx
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
 
Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10
 
skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)
 

evaluation and credibility-Part 1

  • 1. Tilani Gunawardena Evaluation and Credibility (How much should we believe in what was learned)
  • 2. • Basic Algorithms – 1R – Naïve Bayes – Decision Trees – KNN – Clustering(K-means, Hierarchical) – Covering Algorithm – Associate Rule Mining
  • 3. • Introduction • Train, Test and Validation sets • Evaluation on Large data Unbalanced data • Evaluation on Small data – Cross validation – Bootstrap • Comparing data mining schemes – Significance test – Lift Chart / ROC curve • Numeric Prediction Evaluation Outline
  • 4. 4 Introduction • How predictive is the model we learned? • Error on the training data is not a good indicator of performance on future data – Q: Why? – A: Because new data will probably not be exactly the same as the training data! • Overfitting – fitting the training data too precisely - usually leads to poor results on new data
  • 5. Model Selection and Bias-Variance Tradeoff
  • 6. 6 Evaluation issues • Possible evaluation measures: – Classification Accuracy – Total cost/benefit – when different errors involve different costs – Lift and ROC curves – Error in numeric predictions • How reliable are the predicted results ? what we are interested in is the likely future performance on new data, not the past performance on old data. is the error rate on old data likely to be a good indicator of the error rate on new data?
  • 7. 7 Classifier error rate • Natural performance measure for classification problems: error rate – Success: instance’s class is predicted correctly – Error: instance’s class is predicted incorrectly – Error rate: proportion of errors made over the whole set of instances • Training set error rate(Resubstitution error ) : is way too optimistic! – you can find patterns even in random data Accuracy Error Rate(1-Accuracy)
  • 8. 8 Evaluation on “LARGE” data • If many (thousands) of examples are available, including several hundred examples from each class, then how can we evaluate our classifier method/model? • A simple evaluation is sufficient – Ex: Randomly split data into training and test sets (usually 2/3 for train, 1/3 for test) • Build a classifier using the train set and evaluate it using the test set.
  • 9. It is important that the test data is not used in any way to create the classifier.
  • 10.
  • 11. 11 Classification Step 1: Split data into train and test sets Results Known + + - - + THE PAST Data Training set Testing set
  • 12. 12 Classification Step 2: Build a model on a training set Training set Results Known + + - - + THE PAST Data Model Builder Testing set
  • 13. 13 Classification Step 3: Evaluate on test set (Re-train?) Data Predictions Y N Results Known Training set Testing set + + - - + Model Builder Evaluate + - + -
  • 14. 14 A note on parameter tuning • It is important that the test data is not used in any way to create the classifier • Some learning schemes operate in two stages: – Stage 1: builds the basic structure – Stage 2: optimizes parameter settings • The test data can’t be used for parameter tuning! • Proper procedure uses three sets: training data, validation data, and test data – Training data is used by one or more learning schemes to come up with classifiers. – Validation data is used to optimize parameters of those classifier, or to select a particular one. – Test data is used to calculate the error rate of the final, optimized, method.
  • 15. 15 Making the most of the data • Once evaluation is complete, all the data can be used to build the final classifier • Generally, the larger the training data the better the classifier (but returns begin to diminish once a certain volume of training data is exceeded ) • The larger the test data the more accurate the error estimate
  • 16. 16 Classification: Train, Validation, Test split Data Predictions Y N Results Known Training set Validation set + + - - + Model Builder Evaluate + - + - Final ModelFinal Test Set + - + - Final Evaluation Model Builder
  • 17. 17 Unbalanced data • Sometimes, classes have very unequal frequency – Accommodation prediction: 97% stay, 3% don’t stay (in a month) – Medical diagnosis: 90% healthy, 10% disease – eCommerce: 99% don’t buy, 1% buy – Security: >99.99% of Americans are not terrorists • Similar situation with multiple classes • Majority class classifier can be 97% correct, but useless
  • 18. 18 Handling unbalanced data – how? • If we have two classes that are very unbalanced, then how can we evaluate our classifier method? • With two classes, a good approach is to build BALANCED train and test sets, and train model on a balanced set – randomly select desired number of minority class instances – add equal number of randomly selected majority class • How do we generalize “balancing” to multiple classes – Ensure that each class is represented with approximately equal proportions in train and test
  • 19. 19 Predicting performance • Assume the estimated error rate is 25%. How close is this to the true error rate? – Depends on the amount of test data • Prediction is just like tossing a biased coin – “Head” is a “success”, “tail” is an “error” • In statistics, a succession of independent events like this is called a Bernoulli process • Statistical theory provides us with confidence intervals for the true underlying proportion!
  • 20. 20 Confidence intervals • p: A true (but unknown) success rate • We can say: p lies within a certain specified interval with a certain specified confidence • Example: S=750 successes in N=1000 trials – Estimated success rate: 75%(observed success rate is f = S/N) – How close is this to true success rate p? • Answer: with 80% confidence p[73.2,76.7] • Another example: S=75 and N=100 – Estimated success rate: 75% – With 80% confidence p[69.1,80.1]
  • 21. 21 Mean and Variance • Mean and variance for a Bernoulli trial: p, p (1–p) • Expected success rate f=S/N • Mean and variance for f : p, p (1–p)/N • For large enough N, f follows a Normal distribution • c% confidence interval [–z  X  z] for random variable with 0 mean is given by: • With a symmetric distribution: czXz =££- ]Pr[ ]Pr[21]Pr[ zXzXz ³´-=££-
  • 22. 22 Confidence limits • Confidence limits for the normal distribution with 0 mean and a variance of 1: • Thus: • To use this we have to reduce our random variable f to have 0 mean and unit variance Pr[X  z] z 0.1% 3.09 0.5% 2.58 1% 2.33 5% 1.65 10% 1.28 20% 0.84 40% 0.25 %90]65.165.1Pr[ =££- X –1 0 1 1.65 Pr[X ≥ z] = 5% implies that there is a 5% chance that X lies more than 1.65 standard deviations above the mean.
  • 23. 23 Transforming f • Transformed value for f : (i.e. subtract the mean and divide by the standard deviation) • Resulting equation: • Solving for p : Npp pf /)1( - - cz Npp pf z =ú û ù ê ë é £ - - £- /)1( Pr ÷ ø ö ç è æ +÷ ÷ ø ö ç ç è æ +-±+= N z N z N f N f z N z fp 2 2 222 1 42
  • 24. 24 Examples • f = 75%, N = 1000, c = 80% (so that z = 1.28): • f = 75%, N = 100, c = 80% (so that z = 1.28): • Note that normal distribution assumption is only valid for large N (i.e. N > 100) • f = 75%, N = 10, c = 80% (so that z = 1.28): (should be taken with a grain of salt) ]767.0,732.0[Îp ]801.0,691.0[Îp ]881.0,549.0[Îp
  • 25. 25 Evaluation on “small” data • The holdout method reserves a certain amount for testing and uses the remainder for training – Usually: one third for testing, the rest for training • For “unbalanced” datasets, samples might not be representative – Few or none instances of some classes • Stratified sample: advanced version of balancing the data – Make sure that each class is represented with approximately equal proportions in both subsets
  • 26. 26 Repeated holdout method • Holdout estimate can be made more reliable by repeating the process with different subsamples – In each iteration, a certain proportion is randomly selected for training (possibly with stratification) – The error rates on the different iterations are averaged to yield an overall error rate • This is called the repeated holdout method
  • 27. Cross-validation • Cross-validation avoids overlapping test sets – First step: data is split into k subsets of equal size – Second step: each subset in turn is used for testing and the remainder for training • This is called k-fold cross-validation • Often the subsets are stratified before the cross-validation is performed(stratified k-fold cross-validation) • The error estimates are averaged to yield an overall error estimate
  • 28.
  • 29. 29 29 Cross-validation example: — Break up data into groups of the same size — — — Hold aside one group for testing and use the rest to build model — — Repeat Test Train
  • 30. 30 More on cross-validation • Standard method for evaluation: stratified ten- fold cross-validation • Why 10? – Extensive experiments have shown that this is the best choice to get an accurate estimate • Stratification reduces the estimate’s variance • Even better: repeated stratified cross-validation – E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance)
  • 31. 31 Leave-One-Out cross- validation(LOOCV) • simply n-fold cross-validation, where n is the number of instances in the dataset • Leave-One-Out: Remove one instance for testing and the other for training particular form of cross-validation: – Set number of folds to number of training instances – I.e., for n training instances, build classifier n times • It is judged by its correctness on the remaining instance (one or zero for success or failure) • The results of all n judgments, one for each member of the dataset, are averaged, and that average represents the final error estimate. • Makes best use of the data test • The greatest possible amount of data is used for training in each case, which presumably increases the chance that the classifier is an accurate one. • Involves no random subsampling • leave-one-out seems to offer a chance of squeezing the maximum out of a small dataset and getting as accurate an estimate as possible. • Very computationally expensive
  • 32. 32 Leave-One-Out-CV and stratification • Disadvantage of Leave-One-Out-CV: stratification is not possible – It guarantees a non-stratified sample because there is only one instance in the test set!
  • 33. 33 The Bootstrap • CV uses sampling without replacement – The same instance, once selected, can not be selected again for a particular training/test set • The bootstrap uses sampling with replacement to form the training set – Sample a dataset of n instances n times with replacement to form a new dataset of n instances – Use this data as the training set – Use the instances from the original dataset that don’t occur in the new training set for testing
  • 34. 34 The 0.632 bootstrap • Also called the 0.632 bootstrap – A particular instance has a probability of 1–1/n of not being picked – Thus its probability of ending up in the test data is: – This means the training data will contain approximately 63.2% of the instances 368.0 1 1 1 =»÷ ø ö ç è æ - - e n n
  • 35. 35 Estimating error with the bootstrap • The error estimate on the test data will be very pessimistic – Trained on just ~63% of the instances • Therefore, combine it with the resubstitution error: • The resubstitution error gets less weight than the error on the test data • Repeat process several times with different replacement samples; average the results instancestraininginstancestest 368.0632.0 eeerr ×+×=
  • 36. 36 More on the bootstrap • Probably the best way of estimating performance for very small datasets • However, it has some problems – Consider the random dataset from above – 0% resubstitution error and ~50% error on test data – Bootstrap estimate for this classifier: – True expected error: 50% %6.31%0368.0%50632.0 =×+×=err
  • 37. 37 Evaluation Summary: • Use Train, Test, Validation sets for “LARGE” data • Balance “un-balanced” data • Use Cross-validation for small data • Don’t use test data for parameter tuning - use separate validation data