Dr. Oner Celepcikay
CS 4319
CS 4319
Machine Learning
Week 6
Data Science Tool I – Classification Part II
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Tree InductionGreedy strategy.Split the records based on an attribute test that optimizes certain criterion.
IssuesDetermine how to split the recordsHow to specify the attribute test condition?How to determine the best split?Determine when to stop splitting
Stopping Criteria for Tree InductionStop expanding a node when all the records belong to the same class
Stop expanding a node when all the records have similar attribute values
Early termination (to be discussed later)
Practical Issues of ClassificationUnderfitting and Overfitting
Missing Values
Costs of Classification
Underfitting and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting due to Noise
Decision boundary is distorted by noise point
Overfitting due to Noise
* Bats and Whales are misclassified; non-mammals instead of mammals.
Overfitting due to Noise
Decision boundary is distorted by noise point
Both humans and dolphins were misclassified as n0n-mammals b/c Body Temp, Gives_Birth and Four-legged values are identical to mislabeled records in training set.
Spiny anteaters represent an exceptional case (every warm-blooded with no gives_birth is non-mammal in TR_Set
Decision tree perfectly fits training data (training error=0)
But error rate on test data is 30%.
Overfitting due to Noise
Estimating Generalization ErrorsRe-substitution errors: error on training ( e(t) )Generalization errors: error on testing ( e’(t))
Methods for estimating generalization errors:Optimistic approach: e’(t) = e(t)Pessimistic approach: For each leaf node: e’(t) = (e(t)+0.5) Total errors: e’(T) = e(T) + N 0.5 (N: number of leaf nodes) For a tree with 30 leaf nodes and 10 errors on training
(out of 1000 instances):
Training error = 10/1000 = 1%
Generalization error = (10 + 300.5)/1000 = 2.5%Reduced error pruning (REP): uses validation data set to estimate generalization
error
Occam’s RazorGiven two models of similar generalization errors, one should prefer the simpler model over the more complex model
For complex models, there is a greater chance that it was fitted accidentally by errors in data
Therefore, one should include model complexity when evaluating a model
How to Address OverfittingPre-Pruning (Early Stopping Rule)Stop the algorithm before it becomes a fully-grown treeTypical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the sameMore restrictive conditions: Stop if number of instances is less tha ...
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
1. Dr. Oner Celepcikay
CS 4319
CS 4319
Machine Learning
Week 6
Data Science Tool I – Classification Part II
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Tree InductionGreedy strategy.Split the records based on an
attribute test that optimizes certain criterion.
IssuesDetermine how to split the recordsHow to specify the
attribute test condition?How to determine the best
split?Determine when to stop splitting
Stopping Criteria for Tree InductionStop expanding a node
2. when all the records belong to the same class
Stop expanding a node when all the records have similar
attribute values
Early termination (to be discussed later)
Practical Issues of ClassificationUnderfitting and Overfitting
Missing Values
Costs of Classification
Underfitting and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test
errors are large
Overfitting due to Noise
Decision boundary is distorted by noise point
Overfitting due to Noise
* Bats and Whales are misclassified; non-mammals instead of
mammals.
Overfitting due to Noise
Decision boundary is distorted by noise point
3. Both humans and dolphins were misclassified as n0n-mammals
b/c Body Temp, Gives_Birth and Four-legged values are
identical to mislabeled records in training set.
Spiny anteaters represent an exceptional case (every warm-
blooded with no gives_birth is non-mammal in TR_Set
Decision tree perfectly fits training data (training error=0)
But error rate on test data is 30%.
Overfitting due to Noise
Estimating Generalization ErrorsRe-substitution errors: error on
Methods for estimating generalization errors:Optimistic
approach: e’(t) = e(t)Pessimistic approach: For each leaf
node: e’(t) = (e(
(N: number of leaf nodes) For a tree with 30 leaf nodes and 10
errors on training
(out of 1000 instances):
Training error = 10/1000 = 1%
2.5%Reduced error pruning (REP): uses validation data set to
estimate generalization
error
Occam’s RazorGiven two models of similar generalization
errors, one should prefer the simpler model over the more
complex model
For complex models, there is a greater chance that it was fitted
accidentally by errors in data
Therefore, one should include model complexity when
4. evaluating a model
How to Address OverfittingPre-Pruning (Early Stopping
Rule)Stop the algorithm before it becomes a fully-grown
treeTypical stopping conditions for a node: Stop if all instances
belong to the same class Stop if all the attribute values are the
sameMore restrictive conditions: Stop if number of instances is
less than some user-specified threshold Stop if class distribution
of instances are independent of the available features (e.g.,
improve impurity
measures (e.g., Gini or information gain).
How to Address Overfitting…Post-pruningGrow decision tree to
its entiretyTrim the nodes of the decision tree in a bottom-up
fashionIf generalization error improves after trimming, replace
sub-tree by a leaf node.Class label of leaf node is determined
from majority class of instances in the sub-tree
Example of Post-Pruning
Training Error (Before splitting) = 10/30
Pessimistic error = (10 + 0.5)/30 = 10.5/30
Training Error (After splitting) = 7/30
Pessimistic error (After splitting)
PRUNE OR DO NOT PRUNEClass = Yes20Class =
No10Error = ?Class = Yes8Class = No4Class = Yes2Class =
No5Class = Yes6Class = No1Class = Yes4Class = No0
5. Handling Missing Attribute ValuesMissing values affect
decision tree construction in three different ways:Affects how
impurity measures are computedAffects how to distribute
instance with missing value to child nodesAffects how a test
instance with missing value is classified
Model EvaluationMetrics for Performance EvaluationHow to
evaluate the performance of a model?
Methods for Performance EvaluationHow to obtain reliable
estimates?
Methods for Model ComparisonHow to compare the relative
performance among competing models?
Model EvaluationMetrics for Performance EvaluationHow to
evaluate the performance of a model?
Methods for Performance EvaluationHow to obtain reliable
estimates?
Methods for Model ComparisonHow to compare the relative
performance among competing models?
Metrics for Performance EvaluationFocus on the predictive
capability of a modelRather than how fast it takes to classify or
build models, scalability, etc.Confusion Matrix:
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)PREDICTED CLASS
6. ACTUAL CLASSClass=YesClass=NoClass=YesabClass=Nocd
Metrics for Performance Evaluation…
Most widely-used metric:PREDICTED CLASS
ACTUAL CLASSClass=YesClass=NoClass=Yesa (TP)b
(FN)Class=Noc (FP)d (TN)
Limitation of AccuracyConsider a 2-class problemNumber of
Class 0 examples = 9990Number of Class 1 examples = 10
If model predicts everything to be class 0, accuracy is
9990/10000 = 99.9 %Accuracy is misleading because model
does not detect any class 1 example
Cost Matrix
C(i|j): Cost of misclassifying class j example as class i
PREDICTED CLASS
ACTUAL
CLASSC(i|j)Class=YesClass=NoClass=YesC(Yes|Yes)C(No|Yes
)Class=NoC(Yes|No)C(No|No)
Computing Cost of Classification
7. Accuracy = 80%
Cost = 3910
Accuracy = 90%
Cost = 4255Cost MatrixPREDICTED CLASS ACTUAL
CLASSC(i|j)+-+-1100-10Model M1PREDICTED CLASS
ACTUAL CLASS+-+15040-60250Model M2PREDICTED
CLASS ACTUAL CLASS+-+25045-5200
Cost vs AccuracyCountPREDICTED CLASS
ACTUAL
CLASSClass=YesClass=NoClass=YesabClass=NocdCostPREDI
CTED CLASS
ACTUAL CLASSClass=YesClass=NoClass=YespqClass=Noqp
Model EvaluationMetrics for Performance EvaluationHow to
evaluate the performance of a model?
Methods for Performance EvaluationHow to obtain reliable
estimates?
Methods for Model ComparisonHow to compare the relative
performance among competing models?
Methods for Performance EvaluationHow to obtain a reliable
estimate of performance?
Performance of a model may depend on other factors besides the
learning algorithm:Class distributionCost of
misclassificationSize of training and test sets
8. Methods of EstimationHoldoutReserve 2/3 for training and 1/3
for testing Random subsamplingRepeated holdoutCross
validationPartition data into k disjoint subsetsk-fold: train on k-
1 partitions, test on the remaining oneLeave-one-out:
k=nStratified sampling oversampling vs
undersamplingBootstrapSampling with replacement
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
A?
A1
A2
A3
A4
FN
FP
TN
TP
TN
TP
d
c
b
a
d
a
+