3. Overview of Classifier Characteristics
X
Y
• For K-Nearest Neighbors,
training data is the model
• Fitting is fast—just store data
• Prediction can be slow—lots
of distances to measure
• Decision boundary is flexible
4. Overview of Classifier Characteristics
X
Y
• For K-Nearest Neighbors,
training data is the model
• Fitting is fast—just store data
• Prediction can be slow—lots
of distances to measure
• Decision boundary is flexible
5. Overview of Classifier Characteristics
X
Y
• For K-Nearest Neighbors,
training data is the model
• Fitting is fast—just store data
• Prediction can be slow—lots
of distances to measure
• Decision boundary is flexible
6. Overview of Classifier Characteristics
X
Y
• For K-Nearest Neighbors,
training data is the model
• Fitting is fast—just store data
• Prediction can be slow—lots
of distances to measure
• Decision boundary is flexible
7. X
0.0
1.0
Probability
0.5
𝑦 𝛽 𝑥 =
1
1+𝑒−(𝛽0+ 𝛽1 𝑥 + ε )
Overview of Classifier Characteristics
• For logistic regression,
model is just parameters
• Fitting can be slow—must
find best parameters
• Prediction is fast—calculate
expected value
• Decision boundary is simple,
less flexible
8. X
0.0
1.0
Probability
0.5
𝑦 𝛽 𝑥 =
1
1+𝑒−(𝛽0+ 𝛽1 𝑥 + ε )
Overview of Classifier Characteristics
• For logistic regression,
model is just parameters
• Fitting can be slow—must
find best parameters
• Prediction is fast—calculate
expected value
• Decision boundary is simple,
less flexible
9. X
0.0
1.0
Probability
0.5
𝑦 𝛽 𝑥 =
1
1+𝑒−(𝛽0+ 𝛽1 𝑥 + ε )
Overview of Classifier Characteristics
• For logistic regression,
model is just parameters
• Fitting can be slow—must
find best parameters
• Prediction is fast—calculate
expected value
• Decision boundary is simple,
less flexible
10. X
0.0
1.0
Probability
0.5
𝑦 𝛽 𝑥 =
1
1+𝑒−(𝛽0+ 𝛽1 𝑥 + ε )
Overview of Classifier Characteristics
• For logistic regression,
model is just parameters
• Fitting can be slow—must
find best parameters
• Prediction is fast—calculate
expected value
• Decision boundary is simple,
less flexible
12. Introduction to Decision Trees
• Want to predict whether to
play tennis based on
temperature, humidity,
wind, outlook
• Segment data based on
features to predict result
• Trees that predict
categorical results are
decision trees
13. Introduction to Decision Trees
• Want to predict whether to
play tennis based on
temperature, humidity,
wind, outlook
• Segment data based on
features to predict result
• Trees that predict
categorical results are
decision trees
Temperature:
>= Mild
Play TennisNo Tennis
14. Play Tennis
Nodes
Introduction to Decision Trees
• Want to predict whether to
play tennis based on
temperature, humidity,
wind, outlook
• Segment data based on
features to predict result
• Trees that predict
categorical results are
decision trees
Leaves
Temperature:
>= Mild
No Tennis
15. Play TennisNo Tennis
Nodes
Leaves
No Tennis
Humidity:
= Normal
Introduction to Decision Trees
• Want to predict whether to
play tennis based on
temperature, humidity,
wind, outlook
• Segment data based on
features to predict result
• Trees that predict
categorical results are
decision trees
Temperature:
>= Mild
16. Play TennisNo Tennis
Nodes
Leaves
• Want to predict whether to
play tennis based on
temperature, humidity,
wind, outlook
• Segment data based on
features to predict result
• Trees that predict
categorical results are
decision trees
No Tennis
Introduction to Decision Trees
Temperature:
>= Mild
Humidity:
= Normal
17. • Example: use slope an
elevation in Himalayas
• Predict average
precipitation (continuous
value)
Regression Trees Predict Continuous Values
18. 48.50 in.13.67 in.
Nodes
Leaves
Elevation:
< 7900 ft.
55.42 in.
Slope:
< 2.5º
Regression Trees Predict Continuous Values
• Example: use slope an
elevation in Himalayas
• Predict average
precipitation (continuous
value)
19. 48.50 in.13.67 in.
Nodes
Leaves
Elevation:
< 7900 ft.
• Example: use slope an
elevation in Himalayas
• Predict average
precipitation (continuous
value)
• Values at leaves are
averages of members
55.42 in.
Slope:
< 2.5º
Regression Trees Predict Continuous Values
23. Building a Decision Tree
• Select a feature and split
data into binary tree
• Continue splitting with
available features
24. Building a Decision Tree
• Select a feature and split
data into binary tree
• Continue splitting with
available features
25. Building a Decision Tree
• Select a feature and split
data into binary tree
• Continue splitting with
available features
26. How Long to Keep Splitting?
Until:
• Leaf node(s) are pure (only
one class remains)
• A maximum depth is
reached
• A performance metric is
achieved
27. How Long to Keep Splitting?
Until:
• Leaf node(s) are pure (only
one class remains)
• A maximum depth is
reached
• A performance metric is
achieved
28. How Long to Keep Splitting?
Until:
• Leaf node(s) are pure (only
one class remains)
• A maximum depth is
reached
• A performance metric is
achieved
29. How Long to Keep Splitting?
Until:
• Leaf node(s) are pure—only
one class remains
• A maximum depth is
reached
• A performance metric is
achieved
30. • Use greedy search: find the
best split at each step
• What defines the best split?
• One that maximizes the
information gained from the
split
• How is information gain
defined?
Building the Best Decision Tree
Play Tennis
Leaves
No Tennis
Temperature:
>= Mild
31. • Use greedy search: find the
best split at each step
• What defines the best split?
• One that maximizes the
information gained from the
split
• How is information gain
defined?
Building the Best Decision Tree
Play Tennis
Leaves
No Tennis
Temperature:
>= Mild
32. • Use greedy search: find the
best split at each step
• What defines the best split?
• One that maximizes the
information gained from the
split
• How is information gain
defined?
Building the Best Decision Tree
Play Tennis
Leaves
No Tennis
Temperature:
>= Mild
33. • Use greedy search: find the
best split at each step
• What defines the best split?
• One that maximizes the
information gained from the
split
• How is information gain
defined?
Building the Best Decision Tree
Play Tennis
Leaves
No Tennis
Temperature:
>= Mild
34. Splitting Based on Classification Error
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No 𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Classification Error
Equation
Classification Error Before
1 − 8
12 = 0.3333
35. Splitting Based on Classification Error
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No 𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Classification Error
Equation
Classification Error Before
1 − 8
12 = 0.3333
36. Splitting Based on Classification Error
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No 𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Classification Error
Equation
Classification Error Left Side
1 − 2
4 = 0.5000
0.3333
37. Splitting Based on Classification Error
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No 𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Classification Error
Equation
Classification Error Left Side
1 − 2
4 = 0.5000
0.3333
Information lost on
small # of data points
38. Splitting Based on Classification Error
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No 𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Classification Error
Equation
Classification Error Right
Side
1 − 6
8 = 0.2500
0.3333
0.5000
39. Splitting Based on Classification Error
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No 𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Classification Error
Equation
Classification Error Change
0.3333 − 4
12 ∗ 0.5000 − 8
12 ∗0.2500
= 0
0.3333
0.5000 0.2500
40. Splitting Based on Classification Error
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No 𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Classification Error
Equation
Classification Error Change
0.3333
0.5000 0.2500
0.3333 − 4
12 ∗ 0.5000 − 8
12 ∗0.2500
= 0
41. Splitting Based on Classification Error
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
• Using classification error, no
further splits would occur
• Problem: end nodes are not
homogeneous
• Try a different metric?
42. Splitting Based on Entropy
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
𝐻 𝑡 = −
𝑖=1
𝑛
𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)]
Entropy Equation
Entropy Before
− 8
12 𝑙𝑜𝑔2(8
12) − 4
12 𝑙𝑜𝑔2(4
12) = 0.9183
43. Splitting Based on Entropy
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
𝐻 𝑡 = −
𝑖=1
𝑛
𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)]
Entropy Equation
Entropy Before
− 8
12 𝑙𝑜𝑔2(8
12) − 4
12 𝑙𝑜𝑔2(4
12) = 0.9183
44. Splitting Based on Entropy
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
𝐻 𝑡 = −
𝑖=1
𝑛
𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)]
Entropy Equation
Entropy Left Side
− 2
4 𝑙𝑜𝑔2(2
4) − 2
4 𝑙𝑜𝑔2(2
4) = 1.0000
0.9183
45. Splitting Based on Entropy
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
𝐻 𝑡 = −
𝑖=1
𝑛
𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)]
Entropy Equation
Entropy Right Side
− 6
8 𝑙𝑜𝑔2(6
8) − 2
8 𝑙𝑜𝑔2(2
8) = 0.8113
0.9183
1.0000
46. Splitting Based on Entropy
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
𝐻 𝑡 = −
𝑖=1
𝑛
𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)]
Entropy Equation
Entropy Change
0.9183 − 4
12 ∗ 1.0000 − 8
12 ∗0.8113
= 0.0441
0.9183
1.0000 0.8113
47. Splitting Based on Entropy
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
• Splitting based on entropy
allows further splits to occur
• Can eventually reach goal of
homogeneous nodes
• Why does this work with
entropy but not classification
error?
48. Splitting Based on Entropy
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
• Splitting based on entropy
allows further splits to occur
• Can eventually reach goal of
homogeneous nodes
• Why does this work with
entropy but not classification
error?
49. Splitting Based on Entropy
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
• Splitting based on entropy
allows further splits to occur
• Can eventually reach goal of
homogeneous nodes
• Why does this work with
entropy but not classification
error?
50. • Classification error is a flat
function with maximum at
center
• Center represents
ambiguity—50/50 split
• Splitting metrics favor results
that are furthest away from
the center
Classification Error vs Entropy
0.0 0.5
Purity
1.0
Classification
Error
𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Error
51. • Classification error is a flat
function with maximum at
center
• Center represents
ambiguity—50/50 split
• Splitting metrics favor results
that are furthest away from
the center
Classification Error vs Entropy
0.0 0.5
Purity
1.0
Classification
Error
𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Error
52. 0.0 0.5
Purity
1.0
Classification
Error
• Classification error is a flat
function with maximum at
center
• Center represents
ambiguity—50/50 split
• Splitting metrics favor results
that are furthest away from
the center
𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Classification Error vs Entropy
Error
53. Classification Error vs Entropy
• Entropy has the same
maximum but is curved
• Curvature allows splitting to
continue until nodes are pure
• How does this work?
0.0 0.5
Purity
1.0
Classification
ErrorCross Entropy
𝐻 𝑡 = −
𝑖=1
𝑛
𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)]
Error/Entropy
54. Classification Error vs Entropy
• Entropy has the same
maximum but is curved
• Curvature allows splitting to
continue until nodes are pure
• How does this work?
0.0 0.5
Purity
1.0
Classification
ErrorCross Entropy
𝐻 𝑡 = −
𝑖=1
𝑛
𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)]
Error/Entropy
56. Information Gained by Splitting
• With classification error, the
function is flat
• Final average classification
error can be identical to
parent
• Resulting in premature
stopping
57. Information Gained by Splitting
• With classification error, the
function is flat
• Final average classification
error can be identical to
parent
• Resulting in premature
stopping
58. Information Gained by Splitting
• With classification error, the
function is flat
• Final average classification
error can be identical to
parent
• Resulting in premature
stopping
59. Information Gained by Splitting
• With classification error, the
function is flat
• Final average classification
error can be identical to
parent
• Resulting in premature
stopping
60. Information Gained by Splitting
• With entropy gain, the
function has a "bulge"
• Allows average information of
children to be less than
parent
• Results in information gain
and continued splitting
61. Information Gained by Splitting
• With entropy gain, the
function has a "bulge"
• Allows average information of
children to be less than
parent
• Results in information gain
and continued splitting
62. Information Gained by Splitting
• With entropy gain, the
function has a "bulge"
• Allows average information of
children to be less than
parent
• Results in information gain
and continued splitting
66. Decision Trees are High Variance
• Problem: decision trees
tend to overfit
• Small changes in data
greatly affect prediction--
high variance
• Solution: Prune trees
67. • Problem: decision trees
tend to overfit
• Small changes in data
greatly affect prediction--
high variance
• Solution: Prune trees
Decision Trees are High Variance
68. • Problem: decision trees
tend to overfit
• Small changes in data
greatly affect prediction--
high variance
• Solution: Prune trees
Decision Trees are High Variance
69. Pruning Decision Trees
• Problem: decision trees
tend to overfit
• Small changes in data
greatly affect prediction--
high variance
• Solution: Prune trees
70. Pruning Decision Trees
• Problem: decision trees
tend to overfit
• Small changes in data
greatly affect prediction--
high variance
• Solution: Prune trees
71. • How to decide what leaves
to prune?
• Solution: prune based on
classification error threshold
Pruning Decision Trees
72. • How to decide what leaves
to prune?
• Solution: prune based on
classification error threshold
Pruning Decision Trees
𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
73. • Easy to interpret and
implement—"if … then …
else" logic
• Handle any data category—
binary, ordinal, continuous
• No preprocessing or scaling
required
Strengths of Decision Trees
74. • Easy to interpret and
implement—"if … then …
else" logic
• Handle any data category—
binary, ordinal, continuous
• No preprocessing or scaling
required
Strengths of Decision Trees
75. • Easy to interpret and
implement—"if … then …
else" logic
• Handle any data category—
binary, ordinal, continuous
• No preprocessing or scaling
required
Strengths of Decision Trees
76. Import the class containing the classification method
from sklearn.tree import DecisionTreeClassifier
Create an instance of the class
DTC = DecisionTreeClassifier(criterion='gini',
max_features=10, max_depth=5)
Fit the instance on the data and then predict the expected value
DTC = DTC.fit(X_train, y_train)
y_predict = DTC.predict(X_test)
Tune parameters with cross-validation. Use DecisionTreeRegressor for regression.
DecisionTreeClassifier: The Syntax
77. Import the class containing the classification method
from sklearn.tree import DecisionTreeClassifier
Create an instance of the class
DTC = DecisionTreeClassifier(criterion='gini',
max_features=10, max_depth=5)
Fit the instance on the data and then predict the expected value
DTC = DTC.fit(X_train, y_train)
y_predict = DTC.predict(X_test)
Tune parameters with cross-validation. Use DecisionTreeRegressor for regression.
DecisionTreeClassifier: The Syntax
78. Import the class containing the classification method
from sklearn.tree import DecisionTreeClassifier
Create an instance of the class
DTC = DecisionTreeClassifier(criterion='gini',
max_features=10, max_depth=5)
Fit the instance on the data and then predict the expected value
DTC = DTC.fit(X_train, y_train)
y_predict = DTC.predict(X_test)
Tune parameters with cross-validation. Use DecisionTreeRegressor for regression.
DecisionTreeClassifier: The Syntax
tree
parameters
79. Import the class containing the classification method
from sklearn.tree import DecisionTreeClassifier
Create an instance of the class
DTC = DecisionTreeClassifier(criterion='gini',
max_features=10, max_depth=5)
Fit the instance on the data and then predict the expected value
DTC = DTC.fit(X_train, y_train)
y_predict = DTC.predict(X_test)
Tune parameters with cross-validation. Use DecisionTreeRegressor for regression.
DecisionTreeClassifier: The Syntax
80. DecisionTreeClassifier: The Syntax
Import the class containing the classification method
from sklearn.tree import DecisionTreeClassifier
Create an instance of the class
DTC = DecisionTreeClassifier(criterion='gini',
max_features=10, max_depth=5)
Fit the instance on the data and then predict the expected value
DTC = DTC.fit(X_train, y_train)
y_predict = DTC.predict(X_test)
Tune parameters with cross-validation. Use DecisionTreeRegressor for regression.