Decision Trees
Legal Notices and Disclaimers
This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES,
EXPRESS OR IMPLIED, IN THIS SUMMARY.
Intel technologies’ features and benefits depend on system configuration and may require
enabled hardware, software or service activation. Performance varies depending on system
configuration. Check with your system manufacturer or retailer or learn more at intel.com.
This sample source code is released under the Intel Sample Source Code License Agreement.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2017, Intel Corporation. All rights reserved.
Overview of Classifier Characteristics
X
Y
• For K-Nearest Neighbors,
training data is the model
• Fitting is fast—just store data
• Prediction can be slow—lots
of distances to measure
• Decision boundary is flexible
Overview of Classifier Characteristics
X
Y
• For K-Nearest Neighbors,
training data is the model
• Fitting is fast—just store data
• Prediction can be slow—lots
of distances to measure
• Decision boundary is flexible
Overview of Classifier Characteristics
X
Y
• For K-Nearest Neighbors,
training data is the model
• Fitting is fast—just store data
• Prediction can be slow—lots
of distances to measure
• Decision boundary is flexible
Overview of Classifier Characteristics
X
Y
• For K-Nearest Neighbors,
training data is the model
• Fitting is fast—just store data
• Prediction can be slow—lots
of distances to measure
• Decision boundary is flexible
X
0.0
1.0
Probability
0.5
𝑦 𝛽 𝑥 =
1
1+𝑒−(𝛽0+ 𝛽1 𝑥 + ε )
Overview of Classifier Characteristics
• For logistic regression,
model is just parameters
• Fitting can be slow—must
find best parameters
• Prediction is fast—calculate
expected value
• Decision boundary is simple,
less flexible
X
0.0
1.0
Probability
0.5
𝑦 𝛽 𝑥 =
1
1+𝑒−(𝛽0+ 𝛽1 𝑥 + ε )
Overview of Classifier Characteristics
• For logistic regression,
model is just parameters
• Fitting can be slow—must
find best parameters
• Prediction is fast—calculate
expected value
• Decision boundary is simple,
less flexible
X
0.0
1.0
Probability
0.5
𝑦 𝛽 𝑥 =
1
1+𝑒−(𝛽0+ 𝛽1 𝑥 + ε )
Overview of Classifier Characteristics
• For logistic regression,
model is just parameters
• Fitting can be slow—must
find best parameters
• Prediction is fast—calculate
expected value
• Decision boundary is simple,
less flexible
X
0.0
1.0
Probability
0.5
𝑦 𝛽 𝑥 =
1
1+𝑒−(𝛽0+ 𝛽1 𝑥 + ε )
Overview of Classifier Characteristics
• For logistic regression,
model is just parameters
• Fitting can be slow—must
find best parameters
• Prediction is fast—calculate
expected value
• Decision boundary is simple,
less flexible
Introduction to Decision Trees
Introduction to Decision Trees
• Want to predict whether to
play tennis based on
temperature, humidity,
wind, outlook
• Segment data based on
features to predict result
• Trees that predict
categorical results are
decision trees
Introduction to Decision Trees
• Want to predict whether to
play tennis based on
temperature, humidity,
wind, outlook
• Segment data based on
features to predict result
• Trees that predict
categorical results are
decision trees
Temperature:
>= Mild
Play TennisNo Tennis
Play Tennis
Nodes
Introduction to Decision Trees
• Want to predict whether to
play tennis based on
temperature, humidity,
wind, outlook
• Segment data based on
features to predict result
• Trees that predict
categorical results are
decision trees
Leaves
Temperature:
>= Mild
No Tennis
Play TennisNo Tennis
Nodes
Leaves
No Tennis
Humidity:
= Normal
Introduction to Decision Trees
• Want to predict whether to
play tennis based on
temperature, humidity,
wind, outlook
• Segment data based on
features to predict result
• Trees that predict
categorical results are
decision trees
Temperature:
>= Mild
Play TennisNo Tennis
Nodes
Leaves
• Want to predict whether to
play tennis based on
temperature, humidity,
wind, outlook
• Segment data based on
features to predict result
• Trees that predict
categorical results are
decision trees
No Tennis
Introduction to Decision Trees
Temperature:
>= Mild
Humidity:
= Normal
• Example: use slope an
elevation in Himalayas
• Predict average
precipitation (continuous
value)
Regression Trees Predict Continuous Values
48.50 in.13.67 in.
Nodes
Leaves
Elevation:
< 7900 ft.
55.42 in.
Slope:
< 2.5º
Regression Trees Predict Continuous Values
• Example: use slope an
elevation in Himalayas
• Predict average
precipitation (continuous
value)
48.50 in.13.67 in.
Nodes
Leaves
Elevation:
< 7900 ft.
• Example: use slope an
elevation in Himalayas
• Predict average
precipitation (continuous
value)
• Values at leaves are
averages of members
55.42 in.
Slope:
< 2.5º
Regression Trees Predict Continuous Values
Regression Trees Predict Continuous Values
0 1 2 3 4 5
-1.0
-2.0
0.0
1.0
2.0
Source: http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html
Regression Trees Predict Continuous Values
max_depth=2
0 1 2 3 4 5
-1.0
-2.0
0.0
1.0
2.0
Source: http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html
Regression Trees Predict Continuous Values
max_depth=2
max_depth=5
0 1 2 3 4 5
-1.0
-2.0
0.0
1.0
2.0
Source: http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html
Building a Decision Tree
• Select a feature and split
data into binary tree
• Continue splitting with
available features
Building a Decision Tree
• Select a feature and split
data into binary tree
• Continue splitting with
available features
Building a Decision Tree
• Select a feature and split
data into binary tree
• Continue splitting with
available features
How Long to Keep Splitting?
Until:
• Leaf node(s) are pure (only
one class remains)
• A maximum depth is
reached
• A performance metric is
achieved
How Long to Keep Splitting?
Until:
• Leaf node(s) are pure (only
one class remains)
• A maximum depth is
reached
• A performance metric is
achieved
How Long to Keep Splitting?
Until:
• Leaf node(s) are pure (only
one class remains)
• A maximum depth is
reached
• A performance metric is
achieved
How Long to Keep Splitting?
Until:
• Leaf node(s) are pure—only
one class remains
• A maximum depth is
reached
• A performance metric is
achieved
• Use greedy search: find the
best split at each step
• What defines the best split?
• One that maximizes the
information gained from the
split
• How is information gain
defined?
Building the Best Decision Tree
Play Tennis
Leaves
No Tennis
Temperature:
>= Mild
• Use greedy search: find the
best split at each step
• What defines the best split?
• One that maximizes the
information gained from the
split
• How is information gain
defined?
Building the Best Decision Tree
Play Tennis
Leaves
No Tennis
Temperature:
>= Mild
• Use greedy search: find the
best split at each step
• What defines the best split?
• One that maximizes the
information gained from the
split
• How is information gain
defined?
Building the Best Decision Tree
Play Tennis
Leaves
No Tennis
Temperature:
>= Mild
• Use greedy search: find the
best split at each step
• What defines the best split?
• One that maximizes the
information gained from the
split
• How is information gain
defined?
Building the Best Decision Tree
Play Tennis
Leaves
No Tennis
Temperature:
>= Mild
Splitting Based on Classification Error
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No 𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Classification Error
Equation
Classification Error Before
1 − 8
12 = 0.3333
Splitting Based on Classification Error
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No 𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Classification Error
Equation
Classification Error Before
1 − 8
12 = 0.3333
Splitting Based on Classification Error
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No 𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Classification Error
Equation
Classification Error Left Side
1 − 2
4 = 0.5000
0.3333
Splitting Based on Classification Error
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No 𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Classification Error
Equation
Classification Error Left Side
1 − 2
4 = 0.5000
0.3333
Information lost on
small # of data points
Splitting Based on Classification Error
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No 𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Classification Error
Equation
Classification Error Right
Side
1 − 6
8 = 0.2500
0.3333
0.5000
Splitting Based on Classification Error
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No 𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Classification Error
Equation
Classification Error Change
0.3333 − 4
12 ∗ 0.5000 − 8
12 ∗0.2500
= 0
0.3333
0.5000 0.2500
Splitting Based on Classification Error
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No 𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Classification Error
Equation
Classification Error Change
0.3333
0.5000 0.2500
0.3333 − 4
12 ∗ 0.5000 − 8
12 ∗0.2500
= 0
Splitting Based on Classification Error
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
• Using classification error, no
further splits would occur
• Problem: end nodes are not
homogeneous
• Try a different metric?
Splitting Based on Entropy
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
𝐻 𝑡 = −
𝑖=1
𝑛
𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)]
Entropy Equation
Entropy Before
− 8
12 𝑙𝑜𝑔2(8
12) − 4
12 𝑙𝑜𝑔2(4
12) = 0.9183
Splitting Based on Entropy
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
𝐻 𝑡 = −
𝑖=1
𝑛
𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)]
Entropy Equation
Entropy Before
− 8
12 𝑙𝑜𝑔2(8
12) − 4
12 𝑙𝑜𝑔2(4
12) = 0.9183
Splitting Based on Entropy
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
𝐻 𝑡 = −
𝑖=1
𝑛
𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)]
Entropy Equation
Entropy Left Side
− 2
4 𝑙𝑜𝑔2(2
4) − 2
4 𝑙𝑜𝑔2(2
4) = 1.0000
0.9183
Splitting Based on Entropy
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
𝐻 𝑡 = −
𝑖=1
𝑛
𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)]
Entropy Equation
Entropy Right Side
− 6
8 𝑙𝑜𝑔2(6
8) − 2
8 𝑙𝑜𝑔2(2
8) = 0.8113
0.9183
1.0000
Splitting Based on Entropy
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
𝐻 𝑡 = −
𝑖=1
𝑛
𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)]
Entropy Equation
Entropy Change
0.9183 − 4
12 ∗ 1.0000 − 8
12 ∗0.8113
= 0.0441
0.9183
1.0000 0.8113
Splitting Based on Entropy
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
• Splitting based on entropy
allows further splits to occur
• Can eventually reach goal of
homogeneous nodes
• Why does this work with
entropy but not classification
error?
Splitting Based on Entropy
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
• Splitting based on entropy
allows further splits to occur
• Can eventually reach goal of
homogeneous nodes
• Why does this work with
entropy but not classification
error?
Splitting Based on Entropy
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
• Splitting based on entropy
allows further splits to occur
• Can eventually reach goal of
homogeneous nodes
• Why does this work with
entropy but not classification
error?
• Classification error is a flat
function with maximum at
center
• Center represents
ambiguity—50/50 split
• Splitting metrics favor results
that are furthest away from
the center
Classification Error vs Entropy
0.0 0.5
Purity
1.0
Classification
Error
𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Error
• Classification error is a flat
function with maximum at
center
• Center represents
ambiguity—50/50 split
• Splitting metrics favor results
that are furthest away from
the center
Classification Error vs Entropy
0.0 0.5
Purity
1.0
Classification
Error
𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Error
0.0 0.5
Purity
1.0
Classification
Error
• Classification error is a flat
function with maximum at
center
• Center represents
ambiguity—50/50 split
• Splitting metrics favor results
that are furthest away from
the center
𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Classification Error vs Entropy
Error
Classification Error vs Entropy
• Entropy has the same
maximum but is curved
• Curvature allows splitting to
continue until nodes are pure
• How does this work?
0.0 0.5
Purity
1.0
Classification
ErrorCross Entropy
𝐻 𝑡 = −
𝑖=1
𝑛
𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)]
Error/Entropy
Classification Error vs Entropy
• Entropy has the same
maximum but is curved
• Curvature allows splitting to
continue until nodes are pure
• How does this work?
0.0 0.5
Purity
1.0
Classification
ErrorCross Entropy
𝐻 𝑡 = −
𝑖=1
𝑛
𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)]
Error/Entropy
0.0 0.5
Purity
1.0
Classification
ErrorCross Entropy
Classification Error vs Entropy
• Entropy has the same
maximum but is curved
• Curvature allows splitting to
continue until nodes are pure
• How does this work?
𝐻 𝑡 = −
𝑖=1
𝑛
𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)]
Error/Entropy
Information Gained by Splitting
• With classification error, the
function is flat
• Final average classification
error can be identical to
parent
• Resulting in premature
stopping
Information Gained by Splitting
• With classification error, the
function is flat
• Final average classification
error can be identical to
parent
• Resulting in premature
stopping
Information Gained by Splitting
• With classification error, the
function is flat
• Final average classification
error can be identical to
parent
• Resulting in premature
stopping
Information Gained by Splitting
• With classification error, the
function is flat
• Final average classification
error can be identical to
parent
• Resulting in premature
stopping
Information Gained by Splitting
• With entropy gain, the
function has a "bulge"
• Allows average information of
children to be less than
parent
• Results in information gain
and continued splitting
Information Gained by Splitting
• With entropy gain, the
function has a "bulge"
• Allows average information of
children to be less than
parent
• Results in information gain
and continued splitting
Information Gained by Splitting
• With entropy gain, the
function has a "bulge"
• Allows average information of
children to be less than
parent
• Results in information gain
and continued splitting
0.0 0.5
Purity
1.0
Classification
ErrorCross Entropy
Gini Index
• In practice, Gini index often
used for splitting
• Function is similar to
entropy—has bulge
• Does not contain logarithm
The Gini Index
𝐺 𝑡 = 1 −
𝑖=1
𝑛
𝑝 𝑖 𝑡 2
0.0 0.5
Purity
1.0
Classification
ErrorCross Entropy
Gini Index
• In practice, Gini index often
used for splitting
• Function is similar to
entropy—has bulge
• Does not contain logarithm
The Gini Index
𝐺 𝑡 = 1 −
𝑖=1
𝑛
𝑝 𝑖 𝑡 2
0.0 0.5
Purity
1.0
Classification
ErrorCross Entropy
Gini Index
• In practice, Gini index often
used for splitting
• Function is similar to
entropy—has bulge
• Does not contain logarithm
The Gini Index
𝐺 𝑡 = 1 −
𝑖=1
𝑛
𝑝 𝑖 𝑡 2
Decision Trees are High Variance
• Problem: decision trees
tend to overfit
• Small changes in data
greatly affect prediction--
high variance
• Solution: Prune trees
• Problem: decision trees
tend to overfit
• Small changes in data
greatly affect prediction--
high variance
• Solution: Prune trees
Decision Trees are High Variance
• Problem: decision trees
tend to overfit
• Small changes in data
greatly affect prediction--
high variance
• Solution: Prune trees
Decision Trees are High Variance
Pruning Decision Trees
• Problem: decision trees
tend to overfit
• Small changes in data
greatly affect prediction--
high variance
• Solution: Prune trees
Pruning Decision Trees
• Problem: decision trees
tend to overfit
• Small changes in data
greatly affect prediction--
high variance
• Solution: Prune trees
• How to decide what leaves
to prune?
• Solution: prune based on
classification error threshold
Pruning Decision Trees
• How to decide what leaves
to prune?
• Solution: prune based on
classification error threshold
Pruning Decision Trees
𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
• Easy to interpret and
implement—"if … then …
else" logic
• Handle any data category—
binary, ordinal, continuous
• No preprocessing or scaling
required
Strengths of Decision Trees
• Easy to interpret and
implement—"if … then …
else" logic
• Handle any data category—
binary, ordinal, continuous
• No preprocessing or scaling
required
Strengths of Decision Trees
• Easy to interpret and
implement—"if … then …
else" logic
• Handle any data category—
binary, ordinal, continuous
• No preprocessing or scaling
required
Strengths of Decision Trees
Import the class containing the classification method
from sklearn.tree import DecisionTreeClassifier
Create an instance of the class
DTC = DecisionTreeClassifier(criterion='gini',
max_features=10, max_depth=5)
Fit the instance on the data and then predict the expected value
DTC = DTC.fit(X_train, y_train)
y_predict = DTC.predict(X_test)
Tune parameters with cross-validation. Use DecisionTreeRegressor for regression.
DecisionTreeClassifier: The Syntax
Import the class containing the classification method
from sklearn.tree import DecisionTreeClassifier
Create an instance of the class
DTC = DecisionTreeClassifier(criterion='gini',
max_features=10, max_depth=5)
Fit the instance on the data and then predict the expected value
DTC = DTC.fit(X_train, y_train)
y_predict = DTC.predict(X_test)
Tune parameters with cross-validation. Use DecisionTreeRegressor for regression.
DecisionTreeClassifier: The Syntax
Import the class containing the classification method
from sklearn.tree import DecisionTreeClassifier
Create an instance of the class
DTC = DecisionTreeClassifier(criterion='gini',
max_features=10, max_depth=5)
Fit the instance on the data and then predict the expected value
DTC = DTC.fit(X_train, y_train)
y_predict = DTC.predict(X_test)
Tune parameters with cross-validation. Use DecisionTreeRegressor for regression.
DecisionTreeClassifier: The Syntax
tree
parameters
Import the class containing the classification method
from sklearn.tree import DecisionTreeClassifier
Create an instance of the class
DTC = DecisionTreeClassifier(criterion='gini',
max_features=10, max_depth=5)
Fit the instance on the data and then predict the expected value
DTC = DTC.fit(X_train, y_train)
y_predict = DTC.predict(X_test)
Tune parameters with cross-validation. Use DecisionTreeRegressor for regression.
DecisionTreeClassifier: The Syntax
DecisionTreeClassifier: The Syntax
Import the class containing the classification method
from sklearn.tree import DecisionTreeClassifier
Create an instance of the class
DTC = DecisionTreeClassifier(criterion='gini',
max_features=10, max_depth=5)
Fit the instance on the data and then predict the expected value
DTC = DTC.fit(X_train, y_train)
y_predict = DTC.predict(X_test)
Tune parameters with cross-validation. Use DecisionTreeRegressor for regression.
Ml6 decision trees

Ml6 decision trees

  • 1.
  • 2.
    Legal Notices andDisclaimers This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at intel.com. This sample source code is released under the Intel Sample Source Code License Agreement. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Copyright © 2017, Intel Corporation. All rights reserved.
  • 3.
    Overview of ClassifierCharacteristics X Y • For K-Nearest Neighbors, training data is the model • Fitting is fast—just store data • Prediction can be slow—lots of distances to measure • Decision boundary is flexible
  • 4.
    Overview of ClassifierCharacteristics X Y • For K-Nearest Neighbors, training data is the model • Fitting is fast—just store data • Prediction can be slow—lots of distances to measure • Decision boundary is flexible
  • 5.
    Overview of ClassifierCharacteristics X Y • For K-Nearest Neighbors, training data is the model • Fitting is fast—just store data • Prediction can be slow—lots of distances to measure • Decision boundary is flexible
  • 6.
    Overview of ClassifierCharacteristics X Y • For K-Nearest Neighbors, training data is the model • Fitting is fast—just store data • Prediction can be slow—lots of distances to measure • Decision boundary is flexible
  • 7.
    X 0.0 1.0 Probability 0.5 𝑦 𝛽 𝑥= 1 1+𝑒−(𝛽0+ 𝛽1 𝑥 + ε ) Overview of Classifier Characteristics • For logistic regression, model is just parameters • Fitting can be slow—must find best parameters • Prediction is fast—calculate expected value • Decision boundary is simple, less flexible
  • 8.
    X 0.0 1.0 Probability 0.5 𝑦 𝛽 𝑥= 1 1+𝑒−(𝛽0+ 𝛽1 𝑥 + ε ) Overview of Classifier Characteristics • For logistic regression, model is just parameters • Fitting can be slow—must find best parameters • Prediction is fast—calculate expected value • Decision boundary is simple, less flexible
  • 9.
    X 0.0 1.0 Probability 0.5 𝑦 𝛽 𝑥= 1 1+𝑒−(𝛽0+ 𝛽1 𝑥 + ε ) Overview of Classifier Characteristics • For logistic regression, model is just parameters • Fitting can be slow—must find best parameters • Prediction is fast—calculate expected value • Decision boundary is simple, less flexible
  • 10.
    X 0.0 1.0 Probability 0.5 𝑦 𝛽 𝑥= 1 1+𝑒−(𝛽0+ 𝛽1 𝑥 + ε ) Overview of Classifier Characteristics • For logistic regression, model is just parameters • Fitting can be slow—must find best parameters • Prediction is fast—calculate expected value • Decision boundary is simple, less flexible
  • 11.
  • 12.
    Introduction to DecisionTrees • Want to predict whether to play tennis based on temperature, humidity, wind, outlook • Segment data based on features to predict result • Trees that predict categorical results are decision trees
  • 13.
    Introduction to DecisionTrees • Want to predict whether to play tennis based on temperature, humidity, wind, outlook • Segment data based on features to predict result • Trees that predict categorical results are decision trees Temperature: >= Mild Play TennisNo Tennis
  • 14.
    Play Tennis Nodes Introduction toDecision Trees • Want to predict whether to play tennis based on temperature, humidity, wind, outlook • Segment data based on features to predict result • Trees that predict categorical results are decision trees Leaves Temperature: >= Mild No Tennis
  • 15.
    Play TennisNo Tennis Nodes Leaves NoTennis Humidity: = Normal Introduction to Decision Trees • Want to predict whether to play tennis based on temperature, humidity, wind, outlook • Segment data based on features to predict result • Trees that predict categorical results are decision trees Temperature: >= Mild
  • 16.
    Play TennisNo Tennis Nodes Leaves •Want to predict whether to play tennis based on temperature, humidity, wind, outlook • Segment data based on features to predict result • Trees that predict categorical results are decision trees No Tennis Introduction to Decision Trees Temperature: >= Mild Humidity: = Normal
  • 17.
    • Example: useslope an elevation in Himalayas • Predict average precipitation (continuous value) Regression Trees Predict Continuous Values
  • 18.
    48.50 in.13.67 in. Nodes Leaves Elevation: <7900 ft. 55.42 in. Slope: < 2.5º Regression Trees Predict Continuous Values • Example: use slope an elevation in Himalayas • Predict average precipitation (continuous value)
  • 19.
    48.50 in.13.67 in. Nodes Leaves Elevation: <7900 ft. • Example: use slope an elevation in Himalayas • Predict average precipitation (continuous value) • Values at leaves are averages of members 55.42 in. Slope: < 2.5º Regression Trees Predict Continuous Values
  • 20.
    Regression Trees PredictContinuous Values 0 1 2 3 4 5 -1.0 -2.0 0.0 1.0 2.0 Source: http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html
  • 21.
    Regression Trees PredictContinuous Values max_depth=2 0 1 2 3 4 5 -1.0 -2.0 0.0 1.0 2.0 Source: http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html
  • 22.
    Regression Trees PredictContinuous Values max_depth=2 max_depth=5 0 1 2 3 4 5 -1.0 -2.0 0.0 1.0 2.0 Source: http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html
  • 23.
    Building a DecisionTree • Select a feature and split data into binary tree • Continue splitting with available features
  • 24.
    Building a DecisionTree • Select a feature and split data into binary tree • Continue splitting with available features
  • 25.
    Building a DecisionTree • Select a feature and split data into binary tree • Continue splitting with available features
  • 26.
    How Long toKeep Splitting? Until: • Leaf node(s) are pure (only one class remains) • A maximum depth is reached • A performance metric is achieved
  • 27.
    How Long toKeep Splitting? Until: • Leaf node(s) are pure (only one class remains) • A maximum depth is reached • A performance metric is achieved
  • 28.
    How Long toKeep Splitting? Until: • Leaf node(s) are pure (only one class remains) • A maximum depth is reached • A performance metric is achieved
  • 29.
    How Long toKeep Splitting? Until: • Leaf node(s) are pure—only one class remains • A maximum depth is reached • A performance metric is achieved
  • 30.
    • Use greedysearch: find the best split at each step • What defines the best split? • One that maximizes the information gained from the split • How is information gain defined? Building the Best Decision Tree Play Tennis Leaves No Tennis Temperature: >= Mild
  • 31.
    • Use greedysearch: find the best split at each step • What defines the best split? • One that maximizes the information gained from the split • How is information gain defined? Building the Best Decision Tree Play Tennis Leaves No Tennis Temperature: >= Mild
  • 32.
    • Use greedysearch: find the best split at each step • What defines the best split? • One that maximizes the information gained from the split • How is information gain defined? Building the Best Decision Tree Play Tennis Leaves No Tennis Temperature: >= Mild
  • 33.
    • Use greedysearch: find the best split at each step • What defines the best split? • One that maximizes the information gained from the split • How is information gain defined? Building the Best Decision Tree Play Tennis Leaves No Tennis Temperature: >= Mild
  • 34.
    Splitting Based onClassification Error 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Classification Error Equation Classification Error Before 1 − 8 12 = 0.3333
  • 35.
    Splitting Based onClassification Error 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Classification Error Equation Classification Error Before 1 − 8 12 = 0.3333
  • 36.
    Splitting Based onClassification Error 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Classification Error Equation Classification Error Left Side 1 − 2 4 = 0.5000 0.3333
  • 37.
    Splitting Based onClassification Error 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Classification Error Equation Classification Error Left Side 1 − 2 4 = 0.5000 0.3333 Information lost on small # of data points
  • 38.
    Splitting Based onClassification Error 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Classification Error Equation Classification Error Right Side 1 − 6 8 = 0.2500 0.3333 0.5000
  • 39.
    Splitting Based onClassification Error 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Classification Error Equation Classification Error Change 0.3333 − 4 12 ∗ 0.5000 − 8 12 ∗0.2500 = 0 0.3333 0.5000 0.2500
  • 40.
    Splitting Based onClassification Error 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Classification Error Equation Classification Error Change 0.3333 0.5000 0.2500 0.3333 − 4 12 ∗ 0.5000 − 8 12 ∗0.2500 = 0
  • 41.
    Splitting Based onClassification Error 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No • Using classification error, no further splits would occur • Problem: end nodes are not homogeneous • Try a different metric?
  • 42.
    Splitting Based onEntropy 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐻 𝑡 = − 𝑖=1 𝑛 𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)] Entropy Equation Entropy Before − 8 12 𝑙𝑜𝑔2(8 12) − 4 12 𝑙𝑜𝑔2(4 12) = 0.9183
  • 43.
    Splitting Based onEntropy 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐻 𝑡 = − 𝑖=1 𝑛 𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)] Entropy Equation Entropy Before − 8 12 𝑙𝑜𝑔2(8 12) − 4 12 𝑙𝑜𝑔2(4 12) = 0.9183
  • 44.
    Splitting Based onEntropy 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐻 𝑡 = − 𝑖=1 𝑛 𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)] Entropy Equation Entropy Left Side − 2 4 𝑙𝑜𝑔2(2 4) − 2 4 𝑙𝑜𝑔2(2 4) = 1.0000 0.9183
  • 45.
    Splitting Based onEntropy 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐻 𝑡 = − 𝑖=1 𝑛 𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)] Entropy Equation Entropy Right Side − 6 8 𝑙𝑜𝑔2(6 8) − 2 8 𝑙𝑜𝑔2(2 8) = 0.8113 0.9183 1.0000
  • 46.
    Splitting Based onEntropy 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐻 𝑡 = − 𝑖=1 𝑛 𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)] Entropy Equation Entropy Change 0.9183 − 4 12 ∗ 1.0000 − 8 12 ∗0.8113 = 0.0441 0.9183 1.0000 0.8113
  • 47.
    Splitting Based onEntropy 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No • Splitting based on entropy allows further splits to occur • Can eventually reach goal of homogeneous nodes • Why does this work with entropy but not classification error?
  • 48.
    Splitting Based onEntropy 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No • Splitting based on entropy allows further splits to occur • Can eventually reach goal of homogeneous nodes • Why does this work with entropy but not classification error?
  • 49.
    Splitting Based onEntropy 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No • Splitting based on entropy allows further splits to occur • Can eventually reach goal of homogeneous nodes • Why does this work with entropy but not classification error?
  • 50.
    • Classification erroris a flat function with maximum at center • Center represents ambiguity—50/50 split • Splitting metrics favor results that are furthest away from the center Classification Error vs Entropy 0.0 0.5 Purity 1.0 Classification Error 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Error
  • 51.
    • Classification erroris a flat function with maximum at center • Center represents ambiguity—50/50 split • Splitting metrics favor results that are furthest away from the center Classification Error vs Entropy 0.0 0.5 Purity 1.0 Classification Error 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Error
  • 52.
    0.0 0.5 Purity 1.0 Classification Error • Classificationerror is a flat function with maximum at center • Center represents ambiguity—50/50 split • Splitting metrics favor results that are furthest away from the center 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Classification Error vs Entropy Error
  • 53.
    Classification Error vsEntropy • Entropy has the same maximum but is curved • Curvature allows splitting to continue until nodes are pure • How does this work? 0.0 0.5 Purity 1.0 Classification ErrorCross Entropy 𝐻 𝑡 = − 𝑖=1 𝑛 𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)] Error/Entropy
  • 54.
    Classification Error vsEntropy • Entropy has the same maximum but is curved • Curvature allows splitting to continue until nodes are pure • How does this work? 0.0 0.5 Purity 1.0 Classification ErrorCross Entropy 𝐻 𝑡 = − 𝑖=1 𝑛 𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)] Error/Entropy
  • 55.
    0.0 0.5 Purity 1.0 Classification ErrorCross Entropy ClassificationError vs Entropy • Entropy has the same maximum but is curved • Curvature allows splitting to continue until nodes are pure • How does this work? 𝐻 𝑡 = − 𝑖=1 𝑛 𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)] Error/Entropy
  • 56.
    Information Gained bySplitting • With classification error, the function is flat • Final average classification error can be identical to parent • Resulting in premature stopping
  • 57.
    Information Gained bySplitting • With classification error, the function is flat • Final average classification error can be identical to parent • Resulting in premature stopping
  • 58.
    Information Gained bySplitting • With classification error, the function is flat • Final average classification error can be identical to parent • Resulting in premature stopping
  • 59.
    Information Gained bySplitting • With classification error, the function is flat • Final average classification error can be identical to parent • Resulting in premature stopping
  • 60.
    Information Gained bySplitting • With entropy gain, the function has a "bulge" • Allows average information of children to be less than parent • Results in information gain and continued splitting
  • 61.
    Information Gained bySplitting • With entropy gain, the function has a "bulge" • Allows average information of children to be less than parent • Results in information gain and continued splitting
  • 62.
    Information Gained bySplitting • With entropy gain, the function has a "bulge" • Allows average information of children to be less than parent • Results in information gain and continued splitting
  • 63.
    0.0 0.5 Purity 1.0 Classification ErrorCross Entropy GiniIndex • In practice, Gini index often used for splitting • Function is similar to entropy—has bulge • Does not contain logarithm The Gini Index 𝐺 𝑡 = 1 − 𝑖=1 𝑛 𝑝 𝑖 𝑡 2
  • 64.
    0.0 0.5 Purity 1.0 Classification ErrorCross Entropy GiniIndex • In practice, Gini index often used for splitting • Function is similar to entropy—has bulge • Does not contain logarithm The Gini Index 𝐺 𝑡 = 1 − 𝑖=1 𝑛 𝑝 𝑖 𝑡 2
  • 65.
    0.0 0.5 Purity 1.0 Classification ErrorCross Entropy GiniIndex • In practice, Gini index often used for splitting • Function is similar to entropy—has bulge • Does not contain logarithm The Gini Index 𝐺 𝑡 = 1 − 𝑖=1 𝑛 𝑝 𝑖 𝑡 2
  • 66.
    Decision Trees areHigh Variance • Problem: decision trees tend to overfit • Small changes in data greatly affect prediction-- high variance • Solution: Prune trees
  • 67.
    • Problem: decisiontrees tend to overfit • Small changes in data greatly affect prediction-- high variance • Solution: Prune trees Decision Trees are High Variance
  • 68.
    • Problem: decisiontrees tend to overfit • Small changes in data greatly affect prediction-- high variance • Solution: Prune trees Decision Trees are High Variance
  • 69.
    Pruning Decision Trees •Problem: decision trees tend to overfit • Small changes in data greatly affect prediction-- high variance • Solution: Prune trees
  • 70.
    Pruning Decision Trees •Problem: decision trees tend to overfit • Small changes in data greatly affect prediction-- high variance • Solution: Prune trees
  • 71.
    • How todecide what leaves to prune? • Solution: prune based on classification error threshold Pruning Decision Trees
  • 72.
    • How todecide what leaves to prune? • Solution: prune based on classification error threshold Pruning Decision Trees 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ]
  • 73.
    • Easy tointerpret and implement—"if … then … else" logic • Handle any data category— binary, ordinal, continuous • No preprocessing or scaling required Strengths of Decision Trees
  • 74.
    • Easy tointerpret and implement—"if … then … else" logic • Handle any data category— binary, ordinal, continuous • No preprocessing or scaling required Strengths of Decision Trees
  • 75.
    • Easy tointerpret and implement—"if … then … else" logic • Handle any data category— binary, ordinal, continuous • No preprocessing or scaling required Strengths of Decision Trees
  • 76.
    Import the classcontaining the classification method from sklearn.tree import DecisionTreeClassifier Create an instance of the class DTC = DecisionTreeClassifier(criterion='gini', max_features=10, max_depth=5) Fit the instance on the data and then predict the expected value DTC = DTC.fit(X_train, y_train) y_predict = DTC.predict(X_test) Tune parameters with cross-validation. Use DecisionTreeRegressor for regression. DecisionTreeClassifier: The Syntax
  • 77.
    Import the classcontaining the classification method from sklearn.tree import DecisionTreeClassifier Create an instance of the class DTC = DecisionTreeClassifier(criterion='gini', max_features=10, max_depth=5) Fit the instance on the data and then predict the expected value DTC = DTC.fit(X_train, y_train) y_predict = DTC.predict(X_test) Tune parameters with cross-validation. Use DecisionTreeRegressor for regression. DecisionTreeClassifier: The Syntax
  • 78.
    Import the classcontaining the classification method from sklearn.tree import DecisionTreeClassifier Create an instance of the class DTC = DecisionTreeClassifier(criterion='gini', max_features=10, max_depth=5) Fit the instance on the data and then predict the expected value DTC = DTC.fit(X_train, y_train) y_predict = DTC.predict(X_test) Tune parameters with cross-validation. Use DecisionTreeRegressor for regression. DecisionTreeClassifier: The Syntax tree parameters
  • 79.
    Import the classcontaining the classification method from sklearn.tree import DecisionTreeClassifier Create an instance of the class DTC = DecisionTreeClassifier(criterion='gini', max_features=10, max_depth=5) Fit the instance on the data and then predict the expected value DTC = DTC.fit(X_train, y_train) y_predict = DTC.predict(X_test) Tune parameters with cross-validation. Use DecisionTreeRegressor for regression. DecisionTreeClassifier: The Syntax
  • 80.
    DecisionTreeClassifier: The Syntax Importthe class containing the classification method from sklearn.tree import DecisionTreeClassifier Create an instance of the class DTC = DecisionTreeClassifier(criterion='gini', max_features=10, max_depth=5) Fit the instance on the data and then predict the expected value DTC = DTC.fit(X_train, y_train) y_predict = DTC.predict(X_test) Tune parameters with cross-validation. Use DecisionTreeRegressor for regression.