SlideShare a Scribd company logo
1 of 81
Decision Trees
Legal Notices and Disclaimers
This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES,
EXPRESS OR IMPLIED, IN THIS SUMMARY.
Intel technologies’ features and benefits depend on system configuration and may require
enabled hardware, software or service activation. Performance varies depending on system
configuration. Check with your system manufacturer or retailer or learn more at intel.com.
This sample source code is released under the Intel Sample Source Code License Agreement.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2017, Intel Corporation. All rights reserved.
Overview of Classifier Characteristics
X
Y
• For K-Nearest Neighbors,
training data is the model
• Fitting is fast—just store data
• Prediction can be slow—lots
of distances to measure
• Decision boundary is flexible
Overview of Classifier Characteristics
X
Y
• For K-Nearest Neighbors,
training data is the model
• Fitting is fast—just store data
• Prediction can be slow—lots
of distances to measure
• Decision boundary is flexible
Overview of Classifier Characteristics
X
Y
• For K-Nearest Neighbors,
training data is the model
• Fitting is fast—just store data
• Prediction can be slow—lots
of distances to measure
• Decision boundary is flexible
Overview of Classifier Characteristics
X
Y
• For K-Nearest Neighbors,
training data is the model
• Fitting is fast—just store data
• Prediction can be slow—lots
of distances to measure
• Decision boundary is flexible
X
0.0
1.0
Probability
0.5
𝑦 𝛽 𝑥 =
1
1+𝑒−(𝛽0+ 𝛽1 𝑥 + ε )
Overview of Classifier Characteristics
• For logistic regression,
model is just parameters
• Fitting can be slow—must
find best parameters
• Prediction is fast—calculate
expected value
• Decision boundary is simple,
less flexible
X
0.0
1.0
Probability
0.5
𝑦 𝛽 𝑥 =
1
1+𝑒−(𝛽0+ 𝛽1 𝑥 + ε )
Overview of Classifier Characteristics
• For logistic regression,
model is just parameters
• Fitting can be slow—must
find best parameters
• Prediction is fast—calculate
expected value
• Decision boundary is simple,
less flexible
X
0.0
1.0
Probability
0.5
𝑦 𝛽 𝑥 =
1
1+𝑒−(𝛽0+ 𝛽1 𝑥 + ε )
Overview of Classifier Characteristics
• For logistic regression,
model is just parameters
• Fitting can be slow—must
find best parameters
• Prediction is fast—calculate
expected value
• Decision boundary is simple,
less flexible
X
0.0
1.0
Probability
0.5
𝑦 𝛽 𝑥 =
1
1+𝑒−(𝛽0+ 𝛽1 𝑥 + ε )
Overview of Classifier Characteristics
• For logistic regression,
model is just parameters
• Fitting can be slow—must
find best parameters
• Prediction is fast—calculate
expected value
• Decision boundary is simple,
less flexible
Introduction to Decision Trees
Introduction to Decision Trees
• Want to predict whether to
play tennis based on
temperature, humidity,
wind, outlook
• Segment data based on
features to predict result
• Trees that predict
categorical results are
decision trees
Introduction to Decision Trees
• Want to predict whether to
play tennis based on
temperature, humidity,
wind, outlook
• Segment data based on
features to predict result
• Trees that predict
categorical results are
decision trees
Temperature:
>= Mild
Play TennisNo Tennis
Play Tennis
Nodes
Introduction to Decision Trees
• Want to predict whether to
play tennis based on
temperature, humidity,
wind, outlook
• Segment data based on
features to predict result
• Trees that predict
categorical results are
decision trees
Leaves
Temperature:
>= Mild
No Tennis
Play TennisNo Tennis
Nodes
Leaves
No Tennis
Humidity:
= Normal
Introduction to Decision Trees
• Want to predict whether to
play tennis based on
temperature, humidity,
wind, outlook
• Segment data based on
features to predict result
• Trees that predict
categorical results are
decision trees
Temperature:
>= Mild
Play TennisNo Tennis
Nodes
Leaves
• Want to predict whether to
play tennis based on
temperature, humidity,
wind, outlook
• Segment data based on
features to predict result
• Trees that predict
categorical results are
decision trees
No Tennis
Introduction to Decision Trees
Temperature:
>= Mild
Humidity:
= Normal
• Example: use slope an
elevation in Himalayas
• Predict average
precipitation (continuous
value)
Regression Trees Predict Continuous Values
48.50 in.13.67 in.
Nodes
Leaves
Elevation:
< 7900 ft.
55.42 in.
Slope:
< 2.5º
Regression Trees Predict Continuous Values
• Example: use slope an
elevation in Himalayas
• Predict average
precipitation (continuous
value)
48.50 in.13.67 in.
Nodes
Leaves
Elevation:
< 7900 ft.
• Example: use slope an
elevation in Himalayas
• Predict average
precipitation (continuous
value)
• Values at leaves are
averages of members
55.42 in.
Slope:
< 2.5º
Regression Trees Predict Continuous Values
Regression Trees Predict Continuous Values
0 1 2 3 4 5
-1.0
-2.0
0.0
1.0
2.0
Source: http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html
Regression Trees Predict Continuous Values
max_depth=2
0 1 2 3 4 5
-1.0
-2.0
0.0
1.0
2.0
Source: http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html
Regression Trees Predict Continuous Values
max_depth=2
max_depth=5
0 1 2 3 4 5
-1.0
-2.0
0.0
1.0
2.0
Source: http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html
Building a Decision Tree
• Select a feature and split
data into binary tree
• Continue splitting with
available features
Building a Decision Tree
• Select a feature and split
data into binary tree
• Continue splitting with
available features
Building a Decision Tree
• Select a feature and split
data into binary tree
• Continue splitting with
available features
How Long to Keep Splitting?
Until:
• Leaf node(s) are pure (only
one class remains)
• A maximum depth is
reached
• A performance metric is
achieved
How Long to Keep Splitting?
Until:
• Leaf node(s) are pure (only
one class remains)
• A maximum depth is
reached
• A performance metric is
achieved
How Long to Keep Splitting?
Until:
• Leaf node(s) are pure (only
one class remains)
• A maximum depth is
reached
• A performance metric is
achieved
How Long to Keep Splitting?
Until:
• Leaf node(s) are pure—only
one class remains
• A maximum depth is
reached
• A performance metric is
achieved
• Use greedy search: find the
best split at each step
• What defines the best split?
• One that maximizes the
information gained from the
split
• How is information gain
defined?
Building the Best Decision Tree
Play Tennis
Leaves
No Tennis
Temperature:
>= Mild
• Use greedy search: find the
best split at each step
• What defines the best split?
• One that maximizes the
information gained from the
split
• How is information gain
defined?
Building the Best Decision Tree
Play Tennis
Leaves
No Tennis
Temperature:
>= Mild
• Use greedy search: find the
best split at each step
• What defines the best split?
• One that maximizes the
information gained from the
split
• How is information gain
defined?
Building the Best Decision Tree
Play Tennis
Leaves
No Tennis
Temperature:
>= Mild
• Use greedy search: find the
best split at each step
• What defines the best split?
• One that maximizes the
information gained from the
split
• How is information gain
defined?
Building the Best Decision Tree
Play Tennis
Leaves
No Tennis
Temperature:
>= Mild
Splitting Based on Classification Error
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No 𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Classification Error
Equation
Classification Error Before
1 − 8
12 = 0.3333
Splitting Based on Classification Error
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No 𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Classification Error
Equation
Classification Error Before
1 − 8
12 = 0.3333
Splitting Based on Classification Error
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No 𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Classification Error
Equation
Classification Error Left Side
1 − 2
4 = 0.5000
0.3333
Splitting Based on Classification Error
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No 𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Classification Error
Equation
Classification Error Left Side
1 − 2
4 = 0.5000
0.3333
Information lost on
small # of data points
Splitting Based on Classification Error
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No 𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Classification Error
Equation
Classification Error Right
Side
1 − 6
8 = 0.2500
0.3333
0.5000
Splitting Based on Classification Error
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No 𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Classification Error
Equation
Classification Error Change
0.3333 − 4
12 ∗ 0.5000 − 8
12 ∗0.2500
= 0
0.3333
0.5000 0.2500
Splitting Based on Classification Error
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No 𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Classification Error
Equation
Classification Error Change
0.3333
0.5000 0.2500
0.3333 − 4
12 ∗ 0.5000 − 8
12 ∗0.2500
= 0
Splitting Based on Classification Error
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
• Using classification error, no
further splits would occur
• Problem: end nodes are not
homogeneous
• Try a different metric?
Splitting Based on Entropy
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
𝐻 𝑡 = −
𝑖=1
𝑛
𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)]
Entropy Equation
Entropy Before
− 8
12 𝑙𝑜𝑔2(8
12) − 4
12 𝑙𝑜𝑔2(4
12) = 0.9183
Splitting Based on Entropy
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
𝐻 𝑡 = −
𝑖=1
𝑛
𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)]
Entropy Equation
Entropy Before
− 8
12 𝑙𝑜𝑔2(8
12) − 4
12 𝑙𝑜𝑔2(4
12) = 0.9183
Splitting Based on Entropy
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
𝐻 𝑡 = −
𝑖=1
𝑛
𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)]
Entropy Equation
Entropy Left Side
− 2
4 𝑙𝑜𝑔2(2
4) − 2
4 𝑙𝑜𝑔2(2
4) = 1.0000
0.9183
Splitting Based on Entropy
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
𝐻 𝑡 = −
𝑖=1
𝑛
𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)]
Entropy Equation
Entropy Right Side
− 6
8 𝑙𝑜𝑔2(6
8) − 2
8 𝑙𝑜𝑔2(2
8) = 0.8113
0.9183
1.0000
Splitting Based on Entropy
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
𝐻 𝑡 = −
𝑖=1
𝑛
𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)]
Entropy Equation
Entropy Change
0.9183 − 4
12 ∗ 1.0000 − 8
12 ∗0.8113
= 0.0441
0.9183
1.0000 0.8113
Splitting Based on Entropy
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
• Splitting based on entropy
allows further splits to occur
• Can eventually reach goal of
homogeneous nodes
• Why does this work with
entropy but not classification
error?
Splitting Based on Entropy
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
• Splitting based on entropy
allows further splits to occur
• Can eventually reach goal of
homogeneous nodes
• Why does this work with
entropy but not classification
error?
Splitting Based on Entropy
2 Yes
2 No
6 Yes
2 No
Play TennisNo Tennis
Temperature:
>= Mild
8 Yes
4 No
• Splitting based on entropy
allows further splits to occur
• Can eventually reach goal of
homogeneous nodes
• Why does this work with
entropy but not classification
error?
• Classification error is a flat
function with maximum at
center
• Center represents
ambiguity—50/50 split
• Splitting metrics favor results
that are furthest away from
the center
Classification Error vs Entropy
0.0 0.5
Purity
1.0
Classification
Error
𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Error
• Classification error is a flat
function with maximum at
center
• Center represents
ambiguity—50/50 split
• Splitting metrics favor results
that are furthest away from
the center
Classification Error vs Entropy
0.0 0.5
Purity
1.0
Classification
Error
𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Error
0.0 0.5
Purity
1.0
Classification
Error
• Classification error is a flat
function with maximum at
center
• Center represents
ambiguity—50/50 split
• Splitting metrics favor results
that are furthest away from
the center
𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
Classification Error vs Entropy
Error
Classification Error vs Entropy
• Entropy has the same
maximum but is curved
• Curvature allows splitting to
continue until nodes are pure
• How does this work?
0.0 0.5
Purity
1.0
Classification
ErrorCross Entropy
𝐻 𝑡 = −
𝑖=1
𝑛
𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)]
Error/Entropy
Classification Error vs Entropy
• Entropy has the same
maximum but is curved
• Curvature allows splitting to
continue until nodes are pure
• How does this work?
0.0 0.5
Purity
1.0
Classification
ErrorCross Entropy
𝐻 𝑡 = −
𝑖=1
𝑛
𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)]
Error/Entropy
0.0 0.5
Purity
1.0
Classification
ErrorCross Entropy
Classification Error vs Entropy
• Entropy has the same
maximum but is curved
• Curvature allows splitting to
continue until nodes are pure
• How does this work?
𝐻 𝑡 = −
𝑖=1
𝑛
𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)]
Error/Entropy
Information Gained by Splitting
• With classification error, the
function is flat
• Final average classification
error can be identical to
parent
• Resulting in premature
stopping
Information Gained by Splitting
• With classification error, the
function is flat
• Final average classification
error can be identical to
parent
• Resulting in premature
stopping
Information Gained by Splitting
• With classification error, the
function is flat
• Final average classification
error can be identical to
parent
• Resulting in premature
stopping
Information Gained by Splitting
• With classification error, the
function is flat
• Final average classification
error can be identical to
parent
• Resulting in premature
stopping
Information Gained by Splitting
• With entropy gain, the
function has a "bulge"
• Allows average information of
children to be less than
parent
• Results in information gain
and continued splitting
Information Gained by Splitting
• With entropy gain, the
function has a "bulge"
• Allows average information of
children to be less than
parent
• Results in information gain
and continued splitting
Information Gained by Splitting
• With entropy gain, the
function has a "bulge"
• Allows average information of
children to be less than
parent
• Results in information gain
and continued splitting
0.0 0.5
Purity
1.0
Classification
ErrorCross Entropy
Gini Index
• In practice, Gini index often
used for splitting
• Function is similar to
entropy—has bulge
• Does not contain logarithm
The Gini Index
𝐺 𝑡 = 1 −
𝑖=1
𝑛
𝑝 𝑖 𝑡 2
0.0 0.5
Purity
1.0
Classification
ErrorCross Entropy
Gini Index
• In practice, Gini index often
used for splitting
• Function is similar to
entropy—has bulge
• Does not contain logarithm
The Gini Index
𝐺 𝑡 = 1 −
𝑖=1
𝑛
𝑝 𝑖 𝑡 2
0.0 0.5
Purity
1.0
Classification
ErrorCross Entropy
Gini Index
• In practice, Gini index often
used for splitting
• Function is similar to
entropy—has bulge
• Does not contain logarithm
The Gini Index
𝐺 𝑡 = 1 −
𝑖=1
𝑛
𝑝 𝑖 𝑡 2
Decision Trees are High Variance
• Problem: decision trees
tend to overfit
• Small changes in data
greatly affect prediction--
high variance
• Solution: Prune trees
• Problem: decision trees
tend to overfit
• Small changes in data
greatly affect prediction--
high variance
• Solution: Prune trees
Decision Trees are High Variance
• Problem: decision trees
tend to overfit
• Small changes in data
greatly affect prediction--
high variance
• Solution: Prune trees
Decision Trees are High Variance
Pruning Decision Trees
• Problem: decision trees
tend to overfit
• Small changes in data
greatly affect prediction--
high variance
• Solution: Prune trees
Pruning Decision Trees
• Problem: decision trees
tend to overfit
• Small changes in data
greatly affect prediction--
high variance
• Solution: Prune trees
• How to decide what leaves
to prune?
• Solution: prune based on
classification error threshold
Pruning Decision Trees
• How to decide what leaves
to prune?
• Solution: prune based on
classification error threshold
Pruning Decision Trees
𝐸 𝑡 = 1 − max
𝑖
[𝑝 𝑖 𝑡 ]
• Easy to interpret and
implement—"if … then …
else" logic
• Handle any data category—
binary, ordinal, continuous
• No preprocessing or scaling
required
Strengths of Decision Trees
• Easy to interpret and
implement—"if … then …
else" logic
• Handle any data category—
binary, ordinal, continuous
• No preprocessing or scaling
required
Strengths of Decision Trees
• Easy to interpret and
implement—"if … then …
else" logic
• Handle any data category—
binary, ordinal, continuous
• No preprocessing or scaling
required
Strengths of Decision Trees
Import the class containing the classification method
from sklearn.tree import DecisionTreeClassifier
Create an instance of the class
DTC = DecisionTreeClassifier(criterion='gini',
max_features=10, max_depth=5)
Fit the instance on the data and then predict the expected value
DTC = DTC.fit(X_train, y_train)
y_predict = DTC.predict(X_test)
Tune parameters with cross-validation. Use DecisionTreeRegressor for regression.
DecisionTreeClassifier: The Syntax
Import the class containing the classification method
from sklearn.tree import DecisionTreeClassifier
Create an instance of the class
DTC = DecisionTreeClassifier(criterion='gini',
max_features=10, max_depth=5)
Fit the instance on the data and then predict the expected value
DTC = DTC.fit(X_train, y_train)
y_predict = DTC.predict(X_test)
Tune parameters with cross-validation. Use DecisionTreeRegressor for regression.
DecisionTreeClassifier: The Syntax
Import the class containing the classification method
from sklearn.tree import DecisionTreeClassifier
Create an instance of the class
DTC = DecisionTreeClassifier(criterion='gini',
max_features=10, max_depth=5)
Fit the instance on the data and then predict the expected value
DTC = DTC.fit(X_train, y_train)
y_predict = DTC.predict(X_test)
Tune parameters with cross-validation. Use DecisionTreeRegressor for regression.
DecisionTreeClassifier: The Syntax
tree
parameters
Import the class containing the classification method
from sklearn.tree import DecisionTreeClassifier
Create an instance of the class
DTC = DecisionTreeClassifier(criterion='gini',
max_features=10, max_depth=5)
Fit the instance on the data and then predict the expected value
DTC = DTC.fit(X_train, y_train)
y_predict = DTC.predict(X_test)
Tune parameters with cross-validation. Use DecisionTreeRegressor for regression.
DecisionTreeClassifier: The Syntax
DecisionTreeClassifier: The Syntax
Import the class containing the classification method
from sklearn.tree import DecisionTreeClassifier
Create an instance of the class
DTC = DecisionTreeClassifier(criterion='gini',
max_features=10, max_depth=5)
Fit the instance on the data and then predict the expected value
DTC = DTC.fit(X_train, y_train)
y_predict = DTC.predict(X_test)
Tune parameters with cross-validation. Use DecisionTreeRegressor for regression.
Ml6 decision trees

More Related Content

What's hot

Slide3.ppt
Slide3.pptSlide3.ppt
Slide3.ppt
butest
 
Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learning
butest
 

What's hot (20)

CVPR2015 reading "Global refinement of random forest"
CVPR2015 reading "Global refinement of random forest"CVPR2015 reading "Global refinement of random forest"
CVPR2015 reading "Global refinement of random forest"
 
Decision trees for machine learning
Decision trees for machine learningDecision trees for machine learning
Decision trees for machine learning
 
Slide3.ppt
Slide3.pptSlide3.ppt
Slide3.ppt
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 
Ml10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topicsMl10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topics
 
Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learning
 
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
 
ID3 ALGORITHM
ID3 ALGORITHMID3 ALGORITHM
ID3 ALGORITHM
 
Classification Continued
Classification ContinuedClassification Continued
Classification Continued
 
Classification decision tree
Classification  decision treeClassification  decision tree
Classification decision tree
 
L3. Decision Trees
L3. Decision TreesL3. Decision Trees
L3. Decision Trees
 
GBM theory code and parameters
GBM theory code and parametersGBM theory code and parameters
GBM theory code and parameters
 
Comparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for RegressionComparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for Regression
 
WEKA: Algorithms The Basic Methods
WEKA: Algorithms The Basic MethodsWEKA: Algorithms The Basic Methods
WEKA: Algorithms The Basic Methods
 
Decision tree
Decision treeDecision tree
Decision tree
 
Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
 
Kaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeKaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning Challenge
 
Covering (Rules-based) Algorithm
Covering (Rules-based) AlgorithmCovering (Rules-based) Algorithm
Covering (Rules-based) Algorithm
 
Random Forest and KNN is fun
Random Forest and KNN is funRandom Forest and KNN is fun
Random Forest and KNN is fun
 

Similar to Ml6 decision trees

Scaling MySQL Strategies for Developers
Scaling MySQL Strategies for DevelopersScaling MySQL Strategies for Developers
Scaling MySQL Strategies for Developers
Jonathan Levin
 
Decision tree for data mining and computer
Decision tree for data mining and computerDecision tree for data mining and computer
Decision tree for data mining and computer
tttiba
 
Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptx
ImXaib
 

Similar to Ml6 decision trees (20)

Lecture4.pptx
Lecture4.pptxLecture4.pptx
Lecture4.pptx
 
datamining and warehousing ppt
datamining  and warehousing pptdatamining  and warehousing ppt
datamining and warehousing ppt
 
Decision trees
Decision treesDecision trees
Decision trees
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
 
Lecture 5 Decision tree.pdf
Lecture 5 Decision tree.pdfLecture 5 Decision tree.pdf
Lecture 5 Decision tree.pdf
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
One R (1R) Algorithm
One R (1R) AlgorithmOne R (1R) Algorithm
One R (1R) Algorithm
 
Top 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big DataTop 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big Data
 
Introduction to Nastran SOL 200 Size Optimization
Introduction to Nastran SOL 200 Size OptimizationIntroduction to Nastran SOL 200 Size Optimization
Introduction to Nastran SOL 200 Size Optimization
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree induction
 
Descision making descision making decision tree.pptx
Descision making descision making decision tree.pptxDescision making descision making decision tree.pptx
Descision making descision making decision tree.pptx
 
Scaling MySQL Strategies for Developers
Scaling MySQL Strategies for DevelopersScaling MySQL Strategies for Developers
Scaling MySQL Strategies for Developers
 
Decision tree for data mining and computer
Decision tree for data mining and computerDecision tree for data mining and computer
Decision tree for data mining and computer
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptx
 
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...
 
pradeep ppt final.pptx
pradeep ppt final.pptxpradeep ppt final.pptx
pradeep ppt final.pptx
 
Lect9 Decision tree
Lect9 Decision treeLect9 Decision tree
Lect9 Decision tree
 
unit 5 decision tree2.pptx
unit 5 decision tree2.pptxunit 5 decision tree2.pptx
unit 5 decision tree2.pptx
 
Data reduction
Data reductionData reduction
Data reduction
 

More from ankit_ppt

More from ankit_ppt (20)

Deep learning summary
Deep learning summaryDeep learning summary
Deep learning summary
 
07 learning
07 learning07 learning
07 learning
 
06 image features
06 image features06 image features
06 image features
 
05 contours seg_matching
05 contours seg_matching05 contours seg_matching
05 contours seg_matching
 
04 image transformations_ii
04 image transformations_ii04 image transformations_ii
04 image transformations_ii
 
03 image transformations_i
03 image transformations_i03 image transformations_i
03 image transformations_i
 
02 image processing
02 image processing02 image processing
02 image processing
 
01 foundations
01 foundations01 foundations
01 foundations
 
Word2 vec
Word2 vecWord2 vec
Word2 vec
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
Text generation and_advanced_topics
Text generation and_advanced_topicsText generation and_advanced_topics
Text generation and_advanced_topics
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
 
Matrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlpMatrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlp
 
Latent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modelingLatent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modeling
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
Ot regularization and_gradient_descent
Ot regularization and_gradient_descentOt regularization and_gradient_descent
Ot regularization and_gradient_descent
 
Ml5 svm and-kernels
Ml5 svm and-kernelsMl5 svm and-kernels
Ml5 svm and-kernels
 
Lesson 3 ai in the enterprise
Lesson 3   ai in the enterpriseLesson 3   ai in the enterprise
Lesson 3 ai in the enterprise
 
Ml3 logistic regression-and_classification_error_metrics
Ml3 logistic regression-and_classification_error_metricsMl3 logistic regression-and_classification_error_metrics
Ml3 logistic regression-and_classification_error_metrics
 
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
Ml1 introduction to-supervised_learning_and_k_nearest_neighborsMl1 introduction to-supervised_learning_and_k_nearest_neighbors
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
 

Recently uploaded

Query optimization and processing for advanced database systems
Query optimization and processing for advanced database systemsQuery optimization and processing for advanced database systems
Query optimization and processing for advanced database systems
meharikiros2
 
Introduction to Robotics in Mechanical Engineering.pptx
Introduction to Robotics in Mechanical Engineering.pptxIntroduction to Robotics in Mechanical Engineering.pptx
Introduction to Robotics in Mechanical Engineering.pptx
hublikarsn
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 

Recently uploaded (20)

Query optimization and processing for advanced database systems
Query optimization and processing for advanced database systemsQuery optimization and processing for advanced database systems
Query optimization and processing for advanced database systems
 
Introduction to Robotics in Mechanical Engineering.pptx
Introduction to Robotics in Mechanical Engineering.pptxIntroduction to Robotics in Mechanical Engineering.pptx
Introduction to Robotics in Mechanical Engineering.pptx
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Post office management system project ..pdf
Post office management system project ..pdfPost office management system project ..pdf
Post office management system project ..pdf
 
Introduction to Artificial Intelligence ( AI)
Introduction to Artificial Intelligence ( AI)Introduction to Artificial Intelligence ( AI)
Introduction to Artificial Intelligence ( AI)
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Computer Graphics Introduction To Curves
Computer Graphics Introduction To CurvesComputer Graphics Introduction To Curves
Computer Graphics Introduction To Curves
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
Signal Processing and Linear System Analysis
Signal Processing and Linear System AnalysisSignal Processing and Linear System Analysis
Signal Processing and Linear System Analysis
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessor8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessor
 
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 

Ml6 decision trees

  • 2. Legal Notices and Disclaimers This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at intel.com. This sample source code is released under the Intel Sample Source Code License Agreement. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Copyright © 2017, Intel Corporation. All rights reserved.
  • 3. Overview of Classifier Characteristics X Y • For K-Nearest Neighbors, training data is the model • Fitting is fast—just store data • Prediction can be slow—lots of distances to measure • Decision boundary is flexible
  • 4. Overview of Classifier Characteristics X Y • For K-Nearest Neighbors, training data is the model • Fitting is fast—just store data • Prediction can be slow—lots of distances to measure • Decision boundary is flexible
  • 5. Overview of Classifier Characteristics X Y • For K-Nearest Neighbors, training data is the model • Fitting is fast—just store data • Prediction can be slow—lots of distances to measure • Decision boundary is flexible
  • 6. Overview of Classifier Characteristics X Y • For K-Nearest Neighbors, training data is the model • Fitting is fast—just store data • Prediction can be slow—lots of distances to measure • Decision boundary is flexible
  • 7. X 0.0 1.0 Probability 0.5 𝑦 𝛽 𝑥 = 1 1+𝑒−(𝛽0+ 𝛽1 𝑥 + ε ) Overview of Classifier Characteristics • For logistic regression, model is just parameters • Fitting can be slow—must find best parameters • Prediction is fast—calculate expected value • Decision boundary is simple, less flexible
  • 8. X 0.0 1.0 Probability 0.5 𝑦 𝛽 𝑥 = 1 1+𝑒−(𝛽0+ 𝛽1 𝑥 + ε ) Overview of Classifier Characteristics • For logistic regression, model is just parameters • Fitting can be slow—must find best parameters • Prediction is fast—calculate expected value • Decision boundary is simple, less flexible
  • 9. X 0.0 1.0 Probability 0.5 𝑦 𝛽 𝑥 = 1 1+𝑒−(𝛽0+ 𝛽1 𝑥 + ε ) Overview of Classifier Characteristics • For logistic regression, model is just parameters • Fitting can be slow—must find best parameters • Prediction is fast—calculate expected value • Decision boundary is simple, less flexible
  • 10. X 0.0 1.0 Probability 0.5 𝑦 𝛽 𝑥 = 1 1+𝑒−(𝛽0+ 𝛽1 𝑥 + ε ) Overview of Classifier Characteristics • For logistic regression, model is just parameters • Fitting can be slow—must find best parameters • Prediction is fast—calculate expected value • Decision boundary is simple, less flexible
  • 12. Introduction to Decision Trees • Want to predict whether to play tennis based on temperature, humidity, wind, outlook • Segment data based on features to predict result • Trees that predict categorical results are decision trees
  • 13. Introduction to Decision Trees • Want to predict whether to play tennis based on temperature, humidity, wind, outlook • Segment data based on features to predict result • Trees that predict categorical results are decision trees Temperature: >= Mild Play TennisNo Tennis
  • 14. Play Tennis Nodes Introduction to Decision Trees • Want to predict whether to play tennis based on temperature, humidity, wind, outlook • Segment data based on features to predict result • Trees that predict categorical results are decision trees Leaves Temperature: >= Mild No Tennis
  • 15. Play TennisNo Tennis Nodes Leaves No Tennis Humidity: = Normal Introduction to Decision Trees • Want to predict whether to play tennis based on temperature, humidity, wind, outlook • Segment data based on features to predict result • Trees that predict categorical results are decision trees Temperature: >= Mild
  • 16. Play TennisNo Tennis Nodes Leaves • Want to predict whether to play tennis based on temperature, humidity, wind, outlook • Segment data based on features to predict result • Trees that predict categorical results are decision trees No Tennis Introduction to Decision Trees Temperature: >= Mild Humidity: = Normal
  • 17. • Example: use slope an elevation in Himalayas • Predict average precipitation (continuous value) Regression Trees Predict Continuous Values
  • 18. 48.50 in.13.67 in. Nodes Leaves Elevation: < 7900 ft. 55.42 in. Slope: < 2.5º Regression Trees Predict Continuous Values • Example: use slope an elevation in Himalayas • Predict average precipitation (continuous value)
  • 19. 48.50 in.13.67 in. Nodes Leaves Elevation: < 7900 ft. • Example: use slope an elevation in Himalayas • Predict average precipitation (continuous value) • Values at leaves are averages of members 55.42 in. Slope: < 2.5º Regression Trees Predict Continuous Values
  • 20. Regression Trees Predict Continuous Values 0 1 2 3 4 5 -1.0 -2.0 0.0 1.0 2.0 Source: http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html
  • 21. Regression Trees Predict Continuous Values max_depth=2 0 1 2 3 4 5 -1.0 -2.0 0.0 1.0 2.0 Source: http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html
  • 22. Regression Trees Predict Continuous Values max_depth=2 max_depth=5 0 1 2 3 4 5 -1.0 -2.0 0.0 1.0 2.0 Source: http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html
  • 23. Building a Decision Tree • Select a feature and split data into binary tree • Continue splitting with available features
  • 24. Building a Decision Tree • Select a feature and split data into binary tree • Continue splitting with available features
  • 25. Building a Decision Tree • Select a feature and split data into binary tree • Continue splitting with available features
  • 26. How Long to Keep Splitting? Until: • Leaf node(s) are pure (only one class remains) • A maximum depth is reached • A performance metric is achieved
  • 27. How Long to Keep Splitting? Until: • Leaf node(s) are pure (only one class remains) • A maximum depth is reached • A performance metric is achieved
  • 28. How Long to Keep Splitting? Until: • Leaf node(s) are pure (only one class remains) • A maximum depth is reached • A performance metric is achieved
  • 29. How Long to Keep Splitting? Until: • Leaf node(s) are pure—only one class remains • A maximum depth is reached • A performance metric is achieved
  • 30. • Use greedy search: find the best split at each step • What defines the best split? • One that maximizes the information gained from the split • How is information gain defined? Building the Best Decision Tree Play Tennis Leaves No Tennis Temperature: >= Mild
  • 31. • Use greedy search: find the best split at each step • What defines the best split? • One that maximizes the information gained from the split • How is information gain defined? Building the Best Decision Tree Play Tennis Leaves No Tennis Temperature: >= Mild
  • 32. • Use greedy search: find the best split at each step • What defines the best split? • One that maximizes the information gained from the split • How is information gain defined? Building the Best Decision Tree Play Tennis Leaves No Tennis Temperature: >= Mild
  • 33. • Use greedy search: find the best split at each step • What defines the best split? • One that maximizes the information gained from the split • How is information gain defined? Building the Best Decision Tree Play Tennis Leaves No Tennis Temperature: >= Mild
  • 34. Splitting Based on Classification Error 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Classification Error Equation Classification Error Before 1 − 8 12 = 0.3333
  • 35. Splitting Based on Classification Error 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Classification Error Equation Classification Error Before 1 − 8 12 = 0.3333
  • 36. Splitting Based on Classification Error 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Classification Error Equation Classification Error Left Side 1 − 2 4 = 0.5000 0.3333
  • 37. Splitting Based on Classification Error 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Classification Error Equation Classification Error Left Side 1 − 2 4 = 0.5000 0.3333 Information lost on small # of data points
  • 38. Splitting Based on Classification Error 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Classification Error Equation Classification Error Right Side 1 − 6 8 = 0.2500 0.3333 0.5000
  • 39. Splitting Based on Classification Error 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Classification Error Equation Classification Error Change 0.3333 − 4 12 ∗ 0.5000 − 8 12 ∗0.2500 = 0 0.3333 0.5000 0.2500
  • 40. Splitting Based on Classification Error 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Classification Error Equation Classification Error Change 0.3333 0.5000 0.2500 0.3333 − 4 12 ∗ 0.5000 − 8 12 ∗0.2500 = 0
  • 41. Splitting Based on Classification Error 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No • Using classification error, no further splits would occur • Problem: end nodes are not homogeneous • Try a different metric?
  • 42. Splitting Based on Entropy 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐻 𝑡 = − 𝑖=1 𝑛 𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)] Entropy Equation Entropy Before − 8 12 𝑙𝑜𝑔2(8 12) − 4 12 𝑙𝑜𝑔2(4 12) = 0.9183
  • 43. Splitting Based on Entropy 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐻 𝑡 = − 𝑖=1 𝑛 𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)] Entropy Equation Entropy Before − 8 12 𝑙𝑜𝑔2(8 12) − 4 12 𝑙𝑜𝑔2(4 12) = 0.9183
  • 44. Splitting Based on Entropy 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐻 𝑡 = − 𝑖=1 𝑛 𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)] Entropy Equation Entropy Left Side − 2 4 𝑙𝑜𝑔2(2 4) − 2 4 𝑙𝑜𝑔2(2 4) = 1.0000 0.9183
  • 45. Splitting Based on Entropy 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐻 𝑡 = − 𝑖=1 𝑛 𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)] Entropy Equation Entropy Right Side − 6 8 𝑙𝑜𝑔2(6 8) − 2 8 𝑙𝑜𝑔2(2 8) = 0.8113 0.9183 1.0000
  • 46. Splitting Based on Entropy 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐻 𝑡 = − 𝑖=1 𝑛 𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)] Entropy Equation Entropy Change 0.9183 − 4 12 ∗ 1.0000 − 8 12 ∗0.8113 = 0.0441 0.9183 1.0000 0.8113
  • 47. Splitting Based on Entropy 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No • Splitting based on entropy allows further splits to occur • Can eventually reach goal of homogeneous nodes • Why does this work with entropy but not classification error?
  • 48. Splitting Based on Entropy 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No • Splitting based on entropy allows further splits to occur • Can eventually reach goal of homogeneous nodes • Why does this work with entropy but not classification error?
  • 49. Splitting Based on Entropy 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No • Splitting based on entropy allows further splits to occur • Can eventually reach goal of homogeneous nodes • Why does this work with entropy but not classification error?
  • 50. • Classification error is a flat function with maximum at center • Center represents ambiguity—50/50 split • Splitting metrics favor results that are furthest away from the center Classification Error vs Entropy 0.0 0.5 Purity 1.0 Classification Error 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Error
  • 51. • Classification error is a flat function with maximum at center • Center represents ambiguity—50/50 split • Splitting metrics favor results that are furthest away from the center Classification Error vs Entropy 0.0 0.5 Purity 1.0 Classification Error 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Error
  • 52. 0.0 0.5 Purity 1.0 Classification Error • Classification error is a flat function with maximum at center • Center represents ambiguity—50/50 split • Splitting metrics favor results that are furthest away from the center 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Classification Error vs Entropy Error
  • 53. Classification Error vs Entropy • Entropy has the same maximum but is curved • Curvature allows splitting to continue until nodes are pure • How does this work? 0.0 0.5 Purity 1.0 Classification ErrorCross Entropy 𝐻 𝑡 = − 𝑖=1 𝑛 𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)] Error/Entropy
  • 54. Classification Error vs Entropy • Entropy has the same maximum but is curved • Curvature allows splitting to continue until nodes are pure • How does this work? 0.0 0.5 Purity 1.0 Classification ErrorCross Entropy 𝐻 𝑡 = − 𝑖=1 𝑛 𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)] Error/Entropy
  • 55. 0.0 0.5 Purity 1.0 Classification ErrorCross Entropy Classification Error vs Entropy • Entropy has the same maximum but is curved • Curvature allows splitting to continue until nodes are pure • How does this work? 𝐻 𝑡 = − 𝑖=1 𝑛 𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)] Error/Entropy
  • 56. Information Gained by Splitting • With classification error, the function is flat • Final average classification error can be identical to parent • Resulting in premature stopping
  • 57. Information Gained by Splitting • With classification error, the function is flat • Final average classification error can be identical to parent • Resulting in premature stopping
  • 58. Information Gained by Splitting • With classification error, the function is flat • Final average classification error can be identical to parent • Resulting in premature stopping
  • 59. Information Gained by Splitting • With classification error, the function is flat • Final average classification error can be identical to parent • Resulting in premature stopping
  • 60. Information Gained by Splitting • With entropy gain, the function has a "bulge" • Allows average information of children to be less than parent • Results in information gain and continued splitting
  • 61. Information Gained by Splitting • With entropy gain, the function has a "bulge" • Allows average information of children to be less than parent • Results in information gain and continued splitting
  • 62. Information Gained by Splitting • With entropy gain, the function has a "bulge" • Allows average information of children to be less than parent • Results in information gain and continued splitting
  • 63. 0.0 0.5 Purity 1.0 Classification ErrorCross Entropy Gini Index • In practice, Gini index often used for splitting • Function is similar to entropy—has bulge • Does not contain logarithm The Gini Index 𝐺 𝑡 = 1 − 𝑖=1 𝑛 𝑝 𝑖 𝑡 2
  • 64. 0.0 0.5 Purity 1.0 Classification ErrorCross Entropy Gini Index • In practice, Gini index often used for splitting • Function is similar to entropy—has bulge • Does not contain logarithm The Gini Index 𝐺 𝑡 = 1 − 𝑖=1 𝑛 𝑝 𝑖 𝑡 2
  • 65. 0.0 0.5 Purity 1.0 Classification ErrorCross Entropy Gini Index • In practice, Gini index often used for splitting • Function is similar to entropy—has bulge • Does not contain logarithm The Gini Index 𝐺 𝑡 = 1 − 𝑖=1 𝑛 𝑝 𝑖 𝑡 2
  • 66. Decision Trees are High Variance • Problem: decision trees tend to overfit • Small changes in data greatly affect prediction-- high variance • Solution: Prune trees
  • 67. • Problem: decision trees tend to overfit • Small changes in data greatly affect prediction-- high variance • Solution: Prune trees Decision Trees are High Variance
  • 68. • Problem: decision trees tend to overfit • Small changes in data greatly affect prediction-- high variance • Solution: Prune trees Decision Trees are High Variance
  • 69. Pruning Decision Trees • Problem: decision trees tend to overfit • Small changes in data greatly affect prediction-- high variance • Solution: Prune trees
  • 70. Pruning Decision Trees • Problem: decision trees tend to overfit • Small changes in data greatly affect prediction-- high variance • Solution: Prune trees
  • 71. • How to decide what leaves to prune? • Solution: prune based on classification error threshold Pruning Decision Trees
  • 72. • How to decide what leaves to prune? • Solution: prune based on classification error threshold Pruning Decision Trees 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ]
  • 73. • Easy to interpret and implement—"if … then … else" logic • Handle any data category— binary, ordinal, continuous • No preprocessing or scaling required Strengths of Decision Trees
  • 74. • Easy to interpret and implement—"if … then … else" logic • Handle any data category— binary, ordinal, continuous • No preprocessing or scaling required Strengths of Decision Trees
  • 75. • Easy to interpret and implement—"if … then … else" logic • Handle any data category— binary, ordinal, continuous • No preprocessing or scaling required Strengths of Decision Trees
  • 76. Import the class containing the classification method from sklearn.tree import DecisionTreeClassifier Create an instance of the class DTC = DecisionTreeClassifier(criterion='gini', max_features=10, max_depth=5) Fit the instance on the data and then predict the expected value DTC = DTC.fit(X_train, y_train) y_predict = DTC.predict(X_test) Tune parameters with cross-validation. Use DecisionTreeRegressor for regression. DecisionTreeClassifier: The Syntax
  • 77. Import the class containing the classification method from sklearn.tree import DecisionTreeClassifier Create an instance of the class DTC = DecisionTreeClassifier(criterion='gini', max_features=10, max_depth=5) Fit the instance on the data and then predict the expected value DTC = DTC.fit(X_train, y_train) y_predict = DTC.predict(X_test) Tune parameters with cross-validation. Use DecisionTreeRegressor for regression. DecisionTreeClassifier: The Syntax
  • 78. Import the class containing the classification method from sklearn.tree import DecisionTreeClassifier Create an instance of the class DTC = DecisionTreeClassifier(criterion='gini', max_features=10, max_depth=5) Fit the instance on the data and then predict the expected value DTC = DTC.fit(X_train, y_train) y_predict = DTC.predict(X_test) Tune parameters with cross-validation. Use DecisionTreeRegressor for regression. DecisionTreeClassifier: The Syntax tree parameters
  • 79. Import the class containing the classification method from sklearn.tree import DecisionTreeClassifier Create an instance of the class DTC = DecisionTreeClassifier(criterion='gini', max_features=10, max_depth=5) Fit the instance on the data and then predict the expected value DTC = DTC.fit(X_train, y_train) y_predict = DTC.predict(X_test) Tune parameters with cross-validation. Use DecisionTreeRegressor for regression. DecisionTreeClassifier: The Syntax
  • 80. DecisionTreeClassifier: The Syntax Import the class containing the classification method from sklearn.tree import DecisionTreeClassifier Create an instance of the class DTC = DecisionTreeClassifier(criterion='gini', max_features=10, max_depth=5) Fit the instance on the data and then predict the expected value DTC = DTC.fit(X_train, y_train) y_predict = DTC.predict(X_test) Tune parameters with cross-validation. Use DecisionTreeRegressor for regression.