[Women in Data Science Meetup ATX] Decision Trees

Decision Trees
Nikolaos Vergos, Ph.D.
Senior Data Scientist, Accordion Health

Overview
 What are decision trees
 Building decision trees
 Purity Metrics
 Entropy
 Information Gain
 GINI index
 Stopping growth of a decision tree
 Ensemble learning

What are decision trees?
 A flowchart-like structure; graph decision support model
 A non-parametric supervised learning technique that can be used for both
categorical (classification) and continuous (regression) output
 Visually engaging and very easy to interpret
 Excellent model for someone transitioning into the world of data science:
 Require little data preparation
 Able to handle multi-output problems

Surviving the titanic
• Interconnected nodes act as a
series of questions / test
conditions
• Terminal nodes (leaves) show the
output metric
Source: http://www.kdnuggets.com/2016/09/decision-trees-disastrous-overview.html

Questions
 How does the algorithm choose which variables to include in the tree?
 How does the algorithm choose where variables should be located on the
tree?
 How does the algorithm decide to stop “growing” the tree?
 Growing an “optimal” decision tree for a training data set is computationally
a very hard problem
 We can still grow a “good enough” tree – greedy algorithms have good
performance (they choose the immediately best option available at each step)
 Hunt’s algorithm: greedy, recursive algorithm that leads to local optimum

Building a decision tree
 Recursively partition records into smaller and smaller subsets
 Partitioning decision depends on purity:
 Different variables and split options are evaluated to determine which split will
provide the greatest separation between classes
 Goal of a decision tree: to have nodes consisting entirely of members of a single
class
 The "impurity" of a node (the extent to which that node is imbalanced) should be
minimized.
 Several metrics quantify impurity

Entropy
 Data Set: S, each member of which belongs to a class c1, c2, …, cn
• H = 0 : all elements are same class
• H = 1 : even split between classes

Information Gain
 Stems from entropy
 H(parent) – (weighted average) * H(children)
Parent X < 4 X < 3
Source: ACM-SIGKDD Meetup

GINI Index
 Expected error rate:
 How often a randomly chosen element from the set would be incorrectly
labeled if it was randomly labeled according to the distribution of labels in
the subset.
 GINI = 0: All elements are same class (perfect separation, perfect purity)
 GINI = 0.5: Even split between classes (equal representation)
 Similar process:
 Calculate GINI Gain for each potential split
 Choose split with the highest GINI Gain

GINI Gain
Parent X < 4 X < 3
GINI Gain: G(parent) – (weighted average) * G(children)

When to use which?
 Only ~ 2% performance difference
 Entropy might be a bit slower to compute (due to the logarithm)
 Gini for continuous attributes, Entropy for categorical
 Gini to minimize misclassification, Entropy for exploratory analysis
 Gini will tend to find the largest class, Entropy tends to find groups of classes
that make up ~50% of the data
 Default in scikit-learn: Gini. Entropy also available.

When to stop growing?
How about overfitting?
 Pure leaves
 Pre-set depth of tree: the length of the longest path from the root to a leaf
 Number of cases in node less than minimum number of cases set
 Splitting criteria less than certain threshold
 Decision Trees are prone to overfitting
 Pre-pruning: set a minimum threshold on the gain, and stop when no split achieves
a gain above this threshold.
 Post–pruning: build the full tree, and then perform pruning as a post-processing
step
 Not currently supported in scikit-learn (0.18)

Ensemble Learning
 Decision Trees can be weak learners with a tendency to overfit training data
 We can combine several weak learners into an overall strong learner
 Averaging methods for reducing variance
 Bagging (Bootstrap Aggregating): use random subsets of training set
 Random Forest Classifier: Build multiple decision trees and let them vote on how
to classify inputs (scikit-learn). Only a subset of features considered to split a
node.
 Boosting methods for reducing bias
 Base estimators (individual trees) are built sequentially; the subset creation is not
random and depends upon the performance of the previous models: every new
subsets contains the elements that were (likely to be) misclassified by previous
models.
 AdaBoost, Gradient Boosting, XGBoost

References & Further Reading
 ACM-SIGKDD Meetup: Advanced Machine Learning with Python
 Kevin Markham: Introduction to Decision Trees (slides, PDF)
 Pang-Ning Tan et. Al.: “Introduction to Data Mining”, Chapter 4
 Scikit-learn documentation
 Analytics Vidhya: A Complete Tutorial on Tree Based Modeling from Scratch

Thank you for your time!
nvergos@gmail.com
@nvergos
Nikos Vergos

[Women in Data Science Meetup ATX] Decision Trees

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to [Women in Data Science Meetup ATX] Decision Trees

Similar to [Women in Data Science Meetup ATX] Decision Trees (20)

Recently uploaded

Recently uploaded (20)

[Women in Data Science Meetup ATX] Decision Trees

Editor's Notes