SlideShare a Scribd company logo
Decision Trees
Nikolaos Vergos, Ph.D.
Senior Data Scientist, Accordion Health
Overview
 What are decision trees
 Building decision trees
 Purity Metrics
 Entropy
 Information Gain
 GINI index
 Stopping growth of a decision tree
 Ensemble learning
What are decision trees?
 A flowchart-like structure; graph decision support model
 A non-parametric supervised learning technique that can be used for both
categorical (classification) and continuous (regression) output
 Visually engaging and very easy to interpret
 Excellent model for someone transitioning into the world of data science:
 Require little data preparation
 Able to handle multi-output problems
Surviving the titanic
• Interconnected nodes act as a
series of questions / test
conditions
• Terminal nodes (leaves) show the
output metric
Source: http://www.kdnuggets.com/2016/09/decision-trees-disastrous-overview.html
Questions
 How does the algorithm choose which variables to include in the tree?
 How does the algorithm choose where variables should be located on the
tree?
 How does the algorithm decide to stop “growing” the tree?
 Growing an “optimal” decision tree for a training data set is computationally
a very hard problem
 We can still grow a “good enough” tree – greedy algorithms have good
performance (they choose the immediately best option available at each step)
 Hunt’s algorithm: greedy, recursive algorithm that leads to local optimum
Building a decision tree
 Recursively partition records into smaller and smaller subsets
 Partitioning decision depends on purity:
 Different variables and split options are evaluated to determine which split will
provide the greatest separation between classes
 Goal of a decision tree: to have nodes consisting entirely of members of a single
class
 The "impurity" of a node (the extent to which that node is imbalanced) should be
minimized.
 Several metrics quantify impurity
Entropy
 Data Set: S, each member of which belongs to a class c1, c2, …, cn
• H = 0 : all elements are same class
• H = 1 : even split between classes
Information Gain
 Stems from entropy
 H(parent) – (weighted average) * H(children)
Parent X < 4 X < 3
Source: ACM-SIGKDD Meetup
GINI Index
 Expected error rate:
 How often a randomly chosen element from the set would be incorrectly
labeled if it was randomly labeled according to the distribution of labels in
the subset.
 GINI = 0: All elements are same class (perfect separation, perfect purity)
 GINI = 0.5: Even split between classes (equal representation)
 Similar process:
 Calculate GINI Gain for each potential split
 Choose split with the highest GINI Gain
GINI Gain
Parent X < 4 X < 3
GINI Gain: G(parent) – (weighted average) * G(children)
When to use which?
 Only ~ 2% performance difference
 Entropy might be a bit slower to compute (due to the logarithm)
 Gini for continuous attributes, Entropy for categorical
 Gini to minimize misclassification, Entropy for exploratory analysis
 Gini will tend to find the largest class, Entropy tends to find groups of classes
that make up ~50% of the data
 Default in scikit-learn: Gini. Entropy also available.
When to stop growing?
How about overfitting?
 Pure leaves
 Pre-set depth of tree: the length of the longest path from the root to a leaf
 Number of cases in node less than minimum number of cases set
 Splitting criteria less than certain threshold
 Decision Trees are prone to overfitting
 Pre-pruning: set a minimum threshold on the gain, and stop when no split achieves
a gain above this threshold.
 Post–pruning: build the full tree, and then perform pruning as a post-processing
step
 Not currently supported in scikit-learn (0.18)
Ensemble Learning
 Decision Trees can be weak learners with a tendency to overfit training data
 We can combine several weak learners into an overall strong learner
 Averaging methods for reducing variance
 Bagging (Bootstrap Aggregating): use random subsets of training set
 Random Forest Classifier: Build multiple decision trees and let them vote on how
to classify inputs (scikit-learn). Only a subset of features considered to split a
node.
 Boosting methods for reducing bias
 Base estimators (individual trees) are built sequentially; the subset creation is not
random and depends upon the performance of the previous models: every new
subsets contains the elements that were (likely to be) misclassified by previous
models.
 AdaBoost, Gradient Boosting, XGBoost
References & Further Reading
 ACM-SIGKDD Meetup: Advanced Machine Learning with Python
 Kevin Markham: Introduction to Decision Trees (slides, PDF)
 Pang-Ning Tan et. Al.: “Introduction to Data Mining”, Chapter 4
 Scikit-learn documentation
 Analytics Vidhya: A Complete Tutorial on Tree Based Modeling from Scratch
Thank you for your time!
nvergos@gmail.com
@nvergos
Nikos Vergos

More Related Content

What's hot

Survey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data MiningSurvey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data Mining
ijsrd.com
 
L4. Ensembles of Decision Trees
L4. Ensembles of Decision TreesL4. Ensembles of Decision Trees
L4. Ensembles of Decision Trees
Machine Learning Valencia
 
Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science
Frank Kienle
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
thamizh arasi
 
Dma unit 2
Dma unit  2Dma unit  2
Dma unit 2
thamizh arasi
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
DataminingTools Inc
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
MachinePulse
 
Machine learning(UNIT 4)
Machine learning(UNIT 4)Machine learning(UNIT 4)
Machine learning(UNIT 4)
SURBHI SAROHA
 
[db tech showcase Tokyo 2018] #dbts2018 #B16 『The Basics of Machine Learning』
[db tech showcase Tokyo 2018] #dbts2018 #B16 『The Basics of Machine Learning』[db tech showcase Tokyo 2018] #dbts2018 #B16 『The Basics of Machine Learning』
[db tech showcase Tokyo 2018] #dbts2018 #B16 『The Basics of Machine Learning』
Insight Technology, Inc.
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression Trees
Hemant Chetwani
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
Joel Graff
 
Using Decision Trees to Analyze Online Learning Data
Using Decision Trees to Analyze Online Learning Data Using Decision Trees to Analyze Online Learning Data
Using Decision Trees to Analyze Online Learning Data
Shalin Hai-Jew
 
Information Retrieval 08
Information Retrieval 08 Information Retrieval 08
Information Retrieval 08
Jeet Das
 
Informs presentation new ppt
Informs presentation new pptInforms presentation new ppt
Informs presentation new ppt
Salford Systems
 
Introduction to machine learning and deep learning
Introduction to machine learning and deep learningIntroduction to machine learning and deep learning
Introduction to machine learning and deep learning
Shishir Choudhary
 
Tree pruning
Tree pruningTree pruning
Tree pruning
priya_kalia
 
Data preparation
Data preparationData preparation
Data preparation
Harry Potter
 
Ranking tools in public health
Ranking tools in public healthRanking tools in public health
Ranking tools in public health
PraYash Gautam
 
Applications of Machine Learning
Applications of Machine LearningApplications of Machine Learning
Applications of Machine Learning
Department of Computer Science, Aalto University
 
Data Science in Industry - Applying Machine Learning to Real-world Challenges
Data Science in Industry - Applying Machine Learning to Real-world ChallengesData Science in Industry - Applying Machine Learning to Real-world Challenges
Data Science in Industry - Applying Machine Learning to Real-world Challenges
Yuchen Zhao
 

What's hot (20)

Survey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data MiningSurvey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data Mining
 
L4. Ensembles of Decision Trees
L4. Ensembles of Decision TreesL4. Ensembles of Decision Trees
L4. Ensembles of Decision Trees
 
Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Dma unit 2
Dma unit  2Dma unit  2
Dma unit 2
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
 
Machine learning(UNIT 4)
Machine learning(UNIT 4)Machine learning(UNIT 4)
Machine learning(UNIT 4)
 
[db tech showcase Tokyo 2018] #dbts2018 #B16 『The Basics of Machine Learning』
[db tech showcase Tokyo 2018] #dbts2018 #B16 『The Basics of Machine Learning』[db tech showcase Tokyo 2018] #dbts2018 #B16 『The Basics of Machine Learning』
[db tech showcase Tokyo 2018] #dbts2018 #B16 『The Basics of Machine Learning』
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression Trees
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
 
Using Decision Trees to Analyze Online Learning Data
Using Decision Trees to Analyze Online Learning Data Using Decision Trees to Analyze Online Learning Data
Using Decision Trees to Analyze Online Learning Data
 
Information Retrieval 08
Information Retrieval 08 Information Retrieval 08
Information Retrieval 08
 
Informs presentation new ppt
Informs presentation new pptInforms presentation new ppt
Informs presentation new ppt
 
Introduction to machine learning and deep learning
Introduction to machine learning and deep learningIntroduction to machine learning and deep learning
Introduction to machine learning and deep learning
 
Tree pruning
Tree pruningTree pruning
Tree pruning
 
Data preparation
Data preparationData preparation
Data preparation
 
Ranking tools in public health
Ranking tools in public healthRanking tools in public health
Ranking tools in public health
 
Applications of Machine Learning
Applications of Machine LearningApplications of Machine Learning
Applications of Machine Learning
 
Data Science in Industry - Applying Machine Learning to Real-world Challenges
Data Science in Industry - Applying Machine Learning to Real-world ChallengesData Science in Industry - Applying Machine Learning to Real-world Challenges
Data Science in Industry - Applying Machine Learning to Real-world Challenges
 

Viewers also liked

HXR 2016: Data Insights: Mining, Modeling, and Visualizations- Sriram Vishwanath
HXR 2016: Data Insights: Mining, Modeling, and Visualizations- Sriram VishwanathHXR 2016: Data Insights: Mining, Modeling, and Visualizations- Sriram Vishwanath
HXR 2016: Data Insights: Mining, Modeling, and Visualizations- Sriram Vishwanath
HxRefactored
 
Download It
Download ItDownload It
Download It
butest
 
Data Mining
Data MiningData Mining
Data Mining
Sonali Parab
 
ensemble learning
ensemble learningensemble learning
ensemble learning
butest
 
Introduction to Some Tree based Learning Method
Introduction to Some Tree based Learning MethodIntroduction to Some Tree based Learning Method
Introduction to Some Tree based Learning Method
Honglin Yu
 
ランダムフォレストとそのコンピュータビジョンへの応用
ランダムフォレストとそのコンピュータビジョンへの応用ランダムフォレストとそのコンピュータビジョンへの応用
ランダムフォレストとそのコンピュータビジョンへの応用
Kinki University
 
Ensemble modeling overview, Big Data meetup
Ensemble modeling overview, Big Data meetupEnsemble modeling overview, Big Data meetup
Ensemble modeling overview, Big Data meetup
OptimalBI Limited
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
Viet-Trung TRAN
 
Decision trees and random forests
Decision trees and random forestsDecision trees and random forests
Decision trees and random forests
Debdoot Sheet
 
An introduction to decision trees
An introduction to decision treesAn introduction to decision trees
An introduction to decision trees
Fahim Muntaha
 
Machine Learning and Data Mining: 16 Classifiers Ensembles
Machine Learning and Data Mining: 16 Classifiers EnsemblesMachine Learning and Data Mining: 16 Classifiers Ensembles
Machine Learning and Data Mining: 16 Classifiers Ensembles
Pier Luca Lanzi
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods
Marina Santini
 
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting MachinesDecision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Deepak George
 
Cima edition-17-decision-trees (2)
Cima edition-17-decision-trees (2)Cima edition-17-decision-trees (2)
Cima edition-17-decision-trees (2)
mikwaldron
 
2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods
Krish_ver2
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Parth Khare
 
Decision Tree- M.B.A -DecSci
Decision Tree- M.B.A -DecSciDecision Tree- M.B.A -DecSci
Decision Tree- M.B.A -DecSci
Lesly Lising
 
Machine learning overview (with SAS software)
Machine learning overview (with SAS software)Machine learning overview (with SAS software)
Machine learning overview (with SAS software)
Longhow Lam
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
Milind Gokhale
 

Viewers also liked (20)

HXR 2016: Data Insights: Mining, Modeling, and Visualizations- Sriram Vishwanath
HXR 2016: Data Insights: Mining, Modeling, and Visualizations- Sriram VishwanathHXR 2016: Data Insights: Mining, Modeling, and Visualizations- Sriram Vishwanath
HXR 2016: Data Insights: Mining, Modeling, and Visualizations- Sriram Vishwanath
 
Download It
Download ItDownload It
Download It
 
Data Mining
Data MiningData Mining
Data Mining
 
ensemble learning
ensemble learningensemble learning
ensemble learning
 
Introduction to Some Tree based Learning Method
Introduction to Some Tree based Learning MethodIntroduction to Some Tree based Learning Method
Introduction to Some Tree based Learning Method
 
ランダムフォレストとそのコンピュータビジョンへの応用
ランダムフォレストとそのコンピュータビジョンへの応用ランダムフォレストとそのコンピュータビジョンへの応用
ランダムフォレストとそのコンピュータビジョンへの応用
 
Ensemble modeling overview, Big Data meetup
Ensemble modeling overview, Big Data meetupEnsemble modeling overview, Big Data meetup
Ensemble modeling overview, Big Data meetup
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
 
Decision trees and random forests
Decision trees and random forestsDecision trees and random forests
Decision trees and random forests
 
An introduction to decision trees
An introduction to decision treesAn introduction to decision trees
An introduction to decision trees
 
Machine Learning and Data Mining: 16 Classifiers Ensembles
Machine Learning and Data Mining: 16 Classifiers EnsemblesMachine Learning and Data Mining: 16 Classifiers Ensembles
Machine Learning and Data Mining: 16 Classifiers Ensembles
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods
 
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting MachinesDecision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
 
Cima edition-17-decision-trees (2)
Cima edition-17-decision-trees (2)Cima edition-17-decision-trees (2)
Cima edition-17-decision-trees (2)
 
2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
 
Decision Tree- M.B.A -DecSci
Decision Tree- M.B.A -DecSciDecision Tree- M.B.A -DecSci
Decision Tree- M.B.A -DecSci
 
Machine learning overview (with SAS software)
Machine learning overview (with SAS software)Machine learning overview (with SAS software)
Machine learning overview (with SAS software)
 
Decison tree
Decison treeDecison tree
Decison tree
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 

Similar to [Women in Data Science Meetup ATX] Decision Trees

Decision Trees
Decision TreesDecision Trees
Decision Trees
Carlos Santillan
 
Decision tree
Decision tree Decision tree
Decision tree
Learnbay Datascience
 
Classfication Basic.ppt
Classfication Basic.pptClassfication Basic.ppt
Classfication Basic.ppt
henonah
 
Unit 3classification
Unit 3classificationUnit 3classification
Unit 3classification
Kalpna Saharan
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
Xueping Peng
 
data mining.pptx
data mining.pptxdata mining.pptx
data mining.pptx
Kaviya452563
 
classification in data warehouse and mining
classification in data warehouse and miningclassification in data warehouse and mining
classification in data warehouse and mining
anjanasharma77573
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
Krish_ver2
 
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdfMachine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
AdityaSoraut
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.ppt
Rvishnupriya2
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.ppt
Rvishnupriya2
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
Derek Kane
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
Abhimanyu Dwivedi
 
Dataming-chapter-7-Classification-Basic.pptx
Dataming-chapter-7-Classification-Basic.pptxDataming-chapter-7-Classification-Basic.pptx
Dataming-chapter-7-Classification-Basic.pptx
HimanshuSharma997566
 
decisiontrees (3).ppt
decisiontrees (3).pptdecisiontrees (3).ppt
decisiontrees (3).ppt
LvlShivaNagendra
 
decisiontrees.ppt
decisiontrees.pptdecisiontrees.ppt
decisiontrees.ppt
LvlShivaNagendra
 
decisiontrees.ppt
decisiontrees.pptdecisiontrees.ppt
decisiontrees.ppt
PriyadharshiniG41
 
Classification
ClassificationClassification
Classification
DataminingTools Inc
 
Classification
ClassificationClassification
Classification
Datamining Tools
 
Chapter 8. Classification Basic Concepts.ppt
Chapter 8. Classification Basic Concepts.pptChapter 8. Classification Basic Concepts.ppt
Chapter 8. Classification Basic Concepts.ppt
Subrata Kumer Paul
 

Similar to [Women in Data Science Meetup ATX] Decision Trees (20)

Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Decision tree
Decision tree Decision tree
Decision tree
 
Classfication Basic.ppt
Classfication Basic.pptClassfication Basic.ppt
Classfication Basic.ppt
 
Unit 3classification
Unit 3classificationUnit 3classification
Unit 3classification
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 
data mining.pptx
data mining.pptxdata mining.pptx
data mining.pptx
 
classification in data warehouse and mining
classification in data warehouse and miningclassification in data warehouse and mining
classification in data warehouse and mining
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
 
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdfMachine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.ppt
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.ppt
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
 
Dataming-chapter-7-Classification-Basic.pptx
Dataming-chapter-7-Classification-Basic.pptxDataming-chapter-7-Classification-Basic.pptx
Dataming-chapter-7-Classification-Basic.pptx
 
decisiontrees (3).ppt
decisiontrees (3).pptdecisiontrees (3).ppt
decisiontrees (3).ppt
 
decisiontrees.ppt
decisiontrees.pptdecisiontrees.ppt
decisiontrees.ppt
 
decisiontrees.ppt
decisiontrees.pptdecisiontrees.ppt
decisiontrees.ppt
 
Classification
ClassificationClassification
Classification
 
Classification
ClassificationClassification
Classification
 
Chapter 8. Classification Basic Concepts.ppt
Chapter 8. Classification Basic Concepts.pptChapter 8. Classification Basic Concepts.ppt
Chapter 8. Classification Basic Concepts.ppt
 

Recently uploaded

Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 

Recently uploaded (20)

Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 

[Women in Data Science Meetup ATX] Decision Trees

  • 1. Decision Trees Nikolaos Vergos, Ph.D. Senior Data Scientist, Accordion Health
  • 2. Overview  What are decision trees  Building decision trees  Purity Metrics  Entropy  Information Gain  GINI index  Stopping growth of a decision tree  Ensemble learning
  • 3. What are decision trees?  A flowchart-like structure; graph decision support model  A non-parametric supervised learning technique that can be used for both categorical (classification) and continuous (regression) output  Visually engaging and very easy to interpret  Excellent model for someone transitioning into the world of data science:  Require little data preparation  Able to handle multi-output problems
  • 4. Surviving the titanic • Interconnected nodes act as a series of questions / test conditions • Terminal nodes (leaves) show the output metric Source: http://www.kdnuggets.com/2016/09/decision-trees-disastrous-overview.html
  • 5. Questions  How does the algorithm choose which variables to include in the tree?  How does the algorithm choose where variables should be located on the tree?  How does the algorithm decide to stop “growing” the tree?  Growing an “optimal” decision tree for a training data set is computationally a very hard problem  We can still grow a “good enough” tree – greedy algorithms have good performance (they choose the immediately best option available at each step)  Hunt’s algorithm: greedy, recursive algorithm that leads to local optimum
  • 6. Building a decision tree  Recursively partition records into smaller and smaller subsets  Partitioning decision depends on purity:  Different variables and split options are evaluated to determine which split will provide the greatest separation between classes  Goal of a decision tree: to have nodes consisting entirely of members of a single class  The "impurity" of a node (the extent to which that node is imbalanced) should be minimized.  Several metrics quantify impurity
  • 7. Entropy  Data Set: S, each member of which belongs to a class c1, c2, …, cn • H = 0 : all elements are same class • H = 1 : even split between classes
  • 8. Information Gain  Stems from entropy  H(parent) – (weighted average) * H(children) Parent X < 4 X < 3 Source: ACM-SIGKDD Meetup
  • 9. GINI Index  Expected error rate:  How often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.  GINI = 0: All elements are same class (perfect separation, perfect purity)  GINI = 0.5: Even split between classes (equal representation)  Similar process:  Calculate GINI Gain for each potential split  Choose split with the highest GINI Gain
  • 10. GINI Gain Parent X < 4 X < 3 GINI Gain: G(parent) – (weighted average) * G(children)
  • 11. When to use which?  Only ~ 2% performance difference  Entropy might be a bit slower to compute (due to the logarithm)  Gini for continuous attributes, Entropy for categorical  Gini to minimize misclassification, Entropy for exploratory analysis  Gini will tend to find the largest class, Entropy tends to find groups of classes that make up ~50% of the data  Default in scikit-learn: Gini. Entropy also available.
  • 12. When to stop growing? How about overfitting?  Pure leaves  Pre-set depth of tree: the length of the longest path from the root to a leaf  Number of cases in node less than minimum number of cases set  Splitting criteria less than certain threshold  Decision Trees are prone to overfitting  Pre-pruning: set a minimum threshold on the gain, and stop when no split achieves a gain above this threshold.  Post–pruning: build the full tree, and then perform pruning as a post-processing step  Not currently supported in scikit-learn (0.18)
  • 13. Ensemble Learning  Decision Trees can be weak learners with a tendency to overfit training data  We can combine several weak learners into an overall strong learner  Averaging methods for reducing variance  Bagging (Bootstrap Aggregating): use random subsets of training set  Random Forest Classifier: Build multiple decision trees and let them vote on how to classify inputs (scikit-learn). Only a subset of features considered to split a node.  Boosting methods for reducing bias  Base estimators (individual trees) are built sequentially; the subset creation is not random and depends upon the performance of the previous models: every new subsets contains the elements that were (likely to be) misclassified by previous models.  AdaBoost, Gradient Boosting, XGBoost
  • 14. References & Further Reading  ACM-SIGKDD Meetup: Advanced Machine Learning with Python  Kevin Markham: Introduction to Decision Trees (slides, PDF)  Pang-Ning Tan et. Al.: “Introduction to Data Mining”, Chapter 4  Scikit-learn documentation  Analytics Vidhya: A Complete Tutorial on Tree Based Modeling from Scratch
  • 15. Thank you for your time! nvergos@gmail.com @nvergos Nikos Vergos

Editor's Notes

  1. Purity: an important concept in decision trees, determine the decision we take at each step Ensemble: how to combine trees together for better performance
  2. Very flexible Support all kinds of features mostly used for classification good for binary and multi-class
  3. Upside down tree root at the top : all of our data thinning out in each step terminal nodes (leaves)
  4. Several algorithms for decision trees, ID3, CART Optimal DT NP complete different beginning might lead to different tree
  5. Purity: for binary classification, want as many data points to belong to the same category
  6. p: proportion in class j Maximum entropy = total disorder Zero entropy = total order Evaluate entropy within a PARENT, then evaluate entropy in potential CHILDREN
  7. How much more information we have gained with each potential split; Which split reduces entropy (imbalance) the most
  8. Misclassification frequency
  9. No actual difference between the two Historically (different fields of science) Gini for numerical features Entropy for categorical In the textbook: can define everything from scratch In real life: scikit-learn or R
  10. Decision Trees overfit: they learn the data set so threshold: similar to gradient descent optimizer, can decide to stop when the IG/GG from node to node is too insignificant (convergence)