Valencian Summer School 2015
Day 1
Lecture 3
Decision Trees
Gonzalo Martínez (UAM)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
No machine learning algorithm dominates in every domain, but random forests are usually tough to beat by much. And they have some advantages compared to other models. No much input preparation needed, implicit feature selection, fast to train, and ability to visualize the model. While it is easy to get started with random forests, a good understanding of the model is key to get the most of them.
This talk will cover decision trees from theory, to their implementation in scikit-learn. An overview of ensemble methods and bagging will follow, to end up explaining and implementing random forests and see how they compare to other state-of-the-art models.
The talk will have a very practical approach, using examples and real cases to illustrate how to use both decision trees and random forests.
We will see how the simplicity of decision trees, is a key advantage compared to other methods. Unlike black-box methods, or methods tough to represent in multivariate cases, decision trees can easily be visualized, analyzed, and debugged, until we see that our model is behaving as expected. This exercise can increase our understanding of the data and the problem, while making our model perform in the best possible way.
Random Forests can randomize and ensemble decision trees to increase its predictive power, while keeping most of their properties.
The main topics covered will include:
* What are decision trees?
* How decision trees are trained?
* Understanding and debugging decision trees
* Ensemble methods
* Bagging
* Random Forests
* When decision trees and random forests should be used?
* Python implementation with scikit-learn
* Analysis of performance
No machine learning algorithm dominates in every domain, but random forests are usually tough to beat by much. And they have some advantages compared to other models. No much input preparation needed, implicit feature selection, fast to train, and ability to visualize the model. While it is easy to get started with random forests, a good understanding of the model is key to get the most of them.
This talk will cover decision trees from theory, to their implementation in scikit-learn. An overview of ensemble methods and bagging will follow, to end up explaining and implementing random forests and see how they compare to other state-of-the-art models.
The talk will have a very practical approach, using examples and real cases to illustrate how to use both decision trees and random forests.
We will see how the simplicity of decision trees, is a key advantage compared to other methods. Unlike black-box methods, or methods tough to represent in multivariate cases, decision trees can easily be visualized, analyzed, and debugged, until we see that our model is behaving as expected. This exercise can increase our understanding of the data and the problem, while making our model perform in the best possible way.
Random Forests can randomize and ensemble decision trees to increase its predictive power, while keeping most of their properties.
The main topics covered will include:
* What are decision trees?
* How decision trees are trained?
* Understanding and debugging decision trees
* Ensemble methods
* Bagging
* Random Forests
* When decision trees and random forests should be used?
* Python implementation with scikit-learn
* Analysis of performance
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
---------------------------------------------------------------
Come join our meet-up and learn how easily you can use R for advanced Machine learning. In this meet-up, we will demonstrate how to understand and use Xgboost for Kaggle competition. Tong is in Canada and will do remote session with us through google hangout.
---------------------------------------------------------------
Speaker Bio:
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Pre-requisite(if any): R /Calculus
Preparation: A laptop with R installed. Windows users might need to have RTools installed as well.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Event arrangement:
6:45pm Doors open. Come early to network, grab a beer and settle in.
7:00-9:00pm XgBoost Demo
Reference:
https://github.com/dmlc/xgboost
Slide explaining the distinction between bagging and boosting while understanding the bias variance trade-off. Followed by some lesser known scope of supervised learning. understanding the effect of tree split metric in deciding feature importance. Then understanding the effect of threshold on classification accuracy. Additionally, how to adjust model threshold for classification in supervised learning.
Note: Limitation of Accuracy metric (baseline accuracy), alternative metrics, their use case and their advantage and limitations were briefly discussed.
Winning Kaggle 101: Introduction to StackingTed Xiao
An Introduction to Stacking by Erin LeDell, from H2O.ai
Presented as part of the "Winning Kaggle 101" event, hosted by Machine Learning at Berkeley and Data Science Society at Berkeley. Special thanks to the Berkeley Institute of Data Science for the venue!
H2O.ai: http://www.h2o.ai/
ML@B: ml.berkeley.edu
DSSB: http://dssberkeley.org
BIDS: http://bids.berkeley.edu/
This presentation covers Decision Tree as a supervised machine learning technique, talking about Information Gain method and Gini Index method with their related Algorithms.
Decision tree in artificial intelligenceMdAlAmin187
Decision tree.
Decision Tree that based on artificial intelligence. The main ideas behind Decision Trees were invented more than 70 years ago, and nowadays they are among the most powerful Machine Learning tools.
This presentation was prepared as part of the curriculum studies for CSCI-659 Topics in Artificial Intelligence Course - Machine Learning in Computational Linguistics.
It was prepared under guidance of Prof. Sandra Kubler.
A fascinating View of the Artificial Intelligence Journey.
Ramón López de Mántaras, Ph.D.
Technical and Business Perspectives on the Current and Future Impact of Machine Learning - MLVLC
October 20, 2015
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
---------------------------------------------------------------
Come join our meet-up and learn how easily you can use R for advanced Machine learning. In this meet-up, we will demonstrate how to understand and use Xgboost for Kaggle competition. Tong is in Canada and will do remote session with us through google hangout.
---------------------------------------------------------------
Speaker Bio:
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Pre-requisite(if any): R /Calculus
Preparation: A laptop with R installed. Windows users might need to have RTools installed as well.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Event arrangement:
6:45pm Doors open. Come early to network, grab a beer and settle in.
7:00-9:00pm XgBoost Demo
Reference:
https://github.com/dmlc/xgboost
Slide explaining the distinction between bagging and boosting while understanding the bias variance trade-off. Followed by some lesser known scope of supervised learning. understanding the effect of tree split metric in deciding feature importance. Then understanding the effect of threshold on classification accuracy. Additionally, how to adjust model threshold for classification in supervised learning.
Note: Limitation of Accuracy metric (baseline accuracy), alternative metrics, their use case and their advantage and limitations were briefly discussed.
Winning Kaggle 101: Introduction to StackingTed Xiao
An Introduction to Stacking by Erin LeDell, from H2O.ai
Presented as part of the "Winning Kaggle 101" event, hosted by Machine Learning at Berkeley and Data Science Society at Berkeley. Special thanks to the Berkeley Institute of Data Science for the venue!
H2O.ai: http://www.h2o.ai/
ML@B: ml.berkeley.edu
DSSB: http://dssberkeley.org
BIDS: http://bids.berkeley.edu/
This presentation covers Decision Tree as a supervised machine learning technique, talking about Information Gain method and Gini Index method with their related Algorithms.
Decision tree in artificial intelligenceMdAlAmin187
Decision tree.
Decision Tree that based on artificial intelligence. The main ideas behind Decision Trees were invented more than 70 years ago, and nowadays they are among the most powerful Machine Learning tools.
This presentation was prepared as part of the curriculum studies for CSCI-659 Topics in Artificial Intelligence Course - Machine Learning in Computational Linguistics.
It was prepared under guidance of Prof. Sandra Kubler.
A fascinating View of the Artificial Intelligence Journey.
Ramón López de Mántaras, Ph.D.
Technical and Business Perspectives on the Current and Future Impact of Machine Learning - MLVLC
October 20, 2015
Performed Machine Learning Algorithm C4.5 on a city bike sharing system demand dataset from kaggle.com's competition.
The dataset contains 8 attributes that are used to determine the mobility between the two points within a city. Historical data is combined with appropriate weather attributes in the dataset that helps in the forecast of bike sharing demand.
Real-world Stories and Long-term Risks and Opportunities.
Tom Dietterich, Ph.D.
Technical and Business Perspectives on the Current and Future Impact of Machine Learning - MLVLC
October 20, 2015
ID3, C4.5 :used to generate a decision tree developed by Ross Quinlan typically used in the machine learning and natural language processing domains, overview about these algorithms with illustrated examples
Subject: English 18
Translation and Editing Text
Topic: Techniques in Translation
Techniques in Translation
1. Computer assisted
2. Machine translation
3. Subtitling
4. editing/Post editing
1. COMPUTER-ASSISTED
Computer-assisted translations also called 'computer-aided translation or machine-aided human translation. It is a form of translation wherein human translator creates a target text with the assistance of a computer program. The machine supports a human translator.
What is Computer Aided Translation?
Computer aided translation (also called computer assisted translation) is a system in which a human translator uses a computer in the translation process.
Humans and computers each have their strengths and weaknesses. The idea of computer aided translation (CAT) software is to make the most of the strengths of people and computers.
Translation performed solely by computers ("machine translation") has very poor quality. Meanwhile, no human can translate as fast as a computer can. By using a CAT tool, however, you can gain some of the speed, consistency, and memory benefits of the computer, without sacrificing the high quality of human translation.
Translation Skills: Theory and practice
The theoretical base should include general information regarding the translator's workshop and the issues one should be familiar with.
*Internet
It is worth discussing is the role of the internet as a source of information. It is important to use the translations which have been on the market for some time and are recognized by other people. This is where the internet becomes very useful for it allows us to search forgiven information (google.com, yahoo.com, altavista.com, etc.), use online dictionaries and corpora, or compare different language versions of the same site (Wikipedia the Free Encyclopedia and the ability to switch from different languages defining a given notion-www.wikipedia.org). Google itself is a powerful tool since it allows us not only to search for information on webpages but also it indexes*.doc and *pdf files stored on servers, allowing us to browse through their contents in search for a context.
*Software
A successful translator needs to know how to handle various computer applications in his/her work. That's why basic software used to compress and decompress files should be mentioned (WinZip, WinRAR). PDF and multimedia files readers (images, audio). Last, the use of different word processors, are usually the first application that leads people using a computer for their work. This comprises of spell checking, standard layouts, ability to have some characters appear in bold print, italics, or underlined. We can save documents, so it can be used again, and we can print the documents.
It is important to mention CAT tool, how the
Decision Trees - The Machine Learning Magic UnveiledLuca Zavarella
Often a Machine Learning algorithm is seen as one of those magical weapons capable of revealing possible future scenarios to whoever holds it. In truth, it's a direct application of mathematical and statistical concepts, which sometimes generate complex models to be interpreted as output. However, there are predictive models based on decision trees that are really simple to understand. In this slide deck I'll explain what is behind a predictive model of this type.
Here the demo files: https://goo.gl/K6dgWC
Comparison Study of Decision Tree Ensembles for RegressionSeonho Park
Nowadays, decision tree ensemble methods are widely used for solving classification and regression problem due to their rigorousness and robustness. To compare with classification, the performance in regression problem so far has not been yet addressed in detail. In this presentation, we review the state-of-art decision tree ensemble methodology in scikit-learn and xgboost for regression. Also, empirical study results are illustrated to compare their performance and computational efficiency.
Machine Learning can often be a daunting subject to tackle much less utilize in a meaningful manner. In this session, attendees will learn how to take their existing data, shape it, and create models that automatically can make principled business decisions directly in their applications. The discussion will include explanations of the data acquisition and shaping process. Additionally, attendees will learn the basics of machine learning - primarily the supervised learning problem.
This presentation discusses decision trees as a machine learning technique. This introduces the problem with several examples: cricket player selection, medical C-Section diagnosis and Mobile Phone price predictor. It discusses the ID3 algorithm and discusses how the decision tree is induced. The definition and use of the concepts such as Entropy, Information Gain are discussed.
Foundations of Machine Learning - StampedeCon AI Summit 2017StampedeCon
This presentation will cover all aspects of modeling, from preparing data, training and evaluating the results. There will be descriptions of the mainline ML methods including, neural nets, SVM, boosting, bagging, trees, forests, and deep learning. common problems of overfitting and dimensionality will be covered with discussion of modeling best practices. Other topics will include field standardization, encoding categorical variables, feature creation and selection. It will be a soup-to-nuts overview of all the necessary procedures for building state-of-the art predictive models.
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Maninda Edirisooriya
Decision Trees and Ensemble Methods is a different form of Machine Learning algorithm classes. This was one of the lectures of a full course I taught in University of Moratuwa, Sri Lanka on 2023 second half of the year.
Valencian Summer School 2015
Day 2
Lecture 15
Machine Learning - Black Art
Charles Parker (Alston Trading)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Valencian Summer School 2015
Day 1
Lecture 9
Real World Machine Learning - Cooking Predictions
Andrés González (CleverTask)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Valencian Summer School 2015
Day 2
Lecture 11
The Future of Machine Learning
José David Martín-Guerrero (IDAL, UV)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Valencian Summer School 2015
Day 1
Lecture 7
A developers’ overview of the world of predictive APIs
Louis Dorard (PAPIs.io)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Valencian Summer School 2015
Day 1
Lecture 5
Data Transformation and Feature Engineering
Charles Parker (Alston Trading)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Valencian Summer School 2015
Day 1
Lecture 3
Ensembles of Decision Trees
Gonzalo Martínez (UAM)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Valencian Summer School 2015
Day 1
Lecture 1
State of the Art in Machine Learning
Poul Petersen (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
2. 2!
• What is a decision tree?
• History
• Decision tree learning algorithm
• Growing the tree
• Pruning the tree
• Capabilities
Outline
3. 3!
• Hierarchical learning model that recursively
partitions the space using decision rules
• Valid for classification and regression
• Ok, but what is a decision tree?
What is a decision tree?
4. 4!
• Labor negotiations: predict if a collective agreement
is good or bad
Example for classification
good
dental plan?
>full!≤full
goodbad
wage increase
1st year
>36%!≤36%
N/A none half full
dental plan
1020304050
wageincrease1styear
Leaf nodes!
Internal!
nodes!
Root node!
5. 5!
• Boston housing: predict house values in Boston by
neigbourhood
Example for regression
500000
average
rooms
>5.5!≤ 5.5
350000200000
Average
house year
>75!≤75
3 4 5 6
ave. rooms
5060708090
ave.houseyear
6. 6!
• Precursors: Expert Based Systems (EBS)
EBS = Knowledge database + Inference Engine
• MYCIN: Medical diagnosis system based, 600 rules
• XCON: System for configuring VAX computers, 2500 rules
(1982)
• The rules were created by experts by hand!!
• Knowledge acquisition has to be automatized
• Substitute the Expert by its archive with solved cases
History
7. 7!
• CHAID (CHi-squared Automatic Interaction
Detector) Gordon V. Kass ,1980
• CART (Classification and Regression Trees),
Breiman, Friedman, Olsen and Stone, 1984
• ID3 (Iterative Dichotomiser 3), Quinlan, 1986
• C4.5, Quinlan 1993: Based on ID3
History
8. • Consider two binary variables. How many ways can
we split the space using a decision tree?
• Two possible splits and two possible assignments
to the leave nodes à At least 8 possible trees
8!
Computational
0 1
01
9. 9!
Computational
• Under what conditions someone waits in a
restaurant?
There are 2 x 2 x 2 x 2 x 3 x 3 x 2 x 2 x 4 x 4 = 9216 cases
and two classes à 29216 possible hypothesis and many more
possible trees!!!!
10. 10!
• It is just not feasible to find the optimal solution
• A bias should be selected to build the models.
• This is a general a problem in Machine Learning.
Computational
11. 11!
For decision trees a greedy approach is generally
selected:
• Built step by step, instead of building the tree as a
whole
• At each step the best split with respect to the train
data is selected (following a split criterion).
• The tree is grown until a stopping criterion is met
• The tree is generally pruned (following a pruning
criterion) to avoid over-fitting.
Computational
12. 12!
Basic Decision Tree Algorithm
trainTree(Dataset L)
1. T = growTree(L)
2. pruneTree(T,L)
3. return T
growTree(Dataset L)
1. T.s = findBestSplit(L)
2. if T.s == null return null
3. (L1, L2) = splitData(L,T.s)
4. T.left = growTree(L1)
5. T.right = growTree(L2)
6. return T
Removes
subtrees
uncertain
about their
validity.
13. 13!
Finding the best split
findBestSplit(Dataset L)
1. Try all possible splits
2. return best
0!
1!
2!
3!
4!
5!
0! 1! 2! 3! 4! 5!
There are not too
many. Feasible
computationally
O(# of attrib x # of
points)
But which one
is best?
14. 14!
Split Criterion
• It should measure the impurity of node:
𝑖( 𝑡)=∑𝑖=1↑𝑚▒ 𝑓↓𝑖 (1− 𝑓↓𝑖 )
Gini impurity
(CART)
where fi is the fraction of instances of class i in node t
• The improvement of a split is the variation of impurity
before and after the split
∆ 𝑖(𝑡, 𝑠)= 𝑖(𝑡)− 𝑝↓𝐿 𝑖( 𝑡↓𝐿 )− 𝑝↓𝑅 𝑖( 𝑡↓𝑅 )
where pL (pR) is the proportion of instances going to the
left (right) node
16. 16!
• Based on entropy
𝐻(𝑡)=−∑𝑖=1↑𝑚▒ 𝑓↓𝑖 log↓2 𝑓↓𝑖
• Information gain (used in ID4 and C4.5)
𝐼𝐺(𝑡, 𝑠)= 𝐻(𝑡)− 𝑝↓𝐿 𝐻( 𝑡↓𝐿 )− 𝑝↓𝑅 𝐻( 𝑡↓𝑅 )
• Information gain ratio (C4.5)
𝐼𝐺𝑅(𝑡, 𝑠)= 𝐼𝐺(𝑡, 𝑠)/ 𝐻( 𝑠)
Other splitting criteria
17. 17!
• Any splitting criteria with this shape is good
Splitting criteria
(this is for
binary
problems)!
18. 18!
Full grown tree
0!
1!
2!
3!
4!
5!
0! 1! 2! 3! 4! 5!
good
X2
>3.5!
≤3.5
bad
X1
>2.5!≤2.5
X2
>2.5!≤2.5
goodX2
>1.5!≤1.5
good X1
>4.5!≤4.5
goodX1
>3.5!≤3.5
good bad
Are we happy
with this tree?
good
How about this
one?
19. 19!
• All instances assigned to the node considered for
splitting have the same class label
• No split is found to further partition the data
• The number of instances in each terminal node is smaller
than a predefined threshold
• The impurity gain of the best split is not below a given
threshold
• The tree has reached a maximum depth
Stopping criteria
The last three elements are also call pre-pruning
20. 20!
Pruning
• Another option is post pruning (or pruning). Consist in:
• Grow the tree as much as possible
• Prune it afterwards by substituting one subtree by a
single leaf node if the error does not worsen
significantly.
• This process is continued until no more pruning is
possible.
• Actually we go back to smaller trees but through a
different path
• The idea of pruning is to avoid overfitting
21. 21!
Cost-complexity pruning (CART)
• Cost-complexity based pruning:
𝑅↓𝛼 (𝑡)= 𝑅(𝑡)+ 𝛼· 𝐶(𝑡)
• R(t) is the error of the decision tree rooted at node t
• C(t) is the number of leaf nodes from node t
• Parameter α specifies the relative weight between
the accuracy and complexity of the tree
22. 22!
Pruning CART
0!
1!
2!
3!
4!
5!
0! 1! 2! 3! 4! 5!
good
X2
>3.5!
≤3.5
bad
X1
>2.5!≤2.5
X2
>2.5!≤2.5
goodX2
>1.5!≤1.5
good X1
>4.5!≤4.5
goodX1
>3.5!≤3.5
good bad
good
𝑅↓𝛼=0.1 (𝑡)=1/5+0.1·1=0.3
𝑅↓𝛼=0.1 (𝑡)=0+0.1·5=0.5
Pruned:!
Unpruned!
Let’s say 𝛼=0.1!
23. 23!
• CART uses 10-fold cross-validation within the
training data to estimate alpha. Iteratively nine folds
are used for training a tree and one for test.
• A tree is trained on nine folds and it is pruned using
all possible alphas (that are finite).
• Then each of those trees is tested on the remaining
fold.
• The process is repeated 10 times and the alpha value
that gives the best generalization accuracy is kept
Cost-complexity pruning (CART)
24. 24!
• C4.5 estimates the accuracy % on the leaf nodes
using the upper confidence bound (parameter) of a
normal distribution instead of the data.
• Error estimate for subtree is the weighted sum of
the error estimates for all its leaves
• This error is higher when few data instances fall on
a leaf.
• Hence, leaf nodes with few instances tend to be
pruned.
Statistical pruning (C4.5)
25. 25!
• CART pruning is slower since it has to build 10
extra trees to estimate alpha.
• C4.5 pruning is faster, however the algorithm does
not propose a way to compute the confidence
threshold
• The statistical grounds for C4.5 pruning are
questionable.
• Using cross validation is safer
Pruning (CART vs C4.5)
26. 26!
Missing values
• What can be done if a value is missing?
• Suppose the value for “Pat” for one instance is
unknown.
• The instance with the missing value fall through the
three branches but weighted
• And the validity of the split
is computed as before
27. 27!
Oblique splits
• CART algorithms allows for oblique splits, i.e. splits
that are not orthogonal to the attributes axis
• The algorithm searches for planes with good
impurity reduction
• The growing tree process becomes slower
• But trees become more expressive and compact
N1>N2
true!false
- +
28. 28!
• Minimum number of instances necessary to split a
node.
• Pruning/No pruning
• Pruning confidence. How much to prune?
• For computational issues the number of nodes or
depth of the tree can be limited
Parameters
29. 29!
Algorithms details
Splitting criterion! Pruning criterion! Other features!
CART!
• Gini!
• Twoing!
Cross-validation post-
pruning!
• Regression/Classif.!
• Nominal/numeric
attributes!
• Missing values!
• Oblique splits!
• Nominal splits
grouping!
!
ID3! Information Gain (IG)! Pre-pruning.!
• Classification!
• Nominal attributes!
C4.5!
• Information Gain
(IG)!
• Information Gain
Ratio (IGR)!
!
Statistical based post-
pruning!
• Classification!
• Nominal/numeric
attributes!
• Missing values!
• Rule generator!
• Multiple nodes split!
!
30. 30!
Bad things about DT
• None! Well maybe something…
• Cannot handle so well complex interactions between
attributes. Lack of expressive power
31. 31!
Bad things about DT
• Replication problem. Can end up with similar subtrees
in mutually exclusive regions.
32. 32!
Good things about DT
• Self-explanatory. Easy for non experts to understand.
Can be converted to rules.
• Handle both nominal and numeric attributes.
• Can handle uninformative and redundant attributes.
• Can handle missing values.
• Nonparametric method. In principle, no predefined
idea of the concept to learn
• Easy to tune. Do not have hundreds of parameters
33. Thanks for listening
Go for a coffee
Questions?
NO!YES
Go for a coffee!Go for a coffee!
Correctly
answered?
YES!NO