From decision trees to
Decision tree learning
• Supervised learning
• From a set of measurements,
– learn a model
– to predict and understand a phenomenon
Example 1: wine taste preference
• From physicochemical properties (alcohol, acidity,
• Learn a model
• To predict wine taste preference (from 0 to 10)
• Decision tree can be interpreted as set of
• Can be applied to noisy data
• One of popular inductive learning
• Good results for real-life applications
Decision tree representation
• An inner node represents an attribute
• An edge represents a test on the attribute of
the father node
• A leaf represents one of the classes
• Construction of a decision tree
– Based on the training data
– Top-down strategy
• The classiﬁcation of an unknown input vector is done by
traversing the tree from the root node to a leaf node.
• A record enters the tree at the root node.
• At the root, a test is applied to determine which child node
the record will encounter next.
• This process is repeated until the record arrives at a leaf
• All the records that end up at a given leaf of the tree are
classiﬁed in the same way.
• There is a unique path from the root to each leaf.
• The path is a rule which is used to classify the records.
• The data set has ﬁve attributes.
• There is a special attribute: the attribute class is the class
• The attributes, temp (temperature) and humidity are
• Other attributes are categorical, that is, they cannot be
• Based on the training data set, we want to ﬁnd a set of rules
to know what values of outlook, temperature, humidity and
wind, determine whether or not to play golf.
• RULE 1 If it is sunny and the humidity is not above 75%,
• RULE 2 If it is sunny and the humidity is above 75%, then
do not play.
• RULE 3 If it is overcast, then play.
• RULE 4 If it is rainy and not windy, then play.
• RULE 5 If it is rainy and windy, then don't play.
• At every node there is an attribute associated with
the node called the splitting attribute
• Top-down traversal
– In our example, outlook is the splitting attribute at root.
– Since for the given record, outlook = rain, we move to the
rightmost child node of the root.
– At this node, the splitting attribute is windy and we ﬁnd
that for the record we want classify, windy = true.
– Hence, we move to the left child node to conclude that
the class label Is "no play".
Decision tree construction
• Identify the splitting attribute and splitting
criterion at every level of the tree
– Iterative Dichotomizer (ID3)
Iterative Dichotomizer (ID3)
• Quinlan (1986)
• Each node corresponds to a splitting attribute
• Each edge is a possible value of that attribute.
• At each node the splitting attribute is selected to be the
most informative among the attributes not yet considered in
the path from the root.
• Entropy is used to measure how informative is a node.
Splitting attribute selection
• The algorithm uses the criterion of information gain
to determine the goodness of a split.
– The attribute with the greatest information gain is taken
as the splitting attribute, and the data set is split for all
distinct values of the attribute values of the attribute.
• Example: 2 classes: C1, C2, pick A1 or A2
Entropy – General Case
• Impurity/Inhomogeneity measurement
• Suppose X takes n values, V1, V2,… Vn, and
P(X=V1)=p1, P(X=V2)=p2, … P(X=Vn)=pn
• What is the smallest number of bits, on average, per
symbol, needed to transmit the symbols drawn from
distribution of X? It’s
E(X) = p1 log2 p1 – p2 log2 p2 – … pnlog2pn
• E(X) = the entropy of X
• Stop growing when data split not statistically
• Grow full tree then post-prune
• How to select best tree
– Measure performance over training tree
– Measure performance over separate validation
– MDL minimize
• size(tree) + size(misclassiﬁcations(tree))
• Split data into training and validation set
• Do until further pruning is harmful
– Evaluate impact on validation set of pruning
each possible node
– Greedily remove the one that most improves
validation set accuracy
• Convert tree to equivalent set
• Prune each rule independently
• Sort ﬁnal rules into desired
sequence for use
Issues in Decision Tree Learning
• How deep to grow?
• How to handle continuous attributes?
• How to choose an appropriate attributes selection
• How to handle data with missing attributes values?
• How to handle attributes with diﬀerent costs?
• How to improve computational eﬃciency?
• ID3 has been extended to handle most of these.
The resulting system is C4.5 (http://cis-
How to grow a decision tree
• Split rows in a given
node into two sets with
respect to impurity
– The smaller, the more
skewed is distribution
– Compare impurity of
parent with impurity of
When to stop growing tree
• Build full tree or
• Apply stopping criterion - limit on:
– Tree depth, or
– Minimum number of points in a leaf
How to assign leaf
• The leaf value is
– If leaf contains only one point
then its color represents leaf
• Else majority color is picked, or
color distribution is stored
• Tree covered whole area by rectangles
predicting a point color
Decision tree scoring
• The model can predict a point color based
on its coordinates.
• Tree perfectly represents training data (0%
training error), but also learned about noise!
Randomize #1- Bagging
• Each tree sees only sample of training data
and captures only a part of the information.
• Build multiple weak trees which vote
together to give resulting prediction
– voting is based on majority vote, or weighted
Bagging - boundary
• Bagging averages many trees, and produces
smoother decision boundaries.
Randomize #2 - Feature selection
Random forest - properties
• Reﬁnement of bagged trees; quite popular
• At each tree split, a random sample of m features is drawn,
and only those m features are considered for splitting.
• m=√p or log2(p), where p is the number of features
• For each tree grown on a bootstrap sample, the error rate
for observations left out of the bootstrap sample is
monitored. This is called the “out-of-bag” error rate.
• Random forests tries to improve on bagging by “de-
correlating” the trees. Each tree has the same expectation
Advantages of Random Forest
• Independent trees which can be built in
• The model does not overﬁt easily
• Produces reasonable accuracy
• Brings more features to analyze data variable
importance, proximities, missing values
Out of bag points and validation
• Each tree is built over
a sample of training
• Remaining points are