2. Contents
Introduction
Decision Tree in a nutshell
Four Ingredients in Decision Tree Construction
ā¢ Splitting rule
ā¢ Stopping rule
ā¢ Pruning rule
ā¢ Prediction
Some Algorithms for Decision Tree
Advantages and Disadvantages
2 / 20
3. Introduction
ā¢ Decision tree is one of the most widely used supervised
learning method in machine learning
ā¢ Its big appeal is that the decision process is very much akin to
how we humans make decisions
ā¢ Therefore it is easy to understand and accept the results
coming from the tree-style decision process
ā¢ It used to be an baseline model when constructing some other
algorithms (e.g. Random Forest, Boosting)
3 / 20
4. Decision Tree in a nutshell
Decision Tree Construction Process
1 Recursively partition the feature space X into the subregions
tļ±, Ā· Ā· Ā· , tm based on the speciļ¬c splitting rule
2 Stop splitting the feature space (or stop growing tree) when
the stopping rule holds
3 Prune the grown tree based on the pruning rule
4 Make a (local) prediction on each subregion in pruned tree
based on the speciļ¬c estimation method
4 / 20
5. Four Ingredients in Decision Tree
Construction
ā¢ Splitting rule
ā¢ How does the split works?
ā¢ What is the criterion for splitting the feature space?
ā¢ Stopping rule
ā¢ What kind of threshold is used for stopping?
ā¢ Pruning rule
ā¢ Why prune the tree?
ā¢ How does it works?
ā¢ Estimation method
ā¢ What kind of estimation method is used?
ā¢ How does it works?
5 / 20
6. Notation
ā¢ D = {(x(i)
, y(i)
)}N
i=ļ±: the dataset,
ā¢ Where x(i)
= (x
(i)
ļ± , Ā· Ā· Ā· , x
(i)
d ) ā X and y(i)
ā Y, where Y = {ļ±, Ā· Ā· Ā· , K}
ā¢ Let t be a node, which is also identiļ¬ed with a subregion of X
ā¢ D(t) = {(x(i)
, y(i)
) ā D : (x(i)
, y(i)
) ā t}: the set of data points in the
node t
ā¢ Let T be a tree
ā¢ N = |D|: the total number of data points
ā¢ N(t) = |D(t)|: the number of data points in the node t
ā¢ Nj(t) = |{(x(i)
, y(i)
ā D(t) : y(i)
= j}|: the number of data points in the
node t with class label j ā Y
ā¢ p(j, t) =
Nj
N
Ā·
Nj (t)
Nj
=
Nj (t)
N
ā¢ p(t) = j p(j, t) = N(t)
N
ā¢ p(j|t) = p(j,t)
p(t)
=
Nj (t)
N(t)
6 / 20
7. Splitting Rule
ā¢ For each node, determine a splitting variable xj and splitting
criterion c
ā¢ For continuous splitting variable, splitting criterion c is a
number.
ā¢ For example, if an observation x(i)
is the case that its splitting
variable x
(i)
j < c
ā¢ Then tree assigns it to the left child node. Otherwise tree
assigns it to the right child
ā¢ For categorical variable, the splitting criterion divides the
range of the splitting variable in two parts
ā¢ For example, let splitting variable xj ā{1, 2, 3, 4}
ā¢ And let the splitting criterion is {1, 2, 4}
ā¢ If x
(i)
j ā{1, 2, 4}, tree assigns it to the left child. Otherwise,
tree assigns it to the right child
7 / 20
8. Splitting Rule
ā¢ A split is determined based on the impurity
ā¢ Impurity (or purity) is the measure of homogeneity for a given
node
ā¢ For each node, we select a splitting variable and a splitting
criterion which minimizes the sum of impurities of the two
child nodes
ā¢ Impurity is calculated by an impurity function Ļ which
satisļ¬es the following conditions
8 / 20
9. Splitting Rule
ā¢ Deļ¬nition 1. An impurity function Ļ is a function
Ļ(pļ±, Ā· Ā· Ā· , pK) deļ¬ned for pļ±, Ā· Ā· Ā· , pK with pj ā„0 for all j and
pļ± + Ā· Ā· Ā· + pK =1 such that
(i) Ļ(pļ±, Ā· Ā· Ā· , pK) ā„0
(ii) Ļ(ļ±/K, Ā· Ā· Ā· , ļ±/K) is the maximum value of Ļ
(iii) Ļ(pļ±, Ā· Ā· Ā· , pK) is symmetric with regard to pļ±, Ā· Ā· Ā· , pK
(iv) Ļ(1, 0, Ā· Ā· Ā· , 0) = Ļ(0, 1, Ā· Ā· Ā· , 0) = Ļ(0, Ā· Ā· Ā· , 0, 1) = 0
ā¢ Deļ¬nition 2. For node t, its impurity (measure) i(t) is
deļ¬ned as
i(t) = Ļ(p(1|t), Ā· Ā· Ā· , p(K|t))
9 / 20
11. Splitting Rule
ā¢ Deļ¬nition 3. The decrease in impurity
ā¢ Let t be a node and let s be split of t into two child nodes
tL and tR
āi(s, t) = i(t) ā pLi(tL) ā pRi(tR)
ā¢ Where pL = p(tL)
p(t) and pR = p(tR)
p(t) , pL + pR =1
ā¢ Then āi(t) ā„0 (see Proposition 4.4. in Breiman et al. (1984))
ā¢ Hence the splitting rule at t is sā such that we take the split
sā among all possible candidate splits that decreases the cost
most
sā
= argmax
s
āi(s, t)
11 / 20
13. Stopping Rule
ā¢ Stopping rules terminate further splitting
ā¢ For example
ā¢ All observations in a node are contained in one group
ā¢ The number of observations in a node is small
ā¢ The decrease of impurity is small
ā¢ The depth of a node is larger than a given number
13 / 20
14. Pruning Rule
ā¢ A tree with too many nodes will have large prediction error
rate for new observations
ā¢ It is appropriate to prune away some branch of tree for good
prediction error rate
ā¢ To determine the size of tree, we estimate prediction error
using validation set or cross validation
14 / 20
15. Pruning Rule
Pruning Process
ā¢ For a given tree T and positive number Ī±, cost-complexity
pruning is deļ¬ned by
cost-complexity(Ī±) = error rate of T + Ī±|T|
(Where |T| is the number of nodes)
ā¢ In general, the larger tree(the larger |T|), the smaller error
rate (see Proposition 1. in Hyeongin Choi (2017)). But
cost-complexity does not decrease as |T| increases.
ā¢ For the grown tree Tmax, T(Ī±) is a subtree which minimizes
cost-complexity(Ī±)
ā¢ In general, the larger Ī±, the smaller |T(Ī±)|
15 / 20
16. Pruning Rule
Pruning Process
ā¢ One important property of T(Ī±) is that if Ī±ļ± ā¤ Ī±ļ², then
T(Ī±ļ±) ā„ T(Ī±ļ²) which means T(Ī±ļ²) is a subtree of T(Ī±ļ±)
ā¢ Let Tļ± = T(Ī±ļ±), Tļ² = T(Ī±ļ²), Ā· Ā· Ā· then we can get the
following sequence of pruned subtrees
Tļ± ā„ Tļ² ā„ Ā· Ā· Ā· ā„ {tļ±},
where 0 = Ī±ļ± < Ī±ļ² < Ā· Ā· Ā·
ā¢ For a given Ī±, we estimate the generalization error of T(Ī±) by
validation set or cross-validation
ā¢ Choose Ī±ā (and corresponding T(Ī±ā)) which minimizes the
(estimated) generalization error. (see Hyeongin Choi (2017)
for details)
16 / 20
17. Estimation Method
ā¢ Once a tree is ļ¬xed, a prediction at each subregion (terminal
node) can be determined from that tree
ā¢ Since the estimation is operated at each subregion, it is called
a local estimation
ā¢ For this local estimation, any estimation method can be
obtained. What kind of method to use is dependent on the
model assumption.
ā¢ For example, for classiļ¬cation, majority vote or
logistic regression is obtained by modelerās probability
assumption
17 / 20
18. Some Algorithms for Decision Tree
CHAID (CHi-squared Automatic Interaction Detector)
ā¢ by J. A. Hartigan 1975
ā¢ Employes Ļ2
statistic as impurity
ā¢ No pruning process, it stops growing at a certain size
CART (Classiļ¬cation And Regression Tree)
ā¢ by Breiman and et al. 1984
ā¢ Only binary split is operated
ā¢ Cost-complexity pruning is an important unique feature
C5.0 (successor of ID3 and C4.5)
ā¢ by J. Ross Quinlan 1993
ā¢ Multisplit is available
ā¢ For categorical input variable, a node splits into the number of categories.
18 / 20
19. Advantages and Disadvantages
Advantages
ā¢ Easy to interpret and explain
ā¢ Trees can be displayed graphically
ā¢ Trees can easily handle both continuous and categorical
variables without the need to create dummy variables
Disadvantages
ā¢ Poor prediction accuracy compared to other models
ā¢ When depth is large, not only accuracy but interpretation are
bad
19 / 20
20. References
1 Breiman, L. Friedman, J.H, Olshen, R.A. and Stone, C.J., Classiļ¬cation
And Regression Trees, Chapman & Hall/CRC (1984)
2 Hyeongin Choi (2017), Lecture 9: Classiļ¬cation and Regreesion Tree
(CART), http://www.math.snu.ac.kr/ā¼
hichoi/machinelearning/lecturenotes/CART.pdf
3 Yongdai Kim (2017), Chapter 6. Decision Tree,
https://stat.snu.ac.kr/ydkim/courses/2017-1/addm/Chap6-
DecisionTree.pdf
20 / 20