2. Decision Trees
• A decision tree is an approach to predictive analysis that can help you
make decisions. Suppose, for example, that you need to decide
whether to invest a certain amount of money in one of three business
projects: a food-truck business, a restaurant, or a bookstore.
• A business analyst has worked out the rate of failure or success for
each of these business ideas as percentages and the profit you’d
make in each case.
2
3. Business Success Rate Failure Rate
Food Truck 60 percent 40 percent
Restaurant 52 percent 48 percent
Bookstore 50 percent 50 percent
Business Gain (USD) Loss (USD)
Food Truck 20,000 -7,000
Restaurant 40,000 -21,000
Bookstore 6,000 -1,000
From past statistical data shown, you can construct a decision tree as shown below.
3
4. • Using such a decision tree to decide on a business venture begins with calculating
the expected value for each alternative — a numbered rank that helps you select
the best one.
• The expected value is calculated in such a way that includes all possible outcomes
for a decision. Calculating the expected value for the food-truck business idea looks
like this:
• Expected value of food-truck business = (60 % x 20,000 (USD)) + (40 % * -7,000
(USD)) = 9,200 (USD)
• Expected value of restaurant business = (52 % x 40,000 (USD)) + (48 % * -21,000
(USD)) = 10,720 (USD)
• Expected value of bookstore business = (50 % x 6,000 (USD)) + (50 % * -1,000 (USD))
= 2,500 (USD)
• Therefore the expected value becomes one of the criteria you figure into your
business decision-making. In this example, the expected values of the three
alternatives might incline you to favor investing in the restaurant business.
4
5. • Decision trees can also be used to visualize classification rules.
• A decision algorithm generates a decision tree that represents
classification rules.
• Decision Trees (DTs) are a supervised learning technique that predict
values of responses by learning decision rules derived from features.
They can be used in both a regression and a classification context. For
this reason they are sometimes also referred to as Classification And
Regression Trees (CART).
5
6. Why trees?
• Interpretable/intuitive, popular in medical applications because they mimic
the way a doctor thinks
• Model discrete outcomes nicely
• Can be very powerful, can be as complex as you need them
• C4.5 and CART - from “top 10” - decision trees are very popular
• Some real examples (from Russell & Norvig, Mitchell)
• BP’s GasOIL system for separating gas and oil on offshore platforms - decision
trees replaced a hand-designed rules system with 2500 rules. C4.5-based
system outperformed human experts and saved BP millions. (1986)
• learning to fly a Cessna on a flight simulator by watching human experts fly
the simulator (1992)
• can also learn to play tennis, analyze C-section risk, etc.
6
7. Decision Tree Types
• Classification tree analysis is when the predicted outcome is the class to which
the data belongs. Iterative Dichotomiser 3 (ID3), C4.5, (Quinlan, 1986)
• Regression tree analysis is when the predicted outcome can be considered a real
number (e.g. the price of a house, or a patient’s length of stay in a hospital).
• Classification And Regression Tree (CART) analysis is used to refer to both of the
above procedures, first introduced by (Breiman et al., 1984)
• CHi-squared Automatic Interaction Detector (CHAID). Performs multi-level splits
when computing classification trees. (Kass, G. V. 1980).
• A Random Forest classifier uses a number of decision trees, in order to improve
the classification rate.
• Boosting Trees can be used for regression-type and classification-type problems.
• Used in data mining (most are included in R, see rpart and party packages, and in
Weka, Waikato Environment for Knowledge Analysis)
7
8. How to build a decision tree:
• Start at the top of the tree.
• Grow it by “splitting” attributes one by one. To determine which
attribute to split, look at “node impurity.”
• Assign leaf nodes the majority vote in the leaf.
• When we get to the bottom, prune the tree to prevent overfitting
8
9. Decision Tree Representation
• Classification of instances by sorting them down the tree from the root
to some leaf note
• node ≈ test of some attribute
• branch ≈ one of the possible values for the attribute
• Decision trees represent a disjunction of conjunctions of constraints on
the attribute values of instances i.e., (... ∧ ... ∧ ... ) ∨ (... ∧ ... ∧ ... ) ∨ ...
• Equivalent to a set of if-then-rules
• each branch represents one if-then-rule
• if-part: conjunctions of attribute tests on the nodes
• then-part: classification of the branch
9
10. • This decision tree is equivalent to:
• if (Outlook = Sunny ) ∧ (Humidity = Normal ) then Yes;
• if (Outlook = Overcast ) then Yes;
• if (Outlook = Rain ) ∧ (Wind = Weak ) then Yes;
Each internal node: test one attribute Xi
Each branch from a node: selects one value for Xi
Each leaf node: predict Y (or P(Y|X ∈ leaf))
10
11. Example: Will the customer wait for a table?
(from Russell&Norvig)
• Here are the attributes:
11
13. • Here are two options for the first feature to split at the top of the tree.
Which one should we choose? Which one gives me the most information?
13
14. • What we need is a formula to computeinformation. "Before we do
that, here's another example. Let's say we pick one of them (Patrons).
Maybe then we'll pick Hungry next, because it has a lot of
“information":
14
16. • We want to define I so that it obeys all these things:
• I(p)>0;
• I(1)=0; the information of any event is non-negative, no information
from events with probability 1.
• I(p1p2)=I(p1)+I(p2); the information from two independent events
should be the sum of their information.
• I(p) is continuous, slight changes in probability correspond to slight
changes in information.
• Together these lead to:
• I(p2)=2I(p) or generally I(pn)=nI(p); this means that I(p)=I(p(1/m) m)
=mI(p1/m) so 1/m.I(p)=I(p(1/m)).
16
17. Why we use log?
Why is information measured with logarithms instead of just by the total
number of states?
• Mostly because it makes it additive. It's true that if you really wanted to,
you could choose to measure information or entropy by the total number
of states (usually called the "multiplicity"), instead of by the log of the
multiplicity. But then it would be multiplicative instead of additive. If you
have 10 bits and then you obtain another 10 bits of information, then you
have 20 bits.
• Saying the same thing in terms of multiplicity: if you have 2^10 = 1024
states and then you add another 1024 independent states then you have
1024*1024 = 1048576 states (2^20) when they are combined. Multiplicity
is multiplicative instead of additive, which means that the numbers you
need in order to keep track of it get very large very quickly! This is really
inconvenient, hence why we usually stick with using information/entropy
as the unit instead of multiplicity.
17
18. • Shannon (1948, p. 349) explains the convenience of the use of a
logarithmic function in the definition of the entropies: it is practically
useful because many important parameters in engineering vary
linearly with the logarithm of the number of possibilities; it is intuitive
because we use to measure magnitudes by linear comparison with
unities of measurement; it is mathematically more suitable because
many limiting operations in terms of the logarithm are simpler than in
terms of the number of possibilities. In turn, the choice of a
logarithmic base amounts to a choice of a unit for measuring
information.
• If the base 2 is used, the resulting unit is called ‘bit’ –a contraction of
binary unit –.
• With these definitions, one bit is the amount of information obtained
when one of two equally likely alternatives is specified.
18
19. Example of Calculating Information
Coin Toss
• There are two probabilities in fair coin, which are head(.5) and tail(.5).
• So if you get either head or tail you will get 1 bit of information
through following formula.
• I(head) = -log2 (.5) = 1 bit
19
20. Another Example
• Balls in the bin
• The information you will get by choosing a ball from the bin are
calculated as following.
• I(red ball) = - log2 (4/9) = 1.1699 bits
• I(yellow ball) = - log2 (2/9) = 2.1699 bits
• I(green ball) = - log2 (3/9) = 1.58496 bits
20
22. How was the entropy equation is derived?
I = total information from N occurrences
N = number of occurrences
(N*pi) = Approximated number that the
certain result will come out in N occurrence
• So when you look at the difference
between the total Information from N
occurrences and the Entropy equation, only
thing that changed in the place of N.
• The N is moved to the right, which means
that I/N is Entropy. Therefore, Entropy is
the average(expected) amount of
information in a certain event.
22
23. Entropy
• The entropy of a variable is the "amount of information" contained in the
variable. This amount is determined not just by the number of different
values the variable can take on, just as the information in an email is
quantified not just by the number of words in the email or the different
possible words in the language of the email.
• Informally, the amount of information in an email is proportional to the
amount of “surprise” its reading causes. For example, if an email is simply a
repeat of an earlier email, then it is not informative at all. On the other
hand, if say the email reveals the outcome of a cliff-hanger election, then it
is highly informative.
• Similarly, the information in a variable is tied to the amount of surprise that
value of the variable causes when revealed. Shannon’s entropy quantifies
the amount of information in a variable, thus providing the foundation for
a theory around the notion of information.
23
24. • Because entropy is a type of information, the easiest way to measure
information is in bits and bytes, rather than by the total number of possible
states they can represent.
• The basic unit of information is the bit, which represents 2 possible states.
If you have n bits, then that information represents 2n possible states. For
example, a byte is 8 bits, therefore the number of states it represents is
28=256. This means that a byte can store any number between 0 and 255.
If you are given the total number of states, then you just take the log of
that number to get the amount of information, measured in bits:
log2256=8.
• So entropy is defined as the log of the number of total microscopic states
corresponding to a particular macro state of thermodynamics. This is the
additional information you'd need to know in order to completely specify
the microstate, given knowledge of the macrostate.
24
25. Information and Entropy
• Let’s look at this example again…
• Calculating the entropy….
• In this example there are three outcomes possible when you choose the
ball, it can be either red, yellow, or green. (n = 3)
• So the equation will be following.
• Entropy = ̶ (4/9) log(4/9) + ̶ (2/9) log(2/9) + ̶-(3/9) log(3/9)= 1.5305
• Therefore, you are expected to get 1.5305 information each time you
choose a ball from the bin
25
26. Information and Entropy
• Equation for calculating the range of Entropy:
0 ≤ Entropy ≤ log(n), where n is number of outcomes
• Entropy 0 (minimum entropy) occurs when one of the probabilities is
1 and rest are 0’s
• Entropy log(n) (maximum entropy) occurs when all the probabilities
have equal values of 1/n.
26
27. Shannon’s Entropy
• According to Shannon (1948; see also Shannon and Weaver 1949), a
general communication system consists of five parts:
− A source S, which generates the message to be received at the destination.
− A transmitter T, which turns the message generated at the source into a
signal to be transmitted. In the cases in which the information is encoded,
encoding is also implemented by this system.
− A channel CH, that is, the medium used to transmit the signal from the
transmitter to the receiver.
− A receiver R, which reconstructs the message from the signal.
− A destination D, which receives the message.
27
30. Mutual Information
• H(S;D) is the mutual information: the average amount of information
generated at the source S and received at the destination D.
• E is the equivocation: the average amount of information generated
at S but not received at D.
• N is the noise: the average amount of information received at D but
not generated at S.
• As the diagram clearly shows, the mutual information can be
computed as:
H(S;D) = H(S) − E = H(D) − N
30
31. Example
• We could measure how much space it takes to store X. Note that this
definition only makes sense if X is a random variable. If X is a random
variable, it has some distribution, and we can calculate the amount of
memory it takes to store X on average.
For X being a uniformly random n-bit string
H(X) = x p(x) log(1/p(x)) = x 2-nlog2n=n
31
36. Back to C4.5 (source material:
Russell&Norvig,Mitchell,Quinlan)
• We consider a “test” split on attribute A at a branch.
• In S, we have #pos positives and #neg negatives. For each branch j,
we have #posj positives and #negj negatives.
36
41. Keep splitting until:
• no more examples left (no point trying to split)
• all examples have the same class
• no more attributes to split
• For the restaurant example, we get this:
41
42. • Actually, it turns out that the class labels for the data were themselves generated
from a tree. So to get the label for an example, they fed it into a tree, and got the
label from the leaf. That tree is here:
42
43. • But the one we found is simpler!
• Does that mean our algorithm isn't doing a good job?
• There are possibilities to replace H([p,1-p]), it is not the only thing we
can use!
• One example is the Gini index 2p(1-p) used by CART.
• Another example is the value 1-max(p,1-p)
43
44. C4.5 uses information gain for splitting, and CART uses the Gini index. (CART only
has binary splits.)
44
All three are similar, but cross-entropy and the Gini index are differentiable, and
hence more amenable to numerical optimization. either the Gini index or cross-
entropy should be used when growing the tree.
45. • Inductive bias: Shorter trees are preferred to longer trees. Trees that
place high information gain attributes close to the root are also
preferred.
• Why prefer shorter hypotheses?
• Occam’s Razor: Prefer the simplest hypothesis that fits the data!
• see Minimum Description Length Principle (Bayesian Learning)
• e.g., if there are two decision trees, one with 500 nodes and another
with 5 nodes, the second one should be preferred ⇒ better chance to
avoid overfitting
45
46. Gini Index
• Gini index says, if we select two items from a population at random
then they must be of same class and probability for this is 1 if
population is pure.
• It works with categorical target variable “Success” or “Failure”.
• It performs only Binary splits
• Higher the value of Gini higher the homogeneity.
• CART (Classification and Regression Tree) uses Gini method to create
binary splits.
46
47. Steps to Calculate Gini for a split
1. Calculate Gini for sub-nodes, using formula sum of square of
probability for success and failure (p2+(1-p)2).
2. Calculate Gini for split using weighted Gini score of each node of
that split
47
48. Example:
• Let’s say we have a sample of 30 students with three variables Gender (Boy/ Girl), Class( IX/ X)
and Height (5 to 6 ft). 15 out of these 30 play cricket in leisure time. Now, I want to create a
model to predict who will play cricket during leisure period? In this problem, we need to
segregate students who play cricket in their leisure time based on highly significant input variable
among all three.
Split on Gender:
Calculate, Gini for sub-node Female = (0.2)*(0.2)+(0.8)*(0.8)=0.68
Gini for sub-node Male = (0.65)*(0.65)+(0.35)*(0.35)=0.55
Calculate weighted Gini for Split Gender = (10/30)*0.68+(20/30)*0.55 = 0.59
Similar for Split on Class:
Gini for sub-node Class IX = (0.43)*(0.43)+(0.57)*(0.57)=0.51
Gini for sub-node Class X = (0.56)*(0.56)+(0.44)*(0.44)=0.51
Calculate weighted Gini for Split Class = (14/30)*0.51+(16/30)*0.51 = 0.51
Above, you can see that Gini score for Split on Gender is higher than Split on Class, hence, the node
split will take place on Gender.
48
50. Overfitting
reasons for overfitting:
• noise in the data
• number of training examples is to small too produce a representative
sample of the target function
how to avoid overfitting:
• stop the tree grow earlier, before it reaches the point where it perfectly
classifies the training data
• allow overfitting and then post-prune the tree (more successful in
practice!)
how to determine the perfect tree size:
• separate validation set to evaluate utility of post-pruning
• apply statistical test to estimate whether expanding (or pruning) produces
an improvement
50
51. Reduced error pruning
• Each of the decision nodes is considered to be candidate for pruning
• Pruning a decision node consists of removing the subtree rootet at
the node, making it a leaf node and assigning the most common
classification of the training examples affiliated with that node
• Nodes are removed only if the resulting tree performs not worse than
the original tree over the validation set
• Pruning starts with the node whose removal most increases accuracy
and continues until further pruning is harmful
51
52. Reduced Error Pruning
• Effect of reduced error pruning:
• Any node added to coincidental regularities in the training set is likely to be pruned
52
53. Rule Post-Pruning
1. Infer the decision tree from the training set (Overfitting allowed!)
2. Convert the tree into a set of rules
3. Prune each rule by removing any preconditions that result in
improving its estimated accuracy
4. Sort the pruned rules by their estimated accuracy
• One method to estimate rule accuracy is to use a separate validation
set
• Pruning rules is more precise than pruning the tree itself
53
54. Pruning
• Let's start with C4.5's pruning. C4.5 recursively makes choices as to whether
to prune on an attribute:
• Option 1: leaving the tree as is
• Option 2: replace that part of the tree with a leaf corresponding to the most
frequent label in the data S going to that part of the tree.
• Option 3: replace that part of the tree with one of its subtrees,
corresponding to the most common branch in the split
• Demo: To figure out which decision to make, C4.5 computes upper bounds
on the probability of error for each option.
• We'll see you how to do that shortly.
• Prob of error for Option 1 Upper Bound1
• Prob of error for Option 2 Upper Bound2
• Prob of error for Option 3 Upper Bound3
54
55. • C4.5 chooses the option that has the lowest of these three upper
bounds. This ensures that the error rate is fairly low.
• e.g. which has the smallest upper bound:
• 1 incorrect out of 3
• 5 incorrect out of 17, or
• 9 incorrect out of 32?
• For each option, we count the number correct and the number
incorrect. We need upper confidence intervals on the proportion that
are incorrect. To calculate the upper bounds, calculate confidence
intervals on proportions.
55
56. Simple Example
• Flip a coin N times, with M heads. (Here N is the number of examples in the leaf, M is the
number incorrectly classified.) What is an upper bound for the probability p of heads for
the coin? Think visually about the binomial distribution, where we have N coin flips, and
how it changes as p changes:
• We want the upper bound to be as large as possible (largest possible p, it's
an upper bound), but still there needs to be a probability to get as few
errors as we got. In other words, we want:
56
58. • We can calculate this numerically without a problem. So now if you give
me M and N, I can give you p.
• C4.5 uses α=0.25 by default.
• M, for a given branch, is how many misclassified examples are in the
branch.
• N, for a given branch, is just the number of examples in the branch, Sj.
• So we can calculate the upper bound on a branch, but it is still not clear
how to calculate the upper bound on a tree.
• Actually, we calculate an upper confidence bound on each branch on
the tree and average it over the relative frequencies of landing in each
branch of the tree.
58
59. Example
• Let's consider a dataset of 16 examples describing toys (from the
Kranf Site). We want to know if the toy is fun or not.
59
62. Post-pruning
• The aim of pruning is to discard parts of a classification model that describe
random variation in the training sample rather than true features of the
underlying domain.
• This makes the model more comprehensible to the user, and potentially
more accurate on new data that has not been used for training the
classifier.
• Statistical significance tests can be used to make pruning decisions in
classification models. Reduced-error pruning (Quinlan, 1987), a standard
algorithm for post-pruning decision trees, does not take statistical
significance into account, but it is known to be one of the fastest pruning
algorithms, producing trees that are both accurate and small.
62
63. • Reduced-error pruning generates smaller and more accurate decision
trees if pruning decisions are made using significance tests and the
significance level is chosen appropriately for each dataset.
• For a given amount of pruning, decision trees pruned using a
permutation test will be more accurate than those pruned using a
parametric test.
• If decision tree A is the result of pruning using a permutation test,
and decision tree B is the result of pruning using a parametric test,
and both trees have the same size, then A will be more accurate than
B on average.
63
64. Decision tree pruning and statistical
significance
• Below figure depicts an unpruned decision tree. We assume that a
class label has been attached to each node of the tree, for example,
by taking the majority class of the training instances reaching that
particular node. In the figure, there are two classes: A and B.
The tree can be used to predict the class of
a test instance by filtering it to the leaf
node corresponding to the instance’s
attribute values and assigning the class
label attached to that leaf.
64
65. • However, using an unpruned decision tree for classification
potentially “overfits” the training data. Consequently, it is advisable,
before the tree is put to use, to ascertain which parts truly reflect
effects present in the domain, and discard those that do not. This
process is called “pruning.”
• A general, fast, and easily applicable pruning method is “reduced-
error pruning” (Quinlan, 1987a). The idea is to hold out some of the
available instances—the “pruning set”—when the tree is built, and
prune the tree until the classification error on these independent
instances starts to increase. Because the instances in the pruning set
are not used for building the decision tree, they provide a less biased
estimate of its error rate on future instances than the training data.
65
67. • In each tree, the number of instances in the pruning data that are
misclassified by the individual nodes are given in parentheses. A
pruning operation involves replacing a subtree by a leaf. Reduced-
error pruning will perform this operation if it does not increase the
total number of classification errors. Traversing the tree in a bottom-
up fashion ensures that the result is the smallest pruned tree that has
minimum error on the pruning data
67
68. • The CP (complexity parameter) is used to control tree growth. If the
cost of adding a variable is higher then the value of CP, then tree
growth stops.
#Base Model
hr_base_model <- rpart(left ~ ., data = train, method =
"class",control = rpart.control(cp = 0))
summary(hr_base_model)
#Plot Decision Tree
plot(hr_base_model)
# Examine the complexity plot
printcp(hr_base_model)
plotcp(hr_base_model) cp = 0.0084
68
69. #Postpruning
# Prune the hr_base_model based on the optimal cp value
hr_model_pruned <- prune(hr_base_model, cp = 0.0084 )
# Compute the accuracy of the pruned tree
test$pred <- predict(hr_model_pruned, test, type = "class")
accuracy_postprun <- mean(test$pred == test$left)
data.frame(base_accuracy, accuracy_preprun, accuracy_postprun)
The accuracy of the model on the test data is better when the tree is pruned, which
means that the pruned decision tree model generalizes well and is more suited for a
production environment. However, there are also other factors that can influence
decision tree model creation, such as building a tree on an unbalanced class. These
factors were not accounted for here but it's very important for them to be examined
during a live model formulation.
69
70. Prepruning
• Prepruning is also known as early stopping criteria. As the name suggests, the
criteria are set as parameter values while building the rpart model. Below are
some of the pre-pruning criteria that can be used. The tree stops growing when it
meets any of these pre-pruning criteria, or it discovers the pure classes.
• maxdepth: This parameter is used to set the maximum depth of a tree. Depth is
the length of the longest path from a Root node to a Leaf node. Setting this
parameter will stop growing the tree when the depth is equal the value set for
maxdepth.
• minsplit: It is the minimum number of records that must exist in a node for a split
to happen or be attempted. For example, we set minimum records in a split to be
5; then, a node can be further split for achieving purity when the number of
records in each split node is more than 5.
• minbucket: It is the minimum number of records that can be present in a
Terminal node. For example, we set the minimum records in a node to 5, meaning
that every Terminal/Leaf node should have at least five records. We should also
take care of not overfitting the model by specifying this parameter. If it is set to a
too-small value, like 1, we may run the risk of overfitting our model.
70
71. # Grow a tree with minsplit of 100 and max depth of 8
hr_model_preprun <- rpart(left ~ ., data = train, method = "class",
control = rpart.control(cp = 0, maxdepth = 8,
minsplit = 100))
# Compute the accuracy of the pruned tree
test$pred <- predict(hr_model_preprun, test, type = "class")
accuracy_preprun <- mean(test$pred == test$left)
71
82. How to split
• Order records according to one variable, say lot size
• Find midpoints between successive values
E.g. first midpoint is 14.4 (halfway between 14.0 and 14.8)
• Divide records into those with lotsize > 14.4 and those < 14.4
• After evaluating that split, try the next one, which is 15.4 (halfway
between 14.8 and 16.0)
82
83. Splitting criteria
• Regression: residual sum of squares
RSS = Σleft (yi – yL*)2 + Σright (yi – yR*)2
where yL* = mean y-value for left node
yR* = mean y-value for right node
• Classification: Gini criterion
Gini = NL Σk=1,…,K pkL (1- pkL) + NR Σk=1,…,K pkR (1- pkR)
where pkL = proportion of class k in left node
pkR = proportion of class k in right node
83
84. Note: Categorical Variables
• Examine all possible ways in which the categories can be split.
• E.g., categories A, B, C can be split 3 ways
{A} and {B, C}
{B} and {A, C}
{C} and {A, B}
• With many categories, # of splits becomes huge
84
88. Applications of Decision Trees: XBox!
• Decision trees are in XBox
[J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, A. Blake. Real-Time Human Pose
88
91. • Trained on million(s) of examples
• Results:
91
92. Decision Tree Classifier implementation in R
(http://dataaspirant.com/2017/02/03/decision-tree-classifier-
implementation-in-r/ )
• The R programming machine learning caret package( Classification And
REgression Training) holds tons of functions that helps to build
predictive models. It holds tools for data splitting, pre-processing,
feature selection, tuning and supervised – unsupervised learning
algorithms, etc. It is similar to the sklearn library in python.
• The installed caret package provides us direct access to various
functions for training our model with different machine learning
algorithms like Knn, SVM, decision tree, linear regression, etc.
92
93. Cars Evaluation Data Set
• The Cars Evaluation data set consists of 7 attributes, 6 as feature attributes and 1
as the target attribute. All the attributes are categorical. We will try to build
a classifier for predicting the Class attribute. The index of target attribute is 7th.
1 buying vhigh, high, med, low
2 maint vhigh, high, med,low
3 doors 2, 3, 4, 5 , more
4 persons 2, 4, more
5 lug_boot small, med, big.
6 safety low, med, high
7 Car Evaluation – Target Variable unacc, acc, good, vgood
• To model a classifier for evaluating the acceptability of car using its given features.
93
95. head(car_df)
V1 V2 V3 V4 V5 V6 V7
1 vhigh vhigh 2 2 small low unacc
2 vhigh vhigh 2 2 small med unacc
3 vhigh vhigh 2 2 small high unacc
4 vhigh vhigh 2 2 med low unacc
5 vhigh vhigh 2 2 med med unacc
6 vhigh vhigh 2 2 med high unacc
Data Slicing
intrain <- createDataPartition(y = car_df$V7, p= 0.7, list = FALSE)
training <- car_df[intrain,]
testing <- car_df[-intrain,]
#check dimensions of train & test set
dim(training); dim(testing)
[1] 1211 7
[1] 517 7 95
96. • Preprocessing & Training
To check whether our data contains missing values or not, we can
use anyNA() method. Here, NA means Not Available.
anyNA(car_df)
[1] FALSE
summary(car_df)
V1 V2 V3 V4 V5 V6
high :432 high :432 2 :432 2 :576 big :576 high:576
low :432 low :432 3 :432 4 :576 med :576 low :576
med :432 med :432 4 :432 more:576 small:576 med :576
vhigh:432 vhigh:432 5more:432
V7
acc : 384
good : 69
unacc:1210
vgood: 65
96
97. • Training the Decision Tree classifier with criterion as information gain
Caret package provides train() method for training our data for various algorithms. We just need to pass
different parameter values for different algorithms. Before train() method, we will
first use trainControl() method. It controls the computational nuances of the train() method.
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(3333)
dtree_fit <- train(V7 ~., data = training, method = "rpart",
parms = list(split = "information"),
trControl=trctrl,
tuneLength = 10)
• We are setting 3 parameters of trainControl() method. The “method” parameter holds the details
about resampling method. We can set “method” with many values like “boot”, “boot632”, “cv”,
“repeatedcv”, “LOOCV”, “LGOCV” etc. Let’s try to use repeatedcv i.e, repeated cross-validation.
• The “number” parameter holds the number of resampling iterations. The “repeats ” parameter
contains the complete sets of folds to compute for our repeated cross-validation. We are using setting
number =10 and repeats =3. This trainControl() methods returns a list. We are going to pass this on
our train() method.
97
It can also be “gini”
98. • Trained Decision Tree classifier results
We can check the result of our train() method by a print dtree_fit variable. It is showing us the accuracy
metrics for different values of cp. Here, cp is a complexity parameter for our dtree.
dtree_fit
CART
1211 samples
6 predictor
4 classes: 'acc', 'good', 'unacc', 'vgood'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 1091, 1090, 1091, 1089, 1089, 1089, ...
Resampling results across tuning parameters:
cp Accuracy Kappa
0.008928571 0.8483624 0.6791223
0.009615385 0.8467071 0.6745287
0.010989011 0.8365300 0.6487824
0.012362637 0.8266554 0.6253187
0.013736264 0.8219630 0.6128814
0.020604396 0.7961370 0.5540247
0.022893773 0.7980631 0.5600789
0.054945055 0.7746394 0.5307654
0.057692308 0.7724331 0.5305796
0.065934066 0.7322489 0.2893330
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.008928571. 98
100. • Prediction
Now, our model is trained with cp = 0.008928571. We are ready to
predict classes for our test set. We can use predict() method. Let’s try
to predict target variable for test set’s 1st record.
testing[1,]
V1 V2 V3 V4 V5 V6 V7
6 vhigh vhigh 2 2 med high unacc
predict(dtree_fit, newdata = testing[1,])
[1] unacc
Levels: acc good unacc vgood
• For our 1st record of testing data classifier is predicting class variable
as “unacc”. Now, its time to predict target variable for the whole test
set.
100
101. test_pred <- predict(dtree_fit, newdata = testing)
confusionMatrix(test_pred, testing$V7 ) #check accuracy
Reference
Prediction acc good unacc vgood
acc 84 8 27 2
good 7 6 0 1
unacc 17 5 336 0
vgood 7 1 0 16
Accuracy : 0.8549
95% CI : (0.8216, 0.8842)
No Information Rate : 0.7021
P-Value [Acc > NIR] : 3.563e-16
Kappa : 0.6839
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: acc Class: good Class: unacc Class: vgood
Sensitivity 0.7304 0.30000 0.9256 0.84211
Specificity 0.9080 0.98390 0.8571 0.98394
Pos Pred Value 0.6942 0.42857 0.9385 0.66667
Neg Pred Value 0.9217 0.97217 0.8302 0.99391
Prevalence 0.2224 0.03868 0.7021 0.03675
Detection Rate 0.1625 0.01161 0.6499 0.03095
Detection Prevalence 0.2340 0.02708 0.6925 0.04642
Balanced Accuracy 0.8192 0.64195 0.8914 0.91302 101
102. Training the Decision Tree classifier with criterion as gini index
Let’s try to program a decision tree classifier using splitting criterion as
gini index. It is showing us the accuracy metrics for different values of
cp. Here, cp is complexity parameter for our dtree.
set.seed(3333)
dtree_fit_gini <- train(V7 ~., data = training, method = "rpart",
parms = list(split = "gini"),
trControl=trctrl,
tuneLength = 10)
102
103. dtree_fit_gini
CART
1211 samples
6 predictor
4 classes: 'acc', 'good', 'unacc', 'vgood'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 1091, 1090, 1091, 1089, 1089, 1089, ...
Resampling results across tuning parameters:
cp Accuracy Kappa
0.01098901 0.8522395 0.6816530
0.01373626 0.8362745 0.6436379
0.01510989 0.8305029 0.6305745
0.01556777 0.8249840 0.6168644
0.01648352 0.8227709 0.6115286
0.01831502 0.8180553 0.6039963
0.02060440 0.8095423 0.5858712
0.02197802 0.8032220 0.5725628
0.06868132 0.7888755 0.5727260
0.09340659 0.7233582 0.2223118
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.01098901.
103
104. • Plot Decision Tree
We can visualize our decision tree by using prp() method.
prp(dtree_fit_gini$finalModel, box.palette = "Blues", tweak = 1.2)
104
105. Prediction
Now, our model is trained with cp = 0.01098901. We are ready to predict classes for our test set. Now, it’s time to predict target
variable for the whole test set.
test_pred_gini <- predict(dtree_fit_gini, newdata = testing)
> confusionMatrix(test_pred_gini, testing$V7 ) #check accuracy
Reference
Prediction acc good unacc vgood
acc 87 10 25 8
good 4 4 0 0
unacc 22 5 338 0
vgood 2 1 0 11
Overall Statistics
Accuracy : 0.8511
95% CI : (0.8174, 0.8806)
No Information Rate : 0.7021
P-Value [Acc > NIR] : 2.18e-15
Kappa : 0.6666
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: acc Class: good Class: unacc Class: vgood
Sensitivity 0.7565 0.200000 0.9311 0.57895
Specificity 0.8930 0.991952 0.8247 0.99398
Pos Pred Value 0.6692 0.500000 0.9260 0.78571
Neg Pred Value 0.9276 0.968566 0.8355 0.98410
Prevalence 0.2224 0.038685 0.7021 0.03675
Detection Rate 0.1683 0.007737 0.6538 0.02128
Detection Prevalence 0.2515 0.015474 0.7060 0.02708
Balanced Accuracy 0.8248 0.595976 0.8779 0.78646
In this case, our classifier with criterion gini
index is not giving better results.
105
106. Regression Trees for Prediction
• Used with continuous outcome variable
• Procedure similar to classification tree
• Many splits attempted, choose the one that minimizes
impurity
106
107. Decision Tree - Regression
• Decision tree builds regression or classification models in the form of a
tree structure. It brakes down a dataset into smaller and smaller subsets
while at the same time an associated decision tree is incrementally
developed. The final result is a tree with decision nodes and leaf nodes.
A decision node (e.g., Outlook) has two or more branches (e.g., Sunny,
Overcast and Rainy), each representing values for the attribute tested.
Leaf node (e.g., Hours Played) represents a decision on the numerical
target. The topmost decision node in a tree which corresponds to the
best predictor called root node. Decision trees can handle both
categorical and numerical data.
107
108. Advantages of trees
• Easy to use, understand
• Produce rules that are easy to interpret & implement
• Variable selection & reduction is automatic
• Do not require the assumptions of statistical models
• Can work without extensive handling of missing data
108
109. Disadvantages
• May not perform well where there is structure in the data that is not
well captured by horizontal or vertical splits
• Since the process deals with one variable at a time, no way to
capture interactions between variables
109
111. Decision Tree Algorithm
• The core algorithm for building decision trees called ID3 by J. R.
Quinlan which employs a top-down, greedy search through the space
of possible branches with no backtracking. The ID3 algorithm can be
used to construct a decision tree for regression by replacing
Information Gain with Standard Deviation Reduction.
Standard Deviation
• A decision tree is built top-down from a root node and involves
partitioning the data into subsets that contain instances with similar
values (homogenous). We use standard deviation to calculate the
homogeneity of a numerical sample. If the numerical sample is
completely homogeneous its standard deviation is zero.
111
112. a) Standard deviation for one attribute:
• Standard Deviation (S) is for tree building (branching).
• Coefficient of Deviation (CV) is used to decide when to stop
branching. We can use Count (n) as well.
• Average (Avg) is the value in the leaf nodes.
112
114. Standard Deviation Reduction
• The standard deviation reduction is based on the decrease in
standard deviation after a dataset is split on an attribute.
Constructing a decision tree is all about finding attribute that returns
the highest standard deviation reduction (i.e., the most homogeneous
branches).
• Step 1: The standard deviation of the target is calculated.
Standard deviation (Hours Played) = 9.32
114
115. • Step 2: The dataset is then split on the different attributes. The
standard deviation for each branch is calculated. The resulting
standard deviation is subtracted from the standard deviation before
the split. The result is the standard deviation reduction.
115
116. • Step 3: The attribute with the largest standard deviation reduction is
chosen for the decision node.
• Step 4a: The dataset is divided based on the values of the selected
attribute. This process is run recursively on the non-leaf branches,
until all data is processed.
In practice, we need some termination
criteria. For example, when coefficient
of deviation (CV) for a branch becomes
smaller than a certain threshold (e.g.,
10%) and/or when too few instances
(n) remain in the branch (e.g., 3).
116
117. • Step 4b: "Overcast" subset does not need any further splitting
because its CV (8%) is less than the threshold (10%). The related leaf
node gets the average of the "Overcast" subset.
117
118. • Step 4c: However, the "Sunny" branch has an CV (28%) more than the
threshold (10%) which needs further splitting. We select "Windy" as
the best node after "Outlook" because it has the largest SDR.
118
119. • Because the number of data points for both branches (FALSE and
TRUE) is equal or less than 3 we stop further branching and assign the
average of each branch to the related leaf node.
119
120. • Step 4d: Moreover, the "rainy" branch has an CV (22%) which is more
than the threshold (10%). This branch needs further splitting. We
select "Windy" as the best node because it has the largest SDR.
120
121. • Because the number of data points for all three branches (Cool, Hot
and Mild) is equal or less than 3 we stop further branching and assign
the average of each branch to the related leaf node.
• When the number of instances is more than one at a leaf node we
calculate the average as the final value for the target.
121
122. Decision Trees for Regression-R Example
• Decision trees to predict whether the birth weights of infants will be low or not.
> library(MASS)
> library(rpart)
> head(birthwt)
low age lwt race smoke ptl ht ui ftv bwt
85 0 19 182 2 0 0 0 1 0 2523
86 0 33 155 3 0 0 0 0 3 2551
87 0 20 105 1 1 0 0 0 1 2557
88 0 21 108 1 1 0 0 1 2 2594
89 0 18 107 1 1 0 0 1 0 2600
91 0 21 124 3 0 0 0 0 0 2622
122
123. • low – indicator of whether the birth weight is less than 2.5kg
• age – mother’s age in year
• lwt – mother’s weight in pounds at last menstrual period
• race – mother’s race (1 = white, 2 = black, white = other)
• smoke – smoking status during pregnancy
• ptl – number of previous premature labours
• ht – history of hypertension
• ui – presence of uterine irritability
• ftv – number of physician visits during the first trimester
• bwt – birth weight in grams
• Let’s look at the distribution of infant weights:
hist(birthwt$bwt)
Most of the infants weigh between 2kg and 4kg.
123
124. • let us look at the number of infants born with low weight.
table(birthwt$low)
0 1
130 59
This means that there are 130 infants weighing more than 2.5kg and 59
infants weighing less than 2.5kg. If we just guessed the most common
occurrence (> 2.5kg), our accuracy would be 130 / (130 + 59) = 68.78%.
Let’s see if we can improve upon this by building a prediction model.
124
125. Building the model
• In the dataset, all the variables are stored as numeric. Before we build our model, we need to convert the categorical
variables to factor.
cols <- c('low', 'race', 'smoke', 'ht', 'ui')
birthwt[cols] <- lapply(birthwt[cols], as.factor)
Next, let us split our dataset so that we have a training set and a testing set.
set.seed(1)
train <- sample(1:nrow(birthwt), 0.75 * nrow(birthwt))
Now, let us build the model. We will use the rpart function for this.
birthwtTree <- rpart(low ~ . - bwt, data = birthwt[train, ], method = 'class')
Since low = bwt <= 2.5, we exclude bwt from the model, and since it is a classification task, we specify method = 'class'. Let’s
take a look at the tree.
1) root 141 44 0 (0.6879433 0.3120567)
2) ptl< 0.5 117 30 0 (0.7435897 0.2564103)
4) lwt>=106 93 19 0 (0.7956989 0.2043011)
8) ht=0 86 15 0 (0.8255814 0.1744186) *
9) ht=1 7 3 1 (0.4285714 0.5714286) *
5) lwt< 106 24 11 0 (0.5416667 0.4583333)
10) age< 22.5 15 4 0 (0.7333333 0.2666667) *
11) age>=22.5 9 2 1 (0.2222222 0.7777778) *
3) ptl>=0.5 24 10 1 (0.4166667 0.5833333)
6) lwt>=131.5 7 2 0 (0.7142857 0.2857143) *
7) lwt< 131.5 17 5 1 (0.2941176 0.7058824) * 125