SlideShare a Scribd company logo
1 of 128
DECISION TREES
https://ocw.mit.edu/courses/sloan-school-of-management/15-097-prediction-
machine-learning-and-statistics-spring-2012/lecture-
notes/MIT15_097S12_lec08.pdf
http://www.cogsys.wiai.uni-bamberg.de/teaching/ss05/ml/slides/cogsysII-
3.pdf
1
Decision Trees
• A decision tree is an approach to predictive analysis that can help you
make decisions. Suppose, for example, that you need to decide
whether to invest a certain amount of money in one of three business
projects: a food-truck business, a restaurant, or a bookstore.
• A business analyst has worked out the rate of failure or success for
each of these business ideas as percentages and the profit you’d
make in each case.
2
Business Success Rate Failure Rate
Food Truck 60 percent 40 percent
Restaurant 52 percent 48 percent
Bookstore 50 percent 50 percent
Business Gain (USD) Loss (USD)
Food Truck 20,000 -7,000
Restaurant 40,000 -21,000
Bookstore 6,000 -1,000
From past statistical data shown, you can construct a decision tree as shown below.
3
• Using such a decision tree to decide on a business venture begins with calculating
the expected value for each alternative — a numbered rank that helps you select
the best one.
• The expected value is calculated in such a way that includes all possible outcomes
for a decision. Calculating the expected value for the food-truck business idea looks
like this:
• Expected value of food-truck business = (60 % x 20,000 (USD)) + (40 % * -7,000
(USD)) = 9,200 (USD)
• Expected value of restaurant business = (52 % x 40,000 (USD)) + (48 % * -21,000
(USD)) = 10,720 (USD)
• Expected value of bookstore business = (50 % x 6,000 (USD)) + (50 % * -1,000 (USD))
= 2,500 (USD)
• Therefore the expected value becomes one of the criteria you figure into your
business decision-making. In this example, the expected values of the three
alternatives might incline you to favor investing in the restaurant business.
4
• Decision trees can also be used to visualize classification rules.
• A decision algorithm generates a decision tree that represents
classification rules.
• Decision Trees (DTs) are a supervised learning technique that predict
values of responses by learning decision rules derived from features.
They can be used in both a regression and a classification context. For
this reason they are sometimes also referred to as Classification And
Regression Trees (CART).
5
Why trees?
• Interpretable/intuitive, popular in medical applications because they mimic
the way a doctor thinks
• Model discrete outcomes nicely
• Can be very powerful, can be as complex as you need them
• C4.5 and CART - from “top 10” - decision trees are very popular
• Some real examples (from Russell & Norvig, Mitchell)
• BP’s GasOIL system for separating gas and oil on offshore platforms - decision
trees replaced a hand-designed rules system with 2500 rules. C4.5-based
system outperformed human experts and saved BP millions. (1986)
• learning to fly a Cessna on a flight simulator by watching human experts fly
the simulator (1992)
• can also learn to play tennis, analyze C-section risk, etc.
6
Decision Tree Types
• Classification tree analysis is when the predicted outcome is the class to which
the data belongs. Iterative Dichotomiser 3 (ID3), C4.5, (Quinlan, 1986)
• Regression tree analysis is when the predicted outcome can be considered a real
number (e.g. the price of a house, or a patient’s length of stay in a hospital).
• Classification And Regression Tree (CART) analysis is used to refer to both of the
above procedures, first introduced by (Breiman et al., 1984)
• CHi-squared Automatic Interaction Detector (CHAID). Performs multi-level splits
when computing classification trees. (Kass, G. V. 1980).
• A Random Forest classifier uses a number of decision trees, in order to improve
the classification rate.
• Boosting Trees can be used for regression-type and classification-type problems.
• Used in data mining (most are included in R, see rpart and party packages, and in
Weka, Waikato Environment for Knowledge Analysis)
7
How to build a decision tree:
• Start at the top of the tree.
• Grow it by “splitting” attributes one by one. To determine which
attribute to split, look at “node impurity.”
• Assign leaf nodes the majority vote in the leaf.
• When we get to the bottom, prune the tree to prevent overfitting
8
Decision Tree Representation
• Classification of instances by sorting them down the tree from the root
to some leaf note
• node ≈ test of some attribute
• branch ≈ one of the possible values for the attribute
• Decision trees represent a disjunction of conjunctions of constraints on
the attribute values of instances i.e., (... ∧ ... ∧ ... ) ∨ (... ∧ ... ∧ ... ) ∨ ...
• Equivalent to a set of if-then-rules
• each branch represents one if-then-rule
• if-part: conjunctions of attribute tests on the nodes
• then-part: classification of the branch
9
• This decision tree is equivalent to:
• if (Outlook = Sunny ) ∧ (Humidity = Normal ) then Yes;
• if (Outlook = Overcast ) then Yes;
• if (Outlook = Rain ) ∧ (Wind = Weak ) then Yes;
Each internal node: test one attribute Xi
Each branch from a node: selects one value for Xi
Each leaf node: predict Y (or P(Y|X ∈ leaf))
10
Example: Will the customer wait for a table?
(from Russell&Norvig)
• Here are the attributes:
11
• Here are the examples:
12
• Here are two options for the first feature to split at the top of the tree.
Which one should we choose? Which one gives me the most information?
13
• What we need is a formula to computeinformation. "Before we do
that, here's another example. Let's say we pick one of them (Patrons).
Maybe then we'll pick Hungry next, because it has a lot of
“information":
14
Basics of Information Theory
where P is the probability of an event.
15
• We want to define I so that it obeys all these things:
• I(p)>0;
• I(1)=0; the information of any event is non-negative, no information
from events with probability 1.
• I(p1p2)=I(p1)+I(p2); the information from two independent events
should be the sum of their information.
• I(p) is continuous, slight changes in probability correspond to slight
changes in information.
• Together these lead to:
• I(p2)=2I(p) or generally I(pn)=nI(p); this means that I(p)=I(p(1/m) m)
=mI(p1/m) so 1/m.I(p)=I(p(1/m)).
16
Why we use log?
Why is information measured with logarithms instead of just by the total
number of states?
• Mostly because it makes it additive. It's true that if you really wanted to,
you could choose to measure information or entropy by the total number
of states (usually called the "multiplicity"), instead of by the log of the
multiplicity. But then it would be multiplicative instead of additive. If you
have 10 bits and then you obtain another 10 bits of information, then you
have 20 bits.
• Saying the same thing in terms of multiplicity: if you have 2^10 = 1024
states and then you add another 1024 independent states then you have
1024*1024 = 1048576 states (2^20) when they are combined. Multiplicity
is multiplicative instead of additive, which means that the numbers you
need in order to keep track of it get very large very quickly! This is really
inconvenient, hence why we usually stick with using information/entropy
as the unit instead of multiplicity.
17
• Shannon (1948, p. 349) explains the convenience of the use of a
logarithmic function in the definition of the entropies: it is practically
useful because many important parameters in engineering vary
linearly with the logarithm of the number of possibilities; it is intuitive
because we use to measure magnitudes by linear comparison with
unities of measurement; it is mathematically more suitable because
many limiting operations in terms of the logarithm are simpler than in
terms of the number of possibilities. In turn, the choice of a
logarithmic base amounts to a choice of a unit for measuring
information.
• If the base 2 is used, the resulting unit is called ‘bit’ –a contraction of
binary unit –.
• With these definitions, one bit is the amount of information obtained
when one of two equally likely alternatives is specified.
18
Example of Calculating Information
Coin Toss
• There are two probabilities in fair coin, which are head(.5) and tail(.5).
• So if you get either head or tail you will get 1 bit of information
through following formula.
• I(head) = -log2 (.5) = 1 bit
19
Another Example
• Balls in the bin
• The information you will get by choosing a ball from the bin are
calculated as following.
• I(red ball) = - log2 (4/9) = 1.1699 bits
• I(yellow ball) = - log2 (2/9) = 2.1699 bits
• I(green ball) = - log2 (3/9) = 1.58496 bits
20
Entropy
21
How was the entropy equation is derived?
I = total information from N occurrences
N = number of occurrences
(N*pi) = Approximated number that the
certain result will come out in N occurrence
• So when you look at the difference
between the total Information from N
occurrences and the Entropy equation, only
thing that changed in the place of N.
• The N is moved to the right, which means
that I/N is Entropy. Therefore, Entropy is
the average(expected) amount of
information in a certain event.
22
Entropy
• The entropy of a variable is the "amount of information" contained in the
variable. This amount is determined not just by the number of different
values the variable can take on, just as the information in an email is
quantified not just by the number of words in the email or the different
possible words in the language of the email.
• Informally, the amount of information in an email is proportional to the
amount of “surprise” its reading causes. For example, if an email is simply a
repeat of an earlier email, then it is not informative at all. On the other
hand, if say the email reveals the outcome of a cliff-hanger election, then it
is highly informative.
• Similarly, the information in a variable is tied to the amount of surprise that
value of the variable causes when revealed. Shannon’s entropy quantifies
the amount of information in a variable, thus providing the foundation for
a theory around the notion of information.
23
• Because entropy is a type of information, the easiest way to measure
information is in bits and bytes, rather than by the total number of possible
states they can represent.
• The basic unit of information is the bit, which represents 2 possible states.
If you have n bits, then that information represents 2n possible states. For
example, a byte is 8 bits, therefore the number of states it represents is
28=256. This means that a byte can store any number between 0 and 255.
If you are given the total number of states, then you just take the log of
that number to get the amount of information, measured in bits:
log2256=8.
• So entropy is defined as the log of the number of total microscopic states
corresponding to a particular macro state of thermodynamics. This is the
additional information you'd need to know in order to completely specify
the microstate, given knowledge of the macrostate.
24
Information and Entropy
• Let’s look at this example again…
• Calculating the entropy….
• In this example there are three outcomes possible when you choose the
ball, it can be either red, yellow, or green. (n = 3)
• So the equation will be following.
• Entropy = ̶ (4/9) log(4/9) + ̶ (2/9) log(2/9) + ̶-(3/9) log(3/9)= 1.5305
• Therefore, you are expected to get 1.5305 information each time you
choose a ball from the bin
25
Information and Entropy
• Equation for calculating the range of Entropy:
0 ≤ Entropy ≤ log(n), where n is number of outcomes
• Entropy 0 (minimum entropy) occurs when one of the probabilities is
1 and rest are 0’s
• Entropy log(n) (maximum entropy) occurs when all the probabilities
have equal values of 1/n.
26
Shannon’s Entropy
• According to Shannon (1948; see also Shannon and Weaver 1949), a
general communication system consists of five parts:
− A source S, which generates the message to be received at the destination.
− A transmitter T, which turns the message generated at the source into a
signal to be transmitted. In the cases in which the information is encoded,
encoding is also implemented by this system.
− A channel CH, that is, the medium used to transmit the signal from the
transmitter to the receiver.
− A receiver R, which reconstructs the message from the signal.
− A destination D, which receives the message.
27
28
29
Mutual Information
• H(S;D) is the mutual information: the average amount of information
generated at the source S and received at the destination D.
• E is the equivocation: the average amount of information generated
at S but not received at D.
• N is the noise: the average amount of information received at D but
not generated at S.
• As the diagram clearly shows, the mutual information can be
computed as:
H(S;D) = H(S) − E = H(D) − N
30
Example
• We could measure how much space it takes to store X. Note that this
definition only makes sense if X is a random variable. If X is a random
variable, it has some distribution, and we can calculate the amount of
memory it takes to store X on average.
For X being a uniformly random n-bit string
H(X) = x p(x) log(1/p(x)) = x 2-nlog2n=n
31
Conditional Entropy
• Let X and Y be two random variables. Then, expanding H(X, Y) gives
32
Conditional Entropy
• Chain Rule: H(XY) = H(X) + H(Y|X).
• Chain Rule: H(X1X2 . . .Xn) = H(X1)+H(X2|X1)+H(X3|X2X1)+... +H(Xn|Xn-1...X1).
33
Entropy of a Joint Distribution
34
35
Back to C4.5 (source material:
Russell&Norvig,Mitchell,Quinlan)
• We consider a “test” split on attribute A at a branch.
• In S, we have #pos positives and #neg negatives. For each branch j,
we have #posj positives and #negj negatives.
36
37
38
39
Gain Ratio
We want this large
We want this small
40
Keep splitting until:
• no more examples left (no point trying to split)
• all examples have the same class
• no more attributes to split
• For the restaurant example, we get this:
41
• Actually, it turns out that the class labels for the data were themselves generated
from a tree. So to get the label for an example, they fed it into a tree, and got the
label from the leaf. That tree is here:
42
• But the one we found is simpler!
• Does that mean our algorithm isn't doing a good job?
• There are possibilities to replace H([p,1-p]), it is not the only thing we
can use!
• One example is the Gini index 2p(1-p) used by CART.
• Another example is the value 1-max(p,1-p)
43
C4.5 uses information gain for splitting, and CART uses the Gini index. (CART only
has binary splits.)
44
All three are similar, but cross-entropy and the Gini index are differentiable, and
hence more amenable to numerical optimization. either the Gini index or cross-
entropy should be used when growing the tree.
• Inductive bias: Shorter trees are preferred to longer trees. Trees that
place high information gain attributes close to the root are also
preferred.
• Why prefer shorter hypotheses?
• Occam’s Razor: Prefer the simplest hypothesis that fits the data!
• see Minimum Description Length Principle (Bayesian Learning)
• e.g., if there are two decision trees, one with 500 nodes and another
with 5 nodes, the second one should be preferred ⇒ better chance to
avoid overfitting
45
Gini Index
• Gini index says, if we select two items from a population at random
then they must be of same class and probability for this is 1 if
population is pure.
• It works with categorical target variable “Success” or “Failure”.
• It performs only Binary splits
• Higher the value of Gini higher the homogeneity.
• CART (Classification and Regression Tree) uses Gini method to create
binary splits.
46
Steps to Calculate Gini for a split
1. Calculate Gini for sub-nodes, using formula sum of square of
probability for success and failure (p2+(1-p)2).
2. Calculate Gini for split using weighted Gini score of each node of
that split
47
Example:
• Let’s say we have a sample of 30 students with three variables Gender (Boy/ Girl), Class( IX/ X)
and Height (5 to 6 ft). 15 out of these 30 play cricket in leisure time. Now, I want to create a
model to predict who will play cricket during leisure period? In this problem, we need to
segregate students who play cricket in their leisure time based on highly significant input variable
among all three.
Split on Gender:
Calculate, Gini for sub-node Female = (0.2)*(0.2)+(0.8)*(0.8)=0.68
Gini for sub-node Male = (0.65)*(0.65)+(0.35)*(0.35)=0.55
Calculate weighted Gini for Split Gender = (10/30)*0.68+(20/30)*0.55 = 0.59
Similar for Split on Class:
Gini for sub-node Class IX = (0.43)*(0.43)+(0.57)*(0.57)=0.51
Gini for sub-node Class X = (0.56)*(0.56)+(0.44)*(0.44)=0.51
Calculate weighted Gini for Split Class = (14/30)*0.51+(16/30)*0.51 = 0.51
Above, you can see that Gini score for Split on Gender is higher than Split on Class, hence, the node
split will take place on Gender.
48
Overfitting
49
Overfitting
reasons for overfitting:
• noise in the data
• number of training examples is to small too produce a representative
sample of the target function
how to avoid overfitting:
• stop the tree grow earlier, before it reaches the point where it perfectly
classifies the training data
• allow overfitting and then post-prune the tree (more successful in
practice!)
how to determine the perfect tree size:
• separate validation set to evaluate utility of post-pruning
• apply statistical test to estimate whether expanding (or pruning) produces
an improvement
50
Reduced error pruning
• Each of the decision nodes is considered to be candidate for pruning
• Pruning a decision node consists of removing the subtree rootet at
the node, making it a leaf node and assigning the most common
classification of the training examples affiliated with that node
• Nodes are removed only if the resulting tree performs not worse than
the original tree over the validation set
• Pruning starts with the node whose removal most increases accuracy
and continues until further pruning is harmful
51
Reduced Error Pruning
• Effect of reduced error pruning:
• Any node added to coincidental regularities in the training set is likely to be pruned
52
Rule Post-Pruning
1. Infer the decision tree from the training set (Overfitting allowed!)
2. Convert the tree into a set of rules
3. Prune each rule by removing any preconditions that result in
improving its estimated accuracy
4. Sort the pruned rules by their estimated accuracy
• One method to estimate rule accuracy is to use a separate validation
set
• Pruning rules is more precise than pruning the tree itself
53
Pruning
• Let's start with C4.5's pruning. C4.5 recursively makes choices as to whether
to prune on an attribute:
• Option 1: leaving the tree as is
• Option 2: replace that part of the tree with a leaf corresponding to the most
frequent label in the data S going to that part of the tree.
• Option 3: replace that part of the tree with one of its subtrees,
corresponding to the most common branch in the split
• Demo: To figure out which decision to make, C4.5 computes upper bounds
on the probability of error for each option.
• We'll see you how to do that shortly.
• Prob of error for Option 1  Upper Bound1
• Prob of error for Option 2  Upper Bound2
• Prob of error for Option 3  Upper Bound3
54
• C4.5 chooses the option that has the lowest of these three upper
bounds. This ensures that the error rate is fairly low.
• e.g. which has the smallest upper bound:
• 1 incorrect out of 3
• 5 incorrect out of 17, or
• 9 incorrect out of 32?
• For each option, we count the number correct and the number
incorrect. We need upper confidence intervals on the proportion that
are incorrect. To calculate the upper bounds, calculate confidence
intervals on proportions.
55
Simple Example
• Flip a coin N times, with M heads. (Here N is the number of examples in the leaf, M is the
number incorrectly classified.) What is an upper bound for the probability p of heads for
the coin? Think visually about the binomial distribution, where we have N coin flips, and
how it changes as p changes:
• We want the upper bound to be as large as possible (largest possible p, it's
an upper bound), but still there needs to be a probability to get as few
errors as we got. In other words, we want:
56
57
• We can calculate this numerically without a problem. So now if you give
me M and N, I can give you p.
• C4.5 uses α=0.25 by default.
• M, for a given branch, is how many misclassified examples are in the
branch.
• N, for a given branch, is just the number of examples in the branch, Sj.
• So we can calculate the upper bound on a branch, but it is still not clear
how to calculate the upper bound on a tree.
• Actually, we calculate an upper confidence bound on each branch on
the tree and average it over the relative frequencies of landing in each
branch of the tree.
58
Example
• Let's consider a dataset of 16 examples describing toys (from the
Kranf Site). We want to know if the toy is fun or not.
59
60
61
Post-pruning
• The aim of pruning is to discard parts of a classification model that describe
random variation in the training sample rather than true features of the
underlying domain.
• This makes the model more comprehensible to the user, and potentially
more accurate on new data that has not been used for training the
classifier.
• Statistical significance tests can be used to make pruning decisions in
classification models. Reduced-error pruning (Quinlan, 1987), a standard
algorithm for post-pruning decision trees, does not take statistical
significance into account, but it is known to be one of the fastest pruning
algorithms, producing trees that are both accurate and small.
62
• Reduced-error pruning generates smaller and more accurate decision
trees if pruning decisions are made using significance tests and the
significance level is chosen appropriately for each dataset.
• For a given amount of pruning, decision trees pruned using a
permutation test will be more accurate than those pruned using a
parametric test.
• If decision tree A is the result of pruning using a permutation test,
and decision tree B is the result of pruning using a parametric test,
and both trees have the same size, then A will be more accurate than
B on average.
63
Decision tree pruning and statistical
significance
• Below figure depicts an unpruned decision tree. We assume that a
class label has been attached to each node of the tree, for example,
by taking the majority class of the training instances reaching that
particular node. In the figure, there are two classes: A and B.
The tree can be used to predict the class of
a test instance by filtering it to the leaf
node corresponding to the instance’s
attribute values and assigning the class
label attached to that leaf.
64
• However, using an unpruned decision tree for classification
potentially “overfits” the training data. Consequently, it is advisable,
before the tree is put to use, to ascertain which parts truly reflect
effects present in the domain, and discard those that do not. This
process is called “pruning.”
• A general, fast, and easily applicable pruning method is “reduced-
error pruning” (Quinlan, 1987a). The idea is to hold out some of the
available instances—the “pruning set”—when the tree is built, and
prune the tree until the classification error on these independent
instances starts to increase. Because the instances in the pruning set
are not used for building the decision tree, they provide a less biased
estimate of its error rate on future instances than the training data.
65
66
• In each tree, the number of instances in the pruning data that are
misclassified by the individual nodes are given in parentheses. A
pruning operation involves replacing a subtree by a leaf. Reduced-
error pruning will perform this operation if it does not increase the
total number of classification errors. Traversing the tree in a bottom-
up fashion ensures that the result is the smallest pruned tree that has
minimum error on the pruning data
67
• The CP (complexity parameter) is used to control tree growth. If the
cost of adding a variable is higher then the value of CP, then tree
growth stops.
#Base Model
hr_base_model <- rpart(left ~ ., data = train, method =
"class",control = rpart.control(cp = 0))
summary(hr_base_model)
#Plot Decision Tree
plot(hr_base_model)
# Examine the complexity plot
printcp(hr_base_model)
plotcp(hr_base_model) cp = 0.0084
68
#Postpruning
# Prune the hr_base_model based on the optimal cp value
hr_model_pruned <- prune(hr_base_model, cp = 0.0084 )
# Compute the accuracy of the pruned tree
test$pred <- predict(hr_model_pruned, test, type = "class")
accuracy_postprun <- mean(test$pred == test$left)
data.frame(base_accuracy, accuracy_preprun, accuracy_postprun)
The accuracy of the model on the test data is better when the tree is pruned, which
means that the pruned decision tree model generalizes well and is more suited for a
production environment. However, there are also other factors that can influence
decision tree model creation, such as building a tree on an unbalanced class. These
factors were not accounted for here but it's very important for them to be examined
during a live model formulation.
69
Prepruning
• Prepruning is also known as early stopping criteria. As the name suggests, the
criteria are set as parameter values while building the rpart model. Below are
some of the pre-pruning criteria that can be used. The tree stops growing when it
meets any of these pre-pruning criteria, or it discovers the pure classes.
• maxdepth: This parameter is used to set the maximum depth of a tree. Depth is
the length of the longest path from a Root node to a Leaf node. Setting this
parameter will stop growing the tree when the depth is equal the value set for
maxdepth.
• minsplit: It is the minimum number of records that must exist in a node for a split
to happen or be attempted. For example, we set minimum records in a split to be
5; then, a node can be further split for achieving purity when the number of
records in each split node is more than 5.
• minbucket: It is the minimum number of records that can be present in a
Terminal node. For example, we set the minimum records in a node to 5, meaning
that every Terminal/Leaf node should have at least five records. We should also
take care of not overfitting the model by specifying this parameter. If it is set to a
too-small value, like 1, we may run the risk of overfitting our model.
70
# Grow a tree with minsplit of 100 and max depth of 8
hr_model_preprun <- rpart(left ~ ., data = train, method = "class",
control = rpart.control(cp = 0, maxdepth = 8,
minsplit = 100))
# Compute the accuracy of the pruned tree
test$pred <- predict(hr_model_preprun, test, type = "class")
accuracy_preprun <- mean(test$pred == test$left)
71
CART- Classification and Regression Trees
(Breiman, Freedman, Olshen, Stone, 1984)
72
CART decides which attributes to split and where to split them. In each leaf,
we're going to assign f(x) to be a constant.
73
74
75
https://www.quora.com/What-are-the-differences-between-ID3-C4-5-and-CART
Credit Risk Example
76
77
78
79
Example: Riding Mowers
• Goal: Classify 24 households as owning or not owning riding
mowers
• Predictors = Income, Lot Size
80
Income Lot_Size Ownership
60.0 18.4 owner
85.5 16.8 owner
64.8 21.6 owner
61.5 20.8 owner
87.0 23.6 owner
110.1 19.2 owner
108.0 17.6 owner
82.8 22.4 owner
69.0 20.0 owner
93.0 20.8 owner
51.0 22.0 owner
81.0 20.0 owner
75.0 19.6 non-owner
52.8 20.8 non-owner
64.8 17.2 non-owner
43.2 20.4 non-owner
84.0 17.6 non-owner
49.2 17.6 non-owner
59.4 16.0 non-owner
66.0 18.4 non-owner
47.4 16.4 non-owner
33.0 18.8 non-owner
51.0 14.0 non-owner
63.0 14.8 non-owner 81
How to split
• Order records according to one variable, say lot size
• Find midpoints between successive values
E.g. first midpoint is 14.4 (halfway between 14.0 and 14.8)
• Divide records into those with lotsize > 14.4 and those < 14.4
• After evaluating that split, try the next one, which is 15.4 (halfway
between 14.8 and 16.0)
82
Splitting criteria
• Regression: residual sum of squares
RSS = Σleft (yi – yL*)2 + Σright (yi – yR*)2
where yL* = mean y-value for left node
yR* = mean y-value for right node
• Classification: Gini criterion
Gini = NL Σk=1,…,K pkL (1- pkL) + NR Σk=1,…,K pkR (1- pkR)
where pkL = proportion of class k in left node
pkR = proportion of class k in right node
83
Note: Categorical Variables
• Examine all possible ways in which the categories can be split.
• E.g., categories A, B, C can be split 3 ways
{A} and {B, C}
{B} and {A, C}
{C} and {A, B}
• With many categories, # of splits becomes huge
84
The first split: Lot Size = 19,000
85
Second Split: Income = $84,000
86
After All Splits
87
Applications of Decision Trees: XBox!
• Decision trees are in XBox
[J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, A. Blake. Real-Time Human Pose
88
• Decision trees are in XBox: Classifying body parts
89
• Trained on million(s) of examples
90
• Trained on million(s) of examples
• Results:
91
Decision Tree Classifier implementation in R
(http://dataaspirant.com/2017/02/03/decision-tree-classifier-
implementation-in-r/ )
• The R programming machine learning caret package( Classification And
REgression Training) holds tons of functions that helps to build
predictive models. It holds tools for data splitting, pre-processing,
feature selection, tuning and supervised – unsupervised learning
algorithms, etc. It is similar to the sklearn library in python.
• The installed caret package provides us direct access to various
functions for training our model with different machine learning
algorithms like Knn, SVM, decision tree, linear regression, etc.
92
Cars Evaluation Data Set
• The Cars Evaluation data set consists of 7 attributes, 6 as feature attributes and 1
as the target attribute. All the attributes are categorical. We will try to build
a classifier for predicting the Class attribute. The index of target attribute is 7th.
1 buying vhigh, high, med, low
2 maint vhigh, high, med,low
3 doors 2, 3, 4, 5 , more
4 persons 2, 4, more
5 lug_boot small, med, big.
6 safety low, med, high
7 Car Evaluation – Target Variable unacc, acc, good, vgood
• To model a classifier for evaluating the acceptability of car using its given features.
93
library(caret)
library(rpart.plot)
data_url <- c("https://archive.ics.uci.edu/ml/machine-learning-
databases/car/car.data")
download.file(url = data_url, destfile = "car.data")
car_df <- read.csv("car.data", sep = ',', header = FALSE)
str(car_df)
'data.frame': 1728 obs. of 7 variables:
$ V1: Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ...
$ V2: Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ...
$ V3: Factor w/ 4 levels "2","3","4","5more": 1 1 1 1 1 1 1 1 1 1 ...
$ V4: Factor w/ 3 levels "2","4","more": 1 1 1 1 1 1 1 1 1 2 ...
$ V5: Factor w/ 3 levels "big","med","small": 3 3 3 2 2 2 1 1 1 3 ...
$ V6: Factor w/ 3 levels "high","low","med": 2 3 1 2 3 1 2 3 1 2 ...
$ V7: Factor w/ 4 levels "acc","good","unacc",..: 3 3 3 3 3 3 3 3 3 3 ...
94
head(car_df)
V1 V2 V3 V4 V5 V6 V7
1 vhigh vhigh 2 2 small low unacc
2 vhigh vhigh 2 2 small med unacc
3 vhigh vhigh 2 2 small high unacc
4 vhigh vhigh 2 2 med low unacc
5 vhigh vhigh 2 2 med med unacc
6 vhigh vhigh 2 2 med high unacc
Data Slicing
intrain <- createDataPartition(y = car_df$V7, p= 0.7, list = FALSE)
training <- car_df[intrain,]
testing <- car_df[-intrain,]
#check dimensions of train & test set
dim(training); dim(testing)
[1] 1211 7
[1] 517 7 95
• Preprocessing & Training
To check whether our data contains missing values or not, we can
use anyNA() method. Here, NA means Not Available.
anyNA(car_df)
[1] FALSE
summary(car_df)
V1 V2 V3 V4 V5 V6
high :432 high :432 2 :432 2 :576 big :576 high:576
low :432 low :432 3 :432 4 :576 med :576 low :576
med :432 med :432 4 :432 more:576 small:576 med :576
vhigh:432 vhigh:432 5more:432
V7
acc : 384
good : 69
unacc:1210
vgood: 65
96
• Training the Decision Tree classifier with criterion as information gain
Caret package provides train() method for training our data for various algorithms. We just need to pass
different parameter values for different algorithms. Before train() method, we will
first use trainControl() method. It controls the computational nuances of the train() method.
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(3333)
dtree_fit <- train(V7 ~., data = training, method = "rpart",
parms = list(split = "information"),
trControl=trctrl,
tuneLength = 10)
• We are setting 3 parameters of trainControl() method. The “method” parameter holds the details
about resampling method. We can set “method” with many values like “boot”, “boot632”, “cv”,
“repeatedcv”, “LOOCV”, “LGOCV” etc. Let’s try to use repeatedcv i.e, repeated cross-validation.
• The “number” parameter holds the number of resampling iterations. The “repeats ” parameter
contains the complete sets of folds to compute for our repeated cross-validation. We are using setting
number =10 and repeats =3. This trainControl() methods returns a list. We are going to pass this on
our train() method.
97
It can also be “gini”
• Trained Decision Tree classifier results
We can check the result of our train() method by a print dtree_fit variable. It is showing us the accuracy
metrics for different values of cp. Here, cp is a complexity parameter for our dtree.
dtree_fit
CART
1211 samples
6 predictor
4 classes: 'acc', 'good', 'unacc', 'vgood'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 1091, 1090, 1091, 1089, 1089, 1089, ...
Resampling results across tuning parameters:
cp Accuracy Kappa
0.008928571 0.8483624 0.6791223
0.009615385 0.8467071 0.6745287
0.010989011 0.8365300 0.6487824
0.012362637 0.8266554 0.6253187
0.013736264 0.8219630 0.6128814
0.020604396 0.7961370 0.5540247
0.022893773 0.7980631 0.5600789
0.054945055 0.7746394 0.5307654
0.057692308 0.7724331 0.5305796
0.065934066 0.7322489 0.2893330
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.008928571. 98
• Plot Decision Tree
prp(dtree_fit$finalModel, box.palette = "Reds", tweak = 1.2)
99
• Prediction
Now, our model is trained with cp = 0.008928571. We are ready to
predict classes for our test set. We can use predict() method. Let’s try
to predict target variable for test set’s 1st record.
testing[1,]
V1 V2 V3 V4 V5 V6 V7
6 vhigh vhigh 2 2 med high unacc
predict(dtree_fit, newdata = testing[1,])
[1] unacc
Levels: acc good unacc vgood
• For our 1st record of testing data classifier is predicting class variable
as “unacc”. Now, its time to predict target variable for the whole test
set.
100
test_pred <- predict(dtree_fit, newdata = testing)
confusionMatrix(test_pred, testing$V7 ) #check accuracy
Reference
Prediction acc good unacc vgood
acc 84 8 27 2
good 7 6 0 1
unacc 17 5 336 0
vgood 7 1 0 16
Accuracy : 0.8549
95% CI : (0.8216, 0.8842)
No Information Rate : 0.7021
P-Value [Acc > NIR] : 3.563e-16
Kappa : 0.6839
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: acc Class: good Class: unacc Class: vgood
Sensitivity 0.7304 0.30000 0.9256 0.84211
Specificity 0.9080 0.98390 0.8571 0.98394
Pos Pred Value 0.6942 0.42857 0.9385 0.66667
Neg Pred Value 0.9217 0.97217 0.8302 0.99391
Prevalence 0.2224 0.03868 0.7021 0.03675
Detection Rate 0.1625 0.01161 0.6499 0.03095
Detection Prevalence 0.2340 0.02708 0.6925 0.04642
Balanced Accuracy 0.8192 0.64195 0.8914 0.91302 101
Training the Decision Tree classifier with criterion as gini index
Let’s try to program a decision tree classifier using splitting criterion as
gini index. It is showing us the accuracy metrics for different values of
cp. Here, cp is complexity parameter for our dtree.
set.seed(3333)
dtree_fit_gini <- train(V7 ~., data = training, method = "rpart",
parms = list(split = "gini"),
trControl=trctrl,
tuneLength = 10)
102
dtree_fit_gini
CART
1211 samples
6 predictor
4 classes: 'acc', 'good', 'unacc', 'vgood'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 1091, 1090, 1091, 1089, 1089, 1089, ...
Resampling results across tuning parameters:
cp Accuracy Kappa
0.01098901 0.8522395 0.6816530
0.01373626 0.8362745 0.6436379
0.01510989 0.8305029 0.6305745
0.01556777 0.8249840 0.6168644
0.01648352 0.8227709 0.6115286
0.01831502 0.8180553 0.6039963
0.02060440 0.8095423 0.5858712
0.02197802 0.8032220 0.5725628
0.06868132 0.7888755 0.5727260
0.09340659 0.7233582 0.2223118
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.01098901.
103
• Plot Decision Tree
We can visualize our decision tree by using prp() method.
prp(dtree_fit_gini$finalModel, box.palette = "Blues", tweak = 1.2)
104
Prediction
Now, our model is trained with cp = 0.01098901. We are ready to predict classes for our test set. Now, it’s time to predict target
variable for the whole test set.
test_pred_gini <- predict(dtree_fit_gini, newdata = testing)
> confusionMatrix(test_pred_gini, testing$V7 ) #check accuracy
Reference
Prediction acc good unacc vgood
acc 87 10 25 8
good 4 4 0 0
unacc 22 5 338 0
vgood 2 1 0 11
Overall Statistics
Accuracy : 0.8511
95% CI : (0.8174, 0.8806)
No Information Rate : 0.7021
P-Value [Acc > NIR] : 2.18e-15
Kappa : 0.6666
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: acc Class: good Class: unacc Class: vgood
Sensitivity 0.7565 0.200000 0.9311 0.57895
Specificity 0.8930 0.991952 0.8247 0.99398
Pos Pred Value 0.6692 0.500000 0.9260 0.78571
Neg Pred Value 0.9276 0.968566 0.8355 0.98410
Prevalence 0.2224 0.038685 0.7021 0.03675
Detection Rate 0.1683 0.007737 0.6538 0.02128
Detection Prevalence 0.2515 0.015474 0.7060 0.02708
Balanced Accuracy 0.8248 0.595976 0.8779 0.78646
In this case, our classifier with criterion gini
index is not giving better results.
105
Regression Trees for Prediction
• Used with continuous outcome variable
• Procedure similar to classification tree
• Many splits attempted, choose the one that minimizes
impurity
106
Decision Tree - Regression
• Decision tree builds regression or classification models in the form of a
tree structure. It brakes down a dataset into smaller and smaller subsets
while at the same time an associated decision tree is incrementally
developed. The final result is a tree with decision nodes and leaf nodes.
A decision node (e.g., Outlook) has two or more branches (e.g., Sunny,
Overcast and Rainy), each representing values for the attribute tested.
Leaf node (e.g., Hours Played) represents a decision on the numerical
target. The topmost decision node in a tree which corresponds to the
best predictor called root node. Decision trees can handle both
categorical and numerical data.
107
Advantages of trees
• Easy to use, understand
• Produce rules that are easy to interpret & implement
• Variable selection & reduction is automatic
• Do not require the assumptions of statistical models
• Can work without extensive handling of missing data
108
Disadvantages
• May not perform well where there is structure in the data that is not
well captured by horizontal or vertical splits
• Since the process deals with one variable at a time, no way to
capture interactions between variables
109
http://www.saedsayad.com/decision_tree_reg.htm
110
Decision Tree Algorithm
• The core algorithm for building decision trees called ID3 by J. R.
Quinlan which employs a top-down, greedy search through the space
of possible branches with no backtracking. The ID3 algorithm can be
used to construct a decision tree for regression by replacing
Information Gain with Standard Deviation Reduction.
Standard Deviation
• A decision tree is built top-down from a root node and involves
partitioning the data into subsets that contain instances with similar
values (homogenous). We use standard deviation to calculate the
homogeneity of a numerical sample. If the numerical sample is
completely homogeneous its standard deviation is zero.
111
a) Standard deviation for one attribute:
• Standard Deviation (S) is for tree building (branching).
• Coefficient of Deviation (CV) is used to decide when to stop
branching. We can use Count (n) as well.
• Average (Avg) is the value in the leaf nodes.
112
b) Standard deviation for two attributes (target and predictor):
113
Standard Deviation Reduction
• The standard deviation reduction is based on the decrease in
standard deviation after a dataset is split on an attribute.
Constructing a decision tree is all about finding attribute that returns
the highest standard deviation reduction (i.e., the most homogeneous
branches).
• Step 1: The standard deviation of the target is calculated.
Standard deviation (Hours Played) = 9.32
114
• Step 2: The dataset is then split on the different attributes. The
standard deviation for each branch is calculated. The resulting
standard deviation is subtracted from the standard deviation before
the split. The result is the standard deviation reduction.
115
• Step 3: The attribute with the largest standard deviation reduction is
chosen for the decision node.
• Step 4a: The dataset is divided based on the values of the selected
attribute. This process is run recursively on the non-leaf branches,
until all data is processed.
In practice, we need some termination
criteria. For example, when coefficient
of deviation (CV) for a branch becomes
smaller than a certain threshold (e.g.,
10%) and/or when too few instances
(n) remain in the branch (e.g., 3).
116
• Step 4b: "Overcast" subset does not need any further splitting
because its CV (8%) is less than the threshold (10%). The related leaf
node gets the average of the "Overcast" subset.
117
• Step 4c: However, the "Sunny" branch has an CV (28%) more than the
threshold (10%) which needs further splitting. We select "Windy" as
the best node after "Outlook" because it has the largest SDR.
118
• Because the number of data points for both branches (FALSE and
TRUE) is equal or less than 3 we stop further branching and assign the
average of each branch to the related leaf node.
119
• Step 4d: Moreover, the "rainy" branch has an CV (22%) which is more
than the threshold (10%). This branch needs further splitting. We
select "Windy" as the best node because it has the largest SDR.
120
• Because the number of data points for all three branches (Cool, Hot
and Mild) is equal or less than 3 we stop further branching and assign
the average of each branch to the related leaf node.
• When the number of instances is more than one at a leaf node we
calculate the average as the final value for the target.
121
Decision Trees for Regression-R Example
• Decision trees to predict whether the birth weights of infants will be low or not.
> library(MASS)
> library(rpart)
> head(birthwt)
low age lwt race smoke ptl ht ui ftv bwt
85 0 19 182 2 0 0 0 1 0 2523
86 0 33 155 3 0 0 0 0 3 2551
87 0 20 105 1 1 0 0 0 1 2557
88 0 21 108 1 1 0 0 1 2 2594
89 0 18 107 1 1 0 0 1 0 2600
91 0 21 124 3 0 0 0 0 0 2622
122
• low – indicator of whether the birth weight is less than 2.5kg
• age – mother’s age in year
• lwt – mother’s weight in pounds at last menstrual period
• race – mother’s race (1 = white, 2 = black, white = other)
• smoke – smoking status during pregnancy
• ptl – number of previous premature labours
• ht – history of hypertension
• ui – presence of uterine irritability
• ftv – number of physician visits during the first trimester
• bwt – birth weight in grams
• Let’s look at the distribution of infant weights:
hist(birthwt$bwt)
Most of the infants weigh between 2kg and 4kg.
123
• let us look at the number of infants born with low weight.
table(birthwt$low)
0 1
130 59
This means that there are 130 infants weighing more than 2.5kg and 59
infants weighing less than 2.5kg. If we just guessed the most common
occurrence (> 2.5kg), our accuracy would be 130 / (130 + 59) = 68.78%.
Let’s see if we can improve upon this by building a prediction model.
124
Building the model
• In the dataset, all the variables are stored as numeric. Before we build our model, we need to convert the categorical
variables to factor.
cols <- c('low', 'race', 'smoke', 'ht', 'ui')
birthwt[cols] <- lapply(birthwt[cols], as.factor)
Next, let us split our dataset so that we have a training set and a testing set.
set.seed(1)
train <- sample(1:nrow(birthwt), 0.75 * nrow(birthwt))
Now, let us build the model. We will use the rpart function for this.
birthwtTree <- rpart(low ~ . - bwt, data = birthwt[train, ], method = 'class')
Since low = bwt <= 2.5, we exclude bwt from the model, and since it is a classification task, we specify method = 'class'. Let’s
take a look at the tree.
1) root 141 44 0 (0.6879433 0.3120567)
2) ptl< 0.5 117 30 0 (0.7435897 0.2564103)
4) lwt>=106 93 19 0 (0.7956989 0.2043011)
8) ht=0 86 15 0 (0.8255814 0.1744186) *
9) ht=1 7 3 1 (0.4285714 0.5714286) *
5) lwt< 106 24 11 0 (0.5416667 0.4583333)
10) age< 22.5 15 4 0 (0.7333333 0.2666667) *
11) age>=22.5 9 2 1 (0.2222222 0.7777778) *
3) ptl>=0.5 24 10 1 (0.4166667 0.5833333)
6) lwt>=131.5 7 2 0 (0.7142857 0.2857143) *
7) lwt< 131.5 17 5 1 (0.2941176 0.7058824) * 125
plot(birthwtTree)
text(birthwtTree, pretty = 0)
126
• To make a prediction on the birthweight, we can run these codes
birthwtTree_reg <- rpart(bwt ~ . , data = birthwt[train, ])
plot(birthwtTree_reg)
text(birthwtTree_reg, pretty = 0)
1) root 141 79639400 2944.291
2) low=1 44 7755010 2073.614
4) age>=23.5 19 5147718 1859.000 *
5) age< 23.5 25 1067079 2236.720 *
3) low=0 97 23398620 3339.237
6) lwt< 109.5 19 2784704 2983.263 *
7) lwt>=109.5 78 17619810 3425.949
14) race=2,3 33 4872606 3252.061 *
15) race=1 45 11017640 3553.467
30) lwt>=123.5 33 4761066 3459.515 *
31) lwt< 123.5 12 5164248 3811.833 *
127
summary(birthwtTree_reg)
# make predictions
predictions <- predict(birthwtTree_reg, birthwt[,1:9])
# summarize accuracy
mse <- mean((birthwt$bwt - predictions)^2)
print(mse)
[1] 177777.4
birthwtTree_lm <- lm(bwt ~ . , data = birthwt[train, ])
summary(birthwtTree_lm)
# make predictions
predictions <- predict(birthwtTree_lm, birthwt[,1:9])
# summarize accuracy
mse <- mean((birthwt$bwt - predictions)^2)
print(mse)
[1] 178757.8
128

More Related Content

Similar to DECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.ppt

Interpreting Data Like a Pro - Dawn of the Data Age Lecture Series
Interpreting Data Like a Pro - Dawn of the Data Age Lecture SeriesInterpreting Data Like a Pro - Dawn of the Data Age Lecture Series
Interpreting Data Like a Pro - Dawn of the Data Age Lecture SeriesLuciano Pesci, PhD
 
1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdfthaersyam
 
Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science Frank Kienle
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2Nandhini S
 
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018digitalzombie
 
Result analysis of mining fast frequent itemset using compacted data
Result analysis of mining fast frequent itemset using compacted dataResult analysis of mining fast frequent itemset using compacted data
Result analysis of mining fast frequent itemset using compacted dataijistjournal
 
Result Analysis of Mining Fast Frequent Itemset Using Compacted Data
Result Analysis of Mining Fast Frequent Itemset Using Compacted DataResult Analysis of Mining Fast Frequent Itemset Using Compacted Data
Result Analysis of Mining Fast Frequent Itemset Using Compacted Dataijistjournal
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer visionEran Shlomo
 
Decision trees
Decision treesDecision trees
Decision treesNcib Lotfi
 
Marketing Research Approaches .docx
Marketing Research Approaches .docxMarketing Research Approaches .docx
Marketing Research Approaches .docxalfredacavx97
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.pptArumugam90
 
Lesson 6 measures of central tendency
Lesson 6 measures of central tendencyLesson 6 measures of central tendency
Lesson 6 measures of central tendencynurun2010
 

Similar to DECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.ppt (20)

Decision Tree.pptx
Decision Tree.pptxDecision Tree.pptx
Decision Tree.pptx
 
Interpreting Data Like a Pro - Dawn of the Data Age Lecture Series
Interpreting Data Like a Pro - Dawn of the Data Age Lecture SeriesInterpreting Data Like a Pro - Dawn of the Data Age Lecture Series
Interpreting Data Like a Pro - Dawn of the Data Age Lecture Series
 
Descriptive Analytics: Data Reduction
 Descriptive Analytics: Data Reduction Descriptive Analytics: Data Reduction
Descriptive Analytics: Data Reduction
 
Data analysis01 singlevariable
Data analysis01 singlevariableData analysis01 singlevariable
Data analysis01 singlevariable
 
Decision theory & decisiontrees
Decision theory & decisiontreesDecision theory & decisiontrees
Decision theory & decisiontrees
 
1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf
 
Lecture_note1.pdf
Lecture_note1.pdfLecture_note1.pdf
Lecture_note1.pdf
 
Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2
 
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018
 
Result analysis of mining fast frequent itemset using compacted data
Result analysis of mining fast frequent itemset using compacted dataResult analysis of mining fast frequent itemset using compacted data
Result analysis of mining fast frequent itemset using compacted data
 
Result Analysis of Mining Fast Frequent Itemset Using Compacted Data
Result Analysis of Mining Fast Frequent Itemset Using Compacted DataResult Analysis of Mining Fast Frequent Itemset Using Compacted Data
Result Analysis of Mining Fast Frequent Itemset Using Compacted Data
 
algo 1.ppt
algo 1.pptalgo 1.ppt
algo 1.ppt
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer vision
 
Decision trees
Decision treesDecision trees
Decision trees
 
Marketing Research Approaches .docx
Marketing Research Approaches .docxMarketing Research Approaches .docx
Marketing Research Approaches .docx
 
Hiding slides
Hiding slidesHiding slides
Hiding slides
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.ppt
 
Decision theory
Decision theoryDecision theory
Decision theory
 
Lesson 6 measures of central tendency
Lesson 6 measures of central tendencyLesson 6 measures of central tendency
Lesson 6 measures of central tendency
 

Recently uploaded

Call Girls In Safdarjung Enclave 24/7✡️9711147426✡️ Escorts Service
Call Girls In Safdarjung Enclave 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Safdarjung Enclave 24/7✡️9711147426✡️ Escorts Service
Call Girls In Safdarjung Enclave 24/7✡️9711147426✡️ Escorts Servicejennyeacort
 
Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)
Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)
Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)jennyeacort
 
Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...
Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...
Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...babafaisel
 
call girls in Harsh Vihar (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Harsh Vihar (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Harsh Vihar (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Harsh Vihar (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
WAEC Carpentry and Joinery Past Questions
WAEC Carpentry and Joinery Past QuestionsWAEC Carpentry and Joinery Past Questions
WAEC Carpentry and Joinery Past QuestionsCharles Obaleagbon
 
Call Girls Bapu Nagar 7397865700 Ridhima Hire Me Full Night
Call Girls Bapu Nagar 7397865700 Ridhima Hire Me Full NightCall Girls Bapu Nagar 7397865700 Ridhima Hire Me Full Night
Call Girls Bapu Nagar 7397865700 Ridhima Hire Me Full Nightssuser7cb4ff
 
3D Printing And Designing Final Report.pdf
3D Printing And Designing Final Report.pdf3D Printing And Designing Final Report.pdf
3D Printing And Designing Final Report.pdfSwaraliBorhade
 
Abu Dhabi Call Girls O58993O4O2 Call Girls in Abu Dhabi`
Abu Dhabi Call Girls O58993O4O2 Call Girls in Abu Dhabi`Abu Dhabi Call Girls O58993O4O2 Call Girls in Abu Dhabi`
Abu Dhabi Call Girls O58993O4O2 Call Girls in Abu Dhabi`dajasot375
 
Architecture case study India Habitat Centre, Delhi.pdf
Architecture case study India Habitat Centre, Delhi.pdfArchitecture case study India Habitat Centre, Delhi.pdf
Architecture case study India Habitat Centre, Delhi.pdfSumit Lathwal
 
定制(RMIT毕业证书)澳洲墨尔本皇家理工大学毕业证成绩单原版一比一
定制(RMIT毕业证书)澳洲墨尔本皇家理工大学毕业证成绩单原版一比一定制(RMIT毕业证书)澳洲墨尔本皇家理工大学毕业证成绩单原版一比一
定制(RMIT毕业证书)澳洲墨尔本皇家理工大学毕业证成绩单原版一比一lvtagr7
 
Revit Understanding Reference Planes and Reference lines in Revit for Family ...
Revit Understanding Reference Planes and Reference lines in Revit for Family ...Revit Understanding Reference Planes and Reference lines in Revit for Family ...
Revit Understanding Reference Planes and Reference lines in Revit for Family ...Narsimha murthy
 
306MTAMount UCLA University Bachelor's Diploma in Social Media
306MTAMount UCLA University Bachelor's Diploma in Social Media306MTAMount UCLA University Bachelor's Diploma in Social Media
306MTAMount UCLA University Bachelor's Diploma in Social MediaD SSS
 
Introduction-to-Canva-and-Graphic-Design-Basics.pptx
Introduction-to-Canva-and-Graphic-Design-Basics.pptxIntroduction-to-Canva-and-Graphic-Design-Basics.pptx
Introduction-to-Canva-and-Graphic-Design-Basics.pptxnewslab143
 
在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证
在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证
在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证nhjeo1gg
 
VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130
VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130
VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130Suhani Kapoor
 
VIP Call Girl Amravati Aashi 8250192130 Independent Escort Service Amravati
VIP Call Girl Amravati Aashi 8250192130 Independent Escort Service AmravatiVIP Call Girl Amravati Aashi 8250192130 Independent Escort Service Amravati
VIP Call Girl Amravati Aashi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Bus tracking.pptx ,,,,,,,,,,,,,,,,,,,,,,,,,,
Bus tracking.pptx ,,,,,,,,,,,,,,,,,,,,,,,,,,Bus tracking.pptx ,,,,,,,,,,,,,,,,,,,,,,,,,,
Bus tracking.pptx ,,,,,,,,,,,,,,,,,,,,,,,,,,bhuyansuprit
 

Recently uploaded (20)

Call Girls In Safdarjung Enclave 24/7✡️9711147426✡️ Escorts Service
Call Girls In Safdarjung Enclave 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Safdarjung Enclave 24/7✡️9711147426✡️ Escorts Service
Call Girls In Safdarjung Enclave 24/7✡️9711147426✡️ Escorts Service
 
young call girls in Pandav nagar 🔝 9953056974 🔝 Delhi escort Service
young call girls in Pandav nagar 🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Pandav nagar 🔝 9953056974 🔝 Delhi escort Service
young call girls in Pandav nagar 🔝 9953056974 🔝 Delhi escort Service
 
Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)
Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)
Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)
 
Call Girls in Pratap Nagar, 9953056974 Escort Service
Call Girls in Pratap Nagar,  9953056974 Escort ServiceCall Girls in Pratap Nagar,  9953056974 Escort Service
Call Girls in Pratap Nagar, 9953056974 Escort Service
 
Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...
Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...
Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...
 
call girls in Harsh Vihar (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Harsh Vihar (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Harsh Vihar (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Harsh Vihar (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
WAEC Carpentry and Joinery Past Questions
WAEC Carpentry and Joinery Past QuestionsWAEC Carpentry and Joinery Past Questions
WAEC Carpentry and Joinery Past Questions
 
Call Girls Bapu Nagar 7397865700 Ridhima Hire Me Full Night
Call Girls Bapu Nagar 7397865700 Ridhima Hire Me Full NightCall Girls Bapu Nagar 7397865700 Ridhima Hire Me Full Night
Call Girls Bapu Nagar 7397865700 Ridhima Hire Me Full Night
 
3D Printing And Designing Final Report.pdf
3D Printing And Designing Final Report.pdf3D Printing And Designing Final Report.pdf
3D Printing And Designing Final Report.pdf
 
Abu Dhabi Call Girls O58993O4O2 Call Girls in Abu Dhabi`
Abu Dhabi Call Girls O58993O4O2 Call Girls in Abu Dhabi`Abu Dhabi Call Girls O58993O4O2 Call Girls in Abu Dhabi`
Abu Dhabi Call Girls O58993O4O2 Call Girls in Abu Dhabi`
 
Architecture case study India Habitat Centre, Delhi.pdf
Architecture case study India Habitat Centre, Delhi.pdfArchitecture case study India Habitat Centre, Delhi.pdf
Architecture case study India Habitat Centre, Delhi.pdf
 
定制(RMIT毕业证书)澳洲墨尔本皇家理工大学毕业证成绩单原版一比一
定制(RMIT毕业证书)澳洲墨尔本皇家理工大学毕业证成绩单原版一比一定制(RMIT毕业证书)澳洲墨尔本皇家理工大学毕业证成绩单原版一比一
定制(RMIT毕业证书)澳洲墨尔本皇家理工大学毕业证成绩单原版一比一
 
Revit Understanding Reference Planes and Reference lines in Revit for Family ...
Revit Understanding Reference Planes and Reference lines in Revit for Family ...Revit Understanding Reference Planes and Reference lines in Revit for Family ...
Revit Understanding Reference Planes and Reference lines in Revit for Family ...
 
306MTAMount UCLA University Bachelor's Diploma in Social Media
306MTAMount UCLA University Bachelor's Diploma in Social Media306MTAMount UCLA University Bachelor's Diploma in Social Media
306MTAMount UCLA University Bachelor's Diploma in Social Media
 
Introduction-to-Canva-and-Graphic-Design-Basics.pptx
Introduction-to-Canva-and-Graphic-Design-Basics.pptxIntroduction-to-Canva-and-Graphic-Design-Basics.pptx
Introduction-to-Canva-and-Graphic-Design-Basics.pptx
 
在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证
在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证
在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证
 
VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130
VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130
VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130
 
Cheap Rate ➥8448380779 ▻Call Girls In Iffco Chowk Gurgaon
Cheap Rate ➥8448380779 ▻Call Girls In Iffco Chowk GurgaonCheap Rate ➥8448380779 ▻Call Girls In Iffco Chowk Gurgaon
Cheap Rate ➥8448380779 ▻Call Girls In Iffco Chowk Gurgaon
 
VIP Call Girl Amravati Aashi 8250192130 Independent Escort Service Amravati
VIP Call Girl Amravati Aashi 8250192130 Independent Escort Service AmravatiVIP Call Girl Amravati Aashi 8250192130 Independent Escort Service Amravati
VIP Call Girl Amravati Aashi 8250192130 Independent Escort Service Amravati
 
Bus tracking.pptx ,,,,,,,,,,,,,,,,,,,,,,,,,,
Bus tracking.pptx ,,,,,,,,,,,,,,,,,,,,,,,,,,Bus tracking.pptx ,,,,,,,,,,,,,,,,,,,,,,,,,,
Bus tracking.pptx ,,,,,,,,,,,,,,,,,,,,,,,,,,
 

DECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.ppt

  • 2. Decision Trees • A decision tree is an approach to predictive analysis that can help you make decisions. Suppose, for example, that you need to decide whether to invest a certain amount of money in one of three business projects: a food-truck business, a restaurant, or a bookstore. • A business analyst has worked out the rate of failure or success for each of these business ideas as percentages and the profit you’d make in each case. 2
  • 3. Business Success Rate Failure Rate Food Truck 60 percent 40 percent Restaurant 52 percent 48 percent Bookstore 50 percent 50 percent Business Gain (USD) Loss (USD) Food Truck 20,000 -7,000 Restaurant 40,000 -21,000 Bookstore 6,000 -1,000 From past statistical data shown, you can construct a decision tree as shown below. 3
  • 4. • Using such a decision tree to decide on a business venture begins with calculating the expected value for each alternative — a numbered rank that helps you select the best one. • The expected value is calculated in such a way that includes all possible outcomes for a decision. Calculating the expected value for the food-truck business idea looks like this: • Expected value of food-truck business = (60 % x 20,000 (USD)) + (40 % * -7,000 (USD)) = 9,200 (USD) • Expected value of restaurant business = (52 % x 40,000 (USD)) + (48 % * -21,000 (USD)) = 10,720 (USD) • Expected value of bookstore business = (50 % x 6,000 (USD)) + (50 % * -1,000 (USD)) = 2,500 (USD) • Therefore the expected value becomes one of the criteria you figure into your business decision-making. In this example, the expected values of the three alternatives might incline you to favor investing in the restaurant business. 4
  • 5. • Decision trees can also be used to visualize classification rules. • A decision algorithm generates a decision tree that represents classification rules. • Decision Trees (DTs) are a supervised learning technique that predict values of responses by learning decision rules derived from features. They can be used in both a regression and a classification context. For this reason they are sometimes also referred to as Classification And Regression Trees (CART). 5
  • 6. Why trees? • Interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks • Model discrete outcomes nicely • Can be very powerful, can be as complex as you need them • C4.5 and CART - from “top 10” - decision trees are very popular • Some real examples (from Russell & Norvig, Mitchell) • BP’s GasOIL system for separating gas and oil on offshore platforms - decision trees replaced a hand-designed rules system with 2500 rules. C4.5-based system outperformed human experts and saved BP millions. (1986) • learning to fly a Cessna on a flight simulator by watching human experts fly the simulator (1992) • can also learn to play tennis, analyze C-section risk, etc. 6
  • 7. Decision Tree Types • Classification tree analysis is when the predicted outcome is the class to which the data belongs. Iterative Dichotomiser 3 (ID3), C4.5, (Quinlan, 1986) • Regression tree analysis is when the predicted outcome can be considered a real number (e.g. the price of a house, or a patient’s length of stay in a hospital). • Classification And Regression Tree (CART) analysis is used to refer to both of the above procedures, first introduced by (Breiman et al., 1984) • CHi-squared Automatic Interaction Detector (CHAID). Performs multi-level splits when computing classification trees. (Kass, G. V. 1980). • A Random Forest classifier uses a number of decision trees, in order to improve the classification rate. • Boosting Trees can be used for regression-type and classification-type problems. • Used in data mining (most are included in R, see rpart and party packages, and in Weka, Waikato Environment for Knowledge Analysis) 7
  • 8. How to build a decision tree: • Start at the top of the tree. • Grow it by “splitting” attributes one by one. To determine which attribute to split, look at “node impurity.” • Assign leaf nodes the majority vote in the leaf. • When we get to the bottom, prune the tree to prevent overfitting 8
  • 9. Decision Tree Representation • Classification of instances by sorting them down the tree from the root to some leaf note • node ≈ test of some attribute • branch ≈ one of the possible values for the attribute • Decision trees represent a disjunction of conjunctions of constraints on the attribute values of instances i.e., (... ∧ ... ∧ ... ) ∨ (... ∧ ... ∧ ... ) ∨ ... • Equivalent to a set of if-then-rules • each branch represents one if-then-rule • if-part: conjunctions of attribute tests on the nodes • then-part: classification of the branch 9
  • 10. • This decision tree is equivalent to: • if (Outlook = Sunny ) ∧ (Humidity = Normal ) then Yes; • if (Outlook = Overcast ) then Yes; • if (Outlook = Rain ) ∧ (Wind = Weak ) then Yes; Each internal node: test one attribute Xi Each branch from a node: selects one value for Xi Each leaf node: predict Y (or P(Y|X ∈ leaf)) 10
  • 11. Example: Will the customer wait for a table? (from Russell&Norvig) • Here are the attributes: 11
  • 12. • Here are the examples: 12
  • 13. • Here are two options for the first feature to split at the top of the tree. Which one should we choose? Which one gives me the most information? 13
  • 14. • What we need is a formula to computeinformation. "Before we do that, here's another example. Let's say we pick one of them (Patrons). Maybe then we'll pick Hungry next, because it has a lot of “information": 14
  • 15. Basics of Information Theory where P is the probability of an event. 15
  • 16. • We want to define I so that it obeys all these things: • I(p)>0; • I(1)=0; the information of any event is non-negative, no information from events with probability 1. • I(p1p2)=I(p1)+I(p2); the information from two independent events should be the sum of their information. • I(p) is continuous, slight changes in probability correspond to slight changes in information. • Together these lead to: • I(p2)=2I(p) or generally I(pn)=nI(p); this means that I(p)=I(p(1/m) m) =mI(p1/m) so 1/m.I(p)=I(p(1/m)). 16
  • 17. Why we use log? Why is information measured with logarithms instead of just by the total number of states? • Mostly because it makes it additive. It's true that if you really wanted to, you could choose to measure information or entropy by the total number of states (usually called the "multiplicity"), instead of by the log of the multiplicity. But then it would be multiplicative instead of additive. If you have 10 bits and then you obtain another 10 bits of information, then you have 20 bits. • Saying the same thing in terms of multiplicity: if you have 2^10 = 1024 states and then you add another 1024 independent states then you have 1024*1024 = 1048576 states (2^20) when they are combined. Multiplicity is multiplicative instead of additive, which means that the numbers you need in order to keep track of it get very large very quickly! This is really inconvenient, hence why we usually stick with using information/entropy as the unit instead of multiplicity. 17
  • 18. • Shannon (1948, p. 349) explains the convenience of the use of a logarithmic function in the definition of the entropies: it is practically useful because many important parameters in engineering vary linearly with the logarithm of the number of possibilities; it is intuitive because we use to measure magnitudes by linear comparison with unities of measurement; it is mathematically more suitable because many limiting operations in terms of the logarithm are simpler than in terms of the number of possibilities. In turn, the choice of a logarithmic base amounts to a choice of a unit for measuring information. • If the base 2 is used, the resulting unit is called ‘bit’ –a contraction of binary unit –. • With these definitions, one bit is the amount of information obtained when one of two equally likely alternatives is specified. 18
  • 19. Example of Calculating Information Coin Toss • There are two probabilities in fair coin, which are head(.5) and tail(.5). • So if you get either head or tail you will get 1 bit of information through following formula. • I(head) = -log2 (.5) = 1 bit 19
  • 20. Another Example • Balls in the bin • The information you will get by choosing a ball from the bin are calculated as following. • I(red ball) = - log2 (4/9) = 1.1699 bits • I(yellow ball) = - log2 (2/9) = 2.1699 bits • I(green ball) = - log2 (3/9) = 1.58496 bits 20
  • 22. How was the entropy equation is derived? I = total information from N occurrences N = number of occurrences (N*pi) = Approximated number that the certain result will come out in N occurrence • So when you look at the difference between the total Information from N occurrences and the Entropy equation, only thing that changed in the place of N. • The N is moved to the right, which means that I/N is Entropy. Therefore, Entropy is the average(expected) amount of information in a certain event. 22
  • 23. Entropy • The entropy of a variable is the "amount of information" contained in the variable. This amount is determined not just by the number of different values the variable can take on, just as the information in an email is quantified not just by the number of words in the email or the different possible words in the language of the email. • Informally, the amount of information in an email is proportional to the amount of “surprise” its reading causes. For example, if an email is simply a repeat of an earlier email, then it is not informative at all. On the other hand, if say the email reveals the outcome of a cliff-hanger election, then it is highly informative. • Similarly, the information in a variable is tied to the amount of surprise that value of the variable causes when revealed. Shannon’s entropy quantifies the amount of information in a variable, thus providing the foundation for a theory around the notion of information. 23
  • 24. • Because entropy is a type of information, the easiest way to measure information is in bits and bytes, rather than by the total number of possible states they can represent. • The basic unit of information is the bit, which represents 2 possible states. If you have n bits, then that information represents 2n possible states. For example, a byte is 8 bits, therefore the number of states it represents is 28=256. This means that a byte can store any number between 0 and 255. If you are given the total number of states, then you just take the log of that number to get the amount of information, measured in bits: log2256=8. • So entropy is defined as the log of the number of total microscopic states corresponding to a particular macro state of thermodynamics. This is the additional information you'd need to know in order to completely specify the microstate, given knowledge of the macrostate. 24
  • 25. Information and Entropy • Let’s look at this example again… • Calculating the entropy…. • In this example there are three outcomes possible when you choose the ball, it can be either red, yellow, or green. (n = 3) • So the equation will be following. • Entropy = ̶ (4/9) log(4/9) + ̶ (2/9) log(2/9) + ̶-(3/9) log(3/9)= 1.5305 • Therefore, you are expected to get 1.5305 information each time you choose a ball from the bin 25
  • 26. Information and Entropy • Equation for calculating the range of Entropy: 0 ≤ Entropy ≤ log(n), where n is number of outcomes • Entropy 0 (minimum entropy) occurs when one of the probabilities is 1 and rest are 0’s • Entropy log(n) (maximum entropy) occurs when all the probabilities have equal values of 1/n. 26
  • 27. Shannon’s Entropy • According to Shannon (1948; see also Shannon and Weaver 1949), a general communication system consists of five parts: − A source S, which generates the message to be received at the destination. − A transmitter T, which turns the message generated at the source into a signal to be transmitted. In the cases in which the information is encoded, encoding is also implemented by this system. − A channel CH, that is, the medium used to transmit the signal from the transmitter to the receiver. − A receiver R, which reconstructs the message from the signal. − A destination D, which receives the message. 27
  • 28. 28
  • 29. 29
  • 30. Mutual Information • H(S;D) is the mutual information: the average amount of information generated at the source S and received at the destination D. • E is the equivocation: the average amount of information generated at S but not received at D. • N is the noise: the average amount of information received at D but not generated at S. • As the diagram clearly shows, the mutual information can be computed as: H(S;D) = H(S) − E = H(D) − N 30
  • 31. Example • We could measure how much space it takes to store X. Note that this definition only makes sense if X is a random variable. If X is a random variable, it has some distribution, and we can calculate the amount of memory it takes to store X on average. For X being a uniformly random n-bit string H(X) = x p(x) log(1/p(x)) = x 2-nlog2n=n 31
  • 32. Conditional Entropy • Let X and Y be two random variables. Then, expanding H(X, Y) gives 32
  • 33. Conditional Entropy • Chain Rule: H(XY) = H(X) + H(Y|X). • Chain Rule: H(X1X2 . . .Xn) = H(X1)+H(X2|X1)+H(X3|X2X1)+... +H(Xn|Xn-1...X1). 33
  • 34. Entropy of a Joint Distribution 34
  • 35. 35
  • 36. Back to C4.5 (source material: Russell&Norvig,Mitchell,Quinlan) • We consider a “test” split on attribute A at a branch. • In S, we have #pos positives and #neg negatives. For each branch j, we have #posj positives and #negj negatives. 36
  • 37. 37
  • 38. 38
  • 39. 39
  • 40. Gain Ratio We want this large We want this small 40
  • 41. Keep splitting until: • no more examples left (no point trying to split) • all examples have the same class • no more attributes to split • For the restaurant example, we get this: 41
  • 42. • Actually, it turns out that the class labels for the data were themselves generated from a tree. So to get the label for an example, they fed it into a tree, and got the label from the leaf. That tree is here: 42
  • 43. • But the one we found is simpler! • Does that mean our algorithm isn't doing a good job? • There are possibilities to replace H([p,1-p]), it is not the only thing we can use! • One example is the Gini index 2p(1-p) used by CART. • Another example is the value 1-max(p,1-p) 43
  • 44. C4.5 uses information gain for splitting, and CART uses the Gini index. (CART only has binary splits.) 44 All three are similar, but cross-entropy and the Gini index are differentiable, and hence more amenable to numerical optimization. either the Gini index or cross- entropy should be used when growing the tree.
  • 45. • Inductive bias: Shorter trees are preferred to longer trees. Trees that place high information gain attributes close to the root are also preferred. • Why prefer shorter hypotheses? • Occam’s Razor: Prefer the simplest hypothesis that fits the data! • see Minimum Description Length Principle (Bayesian Learning) • e.g., if there are two decision trees, one with 500 nodes and another with 5 nodes, the second one should be preferred ⇒ better chance to avoid overfitting 45
  • 46. Gini Index • Gini index says, if we select two items from a population at random then they must be of same class and probability for this is 1 if population is pure. • It works with categorical target variable “Success” or “Failure”. • It performs only Binary splits • Higher the value of Gini higher the homogeneity. • CART (Classification and Regression Tree) uses Gini method to create binary splits. 46
  • 47. Steps to Calculate Gini for a split 1. Calculate Gini for sub-nodes, using formula sum of square of probability for success and failure (p2+(1-p)2). 2. Calculate Gini for split using weighted Gini score of each node of that split 47
  • 48. Example: • Let’s say we have a sample of 30 students with three variables Gender (Boy/ Girl), Class( IX/ X) and Height (5 to 6 ft). 15 out of these 30 play cricket in leisure time. Now, I want to create a model to predict who will play cricket during leisure period? In this problem, we need to segregate students who play cricket in their leisure time based on highly significant input variable among all three. Split on Gender: Calculate, Gini for sub-node Female = (0.2)*(0.2)+(0.8)*(0.8)=0.68 Gini for sub-node Male = (0.65)*(0.65)+(0.35)*(0.35)=0.55 Calculate weighted Gini for Split Gender = (10/30)*0.68+(20/30)*0.55 = 0.59 Similar for Split on Class: Gini for sub-node Class IX = (0.43)*(0.43)+(0.57)*(0.57)=0.51 Gini for sub-node Class X = (0.56)*(0.56)+(0.44)*(0.44)=0.51 Calculate weighted Gini for Split Class = (14/30)*0.51+(16/30)*0.51 = 0.51 Above, you can see that Gini score for Split on Gender is higher than Split on Class, hence, the node split will take place on Gender. 48
  • 50. Overfitting reasons for overfitting: • noise in the data • number of training examples is to small too produce a representative sample of the target function how to avoid overfitting: • stop the tree grow earlier, before it reaches the point where it perfectly classifies the training data • allow overfitting and then post-prune the tree (more successful in practice!) how to determine the perfect tree size: • separate validation set to evaluate utility of post-pruning • apply statistical test to estimate whether expanding (or pruning) produces an improvement 50
  • 51. Reduced error pruning • Each of the decision nodes is considered to be candidate for pruning • Pruning a decision node consists of removing the subtree rootet at the node, making it a leaf node and assigning the most common classification of the training examples affiliated with that node • Nodes are removed only if the resulting tree performs not worse than the original tree over the validation set • Pruning starts with the node whose removal most increases accuracy and continues until further pruning is harmful 51
  • 52. Reduced Error Pruning • Effect of reduced error pruning: • Any node added to coincidental regularities in the training set is likely to be pruned 52
  • 53. Rule Post-Pruning 1. Infer the decision tree from the training set (Overfitting allowed!) 2. Convert the tree into a set of rules 3. Prune each rule by removing any preconditions that result in improving its estimated accuracy 4. Sort the pruned rules by their estimated accuracy • One method to estimate rule accuracy is to use a separate validation set • Pruning rules is more precise than pruning the tree itself 53
  • 54. Pruning • Let's start with C4.5's pruning. C4.5 recursively makes choices as to whether to prune on an attribute: • Option 1: leaving the tree as is • Option 2: replace that part of the tree with a leaf corresponding to the most frequent label in the data S going to that part of the tree. • Option 3: replace that part of the tree with one of its subtrees, corresponding to the most common branch in the split • Demo: To figure out which decision to make, C4.5 computes upper bounds on the probability of error for each option. • We'll see you how to do that shortly. • Prob of error for Option 1  Upper Bound1 • Prob of error for Option 2  Upper Bound2 • Prob of error for Option 3  Upper Bound3 54
  • 55. • C4.5 chooses the option that has the lowest of these three upper bounds. This ensures that the error rate is fairly low. • e.g. which has the smallest upper bound: • 1 incorrect out of 3 • 5 incorrect out of 17, or • 9 incorrect out of 32? • For each option, we count the number correct and the number incorrect. We need upper confidence intervals on the proportion that are incorrect. To calculate the upper bounds, calculate confidence intervals on proportions. 55
  • 56. Simple Example • Flip a coin N times, with M heads. (Here N is the number of examples in the leaf, M is the number incorrectly classified.) What is an upper bound for the probability p of heads for the coin? Think visually about the binomial distribution, where we have N coin flips, and how it changes as p changes: • We want the upper bound to be as large as possible (largest possible p, it's an upper bound), but still there needs to be a probability to get as few errors as we got. In other words, we want: 56
  • 57. 57
  • 58. • We can calculate this numerically without a problem. So now if you give me M and N, I can give you p. • C4.5 uses α=0.25 by default. • M, for a given branch, is how many misclassified examples are in the branch. • N, for a given branch, is just the number of examples in the branch, Sj. • So we can calculate the upper bound on a branch, but it is still not clear how to calculate the upper bound on a tree. • Actually, we calculate an upper confidence bound on each branch on the tree and average it over the relative frequencies of landing in each branch of the tree. 58
  • 59. Example • Let's consider a dataset of 16 examples describing toys (from the Kranf Site). We want to know if the toy is fun or not. 59
  • 60. 60
  • 61. 61
  • 62. Post-pruning • The aim of pruning is to discard parts of a classification model that describe random variation in the training sample rather than true features of the underlying domain. • This makes the model more comprehensible to the user, and potentially more accurate on new data that has not been used for training the classifier. • Statistical significance tests can be used to make pruning decisions in classification models. Reduced-error pruning (Quinlan, 1987), a standard algorithm for post-pruning decision trees, does not take statistical significance into account, but it is known to be one of the fastest pruning algorithms, producing trees that are both accurate and small. 62
  • 63. • Reduced-error pruning generates smaller and more accurate decision trees if pruning decisions are made using significance tests and the significance level is chosen appropriately for each dataset. • For a given amount of pruning, decision trees pruned using a permutation test will be more accurate than those pruned using a parametric test. • If decision tree A is the result of pruning using a permutation test, and decision tree B is the result of pruning using a parametric test, and both trees have the same size, then A will be more accurate than B on average. 63
  • 64. Decision tree pruning and statistical significance • Below figure depicts an unpruned decision tree. We assume that a class label has been attached to each node of the tree, for example, by taking the majority class of the training instances reaching that particular node. In the figure, there are two classes: A and B. The tree can be used to predict the class of a test instance by filtering it to the leaf node corresponding to the instance’s attribute values and assigning the class label attached to that leaf. 64
  • 65. • However, using an unpruned decision tree for classification potentially “overfits” the training data. Consequently, it is advisable, before the tree is put to use, to ascertain which parts truly reflect effects present in the domain, and discard those that do not. This process is called “pruning.” • A general, fast, and easily applicable pruning method is “reduced- error pruning” (Quinlan, 1987a). The idea is to hold out some of the available instances—the “pruning set”—when the tree is built, and prune the tree until the classification error on these independent instances starts to increase. Because the instances in the pruning set are not used for building the decision tree, they provide a less biased estimate of its error rate on future instances than the training data. 65
  • 66. 66
  • 67. • In each tree, the number of instances in the pruning data that are misclassified by the individual nodes are given in parentheses. A pruning operation involves replacing a subtree by a leaf. Reduced- error pruning will perform this operation if it does not increase the total number of classification errors. Traversing the tree in a bottom- up fashion ensures that the result is the smallest pruned tree that has minimum error on the pruning data 67
  • 68. • The CP (complexity parameter) is used to control tree growth. If the cost of adding a variable is higher then the value of CP, then tree growth stops. #Base Model hr_base_model <- rpart(left ~ ., data = train, method = "class",control = rpart.control(cp = 0)) summary(hr_base_model) #Plot Decision Tree plot(hr_base_model) # Examine the complexity plot printcp(hr_base_model) plotcp(hr_base_model) cp = 0.0084 68
  • 69. #Postpruning # Prune the hr_base_model based on the optimal cp value hr_model_pruned <- prune(hr_base_model, cp = 0.0084 ) # Compute the accuracy of the pruned tree test$pred <- predict(hr_model_pruned, test, type = "class") accuracy_postprun <- mean(test$pred == test$left) data.frame(base_accuracy, accuracy_preprun, accuracy_postprun) The accuracy of the model on the test data is better when the tree is pruned, which means that the pruned decision tree model generalizes well and is more suited for a production environment. However, there are also other factors that can influence decision tree model creation, such as building a tree on an unbalanced class. These factors were not accounted for here but it's very important for them to be examined during a live model formulation. 69
  • 70. Prepruning • Prepruning is also known as early stopping criteria. As the name suggests, the criteria are set as parameter values while building the rpart model. Below are some of the pre-pruning criteria that can be used. The tree stops growing when it meets any of these pre-pruning criteria, or it discovers the pure classes. • maxdepth: This parameter is used to set the maximum depth of a tree. Depth is the length of the longest path from a Root node to a Leaf node. Setting this parameter will stop growing the tree when the depth is equal the value set for maxdepth. • minsplit: It is the minimum number of records that must exist in a node for a split to happen or be attempted. For example, we set minimum records in a split to be 5; then, a node can be further split for achieving purity when the number of records in each split node is more than 5. • minbucket: It is the minimum number of records that can be present in a Terminal node. For example, we set the minimum records in a node to 5, meaning that every Terminal/Leaf node should have at least five records. We should also take care of not overfitting the model by specifying this parameter. If it is set to a too-small value, like 1, we may run the risk of overfitting our model. 70
  • 71. # Grow a tree with minsplit of 100 and max depth of 8 hr_model_preprun <- rpart(left ~ ., data = train, method = "class", control = rpart.control(cp = 0, maxdepth = 8, minsplit = 100)) # Compute the accuracy of the pruned tree test$pred <- predict(hr_model_preprun, test, type = "class") accuracy_preprun <- mean(test$pred == test$left) 71
  • 72. CART- Classification and Regression Trees (Breiman, Freedman, Olshen, Stone, 1984) 72
  • 73. CART decides which attributes to split and where to split them. In each leaf, we're going to assign f(x) to be a constant. 73
  • 74. 74
  • 77. 77
  • 78. 78
  • 79. 79
  • 80. Example: Riding Mowers • Goal: Classify 24 households as owning or not owning riding mowers • Predictors = Income, Lot Size 80
  • 81. Income Lot_Size Ownership 60.0 18.4 owner 85.5 16.8 owner 64.8 21.6 owner 61.5 20.8 owner 87.0 23.6 owner 110.1 19.2 owner 108.0 17.6 owner 82.8 22.4 owner 69.0 20.0 owner 93.0 20.8 owner 51.0 22.0 owner 81.0 20.0 owner 75.0 19.6 non-owner 52.8 20.8 non-owner 64.8 17.2 non-owner 43.2 20.4 non-owner 84.0 17.6 non-owner 49.2 17.6 non-owner 59.4 16.0 non-owner 66.0 18.4 non-owner 47.4 16.4 non-owner 33.0 18.8 non-owner 51.0 14.0 non-owner 63.0 14.8 non-owner 81
  • 82. How to split • Order records according to one variable, say lot size • Find midpoints between successive values E.g. first midpoint is 14.4 (halfway between 14.0 and 14.8) • Divide records into those with lotsize > 14.4 and those < 14.4 • After evaluating that split, try the next one, which is 15.4 (halfway between 14.8 and 16.0) 82
  • 83. Splitting criteria • Regression: residual sum of squares RSS = Σleft (yi – yL*)2 + Σright (yi – yR*)2 where yL* = mean y-value for left node yR* = mean y-value for right node • Classification: Gini criterion Gini = NL Σk=1,…,K pkL (1- pkL) + NR Σk=1,…,K pkR (1- pkR) where pkL = proportion of class k in left node pkR = proportion of class k in right node 83
  • 84. Note: Categorical Variables • Examine all possible ways in which the categories can be split. • E.g., categories A, B, C can be split 3 ways {A} and {B, C} {B} and {A, C} {C} and {A, B} • With many categories, # of splits becomes huge 84
  • 85. The first split: Lot Size = 19,000 85
  • 86. Second Split: Income = $84,000 86
  • 88. Applications of Decision Trees: XBox! • Decision trees are in XBox [J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, A. Blake. Real-Time Human Pose 88
  • 89. • Decision trees are in XBox: Classifying body parts 89
  • 90. • Trained on million(s) of examples 90
  • 91. • Trained on million(s) of examples • Results: 91
  • 92. Decision Tree Classifier implementation in R (http://dataaspirant.com/2017/02/03/decision-tree-classifier- implementation-in-r/ ) • The R programming machine learning caret package( Classification And REgression Training) holds tons of functions that helps to build predictive models. It holds tools for data splitting, pre-processing, feature selection, tuning and supervised – unsupervised learning algorithms, etc. It is similar to the sklearn library in python. • The installed caret package provides us direct access to various functions for training our model with different machine learning algorithms like Knn, SVM, decision tree, linear regression, etc. 92
  • 93. Cars Evaluation Data Set • The Cars Evaluation data set consists of 7 attributes, 6 as feature attributes and 1 as the target attribute. All the attributes are categorical. We will try to build a classifier for predicting the Class attribute. The index of target attribute is 7th. 1 buying vhigh, high, med, low 2 maint vhigh, high, med,low 3 doors 2, 3, 4, 5 , more 4 persons 2, 4, more 5 lug_boot small, med, big. 6 safety low, med, high 7 Car Evaluation – Target Variable unacc, acc, good, vgood • To model a classifier for evaluating the acceptability of car using its given features. 93
  • 94. library(caret) library(rpart.plot) data_url <- c("https://archive.ics.uci.edu/ml/machine-learning- databases/car/car.data") download.file(url = data_url, destfile = "car.data") car_df <- read.csv("car.data", sep = ',', header = FALSE) str(car_df) 'data.frame': 1728 obs. of 7 variables: $ V1: Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ... $ V2: Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ... $ V3: Factor w/ 4 levels "2","3","4","5more": 1 1 1 1 1 1 1 1 1 1 ... $ V4: Factor w/ 3 levels "2","4","more": 1 1 1 1 1 1 1 1 1 2 ... $ V5: Factor w/ 3 levels "big","med","small": 3 3 3 2 2 2 1 1 1 3 ... $ V6: Factor w/ 3 levels "high","low","med": 2 3 1 2 3 1 2 3 1 2 ... $ V7: Factor w/ 4 levels "acc","good","unacc",..: 3 3 3 3 3 3 3 3 3 3 ... 94
  • 95. head(car_df) V1 V2 V3 V4 V5 V6 V7 1 vhigh vhigh 2 2 small low unacc 2 vhigh vhigh 2 2 small med unacc 3 vhigh vhigh 2 2 small high unacc 4 vhigh vhigh 2 2 med low unacc 5 vhigh vhigh 2 2 med med unacc 6 vhigh vhigh 2 2 med high unacc Data Slicing intrain <- createDataPartition(y = car_df$V7, p= 0.7, list = FALSE) training <- car_df[intrain,] testing <- car_df[-intrain,] #check dimensions of train & test set dim(training); dim(testing) [1] 1211 7 [1] 517 7 95
  • 96. • Preprocessing & Training To check whether our data contains missing values or not, we can use anyNA() method. Here, NA means Not Available. anyNA(car_df) [1] FALSE summary(car_df) V1 V2 V3 V4 V5 V6 high :432 high :432 2 :432 2 :576 big :576 high:576 low :432 low :432 3 :432 4 :576 med :576 low :576 med :432 med :432 4 :432 more:576 small:576 med :576 vhigh:432 vhigh:432 5more:432 V7 acc : 384 good : 69 unacc:1210 vgood: 65 96
  • 97. • Training the Decision Tree classifier with criterion as information gain Caret package provides train() method for training our data for various algorithms. We just need to pass different parameter values for different algorithms. Before train() method, we will first use trainControl() method. It controls the computational nuances of the train() method. trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3) set.seed(3333) dtree_fit <- train(V7 ~., data = training, method = "rpart", parms = list(split = "information"), trControl=trctrl, tuneLength = 10) • We are setting 3 parameters of trainControl() method. The “method” parameter holds the details about resampling method. We can set “method” with many values like “boot”, “boot632”, “cv”, “repeatedcv”, “LOOCV”, “LGOCV” etc. Let’s try to use repeatedcv i.e, repeated cross-validation. • The “number” parameter holds the number of resampling iterations. The “repeats ” parameter contains the complete sets of folds to compute for our repeated cross-validation. We are using setting number =10 and repeats =3. This trainControl() methods returns a list. We are going to pass this on our train() method. 97 It can also be “gini”
  • 98. • Trained Decision Tree classifier results We can check the result of our train() method by a print dtree_fit variable. It is showing us the accuracy metrics for different values of cp. Here, cp is a complexity parameter for our dtree. dtree_fit CART 1211 samples 6 predictor 4 classes: 'acc', 'good', 'unacc', 'vgood' No pre-processing Resampling: Cross-Validated (10 fold, repeated 3 times) Summary of sample sizes: 1091, 1090, 1091, 1089, 1089, 1089, ... Resampling results across tuning parameters: cp Accuracy Kappa 0.008928571 0.8483624 0.6791223 0.009615385 0.8467071 0.6745287 0.010989011 0.8365300 0.6487824 0.012362637 0.8266554 0.6253187 0.013736264 0.8219630 0.6128814 0.020604396 0.7961370 0.5540247 0.022893773 0.7980631 0.5600789 0.054945055 0.7746394 0.5307654 0.057692308 0.7724331 0.5305796 0.065934066 0.7322489 0.2893330 Accuracy was used to select the optimal model using the largest value. The final value used for the model was cp = 0.008928571. 98
  • 99. • Plot Decision Tree prp(dtree_fit$finalModel, box.palette = "Reds", tweak = 1.2) 99
  • 100. • Prediction Now, our model is trained with cp = 0.008928571. We are ready to predict classes for our test set. We can use predict() method. Let’s try to predict target variable for test set’s 1st record. testing[1,] V1 V2 V3 V4 V5 V6 V7 6 vhigh vhigh 2 2 med high unacc predict(dtree_fit, newdata = testing[1,]) [1] unacc Levels: acc good unacc vgood • For our 1st record of testing data classifier is predicting class variable as “unacc”. Now, its time to predict target variable for the whole test set. 100
  • 101. test_pred <- predict(dtree_fit, newdata = testing) confusionMatrix(test_pred, testing$V7 ) #check accuracy Reference Prediction acc good unacc vgood acc 84 8 27 2 good 7 6 0 1 unacc 17 5 336 0 vgood 7 1 0 16 Accuracy : 0.8549 95% CI : (0.8216, 0.8842) No Information Rate : 0.7021 P-Value [Acc > NIR] : 3.563e-16 Kappa : 0.6839 Mcnemar's Test P-Value : NA Statistics by Class: Class: acc Class: good Class: unacc Class: vgood Sensitivity 0.7304 0.30000 0.9256 0.84211 Specificity 0.9080 0.98390 0.8571 0.98394 Pos Pred Value 0.6942 0.42857 0.9385 0.66667 Neg Pred Value 0.9217 0.97217 0.8302 0.99391 Prevalence 0.2224 0.03868 0.7021 0.03675 Detection Rate 0.1625 0.01161 0.6499 0.03095 Detection Prevalence 0.2340 0.02708 0.6925 0.04642 Balanced Accuracy 0.8192 0.64195 0.8914 0.91302 101
  • 102. Training the Decision Tree classifier with criterion as gini index Let’s try to program a decision tree classifier using splitting criterion as gini index. It is showing us the accuracy metrics for different values of cp. Here, cp is complexity parameter for our dtree. set.seed(3333) dtree_fit_gini <- train(V7 ~., data = training, method = "rpart", parms = list(split = "gini"), trControl=trctrl, tuneLength = 10) 102
  • 103. dtree_fit_gini CART 1211 samples 6 predictor 4 classes: 'acc', 'good', 'unacc', 'vgood' No pre-processing Resampling: Cross-Validated (10 fold, repeated 3 times) Summary of sample sizes: 1091, 1090, 1091, 1089, 1089, 1089, ... Resampling results across tuning parameters: cp Accuracy Kappa 0.01098901 0.8522395 0.6816530 0.01373626 0.8362745 0.6436379 0.01510989 0.8305029 0.6305745 0.01556777 0.8249840 0.6168644 0.01648352 0.8227709 0.6115286 0.01831502 0.8180553 0.6039963 0.02060440 0.8095423 0.5858712 0.02197802 0.8032220 0.5725628 0.06868132 0.7888755 0.5727260 0.09340659 0.7233582 0.2223118 Accuracy was used to select the optimal model using the largest value. The final value used for the model was cp = 0.01098901. 103
  • 104. • Plot Decision Tree We can visualize our decision tree by using prp() method. prp(dtree_fit_gini$finalModel, box.palette = "Blues", tweak = 1.2) 104
  • 105. Prediction Now, our model is trained with cp = 0.01098901. We are ready to predict classes for our test set. Now, it’s time to predict target variable for the whole test set. test_pred_gini <- predict(dtree_fit_gini, newdata = testing) > confusionMatrix(test_pred_gini, testing$V7 ) #check accuracy Reference Prediction acc good unacc vgood acc 87 10 25 8 good 4 4 0 0 unacc 22 5 338 0 vgood 2 1 0 11 Overall Statistics Accuracy : 0.8511 95% CI : (0.8174, 0.8806) No Information Rate : 0.7021 P-Value [Acc > NIR] : 2.18e-15 Kappa : 0.6666 Mcnemar's Test P-Value : NA Statistics by Class: Class: acc Class: good Class: unacc Class: vgood Sensitivity 0.7565 0.200000 0.9311 0.57895 Specificity 0.8930 0.991952 0.8247 0.99398 Pos Pred Value 0.6692 0.500000 0.9260 0.78571 Neg Pred Value 0.9276 0.968566 0.8355 0.98410 Prevalence 0.2224 0.038685 0.7021 0.03675 Detection Rate 0.1683 0.007737 0.6538 0.02128 Detection Prevalence 0.2515 0.015474 0.7060 0.02708 Balanced Accuracy 0.8248 0.595976 0.8779 0.78646 In this case, our classifier with criterion gini index is not giving better results. 105
  • 106. Regression Trees for Prediction • Used with continuous outcome variable • Procedure similar to classification tree • Many splits attempted, choose the one that minimizes impurity 106
  • 107. Decision Tree - Regression • Decision tree builds regression or classification models in the form of a tree structure. It brakes down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node (e.g., Outlook) has two or more branches (e.g., Sunny, Overcast and Rainy), each representing values for the attribute tested. Leaf node (e.g., Hours Played) represents a decision on the numerical target. The topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees can handle both categorical and numerical data. 107
  • 108. Advantages of trees • Easy to use, understand • Produce rules that are easy to interpret & implement • Variable selection & reduction is automatic • Do not require the assumptions of statistical models • Can work without extensive handling of missing data 108
  • 109. Disadvantages • May not perform well where there is structure in the data that is not well captured by horizontal or vertical splits • Since the process deals with one variable at a time, no way to capture interactions between variables 109
  • 111. Decision Tree Algorithm • The core algorithm for building decision trees called ID3 by J. R. Quinlan which employs a top-down, greedy search through the space of possible branches with no backtracking. The ID3 algorithm can be used to construct a decision tree for regression by replacing Information Gain with Standard Deviation Reduction. Standard Deviation • A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogenous). We use standard deviation to calculate the homogeneity of a numerical sample. If the numerical sample is completely homogeneous its standard deviation is zero. 111
  • 112. a) Standard deviation for one attribute: • Standard Deviation (S) is for tree building (branching). • Coefficient of Deviation (CV) is used to decide when to stop branching. We can use Count (n) as well. • Average (Avg) is the value in the leaf nodes. 112
  • 113. b) Standard deviation for two attributes (target and predictor): 113
  • 114. Standard Deviation Reduction • The standard deviation reduction is based on the decrease in standard deviation after a dataset is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest standard deviation reduction (i.e., the most homogeneous branches). • Step 1: The standard deviation of the target is calculated. Standard deviation (Hours Played) = 9.32 114
  • 115. • Step 2: The dataset is then split on the different attributes. The standard deviation for each branch is calculated. The resulting standard deviation is subtracted from the standard deviation before the split. The result is the standard deviation reduction. 115
  • 116. • Step 3: The attribute with the largest standard deviation reduction is chosen for the decision node. • Step 4a: The dataset is divided based on the values of the selected attribute. This process is run recursively on the non-leaf branches, until all data is processed. In practice, we need some termination criteria. For example, when coefficient of deviation (CV) for a branch becomes smaller than a certain threshold (e.g., 10%) and/or when too few instances (n) remain in the branch (e.g., 3). 116
  • 117. • Step 4b: "Overcast" subset does not need any further splitting because its CV (8%) is less than the threshold (10%). The related leaf node gets the average of the "Overcast" subset. 117
  • 118. • Step 4c: However, the "Sunny" branch has an CV (28%) more than the threshold (10%) which needs further splitting. We select "Windy" as the best node after "Outlook" because it has the largest SDR. 118
  • 119. • Because the number of data points for both branches (FALSE and TRUE) is equal or less than 3 we stop further branching and assign the average of each branch to the related leaf node. 119
  • 120. • Step 4d: Moreover, the "rainy" branch has an CV (22%) which is more than the threshold (10%). This branch needs further splitting. We select "Windy" as the best node because it has the largest SDR. 120
  • 121. • Because the number of data points for all three branches (Cool, Hot and Mild) is equal or less than 3 we stop further branching and assign the average of each branch to the related leaf node. • When the number of instances is more than one at a leaf node we calculate the average as the final value for the target. 121
  • 122. Decision Trees for Regression-R Example • Decision trees to predict whether the birth weights of infants will be low or not. > library(MASS) > library(rpart) > head(birthwt) low age lwt race smoke ptl ht ui ftv bwt 85 0 19 182 2 0 0 0 1 0 2523 86 0 33 155 3 0 0 0 0 3 2551 87 0 20 105 1 1 0 0 0 1 2557 88 0 21 108 1 1 0 0 1 2 2594 89 0 18 107 1 1 0 0 1 0 2600 91 0 21 124 3 0 0 0 0 0 2622 122
  • 123. • low – indicator of whether the birth weight is less than 2.5kg • age – mother’s age in year • lwt – mother’s weight in pounds at last menstrual period • race – mother’s race (1 = white, 2 = black, white = other) • smoke – smoking status during pregnancy • ptl – number of previous premature labours • ht – history of hypertension • ui – presence of uterine irritability • ftv – number of physician visits during the first trimester • bwt – birth weight in grams • Let’s look at the distribution of infant weights: hist(birthwt$bwt) Most of the infants weigh between 2kg and 4kg. 123
  • 124. • let us look at the number of infants born with low weight. table(birthwt$low) 0 1 130 59 This means that there are 130 infants weighing more than 2.5kg and 59 infants weighing less than 2.5kg. If we just guessed the most common occurrence (> 2.5kg), our accuracy would be 130 / (130 + 59) = 68.78%. Let’s see if we can improve upon this by building a prediction model. 124
  • 125. Building the model • In the dataset, all the variables are stored as numeric. Before we build our model, we need to convert the categorical variables to factor. cols <- c('low', 'race', 'smoke', 'ht', 'ui') birthwt[cols] <- lapply(birthwt[cols], as.factor) Next, let us split our dataset so that we have a training set and a testing set. set.seed(1) train <- sample(1:nrow(birthwt), 0.75 * nrow(birthwt)) Now, let us build the model. We will use the rpart function for this. birthwtTree <- rpart(low ~ . - bwt, data = birthwt[train, ], method = 'class') Since low = bwt <= 2.5, we exclude bwt from the model, and since it is a classification task, we specify method = 'class'. Let’s take a look at the tree. 1) root 141 44 0 (0.6879433 0.3120567) 2) ptl< 0.5 117 30 0 (0.7435897 0.2564103) 4) lwt>=106 93 19 0 (0.7956989 0.2043011) 8) ht=0 86 15 0 (0.8255814 0.1744186) * 9) ht=1 7 3 1 (0.4285714 0.5714286) * 5) lwt< 106 24 11 0 (0.5416667 0.4583333) 10) age< 22.5 15 4 0 (0.7333333 0.2666667) * 11) age>=22.5 9 2 1 (0.2222222 0.7777778) * 3) ptl>=0.5 24 10 1 (0.4166667 0.5833333) 6) lwt>=131.5 7 2 0 (0.7142857 0.2857143) * 7) lwt< 131.5 17 5 1 (0.2941176 0.7058824) * 125
  • 127. • To make a prediction on the birthweight, we can run these codes birthwtTree_reg <- rpart(bwt ~ . , data = birthwt[train, ]) plot(birthwtTree_reg) text(birthwtTree_reg, pretty = 0) 1) root 141 79639400 2944.291 2) low=1 44 7755010 2073.614 4) age>=23.5 19 5147718 1859.000 * 5) age< 23.5 25 1067079 2236.720 * 3) low=0 97 23398620 3339.237 6) lwt< 109.5 19 2784704 2983.263 * 7) lwt>=109.5 78 17619810 3425.949 14) race=2,3 33 4872606 3252.061 * 15) race=1 45 11017640 3553.467 30) lwt>=123.5 33 4761066 3459.515 * 31) lwt< 123.5 12 5164248 3811.833 * 127
  • 128. summary(birthwtTree_reg) # make predictions predictions <- predict(birthwtTree_reg, birthwt[,1:9]) # summarize accuracy mse <- mean((birthwt$bwt - predictions)^2) print(mse) [1] 177777.4 birthwtTree_lm <- lm(bwt ~ . , data = birthwt[train, ]) summary(birthwtTree_lm) # make predictions predictions <- predict(birthwtTree_lm, birthwt[,1:9]) # summarize accuracy mse <- mean((birthwt$bwt - predictions)^2) print(mse) [1] 178757.8 128