M3R.FINAL

Deep Into The Random Forest
Peter Schindler
M3R Project
June 10, 2014

Abstract
A random forest is an ensemble of decision trees that combine their results to form
an enhanced prediction tool. The classification and regression tree methodology is
explained. The bagging predictor is introduced. Adding an additional layer of ran-
domness to bagging, one can create a random forest. The random forest is applied to
a real-life dataset and results of performances are computed. These are then compared
to the performance measures given by various other classification tools like the general
linear model or artificial neural networks.
Keywords. Classification and Regression Trees, Bagging, Random Forest
Acknowledgements
In this short paragraph I would like to express my gratitude to all the people that
made a di↵erence in my undergraduate studies at Imperial.
Firstly, I would like to thank Dr Axel Gandy for being my project supervisor. In
my short but instructive encounters with him, he gave me the guidance every young
mathematician dreams of. If I were to come to Imperial just for the experience of
learning from a professor like Mr Gandy, it was worth it.
I would also like to thank Ricardo Monti, a good friend who always had time for
me whenever I needed help with a problem.
I am also grateful to Dr Tony Bellotti, who taught me two very interesting and
useful credit scoring courses that I enjoyed a lot.
On a more personal note, I would like to thank my two grandmothers Kati and
Saci, who were great help for me during the revision periods leading up to exams.
Without them, this preparation would have been very di↵erent and much harder.
Last but not least, I would like to say a big thank you to my mother Marianna.
She has always been there for me throughout my life, during the highs and lows, and
has permanently provided me with the love and support a son could ever wish for. A
big part of who and where I am today in life is due to her hard work. Thank you.
Details
Name: Peter Schindler
CID Number: 00694136
Name of Supervisor: Dr Axel Gandy
Email Address: peter.schindler11@imperial.ac.uk
Home Address: 7 Rue Saint Honore, Versailles, 78000, France
Plagiarism Statement
This is my own unaided work unless stated otherwise.
3

Contents
1 Introduction 1
2 Decision Trees 2
2.1 Building a Tree Classiﬁer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 The Impurity Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 The Gini Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Tree Structure and Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Classiﬁcation and Regression Trees . . . . . . . . . . . . . . . . . . . . . . . 9
3 Random Forests 10
3.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Bootstrapping and Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 The Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Applying the Trees to the Admissions Dataset 16
4.1 Describing the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Variable Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Measure of Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5 Other Prediction Methods 23
5.1 Generalized Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3 Quadratic Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . 25
5.4 Multilayer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.5 Comparing the Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6 Conclusion 27
5

1 Introduction
A random forest is a machine learning algorithm mostly used for prediction problems. It
is made up by an ensemble of many individual tree predictors. A tree predictor is a simple
sequence of decision, where the next decision depends on the current one. This is what
originates the name, as these sequences of decisions can be easily represented by a tree.
Tree predictors are also commonly referred to as decision trees. When applying the random
forests algorithm, many decision trees are generated and the predictions of these individual
trees are aggregated to give the overall prediction made by the random forest. A random
forest is conceptually quite simple, but turns out to be an extremely powerful prediction
tool.
Decision trees represent the pillars of a random forest. They were presented formally for
the first time in Breiman et al. (1984). This gave a foundation of the theory behind these
trees and showed how they can be applied in various classification and regression prob-
lems. Building on these tree methods, the notion of bagging was introduced in Breiman
(1996). Bagging was also a result of inspiration taken from the bootstrapping methodology
presented in Efron and Tibshirani (1993). Adding an additional layer of randomness (dis-
cussed later) to bagging, the random forest was created. Random forests were introduced
in Breiman (2001). They are the outcome of a combination of ideas taken from Amit and
Geman (1997) and Breiman (1996). The paper Breiman (2001) has had an outstanding
amount of success since its publication, as over 12,000 research papers cite it as a reference.
In Section 2. we will introduce the classification and regression tree methodology. A
simple example, followed by a general algorithm for building a tree will be given. The usage
of impurity measures and tree pruning will be explained in detail. Two-class problems will
be of particular interest to us in Sections 4. and 5., and so all theoretical explanations will
be done for the two-class case, however all methods can be generalized with ease to the
n-class case.
Having established the foundation of the decision tree methodology, random forests will
be explored in depth in Section 3. We give an explanation of how bootstrap samples allow
us to construct a bagging predictor. We show why bagging is superior in performance to
a single tree predictor. Then we go deep into the random forest and give an explanation
why the random forests outperforms his ancestor. The out-of-bag error estimate will
be presented and we will show how it is used to obtain a measure called the variable
importance.
In Section 4, we apply the single decision tree, the bagging predictor, as well as the
random forest to the dataset of our interest called the Admissions Dataset. Performance
of each predictor will be assessed and compared.
The purpose of Section 5. will be to put the performance of the random forest into
perspective by comparing it to performances given by other predictors. We experiment with
other methods like the general linear model, linear and quadratic discriminant analysis and
the multilayer perceptron.
Concluding remarks and modern-day applications of random forests are given in Section
6.
1

2 Decision Trees
Applying a decision tree to a dataset, basically means splitting the predictor variable space
into a set of partitions and fitting a simple model to each one. These simple models could
be various things, but typically they are constants. The advantage of this tree-based
method is that it is conceptually simple, easy to interpret and understand, yet at the same
time it can be a very accurate predictor. An additional good thing about this method is
that it can handle all types of data: discrete, continuous, binary, categorical and missing
data. In this section, we will only look at a the most common tree-based method called
Classification And Regression Tree (CART), but multiple similar methods such as ID3 or
C4.5 exist. More details about these alternative methods can be found in Mitchell (1997,
Chapter 3). This section is mainly based on Breiman et al. (1984) and Hastie et al. (2001,
Chapter 9.2).
2.1 Building a Tree Classifier
Let us look at a simple two-class problem, where the response Y can take values in {0, 1}
and there are two predictor variables X1 and X2, each taking a value on an interval of unit
length. Using a specific split selection criterion (discussed later) that gives us the optimal
split, we choose a variable Xi and a split point s 2 [0; 1] which we use to split the space
into two regions. We then repeat this binary splitting process until some stopping criteria
is applied. For example, applying the following four-step partitioning process:
1. splitting X1 at t1, followed by
2. splitting the region X1 < t1 at X2 = t2, followed by
3. splitting the region X1 t1 at X1 = t3, followed by
4. splitting the region X1 > t3 at X2 = t4
we obtain five regions R1, R2, R3, R4 and R5 that can be plotted as shown in Figure 1a.
(Hastie et al., 2001, Page 268). The classification model associated to this partitioning
that predicts the class of an observation (Xi1, Xi2) is:
class(Xi1, Xi2) =
5X
m=1
cm1((Xi1, Xi2) 2 Rm)
where cm is the predicted class of all observations that end up in the region Rm and 1
is the indicator function. This gives rise to the binary decision tree shown in Figure 1b
(Hastie et al., 2001, Page 268). As we can see, this tree is easily interpretable and is a
clear guide on how to classify various observation with predictor variables Xi1 and Xi2.
Now we look at an algorithm that describes how to build a decision tree that can cope
with training data that contains more than two predictor variables. This algorithm was
inspired by the one in the Bellotti (2013) Credit Scoring 1 course. We define the recursive
algorithm called GrowTree in the following way:
2

(a) A particular partition of the unit
square using binary splits.
(b) Tree representation of the partition-
ing in Figure 1a.
Figure 1
Algorithm 1. Taking as input a given node t, a dataset D and a set of predictor variables
SX, we call the function GrowTree(t, D, SX):
• IF the stopping criteria is not met, then do:
(1) Let S be the set of all possible binary splits across all variables X in SX.
Select ˆs = argmaxs2S I(s, t), where I(s, t) represents the decrease of an im-
purity measure I when applying the split s to the node t.
(2) Let ˆX be the variable which ˆs splits. Two sub-datasets are obtained, one having
the characteristic ˆX = ˆX1 and the other one having ˆX = ˆX2.
(3) Assign ˆX to node t.
(4) For i = 1 ! 2, do:
(i) Create a new ith branch from t linked to a new node t?
i , labelled ˆXi.
(ii) Call GrowTree(t?
i , D[ ˆX = ˆXi], SX).
• ELSE report the partition as an output
Note that the initial input for this algorithm is the root node t0 containing the entire
initial dataset D0 with the set of all predictor variables S0
X.
Hence the entire construction of the decision tree boils down to three things:
1. Selecting the impurity measure I,
2. Choosing a stopping criteria, which decides when to declare a node terminal or to
keep on splitting it,
3. Having a rule that assigns a prediction (a class label or a numerical value) to every
terminal node.
3

2.2 The Impurity Measure
Definition 1. An impurity function is a function defined on a set of n-tuple of numbers
(p1, ..., pn) satisfying pi 0 and
P
i pi = 1 for i = 1 ! n, having properties:
i. achieves its maximum only at the point ( 1
n , 1
n , ..., 1
n )
ii. achieves its minimum only at the points (1,0,...,0), (0,1,0,...,0), ..., (0,...,0,1)
iii. is a symmetric function of p1, ..., pn
Definition 2. For a node t, let p(i|t) denote the proportion of class i observations present
in node t. So given an impurity function , we define the impurity measure i(t) of a node
as:
i(t) = (p(1|t), p(2|t), ..., p(n|t))
Finally, let us suppose that all the splitting has been done and a final structure of the
tree is obtained. We denote the set of terminal nodes of this tree by ˜T.
Definition 3. For a node t, we set I(t) = i(t)p(t), where i(t) is as defined in Definition
2. and p(t) is the probability that any given observation falls into node t. We define the
impurity of a tree T by:
I(T) =
X
t2 ˜T
I(t)
Suppose a split s splits node t into nodes t1 and t2, creating a new tree T0. The impurity
of this new tree will now become:
I(T0
) = I(T) I(t) + I(t1) + I(t2)
Hence the decrease in the tree impurity is:
I(s, t) = I(T) I(T0
) = I(t) I(t1) I(t2)
which only depends on the node t and the split s.
When building a good decision tree, our aim is to minimize the tree impurity (whilst
avoiding over-fitting!). So when we are deciding which split to choose from a set of possible
splits for a node t, we choose the split ˆs that maximizes I(s, t). Now let I(ti) = i(ti)p(t)pi
for i = 1 ! 2, where pi denotes the probability that an observation falls into node ti given
that it is in node t, and so p1 + p2 = 1. Then we can write:
I(s, t) = i(t)p(t) i(t1)p(t)p1 i(t2)p(t)p2
= p(t)(i(t) i(t1)p1 i(t2)p2)
= p(t) i(s, t)
So as I(s, t) only di↵ers from i(s, t) by a constant factor p(t), we conclude that the
same ˆs maximizes both these expressions. In the next part we will see how we can find
this optimal split.
4

Figure 2: Plot of the Gini index in the case of a two-class problem.
2.3 The Gini Index
There are various impurity measures one could choose for I, like the Misclassification
Error or the Cross-Entropy, but we will make use of the most common measure called the
Gini index. Further details about the other measures can be found in Hastie et al. (2001,
Chapter 9.2). For a two-class problem, where the p(1|t) is as defined in Definition 2., the
Gini index is given by:
i(t) = 2p(1|t)(1 p(1|t))
Figure 2. shows us what the Gini index looks like by plotting p(1|t) on the x-axis against
i(t) on the y-axis. This plot illustrates that the Gini index satisfies the properties of an
impurity function given in Definition 1., as it is maximized at p(1|t) = p(2|t) = 0.5, mini-
mized at (p(1|t), p(2|t)) = (0, 1) and (p(1|t), p(2|t)) = (1, 0) and p(1|t) and p(2|t) are clearly
symmetric.
Now let us give an example of how the Gini index is used to find the optimal split
for a node. Let us consider a dataset set called the Admission Dataset which contains
information about applicants and their application outcome for a particular Msc course in
2012-2013. We have a continuous predictor variable called ”Days to Deadline”(D2D) that
represents the number of days before the deadline an applicant applied to the course. The
response variable is a binary variable Y taking the value of 1 if the applicant accepted the
o↵er and turned up for the course, and a value of 0 if he did not.
Figure 3. is a histogram plot that represents D2D categorized by month, but it still
gives us a good visualization of how this variable behaves. Note that the deadline is the
15th of July 2013. The height of each box indicates the number of students that where
given an o↵er in the specified month. In green, we have the number of students that
accepted the o↵er, and in red the number of students that did not. Looking at the propor-
tions of green and red colors more closely, we might notice a pattern. As we get closer to
the deadline, the proportion of students accepting the o↵er tends to increase. Especially
5

Figure 3: Histogram plot representing the D2D predictor variable.
from March onwards, about 50-70% of the students accept the o↵er, whereas before March,
this acceptance rate is more around 20%. Obviously, this analysis is very superﬁcial, so let
us see how the Gini index is used to ﬁnd the optimal split point that maximizes discrimi-
nation. We make use of the following algorithm:
Algorithm 2.
1. Let the values taken by the continuous variable D2D be d1,d2, ...,d101 and let their
corresponding binary outcomes be y1,y2,...,y101 with yj 2 {0, 1}.
2. Order D2D in increasing order d1, d2, ..., d101 with corresponding responses y1, y2, ..., y101.
In other words we now have d1  d2  ...  d101. Let t denote the root node contain-
ing all these observations.
3. For j = 1 ! 101:
i. Take dj to be the split point. dj splits t into node t1 containing all observations
with D2D < dj, and node t2 containing all other observations.
ii. Compute p(1|t1) and p(1|t2)
iii. Using the formula of Gini index, compute i(t), i(t1) and i(t1).
iv. Let |ti| denote the number of observations that are in the node ti. Then p1 = |t1|
|t|
and p2 = |t2|
|t| .
v. Compute i(dj, t) = i(t) i(t1)p1 i(t2)p2.
4. Choose the split point ˆdj that maximizes i(dj, t). I.e. i( ˆdj, t) i(dj, t) for all
dj with j = 1 ! 101.
6

Figure 4: Plot illustrating the di↵erence of Gini indexes over the range of possible splits.
Applying Algorithm 2. to this Admission Dataset example and plotting the split points
dj on the x-axis against the corresponding i(dj, t) on the y-axis, we obtain the plot shown
in Figure 4. Looking at this figure, we deduce that the split point that maximizes discrim-
ination is ˆdj=138. This method could then be repeatedly applied to the two intervals
obtained in order to carry out further splitting and achieve more discrimination if it is
required by the problem at hand.
In the above example, we saw how the Gini index can be used to find the optimal split
for any continuous predictor variable. For a discrete predictor variable, the algorithm is
very similar, but it needs slight modifications. Obviously, if the variable is not continuous,
we will not be able to pick a split point dj. Instead, we will divide the values of the discrete
variable into categories and choose the best division to be the split. Let us look at another
example involving the Admissions Dataset. ”Fee Status” is a discrete variable describing
the type of tuition fee d an applicant pays, with d 2 {Home, EU, Overseas}. As we are
only interested in binary splits, we will consider all possible categorizations c into two
categories of these three fee statuses:
• Category1: {Home, EU} / Category2: {Overseas}
• Category1: {Home, Overseas} / Category2: {EU}
• Category1: {Overseas, EU} / Category2: {Home}
Then in step 3 of Algorithm 2., we let t1 be the node containing all observations with
d 2 Category1 and t2 the node that contains the observations with d 2 Category2. The
remainder of step 3 is carried out in the same way. Step 3 is then repeated in a loop for
all possible binary categorizations c (in this case 3 loops) and the categorization ˆc that
satisfies i(ˆc, t) i(c, t) is chosen as the optimal split.
2.4 Tree Structure and Pruning
Once we know how to obtain the optimal split for a node, the next question we have to
consider is how to achieve the right tree size and structure. How many times shall we
7

partition our dataset before stopping? Or, put di↵erently, how deep shall we grow our
tree? It is obvious, that with very large trees we risk to achieve over-fitting, whereas too
shallow trees might not pick up on all the patterns hidden in the data.
An intuitive solution to this problem is to only split a node if the decrease in the value
of the impurity is significant. Putting this mathematically, for a fixed threshold ✓ > 0, we
declare a node t terminal if for the set of all possible splits S, we have maxs2S i(s, t) < ✓.
However, this approach is quite short-sighted, as an apparently worthless split could lead
to very worthy splits below it.
The preferred method in the literature is called tree pruning. We start by growing a
very large (but not necessarily full) tree T0, where splitting at a node stops either when
the node observations are all in the same class or when the number of observations ⌘ in
a node falls bellow a pre-specified threshold (generally we take ⌘ = 5). We then apply a
method called cost-complexity pruning, which will reduce the size of the tree in order to
eliminate the over-fitting present in the large tree.
As before, ˜T denotes the set of terminal nodes of a tree T. We define the cost-complexity
criterion for a tree T of size | ˜T| as follows:
C↵(T) = I(T) + ↵| ˜T|
where ↵ 0 is called the cost-complexity tuning parameter. The aim is to find an ↵ such
that C↵(T) is as small as possible. ↵ is the parameter indicating the balance between how
well a tree fits the data and the complexity of the tree. Taking the extreme case when
↵ = 0, the cost-complexity criterion tells us that the tree T0 is the best. In general, lower
values of ↵ cause larger trees, whereas higher values of ↵ result in smaller trees. Now the
question remains, how to find the value of ↵ that creates the optimal-sized tree?
There are many di↵erent techniques for choosing ↵, but the most common and com-
putationally e cient one is called the One Standard Error Rule. More details about the
other methods can be found in Breiman et al. (1984). From the dataset at hand, we can
estimate the optimal value for ↵ that satisfies ˆ↵ = argmin↵ C↵(T). However we will not
choose this value of ˆ↵ for one simple reason. Figure 5. shows a typical curve of what
plotting ↵ on the x-axis against C↵(T) on the y-axis would look like. Initially, there is a
rapid decrease followed by a long, flat valley with only minor up-down variations, which
in fact is just noise. ˆ↵ is situated at the deepest point in this valley. The dashed line in
Figure 5. indicates the upper limit of the 1 standard error of ˆ↵ and so we see that all
the points in this valley are well within the ±1 standard error range. As Figure 5. was
generated for a particular training set, it might well be that for a di↵erent training set, the
value of ˆ↵ will be di↵erent. Hence we conclude that the value of ˆ↵ is unstable within this
valley. To provide a satisfying solution to this problem, the one standard error rule was
created, which is to choose the greatest value of ↵ for pruning our tree which is within this
±1 standard error range, i.e. max↵ such that:
C↵(T)  Cˆ↵(T) + SE(ˆ↵)
Hence using Figure 5., we choose 0.037 to be the value of ↵ and the tree is pruned to size
4 (i.e. only four terminal nodes).
8

Figure 5: Plot of the tuning parameter against the cost-complexity criterion.
The one standard error rule chooses the simplest tree whose accuracy is comparable to
the one when using ˆ↵ to prune. Note that this method gives only a suboptimal ↵, but it
is easy to compute and it gives satisfactory results when applied to real-life data.
2.5 Classification and Regression Trees
To complete this section, we need to specify a rule that assigns a prediction to each terminal
node. A classification tree is a predictor that assigns a class label to each terminal node,
whereas a regression tree is a predictor that assigns a numerical value to it. It turns out
that both in classification and regression, the node assignment rule can is very simple.
Let t be a terminal node and let y1, y2, ..., ym be the classes of all the observations
that fall into that fall into t. Remember that p(j|t) denotes the proportion of class j
observations that fall into node t. For a classification tree, we designate the class of the
node t to be the class that is present in the largest proportion within that node. Putting
it mathematically, jt is chosen to be the class of t if
p(jt|t) = max
j
p(j|t)
In Section 4., we will be interested in a specific two-class problem. We would like to
predict whether an applicant i will attend the Msc course (i.e. response variable y = 1) or
not (i.e. y = 0). Using the Admission Dataset as a training set, we give rise to a decision
tree. Then we put a new applicant down this tree, it follows the decisions that are made
at each node, until this applicant falls into a terminal node. Instead of predicting the class
an applicant might belong to, it would be more valuable to predict the probability of an
applicant belonging to a certain class. An appealing node assignment rule could be the
following: assign the probability p to a terminal node t where p is such that:
p =
1
m
mX
i=1
yi (1)
9

where m is the number of observations within the terminal node t and
Pm
i=1 yi is simply a
count of the number of observations that are of class 1. So if an applicant belonging to class
1 means that this applicant will attend the Msc course, then p represents the probability
that any new applicant that falls into the terminal node t will attend the course. As p is
not a class label, but is a numerical value (in this case a probability), a decision tree with
such a node assignment rule would be called a regression tree. We will make use of such
regression trees in Section 4.
Finally it should be noted that many other types of decision trees exist. In this section
we assigned a constant to each terminal node, however it is also possible to ﬁt a simple
model to each terminal node. Amongst others, a simple model of choice could be a logistic
regression model. More details about logistic regression trees can be found in Chan and
W.-Y.Loh (2004).
3 Random Forests
Ensemble learning methods are techniques that generate many predictors and combine
their outputs to form an enhanced aggregate predictor. The random forest is a typical
example of an ensemble learning method: it combines results from multiple independent
and identically distributed decision trees to create a very powerful prediction tool.
3.1 The Algorithm
Let D0 be the original dataset containing n observations. Let S0
X denote the set of all
predictor variables of D0 and let m be the size of S0
X. A random forest can be grown in
the following way:
Algorithm 3.
1. Fix the parameters R 2 N+ and mtry 2 N+ with mtry  m.
2. Draw R bootstrap samples with replacement from D0.
3. For each of these R samples, grow an unpruned decision tree. This ensemble of trees
is our random forest.
Grow each tree by making the following adjustment: at each node, instead of choosing
the optimal split amongst all predictor variables, randomly sample a subset of mtry
predictor variables from S0
X and pick the optimal split amongst the possible splits on
this subset of variables.
4. Predict the outcome of a new observation by aggregating the predictions of all the
trees. For classiﬁcation, this aggregation means selecting the class that obtained the
majority of votes; for regression it means taking the average of the outcomes.
Reading this algorithm, we realize that the random forest has two ingredients in addi-
tion to just a simple collecting of decision trees:
i. the randomness created by the bootstrapping, and
10

ii. the randomness introduced at the level of the nodes, when choosing an optimal split
on a random subset of predictor variables rather than from the entire set.
In what follows, we will look into the meaning of these two points in order to understand
the random forest in greater depth.
3.2 Bootstrapping and Bagging
Before the random forest was developed, a very similar technique called bagging existed.
Bagging is basically the random forest without the additional layer of randomness given
in point ii. above. Bagging was first introduced in Breiman (1996) and we will base our
explanation of bagging on this paper.
3.2.1 What is a bootstrap sample?
Suppose we have n independent and identically distributed observations x1, x2, ..., xn that
come from an unknown distribution F. A bootstrap sample x⇤
1, x⇤
2, ..., x⇤
n is obtained
by randomly sampling n times, with replacement, from the original set of observations
x1, x2, ..., xn. In this bootstrap set-up, the bootstrap sample does not come from the dis-
tribution F, but from an empirical distribution ˆFn, which is just x1, x2, ..., xn, each with
an equal mass of 1
n . Note that the reason why we sample with replacement, is to make sure
that every time we randomly sample an observation, it comes from the correct distribution
ˆFn. Sampling without replacement would lead to not sampling from ˆFn. ˆFn is also called
the bootstrap approximation of F and using Monte Carlo integration methods, one can
show that as n ! 1, ˆFn converges in probability to F (Efron and Tibshirani, 1993).
Generally, bootstrap samples are used to estimate confidence intervals for statistics of
interest, however we will not go into detail on that. For more on the theory and applications
of bootstrapping, consult Efron and Tibshirani (1993).
3.2.2 Bagging: Bootstrap aggregating
Suppose we have a training dataset D coming from an unknown distribution F. D consists
of observations {(xi, yi), i = 1 ! n}, where xi represents the characteristics of an obser-
vation i and yi its response. Let tree(x, D) denote a single tree predictor, that was built
using the training data set D, and predicts an output ˆy for an input x. Let {Dk} represent
a sequence of datasets, each containing n observations and underlying the distribution F.
One can define a new classifier treeA which uses an aggregate of the datasets {Dk} to
make predictions. For a response representing a class label c 2 {0, 1}, the most common
aggregating method is voting. Let nc(x) denote the number of trees that classified an input
x to class c, i.e. nc(x) =
P
k 1(tree(x, Dk) = c). Then let the aggregate predictor classify
x to the class that got the most votes:
treeA(x, F) = treeA(x, {Dk})
= argmaxc nc(x)
11

The only inconvenience with this approach is that it is too idealistic, as most of the time
we do not have the luxury to generate a sequence of datasets {Dk} which all come from the
same distribution F. This is where bootstrap samples can be used to create an imitation
of this ideal set-up to get an accurate estimate of treeA(x, {Dk}). For r = 1 ! R, let
{D(r)} be a set of R bootstrap samples obtained from D. Then in the previous formulae,
replace {Dk} by the set {D(r)} in order to obtain a very good bootstrap approximation of
treeA(x, F). Using bootstrap samples to train tree predictors, whose outputs are aggre-
gated to give a prediction, is what we will refer to as the bagging predictor. (Note that
in the literature, bagging has a wider meaning: the aggregated predictors do not need to
necessarily be trees, they could be any type of predictor.)
If the response is not a class label but a numerical value, then instead of taking the vote
of the most popular class, it seems natural to let the bagging prediction be the average of
all the predictions of the various tree components:
treeA(x, {Dr}) =
1
R
RX
r=1
tree(x, D(r)
)
3.2.3 Why does bagging improve prediction accuracy?
In this subsection we will show why the bagging predictor gives lower error rates than the
single decision tree. We will only look at the regression case; how it is done for classiﬁcation
can be found in Breiman (1996).
Using the set-up from the two previous subsections, the bagging predictor can also be
written as:
treeA(x, F) = ED(tree(x, D))
Now let Y, X denote observations coming from the distribution F, independent of D. We
deﬁne the following two mean squared errors:
1. The average prediction error of a single tree:
et = ED(EY,X(Y tree(X, D))2
)
2. The error of the aggregated predictor:
eA = EY,X(Y treeA(X, F))2
Then with help of the inequality E2(W) E(W2), the following computations can be
12

carried out:
et = ED(EY,X(Y tree(X, D))2
)
= ED(EY,X(Y 2
2Y tree(X, D) + tree2
(X, D)))
= ED(EY,X(Y 2
) 2EY,X(Y tree(X, D)) + EY,X(tree2
(X, D)))
= ED(EY,X(Y 2
)) 2ED(EY,X(Y tree(X, D))) + ED(EY,X(tree2
(X, D)))
= EY,X(ED(Y 2
)) 2EY,X(Y ED(tree(X, D))) + EY,X(ED(tree2
(X, D)))
EY,X(Y 2
) 2EY,X(Y ED(tree(X, D))) + EY,X(E2
D(tree(X, D)))
= EY,X(Y 2
) 2EY,X(Y treeA(X, F)) + EY,X(tree2
A(X, F))
= EY,X(Y 2
2Y treeA(X, F) + tree2
A(X, F))
= EY,X(Y treeA(X, F))2
= eA
This is a nice result, however the bagging prediction is not treeA(X, F) but treeA(X, ˆFn),
where ˆFn is the bootstrap approximation of F. If the number of observations is large
enough to consider ˆFn representative of F, then treeA(X, ˆFn) becomes a su ciently good
estimate of treeA(X, F) and hence the improved performance obtained by bagging will still
be present.
3.3 The Random Forest
As mentioned previously, the main di↵erence between bagging and random forests is the
way they split the nodes of their trees. In bagging, each node is split using the optimal
split amongst all variables, which is just the standard CART technique discussed in Section
2. In a random forest, each node is split taking the best split amongst a randomly chosen
subset of all predictor variables. This seems to be a somewhat counterintuitive approach,
however numerous empirical studies provide convincing evidence that random forests are
more accurate than bagging, or a single decision tree for that matter.
3.3.1 Theorems
In this subsection we are going to state some interesting theorems about random forests
that can be helpful in explaining the reason why they produce enhanced performances. We
keep the same notation used previously this section. The following theorems were taken
from Breiman (2001).
Theorem 1. As the number of trees goes to inﬁnity,
EX,Y (Y
1
R
RX
r=1
tree(X, D(r)
))2
! EX,Y (Y ED(tree(X, D)))2
almost surely.
13

The theorem can be proven using the Strong Law of Large Numbers (Breiman, 2001).
This limiting result justifies why random forests do not over-fit as more trees are added
to the forest. Hence we do not have to worry how big we grow our forest, as the error
measures will not be a↵ected. An illustration of this will be given in Section 4.4.2.
Now we look at a result that explains why the random forest performs better than the
individual trees it employs. We define the following two error measures:
1. The generalization error of the forest:
ef = EX,Y (Y ED(tree(X, D)))2
2. The average generalization error of a tree:
et = ED(EX,Y (Y tree(X, D))2
)
Theorem 2. If for all D(r), EX,Y (Y ) = EX,Y (h(X, D(r))), then
ef  ⇢ et
where ⇢ is the is the weighted correlation between the two residuals Y tree(X, D(i)) and
Y tree(X, D(j)), where D(i) and D(j) are independent.
The proof of this theorem can be found in Breiman (2001). This is an interesting result,
as it shows that the random forest can decrease the average error of the trees it employs
by a factor of ⇢. If one can manage to reduce this correlation ⇢, than very accurate forests
can be obtained.
This correlation reduction can be achieved by the random predictor variable selection
present at each node. The magnitude of this correlation reduction will depend on the
number mtry of randomly selected variables at each node. So one might wonder what
the best value is for mtry. There is no theoretical result that gives us an optimal value,
however if m denotes the size of the set of all predictor variables, empirical results suggest
that
⌅p
m
⇧
is a good choice for mtry. However mtry still remains a parameter to tune, so
one will have to try multiple values as part of the tuning process. Examples of how the
di↵erent values of mtry a↵ect accuracy of the random forest can be found in Section 4.4.2.
It should also be noted, that in the particular case when mtry = m, the random forest is
simply the bagging predictor. One might expect the random forest to perform better than
the bagging predictor, as the latter does not contain the layer of randomness introduced
by random predictor variable selection to reduce the correlation ⇢.
3.3.2 Out-of-bag error estimates
A nice thing about performance evaluation for random forests, is that it can also be done
without using a separate test set or any cross-validation methods, because it is possible to
obtain an internal error estimate: the out-of-bag (OOB) error estimate.
14

Let R be the total number of trees in the forest. Each tree r is constructed using a
di↵erent bootstrap sample D(r) coming from a training dataset D. We know that when
sampling D(r), the observations are sampled randomly with replacement, hence any obser-
vation i might appear a repeated number of times or not at all in any particular bootstrap
sample D(r). It can easily be proved (Orbanz, 2014), that on average, about one third of
the dataset D is not present within a particular D(r). This set of left-out observations is
what we call the OOB data of the rth tree. We then put each observation of this OOB
data down the rth tree and obtain a prediction for each one of them. Repeating this for
all R trees means that for each observation i, we gather about R/3 predictions.
For classification, let ˆyi be the class that obtained the majority vote of from these R/3
predictions and let yi be represent the true class of observation i. The OOB error estimate
is defined as:
eoob =
Pn
i=1 1(ˆyi 6= yi)
n
In the case of regression, ˆyi will simply be the average of the R/3 predictions and so
the OOB is simply defined to be the following mean squared error:
eoob =
Pn
i=1(ˆyi yi)2
n
In Breiman (2001), it is shown that the OOB error estimates tend to overestimate the
true error rates, and hence they cannot be relied on for error rate assessment. Instead,
these OOB estimates can serve as guide in the tuning of random forests and can monitor
correlation and generalized errors. In the next section we will see how OOB error estimates
are used to estimate variable importance.
3.3.3 Variable importance
Using the OOB error estimates, we can obtain a measure called the variable importance
measure. As indicated by its name, it measures how useful each predictor variable is for the
given prediction task. The way in which the variable importance calculated is explained
by the following algorithm:
Algorithm 4. Given a dataset D, grow a random forest and calculate eoob. Let X1, X2, ..., Xm
be the set of all the predictor variables.
For i = 1 ! m, do:
1. For variable Xi, randomly permute its values amongst each other to obtain a new
variable Xp
i .
2. Define a new dataset called Dp
i , which is the same as our original dataset D, with
the only exception that it contains the predictor variable Xp
i instead of Xi.
3. Compute the OOB error for Dp
i and call it e
(i)
oob. By executing this permutation, the
ith variable loses all the information that it contained, and so with the loss of this
piece of information, e
(i)
oob can be expected to be higher than eoob.
15

4. Compute the variable importance of Xi, which is defined as the percent increase of
e
(i)
oob compared to eoob. More formally:
importance(Xi) = 100 ⇥
e
(i)
oob
eoob
1
!
%
The higher the importance measure of a predictor variable, the more essential this vari-
able is for the prediction task at hand. We will see an example of how variable importance
can be used in Section 4.
3.3.4 Additional features of Random Forests
Random forests have many other advantageous characteristics that we did not discuss. One
of its important features is that it is a computationally very e cient predictor. For example,
running on a dataset of 50,000 observations and 100 predictor variables, a random forest of
100 trees can be grown in 11 minutes on a 800Mhz machine (Breiman and Cutler, 2014).
What is more, a random forest never does over-fitting and is optimal for handling missing
data and outliers. It also o↵ers experimental methods for detecting variable interactions
and it gives a way of calculating proximities between pairs of observations. To find out
more about random forests, visit Breiman and Cutler (2014).
4 Applying the Trees to the Admissions Dataset
4.1 Describing the Data
The Admissions Dataset is a dataset that contains information about applicants that
received an o↵er for a particular Msc course held in 2012-2013. This dataset contains
101 applicants (i.e. observations). We have a binary response variable Y and a set of
predictor variables S0
X. A summary of these variables is given in the following:
1. Response(Y ): for an applicant i with outcome yi,
yi =
(
1 if applicant i accepted the o↵er and turned up for the course,
0 if applicant i rejected the o↵er.
2. Citizenship(C): the nationality of the applicant. Countries with few applicants and
same geographical location have been grouped together.
3. Fee status(FS): either home, EU or overseas.
4. Application date(AD): self-explanatory.
5. Gender(G): male or female.
6. Location of last degree(LLD): a categorical variable indicating where the applicant
obtained his last degree. This could be a specific university (eg. Imperial) or a region
(eg. EU).
16

7. Year of last degree(Y LD): self-explanatory.
Our aim in Section 4. and 5. is to find out how accurate di↵erent models are in predicting
turn-up rates of future applicants that are made an o↵er. In other words, given an applicant
i with characteristics xi, we want to know how accurate a given model is in predicting the
probability pi of this applicant accepting the o↵er.
4.2 Variable Transformations
Considered in their raw form, not all predictor variables are ready to be used. C, FS, G
and LLD are all categorical variables that can be put directly in the model, however AD
and Y LD cannot be. These two are just a set of dates that R considers as categorical
variables having a lot of categories, one category per date.
Intuitively, as these variables contain a notion of time, we will first want to make them
continuous. AD can easily be transformed to create a new predictor variable called ”Days
To Deadline” (D2D), which is a continuous variable indicating how many days before the
deadline the given applicant applied. A similar transformation can be applied to Y LD.
We create a new predictor variable called ”Years Since Last Degree” (Y LD⇤), that simply
gives the number of years that passed by since the given applicants last degree.
Examining the continuous random variable Y LD⇤, we notice that for most (around
80%) of the applicants this characteristic is just 0, which means that they do not take a
gap in their studies. So it might be beneficial to further transform this variable to get
a new categorical variable called ”Gap Year” (GY ), where one category contains all the
applicants that took at least one gap year and the other category containing those that
did not.
We might want to apply an analogous transformation to D2D. Using Algorithm 2. in
Section 2, we split this continuous variable into two periods: early, being the applicants
that applied earlier than 138 days before the deadline; and late, being the applicant that
applied later than 138 days. We call this variable ”Application Period” (AP).
Creating these last two transformed variables seemed somewhat natural given the prob-
lem at hand, so we will use the variables GY and AP as replacements of Y LD and AD.
4.3 Measure of Performance
4.3.1 Mean Squared Error
In order to get estimates of pi, we will make use of trees that have the terminal node
assignment rule given in equation (1) of Section 2.5. It should be noted, that even though
the response variable Y represents a class label, it also can be considered as a probability:
1 meaning the applicant definitely accepting the o↵er and 0 meaning definitely not. Con-
sequently, in order to assess how accurate a certain predictor is for a given observation i,
we simply consider how far away the prediction ˆyi is from the true response yi, i.e. yi ˆyi.
Hence the common measure that we will employ to assess performance of on a test set
17

containing n observations will be the Mean Squared Error(MSE), defined as:
MSE( ) =
1
n
nX
i=1
(yi ˆyi)2
4.3.2 The Jackknife
Testing a model on the same set that we used for training is not a good idea. This is
because it might well be that our model fits a training data very well, however when it is
used for prediction on an unseen dataset, the model might perform poorly. This would
be a case of over-fitting. A smart thing to do in order to avoid this, is to use a set for
testing that is independent of the training set. A common way this can be done is by
using cross-validation techniques. A K-fold cross-validation works the following way: the
original dataset D0 is randomly split into K (roughly) equal sized sub-datasets {Dk}, then
a looping is done over k, where Dk is kept out to be the independent test set and the
others are put back together to form the training set. In every loop, the model is trained
using the training set, and performance is assessed using the test set. Each loop gives
us a performance measure, which are then averaged to form the general cross-validation
performance measure. Cross-validation is a very good method, as it allows us to obtain
good estimates of the true performance measure, however for higher values of K, the
algorithm might get computationally ine cient.
In our case however, number of observations (101) is relatively small, so it might be a
good strategy to use a high value for K, as it will not cause any time-e ciency problems,
and it might produce more reliable performance measures. So for our example, we are
going to use a special version of cross-validation by taking K = |D0|. This is called the
Jackknife. The following algorithm gives a detailed description of how the Jackknife is
used obtain the MSE performance measure:
Algorithm 5. Let D0 being a set of observations {(x1, y1), (x2, y2), ..., (xn, yn)} :
1. For i = 1 ! n, do:
(a) Let the single observation (xi, yi) to be test set, and let the n-1 other observations
form the training set.
(b) Train the model using the training set.
(c) Use this model to predict the response of the test set, i.e. compute ˆyi.
2. From step 1 we get the ˆyis for i = 1 ! n, so using the formula given above, we can
compute MSE = 1
n
Pn
i=1(yi ˆyi)2.
This algorithm provides us an accurate way of assess performance of a predictor, and
will also allows us to compare the performances of various di↵erent predictors.
18

4.4 Results
In this part we present the results we obtained by experimenting with tree-based predictors.
We are going to use the MSE as a measure of performance.
Before doing anything however, it is important to compute the MSE of the null model,
so that we have a cornerstone value to which we can compare the results given by the more
involved models. The null model is the simplest of predictors, that associates the same
constant to every new observation, without taking into consideration any of the predictor
variables. In our case, this constant will be just the average value of the response variable
Y , which is about 0.347 (101 applicants got an o↵er, 35 accepted it). The null model gives
an MSE of 0.2293. The null model is an extremely basic model, and so we definitely would
expect to get better measures of performance from our tree-based predictors.
4.4.1 A single tree
In this subsection, we will use the R package ’rpart’ to carry out all our computations.
Let us start by obtaining the MSE of a single unpruned decision tree. An example of such
a tree (one taken from the many grown by the Jackknife algorithm) is draw in Figure 6.
The MSE of this unpruned tree is 0.2216. This is somewhat better than the null model,
but for ’proper’ model like this, we would expect much better results. However this tree
has not been pruned yet, and so it is most likely over-fitting the data. By looking at the
cost-complexity tuning parameter plot in Figure 7, we choose the value of ↵ = 0.074 for
pruning. This gives us the tree in Figure 8. and we get a much-improved MSE of 0.1984.
This is quite a significant performance gain, and it is a good example illustrating of how
dangerous over-fitting can be.
4.4.2 The entire forest
In this subsection, we discuss the results generated by random forests. Our findings are
obtained by using the R package ’randomForest’. Random forests are very user-friendly,
as they only have two (main) parameters to tune: the number of trees R in the forest and
the number of predictor variables mtry selected randomly at each node. As it turns out,
the number of trees in the forest only has a stabilization role to play, because by increasing
their number, the variance of the performance measure is reduced and it stabilizes around
a certain value. Figure 9. illustrates this stabilization. So we fix R = 1000, as this is
su ciently large and the forest still remains computationally e cient.
Now let us go onto the of tuning mtry. Multitude of experiments displayed in Breiman
(2001) have revealed that generally a value of mtry = b
p
mc gives the best results, where
m is the total number of predictor variables. Note, that in our case b
p
6c = 2, so let us see
whether this rule of thumb is confirmed and we get best results for an mtry of 2. Table 1
gives the MSEs (and the OOB error estimates for comparison) given by random forests
for di↵erent values of mtry. As we can see, we get considerable improvements in MSE
compared to those produced by the single decision tree in Section 4.4.1. This confirms the
theory discussed in Section 3. We also notice that the rule of thumb is proven to be true,
as mtry = 2 gives the lowest MSEs. Taking mtry = 1 gives a decent result, but far not
19

Figure 6: An unpruned decision tree.
Legend: At each node, the split applied is speciﬁed. Observations that have the indicated
characterics go down the left branch, the others go down the right one.
C=acghi ! C 2 {China, France, otherEU, UK, USA}
AP<0.5 ! AP = {early}
LLD=bde ! LLD 2 {EU, Imperial, Overseas}
LLD=cdefh ! LLD 2 {EuropeOther, Imperial, Overseas, Oxbridge, UK}
FS=d ! FS = {Overseas}
Figure 7: Plot suggesting the value to take for the cost-complexity parameter.
20

Figure 8: The pruned decision tree.
Figure 9: Plot illustrating the stabilization of the performance measure as the number of
trees in the forest is increased (taking mtry = 2).
21

mtry MSE (⇥10 2) eOOB
1 20.20 20.31
2 19.38 19.54
3 19.37 19.60
4 19.47 19.63
5 19.49 19.63
6 19.68 19.88
Table 1: MSE and OOB errors produced by random forests for di↵erent values of mtry.
Table 2: The variable importance measures of the di↵erent predictor variables.
as good as the others. The reason for this might be the following: if mtry = 1, then at
each node, one predictor variable is selected at random, however if this selected variable
can only be split in useless ways, the accuracy of the forest will be negatively a↵ected.
This problem seems to evaporate for increased values of mtry. It should also be noted,
that when taking mtry = 6, our random forest is simply the bagging predictor presented
in Section 3.2., as this means that no random predictor variable selection is done at the
nodes. Hence we can conclude that in this case, random forests perform better than the
bagging predictor.
Now let us go onto looking at the importance of the individual predictor variables.
They are computed using the method explained in Section 3.3.3. and the results are given
in Table 2. Higher values of %IncMSE indicate a higher importance of the variable for the
prediction task at hand. It becomes immediately obvious that the predictor variables C
and AP are considerably more important than the others. So we might wonder why not
make the variables that are included in the model a parameter to tune.
An interesting thought might be to only include the predictor variables in the model
that Table 2. points out to be important. So we try growing a random forest only in-
cluding variables C and AP. The results we get are summarized in Table 3. (Note: The
performance of many other combinations of predictor variables have been measured, and
even though there are some others that give improved results, the lowest MSE was given
by the combination C-AP).
Table 3. reveals remarkable results. We ﬁnd a decrease in the MSE of these forests
22

mtry MSE(⇥10 2) eOOB
1 18.41 18.49
2 18.87 19.01
Table 3: Performance measures of random forests only involving the ’important’ predictor
variables C and AP
that is quite significant. Why might this be the case? Previously, when the model had
all six predictor variables as input, the random predictor variable selection at each node
allowed the possibility that a set containing only ’unimportant’ variables were picked, which
consequently could lead to a ’bad’ split. These few bad splits, especially if they occurred
further up the trees, lead to not ideal performances. When we experimented by feeding
only the important variables into the model, we could be confident that this problem would
not occur, as every split is most likely quite a good one. This reason which might have
lead to these improved results in the model performance.
5 Other Prediction Methods
Previously we saw that, that for the Admissions Dataset, the random forest outperforms the
simple tree predictor. In spite of this, it would be interesting to see how the performance of
the random forest compares with the performance of other popular classifying techniques.
The methods we try in this section are the General Linear Model, the Linear and Quadratic
Discriminant Analysis and the Multilayer Perceptron. We will not go into the theory behind
these methods, but just explain how they can be tuned correctly to obtain maximized
performance.
5.1 Generalized Linear Model
The Generalized Linear Model (GLM) is probably the most widely used statistical tool
and is always a good starting point in any classification problem. The inbuilt R function
’glm’ allows us to use this tool, and the only parameters that need to be specified are the
predictor variables that are included in the model and the link function.
As the response variable of the Admissions Dataset is binary, we chose to fit a binomial
GLM with its canonical link function the logit function. Other choices of link functions
are also be possible, but this link function appears to be the most natural one.
Now let us consider the selection process of the predictor variables. The GLM allows
us to include any number of predictor variables, however as we will see, choosing the right
ones is crucial in order to get optimized performance. To find a good set of variables, we
are going to apply an algorithm called forward stepwise selection. This algorithm is similar
to the one in Bellotti (2013) and it works the following way:
Algorithm 6. Let S0
X denote the set of all predictor variables and Si the set of variables
selected during the process. We choose AIC to be our performance measure. This is very
23

AIC Step1 Step2 Step3
Variable j
FS -1.8 -1.6 1.6
G 3.5 2 1.9
GY 0.4 -0.3 0.6
LLD 5.7 9.9 11
AP -6.7
C -6.4 -9.1
Table 4: The values of AIC(Si [ {j}, Si) for all variables j at each step. The AIC of
the variable selected at each step is highlighted in green. The algorithm stops at step 3, as
all values are positive.
similar measure to the cost-complexity criterion in Section 2.4.: the lower the AIC, the
better the model. For two models M1 and M2, let AIC(M1, M2) = AIC(M1) AIC(M2).
1. Start with S0 = ;
2. While STOP is not applied, do:
(a) Set ˆj = argminj2S0
X Si
AIC(Si [ {j}, Si)
(b) IF AIC(Si [ {ˆj}, Si) < 0
i. Set Si+1 = Si [ {ˆj}
ii. Let i = i + 1
iii. IF Si = S0
X, then STOP.
(c) ELSE STOP
3. Return Si. This will be the set of predictor variables we will feed in the model.
Now let us apply this algorithm to the Admissions Dataset. The AIC of the model not
containing any predictor variables is 131.5. The steps of the forward stepwise selection are
given in Table 4. From this table, we deduce that including only the two variables AP
and C will achieve the best performance (not necessarily maximized performance, as not
all combinations of predictor variables were tried). Doing this, we obtain MSE = 0.1831.
Note that the inclusion of various interaction terms were tested, but no improvement in
performance was found. The results found for the various classiﬁers will be discussed in
Section 5.5.
5.2 Linear Discriminant Analysis
Linear Discriminant Analysis (LDA) is a fairly basic classiﬁer that makes simple assump-
tions about the nature of the data. It assumes that every density within each class is a
Gaussian distribution, all having the same covariance structure. It then makes use of a
24

linear decision boundary to achieve classification. Further details on the theory of linear
discriminant analysis can be found in Hastie et al. (2001, Chapter 4)
A nice thing about this prediction tool, is that it does not require any specific tuning
and so it can be directly applied to the dataset. Using the R function ’lda’, we get
MSE = 0.1946.
5.3 Quadratic Discriminant Analysis
Quadratic Discriminant Analysis (QDA) is very similar to linear discriminant analysis,
with the exception that it allows that the covariance matrices of each class to be di↵erent.
It uses a quadratic decision boundary to do the classification. A more detailed explanation
of the theory of quadratic discriminant analysis and how it compares to its linear version
can be found in Williams (2009).
Applying the R function ’qda’ to the Admissions Dataset, we find MSE = 0.1968.
5.4 Multilayer Perceptron
As a last prediction technique, let us look at an artificial neural network method called the
Multilayer Perceptron (MLP). The MLP is a collection of neurons connected in an acyclic
manner. Each neuron is a simple decision unit. At one end there are the input nodes, at
the other end the output nodes and in the middle the hidden nodes. The structure of the
MLP is fixed, but the weights between neurons are adjustable and are optimized during
the training process. Training is done by back-propagation and the process of prediction is
carried out using forward-propagation. The MLP is a highly sophisticated method and to
get a more in-depth understanding of its theoretical background, consult Mitchell (1997,
Chapter 4). For our purposes, we will just make use of the R package ’nnet’ to get our
results.
The training of this predictor starts by randomly selecting a starting vector, if no such
vector is specified. The outputs are significantly di↵erent depending on this starting vector,
so to eliminate this unnecessary source of randomness, we will fix it to be the vector of
zeros, which tends to be the general rule of thumb. This will allow us to compare with
more ease the results given by di↵erent tunings of this model.
One of the nice things about the MLP, is that one does not have to consider predictor
variable selection. All of them can be included in the model, whether they are useful or
not, because during the training process of the MLP, the network learns which are the
important variables and increases the weights on them, and also spots the less useful ones
and weights them down.
To achieve good performance, some parameters need to be correctly tuned. The three
main parameters that we will tune are (there are others, but we will just leave them in
their default setting):
1. the number of hidden nodes N,
2. the maximum number of iterations imax,
3. the decay parameter .
25

MSE (⇥10 2)
1 22.93
0.1 17.44
0.01 22.49
0.001 24.61
0.0001 32.30
Table 5: Performance measures of the MLP for di↵erent values of , for fixed N = 10
and imax = 200.
(a) MSEs given by the MLP
for di↵erent values of N, for
fixed = 0.1 and imax = 200.
(b) MSEs given by the MLP
for di↵erent values of imax, for
fixed = 0.1 and N = 5.
(c) MSEs given by the MLP
for di↵erent values of , for
fixed imax = 100 and N = 5.
Figure 10
We start with a model with parameters N = 10, imax = 200 and = 1. We do the
tuning by optimizing one parameter at a time, keeping the other two fixed. Note that this
tuning will not result with the optimal tuning (just a suboptimal one), because we do not
try all possible combinations for these three parameters (as there are infinitely many!).
Let us start by getting the magnitude of right. The results of this optimization are
shown in Table 5. This table reveals that choosing = 0.1 results in extremely good
performance, the best so far.
Let us see whether we can get a lower MSE by tuning the other parameters. Figure
10a. plots the results obtained for di↵erent number of hidden nodes N, fixing imax = 200
and = 0.1. From this figure, we deduce that taking N = 5 is the most favourable choice.
Now let us go on to finding a good value for imax. Figure 10b. illustrates the results
obtained for di↵erent values of imax, fixing N = 5 and = 0.1. This figure suggest
imax = 100.
Finally, let us revisit the tuning of . We saw previously, that a good value of is of
order 0.1, so let us do a more detailed search for values of of this magnitude. The results
of this assessment are given in Figure 10c., which recommends the selection of = 0.098.
Hence to conclude, we can say that when tuning the parameters to N = 5, imax = 100 and
= 0.098, the MLP produces an amazing performance measure of MSE = 0.1620.
26

Predictor MSE (⇥10 2)
Null 22.93
CART 19.84
Bagging 18.87
Random Forest 18.41
GLM 18.31
LDA 19.46
QDA 19.68
MLP 16.20
Table 6: Summary of the results given by di↵erent predictors
5.5 Comparing the Predictors
In Sections 4. and 5., we have applied various di↵erent predictors to the Admissions
Dataset and found interesting results. A summary of these results can be found in Table
6.
To begin with, it should be noted, that the Admission Dataset is a relatively small
dataset, as it has only 101 observations and 6 predictor variables. This might have a
substantial influence on how the di↵erent models perform on it. All our models do much
better than the null model, which is a reassuring sign. What is more, we notice that
’simpler’ models like LDA, QDA or CART, tend not to give best performance. This might
mean, that the Admissions Dataset has a some of hidden patterns that these simple models
do not pick up on very well. We can see that the theoretical results given in Section
3. are confirmed, as bagging performs better than CART, and random forests perform
better than bagging. The GLM gives surprisingly good results, slightly outperforming the
Random Forest. The reason for this might lie in the nature of the dataset, which is a
two-class problem with binary outcome variable. It could be that the transformation done
by the GLM’s logit link function is very appropriate for this case. Finally, we note that
the MLP gives outstanding results. This model, once correctly tuned, can train itself to a
level where it can pick up patterns that its other competitors do not.
If we would get a larger Admissions Dataset, with more predictor variables or more
observations, these results would most likely change. We would expect the MSE of these
predictors to decrease, as with more information, usually more accurate predictions can be
done. We might also find the spread of these MSEs to increase. In my opinion, the MLP
would still retain its superiority, however the random forest could overtake the GLM, as
the GLM might loose the advantage that it gained from the relatively small size of the
Admissions Dataset.
6 Conclusion
Random forests are conceptually fairly simple, yet can be very good prediction tools.
The random forest significantly outperforms its ancestors, the single decision tree and
27

bagging predictor. The general belief was that these forests could not compete with some
sophisticated methods like boosting or arcing algorithms, but the findings in Breiman
(2001) gives substantial evidence of the contrary: random forests perform at least as well
and occasionally better than these predictors.
In Section 4. we applied random forests to the Admissions Dataset and found that it
gave quite good results, however it was largely outperformed by the multilayer perceptron
in Section 5. The reason for this might have been that nature of the dataset was not
entirely adequate to allow the random forest to perform at its best. I believe that if the
number of observations and predictor variables (especially the important ones) would be
increased, then the random forest would produce extremely fine predictions. However with
the current Admissions Dataset, one would prefer employing multilayer perceptron for the
given prediction task, as they pick up patterns of the data that random forests miss out
on.
The random forests is a tool that can be used for a very wide range of prediction
problems. These days it is largely used in the fields of Bioinformatics (Qi, 2012), but it
can also be found in real-time pose recognition systems (Shotton et al., 2013) or in facial
recognition devices (Fanelli et al., 2013). The random forest is a new and exciting method
that inspires many statisticians and computer scientists. Increasing amount of articles and
books will be written about it, and so random forests will eventually become a fundamental
part of the machine learning literature.
As a final word, it should be noted that even though random forests (and other pre-
diction methods) can be very good for analysing various datasets, they do not provide
a replacement for human intelligence and knowledge of the data. The output of these
predictors should not be taken as the absolute truth, but just as an intelligent computer
generated guess that may be useful in leading to a deeper understanding of the problem
at hand.
28

References
Amit, Y. and D. Geman (1997). Shape quantization and recognition with randomized
trees. Neural Computation 9(7), 1545–1588.
Bellotti, T. (2013). Credit scoring 1. Lecture Notes.
Breiman, L. (1996). Bagging predictors. Machine Learning 24(2), 123–140.
Breiman, L. (2001). Random forests. Machine Learning 45, 5–32.
Breiman, L. and A. Cutler (2014). Random forests. http://www.stat.berkeley.edu/
~breiman/RandomForests/cc_home.htm. Retrieved 9 June 2014.
Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone (1984). Classiﬁcation and
Regression Trees. Belmont, CA: Wadsworth International Group.
Chan, K.-Y. and W.-Y.Loh (2004). An algorithm for building accurate and comprehensible
logistic regression trees. Journal of Computational and Graphical Statistics 13(4), 826–
852.
Efron, B. and R. Tibshirani (1993). An Introduction to the Bootstrap. Chapman &
Hall/CRC.
Fanelli, G., M. Dantone, J. Gall, A. Fossati, and L. J. V. Gool (2013). Random forests for
real time 3d face analysis. International Journal of Computer Vision 101(3), 437–458.
Hastie, T., R. Tibshirani, and J. Friedman (2001). The Elements of Statistical Learning.
Springer.
Mitchell, T. (1997). Machine Learning. McGraw Hill.
Orbanz, P. (2014). Bagging and random forrest. http://stat.columbia.edu/~porbanz/
teaching/W4240/slides_lecture10.pdf. Retrieved 8 June 2014.
Qi, Y. (2012). Ensemble Machine Learning. Spinger.
Shotton, J., T. Sharp, A. Kipman, A. W. Fitzgibbon, M. Finocchio, A. Blake, M. Cook,
and R. Moore (2013). Real-time human pose recognition in parts from single depth
images. Commun. ACM 56(1), 116–124.
Williams, C. (2009). Classiﬁcation using linear discriminant analysis and quadratic dis-
criminant analysis. http://www.cs.colostate.edu/~anderson/cs545/assignments/
solutionsGoodExamples/assignment3Williams.pdf. Retrieved 8 June 2014.
29

M3R.FINAL

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Similar to M3R.FINAL

Similar to M3R.FINAL (20)

M3R.FINAL