This is an attempt at a practical deconstruction of one of the oldest algorithmic ("BlackBox") techniques in Data Science. This broad class of non-parametric techniques have come to dominant a statistical field that has been recently been popularly rechristened as AI.
2. The Parable
of the Blind Men
examining the
Elephant
The earliest versions of the parable of blind men and
the elephant is found in Buddhist, Hindu and Jain
texts, as they discuss the limits of perception and the
importance of complete context. The parable has
several Indian variations, but broadly goes as follows:
A group of blind men heard that a strange animal, called an elephant, had
been brought to the town, but none of them were aware of its shape and
form. Out of curiosity, they said: "We must inspect and know it by touch, of
which we are capable". So, they sought it out, and when they found it they
groped about it. The first person, whose hand landed on the trunk, said,
"This being is like a thick snake". For another one whose hand reached its
ear, it seemed like a kind of fan. As for another person, whose hand was
upon its leg, said, the elephant is a pillar like a tree-trunk. The blind man
who placed his hand upon its side said the elephant, "is a wall". Another
who felt its tail, described it as a rope. The last felt its tusk, stating the
elephant is that which is hard, smooth and like a spear.
3. Like a collection of blind
men each examining
only a portion of the
Whole Elephant, so
Random Forests
employs a collection of
partial understandings
of a Sampled Dataset,
then aggregates those
to achieve a collective
understanding of the
Whole Dataset (from
which the sample was
drawn) that exceeds the
insights of a single
“fully-sighted” analysis
• Such ensemble statistical learning techniques in which a final model is
derived by aggregating a collection of competing, often “partially
blinded,” models had its first explicit statement in the 1996 Bagged
Predictors paper by Leo Breiman (the subsequent author of the seminal
Random Forests paper, 2001).
• The power of ensemble thinking was given popular attention in the 2004
book by James Surowiecki, The Windom of the Crowds.
• The fundamental mathematics: “the square of a dataset’s average is always
smaller or equal to the average of the individually-squared datapoints” :
ExpectedValue (Xi)2 >=
[ExpectedValue (Xi)]2 NOTE:
second term is (Mean of Xi)2
4. Motivation
• Random Forests, a non-parametric emsemble technique, was unveiled by Leo Breiman in a 2001 paper
he provocatively titled Statistical Modeling: The Two Cultures. He intended it as a wake-up call to the
academic statistical community that he felt was languishing in an old paradigm that “led to irrelevant
theory, questionable conclusions, and has kept statisticians from working on a large range of interesting
current problems.” Random Forest was his capstone contribution (he died soon afterwards) to a newer
approach he labelled “Algorithmic Modeling.”
• Algorithmic Modes (often referred to as “Black Boxes”) have proven highly successful at predictive
accuracy at the cost of considerable opaqueness as to why they work or even what they exactly do.
• Decades of work has been devoted to unraveling the mysteries of Black Box statistical techniques,
Breiman’s Random Forests being a prime target. This presentation is yet another such attempt. Here
we set the modest goal of simply gaining a better understanding of what the technique actually does,
without evaluating why and under what circumstances it might work well. Our approach is to construct
a toy example:
• 1) of low dimensionality, and
• 2) in a discrete Universe where predictors (the X feature vector) define a small hypercube
5. Random Forests as a Monte Carlo Simulation
• Monte Carlo simulations are often used to empirically describe distributions which either have no
closed-form solution or have a solution intractably difficult to specify. The strategy of this presentation is
to treat Random Forests as if it were a simulation and, using a simplified version and a small dimensional
example, to exhaustively specify the Universe it is attempting to analyze. We can then deterministically
examine what Random Forests is doing.
• “Randomness” in Random Forests is a shortcut technique that high dimensionality and predictors
(features) measured as continuous variables make necessary. What if we removed those traits?
• Indeed, in 1995 Tin Kam Ho of Bell Labs, who first coined the term “Random Forest” and introduced use
of feature subspaces to decouple trees (the key innovation used in Breiman’s version) viewed
randomness as simply an enabling technique, not something fundamental. She writes:
Our method to create multiple trees is to construct trees in randomly selected subspaces of the feature space.
For a given
feature space of m [binary] dimensions there are 2m subspaces in which a decision tree can be constructed.
The use of
randomness in selecting components of the feature space vector is merely a convenient way to explore the
possibilities.
[italics added]
6. Defining a “Discrete” Universe: Levels of Measurement
• Four levels of data measurement are commonly recognized:
• Nominal or Categorical (e.g., Race: White, Black, Hispanic, etc.)
• Ordinal (e.g., Small, Medium, Large)
• Interval (e.g., 1, 2, 3, 4, etc. where it is assumed that distance between numbers are the same)
• Rational (interval level data that have a true zero so that any two ratios of numbers can be compared; 2/1=4/2)
• Most statistical analysis (e.g., Regression or Logistic Regression) make an often-unspoken
assumption that data are at least interval.
• Even when expressed as a number (e.g., a FICO score) the metric may not be truly interval level. Yet often
such measures are treated as if they were interval. FICO scores are, in fact, only ordinal but are typically
entered into models designed to employ interval-level data.
• For our purposes, a second data measurement distinction is also important
• Continuous (with infinite granularity between any two Interval data points), and
• Discrete (with a finite region between (or around) data points such that within each region, no
matter the distance, all characteristics are identical). Within an ordinal-level region, say “Small,”
all associated attributes (say, height of men choosing to select Small shirt sizes) are deemed to be
representable as a single metric. Of course, measured attributes (e.g., heights) of Small shirt-
wearers will differ, but again we choose to represent the attribute with a single summary metric.
7. Defining a “Discrete” Universe (continued)
• Our uncomfortable definition of a discretely measured feature variable warrants more discussion.
• In statistical practice (e.g., Credit modeling) one of the most common pre-processing steps is to group
predictors (feature variables) into discrete ordinal groupings (e.g., 850<=FICO, 750<=FICO<850,
700<=FICO<750, 675<=FICO<700, FICO<675). Notice that we have created 5 ordinal-measured groups,
say A B C D E, and that each need not contain the same range of FICO points. As mentioned earlier,
often these are (wrongly) labelled numerically 1 2 3 4 5 and entered into (e.g.) a Logistic Regression.
• More correctly these groupings should be treated as Categorical variables and if entered into a
Regression or Logistic Regression they should be coded as Dummy variables. But importantly, we have
now clearly created discrete variables. Our analysis uses grouped scores as predictors and our results
cannot say anything about different effects of values more granular than at the group level. The same is
true of all men choosing to wear Small shirt sizes. We observe only their shirt size and can model their
predicted height only as a single estimate.
• At a more philosophical level, all real datasets might be argued to contain, at best, ordinal-level discrete
predictors. All observed data points are measured discretely (at some finite level of precision) and must
be said to represent a grouping of unobserved data below that level of precision. Continuity, with an
infinite number of points lying between each observation no matter how granular, is a mathematical
convenience. What we measure we measure discretely.
• Accordingly, this presentation models predictors as ordinal discrete variables. One easy way to
conceptualize this is to imagine that we start with ordered predictors (e.g., FICO scores) that we group.
In our toy example we have 3 predictors that we have grouped into 3 levels each.
9. Our Toy Universe
• Consider a 3 variable, 3 levels-per-variable (2 cut-points)
example. It looks at bit like a small Rubik’s cube. Below is a
cutaway of its top, middle and bottom levels, each containing 9
cells for a total of 27.
10. Each of the 27 cells making up the Universe can be identified in 2 ways
1. By its position spatially along each of the 3 axes (e.g., A=1, B=1, C=1)
2. Equivalently by its cell letter (e.g., k)
It is important to note that we have so far only discussed (and illustrated above) exclusively Predictor Variables/Features (A,
B, C) and the 3 levels of each (e.g., A0, A1, and A2). Any individual cell (say, t) is associated only with its location in the
Predictor grid (t is found at location A2, B0, C1). Nothing has been said about dependent variable values (the target Y values,
which Random Forests is ultimately meant to predict). This is purposeful. It is instructive to demonstrate that without
observing Y values a Random Forest-like algorithm can be run as an Unsupervised algorithm that partitions the cube in
specific ways. If the Universe is small enough and algorithm simple enough, we can then enumerate all possible partitionings
that might be used in a Supervised run (with observed Y values). Some subset of these partitionings the Supervised
algorithm will use, but that subset is dependent upon the specific Y values which we will we observe later as a second step.
C=2 (c) (l) (u) C=2 (f) (o) (x) C=2 (i) (r) (aa)
C=1 (b) (k) (t) C=1 (e) (n) (w) C=1 (h) (q) (z)
C=0 (a) (j) (s) C=0 (d) (m) (v) C=0 (g) (p) (y)
A=0 A=1 A=2 A=0 A=1 A=2 A=0 A=1 A=2
B=0 B=1 B=2
11. A Random Forest-like Unsupervised Algorithm
• Canonical Random Forest is a Supervised algorithm; our Unsupervised version is only “Random Forest-like.”
• Creating a diversified forest requires each tree growth to be decoupled from all others. Canonical Random
Forest does this in two ways:
• It uses only Boot-strapped subsets of the full Training Set to grow each tree.
• Importantly, at each decision-point (node) only a subset of available variables are selected to determine daughter nodes. This is Ho’s “feature
subspaces selection.”
• How our Random Forest-like algorithm differs from Breiman’s algorithm:
• As with many other analyses of Random Forest, our Unsupervised algorithm uses the full Training Set each time a tree is built.
• Breiman uses CART as the base tree-split method. CART allows only binary splits (only 2 daughter nodes per parent node split). We allow
multiple daughter splits of a parent Predictor variable (as is possible in ID3 and CHAID); in fact, our algorithm insists that if a Predictor (in our
example, the A or B or C dimension) is selected as a node splitter, all possible levels are made daughters.
• Breiman’s algorithm grows each tree to its maximum depth. Our approach grows only two levels (vs. the 3-
levels that a maximum 3-attribute model would allow). This is done for two reasons. First, a full growth
model without boot-strapping would necessarily interpolate the Training Set (which Breiman’s method
typically does not do). And second, because we have only 3 predictors, fully grown tree growths do not
produce very decoupled trees. Stopping short of full growth is motivated by the same considerations as
“pruning” non-random trees.
12. A Random Forest-like Unsupervised Algorithm (cont’d)
• In our simple unsupervised random forest-like algorithm the critical “feature subset selection
parameter”, often referred to as mtry, is set at a maximum of 2 (out of the full 3). This implies that at
the root level the splitter is chosen randomly with equal probability as either A, B, or C and 3 daughter
nodes are created (e.g., if A is selected at the root level, the next level consists of A0, A1, A2).
• At each of the 3 daughter nodes the same procedure is repeated. Each daughter’s selection is
independent of her sisters. Each chooses a split variable between the 2 available variables. One
variable, the one used in the root split, is not available (A in our example) so each sister in our example
chooses (with equal probability) between B and C. Choosing independently leads to an enumeration of
23 =8 permutations. This example is for the case where the choice of root splitter is A. Because both B
and C could instead be root splitters, the total number of tree growths, each equally likely, is 24.
BBB CCC AAA CCC AAA BBB
BBC CCB AAC CCA AAB BBA
BCB CBC ACA CAC ABA BAB
BCC CBB ACC CAA ABB BAA
24 EQULLY-LIKELY TREE NAVIGATIONS
A is the root splitter B is the root splitter C is the root splitter
daughters' splitters daughters' splitters daughters' splitters
13. The 8 unique partitionings when the root partition is A
For each of the 3 beginning root partition choices, our unsupervised random forest-like algorithm divides the cube
into 8 unique collections of 1x3 rectangles. There are 18 such possible rectangles for each of the 3 beginning root
partition choices. Below are the results of the algorithm’s navigations when A is the root partition choice. Two
analogous navigation results charts can be constructed for the root attribute choices of B and C.
C2 C2 C2 C2
C1 C1 C1 C1
C0 C0 C0 C0
A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2
C2 C2 C2 C2
C1 C1 C1 C1
C0 C0 C0 C0
A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2
Aggregates Cells (a,b,c) Aggregates Cells (d,e,f) Aggregates Cells (g,h.i) Aggregates Cells (u,x,aa) Aggregates Cells (l,o,r) Aggregates Cells (c,f,i)
Aggregates Cells (j,k,l) Aggregates Cells (m,n,o) Aggregates Cells (p,q,r) Aggregates Cells (t,w,z) Aggregates Cells (k,n,q) Aggregates Cells (b,e,h)
Aggregates Cells (s,t,u) Aggregates Cells (v,w,x) Aggregates Cells (y,z,aa) Aggregates Cells (s,v,y) Aggregates Cells (j,m,p) Aggregates Cells (a,d,g)
B=2 B=0 B=1 B=2
B=2
B=0 B=1 B=2 B=0 B=1 B=2 B=0 B=1 B=2 B=0 B=1
A (B B B) A (B B C) A (B C B) A (B C C)
A (C C C) A (C C B) A (C B C) A (C B B)
B=0 B=1 B=2 B=0 B=1 B=2
B=0 B=1
14. The 8 unique partitionings when the root partition is A (cont.)
Note that each cell (FOR EXAMPLE, CELL n, the middle-most cell) appears in 2 and only 2 unique 1X3 rectangles
among the 8 navigations when the root partition is A. In the 4 navigations bordered in blue below, n appears in
the brown rectangle (m,n,o). In the 4 navigations bordered in red below, n appears in the dark brown rectangle
(k,n,q).
C2 C2 C2 C2
C1 C1 C1 C1
C0 C0 C0 C0
A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2
C2 C2 C2 C2
C1 C1 C1 C1
C0 C0 C0 C0
A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2
Aggregates Cells (a,b,c) Aggregates Cells (d,e,f) Aggregates Cells (g,h.i) Aggregates Cells (u,x,aa) Aggregates Cells (l,o,r) Aggregates Cells (c,f,i)
Aggregates Cells (j,k,l) Aggregates Cells (m,n,o) Aggregates Cells (p,q,r) Aggregates Cells (t,w,z) Aggregates Cells (k,n,q) Aggregates Cells (b,e,h)
Aggregates Cells (s,t,u) Aggregates Cells (v,w,x) Aggregates Cells (y,z,aa) Aggregates Cells (s,v,y) Aggregates Cells (j,m,p) Aggregates Cells (a,d,g)
A (B B B) A (B B C) A (B C B) A (B C C)
A (C C C) A (C C B) A (C B C) A (C B B)
B=0 B=1 B=2 B=0 B=1 B=2
B=0 B=1
B=0 B=1 B=2 B=0 B=1
B=2 B=0 B=1 B=2
B=2
B=2 B=0 B=1 B=2 B=0 B=1
15. The 8 unique partitionings when the root partition is A (cont.)
Analogously, CELL a (the bottom/left-most cell) also appears in 2 and only 2 unique 1X3 rectangles when the
root partition is A. In the 4 navigations bordered in green below (the top row of navigations) Cell a appears in
the red rectangle (a,b,c). In the 4 navigations bordered in brown below (the bottom row of navigations) n
appears in the pink rectangle (a,d,g). This pattern is true for all 27 cells.
C2 C2 C2 C2
C1 C1 C1 C1
C0 C0 C0 C0
A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2
C2 C2 C2 C2
C1 C1 C1 C1
C0 C0 C0 C0
A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2
Aggregates Cells (a,b,c) Aggregates Cells (d,e,f) Aggregates Cells (g,h.i) Aggregates Cells (u,x,aa) Aggregates Cells (l,o,r) Aggregates Cells (c,f,i)
Aggregates Cells (j,k,l) Aggregates Cells (m,n,o) Aggregates Cells (p,q,r) Aggregates Cells (t,w,z) Aggregates Cells (k,n,q) Aggregates Cells (b,e,h)
Aggregates Cells (s,t,u) Aggregates Cells (v,w,x) Aggregates Cells (y,z,aa) Aggregates Cells (s,v,y) Aggregates Cells (j,m,p) Aggregates Cells (a,d,g)
B=2 B=0 B=1 B=2
B=2
B=2 B=0 B=1 B=2 B=0 B=1
B=0 B=1 B=2 B=0 B=1
A (B B B) A (B B C) A (B C B) A (B C C)
A (C C C) A (C C B) A (C B C) A (C B B)
B=0 B=1 B=2 B=0 B=1 B=2
B=0 B=1
16. Random Forests “self-averages”, or Regularizes
Each of our algorithm’s 24 equally-likely tree navigations reduces the granularity of the cube by a factor of
3, from 27 to 9. Each of the original 27 cells finds itself included in exactly two unique 1x3 rectangular
aggregates for each root splitting choice. For example, Cell “a” is in Aggregate (a,b,c) and in Aggregate
(a,d,g) when A is the root split choice (see previous slide). When B is the root split choice (not shown) Cell
“a” appears again in Aggregate (a,b,c) and in a new Aggregate (a,j,s), again 4 of the 8 navigations. When C
is the root split choice, Cell “a” appears uniquely in the two Aggregate (a,j,s) and Aggregate (a,d,g).
Continuing with Cell a as our example and enumerating the rectangles into
which Cell a appears in each of the 24 equally-possible navigations we get:
Now recall that reducing the granularity of the cube means that each cell is
now represented by a rectangle which averages the member cells equally
i.e., vector (a,b,c) is the average of a, b, and c. Therefore the 24-navigation
average is [8*(a+b+c) + 8*(a+d+g) + 8*(a+j+s)] /72.
Rearranging = [24(a) + 8(b) +8(c) + 8(d) + 8(g) + 8(g) + 8(J) + 8(s)] /72
Simplifying = 1/3*(a) + 1/9*(b) + 1/9*(c) + 1/9*(d) + 1/9*(g) + 1/9*(b) + 1/9*(j) + 1/9*(s)
This is our Unsupervised RF-like algorithm’s estimate for the “value” of Cell a in our Toy Example
count of A B C
rectange occurances
(a,b,c) 4 4
(a,d,g) 4 4
(a,j,s) 4
4
sums 24 8 8 8
Root LevelSplit
17. Random Forests “self-averages”, or Regularizes (continued)
g
d
c
b
s
j
a
Because each of the 24 navigations are equally-likely, our ensemble space for
Cell “a” consists of equal parts of Aggregate (a,b,c,), Aggregate (a,d,g) and
Aggregate (a,j,s). Any attributes (e.g., some Target values Y) later assigned to
the 27 original cells would also be averaged over the ensemble space. Note
that because the original cell “a” appears in each of the Aggregates, it gets 3
times the weight that each of the other cells gets. If the original 27 cells are
later all populated with Y values, our Unsupervised algorithm can be seen as
simply regularizing those target Y values. Each regularized estimate equals 3
times the original cell’s Y value plus the Y values of the 6 cells within the
remainder of the 1x3 rectangles lying along each dimension, the sum then
divided by 9). Regularized Cell “a” = (original cells g+d+c+b+j+s+3a)/9.
This formula is true for all 27 cells in our Toy Example.
19. Practical considerations even in a Universe defined as
Discrete: Why real-world analyses require randomness a as simulation expedient
• Even accepting the notion that any set of ordered predictors (some containing hundreds or even
thousands of ordered observations, e.g., Annual Income) can be theoretically conceptualized as
simply Discrete regions, our ability to enumerate all possible outcomes of a Random Forest-like
algorithm as we have done in this presentation is severely limited. Even grouping each ordered
Predictor into a small number of categories does not help when the number of Predictors
(dimensions) is even just modestly large.
•
• For example, even for an intrinsically 3-level set of ordinal predictors (Small, Medium, and Large), a
25 attribute Universe has 325 = 8.5*1011 (nearly a Trillion) unique X-vector grid cell locations and
nearly a Trillion Y-predictions to make.
• Further complications ensue from using a tree algorithm that restricts node splitting to just two
daughters, a binary splitter. A single attribute of more than 2 levels (say Small, Medium, and Large)
can be used more than once within a single navigation path, say first as “Small or Medium” vs.
“Large” then somewhere further down the path the first daughter can be split “Small” vs. “Medium.”
Indeed, making this apparently small change in using a binary splitter in our Toy Example increases
the enumerations of complete tree growths to an unmanageable 86,400 (see Appendix).
20. A Final Complication: “Holes” in the Discrete Universe
• Typically, a sample drawn from a large universe will have “holes” in its Predictor Space; by this is meant that
not all predictor (X- vector) instantiations will have Target Y values represented in the sample; and there are
usually many such holes. Recall our Trillion cell example. A large majority of cell locations will likely not have
associated Y values.
• This is a potential problem even for our Toy Universe because if there are holes then the simple
Regularization formula suggested earlier will not work. Recall that formula:
Each regularized estimate equals 3 times the original cell’s Y value plus the Y values of the 6 cells within the remainder of the
1x3 rectangles lying along each dimension, the sum then divided by 9).
• Luckily, the basic approach of averaging the three Aggregates our algorithm defines for each
Regularized Y-value does continue to hold. But now the process has 2-steps The first step is to find
an average value for each of the three Aggregates using only those cells that have values. Some
1x3 rectangular Aggregates may have no values (including no original value for the cell for which
we seek a Regularized estimate). The denominator for each Aggregate average calculation is the
count of non-missing cells. The second step is to average the three Aggregate Averages. Again,
only non-zero Aggregate Averages are summed and the denominator in the final averaging is a
count of only the non-zero Aggregate Averages.
• The next slide compares the Regularized estimates derived earlier from our randomized 1-27 Y-
value full assignment case to the case where 6 cell locations were arbitrarily assigned to be holes
and contained no original values. Those original locations set as holes are colored in blue in the
second graph. Only the middle-most (Cell “n”) Regularized value in the second chart (=17.00)
differs substantially from its value in the first chart. This cell has holes in all but one location
among all three of its Aggregates; only original cell location “e” is non-blank with original value 17.
22. The Optimization Step
• Most discussions of Random Forests spent considerable space describing how the base tree
algorithm makes its split decision at each node (e.g., examining Gina or Information Value). In our
enumeration approach this optimization step (where we attempt to be separate signal from noise)
is almost an afterthought. We know all possible partitions into which our simple cube CAN be
split. It is only left to determine, given our data, which specific navigations WILL be chosen.
• In our toy example this step is easy. Recall that at the root level 1 of our 3 attributes is randomly
excluded from consideration (this is the effect of setting mtry=2). Our data (the 6-hole case) shows
that if A is not the excluded choice (which is 2/3 of the time) then the A Attribute offers the most
information gain. At the second level the 3 daughters’ choices leads to the full navigation choice
A(CCB). When A is excluded (1/3 of the time) C offers the best information gain and after
examining the optimal daughters’ choices the full navigation choice is C(BAA). The optimal
Regularized estimates are simply the (2/3)*A(CCB)estimates + (1/3)*C(BAA)estimates.
Optimized Regularization Estimates weighting 2/3 A(CCB) and 1/3 C(BAA)
C=2 8.00 7.50 19.67 C=2 8.00 7.50 15.67 C=2 8.00 7.50 23.44
C=1 12.67 14.94 21.33 C=1 12.67 14.94 17.33 C=1 12.67 14.94 25.11
C=0 15.67 19.89 20.56 C=0 11.61 15.83 12.50 C=0 14.11 18.33 22.78
A=0 A=1 A=2 A=0 A=1 A=2 A=0 A=1 A=2
B=0 B=1 B=2
24. The Kaggle Titanic Disaster Dataset
• Kaggle is a website hosting hundreds of for Machine Learning datasets featuring contests between
practitioners allowing them to benchmark the predictive power of their models against peers.
• Kaggle recommends that beginners use the Titanic Disaster dataset as their first contest. It is well-
specified, familiar and real example whose goal (predicting who among the ships’ passengers will
survive given attributes known about them from the ship’s manifest) is intuitive. The details of the
beginners’ contest (including a video) can be found here: kaggle.com/competitions/titanic .
• Up until now our analysis has used Regression Trees as our base learners. That is, we assumed the
Target (Y) values we were attempting to estimate were real, interval level variables. The Titanic
problem is a (binary) classification problem, either the passenger survived or did not (coded 1 or 0).
Nevertheless, we will continue to use Regression Trees as base leaners for our Titanic analysis, with
each tree’s estimates (as well as their aggregations) falling between 1 and 0. As a last step, each
aggregated estimate >.5 will then be converted to a “survive” prediction, otherwise, “not survive”.
25. The Kaggle Titanic Disaster Dataset (continued)
• Excluding the record identifier and binary Target value (Y), the dataset contains 10 raw candidate
predictor fields for each of 891 Training set records: Pclass (1st,2nd, or 3rd class accommodations),
Name, Sex, Age, SibSp (count of siblings traveling with passenger), Parch (count of parents and
children traveling with passenger), Ticket (alphanumerical ticket designator), Fare (cost of passage
in £), Cabin (alphanumerical cabin designator), Embark (S,C,Q one of 3 possible embarkation sites).
•
• For our modeling exercise only Pclass, Sex, Age, and a newly constructed variable Family_Size
(=SibSp +SibSp) were ultimately used.*
• Our initial task was to first pre-process each or these 4 attributes into 3 levels (where possible).
Pclass broke naturally into 3 levels. Sex broke naturally into only 2. The raw interval-level attribute
Age contained many records with “missing values” which were initially treated as a separate level
with the non-missing values bifurcated at the cut-point providing the highest information gain (0-6
years of age, inclusive vs. >6 years of age). However, it was determined that there was no statistical
reason to treat “missing value” records differently from aged > 6 records and these two initial levels
were combined leaving Age a 2-level predictor (Aged 0-6, inclusive vs NOT Aged 0-6, inclusive)
• The integer, ordinal-level Family_Size attribute was cut into 3 levels using the optimal information
gain metric into 2-4 members (small family); >4 members (big family); 1 member (traveling alone).
*Before it was decided to exclude the Embark attribute it was discovered that 2 records lacked values for that attribute and those
records were excluded from the model building, reducing the Training set record count to 889.
26. Approaching our RF-like analysis from a different angle
• Our pre-processed Titanic predictor “universe” comprises 36 cells (=3x2x2x3) and we could proceed
with a RF-like analysis in the same manner as we did with our 27 cell Toy Example. The pre-
processed Titanic universe appears only slightly more complex than our Toy Example (4-dimensional
rather than 3), but the manual tedium required grows exponentially. There is a better way.
• Note that up until this current “Real Data” section all but the last slide (Slide 22) focused exclusively
on the unsupervised version of our proposed RF-like algorithm; only Slide 22 talked about
optimization (“separating signal from noise”). As noted earlier, this was purposeful. We were able to
extract an explicit regularization formula for our Toy Example (without holes, Slide 17). Higher
dimensionality and the presence of “holes” complicate deriving any generalized statement, but at
least our example provides a sense for what canonical random forest might be doing spatially (it
averages over a set of somehow-related, rectangularly-defined “neighbors” and ignores completely
all other cells). This gives intuitive support for those seeing RF’s success to be its relationship to:
• “Nearest Neighbors” Lin, Yi and Jeon, Yongho Random Forests and Adaptive Nearest Neighbors. Journal of
the American Statistical Association Vol. 101 (2006)) or the Totally randomized trees model of Geurts,
Pierre, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine learning 63.1
(2006) p. 3-42;
• “Implicit Regularization” Mentch, Lucas and Zhou, Siyu. Randomization as Regularization: A Degrees of
Freedom Explanation for Random Forest Success. Journal of Machine Learning Research 21 (2020) p. 1-36.
27. Approaching our RF-like analysis from a different angle:
Starting at the optimization step
• Notice that for all the work we did to explicitly define exactly what all 24 possible tress navigations
would look like in our Toy Example, we ended up only evaluating 2. But we could have known in
advance what those 2 would be, if not explicitly, then by way of a generic description.
• Knowing that at the root level an optimization would pick the best attribute to split on given the
choices available, our first question is “would that first best attribute, either A or B or C be available”?
There are 3 equally likely 2-attribute subsets at the Toy Example’s root level: AB, AC, BC. Note that no
matter which of the 3 variables turns out to the optimal one, it will be available 2 out of 3 times.
• Recall our RF-like model’s approach of selecting, at any level, from among only a subset of attributes
not previously used in the current branch. Is this “restricted choice” constraint relevant at levels
deeper than the root level? In the Toy Example the answer is “no.” At the second level only 2
attributes are available for each daughter node to independently choose among and, although those
independent choices may differ, they will be optimal for each daughter.
• Because we stop one level short of exhaustive navigation (recall, to avoid interpolating the Training
Set), we are done. This is why the final estimates in our Toy Example involved simply weighting the
estimates of two navigations: (2/3)*A(CCB)estimates + (1/3)*C(BAA)estimates. The last navigation
results in the 1 time out of 3 that A (which turns out to the optimal) is not available at the root level.
28. A Generic Outline of the Optimization Step for our Titanic
Disaster Data Set
• A mapping of the relevant tree navigations for the Titanic predictor dataset we have constructed is
more complex than that for the Toy Example, principally because we require a 3-deep branching
structure and have 4 attributes to consider rather than 3. We will continue, however, to choose only
2 from among currently available (unused) attributes at every level (i.e., mtry=2).
• At the root level, 2 of the available 4 are randomly selected resulting in 6 possible combinations:
(AB), (AC), (AD), (BC), (BD), (CD). Note that no matter which of the 4 variables turns out to the
optimal one, it will be available 3 out of 6 times. For the ½ of the times that the first-best optimizing
attribute (as measured by information gain) is not among the chosen couplet, then the second-best
attribute is available 2 out of those 3 times; the second-best solution is therefore available for 1/3 of
the total draws (1/2 * 2/3). The third-best solution is the chosen only when the 2 superior choices
are unavailable, 1/6 of time.
• For the Titanic predictor set as we have constructed it, the root level choices are:
• Sex (First-Best), weighted 1/2
• PClass (Second-Best), weighted 1/3
• Family_Size (Third-Best), weighted 1/6
29. A Generic Outline of the Optimization Step for our Titanic
Disaster Data Set (continued)
• For each of the 3 choices of Root Level attribute, the 2 or 3 daughter nodes each independently
chooses the best attribute from among those remaining. If, for example, A is taken to be the Root
Level splitter it is no longer available at Level 2 whose choices are B,C, or D. Taken 2 at a time, the
possible couplets are (BC), (BD), (CD). 2/3 of the time the First-Best splitter is available. At deeper
levels the optimal attribute is always chosen. The weightings of each of 20 navigations are below.
Level 1 (ROOT) First Best Choice (Sex) 1/2
Level 2 Daughter 1 Daughter 2 Weighting
tree 1 First Best First Best 2/9
tree 2 First Best Second Best 1/9
tree 3 Second Best First Best 1/9
tree 4 Second Best Second Best 1/18
Level 1 (ROOT) 2nd-Best Choice (PClass) 1/3
Level 2 Daughter 1 Daughter 2 Daughter 3 Weighting
tree 5 First Best First Best First Best 8/81
tree 6 First Best First Best Second Best 4/81
tree 7 First Best Second Best First Best 4/81
tree 8 First Best Second Best Second Best 2/81
tree 9 Second Best First Best First Best 4/81
tree 10 Second Best First Best Second Best 2/81
tree 11 Second Best Second Best First Best 2/81
tree 12 Second Best Second Best Second Best 1/81
Level 1 (ROOT) 3nd-Best (Family_Size) 1/6
Level 2 Daughter 1 Daughter 2 Daughter 3 Weighting
tree 13 First Best First Best First Best 4/81
tree 14 First Best First Best Second Best 2/81
tree 15 First Best Second Best First Best 2/81
tree 16 First Best Second Best Second Best 1/81
tree 17 Second Best First Best First Best 2/81
tree 18 Second Best First Best Second Best 1/81
tree 19 Second Best Second Best First Best 1/81
tree 20 Second Best Second Best Second Best 1/162
30. Predictions for the Titanic Contest using our RF-like algorithm
• The weighted sum of the first 5 trees (out of 20) is displayed in separate Male and Female charts on
this and the next slide. An Average (normalized) metric of >=.5 implies a prediction of “Survived”,
otherwise “Not.” These first 5 trees represent nearly 60% of the weightings of a complete RF-like
algorithm’s weighted opinion and are unlikely to be different from a classification based on all 20.
Sex Pclass Young_Child? Family_Size Trees 1&2 Trees 3&4 Tree5 Average
weight 1/3 weight 1/6 weight 8/81 (normalized)
Male 3rd NOT Young_Child Family_2_3_4 0.123 0.123 0.327 0.157
Male 3rd Family_Alone 0.123 0.123 0.210 0.137
Male 3rd Family_Big 0.123 0.123 0.050 0.111
Male 3rd Young_Child Family_2_3_4 1.000 0.429 0.933 0.830
Male 3rd Family_Alone H 0.667 0.429 1.000 0.656
Male 3rd Family_Big 0.111 0.429 0.143 0.205
Male 2nd NOT Young_Child Family_2_3_4 0.090 0.090 0.547 0.165
Male 2nd Family_Alone 0.090 0.090 0.346 0.132
Male 2nd Family_Big 0.090 0.090 1.000 0.240
Male 2nd Young_Child Family_2_3_4 1.000 1.000 1.000 1.000
Male 2nd Family_Alone H 0.667 1.000 0.346 0.707
Male 2nd Family_Big H 0.111 1.000 1.000 0.505
Male 1st NOT Young_Child Family_2_3_4 0.358 0.358 0.735 0.420
Male 1st Family_Alone 0.358 0.358 0.523 0.385
Male 1st Family_Big 0.358 0.358 0.667 0.409
Male 1st Young_Child Family_2_3_4 1.000 1.000 0.667 0.945
Male 1st Family_Alone H 0.667 1.000 0.523 0.736
Male 1st Family_Big H 0.111 1.000 0.667 0.450
MALES
31. Predictions for the Titanic Contest using our RF-like algorithm (cont’d)
• Generally, most females Survived, and most males did not. Predicted exceptions are highlighted in
Red and Green. A complete decision rule would read “Predict young males (“children”) in 2nd Class
survive along with male children in 3rd and 1st Classes not in Big Families; all other males perish. All
females survive except those in 3rd Class traveling in Big Families and female children in 1st Class.”
Sex Pclass Young_Child? Family_Size Trees 1&3 Trees 2&4 Tree5 Average
weight 1/3 weight 1/6 weight 8/81 (normalized)
Female 3rd NOT Young_Child Family_2_3_4 0.561 0.561 0.327 0.522
Female 3rd Family_Alone 0.617 0.617 0.210 0.550
Female 3rd Family_Big 0.111 0.111 0.050 0.101
Female 3rd Young_Child Family_2_3_4 0.561 0.561 0.933 0.622
Female 3rd Family_Alone 0.617 0.617 1.000 0.680
Female 3rd Family_Big 0.111 0.111 0.143 0.116
Female 2nd NOT Young_Child Family_2_3_4 0.914 0.929 0.547 0.858
Female 2nd Family_Alone 0.914 0.906 0.346 0.818
Female 2nd Family_Big 0.914 1.000 1.000 0.952
Female 2nd Young_Child Family_2_3_4 1.000 0.929 1.000 0.980
Female 2nd Family_Alone H 1.000 0.906 0.346 0.866
Female 2nd Family_Big H 1.000 1.000 1.000 1.000
Female 1st NOT Young_Child Family_2_3_4 0.978 0.964 0.735 0.934
Female 1st Family_Alone 0.978 0.969 0.523 0.900
Female 1st Family_Big 0.978 1.000 0.667 0.933
Female 1st Young_Child Family_2_3_4 0.000 0.964 0.667 0.378
Female 1st Family_Alone H 0.000 0.969 0.523 0.356
Female 1st Family_Big H 0.000 1.000 0.667 0.388
FEMALES
32. Performance of our RF-like algorithm of Titanic test data
• Although not the focus of this presentation, it is instructive to benchmark the performance of our
simple “derived Rules Statement” (based on our RF-like model) against models submitted by others.
• Kaggle provides a mechanism by which one can submit a file of predictions for a Test set of passengers
that is provided without labels (i.e., “Survived or not”). Kaggle will return a performance evaluation
summarized as the % of Test records predicted correctly.
• Simply guessing that “all Females survive and all Males do not” results in a score of 76.555%. Our
Rules Statement gives an only modestly better score 77.751%. A “great” score is >80% (estimated to
place one in the top 6% of legitimate submissions). Improvements derive from tedious “feature-
engineering.” In contrast, our data prep involved no feature-engineering except for the creation of a
Family_Size construct (= SibSp+Parch+1). Also, we only coarsely aggregated our 4 employed predictors
into no more than 3 levels. Our goal was to uncover the mechanism behind a RF-like model rather
than simply (and blindly) “get a good score.” Interestingly, the canonical Random Forest example
featured in the Titanic tutorial gives a score of 77.511%, consistent with (actually, slightly worse)
than our own RF-like set of derived Decisions Rules; again, both without any heavy feature-
engineering. Reference: https://www.kaggle.com/code/carlmcbrideellis/titanic-leaderboard-a-
score-0-8-is-great.
33. APPENDIX: The problem of using binary splitters
• Treat our toy “3-variable, 2-split points each” set-up as a simple 6-varable set of binary cut-points.
• Splitting on one of among 6 Top-level cut-points results in a pair of child nodes. Independently
splitting on both right and left sides of this first pair involves a choice from among 5 possible Level
2 splits per child. The independent selection per child means that the possible combinations of
possible paths at Level 2 is 52 (=25). By Level 2 we have made 2 choices, the choice of Top-level cut
and the choice of second-level pair; one path out of 6*25 (=150). The left branch of each second
level pair now selects from its own 4 third-level branch pair options; so too does the right branch,
independently. This independence implies that there are 42 (=16), third level choices. So, by Level 3
we will have chosen one path from among 16*150 (=2,400) possibilities. Proceeding accordingly,
at Level 4 each of these Level 3 possibilities chooses independent left and right pairs from among 3
remaining cut-points, so 32 (=9) total possibilities per 2,400 Level 3 nodes (=21,600). At Level 5 (2)2
(=4) is multiplied in, but at Level 6 there is only one possible splitting option, and the factor is (12)2
(=1). The final number of possible paths for our “simple” 27 discrete cell toy example is 86,400
(=4* 21,600).
• In summary, there are 6 * 52 * 42 * 32 * 22 * 12 = 86,400 equally likely configurations. This is the
universe from which a machine-assisted Monte Carlo simulation would sample