SlideShare a Scribd company logo
1 of 33
Random Forests
without the Randomness
June 16, 2023
Kirk Monteverde
kirk@quantifyrisk.com
The Parable
of the Blind Men
examining the
Elephant
The earliest versions of the parable of blind men and
the elephant is found in Buddhist, Hindu and Jain
texts, as they discuss the limits of perception and the
importance of complete context. The parable has
several Indian variations, but broadly goes as follows:
A group of blind men heard that a strange animal, called an elephant, had
been brought to the town, but none of them were aware of its shape and
form. Out of curiosity, they said: "We must inspect and know it by touch, of
which we are capable". So, they sought it out, and when they found it they
groped about it. The first person, whose hand landed on the trunk, said,
"This being is like a thick snake". For another one whose hand reached its
ear, it seemed like a kind of fan. As for another person, whose hand was
upon its leg, said, the elephant is a pillar like a tree-trunk. The blind man
who placed his hand upon its side said the elephant, "is a wall". Another
who felt its tail, described it as a rope. The last felt its tusk, stating the
elephant is that which is hard, smooth and like a spear.
Like a collection of blind
men each examining
only a portion of the
Whole Elephant, so
Random Forests
employs a collection of
partial understandings
of a Sampled Dataset,
then aggregates those
to achieve a collective
understanding of the
Whole Dataset (from
which the sample was
drawn) that exceeds the
insights of a single
“fully-sighted” analysis
• Such ensemble statistical learning techniques in which a final model is
derived by aggregating a collection of competing, often “partially
blinded,” models had its first explicit statement in the 1996 Bagged
Predictors paper by Leo Breiman (the subsequent author of the seminal
Random Forests paper, 2001).
• The power of ensemble thinking was given popular attention in the 2004
book by James Surowiecki, The Windom of the Crowds.
• The fundamental mathematics: “the square of a dataset’s average is always
smaller or equal to the average of the individually-squared datapoints” :
ExpectedValue (Xi)2 >=
[ExpectedValue (Xi)]2 NOTE:
second term is (Mean of Xi)2
Motivation
• Random Forests, a non-parametric emsemble technique, was unveiled by Leo Breiman in a 2001 paper
he provocatively titled Statistical Modeling: The Two Cultures. He intended it as a wake-up call to the
academic statistical community that he felt was languishing in an old paradigm that “led to irrelevant
theory, questionable conclusions, and has kept statisticians from working on a large range of interesting
current problems.” Random Forest was his capstone contribution (he died soon afterwards) to a newer
approach he labelled “Algorithmic Modeling.”
• Algorithmic Modes (often referred to as “Black Boxes”) have proven highly successful at predictive
accuracy at the cost of considerable opaqueness as to why they work or even what they exactly do.
• Decades of work has been devoted to unraveling the mysteries of Black Box statistical techniques,
Breiman’s Random Forests being a prime target. This presentation is yet another such attempt. Here
we set the modest goal of simply gaining a better understanding of what the technique actually does,
without evaluating why and under what circumstances it might work well. Our approach is to construct
a toy example:
• 1) of low dimensionality, and
• 2) in a discrete Universe where predictors (the X feature vector) define a small hypercube
Random Forests as a Monte Carlo Simulation
• Monte Carlo simulations are often used to empirically describe distributions which either have no
closed-form solution or have a solution intractably difficult to specify. The strategy of this presentation is
to treat Random Forests as if it were a simulation and, using a simplified version and a small dimensional
example, to exhaustively specify the Universe it is attempting to analyze. We can then deterministically
examine what Random Forests is doing.
• “Randomness” in Random Forests is a shortcut technique that high dimensionality and predictors
(features) measured as continuous variables make necessary. What if we removed those traits?
• Indeed, in 1995 Tin Kam Ho of Bell Labs, who first coined the term “Random Forest” and introduced use
of feature subspaces to decouple trees (the key innovation used in Breiman’s version) viewed
randomness as simply an enabling technique, not something fundamental. She writes:
Our method to create multiple trees is to construct trees in randomly selected subspaces of the feature space.
For a given
feature space of m [binary] dimensions there are 2m subspaces in which a decision tree can be constructed.
The use of
randomness in selecting components of the feature space vector is merely a convenient way to explore the
possibilities.
[italics added]
Defining a “Discrete” Universe: Levels of Measurement
• Four levels of data measurement are commonly recognized:
• Nominal or Categorical (e.g., Race: White, Black, Hispanic, etc.)
• Ordinal (e.g., Small, Medium, Large)
• Interval (e.g., 1, 2, 3, 4, etc. where it is assumed that distance between numbers are the same)
• Rational (interval level data that have a true zero so that any two ratios of numbers can be compared; 2/1=4/2)
• Most statistical analysis (e.g., Regression or Logistic Regression) make an often-unspoken
assumption that data are at least interval.
• Even when expressed as a number (e.g., a FICO score) the metric may not be truly interval level. Yet often
such measures are treated as if they were interval. FICO scores are, in fact, only ordinal but are typically
entered into models designed to employ interval-level data.
• For our purposes, a second data measurement distinction is also important
• Continuous (with infinite granularity between any two Interval data points), and
• Discrete (with a finite region between (or around) data points such that within each region, no
matter the distance, all characteristics are identical). Within an ordinal-level region, say “Small,”
all associated attributes (say, height of men choosing to select Small shirt sizes) are deemed to be
representable as a single metric. Of course, measured attributes (e.g., heights) of Small shirt-
wearers will differ, but again we choose to represent the attribute with a single summary metric.
Defining a “Discrete” Universe (continued)
• Our uncomfortable definition of a discretely measured feature variable warrants more discussion.
• In statistical practice (e.g., Credit modeling) one of the most common pre-processing steps is to group
predictors (feature variables) into discrete ordinal groupings (e.g., 850<=FICO, 750<=FICO<850,
700<=FICO<750, 675<=FICO<700, FICO<675). Notice that we have created 5 ordinal-measured groups,
say A B C D E, and that each need not contain the same range of FICO points. As mentioned earlier,
often these are (wrongly) labelled numerically 1 2 3 4 5 and entered into (e.g.) a Logistic Regression.
• More correctly these groupings should be treated as Categorical variables and if entered into a
Regression or Logistic Regression they should be coded as Dummy variables. But importantly, we have
now clearly created discrete variables. Our analysis uses grouped scores as predictors and our results
cannot say anything about different effects of values more granular than at the group level. The same is
true of all men choosing to wear Small shirt sizes. We observe only their shirt size and can model their
predicted height only as a single estimate.
• At a more philosophical level, all real datasets might be argued to contain, at best, ordinal-level discrete
predictors. All observed data points are measured discretely (at some finite level of precision) and must
be said to represent a grouping of unobserved data below that level of precision. Continuity, with an
infinite number of points lying between each observation no matter how granular, is a mathematical
convenience. What we measure we measure discretely.
• Accordingly, this presentation models predictors as ordinal discrete variables. One easy way to
conceptualize this is to imagine that we start with ordered predictors (e.g., FICO scores) that we group.
In our toy example we have 3 predictors that we have grouped into 3 levels each.
Our Toy Universe:
Exploring it with an Unsupervised
Random Forest Algorithm
Our Toy Universe
• Consider a 3 variable, 3 levels-per-variable (2 cut-points)
example. It looks at bit like a small Rubik’s cube. Below is a
cutaway of its top, middle and bottom levels, each containing 9
cells for a total of 27.
Each of the 27 cells making up the Universe can be identified in 2 ways
1. By its position spatially along each of the 3 axes (e.g., A=1, B=1, C=1)
2. Equivalently by its cell letter (e.g., k)
It is important to note that we have so far only discussed (and illustrated above) exclusively Predictor Variables/Features (A,
B, C) and the 3 levels of each (e.g., A0, A1, and A2). Any individual cell (say, t) is associated only with its location in the
Predictor grid (t is found at location A2, B0, C1). Nothing has been said about dependent variable values (the target Y values,
which Random Forests is ultimately meant to predict). This is purposeful. It is instructive to demonstrate that without
observing Y values a Random Forest-like algorithm can be run as an Unsupervised algorithm that partitions the cube in
specific ways. If the Universe is small enough and algorithm simple enough, we can then enumerate all possible partitionings
that might be used in a Supervised run (with observed Y values). Some subset of these partitionings the Supervised
algorithm will use, but that subset is dependent upon the specific Y values which we will we observe later as a second step.
C=2 (c) (l) (u) C=2 (f) (o) (x) C=2 (i) (r) (aa)
C=1 (b) (k) (t) C=1 (e) (n) (w) C=1 (h) (q) (z)
C=0 (a) (j) (s) C=0 (d) (m) (v) C=0 (g) (p) (y)
A=0 A=1 A=2 A=0 A=1 A=2 A=0 A=1 A=2
B=0 B=1 B=2
A Random Forest-like Unsupervised Algorithm
• Canonical Random Forest is a Supervised algorithm; our Unsupervised version is only “Random Forest-like.”
• Creating a diversified forest requires each tree growth to be decoupled from all others. Canonical Random
Forest does this in two ways:
• It uses only Boot-strapped subsets of the full Training Set to grow each tree.
• Importantly, at each decision-point (node) only a subset of available variables are selected to determine daughter nodes. This is Ho’s “feature
subspaces selection.”
• How our Random Forest-like algorithm differs from Breiman’s algorithm:
• As with many other analyses of Random Forest, our Unsupervised algorithm uses the full Training Set each time a tree is built.
• Breiman uses CART as the base tree-split method. CART allows only binary splits (only 2 daughter nodes per parent node split). We allow
multiple daughter splits of a parent Predictor variable (as is possible in ID3 and CHAID); in fact, our algorithm insists that if a Predictor (in our
example, the A or B or C dimension) is selected as a node splitter, all possible levels are made daughters.
• Breiman’s algorithm grows each tree to its maximum depth. Our approach grows only two levels (vs. the 3-
levels that a maximum 3-attribute model would allow). This is done for two reasons. First, a full growth
model without boot-strapping would necessarily interpolate the Training Set (which Breiman’s method
typically does not do). And second, because we have only 3 predictors, fully grown tree growths do not
produce very decoupled trees. Stopping short of full growth is motivated by the same considerations as
“pruning” non-random trees.
A Random Forest-like Unsupervised Algorithm (cont’d)
• In our simple unsupervised random forest-like algorithm the critical “feature subset selection
parameter”, often referred to as mtry, is set at a maximum of 2 (out of the full 3). This implies that at
the root level the splitter is chosen randomly with equal probability as either A, B, or C and 3 daughter
nodes are created (e.g., if A is selected at the root level, the next level consists of A0, A1, A2).
• At each of the 3 daughter nodes the same procedure is repeated. Each daughter’s selection is
independent of her sisters. Each chooses a split variable between the 2 available variables. One
variable, the one used in the root split, is not available (A in our example) so each sister in our example
chooses (with equal probability) between B and C. Choosing independently leads to an enumeration of
23 =8 permutations. This example is for the case where the choice of root splitter is A. Because both B
and C could instead be root splitters, the total number of tree growths, each equally likely, is 24.
BBB CCC AAA CCC AAA BBB
BBC CCB AAC CCA AAB BBA
BCB CBC ACA CAC ABA BAB
BCC CBB ACC CAA ABB BAA
24 EQULLY-LIKELY TREE NAVIGATIONS
A is the root splitter B is the root splitter C is the root splitter
daughters' splitters daughters' splitters daughters' splitters
The 8 unique partitionings when the root partition is A
For each of the 3 beginning root partition choices, our unsupervised random forest-like algorithm divides the cube
into 8 unique collections of 1x3 rectangles. There are 18 such possible rectangles for each of the 3 beginning root
partition choices. Below are the results of the algorithm’s navigations when A is the root partition choice. Two
analogous navigation results charts can be constructed for the root attribute choices of B and C.
C2 C2 C2 C2
C1 C1 C1 C1
C0 C0 C0 C0
A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2
C2 C2 C2 C2
C1 C1 C1 C1
C0 C0 C0 C0
A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2
Aggregates Cells (a,b,c) Aggregates Cells (d,e,f) Aggregates Cells (g,h.i) Aggregates Cells (u,x,aa) Aggregates Cells (l,o,r) Aggregates Cells (c,f,i)
Aggregates Cells (j,k,l) Aggregates Cells (m,n,o) Aggregates Cells (p,q,r) Aggregates Cells (t,w,z) Aggregates Cells (k,n,q) Aggregates Cells (b,e,h)
Aggregates Cells (s,t,u) Aggregates Cells (v,w,x) Aggregates Cells (y,z,aa) Aggregates Cells (s,v,y) Aggregates Cells (j,m,p) Aggregates Cells (a,d,g)
B=2 B=0 B=1 B=2
B=2
B=0 B=1 B=2 B=0 B=1 B=2 B=0 B=1 B=2 B=0 B=1
A (B B B) A (B B C) A (B C B) A (B C C)
A (C C C) A (C C B) A (C B C) A (C B B)
B=0 B=1 B=2 B=0 B=1 B=2
B=0 B=1
The 8 unique partitionings when the root partition is A (cont.)
Note that each cell (FOR EXAMPLE, CELL n, the middle-most cell) appears in 2 and only 2 unique 1X3 rectangles
among the 8 navigations when the root partition is A. In the 4 navigations bordered in blue below, n appears in
the brown rectangle (m,n,o). In the 4 navigations bordered in red below, n appears in the dark brown rectangle
(k,n,q).
C2 C2 C2 C2
C1 C1 C1 C1
C0 C0 C0 C0
A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2
C2 C2 C2 C2
C1 C1 C1 C1
C0 C0 C0 C0
A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2
Aggregates Cells (a,b,c) Aggregates Cells (d,e,f) Aggregates Cells (g,h.i) Aggregates Cells (u,x,aa) Aggregates Cells (l,o,r) Aggregates Cells (c,f,i)
Aggregates Cells (j,k,l) Aggregates Cells (m,n,o) Aggregates Cells (p,q,r) Aggregates Cells (t,w,z) Aggregates Cells (k,n,q) Aggregates Cells (b,e,h)
Aggregates Cells (s,t,u) Aggregates Cells (v,w,x) Aggregates Cells (y,z,aa) Aggregates Cells (s,v,y) Aggregates Cells (j,m,p) Aggregates Cells (a,d,g)
A (B B B) A (B B C) A (B C B) A (B C C)
A (C C C) A (C C B) A (C B C) A (C B B)
B=0 B=1 B=2 B=0 B=1 B=2
B=0 B=1
B=0 B=1 B=2 B=0 B=1
B=2 B=0 B=1 B=2
B=2
B=2 B=0 B=1 B=2 B=0 B=1
The 8 unique partitionings when the root partition is A (cont.)
Analogously, CELL a (the bottom/left-most cell) also appears in 2 and only 2 unique 1X3 rectangles when the
root partition is A. In the 4 navigations bordered in green below (the top row of navigations) Cell a appears in
the red rectangle (a,b,c). In the 4 navigations bordered in brown below (the bottom row of navigations) n
appears in the pink rectangle (a,d,g). This pattern is true for all 27 cells.
C2 C2 C2 C2
C1 C1 C1 C1
C0 C0 C0 C0
A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2
C2 C2 C2 C2
C1 C1 C1 C1
C0 C0 C0 C0
A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2
Aggregates Cells (a,b,c) Aggregates Cells (d,e,f) Aggregates Cells (g,h.i) Aggregates Cells (u,x,aa) Aggregates Cells (l,o,r) Aggregates Cells (c,f,i)
Aggregates Cells (j,k,l) Aggregates Cells (m,n,o) Aggregates Cells (p,q,r) Aggregates Cells (t,w,z) Aggregates Cells (k,n,q) Aggregates Cells (b,e,h)
Aggregates Cells (s,t,u) Aggregates Cells (v,w,x) Aggregates Cells (y,z,aa) Aggregates Cells (s,v,y) Aggregates Cells (j,m,p) Aggregates Cells (a,d,g)
B=2 B=0 B=1 B=2
B=2
B=2 B=0 B=1 B=2 B=0 B=1
B=0 B=1 B=2 B=0 B=1
A (B B B) A (B B C) A (B C B) A (B C C)
A (C C C) A (C C B) A (C B C) A (C B B)
B=0 B=1 B=2 B=0 B=1 B=2
B=0 B=1
Random Forests “self-averages”, or Regularizes
Each of our algorithm’s 24 equally-likely tree navigations reduces the granularity of the cube by a factor of
3, from 27 to 9. Each of the original 27 cells finds itself included in exactly two unique 1x3 rectangular
aggregates for each root splitting choice. For example, Cell “a” is in Aggregate (a,b,c) and in Aggregate
(a,d,g) when A is the root split choice (see previous slide). When B is the root split choice (not shown) Cell
“a” appears again in Aggregate (a,b,c) and in a new Aggregate (a,j,s), again 4 of the 8 navigations. When C
is the root split choice, Cell “a” appears uniquely in the two Aggregate (a,j,s) and Aggregate (a,d,g).
Continuing with Cell a as our example and enumerating the rectangles into
which Cell a appears in each of the 24 equally-possible navigations we get:
Now recall that reducing the granularity of the cube means that each cell is
now represented by a rectangle which averages the member cells equally
i.e., vector (a,b,c) is the average of a, b, and c. Therefore the 24-navigation
average is [8*(a+b+c) + 8*(a+d+g) + 8*(a+j+s)] /72.
Rearranging = [24(a) + 8(b) +8(c) + 8(d) + 8(g) + 8(g) + 8(J) + 8(s)] /72
Simplifying = 1/3*(a) + 1/9*(b) + 1/9*(c) + 1/9*(d) + 1/9*(g) + 1/9*(b) + 1/9*(j) + 1/9*(s)
This is our Unsupervised RF-like algorithm’s estimate for the “value” of Cell a in our Toy Example
count of A B C
rectange occurances
(a,b,c) 4 4
(a,d,g) 4 4
(a,j,s) 4
4
sums 24 8 8 8
Root LevelSplit
Random Forests “self-averages”, or Regularizes (continued)
g
d
c
b
s
j
a
Because each of the 24 navigations are equally-likely, our ensemble space for
Cell “a” consists of equal parts of Aggregate (a,b,c,), Aggregate (a,d,g) and
Aggregate (a,j,s). Any attributes (e.g., some Target values Y) later assigned to
the 27 original cells would also be averaged over the ensemble space. Note
that because the original cell “a” appears in each of the Aggregates, it gets 3
times the weight that each of the other cells gets. If the original 27 cells are
later all populated with Y values, our Unsupervised algorithm can be seen as
simply regularizing those target Y values. Each regularized estimate equals 3
times the original cell’s Y value plus the Y values of the 6 cells within the
remainder of the 1x3 rectangles lying along each dimension, the sum then
divided by 9). Regularized Cell “a” = (original cells g+d+c+b+j+s+3a)/9.
This formula is true for all 27 cells in our Toy Example.
C=2 5 3 16 C=2 11 4 15 C=2 8 12 26
C=1 2 7 21 C=1 17 1 25 C=1 19 14 27
C=0 22 20 23 C=0 6 9 13 C=0 10 18 24
A=0 A=1 A=2 A=0 A=1 A=2 A=0 A=1 A=2
B=0 B=1 B=2
Bottom Middle Top
C=2 8 5/9 8 1/9 15 2/3 C=2 9 7/9 7 15 5/9 C=2 11 8/9 12 1/9 20
C=1 10 7/9 9 1/9 18 1/9 C=1 12 7/9 8 7/9 18 7/9 C=1 15 14 23 1/3
C=0 14 2/3 15 7/9 20 5/9 C=0 11 1/9 9 8/9 15 2/3 C=0 14 1/9 15 8/9 21
A=0 A=1 A=2 A=0 A=1 A=2 A=0 A=1 A=2
B=0 B=1 B=2
Bottom Middle Top
Randonly-assigned Y values after Regularization by our RF-like algorithm
Numbers 1-27 randomly assigned to cells as Target Y values
Regularization by our Random Forest-like Algorithm reduces variance from 60 2/3 to 19
Practical considerations even in a Universe defined as
Discrete: Why real-world analyses require randomness a as simulation expedient
• Even accepting the notion that any set of ordered predictors (some containing hundreds or even
thousands of ordered observations, e.g., Annual Income) can be theoretically conceptualized as
simply Discrete regions, our ability to enumerate all possible outcomes of a Random Forest-like
algorithm as we have done in this presentation is severely limited. Even grouping each ordered
Predictor into a small number of categories does not help when the number of Predictors
(dimensions) is even just modestly large.
•
• For example, even for an intrinsically 3-level set of ordinal predictors (Small, Medium, and Large), a
25 attribute Universe has 325 = 8.5*1011 (nearly a Trillion) unique X-vector grid cell locations and
nearly a Trillion Y-predictions to make.
• Further complications ensue from using a tree algorithm that restricts node splitting to just two
daughters, a binary splitter. A single attribute of more than 2 levels (say Small, Medium, and Large)
can be used more than once within a single navigation path, say first as “Small or Medium” vs.
“Large” then somewhere further down the path the first daughter can be split “Small” vs. “Medium.”
Indeed, making this apparently small change in using a binary splitter in our Toy Example increases
the enumerations of complete tree growths to an unmanageable 86,400 (see Appendix).
A Final Complication: “Holes” in the Discrete Universe
• Typically, a sample drawn from a large universe will have “holes” in its Predictor Space; by this is meant that
not all predictor (X- vector) instantiations will have Target Y values represented in the sample; and there are
usually many such holes. Recall our Trillion cell example. A large majority of cell locations will likely not have
associated Y values.
• This is a potential problem even for our Toy Universe because if there are holes then the simple
Regularization formula suggested earlier will not work. Recall that formula:
Each regularized estimate equals 3 times the original cell’s Y value plus the Y values of the 6 cells within the remainder of the
1x3 rectangles lying along each dimension, the sum then divided by 9).
• Luckily, the basic approach of averaging the three Aggregates our algorithm defines for each
Regularized Y-value does continue to hold. But now the process has 2-steps The first step is to find
an average value for each of the three Aggregates using only those cells that have values. Some
1x3 rectangular Aggregates may have no values (including no original value for the cell for which
we seek a Regularized estimate). The denominator for each Aggregate average calculation is the
count of non-missing cells. The second step is to average the three Aggregate Averages. Again,
only non-zero Aggregate Averages are summed and the denominator in the final averaging is a
count of only the non-zero Aggregate Averages.
• The next slide compares the Regularized estimates derived earlier from our randomized 1-27 Y-
value full assignment case to the case where 6 cell locations were arbitrarily assigned to be holes
and contained no original values. Those original locations set as holes are colored in blue in the
second graph. Only the middle-most (Cell “n”) Regularized value in the second chart (=17.00)
differs substantially from its value in the first chart. This cell has holes in all but one location
among all three of its Aggregates; only original cell location “e” is non-blank with original value 17.
Regularization of Universe with Initial Holes (second chart). Variance: 15.7
C=2 8 5/9 8 1/9 15 2/3 C=2 9 7/9 7 15 5/9 C=2 11 8/9 12 1/9 20
C=1 10 7/9 9 1/9 18 1/9 C=1 12 7/9 8 7/9 18 7/9 C=1 15 14 23 1/3
C=0 14 2/3 15 7/9 20 5/9 C=0 11 1/9 9 8/9 15 2/3 C=0 14 1/9 15 8/9 21
A=0 A=1 A=2 A=0 A=1 A=2 A=0 A=1 A=2
B=0 B=1 B=2
Bottom Middle Top
C=2 8.56 9.00 15.67 C=2 10.78 10.25 15.33 C=2 11.89 12.61 20.00
C=1 11.28 11.50 18.50 C=1 13.67 17.00 18.33 C=1 16.00 19.00 24.22
C=0 14.67 17.39 20.56 C=0 11.17 14.25 14.50 C=0 14.11 17.11 21.00
A=0 A=1 A=2 A=0 A=1 A=2 A=0 A=1 A=2
B=0 B=1 B=2
Bottom Middle Top
6 cells with original Y-values missing in blue
Randonly-assigned Y values after Regularization by our RF-like algorithm
Randonly-assigned Y values after Regularization by our RF-like algorithm
The Optimization Step
• Most discussions of Random Forests spent considerable space describing how the base tree
algorithm makes its split decision at each node (e.g., examining Gina or Information Value). In our
enumeration approach this optimization step (where we attempt to be separate signal from noise)
is almost an afterthought. We know all possible partitions into which our simple cube CAN be
split. It is only left to determine, given our data, which specific navigations WILL be chosen.
• In our toy example this step is easy. Recall that at the root level 1 of our 3 attributes is randomly
excluded from consideration (this is the effect of setting mtry=2). Our data (the 6-hole case) shows
that if A is not the excluded choice (which is 2/3 of the time) then the A Attribute offers the most
information gain. At the second level the 3 daughters’ choices leads to the full navigation choice
A(CCB). When A is excluded (1/3 of the time) C offers the best information gain and after
examining the optimal daughters’ choices the full navigation choice is C(BAA). The optimal
Regularized estimates are simply the (2/3)*A(CCB)estimates + (1/3)*C(BAA)estimates.
Optimized Regularization Estimates weighting 2/3 A(CCB) and 1/3 C(BAA)
C=2 8.00 7.50 19.67 C=2 8.00 7.50 15.67 C=2 8.00 7.50 23.44
C=1 12.67 14.94 21.33 C=1 12.67 14.94 17.33 C=1 12.67 14.94 25.11
C=0 15.67 19.89 20.56 C=0 11.61 15.83 12.50 C=0 14.11 18.33 22.78
A=0 A=1 A=2 A=0 A=1 A=2 A=0 A=1 A=2
B=0 B=1 B=2
An Example using Real Data:
The Kaggle Titanic Disaster Dataset
The Kaggle Titanic Disaster Dataset
• Kaggle is a website hosting hundreds of for Machine Learning datasets featuring contests between
practitioners allowing them to benchmark the predictive power of their models against peers.
• Kaggle recommends that beginners use the Titanic Disaster dataset as their first contest. It is well-
specified, familiar and real example whose goal (predicting who among the ships’ passengers will
survive given attributes known about them from the ship’s manifest) is intuitive. The details of the
beginners’ contest (including a video) can be found here: kaggle.com/competitions/titanic .
• Up until now our analysis has used Regression Trees as our base learners. That is, we assumed the
Target (Y) values we were attempting to estimate were real, interval level variables. The Titanic
problem is a (binary) classification problem, either the passenger survived or did not (coded 1 or 0).
Nevertheless, we will continue to use Regression Trees as base leaners for our Titanic analysis, with
each tree’s estimates (as well as their aggregations) falling between 1 and 0. As a last step, each
aggregated estimate >.5 will then be converted to a “survive” prediction, otherwise, “not survive”.
The Kaggle Titanic Disaster Dataset (continued)
• Excluding the record identifier and binary Target value (Y), the dataset contains 10 raw candidate
predictor fields for each of 891 Training set records: Pclass (1st,2nd, or 3rd class accommodations),
Name, Sex, Age, SibSp (count of siblings traveling with passenger), Parch (count of parents and
children traveling with passenger), Ticket (alphanumerical ticket designator), Fare (cost of passage
in £), Cabin (alphanumerical cabin designator), Embark (S,C,Q one of 3 possible embarkation sites).
•
• For our modeling exercise only Pclass, Sex, Age, and a newly constructed variable Family_Size
(=SibSp +SibSp) were ultimately used.*
• Our initial task was to first pre-process each or these 4 attributes into 3 levels (where possible).
Pclass broke naturally into 3 levels. Sex broke naturally into only 2. The raw interval-level attribute
Age contained many records with “missing values” which were initially treated as a separate level
with the non-missing values bifurcated at the cut-point providing the highest information gain (0-6
years of age, inclusive vs. >6 years of age). However, it was determined that there was no statistical
reason to treat “missing value” records differently from aged > 6 records and these two initial levels
were combined leaving Age a 2-level predictor (Aged 0-6, inclusive vs NOT Aged 0-6, inclusive)
• The integer, ordinal-level Family_Size attribute was cut into 3 levels using the optimal information
gain metric into 2-4 members (small family); >4 members (big family); 1 member (traveling alone).
*Before it was decided to exclude the Embark attribute it was discovered that 2 records lacked values for that attribute and those
records were excluded from the model building, reducing the Training set record count to 889.
Approaching our RF-like analysis from a different angle
• Our pre-processed Titanic predictor “universe” comprises 36 cells (=3x2x2x3) and we could proceed
with a RF-like analysis in the same manner as we did with our 27 cell Toy Example. The pre-
processed Titanic universe appears only slightly more complex than our Toy Example (4-dimensional
rather than 3), but the manual tedium required grows exponentially. There is a better way.
• Note that up until this current “Real Data” section all but the last slide (Slide 22) focused exclusively
on the unsupervised version of our proposed RF-like algorithm; only Slide 22 talked about
optimization (“separating signal from noise”). As noted earlier, this was purposeful. We were able to
extract an explicit regularization formula for our Toy Example (without holes, Slide 17). Higher
dimensionality and the presence of “holes” complicate deriving any generalized statement, but at
least our example provides a sense for what canonical random forest might be doing spatially (it
averages over a set of somehow-related, rectangularly-defined “neighbors” and ignores completely
all other cells). This gives intuitive support for those seeing RF’s success to be its relationship to:
• “Nearest Neighbors” Lin, Yi and Jeon, Yongho Random Forests and Adaptive Nearest Neighbors. Journal of
the American Statistical Association Vol. 101 (2006)) or the Totally randomized trees model of Geurts,
Pierre, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine learning 63.1
(2006) p. 3-42;
• “Implicit Regularization” Mentch, Lucas and Zhou, Siyu. Randomization as Regularization: A Degrees of
Freedom Explanation for Random Forest Success. Journal of Machine Learning Research 21 (2020) p. 1-36.
Approaching our RF-like analysis from a different angle:
Starting at the optimization step
• Notice that for all the work we did to explicitly define exactly what all 24 possible tress navigations
would look like in our Toy Example, we ended up only evaluating 2. But we could have known in
advance what those 2 would be, if not explicitly, then by way of a generic description.
• Knowing that at the root level an optimization would pick the best attribute to split on given the
choices available, our first question is “would that first best attribute, either A or B or C be available”?
There are 3 equally likely 2-attribute subsets at the Toy Example’s root level: AB, AC, BC. Note that no
matter which of the 3 variables turns out to the optimal one, it will be available 2 out of 3 times.
• Recall our RF-like model’s approach of selecting, at any level, from among only a subset of attributes
not previously used in the current branch. Is this “restricted choice” constraint relevant at levels
deeper than the root level? In the Toy Example the answer is “no.” At the second level only 2
attributes are available for each daughter node to independently choose among and, although those
independent choices may differ, they will be optimal for each daughter.
• Because we stop one level short of exhaustive navigation (recall, to avoid interpolating the Training
Set), we are done. This is why the final estimates in our Toy Example involved simply weighting the
estimates of two navigations: (2/3)*A(CCB)estimates + (1/3)*C(BAA)estimates. The last navigation
results in the 1 time out of 3 that A (which turns out to the optimal) is not available at the root level.
A Generic Outline of the Optimization Step for our Titanic
Disaster Data Set
• A mapping of the relevant tree navigations for the Titanic predictor dataset we have constructed is
more complex than that for the Toy Example, principally because we require a 3-deep branching
structure and have 4 attributes to consider rather than 3. We will continue, however, to choose only
2 from among currently available (unused) attributes at every level (i.e., mtry=2).
• At the root level, 2 of the available 4 are randomly selected resulting in 6 possible combinations:
(AB), (AC), (AD), (BC), (BD), (CD). Note that no matter which of the 4 variables turns out to the
optimal one, it will be available 3 out of 6 times. For the ½ of the times that the first-best optimizing
attribute (as measured by information gain) is not among the chosen couplet, then the second-best
attribute is available 2 out of those 3 times; the second-best solution is therefore available for 1/3 of
the total draws (1/2 * 2/3). The third-best solution is the chosen only when the 2 superior choices
are unavailable, 1/6 of time.
• For the Titanic predictor set as we have constructed it, the root level choices are:
• Sex (First-Best), weighted 1/2
• PClass (Second-Best), weighted 1/3
• Family_Size (Third-Best), weighted 1/6
A Generic Outline of the Optimization Step for our Titanic
Disaster Data Set (continued)
• For each of the 3 choices of Root Level attribute, the 2 or 3 daughter nodes each independently
chooses the best attribute from among those remaining. If, for example, A is taken to be the Root
Level splitter it is no longer available at Level 2 whose choices are B,C, or D. Taken 2 at a time, the
possible couplets are (BC), (BD), (CD). 2/3 of the time the First-Best splitter is available. At deeper
levels the optimal attribute is always chosen. The weightings of each of 20 navigations are below.
Level 1 (ROOT) First Best Choice (Sex) 1/2
Level 2 Daughter 1 Daughter 2 Weighting
tree 1 First Best First Best 2/9
tree 2 First Best Second Best 1/9
tree 3 Second Best First Best 1/9
tree 4 Second Best Second Best 1/18
Level 1 (ROOT) 2nd-Best Choice (PClass) 1/3
Level 2 Daughter 1 Daughter 2 Daughter 3 Weighting
tree 5 First Best First Best First Best 8/81
tree 6 First Best First Best Second Best 4/81
tree 7 First Best Second Best First Best 4/81
tree 8 First Best Second Best Second Best 2/81
tree 9 Second Best First Best First Best 4/81
tree 10 Second Best First Best Second Best 2/81
tree 11 Second Best Second Best First Best 2/81
tree 12 Second Best Second Best Second Best 1/81
Level 1 (ROOT) 3nd-Best (Family_Size) 1/6
Level 2 Daughter 1 Daughter 2 Daughter 3 Weighting
tree 13 First Best First Best First Best 4/81
tree 14 First Best First Best Second Best 2/81
tree 15 First Best Second Best First Best 2/81
tree 16 First Best Second Best Second Best 1/81
tree 17 Second Best First Best First Best 2/81
tree 18 Second Best First Best Second Best 1/81
tree 19 Second Best Second Best First Best 1/81
tree 20 Second Best Second Best Second Best 1/162
Predictions for the Titanic Contest using our RF-like algorithm
• The weighted sum of the first 5 trees (out of 20) is displayed in separate Male and Female charts on
this and the next slide. An Average (normalized) metric of >=.5 implies a prediction of “Survived”,
otherwise “Not.” These first 5 trees represent nearly 60% of the weightings of a complete RF-like
algorithm’s weighted opinion and are unlikely to be different from a classification based on all 20.
Sex Pclass Young_Child? Family_Size Trees 1&2 Trees 3&4 Tree5 Average
weight 1/3 weight 1/6 weight 8/81 (normalized)
Male 3rd NOT Young_Child Family_2_3_4 0.123 0.123 0.327 0.157
Male 3rd Family_Alone 0.123 0.123 0.210 0.137
Male 3rd Family_Big 0.123 0.123 0.050 0.111
Male 3rd Young_Child Family_2_3_4 1.000 0.429 0.933 0.830
Male 3rd Family_Alone H 0.667 0.429 1.000 0.656
Male 3rd Family_Big 0.111 0.429 0.143 0.205
Male 2nd NOT Young_Child Family_2_3_4 0.090 0.090 0.547 0.165
Male 2nd Family_Alone 0.090 0.090 0.346 0.132
Male 2nd Family_Big 0.090 0.090 1.000 0.240
Male 2nd Young_Child Family_2_3_4 1.000 1.000 1.000 1.000
Male 2nd Family_Alone H 0.667 1.000 0.346 0.707
Male 2nd Family_Big H 0.111 1.000 1.000 0.505
Male 1st NOT Young_Child Family_2_3_4 0.358 0.358 0.735 0.420
Male 1st Family_Alone 0.358 0.358 0.523 0.385
Male 1st Family_Big 0.358 0.358 0.667 0.409
Male 1st Young_Child Family_2_3_4 1.000 1.000 0.667 0.945
Male 1st Family_Alone H 0.667 1.000 0.523 0.736
Male 1st Family_Big H 0.111 1.000 0.667 0.450
MALES
Predictions for the Titanic Contest using our RF-like algorithm (cont’d)
• Generally, most females Survived, and most males did not. Predicted exceptions are highlighted in
Red and Green. A complete decision rule would read “Predict young males (“children”) in 2nd Class
survive along with male children in 3rd and 1st Classes not in Big Families; all other males perish. All
females survive except those in 3rd Class traveling in Big Families and female children in 1st Class.”
Sex Pclass Young_Child? Family_Size Trees 1&3 Trees 2&4 Tree5 Average
weight 1/3 weight 1/6 weight 8/81 (normalized)
Female 3rd NOT Young_Child Family_2_3_4 0.561 0.561 0.327 0.522
Female 3rd Family_Alone 0.617 0.617 0.210 0.550
Female 3rd Family_Big 0.111 0.111 0.050 0.101
Female 3rd Young_Child Family_2_3_4 0.561 0.561 0.933 0.622
Female 3rd Family_Alone 0.617 0.617 1.000 0.680
Female 3rd Family_Big 0.111 0.111 0.143 0.116
Female 2nd NOT Young_Child Family_2_3_4 0.914 0.929 0.547 0.858
Female 2nd Family_Alone 0.914 0.906 0.346 0.818
Female 2nd Family_Big 0.914 1.000 1.000 0.952
Female 2nd Young_Child Family_2_3_4 1.000 0.929 1.000 0.980
Female 2nd Family_Alone H 1.000 0.906 0.346 0.866
Female 2nd Family_Big H 1.000 1.000 1.000 1.000
Female 1st NOT Young_Child Family_2_3_4 0.978 0.964 0.735 0.934
Female 1st Family_Alone 0.978 0.969 0.523 0.900
Female 1st Family_Big 0.978 1.000 0.667 0.933
Female 1st Young_Child Family_2_3_4 0.000 0.964 0.667 0.378
Female 1st Family_Alone H 0.000 0.969 0.523 0.356
Female 1st Family_Big H 0.000 1.000 0.667 0.388
FEMALES
Performance of our RF-like algorithm of Titanic test data
• Although not the focus of this presentation, it is instructive to benchmark the performance of our
simple “derived Rules Statement” (based on our RF-like model) against models submitted by others.
• Kaggle provides a mechanism by which one can submit a file of predictions for a Test set of passengers
that is provided without labels (i.e., “Survived or not”). Kaggle will return a performance evaluation
summarized as the % of Test records predicted correctly.
• Simply guessing that “all Females survive and all Males do not” results in a score of 76.555%. Our
Rules Statement gives an only modestly better score 77.751%. A “great” score is >80% (estimated to
place one in the top 6% of legitimate submissions). Improvements derive from tedious “feature-
engineering.” In contrast, our data prep involved no feature-engineering except for the creation of a
Family_Size construct (= SibSp+Parch+1). Also, we only coarsely aggregated our 4 employed predictors
into no more than 3 levels. Our goal was to uncover the mechanism behind a RF-like model rather
than simply (and blindly) “get a good score.” Interestingly, the canonical Random Forest example
featured in the Titanic tutorial gives a score of 77.511%, consistent with (actually, slightly worse)
than our own RF-like set of derived Decisions Rules; again, both without any heavy feature-
engineering. Reference: https://www.kaggle.com/code/carlmcbrideellis/titanic-leaderboard-a-
score-0-8-is-great.
APPENDIX: The problem of using binary splitters
• Treat our toy “3-variable, 2-split points each” set-up as a simple 6-varable set of binary cut-points.
• Splitting on one of among 6 Top-level cut-points results in a pair of child nodes. Independently
splitting on both right and left sides of this first pair involves a choice from among 5 possible Level
2 splits per child. The independent selection per child means that the possible combinations of
possible paths at Level 2 is 52 (=25). By Level 2 we have made 2 choices, the choice of Top-level cut
and the choice of second-level pair; one path out of 6*25 (=150). The left branch of each second
level pair now selects from its own 4 third-level branch pair options; so too does the right branch,
independently. This independence implies that there are 42 (=16), third level choices. So, by Level 3
we will have chosen one path from among 16*150 (=2,400) possibilities. Proceeding accordingly,
at Level 4 each of these Level 3 possibilities chooses independent left and right pairs from among 3
remaining cut-points, so 32 (=9) total possibilities per 2,400 Level 3 nodes (=21,600). At Level 5 (2)2
(=4) is multiplied in, but at Level 6 there is only one possible splitting option, and the factor is (12)2
(=1). The final number of possible paths for our “simple” 27 discrete cell toy example is 86,400
(=4* 21,600).
• In summary, there are 6 * 52 * 42 * 32 * 22 * 12 = 86,400 equally likely configurations. This is the
universe from which a machine-assisted Monte Carlo simulation would sample

More Related Content

Similar to Random Forests without the Randomness June 16_2023.pptx

PRML Chapter 2
PRML Chapter 2PRML Chapter 2
PRML Chapter 2Sunwoo Kim
 
Master of Computer Application (MCA) – Semester 4 MC0079
Master of Computer Application (MCA) – Semester 4  MC0079Master of Computer Application (MCA) – Semester 4  MC0079
Master of Computer Application (MCA) – Semester 4 MC0079Aravind NC
 
Conditional Correlation 2009
Conditional Correlation 2009Conditional Correlation 2009
Conditional Correlation 2009yamanote
 
SPATIAL POINT PATTERNS
SPATIAL POINT PATTERNSSPATIAL POINT PATTERNS
SPATIAL POINT PATTERNSLiemNguyenDuy
 
statistics - Populations and Samples.pdf
statistics - Populations and Samples.pdfstatistics - Populations and Samples.pdf
statistics - Populations and Samples.pdfkobra22
 
OBC | Complexity science and the role of mathematical modeling
OBC | Complexity science and the role of mathematical modelingOBC | Complexity science and the role of mathematical modeling
OBC | Complexity science and the role of mathematical modelingOut of The Box Seminar
 
Chap_10_Object_Recognition.pdf
Chap_10_Object_Recognition.pdfChap_10_Object_Recognition.pdf
Chap_10_Object_Recognition.pdfSsdSsd5
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningKnoldus Inc.
 
Digital Image Processing.pptx
Digital Image Processing.pptxDigital Image Processing.pptx
Digital Image Processing.pptxMukhtiarKhan5
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalBhaskar Mitra
 
Stuart russell and peter norvig artificial intelligence - a modern approach...
Stuart russell and peter norvig   artificial intelligence - a modern approach...Stuart russell and peter norvig   artificial intelligence - a modern approach...
Stuart russell and peter norvig artificial intelligence - a modern approach...Lê Anh Đạt
 
Conceptual Spaces for Cognitive Architectures: A Lingua Franca for Different ...
Conceptual Spaces for Cognitive Architectures: A Lingua Franca for Different ...Conceptual Spaces for Cognitive Architectures: A Lingua Franca for Different ...
Conceptual Spaces for Cognitive Architectures: A Lingua Franca for Different ...Antonio Lieto
 
SVM Based Identification of Psychological Personality Using Handwritten Text
SVM Based Identification of Psychological Personality Using Handwritten Text SVM Based Identification of Psychological Personality Using Handwritten Text
SVM Based Identification of Psychological Personality Using Handwritten Text IJERA Editor
 
Identifying classes and objects ooad
Identifying classes and objects ooadIdentifying classes and objects ooad
Identifying classes and objects ooadMelba Rosalind
 

Similar to Random Forests without the Randomness June 16_2023.pptx (20)

Large Deviations: An Introduction
Large Deviations: An IntroductionLarge Deviations: An Introduction
Large Deviations: An Introduction
 
PRML Chapter 2
PRML Chapter 2PRML Chapter 2
PRML Chapter 2
 
Master of Computer Application (MCA) – Semester 4 MC0079
Master of Computer Application (MCA) – Semester 4  MC0079Master of Computer Application (MCA) – Semester 4  MC0079
Master of Computer Application (MCA) – Semester 4 MC0079
 
Data analysis05 clustering
Data analysis05 clusteringData analysis05 clustering
Data analysis05 clustering
 
Conditional Correlation 2009
Conditional Correlation 2009Conditional Correlation 2009
Conditional Correlation 2009
 
SPATIAL POINT PATTERNS
SPATIAL POINT PATTERNSSPATIAL POINT PATTERNS
SPATIAL POINT PATTERNS
 
statistics - Populations and Samples.pdf
statistics - Populations and Samples.pdfstatistics - Populations and Samples.pdf
statistics - Populations and Samples.pdf
 
What is a histogram
What is a histogramWhat is a histogram
What is a histogram
 
OBC | Complexity science and the role of mathematical modeling
OBC | Complexity science and the role of mathematical modelingOBC | Complexity science and the role of mathematical modeling
OBC | Complexity science and the role of mathematical modeling
 
Chap_10_Object_Recognition.pdf
Chap_10_Object_Recognition.pdfChap_10_Object_Recognition.pdf
Chap_10_Object_Recognition.pdf
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Digital Image Processing.pptx
Digital Image Processing.pptxDigital Image Processing.pptx
Digital Image Processing.pptx
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
 
Class9_PCA_final.ppt
Class9_PCA_final.pptClass9_PCA_final.ppt
Class9_PCA_final.ppt
 
Dissertation Paper
Dissertation PaperDissertation Paper
Dissertation Paper
 
Stuart russell and peter norvig artificial intelligence - a modern approach...
Stuart russell and peter norvig   artificial intelligence - a modern approach...Stuart russell and peter norvig   artificial intelligence - a modern approach...
Stuart russell and peter norvig artificial intelligence - a modern approach...
 
Cluster Analysis.pptx
Cluster Analysis.pptxCluster Analysis.pptx
Cluster Analysis.pptx
 
Conceptual Spaces for Cognitive Architectures: A Lingua Franca for Different ...
Conceptual Spaces for Cognitive Architectures: A Lingua Franca for Different ...Conceptual Spaces for Cognitive Architectures: A Lingua Franca for Different ...
Conceptual Spaces for Cognitive Architectures: A Lingua Franca for Different ...
 
SVM Based Identification of Psychological Personality Using Handwritten Text
SVM Based Identification of Psychological Personality Using Handwritten Text SVM Based Identification of Psychological Personality Using Handwritten Text
SVM Based Identification of Psychological Personality Using Handwritten Text
 
Identifying classes and objects ooad
Identifying classes and objects ooadIdentifying classes and objects ooad
Identifying classes and objects ooad
 

Recently uploaded

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 

Recently uploaded (20)

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 

Random Forests without the Randomness June 16_2023.pptx

  • 1. Random Forests without the Randomness June 16, 2023 Kirk Monteverde kirk@quantifyrisk.com
  • 2. The Parable of the Blind Men examining the Elephant The earliest versions of the parable of blind men and the elephant is found in Buddhist, Hindu and Jain texts, as they discuss the limits of perception and the importance of complete context. The parable has several Indian variations, but broadly goes as follows: A group of blind men heard that a strange animal, called an elephant, had been brought to the town, but none of them were aware of its shape and form. Out of curiosity, they said: "We must inspect and know it by touch, of which we are capable". So, they sought it out, and when they found it they groped about it. The first person, whose hand landed on the trunk, said, "This being is like a thick snake". For another one whose hand reached its ear, it seemed like a kind of fan. As for another person, whose hand was upon its leg, said, the elephant is a pillar like a tree-trunk. The blind man who placed his hand upon its side said the elephant, "is a wall". Another who felt its tail, described it as a rope. The last felt its tusk, stating the elephant is that which is hard, smooth and like a spear.
  • 3. Like a collection of blind men each examining only a portion of the Whole Elephant, so Random Forests employs a collection of partial understandings of a Sampled Dataset, then aggregates those to achieve a collective understanding of the Whole Dataset (from which the sample was drawn) that exceeds the insights of a single “fully-sighted” analysis • Such ensemble statistical learning techniques in which a final model is derived by aggregating a collection of competing, often “partially blinded,” models had its first explicit statement in the 1996 Bagged Predictors paper by Leo Breiman (the subsequent author of the seminal Random Forests paper, 2001). • The power of ensemble thinking was given popular attention in the 2004 book by James Surowiecki, The Windom of the Crowds. • The fundamental mathematics: “the square of a dataset’s average is always smaller or equal to the average of the individually-squared datapoints” : ExpectedValue (Xi)2 >= [ExpectedValue (Xi)]2 NOTE: second term is (Mean of Xi)2
  • 4. Motivation • Random Forests, a non-parametric emsemble technique, was unveiled by Leo Breiman in a 2001 paper he provocatively titled Statistical Modeling: The Two Cultures. He intended it as a wake-up call to the academic statistical community that he felt was languishing in an old paradigm that “led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems.” Random Forest was his capstone contribution (he died soon afterwards) to a newer approach he labelled “Algorithmic Modeling.” • Algorithmic Modes (often referred to as “Black Boxes”) have proven highly successful at predictive accuracy at the cost of considerable opaqueness as to why they work or even what they exactly do. • Decades of work has been devoted to unraveling the mysteries of Black Box statistical techniques, Breiman’s Random Forests being a prime target. This presentation is yet another such attempt. Here we set the modest goal of simply gaining a better understanding of what the technique actually does, without evaluating why and under what circumstances it might work well. Our approach is to construct a toy example: • 1) of low dimensionality, and • 2) in a discrete Universe where predictors (the X feature vector) define a small hypercube
  • 5. Random Forests as a Monte Carlo Simulation • Monte Carlo simulations are often used to empirically describe distributions which either have no closed-form solution or have a solution intractably difficult to specify. The strategy of this presentation is to treat Random Forests as if it were a simulation and, using a simplified version and a small dimensional example, to exhaustively specify the Universe it is attempting to analyze. We can then deterministically examine what Random Forests is doing. • “Randomness” in Random Forests is a shortcut technique that high dimensionality and predictors (features) measured as continuous variables make necessary. What if we removed those traits? • Indeed, in 1995 Tin Kam Ho of Bell Labs, who first coined the term “Random Forest” and introduced use of feature subspaces to decouple trees (the key innovation used in Breiman’s version) viewed randomness as simply an enabling technique, not something fundamental. She writes: Our method to create multiple trees is to construct trees in randomly selected subspaces of the feature space. For a given feature space of m [binary] dimensions there are 2m subspaces in which a decision tree can be constructed. The use of randomness in selecting components of the feature space vector is merely a convenient way to explore the possibilities. [italics added]
  • 6. Defining a “Discrete” Universe: Levels of Measurement • Four levels of data measurement are commonly recognized: • Nominal or Categorical (e.g., Race: White, Black, Hispanic, etc.) • Ordinal (e.g., Small, Medium, Large) • Interval (e.g., 1, 2, 3, 4, etc. where it is assumed that distance between numbers are the same) • Rational (interval level data that have a true zero so that any two ratios of numbers can be compared; 2/1=4/2) • Most statistical analysis (e.g., Regression or Logistic Regression) make an often-unspoken assumption that data are at least interval. • Even when expressed as a number (e.g., a FICO score) the metric may not be truly interval level. Yet often such measures are treated as if they were interval. FICO scores are, in fact, only ordinal but are typically entered into models designed to employ interval-level data. • For our purposes, a second data measurement distinction is also important • Continuous (with infinite granularity between any two Interval data points), and • Discrete (with a finite region between (or around) data points such that within each region, no matter the distance, all characteristics are identical). Within an ordinal-level region, say “Small,” all associated attributes (say, height of men choosing to select Small shirt sizes) are deemed to be representable as a single metric. Of course, measured attributes (e.g., heights) of Small shirt- wearers will differ, but again we choose to represent the attribute with a single summary metric.
  • 7. Defining a “Discrete” Universe (continued) • Our uncomfortable definition of a discretely measured feature variable warrants more discussion. • In statistical practice (e.g., Credit modeling) one of the most common pre-processing steps is to group predictors (feature variables) into discrete ordinal groupings (e.g., 850<=FICO, 750<=FICO<850, 700<=FICO<750, 675<=FICO<700, FICO<675). Notice that we have created 5 ordinal-measured groups, say A B C D E, and that each need not contain the same range of FICO points. As mentioned earlier, often these are (wrongly) labelled numerically 1 2 3 4 5 and entered into (e.g.) a Logistic Regression. • More correctly these groupings should be treated as Categorical variables and if entered into a Regression or Logistic Regression they should be coded as Dummy variables. But importantly, we have now clearly created discrete variables. Our analysis uses grouped scores as predictors and our results cannot say anything about different effects of values more granular than at the group level. The same is true of all men choosing to wear Small shirt sizes. We observe only their shirt size and can model their predicted height only as a single estimate. • At a more philosophical level, all real datasets might be argued to contain, at best, ordinal-level discrete predictors. All observed data points are measured discretely (at some finite level of precision) and must be said to represent a grouping of unobserved data below that level of precision. Continuity, with an infinite number of points lying between each observation no matter how granular, is a mathematical convenience. What we measure we measure discretely. • Accordingly, this presentation models predictors as ordinal discrete variables. One easy way to conceptualize this is to imagine that we start with ordered predictors (e.g., FICO scores) that we group. In our toy example we have 3 predictors that we have grouped into 3 levels each.
  • 8. Our Toy Universe: Exploring it with an Unsupervised Random Forest Algorithm
  • 9. Our Toy Universe • Consider a 3 variable, 3 levels-per-variable (2 cut-points) example. It looks at bit like a small Rubik’s cube. Below is a cutaway of its top, middle and bottom levels, each containing 9 cells for a total of 27.
  • 10. Each of the 27 cells making up the Universe can be identified in 2 ways 1. By its position spatially along each of the 3 axes (e.g., A=1, B=1, C=1) 2. Equivalently by its cell letter (e.g., k) It is important to note that we have so far only discussed (and illustrated above) exclusively Predictor Variables/Features (A, B, C) and the 3 levels of each (e.g., A0, A1, and A2). Any individual cell (say, t) is associated only with its location in the Predictor grid (t is found at location A2, B0, C1). Nothing has been said about dependent variable values (the target Y values, which Random Forests is ultimately meant to predict). This is purposeful. It is instructive to demonstrate that without observing Y values a Random Forest-like algorithm can be run as an Unsupervised algorithm that partitions the cube in specific ways. If the Universe is small enough and algorithm simple enough, we can then enumerate all possible partitionings that might be used in a Supervised run (with observed Y values). Some subset of these partitionings the Supervised algorithm will use, but that subset is dependent upon the specific Y values which we will we observe later as a second step. C=2 (c) (l) (u) C=2 (f) (o) (x) C=2 (i) (r) (aa) C=1 (b) (k) (t) C=1 (e) (n) (w) C=1 (h) (q) (z) C=0 (a) (j) (s) C=0 (d) (m) (v) C=0 (g) (p) (y) A=0 A=1 A=2 A=0 A=1 A=2 A=0 A=1 A=2 B=0 B=1 B=2
  • 11. A Random Forest-like Unsupervised Algorithm • Canonical Random Forest is a Supervised algorithm; our Unsupervised version is only “Random Forest-like.” • Creating a diversified forest requires each tree growth to be decoupled from all others. Canonical Random Forest does this in two ways: • It uses only Boot-strapped subsets of the full Training Set to grow each tree. • Importantly, at each decision-point (node) only a subset of available variables are selected to determine daughter nodes. This is Ho’s “feature subspaces selection.” • How our Random Forest-like algorithm differs from Breiman’s algorithm: • As with many other analyses of Random Forest, our Unsupervised algorithm uses the full Training Set each time a tree is built. • Breiman uses CART as the base tree-split method. CART allows only binary splits (only 2 daughter nodes per parent node split). We allow multiple daughter splits of a parent Predictor variable (as is possible in ID3 and CHAID); in fact, our algorithm insists that if a Predictor (in our example, the A or B or C dimension) is selected as a node splitter, all possible levels are made daughters. • Breiman’s algorithm grows each tree to its maximum depth. Our approach grows only two levels (vs. the 3- levels that a maximum 3-attribute model would allow). This is done for two reasons. First, a full growth model without boot-strapping would necessarily interpolate the Training Set (which Breiman’s method typically does not do). And second, because we have only 3 predictors, fully grown tree growths do not produce very decoupled trees. Stopping short of full growth is motivated by the same considerations as “pruning” non-random trees.
  • 12. A Random Forest-like Unsupervised Algorithm (cont’d) • In our simple unsupervised random forest-like algorithm the critical “feature subset selection parameter”, often referred to as mtry, is set at a maximum of 2 (out of the full 3). This implies that at the root level the splitter is chosen randomly with equal probability as either A, B, or C and 3 daughter nodes are created (e.g., if A is selected at the root level, the next level consists of A0, A1, A2). • At each of the 3 daughter nodes the same procedure is repeated. Each daughter’s selection is independent of her sisters. Each chooses a split variable between the 2 available variables. One variable, the one used in the root split, is not available (A in our example) so each sister in our example chooses (with equal probability) between B and C. Choosing independently leads to an enumeration of 23 =8 permutations. This example is for the case where the choice of root splitter is A. Because both B and C could instead be root splitters, the total number of tree growths, each equally likely, is 24. BBB CCC AAA CCC AAA BBB BBC CCB AAC CCA AAB BBA BCB CBC ACA CAC ABA BAB BCC CBB ACC CAA ABB BAA 24 EQULLY-LIKELY TREE NAVIGATIONS A is the root splitter B is the root splitter C is the root splitter daughters' splitters daughters' splitters daughters' splitters
  • 13. The 8 unique partitionings when the root partition is A For each of the 3 beginning root partition choices, our unsupervised random forest-like algorithm divides the cube into 8 unique collections of 1x3 rectangles. There are 18 such possible rectangles for each of the 3 beginning root partition choices. Below are the results of the algorithm’s navigations when A is the root partition choice. Two analogous navigation results charts can be constructed for the root attribute choices of B and C. C2 C2 C2 C2 C1 C1 C1 C1 C0 C0 C0 C0 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 C2 C2 C2 C2 C1 C1 C1 C1 C0 C0 C0 C0 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 Aggregates Cells (a,b,c) Aggregates Cells (d,e,f) Aggregates Cells (g,h.i) Aggregates Cells (u,x,aa) Aggregates Cells (l,o,r) Aggregates Cells (c,f,i) Aggregates Cells (j,k,l) Aggregates Cells (m,n,o) Aggregates Cells (p,q,r) Aggregates Cells (t,w,z) Aggregates Cells (k,n,q) Aggregates Cells (b,e,h) Aggregates Cells (s,t,u) Aggregates Cells (v,w,x) Aggregates Cells (y,z,aa) Aggregates Cells (s,v,y) Aggregates Cells (j,m,p) Aggregates Cells (a,d,g) B=2 B=0 B=1 B=2 B=2 B=0 B=1 B=2 B=0 B=1 B=2 B=0 B=1 B=2 B=0 B=1 A (B B B) A (B B C) A (B C B) A (B C C) A (C C C) A (C C B) A (C B C) A (C B B) B=0 B=1 B=2 B=0 B=1 B=2 B=0 B=1
  • 14. The 8 unique partitionings when the root partition is A (cont.) Note that each cell (FOR EXAMPLE, CELL n, the middle-most cell) appears in 2 and only 2 unique 1X3 rectangles among the 8 navigations when the root partition is A. In the 4 navigations bordered in blue below, n appears in the brown rectangle (m,n,o). In the 4 navigations bordered in red below, n appears in the dark brown rectangle (k,n,q). C2 C2 C2 C2 C1 C1 C1 C1 C0 C0 C0 C0 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 C2 C2 C2 C2 C1 C1 C1 C1 C0 C0 C0 C0 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 Aggregates Cells (a,b,c) Aggregates Cells (d,e,f) Aggregates Cells (g,h.i) Aggregates Cells (u,x,aa) Aggregates Cells (l,o,r) Aggregates Cells (c,f,i) Aggregates Cells (j,k,l) Aggregates Cells (m,n,o) Aggregates Cells (p,q,r) Aggregates Cells (t,w,z) Aggregates Cells (k,n,q) Aggregates Cells (b,e,h) Aggregates Cells (s,t,u) Aggregates Cells (v,w,x) Aggregates Cells (y,z,aa) Aggregates Cells (s,v,y) Aggregates Cells (j,m,p) Aggregates Cells (a,d,g) A (B B B) A (B B C) A (B C B) A (B C C) A (C C C) A (C C B) A (C B C) A (C B B) B=0 B=1 B=2 B=0 B=1 B=2 B=0 B=1 B=0 B=1 B=2 B=0 B=1 B=2 B=0 B=1 B=2 B=2 B=2 B=0 B=1 B=2 B=0 B=1
  • 15. The 8 unique partitionings when the root partition is A (cont.) Analogously, CELL a (the bottom/left-most cell) also appears in 2 and only 2 unique 1X3 rectangles when the root partition is A. In the 4 navigations bordered in green below (the top row of navigations) Cell a appears in the red rectangle (a,b,c). In the 4 navigations bordered in brown below (the bottom row of navigations) n appears in the pink rectangle (a,d,g). This pattern is true for all 27 cells. C2 C2 C2 C2 C1 C1 C1 C1 C0 C0 C0 C0 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 C2 C2 C2 C2 C1 C1 C1 C1 C0 C0 C0 C0 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 Aggregates Cells (a,b,c) Aggregates Cells (d,e,f) Aggregates Cells (g,h.i) Aggregates Cells (u,x,aa) Aggregates Cells (l,o,r) Aggregates Cells (c,f,i) Aggregates Cells (j,k,l) Aggregates Cells (m,n,o) Aggregates Cells (p,q,r) Aggregates Cells (t,w,z) Aggregates Cells (k,n,q) Aggregates Cells (b,e,h) Aggregates Cells (s,t,u) Aggregates Cells (v,w,x) Aggregates Cells (y,z,aa) Aggregates Cells (s,v,y) Aggregates Cells (j,m,p) Aggregates Cells (a,d,g) B=2 B=0 B=1 B=2 B=2 B=2 B=0 B=1 B=2 B=0 B=1 B=0 B=1 B=2 B=0 B=1 A (B B B) A (B B C) A (B C B) A (B C C) A (C C C) A (C C B) A (C B C) A (C B B) B=0 B=1 B=2 B=0 B=1 B=2 B=0 B=1
  • 16. Random Forests “self-averages”, or Regularizes Each of our algorithm’s 24 equally-likely tree navigations reduces the granularity of the cube by a factor of 3, from 27 to 9. Each of the original 27 cells finds itself included in exactly two unique 1x3 rectangular aggregates for each root splitting choice. For example, Cell “a” is in Aggregate (a,b,c) and in Aggregate (a,d,g) when A is the root split choice (see previous slide). When B is the root split choice (not shown) Cell “a” appears again in Aggregate (a,b,c) and in a new Aggregate (a,j,s), again 4 of the 8 navigations. When C is the root split choice, Cell “a” appears uniquely in the two Aggregate (a,j,s) and Aggregate (a,d,g). Continuing with Cell a as our example and enumerating the rectangles into which Cell a appears in each of the 24 equally-possible navigations we get: Now recall that reducing the granularity of the cube means that each cell is now represented by a rectangle which averages the member cells equally i.e., vector (a,b,c) is the average of a, b, and c. Therefore the 24-navigation average is [8*(a+b+c) + 8*(a+d+g) + 8*(a+j+s)] /72. Rearranging = [24(a) + 8(b) +8(c) + 8(d) + 8(g) + 8(g) + 8(J) + 8(s)] /72 Simplifying = 1/3*(a) + 1/9*(b) + 1/9*(c) + 1/9*(d) + 1/9*(g) + 1/9*(b) + 1/9*(j) + 1/9*(s) This is our Unsupervised RF-like algorithm’s estimate for the “value” of Cell a in our Toy Example count of A B C rectange occurances (a,b,c) 4 4 (a,d,g) 4 4 (a,j,s) 4 4 sums 24 8 8 8 Root LevelSplit
  • 17. Random Forests “self-averages”, or Regularizes (continued) g d c b s j a Because each of the 24 navigations are equally-likely, our ensemble space for Cell “a” consists of equal parts of Aggregate (a,b,c,), Aggregate (a,d,g) and Aggregate (a,j,s). Any attributes (e.g., some Target values Y) later assigned to the 27 original cells would also be averaged over the ensemble space. Note that because the original cell “a” appears in each of the Aggregates, it gets 3 times the weight that each of the other cells gets. If the original 27 cells are later all populated with Y values, our Unsupervised algorithm can be seen as simply regularizing those target Y values. Each regularized estimate equals 3 times the original cell’s Y value plus the Y values of the 6 cells within the remainder of the 1x3 rectangles lying along each dimension, the sum then divided by 9). Regularized Cell “a” = (original cells g+d+c+b+j+s+3a)/9. This formula is true for all 27 cells in our Toy Example.
  • 18. C=2 5 3 16 C=2 11 4 15 C=2 8 12 26 C=1 2 7 21 C=1 17 1 25 C=1 19 14 27 C=0 22 20 23 C=0 6 9 13 C=0 10 18 24 A=0 A=1 A=2 A=0 A=1 A=2 A=0 A=1 A=2 B=0 B=1 B=2 Bottom Middle Top C=2 8 5/9 8 1/9 15 2/3 C=2 9 7/9 7 15 5/9 C=2 11 8/9 12 1/9 20 C=1 10 7/9 9 1/9 18 1/9 C=1 12 7/9 8 7/9 18 7/9 C=1 15 14 23 1/3 C=0 14 2/3 15 7/9 20 5/9 C=0 11 1/9 9 8/9 15 2/3 C=0 14 1/9 15 8/9 21 A=0 A=1 A=2 A=0 A=1 A=2 A=0 A=1 A=2 B=0 B=1 B=2 Bottom Middle Top Randonly-assigned Y values after Regularization by our RF-like algorithm Numbers 1-27 randomly assigned to cells as Target Y values Regularization by our Random Forest-like Algorithm reduces variance from 60 2/3 to 19
  • 19. Practical considerations even in a Universe defined as Discrete: Why real-world analyses require randomness a as simulation expedient • Even accepting the notion that any set of ordered predictors (some containing hundreds or even thousands of ordered observations, e.g., Annual Income) can be theoretically conceptualized as simply Discrete regions, our ability to enumerate all possible outcomes of a Random Forest-like algorithm as we have done in this presentation is severely limited. Even grouping each ordered Predictor into a small number of categories does not help when the number of Predictors (dimensions) is even just modestly large. • • For example, even for an intrinsically 3-level set of ordinal predictors (Small, Medium, and Large), a 25 attribute Universe has 325 = 8.5*1011 (nearly a Trillion) unique X-vector grid cell locations and nearly a Trillion Y-predictions to make. • Further complications ensue from using a tree algorithm that restricts node splitting to just two daughters, a binary splitter. A single attribute of more than 2 levels (say Small, Medium, and Large) can be used more than once within a single navigation path, say first as “Small or Medium” vs. “Large” then somewhere further down the path the first daughter can be split “Small” vs. “Medium.” Indeed, making this apparently small change in using a binary splitter in our Toy Example increases the enumerations of complete tree growths to an unmanageable 86,400 (see Appendix).
  • 20. A Final Complication: “Holes” in the Discrete Universe • Typically, a sample drawn from a large universe will have “holes” in its Predictor Space; by this is meant that not all predictor (X- vector) instantiations will have Target Y values represented in the sample; and there are usually many such holes. Recall our Trillion cell example. A large majority of cell locations will likely not have associated Y values. • This is a potential problem even for our Toy Universe because if there are holes then the simple Regularization formula suggested earlier will not work. Recall that formula: Each regularized estimate equals 3 times the original cell’s Y value plus the Y values of the 6 cells within the remainder of the 1x3 rectangles lying along each dimension, the sum then divided by 9). • Luckily, the basic approach of averaging the three Aggregates our algorithm defines for each Regularized Y-value does continue to hold. But now the process has 2-steps The first step is to find an average value for each of the three Aggregates using only those cells that have values. Some 1x3 rectangular Aggregates may have no values (including no original value for the cell for which we seek a Regularized estimate). The denominator for each Aggregate average calculation is the count of non-missing cells. The second step is to average the three Aggregate Averages. Again, only non-zero Aggregate Averages are summed and the denominator in the final averaging is a count of only the non-zero Aggregate Averages. • The next slide compares the Regularized estimates derived earlier from our randomized 1-27 Y- value full assignment case to the case where 6 cell locations were arbitrarily assigned to be holes and contained no original values. Those original locations set as holes are colored in blue in the second graph. Only the middle-most (Cell “n”) Regularized value in the second chart (=17.00) differs substantially from its value in the first chart. This cell has holes in all but one location among all three of its Aggregates; only original cell location “e” is non-blank with original value 17.
  • 21. Regularization of Universe with Initial Holes (second chart). Variance: 15.7 C=2 8 5/9 8 1/9 15 2/3 C=2 9 7/9 7 15 5/9 C=2 11 8/9 12 1/9 20 C=1 10 7/9 9 1/9 18 1/9 C=1 12 7/9 8 7/9 18 7/9 C=1 15 14 23 1/3 C=0 14 2/3 15 7/9 20 5/9 C=0 11 1/9 9 8/9 15 2/3 C=0 14 1/9 15 8/9 21 A=0 A=1 A=2 A=0 A=1 A=2 A=0 A=1 A=2 B=0 B=1 B=2 Bottom Middle Top C=2 8.56 9.00 15.67 C=2 10.78 10.25 15.33 C=2 11.89 12.61 20.00 C=1 11.28 11.50 18.50 C=1 13.67 17.00 18.33 C=1 16.00 19.00 24.22 C=0 14.67 17.39 20.56 C=0 11.17 14.25 14.50 C=0 14.11 17.11 21.00 A=0 A=1 A=2 A=0 A=1 A=2 A=0 A=1 A=2 B=0 B=1 B=2 Bottom Middle Top 6 cells with original Y-values missing in blue Randonly-assigned Y values after Regularization by our RF-like algorithm Randonly-assigned Y values after Regularization by our RF-like algorithm
  • 22. The Optimization Step • Most discussions of Random Forests spent considerable space describing how the base tree algorithm makes its split decision at each node (e.g., examining Gina or Information Value). In our enumeration approach this optimization step (where we attempt to be separate signal from noise) is almost an afterthought. We know all possible partitions into which our simple cube CAN be split. It is only left to determine, given our data, which specific navigations WILL be chosen. • In our toy example this step is easy. Recall that at the root level 1 of our 3 attributes is randomly excluded from consideration (this is the effect of setting mtry=2). Our data (the 6-hole case) shows that if A is not the excluded choice (which is 2/3 of the time) then the A Attribute offers the most information gain. At the second level the 3 daughters’ choices leads to the full navigation choice A(CCB). When A is excluded (1/3 of the time) C offers the best information gain and after examining the optimal daughters’ choices the full navigation choice is C(BAA). The optimal Regularized estimates are simply the (2/3)*A(CCB)estimates + (1/3)*C(BAA)estimates. Optimized Regularization Estimates weighting 2/3 A(CCB) and 1/3 C(BAA) C=2 8.00 7.50 19.67 C=2 8.00 7.50 15.67 C=2 8.00 7.50 23.44 C=1 12.67 14.94 21.33 C=1 12.67 14.94 17.33 C=1 12.67 14.94 25.11 C=0 15.67 19.89 20.56 C=0 11.61 15.83 12.50 C=0 14.11 18.33 22.78 A=0 A=1 A=2 A=0 A=1 A=2 A=0 A=1 A=2 B=0 B=1 B=2
  • 23. An Example using Real Data: The Kaggle Titanic Disaster Dataset
  • 24. The Kaggle Titanic Disaster Dataset • Kaggle is a website hosting hundreds of for Machine Learning datasets featuring contests between practitioners allowing them to benchmark the predictive power of their models against peers. • Kaggle recommends that beginners use the Titanic Disaster dataset as their first contest. It is well- specified, familiar and real example whose goal (predicting who among the ships’ passengers will survive given attributes known about them from the ship’s manifest) is intuitive. The details of the beginners’ contest (including a video) can be found here: kaggle.com/competitions/titanic . • Up until now our analysis has used Regression Trees as our base learners. That is, we assumed the Target (Y) values we were attempting to estimate were real, interval level variables. The Titanic problem is a (binary) classification problem, either the passenger survived or did not (coded 1 or 0). Nevertheless, we will continue to use Regression Trees as base leaners for our Titanic analysis, with each tree’s estimates (as well as their aggregations) falling between 1 and 0. As a last step, each aggregated estimate >.5 will then be converted to a “survive” prediction, otherwise, “not survive”.
  • 25. The Kaggle Titanic Disaster Dataset (continued) • Excluding the record identifier and binary Target value (Y), the dataset contains 10 raw candidate predictor fields for each of 891 Training set records: Pclass (1st,2nd, or 3rd class accommodations), Name, Sex, Age, SibSp (count of siblings traveling with passenger), Parch (count of parents and children traveling with passenger), Ticket (alphanumerical ticket designator), Fare (cost of passage in £), Cabin (alphanumerical cabin designator), Embark (S,C,Q one of 3 possible embarkation sites). • • For our modeling exercise only Pclass, Sex, Age, and a newly constructed variable Family_Size (=SibSp +SibSp) were ultimately used.* • Our initial task was to first pre-process each or these 4 attributes into 3 levels (where possible). Pclass broke naturally into 3 levels. Sex broke naturally into only 2. The raw interval-level attribute Age contained many records with “missing values” which were initially treated as a separate level with the non-missing values bifurcated at the cut-point providing the highest information gain (0-6 years of age, inclusive vs. >6 years of age). However, it was determined that there was no statistical reason to treat “missing value” records differently from aged > 6 records and these two initial levels were combined leaving Age a 2-level predictor (Aged 0-6, inclusive vs NOT Aged 0-6, inclusive) • The integer, ordinal-level Family_Size attribute was cut into 3 levels using the optimal information gain metric into 2-4 members (small family); >4 members (big family); 1 member (traveling alone). *Before it was decided to exclude the Embark attribute it was discovered that 2 records lacked values for that attribute and those records were excluded from the model building, reducing the Training set record count to 889.
  • 26. Approaching our RF-like analysis from a different angle • Our pre-processed Titanic predictor “universe” comprises 36 cells (=3x2x2x3) and we could proceed with a RF-like analysis in the same manner as we did with our 27 cell Toy Example. The pre- processed Titanic universe appears only slightly more complex than our Toy Example (4-dimensional rather than 3), but the manual tedium required grows exponentially. There is a better way. • Note that up until this current “Real Data” section all but the last slide (Slide 22) focused exclusively on the unsupervised version of our proposed RF-like algorithm; only Slide 22 talked about optimization (“separating signal from noise”). As noted earlier, this was purposeful. We were able to extract an explicit regularization formula for our Toy Example (without holes, Slide 17). Higher dimensionality and the presence of “holes” complicate deriving any generalized statement, but at least our example provides a sense for what canonical random forest might be doing spatially (it averages over a set of somehow-related, rectangularly-defined “neighbors” and ignores completely all other cells). This gives intuitive support for those seeing RF’s success to be its relationship to: • “Nearest Neighbors” Lin, Yi and Jeon, Yongho Random Forests and Adaptive Nearest Neighbors. Journal of the American Statistical Association Vol. 101 (2006)) or the Totally randomized trees model of Geurts, Pierre, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine learning 63.1 (2006) p. 3-42; • “Implicit Regularization” Mentch, Lucas and Zhou, Siyu. Randomization as Regularization: A Degrees of Freedom Explanation for Random Forest Success. Journal of Machine Learning Research 21 (2020) p. 1-36.
  • 27. Approaching our RF-like analysis from a different angle: Starting at the optimization step • Notice that for all the work we did to explicitly define exactly what all 24 possible tress navigations would look like in our Toy Example, we ended up only evaluating 2. But we could have known in advance what those 2 would be, if not explicitly, then by way of a generic description. • Knowing that at the root level an optimization would pick the best attribute to split on given the choices available, our first question is “would that first best attribute, either A or B or C be available”? There are 3 equally likely 2-attribute subsets at the Toy Example’s root level: AB, AC, BC. Note that no matter which of the 3 variables turns out to the optimal one, it will be available 2 out of 3 times. • Recall our RF-like model’s approach of selecting, at any level, from among only a subset of attributes not previously used in the current branch. Is this “restricted choice” constraint relevant at levels deeper than the root level? In the Toy Example the answer is “no.” At the second level only 2 attributes are available for each daughter node to independently choose among and, although those independent choices may differ, they will be optimal for each daughter. • Because we stop one level short of exhaustive navigation (recall, to avoid interpolating the Training Set), we are done. This is why the final estimates in our Toy Example involved simply weighting the estimates of two navigations: (2/3)*A(CCB)estimates + (1/3)*C(BAA)estimates. The last navigation results in the 1 time out of 3 that A (which turns out to the optimal) is not available at the root level.
  • 28. A Generic Outline of the Optimization Step for our Titanic Disaster Data Set • A mapping of the relevant tree navigations for the Titanic predictor dataset we have constructed is more complex than that for the Toy Example, principally because we require a 3-deep branching structure and have 4 attributes to consider rather than 3. We will continue, however, to choose only 2 from among currently available (unused) attributes at every level (i.e., mtry=2). • At the root level, 2 of the available 4 are randomly selected resulting in 6 possible combinations: (AB), (AC), (AD), (BC), (BD), (CD). Note that no matter which of the 4 variables turns out to the optimal one, it will be available 3 out of 6 times. For the ½ of the times that the first-best optimizing attribute (as measured by information gain) is not among the chosen couplet, then the second-best attribute is available 2 out of those 3 times; the second-best solution is therefore available for 1/3 of the total draws (1/2 * 2/3). The third-best solution is the chosen only when the 2 superior choices are unavailable, 1/6 of time. • For the Titanic predictor set as we have constructed it, the root level choices are: • Sex (First-Best), weighted 1/2 • PClass (Second-Best), weighted 1/3 • Family_Size (Third-Best), weighted 1/6
  • 29. A Generic Outline of the Optimization Step for our Titanic Disaster Data Set (continued) • For each of the 3 choices of Root Level attribute, the 2 or 3 daughter nodes each independently chooses the best attribute from among those remaining. If, for example, A is taken to be the Root Level splitter it is no longer available at Level 2 whose choices are B,C, or D. Taken 2 at a time, the possible couplets are (BC), (BD), (CD). 2/3 of the time the First-Best splitter is available. At deeper levels the optimal attribute is always chosen. The weightings of each of 20 navigations are below. Level 1 (ROOT) First Best Choice (Sex) 1/2 Level 2 Daughter 1 Daughter 2 Weighting tree 1 First Best First Best 2/9 tree 2 First Best Second Best 1/9 tree 3 Second Best First Best 1/9 tree 4 Second Best Second Best 1/18 Level 1 (ROOT) 2nd-Best Choice (PClass) 1/3 Level 2 Daughter 1 Daughter 2 Daughter 3 Weighting tree 5 First Best First Best First Best 8/81 tree 6 First Best First Best Second Best 4/81 tree 7 First Best Second Best First Best 4/81 tree 8 First Best Second Best Second Best 2/81 tree 9 Second Best First Best First Best 4/81 tree 10 Second Best First Best Second Best 2/81 tree 11 Second Best Second Best First Best 2/81 tree 12 Second Best Second Best Second Best 1/81 Level 1 (ROOT) 3nd-Best (Family_Size) 1/6 Level 2 Daughter 1 Daughter 2 Daughter 3 Weighting tree 13 First Best First Best First Best 4/81 tree 14 First Best First Best Second Best 2/81 tree 15 First Best Second Best First Best 2/81 tree 16 First Best Second Best Second Best 1/81 tree 17 Second Best First Best First Best 2/81 tree 18 Second Best First Best Second Best 1/81 tree 19 Second Best Second Best First Best 1/81 tree 20 Second Best Second Best Second Best 1/162
  • 30. Predictions for the Titanic Contest using our RF-like algorithm • The weighted sum of the first 5 trees (out of 20) is displayed in separate Male and Female charts on this and the next slide. An Average (normalized) metric of >=.5 implies a prediction of “Survived”, otherwise “Not.” These first 5 trees represent nearly 60% of the weightings of a complete RF-like algorithm’s weighted opinion and are unlikely to be different from a classification based on all 20. Sex Pclass Young_Child? Family_Size Trees 1&2 Trees 3&4 Tree5 Average weight 1/3 weight 1/6 weight 8/81 (normalized) Male 3rd NOT Young_Child Family_2_3_4 0.123 0.123 0.327 0.157 Male 3rd Family_Alone 0.123 0.123 0.210 0.137 Male 3rd Family_Big 0.123 0.123 0.050 0.111 Male 3rd Young_Child Family_2_3_4 1.000 0.429 0.933 0.830 Male 3rd Family_Alone H 0.667 0.429 1.000 0.656 Male 3rd Family_Big 0.111 0.429 0.143 0.205 Male 2nd NOT Young_Child Family_2_3_4 0.090 0.090 0.547 0.165 Male 2nd Family_Alone 0.090 0.090 0.346 0.132 Male 2nd Family_Big 0.090 0.090 1.000 0.240 Male 2nd Young_Child Family_2_3_4 1.000 1.000 1.000 1.000 Male 2nd Family_Alone H 0.667 1.000 0.346 0.707 Male 2nd Family_Big H 0.111 1.000 1.000 0.505 Male 1st NOT Young_Child Family_2_3_4 0.358 0.358 0.735 0.420 Male 1st Family_Alone 0.358 0.358 0.523 0.385 Male 1st Family_Big 0.358 0.358 0.667 0.409 Male 1st Young_Child Family_2_3_4 1.000 1.000 0.667 0.945 Male 1st Family_Alone H 0.667 1.000 0.523 0.736 Male 1st Family_Big H 0.111 1.000 0.667 0.450 MALES
  • 31. Predictions for the Titanic Contest using our RF-like algorithm (cont’d) • Generally, most females Survived, and most males did not. Predicted exceptions are highlighted in Red and Green. A complete decision rule would read “Predict young males (“children”) in 2nd Class survive along with male children in 3rd and 1st Classes not in Big Families; all other males perish. All females survive except those in 3rd Class traveling in Big Families and female children in 1st Class.” Sex Pclass Young_Child? Family_Size Trees 1&3 Trees 2&4 Tree5 Average weight 1/3 weight 1/6 weight 8/81 (normalized) Female 3rd NOT Young_Child Family_2_3_4 0.561 0.561 0.327 0.522 Female 3rd Family_Alone 0.617 0.617 0.210 0.550 Female 3rd Family_Big 0.111 0.111 0.050 0.101 Female 3rd Young_Child Family_2_3_4 0.561 0.561 0.933 0.622 Female 3rd Family_Alone 0.617 0.617 1.000 0.680 Female 3rd Family_Big 0.111 0.111 0.143 0.116 Female 2nd NOT Young_Child Family_2_3_4 0.914 0.929 0.547 0.858 Female 2nd Family_Alone 0.914 0.906 0.346 0.818 Female 2nd Family_Big 0.914 1.000 1.000 0.952 Female 2nd Young_Child Family_2_3_4 1.000 0.929 1.000 0.980 Female 2nd Family_Alone H 1.000 0.906 0.346 0.866 Female 2nd Family_Big H 1.000 1.000 1.000 1.000 Female 1st NOT Young_Child Family_2_3_4 0.978 0.964 0.735 0.934 Female 1st Family_Alone 0.978 0.969 0.523 0.900 Female 1st Family_Big 0.978 1.000 0.667 0.933 Female 1st Young_Child Family_2_3_4 0.000 0.964 0.667 0.378 Female 1st Family_Alone H 0.000 0.969 0.523 0.356 Female 1st Family_Big H 0.000 1.000 0.667 0.388 FEMALES
  • 32. Performance of our RF-like algorithm of Titanic test data • Although not the focus of this presentation, it is instructive to benchmark the performance of our simple “derived Rules Statement” (based on our RF-like model) against models submitted by others. • Kaggle provides a mechanism by which one can submit a file of predictions for a Test set of passengers that is provided without labels (i.e., “Survived or not”). Kaggle will return a performance evaluation summarized as the % of Test records predicted correctly. • Simply guessing that “all Females survive and all Males do not” results in a score of 76.555%. Our Rules Statement gives an only modestly better score 77.751%. A “great” score is >80% (estimated to place one in the top 6% of legitimate submissions). Improvements derive from tedious “feature- engineering.” In contrast, our data prep involved no feature-engineering except for the creation of a Family_Size construct (= SibSp+Parch+1). Also, we only coarsely aggregated our 4 employed predictors into no more than 3 levels. Our goal was to uncover the mechanism behind a RF-like model rather than simply (and blindly) “get a good score.” Interestingly, the canonical Random Forest example featured in the Titanic tutorial gives a score of 77.511%, consistent with (actually, slightly worse) than our own RF-like set of derived Decisions Rules; again, both without any heavy feature- engineering. Reference: https://www.kaggle.com/code/carlmcbrideellis/titanic-leaderboard-a- score-0-8-is-great.
  • 33. APPENDIX: The problem of using binary splitters • Treat our toy “3-variable, 2-split points each” set-up as a simple 6-varable set of binary cut-points. • Splitting on one of among 6 Top-level cut-points results in a pair of child nodes. Independently splitting on both right and left sides of this first pair involves a choice from among 5 possible Level 2 splits per child. The independent selection per child means that the possible combinations of possible paths at Level 2 is 52 (=25). By Level 2 we have made 2 choices, the choice of Top-level cut and the choice of second-level pair; one path out of 6*25 (=150). The left branch of each second level pair now selects from its own 4 third-level branch pair options; so too does the right branch, independently. This independence implies that there are 42 (=16), third level choices. So, by Level 3 we will have chosen one path from among 16*150 (=2,400) possibilities. Proceeding accordingly, at Level 4 each of these Level 3 possibilities chooses independent left and right pairs from among 3 remaining cut-points, so 32 (=9) total possibilities per 2,400 Level 3 nodes (=21,600). At Level 5 (2)2 (=4) is multiplied in, but at Level 6 there is only one possible splitting option, and the factor is (12)2 (=1). The final number of possible paths for our “simple” 27 discrete cell toy example is 86,400 (=4* 21,600). • In summary, there are 6 * 52 * 42 * 32 * 22 * 12 = 86,400 equally likely configurations. This is the universe from which a machine-assisted Monte Carlo simulation would sample