Report

Share

•0 likes•30 views

•0 likes•30 views

Report

Share

Download to read offline

A practical deconstruction of one of the oldest algorithmic ("BlackBox") techniques in Data Science. This broad class of non-parametric techniques has come to dominate a statistical field that has recently been popularly rechristened as AI.

- 1. Random Forests without the Randomness September 2023 (revision 1, corrected) Kirk Monteverde kirk@quantifyrisk.com
- 2. The Parable of the Blind Men examining the Elephant The earliest versions of the parable of blind men and the elephant is found in Buddhist, Hindu and Jain texts, as they discuss the limits of perception and the importance of complete context. The parable has several Indian variations, but broadly goes as follows: A group of blind men heard that a strange animal, called an elephant, had been brought to the town, but none of them were aware of its shape and form. Out of curiosity, they said: "We must inspect and know it by touch, of which we are capable". So, they sought it out, and when they found it they groped about it. The first person, whose hand landed on the trunk, said, "This being is like a thick snake". For another one whose hand reached its ear, it seemed like a kind of fan. As for another person, whose hand was upon its leg, said, the elephant is a pillar like a tree-trunk. The blind man who placed his hand upon its side said the elephant, "is a wall". Another who felt its tail, described it as a rope. The last felt its tusk, stating the elephant is that which is hard, smooth and like a spear.
- 3. Like a collection of blind men each examining only a portion of the Whole Elephant, so Random Forests employs a collection of partial understandings of a Sampled Dataset, then aggregates those to achieve a collective understanding of the Whole Dataset (from which the sample was drawn) that exceeds the insights of any single “fully-sighted” analysis Ensemble Learning vs. Crowd Wisdom • Such ensemble statistical learning techniques in which a final model is derived by aggregating a collection of competing, often “partially blinded,” models had its first explicit statement in the 1996 Bagged Predictors paper by Leo Breiman (the subsequent author of the seminal Random Forests paper, 2001). • In 2004 James Surowiecki’s book, The Windom of Crowds, sparked public excitement over the power of collective decision-making, a phenomenon often erroneously conflated with ensemble learning. • They are not the same thing. Ensembling has only to do with reducing variance. Crowd wisdom involves both a variance and bias reduction, bias being a measure of how good the crowd’s collective estimate is, on average, from truth. Our blind men may concoct a collective image in their minds’ eyes of what an elephant looks like that could still be far from how a true elephant appears. (As it turns out, in the ending of the actual parable they couldn't even do that; the men simply ended up vehemently arguing with one another.)
- 4. The Problem of Bias • The mean of a crowd’s guess is not always a very good guess of truth. The difference between our “best guess” given data, the mean, may be wildly different from truth. This is “bias.” For example, ask a class of grammar school children who have just learned to use commas to express large numbers (e.g., 10,000,000) to write down a guess of the number of atoms in a grain of sand on a small index card and you are likely to get guesses all of which underestimate truth (which is, in order of magnitude, nearing the number of stars in the known universe, > 1 followed by 23 zeros). The best guess of truth is likely that from the student with the patience and fine motor skills to write down the biggest number (but which student is that?). A full understanding of the Wisdom of Crowds requires an explanation of why Crowds not only reduce variance but also bias. It is perhaps best viewed, as at least a starting point, though the prism of Condorcet's Jury Theorem which requires each group member to share some probability > .5 of being right (the choice is binary); and the smaller that probability, the larger the group offering a collective decision must be in order that the group decision is (nearly certainly) correct. Ensembling has more modest aims. • Much early excitement surrounding Random Forests was due to speculation that it not only reduced variance but might also reduce bias! But it does not (see Hastie, et. al., The Elements of Statistical Learning, 2009, p.601 and Mentch & Zhou, Randomization as Regularization, 2020, p.5). So, although Random Forests (and perhaps ensemble learning generally) seem, at first blush, to capture statistically the popular notion of Wisdom of Crowds, instead they may best be understood as employing only variance reduction. But that alone can substantially improve prediction.
- 5. Motivation • Random Forests, a non-parametric ensemble technique, was unveiled by Leo Breiman in a 2001 paper he provocatively titled Statistical Modeling: The Two Cultures. He intended it as a wake-up call to the academic statistical community that he felt was languishing in an old paradigm that “led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems.” Random Forest was his capstone contribution (he died soon afterwards) to a newer approach he labelled “Algorithmic Modeling.” • Algorithmic Models (often referred to as “Black Boxes”) have proven highly successful at predictive accuracy at the cost of considerable opaqueness as to why they work or even what they exactly do. • Decades of work has been devoted to unraveling the mysteries of Black Box statistical techniques, Breiman’s Random Forests being a prime target. This presentation is yet another such attempt. Here we set the modest goal of simply gaining a better understanding of what the technique actually does, without evaluating why and under what circumstances it might work well. Our approach is to construct a toy example: • 1) of low dimensionality, and • 2) in a discrete Universe where predictors (the X feature vector) define a small hypercube
- 6. Random Forests as a Monte Carlo Simulation • Monte Carlo simulations are often used to empirically describe distributions which either have no closed-form solution or have a solution intractably difficult to specify. The strategy of this presentation is to treat Random Forests as if it were a simulation and, using a simplified version and a small dimensional example, to exhaustively specify the Universe it is attempting to analyze. We can then deterministically examine what Random Forests is doing. • “Randomness” in Random Forests is a shortcut technique that high dimensionality and predictors (features) measured as continuous variables make necessary. What if we removed those traits? • Indeed, in 1995 Tin Kam Ho of Bell Labs, who first coined the term “Random Forest” and introduced use of feature subspaces to decouple trees (the key innovation used in Breiman’s version) viewed randomness as simply an enabling technique, not something fundamental. She writes: Our method to create multiple trees is to construct trees in randomly selected subspaces of the feature space. For a given feature space of m [binary] dimensions there are 2m subspaces in which a decision tree can be constructed. The use of randomness in selecting components of the feature space vector is merely a convenient way to explore the possibilities. [italics added]
- 7. Defining a “Discrete” Universe: Levels of Measurement • Four levels of data measurement are commonly recognized: • Nominal or Categorical (e.g., Race: White, Black, Hispanic, etc.) • Ordinal (e.g., Small, Medium, Large) • Interval (e.g., 1, 2, 3, 4, etc. where it is assumed that distance between numbers are the same) • Ratio (interval level data that have a true zero and no negative numbers; ratio level data allow one to make statements like “this number is twice as big as that number”). • Most statistical analysis (e.g., Regression or Logistic Regression) make an often-unspoken assumption that data are at least interval. • Even when expressed as a number (e.g., a FICO score) the metric may not be truly interval level. Yet often such measures are treated as if they were interval. FICO scores are, in fact, only ordinal but are typically entered into models designed to employ interval-level data. • For our purposes, a second data measurement distinction is also important • Continuous: with infinite granularity between any two Interval data points, and • Discrete: with a finite region between (or around) data points such that within each region, no matter the size of the region, all characteristics are considered identical.
- 8. Defining a “Discrete” Universe (continued) • In statistical practice (e.g., Credit modeling) one of the most common pre-processing steps is to group predictors (feature variables) into discrete ordinal groupings (e.g., 850<=FICO, 750<=FICO<850, 700<=FICO<750, 675<=FICO<700, FICO<675). Notice that we have created 5 ordinal-measured groups, say A B C D E, and that each need not contain the same range of FICO points. As mentioned earlier, often these are (wrongly) labelled numerically 1 2 3 4 5 and entered into (e.g.) a Logistic Regression. • More correctly these groupings should be treated as Categorical variables and if entered into a Regression or Logistic Regression they should be coded as Dummy variables. But importantly, we have now clearly created discrete variables. Our analysis uses grouped scores as predictors and our results cannot say anything about different effects of values more granular than at the group level. • At a more philosophical level, all real datasets might be argued to contain, at best, ordinal-level discrete predictors. All observed data points are measured discretely (at some finite level of precision) and must be said to represent a grouping of unobserved data below that level of precision. Continuity, with an infinite number of points lying between each observation no matter how granular, is a mathematical convenience. What we measure we measure discretely. • Accordingly, this presentation models predictors as ordinal discrete variables. One easy way to conceptualize this is to imagine that we start with ordered predictors (e.g., FICO scores) that we group. In our toy example we have 3 predictors that we have grouped into 3 levels each.
- 9. Our Toy Universe: Exploring it with an Unsupervised Random Forest Algorithm
- 10. Our Toy Universe • Consider a 3 variable, 3 levels-per-variable (2 cut-points) example. It looks at bit like a small Rubik’s cube. Below is a cutaway of its top, middle and bottom levels, each containing 9 cells for a total of 27.
- 11. Each of the 27 cells making up the Universe can be identified in 2 ways 1. By its position spatially along each of the 3 axes (e.g., A=1, B=1, C=1) 2. Equivalently by its cell letter (e.g., k) It is important to note that we have so far only discussed (and illustrated above) exclusively Predictor Variables/Features (A, B, C) and the 3 levels of each (e.g., A0, A1, and A2). Any individual cell (say, t) is associated only with its location in the Predictor grid (t is found at location A2, B0, C1). Nothing has been said about dependent variable values (the target Y values, which Random Forests is ultimately meant to predict). This is purposeful. It is instructive to demonstrate that without observing Y values a Random Forest-like algorithm can be run as an Unsupervised algorithm that partitions the cube in specific ways. If the Universe is small enough and algorithm simple enough, we can then enumerate all possible partitionings that might be used in a Supervised run (with observed Y values). Some subset of these partitionings the Supervised algorithm will use, but that subset is dependent upon the specific Y values which we will we observe later as a second step. C=2 (c) (l) (u) C=2 (f) (o) (x) C=2 (i) (r) (aa) C=1 (b) (k) (t) C=1 (e) (n) (w) C=1 (h) (q) (z) C=0 (a) (j) (s) C=0 (d) (m) (v) C=0 (g) (p) (y) A=0 A=1 A=2 A=0 A=1 A=2 A=0 A=1 A=2 B=0 B=1 B=2
- 12. A Random Forest-like Unsupervised Algorithm • Canonical Random Forest is a Supervised algorithm; our Unsupervised version is only “Random Forest-like.” • Creating a diversified forest requires each tree growth to be decoupled from all others. Canonical Random Forest does this in two ways: • It uses only Boot-strapped subsets of the full Training Set to grow each tree. • Importantly, at each decision-point (node) only a subset of available variables are selected to determine daughter nodes. This is Ho’s “feature subspaces selection.” • How our Random Forest-like algorithm differs from Breiman’s algorithm: • As with many other analyses of Random Forest, our Unsupervised algorithm uses the full Training Set each time a tree is built. • Breiman uses CART as the base tree-split method. CART allows only binary splits (only 2 daughter nodes per parent node split). We allow multiple daughter splits of a parent Predictor variable (as is possible in ID3 and CHAID); in fact, our algorithm insists that if a Predictor (in our example, the A or B or C dimension) is selected as a node splitter, all possible levels are made daughters. • Breiman’s algorithm grows each tree to its maximum depth. Our approach grows to only two levels (vs. the 3-levels that a maximum 3-attribute model would allow). This is done primarily to avoid interpolating the input data (simply getting out what one puts in) in the case where all X-variable combinations are populated with Y (target) values. We relax this assumption later in Appendix B to show that, in the presence of “holes” (X-variable combinations without target values), while interpolation does occur for “non-holes,” estimation for the “holes” are unaffected when growth is taken to the maximum.
- 13. A Random Forest-like Unsupervised Algorithm (cont’d) • In our simple unsupervised random forest-like algorithm the critical “feature subset selection parameter”, often referred to as mtry, is set at a maximum of 2 (out of the full 3). This implies that at the root level the splitter is chosen randomly with equal probability as either A, B, or C and 3 daughter nodes are created (e.g., if A is selected at the root level, the next level consists of A0, A1, A2). • At each of the 3 daughter nodes the same procedure is repeated. Each daughter’s selection is independent of her sisters. Each chooses a split variable between the 2 available variables. One variable, the one used in the root split, is not available (A in our example) so each sister in our example chooses (with equal probability) between B and C. Choosing independently leads to an enumeration of 23 =8 permutations. This example is for the case where the choice of root splitter is A. Because both B and C could instead be root splitters, the total number of tree growths, each equally likely, is 24. BBB CCC AAA CCC AAA BBB BBC CCB AAC CCA AAB BBA BCB CBC ACA CAC ABA BAB BCC CBB ACC CAA ABB BAA 24 EQULLY-LIKELY TREE NAVIGATIONS A is the root splitter B is the root splitter C is the root splitter daughters' splitters daughters' splitters daughters' splitters
- 14. The 8 unique partitionings when the root partition is A For each of the 3 beginning root partition choices, our unsupervised random forest-like algorithm divides the cube into 8 unique collections of 1x3 rectangles. There are 18 such possible rectangles for each of the 3 beginning root partition choices. Below are the results of the algorithm’s navigations when A is the root partition choice. Two analogous navigation results charts can be constructed for the root attribute choices of B and C. C2 C2 C2 C2 C1 C1 C1 C1 C0 C0 C0 C0 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 C2 C2 C2 C2 C1 C1 C1 C1 C0 C0 C0 C0 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 Aggregates Cells (a,b,c) Aggregates Cells (d,e,f) Aggregates Cells (g,h.i) Aggregates Cells (u,x,aa) Aggregates Cells (l,o,r) Aggregates Cells (c,f,i) Aggregates Cells (j,k,l) Aggregates Cells (m,n,o) Aggregates Cells (p,q,r) Aggregates Cells (t,w,z) Aggregates Cells (k,n,q) Aggregates Cells (b,e,h) Aggregates Cells (s,t,u) Aggregates Cells (v,w,x) Aggregates Cells (y,z,aa) Aggregates Cells (s,v,y) Aggregates Cells (j,m,p) Aggregates Cells (a,d,g) B=2 B=0 B=1 B=2 B=2 B=0 B=1 B=2 B=0 B=1 B=2 B=0 B=1 B=2 B=0 B=1 A (B B B) A (B B C) A (B C B) A (B C C) A (C C C) A (C C B) A (C B C) A (C B B) B=0 B=1 B=2 B=0 B=1 B=2 B=0 B=1
- 15. The 8 unique partitionings when the root partition is A (cont.) Note that each cell (FOR EXAMPLE, CELL n, the middle-most cell) appears in 2 and only 2 unique 1X3 rectangles among the 8 navigations when the root partition is A. In the 4 navigations bordered in blue below, n appears in the brown rectangle (m,n,o). In the 4 navigations bordered in red below, n appears in the dark brown rectangle (k,n,q). C2 C2 C2 C2 C1 C1 C1 C1 C0 C0 C0 C0 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 C2 C2 C2 C2 C1 C1 C1 C1 C0 C0 C0 C0 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 Aggregates Cells (a,b,c) Aggregates Cells (d,e,f) Aggregates Cells (g,h.i) Aggregates Cells (u,x,aa) Aggregates Cells (l,o,r) Aggregates Cells (c,f,i) Aggregates Cells (j,k,l) Aggregates Cells (m,n,o) Aggregates Cells (p,q,r) Aggregates Cells (t,w,z) Aggregates Cells (k,n,q) Aggregates Cells (b,e,h) Aggregates Cells (s,t,u) Aggregates Cells (v,w,x) Aggregates Cells (y,z,aa) Aggregates Cells (s,v,y) Aggregates Cells (j,m,p) Aggregates Cells (a,d,g) A (B B B) A (B B C) A (B C B) A (B C C) A (C C C) A (C C B) A (C B C) A (C B B) B=0 B=1 B=2 B=0 B=1 B=2 B=0 B=1 B=0 B=1 B=2 B=0 B=1 B=2 B=0 B=1 B=2 B=2 B=2 B=0 B=1 B=2 B=0 B=1
- 16. The 8 unique partitionings when the root partition is A (cont.) Analogously, CELL a (the bottom/left-most cell) also appears in 2 and only 2 unique 1X3 rectangles when the root partition is A. In the 4 navigations bordered in green below (the top row of navigations) Cell a appears in the red rectangle (a,b,c). In the 4 navigations bordered in brown below (the bottom row of navigations) n appears in the pink rectangle (a,d,g). This pattern is true for all 27 cells. C2 C2 C2 C2 C1 C1 C1 C1 C0 C0 C0 C0 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 C2 C2 C2 C2 C1 C1 C1 C1 C0 C0 C0 C0 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 A0 A1 A2 Aggregates Cells (a,b,c) Aggregates Cells (d,e,f) Aggregates Cells (g,h.i) Aggregates Cells (u,x,aa) Aggregates Cells (l,o,r) Aggregates Cells (c,f,i) Aggregates Cells (j,k,l) Aggregates Cells (m,n,o) Aggregates Cells (p,q,r) Aggregates Cells (t,w,z) Aggregates Cells (k,n,q) Aggregates Cells (b,e,h) Aggregates Cells (s,t,u) Aggregates Cells (v,w,x) Aggregates Cells (y,z,aa) Aggregates Cells (s,v,y) Aggregates Cells (j,m,p) Aggregates Cells (a,d,g) B=2 B=0 B=1 B=2 B=2 B=2 B=0 B=1 B=2 B=0 B=1 B=0 B=1 B=2 B=0 B=1 A (B B B) A (B B C) A (B C B) A (B C C) A (C C C) A (C C B) A (C B C) A (C B B) B=0 B=1 B=2 B=0 B=1 B=2 B=0 B=1
- 17. Random Forests “self-averages”, or Regularizes Each of our algorithm’s 24 equally-likely tree navigations reduces the granularity of the cube by a factor of 3, from 27 to 9. Each of the original 27 cells finds itself included in exactly two unique 1x3 rectangular aggregates for each root splitting choice. For example, Cell “a” is in Aggregate (a,b,c) and in Aggregate (a,d,g) when A is the root split choice (see previous slide). When B is the root split choice (not shown) Cell “a” appears again in Aggregate (a,b,c) and in a new Aggregate (a,j,s), again 4 of the 8 navigations. When C is the root split choice, Cell “a” appears uniquely in the two Aggregate (a,j,s) and Aggregate (a,d,g). Continuing with Cell a as our example and enumerating the rectangles into which Cell a appears in each of the 24 equally-possible navigations we get: Now recall that reducing the granularity of the cube means that each cell is now represented by a rectangle which averages the member cells equally i.e., vector (a,b,c) is the average of a, b, and c. Therefore the 24-navigation average is [8*(a+b+c) + 8*(a+d+g) + 8*(a+j+s)] /72. Rearranging = [24(a) + 8(b) +8(c) + 8(d) + 8(g) + 8(g) + 8(J) + 8(s)] /72 Simplifying = 1/3*(a) + 1/9*(b) + 1/9*(c) + 1/9*(d) + 1/9*(g) + 1/9*(b) + 1/9*(j) + 1/9*(s) This is our Unsupervised RF-like algorithm’s estimate for the “value” of Cell a in our Toy Example count of A B C rectange occurances (a,b,c) 4 4 (a,d,g) 4 4 (a,j,s) 4 4 sums 24 8 8 8 Root LevelSplit
- 18. Random Forests “self-averages” (continued) g d b c a Because each of the 24 navigations are equally-likely, our ensemble space for Cell “a” consists of equal parts of Aggregate (a,b,c), Aggregate (a,d,g) and Aggregate (a,j,s). Any attributes (e.g., some Target values Y) later assigned to the 27 original cells would also be averaged over the ensemble space. j s Note that because the original cell “a” appears in each of the Aggregates, it gets 3 times the weight that each of the other cells gets. If the original 27 cells are later all populated with Y values, our Unsupervised algorithm can be seen as simply regularizing those target Y values. Each regularized estimate equals 3 times the original cell’s Y value plus the Y values of the 6 cells within the remainder of the 1x3 rectangles lying along each dimension, the sum then divided by 9). It is important to note that the particular “geometry” of our Toy Example (an aggregation of rectangles) is specific to our assumption of stopping one level short of full growth. In any dimensionality making the same assumption, the geometry would be the same. In 4 dimensions (a 3x3x3x3 structure) we average four 1x3 rectangles, giving the cell being valued four times the weight of the other 12 cells in the in the 4-vector sum. That weighted sum is then divided by 16.
- 19. C=2 5 3 16 C=2 11 4 15 C=2 8 12 26 C=1 2 7 21 C=1 17 1 25 C=1 19 14 27 C=0 22 20 23 C=0 6 9 13 C=0 10 18 24 A=0 A=1 A=2 A=0 A=1 A=2 A=0 A=1 A=2 B=0 B=1 B=2 Bottom Middle Top C=2 8 5/9 8 1/9 15 2/3 C=2 9 7/9 7 15 5/9 C=2 11 8/9 12 1/9 20 C=1 10 7/9 9 1/9 18 1/9 C=1 12 7/9 8 7/9 18 7/9 C=1 15 14 23 1/3 C=0 14 2/3 15 7/9 20 5/9 C=0 11 1/9 9 8/9 15 2/3 C=0 14 1/9 15 8/9 21 A=0 A=1 A=2 A=0 A=1 A=2 A=0 A=1 A=2 B=0 B=1 B=2 Bottom Middle Top Randonly-assigned Y values after Regularization by our RF-like algorithm Numbers 1-27 randomly assigned to cells as Target Y values Regularization by our Random Forest-like Algorithm reduces variance from 60 2/3 to 19
- 20. Choice of stopping point as equivalent to Pruning • Assume we perform a simple tree navigation of discrete data to full depth. No matter how many unique navigations there are, the final leaf nods will always be the same; each leaf will contain a single data point and reproduce whatever was input. The tree is said to “interpolate” the input data. • If we were to prune the tree arbitrarily by removing the final (deepest) split of any specific navigation path, it would produce a leaf that separates the dataset subsets using one less variable than the total number of variables, ignoring the last. One such path from our Toy Example results in the 9 subsets of [A0, B0, all C’s], [A0, B1, all C’s], [A0, B2, all C’s], [A1, B0, all C’s], [A1, B1, all C’s], [A1, B2, all C’s], [A2, B0, all C’s], [A2, B1, all C’s], [A2, B2, all C’s]. This example is illustrated below left: • Among the 24 unique full navigations possible in our Toy Example, 8 ignore C-variable distinctions, 8 ignore B-variable distinctions and 8 ignore A-variable distinctions (see Slide 13). This gives the “stop splitting one level short of full growth” (or “prune away final split in each path”) the specific estimation geometry of aggregating rectangles. The geometry is different if we prune even further (see Appendix C for an example).
- 21. Practical considerations even in a Universe defined as Discrete Why real-world analyses require randomness a as simulation expedient • Even accepting the notion that any set of ordered predictors (some containing hundreds or even thousands of ordered observations, e.g., Annual Income) can be theoretically conceptualized as simply Discrete regions, our ability to enumerate all possible outcomes of a Random Forest-like algorithm as we have done in this presentation is severely limited. Even grouping each ordered Predictor into a small number of categories does not help when the number of Predictors (dimensions) is even just modestly large. • • For example, even for an intrinsically 3-level set of ordinal predictors (Small, Medium, and Large), a 25 attribute Universe has 325 (or approximately = 8.5*1011,nearly a Trillion) unique X-vector grid cell locations and nearly a Trillion Y-predictions to make. • Further complications ensue from using a tree algorithm that restricts node splitting to just two daughters, a binary splitter. A single attribute of more than 2 levels (say Small, Medium, and Large) can be used more than once within a single navigation path, say first as “Small or Medium” vs. “Large” then somewhere further down the path the first daughter can be split “Small” vs. “Medium.” Indeed, making this apparently small change in using a binary splitter in our Toy Example increases the enumerations of complete tree growths to an unmanageable 86,400 (see Appendix A).
- 22. A Final Complication: “Holes” in the Discrete Universe • Typically, a sample drawn from a large universe will have “holes” in its Predictor Space; by this is meant that not all predictor (X- vector) combinations will have Target Y values represented in the sample; and there are usually many such holes. Recall our Trillion cell example. A large majority of cell locations will likely not have associated Y values. This is a potential problem even for our Toy Universe because if there are holes then the simple Regularization formula suggested earlier will not work. • Luckily, the basic approach of averaging the three Aggregates our algorithm defines for each Regularized Y- value does continue to hold. But now the process has 2-steps The first step is to find an average value for each of the three Aggregates using only those cells that have values. The 3 Aggregates are then summed and divided by 3. The next slide compares the Regularized estimates derived earlier from our randomized 1- 27 Y-value full assignment case to the case where 6 cell locations were arbitrarily assigned to be holes and contained no original values. Those original locations set as holes are colored in blue in the second chart. • Some 1x3 constituent rectangular Aggregates may be missing ALL 3 values. For example, consider the blue value 12.74 at cell location A=1,B=0,C=1 on the next slide. The 1st of 3 Aggregates averages the original values directly above and below that cell (20+3)/2=11.5. The 2nd Aggregate averages the values directly to the right and left (2+21)/2=11.5 (coincidently). The 3rd however is troublesome; all cells along the “look- through” dimension are missing original Y-values. For an estimate of this 3rd Aggregate we first determine which attribute values the 3 cells missing Y values share, here C=1 and A=1. We then average all populated cells where C=1 (5 cells, avg=17.2); next those cells where A=1 (4 cells, avg=13.25). Then we find the average of those 2 numbers(=15.23). The estimated Regularized cell value is (11.5+11.5+15.23)/3= 12.74. • Appendix B explores the nature and estimation of holes more fully, using a “Sociology Example.” That appendix also comments on the effects of growing a Random Forest to full growth in the presence of holes.
- 23. Regularization of Universe with Initial Holes (second chart). Variance: 14.2 C=2 8 5/9 8 1/9 15 2/3 C=2 9 7/9 7 15 5/9 C=2 11 8/9 12 1/9 20 C=1 10 7/9 9 1/9 18 1/9 C=1 12 7/9 8 7/9 18 7/9 C=1 15 14 23 1/3 C=0 14 2/3 15 7/9 20 5/9 C=0 11 1/9 9 8/9 15 2/3 C=0 14 1/9 15 8/9 21 A=0 A=1 A=2 A=0 A=1 A=2 A=0 A=1 A=2 B=0 B=1 B=2 Bottom Middle Top C=2 8.56 9.00 15.67 C=2 10.78 11.11 15.33 C=2 11.89 12.61 20.00 C=1 11.28 12.74 18.50 C=1 13.67 15.02 18.33 C=1 16.00 17.74 24.22 C=0 14.67 17.39 20.56 C=0 11.17 13.78 14.50 C=0 14.11 17.11 21.00 A=0 A=1 A=2 A=0 A=1 A=2 A=0 A=1 A=2 B=0 B=1 B=2 Bottom Middle Top 6 cells with original Y-values missing in blue Randonly-assigned Y values after Regularization by our RF-like algorithm Randonly-assigned Y values after Regularization by our RF-like algorithm
- 24. The Optimization Step • Most discussions of Random Forests spend considerable space describing how the base tree algorithm makes its split decision at each node (e.g., examining Gini or Information Value). In our enumeration approach this optimization step (where we attempt to be separate signal from noise) is almost an afterthought. We know all possible partitions into which our simple cube CAN be split. It is only left to determine, given our data, which specific navigations WILL be chosen. • In our Toy Example this step is easy. Recall that at the root level, 1 of our 3 attributes is randomly excluded from consideration (this is the effect of setting mtry=2). Our data (the 6-hole case) shows that if A is not the excluded choice (which is 2/3 of the time) then the A Attribute offers the most information gain. At the second level the 3 daughters’ choices leads to the full navigation choice A(CCB). When A is excluded (1/3 of the time) C offers the best information gain and after examining the optimal daughters’ choices the full navigation choice is C(BAA). The optimal Regularized estimates are simply the (2/3)*A(CCB)estimates + (1/3)*C(BAA)estimates (see Appendix B for fuller detail). Optimized Regularization Estimates weighting 2/3 A(CCB) and 1/3 C(BAA) C=2 8.00 7.50 19.67 C=2 8.00 7.50 15.67 C=2 8.00 7.50 23.45 C=1 12.67 14.57 21.33 C=1 12.67 14.57 17.33 C=1 12.67 14.57 25.11 C=0 15.67 19.89 20.56 C=0 11.61 15.83 12.50 C=0 14.22 18.44 22.89 A=0 A=1 A=2 A=0 A=1 A=2 A=0 A=1 A=2 B=0 B=1 B=2
- 25. An Example using Real Data: The Kaggle Titanic Disaster Dataset
- 26. The Kaggle Titanic Disaster Dataset • Kaggle is a website hosting thousands of Machine Learning datasets featuring contests between practitioners allowing them to benchmark the predictive power of their models against peers. • Kaggle recommends that beginners use the Titanic Disaster dataset as their first contest. It is well- specified, familiar and real example whose goal (predicting who among the ships’ passengers will survive given attributes known about them from the ship’s manifest) is intuitive. The details of the beginners’ contest (including a video) can be found here: kaggle.com/competitions/titanic . • Up until now our analysis has used Regression Trees as our base learners. That is, we assumed the Target (Y) values we were attempting to estimate were real, interval level variables. The Titanic problem is a (binary) classification problem, either the passenger survived or did not (coded 1 or 0). Nevertheless, we will continue to use Regression Trees as base leaners for our Titanic analysis, with each tree’s estimates (as well as their aggregations) falling between 1 and 0. As a last step, each aggregated estimate >.5 will then be converted to a “survive” prediction, otherwise, “not survive”.
- 27. The Kaggle Titanic Disaster Dataset (continued) • Excluding the record identifier and binary Target value (Y), the dataset contains 10 raw candidate predictor fields for each of 891 Training set records: Pclass (1st,2nd, or 3rd class accommodations), Name, Sex, Age, SibSp (count of siblings traveling with passenger), Parch (count of parents and children traveling with passenger), Ticket (alphanumerical ticket designator), Fare (cost of passage in £), Cabin (alphanumerical cabin designator), Embark (S,C,Q one of 3 possible embarkation sites). • • For our modeling exercise only Pclass, Sex, Age, and a newly constructed variable Family_Size (=SibSp + SibSp +1) were ultimately used.* • Our initial task was to first pre-process each of these 4 attributes into 3 levels (where possible). Pclass broke naturally into 3 levels. Sex broke naturally into only 2. The raw interval-level attribute Age contained many records with “missing values” which were initially treated as a separate level with the non-missing values bifurcated at the cut-point providing the highest information gain (0-6 years of age, inclusive vs. >6 years of age). However, it was determined that there was no statistical reason to treat “missing value” records differently from aged > 6 records and these two initial levels were combined leaving Age a 2-level predictor (Aged 0-6, inclusive vs NOT Aged 0-6, inclusive) • The integer, ordinal-level Family_Size attribute was cut into 3 levels using the optimal information gain metric into 2-4 members (small family); >4 members (big family); 1 member (traveling alone). *Before it was decided to exclude the Embark attribute it was discovered that 2 records lacked values for that attribute and those records were excluded from the model building, reducing the Training set record count to 889.
- 28. Approaching our RF-like analysis from a different angle • Our pre-processed Titanic predictor “universe” comprises 36 cells (=3x2x2x3) and we could proceed with a RF-like analysis in the same manner as we did with our 27 cell Toy Example. The pre- processed Titanic universe appears only slightly more complex than our Toy Example (4-dimensional rather than 3), but the manual tedium required grows exponentially. There is a better way. • Note that up until this current “Real Data” section all but the last slide (Slide 24) focused exclusively on the unsupervised version of our proposed RF-like algorithm; only Slide 24 talked about optimization (“separating signal from noise”). As noted earlier, this was purposeful. We were able to extract an explicit regularization formula for our Toy Example (without holes, Slide 18). Higher dimensionality and the presence of “holes” complicate deriving any generalized statement, but at least our example provides a sense for what canonical random forest might be doing spatially (it averages over a set of somehow-related, rectangularly-defined “neighbors” and ignores completely all other cells). This gives intuitive support for those seeing RF’s success to be its relationship to: • “Nearest Neighbors” Lin, Yi and Jeon, Yongho Random Forests and Adaptive Nearest Neighbors. Journal of the American Statistical Association Vol. 101 (2006)) or the Totally randomized trees model of Geurts, Pierre, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine learning 63.1 (2006) p. 3-42; • “Implicit Regularization” Mentch, Lucas and Zhou, Siyu. Randomization as Regularization: A Degrees of Freedom Explanation for Random Forest Success. Journal of Machine Learning Research 21 (2020) p. 1-36.
- 29. Approaching our RF-like analysis from a different angle: Starting at the optimization step • Notice that for all the work we did to explicitly define exactly what all 24 possible tree navigations would look like in our Toy Example, we ended up only evaluating 2. But we could have known in advance what those 2 would be, if not explicitly, then by way of a generic description. • Knowing that at the root level an optimization would pick the best attribute to split on given the choices available, our first question is “would that first best attribute, either A or B or C be available”? There are 3 equally likely 2-attribute subsets at the Toy Example’s root level: AB, AC, BC. Note that no matter which of the 3 variables turns out to the optimal one, it will be available 2 out of 3 times. • Recall our RF-like model’s approach of selecting, at any level, from among only a subset of attributes not previously used in the current branch. Is this “restricted choice” constraint relevant at levels deeper than the root level? In the Toy Example the answer is “no.” At the second level only 2 attributes are available for each daughter node to independently choose among and, although those independent choices may differ, they will be optimal for each daughter. • Because we stop one level short of exhaustive navigation (recall, to avoid interpolating the Training Set), we are done. This is why the final estimates in our Toy Example involved simply weighting the estimates of two navigations: (2/3)*A(CCB)estimates + (1/3)*C(BAA)estimates. The last navigation results in the 1 time out of 3 that A (which turns out to the optimal) is not available at the root level.
- 30. A Generic Outline of the Optimization Step for our Titanic Disaster Data Set • A mapping of the relevant tree navigations for the Titanic predictor dataset we have constructed is more complex than that for the Toy Example, principally because we require a 3-deep branching structure and have 4 attributes to consider rather than 3. We will continue, however, to choose no more than 2 from among currently available (unused) attributes at every level (i.e., mtry=2). • At the root level, 2 of the available 4 are randomly selected resulting in 6 possible combinations: (AB), (AC), (AD), (BC), (BD), (CD). Note that no matter which of the 4 variables turns out to the optimal one, it will be available 3 out of 6 times. For the ½ of the times that the first-best optimizing attribute (as measured by information gain) is not among the chosen couplet, then the second-best attribute is available 2 out of those 3 times; the second-best solution is therefore available for 1/3 of the total draws (1/2 * 2/3). The third-best solution is the chosen only when the 2 superior choices are unavailable, 1/6 of time. • For the Titanic predictor set as we have constructed it, the root level choices are: • Sex (First-Best), weighted 1/2 • PClass (Second-Best), weighted 1/3 • Family_Size (Third-Best), weighted 1/6
- 31. A Generic Outline of the Optimization Step for our Titanic Disaster Data Set (continued) • For each of the 3 choices of Root Level attribute, the 2 or 3 daughter nodes each independently chooses the best attribute from among those remaining. If, for example, A is taken to be the Root Level splitter it is no longer available at Level 2 whose choices are B,C, or D. Taken 2 at a time, the possible couplets are (BC), (BD), (CD). 2/3 of the time the First-Best splitter is available. At deeper levels, the optimal attribute is always chosen. The weightings of each of 20 navigations are below. Level 1 (ROOT) First Best Choice (Sex) 1/2 Level 2 Daughter 1 Daughter 2 Weighting tree 1 First Best First Best 2/9 tree 2 First Best Second Best 1/9 tree 3 Second Best First Best 1/9 tree 4 Second Best Second Best 1/18 Level 1 (ROOT) 2nd-Best Choice (PClass) 1/3 Level 2 Daughter 1 Daughter 2 Daughter 3 Weighting tree 5 First Best First Best First Best 8/81 tree 6 First Best First Best Second Best 4/81 tree 7 First Best Second Best First Best 4/81 tree 8 First Best Second Best Second Best 2/81 tree 9 Second Best First Best First Best 4/81 tree 10 Second Best First Best Second Best 2/81 tree 11 Second Best Second Best First Best 2/81 tree 12 Second Best Second Best Second Best 1/81 Level 1 (ROOT) 3nd-Best (Family_Size) 1/6 Level 2 Daughter 1 Daughter 2 Daughter 3 Weighting tree 13 First Best First Best First Best 4/81 tree 14 First Best First Best Second Best 2/81 tree 15 First Best Second Best First Best 2/81 tree 16 First Best Second Best Second Best 1/81 tree 17 Second Best First Best First Best 2/81 tree 18 Second Best First Best Second Best 1/81 tree 19 Second Best Second Best First Best 1/81 tree 20 Second Best Second Best Second Best 1/162
- 32. Predictions for the Titanic Contest using our RF-like algorithm • The weighted sum of the first 5 trees (out of 20) is displayed in separate Male and Female charts on this and the next slide. An Average (normalized) metric of >=.5 implies a prediction of “Survived”, otherwise “Not.” These first 5 trees represent nearly 60% of the weightings of a complete RF-like algorithm’s weighted opinion and are unlikely to be different from a classification based on all 20. Sex Pclass Young_Child? Family_Size Trees 1&2 Trees 3&4 Tree5 Average weight 1/3 weight 1/6 weight 8/81 (normalized) Male 3rd NOT Young_Child Family_2_3_4 0.123 0.123 0.327 0.157 Male 3rd Family_Alone 0.123 0.123 0.210 0.137 Male 3rd Family_Big 0.123 0.123 0.050 0.111 Male 3rd Young_Child Family_2_3_4 1.000 0.429 0.933 0.830 Male 3rd Family_Alone H 0.667 0.429 1.000 0.656 Male 3rd Family_Big 0.111 0.429 0.143 0.205 Male 2nd NOT Young_Child Family_2_3_4 0.090 0.090 0.547 0.165 Male 2nd Family_Alone 0.090 0.090 0.346 0.132 Male 2nd Family_Big 0.090 0.090 1.000 0.240 Male 2nd Young_Child Family_2_3_4 1.000 1.000 1.000 1.000 Male 2nd Family_Alone H 0.667 1.000 0.346 0.707 Male 2nd Family_Big H 0.111 1.000 1.000 0.505 Male 1st NOT Young_Child Family_2_3_4 0.358 0.358 0.735 0.420 Male 1st Family_Alone 0.358 0.358 0.523 0.385 Male 1st Family_Big 0.358 0.358 0.667 0.409 Male 1st Young_Child Family_2_3_4 1.000 1.000 0.667 0.945 Male 1st Family_Alone H 0.667 1.000 0.523 0.736 Male 1st Family_Big H 0.111 1.000 0.667 0.450 MALES
- 33. Predictions for the Titanic Contest using our RF-like algorithm (cont’d) •Generally, most females Survived, and most males did not. Predicted exceptions are highlighted in Red and Green. A complete decision rule would read “Predict young males (“children”) in 2nd Class survive along with male children in 3rd and 1st Classes not in Big Families; all other males perish. All females survive except those in 3rd Class traveling in Big Families and female children in 1st Class.” Sex Pclass Young_Child? Family_Size Trees 1&3 Trees 2&4 Tree5 Average weight 1/3 weight 1/6 weight 8/81 (normalized) Female 3rd NOT Young_Child Family_2_3_4 0.561 0.561 0.327 0.522 Female 3rd Family_Alone 0.617 0.617 0.210 0.550 Female 3rd Family_Big 0.111 0.111 0.050 0.101 Female 3rd Young_Child Family_2_3_4 0.561 0.561 0.933 0.622 Female 3rd Family_Alone 0.617 0.617 1.000 0.680 Female 3rd Family_Big 0.111 0.111 0.143 0.116 Female 2nd NOT Young_Child Family_2_3_4 0.914 0.929 0.547 0.858 Female 2nd Family_Alone 0.914 0.906 0.346 0.818 Female 2nd Family_Big 0.914 1.000 1.000 0.952 Female 2nd Young_Child Family_2_3_4 1.000 0.929 1.000 0.980 Female 2nd Family_Alone H 1.000 0.906 0.346 0.866 Female 2nd Family_Big H 1.000 1.000 1.000 1.000 Female 1st NOT Young_Child Family_2_3_4 0.978 0.964 0.735 0.934 Female 1st Family_Alone 0.978 0.969 0.523 0.900 Female 1st Family_Big 0.978 1.000 0.667 0.933 Female 1st Young_Child Family_2_3_4 0.000 0.964 0.667 0.378 Female 1st Family_Alone H 0.000 0.969 0.523 0.356 Female 1st Family_Big H 0.000 1.000 0.667 0.388 FEMALES
- 34. Performance of our RF-like algorithm of Titanic test data • Although not the focus of this presentation, it is instructive to benchmark the performance of our simple “derived Rules Statement” (based on our RF-like model) against models submitted by others. • Kaggle provides a mechanism by which one can submit a file of predictions for a Test set of passengers that is provided without labels (i.e., “Survived or not”). Kaggle will return a performance evaluation summarized as the % of Test records predicted correctly. • Simply guessing that “all Females survive, and all Males do not” results in a score of 76.555%. Our Rules Statement gives an only modestly better score 77.751%. A “great” score is >80% (estimated to place one in the top 6% of legitimate submissions). Improvements derive from tedious “feature- engineering.” In contrast, our data prep involved no feature-engineering except for the creation of a Family_Size construct (= SibSp+Parch+1). Also, we only coarsely aggregated our 4 employed predictors into no more than 3 levels. Our goal was to uncover the mechanism behind a RF-like model rather than simply (and blindly) “get a good score.” Interestingly, the canonical Random Forest example featured in the Titanic tutorial gives a score of 77.511%, consistent with (actually, slightly worse) than our own RF-like set of derived Decisions Rules; again, both without any heavy feature- engineering. Reference: https://www.kaggle.com/code/carlmcbrideellis/titanic-leaderboard-a- score-0-8-is-great.
- 35. APPENDICES APPENDIX A: The problem of using binary splitters APPENDIX B: A “Sociological Example” APPENDIX C: Aggregate “Geometries” when pruning more than one level
- 36. APPENDIX A: The problem of using binary splitters • Treat our toy “3-variable, 2-split points each” set-up as a simple 6-varable set of binary cut-points. • Splitting on one of among 6 Top-level cut-points results in a pair of child nodes. Independently splitting on both right and left sides of this first pair involves a choice from among 5 possible Level 2 splits per child. The independent selection per child means that the possible combinations of possible paths at Level 2 is 52 (=25). By Level 2 we have made 2 choices, the choice of Top-level cut and the choice of second-level pair; one path out of 6*25 (=150). The left branch of each second level pair now selects from its own 4 third-level branch pair options; so too does the right branch, independently. This independence implies that there are 42 (=16), third level choices. So, by Level 3 we will have chosen one path from among 16*150 (=2,400) possibilities. Proceeding accordingly, at Level 4 each of these Level 3 possibilities chooses independent left and right pairs from among 3 remaining cut-points, so 32 (=9) total possibilities per 2,400 Level 3 nodes (=21,600). At Level 5 (2)2 (=4) is multiplied in, but at Level 6 there is only one possible splitting option, and the factor is (12)2 (=1). The final number of possible paths for our “simple” 27 discrete cell toy example is 86,400 (=4* 21,600). • In summary, there are 6 * 52 * 42 * 32 * 22 * 12 = 86,400 equally likely configurations. This is the universe from which a machine-assisted Monte Carlo simulation would sample
- 37. Appendix B: A “Sociological Example” using our Toy Universe cube with holes (bottom chart of Slide 23) • This Appendix puts fictious, somewhat awkward “sociological” labels to the X-variables of Our Toy Universe with holes: • Dimension A is GENDER: AO Genetic Female; A1 Genetic Male; A2 Do Not Choose to Identify with Either • Dimension B is FAMILY STATUS: BO Have Had Children; B1 Currently Pregnant; B2 Have No Children • Dimension C is RACE: C0 White; C1 Black; C2 Race Other than White or Black • The purpose of this exercise is twofold: • To illustrate that holes can be present in the data either because certain X-variable instantiations are impossible even in the Full Universe or, more likely, are possible in the Universe, but are simply not represented among the datapoints in our particular sample. • Second, to illustrate that if one were content with interpolating in the final aggregate tree those input datapoints that are fully defined in the Training Sample (i.e., leaving Y-value estimates for non-holes un-regularized) then growing a Random Forest to full growth is fine because estimates of “hole” Y-values do not change.
- 38. Why might a dataset have “holes” • Below is reproduced the chart at the top of Slide 19 wherein the numbers 1 through 27 were randomly assigned as Y-values to our Toy Example’s 27-cell cube, but here the cell locations identified as missing Y- values later on Slide 23 (“holes” ) are colored in yellow with their letter value replacing the missing Y- values. Sociological labels have replaced the original A,B,C abstract dimensions. • Imagine that the 21 displayed Y-values are measures of some interval-level metric, say some measure of average “happiness” rounded to an integer conveniently matching the values we saw on Slide 19. The data are from a survey of some modest number of respondents. Each respondent is simply asked the 3 sociological questions, GENDER, FAMILY STATUS, and RACE (each with 3 supposedly mutually exclusive and exhaustive choices) and a panel of questions that result in their individual “happiness” score. All respondents can fall within only 1 of the 27 cells based on the sociological questions and their individual happiness score is averaged in with all others sharing their sociological cell. These are the table’s Y-values. White 5 3 16 White 11 (o) 15 White 8 12 26 Black 2 (k) 21 Black 17 (n) (w) Black 19 (q) 27 Other 22 20 23 Other 6 (m) 13 Other 10 18 24 Female Male Neither Female Male Neither Female Male Neither Have Had Children Currently Pregnant Have No Children
- 39. Why might a dataset have “holes” (continued) • There are 6 sociological cells where there were no respondents and therefore no average “happiness” score. • Male, Black, with Children • Male, Black, Pregnant • Male, White, Pregnant • Male, race Other than white or black, Pregnant • Neither Female nor Male, Black, Pregnant • Male, Black, no Children • The cell descriptions in red are combinations that cannot exist in the Universe. A Genetic Male cannot be pregnant (the anomaly may have resulted from a careless wording of the answer choice which might better have read “Are you or your partner currently pregnant?’’). • The second major source of missing values may be said to have resulted from sampling inadequacies. No Male Blacks were included among the respondents (descriptions in blue). This is an example of a far more common source of missing values wherein the particular combination of interest is possible in the Universe but is not among the Training Set samples. Often such missing values represent X-value combinations that are rare in the Universe (perhaps the orange example above) as opposed to our contrived example here of not sampling any Male Blacks.
- 40. In Slide 24 we discussed an Optimization Step for our Toy Example which combined selected Unsupervised Navigations to calculate a Supervised Random Forest prediction for each of our 27 cells. The Supervised version combined the predictions of the A(CCB) navigation and the C(BAA) navigation, weighting the first with twice the weight of the second. On this and the next page are presented full tree navigations using our Sociological labels and now growing each tree to maximum depth. Recall, in contrast, that the Supervised Optimization on Slide 24 stopped one split short of full growth. We then compare the Supervised Result derived here to that of Slide 24 to demonstrate that estimates for the dataset “holes” are unchanged when growing to full depth and that the non-holes are interpolated. Female Male Neither count 9 count 4 count 8 avg11.11 avg13.25 avg20.63 White Black Other White Black Other Children Pregnant No Child Female Female Female Male Male Male Neither Neither Neither count 3 count 3 count 3 count 2 count 0 count 2 count 3 count 2 count 3 avg8.00 avg12.67 avg12.67 avg7.50 EST. 13.25 avg19.00 avg20.00 avg14.00 avg25.67 Children Pregnant No Child Children Pregnant No Child Children Pregnant No Child Children Pregnant No Child Children Pregnant No Child Children Pregnant No Child White Black Other White Black Other White Black Other White White White Black Black Black Other Other Other White White White Black Black Black Other Other Other Children Children Children Pregnant Pregnant Pregnant No Child No Child No Child Female Female Female Female Female Female Female Female Female Male Male Male Male Male Male Male Male Male Neither Neither Neither Neither Neither Neither Neither Neither Neither count 1 count 1 count 1 count 1 count 1 count 1 count 1 count 1 count 1 count 1 ESTIMATE count 1 ESTIMATE ESTIMATE ESTIMATE count 1 ESTIMATE count 1 count 1 count 1 count 1 count 1 ESTIMATE count 1 count 1 count 1 count 1 avg5 avg11 avg8 avg2 avg17 avg19 avg22 avg6 avg10 avg3 7.50 avg12 13.25 13.25 13.25 avg20 19.00 avg18 avg16 avg21 avg23 avg15 14.00 avg13 avg26 avg27 avg24 c f i b e h a d g l o r k n q j m p u t s x w v aa z y Tree Navigation to Maximum Depth for A(CCB)
- 41. The last line in each table is the cell value being estimated. In the line above that are the estimates for that cell. Figures in red are interpolations of the input data because these are non-holes (i.e., their X- variable combo is present in the Training Set). Highlighted in yellow are estimates for hole X-variable combos. Notice that yellow cell value estimates inherit the value of the next higher level. If that too is missing, the estimate is inherited from, again, the next highest level. This is the case for the cells k, n and q. Importantly, for Hole estimates the order of variable splits matters. For cells k, n, and q if Race precedes Gender in the navigation, the cells take the value of all Black respondents, 17.20 (table below). In contrast, if Gender precedes Race (table on the last slide ) these cells take the value of all Males, 13.25. Tree Navigation to Maximum Depth for C(BAA) Other Black White count8 count5 count8 avg17.00 avg17.20 avg12.00 Children Pregnant No Child Female Male Neither Female Male Neither Other Other Other Black Black Black White White White count3 count2 count3 count3 count0 count2 count3 count 2 count3 avg21.67 avg9.50 avg17.33 avg12.67 EST 17.20 avg24.00 avg8.00 avg7.50 avg19.00 Female Male Neither Female Male Neither Female Male Neither Children Pregnant No Child Children Pregnant No Child Children Pregnant No Child Children Pregnant No Child Children Pregnant No Child Children Pregnant No Child Children Children Children Pregnant Pregnant Pregnant No Child No Child Other Female Female Female Male Male Male Neither Neither Neither Female Female Female Male Male Male Neither Neither Neither Other Other Other Other Other Other Other Other Other Black Black Black Black Black Black Black Black Black White White White White White White White White White count1 count1 count1 count1 ESTIMATE count1 count1 count1 count1 count1 count1 count1 ESTIMATE ESTIMATE ESTIMATE count1 ESTIMATE count1 count1 count1 count1 count1 ESTIMATE count1 count1 count1 count1 avg22 avg20 avg23 avg6 9.50 avg13 avg10 avg18 avg24 avg2 avg17 avg19 17.20 17.20 17.20 avg21 24.00 avg27 avg5 avg11 avg8 avg3 7.50 avg12 avg16 avg15 avg26 a j s d m v g p y b e h k n q t w z c f i l o r u x aa
- 42. The Optimization Step • Below is reproduced a version of the chart from Slide 24 with Sociological labels replacing the A,B,C dimensions and the two constituent trees grown to full depth. Data “Holes” are in yellow, and their Supervised Random Forest Y-values are estimated by forming a cell-wise weighted sum [(2*the cell value in the A(CCB) navigation + 1*the cell value in the C(BAA) navigation)/3]. Because the observed Y-values have not changed, the choice from among the 24 possible Unsupervised navigations and their weightings in the Supervised version remain the same. • Note that the values here in yellow are the same as the corresponding values on Slide 24 but that the non-Hole values are equal to their input values. The only way to avoid this interpolation is to stop short of full growth (or, as in Breiman’s original version, to use boot-scrapped samples). Growing trees to full depth does not affect the estimate of Hole Y-values. White 5 3 16 White 11 7.50 15 White 8 12 26 Black 2 14.57 21 Black 17 14.57 17.33 Black 19 14.57 27 Other 22 20 23 Other 6 15.83 13 Other 10 18 24 Female Male Neither Female Male Neither Female Male Neither Have Had Children Currently Pregnant Have No Children
- 43. Appendix C: • Typically, the individual tree navigations comprising a Random Forest ensemble are grown to full depth. • Appendix B has demonstrated that the estimated Y values of “holes” in a Predictor Space (holes being those X-vector combos without Y values in the original sample) are unaffected when a tree in grown to full depth. These unobserved combos are usually the values we are most interested in. • But, although atypical, one need not necessarily fully grow a tree. One can stop short (or prune more). This Appendix offers an example of the different Aggregate Geometries that result when this is done. Aggregate “Geometries” when pruning more than one level
- 44. Examples of Geometries when pruning more than the last split (or stopping before the tree is fully grown) • Now, for our 3x3x3 Toy Example, prune the last 2 levels (or stop after the first split). This means that there are only 3 unique full navigations (equally) available to form an unsupervised Random Forest ensemble, one that splits the data into the 3 levels of the A variable, one using only the B variable, and one using only the C variable. To determine the estimated value of any specific cell in this ensemble requires averaging 3 3x3 “rectangles” (squares now) rather than averaging the 3 1x3 rectangles of our original Toy Example. Below are illustrated the weightings given the 27 original cells when all have Y-target values (no “holes”) for two cell estimation examples, an estimate for the middle-most cell “n” and one for the bottom left-most cell “a”. • (For the “a” cell these weightings give averages analogous to those that result in the formula: 1/3*(a) + 1/9*(b) + 1/9*(c) + 1/9*(d) + 1/9*(g) + 1/9*(b) + 1/9*(j) + 1/9*(s) of Slide 17). • Characterizing formulaically just how these weightings are derived is not straightforward (it would likely involve a city-block distance metric). Furthermore, the extension to more than 3 dimensions is more complex than what we see for the 3- variable case. But the point is that it can be done, even if only in a tabular, non- formulaic form. This suggests that Random Forests is a sort of Monte Carlo method used to estimate some otherwise intractably-difficult, but not impossible, deterministic algorithm. Random Forests’ estimation logic is not intrinsically “random.” 0 1 0 1 2 1 0 1 0 2 1 1 1 0 0 1 0 0 1 2 1 2 3 2 1 2 1 2 1 1 1 0 0 1 0 0 0 1 0 1 2 1 0 1 0 3 2 2 2 1 1 2 1 1 Estimating the middle-most cell, n Estimating the bottom left-most cell, a WEIGHTINGS FOR 3 SQUARE ENSEMBLE WITHOUT HOLES ("Holes" require a 2-step process; see Slide 22)