34. From trees to forests Leo Breiman promoted random forests. Idea: Using tree averaging as a means of obtaining good rules. The base trees are simple and randomized.Breiman’s ideas were decisively inﬂuenced by Amit and Geman (1997, geometric feature selection). Ho (1998, random subspace method). Dietterich (2000, random split selection approach). stat.berkeley.edu/users/breiman/RandomForests G. Biau (UPMC) 34 / 106
35. From trees to forests Leo Breiman promoted random forests. Idea: Using tree averaging as a means of obtaining good rules. The base trees are simple and randomized.Breiman’s ideas were decisively inﬂuenced by Amit and Geman (1997, geometric feature selection). Ho (1998, random subspace method). Dietterich (2000, random split selection approach). stat.berkeley.edu/users/breiman/RandomForests G. Biau (UPMC) 34 / 106
36. From trees to forests Leo Breiman promoted random forests. Idea: Using tree averaging as a means of obtaining good rules. The base trees are simple and randomized.Breiman’s ideas were decisively inﬂuenced by Amit and Geman (1997, geometric feature selection). Ho (1998, random subspace method). Dietterich (2000, random split selection approach). stat.berkeley.edu/users/breiman/RandomForests G. Biau (UPMC) 34 / 106
37. From trees to forests Leo Breiman promoted random forests. Idea: Using tree averaging as a means of obtaining good rules. The base trees are simple and randomized.Breiman’s ideas were decisively inﬂuenced by Amit and Geman (1997, geometric feature selection). Ho (1998, random subspace method). Dietterich (2000, random split selection approach). stat.berkeley.edu/users/breiman/RandomForests G. Biau (UPMC) 34 / 106
38. From trees to forests Leo Breiman promoted random forests. Idea: Using tree averaging as a means of obtaining good rules. The base trees are simple and randomized.Breiman’s ideas were decisively inﬂuenced by Amit and Geman (1997, geometric feature selection). Ho (1998, random subspace method). Dietterich (2000, random split selection approach). stat.berkeley.edu/users/breiman/RandomForests G. Biau (UPMC) 34 / 106
39. Random forests They have emerged as serious competitors to state of the art methods. They are fast and easy to implement, produce highly accurate predictions and can handle a very large number of input variables without overﬁtting. In fact, forests are among the most accurate general-purpose learners available. The algorithm is difﬁcult to analyze and its mathematical properties remain to date largely unknown. Most theoretical studies have concentrated on isolated parts or stylized versions of the procedure. G. Biau (UPMC) 35 / 106
40. Random forests They have emerged as serious competitors to state of the art methods. They are fast and easy to implement, produce highly accurate predictions and can handle a very large number of input variables without overﬁtting. In fact, forests are among the most accurate general-purpose learners available. The algorithm is difﬁcult to analyze and its mathematical properties remain to date largely unknown. Most theoretical studies have concentrated on isolated parts or stylized versions of the procedure. G. Biau (UPMC) 35 / 106
41. Random forests They have emerged as serious competitors to state of the art methods. They are fast and easy to implement, produce highly accurate predictions and can handle a very large number of input variables without overﬁtting. In fact, forests are among the most accurate general-purpose learners available. The algorithm is difﬁcult to analyze and its mathematical properties remain to date largely unknown. Most theoretical studies have concentrated on isolated parts or stylized versions of the procedure. G. Biau (UPMC) 35 / 106
42. Random forests They have emerged as serious competitors to state of the art methods. They are fast and easy to implement, produce highly accurate predictions and can handle a very large number of input variables without overﬁtting. In fact, forests are among the most accurate general-purpose learners available. The algorithm is difﬁcult to analyze and its mathematical properties remain to date largely unknown. Most theoretical studies have concentrated on isolated parts or stylized versions of the procedure. G. Biau (UPMC) 35 / 106
43. Random forests They have emerged as serious competitors to state of the art methods. They are fast and easy to implement, produce highly accurate predictions and can handle a very large number of input variables without overﬁtting. In fact, forests are among the most accurate general-purpose learners available. The algorithm is difﬁcult to analyze and its mathematical properties remain to date largely unknown. Most theoretical studies have concentrated on isolated parts or stylized versions of the procedure. G. Biau (UPMC) 35 / 106
44. Key-references Breiman (2000, 2001, 2004). Deﬁnition, experiments and intuitions. Lin and Jeon (2006). Link with layered nearest neighbors. Biau, Devroye and Lugosi (2008). Consistency results for stylized versions. Biau (2010). Sparsity and random forests. G. Biau (UPMC) 36 / 106
45. Key-references Breiman (2000, 2001, 2004). Deﬁnition, experiments and intuitions. Lin and Jeon (2006). Link with layered nearest neighbors. Biau, Devroye and Lugosi (2008). Consistency results for stylized versions. Biau (2010). Sparsity and random forests. G. Biau (UPMC) 36 / 106
46. Key-references Breiman (2000, 2001, 2004). Deﬁnition, experiments and intuitions. Lin and Jeon (2006). Link with layered nearest neighbors. Biau, Devroye and Lugosi (2008). Consistency results for stylized versions. Biau (2010). Sparsity and random forests. G. Biau (UPMC) 36 / 106
47. Key-references Breiman (2000, 2001, 2004). Deﬁnition, experiments and intuitions. Lin and Jeon (2006). Link with layered nearest neighbors. Biau, Devroye and Lugosi (2008). Consistency results for stylized versions. Biau (2010). Sparsity and random forests. G. Biau (UPMC) 36 / 106
48. Outline1 Setting2 A random forests model3 Layered nearest neighbors and random forests G. Biau (UPMC) 37 / 106
49. Three basic ingredients1-Randomization and no-pruning For each tree, select at random, at each node, a small group of input coordinates to split. Calculate the best split based on these features and cut. The tree is grown to maximum size, without pruning. G. Biau (UPMC) 38 / 106
50. Three basic ingredients1-Randomization and no-pruning For each tree, select at random, at each node, a small group of input coordinates to split. Calculate the best split based on these features and cut. The tree is grown to maximum size, without pruning. G. Biau (UPMC) 38 / 106
51. Three basic ingredients1-Randomization and no-pruning For each tree, select at random, at each node, a small group of input coordinates to split. Calculate the best split based on these features and cut. The tree is grown to maximum size, without pruning. G. Biau (UPMC) 38 / 106
52. Three basic ingredients2-Aggregation Final predictions are obtained by aggregating over the ensemble. It is fast and easily parallelizable. G. Biau (UPMC) 39 / 106
53. Three basic ingredients2-Aggregation Final predictions are obtained by aggregating over the ensemble. It is fast and easily parallelizable. G. Biau (UPMC) 39 / 106
54. Three basic ingredients3-Bagging The subspace randomization scheme is blended with bagging. Breiman (1996). Bühlmann and Yu (2002). Biau, Cérou and Guyader (2010). G. Biau (UPMC) 40 / 106
55. Three basic ingredients3-Bagging The subspace randomization scheme is blended with bagging. Breiman (1996). Bühlmann and Yu (2002). Biau, Cérou and Guyader (2010). G. Biau (UPMC) 40 / 106
56. Mathematical framework A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d. [0, 1]d × R-valued random variables. A generic pair: (X, Y ) satisfying EY 2 < ∞. Our mission: For ﬁxed x ∈ [0, 1]d , estimate the regression function r (x) = E[Y |X = x] using the data. Quality criterion: E[rn (X) − r (X)]2 . G. Biau (UPMC) 41 / 106
57. Mathematical framework A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d. [0, 1]d × R-valued random variables. A generic pair: (X, Y ) satisfying EY 2 < ∞. Our mission: For ﬁxed x ∈ [0, 1]d , estimate the regression function r (x) = E[Y |X = x] using the data. Quality criterion: E[rn (X) − r (X)]2 . G. Biau (UPMC) 41 / 106
58. Mathematical framework A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d. [0, 1]d × R-valued random variables. A generic pair: (X, Y ) satisfying EY 2 < ∞. Our mission: For ﬁxed x ∈ [0, 1]d , estimate the regression function r (x) = E[Y |X = x] using the data. Quality criterion: E[rn (X) − r (X)]2 . G. Biau (UPMC) 41 / 106
59. Mathematical framework A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d. [0, 1]d × R-valued random variables. A generic pair: (X, Y ) satisfying EY 2 < ∞. Our mission: For ﬁxed x ∈ [0, 1]d , estimate the regression function r (x) = E[Y |X = x] using the data. Quality criterion: E[rn (X) − r (X)]2 . G. Biau (UPMC) 41 / 106
60. The model A random forest is a collection of randomized base regression trees {rn (x, Θm , Dn ), m ≥ 1}. These random trees are combined to form the aggregated regression estimate ¯n (X, Dn ) = EΘ [rn (X, Θ, Dn )] . r Θ is assumed to be independent of X and the training sample Dn . However, we allow Θ to be based on a test sample, independent of, but distributed as, Dn . G. Biau (UPMC) 42 / 106
61. The model A random forest is a collection of randomized base regression trees {rn (x, Θm , Dn ), m ≥ 1}. These random trees are combined to form the aggregated regression estimate ¯n (X, Dn ) = EΘ [rn (X, Θ, Dn )] . r Θ is assumed to be independent of X and the training sample Dn . However, we allow Θ to be based on a test sample, independent of, but distributed as, Dn . G. Biau (UPMC) 42 / 106
62. The model A random forest is a collection of randomized base regression trees {rn (x, Θm , Dn ), m ≥ 1}. These random trees are combined to form the aggregated regression estimate ¯n (X, Dn ) = EΘ [rn (X, Θ, Dn )] . r Θ is assumed to be independent of X and the training sample Dn . However, we allow Θ to be based on a test sample, independent of, but distributed as, Dn . G. Biau (UPMC) 42 / 106
63. The model A random forest is a collection of randomized base regression trees {rn (x, Θm , Dn ), m ≥ 1}. These random trees are combined to form the aggregated regression estimate ¯n (X, Dn ) = EΘ [rn (X, Θ, Dn )] . r Θ is assumed to be independent of X and the training sample Dn . However, we allow Θ to be based on a test sample, independent of, but distributed as, Dn . G. Biau (UPMC) 42 / 106
64. The procedure Fix kn ≥ 2 and repeat the following procedure log2 kn times: 1 At each node, a coordinate of X = (X (1) , . . . , X (d) ) is selected, with the j-th feature having a probability pnj ∈ (0, 1) of being selected. 2 At each node, once the coordinate is selected, the split is at the midpoint of the chosen side. Thus n i=1 Yi 1[Xi ∈An (X,Θ)] ¯n (X) = EΘ r n 1En (X,Θ) , i=1 1[Xi ∈An (X,Θ)] where n En (X, Θ) = 1[Xi ∈An (X,Θ)] = 0 . i=1 G. Biau (UPMC) 43 / 106
65. The procedure Fix kn ≥ 2 and repeat the following procedure log2 kn times: 1 At each node, a coordinate of X = (X (1) , . . . , X (d) ) is selected, with the j-th feature having a probability pnj ∈ (0, 1) of being selected. 2 At each node, once the coordinate is selected, the split is at the midpoint of the chosen side. Thus n i=1 Yi 1[Xi ∈An (X,Θ)] ¯n (X) = EΘ r n 1En (X,Θ) , i=1 1[Xi ∈An (X,Θ)] where n En (X, Θ) = 1[Xi ∈An (X,Θ)] = 0 . i=1 G. Biau (UPMC) 43 / 106
66. The procedure Fix kn ≥ 2 and repeat the following procedure log2 kn times: 1 At each node, a coordinate of X = (X (1) , . . . , X (d) ) is selected, with the j-th feature having a probability pnj ∈ (0, 1) of being selected. 2 At each node, once the coordinate is selected, the split is at the midpoint of the chosen side. Thus n i=1 Yi 1[Xi ∈An (X,Θ)] ¯n (X) = EΘ r n 1En (X,Θ) , i=1 1[Xi ∈An (X,Θ)] where n En (X, Θ) = 1[Xi ∈An (X,Θ)] = 0 . i=1 G. Biau (UPMC) 43 / 106
67. Binary trees G. Biau (UPMC) 44 / 106
68. Binary trees G. Biau (UPMC) 45 / 106
69. Binary trees G. Biau (UPMC) 46 / 106
70. Binary trees G. Biau (UPMC) 47 / 106
71. Binary trees G. Biau (UPMC) 48 / 106
72. General comments Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes, and each leaf has Lebesgue measure 2− log2 kn (≈ 1/kn ). If X has uniform distribution on [0, 1]d , there will be on average about n/kn observations per terminal node. The choice kn = n induces a very small number of cases in the ﬁnal leaves.This scheme is close to what the original randomforests algorithm does. G. Biau (UPMC) 49 / 106
73. General comments Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes, and each leaf has Lebesgue measure 2− log2 kn (≈ 1/kn ). If X has uniform distribution on [0, 1]d , there will be on average about n/kn observations per terminal node. The choice kn = n induces a very small number of cases in the ﬁnal leaves.This scheme is close to what the original randomforests algorithm does. G. Biau (UPMC) 49 / 106
74. General comments Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes, and each leaf has Lebesgue measure 2− log2 kn (≈ 1/kn ). If X has uniform distribution on [0, 1]d , there will be on average about n/kn observations per terminal node. The choice kn = n induces a very small number of cases in the ﬁnal leaves.This scheme is close to what the original randomforests algorithm does. G. Biau (UPMC) 49 / 106
75. General comments Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes, and each leaf has Lebesgue measure 2− log2 kn (≈ 1/kn ). If X has uniform distribution on [0, 1]d , there will be on average about n/kn observations per terminal node. The choice kn = n induces a very small number of cases in the ﬁnal leaves.This scheme is close to what the original randomforests algorithm does. G. Biau (UPMC) 49 / 106
76. ConsistencyTheoremAssume that the distribution of X has support on [0, 1]d . Then therandom forests estimate ¯n is consistent whenever pnj log kn → ∞ for rall j = 1, . . . , d and kn /n → 0 as n → ∞. In the purely random model, pnj = 1/d, independently of n and j, and consistency is ensured as long as kn → ∞ and kn /n → 0. This is however a radically simpliﬁed version of the random forests used in practice. A more in-depth analysis is needed. G. Biau (UPMC) 50 / 106
77. ConsistencyTheoremAssume that the distribution of X has support on [0, 1]d . Then therandom forests estimate ¯n is consistent whenever pnj log kn → ∞ for rall j = 1, . . . , d and kn /n → 0 as n → ∞. In the purely random model, pnj = 1/d, independently of n and j, and consistency is ensured as long as kn → ∞ and kn /n → 0. This is however a radically simpliﬁed version of the random forests used in practice. A more in-depth analysis is needed. G. Biau (UPMC) 50 / 106
78. ConsistencyTheoremAssume that the distribution of X has support on [0, 1]d . Then therandom forests estimate ¯n is consistent whenever pnj log kn → ∞ for rall j = 1, . . . , d and kn /n → 0 as n → ∞. In the purely random model, pnj = 1/d, independently of n and j, and consistency is ensured as long as kn → ∞ and kn /n → 0. This is however a radically simpliﬁed version of the random forests used in practice. A more in-depth analysis is needed. G. Biau (UPMC) 50 / 106
79. ConsistencyTheoremAssume that the distribution of X has support on [0, 1]d . Then therandom forests estimate ¯n is consistent whenever pnj log kn → ∞ for rall j = 1, . . . , d and kn /n → 0 as n → ∞. In the purely random model, pnj = 1/d, independently of n and j, and consistency is ensured as long as kn → ∞ and kn /n → 0. This is however a radically simpliﬁed version of the random forests used in practice. A more in-depth analysis is needed. G. Biau (UPMC) 50 / 106
80. Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation. Images wavelet coefﬁcients. High-throughput technologies. Sparse estimation is playing an increasingly important role in the statistics and machine learning communities. Several methods have recently been developed in both ﬁelds, which rely upon the notion of sparsity. G. Biau (UPMC) 51 / 106
81. Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation. Images wavelet coefﬁcients. High-throughput technologies. Sparse estimation is playing an increasingly important role in the statistics and machine learning communities. Several methods have recently been developed in both ﬁelds, which rely upon the notion of sparsity. G. Biau (UPMC) 51 / 106
82. Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation. Images wavelet coefﬁcients. High-throughput technologies. Sparse estimation is playing an increasingly important role in the statistics and machine learning communities. Several methods have recently been developed in both ﬁelds, which rely upon the notion of sparsity. G. Biau (UPMC) 51 / 106
83. Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation. Images wavelet coefﬁcients. High-throughput technologies. Sparse estimation is playing an increasingly important role in the statistics and machine learning communities. Several methods have recently been developed in both ﬁelds, which rely upon the notion of sparsity. G. Biau (UPMC) 51 / 106
84. Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation. Images wavelet coefﬁcients. High-throughput technologies. Sparse estimation is playing an increasingly important role in the statistics and machine learning communities. Several methods have recently been developed in both ﬁelds, which rely upon the notion of sparsity. G. Biau (UPMC) 51 / 106
85. Our vision The regression function r (X) = E[Y |X] depends in fact only on a nonempty subset S (for Strong) of the d features. In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have r (X) = E[Y |XS ]. In the dimension reduction scenario we have in mind, the ambient dimension d can be very large, much larger than n. As such, the value S characterizes the sparsity of the model: The smaller S, the sparser r . G. Biau (UPMC) 52 / 106
86. Our vision The regression function r (X) = E[Y |X] depends in fact only on a nonempty subset S (for Strong) of the d features. In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have r (X) = E[Y |XS ]. In the dimension reduction scenario we have in mind, the ambient dimension d can be very large, much larger than n. As such, the value S characterizes the sparsity of the model: The smaller S, the sparser r . G. Biau (UPMC) 52 / 106
87. Our vision The regression function r (X) = E[Y |X] depends in fact only on a nonempty subset S (for Strong) of the d features. In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have r (X) = E[Y |XS ]. In the dimension reduction scenario we have in mind, the ambient dimension d can be very large, much larger than n. As such, the value S characterizes the sparsity of the model: The smaller S, the sparser r . G. Biau (UPMC) 52 / 106
88. Our vision The regression function r (X) = E[Y |X] depends in fact only on a nonempty subset S (for Strong) of the d features. In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have r (X) = E[Y |XS ]. In the dimension reduction scenario we have in mind, the ambient dimension d can be very large, much larger than n. As such, the value S characterizes the sparsity of the model: The smaller S, the sparser r . G. Biau (UPMC) 52 / 106
89. Sparsity and random forests Ideally, pnj = 1/S for j ∈ S. To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj ). Such a randomization mechanism may be designed on the basis of a test sample.Action plan E [¯n (X) − r (X)]2 = E [¯n (X) − ˜n (X)]2 + E [˜n (X) − r (X)]2 , r r r rwhere n ˜n (X) = r EΘ [Wni (X, Θ)] r (Xi ). i=1 G. Biau (UPMC) 53 / 106
90. Sparsity and random forests Ideally, pnj = 1/S for j ∈ S. To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj ). Such a randomization mechanism may be designed on the basis of a test sample.Action plan E [¯n (X) − r (X)]2 = E [¯n (X) − ˜n (X)]2 + E [˜n (X) − r (X)]2 , r r r rwhere n ˜n (X) = r EΘ [Wni (X, Θ)] r (Xi ). i=1 G. Biau (UPMC) 53 / 106
91. Sparsity and random forests Ideally, pnj = 1/S for j ∈ S. To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj ). Such a randomization mechanism may be designed on the basis of a test sample.Action plan E [¯n (X) − r (X)]2 = E [¯n (X) − ˜n (X)]2 + E [˜n (X) − r (X)]2 , r r r rwhere n ˜n (X) = r EΘ [Wni (X, Θ)] r (Xi ). i=1 G. Biau (UPMC) 53 / 106
92. Sparsity and random forests Ideally, pnj = 1/S for j ∈ S. To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj ). Such a randomization mechanism may be designed on the basis of a test sample.Action plan E [¯n (X) − r (X)]2 = E [¯n (X) − ˜n (X)]2 + E [˜n (X) − r (X)]2 , r r r rwhere n ˜n (X) = r EΘ [Wni (X, Θ)] r (Xi ). i=1 G. Biau (UPMC) 53 / 106
93. VariancePropositionAssume that X is uniformly distributed on [0, 1]d and, for all x ∈ Rd , σ 2 (x) = V[Y | X = x] ≤ σ 2for some positive constant σ 2 . Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S, S/2d S2 kn E [¯n (X) − ˜n (X)]2 ≤ Cσ 2 r r (1 + ξn ) , S−1 n(log kn )S/2dwhere S/2d 288 π log 2 C= . π 16 G. Biau (UPMC) 54 / 106
94. BiasPropositionAssume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz.Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S, 2SL2 E [˜n (X) − r (X)]2 ≤ r 0.75 + sup r 2 (x) e−n/2kn . (1+γn ) kn S log 2 x∈[0,1]d The rate at which the bias decreases to 0 depends on the number of strong variables, not on d. kn −(0.75/(S log 2))(1+γn ) = o(kn −2/d ) as soon as S ≤ 0.54d . The term e−n/2kn prevents the extreme choice kn = n. G. Biau (UPMC) 55 / 106
95. BiasPropositionAssume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz.Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S, 2SL2 E [˜n (X) − r (X)]2 ≤ r 0.75 + sup r 2 (x) e−n/2kn . (1+γn ) kn S log 2 x∈[0,1]d The rate at which the bias decreases to 0 depends on the number of strong variables, not on d. kn −(0.75/(S log 2))(1+γn ) = o(kn −2/d ) as soon as S ≤ 0.54d . The term e−n/2kn prevents the extreme choice kn = n. G. Biau (UPMC) 55 / 106
96. BiasPropositionAssume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz.Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S, 2SL2 E [˜n (X) − r (X)]2 ≤ r 0.75 + sup r 2 (x) e−n/2kn . (1+γn ) kn S log 2 x∈[0,1]d The rate at which the bias decreases to 0 depends on the number of strong variables, not on d. kn −(0.75/(S log 2))(1+γn ) = o(kn −2/d ) as soon as S ≤ 0.54d . The term e−n/2kn prevents the extreme choice kn = n. G. Biau (UPMC) 55 / 106
97. BiasPropositionAssume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz.Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S, 2SL2 E [˜n (X) − r (X)]2 ≤ r 0.75 + sup r 2 (x) e−n/2kn . (1+γn ) kn S log 2 x∈[0,1]d The rate at which the bias decreases to 0 depends on the number of strong variables, not on d. kn −(0.75/(S log 2))(1+γn ) = o(kn −2/d ) as soon as S ≤ 0.54d . The term e−n/2kn prevents the extreme choice kn = n. G. Biau (UPMC) 55 / 106
98. Main resultTheoremIf pnj = (1/S)(1 + ξnj ) for j ∈ S, with ξnj log n → 0 as n → ∞, then forthe choice 1/(1+ S0.752 ) L2 log 1/(1+ S0.752 ) kn ∝ n log , Ξwe have E [¯n (X) − r (X)]2 r lim sup sup 0.75 ≤ Λ. n→∞ (X,Y )∈F 2S ln 2 S ln 2+0.75 −0.75 ΞL 0.75 n S log 2+0.75Take-home message −0.75The rate n S log 2+0.75 is strictly faster than the usual minimax raten−2/(d+2) as soon as S ≤ 0.54d . G. Biau (UPMC) 56 / 106
99. Main resultTheoremIf pnj = (1/S)(1 + ξnj ) for j ∈ S, with ξnj log n → 0 as n → ∞, then forthe choice 1/(1+ S0.752 ) L2 log 1/(1+ S0.752 ) kn ∝ n log , Ξwe have E [¯n (X) − r (X)]2 r lim sup sup 0.75 ≤ Λ. n→∞ (X,Y )∈F 2S ln 2 S ln 2+0.75 −0.75 ΞL 0.75 n S log 2+0.75Take-home message −0.75The rate n S log 2+0.75 is strictly faster than the usual minimax raten−2/(d+2) as soon as S ≤ 0.54d . G. Biau (UPMC) 56 / 106
101. Discussion — Bagging The optimal parameter kn depends on the unknown distribution of (X, Y ). To correct this situation, adaptive choices of kn should preserve the rate of convergence of the estimate. Another route we may follow is to analyze the effect of bagging. Biau, Cérou and Guyader (2010). G. Biau (UPMC) 58 / 106
102. Discussion — Bagging The optimal parameter kn depends on the unknown distribution of (X, Y ). To correct this situation, adaptive choices of kn should preserve the rate of convergence of the estimate. Another route we may follow is to analyze the effect of bagging. Biau, Cérou and Guyader (2010). G. Biau (UPMC) 58 / 106
103. Discussion — Bagging The optimal parameter kn depends on the unknown distribution of (X, Y ). To correct this situation, adaptive choices of kn should preserve the rate of convergence of the estimate. Another route we may follow is to analyze the effect of bagging. Biau, Cérou and Guyader (2010). G. Biau (UPMC) 58 / 106
104. Discussion — Bagging The optimal parameter kn depends on the unknown distribution of (X, Y ). To correct this situation, adaptive choices of kn should preserve the rate of convergence of the estimate. Another route we may follow is to analyze the effect of bagging. Biau, Cérou and Guyader (2010). G. Biau (UPMC) 58 / 106
105. ?G. Biau (UPMC) 59 / 106
106. ?G. Biau (UPMC) 60 / 106
107. G. Biau (UPMC) 61 / 106
108. ?G. Biau (UPMC) 62 / 106
109. ?G. Biau (UPMC) 63 / 106
110. ?G. Biau (UPMC) 64 / 106
111. G. Biau (UPMC) 65 / 106
112. ?G. Biau (UPMC) 66 / 106
113. ?G. Biau (UPMC) 67 / 106
114. ?G. Biau (UPMC) 68 / 106
115. G. Biau (UPMC) 69 / 106
116. Discussion — Choosing the pnj ’sImaginary scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 If the selection is all weak, then choose one at random to split on. 3 If there is more than one strong variable elected, choose one at random and cut. G. Biau (UPMC) 70 / 106
117. Discussion — Choosing the pnj ’sImaginary scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 If the selection is all weak, then choose one at random to split on. 3 If there is more than one strong variable elected, choose one at random and cut. G. Biau (UPMC) 70 / 106
118. Discussion — Choosing the pnj ’sImaginary scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 If the selection is all weak, then choose one at random to split on. 3 If there is more than one strong variable elected, choose one at random and cut. G. Biau (UPMC) 70 / 106
119. Discussion — Choosing the pnj ’sImaginary scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 If the selection is all weak, then choose one at random to split on. 3 If there is more than one strong variable elected, choose one at random and cut. G. Biau (UPMC) 70 / 106
120. Discussion — Choosing the pnj ’s Each coordinate in S will be cut with the “ideal” probability Mn 1 S pn = 1− 1− . S d The parameter Mn should satisfy Mn S 1− log n → 0 as n → ∞. d This is true as soon as Mn Mn → ∞ and → ∞ as n → ∞. log n G. Biau (UPMC) 71 / 106
121. Discussion — Choosing the pnj ’s Each coordinate in S will be cut with the “ideal” probability Mn 1 S pn = 1− 1− . S d The parameter Mn should satisfy Mn S 1− log n → 0 as n → ∞. d This is true as soon as Mn Mn → ∞ and → ∞ as n → ∞. log n G. Biau (UPMC) 71 / 106
122. Discussion — Choosing the pnj ’s Each coordinate in S will be cut with the “ideal” probability Mn 1 S pn = 1− 1− . S d The parameter Mn should satisfy Mn S 1− log n → 0 as n → ∞. d This is true as soon as Mn Mn → ∞ and → ∞ as n → ∞. log n G. Biau (UPMC) 71 / 106
123. Assumptions We have at hand an independent test set Dn . The model is linear: Y = aj X (j) + ε. j∈S For a ﬁxed node A = d Aj , ﬁx a coordinate j and look at the j=1 weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ). If j ∈ S, then the best split is at the midpoint of the node, with a variance decrease equal to aj2 /16 > 0. If j ∈ W, the decrease of the variance is always 0, whatever the location of the split. G. Biau (UPMC) 72 / 106
124. Assumptions We have at hand an independent test set Dn . The model is linear: Y = aj X (j) + ε. j∈S For a ﬁxed node A = d Aj , ﬁx a coordinate j and look at the j=1 weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ). If j ∈ S, then the best split is at the midpoint of the node, with a variance decrease equal to aj2 /16 > 0. If j ∈ W, the decrease of the variance is always 0, whatever the location of the split. G. Biau (UPMC) 72 / 106
125. Assumptions We have at hand an independent test set Dn . The model is linear: Y = aj X (j) + ε. j∈S For a ﬁxed node A = d Aj , ﬁx a coordinate j and look at the j=1 weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ). If j ∈ S, then the best split is at the midpoint of the node, with a variance decrease equal to aj2 /16 > 0. If j ∈ W, the decrease of the variance is always 0, whatever the location of the split. G. Biau (UPMC) 72 / 106
126. Assumptions We have at hand an independent test set Dn . The model is linear: Y = aj X (j) + ε. j∈S For a ﬁxed node A = d Aj , ﬁx a coordinate j and look at the j=1 weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ). If j ∈ S, then the best split is at the midpoint of the node, with a variance decrease equal to aj2 /16 > 0. If j ∈ W, the decrease of the variance is always 0, whatever the location of the split. G. Biau (UPMC) 72 / 106
127. Assumptions We have at hand an independent test set Dn . The model is linear: Y = aj X (j) + ε. j∈S For a ﬁxed node A = d Aj , ﬁx a coordinate j and look at the j=1 weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ). If j ∈ S, then the best split is at the midpoint of the node, with a variance decrease equal to aj2 /16 > 0. If j ∈ W, the decrease of the variance is always 0, whatever the location of the split. G. Biau (UPMC) 72 / 106
128. Assumptions We have at hand an independent test set Dn . The model is linear: Y = aj X (j) + ε. j∈S For a ﬁxed node A = d Aj , ﬁx a coordinate j and look at the j=1 weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ). If j ∈ S, then the best split is at the midpoint of the node, with a variance decrease equal to aj2 /16 > 0. If j ∈ W, the decrease of the variance is always 0, whatever the location of the split. G. Biau (UPMC) 72 / 106
129. Discussion — Choosing the pnj ’sNear-reality scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 For each of the Mn elected coordinates, calculate the best split. 3 Select the coordinate which outputs the best within-node sum of squares decrease, and cut.ConclusionFor j ∈ S, 1 pnj ≈ 1 + ξnj , Swhere ξnj → 0 and satisﬁes the constraint ξnj log n → 0 as n tends toinﬁnity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞. G. Biau (UPMC) 73 / 106
130. Discussion — Choosing the pnj ’sNear-reality scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 For each of the Mn elected coordinates, calculate the best split. 3 Select the coordinate which outputs the best within-node sum of squares decrease, and cut.ConclusionFor j ∈ S, 1 pnj ≈ 1 + ξnj , Swhere ξnj → 0 and satisﬁes the constraint ξnj log n → 0 as n tends toinﬁnity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞. G. Biau (UPMC) 73 / 106
131. Discussion — Choosing the pnj ’sNear-reality scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 For each of the Mn elected coordinates, calculate the best split. 3 Select the coordinate which outputs the best within-node sum of squares decrease, and cut.ConclusionFor j ∈ S, 1 pnj ≈ 1 + ξnj , Swhere ξnj → 0 and satisﬁes the constraint ξnj log n → 0 as n tends toinﬁnity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞. G. Biau (UPMC) 73 / 106
132. Discussion — Choosing the pnj ’sNear-reality scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 For each of the Mn elected coordinates, calculate the best split. 3 Select the coordinate which outputs the best within-node sum of squares decrease, and cut.ConclusionFor j ∈ S, 1 pnj ≈ 1 + ξnj , Swhere ξnj → 0 and satisﬁes the constraint ξnj log n → 0 as n tends toinﬁnity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞. G. Biau (UPMC) 73 / 106
133. Discussion — Choosing the pnj ’sNear-reality scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 For each of the Mn elected coordinates, calculate the best split. 3 Select the coordinate which outputs the best within-node sum of squares decrease, and cut.ConclusionFor j ∈ S, 1 pnj ≈ 1 + ξnj , Swhere ξnj → 0 and satisﬁes the constraint ξnj log n → 0 as n tends toinﬁnity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞. G. Biau (UPMC) 73 / 106
134. Outline1 Setting2 A random forests model3 Layered nearest neighbors and random forests G. Biau (UPMC) 74 / 106
135. ?G. Biau (UPMC) 75 / 106
136. ?G. Biau (UPMC) 76 / 106
137. ?G. Biau (UPMC) 77 / 106
138. ?G. Biau (UPMC) 78 / 106
139. ?G. Biau (UPMC) 79 / 106
140. G. Biau (UPMC) 80 / 106
141. ?G. Biau (UPMC) 81 / 106
142. ?G. Biau (UPMC) 82 / 106
143. ?G. Biau (UPMC) 83 / 106
144. ?G. Biau (UPMC) 84 / 106
145. ?G. Biau (UPMC) 85 / 106
146. G. Biau (UPMC) 86 / 106
147. ?G. Biau (UPMC) 87 / 106
148. ?G. Biau (UPMC) 88 / 106
149. ?G. Biau (UPMC) 89 / 106
150. ?G. Biau (UPMC) 90 / 106
151. ?G. Biau (UPMC) 91 / 106
152. ?G. Biau (UPMC) 92 / 106
153. ?G. Biau (UPMC) 93 / 106
154. ?G. Biau (UPMC) 94 / 106
155. ?G. Biau (UPMC) 95 / 106
156. ?G. Biau (UPMC) 96 / 106
157. ?G. Biau (UPMC) 97 / 106
158. ?G. Biau (UPMC) 98 / 106
159. Layered Nearest NeighborsDeﬁnitionLet X1 , . . . , Xn be a sample of i.i.d. random vectors in Rd , d ≥ 2. Anobservation Xi is said to be a LNN of a point x if the hyperrectangledeﬁned by x and Xi contains no other data points. x empty G. Biau (UPMC) 99 / 106
160. What is known about Ln (x)? ... a lot when X1 , . . . , Xn are uniformly distributed over [0, 1]d . For example, 2d (log n)d−1 ELn (x) = + O (log n)d−2 (d − 1)! and (d − 1)! Ln (x) → 1 in probability as n → ∞. 2d (log n)d−1 This is the problem of maxima in random vectors (Barndorff-Nielsen and Sobel, 1966). G. Biau (UPMC) 100 / 106
161. What is known about Ln (x)? ... a lot when X1 , . . . , Xn are uniformly distributed over [0, 1]d . For example, 2d (log n)d−1 ELn (x) = + O (log n)d−2 (d − 1)! and (d − 1)! Ln (x) → 1 in probability as n → ∞. 2d (log n)d−1 This is the problem of maxima in random vectors (Barndorff-Nielsen and Sobel, 1966). G. Biau (UPMC) 100 / 106
162. What is known about Ln (x)? ... a lot when X1 , . . . , Xn are uniformly distributed over [0, 1]d . For example, 2d (log n)d−1 ELn (x) = + O (log n)d−2 (d − 1)! and (d − 1)! Ln (x) → 1 in probability as n → ∞. 2d (log n)d−1 This is the problem of maxima in random vectors (Barndorff-Nielsen and Sobel, 1966). G. Biau (UPMC) 100 / 106
163. Two results (Biau and Devroye, 2010)ModelX1 , . . . , Xn are independently distributed according to some probabilitydensity f (with probability measure µ).TheoremFor µ-almost all x ∈ Rd , one has Ln (x) → ∞ in probability as n → ∞.TheoremSuppose that f is λ-almost everywhere continuous. Then (d − 1)! ELn (x) → 1 as n → ∞, 2d (log n)d−1at µ-almost all x ∈ Rd . G. Biau (UPMC) 101 / 106
164. Two results (Biau and Devroye, 2010)ModelX1 , . . . , Xn are independently distributed according to some probabilitydensity f (with probability measure µ).TheoremFor µ-almost all x ∈ Rd , one has Ln (x) → ∞ in probability as n → ∞.TheoremSuppose that f is λ-almost everywhere continuous. Then (d − 1)! ELn (x) → 1 as n → ∞, 2d (log n)d−1at µ-almost all x ∈ Rd . G. Biau (UPMC) 101 / 106
165. Two results (Biau and Devroye, 2010)ModelX1 , . . . , Xn are independently distributed according to some probabilitydensity f (with probability measure µ).TheoremFor µ-almost all x ∈ Rd , one has Ln (x) → ∞ in probability as n → ∞.TheoremSuppose that f is λ-almost everywhere continuous. Then (d − 1)! ELn (x) → 1 as n → ∞, 2d (log n)d−1at µ-almost all x ∈ Rd . G. Biau (UPMC) 101 / 106
166. LNN regression estimationModel(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.Moreover, |Y | is bounded and X has a density.The regression function r (x) = E[Y |X = x] may be estimated by n 1 rn (x) = Yi 1[Xi ∈Ln (x)] . Ln (x) i=1 1 No smoothing parameter. 2 A scale-invariant estimate. 3 Intimately connected to Breiman’s random forests. G. Biau (UPMC) 102 / 106
167. LNN regression estimationModel(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.Moreover, |Y | is bounded and X has a density.The regression function r (x) = E[Y |X = x] may be estimated by n 1 rn (x) = Yi 1[Xi ∈Ln (x)] . Ln (x) i=1 1 No smoothing parameter. 2 A scale-invariant estimate. 3 Intimately connected to Breiman’s random forests. G. Biau (UPMC) 102 / 106
168. LNN regression estimationModel(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.Moreover, |Y | is bounded and X has a density.The regression function r (x) = E[Y |X = x] may be estimated by n 1 rn (x) = Yi 1[Xi ∈Ln (x)] . Ln (x) i=1 1 No smoothing parameter. 2 A scale-invariant estimate. 3 Intimately connected to Breiman’s random forests. G. Biau (UPMC) 102 / 106
169. LNN regression estimationModel(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.Moreover, |Y | is bounded and X has a density.The regression function r (x) = E[Y |X = x] may be estimated by n 1 rn (x) = Yi 1[Xi ∈Ln (x)] . Ln (x) i=1 1 No smoothing parameter. 2 A scale-invariant estimate. 3 Intimately connected to Breiman’s random forests. G. Biau (UPMC) 102 / 106
170. LNN regression estimationModel(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.Moreover, |Y | is bounded and X has a density.The regression function r (x) = E[Y |X = x] may be estimated by n 1 rn (x) = Yi 1[Xi ∈Ln (x)] . Ln (x) i=1 1 No smoothing parameter. 2 A scale-invariant estimate. 3 Intimately connected to Breiman’s random forests. G. Biau (UPMC) 102 / 106
171. ConsistencyTheorem (Pointwise Lp -consistency)Assume that the regression function r is λ-almost everywherecontinuous and that Y is bounded. Then, for µ-almost all x ∈ Rd andall p ≥ 1, E |rn (x) − r (x)|p → 0 as n → ∞.Theorem (Gobal Lp -consistency)Under the same conditions, for all p ≥ 1, E |rn (X) − r (X)|p → 0 as n → ∞. 1 No universal consistency result with respect to r is possible. 2 The results do not impose any condition on the density. 3 They are also scale-free. G. Biau (UPMC) 103 / 106
172. ConsistencyTheorem (Pointwise Lp -consistency)Assume that the regression function r is λ-almost everywherecontinuous and that Y is bounded. Then, for µ-almost all x ∈ Rd andall p ≥ 1, E |rn (x) − r (x)|p → 0 as n → ∞.Theorem (Gobal Lp -consistency)Under the same conditions, for all p ≥ 1, E |rn (X) − r (X)|p → 0 as n → ∞. 1 No universal consistency result with respect to r is possible. 2 The results do not impose any condition on the density. 3 They are also scale-free. G. Biau (UPMC) 103 / 106
173. ConsistencyTheorem (Pointwise Lp -consistency)Assume that the regression function r is λ-almost everywherecontinuous and that Y is bounded. Then, for µ-almost all x ∈ Rd andall p ≥ 1, E |rn (x) − r (x)|p → 0 as n → ∞.Theorem (Gobal Lp -consistency)Under the same conditions, for all p ≥ 1, E |rn (X) − r (X)|p → 0 as n → ∞. 1 No universal consistency result with respect to r is possible. 2 The results do not impose any condition on the density. 3 They are also scale-free. G. Biau (UPMC) 103 / 106
174. ConsistencyTheorem (Pointwise Lp -consistency)Assume that the regression function r is λ-almost everywherecontinuous and that Y is bounded. Then, for µ-almost all x ∈ Rd andall p ≥ 1, E |rn (x) − r (x)|p → 0 as n → ∞.Theorem (Gobal Lp -consistency)Under the same conditions, for all p ≥ 1, E |rn (X) − r (X)|p → 0 as n → ∞. 1 No universal consistency result with respect to r is possible. 2 The results do not impose any condition on the density. 3 They are also scale-free. G. Biau (UPMC) 103 / 106
175. ConsistencyTheorem (Pointwise Lp -consistency)Assume that the regression function r is λ-almost everywherecontinuous and that Y is bounded. Then, for µ-almost all x ∈ Rd andall p ≥ 1, E |rn (x) − r (x)|p → 0 as n → ∞.Theorem (Gobal Lp -consistency)Under the same conditions, for all p ≥ 1, E |rn (X) − r (X)|p → 0 as n → ∞. 1 No universal consistency result with respect to r is possible. 2 The results do not impose any condition on the density. 3 They are also scale-free. G. Biau (UPMC) 103 / 106
176. Back to random forestsA random forest can be viewed as a weighted LNN regression estimate n ¯n (x) = r Yi Wni (x), i=1where the weights concentrate on the LNN and satisfy n Wni (x) = 1. i=1 G. Biau (UPMC) 104 / 106
177. Non-adaptive strategiesConsider the non-adaptive random forests estimate n ¯n (x) = r Yi Wni (x). i=1where the weights concentrate on the LNN.PropositionFor any x ∈ Rd , assume that σ 2 = V[Y |X = x] is independent of x.Then σ2 E [¯n (x) − r (x)]2 ≥ r . ELn (x) G. Biau (UPMC) 105 / 106
178. Non-adaptive strategiesConsider the non-adaptive random forests estimate n ¯n (x) = r Yi Wni (x). i=1where the weights concentrate on the LNN.PropositionFor any x ∈ Rd , assume that σ 2 = V[Y |X = x] is independent of x.Then σ2 E [¯n (x) − r (x)]2 ≥ r . ELn (x) G. Biau (UPMC) 105 / 106
179. Rate of convergenceAt µ-almost all x, when f is λ-almost everywhere continuous, σ 2 (d − 1)! E [¯n (x) − r (x)]2 r . 2d (log n)d−1Improving the rate of convergence 1 Stop as soon as a future rectangle split would cause a sub-rectangle to have fewer than kn points. 2 Resort to bagging and randomize using random subsamples. G. Biau (UPMC) 106 / 106
180. Rate of convergenceAt µ-almost all x, when f is λ-almost everywhere continuous, σ 2 (d − 1)! E [¯n (x) − r (x)]2 r . 2d (log n)d−1Improving the rate of convergence 1 Stop as soon as a future rectangle split would cause a sub-rectangle to have fewer than kn points. 2 Resort to bagging and randomize using random subsamples. G. Biau (UPMC) 106 / 106
181. Rate of convergenceAt µ-almost all x, when f is λ-almost everywhere continuous, σ 2 (d − 1)! E [¯n (x) − r (x)]2 r . 2d (log n)d−1Improving the rate of convergence 1 Stop as soon as a future rectangle split would cause a sub-rectangle to have fewer than kn points. 2 Resort to bagging and randomize using random subsamples. G. Biau (UPMC) 106 / 106
Be the first to comment