• Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,343
On Slideshare
0
From Embeds
0
Number of Embeds
5

Actions

Shares
Downloads
23
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Random forests Gérard Biau Groupe de travail prévision, May 2011G. Biau (UPMC) 1 / 106
  • 2. Outline1 Setting2 A random forests model3 Layered nearest neighbors and random forests G. Biau (UPMC) 2 / 106
  • 3. Outline1 Setting2 A random forests model3 Layered nearest neighbors and random forests G. Biau (UPMC) 3 / 106
  • 4. Leo Breiman (1928-2005) G. Biau (UPMC) 4 / 106
  • 5. Leo Breiman (1928-2005) G. Biau (UPMC) 5 / 106
  • 6. Classification G. Biau (UPMC) 6 / 106
  • 7. Classification ? G. Biau (UPMC) 7 / 106
  • 8. Classification G. Biau (UPMC) 8 / 106
  • 9. Regression −2.33 1.38 6.76 5.89 12.00 0.56 2.27 0.12 4.73 31.2 −2.88 −2.56 7.72 7.26 0.12 4.32 2.23 7.77 −1.23 3.82 1.09 8.21 −2.66 G. Biau (UPMC) 9 / 106
  • 10. Regression −2.33 1.38 6.76 5.89 12.00 0.56 2.27 0.12 4.73 31.2 −2.88 7.72 ? −2.56 7.26 0.12 4.32 2.23 7.77 −1.23 3.82 1.09 8.21 −2.66 G. Biau (UPMC) 10 / 106
  • 11. Regression −2.33 1.38 6.76 5.89 12.00 0.56 2.27 0.12 4.73 31.2 −2.88 2.25 −2.56 7.72 7.26 0.12 4.32 2.23 7.77 −1.23 3.82 1.09 8.21 −2.66 G. Biau (UPMC) 11 / 106
  • 12. Trees ? G. Biau (UPMC) 12 / 106
  • 13. Trees ? G. Biau (UPMC) 13 / 106
  • 14. Trees ? G. Biau (UPMC) 14 / 106
  • 15. Trees ? G. Biau (UPMC) 15 / 106
  • 16. Trees ? G. Biau (UPMC) 16 / 106
  • 17. Trees ? G. Biau (UPMC) 17 / 106
  • 18. Trees ? G. Biau (UPMC) 18 / 106
  • 19. Trees G. Biau (UPMC) 19 / 106
  • 20. Trees G. Biau (UPMC) 20 / 106
  • 21. Trees G. Biau (UPMC) 21 / 106
  • 22. Trees G. Biau (UPMC) 22 / 106
  • 23. Trees G. Biau (UPMC) 23 / 106
  • 24. Trees G. Biau (UPMC) 24 / 106
  • 25. Trees G. Biau (UPMC) 25 / 106
  • 26. Trees G. Biau (UPMC) 26 / 106
  • 27. Trees ? G. Biau (UPMC) 27 / 106
  • 28. Trees ? G. Biau (UPMC) 28 / 106
  • 29. Trees ? G. Biau (UPMC) 29 / 106
  • 30. Trees ? G. Biau (UPMC) 30 / 106
  • 31. Trees ? G. Biau (UPMC) 31 / 106
  • 32. Trees ? G. Biau (UPMC) 32 / 106
  • 33. Trees ? G. Biau (UPMC) 33 / 106
  • 34. From trees to forests Leo Breiman promoted random forests. Idea: Using tree averaging as a means of obtaining good rules. The base trees are simple and randomized.Breiman’s ideas were decisively influenced by Amit and Geman (1997, geometric feature selection). Ho (1998, random subspace method). Dietterich (2000, random split selection approach). stat.berkeley.edu/users/breiman/RandomForests G. Biau (UPMC) 34 / 106
  • 35. From trees to forests Leo Breiman promoted random forests. Idea: Using tree averaging as a means of obtaining good rules. The base trees are simple and randomized.Breiman’s ideas were decisively influenced by Amit and Geman (1997, geometric feature selection). Ho (1998, random subspace method). Dietterich (2000, random split selection approach). stat.berkeley.edu/users/breiman/RandomForests G. Biau (UPMC) 34 / 106
  • 36. From trees to forests Leo Breiman promoted random forests. Idea: Using tree averaging as a means of obtaining good rules. The base trees are simple and randomized.Breiman’s ideas were decisively influenced by Amit and Geman (1997, geometric feature selection). Ho (1998, random subspace method). Dietterich (2000, random split selection approach). stat.berkeley.edu/users/breiman/RandomForests G. Biau (UPMC) 34 / 106
  • 37. From trees to forests Leo Breiman promoted random forests. Idea: Using tree averaging as a means of obtaining good rules. The base trees are simple and randomized.Breiman’s ideas were decisively influenced by Amit and Geman (1997, geometric feature selection). Ho (1998, random subspace method). Dietterich (2000, random split selection approach). stat.berkeley.edu/users/breiman/RandomForests G. Biau (UPMC) 34 / 106
  • 38. From trees to forests Leo Breiman promoted random forests. Idea: Using tree averaging as a means of obtaining good rules. The base trees are simple and randomized.Breiman’s ideas were decisively influenced by Amit and Geman (1997, geometric feature selection). Ho (1998, random subspace method). Dietterich (2000, random split selection approach). stat.berkeley.edu/users/breiman/RandomForests G. Biau (UPMC) 34 / 106
  • 39. Random forests They have emerged as serious competitors to state of the art methods. They are fast and easy to implement, produce highly accurate predictions and can handle a very large number of input variables without overfitting. In fact, forests are among the most accurate general-purpose learners available. The algorithm is difficult to analyze and its mathematical properties remain to date largely unknown. Most theoretical studies have concentrated on isolated parts or stylized versions of the procedure. G. Biau (UPMC) 35 / 106
  • 40. Random forests They have emerged as serious competitors to state of the art methods. They are fast and easy to implement, produce highly accurate predictions and can handle a very large number of input variables without overfitting. In fact, forests are among the most accurate general-purpose learners available. The algorithm is difficult to analyze and its mathematical properties remain to date largely unknown. Most theoretical studies have concentrated on isolated parts or stylized versions of the procedure. G. Biau (UPMC) 35 / 106
  • 41. Random forests They have emerged as serious competitors to state of the art methods. They are fast and easy to implement, produce highly accurate predictions and can handle a very large number of input variables without overfitting. In fact, forests are among the most accurate general-purpose learners available. The algorithm is difficult to analyze and its mathematical properties remain to date largely unknown. Most theoretical studies have concentrated on isolated parts or stylized versions of the procedure. G. Biau (UPMC) 35 / 106
  • 42. Random forests They have emerged as serious competitors to state of the art methods. They are fast and easy to implement, produce highly accurate predictions and can handle a very large number of input variables without overfitting. In fact, forests are among the most accurate general-purpose learners available. The algorithm is difficult to analyze and its mathematical properties remain to date largely unknown. Most theoretical studies have concentrated on isolated parts or stylized versions of the procedure. G. Biau (UPMC) 35 / 106
  • 43. Random forests They have emerged as serious competitors to state of the art methods. They are fast and easy to implement, produce highly accurate predictions and can handle a very large number of input variables without overfitting. In fact, forests are among the most accurate general-purpose learners available. The algorithm is difficult to analyze and its mathematical properties remain to date largely unknown. Most theoretical studies have concentrated on isolated parts or stylized versions of the procedure. G. Biau (UPMC) 35 / 106
  • 44. Key-references Breiman (2000, 2001, 2004). Definition, experiments and intuitions. Lin and Jeon (2006). Link with layered nearest neighbors. Biau, Devroye and Lugosi (2008). Consistency results for stylized versions. Biau (2010). Sparsity and random forests. G. Biau (UPMC) 36 / 106
  • 45. Key-references Breiman (2000, 2001, 2004). Definition, experiments and intuitions. Lin and Jeon (2006). Link with layered nearest neighbors. Biau, Devroye and Lugosi (2008). Consistency results for stylized versions. Biau (2010). Sparsity and random forests. G. Biau (UPMC) 36 / 106
  • 46. Key-references Breiman (2000, 2001, 2004). Definition, experiments and intuitions. Lin and Jeon (2006). Link with layered nearest neighbors. Biau, Devroye and Lugosi (2008). Consistency results for stylized versions. Biau (2010). Sparsity and random forests. G. Biau (UPMC) 36 / 106
  • 47. Key-references Breiman (2000, 2001, 2004). Definition, experiments and intuitions. Lin and Jeon (2006). Link with layered nearest neighbors. Biau, Devroye and Lugosi (2008). Consistency results for stylized versions. Biau (2010). Sparsity and random forests. G. Biau (UPMC) 36 / 106
  • 48. Outline1 Setting2 A random forests model3 Layered nearest neighbors and random forests G. Biau (UPMC) 37 / 106
  • 49. Three basic ingredients1-Randomization and no-pruning For each tree, select at random, at each node, a small group of input coordinates to split. Calculate the best split based on these features and cut. The tree is grown to maximum size, without pruning. G. Biau (UPMC) 38 / 106
  • 50. Three basic ingredients1-Randomization and no-pruning For each tree, select at random, at each node, a small group of input coordinates to split. Calculate the best split based on these features and cut. The tree is grown to maximum size, without pruning. G. Biau (UPMC) 38 / 106
  • 51. Three basic ingredients1-Randomization and no-pruning For each tree, select at random, at each node, a small group of input coordinates to split. Calculate the best split based on these features and cut. The tree is grown to maximum size, without pruning. G. Biau (UPMC) 38 / 106
  • 52. Three basic ingredients2-Aggregation Final predictions are obtained by aggregating over the ensemble. It is fast and easily parallelizable. G. Biau (UPMC) 39 / 106
  • 53. Three basic ingredients2-Aggregation Final predictions are obtained by aggregating over the ensemble. It is fast and easily parallelizable. G. Biau (UPMC) 39 / 106
  • 54. Three basic ingredients3-Bagging The subspace randomization scheme is blended with bagging. Breiman (1996). Bühlmann and Yu (2002). Biau, Cérou and Guyader (2010). G. Biau (UPMC) 40 / 106
  • 55. Three basic ingredients3-Bagging The subspace randomization scheme is blended with bagging. Breiman (1996). Bühlmann and Yu (2002). Biau, Cérou and Guyader (2010). G. Biau (UPMC) 40 / 106
  • 56. Mathematical framework A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d. [0, 1]d × R-valued random variables. A generic pair: (X, Y ) satisfying EY 2 < ∞. Our mission: For fixed x ∈ [0, 1]d , estimate the regression function r (x) = E[Y |X = x] using the data. Quality criterion: E[rn (X) − r (X)]2 . G. Biau (UPMC) 41 / 106
  • 57. Mathematical framework A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d. [0, 1]d × R-valued random variables. A generic pair: (X, Y ) satisfying EY 2 < ∞. Our mission: For fixed x ∈ [0, 1]d , estimate the regression function r (x) = E[Y |X = x] using the data. Quality criterion: E[rn (X) − r (X)]2 . G. Biau (UPMC) 41 / 106
  • 58. Mathematical framework A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d. [0, 1]d × R-valued random variables. A generic pair: (X, Y ) satisfying EY 2 < ∞. Our mission: For fixed x ∈ [0, 1]d , estimate the regression function r (x) = E[Y |X = x] using the data. Quality criterion: E[rn (X) − r (X)]2 . G. Biau (UPMC) 41 / 106
  • 59. Mathematical framework A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d. [0, 1]d × R-valued random variables. A generic pair: (X, Y ) satisfying EY 2 < ∞. Our mission: For fixed x ∈ [0, 1]d , estimate the regression function r (x) = E[Y |X = x] using the data. Quality criterion: E[rn (X) − r (X)]2 . G. Biau (UPMC) 41 / 106
  • 60. The model A random forest is a collection of randomized base regression trees {rn (x, Θm , Dn ), m ≥ 1}. These random trees are combined to form the aggregated regression estimate ¯n (X, Dn ) = EΘ [rn (X, Θ, Dn )] . r Θ is assumed to be independent of X and the training sample Dn . However, we allow Θ to be based on a test sample, independent of, but distributed as, Dn . G. Biau (UPMC) 42 / 106
  • 61. The model A random forest is a collection of randomized base regression trees {rn (x, Θm , Dn ), m ≥ 1}. These random trees are combined to form the aggregated regression estimate ¯n (X, Dn ) = EΘ [rn (X, Θ, Dn )] . r Θ is assumed to be independent of X and the training sample Dn . However, we allow Θ to be based on a test sample, independent of, but distributed as, Dn . G. Biau (UPMC) 42 / 106
  • 62. The model A random forest is a collection of randomized base regression trees {rn (x, Θm , Dn ), m ≥ 1}. These random trees are combined to form the aggregated regression estimate ¯n (X, Dn ) = EΘ [rn (X, Θ, Dn )] . r Θ is assumed to be independent of X and the training sample Dn . However, we allow Θ to be based on a test sample, independent of, but distributed as, Dn . G. Biau (UPMC) 42 / 106
  • 63. The model A random forest is a collection of randomized base regression trees {rn (x, Θm , Dn ), m ≥ 1}. These random trees are combined to form the aggregated regression estimate ¯n (X, Dn ) = EΘ [rn (X, Θ, Dn )] . r Θ is assumed to be independent of X and the training sample Dn . However, we allow Θ to be based on a test sample, independent of, but distributed as, Dn . G. Biau (UPMC) 42 / 106
  • 64. The procedure Fix kn ≥ 2 and repeat the following procedure log2 kn times: 1 At each node, a coordinate of X = (X (1) , . . . , X (d) ) is selected, with the j-th feature having a probability pnj ∈ (0, 1) of being selected. 2 At each node, once the coordinate is selected, the split is at the midpoint of the chosen side. Thus n i=1 Yi 1[Xi ∈An (X,Θ)] ¯n (X) = EΘ r n 1En (X,Θ) , i=1 1[Xi ∈An (X,Θ)] where n En (X, Θ) = 1[Xi ∈An (X,Θ)] = 0 . i=1 G. Biau (UPMC) 43 / 106
  • 65. The procedure Fix kn ≥ 2 and repeat the following procedure log2 kn times: 1 At each node, a coordinate of X = (X (1) , . . . , X (d) ) is selected, with the j-th feature having a probability pnj ∈ (0, 1) of being selected. 2 At each node, once the coordinate is selected, the split is at the midpoint of the chosen side. Thus n i=1 Yi 1[Xi ∈An (X,Θ)] ¯n (X) = EΘ r n 1En (X,Θ) , i=1 1[Xi ∈An (X,Θ)] where n En (X, Θ) = 1[Xi ∈An (X,Θ)] = 0 . i=1 G. Biau (UPMC) 43 / 106
  • 66. The procedure Fix kn ≥ 2 and repeat the following procedure log2 kn times: 1 At each node, a coordinate of X = (X (1) , . . . , X (d) ) is selected, with the j-th feature having a probability pnj ∈ (0, 1) of being selected. 2 At each node, once the coordinate is selected, the split is at the midpoint of the chosen side. Thus n i=1 Yi 1[Xi ∈An (X,Θ)] ¯n (X) = EΘ r n 1En (X,Θ) , i=1 1[Xi ∈An (X,Θ)] where n En (X, Θ) = 1[Xi ∈An (X,Θ)] = 0 . i=1 G. Biau (UPMC) 43 / 106
  • 67. Binary trees G. Biau (UPMC) 44 / 106
  • 68. Binary trees G. Biau (UPMC) 45 / 106
  • 69. Binary trees G. Biau (UPMC) 46 / 106
  • 70. Binary trees G. Biau (UPMC) 47 / 106
  • 71. Binary trees G. Biau (UPMC) 48 / 106
  • 72. General comments Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes, and each leaf has Lebesgue measure 2− log2 kn (≈ 1/kn ). If X has uniform distribution on [0, 1]d , there will be on average about n/kn observations per terminal node. The choice kn = n induces a very small number of cases in the final leaves.This scheme is close to what the original randomforests algorithm does. G. Biau (UPMC) 49 / 106
  • 73. General comments Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes, and each leaf has Lebesgue measure 2− log2 kn (≈ 1/kn ). If X has uniform distribution on [0, 1]d , there will be on average about n/kn observations per terminal node. The choice kn = n induces a very small number of cases in the final leaves.This scheme is close to what the original randomforests algorithm does. G. Biau (UPMC) 49 / 106
  • 74. General comments Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes, and each leaf has Lebesgue measure 2− log2 kn (≈ 1/kn ). If X has uniform distribution on [0, 1]d , there will be on average about n/kn observations per terminal node. The choice kn = n induces a very small number of cases in the final leaves.This scheme is close to what the original randomforests algorithm does. G. Biau (UPMC) 49 / 106
  • 75. General comments Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes, and each leaf has Lebesgue measure 2− log2 kn (≈ 1/kn ). If X has uniform distribution on [0, 1]d , there will be on average about n/kn observations per terminal node. The choice kn = n induces a very small number of cases in the final leaves.This scheme is close to what the original randomforests algorithm does. G. Biau (UPMC) 49 / 106
  • 76. ConsistencyTheoremAssume that the distribution of X has support on [0, 1]d . Then therandom forests estimate ¯n is consistent whenever pnj log kn → ∞ for rall j = 1, . . . , d and kn /n → 0 as n → ∞. In the purely random model, pnj = 1/d, independently of n and j, and consistency is ensured as long as kn → ∞ and kn /n → 0. This is however a radically simplified version of the random forests used in practice. A more in-depth analysis is needed. G. Biau (UPMC) 50 / 106
  • 77. ConsistencyTheoremAssume that the distribution of X has support on [0, 1]d . Then therandom forests estimate ¯n is consistent whenever pnj log kn → ∞ for rall j = 1, . . . , d and kn /n → 0 as n → ∞. In the purely random model, pnj = 1/d, independently of n and j, and consistency is ensured as long as kn → ∞ and kn /n → 0. This is however a radically simplified version of the random forests used in practice. A more in-depth analysis is needed. G. Biau (UPMC) 50 / 106
  • 78. ConsistencyTheoremAssume that the distribution of X has support on [0, 1]d . Then therandom forests estimate ¯n is consistent whenever pnj log kn → ∞ for rall j = 1, . . . , d and kn /n → 0 as n → ∞. In the purely random model, pnj = 1/d, independently of n and j, and consistency is ensured as long as kn → ∞ and kn /n → 0. This is however a radically simplified version of the random forests used in practice. A more in-depth analysis is needed. G. Biau (UPMC) 50 / 106
  • 79. ConsistencyTheoremAssume that the distribution of X has support on [0, 1]d . Then therandom forests estimate ¯n is consistent whenever pnj log kn → ∞ for rall j = 1, . . . , d and kn /n → 0 as n → ∞. In the purely random model, pnj = 1/d, independently of n and j, and consistency is ensured as long as kn → ∞ and kn /n → 0. This is however a radically simplified version of the random forests used in practice. A more in-depth analysis is needed. G. Biau (UPMC) 50 / 106
  • 80. Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation. Images wavelet coefficients. High-throughput technologies. Sparse estimation is playing an increasingly important role in the statistics and machine learning communities. Several methods have recently been developed in both fields, which rely upon the notion of sparsity. G. Biau (UPMC) 51 / 106
  • 81. Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation. Images wavelet coefficients. High-throughput technologies. Sparse estimation is playing an increasingly important role in the statistics and machine learning communities. Several methods have recently been developed in both fields, which rely upon the notion of sparsity. G. Biau (UPMC) 51 / 106
  • 82. Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation. Images wavelet coefficients. High-throughput technologies. Sparse estimation is playing an increasingly important role in the statistics and machine learning communities. Several methods have recently been developed in both fields, which rely upon the notion of sparsity. G. Biau (UPMC) 51 / 106
  • 83. Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation. Images wavelet coefficients. High-throughput technologies. Sparse estimation is playing an increasingly important role in the statistics and machine learning communities. Several methods have recently been developed in both fields, which rely upon the notion of sparsity. G. Biau (UPMC) 51 / 106
  • 84. Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation. Images wavelet coefficients. High-throughput technologies. Sparse estimation is playing an increasingly important role in the statistics and machine learning communities. Several methods have recently been developed in both fields, which rely upon the notion of sparsity. G. Biau (UPMC) 51 / 106
  • 85. Our vision The regression function r (X) = E[Y |X] depends in fact only on a nonempty subset S (for Strong) of the d features. In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have r (X) = E[Y |XS ]. In the dimension reduction scenario we have in mind, the ambient dimension d can be very large, much larger than n. As such, the value S characterizes the sparsity of the model: The smaller S, the sparser r . G. Biau (UPMC) 52 / 106
  • 86. Our vision The regression function r (X) = E[Y |X] depends in fact only on a nonempty subset S (for Strong) of the d features. In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have r (X) = E[Y |XS ]. In the dimension reduction scenario we have in mind, the ambient dimension d can be very large, much larger than n. As such, the value S characterizes the sparsity of the model: The smaller S, the sparser r . G. Biau (UPMC) 52 / 106
  • 87. Our vision The regression function r (X) = E[Y |X] depends in fact only on a nonempty subset S (for Strong) of the d features. In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have r (X) = E[Y |XS ]. In the dimension reduction scenario we have in mind, the ambient dimension d can be very large, much larger than n. As such, the value S characterizes the sparsity of the model: The smaller S, the sparser r . G. Biau (UPMC) 52 / 106
  • 88. Our vision The regression function r (X) = E[Y |X] depends in fact only on a nonempty subset S (for Strong) of the d features. In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have r (X) = E[Y |XS ]. In the dimension reduction scenario we have in mind, the ambient dimension d can be very large, much larger than n. As such, the value S characterizes the sparsity of the model: The smaller S, the sparser r . G. Biau (UPMC) 52 / 106
  • 89. Sparsity and random forests Ideally, pnj = 1/S for j ∈ S. To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj ). Such a randomization mechanism may be designed on the basis of a test sample.Action plan E [¯n (X) − r (X)]2 = E [¯n (X) − ˜n (X)]2 + E [˜n (X) − r (X)]2 , r r r rwhere n ˜n (X) = r EΘ [Wni (X, Θ)] r (Xi ). i=1 G. Biau (UPMC) 53 / 106
  • 90. Sparsity and random forests Ideally, pnj = 1/S for j ∈ S. To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj ). Such a randomization mechanism may be designed on the basis of a test sample.Action plan E [¯n (X) − r (X)]2 = E [¯n (X) − ˜n (X)]2 + E [˜n (X) − r (X)]2 , r r r rwhere n ˜n (X) = r EΘ [Wni (X, Θ)] r (Xi ). i=1 G. Biau (UPMC) 53 / 106
  • 91. Sparsity and random forests Ideally, pnj = 1/S for j ∈ S. To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj ). Such a randomization mechanism may be designed on the basis of a test sample.Action plan E [¯n (X) − r (X)]2 = E [¯n (X) − ˜n (X)]2 + E [˜n (X) − r (X)]2 , r r r rwhere n ˜n (X) = r EΘ [Wni (X, Θ)] r (Xi ). i=1 G. Biau (UPMC) 53 / 106
  • 92. Sparsity and random forests Ideally, pnj = 1/S for j ∈ S. To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj ). Such a randomization mechanism may be designed on the basis of a test sample.Action plan E [¯n (X) − r (X)]2 = E [¯n (X) − ˜n (X)]2 + E [˜n (X) − r (X)]2 , r r r rwhere n ˜n (X) = r EΘ [Wni (X, Θ)] r (Xi ). i=1 G. Biau (UPMC) 53 / 106
  • 93. VariancePropositionAssume that X is uniformly distributed on [0, 1]d and, for all x ∈ Rd , σ 2 (x) = V[Y | X = x] ≤ σ 2for some positive constant σ 2 . Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S, S/2d S2 kn E [¯n (X) − ˜n (X)]2 ≤ Cσ 2 r r (1 + ξn ) , S−1 n(log kn )S/2dwhere S/2d 288 π log 2 C= . π 16 G. Biau (UPMC) 54 / 106
  • 94. BiasPropositionAssume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz.Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S, 2SL2 E [˜n (X) − r (X)]2 ≤ r 0.75 + sup r 2 (x) e−n/2kn . (1+γn ) kn S log 2 x∈[0,1]d The rate at which the bias decreases to 0 depends on the number of strong variables, not on d. kn −(0.75/(S log 2))(1+γn ) = o(kn −2/d ) as soon as S ≤ 0.54d . The term e−n/2kn prevents the extreme choice kn = n. G. Biau (UPMC) 55 / 106
  • 95. BiasPropositionAssume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz.Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S, 2SL2 E [˜n (X) − r (X)]2 ≤ r 0.75 + sup r 2 (x) e−n/2kn . (1+γn ) kn S log 2 x∈[0,1]d The rate at which the bias decreases to 0 depends on the number of strong variables, not on d. kn −(0.75/(S log 2))(1+γn ) = o(kn −2/d ) as soon as S ≤ 0.54d . The term e−n/2kn prevents the extreme choice kn = n. G. Biau (UPMC) 55 / 106
  • 96. BiasPropositionAssume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz.Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S, 2SL2 E [˜n (X) − r (X)]2 ≤ r 0.75 + sup r 2 (x) e−n/2kn . (1+γn ) kn S log 2 x∈[0,1]d The rate at which the bias decreases to 0 depends on the number of strong variables, not on d. kn −(0.75/(S log 2))(1+γn ) = o(kn −2/d ) as soon as S ≤ 0.54d . The term e−n/2kn prevents the extreme choice kn = n. G. Biau (UPMC) 55 / 106
  • 97. BiasPropositionAssume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz.Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S, 2SL2 E [˜n (X) − r (X)]2 ≤ r 0.75 + sup r 2 (x) e−n/2kn . (1+γn ) kn S log 2 x∈[0,1]d The rate at which the bias decreases to 0 depends on the number of strong variables, not on d. kn −(0.75/(S log 2))(1+γn ) = o(kn −2/d ) as soon as S ≤ 0.54d . The term e−n/2kn prevents the extreme choice kn = n. G. Biau (UPMC) 55 / 106
  • 98. Main resultTheoremIf pnj = (1/S)(1 + ξnj ) for j ∈ S, with ξnj log n → 0 as n → ∞, then forthe choice 1/(1+ S0.752 ) L2 log 1/(1+ S0.752 ) kn ∝ n log , Ξwe have E [¯n (X) − r (X)]2 r lim sup sup 0.75 ≤ Λ. n→∞ (X,Y )∈F 2S ln 2 S ln 2+0.75 −0.75 ΞL 0.75 n S log 2+0.75Take-home message −0.75The rate n S log 2+0.75 is strictly faster than the usual minimax raten−2/(d+2) as soon as S ≤ 0.54d . G. Biau (UPMC) 56 / 106
  • 99. Main resultTheoremIf pnj = (1/S)(1 + ξnj ) for j ∈ S, with ξnj log n → 0 as n → ∞, then forthe choice 1/(1+ S0.752 ) L2 log 1/(1+ S0.752 ) kn ∝ n log , Ξwe have E [¯n (X) − r (X)]2 r lim sup sup 0.75 ≤ Λ. n→∞ (X,Y )∈F 2S ln 2 S ln 2+0.75 −0.75 ΞL 0.75 n S log 2+0.75Take-home message −0.75The rate n S log 2+0.75 is strictly faster than the usual minimax raten−2/(d+2) as soon as S ≤ 0.54d . G. Biau (UPMC) 56 / 106
  • 100. Dimension reduction 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0.0196 0 2 10 20 30 40 50 60 70 80 90 100 S G. Biau (UPMC) 57 / 106
  • 101. Discussion — Bagging The optimal parameter kn depends on the unknown distribution of (X, Y ). To correct this situation, adaptive choices of kn should preserve the rate of convergence of the estimate. Another route we may follow is to analyze the effect of bagging. Biau, Cérou and Guyader (2010). G. Biau (UPMC) 58 / 106
  • 102. Discussion — Bagging The optimal parameter kn depends on the unknown distribution of (X, Y ). To correct this situation, adaptive choices of kn should preserve the rate of convergence of the estimate. Another route we may follow is to analyze the effect of bagging. Biau, Cérou and Guyader (2010). G. Biau (UPMC) 58 / 106
  • 103. Discussion — Bagging The optimal parameter kn depends on the unknown distribution of (X, Y ). To correct this situation, adaptive choices of kn should preserve the rate of convergence of the estimate. Another route we may follow is to analyze the effect of bagging. Biau, Cérou and Guyader (2010). G. Biau (UPMC) 58 / 106
  • 104. Discussion — Bagging The optimal parameter kn depends on the unknown distribution of (X, Y ). To correct this situation, adaptive choices of kn should preserve the rate of convergence of the estimate. Another route we may follow is to analyze the effect of bagging. Biau, Cérou and Guyader (2010). G. Biau (UPMC) 58 / 106
  • 105. ?G. Biau (UPMC) 59 / 106
  • 106. ?G. Biau (UPMC) 60 / 106
  • 107. G. Biau (UPMC) 61 / 106
  • 108. ?G. Biau (UPMC) 62 / 106
  • 109. ?G. Biau (UPMC) 63 / 106
  • 110. ?G. Biau (UPMC) 64 / 106
  • 111. G. Biau (UPMC) 65 / 106
  • 112. ?G. Biau (UPMC) 66 / 106
  • 113. ?G. Biau (UPMC) 67 / 106
  • 114. ?G. Biau (UPMC) 68 / 106
  • 115. G. Biau (UPMC) 69 / 106
  • 116. Discussion — Choosing the pnj ’sImaginary scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 If the selection is all weak, then choose one at random to split on. 3 If there is more than one strong variable elected, choose one at random and cut. G. Biau (UPMC) 70 / 106
  • 117. Discussion — Choosing the pnj ’sImaginary scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 If the selection is all weak, then choose one at random to split on. 3 If there is more than one strong variable elected, choose one at random and cut. G. Biau (UPMC) 70 / 106
  • 118. Discussion — Choosing the pnj ’sImaginary scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 If the selection is all weak, then choose one at random to split on. 3 If there is more than one strong variable elected, choose one at random and cut. G. Biau (UPMC) 70 / 106
  • 119. Discussion — Choosing the pnj ’sImaginary scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 If the selection is all weak, then choose one at random to split on. 3 If there is more than one strong variable elected, choose one at random and cut. G. Biau (UPMC) 70 / 106
  • 120. Discussion — Choosing the pnj ’s Each coordinate in S will be cut with the “ideal” probability Mn 1 S pn = 1− 1− . S d The parameter Mn should satisfy Mn S 1− log n → 0 as n → ∞. d This is true as soon as Mn Mn → ∞ and → ∞ as n → ∞. log n G. Biau (UPMC) 71 / 106
  • 121. Discussion — Choosing the pnj ’s Each coordinate in S will be cut with the “ideal” probability Mn 1 S pn = 1− 1− . S d The parameter Mn should satisfy Mn S 1− log n → 0 as n → ∞. d This is true as soon as Mn Mn → ∞ and → ∞ as n → ∞. log n G. Biau (UPMC) 71 / 106
  • 122. Discussion — Choosing the pnj ’s Each coordinate in S will be cut with the “ideal” probability Mn 1 S pn = 1− 1− . S d The parameter Mn should satisfy Mn S 1− log n → 0 as n → ∞. d This is true as soon as Mn Mn → ∞ and → ∞ as n → ∞. log n G. Biau (UPMC) 71 / 106
  • 123. Assumptions We have at hand an independent test set Dn . The model is linear: Y = aj X (j) + ε. j∈S For a fixed node A = d Aj , fix a coordinate j and look at the j=1 weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ). If j ∈ S, then the best split is at the midpoint of the node, with a variance decrease equal to aj2 /16 > 0. If j ∈ W, the decrease of the variance is always 0, whatever the location of the split. G. Biau (UPMC) 72 / 106
  • 124. Assumptions We have at hand an independent test set Dn . The model is linear: Y = aj X (j) + ε. j∈S For a fixed node A = d Aj , fix a coordinate j and look at the j=1 weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ). If j ∈ S, then the best split is at the midpoint of the node, with a variance decrease equal to aj2 /16 > 0. If j ∈ W, the decrease of the variance is always 0, whatever the location of the split. G. Biau (UPMC) 72 / 106
  • 125. Assumptions We have at hand an independent test set Dn . The model is linear: Y = aj X (j) + ε. j∈S For a fixed node A = d Aj , fix a coordinate j and look at the j=1 weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ). If j ∈ S, then the best split is at the midpoint of the node, with a variance decrease equal to aj2 /16 > 0. If j ∈ W, the decrease of the variance is always 0, whatever the location of the split. G. Biau (UPMC) 72 / 106
  • 126. Assumptions We have at hand an independent test set Dn . The model is linear: Y = aj X (j) + ε. j∈S For a fixed node A = d Aj , fix a coordinate j and look at the j=1 weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ). If j ∈ S, then the best split is at the midpoint of the node, with a variance decrease equal to aj2 /16 > 0. If j ∈ W, the decrease of the variance is always 0, whatever the location of the split. G. Biau (UPMC) 72 / 106
  • 127. Assumptions We have at hand an independent test set Dn . The model is linear: Y = aj X (j) + ε. j∈S For a fixed node A = d Aj , fix a coordinate j and look at the j=1 weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ). If j ∈ S, then the best split is at the midpoint of the node, with a variance decrease equal to aj2 /16 > 0. If j ∈ W, the decrease of the variance is always 0, whatever the location of the split. G. Biau (UPMC) 72 / 106
  • 128. Assumptions We have at hand an independent test set Dn . The model is linear: Y = aj X (j) + ε. j∈S For a fixed node A = d Aj , fix a coordinate j and look at the j=1 weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ). If j ∈ S, then the best split is at the midpoint of the node, with a variance decrease equal to aj2 /16 > 0. If j ∈ W, the decrease of the variance is always 0, whatever the location of the split. G. Biau (UPMC) 72 / 106
  • 129. Discussion — Choosing the pnj ’sNear-reality scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 For each of the Mn elected coordinates, calculate the best split. 3 Select the coordinate which outputs the best within-node sum of squares decrease, and cut.ConclusionFor j ∈ S, 1 pnj ≈ 1 + ξnj , Swhere ξnj → 0 and satisfies the constraint ξnj log n → 0 as n tends toinfinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞. G. Biau (UPMC) 73 / 106
  • 130. Discussion — Choosing the pnj ’sNear-reality scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 For each of the Mn elected coordinates, calculate the best split. 3 Select the coordinate which outputs the best within-node sum of squares decrease, and cut.ConclusionFor j ∈ S, 1 pnj ≈ 1 + ξnj , Swhere ξnj → 0 and satisfies the constraint ξnj log n → 0 as n tends toinfinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞. G. Biau (UPMC) 73 / 106
  • 131. Discussion — Choosing the pnj ’sNear-reality scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 For each of the Mn elected coordinates, calculate the best split. 3 Select the coordinate which outputs the best within-node sum of squares decrease, and cut.ConclusionFor j ∈ S, 1 pnj ≈ 1 + ξnj , Swhere ξnj → 0 and satisfies the constraint ξnj log n → 0 as n tends toinfinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞. G. Biau (UPMC) 73 / 106
  • 132. Discussion — Choosing the pnj ’sNear-reality scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 For each of the Mn elected coordinates, calculate the best split. 3 Select the coordinate which outputs the best within-node sum of squares decrease, and cut.ConclusionFor j ∈ S, 1 pnj ≈ 1 + ξnj , Swhere ξnj → 0 and satisfies the constraint ξnj log n → 0 as n tends toinfinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞. G. Biau (UPMC) 73 / 106
  • 133. Discussion — Choosing the pnj ’sNear-reality scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 For each of the Mn elected coordinates, calculate the best split. 3 Select the coordinate which outputs the best within-node sum of squares decrease, and cut.ConclusionFor j ∈ S, 1 pnj ≈ 1 + ξnj , Swhere ξnj → 0 and satisfies the constraint ξnj log n → 0 as n tends toinfinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞. G. Biau (UPMC) 73 / 106
  • 134. Outline1 Setting2 A random forests model3 Layered nearest neighbors and random forests G. Biau (UPMC) 74 / 106
  • 135. ?G. Biau (UPMC) 75 / 106
  • 136. ?G. Biau (UPMC) 76 / 106
  • 137. ?G. Biau (UPMC) 77 / 106
  • 138. ?G. Biau (UPMC) 78 / 106
  • 139. ?G. Biau (UPMC) 79 / 106
  • 140. G. Biau (UPMC) 80 / 106
  • 141. ?G. Biau (UPMC) 81 / 106
  • 142. ?G. Biau (UPMC) 82 / 106
  • 143. ?G. Biau (UPMC) 83 / 106
  • 144. ?G. Biau (UPMC) 84 / 106
  • 145. ?G. Biau (UPMC) 85 / 106
  • 146. G. Biau (UPMC) 86 / 106
  • 147. ?G. Biau (UPMC) 87 / 106
  • 148. ?G. Biau (UPMC) 88 / 106
  • 149. ?G. Biau (UPMC) 89 / 106
  • 150. ?G. Biau (UPMC) 90 / 106
  • 151. ?G. Biau (UPMC) 91 / 106
  • 152. ?G. Biau (UPMC) 92 / 106
  • 153. ?G. Biau (UPMC) 93 / 106
  • 154. ?G. Biau (UPMC) 94 / 106
  • 155. ?G. Biau (UPMC) 95 / 106
  • 156. ?G. Biau (UPMC) 96 / 106
  • 157. ?G. Biau (UPMC) 97 / 106
  • 158. ?G. Biau (UPMC) 98 / 106
  • 159. Layered Nearest NeighborsDefinitionLet X1 , . . . , Xn be a sample of i.i.d. random vectors in Rd , d ≥ 2. Anobservation Xi is said to be a LNN of a point x if the hyperrectangledefined by x and Xi contains no other data points. x empty G. Biau (UPMC) 99 / 106
  • 160. What is known about Ln (x)? ... a lot when X1 , . . . , Xn are uniformly distributed over [0, 1]d . For example, 2d (log n)d−1 ELn (x) = + O (log n)d−2 (d − 1)! and (d − 1)! Ln (x) → 1 in probability as n → ∞. 2d (log n)d−1 This is the problem of maxima in random vectors (Barndorff-Nielsen and Sobel, 1966). G. Biau (UPMC) 100 / 106
  • 161. What is known about Ln (x)? ... a lot when X1 , . . . , Xn are uniformly distributed over [0, 1]d . For example, 2d (log n)d−1 ELn (x) = + O (log n)d−2 (d − 1)! and (d − 1)! Ln (x) → 1 in probability as n → ∞. 2d (log n)d−1 This is the problem of maxima in random vectors (Barndorff-Nielsen and Sobel, 1966). G. Biau (UPMC) 100 / 106
  • 162. What is known about Ln (x)? ... a lot when X1 , . . . , Xn are uniformly distributed over [0, 1]d . For example, 2d (log n)d−1 ELn (x) = + O (log n)d−2 (d − 1)! and (d − 1)! Ln (x) → 1 in probability as n → ∞. 2d (log n)d−1 This is the problem of maxima in random vectors (Barndorff-Nielsen and Sobel, 1966). G. Biau (UPMC) 100 / 106
  • 163. Two results (Biau and Devroye, 2010)ModelX1 , . . . , Xn are independently distributed according to some probabilitydensity f (with probability measure µ).TheoremFor µ-almost all x ∈ Rd , one has Ln (x) → ∞ in probability as n → ∞.TheoremSuppose that f is λ-almost everywhere continuous. Then (d − 1)! ELn (x) → 1 as n → ∞, 2d (log n)d−1at µ-almost all x ∈ Rd . G. Biau (UPMC) 101 / 106
  • 164. Two results (Biau and Devroye, 2010)ModelX1 , . . . , Xn are independently distributed according to some probabilitydensity f (with probability measure µ).TheoremFor µ-almost all x ∈ Rd , one has Ln (x) → ∞ in probability as n → ∞.TheoremSuppose that f is λ-almost everywhere continuous. Then (d − 1)! ELn (x) → 1 as n → ∞, 2d (log n)d−1at µ-almost all x ∈ Rd . G. Biau (UPMC) 101 / 106
  • 165. Two results (Biau and Devroye, 2010)ModelX1 , . . . , Xn are independently distributed according to some probabilitydensity f (with probability measure µ).TheoremFor µ-almost all x ∈ Rd , one has Ln (x) → ∞ in probability as n → ∞.TheoremSuppose that f is λ-almost everywhere continuous. Then (d − 1)! ELn (x) → 1 as n → ∞, 2d (log n)d−1at µ-almost all x ∈ Rd . G. Biau (UPMC) 101 / 106
  • 166. LNN regression estimationModel(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.Moreover, |Y | is bounded and X has a density.The regression function r (x) = E[Y |X = x] may be estimated by n 1 rn (x) = Yi 1[Xi ∈Ln (x)] . Ln (x) i=1 1 No smoothing parameter. 2 A scale-invariant estimate. 3 Intimately connected to Breiman’s random forests. G. Biau (UPMC) 102 / 106
  • 167. LNN regression estimationModel(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.Moreover, |Y | is bounded and X has a density.The regression function r (x) = E[Y |X = x] may be estimated by n 1 rn (x) = Yi 1[Xi ∈Ln (x)] . Ln (x) i=1 1 No smoothing parameter. 2 A scale-invariant estimate. 3 Intimately connected to Breiman’s random forests. G. Biau (UPMC) 102 / 106
  • 168. LNN regression estimationModel(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.Moreover, |Y | is bounded and X has a density.The regression function r (x) = E[Y |X = x] may be estimated by n 1 rn (x) = Yi 1[Xi ∈Ln (x)] . Ln (x) i=1 1 No smoothing parameter. 2 A scale-invariant estimate. 3 Intimately connected to Breiman’s random forests. G. Biau (UPMC) 102 / 106
  • 169. LNN regression estimationModel(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.Moreover, |Y | is bounded and X has a density.The regression function r (x) = E[Y |X = x] may be estimated by n 1 rn (x) = Yi 1[Xi ∈Ln (x)] . Ln (x) i=1 1 No smoothing parameter. 2 A scale-invariant estimate. 3 Intimately connected to Breiman’s random forests. G. Biau (UPMC) 102 / 106
  • 170. LNN regression estimationModel(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.Moreover, |Y | is bounded and X has a density.The regression function r (x) = E[Y |X = x] may be estimated by n 1 rn (x) = Yi 1[Xi ∈Ln (x)] . Ln (x) i=1 1 No smoothing parameter. 2 A scale-invariant estimate. 3 Intimately connected to Breiman’s random forests. G. Biau (UPMC) 102 / 106
  • 171. ConsistencyTheorem (Pointwise Lp -consistency)Assume that the regression function r is λ-almost everywherecontinuous and that Y is bounded. Then, for µ-almost all x ∈ Rd andall p ≥ 1, E |rn (x) − r (x)|p → 0 as n → ∞.Theorem (Gobal Lp -consistency)Under the same conditions, for all p ≥ 1, E |rn (X) − r (X)|p → 0 as n → ∞. 1 No universal consistency result with respect to r is possible. 2 The results do not impose any condition on the density. 3 They are also scale-free. G. Biau (UPMC) 103 / 106
  • 172. ConsistencyTheorem (Pointwise Lp -consistency)Assume that the regression function r is λ-almost everywherecontinuous and that Y is bounded. Then, for µ-almost all x ∈ Rd andall p ≥ 1, E |rn (x) − r (x)|p → 0 as n → ∞.Theorem (Gobal Lp -consistency)Under the same conditions, for all p ≥ 1, E |rn (X) − r (X)|p → 0 as n → ∞. 1 No universal consistency result with respect to r is possible. 2 The results do not impose any condition on the density. 3 They are also scale-free. G. Biau (UPMC) 103 / 106
  • 173. ConsistencyTheorem (Pointwise Lp -consistency)Assume that the regression function r is λ-almost everywherecontinuous and that Y is bounded. Then, for µ-almost all x ∈ Rd andall p ≥ 1, E |rn (x) − r (x)|p → 0 as n → ∞.Theorem (Gobal Lp -consistency)Under the same conditions, for all p ≥ 1, E |rn (X) − r (X)|p → 0 as n → ∞. 1 No universal consistency result with respect to r is possible. 2 The results do not impose any condition on the density. 3 They are also scale-free. G. Biau (UPMC) 103 / 106
  • 174. ConsistencyTheorem (Pointwise Lp -consistency)Assume that the regression function r is λ-almost everywherecontinuous and that Y is bounded. Then, for µ-almost all x ∈ Rd andall p ≥ 1, E |rn (x) − r (x)|p → 0 as n → ∞.Theorem (Gobal Lp -consistency)Under the same conditions, for all p ≥ 1, E |rn (X) − r (X)|p → 0 as n → ∞. 1 No universal consistency result with respect to r is possible. 2 The results do not impose any condition on the density. 3 They are also scale-free. G. Biau (UPMC) 103 / 106
  • 175. ConsistencyTheorem (Pointwise Lp -consistency)Assume that the regression function r is λ-almost everywherecontinuous and that Y is bounded. Then, for µ-almost all x ∈ Rd andall p ≥ 1, E |rn (x) − r (x)|p → 0 as n → ∞.Theorem (Gobal Lp -consistency)Under the same conditions, for all p ≥ 1, E |rn (X) − r (X)|p → 0 as n → ∞. 1 No universal consistency result with respect to r is possible. 2 The results do not impose any condition on the density. 3 They are also scale-free. G. Biau (UPMC) 103 / 106
  • 176. Back to random forestsA random forest can be viewed as a weighted LNN regression estimate n ¯n (x) = r Yi Wni (x), i=1where the weights concentrate on the LNN and satisfy n Wni (x) = 1. i=1 G. Biau (UPMC) 104 / 106
  • 177. Non-adaptive strategiesConsider the non-adaptive random forests estimate n ¯n (x) = r Yi Wni (x). i=1where the weights concentrate on the LNN.PropositionFor any x ∈ Rd , assume that σ 2 = V[Y |X = x] is independent of x.Then σ2 E [¯n (x) − r (x)]2 ≥ r . ELn (x) G. Biau (UPMC) 105 / 106
  • 178. Non-adaptive strategiesConsider the non-adaptive random forests estimate n ¯n (x) = r Yi Wni (x). i=1where the weights concentrate on the LNN.PropositionFor any x ∈ Rd , assume that σ 2 = V[Y |X = x] is independent of x.Then σ2 E [¯n (x) − r (x)]2 ≥ r . ELn (x) G. Biau (UPMC) 105 / 106
  • 179. Rate of convergenceAt µ-almost all x, when f is λ-almost everywhere continuous, σ 2 (d − 1)! E [¯n (x) − r (x)]2 r . 2d (log n)d−1Improving the rate of convergence 1 Stop as soon as a future rectangle split would cause a sub-rectangle to have fewer than kn points. 2 Resort to bagging and randomize using random subsamples. G. Biau (UPMC) 106 / 106
  • 180. Rate of convergenceAt µ-almost all x, when f is λ-almost everywhere continuous, σ 2 (d − 1)! E [¯n (x) − r (x)]2 r . 2d (log n)d−1Improving the rate of convergence 1 Stop as soon as a future rectangle split would cause a sub-rectangle to have fewer than kn points. 2 Resort to bagging and randomize using random subsamples. G. Biau (UPMC) 106 / 106
  • 181. Rate of convergenceAt µ-almost all x, when f is λ-almost everywhere continuous, σ 2 (d − 1)! E [¯n (x) − r (x)]2 r . 2d (log n)d−1Improving the rate of convergence 1 Stop as soon as a future rectangle split would cause a sub-rectangle to have fewer than kn points. 2 Resort to bagging and randomize using random subsamples. G. Biau (UPMC) 106 / 106