Random forests                    Gérard Biau      Groupe de travail prévision, May 2011G. Biau (UPMC)                    ...
Outline1   Setting2   A random forests model3   Layered nearest neighbors and random forests      G. Biau (UPMC)          ...
Outline1   Setting2   A random forests model3   Layered nearest neighbors and random forests      G. Biau (UPMC)          ...
Leo Breiman (1928-2005)    G. Biau (UPMC)        4 / 106
Leo Breiman (1928-2005)    G. Biau (UPMC)        5 / 106
Classification    G. Biau (UPMC)   6 / 106
Classification                     ?    G. Biau (UPMC)       7 / 106
Classification    G. Biau (UPMC)   8 / 106
Regression                                                                                              −2.33             ...
Regression                                                                                              −2.33             ...
Regression                                                                                              −2.33             ...
Trees                     ?    G. Biau (UPMC)       12 / 106
Trees                     ?    G. Biau (UPMC)       13 / 106
Trees                     ?    G. Biau (UPMC)       14 / 106
Trees                     ?    G. Biau (UPMC)       15 / 106
Trees                     ?    G. Biau (UPMC)       16 / 106
Trees                     ?    G. Biau (UPMC)       17 / 106
Trees                     ?    G. Biau (UPMC)       18 / 106
Trees    G. Biau (UPMC)   19 / 106
Trees    G. Biau (UPMC)   20 / 106
Trees    G. Biau (UPMC)   21 / 106
Trees    G. Biau (UPMC)   22 / 106
Trees    G. Biau (UPMC)   23 / 106
Trees    G. Biau (UPMC)   24 / 106
Trees    G. Biau (UPMC)   25 / 106
Trees    G. Biau (UPMC)   26 / 106
Trees    ?    G. Biau (UPMC)   27 / 106
Trees    ?    G. Biau (UPMC)   28 / 106
Trees    ?    G. Biau (UPMC)   29 / 106
Trees    ?    G. Biau (UPMC)   30 / 106
Trees    ?    G. Biau (UPMC)   31 / 106
Trees    ?    G. Biau (UPMC)   32 / 106
Trees    ?    G. Biau (UPMC)   33 / 106
From trees to forests    Leo Breiman promoted random forests.    Idea: Using tree averaging as a means of obtaining good r...
From trees to forests    Leo Breiman promoted random forests.    Idea: Using tree averaging as a means of obtaining good r...
From trees to forests    Leo Breiman promoted random forests.    Idea: Using tree averaging as a means of obtaining good r...
From trees to forests    Leo Breiman promoted random forests.    Idea: Using tree averaging as a means of obtaining good r...
From trees to forests    Leo Breiman promoted random forests.    Idea: Using tree averaging as a means of obtaining good r...
Random forests   They have emerged as serious competitors to state of the art   methods.   They are fast and easy to imple...
Random forests   They have emerged as serious competitors to state of the art   methods.   They are fast and easy to imple...
Random forests   They have emerged as serious competitors to state of the art   methods.   They are fast and easy to imple...
Random forests   They have emerged as serious competitors to state of the art   methods.   They are fast and easy to imple...
Random forests   They have emerged as serious competitors to state of the art   methods.   They are fast and easy to imple...
Key-references   Breiman (2000, 2001, 2004).         Definition, experiments and intuitions.   Lin and Jeon (2006).        ...
Key-references   Breiman (2000, 2001, 2004).         Definition, experiments and intuitions.   Lin and Jeon (2006).        ...
Key-references   Breiman (2000, 2001, 2004).         Definition, experiments and intuitions.   Lin and Jeon (2006).        ...
Key-references   Breiman (2000, 2001, 2004).         Definition, experiments and intuitions.   Lin and Jeon (2006).        ...
Outline1   Setting2   A random forests model3   Layered nearest neighbors and random forests      G. Biau (UPMC)          ...
Three basic ingredients1-Randomization and no-pruning   For each tree, select at random, at each node, a small group of   ...
Three basic ingredients1-Randomization and no-pruning   For each tree, select at random, at each node, a small group of   ...
Three basic ingredients1-Randomization and no-pruning   For each tree, select at random, at each node, a small group of   ...
Three basic ingredients2-Aggregation   Final predictions are obtained by aggregating over the ensemble.   It is fast and e...
Three basic ingredients2-Aggregation   Final predictions are obtained by aggregating over the ensemble.   It is fast and e...
Three basic ingredients3-Bagging   The subspace randomization scheme is blended with bagging.   Breiman (1996).   Bühlmann...
Three basic ingredients3-Bagging   The subspace randomization scheme is blended with bagging.   Breiman (1996).   Bühlmann...
Mathematical framework   A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d.   [0, 1]d × R-valued random varia...
Mathematical framework   A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d.   [0, 1]d × R-valued random varia...
Mathematical framework   A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d.   [0, 1]d × R-valued random varia...
Mathematical framework   A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d.   [0, 1]d × R-valued random varia...
The model   A random forest is a collection of randomized base regression   trees {rn (x, Θm , Dn ), m ≥ 1}.   These rando...
The model   A random forest is a collection of randomized base regression   trees {rn (x, Θm , Dn ), m ≥ 1}.   These rando...
The model   A random forest is a collection of randomized base regression   trees {rn (x, Θm , Dn ), m ≥ 1}.   These rando...
The model   A random forest is a collection of randomized base regression   trees {rn (x, Θm , Dn ), m ≥ 1}.   These rando...
The procedure     Fix kn ≥ 2 and repeat the following procedure log2 kn times: 1   At each node, a coordinate of X = (X (1...
The procedure     Fix kn ≥ 2 and repeat the following procedure log2 kn times: 1   At each node, a coordinate of X = (X (1...
The procedure     Fix kn ≥ 2 and repeat the following procedure log2 kn times: 1   At each node, a coordinate of X = (X (1...
Binary trees    G. Biau (UPMC)   44 / 106
Binary trees    G. Biau (UPMC)   45 / 106
Binary trees    G. Biau (UPMC)   46 / 106
Binary trees    G. Biau (UPMC)   47 / 106
Binary trees    G. Biau (UPMC)   48 / 106
General comments   Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes,   and each leaf has Lebesgue measure...
General comments   Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes,   and each leaf has Lebesgue measure...
General comments   Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes,   and each leaf has Lebesgue measure...
General comments   Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes,   and each leaf has Lebesgue measure...
ConsistencyTheoremAssume that the distribution of X has support on [0, 1]d . Then therandom forests estimate ¯n is consist...
ConsistencyTheoremAssume that the distribution of X has support on [0, 1]d . Then therandom forests estimate ¯n is consist...
ConsistencyTheoremAssume that the distribution of X has support on [0, 1]d . Then therandom forests estimate ¯n is consist...
ConsistencyTheoremAssume that the distribution of X has support on [0, 1]d . Then therandom forests estimate ¯n is consist...
Sparsity   There is empirical evidence that many signals in high-dimensional   spaces admit a sparse representation.      ...
Sparsity   There is empirical evidence that many signals in high-dimensional   spaces admit a sparse representation.      ...
Sparsity   There is empirical evidence that many signals in high-dimensional   spaces admit a sparse representation.      ...
Sparsity   There is empirical evidence that many signals in high-dimensional   spaces admit a sparse representation.      ...
Sparsity   There is empirical evidence that many signals in high-dimensional   spaces admit a sparse representation.      ...
Our vision   The regression function r (X) = E[Y |X] depends in fact only on a   nonempty subset S (for Strong) of the d f...
Our vision   The regression function r (X) = E[Y |X] depends in fact only on a   nonempty subset S (for Strong) of the d f...
Our vision   The regression function r (X) = E[Y |X] depends in fact only on a   nonempty subset S (for Strong) of the d f...
Our vision   The regression function r (X) = E[Y |X] depends in fact only on a   nonempty subset S (for Strong) of the d f...
Sparsity and random forests    Ideally, pnj = 1/S for j ∈ S.    To stick to reality, we will rather require that pnj = (1/...
Sparsity and random forests    Ideally, pnj = 1/S for j ∈ S.    To stick to reality, we will rather require that pnj = (1/...
Sparsity and random forests    Ideally, pnj = 1/S for j ∈ S.    To stick to reality, we will rather require that pnj = (1/...
Sparsity and random forests    Ideally, pnj = 1/S for j ∈ S.    To stick to reality, we will rather require that pnj = (1/...
VariancePropositionAssume that X is uniformly distributed on [0, 1]d and, for all x ∈ Rd ,                        σ 2 (x) ...
BiasPropositionAssume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz.Then, if pnj = (1/S)(1 + ξnj ) for j...
BiasPropositionAssume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz.Then, if pnj = (1/S)(1 + ξnj ) for j...
BiasPropositionAssume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz.Then, if pnj = (1/S)(1 + ξnj ) for j...
BiasPropositionAssume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz.Then, if pnj = (1/S)(1 + ξnj ) for j...
Main resultTheoremIf pnj = (1/S)(1 + ξnj ) for j ∈ S, with ξnj log n → 0 as n → ∞, then forthe choice                     ...
Main resultTheoremIf pnj = (1/S)(1 + ξnj ) for j ∈ S, with ξnj log n → 0 as n → ∞, then forthe choice                     ...
Dimension reduction                   0.4                  0.35                   0.3                  0.25               ...
Discussion — Bagging   The optimal parameter kn depends on the unknown distribution of   (X, Y ).   To correct this situat...
Discussion — Bagging   The optimal parameter kn depends on the unknown distribution of   (X, Y ).   To correct this situat...
Discussion — Bagging   The optimal parameter kn depends on the unknown distribution of   (X, Y ).   To correct this situat...
Discussion — Bagging   The optimal parameter kn depends on the unknown distribution of   (X, Y ).   To correct this situat...
?G. Biau (UPMC)       59 / 106
?G. Biau (UPMC)       60 / 106
G. Biau (UPMC)   61 / 106
?G. Biau (UPMC)       62 / 106
?G. Biau (UPMC)       63 / 106
?G. Biau (UPMC)       64 / 106
G. Biau (UPMC)   65 / 106
?G. Biau (UPMC)       66 / 106
?G. Biau (UPMC)       67 / 106
?G. Biau (UPMC)       68 / 106
G. Biau (UPMC)   69 / 106
Discussion — Choosing the pnj ’sImaginary scenarioThe following splitting scheme is iteratively repeated at each node: 1  ...
Discussion — Choosing the pnj ’sImaginary scenarioThe following splitting scheme is iteratively repeated at each node: 1  ...
Discussion — Choosing the pnj ’sImaginary scenarioThe following splitting scheme is iteratively repeated at each node: 1  ...
Discussion — Choosing the pnj ’sImaginary scenarioThe following splitting scheme is iteratively repeated at each node: 1  ...
Discussion — Choosing the pnj ’s   Each coordinate in S will be cut with the “ideal” probability                          ...
Discussion — Choosing the pnj ’s   Each coordinate in S will be cut with the “ideal” probability                          ...
Discussion — Choosing the pnj ’s   Each coordinate in S will be cut with the “ideal” probability                          ...
Assumptions   We have at hand an independent test set Dn .   The model is linear:                            Y =         a...
Assumptions   We have at hand an independent test set Dn .   The model is linear:                            Y =         a...
Assumptions   We have at hand an independent test set Dn .   The model is linear:                            Y =         a...
Assumptions   We have at hand an independent test set Dn .   The model is linear:                            Y =         a...
Assumptions   We have at hand an independent test set Dn .   The model is linear:                            Y =         a...
Assumptions   We have at hand an independent test set Dn .   The model is linear:                            Y =         a...
Discussion — Choosing the pnj ’sNear-reality scenarioThe following splitting scheme is iteratively repeated at each node: ...
Discussion — Choosing the pnj ’sNear-reality scenarioThe following splitting scheme is iteratively repeated at each node: ...
Discussion — Choosing the pnj ’sNear-reality scenarioThe following splitting scheme is iteratively repeated at each node: ...
Discussion — Choosing the pnj ’sNear-reality scenarioThe following splitting scheme is iteratively repeated at each node: ...
Discussion — Choosing the pnj ’sNear-reality scenarioThe following splitting scheme is iteratively repeated at each node: ...
Outline1   Setting2   A random forests model3   Layered nearest neighbors and random forests      G. Biau (UPMC)          ...
?G. Biau (UPMC)       75 / 106
?G. Biau (UPMC)       76 / 106
?G. Biau (UPMC)       77 / 106
?G. Biau (UPMC)       78 / 106
?G. Biau (UPMC)       79 / 106
G. Biau (UPMC)   80 / 106
?G. Biau (UPMC)       81 / 106
?G. Biau (UPMC)       82 / 106
?G. Biau (UPMC)       83 / 106
?G. Biau (UPMC)       84 / 106
?G. Biau (UPMC)       85 / 106
G. Biau (UPMC)   86 / 106
?G. Biau (UPMC)       87 / 106
?G. Biau (UPMC)       88 / 106
?G. Biau (UPMC)       89 / 106
?G. Biau (UPMC)       90 / 106
?G. Biau (UPMC)       91 / 106
?G. Biau (UPMC)       92 / 106
?G. Biau (UPMC)       93 / 106
?G. Biau (UPMC)       94 / 106
?G. Biau (UPMC)       95 / 106
?G. Biau (UPMC)       96 / 106
?G. Biau (UPMC)       97 / 106
?G. Biau (UPMC)       98 / 106
Layered Nearest NeighborsDefinitionLet X1 , . . . , Xn be a sample of i.i.d. random vectors in Rd , d ≥ 2. Anobservation Xi...
What is known about Ln (x)?   ... a lot when X1 , . . . , Xn are uniformly distributed over [0, 1]d .   For example,      ...
What is known about Ln (x)?   ... a lot when X1 , . . . , Xn are uniformly distributed over [0, 1]d .   For example,      ...
What is known about Ln (x)?   ... a lot when X1 , . . . , Xn are uniformly distributed over [0, 1]d .   For example,      ...
Two results (Biau and Devroye, 2010)ModelX1 , . . . , Xn are independently distributed according to some probabilitydensit...
Two results (Biau and Devroye, 2010)ModelX1 , . . . , Xn are independently distributed according to some probabilitydensit...
Two results (Biau and Devroye, 2010)ModelX1 , . . . , Xn are independently distributed according to some probabilitydensit...
LNN regression estimationModel(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.Moreover, |Y | i...
LNN regression estimationModel(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.Moreover, |Y | i...
LNN regression estimationModel(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.Moreover, |Y | i...
LNN regression estimationModel(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.Moreover, |Y | i...
LNN regression estimationModel(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.Moreover, |Y | i...
ConsistencyTheorem (Pointwise Lp -consistency)Assume that the regression function r is λ-almost everywherecontinuous and t...
ConsistencyTheorem (Pointwise Lp -consistency)Assume that the regression function r is λ-almost everywherecontinuous and t...
ConsistencyTheorem (Pointwise Lp -consistency)Assume that the regression function r is λ-almost everywherecontinuous and t...
ConsistencyTheorem (Pointwise Lp -consistency)Assume that the regression function r is λ-almost everywherecontinuous and t...
ConsistencyTheorem (Pointwise Lp -consistency)Assume that the regression function r is λ-almost everywherecontinuous and t...
Back to random forestsA random forest can be viewed as a weighted LNN regression estimate                                 ...
Non-adaptive strategiesConsider the non-adaptive random forests estimate                                    n             ...
Non-adaptive strategiesConsider the non-adaptive random forests estimate                                    n             ...
Rate of convergenceAt µ-almost all x, when f is λ-almost everywhere continuous,                                           ...
Rate of convergenceAt µ-almost all x, when f is λ-almost everywhere continuous,                                           ...
Rate of convergenceAt µ-almost all x, when f is λ-almost everywhere continuous,                                           ...
Upcoming SlideShare
Loading in...5
×

Présentation G.Biau Random Forests

1,526

Published on

Introduction aux random forests
cas sparse

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,526
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
31
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Présentation G.Biau Random Forests"

  1. 1. Random forests Gérard Biau Groupe de travail prévision, May 2011G. Biau (UPMC) 1 / 106
  2. 2. Outline1 Setting2 A random forests model3 Layered nearest neighbors and random forests G. Biau (UPMC) 2 / 106
  3. 3. Outline1 Setting2 A random forests model3 Layered nearest neighbors and random forests G. Biau (UPMC) 3 / 106
  4. 4. Leo Breiman (1928-2005) G. Biau (UPMC) 4 / 106
  5. 5. Leo Breiman (1928-2005) G. Biau (UPMC) 5 / 106
  6. 6. Classification G. Biau (UPMC) 6 / 106
  7. 7. Classification ? G. Biau (UPMC) 7 / 106
  8. 8. Classification G. Biau (UPMC) 8 / 106
  9. 9. Regression −2.33 1.38 6.76 5.89 12.00 0.56 2.27 0.12 4.73 31.2 −2.88 −2.56 7.72 7.26 0.12 4.32 2.23 7.77 −1.23 3.82 1.09 8.21 −2.66 G. Biau (UPMC) 9 / 106
  10. 10. Regression −2.33 1.38 6.76 5.89 12.00 0.56 2.27 0.12 4.73 31.2 −2.88 7.72 ? −2.56 7.26 0.12 4.32 2.23 7.77 −1.23 3.82 1.09 8.21 −2.66 G. Biau (UPMC) 10 / 106
  11. 11. Regression −2.33 1.38 6.76 5.89 12.00 0.56 2.27 0.12 4.73 31.2 −2.88 2.25 −2.56 7.72 7.26 0.12 4.32 2.23 7.77 −1.23 3.82 1.09 8.21 −2.66 G. Biau (UPMC) 11 / 106
  12. 12. Trees ? G. Biau (UPMC) 12 / 106
  13. 13. Trees ? G. Biau (UPMC) 13 / 106
  14. 14. Trees ? G. Biau (UPMC) 14 / 106
  15. 15. Trees ? G. Biau (UPMC) 15 / 106
  16. 16. Trees ? G. Biau (UPMC) 16 / 106
  17. 17. Trees ? G. Biau (UPMC) 17 / 106
  18. 18. Trees ? G. Biau (UPMC) 18 / 106
  19. 19. Trees G. Biau (UPMC) 19 / 106
  20. 20. Trees G. Biau (UPMC) 20 / 106
  21. 21. Trees G. Biau (UPMC) 21 / 106
  22. 22. Trees G. Biau (UPMC) 22 / 106
  23. 23. Trees G. Biau (UPMC) 23 / 106
  24. 24. Trees G. Biau (UPMC) 24 / 106
  25. 25. Trees G. Biau (UPMC) 25 / 106
  26. 26. Trees G. Biau (UPMC) 26 / 106
  27. 27. Trees ? G. Biau (UPMC) 27 / 106
  28. 28. Trees ? G. Biau (UPMC) 28 / 106
  29. 29. Trees ? G. Biau (UPMC) 29 / 106
  30. 30. Trees ? G. Biau (UPMC) 30 / 106
  31. 31. Trees ? G. Biau (UPMC) 31 / 106
  32. 32. Trees ? G. Biau (UPMC) 32 / 106
  33. 33. Trees ? G. Biau (UPMC) 33 / 106
  34. 34. From trees to forests Leo Breiman promoted random forests. Idea: Using tree averaging as a means of obtaining good rules. The base trees are simple and randomized.Breiman’s ideas were decisively influenced by Amit and Geman (1997, geometric feature selection). Ho (1998, random subspace method). Dietterich (2000, random split selection approach). stat.berkeley.edu/users/breiman/RandomForests G. Biau (UPMC) 34 / 106
  35. 35. From trees to forests Leo Breiman promoted random forests. Idea: Using tree averaging as a means of obtaining good rules. The base trees are simple and randomized.Breiman’s ideas were decisively influenced by Amit and Geman (1997, geometric feature selection). Ho (1998, random subspace method). Dietterich (2000, random split selection approach). stat.berkeley.edu/users/breiman/RandomForests G. Biau (UPMC) 34 / 106
  36. 36. From trees to forests Leo Breiman promoted random forests. Idea: Using tree averaging as a means of obtaining good rules. The base trees are simple and randomized.Breiman’s ideas were decisively influenced by Amit and Geman (1997, geometric feature selection). Ho (1998, random subspace method). Dietterich (2000, random split selection approach). stat.berkeley.edu/users/breiman/RandomForests G. Biau (UPMC) 34 / 106
  37. 37. From trees to forests Leo Breiman promoted random forests. Idea: Using tree averaging as a means of obtaining good rules. The base trees are simple and randomized.Breiman’s ideas were decisively influenced by Amit and Geman (1997, geometric feature selection). Ho (1998, random subspace method). Dietterich (2000, random split selection approach). stat.berkeley.edu/users/breiman/RandomForests G. Biau (UPMC) 34 / 106
  38. 38. From trees to forests Leo Breiman promoted random forests. Idea: Using tree averaging as a means of obtaining good rules. The base trees are simple and randomized.Breiman’s ideas were decisively influenced by Amit and Geman (1997, geometric feature selection). Ho (1998, random subspace method). Dietterich (2000, random split selection approach). stat.berkeley.edu/users/breiman/RandomForests G. Biau (UPMC) 34 / 106
  39. 39. Random forests They have emerged as serious competitors to state of the art methods. They are fast and easy to implement, produce highly accurate predictions and can handle a very large number of input variables without overfitting. In fact, forests are among the most accurate general-purpose learners available. The algorithm is difficult to analyze and its mathematical properties remain to date largely unknown. Most theoretical studies have concentrated on isolated parts or stylized versions of the procedure. G. Biau (UPMC) 35 / 106
  40. 40. Random forests They have emerged as serious competitors to state of the art methods. They are fast and easy to implement, produce highly accurate predictions and can handle a very large number of input variables without overfitting. In fact, forests are among the most accurate general-purpose learners available. The algorithm is difficult to analyze and its mathematical properties remain to date largely unknown. Most theoretical studies have concentrated on isolated parts or stylized versions of the procedure. G. Biau (UPMC) 35 / 106
  41. 41. Random forests They have emerged as serious competitors to state of the art methods. They are fast and easy to implement, produce highly accurate predictions and can handle a very large number of input variables without overfitting. In fact, forests are among the most accurate general-purpose learners available. The algorithm is difficult to analyze and its mathematical properties remain to date largely unknown. Most theoretical studies have concentrated on isolated parts or stylized versions of the procedure. G. Biau (UPMC) 35 / 106
  42. 42. Random forests They have emerged as serious competitors to state of the art methods. They are fast and easy to implement, produce highly accurate predictions and can handle a very large number of input variables without overfitting. In fact, forests are among the most accurate general-purpose learners available. The algorithm is difficult to analyze and its mathematical properties remain to date largely unknown. Most theoretical studies have concentrated on isolated parts or stylized versions of the procedure. G. Biau (UPMC) 35 / 106
  43. 43. Random forests They have emerged as serious competitors to state of the art methods. They are fast and easy to implement, produce highly accurate predictions and can handle a very large number of input variables without overfitting. In fact, forests are among the most accurate general-purpose learners available. The algorithm is difficult to analyze and its mathematical properties remain to date largely unknown. Most theoretical studies have concentrated on isolated parts or stylized versions of the procedure. G. Biau (UPMC) 35 / 106
  44. 44. Key-references Breiman (2000, 2001, 2004). Definition, experiments and intuitions. Lin and Jeon (2006). Link with layered nearest neighbors. Biau, Devroye and Lugosi (2008). Consistency results for stylized versions. Biau (2010). Sparsity and random forests. G. Biau (UPMC) 36 / 106
  45. 45. Key-references Breiman (2000, 2001, 2004). Definition, experiments and intuitions. Lin and Jeon (2006). Link with layered nearest neighbors. Biau, Devroye and Lugosi (2008). Consistency results for stylized versions. Biau (2010). Sparsity and random forests. G. Biau (UPMC) 36 / 106
  46. 46. Key-references Breiman (2000, 2001, 2004). Definition, experiments and intuitions. Lin and Jeon (2006). Link with layered nearest neighbors. Biau, Devroye and Lugosi (2008). Consistency results for stylized versions. Biau (2010). Sparsity and random forests. G. Biau (UPMC) 36 / 106
  47. 47. Key-references Breiman (2000, 2001, 2004). Definition, experiments and intuitions. Lin and Jeon (2006). Link with layered nearest neighbors. Biau, Devroye and Lugosi (2008). Consistency results for stylized versions. Biau (2010). Sparsity and random forests. G. Biau (UPMC) 36 / 106
  48. 48. Outline1 Setting2 A random forests model3 Layered nearest neighbors and random forests G. Biau (UPMC) 37 / 106
  49. 49. Three basic ingredients1-Randomization and no-pruning For each tree, select at random, at each node, a small group of input coordinates to split. Calculate the best split based on these features and cut. The tree is grown to maximum size, without pruning. G. Biau (UPMC) 38 / 106
  50. 50. Three basic ingredients1-Randomization and no-pruning For each tree, select at random, at each node, a small group of input coordinates to split. Calculate the best split based on these features and cut. The tree is grown to maximum size, without pruning. G. Biau (UPMC) 38 / 106
  51. 51. Three basic ingredients1-Randomization and no-pruning For each tree, select at random, at each node, a small group of input coordinates to split. Calculate the best split based on these features and cut. The tree is grown to maximum size, without pruning. G. Biau (UPMC) 38 / 106
  52. 52. Three basic ingredients2-Aggregation Final predictions are obtained by aggregating over the ensemble. It is fast and easily parallelizable. G. Biau (UPMC) 39 / 106
  53. 53. Three basic ingredients2-Aggregation Final predictions are obtained by aggregating over the ensemble. It is fast and easily parallelizable. G. Biau (UPMC) 39 / 106
  54. 54. Three basic ingredients3-Bagging The subspace randomization scheme is blended with bagging. Breiman (1996). Bühlmann and Yu (2002). Biau, Cérou and Guyader (2010). G. Biau (UPMC) 40 / 106
  55. 55. Three basic ingredients3-Bagging The subspace randomization scheme is blended with bagging. Breiman (1996). Bühlmann and Yu (2002). Biau, Cérou and Guyader (2010). G. Biau (UPMC) 40 / 106
  56. 56. Mathematical framework A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d. [0, 1]d × R-valued random variables. A generic pair: (X, Y ) satisfying EY 2 < ∞. Our mission: For fixed x ∈ [0, 1]d , estimate the regression function r (x) = E[Y |X = x] using the data. Quality criterion: E[rn (X) − r (X)]2 . G. Biau (UPMC) 41 / 106
  57. 57. Mathematical framework A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d. [0, 1]d × R-valued random variables. A generic pair: (X, Y ) satisfying EY 2 < ∞. Our mission: For fixed x ∈ [0, 1]d , estimate the regression function r (x) = E[Y |X = x] using the data. Quality criterion: E[rn (X) − r (X)]2 . G. Biau (UPMC) 41 / 106
  58. 58. Mathematical framework A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d. [0, 1]d × R-valued random variables. A generic pair: (X, Y ) satisfying EY 2 < ∞. Our mission: For fixed x ∈ [0, 1]d , estimate the regression function r (x) = E[Y |X = x] using the data. Quality criterion: E[rn (X) − r (X)]2 . G. Biau (UPMC) 41 / 106
  59. 59. Mathematical framework A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d. [0, 1]d × R-valued random variables. A generic pair: (X, Y ) satisfying EY 2 < ∞. Our mission: For fixed x ∈ [0, 1]d , estimate the regression function r (x) = E[Y |X = x] using the data. Quality criterion: E[rn (X) − r (X)]2 . G. Biau (UPMC) 41 / 106
  60. 60. The model A random forest is a collection of randomized base regression trees {rn (x, Θm , Dn ), m ≥ 1}. These random trees are combined to form the aggregated regression estimate ¯n (X, Dn ) = EΘ [rn (X, Θ, Dn )] . r Θ is assumed to be independent of X and the training sample Dn . However, we allow Θ to be based on a test sample, independent of, but distributed as, Dn . G. Biau (UPMC) 42 / 106
  61. 61. The model A random forest is a collection of randomized base regression trees {rn (x, Θm , Dn ), m ≥ 1}. These random trees are combined to form the aggregated regression estimate ¯n (X, Dn ) = EΘ [rn (X, Θ, Dn )] . r Θ is assumed to be independent of X and the training sample Dn . However, we allow Θ to be based on a test sample, independent of, but distributed as, Dn . G. Biau (UPMC) 42 / 106
  62. 62. The model A random forest is a collection of randomized base regression trees {rn (x, Θm , Dn ), m ≥ 1}. These random trees are combined to form the aggregated regression estimate ¯n (X, Dn ) = EΘ [rn (X, Θ, Dn )] . r Θ is assumed to be independent of X and the training sample Dn . However, we allow Θ to be based on a test sample, independent of, but distributed as, Dn . G. Biau (UPMC) 42 / 106
  63. 63. The model A random forest is a collection of randomized base regression trees {rn (x, Θm , Dn ), m ≥ 1}. These random trees are combined to form the aggregated regression estimate ¯n (X, Dn ) = EΘ [rn (X, Θ, Dn )] . r Θ is assumed to be independent of X and the training sample Dn . However, we allow Θ to be based on a test sample, independent of, but distributed as, Dn . G. Biau (UPMC) 42 / 106
  64. 64. The procedure Fix kn ≥ 2 and repeat the following procedure log2 kn times: 1 At each node, a coordinate of X = (X (1) , . . . , X (d) ) is selected, with the j-th feature having a probability pnj ∈ (0, 1) of being selected. 2 At each node, once the coordinate is selected, the split is at the midpoint of the chosen side. Thus n i=1 Yi 1[Xi ∈An (X,Θ)] ¯n (X) = EΘ r n 1En (X,Θ) , i=1 1[Xi ∈An (X,Θ)] where n En (X, Θ) = 1[Xi ∈An (X,Θ)] = 0 . i=1 G. Biau (UPMC) 43 / 106
  65. 65. The procedure Fix kn ≥ 2 and repeat the following procedure log2 kn times: 1 At each node, a coordinate of X = (X (1) , . . . , X (d) ) is selected, with the j-th feature having a probability pnj ∈ (0, 1) of being selected. 2 At each node, once the coordinate is selected, the split is at the midpoint of the chosen side. Thus n i=1 Yi 1[Xi ∈An (X,Θ)] ¯n (X) = EΘ r n 1En (X,Θ) , i=1 1[Xi ∈An (X,Θ)] where n En (X, Θ) = 1[Xi ∈An (X,Θ)] = 0 . i=1 G. Biau (UPMC) 43 / 106
  66. 66. The procedure Fix kn ≥ 2 and repeat the following procedure log2 kn times: 1 At each node, a coordinate of X = (X (1) , . . . , X (d) ) is selected, with the j-th feature having a probability pnj ∈ (0, 1) of being selected. 2 At each node, once the coordinate is selected, the split is at the midpoint of the chosen side. Thus n i=1 Yi 1[Xi ∈An (X,Θ)] ¯n (X) = EΘ r n 1En (X,Θ) , i=1 1[Xi ∈An (X,Θ)] where n En (X, Θ) = 1[Xi ∈An (X,Θ)] = 0 . i=1 G. Biau (UPMC) 43 / 106
  67. 67. Binary trees G. Biau (UPMC) 44 / 106
  68. 68. Binary trees G. Biau (UPMC) 45 / 106
  69. 69. Binary trees G. Biau (UPMC) 46 / 106
  70. 70. Binary trees G. Biau (UPMC) 47 / 106
  71. 71. Binary trees G. Biau (UPMC) 48 / 106
  72. 72. General comments Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes, and each leaf has Lebesgue measure 2− log2 kn (≈ 1/kn ). If X has uniform distribution on [0, 1]d , there will be on average about n/kn observations per terminal node. The choice kn = n induces a very small number of cases in the final leaves.This scheme is close to what the original randomforests algorithm does. G. Biau (UPMC) 49 / 106
  73. 73. General comments Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes, and each leaf has Lebesgue measure 2− log2 kn (≈ 1/kn ). If X has uniform distribution on [0, 1]d , there will be on average about n/kn observations per terminal node. The choice kn = n induces a very small number of cases in the final leaves.This scheme is close to what the original randomforests algorithm does. G. Biau (UPMC) 49 / 106
  74. 74. General comments Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes, and each leaf has Lebesgue measure 2− log2 kn (≈ 1/kn ). If X has uniform distribution on [0, 1]d , there will be on average about n/kn observations per terminal node. The choice kn = n induces a very small number of cases in the final leaves.This scheme is close to what the original randomforests algorithm does. G. Biau (UPMC) 49 / 106
  75. 75. General comments Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes, and each leaf has Lebesgue measure 2− log2 kn (≈ 1/kn ). If X has uniform distribution on [0, 1]d , there will be on average about n/kn observations per terminal node. The choice kn = n induces a very small number of cases in the final leaves.This scheme is close to what the original randomforests algorithm does. G. Biau (UPMC) 49 / 106
  76. 76. ConsistencyTheoremAssume that the distribution of X has support on [0, 1]d . Then therandom forests estimate ¯n is consistent whenever pnj log kn → ∞ for rall j = 1, . . . , d and kn /n → 0 as n → ∞. In the purely random model, pnj = 1/d, independently of n and j, and consistency is ensured as long as kn → ∞ and kn /n → 0. This is however a radically simplified version of the random forests used in practice. A more in-depth analysis is needed. G. Biau (UPMC) 50 / 106
  77. 77. ConsistencyTheoremAssume that the distribution of X has support on [0, 1]d . Then therandom forests estimate ¯n is consistent whenever pnj log kn → ∞ for rall j = 1, . . . , d and kn /n → 0 as n → ∞. In the purely random model, pnj = 1/d, independently of n and j, and consistency is ensured as long as kn → ∞ and kn /n → 0. This is however a radically simplified version of the random forests used in practice. A more in-depth analysis is needed. G. Biau (UPMC) 50 / 106
  78. 78. ConsistencyTheoremAssume that the distribution of X has support on [0, 1]d . Then therandom forests estimate ¯n is consistent whenever pnj log kn → ∞ for rall j = 1, . . . , d and kn /n → 0 as n → ∞. In the purely random model, pnj = 1/d, independently of n and j, and consistency is ensured as long as kn → ∞ and kn /n → 0. This is however a radically simplified version of the random forests used in practice. A more in-depth analysis is needed. G. Biau (UPMC) 50 / 106
  79. 79. ConsistencyTheoremAssume that the distribution of X has support on [0, 1]d . Then therandom forests estimate ¯n is consistent whenever pnj log kn → ∞ for rall j = 1, . . . , d and kn /n → 0 as n → ∞. In the purely random model, pnj = 1/d, independently of n and j, and consistency is ensured as long as kn → ∞ and kn /n → 0. This is however a radically simplified version of the random forests used in practice. A more in-depth analysis is needed. G. Biau (UPMC) 50 / 106
  80. 80. Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation. Images wavelet coefficients. High-throughput technologies. Sparse estimation is playing an increasingly important role in the statistics and machine learning communities. Several methods have recently been developed in both fields, which rely upon the notion of sparsity. G. Biau (UPMC) 51 / 106
  81. 81. Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation. Images wavelet coefficients. High-throughput technologies. Sparse estimation is playing an increasingly important role in the statistics and machine learning communities. Several methods have recently been developed in both fields, which rely upon the notion of sparsity. G. Biau (UPMC) 51 / 106
  82. 82. Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation. Images wavelet coefficients. High-throughput technologies. Sparse estimation is playing an increasingly important role in the statistics and machine learning communities. Several methods have recently been developed in both fields, which rely upon the notion of sparsity. G. Biau (UPMC) 51 / 106
  83. 83. Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation. Images wavelet coefficients. High-throughput technologies. Sparse estimation is playing an increasingly important role in the statistics and machine learning communities. Several methods have recently been developed in both fields, which rely upon the notion of sparsity. G. Biau (UPMC) 51 / 106
  84. 84. Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation. Images wavelet coefficients. High-throughput technologies. Sparse estimation is playing an increasingly important role in the statistics and machine learning communities. Several methods have recently been developed in both fields, which rely upon the notion of sparsity. G. Biau (UPMC) 51 / 106
  85. 85. Our vision The regression function r (X) = E[Y |X] depends in fact only on a nonempty subset S (for Strong) of the d features. In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have r (X) = E[Y |XS ]. In the dimension reduction scenario we have in mind, the ambient dimension d can be very large, much larger than n. As such, the value S characterizes the sparsity of the model: The smaller S, the sparser r . G. Biau (UPMC) 52 / 106
  86. 86. Our vision The regression function r (X) = E[Y |X] depends in fact only on a nonempty subset S (for Strong) of the d features. In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have r (X) = E[Y |XS ]. In the dimension reduction scenario we have in mind, the ambient dimension d can be very large, much larger than n. As such, the value S characterizes the sparsity of the model: The smaller S, the sparser r . G. Biau (UPMC) 52 / 106
  87. 87. Our vision The regression function r (X) = E[Y |X] depends in fact only on a nonempty subset S (for Strong) of the d features. In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have r (X) = E[Y |XS ]. In the dimension reduction scenario we have in mind, the ambient dimension d can be very large, much larger than n. As such, the value S characterizes the sparsity of the model: The smaller S, the sparser r . G. Biau (UPMC) 52 / 106
  88. 88. Our vision The regression function r (X) = E[Y |X] depends in fact only on a nonempty subset S (for Strong) of the d features. In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have r (X) = E[Y |XS ]. In the dimension reduction scenario we have in mind, the ambient dimension d can be very large, much larger than n. As such, the value S characterizes the sparsity of the model: The smaller S, the sparser r . G. Biau (UPMC) 52 / 106
  89. 89. Sparsity and random forests Ideally, pnj = 1/S for j ∈ S. To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj ). Such a randomization mechanism may be designed on the basis of a test sample.Action plan E [¯n (X) − r (X)]2 = E [¯n (X) − ˜n (X)]2 + E [˜n (X) − r (X)]2 , r r r rwhere n ˜n (X) = r EΘ [Wni (X, Θ)] r (Xi ). i=1 G. Biau (UPMC) 53 / 106
  90. 90. Sparsity and random forests Ideally, pnj = 1/S for j ∈ S. To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj ). Such a randomization mechanism may be designed on the basis of a test sample.Action plan E [¯n (X) − r (X)]2 = E [¯n (X) − ˜n (X)]2 + E [˜n (X) − r (X)]2 , r r r rwhere n ˜n (X) = r EΘ [Wni (X, Θ)] r (Xi ). i=1 G. Biau (UPMC) 53 / 106
  91. 91. Sparsity and random forests Ideally, pnj = 1/S for j ∈ S. To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj ). Such a randomization mechanism may be designed on the basis of a test sample.Action plan E [¯n (X) − r (X)]2 = E [¯n (X) − ˜n (X)]2 + E [˜n (X) − r (X)]2 , r r r rwhere n ˜n (X) = r EΘ [Wni (X, Θ)] r (Xi ). i=1 G. Biau (UPMC) 53 / 106
  92. 92. Sparsity and random forests Ideally, pnj = 1/S for j ∈ S. To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj ). Such a randomization mechanism may be designed on the basis of a test sample.Action plan E [¯n (X) − r (X)]2 = E [¯n (X) − ˜n (X)]2 + E [˜n (X) − r (X)]2 , r r r rwhere n ˜n (X) = r EΘ [Wni (X, Θ)] r (Xi ). i=1 G. Biau (UPMC) 53 / 106
  93. 93. VariancePropositionAssume that X is uniformly distributed on [0, 1]d and, for all x ∈ Rd , σ 2 (x) = V[Y | X = x] ≤ σ 2for some positive constant σ 2 . Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S, S/2d S2 kn E [¯n (X) − ˜n (X)]2 ≤ Cσ 2 r r (1 + ξn ) , S−1 n(log kn )S/2dwhere S/2d 288 π log 2 C= . π 16 G. Biau (UPMC) 54 / 106
  94. 94. BiasPropositionAssume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz.Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S, 2SL2 E [˜n (X) − r (X)]2 ≤ r 0.75 + sup r 2 (x) e−n/2kn . (1+γn ) kn S log 2 x∈[0,1]d The rate at which the bias decreases to 0 depends on the number of strong variables, not on d. kn −(0.75/(S log 2))(1+γn ) = o(kn −2/d ) as soon as S ≤ 0.54d . The term e−n/2kn prevents the extreme choice kn = n. G. Biau (UPMC) 55 / 106
  95. 95. BiasPropositionAssume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz.Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S, 2SL2 E [˜n (X) − r (X)]2 ≤ r 0.75 + sup r 2 (x) e−n/2kn . (1+γn ) kn S log 2 x∈[0,1]d The rate at which the bias decreases to 0 depends on the number of strong variables, not on d. kn −(0.75/(S log 2))(1+γn ) = o(kn −2/d ) as soon as S ≤ 0.54d . The term e−n/2kn prevents the extreme choice kn = n. G. Biau (UPMC) 55 / 106
  96. 96. BiasPropositionAssume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz.Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S, 2SL2 E [˜n (X) − r (X)]2 ≤ r 0.75 + sup r 2 (x) e−n/2kn . (1+γn ) kn S log 2 x∈[0,1]d The rate at which the bias decreases to 0 depends on the number of strong variables, not on d. kn −(0.75/(S log 2))(1+γn ) = o(kn −2/d ) as soon as S ≤ 0.54d . The term e−n/2kn prevents the extreme choice kn = n. G. Biau (UPMC) 55 / 106
  97. 97. BiasPropositionAssume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz.Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S, 2SL2 E [˜n (X) − r (X)]2 ≤ r 0.75 + sup r 2 (x) e−n/2kn . (1+γn ) kn S log 2 x∈[0,1]d The rate at which the bias decreases to 0 depends on the number of strong variables, not on d. kn −(0.75/(S log 2))(1+γn ) = o(kn −2/d ) as soon as S ≤ 0.54d . The term e−n/2kn prevents the extreme choice kn = n. G. Biau (UPMC) 55 / 106
  98. 98. Main resultTheoremIf pnj = (1/S)(1 + ξnj ) for j ∈ S, with ξnj log n → 0 as n → ∞, then forthe choice 1/(1+ S0.752 ) L2 log 1/(1+ S0.752 ) kn ∝ n log , Ξwe have E [¯n (X) − r (X)]2 r lim sup sup 0.75 ≤ Λ. n→∞ (X,Y )∈F 2S ln 2 S ln 2+0.75 −0.75 ΞL 0.75 n S log 2+0.75Take-home message −0.75The rate n S log 2+0.75 is strictly faster than the usual minimax raten−2/(d+2) as soon as S ≤ 0.54d . G. Biau (UPMC) 56 / 106
  99. 99. Main resultTheoremIf pnj = (1/S)(1 + ξnj ) for j ∈ S, with ξnj log n → 0 as n → ∞, then forthe choice 1/(1+ S0.752 ) L2 log 1/(1+ S0.752 ) kn ∝ n log , Ξwe have E [¯n (X) − r (X)]2 r lim sup sup 0.75 ≤ Λ. n→∞ (X,Y )∈F 2S ln 2 S ln 2+0.75 −0.75 ΞL 0.75 n S log 2+0.75Take-home message −0.75The rate n S log 2+0.75 is strictly faster than the usual minimax raten−2/(d+2) as soon as S ≤ 0.54d . G. Biau (UPMC) 56 / 106
  100. 100. Dimension reduction 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0.0196 0 2 10 20 30 40 50 60 70 80 90 100 S G. Biau (UPMC) 57 / 106
  101. 101. Discussion — Bagging The optimal parameter kn depends on the unknown distribution of (X, Y ). To correct this situation, adaptive choices of kn should preserve the rate of convergence of the estimate. Another route we may follow is to analyze the effect of bagging. Biau, Cérou and Guyader (2010). G. Biau (UPMC) 58 / 106
  102. 102. Discussion — Bagging The optimal parameter kn depends on the unknown distribution of (X, Y ). To correct this situation, adaptive choices of kn should preserve the rate of convergence of the estimate. Another route we may follow is to analyze the effect of bagging. Biau, Cérou and Guyader (2010). G. Biau (UPMC) 58 / 106
  103. 103. Discussion — Bagging The optimal parameter kn depends on the unknown distribution of (X, Y ). To correct this situation, adaptive choices of kn should preserve the rate of convergence of the estimate. Another route we may follow is to analyze the effect of bagging. Biau, Cérou and Guyader (2010). G. Biau (UPMC) 58 / 106
  104. 104. Discussion — Bagging The optimal parameter kn depends on the unknown distribution of (X, Y ). To correct this situation, adaptive choices of kn should preserve the rate of convergence of the estimate. Another route we may follow is to analyze the effect of bagging. Biau, Cérou and Guyader (2010). G. Biau (UPMC) 58 / 106
  105. 105. ?G. Biau (UPMC) 59 / 106
  106. 106. ?G. Biau (UPMC) 60 / 106
  107. 107. G. Biau (UPMC) 61 / 106
  108. 108. ?G. Biau (UPMC) 62 / 106
  109. 109. ?G. Biau (UPMC) 63 / 106
  110. 110. ?G. Biau (UPMC) 64 / 106
  111. 111. G. Biau (UPMC) 65 / 106
  112. 112. ?G. Biau (UPMC) 66 / 106
  113. 113. ?G. Biau (UPMC) 67 / 106
  114. 114. ?G. Biau (UPMC) 68 / 106
  115. 115. G. Biau (UPMC) 69 / 106
  116. 116. Discussion — Choosing the pnj ’sImaginary scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 If the selection is all weak, then choose one at random to split on. 3 If there is more than one strong variable elected, choose one at random and cut. G. Biau (UPMC) 70 / 106
  117. 117. Discussion — Choosing the pnj ’sImaginary scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 If the selection is all weak, then choose one at random to split on. 3 If there is more than one strong variable elected, choose one at random and cut. G. Biau (UPMC) 70 / 106
  118. 118. Discussion — Choosing the pnj ’sImaginary scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 If the selection is all weak, then choose one at random to split on. 3 If there is more than one strong variable elected, choose one at random and cut. G. Biau (UPMC) 70 / 106
  119. 119. Discussion — Choosing the pnj ’sImaginary scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 If the selection is all weak, then choose one at random to split on. 3 If there is more than one strong variable elected, choose one at random and cut. G. Biau (UPMC) 70 / 106
  120. 120. Discussion — Choosing the pnj ’s Each coordinate in S will be cut with the “ideal” probability Mn 1 S pn = 1− 1− . S d The parameter Mn should satisfy Mn S 1− log n → 0 as n → ∞. d This is true as soon as Mn Mn → ∞ and → ∞ as n → ∞. log n G. Biau (UPMC) 71 / 106
  121. 121. Discussion — Choosing the pnj ’s Each coordinate in S will be cut with the “ideal” probability Mn 1 S pn = 1− 1− . S d The parameter Mn should satisfy Mn S 1− log n → 0 as n → ∞. d This is true as soon as Mn Mn → ∞ and → ∞ as n → ∞. log n G. Biau (UPMC) 71 / 106
  122. 122. Discussion — Choosing the pnj ’s Each coordinate in S will be cut with the “ideal” probability Mn 1 S pn = 1− 1− . S d The parameter Mn should satisfy Mn S 1− log n → 0 as n → ∞. d This is true as soon as Mn Mn → ∞ and → ∞ as n → ∞. log n G. Biau (UPMC) 71 / 106
  123. 123. Assumptions We have at hand an independent test set Dn . The model is linear: Y = aj X (j) + ε. j∈S For a fixed node A = d Aj , fix a coordinate j and look at the j=1 weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ). If j ∈ S, then the best split is at the midpoint of the node, with a variance decrease equal to aj2 /16 > 0. If j ∈ W, the decrease of the variance is always 0, whatever the location of the split. G. Biau (UPMC) 72 / 106
  124. 124. Assumptions We have at hand an independent test set Dn . The model is linear: Y = aj X (j) + ε. j∈S For a fixed node A = d Aj , fix a coordinate j and look at the j=1 weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ). If j ∈ S, then the best split is at the midpoint of the node, with a variance decrease equal to aj2 /16 > 0. If j ∈ W, the decrease of the variance is always 0, whatever the location of the split. G. Biau (UPMC) 72 / 106
  125. 125. Assumptions We have at hand an independent test set Dn . The model is linear: Y = aj X (j) + ε. j∈S For a fixed node A = d Aj , fix a coordinate j and look at the j=1 weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ). If j ∈ S, then the best split is at the midpoint of the node, with a variance decrease equal to aj2 /16 > 0. If j ∈ W, the decrease of the variance is always 0, whatever the location of the split. G. Biau (UPMC) 72 / 106
  126. 126. Assumptions We have at hand an independent test set Dn . The model is linear: Y = aj X (j) + ε. j∈S For a fixed node A = d Aj , fix a coordinate j and look at the j=1 weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ). If j ∈ S, then the best split is at the midpoint of the node, with a variance decrease equal to aj2 /16 > 0. If j ∈ W, the decrease of the variance is always 0, whatever the location of the split. G. Biau (UPMC) 72 / 106
  127. 127. Assumptions We have at hand an independent test set Dn . The model is linear: Y = aj X (j) + ε. j∈S For a fixed node A = d Aj , fix a coordinate j and look at the j=1 weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ). If j ∈ S, then the best split is at the midpoint of the node, with a variance decrease equal to aj2 /16 > 0. If j ∈ W, the decrease of the variance is always 0, whatever the location of the split. G. Biau (UPMC) 72 / 106
  128. 128. Assumptions We have at hand an independent test set Dn . The model is linear: Y = aj X (j) + ε. j∈S For a fixed node A = d Aj , fix a coordinate j and look at the j=1 weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ). If j ∈ S, then the best split is at the midpoint of the node, with a variance decrease equal to aj2 /16 > 0. If j ∈ W, the decrease of the variance is always 0, whatever the location of the split. G. Biau (UPMC) 72 / 106
  129. 129. Discussion — Choosing the pnj ’sNear-reality scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 For each of the Mn elected coordinates, calculate the best split. 3 Select the coordinate which outputs the best within-node sum of squares decrease, and cut.ConclusionFor j ∈ S, 1 pnj ≈ 1 + ξnj , Swhere ξnj → 0 and satisfies the constraint ξnj log n → 0 as n tends toinfinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞. G. Biau (UPMC) 73 / 106
  130. 130. Discussion — Choosing the pnj ’sNear-reality scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 For each of the Mn elected coordinates, calculate the best split. 3 Select the coordinate which outputs the best within-node sum of squares decrease, and cut.ConclusionFor j ∈ S, 1 pnj ≈ 1 + ξnj , Swhere ξnj → 0 and satisfies the constraint ξnj log n → 0 as n tends toinfinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞. G. Biau (UPMC) 73 / 106
  131. 131. Discussion — Choosing the pnj ’sNear-reality scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 For each of the Mn elected coordinates, calculate the best split. 3 Select the coordinate which outputs the best within-node sum of squares decrease, and cut.ConclusionFor j ∈ S, 1 pnj ≈ 1 + ξnj , Swhere ξnj → 0 and satisfies the constraint ξnj log n → 0 as n tends toinfinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞. G. Biau (UPMC) 73 / 106
  132. 132. Discussion — Choosing the pnj ’sNear-reality scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 For each of the Mn elected coordinates, calculate the best split. 3 Select the coordinate which outputs the best within-node sum of squares decrease, and cut.ConclusionFor j ∈ S, 1 pnj ≈ 1 + ξnj , Swhere ξnj → 0 and satisfies the constraint ξnj log n → 0 as n tends toinfinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞. G. Biau (UPMC) 73 / 106
  133. 133. Discussion — Choosing the pnj ’sNear-reality scenarioThe following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 For each of the Mn elected coordinates, calculate the best split. 3 Select the coordinate which outputs the best within-node sum of squares decrease, and cut.ConclusionFor j ∈ S, 1 pnj ≈ 1 + ξnj , Swhere ξnj → 0 and satisfies the constraint ξnj log n → 0 as n tends toinfinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞. G. Biau (UPMC) 73 / 106
  134. 134. Outline1 Setting2 A random forests model3 Layered nearest neighbors and random forests G. Biau (UPMC) 74 / 106
  135. 135. ?G. Biau (UPMC) 75 / 106
  136. 136. ?G. Biau (UPMC) 76 / 106
  137. 137. ?G. Biau (UPMC) 77 / 106
  138. 138. ?G. Biau (UPMC) 78 / 106
  139. 139. ?G. Biau (UPMC) 79 / 106
  140. 140. G. Biau (UPMC) 80 / 106
  141. 141. ?G. Biau (UPMC) 81 / 106
  142. 142. ?G. Biau (UPMC) 82 / 106
  143. 143. ?G. Biau (UPMC) 83 / 106
  144. 144. ?G. Biau (UPMC) 84 / 106
  145. 145. ?G. Biau (UPMC) 85 / 106
  146. 146. G. Biau (UPMC) 86 / 106
  147. 147. ?G. Biau (UPMC) 87 / 106
  148. 148. ?G. Biau (UPMC) 88 / 106
  149. 149. ?G. Biau (UPMC) 89 / 106
  150. 150. ?G. Biau (UPMC) 90 / 106
  151. 151. ?G. Biau (UPMC) 91 / 106
  152. 152. ?G. Biau (UPMC) 92 / 106
  153. 153. ?G. Biau (UPMC) 93 / 106
  154. 154. ?G. Biau (UPMC) 94 / 106
  155. 155. ?G. Biau (UPMC) 95 / 106
  156. 156. ?G. Biau (UPMC) 96 / 106
  157. 157. ?G. Biau (UPMC) 97 / 106
  158. 158. ?G. Biau (UPMC) 98 / 106
  159. 159. Layered Nearest NeighborsDefinitionLet X1 , . . . , Xn be a sample of i.i.d. random vectors in Rd , d ≥ 2. Anobservation Xi is said to be a LNN of a point x if the hyperrectangledefined by x and Xi contains no other data points. x empty G. Biau (UPMC) 99 / 106
  160. 160. What is known about Ln (x)? ... a lot when X1 , . . . , Xn are uniformly distributed over [0, 1]d . For example, 2d (log n)d−1 ELn (x) = + O (log n)d−2 (d − 1)! and (d − 1)! Ln (x) → 1 in probability as n → ∞. 2d (log n)d−1 This is the problem of maxima in random vectors (Barndorff-Nielsen and Sobel, 1966). G. Biau (UPMC) 100 / 106
  161. 161. What is known about Ln (x)? ... a lot when X1 , . . . , Xn are uniformly distributed over [0, 1]d . For example, 2d (log n)d−1 ELn (x) = + O (log n)d−2 (d − 1)! and (d − 1)! Ln (x) → 1 in probability as n → ∞. 2d (log n)d−1 This is the problem of maxima in random vectors (Barndorff-Nielsen and Sobel, 1966). G. Biau (UPMC) 100 / 106
  162. 162. What is known about Ln (x)? ... a lot when X1 , . . . , Xn are uniformly distributed over [0, 1]d . For example, 2d (log n)d−1 ELn (x) = + O (log n)d−2 (d − 1)! and (d − 1)! Ln (x) → 1 in probability as n → ∞. 2d (log n)d−1 This is the problem of maxima in random vectors (Barndorff-Nielsen and Sobel, 1966). G. Biau (UPMC) 100 / 106
  163. 163. Two results (Biau and Devroye, 2010)ModelX1 , . . . , Xn are independently distributed according to some probabilitydensity f (with probability measure µ).TheoremFor µ-almost all x ∈ Rd , one has Ln (x) → ∞ in probability as n → ∞.TheoremSuppose that f is λ-almost everywhere continuous. Then (d − 1)! ELn (x) → 1 as n → ∞, 2d (log n)d−1at µ-almost all x ∈ Rd . G. Biau (UPMC) 101 / 106
  164. 164. Two results (Biau and Devroye, 2010)ModelX1 , . . . , Xn are independently distributed according to some probabilitydensity f (with probability measure µ).TheoremFor µ-almost all x ∈ Rd , one has Ln (x) → ∞ in probability as n → ∞.TheoremSuppose that f is λ-almost everywhere continuous. Then (d − 1)! ELn (x) → 1 as n → ∞, 2d (log n)d−1at µ-almost all x ∈ Rd . G. Biau (UPMC) 101 / 106
  165. 165. Two results (Biau and Devroye, 2010)ModelX1 , . . . , Xn are independently distributed according to some probabilitydensity f (with probability measure µ).TheoremFor µ-almost all x ∈ Rd , one has Ln (x) → ∞ in probability as n → ∞.TheoremSuppose that f is λ-almost everywhere continuous. Then (d − 1)! ELn (x) → 1 as n → ∞, 2d (log n)d−1at µ-almost all x ∈ Rd . G. Biau (UPMC) 101 / 106
  166. 166. LNN regression estimationModel(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.Moreover, |Y | is bounded and X has a density.The regression function r (x) = E[Y |X = x] may be estimated by n 1 rn (x) = Yi 1[Xi ∈Ln (x)] . Ln (x) i=1 1 No smoothing parameter. 2 A scale-invariant estimate. 3 Intimately connected to Breiman’s random forests. G. Biau (UPMC) 102 / 106
  167. 167. LNN regression estimationModel(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.Moreover, |Y | is bounded and X has a density.The regression function r (x) = E[Y |X = x] may be estimated by n 1 rn (x) = Yi 1[Xi ∈Ln (x)] . Ln (x) i=1 1 No smoothing parameter. 2 A scale-invariant estimate. 3 Intimately connected to Breiman’s random forests. G. Biau (UPMC) 102 / 106
  168. 168. LNN regression estimationModel(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.Moreover, |Y | is bounded and X has a density.The regression function r (x) = E[Y |X = x] may be estimated by n 1 rn (x) = Yi 1[Xi ∈Ln (x)] . Ln (x) i=1 1 No smoothing parameter. 2 A scale-invariant estimate. 3 Intimately connected to Breiman’s random forests. G. Biau (UPMC) 102 / 106
  169. 169. LNN regression estimationModel(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.Moreover, |Y | is bounded and X has a density.The regression function r (x) = E[Y |X = x] may be estimated by n 1 rn (x) = Yi 1[Xi ∈Ln (x)] . Ln (x) i=1 1 No smoothing parameter. 2 A scale-invariant estimate. 3 Intimately connected to Breiman’s random forests. G. Biau (UPMC) 102 / 106
  170. 170. LNN regression estimationModel(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.Moreover, |Y | is bounded and X has a density.The regression function r (x) = E[Y |X = x] may be estimated by n 1 rn (x) = Yi 1[Xi ∈Ln (x)] . Ln (x) i=1 1 No smoothing parameter. 2 A scale-invariant estimate. 3 Intimately connected to Breiman’s random forests. G. Biau (UPMC) 102 / 106
  171. 171. ConsistencyTheorem (Pointwise Lp -consistency)Assume that the regression function r is λ-almost everywherecontinuous and that Y is bounded. Then, for µ-almost all x ∈ Rd andall p ≥ 1, E |rn (x) − r (x)|p → 0 as n → ∞.Theorem (Gobal Lp -consistency)Under the same conditions, for all p ≥ 1, E |rn (X) − r (X)|p → 0 as n → ∞. 1 No universal consistency result with respect to r is possible. 2 The results do not impose any condition on the density. 3 They are also scale-free. G. Biau (UPMC) 103 / 106
  172. 172. ConsistencyTheorem (Pointwise Lp -consistency)Assume that the regression function r is λ-almost everywherecontinuous and that Y is bounded. Then, for µ-almost all x ∈ Rd andall p ≥ 1, E |rn (x) − r (x)|p → 0 as n → ∞.Theorem (Gobal Lp -consistency)Under the same conditions, for all p ≥ 1, E |rn (X) − r (X)|p → 0 as n → ∞. 1 No universal consistency result with respect to r is possible. 2 The results do not impose any condition on the density. 3 They are also scale-free. G. Biau (UPMC) 103 / 106
  173. 173. ConsistencyTheorem (Pointwise Lp -consistency)Assume that the regression function r is λ-almost everywherecontinuous and that Y is bounded. Then, for µ-almost all x ∈ Rd andall p ≥ 1, E |rn (x) − r (x)|p → 0 as n → ∞.Theorem (Gobal Lp -consistency)Under the same conditions, for all p ≥ 1, E |rn (X) − r (X)|p → 0 as n → ∞. 1 No universal consistency result with respect to r is possible. 2 The results do not impose any condition on the density. 3 They are also scale-free. G. Biau (UPMC) 103 / 106
  174. 174. ConsistencyTheorem (Pointwise Lp -consistency)Assume that the regression function r is λ-almost everywherecontinuous and that Y is bounded. Then, for µ-almost all x ∈ Rd andall p ≥ 1, E |rn (x) − r (x)|p → 0 as n → ∞.Theorem (Gobal Lp -consistency)Under the same conditions, for all p ≥ 1, E |rn (X) − r (X)|p → 0 as n → ∞. 1 No universal consistency result with respect to r is possible. 2 The results do not impose any condition on the density. 3 They are also scale-free. G. Biau (UPMC) 103 / 106
  175. 175. ConsistencyTheorem (Pointwise Lp -consistency)Assume that the regression function r is λ-almost everywherecontinuous and that Y is bounded. Then, for µ-almost all x ∈ Rd andall p ≥ 1, E |rn (x) − r (x)|p → 0 as n → ∞.Theorem (Gobal Lp -consistency)Under the same conditions, for all p ≥ 1, E |rn (X) − r (X)|p → 0 as n → ∞. 1 No universal consistency result with respect to r is possible. 2 The results do not impose any condition on the density. 3 They are also scale-free. G. Biau (UPMC) 103 / 106
  176. 176. Back to random forestsA random forest can be viewed as a weighted LNN regression estimate n ¯n (x) = r Yi Wni (x), i=1where the weights concentrate on the LNN and satisfy n Wni (x) = 1. i=1 G. Biau (UPMC) 104 / 106
  177. 177. Non-adaptive strategiesConsider the non-adaptive random forests estimate n ¯n (x) = r Yi Wni (x). i=1where the weights concentrate on the LNN.PropositionFor any x ∈ Rd , assume that σ 2 = V[Y |X = x] is independent of x.Then σ2 E [¯n (x) − r (x)]2 ≥ r . ELn (x) G. Biau (UPMC) 105 / 106
  178. 178. Non-adaptive strategiesConsider the non-adaptive random forests estimate n ¯n (x) = r Yi Wni (x). i=1where the weights concentrate on the LNN.PropositionFor any x ∈ Rd , assume that σ 2 = V[Y |X = x] is independent of x.Then σ2 E [¯n (x) − r (x)]2 ≥ r . ELn (x) G. Biau (UPMC) 105 / 106
  179. 179. Rate of convergenceAt µ-almost all x, when f is λ-almost everywhere continuous, σ 2 (d − 1)! E [¯n (x) − r (x)]2 r . 2d (log n)d−1Improving the rate of convergence 1 Stop as soon as a future rectangle split would cause a sub-rectangle to have fewer than kn points. 2 Resort to bagging and randomize using random subsamples. G. Biau (UPMC) 106 / 106
  180. 180. Rate of convergenceAt µ-almost all x, when f is λ-almost everywhere continuous, σ 2 (d − 1)! E [¯n (x) − r (x)]2 r . 2d (log n)d−1Improving the rate of convergence 1 Stop as soon as a future rectangle split would cause a sub-rectangle to have fewer than kn points. 2 Resort to bagging and randomize using random subsamples. G. Biau (UPMC) 106 / 106
  181. 181. Rate of convergenceAt µ-almost all x, when f is λ-almost everywhere continuous, σ 2 (d − 1)! E [¯n (x) − r (x)]2 r . 2d (log n)d−1Improving the rate of convergence 1 Stop as soon as a future rectangle split would cause a sub-rectangle to have fewer than kn points. 2 Resort to bagging and randomize using random subsamples. G. Biau (UPMC) 106 / 106
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×