Consistency of Random Forests
Hoang N.V.
hoangnvvnua@gmail.com
Department of Computer Science
FITA – Viet Nam Institute of Agriculture
Seminar IT R&D, HANU
Ha Noi, December 2015
Machine Learning, what is?
“true”
Parametric
Non-parametric
Supervised problems: not too difficult
Unsupervised problems: is very difficult
Find a parameter which minimize the loss function
Supervised Learning
ℒ 𝑛
L is a loss function
Classification: zero-one loss function
Regression: 𝕃1, 𝕃2
Bias-variance tradeoff
If the model is too simple, the solution is biased and does not fit
the data.
If the model is too complex then it is very sensitive to small
changes in the data.
[Hastie et all., 2005]
Ensemble Methods
Bagging
[Random Forest]
Tree Predictor
ℒ 𝑛
ℒ 𝑛
Pick an internal node to split
Pick the best split in
Split A into two child nodes ( and )
Set
A splitting scheme induces a partition Λ of the feature space into
non-overlapping rectangles 𝑃1, … , 𝑃ℓ.
Tree Predictor
ℒ 𝑛
ℒ 𝑛
Select an internal node to split
Select the best split in
Split A into two child nodes ( and )
Set
A splitting scheme induces a partition Λ of the feature space into
non-overlapping rectangles 𝑃1, … , 𝑃ℓ.
Predicting Rule
Λ
Λ ℒ 𝑛
Λ ℒ 𝑛
Tree Predictor
ℒ 𝑛
ℒ 𝑛
Select an internal node to split
Select the best split in
Split A into two child nodes ( and )
Set
A splitting scheme induces a partition Λ of the feature space into
non-overlapping rectangles 𝑃1, … , 𝑃ℓ.
Training Methods
ID3 (Iterative Dichotomiser 3)
C4.5
CART (Classification and Regression Tree)
CHAID
MARS
Conditional Inference Tree
Predicting Rule
Λ
Λ ℒ 𝑛
Λ ℒ 𝑛
Forest = Aggregation of trees
Aggregating Rule
ℒ 𝑛 ℒ 𝑛
ℒ 𝑛 ℒ 𝑛 ℒ 𝑛
Grow different trees from same learning set ℒ 𝑛
Sampling with replacement [Breiman, 1994]
Random subspace sampling [Ho, 1995 & 1998]
Random output sampling [Breiman, 1998]
Randomized C4.5 [Dietterich, 1998]
Purely random forest [Breiman, 2000]
Extremely random trees [Guerts, 2006]
Grow different trees from same learning set ℒ 𝑛
Sampling with replacement - random subspace [Breiman, 2001]
Sampling with replacement - weighted subspace [Amaratunga, 2008; Xu,
2008; Wu, 2012]
Sampling with replacement - random subspace and regularized [Deng,
2012]
Sampling with replacement - random subspace and guided-regularized
[Deng, 2013]
Sampling with replacement - random subspace and random split
position selection [Saïp Ciss, 2014]
Some RF extensions
quantile estimation Meinshausen, 2006
survival analysis Ishwaran et al., 2008
ranking Clemencon et al., 2013
online learning Denil et al., 2013;
Lakshminarayanan et al., 2014
GWA problems Yang et al., 2013; Botta et al., 2014
What is a good learner?
[What is friendly with my data?]
What is good in high-dimensional settings
Breiman, 2001
Wu et al., 2012
Deng, 2012
Deng., 2013
Saïp Ciss, 2014
Simulation Experiment
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y1 y2 y3 y4
AA ABB B
Random Forest [Breiman, 2001]
WSRF [Wu, 2012]
Random Uniform Forest [Saïp Ciss, 2014]
RRF [Deng, 2012]
GRRF [Deng, 2013]
Simulation Experiment
Random Forest [Breiman, 2001]
WSRF [Wu, 2012]
Random Uniform Forest [Saïp Ciss, 2014]
RRF [Deng, 2012]
GRRF [Deng, 2013]
GRRF with AUC [Deng, 2013]
GRRF with ER [Deng, 2013]
Simulation Experiment
B
A
C
D
B E
A
E D
Multiple Class Tree
Random Forest [Breiman, 2001]
WSRF [Wu, 2012]
Random Uniform Forest [Saïp Ciss, 2014]
RRF [Deng, 2012]
GRRF [Deng, 2013]
GRRF with ER [Deng, 2013]
What is a good learner?
[Nothing you do will convince me]
[I need rigorous theoretical guarantees]
Asymptotic statistics and learning theory
[go beyond experiment results]
Machine Learning, what is?
Parametric
Non-parametric
Supervised problems: not too difficult
Unsupervised problems: is very difficult
Find a parameter which minimize the loss function
“right”
Pattern (which learnt from ℒ 𝑛 ) is “true”, isn’t it? How much do I believe?
Is this procedure friendly with my data?
What is the best possible procedure for my problem?
What is if our assumptions are wrong?
“efficient”
How many observations do I need in order to achieve a “believed” pattern?
How many computations do I need?
Assumption: There are some patterns
Learning Theory [Vapnik, 1999]
asymptotic theory
necessary and sufficient conditions
the best
possible
Supervised Learning
ℒ 𝑛
L is a loss function
Classification: zero-one loss function
Regression: 𝕃1, 𝕃2
Supervised Learning
Generator 𝒙
𝒙
𝑦
𝑦 𝑦′
Supervisor
Machine
Learning
Two different goals
imitate (prediction accuracy)
identify (interpretability)
What is the best predictor?
What is the best predictor
Bayes model
residual error
A model built from
any learning set ℒ, Err( ) ≤ Err( )
In theory, when 𝑃(𝑋𝑌) is known
What is the best predictor
If is zero-one loss function, the Bayes model is
In classification, the best possible classifier consists in systematically
predicting the most likely class 𝑦 ∈ {𝑐1, … , 𝑐𝐽} given 𝑋 = 𝒙
What is the best predictor
If is the squared error loss function, the Bayes model is
In regression, the best possible regressor consists in systematically
predicting the average value of 𝑌 given 𝑋 = 𝒙
Given a learning algorithm 𝒜 and a loss function
𝜑 𝑛 = 𝒜(ℒ 𝑛)
ℒ 𝑛
𝒜 ℒ 𝑛 }
Learning algorithm 𝒜 is consistent in L if and only if
𝐸𝑟𝑟( ) ⟶ 𝑛→∞
𝑃
𝐸𝑟𝑟(𝜑 𝐵)
𝐸𝑟𝑟(𝜑 𝑛 ) ⟶ 𝑛→∞
𝑃
𝐸𝑟𝑟(𝜑 𝐵)
Random Forests are consistent, aren’t they?
𝒜
Θ 𝒜
Θ is used to sample the training set or to select the candidate directions
or positions for splitting
Θ is independent of the dataset and thus unrelated to the particular
problem
In some new variants of RF, Θ is depend on the dataset
Generalized Random Forest
Bagging Procedure
Θ Θ1, … , Θm
Θi ℒ 𝑛 𝒜(Θ𝑖 ℒ 𝑛)
ℒ 𝑛 Θ𝑖 ℒ 𝑛
ℒ 𝑛 Θ1 ℒ 𝑛 Θ 𝑚 ℒ 𝑛
Generalized Random Forest
Consistency of Random Forests
lim
𝑛→∞
Err( 𝐻 𝑚 . ; ℒ 𝑛 ) = Err(𝜑 𝐵)
Problem
- m is finite ⇒ predictor depend on trees that formed forest
- structure of a tree depend on Θi and learning set
⟹ finite forest is actually a subtle combination of randomness and
depending-on-data structures
⟹ finite forests predictions can be difficult to interpret (random
prediction or not)
- Non-asymptotic rate of convergence
Challenges
𝐻∞(𝑥; ℒ 𝑛) = 𝔼Θ{ Θ ℒ 𝑛 }
lim
𝑚→∞
𝐻 𝑚(𝑥; ℒ 𝑛) = 𝐻∞(𝑥; ℒ 𝑛)
Problem
- infinite forest is good than finite forest, isn’t it?
- What is good m? (rate of convergence)
Challenge
Consistency of Random Forests
Review some recent results
Strength, Correlation and Err [Breiman, 2001]
Θi ℒ 𝑛 Θi ℒ 𝑛
Θ ℒ 𝑛 Θ ℒ 𝑛
Theorem 2.3 An upper bound for the generalization error is given by:
𝑃𝐸∗
≤ 𝜌(1 − 𝑠2
)/𝑠2
where 𝜌 is the mean value of the correlation, s is the strength of the
set of classifiers.
RF and Adaptive Nearest Neighbors [Lin et al, 2006]
Θi ℒ 𝑛 𝒜(Θ𝑖 ℒ 𝑛)
Θ
Θi ℒ 𝑛
=
1
𝒙𝐣
:𝒙𝐣
∈ 𝐿 Θ 𝑖,𝒙 𝑗:𝒙𝐣
∈ 𝐿 Θ 𝑖,𝒙 𝑦j = 𝑗=1
𝑛
𝑤𝑗Λ 𝑖
𝑦𝑗
ℒ 𝑛 Θ𝑖 ℒ 𝑛 𝑗=1
𝑛
𝑤𝑗 𝑦𝑗
𝑤𝑗 𝑤𝑗Λ 𝑖
𝐻∞(𝒙; ℒ 𝑛) Θ
Θ
Θ
Non-adaptive if 𝑤𝑗 not depend on 𝑦𝑖’s of the learning set
RF and Adaptive Nearest Neighbors [Lin et al, 2006]
Θi ℒ 𝑛 𝒜(Θ𝑖 ℒ 𝑛)
Θ
Θi ℒ 𝑛
=
1
𝒙𝐣
:𝒙𝐣
∈ 𝐿 Θ 𝑖,𝒙 𝑗:𝒙𝐣
∈ 𝐿 Θ 𝑖,𝒙 𝑦j = 𝑗=1
𝑛
𝑤𝑗Λ 𝑖
𝑦𝑗
ℒ 𝑛 Θ𝑖 ℒ 𝑛 𝑗=1
𝑛
𝑤𝑗 𝑦𝑗
𝑤𝑗 𝑤𝑗Λ 𝑖
𝐻∞(𝒙; ℒ 𝑛) Θ
Θ
Θ
Non-adaptive if 𝑤𝑗 not depend on 𝑦𝑖’s of the learning set
The terminal node size k should be made to increase with the sample size 𝑛.
Therefore, growing large trees (k being a small constant) does not always
give the best performance.
Biau et al, 2008
Given a learning set ℒ 𝑛 = 𝒙1, 𝑦1 , … , 𝒙 𝑛, 𝑦𝑛 of ℝ 𝑑
∗ {0, 1}
Binary classifier 𝜑 𝑛 which trained from ℒ 𝑛: ℝ 𝑑 ⟶ {0, 1}
𝜑 𝑛 𝜑 𝑛(𝑋) ≠ 𝑌
𝜑 𝐵 𝑥 = 𝕀{ } 𝜑 𝐵
A sequence {𝜑 𝑛} of classifiers is consistent for a certain distribution of
(𝑋, 𝑌) if 𝐸𝑟𝑟(𝜑 𝑛) ⟶ in probability
Assume that the sequence {𝑇𝑛} of randomized classifiers is consistent for
a certain distribution of 𝑋, 𝑌 . Then the voting classifier 𝐻 𝑚(for any value
of m) and the averaged classifier 𝐻∞ are also consistent.
Biau et al, 2008
Growing Trees
Node 𝐴 is randomly selected
The split feature j is selected uniformly at random from [1, … , 𝑝]
Finally, the selected node is split along the randomly chosen feature at a random location
* Recursive node splits do not depend on the labels 𝑦1, … , 𝑦𝑛
Theorem 2 Assume that the distribution of 𝑋 is supported on [0, 1] 𝑑
.
Then purely random forest classifier 𝐻∞ is consistent whenever 𝑘 ⟶
∞ and 𝑘
𝑛 ⟶ 0 as n ⟶ ∞.
Biau et al, 2008
Growing Trees
ℒ 𝑛 𝒜( )
Theorem 6 Let {𝑇Λ} be a sequence of classifiers that is consistent for the
distribution of 𝑋𝑌. Consider the bagging classifier 𝐻 𝑚 and 𝐻∞, using parameter
𝑞 𝑛. If 𝑛𝑞 𝑛 ⟶ ∞ as n ⟶ ∞ then both classifier are consistent.
Biau et al, 2012
Growing Trees
At each node, a coordinate is selected with 𝑝 𝑛𝑗 ∈ (0, 1) is the probability j-th feature is selected
the split is at the midpoint of the chosen side
Theorem 1 Assume that the distribution of 𝑋 has support on [0, 1] 𝑑
.
Then the random forests estimate 𝐻∞(𝒙; ℒ 𝑛) is consistent whenever
𝑝 𝑛𝑗 𝑙𝑜𝑔𝑘 𝑛 ⟶ ∞ for all j=1, …, p and 𝑘 𝑛
𝑛 ⟶ 0 as 𝑛 ⟶ ∞.
Biau et al, 2012
Assume that X is uniformly distributed on [0,1] 𝑝
𝒑 𝒏𝒋 = (𝟏/𝑺)(𝟏 + 𝝃 𝒏𝒋) 𝒇𝒐𝒓 𝒋 ∈ 𝓢
In sparse settings
Estimation Error (variance)
𝔼{[𝐻∞ 𝒙; ℒ 𝑛 − 𝐻∞ 𝒙; ℒ 𝑛 ]2} ≤ 𝐶𝜎2
S2
S − 1
𝑆
2𝑝
(1 + 𝜉 𝑛)
𝑘 𝑛
𝑛(𝑙𝑜𝑔𝑘 𝑛) 𝑆/2𝑝
If 𝑎 < 𝑝 𝑛𝑗 < 𝑏 form some constants 𝑎, 𝑏 ∈ 0,1 then
1 + 𝜉 𝑛 ≤
𝑆 − 1
𝑆2 𝑎 1 − 𝑏
𝑆
2𝑝
Biau et al, 2012
Assume that X is uniformly distributed on [0,1] 𝑝 and 𝜑 𝐵 𝒙 𝒮 is 𝐿 − 𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧 on [0,1] 𝑠
𝒑 𝒏𝒋 = (𝟏/𝑺)(𝟏 + 𝝃 𝒏𝒋) 𝒇𝒐𝒓 𝒋 ∈ 𝓢
In sparse settings
Approximation Error (bias2)
𝔼 𝐻∞ 𝒙; ℒ 𝑛 − 𝜑 𝐵 𝒙
2
≤ 2𝑆𝐿2 𝑘 𝑛
−
0.75
𝑆𝑙𝑜𝑔2 1+𝛾 𝑛
+ [ sup
𝑥∈[0,1] 𝑝
𝜑 𝐵
2
(𝒙)]𝑒−𝑛/2𝑘 𝑛
where 𝛾 𝑛 = min
𝑗∈ 𝒮
𝜉 𝑛𝑗 tends to 0 as n tends to infinity.
Finite and infinite RFs [Scornet, 2014]
ℒ 𝑛 Θ𝑖 ℒ 𝑛
𝐻∞(𝒙; ℒ 𝑛) = 𝔼Θ{ Θ ℒ 𝑛 }
ℒ 𝑛
ℒ 𝑛 𝐻∞(𝒙; ℒ 𝑛)
Theorem 3.1 Conditionally on ℒ 𝑛, almost surely, for all 𝑥 ∈ 0, 1 𝑝
, we
have: 𝐻 𝑚 𝒙; ℒ 𝑛
𝑀→∞
𝐻∞(𝒙; ℒ 𝑛).
Finite and infinite RFs [Scornet, 2014]
One has 𝑌 = 𝑚 𝑋 + 𝜀 where 𝜀 is a centered Gaussian noise with
finite variance 𝜎2
, independent of 𝑋,
and 𝑚 ∞ = sup
𝑥∈[0,1] 𝑝
|𝑚(𝑥)| < ∞.
Assumption H
Theorem 3.3 Assume H is satisfied. Then, for all m, 𝑛 ∈ ℕ∗
,
𝐸𝑟𝑟 𝐻 𝑚 𝒙; ℒ 𝑛 = 𝐸𝑟𝑟 𝐻∞(𝒙; ℒ 𝑛) +
1
𝑚
𝔼 𝑋,ℒ 𝑛
[𝕍Θ[ 𝑇Λ (𝒙; Θ, ℒ 𝑛)]]
⇒ 𝑚 ≥
8 𝑚 ∞
2
+ 𝜎2
𝜀
+
32𝜎2
𝑙𝑜𝑔𝑛
𝜀
𝑡ℎ𝑒𝑛 𝐸𝑟𝑟 Hm − 𝐸𝑟𝑟 H∞ ≤ 𝜀
0 ≤ 𝐸𝑟𝑟 Hm − 𝐸𝑟𝑟 H∞ ≤
8
m
( 𝑚 ∞
2 + 𝜎2(1 + 4𝑙𝑜𝑔𝑛))
RF and Additive regression model [Scornet et al., 2015]
Growing Trees
without replacement
Assume that 𝐴 is selected node and 𝐴 > 1
Select uniformly, without replacement, a subset ℳ𝑡𝑟𝑦 ⊂ 1, … , 𝑝 , |ℳ𝑡𝑟𝑦| = 𝑚 𝑡𝑟𝑦
Select the best split in A by optimizing the CART-split criterion along the coordinates in ℳ𝑡𝑟𝑦
Cut the cell 𝐴 according to the best split. Call 𝐴 𝐿 and 𝐴 𝑅 the true resulting cell
Set 𝐴 𝐴 𝐿 𝐴 𝑅
RF and Additive regression model [Scornet et al., 2015]
𝑌 = 𝑗=1
𝑝
𝑚𝑗(𝑋(𝑗)
) + 𝜀
Assumption H1
Theorem 3.1 Assume that (H1) is satisfied. Then, provided 𝑛 ⟶ ∞ and
𝑡 𝑛(𝑙𝑜𝑔𝑎 𝑛)9
/𝑎 𝑛 ⟶ 0, is consistent.
Theorem 3.2 Assume that (H1) and (H2) are satisfied and let 𝑡 𝑛 = 𝑎 𝑛.
Then, provided 𝑎 𝑛 ⟶ ∞, 𝑡 𝑛 ⟶ ∞ and 𝑎 𝑛 𝑙𝑜𝑔𝑛/𝑛 ⟶ 0, is consistent.
RF and Additive regression model [Scornet et al., 2015]
, 𝑗1,𝑛 𝑋 , … , 𝑗 𝑘,𝑛 𝑋 the first cut directions used to construct
the cell containing 𝑋. 𝑗 𝑞,𝑛 𝑋 = ∞ if the cell has been cut strictly less than
q times.
Theorem 3.2 Assume that (H1) is satisfied. let k ∈ ℕ∗
and 𝜉 > 0.
Assume that there is no interval [𝑎, 𝑏] and no 𝑗 ∈ {1, … , 𝑆} such that
𝑚𝑗 is constant on [𝑎, 𝑏]. Then, with probability 1 − 𝜉, for all 𝑛 large
enough, we have, for all 1 ≤ 𝑞 ≤ 𝑘, 𝑗 𝑞,𝑛 𝑋 ∈ {1, … , 𝑆}.
[Wager, 2015]
A partition Λ is 𝛼, 𝑘 − 𝑣𝑎𝑙𝑖𝑑 if can generated by a recursive partitioning scheme in
which each child node contains at least a fraction 𝛼 of the data points in its parent
node for some 0 < 𝛼 < 0.5, and each terminal node contains at least 𝑘 training
examples for some k ∈ N.
Given a dataset 𝑋, 𝒱𝛼,𝑘(𝑋) denote the set of 𝛼, 𝑘 − 𝑣𝑎𝑙𝑖𝑑 partitions
𝑇Λ: [0,1] 𝑝→ ℝ, 𝑇Λ 𝒙 =
1
|{𝒙 𝑖: 𝒙 𝑖 ∈ 𝐿(𝒙)}| {𝑖: 𝒙 𝑖∈𝐿(𝒙)} 𝑦𝑖 (is called valid tree)
𝑇Λ
∗
: [0,1] 𝑝→ ℝ, 𝑇Λ
∗
𝒙 = 𝔼[𝑌|𝑋 ∈ 𝐿(𝒙)] (is called partition-optimal tree)
Whether we can treat 𝑇Λ as a good approximation to 𝑇Λ
∗
the
supported on the partition Λ
Given a learning set ℒ 𝑛 of [0,1] 𝑝
∗ −
𝑀
2
,
𝑀
2
with 𝑋~𝑈([0,1] 𝑝
)
[Wager, 2015]
Theorem 1
Given parameters n, p, k such that
lim
𝑛→∞
log 𝑛 log 𝑝
𝑘
= 0 𝑎𝑛𝑑 𝑝 = Ω(𝑛)
then
lim
𝑛,𝑑,𝑘→∞
ℙ sup
𝑥∈ 0,1 𝑝,Λ∈𝒱 𝛼,𝑘
|𝑇Λ − 𝑇Λ
∗
| ≤ 6𝑀
log 𝑛 log(𝑝)
klog((1 − 𝛼)−1)
= 1
[Wager, 2015]
Growing Trees (Guest-and-check)
Select a currently un-split node 𝐴 containing at least 2k training examples
Pick a candidate splitting variable 𝑗 ∈ {1, … , 𝑝} uniformly at random
Pick the minimum squared error (ℓ( 𝜃)) splitting point 𝜃
If either there has already been a successful split along variable j for some other nod or
ℓ( 𝜃) ≥ 36𝑀2
log 𝑛 log(𝑑)
𝑘𝑙𝑜𝑔((1 − 𝛼)−1)
The split succeeds and we cut the node 𝐴 at 𝜃 along the j-th variable; if not we do not split the
node 𝐴 this time.
[Wager, 2015]
In sparse settings
and a set of sign variables 𝜎𝑗 ∈ ±1 such that, for all
𝑗 ∈ and all 𝑥 ∈ [0,1] 𝑝,
𝔼 𝑌 𝑋 −𝑗
= 𝑥 −𝑗
, 𝑋 𝑗
>
1
2
− 𝔼 𝑌 𝑋 −𝑗
= 𝑥 −𝑗
, 𝑋 𝑗
≤
1
2
≥ 𝛽𝜎𝑗
Assumption H1
is Lipschitz-continuous in
Assumption H2
[Wager, 2015]
In sparse settings
and a set of sign variables 𝜎𝑗 ∈ ±1 such that, for all
𝑗 ∈ and all 𝑥 ∈ [0,1] 𝑝,
𝔼 𝑌 𝑋 −𝑗
= 𝑥 −𝑗
, 𝑋 𝑗
>
1
2
− 𝔼 𝑌 𝑋 −𝑗
= 𝑥 −𝑗
, 𝑋 𝑗
≤
1
2
≥ 𝛽𝜎𝑗
Assumption H1
is Lipschitz-continuous in
Assumption H2
Theorem 2 Under the conditions of theorem 1, suppose that
assumptions in the sparse setting hold, then guest-and-check forest is
consistent.
[Wager, 2015]
𝐻{Λ}1
𝐵: [0,1] 𝑝→ ℝ, 𝐻 Λ 1
𝐵 𝒙 =
1
𝐵 𝑏=1
𝐵
𝑇Λ 𝑏
(𝒙) (is called valid forest)
𝐻{Λ}1
𝐵
∗
: [0,1] 𝑝→ ℝ, 𝐻{Λ}1
𝐵
∗
𝒙 =
1
𝐵 𝑏=1
𝐵
𝑇Λ 𝑏
∗
(𝒙) (is called partition-optimal forest)
Theorem 4
lim
𝑛,𝑑,𝑘→∞
ℙ sup
𝐻∈ℋ 𝛼,𝑘
1
𝑛
𝑖=1
𝑛
(𝑦𝑖 − 𝐻(𝑥𝑖))2 − 𝔼[(𝑌 − 𝐻(𝑋))2] ≤ 11𝑀2
log 𝑛 log(𝑝)
klog((1 − 𝛼)−1)
= 1
References
B. Efron. Estimation and accuracy after model selection. Journal of the American Statistical Association, 2013.
B. Lakshminarayanan et al. Mondrian forests: Efficient online random forests. arXiv:1406.2673, 2014.
B. Xu et al. Classifying very high-dimensional data with random forests built from small subspaces. International Journal
of Data Warehousing and Mining (IJDWM) 8(2), 37:44–63, 2012.
D. Amaratunga et al. Enriched random forests. Bioinformatics (Oxford, England), 24(18): 2010–2014, 2008
doi:10.1093/bioinformatics/btn356
E. Scornet. On the asymptotics of random forests. arXiv:1409.2090, 2014
E. Scornet et al. Consistency of random forests. The Annals of Statistics. 43 (2015), no. 4, 1716--1741. doi:10.1214/15-
AOS1321. http://projecteuclid.org/euclid.aos/1434546220.
G. Biau et al. Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research,
9:2015–2033, 2008.
G. Biau. Analysis of a random forests model. Journal of Machine Learning Research, 13:1063–1095, 2012.
H. Deng et al. Feature Selection via Regularized Trees, The 2012 International Joint Conference on Neural Networks
(IJCNN), IEEE, 2012.
H. Deng et al. Gene Selection with Guided Regularized Random Forest , Pattern Recognition, 46.12 (2013): 3483-3489
H. Ishwaran et al. Random survival forest. The Annals of Applied Statistics, 2:841–860, 2008.
L. Breiman. Bagging predictors. Technical Report No. 421, Statistics Department, UC Berkeley, 1994.
References
L.Breiman. Randomizing outputs to increase prediction accuracy, Technical Report 518, Statistics Department,
UC Berkeley, 1998.
L. Breiman. Some infinite theory for predictor ensembles. Technical Report 577, Statistics Department, UC Berkeley, 2000.
L. Breiman. Random forests. Machine Learning, 45:5–32, 2001.
M. Denil et al. Consistency of online random forests. International Conference on Machine Learning (ICML) 2013.
N. Meinshausen. Quantile regression forests. Journal of Machine Learning Research, 7:983–999, 2006.
P. Geurts et al. Extremely randomized trees. Machine Learning, 63(1):3-42, 2006.
Q. Wu et al. Snp selection and classification of genome-wide snp data using stratified sampling random forests.
NanoBioscience, IEEE Transactions on 11(3): 216–227, 2012.
T. Dietterich. An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging,
Boosting and Randomization, Machine Learning 1-22, 1998
T.K. Ho. Random decision forests, Document Analysis and Recognition, 1995., Proceedings of the Third International
Conference on , 1:278-282, 14-16 Aug 1995, doi: 10.1109/ICDAR.1995.598994
T.K. Ho. The random subspace method for constructing decision forests, IEEE Trans. on Pattern Analysis and Machine
Intelligence, 20(8):832-844, 1998
Saïp Ciss. Random Uniform Forests. 2015. <hal-01104340v2>
S. Cl´emen¸con et al. Ranking forests. Journal of Machine Learning Research, 14:39–73, 2013.
References
S. Wager. Asymptotic theory for random forests. arXiv:1405.0352, 2014
S. Wager. Uniform convergence of random forests via adaptive concentration. arXiv:1503.06388, 2015
Y. Lin and Y. Jeon. Random forests and adaptive nearest neighbors. Journal of the American Statistical Association,
101:578–590, 2006.
V. N. Vapnik. An overview of Statistical Learning Theory. IEEE Trans. on Neural Networks, 10(5):988-999, 1999.
RF and Additive regression model [Scornet et al., 2015]
the indicator that falls in the same cell as in the
random tree designed with ℒn and the random parameter
where ′is an independent copy of .
′, 𝑋1, … , 𝑋 𝑛,
′
, 𝑋1, … , 𝑋 𝑛
RF and Additive regression model [Scornet et al., 2015]

Conistency of random forests

  • 1.
    Consistency of RandomForests Hoang N.V. hoangnvvnua@gmail.com Department of Computer Science FITA – Viet Nam Institute of Agriculture Seminar IT R&D, HANU Ha Noi, December 2015
  • 2.
    Machine Learning, whatis? “true” Parametric Non-parametric Supervised problems: not too difficult Unsupervised problems: is very difficult Find a parameter which minimize the loss function
  • 3.
    Supervised Learning ℒ 𝑛 Lis a loss function Classification: zero-one loss function Regression: 𝕃1, 𝕃2
  • 4.
    Bias-variance tradeoff If themodel is too simple, the solution is biased and does not fit the data. If the model is too complex then it is very sensitive to small changes in the data.
  • 5.
  • 6.
  • 7.
  • 8.
    Tree Predictor ℒ 𝑛 ℒ𝑛 Pick an internal node to split Pick the best split in Split A into two child nodes ( and ) Set A splitting scheme induces a partition Λ of the feature space into non-overlapping rectangles 𝑃1, … , 𝑃ℓ.
  • 9.
    Tree Predictor ℒ 𝑛 ℒ𝑛 Select an internal node to split Select the best split in Split A into two child nodes ( and ) Set A splitting scheme induces a partition Λ of the feature space into non-overlapping rectangles 𝑃1, … , 𝑃ℓ. Predicting Rule Λ Λ ℒ 𝑛 Λ ℒ 𝑛
  • 10.
    Tree Predictor ℒ 𝑛 ℒ𝑛 Select an internal node to split Select the best split in Split A into two child nodes ( and ) Set A splitting scheme induces a partition Λ of the feature space into non-overlapping rectangles 𝑃1, … , 𝑃ℓ. Training Methods ID3 (Iterative Dichotomiser 3) C4.5 CART (Classification and Regression Tree) CHAID MARS Conditional Inference Tree Predicting Rule Λ Λ ℒ 𝑛 Λ ℒ 𝑛
  • 11.
    Forest = Aggregationof trees Aggregating Rule ℒ 𝑛 ℒ 𝑛 ℒ 𝑛 ℒ 𝑛 ℒ 𝑛
  • 13.
    Grow different treesfrom same learning set ℒ 𝑛 Sampling with replacement [Breiman, 1994] Random subspace sampling [Ho, 1995 & 1998] Random output sampling [Breiman, 1998] Randomized C4.5 [Dietterich, 1998] Purely random forest [Breiman, 2000] Extremely random trees [Guerts, 2006]
  • 14.
    Grow different treesfrom same learning set ℒ 𝑛 Sampling with replacement - random subspace [Breiman, 2001] Sampling with replacement - weighted subspace [Amaratunga, 2008; Xu, 2008; Wu, 2012] Sampling with replacement - random subspace and regularized [Deng, 2012] Sampling with replacement - random subspace and guided-regularized [Deng, 2013] Sampling with replacement - random subspace and random split position selection [Saïp Ciss, 2014]
  • 15.
    Some RF extensions quantileestimation Meinshausen, 2006 survival analysis Ishwaran et al., 2008 ranking Clemencon et al., 2013 online learning Denil et al., 2013; Lakshminarayanan et al., 2014 GWA problems Yang et al., 2013; Botta et al., 2014
  • 16.
    What is agood learner? [What is friendly with my data?]
  • 17.
    What is goodin high-dimensional settings Breiman, 2001 Wu et al., 2012 Deng, 2012 Deng., 2013 Saïp Ciss, 2014
  • 18.
  • 19.
    -0.1 -0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.1 0.20.3 0.4 0.5 0.6 0.7 0.8 0.9 1 y1 y2 y3 y4 AA ABB B
  • 20.
  • 21.
  • 22.
    Random Uniform Forest[Saïp Ciss, 2014]
  • 23.
  • 24.
  • 26.
  • 27.
  • 28.
  • 29.
    Random Uniform Forest[Saïp Ciss, 2014]
  • 30.
  • 31.
  • 32.
    GRRF with AUC[Deng, 2013]
  • 33.
    GRRF with ER[Deng, 2013]
  • 34.
  • 35.
  • 36.
  • 37.
    Random Uniform Forest[Saïp Ciss, 2014]
  • 38.
  • 39.
  • 40.
    GRRF with ER[Deng, 2013]
  • 41.
    What is agood learner? [Nothing you do will convince me] [I need rigorous theoretical guarantees]
  • 42.
    Asymptotic statistics andlearning theory [go beyond experiment results]
  • 43.
    Machine Learning, whatis? Parametric Non-parametric Supervised problems: not too difficult Unsupervised problems: is very difficult Find a parameter which minimize the loss function
  • 44.
    “right” Pattern (which learntfrom ℒ 𝑛 ) is “true”, isn’t it? How much do I believe? Is this procedure friendly with my data? What is the best possible procedure for my problem? What is if our assumptions are wrong? “efficient” How many observations do I need in order to achieve a “believed” pattern? How many computations do I need? Assumption: There are some patterns
  • 45.
    Learning Theory [Vapnik,1999] asymptotic theory necessary and sufficient conditions the best possible
  • 46.
    Supervised Learning ℒ 𝑛 Lis a loss function Classification: zero-one loss function Regression: 𝕃1, 𝕃2
  • 47.
    Supervised Learning Generator 𝒙 𝒙 𝑦 𝑦𝑦′ Supervisor Machine Learning Two different goals imitate (prediction accuracy) identify (interpretability)
  • 48.
    What is thebest predictor?
  • 49.
    What is thebest predictor Bayes model residual error A model built from any learning set ℒ, Err( ) ≤ Err( ) In theory, when 𝑃(𝑋𝑌) is known
  • 50.
    What is thebest predictor If is zero-one loss function, the Bayes model is In classification, the best possible classifier consists in systematically predicting the most likely class 𝑦 ∈ {𝑐1, … , 𝑐𝐽} given 𝑋 = 𝒙
  • 51.
    What is thebest predictor If is the squared error loss function, the Bayes model is In regression, the best possible regressor consists in systematically predicting the average value of 𝑌 given 𝑋 = 𝒙
  • 52.
    Given a learningalgorithm 𝒜 and a loss function 𝜑 𝑛 = 𝒜(ℒ 𝑛) ℒ 𝑛 𝒜 ℒ 𝑛 } Learning algorithm 𝒜 is consistent in L if and only if 𝐸𝑟𝑟( ) ⟶ 𝑛→∞ 𝑃 𝐸𝑟𝑟(𝜑 𝐵) 𝐸𝑟𝑟(𝜑 𝑛 ) ⟶ 𝑛→∞ 𝑃 𝐸𝑟𝑟(𝜑 𝐵)
  • 53.
    Random Forests areconsistent, aren’t they?
  • 54.
    𝒜 Θ 𝒜 Θ isused to sample the training set or to select the candidate directions or positions for splitting Θ is independent of the dataset and thus unrelated to the particular problem In some new variants of RF, Θ is depend on the dataset Generalized Random Forest
  • 55.
    Bagging Procedure Θ Θ1,… , Θm Θi ℒ 𝑛 𝒜(Θ𝑖 ℒ 𝑛) ℒ 𝑛 Θ𝑖 ℒ 𝑛 ℒ 𝑛 Θ1 ℒ 𝑛 Θ 𝑚 ℒ 𝑛 Generalized Random Forest
  • 56.
    Consistency of RandomForests lim 𝑛→∞ Err( 𝐻 𝑚 . ; ℒ 𝑛 ) = Err(𝜑 𝐵) Problem - m is finite ⇒ predictor depend on trees that formed forest - structure of a tree depend on Θi and learning set ⟹ finite forest is actually a subtle combination of randomness and depending-on-data structures ⟹ finite forests predictions can be difficult to interpret (random prediction or not) - Non-asymptotic rate of convergence Challenges
  • 57.
    𝐻∞(𝑥; ℒ 𝑛)= 𝔼Θ{ Θ ℒ 𝑛 } lim 𝑚→∞ 𝐻 𝑚(𝑥; ℒ 𝑛) = 𝐻∞(𝑥; ℒ 𝑛) Problem - infinite forest is good than finite forest, isn’t it? - What is good m? (rate of convergence) Challenge Consistency of Random Forests
  • 58.
  • 59.
    Strength, Correlation andErr [Breiman, 2001] Θi ℒ 𝑛 Θi ℒ 𝑛 Θ ℒ 𝑛 Θ ℒ 𝑛 Theorem 2.3 An upper bound for the generalization error is given by: 𝑃𝐸∗ ≤ 𝜌(1 − 𝑠2 )/𝑠2 where 𝜌 is the mean value of the correlation, s is the strength of the set of classifiers.
  • 60.
    RF and AdaptiveNearest Neighbors [Lin et al, 2006] Θi ℒ 𝑛 𝒜(Θ𝑖 ℒ 𝑛) Θ Θi ℒ 𝑛 = 1 𝒙𝐣 :𝒙𝐣 ∈ 𝐿 Θ 𝑖,𝒙 𝑗:𝒙𝐣 ∈ 𝐿 Θ 𝑖,𝒙 𝑦j = 𝑗=1 𝑛 𝑤𝑗Λ 𝑖 𝑦𝑗 ℒ 𝑛 Θ𝑖 ℒ 𝑛 𝑗=1 𝑛 𝑤𝑗 𝑦𝑗 𝑤𝑗 𝑤𝑗Λ 𝑖 𝐻∞(𝒙; ℒ 𝑛) Θ Θ Θ Non-adaptive if 𝑤𝑗 not depend on 𝑦𝑖’s of the learning set
  • 61.
    RF and AdaptiveNearest Neighbors [Lin et al, 2006] Θi ℒ 𝑛 𝒜(Θ𝑖 ℒ 𝑛) Θ Θi ℒ 𝑛 = 1 𝒙𝐣 :𝒙𝐣 ∈ 𝐿 Θ 𝑖,𝒙 𝑗:𝒙𝐣 ∈ 𝐿 Θ 𝑖,𝒙 𝑦j = 𝑗=1 𝑛 𝑤𝑗Λ 𝑖 𝑦𝑗 ℒ 𝑛 Θ𝑖 ℒ 𝑛 𝑗=1 𝑛 𝑤𝑗 𝑦𝑗 𝑤𝑗 𝑤𝑗Λ 𝑖 𝐻∞(𝒙; ℒ 𝑛) Θ Θ Θ Non-adaptive if 𝑤𝑗 not depend on 𝑦𝑖’s of the learning set The terminal node size k should be made to increase with the sample size 𝑛. Therefore, growing large trees (k being a small constant) does not always give the best performance.
  • 62.
    Biau et al,2008 Given a learning set ℒ 𝑛 = 𝒙1, 𝑦1 , … , 𝒙 𝑛, 𝑦𝑛 of ℝ 𝑑 ∗ {0, 1} Binary classifier 𝜑 𝑛 which trained from ℒ 𝑛: ℝ 𝑑 ⟶ {0, 1} 𝜑 𝑛 𝜑 𝑛(𝑋) ≠ 𝑌 𝜑 𝐵 𝑥 = 𝕀{ } 𝜑 𝐵 A sequence {𝜑 𝑛} of classifiers is consistent for a certain distribution of (𝑋, 𝑌) if 𝐸𝑟𝑟(𝜑 𝑛) ⟶ in probability Assume that the sequence {𝑇𝑛} of randomized classifiers is consistent for a certain distribution of 𝑋, 𝑌 . Then the voting classifier 𝐻 𝑚(for any value of m) and the averaged classifier 𝐻∞ are also consistent.
  • 63.
    Biau et al,2008 Growing Trees Node 𝐴 is randomly selected The split feature j is selected uniformly at random from [1, … , 𝑝] Finally, the selected node is split along the randomly chosen feature at a random location * Recursive node splits do not depend on the labels 𝑦1, … , 𝑦𝑛 Theorem 2 Assume that the distribution of 𝑋 is supported on [0, 1] 𝑑 . Then purely random forest classifier 𝐻∞ is consistent whenever 𝑘 ⟶ ∞ and 𝑘 𝑛 ⟶ 0 as n ⟶ ∞.
  • 64.
    Biau et al,2008 Growing Trees ℒ 𝑛 𝒜( ) Theorem 6 Let {𝑇Λ} be a sequence of classifiers that is consistent for the distribution of 𝑋𝑌. Consider the bagging classifier 𝐻 𝑚 and 𝐻∞, using parameter 𝑞 𝑛. If 𝑛𝑞 𝑛 ⟶ ∞ as n ⟶ ∞ then both classifier are consistent.
  • 65.
    Biau et al,2012 Growing Trees At each node, a coordinate is selected with 𝑝 𝑛𝑗 ∈ (0, 1) is the probability j-th feature is selected the split is at the midpoint of the chosen side Theorem 1 Assume that the distribution of 𝑋 has support on [0, 1] 𝑑 . Then the random forests estimate 𝐻∞(𝒙; ℒ 𝑛) is consistent whenever 𝑝 𝑛𝑗 𝑙𝑜𝑔𝑘 𝑛 ⟶ ∞ for all j=1, …, p and 𝑘 𝑛 𝑛 ⟶ 0 as 𝑛 ⟶ ∞.
  • 66.
    Biau et al,2012 Assume that X is uniformly distributed on [0,1] 𝑝 𝒑 𝒏𝒋 = (𝟏/𝑺)(𝟏 + 𝝃 𝒏𝒋) 𝒇𝒐𝒓 𝒋 ∈ 𝓢 In sparse settings Estimation Error (variance) 𝔼{[𝐻∞ 𝒙; ℒ 𝑛 − 𝐻∞ 𝒙; ℒ 𝑛 ]2} ≤ 𝐶𝜎2 S2 S − 1 𝑆 2𝑝 (1 + 𝜉 𝑛) 𝑘 𝑛 𝑛(𝑙𝑜𝑔𝑘 𝑛) 𝑆/2𝑝 If 𝑎 < 𝑝 𝑛𝑗 < 𝑏 form some constants 𝑎, 𝑏 ∈ 0,1 then 1 + 𝜉 𝑛 ≤ 𝑆 − 1 𝑆2 𝑎 1 − 𝑏 𝑆 2𝑝
  • 67.
    Biau et al,2012 Assume that X is uniformly distributed on [0,1] 𝑝 and 𝜑 𝐵 𝒙 𝒮 is 𝐿 − 𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧 on [0,1] 𝑠 𝒑 𝒏𝒋 = (𝟏/𝑺)(𝟏 + 𝝃 𝒏𝒋) 𝒇𝒐𝒓 𝒋 ∈ 𝓢 In sparse settings Approximation Error (bias2) 𝔼 𝐻∞ 𝒙; ℒ 𝑛 − 𝜑 𝐵 𝒙 2 ≤ 2𝑆𝐿2 𝑘 𝑛 − 0.75 𝑆𝑙𝑜𝑔2 1+𝛾 𝑛 + [ sup 𝑥∈[0,1] 𝑝 𝜑 𝐵 2 (𝒙)]𝑒−𝑛/2𝑘 𝑛 where 𝛾 𝑛 = min 𝑗∈ 𝒮 𝜉 𝑛𝑗 tends to 0 as n tends to infinity.
  • 68.
    Finite and infiniteRFs [Scornet, 2014] ℒ 𝑛 Θ𝑖 ℒ 𝑛 𝐻∞(𝒙; ℒ 𝑛) = 𝔼Θ{ Θ ℒ 𝑛 } ℒ 𝑛 ℒ 𝑛 𝐻∞(𝒙; ℒ 𝑛) Theorem 3.1 Conditionally on ℒ 𝑛, almost surely, for all 𝑥 ∈ 0, 1 𝑝 , we have: 𝐻 𝑚 𝒙; ℒ 𝑛 𝑀→∞ 𝐻∞(𝒙; ℒ 𝑛).
  • 69.
    Finite and infiniteRFs [Scornet, 2014] One has 𝑌 = 𝑚 𝑋 + 𝜀 where 𝜀 is a centered Gaussian noise with finite variance 𝜎2 , independent of 𝑋, and 𝑚 ∞ = sup 𝑥∈[0,1] 𝑝 |𝑚(𝑥)| < ∞. Assumption H Theorem 3.3 Assume H is satisfied. Then, for all m, 𝑛 ∈ ℕ∗ , 𝐸𝑟𝑟 𝐻 𝑚 𝒙; ℒ 𝑛 = 𝐸𝑟𝑟 𝐻∞(𝒙; ℒ 𝑛) + 1 𝑚 𝔼 𝑋,ℒ 𝑛 [𝕍Θ[ 𝑇Λ (𝒙; Θ, ℒ 𝑛)]] ⇒ 𝑚 ≥ 8 𝑚 ∞ 2 + 𝜎2 𝜀 + 32𝜎2 𝑙𝑜𝑔𝑛 𝜀 𝑡ℎ𝑒𝑛 𝐸𝑟𝑟 Hm − 𝐸𝑟𝑟 H∞ ≤ 𝜀 0 ≤ 𝐸𝑟𝑟 Hm − 𝐸𝑟𝑟 H∞ ≤ 8 m ( 𝑚 ∞ 2 + 𝜎2(1 + 4𝑙𝑜𝑔𝑛))
  • 70.
    RF and Additiveregression model [Scornet et al., 2015] Growing Trees without replacement Assume that 𝐴 is selected node and 𝐴 > 1 Select uniformly, without replacement, a subset ℳ𝑡𝑟𝑦 ⊂ 1, … , 𝑝 , |ℳ𝑡𝑟𝑦| = 𝑚 𝑡𝑟𝑦 Select the best split in A by optimizing the CART-split criterion along the coordinates in ℳ𝑡𝑟𝑦 Cut the cell 𝐴 according to the best split. Call 𝐴 𝐿 and 𝐴 𝑅 the true resulting cell Set 𝐴 𝐴 𝐿 𝐴 𝑅
  • 71.
    RF and Additiveregression model [Scornet et al., 2015] 𝑌 = 𝑗=1 𝑝 𝑚𝑗(𝑋(𝑗) ) + 𝜀 Assumption H1 Theorem 3.1 Assume that (H1) is satisfied. Then, provided 𝑛 ⟶ ∞ and 𝑡 𝑛(𝑙𝑜𝑔𝑎 𝑛)9 /𝑎 𝑛 ⟶ 0, is consistent. Theorem 3.2 Assume that (H1) and (H2) are satisfied and let 𝑡 𝑛 = 𝑎 𝑛. Then, provided 𝑎 𝑛 ⟶ ∞, 𝑡 𝑛 ⟶ ∞ and 𝑎 𝑛 𝑙𝑜𝑔𝑛/𝑛 ⟶ 0, is consistent.
  • 72.
    RF and Additiveregression model [Scornet et al., 2015] , 𝑗1,𝑛 𝑋 , … , 𝑗 𝑘,𝑛 𝑋 the first cut directions used to construct the cell containing 𝑋. 𝑗 𝑞,𝑛 𝑋 = ∞ if the cell has been cut strictly less than q times. Theorem 3.2 Assume that (H1) is satisfied. let k ∈ ℕ∗ and 𝜉 > 0. Assume that there is no interval [𝑎, 𝑏] and no 𝑗 ∈ {1, … , 𝑆} such that 𝑚𝑗 is constant on [𝑎, 𝑏]. Then, with probability 1 − 𝜉, for all 𝑛 large enough, we have, for all 1 ≤ 𝑞 ≤ 𝑘, 𝑗 𝑞,𝑛 𝑋 ∈ {1, … , 𝑆}.
  • 73.
    [Wager, 2015] A partitionΛ is 𝛼, 𝑘 − 𝑣𝑎𝑙𝑖𝑑 if can generated by a recursive partitioning scheme in which each child node contains at least a fraction 𝛼 of the data points in its parent node for some 0 < 𝛼 < 0.5, and each terminal node contains at least 𝑘 training examples for some k ∈ N. Given a dataset 𝑋, 𝒱𝛼,𝑘(𝑋) denote the set of 𝛼, 𝑘 − 𝑣𝑎𝑙𝑖𝑑 partitions 𝑇Λ: [0,1] 𝑝→ ℝ, 𝑇Λ 𝒙 = 1 |{𝒙 𝑖: 𝒙 𝑖 ∈ 𝐿(𝒙)}| {𝑖: 𝒙 𝑖∈𝐿(𝒙)} 𝑦𝑖 (is called valid tree) 𝑇Λ ∗ : [0,1] 𝑝→ ℝ, 𝑇Λ ∗ 𝒙 = 𝔼[𝑌|𝑋 ∈ 𝐿(𝒙)] (is called partition-optimal tree) Whether we can treat 𝑇Λ as a good approximation to 𝑇Λ ∗ the supported on the partition Λ
  • 74.
    Given a learningset ℒ 𝑛 of [0,1] 𝑝 ∗ − 𝑀 2 , 𝑀 2 with 𝑋~𝑈([0,1] 𝑝 ) [Wager, 2015] Theorem 1 Given parameters n, p, k such that lim 𝑛→∞ log 𝑛 log 𝑝 𝑘 = 0 𝑎𝑛𝑑 𝑝 = Ω(𝑛) then lim 𝑛,𝑑,𝑘→∞ ℙ sup 𝑥∈ 0,1 𝑝,Λ∈𝒱 𝛼,𝑘 |𝑇Λ − 𝑇Λ ∗ | ≤ 6𝑀 log 𝑛 log(𝑝) klog((1 − 𝛼)−1) = 1
  • 75.
    [Wager, 2015] Growing Trees(Guest-and-check) Select a currently un-split node 𝐴 containing at least 2k training examples Pick a candidate splitting variable 𝑗 ∈ {1, … , 𝑝} uniformly at random Pick the minimum squared error (ℓ( 𝜃)) splitting point 𝜃 If either there has already been a successful split along variable j for some other nod or ℓ( 𝜃) ≥ 36𝑀2 log 𝑛 log(𝑑) 𝑘𝑙𝑜𝑔((1 − 𝛼)−1) The split succeeds and we cut the node 𝐴 at 𝜃 along the j-th variable; if not we do not split the node 𝐴 this time.
  • 76.
    [Wager, 2015] In sparsesettings and a set of sign variables 𝜎𝑗 ∈ ±1 such that, for all 𝑗 ∈ and all 𝑥 ∈ [0,1] 𝑝, 𝔼 𝑌 𝑋 −𝑗 = 𝑥 −𝑗 , 𝑋 𝑗 > 1 2 − 𝔼 𝑌 𝑋 −𝑗 = 𝑥 −𝑗 , 𝑋 𝑗 ≤ 1 2 ≥ 𝛽𝜎𝑗 Assumption H1 is Lipschitz-continuous in Assumption H2
  • 77.
    [Wager, 2015] In sparsesettings and a set of sign variables 𝜎𝑗 ∈ ±1 such that, for all 𝑗 ∈ and all 𝑥 ∈ [0,1] 𝑝, 𝔼 𝑌 𝑋 −𝑗 = 𝑥 −𝑗 , 𝑋 𝑗 > 1 2 − 𝔼 𝑌 𝑋 −𝑗 = 𝑥 −𝑗 , 𝑋 𝑗 ≤ 1 2 ≥ 𝛽𝜎𝑗 Assumption H1 is Lipschitz-continuous in Assumption H2 Theorem 2 Under the conditions of theorem 1, suppose that assumptions in the sparse setting hold, then guest-and-check forest is consistent.
  • 78.
    [Wager, 2015] 𝐻{Λ}1 𝐵: [0,1]𝑝→ ℝ, 𝐻 Λ 1 𝐵 𝒙 = 1 𝐵 𝑏=1 𝐵 𝑇Λ 𝑏 (𝒙) (is called valid forest) 𝐻{Λ}1 𝐵 ∗ : [0,1] 𝑝→ ℝ, 𝐻{Λ}1 𝐵 ∗ 𝒙 = 1 𝐵 𝑏=1 𝐵 𝑇Λ 𝑏 ∗ (𝒙) (is called partition-optimal forest) Theorem 4 lim 𝑛,𝑑,𝑘→∞ ℙ sup 𝐻∈ℋ 𝛼,𝑘 1 𝑛 𝑖=1 𝑛 (𝑦𝑖 − 𝐻(𝑥𝑖))2 − 𝔼[(𝑌 − 𝐻(𝑋))2] ≤ 11𝑀2 log 𝑛 log(𝑝) klog((1 − 𝛼)−1) = 1
  • 79.
    References B. Efron. Estimationand accuracy after model selection. Journal of the American Statistical Association, 2013. B. Lakshminarayanan et al. Mondrian forests: Efficient online random forests. arXiv:1406.2673, 2014. B. Xu et al. Classifying very high-dimensional data with random forests built from small subspaces. International Journal of Data Warehousing and Mining (IJDWM) 8(2), 37:44–63, 2012. D. Amaratunga et al. Enriched random forests. Bioinformatics (Oxford, England), 24(18): 2010–2014, 2008 doi:10.1093/bioinformatics/btn356 E. Scornet. On the asymptotics of random forests. arXiv:1409.2090, 2014 E. Scornet et al. Consistency of random forests. The Annals of Statistics. 43 (2015), no. 4, 1716--1741. doi:10.1214/15- AOS1321. http://projecteuclid.org/euclid.aos/1434546220. G. Biau et al. Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research, 9:2015–2033, 2008. G. Biau. Analysis of a random forests model. Journal of Machine Learning Research, 13:1063–1095, 2012. H. Deng et al. Feature Selection via Regularized Trees, The 2012 International Joint Conference on Neural Networks (IJCNN), IEEE, 2012. H. Deng et al. Gene Selection with Guided Regularized Random Forest , Pattern Recognition, 46.12 (2013): 3483-3489 H. Ishwaran et al. Random survival forest. The Annals of Applied Statistics, 2:841–860, 2008. L. Breiman. Bagging predictors. Technical Report No. 421, Statistics Department, UC Berkeley, 1994.
  • 80.
    References L.Breiman. Randomizing outputsto increase prediction accuracy, Technical Report 518, Statistics Department, UC Berkeley, 1998. L. Breiman. Some infinite theory for predictor ensembles. Technical Report 577, Statistics Department, UC Berkeley, 2000. L. Breiman. Random forests. Machine Learning, 45:5–32, 2001. M. Denil et al. Consistency of online random forests. International Conference on Machine Learning (ICML) 2013. N. Meinshausen. Quantile regression forests. Journal of Machine Learning Research, 7:983–999, 2006. P. Geurts et al. Extremely randomized trees. Machine Learning, 63(1):3-42, 2006. Q. Wu et al. Snp selection and classification of genome-wide snp data using stratified sampling random forests. NanoBioscience, IEEE Transactions on 11(3): 216–227, 2012. T. Dietterich. An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting and Randomization, Machine Learning 1-22, 1998 T.K. Ho. Random decision forests, Document Analysis and Recognition, 1995., Proceedings of the Third International Conference on , 1:278-282, 14-16 Aug 1995, doi: 10.1109/ICDAR.1995.598994 T.K. Ho. The random subspace method for constructing decision forests, IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(8):832-844, 1998 Saïp Ciss. Random Uniform Forests. 2015. <hal-01104340v2> S. Cl´emen¸con et al. Ranking forests. Journal of Machine Learning Research, 14:39–73, 2013.
  • 81.
    References S. Wager. Asymptotictheory for random forests. arXiv:1405.0352, 2014 S. Wager. Uniform convergence of random forests via adaptive concentration. arXiv:1503.06388, 2015 Y. Lin and Y. Jeon. Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101:578–590, 2006. V. N. Vapnik. An overview of Statistical Learning Theory. IEEE Trans. on Neural Networks, 10(5):988-999, 1999.
  • 82.
    RF and Additiveregression model [Scornet et al., 2015] the indicator that falls in the same cell as in the random tree designed with ℒn and the random parameter where ′is an independent copy of . ′, 𝑋1, … , 𝑋 𝑛, ′ , 𝑋1, … , 𝑋 𝑛
  • 83.
    RF and Additiveregression model [Scornet et al., 2015]