PaloBoost
An Overfitting-robust
TreeBoost with Out-of-Bag
Sample Regularization
Technique
https://arxiv.org/abs/1807.
08383
Yubin Park and Joyce Ho
Motivation
• Stochastic Graident TreeBoost is a powerful algorithm
• Many winning solutions in various data science challenges
• Fast training speed compared other algorithms
• Can handle various data types including missing values
• BUT, the algorithm can be even more powerful IF…
• It is less sensitive to hyperparameters, such as learning rate, tree depth, etc.
• And at the same time, it provides robust performance without overfitting
• The question is: Can we build such an algorithm?
Out-of-Bag Samples? (1)
• Out-of-Bag (OOB) samples
• The samples that are not included in
the subsampling process
• Available in many subsampling-based
algorithms such as Random Forests
• In Stochastic Gradient TreeBoost,
• OOB samples are often used to
estimate cross-validation errors
• They can be used for early-stopping
• However, the OOB errors do not affect
the actual training process much
• We think OOB samples can provide
more values than this
1 2 3 4 5
1 2 3 4 5
1 23
1 23
12 3
1 2 3
training validation test
OOB In-Bag
data
OOB samples are for monitoring OOB errors.
They do not impact the overall training process.
à under-utilized
Multiple iterations to
determine the best
hyperparameters.
Stochastic Gradient TreeBoost
boosting
iterations
Out-of-Bag Samples? (2)
• PaloBoost utilizes the under-
utilized out-of-bag samples
• Gradient-Aware Pruning
• Prune a decision tree if its gradient
steps do not decrease OOB errors
• Adaptive Learning Rates
• Find the optimal learning rates for
the gradient steps w.r.t. OOB errors
1 2 3 4 5
1 2 3 4 5
1 2 3 4
1 2 34
12 3 4
1 23 4
training test
OOB + validation
or training #2
In-Bag
or training #1
data
OOB samples
provide
immediate
feedback on the
hyperparameters.
Different learning
rates and tree
depths at each
stage. à fully-
utilized
PaloBoost
boosting
iterations
PaloBoost: Adaptive Learning Rate
• PaloBoost sets different learning rates for each tree leaf
• This allows closed form solutions for the optimal learning rates
• Bernoulli distribution:
• Gaussian distribution:
PaloBoost: Algorithm
For more information, please see:
PaloBoost • :9
ALGORITHM 5: PaloBoost
F0 = arg min
ÕN
i=1 L( i, )
for m 1 to M do
{ i, xi }N 0
1 = Subsample({ i, xi }N
1 , rate = q)
zi = @
@F(xi )L( i, F(xi ))
F=Fm 1
, for i = 1, · · · , N 0
{Rjm }1 = RegressionTree({zi, xi }N 0
i )
jm = arg min
Õ
xi 2Rjm
L( i, Fm 1(xi ) + ), for j = 1, · · · ,
{Rjm, jm }
0
1 = Gradient-Aware-Pruning({Rjm, jm }1 , max )
jm = Learning-Rate-Adjustment( jm, max ), for j = 1, · · · , 0
Fm(x) = Fm 1(x) +
Õ
j=1 jm jm (x 2 Rjm)
region coverage, and region-specic learning rate. The importance of a feature, fk is calculated as:
fk =
M’
m=1
Õ m
j=1 jm |Rjm || jm | (fk 2 Rulesjm)
m
Õ
j=1 Rjm
(14)
where |Rjm | represent the coverage of the region Rjm, and Rulesjm denotes the logical rules that dene the region
Rjm. For example, if Rulesjm is of the form (x0  0.5 AND x1  1.0 AND x3  1), then (f1 2 Rulesjm) will be
one as x1 is in the rules. With our feature importance formula, if a feature is used to dene a region, we will
multiply the size of the region (|Rjm |) with the eective absolute node estimate, jm | jm |. Therefore, if any of the
three quantities (coverage of the region, the leaf estimate, or the adaptive learning rate) is very small, the feature
importance contribution is minimal.
Experimental
Results:
Predictive
Performance
Friedman’s Simulated
Data in “Multivariate
adaptive regression
splines”
Robust to overfitting
Experimental Results: OOB Regularizations
Gradient-Aware Pruning (left) and Adaptive Learning Rate (right)
PaloBoost knows when to ”slow down”
Experimental Results: Feature Importances
x5-x9 are noisy features. In theory, they should have zero importance.
PaloBoost clearly identifies noisy features
Discussions
• PaloBoost automatically adapts to mitigate overfitting by knowing when it needs to
“slow down”
• Moreover, PaloBoost’s importance estimates can accurately capture the true
importances
• Future improvements:
• Removing low learning rate trees?
• Combining with learning rate schedule approaches?
• Reducing the impact of “max” learning rate and tree depth?
Software
• Available in the Bonsai-DT framework
• Project Website: https://yubin-park.github.io/bonsai-dt/
• PaloBoost: https://github.com/yubin-park/bonsai-
dt/blob/master/bonsai/ensemble/paloboost.py
• Script for the Experimental Results: https://github.com/yubin-park/bonsai-
dt/blob/master/research/paloboost.ipynb
• Paper link: https://arxiv.org/abs/1807.08383

Paloboost

  • 1.
    PaloBoost An Overfitting-robust TreeBoost withOut-of-Bag Sample Regularization Technique https://arxiv.org/abs/1807. 08383 Yubin Park and Joyce Ho
  • 2.
    Motivation • Stochastic GraidentTreeBoost is a powerful algorithm • Many winning solutions in various data science challenges • Fast training speed compared other algorithms • Can handle various data types including missing values • BUT, the algorithm can be even more powerful IF… • It is less sensitive to hyperparameters, such as learning rate, tree depth, etc. • And at the same time, it provides robust performance without overfitting • The question is: Can we build such an algorithm?
  • 3.
    Out-of-Bag Samples? (1) •Out-of-Bag (OOB) samples • The samples that are not included in the subsampling process • Available in many subsampling-based algorithms such as Random Forests • In Stochastic Gradient TreeBoost, • OOB samples are often used to estimate cross-validation errors • They can be used for early-stopping • However, the OOB errors do not affect the actual training process much • We think OOB samples can provide more values than this 1 2 3 4 5 1 2 3 4 5 1 23 1 23 12 3 1 2 3 training validation test OOB In-Bag data OOB samples are for monitoring OOB errors. They do not impact the overall training process. à under-utilized Multiple iterations to determine the best hyperparameters. Stochastic Gradient TreeBoost boosting iterations
  • 4.
    Out-of-Bag Samples? (2) •PaloBoost utilizes the under- utilized out-of-bag samples • Gradient-Aware Pruning • Prune a decision tree if its gradient steps do not decrease OOB errors • Adaptive Learning Rates • Find the optimal learning rates for the gradient steps w.r.t. OOB errors 1 2 3 4 5 1 2 3 4 5 1 2 3 4 1 2 34 12 3 4 1 23 4 training test OOB + validation or training #2 In-Bag or training #1 data OOB samples provide immediate feedback on the hyperparameters. Different learning rates and tree depths at each stage. à fully- utilized PaloBoost boosting iterations
  • 5.
    PaloBoost: Adaptive LearningRate • PaloBoost sets different learning rates for each tree leaf • This allows closed form solutions for the optimal learning rates • Bernoulli distribution: • Gaussian distribution:
  • 6.
    PaloBoost: Algorithm For moreinformation, please see: PaloBoost • :9 ALGORITHM 5: PaloBoost F0 = arg min ÕN i=1 L( i, ) for m 1 to M do { i, xi }N 0 1 = Subsample({ i, xi }N 1 , rate = q) zi = @ @F(xi )L( i, F(xi )) F=Fm 1 , for i = 1, · · · , N 0 {Rjm }1 = RegressionTree({zi, xi }N 0 i ) jm = arg min Õ xi 2Rjm L( i, Fm 1(xi ) + ), for j = 1, · · · , {Rjm, jm } 0 1 = Gradient-Aware-Pruning({Rjm, jm }1 , max ) jm = Learning-Rate-Adjustment( jm, max ), for j = 1, · · · , 0 Fm(x) = Fm 1(x) + Õ j=1 jm jm (x 2 Rjm) region coverage, and region-specic learning rate. The importance of a feature, fk is calculated as: fk = M’ m=1 Õ m j=1 jm |Rjm || jm | (fk 2 Rulesjm) m Õ j=1 Rjm (14) where |Rjm | represent the coverage of the region Rjm, and Rulesjm denotes the logical rules that dene the region Rjm. For example, if Rulesjm is of the form (x0 0.5 AND x1 1.0 AND x3 1), then (f1 2 Rulesjm) will be one as x1 is in the rules. With our feature importance formula, if a feature is used to dene a region, we will multiply the size of the region (|Rjm |) with the eective absolute node estimate, jm | jm |. Therefore, if any of the three quantities (coverage of the region, the leaf estimate, or the adaptive learning rate) is very small, the feature importance contribution is minimal.
  • 7.
    Experimental Results: Predictive Performance Friedman’s Simulated Data in“Multivariate adaptive regression splines” Robust to overfitting
  • 8.
    Experimental Results: OOBRegularizations Gradient-Aware Pruning (left) and Adaptive Learning Rate (right) PaloBoost knows when to ”slow down”
  • 9.
    Experimental Results: FeatureImportances x5-x9 are noisy features. In theory, they should have zero importance. PaloBoost clearly identifies noisy features
  • 10.
    Discussions • PaloBoost automaticallyadapts to mitigate overfitting by knowing when it needs to “slow down” • Moreover, PaloBoost’s importance estimates can accurately capture the true importances • Future improvements: • Removing low learning rate trees? • Combining with learning rate schedule approaches? • Reducing the impact of “max” learning rate and tree depth?
  • 11.
    Software • Available inthe Bonsai-DT framework • Project Website: https://yubin-park.github.io/bonsai-dt/ • PaloBoost: https://github.com/yubin-park/bonsai- dt/blob/master/bonsai/ensemble/paloboost.py • Script for the Experimental Results: https://github.com/yubin-park/bonsai- dt/blob/master/research/paloboost.ipynb • Paper link: https://arxiv.org/abs/1807.08383