XGBoost: A Scalable Tree Boosting System
Simon Lia-Jonassen
Motivation
 Used by majority of winning solutions on
Kaggle, 2nd most popular method after DNN.
 Also used by 10 best teams in KDDCup’15.
 Applies to classification, regression and
learning-to-rank tasks.
 Usually outperforms alternatives in an
out-of-the-box setting.
 Combines a good theoretical foundation and
a highly efficient implementation.
 So, how does it work?
Decision Tree Boosting
Number of trees Tree function,
maps to a set of leaf weights
Instance features
Regularized Learning Objective
Prediction loss Complexity penalty
Number of leaves L2 regularization on
leaves weights
Regularized Learning Objective
First order gradient
of the loss function
Second order gradient
of the loss function
By additive definition
Where:
However, for example:
Regularized Learning Objective
By expansion:
For each
instance
For each leaf For each
instance
in the leaf
Regularized Learning Objective
Optimal leaf weight for a fixed structure:
By substitution:
Gradient Tree Boosting
Before
we split
Left
split
Right
split
Split
penalty
Gradient Tree Boosting
Optimizations
 Shrinkage
 More trees
 Column subsampling
 Prevents over-fitting
 Approximate split finding
 Faster AUC convergence
 Sparsity-aware split finding
 Visit only non-missing values
 Cache-aware parallel column block
access
 Fewer misses on large datasets
 Block compression and sharding
 Faster I/O for out-of-core computation
Optimizations
 Shrinkage
 More trees
 Column subsampling
 Prevents over-fitting
 Approximate split finding
 Faster AUC convergence
 Sparsity-aware split finding
 Visit only non-missing values
 Cache-aware parallel column block
access
 Fewer misses on large datasets
 Block compression and sharding
 Faster I/O for out-of-core computation
Optimizations
 Shrinkage
 More trees
 Column subsampling
 Prevents over-fitting
 Approximate split finding
 Faster AUC convergence
 Sparsity-aware split finding
 Visit only non-missing values
 Cache-aware parallel column block
access
 Fewer misses on large datasets
 Block compression and sharding
 Faster I/O for out-of-core computation
Optimizations
 Shrinkage
 More trees
 Column subsampling
 Prevents over-fitting
 Approximate split finding
 Faster AUC convergence
 Sparsity-aware split finding
 Visit only non-missing values
 Cache-aware parallel column block
access
 Fewer misses on large datasets
 Block compression and sharding
 Faster I/O for out-of-core computation
Optimizations
 Shrinkage
 More trees
 Column subsampling
 Prevents over-fitting
 Approximate split finding
 Faster AUC convergence
 Sparsity-aware split finding
 Visit only non-missing values
 Cache-aware parallel column block
access
 Fewer misses on large datasets
 Block compression and sharding
 Faster I/O for out-of-core computation
Optimizations
Further reading
 The paper:
 https://arxiv.org/pdf/1603.02754.pdf
 XGBoost tutorial:
 http://xgboost.readthedocs.io/en/latest/model.html
 A great deck of slides:
 https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf
 A simple usage example:
 https://www.kaggle.com/kevalm/xgboost-implementation-on-iris-dataset-python
 DataCamp mini-course:
 https://campus.datacamp.com/courses/extreme-gradient-boosting-with-xgboost

Xgboost: A Scalable Tree Boosting System - Explained

  • 1.
    XGBoost: A ScalableTree Boosting System Simon Lia-Jonassen
  • 2.
    Motivation  Used bymajority of winning solutions on Kaggle, 2nd most popular method after DNN.  Also used by 10 best teams in KDDCup’15.  Applies to classification, regression and learning-to-rank tasks.  Usually outperforms alternatives in an out-of-the-box setting.  Combines a good theoretical foundation and a highly efficient implementation.  So, how does it work?
  • 3.
    Decision Tree Boosting Numberof trees Tree function, maps to a set of leaf weights Instance features
  • 4.
    Regularized Learning Objective Predictionloss Complexity penalty Number of leaves L2 regularization on leaves weights
  • 5.
    Regularized Learning Objective Firstorder gradient of the loss function Second order gradient of the loss function By additive definition Where: However, for example:
  • 6.
    Regularized Learning Objective Byexpansion: For each instance For each leaf For each instance in the leaf
  • 7.
    Regularized Learning Objective Optimalleaf weight for a fixed structure: By substitution:
  • 8.
    Gradient Tree Boosting Before wesplit Left split Right split Split penalty
  • 9.
  • 10.
    Optimizations  Shrinkage  Moretrees  Column subsampling  Prevents over-fitting  Approximate split finding  Faster AUC convergence  Sparsity-aware split finding  Visit only non-missing values  Cache-aware parallel column block access  Fewer misses on large datasets  Block compression and sharding  Faster I/O for out-of-core computation
  • 11.
    Optimizations  Shrinkage  Moretrees  Column subsampling  Prevents over-fitting  Approximate split finding  Faster AUC convergence  Sparsity-aware split finding  Visit only non-missing values  Cache-aware parallel column block access  Fewer misses on large datasets  Block compression and sharding  Faster I/O for out-of-core computation
  • 12.
    Optimizations  Shrinkage  Moretrees  Column subsampling  Prevents over-fitting  Approximate split finding  Faster AUC convergence  Sparsity-aware split finding  Visit only non-missing values  Cache-aware parallel column block access  Fewer misses on large datasets  Block compression and sharding  Faster I/O for out-of-core computation
  • 13.
    Optimizations  Shrinkage  Moretrees  Column subsampling  Prevents over-fitting  Approximate split finding  Faster AUC convergence  Sparsity-aware split finding  Visit only non-missing values  Cache-aware parallel column block access  Fewer misses on large datasets  Block compression and sharding  Faster I/O for out-of-core computation
  • 14.
    Optimizations  Shrinkage  Moretrees  Column subsampling  Prevents over-fitting  Approximate split finding  Faster AUC convergence  Sparsity-aware split finding  Visit only non-missing values  Cache-aware parallel column block access  Fewer misses on large datasets  Block compression and sharding  Faster I/O for out-of-core computation
  • 15.
  • 16.
    Further reading  Thepaper:  https://arxiv.org/pdf/1603.02754.pdf  XGBoost tutorial:  http://xgboost.readthedocs.io/en/latest/model.html  A great deck of slides:  https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf  A simple usage example:  https://www.kaggle.com/kevalm/xgboost-implementation-on-iris-dataset-python  DataCamp mini-course:  https://campus.datacamp.com/courses/extreme-gradient-boosting-with-xgboost