SlideShare a Scribd company logo
Introduction to Boosted Trees
Tianqi Chen
Oct. 22 2014
Outline
• Review of key concepts of supervised learning
• Regression Tree and Ensemble (What are we Learning)
• Gradient Boosting (How do we Learn)
• Summary
Elements in Supervised Learning
• Notations: i-th training example
• Model: how to make prediction given
 Linear model: (include linear/logistic regression)
 The prediction score can have different interpretations
depending on the task
 Linear regression: is the predicted score
 Logistic regression: is predicted the probability
of the instance being positive
 Others… for example in ranking can be the rank score
• Parameters: the things we need to learn from data
 Linear model:
Elements continued: Objective Function
• Objective function that is everywhere
• Loss on training data:
 Square loss:
 Logistic loss:
• Regularization: how complicated the model is?
 L2 norm:
 L1 norm (lasso):
Training Loss measures how
well model fit on training data
Regularization, measures
complexity of model
Putting known knowledge into context
• Ridge regression:
 Linear model, square loss, L2 regularization
• Lasso:
 Linear model, square loss, L1 regularization
• Logistic regression:
 Linear model, logistic loss, L2 regularization
• The conceptual separation between model, parameter,
objective also gives you engineering benefits.
 Think of how you can implement SGD for both ridge regression
and logistic regression
Objective and Bias Variance Trade-off
• Why do we want to contain two component in the objective?
• Optimizing training loss encourages predictive models
 Fitting well in training data at least get you close to training data
which is hopefully close to the underlying distribution
• Optimizing regularization encourages simple models
 Simpler models tends to have smaller variance in future
predictions, making prediction stable
Training Loss measures how
well model fit on training data
Regularization, measures
complexity of model
Outline
• Review of key concepts of supervised learning
• Regression Tree and Ensemble (What are we Learning)
• Gradient Boosting (How do we Learn)
• Summary
Regression Tree (CART)
• regression tree (also known as classification and regression
tree):
 Decision rules same as in decision tree
 Contains one score in each leaf value
Input: age, gender, occupation, …
age < 15
is male?
+2 -1+0.1
Y N
Y N
Does the person like computer games
prediction score in each leaf
Regression Tree Ensemble
age < 15
is male?
+2 -1+0.1
Y N
Y N
Use Computer
Daily
Y N
+0.9
-0.9
tree1 tree2
f( ) = 2 + 0.9= 2.9 f( )= -1 + 0.9= -0.1
Prediction of is sum of scores predicted by each of the tree
Tree Ensemble methods
• Very widely used, look for GBM, random forest…
 Almost half of data mining competition are won by using some
variants of tree ensemble methods
• Invariant to scaling of inputs, so you do not need to do careful
features normalization.
• Learn higher order interaction between features.
• Can be scalable, and are used in Industry
Put into context: Model and Parameters
• Model: assuming we have K trees
Think: regression tree is a function that maps the attributes to the score
• Parameters
 Including structure of each tree, and the score in the leaf
 Or simply use function as parameters
 Instead learning weights in , we are learning functions(trees)
Space of functions containing all Regression trees
Learning a tree on single variable
• How can we learn functions?
• Define objective (loss, regularization), and optimize it!!
• Example:
 Consider regression tree on single input t (time)
 I want to predict whether I like romantic music at time t
t < 2011/03/01
t < 2010/03/20
Y N
Y N
0.2
Equivalently
The model is regression tree that splits on time
1.2
1.0
Piecewise step function over time
Learning a step function
• Things we need to learn
• Objective for single variable regression tree(step functions)
 Training Loss: How will the function fit on the points?
 Regularization: How do we define complexity of the function?
 Number of splitting points, l2 norm of the height in each segment?
Splitting Positions
The Height in each segment
Learning step function (visually)
Coming back: Objective for Tree Ensemble
• Model: assuming we have K trees
• Objective
• Possible ways to define ?
 Number of nodes in the tree, depth
 L2 norm of the leaf weights
 … detailed later
Training loss Complexity of the Trees
Objective vs Heuristic
• When you talk about (decision) trees, it is usually heuristics
 Split by information gain
 Prune the tree
 Maximum depth
 Smooth the leaf values
• Most heuristics maps well to objectives, taking the formal
(objective) view let us know what we are learning
 Information gain -> training loss
 Pruning -> regularization defined by #nodes
 Max depth -> constraint on the function space
 Smoothing leaf values -> L2 regularization on leaf weights
Regression Tree is not just for regression!
• Regression tree ensemble defines how you make the
prediction score, it can be used for
 Classification, Regression, Ranking….
 ….
• It all depends on how you define the objective function!
• So far we have learned:
 Using Square loss
 Will results in common gradient boosted machine
 Using Logistic loss
 Will results in LogitBoost
Take Home Message for this section
• Bias-variance tradeoff is everywhere
• The loss + regularization objective pattern applies for
regression tree learning (function learning)
• We want predictive and simple functions
• This defines what we want to learn (objective, model).
• But how do we learn it?
 Next section
Outline
• Review of key concepts of supervised learning
• Regression Tree and Ensemble (What are we Learning)
• Gradient Boosting (How do we Learn)
• Summary
So How do we Learn?
• Objective:
• We can not use methods such as SGD, to find f (since they are
trees, instead of just numerical vectors)
• Solution: Additive Training (Boosting)
 Start from constant prediction, add a new function each time
Model at training round t
New function
Keep functions added in previous round
Additive Training
• How do we decide which f to add?
 Optimize the objective!!
• The prediction at round t is
• Consider square loss
This is what we need to decide in round t
Goal: find to minimize this
This is usually called residual from previous round
Taylor Expansion Approximation of Loss
• Goal
 Seems still complicated except for the case of square loss
• Take Taylor expansion of the objective
 Recall
 Define
• If you are not comfortable with this, think of square loss
• Compare what we get to previous slide
Our New Goal
• Objective, with constants removed
 where
• Why spending s much efforts to derive the objective, why not
just grow trees …
 Theoretical benefit: know what we are learning, convergence
 Engineering benefit, recall the elements of supervised learning
 and comes from definition of loss function
 The learning of function only depend on the objective via and
 Think of how you can separate modules of your code when you
are asked to implement boosted tree for both square loss and
logistic loss
Refine the definition of tree
• We define tree by a vector of scores in leafs, and a leaf index
mapping function that maps an instance to a leaf
age < 15
is male?
Y N
Y N
Leaf 1 Leaf 2 Leaf 3
q( ) = 1
q( ) = 3
w1=+2 w2=0.1 w3=-1
The structure of the tree
The leaf weight of the tree
Define the Complexity of Tree
• Define complexity as (this is not the only possible definition)
age < 15
is male?
Y N
Y N
Leaf 1 Leaf 2 Leaf 3
w1=+2 w2=0.1 w3=-1
Number of leaves L2 norm of leaf scores
Revisit the Objectives
• Define the instance set in leaf j as
• Regroup the objective by each leaf
• This is sum of T independent quadratic functions
The Structure Score
• Two facts about single variable quadratic function
• Let us define
• Assume the structure of tree ( q(x) ) is fixed, the optimal
weight in each leaf, and the resulting objective value are
This measures how good a tree structure is!
The Structure Score Calculation
age < 15
is male?
Y N
Y N
Instance index
1
2
3
4
5
g1, h1
g2, h2
g3, h3
g4, h4
g5, h5
gradient statistics
The smaller the score is, the better the structure is
Searching Algorithm for Single Tree
• Enumerate the possible tree structures q
• Calculate the structure score for the q, using the scoring eq.
• Find the best tree structure, and use the optimal leaf weight
• But… there can be infinite possible tree structures..
Greedy Learning of the Tree
• In practice, we grow the tree greedily
 Start from tree with depth 0
 For each leaf node of the tree, try to add a split. The change of
objective after adding the split is
 Remaining question: how do we find the best split?
the score of left child
the score of right child
the score of if we do not split
The complexity cost by
introducing additional leaf
Efficient Finding of the Best Split
• What is the gain of a split rule ? Say is age
• All we need is sum of g and h in each side, and calculate
• Left to right linear scan over sorted instance is enough to
decide the best split along the feature
g1, h1 g4, h4 g2, h2 g5, h5 g3, h3
a
An Algorithm for Split Finding
• For each node, enumerate over all features
 For each feature, sorted the instances by feature value
 Use a linear scan to decide the best split along that feature
 Take the best split solution along all the features
• Time Complexity growing a tree of depth K
 It is O(n d K log n): or each level, need O(n log n) time to sort
There are d features, and we need to do it for K level
 This can be further optimized (e.g. use approximation or caching
the sorted features)
 Can scale to very large dataset
What about Categorical Variables?
• Some tree learning algorithm handles categorical variable and
continuous variable separately
 We can easily use the scoring formula we derived to score split
based on categorical variables.
• Actually it is not necessary to handle categorical separately.
 We can encode the categorical variables into numerical vector
using one-hot encoding. Allocate a #categorical length vector
 The vector will be sparse if there are lots of categories, the
learning algorithm is preferred to handle sparse data
Pruning and Regularization
• Recall the gain of split, it can be negative!
 When the training loss reduction is smaller than regularization
 Trade-off between simplicity and predictivness
• Pre-stopping
 Stop split if the best split have negative gain
 But maybe a split can benefit future splits..
• Post-Prunning
 Grow a tree to maximum depth, recursively prune all the leaf
splits with negative gain
Recap: Boosted Tree Algorithm
• Add a new tree in each iteration
• Beginning of each iteration, calculate
• Use the statistics to greedily grow a tree
• Add to the model
 Usually, instead we do
 is called step-size or shrinkage, usually set around 0.1
 This means we do not do full optimization in each step and
reserve chance for future rounds, it helps prevent overfitting
Outline
• Review of key concepts of supervised learning
• Regression Tree and Ensemble (What are we Learning)
• Gradient Boosting (How do we Learn)
• Summary
Questions to check if you really get it
• How can we build a boosted tree classifier to do weighted
regression problem, such that each instance have a
importance weight?
• Back to the time series problem, if I want to learn step
functions over time. Is there other ways to learn the time
splits, other than the top down split approach?
Questions to check if you really get it
• How can we build a boosted tree classifier to do weighted
regression problem, such that each instance have a
importance weight?
 Define objective, calculate , feed it to the old tree learning
algorithm we have for un-weighted version
 Again think of separation of model and objective, how does the
theory can help better organizing the machine learning toolkit
Questions to check if you really get it
• Time series problem
• All that is important is the structure score of the splits
 Top-down greedy, same as trees
 Bottom-up greedy, start from individual points as each group,
greedily merge neighbors
 Dynamic programming, can find optimal solution for this case
Summary
• The separation between model, objective, parameters can be
helpful for us to understand and customize learning models
• The bias-variance trade-off applies everywhere, including
learning in functional space
• We can be formal about what we learn and how we learn.
Clear understanding of theory can be used to guide cleaner
implementation.
Reference
• Greedy function approximation a gradient boosting machine. J.H. Friedman
 First paper about gradient boosting
• Stochastic Gradient Boosting. J.H. Friedman
 Introducing bagging trick to gradient boosting
• Elements of Statistical Learning. T. Hastie, R. Tibshirani and J.H. Friedman
 Contains a chapter about gradient boosted boosting
• Additive logistic regression a statistical view of boosting. J.H. Friedman T. Hastie R. Tibshirani
 Uses second-order statistics for tree splitting, which is closer to the view presented in this slide
• Learning Nonlinear Functions Using Regularized Greedy Forest. R. Johnson and T. Zhang
 Proposes to do fully corrective step, as well as regularizing the tree complexity. The regularizing trick
is closed related to the view present in this slide
• Software implementing the model described in this slide: https://github.com/tqchen/xgboost

More Related Content

What's hot

Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
NAVER Engineering
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
ESCOM
 
Boosting_suman
Boosting_sumanBoosting_suman
Boosting_suman
suman_lim
 
[한글] Tutorial: Sparse variational dropout
[한글] Tutorial: Sparse variational dropout[한글] Tutorial: Sparse variational dropout
[한글] Tutorial: Sparse variational dropout
Wuhyun Rico Shin
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
Ted Xiao
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methods
Shunta Saito
 
Optimizers
OptimizersOptimizers
Optimizers
Il Gu Yi
 
東京都市大学 データ解析入門 7 回帰分析とモデル選択 2
東京都市大学 データ解析入門 7 回帰分析とモデル選択 2東京都市大学 データ解析入門 7 回帰分析とモデル選択 2
東京都市大学 データ解析入門 7 回帰分析とモデル選択 2
hirokazutanaka
 
XGBoost & LightGBM
XGBoost & LightGBMXGBoost & LightGBM
XGBoost & LightGBM
Gabriel Cypriano Saca
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
Md. Ariful Hoque
 
Visualizaing and understanding convolutional networks
Visualizaing and understanding convolutional networksVisualizaing and understanding convolutional networks
Visualizaing and understanding convolutional networks
SungminYou
 
Gradient descent optimizer
Gradient descent optimizerGradient descent optimizer
Gradient descent optimizer
Hojin Yang
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
Sri Ambati
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
Simplilearn
 
15 Machine Learning Multilayer Perceptron
15 Machine Learning Multilayer Perceptron15 Machine Learning Multilayer Perceptron
15 Machine Learning Multilayer Perceptron
Andres Mendez-Vazquez
 
Tuning learning rate
Tuning learning rateTuning learning rate
Tuning learning rate
Jamie (Taka) Wang
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
HJ van Veen
 
Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learning
Jörgen Sandig
 
Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
halifaxchester
 

What's hot (20)

Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
 
Boosting_suman
Boosting_sumanBoosting_suman
Boosting_suman
 
YOLO
YOLOYOLO
YOLO
 
[한글] Tutorial: Sparse variational dropout
[한글] Tutorial: Sparse variational dropout[한글] Tutorial: Sparse variational dropout
[한글] Tutorial: Sparse variational dropout
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methods
 
Optimizers
OptimizersOptimizers
Optimizers
 
東京都市大学 データ解析入門 7 回帰分析とモデル選択 2
東京都市大学 データ解析入門 7 回帰分析とモデル選択 2東京都市大学 データ解析入門 7 回帰分析とモデル選択 2
東京都市大学 データ解析入門 7 回帰分析とモデル選択 2
 
XGBoost & LightGBM
XGBoost & LightGBMXGBoost & LightGBM
XGBoost & LightGBM
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Visualizaing and understanding convolutional networks
Visualizaing and understanding convolutional networksVisualizaing and understanding convolutional networks
Visualizaing and understanding convolutional networks
 
Gradient descent optimizer
Gradient descent optimizerGradient descent optimizer
Gradient descent optimizer
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
 
15 Machine Learning Multilayer Perceptron
15 Machine Learning Multilayer Perceptron15 Machine Learning Multilayer Perceptron
15 Machine Learning Multilayer Perceptron
 
Tuning learning rate
Tuning learning rateTuning learning rate
Tuning learning rate
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learning
 
Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
 

Viewers also liked

Basic Chinese Lesson 06 what are you doing
Basic Chinese Lesson 06 what are you doingBasic Chinese Lesson 06 what are you doing
Basic Chinese Lesson 06 what are you doing
Haibiao Miao
 
Survival chinese 1
Survival chinese 1 Survival chinese 1
Survival chinese 1
tra266
 
Chinese characters numbers
Chinese characters numbersChinese characters numbers
Chinese characters numbers
dashcalls
 
學好中文的十個理由以及好用的學習工具
學好中文的十個理由以及好用的學習工具學好中文的十個理由以及好用的學習工具
學好中文的十個理由以及好用的學習工具
Carol Chao
 
365 mandarin video chinese hcv 001 01 l01 - how do you do !
365 mandarin video chinese hcv 001 01 l01 - how do you do !365 mandarin video chinese hcv 001 01 l01 - how do you do !
365 mandarin video chinese hcv 001 01 l01 - how do you do !
LEGOO MANDARIN
 
Transportation supplymentary
Transportation supplymentaryTransportation supplymentary
Transportation supplymentarysindy322
 
The HSK Exam
The HSK ExamThe HSK Exam
The HSK Exam
Beleza Chan
 
HSK Standard Course Textbook1 Lesson 1 slides | HSK标准教程第一册第一课课件
HSK Standard Course Textbook1 Lesson 1 slides | HSK标准教程第一册第一课课件HSK Standard Course Textbook1 Lesson 1 slides | HSK标准教程第一册第一课课件
HSK Standard Course Textbook1 Lesson 1 slides | HSK标准教程第一册第一课课件
Phoenix Tree Publishing Inc
 
Traditional Marriage in China
Traditional Marriage  in ChinaTraditional Marriage  in China
Traditional Marriage in China
altinawu
 
Chinse made easy for kids
Chinse made easy for kidsChinse made easy for kids
Chinse made easy for kids
nicolezhang
 
An introduction to hsk with yct
An introduction to hsk with yctAn introduction to hsk with yct
An introduction to hsk with yct
xmeng
 
Chinese Wedding
Chinese WeddingChinese Wedding
Chinese Wedding
stkach
 
Contents and Sample Lesson of HSK Standard Course 《Hsk标准教程》教材介绍及样课 preview
Contents and Sample Lesson of HSK Standard Course 《Hsk标准教程》教材介绍及样课 previewContents and Sample Lesson of HSK Standard Course 《Hsk标准教程》教材介绍及样课 preview
Contents and Sample Lesson of HSK Standard Course 《Hsk标准教程》教材介绍及样课 preview
Phoenix Tree Publishing Inc
 
Stories with chinese characters
Stories with chinese charactersStories with chinese characters
Stories with chinese characters
Peter Palme 高 彼特
 
Printable quiz hsk 3 vocabulary noun (part 2)
Printable quiz  hsk 3 vocabulary   noun (part 2)Printable quiz  hsk 3 vocabulary   noun (part 2)
Printable quiz hsk 3 vocabulary noun (part 2)
LEGOO MANDARIN
 
Printable quiz hsk 2 vocabulary noun (part 2)
Printable quiz  hsk 2 vocabulary   noun (part 2)Printable quiz  hsk 2 vocabulary   noun (part 2)
Printable quiz hsk 2 vocabulary noun (part 2)
LEGOO MANDARIN
 
新 年 發 福 - A Chinese New Year Story
新 年 發 福 - A Chinese New Year Story新 年 發 福 - A Chinese New Year Story
新 年 發 福 - A Chinese New Year Story
Slides Singapore
 
Workbook
WorkbookWorkbook
Workbook
king16484
 
Printable quiz intermediate ch reading 1
Printable quiz  intermediate ch reading 1Printable quiz  intermediate ch reading 1
Printable quiz intermediate ch reading 1
LEGOO MANDARIN
 
Culture study unit 15 Chinese art worksheet
Culture study unit 15 Chinese art worksheetCulture study unit 15 Chinese art worksheet
Culture study unit 15 Chinese art worksheet
Joanne Chen
 

Viewers also liked (20)

Basic Chinese Lesson 06 what are you doing
Basic Chinese Lesson 06 what are you doingBasic Chinese Lesson 06 what are you doing
Basic Chinese Lesson 06 what are you doing
 
Survival chinese 1
Survival chinese 1 Survival chinese 1
Survival chinese 1
 
Chinese characters numbers
Chinese characters numbersChinese characters numbers
Chinese characters numbers
 
學好中文的十個理由以及好用的學習工具
學好中文的十個理由以及好用的學習工具學好中文的十個理由以及好用的學習工具
學好中文的十個理由以及好用的學習工具
 
365 mandarin video chinese hcv 001 01 l01 - how do you do !
365 mandarin video chinese hcv 001 01 l01 - how do you do !365 mandarin video chinese hcv 001 01 l01 - how do you do !
365 mandarin video chinese hcv 001 01 l01 - how do you do !
 
Transportation supplymentary
Transportation supplymentaryTransportation supplymentary
Transportation supplymentary
 
The HSK Exam
The HSK ExamThe HSK Exam
The HSK Exam
 
HSK Standard Course Textbook1 Lesson 1 slides | HSK标准教程第一册第一课课件
HSK Standard Course Textbook1 Lesson 1 slides | HSK标准教程第一册第一课课件HSK Standard Course Textbook1 Lesson 1 slides | HSK标准教程第一册第一课课件
HSK Standard Course Textbook1 Lesson 1 slides | HSK标准教程第一册第一课课件
 
Traditional Marriage in China
Traditional Marriage  in ChinaTraditional Marriage  in China
Traditional Marriage in China
 
Chinse made easy for kids
Chinse made easy for kidsChinse made easy for kids
Chinse made easy for kids
 
An introduction to hsk with yct
An introduction to hsk with yctAn introduction to hsk with yct
An introduction to hsk with yct
 
Chinese Wedding
Chinese WeddingChinese Wedding
Chinese Wedding
 
Contents and Sample Lesson of HSK Standard Course 《Hsk标准教程》教材介绍及样课 preview
Contents and Sample Lesson of HSK Standard Course 《Hsk标准教程》教材介绍及样课 previewContents and Sample Lesson of HSK Standard Course 《Hsk标准教程》教材介绍及样课 preview
Contents and Sample Lesson of HSK Standard Course 《Hsk标准教程》教材介绍及样课 preview
 
Stories with chinese characters
Stories with chinese charactersStories with chinese characters
Stories with chinese characters
 
Printable quiz hsk 3 vocabulary noun (part 2)
Printable quiz  hsk 3 vocabulary   noun (part 2)Printable quiz  hsk 3 vocabulary   noun (part 2)
Printable quiz hsk 3 vocabulary noun (part 2)
 
Printable quiz hsk 2 vocabulary noun (part 2)
Printable quiz  hsk 2 vocabulary   noun (part 2)Printable quiz  hsk 2 vocabulary   noun (part 2)
Printable quiz hsk 2 vocabulary noun (part 2)
 
新 年 發 福 - A Chinese New Year Story
新 年 發 福 - A Chinese New Year Story新 年 發 福 - A Chinese New Year Story
新 年 發 福 - A Chinese New Year Story
 
Workbook
WorkbookWorkbook
Workbook
 
Printable quiz intermediate ch reading 1
Printable quiz  intermediate ch reading 1Printable quiz  intermediate ch reading 1
Printable quiz intermediate ch reading 1
 
Culture study unit 15 Chinese art worksheet
Culture study unit 15 Chinese art worksheetCulture study unit 15 Chinese art worksheet
Culture study unit 15 Chinese art worksheet
 

Similar to Introduction to Boosted Trees by Tianqi Chen

Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
Xiang Zhang
 
Machine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision TreesMachine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision Trees
ananth
 
Analytics Boot Camp - Slides
Analytics Boot Camp - SlidesAnalytics Boot Camp - Slides
Analytics Boot Camp - Slides
Aditya Joshi
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
ssuserf07225
 
Deep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudData
WeCloudData
 
ML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdf
Tigabu Yaya
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2
Nandhini S
 
Lecture4.pptx
Lecture4.pptxLecture4.pptx
Lecture4.pptx
yasir149288
 
MLEARN 210 B Autumn 2018: Lecture 1
MLEARN 210 B Autumn 2018: Lecture 1MLEARN 210 B Autumn 2018: Lecture 1
MLEARN 210 B Autumn 2018: Lecture 1
heinestien
 
Predict oscars (4:17)
Predict oscars (4:17)Predict oscars (4:17)
Predict oscars (4:17)
Thinkful
 
machine_learning.pptx
machine_learning.pptxmachine_learning.pptx
machine_learning.pptx
Panchami V U
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
MohamedAliHabib3
 
Machine learning
Machine learningMachine learning
Machine learning
Digvijay Singh
 
Lecture2-DT.pptx
Lecture2-DT.pptxLecture2-DT.pptx
Lecture2-DT.pptx
INyomanSwitrayana
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
Viet-Trung TRAN
 
in computer data structures and algorithms
in computer data structures and algorithmsin computer data structures and algorithms
in computer data structures and algorithms
FIONACHATOLA
 
Week_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptxWeek_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptx
muhammadsamroz
 
مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةمدخل إلى تعلم الآلة
مدخل إلى تعلم الآلة
Fares Al-Qunaieer
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree induction
thamizh arasi
 
Learning
LearningLearning
Learning
Amar Jukuntla
 

Similar to Introduction to Boosted Trees by Tianqi Chen (20)

Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
 
Machine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision TreesMachine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision Trees
 
Analytics Boot Camp - Slides
Analytics Boot Camp - SlidesAnalytics Boot Camp - Slides
Analytics Boot Camp - Slides
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
 
Deep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudData
 
ML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdf
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2
 
Lecture4.pptx
Lecture4.pptxLecture4.pptx
Lecture4.pptx
 
MLEARN 210 B Autumn 2018: Lecture 1
MLEARN 210 B Autumn 2018: Lecture 1MLEARN 210 B Autumn 2018: Lecture 1
MLEARN 210 B Autumn 2018: Lecture 1
 
Predict oscars (4:17)
Predict oscars (4:17)Predict oscars (4:17)
Predict oscars (4:17)
 
machine_learning.pptx
machine_learning.pptxmachine_learning.pptx
machine_learning.pptx
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Machine learning
Machine learningMachine learning
Machine learning
 
Lecture2-DT.pptx
Lecture2-DT.pptxLecture2-DT.pptx
Lecture2-DT.pptx
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
 
in computer data structures and algorithms
in computer data structures and algorithmsin computer data structures and algorithms
in computer data structures and algorithms
 
Week_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptxWeek_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptx
 
مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةمدخل إلى تعلم الآلة
مدخل إلى تعلم الآلة
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree induction
 
Learning
LearningLearning
Learning
 

Recently uploaded

一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 

Recently uploaded (20)

一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 

Introduction to Boosted Trees by Tianqi Chen

  • 1. Introduction to Boosted Trees Tianqi Chen Oct. 22 2014
  • 2. Outline • Review of key concepts of supervised learning • Regression Tree and Ensemble (What are we Learning) • Gradient Boosting (How do we Learn) • Summary
  • 3. Elements in Supervised Learning • Notations: i-th training example • Model: how to make prediction given  Linear model: (include linear/logistic regression)  The prediction score can have different interpretations depending on the task  Linear regression: is the predicted score  Logistic regression: is predicted the probability of the instance being positive  Others… for example in ranking can be the rank score • Parameters: the things we need to learn from data  Linear model:
  • 4. Elements continued: Objective Function • Objective function that is everywhere • Loss on training data:  Square loss:  Logistic loss: • Regularization: how complicated the model is?  L2 norm:  L1 norm (lasso): Training Loss measures how well model fit on training data Regularization, measures complexity of model
  • 5. Putting known knowledge into context • Ridge regression:  Linear model, square loss, L2 regularization • Lasso:  Linear model, square loss, L1 regularization • Logistic regression:  Linear model, logistic loss, L2 regularization • The conceptual separation between model, parameter, objective also gives you engineering benefits.  Think of how you can implement SGD for both ridge regression and logistic regression
  • 6. Objective and Bias Variance Trade-off • Why do we want to contain two component in the objective? • Optimizing training loss encourages predictive models  Fitting well in training data at least get you close to training data which is hopefully close to the underlying distribution • Optimizing regularization encourages simple models  Simpler models tends to have smaller variance in future predictions, making prediction stable Training Loss measures how well model fit on training data Regularization, measures complexity of model
  • 7. Outline • Review of key concepts of supervised learning • Regression Tree and Ensemble (What are we Learning) • Gradient Boosting (How do we Learn) • Summary
  • 8. Regression Tree (CART) • regression tree (also known as classification and regression tree):  Decision rules same as in decision tree  Contains one score in each leaf value Input: age, gender, occupation, … age < 15 is male? +2 -1+0.1 Y N Y N Does the person like computer games prediction score in each leaf
  • 9. Regression Tree Ensemble age < 15 is male? +2 -1+0.1 Y N Y N Use Computer Daily Y N +0.9 -0.9 tree1 tree2 f( ) = 2 + 0.9= 2.9 f( )= -1 + 0.9= -0.1 Prediction of is sum of scores predicted by each of the tree
  • 10. Tree Ensemble methods • Very widely used, look for GBM, random forest…  Almost half of data mining competition are won by using some variants of tree ensemble methods • Invariant to scaling of inputs, so you do not need to do careful features normalization. • Learn higher order interaction between features. • Can be scalable, and are used in Industry
  • 11. Put into context: Model and Parameters • Model: assuming we have K trees Think: regression tree is a function that maps the attributes to the score • Parameters  Including structure of each tree, and the score in the leaf  Or simply use function as parameters  Instead learning weights in , we are learning functions(trees) Space of functions containing all Regression trees
  • 12. Learning a tree on single variable • How can we learn functions? • Define objective (loss, regularization), and optimize it!! • Example:  Consider regression tree on single input t (time)  I want to predict whether I like romantic music at time t t < 2011/03/01 t < 2010/03/20 Y N Y N 0.2 Equivalently The model is regression tree that splits on time 1.2 1.0 Piecewise step function over time
  • 13. Learning a step function • Things we need to learn • Objective for single variable regression tree(step functions)  Training Loss: How will the function fit on the points?  Regularization: How do we define complexity of the function?  Number of splitting points, l2 norm of the height in each segment? Splitting Positions The Height in each segment
  • 15. Coming back: Objective for Tree Ensemble • Model: assuming we have K trees • Objective • Possible ways to define ?  Number of nodes in the tree, depth  L2 norm of the leaf weights  … detailed later Training loss Complexity of the Trees
  • 16. Objective vs Heuristic • When you talk about (decision) trees, it is usually heuristics  Split by information gain  Prune the tree  Maximum depth  Smooth the leaf values • Most heuristics maps well to objectives, taking the formal (objective) view let us know what we are learning  Information gain -> training loss  Pruning -> regularization defined by #nodes  Max depth -> constraint on the function space  Smoothing leaf values -> L2 regularization on leaf weights
  • 17. Regression Tree is not just for regression! • Regression tree ensemble defines how you make the prediction score, it can be used for  Classification, Regression, Ranking….  …. • It all depends on how you define the objective function! • So far we have learned:  Using Square loss  Will results in common gradient boosted machine  Using Logistic loss  Will results in LogitBoost
  • 18. Take Home Message for this section • Bias-variance tradeoff is everywhere • The loss + regularization objective pattern applies for regression tree learning (function learning) • We want predictive and simple functions • This defines what we want to learn (objective, model). • But how do we learn it?  Next section
  • 19. Outline • Review of key concepts of supervised learning • Regression Tree and Ensemble (What are we Learning) • Gradient Boosting (How do we Learn) • Summary
  • 20. So How do we Learn? • Objective: • We can not use methods such as SGD, to find f (since they are trees, instead of just numerical vectors) • Solution: Additive Training (Boosting)  Start from constant prediction, add a new function each time Model at training round t New function Keep functions added in previous round
  • 21. Additive Training • How do we decide which f to add?  Optimize the objective!! • The prediction at round t is • Consider square loss This is what we need to decide in round t Goal: find to minimize this This is usually called residual from previous round
  • 22. Taylor Expansion Approximation of Loss • Goal  Seems still complicated except for the case of square loss • Take Taylor expansion of the objective  Recall  Define • If you are not comfortable with this, think of square loss • Compare what we get to previous slide
  • 23. Our New Goal • Objective, with constants removed  where • Why spending s much efforts to derive the objective, why not just grow trees …  Theoretical benefit: know what we are learning, convergence  Engineering benefit, recall the elements of supervised learning  and comes from definition of loss function  The learning of function only depend on the objective via and  Think of how you can separate modules of your code when you are asked to implement boosted tree for both square loss and logistic loss
  • 24. Refine the definition of tree • We define tree by a vector of scores in leafs, and a leaf index mapping function that maps an instance to a leaf age < 15 is male? Y N Y N Leaf 1 Leaf 2 Leaf 3 q( ) = 1 q( ) = 3 w1=+2 w2=0.1 w3=-1 The structure of the tree The leaf weight of the tree
  • 25. Define the Complexity of Tree • Define complexity as (this is not the only possible definition) age < 15 is male? Y N Y N Leaf 1 Leaf 2 Leaf 3 w1=+2 w2=0.1 w3=-1 Number of leaves L2 norm of leaf scores
  • 26. Revisit the Objectives • Define the instance set in leaf j as • Regroup the objective by each leaf • This is sum of T independent quadratic functions
  • 27. The Structure Score • Two facts about single variable quadratic function • Let us define • Assume the structure of tree ( q(x) ) is fixed, the optimal weight in each leaf, and the resulting objective value are This measures how good a tree structure is!
  • 28. The Structure Score Calculation age < 15 is male? Y N Y N Instance index 1 2 3 4 5 g1, h1 g2, h2 g3, h3 g4, h4 g5, h5 gradient statistics The smaller the score is, the better the structure is
  • 29. Searching Algorithm for Single Tree • Enumerate the possible tree structures q • Calculate the structure score for the q, using the scoring eq. • Find the best tree structure, and use the optimal leaf weight • But… there can be infinite possible tree structures..
  • 30. Greedy Learning of the Tree • In practice, we grow the tree greedily  Start from tree with depth 0  For each leaf node of the tree, try to add a split. The change of objective after adding the split is  Remaining question: how do we find the best split? the score of left child the score of right child the score of if we do not split The complexity cost by introducing additional leaf
  • 31. Efficient Finding of the Best Split • What is the gain of a split rule ? Say is age • All we need is sum of g and h in each side, and calculate • Left to right linear scan over sorted instance is enough to decide the best split along the feature g1, h1 g4, h4 g2, h2 g5, h5 g3, h3 a
  • 32. An Algorithm for Split Finding • For each node, enumerate over all features  For each feature, sorted the instances by feature value  Use a linear scan to decide the best split along that feature  Take the best split solution along all the features • Time Complexity growing a tree of depth K  It is O(n d K log n): or each level, need O(n log n) time to sort There are d features, and we need to do it for K level  This can be further optimized (e.g. use approximation or caching the sorted features)  Can scale to very large dataset
  • 33. What about Categorical Variables? • Some tree learning algorithm handles categorical variable and continuous variable separately  We can easily use the scoring formula we derived to score split based on categorical variables. • Actually it is not necessary to handle categorical separately.  We can encode the categorical variables into numerical vector using one-hot encoding. Allocate a #categorical length vector  The vector will be sparse if there are lots of categories, the learning algorithm is preferred to handle sparse data
  • 34. Pruning and Regularization • Recall the gain of split, it can be negative!  When the training loss reduction is smaller than regularization  Trade-off between simplicity and predictivness • Pre-stopping  Stop split if the best split have negative gain  But maybe a split can benefit future splits.. • Post-Prunning  Grow a tree to maximum depth, recursively prune all the leaf splits with negative gain
  • 35. Recap: Boosted Tree Algorithm • Add a new tree in each iteration • Beginning of each iteration, calculate • Use the statistics to greedily grow a tree • Add to the model  Usually, instead we do  is called step-size or shrinkage, usually set around 0.1  This means we do not do full optimization in each step and reserve chance for future rounds, it helps prevent overfitting
  • 36. Outline • Review of key concepts of supervised learning • Regression Tree and Ensemble (What are we Learning) • Gradient Boosting (How do we Learn) • Summary
  • 37. Questions to check if you really get it • How can we build a boosted tree classifier to do weighted regression problem, such that each instance have a importance weight? • Back to the time series problem, if I want to learn step functions over time. Is there other ways to learn the time splits, other than the top down split approach?
  • 38. Questions to check if you really get it • How can we build a boosted tree classifier to do weighted regression problem, such that each instance have a importance weight?  Define objective, calculate , feed it to the old tree learning algorithm we have for un-weighted version  Again think of separation of model and objective, how does the theory can help better organizing the machine learning toolkit
  • 39. Questions to check if you really get it • Time series problem • All that is important is the structure score of the splits  Top-down greedy, same as trees  Bottom-up greedy, start from individual points as each group, greedily merge neighbors  Dynamic programming, can find optimal solution for this case
  • 40. Summary • The separation between model, objective, parameters can be helpful for us to understand and customize learning models • The bias-variance trade-off applies everywhere, including learning in functional space • We can be formal about what we learn and how we learn. Clear understanding of theory can be used to guide cleaner implementation.
  • 41. Reference • Greedy function approximation a gradient boosting machine. J.H. Friedman  First paper about gradient boosting • Stochastic Gradient Boosting. J.H. Friedman  Introducing bagging trick to gradient boosting • Elements of Statistical Learning. T. Hastie, R. Tibshirani and J.H. Friedman  Contains a chapter about gradient boosted boosting • Additive logistic regression a statistical view of boosting. J.H. Friedman T. Hastie R. Tibshirani  Uses second-order statistics for tree splitting, which is closer to the view presented in this slide • Learning Nonlinear Functions Using Regularized Greedy Forest. R. Johnson and T. Zhang  Proposes to do fully corrective step, as well as regularizing the tree complexity. The regularizing trick is closed related to the view present in this slide • Software implementing the model described in this slide: https://github.com/tqchen/xgboost