SlideShare a Scribd company logo
XGBoost: A Scalable Tree Boosting System
Simon Lia-Jonassen
Motivation
 Used by majority of winning solutions on
Kaggle, 2nd most popular method after DNN.
 Also used by 10 best teams in KDDCup’15.
 Applies to classification, regression and
learning-to-rank tasks.
 Usually outperforms alternatives in an
out-of-the-box setting.
 Combines a good theoretical foundation and
a highly efficient implementation.
 So, how does it work?
Decision Tree Boosting
Number of trees Tree function,
maps to a set of leaf weights
Instance features
Regularized Learning Objective
Prediction loss Complexity penalty
Number of leaves L2 regularization on
leaves weights
Regularized Learning Objective
First order gradient
of the loss function
Second order gradient
of the loss function
By additive definition
Where:
However, for example:
Regularized Learning Objective
By expansion:
For each
instance
For each leaf For each
instance
in the leaf
Regularized Learning Objective
Optimal leaf weight for a fixed structure:
By substitution:
Gradient Tree Boosting
Before
we split
Left
split
Right
split
Split
penalty
Gradient Tree Boosting
Optimizations
 Shrinkage
 More trees
 Column subsampling
 Prevents over-fitting
 Approximate split finding
 Faster AUC convergence
 Sparsity-aware split finding
 Visit only non-missing values
 Cache-aware parallel column block
access
 Fewer misses on large datasets
 Block compression and sharding
 Faster I/O for out-of-core computation
Optimizations
 Shrinkage
 More trees
 Column subsampling
 Prevents over-fitting
 Approximate split finding
 Faster AUC convergence
 Sparsity-aware split finding
 Visit only non-missing values
 Cache-aware parallel column block
access
 Fewer misses on large datasets
 Block compression and sharding
 Faster I/O for out-of-core computation
Optimizations
 Shrinkage
 More trees
 Column subsampling
 Prevents over-fitting
 Approximate split finding
 Faster AUC convergence
 Sparsity-aware split finding
 Visit only non-missing values
 Cache-aware parallel column block
access
 Fewer misses on large datasets
 Block compression and sharding
 Faster I/O for out-of-core computation
Optimizations
 Shrinkage
 More trees
 Column subsampling
 Prevents over-fitting
 Approximate split finding
 Faster AUC convergence
 Sparsity-aware split finding
 Visit only non-missing values
 Cache-aware parallel column block
access
 Fewer misses on large datasets
 Block compression and sharding
 Faster I/O for out-of-core computation
Optimizations
 Shrinkage
 More trees
 Column subsampling
 Prevents over-fitting
 Approximate split finding
 Faster AUC convergence
 Sparsity-aware split finding
 Visit only non-missing values
 Cache-aware parallel column block
access
 Fewer misses on large datasets
 Block compression and sharding
 Faster I/O for out-of-core computation
Optimizations
Further reading
 The paper:
 https://arxiv.org/pdf/1603.02754.pdf
 XGBoost tutorial:
 http://xgboost.readthedocs.io/en/latest/model.html
 A great deck of slides:
 https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf
 A simple usage example:
 https://www.kaggle.com/kevalm/xgboost-implementation-on-iris-dataset-python
 DataCamp mini-course:
 https://campus.datacamp.com/courses/extreme-gradient-boosting-with-xgboost

More Related Content

What's hot

Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
halifaxchester
 
Xgboost
XgboostXgboost
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
Gabriel Moreira
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
Vivian S. Zhang
 
decision tree regression
decision tree regressiondecision tree regression
decision tree regression
Akhilesh Joshi
 
Kaggleのテクニック
KaggleのテクニックKaggleのテクニック
Kaggleのテクニック
Yasunori Ozaki
 
科学と機械学習のあいだ:変量の設計・変換・選択・交互作用・線形性
科学と機械学習のあいだ:変量の設計・変換・選択・交互作用・線形性科学と機械学習のあいだ:変量の設計・変換・選択・交互作用・線形性
科学と機械学習のあいだ:変量の設計・変換・選択・交互作用・線形性
Ichigaku Takigawa
 
実践多クラス分類 Kaggle Ottoから学んだこと
実践多クラス分類 Kaggle Ottoから学んだこと実践多クラス分類 Kaggle Ottoから学んだこと
実践多クラス分類 Kaggle Ottoから学んだこと
nishio
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
HJ van Veen
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
safa cimenli
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Parth Khare
 
Machine Learning ppt
Machine Learning pptMachine Learning ppt
Machine Learning ppt
Student Conscious Club
 
Overview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostOverview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboost
Takami Sato
 
Lecture6 - C4.5
Lecture6 - C4.5Lecture6 - C4.5
Lecture6 - C4.5
Albert Orriols-Puig
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
Himadri Mishra
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
Joaquin Vanschoren
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
Mohit Rajput
 
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEUnified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
Databricks
 
勾配降下法の 最適化アルゴリズム
勾配降下法の最適化アルゴリズム勾配降下法の最適化アルゴリズム
勾配降下法の 最適化アルゴリズム
nishio
 
Boosting - An Ensemble Machine Learning Method
Boosting - An Ensemble Machine Learning MethodBoosting - An Ensemble Machine Learning Method
Boosting - An Ensemble Machine Learning Method
Kirkwood Donavin
 

What's hot (20)

Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
 
Xgboost
XgboostXgboost
Xgboost
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
 
decision tree regression
decision tree regressiondecision tree regression
decision tree regression
 
Kaggleのテクニック
KaggleのテクニックKaggleのテクニック
Kaggleのテクニック
 
科学と機械学習のあいだ:変量の設計・変換・選択・交互作用・線形性
科学と機械学習のあいだ:変量の設計・変換・選択・交互作用・線形性科学と機械学習のあいだ:変量の設計・変換・選択・交互作用・線形性
科学と機械学習のあいだ:変量の設計・変換・選択・交互作用・線形性
 
実践多クラス分類 Kaggle Ottoから学んだこと
実践多クラス分類 Kaggle Ottoから学んだこと実践多クラス分類 Kaggle Ottoから学んだこと
実践多クラス分類 Kaggle Ottoから学んだこと
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
 
Machine Learning ppt
Machine Learning pptMachine Learning ppt
Machine Learning ppt
 
Overview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostOverview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboost
 
Lecture6 - C4.5
Lecture6 - C4.5Lecture6 - C4.5
Lecture6 - C4.5
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
 
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEUnified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
 
勾配降下法の 最適化アルゴリズム
勾配降下法の最適化アルゴリズム勾配降下法の最適化アルゴリズム
勾配降下法の 最適化アルゴリズム
 
Boosting - An Ensemble Machine Learning Method
Boosting - An Ensemble Machine Learning MethodBoosting - An Ensemble Machine Learning Method
Boosting - An Ensemble Machine Learning Method
 

Similar to Xgboost: A Scalable Tree Boosting System - Explained

Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdfMachine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
AdityaSoraut
 
Data analysis01 singlevariable
Data analysis01 singlevariableData analysis01 singlevariable
Data analysis01 singlevariable
Universidade de São Paulo
 
Optimising Xapian
Optimising XapianOptimising Xapian
Optimising Xapian
Richard Boulton
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysis
guru_prasadg
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
butest
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
kevinlan
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
jim
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
Derek Kane
 
Module III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptxModule III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptx
Shivakrishnan18
 
Performance By Design
Performance By DesignPerformance By Design
Performance By Design
Guy Harrison
 
Ml7 bagging
Ml7 baggingMl7 bagging
Ml7 bagging
ankit_ppt
 
2 extreme performance - smaller is better
2   extreme performance - smaller is better2   extreme performance - smaller is better
2 extreme performance - smaller is better
sqlserver.co.il
 
Performance By Design
Performance By DesignPerformance By Design
Performance By Design
Guy Harrison
 
Bank loan purchase modeling
Bank loan purchase modelingBank loan purchase modeling
Bank loan purchase modeling
Saleesh Satheeshchandran
 
Cs437 lecture 1-6
Cs437 lecture 1-6Cs437 lecture 1-6
Cs437 lecture 1-6
Aneeb_Khawar
 
Database optimization
Database optimizationDatabase optimization
Database optimization
EsraaAlattar1
 
Introduction to cart_2009
Introduction to cart_2009Introduction to cart_2009
Introduction to cart_2009
Matthew Magistrado
 
Introduction to RandomForests 2004
Introduction to RandomForests 2004Introduction to RandomForests 2004
Introduction to RandomForests 2004
Salford Systems
 
[Women in Data Science Meetup ATX] Decision Trees
[Women in Data Science Meetup ATX] Decision Trees [Women in Data Science Meetup ATX] Decision Trees
[Women in Data Science Meetup ATX] Decision Trees
Nikolaos Vergos
 
decisiontrees (3).ppt
decisiontrees (3).pptdecisiontrees (3).ppt
decisiontrees (3).ppt
LvlShivaNagendra
 

Similar to Xgboost: A Scalable Tree Boosting System - Explained (20)

Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdfMachine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
 
Data analysis01 singlevariable
Data analysis01 singlevariableData analysis01 singlevariable
Data analysis01 singlevariable
 
Optimising Xapian
Optimising XapianOptimising Xapian
Optimising Xapian
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysis
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
Module III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptxModule III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptx
 
Performance By Design
Performance By DesignPerformance By Design
Performance By Design
 
Ml7 bagging
Ml7 baggingMl7 bagging
Ml7 bagging
 
2 extreme performance - smaller is better
2   extreme performance - smaller is better2   extreme performance - smaller is better
2 extreme performance - smaller is better
 
Performance By Design
Performance By DesignPerformance By Design
Performance By Design
 
Bank loan purchase modeling
Bank loan purchase modelingBank loan purchase modeling
Bank loan purchase modeling
 
Cs437 lecture 1-6
Cs437 lecture 1-6Cs437 lecture 1-6
Cs437 lecture 1-6
 
Database optimization
Database optimizationDatabase optimization
Database optimization
 
Introduction to cart_2009
Introduction to cart_2009Introduction to cart_2009
Introduction to cart_2009
 
Introduction to RandomForests 2004
Introduction to RandomForests 2004Introduction to RandomForests 2004
Introduction to RandomForests 2004
 
[Women in Data Science Meetup ATX] Decision Trees
[Women in Data Science Meetup ATX] Decision Trees [Women in Data Science Meetup ATX] Decision Trees
[Women in Data Science Meetup ATX] Decision Trees
 
decisiontrees (3).ppt
decisiontrees (3).pptdecisiontrees (3).ppt
decisiontrees (3).ppt
 

More from Simon Lia-Jonassen

HyperLogLog and friends
HyperLogLog and friendsHyperLogLog and friends
HyperLogLog and friends
Simon Lia-Jonassen
 
No more bad news!
No more bad news!No more bad news!
No more bad news!
Simon Lia-Jonassen
 
Chatbots are coming!
Chatbots are coming!Chatbots are coming!
Chatbots are coming!
Simon Lia-Jonassen
 
Large-Scale Real-Time Data Management for Engagement and Monetization
Large-Scale Real-Time Data Management for Engagement and MonetizationLarge-Scale Real-Time Data Management for Engagement and Monetization
Large-Scale Real-Time Data Management for Engagement and Monetization
Simon Lia-Jonassen
 
Efficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search EnginesEfficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search Engines
Simon Lia-Jonassen
 
Leveraging Big Data and Real-Time Analytics at Cxense
Leveraging Big Data and Real-Time Analytics at CxenseLeveraging Big Data and Real-Time Analytics at Cxense
Leveraging Big Data and Real-Time Analytics at Cxense
Simon Lia-Jonassen
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
Simon Lia-Jonassen
 
Efficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesEfficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search Engines
Simon Lia-Jonassen
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...
Simon Lia-Jonassen
 

More from Simon Lia-Jonassen (9)

HyperLogLog and friends
HyperLogLog and friendsHyperLogLog and friends
HyperLogLog and friends
 
No more bad news!
No more bad news!No more bad news!
No more bad news!
 
Chatbots are coming!
Chatbots are coming!Chatbots are coming!
Chatbots are coming!
 
Large-Scale Real-Time Data Management for Engagement and Monetization
Large-Scale Real-Time Data Management for Engagement and MonetizationLarge-Scale Real-Time Data Management for Engagement and Monetization
Large-Scale Real-Time Data Management for Engagement and Monetization
 
Efficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search EnginesEfficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search Engines
 
Leveraging Big Data and Real-Time Analytics at Cxense
Leveraging Big Data and Real-Time Analytics at CxenseLeveraging Big Data and Real-Time Analytics at Cxense
Leveraging Big Data and Real-Time Analytics at Cxense
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
 
Efficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesEfficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search Engines
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...
 

Recently uploaded

Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
saastr
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 

Recently uploaded (20)

Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 

Xgboost: A Scalable Tree Boosting System - Explained

  • 1. XGBoost: A Scalable Tree Boosting System Simon Lia-Jonassen
  • 2. Motivation  Used by majority of winning solutions on Kaggle, 2nd most popular method after DNN.  Also used by 10 best teams in KDDCup’15.  Applies to classification, regression and learning-to-rank tasks.  Usually outperforms alternatives in an out-of-the-box setting.  Combines a good theoretical foundation and a highly efficient implementation.  So, how does it work?
  • 3. Decision Tree Boosting Number of trees Tree function, maps to a set of leaf weights Instance features
  • 4. Regularized Learning Objective Prediction loss Complexity penalty Number of leaves L2 regularization on leaves weights
  • 5. Regularized Learning Objective First order gradient of the loss function Second order gradient of the loss function By additive definition Where: However, for example:
  • 6. Regularized Learning Objective By expansion: For each instance For each leaf For each instance in the leaf
  • 7. Regularized Learning Objective Optimal leaf weight for a fixed structure: By substitution:
  • 8. Gradient Tree Boosting Before we split Left split Right split Split penalty
  • 10. Optimizations  Shrinkage  More trees  Column subsampling  Prevents over-fitting  Approximate split finding  Faster AUC convergence  Sparsity-aware split finding  Visit only non-missing values  Cache-aware parallel column block access  Fewer misses on large datasets  Block compression and sharding  Faster I/O for out-of-core computation
  • 11. Optimizations  Shrinkage  More trees  Column subsampling  Prevents over-fitting  Approximate split finding  Faster AUC convergence  Sparsity-aware split finding  Visit only non-missing values  Cache-aware parallel column block access  Fewer misses on large datasets  Block compression and sharding  Faster I/O for out-of-core computation
  • 12. Optimizations  Shrinkage  More trees  Column subsampling  Prevents over-fitting  Approximate split finding  Faster AUC convergence  Sparsity-aware split finding  Visit only non-missing values  Cache-aware parallel column block access  Fewer misses on large datasets  Block compression and sharding  Faster I/O for out-of-core computation
  • 13. Optimizations  Shrinkage  More trees  Column subsampling  Prevents over-fitting  Approximate split finding  Faster AUC convergence  Sparsity-aware split finding  Visit only non-missing values  Cache-aware parallel column block access  Fewer misses on large datasets  Block compression and sharding  Faster I/O for out-of-core computation
  • 14. Optimizations  Shrinkage  More trees  Column subsampling  Prevents over-fitting  Approximate split finding  Faster AUC convergence  Sparsity-aware split finding  Visit only non-missing values  Cache-aware parallel column block access  Fewer misses on large datasets  Block compression and sharding  Faster I/O for out-of-core computation
  • 16. Further reading  The paper:  https://arxiv.org/pdf/1603.02754.pdf  XGBoost tutorial:  http://xgboost.readthedocs.io/en/latest/model.html  A great deck of slides:  https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf  A simple usage example:  https://www.kaggle.com/kevalm/xgboost-implementation-on-iris-dataset-python  DataCamp mini-course:  https://campus.datacamp.com/courses/extreme-gradient-boosting-with-xgboost