This is the slide from my talk at FULokoja Ingressive meetup.
XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. In prediction problems involving unstructured and structured data (images, text, etc.) artificial neural networks tend to outperform all other algorithms or frameworks. However, when it comes to small-to-medium structured/tabular data, decision tree-based algorithms are considered best-in-class right now. XGBoost model has the best combination of prediction performance and processing time compared to other algorithms.
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
---------------------------------------------------------------
Come join our meet-up and learn how easily you can use R for advanced Machine learning. In this meet-up, we will demonstrate how to understand and use Xgboost for Kaggle competition. Tong is in Canada and will do remote session with us through google hangout.
---------------------------------------------------------------
Speaker Bio:
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Pre-requisite(if any): R /Calculus
Preparation: A laptop with R installed. Windows users might need to have RTools installed as well.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Event arrangement:
6:45pm Doors open. Come early to network, grab a beer and settle in.
7:00-9:00pm XgBoost Demo
Reference:
https://github.com/dmlc/xgboost
Slide explaining the distinction between bagging and boosting while understanding the bias variance trade-off. Followed by some lesser known scope of supervised learning. understanding the effect of tree split metric in deciding feature importance. Then understanding the effect of threshold on classification accuracy. Additionally, how to adjust model threshold for classification in supervised learning.
Note: Limitation of Accuracy metric (baseline accuracy), alternative metrics, their use case and their advantage and limitations were briefly discussed.
This is the slide from my talk at FULokoja Ingressive meetup.
XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. In prediction problems involving unstructured and structured data (images, text, etc.) artificial neural networks tend to outperform all other algorithms or frameworks. However, when it comes to small-to-medium structured/tabular data, decision tree-based algorithms are considered best-in-class right now. XGBoost model has the best combination of prediction performance and processing time compared to other algorithms.
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
---------------------------------------------------------------
Come join our meet-up and learn how easily you can use R for advanced Machine learning. In this meet-up, we will demonstrate how to understand and use Xgboost for Kaggle competition. Tong is in Canada and will do remote session with us through google hangout.
---------------------------------------------------------------
Speaker Bio:
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Pre-requisite(if any): R /Calculus
Preparation: A laptop with R installed. Windows users might need to have RTools installed as well.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Event arrangement:
6:45pm Doors open. Come early to network, grab a beer and settle in.
7:00-9:00pm XgBoost Demo
Reference:
https://github.com/dmlc/xgboost
Slide explaining the distinction between bagging and boosting while understanding the bias variance trade-off. Followed by some lesser known scope of supervised learning. understanding the effect of tree split metric in deciding feature importance. Then understanding the effect of threshold on classification accuracy. Additionally, how to adjust model threshold for classification in supervised learning.
Note: Limitation of Accuracy metric (baseline accuracy), alternative metrics, their use case and their advantage and limitations were briefly discussed.
This is a presentation about Gradient Boosted Trees which starts from the basics of Data Mining, building up towards Ensemble Methods like Bagging,Boosting etc. and then building towards Gradient Boosted Trees.
This presentation introduces the Boosting ensemble method for machine learning. It's objective is to compare Boosting to the Random Forest ensemble method, explain the difference between AdaBoost and Gradient Boosting and annotate the pseudo-code for each algorithm for classification and regression, respectively. Kirkwood gave this instructional demonstration while applying to be a Data Scientist in Residence at Galvanize, Boulder.
No machine learning algorithm dominates in every domain, but random forests are usually tough to beat by much. And they have some advantages compared to other models. No much input preparation needed, implicit feature selection, fast to train, and ability to visualize the model. While it is easy to get started with random forests, a good understanding of the model is key to get the most of them.
This talk will cover decision trees from theory, to their implementation in scikit-learn. An overview of ensemble methods and bagging will follow, to end up explaining and implementing random forests and see how they compare to other state-of-the-art models.
The talk will have a very practical approach, using examples and real cases to illustrate how to use both decision trees and random forests.
We will see how the simplicity of decision trees, is a key advantage compared to other methods. Unlike black-box methods, or methods tough to represent in multivariate cases, decision trees can easily be visualized, analyzed, and debugged, until we see that our model is behaving as expected. This exercise can increase our understanding of the data and the problem, while making our model perform in the best possible way.
Random Forests can randomize and ensemble decision trees to increase its predictive power, while keeping most of their properties.
The main topics covered will include:
* What are decision trees?
* How decision trees are trained?
* Understanding and debugging decision trees
* Ensemble methods
* Bagging
* Random Forests
* When decision trees and random forests should be used?
* Python implementation with scikit-learn
* Analysis of performance
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
Random Forest Classifier in Machine Learning | Palin AnalyticsPalin analytics
Random Forest is a supervised learning ensemble algorithm. Ensemble algorithms are those which combine more than one algorithms of same or different kind for classifying objects....
Gradient Boosted Regression Trees in scikit-learnDataRobot
Slides of the talk "Gradient Boosted Regression Trees in scikit-learn" by Peter Prettenhofer and Gilles Louppe held at PyData London 2014.
Abstract:
This talk describes Gradient Boosted Regression Trees (GBRT), a powerful statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche modeling. GBRT is a key ingredient of many winning solutions in data-mining competitions such as the Netflix Prize, the GE Flight Quest, or the Heritage Health Price.
I will give a brief introduction to the GBRT model and regression trees -- focusing on intuition rather than mathematical formulas. The majority of the talk will be dedicated to an in depth discussion how to apply GBRT in practice using scikit-learn. We will cover important topics such as regularization, model tuning and model interpretation that should significantly improve your score on Kaggle.
In this presentation, we approach a two-class classification problem. We try to find a plane that separates the class in the feature space, also called a hyperplane. If we can't find a hyperplane, then we can be creative in two ways: 1) We soften what we mean by separate, and 2) We enrich and enlarge the featured space so that separation is possible.
Production model lifecycle management 2016 09Greg Makowski
This talk covers going over the various stages of building data mining models, putting them into production and eventually replacing them. A common theme throughout are three attributes of predictive models: accuracy, generalization and description. I assert you can have it all, and having all three is important for managing the lifecycle. A subtle point is that this is a step to developing embedded, automated data mining systems which can figure out themselves when they need to be updated.
Augmenting Machine Learning with Databricks Labs AutoML ToolkitDatabricks
<p>Instead of better understanding and optimizing their machine learning models, data scientists spend a majority of their time training and iterating through different models even in cases where there the data is reliable and clean. Important aspects of creating an ML model include (but are not limited to) data preparation, feature engineering, identifying the correct models, training (and continuing to train) and optimizing their models. This process can be (and often is) laborious and time-consuming.</p><p>In this session, we will explore this process and then show how the AutoML toolkit (from Databricks Labs) can significantly simplify and optimize machine learning. We will demonstrate all of this financial loan risk data with code snippets and notebooks that will be free to download.</p>
This is a presentation about Gradient Boosted Trees which starts from the basics of Data Mining, building up towards Ensemble Methods like Bagging,Boosting etc. and then building towards Gradient Boosted Trees.
This presentation introduces the Boosting ensemble method for machine learning. It's objective is to compare Boosting to the Random Forest ensemble method, explain the difference between AdaBoost and Gradient Boosting and annotate the pseudo-code for each algorithm for classification and regression, respectively. Kirkwood gave this instructional demonstration while applying to be a Data Scientist in Residence at Galvanize, Boulder.
No machine learning algorithm dominates in every domain, but random forests are usually tough to beat by much. And they have some advantages compared to other models. No much input preparation needed, implicit feature selection, fast to train, and ability to visualize the model. While it is easy to get started with random forests, a good understanding of the model is key to get the most of them.
This talk will cover decision trees from theory, to their implementation in scikit-learn. An overview of ensemble methods and bagging will follow, to end up explaining and implementing random forests and see how they compare to other state-of-the-art models.
The talk will have a very practical approach, using examples and real cases to illustrate how to use both decision trees and random forests.
We will see how the simplicity of decision trees, is a key advantage compared to other methods. Unlike black-box methods, or methods tough to represent in multivariate cases, decision trees can easily be visualized, analyzed, and debugged, until we see that our model is behaving as expected. This exercise can increase our understanding of the data and the problem, while making our model perform in the best possible way.
Random Forests can randomize and ensemble decision trees to increase its predictive power, while keeping most of their properties.
The main topics covered will include:
* What are decision trees?
* How decision trees are trained?
* Understanding and debugging decision trees
* Ensemble methods
* Bagging
* Random Forests
* When decision trees and random forests should be used?
* Python implementation with scikit-learn
* Analysis of performance
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
Random Forest Classifier in Machine Learning | Palin AnalyticsPalin analytics
Random Forest is a supervised learning ensemble algorithm. Ensemble algorithms are those which combine more than one algorithms of same or different kind for classifying objects....
Gradient Boosted Regression Trees in scikit-learnDataRobot
Slides of the talk "Gradient Boosted Regression Trees in scikit-learn" by Peter Prettenhofer and Gilles Louppe held at PyData London 2014.
Abstract:
This talk describes Gradient Boosted Regression Trees (GBRT), a powerful statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche modeling. GBRT is a key ingredient of many winning solutions in data-mining competitions such as the Netflix Prize, the GE Flight Quest, or the Heritage Health Price.
I will give a brief introduction to the GBRT model and regression trees -- focusing on intuition rather than mathematical formulas. The majority of the talk will be dedicated to an in depth discussion how to apply GBRT in practice using scikit-learn. We will cover important topics such as regularization, model tuning and model interpretation that should significantly improve your score on Kaggle.
In this presentation, we approach a two-class classification problem. We try to find a plane that separates the class in the feature space, also called a hyperplane. If we can't find a hyperplane, then we can be creative in two ways: 1) We soften what we mean by separate, and 2) We enrich and enlarge the featured space so that separation is possible.
Production model lifecycle management 2016 09Greg Makowski
This talk covers going over the various stages of building data mining models, putting them into production and eventually replacing them. A common theme throughout are three attributes of predictive models: accuracy, generalization and description. I assert you can have it all, and having all three is important for managing the lifecycle. A subtle point is that this is a step to developing embedded, automated data mining systems which can figure out themselves when they need to be updated.
Augmenting Machine Learning with Databricks Labs AutoML ToolkitDatabricks
<p>Instead of better understanding and optimizing their machine learning models, data scientists spend a majority of their time training and iterating through different models even in cases where there the data is reliable and clean. Important aspects of creating an ML model include (but are not limited to) data preparation, feature engineering, identifying the correct models, training (and continuing to train) and optimizing their models. This process can be (and often is) laborious and time-consuming.</p><p>In this session, we will explore this process and then show how the AutoML toolkit (from Databricks Labs) can significantly simplify and optimize machine learning. We will demonstrate all of this financial loan risk data with code snippets and notebooks that will be free to download.</p>
Using R in Kaggle Competitions.
Kaggle has been the most popular data science platform linking close to half a million of data scientists worldwide. How to get yourself a decent ranking on Kaggle competitions with R programming, eXtreme Gradient BOOSTing, and a laptop. Great machine learning tools for all levels to get started and learn. Find out how to perform features engineering, tuning XGB models, selecting a sizable cross validations and performing model ensembles.
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
The ever-increasing interest around deep learning and neural networks has led to a vast increase in processing frameworks like TensorFlow and PyTorch. These libraries are built around the idea of a computational graph that models the dataflow of individual units. Because tensors are their basic computational unit, these frameworks can run efficiently on hardware accelerators (e.g. GPUs).Traditional machine learning (ML) such as linear regressions and decision trees in scikit-learn cannot currently be run on GPUs, missing out on the potential accelerations that deep learning and neural networks enjoy.
In this talk, we’ll show how you can use Hummingbird to achieve 1000x speedup in inferencing on GPUs by converting your traditional ML models to tensor-based models (PyTorch andTVM). https://github.com/microsoft/hummingbird
This talk is for intermediate audiences that use traditional machine learning and want to speedup the time it takes to perform inference with these models. After watching the talk, the audience should be able to use ~5 lines of code to convert their traditional models to tensor-based models to be able to try them out on GPUs.
Outline:
Introduction of what ML inference is (and why it’s different than training)
Motivation: Tensor-based DNN frameworks allow inference on GPU, but “traditional” ML frameworks do not
Why “traditional” ML methods are important
Introduction of what Hummingbirddoes and main benefits
Deep dive on how traditional ML models are built
Brief intro onhow Hummingbird converter works
Example of how Hummingbird can convert a tree model into a tensor-based model
Other models
Demo
Status
Q&A
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
Machine learning lets you make better business decisions by uncovering patterns in your consumer behavior data that is hard for the human eye to spot. You can also use it to automate routine, expensive human tasks that were previously not doable by computers. In the business to business space (B2B), if your competitors can make wiser business decisions based on data and automate more business operations but you still base your decisions on guesswork and lack automation, you will lose out on business productivity. In this introduction to machine learning tech talk, you will learn how to use machine learning even if you do not have deep technical expertise on this technology.
Topics covered:
1.What is machine learning
2.What is a typical ML application architecture
3.How to start ML development with free resource links
4.Key decision factors in ML technology selection depending on use case scenarios
The concept of talk is as follows: - to give a general idea about user segmentation task in DMP project and how solving this problem helps our business - to tell how we use autoML to solve this task and to explain its components - to give insights about techniques we apply to make our pipeline fast and stable on huge datasets
Scaling out logistic regression with SparkBarak Gitsis
Large scale multinomial logistic regression with Spark. Contains animated gifs. Analysis of LBFGS. Real world spark configurations. SimilarWeb categorization algorithm
The Power of Auto ML and How Does it WorkIvo Andreev
Automated ML is an approach to minimize the need of data science effort by enabling domain experts to build ML models without having deep knowledge of algorithms, mathematics or programming skills. The mechanism works by allowing end-users to simply provide data and the system automatically does the rest by determining approach to perform particular ML task. At first this may sound discouraging to those aiming to the “sexiest job of the 21st century” - the data scientists. However, Auto ML should be considered as democratization of ML, rather that automatic data science.
In this session we will talk about how Auto ML works, how is it implemented by Microsoft and how it could improve the productivity of even professional data scientists.
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntEugene Yan Ziyou
Our team achieved 85th position out of 3,514 at the very popular Kaggle Otto Product Classification Challenge. Here's an overview of how we did it, as well as some techniques we learnt from fellow Kagglers during and after the competition.
Similar to XGBoost: the algorithm that wins every competition (20)
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
XGBoost: the algorithm that wins every competition
1. XGBoost: the algorithm that wins every
competition
Poznań Univeristy of Technology; April 28th, 2016
meet.ml #1 - Applied Big Data and Machine Learning
By Jarosław Szymczak
5. Some recent competitions from Kaggle
CTR – click through rate
(~ equal to probability of click)
Features:
6. Some recent competitions from Kaggle
Sales prediction for store at certain date
Features:
• store id
• date
• school and state holidays
• store type
• assortment
• promotions
• competition
7. Solutions evaluation
0
2
4
6
8
10
12
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
logarithmic loss for positive example
• Root Mean Square
Percentage Error (RMSPE)
• Binary logarithmic loss
(loss is capped at ~35 for
a single observation)
• Multi-class logarithmic loss
22. Another boosting approach
What if we, instead of reweighting examples, made some corrections to prediction errors directly?
Note:
Residual is a gradient of single
observation error contribution
in one of the most common
evaluation measure for regression:
root mean squared error (RMSE)
We add new model, to the one we
already have, so:
Ypred = X1(Y) + NEW(Y)
NEW(Y) = X1(Y) - Ypred
our residual
24. Gradient boosting
It can be:
• same algorithm as for
further steps
• or something very
simple (like uniform
probabilities or average
target in regression)
For any function that:
• agrregates the error from
examples (e.g. log-loss,
RMSE, but not AUC)
• you can calculate gradient
on example level (it is
called pseudo-residual)
Fit model to initial data Fit pseudo-residuals Finalization
Sum up all the
models
26. XGBoost
by Tianqi Chen
eXtreme Gradient
Boosting
Used for:
• classification
• regression
• ranking
with custom loss
functions
Interfaces for
Python and R,
can be executed on
YARN
custom tree
building
algorithm
27. XGBoost – handling the features
Numeric values
• for each numeric value, XGBoost finds the best available split (it is always a binary split)
• algorithm is designed to work with numeric values only
Nominal values
• need to be converted to numeric ones
• classic way is to perform one-hot-encoding / get dummies (for all values)
• for variables with large cardinality, some other, more sophisticated methods may need to
be used
Missing values
• XGBoost handles missing values separately
• missing value is always added to a branch in which it would minimize the loss (so it is
either treated as very large / very small value)
• it is a great advantage over, e.g. scikit-learn gradient boosting implementation
28. XGBoost – loss functions
Used as optimization objective
• logloss (optimized in complicated manner) for objectives binary:logistic and reg:logistic (they only differ by
evaluation measure, all the rest is the same)
• softmax (which is logloss transofrmed to handle multiple classes) for objective multi:softmax
• RMSE (root mean squared error) for objective reg:linear
Only measured
• AUC – area under ROC curve
• MAE – mean absolute error
• error / merror – error rate for binary / multi-class classification
Custom
• only measured
• simply implement the evaluation function
• used as optimization objective
• provided that you have a measure that is aggregation of element-wise losses and is differentiable
• you create a function for element-wise gradient (derivative of prediction function)
• then hessian (second derivative of prediction function)
• that’s it, algorithm will approximate your loss function with Taylor series (pre-defined loss functions are
implemented in the same manner)
29. XGBoost – setup for Python environment
Installing XGBoost in your environment
Check out this wonderful tutorial to install Docker with
kaggle Python stack:
http://goo.gl/wSCkmC
http://xgboost.readthedocs.org/en/latest/build.html
Link as above, expect potential troubles
Clone: https://github.com/dmlc/xgboost
At revision: 219e58d453daf26d6717e2266b83fca0071f3129
And follow the instructions in it (it is of course older version)
31. XGBoost – simple use example (direct)
Link to script: https://goo.gl/vwQeXs
for direct use
we need to
specify
number
of classes
scikit-learn
conventional
names
to account for
examples importance
we can assign weights
to them in DMatrix
(not done here)
32. XGBoost – feature importance
Link to script: http://pastebin.com/uPK5aNkf
33. XGBoost – simple use example (in scikit-learn style)
Link to script: https://goo.gl/IPxKh5
34. XGBoost – most common parameters for tree booster
subsample
• ratio of instances to take
• example values: 0.6, 0.9
colsample_by_tree
• ratio of columns used for
whole tree creation
• example values: 0.1, …, 0.9
colsample_by_level
• ratio of columns sampled for
each split
• example values: 0.1, …, 0.9
eta
• how fast algorithm will learn
(shrinkage)
• example values: 0.01, 0.05, 0.1
max_depth
• maximum numer of consecutive
splits
• example values: 1, …, 15 (for
dozen or so or more, needs to
be set with regularization
parametrs)
min_child_weight
• minimum weight of children in
leaf, needs to be adopted for
each measure
• example values: 1 (for linear
regression it would be one
example, for classification it is
gradient of pseudo-residual)
35. XGBoost – regularization parameters for tree booster
alpha
• L1 norm (simple average) of
weights for whole objective
function
lambda
• L2 norm (root from average of
squares) of weights, added as
penalty to objective function
gamma
• L0 norm, multiplied by numer of
leafs in a tree is used to decide
whether to make a split
lambda is in
denominator because
of various
transformation, in
original objective it is
„classic” L2 penalty
36. XGBoost – some cool things not necesarilly present elswhere
Watchlist
You can create some small
validation set to prevent
overfitting without need to
guess how many iterations
would be enough
Warm
start
Thanks to the nature of
gradient boosting, you can
take some already woking
solution and start the
„improvement” from it
Easy cut-
off at any
round
when
classyfing
Simple but powerful while
tuning the parameters,
allows you to recover from
overfitted model in easy
way
38. Winning solutions
The winning solution consists of over 20 xgboost models
that each need about two hours to train when running three
models in parallel on my laptop. So I think it could be done
within 24 hours. Most of the models individually achieve a
very competitive (top 3 leaderboard) score.
by Gert Jacobusse
I did quite a bit of manual feature engineering and my models
are entirely based on xgboost. Feature engineering is a
combination of “brutal force” (trying different
transformations, etc that I know) and “heuristic” (trying to
think about drivers of the target in real world settings). One
thing I learned recently was entropy based features, which
were useful in my model.
by Owen Zhang
40. Even deeper dive
• XGBoost:
• https://github.com/dmlc/xgboost/blob/master/demo/README.md
• More info about machine learning:
• Uczenie maszynowe i sieci neuronowe.
Krawiec K., Stefanowski J., Wydawnictwo PP, Poznań, 2003. (2 wydanie 2004)
• The Elements of Statistical Learning: Data Mining, Inference, and Prediction
by Trevor Hastie, Robert Tibshirani, Jerome Friedman (available on-line for free)
• Sci-kit learn and pandas official pages:
• http://scikit-learn.org/
• http://pandas.pydata.org/
• Kaggle blog and forum:
• http://blog.kaggle.com/ (no free hunch)
• https://www.datacamp.com/courses/kaggle-python-tutorial-on-machine-learning
• https://www.datacamp.com/courses/intro-to-python-for-data-science
• Python basics:
• https://www.coursera.org/learn/interactive-python-1
• https://www.coursera.org/learn/interactive-python-2