XGBoost
AGATHA LAU
25/4/2024
What is the XGBoost Algorithm?
XGBoost stands for Extreme Gradient Boosting, which is a scalable, distributed gradient-
boosted decision tree (GBDT) machine learning library.
It provides parallel tree boosting and is the leading machine learning library for
regression, classification, and ranking problems.
XGBoost builds upon:
◦ Supervised Machine Learning.
◦ Decision Trees.
◦ Ensemble Learning.
◦ Gradient Boosting.
Supervised Machine Learning
Supervised machine learning uses algorithms to train a model to find patterns in a
dataset with labels and features and then uses the trained model to predict the labels
on a new dataset’s features.
Figure 1: A supervised learning train a model to find pattern in a dataset and uses the trained model to predict labels on a new dataset.
Decision Trees
Decision trees create a model that predicts the label by evaluating a tree of if-then-else
true/false feature questions, and estimating the minimum number of questions needed
to assess the probability of making a correct decision.
Figure 2: A decision tree is used to estimate a house price (the label) based on the size and number of bedrooms (the features).
Ensemble Learning
A Gradient Boosting Decision Trees (GBDT) is a decision tree ensemble learning
algorithm similar to random forest, for classification and regression.
Ensemble models in Machine Learning combine the decisions from multiple models to
improve the overall performance.
Both random forest and GBDT build a model consisting of multiple decision trees. The
difference is in how the trees are built and combined.
Ensemble Methods
The principle behind ensembles is the idea of "wisdom of the crowd". The collective predictions
of many diverse models is better than any set of predictions made by a single model.
In ensemble learning theory, we call weak learners (base models) that can be used as building
blocks for designing more complex models by combining several of them.
What is Bagging?
Combing homogenous weak learners, learns them independently from each other in parallel and
combines them following some kind of deterministic averaging process.
Dataset
Samples
Samples
Samples
Weak Model
Weak Model
Weak Model
Ensemble
What is Boosting?
Combing homogenous weak learners, learns them in a very additive way (a base model depends on the
previous ones) and combines them following a deterministic strategy.
Dataset
Weak Model
Answer
Update the training dataset based
on previous results
Dataset
Weak Model
Final
What is Gradient Boosting
Gradient boosting is a technique for building an ensemble of weak models such that the predictions of
the ensemble minimize a loss function. We are combing the predictions of multiple models, so we are not
optimizing the model parameters directly but the boosted model predictions, therefore, the gradients.
Dataset
Weak Model
Answer
Dataset
Weak Model
Answer
Dataset
Weak Model
Answer
Dataset
Weak Model
Final
Three Steps to Gradient Boosting
Loss Function
Optimized
Weak Learner/
Decision Tree
Trees Added
Sequentially
A differentiable loss function is chosen to measure
the difference between the model's predictions and
the actual target values in the training data. This loss
function quantifies the error of the model's
predictions and serves as the basis for optimizing the
model's parameters.
Gradient descent optimization is used to minimize
this loss function. Gradient descent iteratively adjusts
the model parameters (e.g., weights in decision
trees) in the direction that reduces the loss, guided
by the gradients of the loss function with respect to
the parameters.
A weak learner, often a decision tree, is
used to make predictions. A weak learner is
a simple model that performs slightly better
than random guessing.
Decision trees are commonly used as weak
learners in gradient boosting due to their
simplicity and ability to capture complex
relationships in the data. Each decision tree
is trained to predict the residuals (errors) of
the ensemble generated by the previous
trees.
The decision trees are added to the
ensemble sequentially, with each new tree
trained to correct the errors made by the
ensemble of previously trained trees.
After training each tree, its predictions are
combined with the predictions of the existing
ensemble to reduce the overall error. This
process is repeated iteratively until a
predefined number of trees have been
added or until a certain level of performance
is achieved.
1 2 3
Summary
Defined Bagging and Boosting
Learned the Role of a Weak Learner
Defined Loss Function
Greedy Algorithm
Three Steps in Gradient Boosting
Learned about Ensembles

Introduction to XGBoost Machine Learning Model.pptx

  • 1.
  • 2.
    What is theXGBoost Algorithm? XGBoost stands for Extreme Gradient Boosting, which is a scalable, distributed gradient- boosted decision tree (GBDT) machine learning library. It provides parallel tree boosting and is the leading machine learning library for regression, classification, and ranking problems. XGBoost builds upon: ◦ Supervised Machine Learning. ◦ Decision Trees. ◦ Ensemble Learning. ◦ Gradient Boosting.
  • 3.
    Supervised Machine Learning Supervisedmachine learning uses algorithms to train a model to find patterns in a dataset with labels and features and then uses the trained model to predict the labels on a new dataset’s features. Figure 1: A supervised learning train a model to find pattern in a dataset and uses the trained model to predict labels on a new dataset.
  • 4.
    Decision Trees Decision treescreate a model that predicts the label by evaluating a tree of if-then-else true/false feature questions, and estimating the minimum number of questions needed to assess the probability of making a correct decision. Figure 2: A decision tree is used to estimate a house price (the label) based on the size and number of bedrooms (the features).
  • 5.
    Ensemble Learning A GradientBoosting Decision Trees (GBDT) is a decision tree ensemble learning algorithm similar to random forest, for classification and regression. Ensemble models in Machine Learning combine the decisions from multiple models to improve the overall performance. Both random forest and GBDT build a model consisting of multiple decision trees. The difference is in how the trees are built and combined.
  • 6.
    Ensemble Methods The principlebehind ensembles is the idea of "wisdom of the crowd". The collective predictions of many diverse models is better than any set of predictions made by a single model. In ensemble learning theory, we call weak learners (base models) that can be used as building blocks for designing more complex models by combining several of them.
  • 7.
    What is Bagging? Combinghomogenous weak learners, learns them independently from each other in parallel and combines them following some kind of deterministic averaging process. Dataset Samples Samples Samples Weak Model Weak Model Weak Model Ensemble
  • 8.
    What is Boosting? Combinghomogenous weak learners, learns them in a very additive way (a base model depends on the previous ones) and combines them following a deterministic strategy. Dataset Weak Model Answer Update the training dataset based on previous results Dataset Weak Model Final
  • 9.
    What is GradientBoosting Gradient boosting is a technique for building an ensemble of weak models such that the predictions of the ensemble minimize a loss function. We are combing the predictions of multiple models, so we are not optimizing the model parameters directly but the boosted model predictions, therefore, the gradients. Dataset Weak Model Answer Dataset Weak Model Answer Dataset Weak Model Answer Dataset Weak Model Final
  • 10.
    Three Steps toGradient Boosting Loss Function Optimized Weak Learner/ Decision Tree Trees Added Sequentially A differentiable loss function is chosen to measure the difference between the model's predictions and the actual target values in the training data. This loss function quantifies the error of the model's predictions and serves as the basis for optimizing the model's parameters. Gradient descent optimization is used to minimize this loss function. Gradient descent iteratively adjusts the model parameters (e.g., weights in decision trees) in the direction that reduces the loss, guided by the gradients of the loss function with respect to the parameters. A weak learner, often a decision tree, is used to make predictions. A weak learner is a simple model that performs slightly better than random guessing. Decision trees are commonly used as weak learners in gradient boosting due to their simplicity and ability to capture complex relationships in the data. Each decision tree is trained to predict the residuals (errors) of the ensemble generated by the previous trees. The decision trees are added to the ensemble sequentially, with each new tree trained to correct the errors made by the ensemble of previously trained trees. After training each tree, its predictions are combined with the predictions of the existing ensemble to reduce the overall error. This process is repeated iteratively until a predefined number of trees have been added or until a certain level of performance is achieved. 1 2 3
  • 11.
    Summary Defined Bagging andBoosting Learned the Role of a Weak Learner Defined Loss Function Greedy Algorithm Three Steps in Gradient Boosting Learned about Ensembles

Editor's Notes

  • #4 XGBoost – What Is It and Why Does It Matter? (nvidia.com)
  • #7 With ensemble learning, we ensemble groups of weak learners. These weak learners are models that can be used as the building blocks for designing more complex models. The weak learner in XGBoost is a decision tree. These base models or these weak learners don’t perform well by themselves, either because of they have a high bias or they have too much variance. In order to achieve a bias-variance trade off, the weak learners are combine to create ensembles that achieves better performance. Most of time we use a homogeneous base learner that simply means we use the same model to create the ensemble.
  • #8 Bagging is used when you want to reduce variance. The idea is to create several subsets of data from training samples chosen randomly. Each collection of a subset of data is used to train the decision tree. As a result, we end up with an ensemble of different models. When bagging with decision trees, we are less concerned about the individual trees overfitting the training data. The individual decision trees can grow deeply. The trees are not pruned. These trees will have both high variance and low bias. In our examples, several base models execute different bootstrap samples and build an ensemble model that averages the results of the weak learners.
  • #10 Boosting is another ensemble technique used to create a collection of models. In this technique, models are learned sequentially with the earlier models fitting simple models to the data and then analyzing the data for errors. For bagging, we had to run each model independently and then aggregate the outputs at the end without any preference to the model. In other words with boosting, we fit consecutive trees at every step. The goal is to solve the net error from the prior tree. Unlike bagging, boosting is all about teamwork. With each decision tree, the weak learner dictates what features for the model will focus on next. When an input is misclassified, its weight is increased, so that the next decision is more likely to classify it correctly.
  • #12 Basic Boosting: Boosting is a general approach where models are built sequentially, and each new model tries to correct the mistakes of the previous ones. It starts with a simple model and focuses on the examples that were incorrectly predicted by the previous models. Boosting doesn't specify how the errors are corrected or how the models are updated. Gradient Boosting: Gradient boosting is a specific type of boosting where the errors made by the previous models are corrected using gradient descent optimization. Instead of just focusing on the misclassified examples, gradient boosting calculates the gradient of the loss function with respect to the predictions of the model, and adjusts the model parameters in the direction that reduces the loss. This allows gradient boosting to optimize the model parameters more efficiently and make better use of the information in the training data.
  • #13 This means that in gradient boosting, the first step is to choose a loss function that measures how well the model's predictions match the actual data. This loss function must be differentiable, which means that it should have a well-defined derivative at every point. Differentiability is important because gradient boosting relies on gradient descent optimization, which involves iteratively adjusting the model parameters (such as the weights of features in decision trees) to minimize the loss function. Gradient descent calculates the direction and magnitude of the steepest decrease in the loss function (the gradient) and updates the model parameters accordingly.
  • #14 Gradient Boosting involves three core steps. The first step that is required is the loss function must optimized. That loss function must be differentiable. A loss function measures how well a machine learning model fits the data of a certain phenomenon. The second step in gradient boosting is to use a weak learner. In gradient boosting, the weak learner is a decision tree. The third step is combining many weak learners in addictive fashion. Decision trees are added one at a time. A gradient descent procedure is used to minimize the loss when adding trees.
  • #15 Let’s summarize what was covered in this section. With bagging, the idea is to create several subsets of data from training samples chosen randomly. Each collection of a subset of the data is used to train the decision trees. With boosting, the idea is to fit consecutive trees at every step. The goal is to solve the net error from the prior tree. Unlike bagging, boosting is all about teamwork. In most boosting models, the decision tree is the weak learner choice. A weak learner is a model whose performance is a little better than guessing. With these weak learners, we want the tree depth to be shallow to avoid overfitting. In this section, a loss function was defined. A loss function is a measure how well the ML model fits the data of a certain phenomenon. Different loss functions may be used to depending on the type of problem. A greedy algo makes the best decision at every step in the process without regard to the entire problem. XGBoost is a greedy algorithms. The first step is the loss function must be optimized. The second step is to use a weak learner and most gradient boosters in XGBoost that weak learner is a decision tree. The third and final step is combining many weak learners in an additive fashion. In this section, ensemble models were defined. Ensemble learning uses several ML models to build a more efficient learning algo to improve the accuracy of the prediction. With this approach, multiple weak learners are combined to improve our results.
  • #16 Fundamental Concepts For every prediction model, there is a residual. Optimize Machine learning purpose to reduce this residual. Make the small square error as low as possible. Bias is the error between the predicted value and the real value. High Bias – the learning algo is missing important trends on the features – underfit – too general Variance is a type of error that occurs due to a model’s sensitivity to small fluctuations in the training set. High variance – results in loss of generalization – overfit – too specific X realistically avoid bias and variance all together
  • #17 https://www.linkedin.com/posts/mohsin-anwer-074b5a119_ai-ml-datascience-activity-7101935623887818752-qnEp/ The bias of the predictor's function class is the difference between the expected value of the predictor at x and the expected value of the target at x. The variance of the predictor's function class is the expected difference between the value of the predictor estimated on a randomly sampled data set and the expected value of the predictor. A standard measure for the error in a prediction is the mean squared error. Bias-variance decomposition: the mean square error decomposes into a bias term and variance term. A high bias causes underfitting, simply stated, this means missing real relationships between the features of the data set and the target. In contrast, a high variance causes overfitting which may be thought of as introducing false relationships due to increase noise between the data set features and the target. Thus, overfitting gives rise to the model appearing as a good predicter on the training data while underperforming on future new and previously unseen data (not generalized). In the end, the ultimate goal of any ML algorithm is to find the right balance between bias and variance (bias-variance trade-off). This balance is key in finding the most generalizable model.