This presentation is an attempt to explain random forest and gradient boosting methods in layman terms with many real life examples related to the concepts
4. ANATOMY OF DECISION TREE
4
• Trees that predict categorical results are
called as decision trees
• At each node certain set of rules should be
satisfied
• Output from each node will be a Boolean
(True/False)
• Splitting is a process of dividing a node into
two or more sub nodes
• Root node represents the entire population
• When sub nodes split into further sub
nodes then it’s a decision node
• Nodes that do not split are called as
terminal nodes/leaf nodes
ROOT NODE
DECISION NODE
LEAF NODE
Decision tree for Regression dataset
X[i] :- Input variables in the dataset
MSE :- Mean Squared Error of all samples in a node
Samples :- Total number of samples in a node
Value :- Average value of all samples corresponding to
an output variable in a node
30-03-2019
5. DECISION TREES FOR CLASSIFICATION
5
Predict whether or not to play tennis based on
Temperature, Humidity, Wind and Outlook
• A good decision tree is the one which makes correct
predictions for any unseen data
• Split at each node is made based on Gini-score
• Best split is the one which yields the lowest Gini-score
30-03-2019
6. DECISION TREE FOR REGRESSION
• Regression trees predict continuous values
• Values at the leaves are the average of all
samples in the leaf
• Best split at each node is based on MSE or
weighted average of standard deviation
6
Predict the average precipitation based on the
Slope and Elevation of the Himalayan region
30-03-2019
7. BEST SPLIT BASED ON STANDARD DEVIATION
7Weighted standard deviation30-03-2019
8. HOW LONG TO KEEP SPLITTING??..
• Until:
• Leaf nodes are pure – Only one class remains
• A maximum depth is reached
• A performance metric is achieved
• Problem:
• Decision trees tend to overfit
• Small changes in data greatly affects the prediction
• Solution:
• Pruning the trees
• Restricting the tree from growing to it’s fullest
• Maintain minimum number of samples in leaf
nodes
30-03-2019 8
9. Pros and Cons of Classification and Regression Trees
Advantages
• Simple to understand, interpret and
visualise
• Can handle both numerical and
categorical data
• Less effort in data preparation
• Non linear relationships between
parameters wont affect tree
performance
• Implicitly performs feature selection
Disadvantages
• Prone to create over complex trees
which lack generalization capability
• Unstable, small variations in data
results into completely different tree
• They create biased trees if some
classes dominate
• Cannot guarantee to return global
optimal decision tree
30-03-2019 9
Lower the variance of individual trees by Ensemble methods like Bagging and Boosting
10. ANALOGY OF ENSEMBLE LEARNING
10
Decision Tree 1 Decision Tree 2 Decision Tree 3
2.91
2.6 2.95 3.2 Desired output :- 2.85Predicted outputs
30-03-2019
11. RANDOM FOREST METHOD
11
Training dataset
Bootstrap sample 1 Bootstrap sample 2 Bootstrap sample k
In Bag
(2/3)
Out of Bag
(1/3)
In Bag
(2/3)
Out of Bag
(1/3)
In Bag
(2/3)
Out of Bag
(1/3)
Prediction 1 Prediction 2 Prediction k
Average of k
predictions
30-03-2019
13. PSEUDO CODE FOR RANDOM FOREST METHOD
1. Randomly select “k” features from total “m” features
• k< m
2. Among “k” features, calculate the node “d” using best split point
3. Split the node into daughter nodes using the best split
4. Repeat steps 1 to 3 until a predefined number of nodes is reached
5. Build a forest by repeating steps 1 to 4 “n” number of times to create “n”
number of trees
6. Takes the test features and uses the rules of each randomly created trees to
predict the output
7. Calculates the votes for each predicted target
8. High voted predicted target is considered as the final prediction
30-03-2019 13
14. OVERFITTING – HIGH VARIANCE
• High variance
• Outcome can vary even if there are tiniest changes in the input
• Do not generalise well to new data
• High variance compared to “PHYSICAL BALANCE”
• If you are balancing on one foot while standing on solid ground you’re
not likely to fall over.
• But what if there are suddenly 100 mph wind gusts? I bet you’d fall
over.
• That’s because your ability to balance on one leg is highly dependent
on the factors in your environment.
• If even one thing changes, it could completely mess you up!
• If we mess with any factors in its training data, we could completely
change the outcome.
• This is not stable model and therefore not a model of which we
would want to make decisions.
30-03-2019 14
Don’t fall, lil guy!!
15. APPLICATIONS OF RANDOM FOREST
30-03-2019 15
1. Banking
• To find loyal and fraud customers
• Growth of a bank purely depends on loyal customers
• To identify customers not profitable to bank
• Bank won’t approve loans to such customers if
identified
2. Medicine
• To identify the disease by analysing patient’s
medical records
• Identify correct combination of components to
validate the medicine
3. E-commerce
• Identify likelihood of customer liking a
recommended product
16. GRADIENT BOOSTING METHOD
16
Create a decision tree
on known response
values
Make Predictions
Calculate errors
(Residuals)
Fit new tree using
errors as response
values
Combine new tree
with tree from
previous iteration
• Tuning parameters:
1. Number of trees
2. Maximum depth of each tree
3. Maximum features at each split
4. Learning rate
5. Minimum samples in leaf
• Builds decision trees sequentially
• More weight is given to mispredicted values
at each stage of training
• Builds more accurate models as the final
output is the average of predictions of all
decision trees
30-03-2019
18. PSEUDO CODE FOR GRADIENT BOOSTING METHOD
1. Initialize the approximation function F(x):
2. For m=1 to M do:
• Calculate the pseudo responses
• Fit the regression tree using the training set
• Calculate the step size using the line search
• Update the model:
3. End the algorithm: is the final output
30-03-2019 18
19. AN EXAMPLE BASED ON GRADIENT BOOSTING
30-03-2019 19
Predict the age of a person based on whether they play video games, enjoy gardening and
their preference in wearing hats
Objective :- Minimize Squared Error
20. LOSS FUNCTION :- SQUARED ERROR
30-03-2019 20
F1 = F0 + gamma0 ∗ h0PseudoResidual0 = Age − F0F0 = (1/n) ∗ k=1
n
Age 𝑆𝑆𝐸 =
𝑘=1
𝑛
(𝐴𝑔𝑒 − 𝐹1)2
Good morning everyone! My name is Shreyas. Now I’ll be giving a presentation on the topic “”
The analogy of ensemble methods can be described by comparing the workflow with most popular show “Who wants to be a Billionaire”?. There are three lifelines in this show as shown. At each stage of training, we build decision trees and each of them will be giving an output. Each tree is a weak learner as the predicted output of each tree is some what better than random guessing. Here we are combining a set of weak learners and projecting a strong learner by averaging output. Probability of getting a correct answer from a friend is comparably lower than the answer we get from a set of audience.
Data splitting will divide the total dataset into training and testing sets. Further the training sets are divided into bootstrap samples.
The predictions made by every new tree in each iteration will be stronger than the previous one.