Get to know in detail the termonologies of Random Forest with their types of algorithms used in the workflow along with their advantages and disadvantages of their predecessors.
Thanks, for your time, if you enjoyed this short article there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
2. Random Forest
To understand the Random Forest let’s first understand Ensemble model.
Ensemble model is a collection of outputs from multiple models to more
accuracy predictive modeling
Ensemble model are high demand due to easy implementation of
multiple models in a short time and effort for high prediction accuracy.
And Decision tree is a branching method of one or more if-then-else
statements for the predictors
* It is very useful for data exploration that breaks down the dataset into
smaller and smaller subsets of association.
Rupak Roy
3. Single decision Tree
In Decision trees the measure i.e. the branching of the tree is done by
Information Gain.
Information Gain = Entropy of the parent node – Entropy of the
split(children)
Entropy is a measure on how disorganized the systems is.
Entropy ranges from 0 to1. Pure node has an Entropy of 0 while impure
node has Entropy of 1
* The core algorithm for building decision trees was known as ID3 by
J.R.Quinlan
• It uses a top down approach and can be used to build Classification
and Regression Decision trees.
Rupak Roy
4. Decision Making in Regression Decision
As we know the main aim in regression tree is to reduce the standard
deviation and in classification tree the main aim is to reduce entropy.
Random Forest is suitable for numerical values and since random forest
is a collection of decision trees so first lets understand how numerical
values works for a decision tree.
The numerical values for the decision tree i.e. the regression tree uses
standard deviation scores to do the splitting. The attribute with the
largest standard deviation reduction is chosen for the next decision
node(the node that can be further split). The branch with standard
deviation value more than 0 usually needs for splitting.
Rupak Roy
5. Decision Making in Regression Decision
ď‚› Stop/pruning criteria is provided using a size based criteria to further
stop the tree growing which leads to over fitting problems.
ď‚› The process of splitting the decision nodes runs recursively till it
reaches the terminal/leafnodes(the node that cannot be further split)
When the number of instances is more than one at a least node we
calculate the average as the final value for the target.
Rupak Roy
6. Decision Tree Algorithms
ID3 or Iterative Dichotomizer is one of the first of 3 decision tree
algorithms to implement, developed by J Quinlan in 1986
C4.5 – is the next version also developed by J Quinlan, optimized for
continuous and discrete features with some improvement on the over
fitting problem by using bottom up approach known as pruning.
CART or Classification & Regression Trees
ď‚› The CART implementation is similar to C4.5 that prunes the tree by
imposing a complexity penalty based on number of leaves in the
tree.
ď‚› CART uses the GINI method to create binary splits. Most commonly
used decision tree algorithm.
7. Advantages of Single DT
Advantages of Single DT
ď‚› It is a non- parametric method i.e. it is independent of type, size of
underlying population, that is we can even use when sample size is
low. Therefore very fast and easy to understand and implement.
ď‚› Can handle outliers and missing values, therefore requires less data
preparation than other machine leaning methods and can be used
for both continuous and numerical data types.
Here let’s focus more into the disadvantages of a Decision Tree to get a
solution.
Rupak Roy
8. Disadvantages of Single DT
Disadvantages of Single DT
ď‚› As we know decision trees are easily prone to over fitting issue,
therefore it needs to be controlled by pruning techniques.
ď‚› It uses range of values to split the tree rather than actual values for
continuous numerical variables. Hence sometimes not very effective
for estimating continuous values.
ď‚› The robustness to outliers and skewness comes at the cost of throwing
away some of the information from the dataset.
ď‚› When some input variable have too many possible values they need
to be aggregated into groups else it will result in too many splits
which may result in poor predicting performance.
This disadvantages of a Decision Tree has given rise to the ensemble
methods.
Rupak Roy
9. Ensemble Methods
This disadvantages of a Decision Tree has given rise to the ensemble
methods.
* A collection of several models in this case collection of decision trees
are used in order to increase predictive power & the final score is
obtained by aggregating them.
• This is known as Ensemble Method in Machine Learning
Random forest for continuous numerical variables and Boosting &
Bagging for categorical variables are the most popular ensemble
methods.
However the basic functionality remains the same i.e. the original
concept of creating a tree by using entropy & information gain.
Rupak Roy
10. Random Forest in brief
• The goal of random forest is to improve the prediction accuracy by
using the collection of un-pruned decision trees combined with a rule
based criteria.
So let’s understand the goals of random forest in detail.
Rupak Roy