Random Forest Classification is a machine learning technique utilizing aggregated outcome of many decision tree classifiers in order to improve precision of the outcome. It measures the relationship between the categorical target variable and one or more independent variables.
4. Terminologies
▪ Target variable usually denoted by Y, is the variable being predicted and is also called dependent variable,
output variable, response variable or outcome variable (E.g., One highlighted in red box in table below).
▪ Predictor, sometimes called an independent variable, is a variable that is being used to predict the
target variable (E.g., Variables highlighted in green box in table below).
The predictors highlighted in green box above constitutes of the attributes upon which the target variable
highlighted in red box (i.e., Churn) depends on.
Contract Tenure Internet Service Churn
Month-to-month 2 DSL Yes
Two-year 72 Fibre Optic No
Month-to-month 29 Fibre Optic Yes
One-year 12 DSL No
Month-to-month 30 DSL No
5. Terminologies (Continued...)
Feature Importance:
• Feature importance values are used to check impact of each influencers (predictors) on target
variable.
• Random Forest Classification algorithm gives an estimate of what variables are important in the
classification.
• For instance, predicting which customers are prone to churn by identifying which variables are
important, i.e., which factors determine the rate of attrition(churn).
Target Variable: Churn
6. Introduction
• Objective:
– It is a statistical technique to explore the
relationship between two or more variables (Xi
and Y).
• Benefit:
– Random Forest Classification output helps
identify important factors ( Xi ) impacting the
dependent variable(y) and the nature of
relationship between each of these factors and
dependent variable.
• Model:
– Random Forest Classification model constructs
many trees wherein each tree votes and
outputs the most popular class as the
prediction result.
7. Example: Random Forest Classification
Let’s conduct the Random Forest Classification analysis on independent variables: Contract, Tenure, Internet Service, Tech Support, Online Security
and target variable: Churn as shown below:
Churn Contract Tenure
Internet
Service
Tech
Support
Online Security
Yes Month-to-month 2 DSL
No internet
service
Yes
No Two-year 72 Fibre optic No No
Yes Month-to-month 29 Fibre optic No
No Internet
Service
No One-year 12 DSL Yes No
Yes Month-to-month 30 DSL Yes No
Independent
variables (Xi)
Target
Variable (Y) Model is an excellent fit as
Accuracy > 75%
Classification Evaluation Metric
Accuracy 78.6%
Classification Error 21.4%
• Classification Accuracy:
○ A crucial criterion for assessing Model
Performance
○ Model with prediction accuracy > 75% is
useful.
• Classification Error = 100- Accuracy = 21.4%
○ Indicates that there is 21.4% chance of error
in classification.
8. Standard Input/Tuning Parameters & Sample UI
Select the target variable
Contract
Churn
Online Security
Tenure
Tech Support
Internet Service
Step
1
Step
2
More than one
predictors can be
selected
Step 3
Number of Trees= 20
Range for no. of Trees: 1-128
Depth of Trees=20
Range for max Depth: 1-30
By default, these parameters
should be set with the values
mentioned
Step 4
Display the output window containing following:
● Scatter Plot
● Dimension Contribution
● Dimension Counts By Percentage
● Average Measures by Target Classes
Note:
▪ Decision on selection of predictors depends on the business knowledge and the correlation value between target variable and predictors.
Select the predictor variable(s)
Contract
Churn
Online Security
Tenure
Tech Support
Internet Service
9. Influencer’s importance chart is used to show impact of each predictor on target variable.
Target Variable: Churn
Influencer’s Importance
Sample Output: 1. Interpretation
10. ● Accuracy: It shows the goodness of fit of the model. It lies
between 1 to 100 and closer the value to 100, better the model.
● Precision: Proportion of predicted values that were actually correct. Generally, higher precision (>70%) indicates
that confidence for predicted class is high.
● Recall/Sensitivity/Hit Rate: Proportion of actual positives that were predicted correctly. Generally, higher recall
(>70%) indicates that confidence for predicted class is high.
Precision Recall
No 79.91% 94.23%
Yes 70.78% 37.1%
Accuracy 78.6%
Class Wise Precision and Recall
Predicted
No Yes
Actual
No 3503 195
Yes 880 507
Actual versus Predicted Class
Sample Output: 2. Model Summary
11. Sample Output: 3. Predicted Class & Probability
Churn Contract Online
Security
Tech Support Tenure Internet
Service
Monthly
Charges
Probability Predicted Churn
No Month-to-month No No 3 Fibre optic 90.4 0.72 Yes
No Two year No internet
service
No internet
service
8 No 19.5 0.91 No
No One year No No 60 Fibre optic 100.5 0.77 No
No Two year No internet
service
No internet
service
66 No 20.55 0.93 No
No One year Yes Yes 27 DSL 81.7 0.92 No
No Month-to-month No No 12 Fibre optic 79.95 0.69 Yes
The data output will contain predicted class column along with the probability of prediction
12. Accuracy
• Accuracy > 75%
represents model is
well fit on the
provided data and
the values are
reasonably accurate.
• Accuracy < 75%
represents model is
not well fit on
provided data and
the values are likely
to be inaccurate and
contain high
chances of error.
Precision:
• Proportion of
predicted values
that were actually
correct. Generally,
higher precision
(>70%) indicates
that confidence
for predicted
class is high.
Recall:
• Proportion of
actual positives
that were
predicted
correctly.
Generally, higher
recall (>70%)
indicates that
confidence for
predicted class is
high.
Feature Importance:
• Feature
Importance values
are used to check
the impact of each
influencer
(predictors) on
target variable.
Interpretation of Important Model Summary
Statistics
13. Interpretation of Plots: Scatter Plot
● This plot is used to see the classification quality by model; the less overlap among the classes in the plot
above, the better the classification by model.
● We can also visually analyze how a particular class is assigned.
● Scatter plots give the overview of the input data, allowing a user to see general trends for the
attributes.
● The graph is plotted against measures within the data.
Monthly
Charges
Tenure
No Yes
14. Interpretation of Plots: Dimension Contribution
● This plot is used to display how dimension values are distributed for each class in the target variable.
● For instance, the plot above shows how various values of Contract period (Month-to-month, One year,
Two year) are distributed within each class of response (Yes, No). The graph shows counts of target
class(Yes, No) for each Contract (Month-to-month, One year, Two year).
15. Interpretation of plots: Dimension Counts by Percentage
● This plot is used to visually analyze how dimension counts are distributed across target variable classes.
● For instance, the plot above shows the churn status to analyze whether a particular target class is having
relatively more counts of a particular status.
16. Interpretation of Plots: Average Measures by Target Class
● This plot is used to visually analyze how average measures are distributed across target variable classes.
● For instance, the plot above shows how average Tenure is distributed within each Churn status.
Average
Avg(Tenure)
Churn
Avg(Monthly Charges)
17. Limitations
● Minimum sample size should be at least 20 cases per independent variable.
● Random Forests can be computationally intensive for large datasets, i.e., it
does not work very well on large datasets.
● The main limitation of random forest is that a large number of trees can make
the algorithm too slow and ineffective for real-time predictions.
● The model provides a very little control over itself.
● Target/independent variables should be normally distributed.
18. Limitations (Continued…)
● A normal distribution is an arrangement of
a data set in which most values cluster in
the middle of the range and the rest taper
off symmetrically towards extreme. It will
look like a bell curve as shown in figure 1.
● Outliers in data (target as well as
independent variables) can affect the
analysis, hence outliers need to be
removed.
● Outliers are the observations lying outside
overall pattern of distribution as shown in
figure 2.
Figure 1
Figure 2
19. Business Use Case 1
• Business Problem: Predict loan default
• Based on the historical data related to credit card payments , loan payments , existing loan status, job
status we want to classify/divide the customers into defaulters and non defaulters.
• Input Data:
• Predictor/independent variables:
• Home ownership status
• Existing loan status
• Occupation
• Account Balance
• Target/dependent variable:
• Default Status
• Business Benefit:
• The predictive model will help us identify, whether a customer fails to repay the loan depending on
certain factors, which would lead to easier identification of risky customers and help the bank avert the
risk delinquencies.
20. Business Use Case 2
• Business Problem: Predict quality of Red Wine
• The data is a result of analysis to determine the quality of the red wine based upon chemicals it
constitutes of.
• Input Data:
• Predictor/independent variables:
• Citric Acid
• Density
• Residual Sugar
• Chlorides
• Target/dependent variable:
• Quality_Category
• Business Benefit:
• Using random forest classification, we can determine the quality of red wine (high, low) based upon
its influential chemical attributes.
21. Want to
Learn More?
Get in touch with us @
support@Smarten.com
And Do Checkout the Learning section
on
Smarten.com
September 2021