This Edureka Random Forest tutorial will help you understand all the basics of Random Forest machine learning algorithm. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Data Science concepts, learn random forest analysis along with examples. Below are the topics covered in this tutorial:
1) Introduction to Classification
2) Why Random Forest?
3) What is Random Forest?
4) Random Forest Use Cases
5) How Random Forest Works?
6) Demo in R: Diabetes Prevention Use Case
You can also take a complete structured training, check out the details here: https://goo.gl/AfxwBc
2. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What Will You Learn Today?
Why Random Forest?Introduction What is Random Forest?
Random Forest - Example How Random Forest Works? Demo In R: Diabetes
Prevention Use Case
1 2 3
4 65
4. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Introduction To Classification
Classification is the problem of identifying to
which set of categories a new observation
belongs.
It is a supervised learning model as the
classifier already has a set of classified examples
and from these examples, the classifier learns to
assign unseen new examples.
Example: Assigning a given email into "spam"
or "non-spam" category.
Is this A or B ?
5. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Types Of Classifiers
Decision Tree
• Decision tree builds classification
models in the form of a tree
structure.
• It breaks down a dataset into
smaller and smaller subsets.
• Random Forest is an ensemble
classifier made using many
decision tree models.
• Ensemble models combine the
results from different models.
Random Forest Naïve Bayes
• It is a classification technique
based on Bayes' Theorem with an
assumption of independence
among attributes.
7. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Use Case - Credit Risk Detection
To minimize loss, the bank needs a
decision rule to predict whom to give
approval of the loan.
An applicant’s demographic (income,
debts, credit history) and socio-economic
profiles are considered.
Data science can help banks recognize
behavior patterns and provide a
complete view of individual customers.
10. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What Is Random Forest?
Random Forest - a versatile algorithm capable of
performing both
i) Regression
ii) Classification
It is a type of ensemble learning method
Commonly used predictive modelling and machine
learning technique
11. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Random Forest - Example
Let’ say you want to decide if to watch “Edge of
Tomorrow” or not.
So you will decide based on following two actions.
(i) You can ask your best friend
(ii) You can ask bunch of friends.
12. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Random Forest - Example
To figure out if you will like “Edge of Tomorrow”
or not, your friend will analyze a few things as:
(i) If you like Adventure and Action
(ii) If you like Emily Blunt
Thus, a decision tree is created by your best
friend.
Ask best friend
Genre -
Adventure
Yes
Cast - Emily
Blunt
No
Is Emily Blunt
main lead?
Like Don’t Like
Yes No
Like Don’t Like
13. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Random Forest - Example
In order to get more accurate recommendations,
you will have to ask bunch of friends, say #Friend1,
#Friend2, #Friend3 and consider their vote.
Each one of them may take movies of different
genre and further decide.
The majority of the votes will decide the final
outcome.
Thus you build random forest of group of friends.
14. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Random Forest - Example
Friend 1
Top Gun
Action
movies
Yes No
Like Don’t Like
Yes
Like
No
Godzilla
Don’t Like
Friend 3
Far and
Away
Yes
Oblivion
Like
No
Like
Friend 2
Tom
Cruise
15. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Random Forest Use Cases
Banking
Remote sensing
Medicine
Banking
Identification of loan risk applicants by their
probability of defaulting payments.
Medicine
Identification of at-risk patients and disease trends.
Land Use
Identification of areas of similar land use.
Marketing
Identifying customer churn.
Use-cases
Marketing
17. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Random Forest Algorithm
i.Randomly select m features from T;
where 𝑚≪T
i.For node d, calculate the best split point among the 𝑚
feature
i.Split the node into two daughter nodes using the best split
Repeat first three steps until 𝑛 number of nodes has been
reached
Build your forest by repeating steps i–iv for 𝐷 number of
times
T: number of features
𝐷: number of trees to be constructed
𝑉: Output: the class with the highest vote
18. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How Random Forest Works?
Let’s take an example,
We have taken dataset consisting of:
• Weather information of last 14 days
• Whether match was played or not on that particular day
Now using the random forest we need to predict whether the
game will happen if the weather condition is
Outlook = Rain
Humidity = High
Wind = Weak
Play = ?
19. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How Random Forest Works?
The first step in Random forest is that it will divide the data into smaller
subsets.
Every subsets need not be distinct, some subsets maybe overlapped
20. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How Random Forest Works?
D1,D2,D3
Overcast
Wind
Play No Play
Play
D7,D8,D9
Overcast
Play
No play Play
Humidity
D3,D4,D5,D6
Wind
Overcast
Play
Wind
Humidity
PlayPlay No play No play
Play
Play
21. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Features of Random Forest
Most accurate learning algorithms
Works well for both classification and regression problems
Runs efficiently on large databases
Requires almost no input preparation
Performs implicit feature selection
Can be easily grown in parallel
Methods for balancing error in unbalanced data sets
23. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What if we could predict the
occurrence of diabetes and
take appropriate measures
beforehand to prevent it?
Sure! Let me take you
through the steps to
predict the vulnerable
patients.
24. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
Doctor gets the following data from the medical history of the patient.
25. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
We will divide our entire dataset into two subsets as:
• Training dataset -> to train the model
• Testing dataset -> to validate and make predictions
26. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
Before we create random forest, let’s find out the best mtry value using following commands
27. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
Here, we implement random forest in R using following commands.
29. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
Let’s see what all variables are most important for our model. For
plotting the we can use the following commands
As per MiniDecreaseGini value, glucose_conc is the most important variable in the model.
30. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
Now, we can use our model to predict the output of our testing dataset.
We can use the following code for predicting the output.
pred1_diabet<-predict(diabet_forest,newdata = diabet_test,type ="class")
pred1_diabet
31. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
We get the following output for our testing dataset where:
“YES” means the probability of patient being vulnerable to diabetes is positive
“NO” means the probability of patient being vulnerable to diabetes is negative.
32. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
library(caret)
confusionMatrix(table(pred1_diabet,diabet_test$is_diabetic))
We can create confusion matrix for the model using the library caret to know how
good is our model.
33. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Implement model
Visualize
Accuracy = 79.66%
The accuracy (or the overall success rate) is a metric defining the rate at which a
model has classified the records correctly. A good model should have a high
accuracy score
Divide dataset
Implement model
Visualize
Model Validation
34. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Course Details
Go to www.edureka.co/data-science
Get Edureka Certified in Data Science Today!
What our learners have to say about us!
Shravan Reddy says- “I would like to recommend any one who
wants to be a Data Scientist just one place: Edureka. Explanations
are clean, clear, easy to understand. Their support team works
very well.. I took the Data Science course and I'm going to take
Machine Learning with Mahout and then Big Data and Hadoop”.
Gnana Sekhar says - “Edureka Data science course provided me a very
good mixture of theoretical and practical training. LMS pre recorded
sessions and assignments were very good as there is a lot of
information in them that will help me in my job. Edureka is my
teaching GURU now...Thanks EDUREKA.”
Balu Samaga says - “It was a great experience to undergo and get
certified in the Data Science course from Edureka. Quality of the
training materials, assignments, project, support and other
infrastructures are a top notch.”