SlideShare a Scribd company logo
Random Forest
Definition
⚫Random forest (or random forests) is
an ensemble classifier that consists of
many decision trees and outputs the
class that is the mode of the class's
output by individual trees.
⚫The term came from random decision
forests that was first proposed by Tin
Kam Ho of Bell Labs in 1995.
⚫The method combines Breiman's
"bagging" idea and the random selection
of features.
2/1
4
DS JAGLI
Decision trees
⚫Decision trees are individual learners that
are combined. They are one of the most
popular learning methods commonly used
for data exploration.
⚫One type of decision tree is called CART…
classification and regression tree.
⚫CART … greedy, top-down binary,
recursive partitioning, that divides feature
space into sets of disjoint rectangular
regions.
⚫Regions should be pure wrt response
variable
⚫Simple model is fit in each region – majority
vote for classification, constant value for
3/1
4
DS JAGLI
Decision tress involve greedy, recursive
partitioning.
⚫Simple dataset with two predictors
⚫Greedy, recursive partitioning along TI
and PE
4/1
4
DS JAGLI
Algorithm
Each tree is constructed using the following algorithm:
1. Let the number of training cases be N, and the number
of variables in the classifier be M.
2. We are told the number m of input variables to be
used to determine the decision at a node of the tree; m
should be much less than M.
3. Choose a training set for this tree by choosing n times
with replacement from all N available training cases
(i.e. take a bootstrap sample). Use the rest of the cases
to estimate the error of the tree, by predicting their
classes.
4. For each node of the tree, randomly choose m
variables on which to base the decision at that node.
Calculate the best split based on these m variables in
the training set.
5. Each tree is fully grown and not pruned (as may be
done in constructing a normal tree classifier).
For prediction a new sample is pushed down the tree. It is
assigned the label of the training sample in the terminal
node it ends up in. This procedure is iterated over all
trees in the ensemble, and the average vote of all trees is
5/1
4
DS JAGLI
Algorithm flow chart
⚫For computer scientists:
6/1
4
DS JAGLI
Random Forest – practical
consideration
⚫Splits are chosen according to a purity
measure:
⚫ E.g. squared error (regression), Gini index or
devinace (classification)
⚫How to select N?
⚫Build trees until the error no longer
decreases
⚫How to select M?
⚫Try to recommend defaults, half of them
and twice of them and pick the best.
7/1
4
DS JAGLI
Features and Advantages
The advantages of random forest are:
⚫ It is one of the most accurate learning
algorithms available. For many data sets, it
produces a highly accurate classifier.
⚫ It runs efficiently on large databases.
⚫ It can handle thousands of input variables
without variable deletion.
⚫ It gives estimates of what variables are
important in the classification.
⚫ It generates an internal unbiased estimate of
the generalization error as the forest building
progresses.
⚫ It has an effective method for estimating
missing data and maintains accuracy when a
large proportion of the data are missing.
8/1
4
DS JAGLI
Features and Advantages
⚫ It has methods for balancing error in class
population unbalanced data sets.
⚫ Generated forests can be saved for future use
on other data.
⚫ Prototypes are computed that give
information about the relation between the
variables and the classification.
⚫ It computes proximities between pairs of cases
that can be used in clustering, locating outliers,
or (by scaling) give interesting views of the
data.
⚫ The capabilities of the above can be extended
to unlabeled data, leading to unsupervised
clustering, data views and outlier detection.
⚫ It offers an experimental method for detecting
variable interactions.
9/1
4
DS JAGLI
Disadvantages
⚫Random forests have been observed to
overfit for some datasets with noisy
classification/regression tasks.
⚫For data including categorical variables
with different number of levels, random
forests are biased in favor of those
attributes with more levels. Therefore, the
variable importance scores from random
forest are not reliable for this type of data.
10/
14
DS JAGLI
RM - Additional information
Estimating the test error:
⚫While growing forest, estimate test error
from training samples
⚫For each tree grown, 33-36% of samples are
not selected in bootstrap, called out of
bootstrap (OOB) samples
⚫Using OOB samples as input to the
corresponding tree, predictions are made
as if they were novel test samples
⚫Through book-keeping, majority vote
(classification), average (regression) is
computed for all OOB samples from all
trees.
⚫Such estimated test error is very accurate
11/1
4
DS JAGLI
RM - Additional information
Estimating the importance of each
predictor:
⚫Denote by ê the OOB estimate of the loss
when using original training set, D.
⚫For each predictor xp where p∈{1,..,k}
⚫Randomly permute pth predictor to generate
a new set of samples D' ={(y1,x'1),…,(yN,x'N)}
⚫Compute OOB estimate êk of prediction
error with the new samples
⚫A measure of importance of predictor xp is
êk – ê, the increase in error due to random
perturbation of pth predictor
12/
14
DS JAGLI
Conclusions & summary:
⚫ Fast fast fast!
⚫ RF is fast to build. Even faster to predict!
⚫ Practically speaking, not requiring cross-validation
alone for model selection significantly speeds training
by 10x-100x or more.
⚫ Fully parallelizable … to go even faster!
⚫ Automatic predictor selection from large number of
candidates
⚫ Resistance to over training
⚫ Ability to handle data without preprocessing
⚫ data does not need to be rescaled, transformed, or modified
⚫ resistant to outliers
⚫ automatic handling of missing values
⚫ Cluster identification can be used to generate tree-
based clusters through sample proximity
13/
14
DS JAGLI
Thank you!
14/
14
DS JAGLI

More Related Content

Similar to 5.Module_AIML Random Forest.pptx

CS109a_Lecture16_Bagging_RF_Boosting.pptx
CS109a_Lecture16_Bagging_RF_Boosting.pptxCS109a_Lecture16_Bagging_RF_Boosting.pptx
CS109a_Lecture16_Bagging_RF_Boosting.pptx
AbhishekSingh43430
 
Random Forest Classifier in Machine Learning | Palin Analytics
Random Forest Classifier in Machine Learning | Palin AnalyticsRandom Forest Classifier in Machine Learning | Palin Analytics
Random Forest Classifier in Machine Learning | Palin Analytics
Palin analytics
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
Derek Kane
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Parth Khare
 
An Introduction to Random Forest and linear regression algorithms
An Introduction to Random Forest and linear regression algorithmsAn Introduction to Random Forest and linear regression algorithms
An Introduction to Random Forest and linear regression algorithms
Shouvic Banik0139
 
A Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of DiseasesA Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of Diseases
ijsrd.com
 
Lt. 5 Pattern Reg.pptx
Lt. 5  Pattern Reg.pptxLt. 5  Pattern Reg.pptx
Lt. 5 Pattern Reg.pptx
ssuser5c580e1
 
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018
digitalzombie
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
Md. Ariful Hoque
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
r-kor
 
What Is Random Forest_ analytics_ IBM.pdf
What Is Random Forest_ analytics_ IBM.pdfWhat Is Random Forest_ analytics_ IBM.pdf
What Is Random Forest_ analytics_ IBM.pdf
Dr Arash Najmaei ( Phd., MBA, BSc)
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchkevinlan
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
jim
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Researchbutest
 
data mining.pptx
data mining.pptxdata mining.pptx
data mining.pptx
Kaviya452563
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2
Nandhini S
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
Knoldus Inc.
 
Decision tree
Decision treeDecision tree
Decision tree
Varun Jain
 
Handout11
Handout11Handout11
Handout11butest
 
Chapter 4 Classification
Chapter 4 ClassificationChapter 4 Classification
Chapter 4 Classification
Khalid Elshafie
 

Similar to 5.Module_AIML Random Forest.pptx (20)

CS109a_Lecture16_Bagging_RF_Boosting.pptx
CS109a_Lecture16_Bagging_RF_Boosting.pptxCS109a_Lecture16_Bagging_RF_Boosting.pptx
CS109a_Lecture16_Bagging_RF_Boosting.pptx
 
Random Forest Classifier in Machine Learning | Palin Analytics
Random Forest Classifier in Machine Learning | Palin AnalyticsRandom Forest Classifier in Machine Learning | Palin Analytics
Random Forest Classifier in Machine Learning | Palin Analytics
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
 
An Introduction to Random Forest and linear regression algorithms
An Introduction to Random Forest and linear regression algorithmsAn Introduction to Random Forest and linear regression algorithms
An Introduction to Random Forest and linear regression algorithms
 
A Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of DiseasesA Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of Diseases
 
Lt. 5 Pattern Reg.pptx
Lt. 5  Pattern Reg.pptxLt. 5  Pattern Reg.pptx
Lt. 5 Pattern Reg.pptx
 
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
 
What Is Random Forest_ analytics_ IBM.pdf
What Is Random Forest_ analytics_ IBM.pdfWhat Is Random Forest_ analytics_ IBM.pdf
What Is Random Forest_ analytics_ IBM.pdf
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
 
data mining.pptx
data mining.pptxdata mining.pptx
data mining.pptx
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
Decision tree
Decision treeDecision tree
Decision tree
 
Handout11
Handout11Handout11
Handout11
 
Chapter 4 Classification
Chapter 4 ClassificationChapter 4 Classification
Chapter 4 Classification
 

Recently uploaded

How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 

Recently uploaded (20)

How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 

5.Module_AIML Random Forest.pptx

  • 2. Definition ⚫Random forest (or random forests) is an ensemble classifier that consists of many decision trees and outputs the class that is the mode of the class's output by individual trees. ⚫The term came from random decision forests that was first proposed by Tin Kam Ho of Bell Labs in 1995. ⚫The method combines Breiman's "bagging" idea and the random selection of features. 2/1 4 DS JAGLI
  • 3. Decision trees ⚫Decision trees are individual learners that are combined. They are one of the most popular learning methods commonly used for data exploration. ⚫One type of decision tree is called CART… classification and regression tree. ⚫CART … greedy, top-down binary, recursive partitioning, that divides feature space into sets of disjoint rectangular regions. ⚫Regions should be pure wrt response variable ⚫Simple model is fit in each region – majority vote for classification, constant value for 3/1 4 DS JAGLI
  • 4. Decision tress involve greedy, recursive partitioning. ⚫Simple dataset with two predictors ⚫Greedy, recursive partitioning along TI and PE 4/1 4 DS JAGLI
  • 5. Algorithm Each tree is constructed using the following algorithm: 1. Let the number of training cases be N, and the number of variables in the classifier be M. 2. We are told the number m of input variables to be used to determine the decision at a node of the tree; m should be much less than M. 3. Choose a training set for this tree by choosing n times with replacement from all N available training cases (i.e. take a bootstrap sample). Use the rest of the cases to estimate the error of the tree, by predicting their classes. 4. For each node of the tree, randomly choose m variables on which to base the decision at that node. Calculate the best split based on these m variables in the training set. 5. Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier). For prediction a new sample is pushed down the tree. It is assigned the label of the training sample in the terminal node it ends up in. This procedure is iterated over all trees in the ensemble, and the average vote of all trees is 5/1 4 DS JAGLI
  • 6. Algorithm flow chart ⚫For computer scientists: 6/1 4 DS JAGLI
  • 7. Random Forest – practical consideration ⚫Splits are chosen according to a purity measure: ⚫ E.g. squared error (regression), Gini index or devinace (classification) ⚫How to select N? ⚫Build trees until the error no longer decreases ⚫How to select M? ⚫Try to recommend defaults, half of them and twice of them and pick the best. 7/1 4 DS JAGLI
  • 8. Features and Advantages The advantages of random forest are: ⚫ It is one of the most accurate learning algorithms available. For many data sets, it produces a highly accurate classifier. ⚫ It runs efficiently on large databases. ⚫ It can handle thousands of input variables without variable deletion. ⚫ It gives estimates of what variables are important in the classification. ⚫ It generates an internal unbiased estimate of the generalization error as the forest building progresses. ⚫ It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing. 8/1 4 DS JAGLI
  • 9. Features and Advantages ⚫ It has methods for balancing error in class population unbalanced data sets. ⚫ Generated forests can be saved for future use on other data. ⚫ Prototypes are computed that give information about the relation between the variables and the classification. ⚫ It computes proximities between pairs of cases that can be used in clustering, locating outliers, or (by scaling) give interesting views of the data. ⚫ The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection. ⚫ It offers an experimental method for detecting variable interactions. 9/1 4 DS JAGLI
  • 10. Disadvantages ⚫Random forests have been observed to overfit for some datasets with noisy classification/regression tasks. ⚫For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Therefore, the variable importance scores from random forest are not reliable for this type of data. 10/ 14 DS JAGLI
  • 11. RM - Additional information Estimating the test error: ⚫While growing forest, estimate test error from training samples ⚫For each tree grown, 33-36% of samples are not selected in bootstrap, called out of bootstrap (OOB) samples ⚫Using OOB samples as input to the corresponding tree, predictions are made as if they were novel test samples ⚫Through book-keeping, majority vote (classification), average (regression) is computed for all OOB samples from all trees. ⚫Such estimated test error is very accurate 11/1 4 DS JAGLI
  • 12. RM - Additional information Estimating the importance of each predictor: ⚫Denote by ê the OOB estimate of the loss when using original training set, D. ⚫For each predictor xp where p∈{1,..,k} ⚫Randomly permute pth predictor to generate a new set of samples D' ={(y1,x'1),…,(yN,x'N)} ⚫Compute OOB estimate êk of prediction error with the new samples ⚫A measure of importance of predictor xp is êk – ê, the increase in error due to random perturbation of pth predictor 12/ 14 DS JAGLI
  • 13. Conclusions & summary: ⚫ Fast fast fast! ⚫ RF is fast to build. Even faster to predict! ⚫ Practically speaking, not requiring cross-validation alone for model selection significantly speeds training by 10x-100x or more. ⚫ Fully parallelizable … to go even faster! ⚫ Automatic predictor selection from large number of candidates ⚫ Resistance to over training ⚫ Ability to handle data without preprocessing ⚫ data does not need to be rescaled, transformed, or modified ⚫ resistant to outliers ⚫ automatic handling of missing values ⚫ Cluster identification can be used to generate tree- based clusters through sample proximity 13/ 14 DS JAGLI