SlideShare a Scribd company logo
Machine Learning with
Python
Decision Trees
 A supervised learning predictive model that uses a set of binary rules to calculate a target value
 Used for either classification(categorical variables) or regression(continuous variables)
 In this method, the given data is split into wo or more homogeneous sets based on most
significant input variable
 Tree generating algorithm determines
- Which variable to split at a node
- Decision to stop or make a split again
- Assign terminal nodes to a class
Decision Trees
Sno Species Name Body_Temp Gives_Birth Type
1 SP101 Cold-Blooded No Non-Mammal
2 SP102 Warm-Blodded No Non-Mammal
3 SP103 Cold-Blooded No Non-Mammal
4 SP104 Warm-Blodded Yes Mammal
5 SP105 Warm-Blodded No Non-Mammal
6 SP106 Warm-Blodded No Non-Mammal
7 SP107 Warm-Blodded Yes Mammal
8 SP108 Cold-Blooded No Non-Mammal
9 SP109 Cold-Blooded No Non-Mammal
10 SP110 Warm-Blodded Yes Mammal
11 SP111 Warm-Blodded Yes Mammal
12 SP112 Warm-Blodded No Non-Mammal
13 SP113 Cold-Blooded No Non-Mammal
14 SP114 Warm-Blodded Yes Mammal
Decision Trees
The decision tree will be created as follows:
1. The splits will be done to partition the data into purer subsets. So, when the split is done on Body
Temperature, all the species that are cold-blooded are found to be “non-mammals”. So when the
Body Temperature is “Cold- Blooded”, we get a purest subset where all the species belong to “non-
Mammals”.
2. But for “warm-blooded” species, we still have both types present: mammals and non-mammals. So,
now the algorithm splits on the values of Gives_Birth. This time, when the values are “Yes”, we get
all type values as “Mammals” and when the values are “No”, we get all type values as “Non-
Mammals”.
3. The nodes represents Attributes while the edges represent values.
Decision Tree – An Example
Body
Temperature
Gives Birth
Mammal
Non-
Mammal
Non-
Mammals
Warm Cold
Yes No
Decision Tree – Another Example
Decision Tree – Another Example
Measures Used for Split
Gini Index
Entropy
Information Gain
Gini
Gini Index: It is the measure of inequality of distribution. It says if we select two items from a
population at random then they must be of same class and probability for this is 1 if population is pure.
 It works with categorical target variable “Success” or “Failure”.
 It performs only Binary splits
 Higher the value of Gini, higher the homogeneity.
 CART (Classification and Regression Tree) uses Gini method to create binary splits.
Process to calculate Gini Measure:
P(j) is the Probability of Class j
Entropy
Entropy is a way to measure impurity.
Less impure node requires less information to describe it and more impure node
requires more information. If the sample is completely homogeneous, then the entropy
is zero and if the sample is an equally divided it has entropy of one.
Information Gain
Information Gain is simply a mathematical way to capture the amount of information
one gains(or reduction in randomness) by picking a particular attribute
In a decision algorithm, we start at the tree root and split the data on the feature that
results in the largest information gain (IG). In other words, IG tells us how important a
given attribute is.
The Information Gain (IG) can be defined as follows:
Where I could be entropy or Gini index. D(p), D(Left) and D(Right) are the dataset of
the parent, left and right child node.
Boolean Operators with Decision Trees
A AND B
A
-B
-+
T F
T F
Boolean Operators with decision Trees
A OR B
A
+B
-+
F T
T F
Boolean Operators with decision Trees
A XOR B
A
+
B
+-
T F
T F
B
-
Pros and Cons of Decision Trees
Advantages of Decision Trees:
• Its very interpretable and hence easy to understand
• It can be used to identify the most significant variables in your data-set
Disadvantages:
The model has very high chances of “over-fitting”
Avoiding Overfitting in Decision Trees
• Overfitting is the key challenge in case of Decision Trees.
• If no limit is set, in the worst case, it will end up putting each observation into a leaf node.
• It can be avoided by setting Constraints on Tree Size
Parameters for Decision Trees
Following parameters are used for limiting Tree Size:
 Minimum samples for a node split: It defines minimum number of observations which are
required in a node to be considered for splitting.
 Minimum samples for a terminal node (leaf): Defines the minimum samples (or
observations) required in a terminal node or leaf. Generally lower values should be chosen for
imbalanced class problems because the regions in which the minority class will be in majority
will be very small.
 Maximum depth of tree: Higher depth can cause overfitting
 Maximum number of terminal nodes
 Maximum features to consider for split: As a thumb-rule, square root of the total number of
features works great but we should check upto 30-40% of the total number of features.
Validation
 When creating a predictive model, we'd like to get an accurate sense of its ability to
generalize to unseen data before actually going out and using it on unseen data.
 A complex model may fit the training data extremely closely but fail to generalize to
new, unseen data. We can get a better sense of a model's expected performance on
unseen data by setting a portion of our training data aside when creating a model, and
then using that set aside data to evaluate the model's performance.
 This technique of setting aside some of the training data to assess a model's ability to
generalize is known as validation.
Holdout validation and cross validation are two common methods for assessing a
model before using it on test data.
Holdout Validation
Holdout validation involves splitting the training data into two parts, a training set and a
validation set, building a model with the training set and then assessing performance with
the validation set.
Some Disadvantages:
 Reserving a portion of the training data for a holdout set means you aren't using all the
data at your disposal to build your model in the validation phase. This can lead to
suboptimal performance, especially in situations where you don't have much data to
work with.
 If you use the same holdout validation set to assess too many different models, you
may end up finding a model that fits the validation set well due to chance that won't
necessarily generalize well to unseen data.
Cross-Validation
Cross-Validation is a process of leaving a sample on which you do not train the model and test
the model on this sample.
To do 10-fold cross validation:
 we divide the training data equally into 10 sets.
 Train the model over 9 sets, while holding back the 10th set
 Test the model on the set which is held back and save the results
 Repeat the process 9 more times by changing the data set which is held back
 Average out the results
The result is a more reliable estimate of the performance of the algorithm on new data. It is more
accurate because the algorithm is trained and evaluated multiple times on different data. The
choice of k must allow the size of each test partition to be large enough to be a reasonable
sample of the problem, whilst allowing enough repetitions of the train-test evaluation of the
algorithm to provide a fair estimate of the algorithms performance on unseen data. For modest
sized datasets in the thousands or tens of thousands of records, k values of 3, 5 and 10 are
common.
Ensembling
Ensembling: Ensembling is a process of combining the results of multiple models to solve
a given prediction or classification problem.
The three most popular methods for combining the predictions from different models are:
Bagging: Building multiple models (typically of the same type) from different subsamples
of the training dataset. A limitation of bagging is that the same greedy algorithm is used
to create each tree, meaning that it is likely that the same or very similar split points will
be chosen in each tree making the different trees very similar (trees will be correlated).
This, in turn, makes their predictions similar. We can force the decision trees to be
different by limiting the features that the greedy algorithm can evaluate at each split point
when creating the tree. This is called the Random Forest algorithm.
Boosting: Building multiple models (typically of the same type) each of which learns to fix
the prediction errors of a prior model in the sequence of models.
Voting: Building multiple models (typically of differing types) and simple statistics (like
calculating the mean) are used to combine predictions.
Ensembling
Random Forests
Random Forests: Random Forests is an ensemble(bagging) modeling technique that
works by building large number of decision tress. It takes a subset of observations and a
subset of variables to build several decision trees. Each tree will play a role in determining
the final outcome.
A random forest builds many trees. In forming each tree, it randomly selects the
independent variables. This means that data used as training data for each tree is selected
randomly with replacement. Each tree will give its output. The final output will be a
function of each individual tree output.
The function can be:
 Mean of individual probability outcomes
 Vote Frequency
The important parameters used while using RandomForestClassifier in Python are:
n_estimators is used to fix number of trees in Random Forest.
max_depth - parameter determining when the splitting up of the decision tree stops.
min_samples_split - parameter monitoring the amount of observations in a bucket
Random Forests
An Example
Lets say we have to decide whether a cricket player is Good or Bad. We can have the as
“average runs scored” against various variables like “Opposition Played”, “Country
Played”, “Matches Won”, “Matches Lost” and many others. We can use a decision tree
which splits on only certain attributes, to determine whether this player is Good or Bad.
But we can improve our prediction, if we build a number of decision trees which split on
different combination of variables. Then using the outcome of each tree, we can decide
upon our final classification. This will greatly improve the performance which we get
using a single tree. Another example can be the judgements delivered by Supreme Court.
Supreme Court usually employs a bench of judges to deliver a judgement. Each judge
gives its own judgment and then the outcome which gets the maximum voting becomes
the final judgement.
An Example
Suppose, you want to watch a movie ‘X’ . You ask your friend ‘Amar’ if you should watch this movie.
Amar asks you a series of questions:
What previous movies you watched? Which of these movies you liked or disliked? Is X a romantic
movie? Does Salman Khan act in movie X?
….. And several other questions. Finally Amar tells you either Yes or No.
Amar is a decision tree for your movie preference.
But Amar may not give you accurate answer. In order to be more sure, you ask couple of friends –
Rita, Suneet and Megha, the same question. They will each vote for the movie X. (This is ensemble
model aka Random Forest in this case)
If you have a similar circle of friends, they may all have the exact same process of questions, so to
avoid them all having the exact same answer, you’ll want to give them each a different sample from
your list of movies. You decide to cut up your list and place it in a bag, then randomly draw from the
bag, tell your friend whether or not you enjoyed that movie, and then place that sample back in the
bag. This means you’ll be randomly drawing a sub sample from your original list with replacement
(bootstrapping your original data)
And you will just make sure that , each of your friends asks different questions randomly.
Uses – Random Forest
Variable Selection: One of the best use cases for random forest is feature selection. One
of the byproducts of trying lots of decision tree variations is that you can examine which
variables are working best/worst in each tree.
Classification: Random forest is also great for classification. It can be used to make
predictions for categories with multiple possible values

More Related Content

What's hot

Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model Selection
Derek Kane
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
Rupak Roy
 
Housing price prediction
Housing price predictionHousing price prediction
Housing price prediction
Abhimanyu Dwivedi
 
Machine learning algorithms
Machine learning algorithmsMachine learning algorithms
Machine learning algorithms
Shalitha Suranga
 
Types of Probability Distributions - Statistics II
Types of Probability Distributions - Statistics IITypes of Probability Distributions - Statistics II
Types of Probability Distributions - Statistics II
Rupak Roy
 
Machine Learning Decision Tree Algorithms
Machine Learning Decision Tree AlgorithmsMachine Learning Decision Tree Algorithms
Machine Learning Decision Tree Algorithms
Rupak Roy
 
Estimation Theory
Estimation TheoryEstimation Theory
Estimation Theory
Seung Ho Choi
 
Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit
Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit
Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit
Rupak Roy
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Madhav Mishra
 
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Simplilearn
 
Explore ML day 1
Explore ML day 1Explore ML day 1
Explore ML day 1
preetikumara
 
Validation and Over fitting , Validation strategies
Validation and Over fitting , Validation strategiesValidation and Over fitting , Validation strategies
Validation and Over fitting , Validation strategies
Chode Amarnath
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Parth Khare
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
Siddharth Shrivastava
 
Applied Artificial Intelligence Unit 2 Semester 3 MSc IT Part 2 Mumbai Univer...
Applied Artificial Intelligence Unit 2 Semester 3 MSc IT Part 2 Mumbai Univer...Applied Artificial Intelligence Unit 2 Semester 3 MSc IT Part 2 Mumbai Univer...
Applied Artificial Intelligence Unit 2 Semester 3 MSc IT Part 2 Mumbai Univer...
Madhav Mishra
 
Point estimation
Point estimationPoint estimation
Point estimation
Shahab Yaseen
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in Agriculture
Aman Vasisht
 

What's hot (17)

Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model Selection
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Housing price prediction
Housing price predictionHousing price prediction
Housing price prediction
 
Machine learning algorithms
Machine learning algorithmsMachine learning algorithms
Machine learning algorithms
 
Types of Probability Distributions - Statistics II
Types of Probability Distributions - Statistics IITypes of Probability Distributions - Statistics II
Types of Probability Distributions - Statistics II
 
Machine Learning Decision Tree Algorithms
Machine Learning Decision Tree AlgorithmsMachine Learning Decision Tree Algorithms
Machine Learning Decision Tree Algorithms
 
Estimation Theory
Estimation TheoryEstimation Theory
Estimation Theory
 
Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit
Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit
Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
 
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
 
Explore ML day 1
Explore ML day 1Explore ML day 1
Explore ML day 1
 
Validation and Over fitting , Validation strategies
Validation and Over fitting , Validation strategiesValidation and Over fitting , Validation strategies
Validation and Over fitting , Validation strategies
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
 
Applied Artificial Intelligence Unit 2 Semester 3 MSc IT Part 2 Mumbai Univer...
Applied Artificial Intelligence Unit 2 Semester 3 MSc IT Part 2 Mumbai Univer...Applied Artificial Intelligence Unit 2 Semester 3 MSc IT Part 2 Mumbai Univer...
Applied Artificial Intelligence Unit 2 Semester 3 MSc IT Part 2 Mumbai Univer...
 
Point estimation
Point estimationPoint estimation
Point estimation
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in Agriculture
 

Similar to Machine learning session6(decision trees random forrest)

Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
Derek Kane
 
Unit 2-ML.pptx
Unit 2-ML.pptxUnit 2-ML.pptx
Unit 2-ML.pptx
Chitrachitrap
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchkevinlan
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
jim
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Researchbutest
 
Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees
Kush Kulshrestha
 
Decision tree
Decision tree Decision tree
Decision tree
Learnbay Datascience
 
Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning Algorithm
Palin analytics
 
13 random forest
13 random forest13 random forest
13 random forest
Vishal Dutt
 
Data science
Data scienceData science
Data science
S. M. Akash
 
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
20211229120253D6323_PERT 06_ Ensemble Learning.pptx20211229120253D6323_PERT 06_ Ensemble Learning.pptx
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
RaflyRizky2
 
Sample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdfSample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdf
AaryanArora10
 
Classification
ClassificationClassification
Classification
DataminingTools Inc
 
Classification
ClassificationClassification
Classification
Datamining Tools
 
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdfMachine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
AdityaSoraut
 
Decision Tree.pptx
Decision Tree.pptxDecision Tree.pptx
Decision Tree.pptx
JayabharathiMuraliku
 
Module III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptxModule III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptx
Shivakrishnan18
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CARTXueping Peng
 
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018
digitalzombie
 
Introduction to Random Forest
Introduction to Random Forest Introduction to Random Forest
Introduction to Random Forest
Rupak Roy
 

Similar to Machine learning session6(decision trees random forrest) (20)

Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
Unit 2-ML.pptx
Unit 2-ML.pptxUnit 2-ML.pptx
Unit 2-ML.pptx
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
 
Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees
 
Decision tree
Decision tree Decision tree
Decision tree
 
Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning Algorithm
 
13 random forest
13 random forest13 random forest
13 random forest
 
Data science
Data scienceData science
Data science
 
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
20211229120253D6323_PERT 06_ Ensemble Learning.pptx20211229120253D6323_PERT 06_ Ensemble Learning.pptx
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
 
Sample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdfSample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdf
 
Classification
ClassificationClassification
Classification
 
Classification
ClassificationClassification
Classification
 
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdfMachine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
 
Decision Tree.pptx
Decision Tree.pptxDecision Tree.pptx
Decision Tree.pptx
 
Module III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptxModule III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptx
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018
 
Introduction to Random Forest
Introduction to Random Forest Introduction to Random Forest
Introduction to Random Forest
 

Recently uploaded

Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
DhatriParmar
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 

Recently uploaded (20)

Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 

Machine learning session6(decision trees random forrest)

  • 2. Decision Trees  A supervised learning predictive model that uses a set of binary rules to calculate a target value  Used for either classification(categorical variables) or regression(continuous variables)  In this method, the given data is split into wo or more homogeneous sets based on most significant input variable  Tree generating algorithm determines - Which variable to split at a node - Decision to stop or make a split again - Assign terminal nodes to a class
  • 3. Decision Trees Sno Species Name Body_Temp Gives_Birth Type 1 SP101 Cold-Blooded No Non-Mammal 2 SP102 Warm-Blodded No Non-Mammal 3 SP103 Cold-Blooded No Non-Mammal 4 SP104 Warm-Blodded Yes Mammal 5 SP105 Warm-Blodded No Non-Mammal 6 SP106 Warm-Blodded No Non-Mammal 7 SP107 Warm-Blodded Yes Mammal 8 SP108 Cold-Blooded No Non-Mammal 9 SP109 Cold-Blooded No Non-Mammal 10 SP110 Warm-Blodded Yes Mammal 11 SP111 Warm-Blodded Yes Mammal 12 SP112 Warm-Blodded No Non-Mammal 13 SP113 Cold-Blooded No Non-Mammal 14 SP114 Warm-Blodded Yes Mammal
  • 4. Decision Trees The decision tree will be created as follows: 1. The splits will be done to partition the data into purer subsets. So, when the split is done on Body Temperature, all the species that are cold-blooded are found to be “non-mammals”. So when the Body Temperature is “Cold- Blooded”, we get a purest subset where all the species belong to “non- Mammals”. 2. But for “warm-blooded” species, we still have both types present: mammals and non-mammals. So, now the algorithm splits on the values of Gives_Birth. This time, when the values are “Yes”, we get all type values as “Mammals” and when the values are “No”, we get all type values as “Non- Mammals”. 3. The nodes represents Attributes while the edges represent values.
  • 5. Decision Tree – An Example Body Temperature Gives Birth Mammal Non- Mammal Non- Mammals Warm Cold Yes No
  • 6. Decision Tree – Another Example
  • 7. Decision Tree – Another Example
  • 8. Measures Used for Split Gini Index Entropy Information Gain
  • 9. Gini Gini Index: It is the measure of inequality of distribution. It says if we select two items from a population at random then they must be of same class and probability for this is 1 if population is pure.  It works with categorical target variable “Success” or “Failure”.  It performs only Binary splits  Higher the value of Gini, higher the homogeneity.  CART (Classification and Regression Tree) uses Gini method to create binary splits. Process to calculate Gini Measure: P(j) is the Probability of Class j
  • 10. Entropy Entropy is a way to measure impurity. Less impure node requires less information to describe it and more impure node requires more information. If the sample is completely homogeneous, then the entropy is zero and if the sample is an equally divided it has entropy of one.
  • 11. Information Gain Information Gain is simply a mathematical way to capture the amount of information one gains(or reduction in randomness) by picking a particular attribute In a decision algorithm, we start at the tree root and split the data on the feature that results in the largest information gain (IG). In other words, IG tells us how important a given attribute is. The Information Gain (IG) can be defined as follows: Where I could be entropy or Gini index. D(p), D(Left) and D(Right) are the dataset of the parent, left and right child node.
  • 12. Boolean Operators with Decision Trees A AND B A -B -+ T F T F
  • 13. Boolean Operators with decision Trees A OR B A +B -+ F T T F
  • 14. Boolean Operators with decision Trees A XOR B A + B +- T F T F B -
  • 15. Pros and Cons of Decision Trees Advantages of Decision Trees: • Its very interpretable and hence easy to understand • It can be used to identify the most significant variables in your data-set Disadvantages: The model has very high chances of “over-fitting”
  • 16. Avoiding Overfitting in Decision Trees • Overfitting is the key challenge in case of Decision Trees. • If no limit is set, in the worst case, it will end up putting each observation into a leaf node. • It can be avoided by setting Constraints on Tree Size
  • 17. Parameters for Decision Trees Following parameters are used for limiting Tree Size:  Minimum samples for a node split: It defines minimum number of observations which are required in a node to be considered for splitting.  Minimum samples for a terminal node (leaf): Defines the minimum samples (or observations) required in a terminal node or leaf. Generally lower values should be chosen for imbalanced class problems because the regions in which the minority class will be in majority will be very small.  Maximum depth of tree: Higher depth can cause overfitting  Maximum number of terminal nodes  Maximum features to consider for split: As a thumb-rule, square root of the total number of features works great but we should check upto 30-40% of the total number of features.
  • 18. Validation  When creating a predictive model, we'd like to get an accurate sense of its ability to generalize to unseen data before actually going out and using it on unseen data.  A complex model may fit the training data extremely closely but fail to generalize to new, unseen data. We can get a better sense of a model's expected performance on unseen data by setting a portion of our training data aside when creating a model, and then using that set aside data to evaluate the model's performance.  This technique of setting aside some of the training data to assess a model's ability to generalize is known as validation. Holdout validation and cross validation are two common methods for assessing a model before using it on test data.
  • 19. Holdout Validation Holdout validation involves splitting the training data into two parts, a training set and a validation set, building a model with the training set and then assessing performance with the validation set. Some Disadvantages:  Reserving a portion of the training data for a holdout set means you aren't using all the data at your disposal to build your model in the validation phase. This can lead to suboptimal performance, especially in situations where you don't have much data to work with.  If you use the same holdout validation set to assess too many different models, you may end up finding a model that fits the validation set well due to chance that won't necessarily generalize well to unseen data.
  • 20. Cross-Validation Cross-Validation is a process of leaving a sample on which you do not train the model and test the model on this sample. To do 10-fold cross validation:  we divide the training data equally into 10 sets.  Train the model over 9 sets, while holding back the 10th set  Test the model on the set which is held back and save the results  Repeat the process 9 more times by changing the data set which is held back  Average out the results The result is a more reliable estimate of the performance of the algorithm on new data. It is more accurate because the algorithm is trained and evaluated multiple times on different data. The choice of k must allow the size of each test partition to be large enough to be a reasonable sample of the problem, whilst allowing enough repetitions of the train-test evaluation of the algorithm to provide a fair estimate of the algorithms performance on unseen data. For modest sized datasets in the thousands or tens of thousands of records, k values of 3, 5 and 10 are common.
  • 21. Ensembling Ensembling: Ensembling is a process of combining the results of multiple models to solve a given prediction or classification problem. The three most popular methods for combining the predictions from different models are: Bagging: Building multiple models (typically of the same type) from different subsamples of the training dataset. A limitation of bagging is that the same greedy algorithm is used to create each tree, meaning that it is likely that the same or very similar split points will be chosen in each tree making the different trees very similar (trees will be correlated). This, in turn, makes their predictions similar. We can force the decision trees to be different by limiting the features that the greedy algorithm can evaluate at each split point when creating the tree. This is called the Random Forest algorithm. Boosting: Building multiple models (typically of the same type) each of which learns to fix the prediction errors of a prior model in the sequence of models. Voting: Building multiple models (typically of differing types) and simple statistics (like calculating the mean) are used to combine predictions.
  • 23. Random Forests Random Forests: Random Forests is an ensemble(bagging) modeling technique that works by building large number of decision tress. It takes a subset of observations and a subset of variables to build several decision trees. Each tree will play a role in determining the final outcome. A random forest builds many trees. In forming each tree, it randomly selects the independent variables. This means that data used as training data for each tree is selected randomly with replacement. Each tree will give its output. The final output will be a function of each individual tree output. The function can be:  Mean of individual probability outcomes  Vote Frequency The important parameters used while using RandomForestClassifier in Python are: n_estimators is used to fix number of trees in Random Forest. max_depth - parameter determining when the splitting up of the decision tree stops. min_samples_split - parameter monitoring the amount of observations in a bucket
  • 25. An Example Lets say we have to decide whether a cricket player is Good or Bad. We can have the as “average runs scored” against various variables like “Opposition Played”, “Country Played”, “Matches Won”, “Matches Lost” and many others. We can use a decision tree which splits on only certain attributes, to determine whether this player is Good or Bad. But we can improve our prediction, if we build a number of decision trees which split on different combination of variables. Then using the outcome of each tree, we can decide upon our final classification. This will greatly improve the performance which we get using a single tree. Another example can be the judgements delivered by Supreme Court. Supreme Court usually employs a bench of judges to deliver a judgement. Each judge gives its own judgment and then the outcome which gets the maximum voting becomes the final judgement.
  • 26. An Example Suppose, you want to watch a movie ‘X’ . You ask your friend ‘Amar’ if you should watch this movie. Amar asks you a series of questions: What previous movies you watched? Which of these movies you liked or disliked? Is X a romantic movie? Does Salman Khan act in movie X? ….. And several other questions. Finally Amar tells you either Yes or No. Amar is a decision tree for your movie preference. But Amar may not give you accurate answer. In order to be more sure, you ask couple of friends – Rita, Suneet and Megha, the same question. They will each vote for the movie X. (This is ensemble model aka Random Forest in this case) If you have a similar circle of friends, they may all have the exact same process of questions, so to avoid them all having the exact same answer, you’ll want to give them each a different sample from your list of movies. You decide to cut up your list and place it in a bag, then randomly draw from the bag, tell your friend whether or not you enjoyed that movie, and then place that sample back in the bag. This means you’ll be randomly drawing a sub sample from your original list with replacement (bootstrapping your original data) And you will just make sure that , each of your friends asks different questions randomly.
  • 27. Uses – Random Forest Variable Selection: One of the best use cases for random forest is feature selection. One of the byproducts of trying lots of decision tree variations is that you can examine which variables are working best/worst in each tree. Classification: Random forest is also great for classification. It can be used to make predictions for categories with multiple possible values