SlideShare a Scribd company logo
Classification Tree
Earning is in Learning
Data science and AI Certification Course
Visit: Learnbay.co
Enriching training and learningsession…
§ Training Checklist
– Sitting arrangementF2F
– Quality over Quantity
– Everyone to have their own machinesfor
hands-on practice
– Illuminated and happy glowingtraining
room (no candle light dinnerambience)
– Anyone wanting to step-out, feel free
– Feel free to ask for breaks
– Feel free to ask same question againtill
you understand
– Let me know if you want me toskip
Practice Exercises in between the
session
– Brief side-talks areokay
– I don’t speak to walls, respect each
other
Involvement
Content Duration
Enriching
Training
Visit: Learnbay.co
Classification Tree
CART
Visit: Learnbay.co
Learning Objectives
§ What is ClassificationTechnique?
§ CHAID, CART, C4.5 Intro
§ Gini Gain Computation
§ Why are Classification Tree algorithmsRecursive?
§ What is pre-pruning and post-pruning in ClassificationTree?
§ What is Loss?
§ What is Validation? What is Cross-Validation?
§ Why you should avoid over-fitting?
§ Performance Measure
Visit: Learnbay.co
Analytics that are actually used
5Visit: Learnbay.co
What is Classification?
The action or process of classifying something
according to shared qualities or characteristics.
Visit: Learnbay.co
Defining Characteristics of each animalclassification
§ Mammals – Mammals are vertebrates (backboned animals). Mammals are
warm-blooded and have hair. Mammals are able to move around using
limbs
§ Birds – Birds are warm-blooded vertebrates, having a body covered with
feathers, forelimbs modified into wings, scaly legs, a beak, and no teeth, and
bearing young ones in a hard-shelledegg
§ Insects – any of small invertebrate animals which typically have a well
defined head, thorax, and abdomen, only three pairs of legs, and typically
one or two pair of wings
§ Amphibian - any cold-blooded vertebrate that live on land but breed in water
§ Reptiles - class of cold-blooded air-breathing vertebrates withcompletely
ossified skeleton and a body usually covered with scales or horny plates
§ Fish - Alimbless cold-blooded vertebrate animal with gills and fins and living
wholly in water
Visit: Learnbay.co
Why Classify?
To Explain (Profile)
Explaining in the classification world is called Profiling
or
ToPredict (Classify)
Predicting the class of new records is called Classifying
Visit: Learnbay.co
Win Back Campaign Classification Analysis
RootNode
Leaf Node
Leaf/Node
TerminalNode
InRteorontaNlNodoede
LienChrg>5K LienChrg1Kto 5K LienChrg<1K AccBalance<1000 AccBalance>=1000
Dud 1,550 16% Dud 1,250 13% Dud 1,200 12% Dud 1,234 12% Dud 1,340 13%
W.B. 421 12% W.B. 601 17% W.B. 1,078 31% W.B. 152 4% W.B. 769 22%
W.B.% 27.2% W.B.% 48.1% W.B.% 89.8% W.B.% 12.3% W.B.% 57.4%
AccTypeSAL=TRUE AccTypeSAL=FALSE Gender =Female Gender =Male CntTxnsLastActive
Mth <10
CntTxnsLastActive
Mth >=10
Dud 275 3% Dud 1,275 13% Dud 450 5% Dud 800 8% Dud 311 3% Dud 1,029 10%
W.B. 70 2% W.B. 351 10% W.B. 129 4% W.B. 472 13% W.B. 85 2% W.B. 684 20%
W.B.% 25.5% W.B.% 27.5% W.B.% 28.7% W.B.% 59.0% W.B.% 27.3% W.B.% 66.5%
Gender =Male
Gender =Female
t TxnsLastActiveMth<
Dud 540 5% Dud 735 7% Dud 250 3%
W.B. 300 9% W.B. 51 1% W.B. 35 1%
W.B.% 55.6% W.B.% 6.9% W.B.% 14.0%
Total
Dud 10,000 100%
W.B. 3,500 100%
W.B.% 35.0%
Ina
ct
ive<6 Mths Inactive 6- 12Mths Inactive>12Mths
Dud 3,426 34%
Dud 4,000 40% Dud 2574 26% W.B. 479 14%
W.B. 2,100 60% W.B. 921 26% W.B.% 14.0%
W.B.% 52.5% W.B.% 35.8%
CntTxnsLastActive
Mth >=10
Dud 550 6%
W.B. 437 12%
W.B.% 79.5%
Dud Dud Accounts(Inactivefor
longperiod)
W.B. WinBack
Visit: Learnbay.co
Main issues of classification tree learning
§ Choosing the splitting criterion
– Impurity based criteria
– Information gain
– Statistical measures ofassociation
§ Binary or multiway splits
– Multiway split
– Binary split
§ Finding the right sized tree
– Pre-pruning
– Post-pruning
Visit: Learnbay.co
Popular Classification Techniques
§ CHAID - CHi-squared Automatic Interaction Detector. The “Chi-
squared” part of the name arises because the technique essentially
involves automatically constructing many cross-tabs, and working
out statistical significance of the proportions. The most significant
relationships are used to control the structure of a treediagram
– CHAID is a non-binary decision tree; Recursive PartitioningAlgorithm
– Continuous variables must be grouped into a finite number of bins to
create categories.
§ CLASSIFICATION AND REGRESSION TREES (CART) are binary
decision trees, which split a single variable at each node.
– The CART algorithm recursively goes though an exhaustive search ofall
variables and split values to find the optimal splitting rule for each node.
§ C4.5 builds decision trees from a set of training data using the
concept of information entropy
Visit: Learnbay.co
CART
Visit: Learnbay.co
Visit: Learnbay.co
  K2Analytics.co.in
CART | Splitting Criteria
§ CART uses the Gini Index as measure of impurity
§ Gini of a Node
(NOTE: p( j | t) is the relative frequencyof
class j at node t).
§ Gini of Split Node is computed as Weighted Avg Gini of each Node
at Split Node level
ni = number of records at childi,
n = Totalnumber of records in parent node
§ Gini Gain = Gini(t) – Gini(split)
www.cs.kent.edu/~jin/DM07/ClassificationDecisionTree.ppt
Visit: Learnbay.co
Gini calculations
Root Node
N:10; T:4
M
N: 6; T:3
F
N: 4; T:1
Gender
Cust_ID Gender Occupation Age Target
1 M Sal 22 1
2 M Sal 22 0
3 M Self-Emp 23 1
4 M Self-Emp 23 0
5 M Self-Emp 24 1
6 M Self-Emp 24 0
7 F Sal 25 1
8 F Sal 25 0
9 F Sal 26 0
10 F Self-Emp 26 0
Node Gini Computation Formula Gini Index
Overall = 1 - ( (4/10)^2 + (6/10)^2 ) 0.48
Gender = M = 1 - ( (3/6)^2 + (3/6)^2) 0.50
Gender = F = 1 - ( (1/4)^2 + (3/4)^2) 0.375
Gender = (6/10) * 0.5 + (4/10) *0.375 0.45
Gini Gain = Gini (Overall) – Gini (Gender) 0.03
Visit: Learnbay.co
Gini calculations
Root Node
N:10; T:4
Sal
N: 5; T:2
Self-Emp
N: 5; T:2
Occupation
Node Gini Computation Formula Gini
Index
Overall = 1 - ( (4/10)^2 + (6/10)^2 ) 0.48
Occ = Sal = 1 - ( (2/5)^2 + (3/5)^2) 0.48
Occ = Self-
Emp
= 1 - ( (2/5)^2 + (3/5)^2) 0.48
Occupation = (5/10) * 0.48 + (5/10) *0.48 0.48
Gini Gain = Gini (Overall) –Gini
(Occupation)
0.0
Age <=22 <=23 <=24 <=25
Gini(Left) 0.5 0.5 0.5 0.5
Gini(Right) 0.47 0.44 0.38 0
GiniSplit 0.48 0.47 0.45 0.40
GiniGain 0.0 0.01 0.03 0.08
Visit: Learnbay.co
Exercise… Compute Gini Gain
Root Node
N:100; T:40
M
N: 25; T:10
F
N: 75; T:30
Visits > 3
Y N
Visit: Learnbay.co
Sampling…
## Creating Development and ValidationSample
##dummy_df = pd.read_csv("/home/utkarsh/Desktop/bank.csv", na_values =['NA'])
##x_train, x_test, y_train, y_test = train_test_split(x,y,test_size =0.5)
CTDF.dev <- pd.read_csv("datafile/DEV_SAMPLE.csv", sep = ",", header = T)
CTDF.holdout <- pd.read_csv ("datafile/HOLDOUT_SAMPLE.csv", sep = ",", header = T)
SamplingCode
Separate Dev & Val
samples areprovidedas
such we will directly
import them rather than
use samplingcode
Visit: Learnbay.co
Decision Tree code to build CART Tree
## installing rpart package forCART
# from sklearn.model_selection importtrain_test_split
# from sklearn.tree import DecisionTreeClassifier
# import matplotlib.pyplot as plt from sklearn.externals.six #
# import StringIO from IPython.display import Image
# from sklearn.tree import export_graphviz
# import pydotplus 
## calling the Decision Tree functionto buildthe tree
model_dt = DecisionTreeClassifier(max_depth = 8, criterion =“gini“,
min_samples_split = 100, min_sample_leaf = 10 )
Visit: Learnbay.co
Decision Tree control arguments
§ Min_samples_split: the minimum number of observations that must existin
a nodein order for a split to beattempted.
§ Min_samples_leaf: the minimum number of observations in any terminal
leaf node. If only one of min_samples_leaf or min_samples_split is specified,
the code either sets min_samples_split to min_samples_leaf*3 or
min_samples_leaf to min_samples_split/3,as appropriate.
§ max_depth: The maximum depth of the tree.if NONE then nodes are
expanded until all leaves are pure or until all leaves contains less than
min_samples_split samples.
§ Criterion: The function to measure the quality of the split. It can be “gini” for
the gini impurity and “entropy” for the information gain.
Visit: Learnbay.co
Loss, Mis-Classification Error and Response Rate
§ Loss is the number of cases mis-
classified in a given node
§ Mis-Classification Error is the
ratio of total number of cases mis-
classified to total number ofcases
– We are interested in mis-
classification error for the fulltree
§ Response Rate is the ratio of
number of responders (Target=
1) to the total number ofcases
– We are interested in findingnodes
where the response rate is very
high
# Obs : 9,182
# Target =1 :443
# Target= 0 : 8,739
# Obs : 4,818
# Target =1 :792
# Target= 0 : 4,026
# Obs : 600
# Target =1 :400
# Target= 0 : 200
# Obs : 4,218
# Target =1 :392
# Target= 0 : 3,826
Root Node
# Obs : 14,000
# Target =1 :1,235
# Target = 0 :12,765
N
Holding Period >=10
Y
ABC >X
What is the mis-classification error for the abovetree?
Visit: Learnbay.co
Plotting the Classification Tree
l
)
Let us exportthe
output to PDF
format to havea
clear view ofthe
tree
Visit: Learnbay.co
Concepts | Greedy Algorithm
Make 31 Paise using any combination of above coins
Optimal solution with few coins : 25 + 5 + 1
What if the 5 paise coin is not there?
Optimal solution with few coins : 10 * 3 + 1
Greedy Algorithm solution: 25 + 1 * 6 Visit: Learnbay.co
Concepts | Cross Validation
§ Cross Validation is
part of the CART
algorithm
§ Method to see how
well the model
performs tounseen
data
§ Typically xval
parameter for cross-
validation is set to10
KFoldCV P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
Fold1 Train Train Train Train Train Train Train Train Train Test
Fold2 Train Train Train Train Train Train Train Train Test Train
Fold3 Train Train Train Train Train Train Train Test Train Train
Fold4 Train Train Train Train Train Train Test Train Train Train
Fold5 Train Train Train Train Train Test Train Train Train Train
Fold6 Train Train Train Train Test Train Train Train Train Train
Fold7 Train Train Train Test Train Train Train Train Train Train
Fold8 Train Train Test Train Train Train Train Train Train Train
Fold9 Train Test Train Train Train Train Train Train Train Train
Fold10 Test Train Train Train Train Train Train Train Train Train
Visit: Learnbay.co
Concepts | Over-fitting
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Training
Data
Test
Data
0 10 20 30 40 50 60 70 80 90 100
Tree Size(No. of Nodes)
Accuracy
§ If you grow the tree
too long you will run
the risk of over-fitting
§ Classification model
may not work well on
unseen data
How do we avoid Over-fitting?
Stopping Rule: don’t expand a node if the impurity reductionof the best
split is below somethreshold
Pruning: grow a very large tree and merge backnodes
Visit: Learnbay.co
Concepts | Parsimony Principle & Re-substitution Error
§ Parsimony principle is basic to all
science and tells us to choose the
simplest scientific explanation that
fits the evidence.
§ Resubstitution Error: It measures
what fraction of the cases in a node
is classified incorrectly if we assign
every case to the majority class in
that node; It always favours large
tree
§ Tocounter balance the
resubstitution error we need a
penalty component that favours
smaller tree
Sub-tree Node
530 ; 113;0
Node 14
122; 10;0
Node 15
408; 103;0
Node 30
388; 90;0
Node 31
20; 7; 1
SCR <334
Y N
Gender:M,O
Re (prunded) = 113 /530
Re (leaves) = 107 /530
Visit: Learnbay.co
Cost Component Pruning
§ “cost-complexity” – a measure of avg. error reduced per leaf
§ Calculate number of errors for each node if collapsed toleaf
§ Compare to errors in leaves, taking into account more nodes used
Sub-tree Node
530 ; 113;0
Node 14
122; 10;0
Node 15
408; 103;0
Node 30
388; 90;0
Node 31
20; 7; 1
SCR <334
Y N
Gender:M,O
Re (prunded) + 1 a
= Re (leaves) +3
a
113 / 530 + 1 a =107/ 530+3
a
a =
0.0056
Visit: Learnbay.co
Pruning
§ Pruning is Basically the average cost complexity reduced perleaf
in a Decision Tree.
§ Generally It’s a hit & try method to get the accuracy improved over
the depth of tree getting reduced or average number of nodes
reduced without over fitting.
§ Practically, We creates a Tree structure which is getting refined on
certain pre-assumptions for improving the performance and
accuracy of a Decision Treeclassifier
http://stats.stackexchange.com/questions/92547/r-rpart-cross-validation-and-1-se-rule-why-is-the-column-in-cptable-called-xst
https://stats.stackexchange.com/questions/13471/how-to-choose-the-number-of-splits-in-rpart
Visit: Learnbay.co
Pruned Classification Tree
Visit: Learnbay.co
Model Evaluation
Various measures to see the model performance
§ Error Matrix
§ Gini Coefficient
§ AUC
§ KS
§ Lift Chart
https://www.youtube.com/watch?v=OAl6eAyP-yo
Demo of Rattle interfaceto
build model and generate
various model evaluation
measures
Visit: Learnbay.co
Confusion Matrix… JJJ
Visit: Learnbay.co
Area Under Curve
Sensitivity = True PositiveRate
= True Positive / TotalPositive
= a / (a + b)
Specificity = True Negative / TotalNegative
= d / (c + d)
False Positive Rate = 1 -Specificity
Classification Matrix Predicted
Y N
Actual Y a b
N c d
Visit: Learnbay.co
Thankyou!!!
Visit: Learnbay.co

More Related Content

Similar to Classification Tree - Cart

www1.cs.columbia.edu
www1.cs.columbia.eduwww1.cs.columbia.edu
www1.cs.columbia.edu
butest
 
Lecture4.pptx
Lecture4.pptxLecture4.pptx
Lecture4.pptx
yasir149288
 
Coaching Development Teams: Teach A Man To Fish
Coaching Development Teams: Teach A Man To FishCoaching Development Teams: Teach A Man To Fish
Coaching Development Teams: Teach A Man To Fish
Lorna Mitchell
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4
Roger Barga
 
Clean Code 2
Clean Code 2Clean Code 2
Clean Code 2
Fredrik Wendt
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
StampedeCon
 
Classification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision TreesClassification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision Trees
sathish sak
 
BAS 250 Lecture 8
BAS 250 Lecture 8BAS 250 Lecture 8
BAS 250 Lecture 8
Wake Tech BAS
 
Data Mining Concepts 15061
Data Mining Concepts 15061Data Mining Concepts 15061
Data Mining Concepts 15061
badirh
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
dataminers.ir
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
Dung Nguyen
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearning
Hoa Le
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
Sri Ambati
 
University Course Timetabling by using Multi Objective Genetic Algortihms
University Course Timetabling by using Multi Objective Genetic AlgortihmsUniversity Course Timetabling by using Multi Objective Genetic Algortihms
University Course Timetabling by using Multi Objective Genetic Algortihms
Halil Kaşkavalcı
 
Decision trees
Decision treesDecision trees
Decision trees
Ncib Lotfi
 
Managing Data: storage, decisions and classification
Managing Data: storage, decisions and classificationManaging Data: storage, decisions and classification
Managing Data: storage, decisions and classification
Edward Blurock
 
clustering and dataset
clustering and dataset clustering and dataset
clustering and dataset
PiyushGoyal59383
 
CS632_Lecture_15_updated.pptx
CS632_Lecture_15_updated.pptxCS632_Lecture_15_updated.pptx
CS632_Lecture_15_updated.pptx
MuhammadAbubakar114879
 
Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptx
ImXaib
 
A Modern Introduction to Decision Tree Ensembles
A Modern Introduction to Decision Tree EnsemblesA Modern Introduction to Decision Tree Ensembles
A Modern Introduction to Decision Tree Ensembles
Ichigaku Takigawa
 

Similar to Classification Tree - Cart (20)

www1.cs.columbia.edu
www1.cs.columbia.eduwww1.cs.columbia.edu
www1.cs.columbia.edu
 
Lecture4.pptx
Lecture4.pptxLecture4.pptx
Lecture4.pptx
 
Coaching Development Teams: Teach A Man To Fish
Coaching Development Teams: Teach A Man To FishCoaching Development Teams: Teach A Man To Fish
Coaching Development Teams: Teach A Man To Fish
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4
 
Clean Code 2
Clean Code 2Clean Code 2
Clean Code 2
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Classification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision TreesClassification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision Trees
 
BAS 250 Lecture 8
BAS 250 Lecture 8BAS 250 Lecture 8
BAS 250 Lecture 8
 
Data Mining Concepts 15061
Data Mining Concepts 15061Data Mining Concepts 15061
Data Mining Concepts 15061
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearning
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
 
University Course Timetabling by using Multi Objective Genetic Algortihms
University Course Timetabling by using Multi Objective Genetic AlgortihmsUniversity Course Timetabling by using Multi Objective Genetic Algortihms
University Course Timetabling by using Multi Objective Genetic Algortihms
 
Decision trees
Decision treesDecision trees
Decision trees
 
Managing Data: storage, decisions and classification
Managing Data: storage, decisions and classificationManaging Data: storage, decisions and classification
Managing Data: storage, decisions and classification
 
clustering and dataset
clustering and dataset clustering and dataset
clustering and dataset
 
CS632_Lecture_15_updated.pptx
CS632_Lecture_15_updated.pptxCS632_Lecture_15_updated.pptx
CS632_Lecture_15_updated.pptx
 
Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptx
 
A Modern Introduction to Decision Tree Ensembles
A Modern Introduction to Decision Tree EnsemblesA Modern Introduction to Decision Tree Ensembles
A Modern Introduction to Decision Tree Ensembles
 

More from Learnbay Datascience

Top data science projects
Top data science projectsTop data science projects
Top data science projects
Learnbay Datascience
 
Python my SQL - create table
Python my SQL - create tablePython my SQL - create table
Python my SQL - create table
Learnbay Datascience
 
Python my SQL - create database
Python my SQL - create databasePython my SQL - create database
Python my SQL - create database
Learnbay Datascience
 
Python my sql database connection
Python my sql   database connectionPython my sql   database connection
Python my sql database connection
Learnbay Datascience
 
Python - mySOL
Python - mySOLPython - mySOL
Python - mySOL
Learnbay Datascience
 
AI - Issues and Terminology
AI - Issues and TerminologyAI - Issues and Terminology
AI - Issues and Terminology
Learnbay Datascience
 
AI - Fuzzy Logic Systems
AI - Fuzzy Logic SystemsAI - Fuzzy Logic Systems
AI - Fuzzy Logic Systems
Learnbay Datascience
 
AI - working of an ns
AI - working of an nsAI - working of an ns
AI - working of an ns
Learnbay Datascience
 
Artificial Intelligence- Neural Networks
Artificial Intelligence- Neural NetworksArtificial Intelligence- Neural Networks
Artificial Intelligence- Neural Networks
Learnbay Datascience
 
AI - Robotics
AI - RoboticsAI - Robotics
AI - Robotics
Learnbay Datascience
 
Applications of expert system
Applications of expert systemApplications of expert system
Applications of expert system
Learnbay Datascience
 
Components of expert systems
Components of expert systemsComponents of expert systems
Components of expert systems
Learnbay Datascience
 
Artificial intelligence - expert systems
 Artificial intelligence - expert systems Artificial intelligence - expert systems
Artificial intelligence - expert systems
Learnbay Datascience
 
AI - natural language processing
AI - natural language processingAI - natural language processing
AI - natural language processing
Learnbay Datascience
 
Ai popular search algorithms
Ai   popular search algorithmsAi   popular search algorithms
Ai popular search algorithms
Learnbay Datascience
 
AI - Agents & Environments
AI - Agents & EnvironmentsAI - Agents & Environments
AI - Agents & Environments
Learnbay Datascience
 
Artificial intelligence - research areas
Artificial intelligence - research areasArtificial intelligence - research areas
Artificial intelligence - research areas
Learnbay Datascience
 
Artificial intelligence composed
Artificial intelligence composedArtificial intelligence composed
Artificial intelligence composed
Learnbay Datascience
 
Artificial intelligence intelligent systems
Artificial intelligence   intelligent systemsArtificial intelligence   intelligent systems
Artificial intelligence intelligent systems
Learnbay Datascience
 
Applications of ai
Applications of aiApplications of ai
Applications of ai
Learnbay Datascience
 

More from Learnbay Datascience (20)

Top data science projects
Top data science projectsTop data science projects
Top data science projects
 
Python my SQL - create table
Python my SQL - create tablePython my SQL - create table
Python my SQL - create table
 
Python my SQL - create database
Python my SQL - create databasePython my SQL - create database
Python my SQL - create database
 
Python my sql database connection
Python my sql   database connectionPython my sql   database connection
Python my sql database connection
 
Python - mySOL
Python - mySOLPython - mySOL
Python - mySOL
 
AI - Issues and Terminology
AI - Issues and TerminologyAI - Issues and Terminology
AI - Issues and Terminology
 
AI - Fuzzy Logic Systems
AI - Fuzzy Logic SystemsAI - Fuzzy Logic Systems
AI - Fuzzy Logic Systems
 
AI - working of an ns
AI - working of an nsAI - working of an ns
AI - working of an ns
 
Artificial Intelligence- Neural Networks
Artificial Intelligence- Neural NetworksArtificial Intelligence- Neural Networks
Artificial Intelligence- Neural Networks
 
AI - Robotics
AI - RoboticsAI - Robotics
AI - Robotics
 
Applications of expert system
Applications of expert systemApplications of expert system
Applications of expert system
 
Components of expert systems
Components of expert systemsComponents of expert systems
Components of expert systems
 
Artificial intelligence - expert systems
 Artificial intelligence - expert systems Artificial intelligence - expert systems
Artificial intelligence - expert systems
 
AI - natural language processing
AI - natural language processingAI - natural language processing
AI - natural language processing
 
Ai popular search algorithms
Ai   popular search algorithmsAi   popular search algorithms
Ai popular search algorithms
 
AI - Agents & Environments
AI - Agents & EnvironmentsAI - Agents & Environments
AI - Agents & Environments
 
Artificial intelligence - research areas
Artificial intelligence - research areasArtificial intelligence - research areas
Artificial intelligence - research areas
 
Artificial intelligence composed
Artificial intelligence composedArtificial intelligence composed
Artificial intelligence composed
 
Artificial intelligence intelligent systems
Artificial intelligence   intelligent systemsArtificial intelligence   intelligent systems
Artificial intelligence intelligent systems
 
Applications of ai
Applications of aiApplications of ai
Applications of ai
 

Recently uploaded

FinalSD_MathematicsGrade7_Session2_Unida.pptx
FinalSD_MathematicsGrade7_Session2_Unida.pptxFinalSD_MathematicsGrade7_Session2_Unida.pptx
FinalSD_MathematicsGrade7_Session2_Unida.pptx
JennySularte1
 
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
EduSkills OECD
 
欧洲杯下注-欧洲杯下注押注官网-欧洲杯下注押注网站|【​网址​🎉ac44.net🎉​】
欧洲杯下注-欧洲杯下注押注官网-欧洲杯下注押注网站|【​网址​🎉ac44.net🎉​】欧洲杯下注-欧洲杯下注押注官网-欧洲杯下注押注网站|【​网址​🎉ac44.net🎉​】
欧洲杯下注-欧洲杯下注押注官网-欧洲杯下注押注网站|【​网址​🎉ac44.net🎉​】
andagarcia212
 
Simple-Present-Tense xxxxxxxxxxxxxxxxxxx
Simple-Present-Tense xxxxxxxxxxxxxxxxxxxSimple-Present-Tense xxxxxxxxxxxxxxxxxxx
Simple-Present-Tense xxxxxxxxxxxxxxxxxxx
RandolphRadicy
 
Observational Learning
Observational Learning Observational Learning
Observational Learning
sanamushtaq922
 
Creative Restart 2024: Mike Martin - Finding a way around “no”
Creative Restart 2024: Mike Martin - Finding a way around “no”Creative Restart 2024: Mike Martin - Finding a way around “no”
Creative Restart 2024: Mike Martin - Finding a way around “no”
Taste
 
Educational Technology in the Health Sciences
Educational Technology in the Health SciencesEducational Technology in the Health Sciences
Educational Technology in the Health Sciences
Iris Thiele Isip-Tan
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Henry Hollis
 
78 Microsoft-Publisher - Sirin Sultana Bora.pptx
78 Microsoft-Publisher - Sirin Sultana Bora.pptx78 Microsoft-Publisher - Sirin Sultana Bora.pptx
78 Microsoft-Publisher - Sirin Sultana Bora.pptx
Kalna College
 
Diversity Quiz Prelims by Quiz Club, IIT Kanpur
Diversity Quiz Prelims by Quiz Club, IIT KanpurDiversity Quiz Prelims by Quiz Club, IIT Kanpur
Diversity Quiz Prelims by Quiz Club, IIT Kanpur
Quiz Club IIT Kanpur
 
How to Setup Default Value for a Field in Odoo 17
How to Setup Default Value for a Field in Odoo 17How to Setup Default Value for a Field in Odoo 17
How to Setup Default Value for a Field in Odoo 17
Celine George
 
Accounting for Restricted Grants When and How To Record Properly
Accounting for Restricted Grants  When and How To Record ProperlyAccounting for Restricted Grants  When and How To Record Properly
Accounting for Restricted Grants When and How To Record Properly
TechSoup
 
A Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two HeartsA Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two Hearts
Steve Thomason
 
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptxCapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapitolTechU
 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
nitinpv4ai
 
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptxRESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
zuzanka
 
A Free 200-Page eBook ~ Brain and Mind Exercise.pptx
A Free 200-Page eBook ~ Brain and Mind Exercise.pptxA Free 200-Page eBook ~ Brain and Mind Exercise.pptx
A Free 200-Page eBook ~ Brain and Mind Exercise.pptx
OH TEIK BIN
 
HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
deepaannamalai16
 
Brand Guideline of Bashundhara A4 Paper - 2024
Brand Guideline of Bashundhara A4 Paper - 2024Brand Guideline of Bashundhara A4 Paper - 2024
Brand Guideline of Bashundhara A4 Paper - 2024
khabri85
 
skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)
Mohammad Al-Dhahabi
 

Recently uploaded (20)

FinalSD_MathematicsGrade7_Session2_Unida.pptx
FinalSD_MathematicsGrade7_Session2_Unida.pptxFinalSD_MathematicsGrade7_Session2_Unida.pptx
FinalSD_MathematicsGrade7_Session2_Unida.pptx
 
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
 
欧洲杯下注-欧洲杯下注押注官网-欧洲杯下注押注网站|【​网址​🎉ac44.net🎉​】
欧洲杯下注-欧洲杯下注押注官网-欧洲杯下注押注网站|【​网址​🎉ac44.net🎉​】欧洲杯下注-欧洲杯下注押注官网-欧洲杯下注押注网站|【​网址​🎉ac44.net🎉​】
欧洲杯下注-欧洲杯下注押注官网-欧洲杯下注押注网站|【​网址​🎉ac44.net🎉​】
 
Simple-Present-Tense xxxxxxxxxxxxxxxxxxx
Simple-Present-Tense xxxxxxxxxxxxxxxxxxxSimple-Present-Tense xxxxxxxxxxxxxxxxxxx
Simple-Present-Tense xxxxxxxxxxxxxxxxxxx
 
Observational Learning
Observational Learning Observational Learning
Observational Learning
 
Creative Restart 2024: Mike Martin - Finding a way around “no”
Creative Restart 2024: Mike Martin - Finding a way around “no”Creative Restart 2024: Mike Martin - Finding a way around “no”
Creative Restart 2024: Mike Martin - Finding a way around “no”
 
Educational Technology in the Health Sciences
Educational Technology in the Health SciencesEducational Technology in the Health Sciences
Educational Technology in the Health Sciences
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
 
78 Microsoft-Publisher - Sirin Sultana Bora.pptx
78 Microsoft-Publisher - Sirin Sultana Bora.pptx78 Microsoft-Publisher - Sirin Sultana Bora.pptx
78 Microsoft-Publisher - Sirin Sultana Bora.pptx
 
Diversity Quiz Prelims by Quiz Club, IIT Kanpur
Diversity Quiz Prelims by Quiz Club, IIT KanpurDiversity Quiz Prelims by Quiz Club, IIT Kanpur
Diversity Quiz Prelims by Quiz Club, IIT Kanpur
 
How to Setup Default Value for a Field in Odoo 17
How to Setup Default Value for a Field in Odoo 17How to Setup Default Value for a Field in Odoo 17
How to Setup Default Value for a Field in Odoo 17
 
Accounting for Restricted Grants When and How To Record Properly
Accounting for Restricted Grants  When and How To Record ProperlyAccounting for Restricted Grants  When and How To Record Properly
Accounting for Restricted Grants When and How To Record Properly
 
A Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two HeartsA Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two Hearts
 
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptxCapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
 
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptxRESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
 
A Free 200-Page eBook ~ Brain and Mind Exercise.pptx
A Free 200-Page eBook ~ Brain and Mind Exercise.pptxA Free 200-Page eBook ~ Brain and Mind Exercise.pptx
A Free 200-Page eBook ~ Brain and Mind Exercise.pptx
 
HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
 
Brand Guideline of Bashundhara A4 Paper - 2024
Brand Guideline of Bashundhara A4 Paper - 2024Brand Guideline of Bashundhara A4 Paper - 2024
Brand Guideline of Bashundhara A4 Paper - 2024
 
skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)
 

Classification Tree - Cart

  • 1. Classification Tree Earning is in Learning Data science and AI Certification Course Visit: Learnbay.co
  • 2. Enriching training and learningsession… § Training Checklist – Sitting arrangementF2F – Quality over Quantity – Everyone to have their own machinesfor hands-on practice – Illuminated and happy glowingtraining room (no candle light dinnerambience) – Anyone wanting to step-out, feel free – Feel free to ask for breaks – Feel free to ask same question againtill you understand – Let me know if you want me toskip Practice Exercises in between the session – Brief side-talks areokay – I don’t speak to walls, respect each other Involvement Content Duration Enriching Training Visit: Learnbay.co
  • 4. Learning Objectives § What is ClassificationTechnique? § CHAID, CART, C4.5 Intro § Gini Gain Computation § Why are Classification Tree algorithmsRecursive? § What is pre-pruning and post-pruning in ClassificationTree? § What is Loss? § What is Validation? What is Cross-Validation? § Why you should avoid over-fitting? § Performance Measure Visit: Learnbay.co
  • 5. Analytics that are actually used 5Visit: Learnbay.co
  • 6. What is Classification? The action or process of classifying something according to shared qualities or characteristics. Visit: Learnbay.co
  • 7. Defining Characteristics of each animalclassification § Mammals – Mammals are vertebrates (backboned animals). Mammals are warm-blooded and have hair. Mammals are able to move around using limbs § Birds – Birds are warm-blooded vertebrates, having a body covered with feathers, forelimbs modified into wings, scaly legs, a beak, and no teeth, and bearing young ones in a hard-shelledegg § Insects – any of small invertebrate animals which typically have a well defined head, thorax, and abdomen, only three pairs of legs, and typically one or two pair of wings § Amphibian - any cold-blooded vertebrate that live on land but breed in water § Reptiles - class of cold-blooded air-breathing vertebrates withcompletely ossified skeleton and a body usually covered with scales or horny plates § Fish - Alimbless cold-blooded vertebrate animal with gills and fins and living wholly in water Visit: Learnbay.co
  • 8. Why Classify? To Explain (Profile) Explaining in the classification world is called Profiling or ToPredict (Classify) Predicting the class of new records is called Classifying Visit: Learnbay.co
  • 9. Win Back Campaign Classification Analysis RootNode Leaf Node Leaf/Node TerminalNode InRteorontaNlNodoede LienChrg>5K LienChrg1Kto 5K LienChrg<1K AccBalance<1000 AccBalance>=1000 Dud 1,550 16% Dud 1,250 13% Dud 1,200 12% Dud 1,234 12% Dud 1,340 13% W.B. 421 12% W.B. 601 17% W.B. 1,078 31% W.B. 152 4% W.B. 769 22% W.B.% 27.2% W.B.% 48.1% W.B.% 89.8% W.B.% 12.3% W.B.% 57.4% AccTypeSAL=TRUE AccTypeSAL=FALSE Gender =Female Gender =Male CntTxnsLastActive Mth <10 CntTxnsLastActive Mth >=10 Dud 275 3% Dud 1,275 13% Dud 450 5% Dud 800 8% Dud 311 3% Dud 1,029 10% W.B. 70 2% W.B. 351 10% W.B. 129 4% W.B. 472 13% W.B. 85 2% W.B. 684 20% W.B.% 25.5% W.B.% 27.5% W.B.% 28.7% W.B.% 59.0% W.B.% 27.3% W.B.% 66.5% Gender =Male Gender =Female t TxnsLastActiveMth< Dud 540 5% Dud 735 7% Dud 250 3% W.B. 300 9% W.B. 51 1% W.B. 35 1% W.B.% 55.6% W.B.% 6.9% W.B.% 14.0% Total Dud 10,000 100% W.B. 3,500 100% W.B.% 35.0% Ina ct ive<6 Mths Inactive 6- 12Mths Inactive>12Mths Dud 3,426 34% Dud 4,000 40% Dud 2574 26% W.B. 479 14% W.B. 2,100 60% W.B. 921 26% W.B.% 14.0% W.B.% 52.5% W.B.% 35.8% CntTxnsLastActive Mth >=10 Dud 550 6% W.B. 437 12% W.B.% 79.5% Dud Dud Accounts(Inactivefor longperiod) W.B. WinBack Visit: Learnbay.co
  • 10. Main issues of classification tree learning § Choosing the splitting criterion – Impurity based criteria – Information gain – Statistical measures ofassociation § Binary or multiway splits – Multiway split – Binary split § Finding the right sized tree – Pre-pruning – Post-pruning Visit: Learnbay.co
  • 11. Popular Classification Techniques § CHAID - CHi-squared Automatic Interaction Detector. The “Chi- squared” part of the name arises because the technique essentially involves automatically constructing many cross-tabs, and working out statistical significance of the proportions. The most significant relationships are used to control the structure of a treediagram – CHAID is a non-binary decision tree; Recursive PartitioningAlgorithm – Continuous variables must be grouped into a finite number of bins to create categories. § CLASSIFICATION AND REGRESSION TREES (CART) are binary decision trees, which split a single variable at each node. – The CART algorithm recursively goes though an exhaustive search ofall variables and split values to find the optimal splitting rule for each node. § C4.5 builds decision trees from a set of training data using the concept of information entropy Visit: Learnbay.co
  • 14.   K2Analytics.co.in CART | Splitting Criteria § CART uses the Gini Index as measure of impurity § Gini of a Node (NOTE: p( j | t) is the relative frequencyof class j at node t). § Gini of Split Node is computed as Weighted Avg Gini of each Node at Split Node level ni = number of records at childi, n = Totalnumber of records in parent node § Gini Gain = Gini(t) – Gini(split) www.cs.kent.edu/~jin/DM07/ClassificationDecisionTree.ppt Visit: Learnbay.co
  • 15. Gini calculations Root Node N:10; T:4 M N: 6; T:3 F N: 4; T:1 Gender Cust_ID Gender Occupation Age Target 1 M Sal 22 1 2 M Sal 22 0 3 M Self-Emp 23 1 4 M Self-Emp 23 0 5 M Self-Emp 24 1 6 M Self-Emp 24 0 7 F Sal 25 1 8 F Sal 25 0 9 F Sal 26 0 10 F Self-Emp 26 0 Node Gini Computation Formula Gini Index Overall = 1 - ( (4/10)^2 + (6/10)^2 ) 0.48 Gender = M = 1 - ( (3/6)^2 + (3/6)^2) 0.50 Gender = F = 1 - ( (1/4)^2 + (3/4)^2) 0.375 Gender = (6/10) * 0.5 + (4/10) *0.375 0.45 Gini Gain = Gini (Overall) – Gini (Gender) 0.03 Visit: Learnbay.co
  • 16. Gini calculations Root Node N:10; T:4 Sal N: 5; T:2 Self-Emp N: 5; T:2 Occupation Node Gini Computation Formula Gini Index Overall = 1 - ( (4/10)^2 + (6/10)^2 ) 0.48 Occ = Sal = 1 - ( (2/5)^2 + (3/5)^2) 0.48 Occ = Self- Emp = 1 - ( (2/5)^2 + (3/5)^2) 0.48 Occupation = (5/10) * 0.48 + (5/10) *0.48 0.48 Gini Gain = Gini (Overall) –Gini (Occupation) 0.0 Age <=22 <=23 <=24 <=25 Gini(Left) 0.5 0.5 0.5 0.5 Gini(Right) 0.47 0.44 0.38 0 GiniSplit 0.48 0.47 0.45 0.40 GiniGain 0.0 0.01 0.03 0.08 Visit: Learnbay.co
  • 17. Exercise… Compute Gini Gain Root Node N:100; T:40 M N: 25; T:10 F N: 75; T:30 Visits > 3 Y N Visit: Learnbay.co
  • 18. Sampling… ## Creating Development and ValidationSample ##dummy_df = pd.read_csv("/home/utkarsh/Desktop/bank.csv", na_values =['NA']) ##x_train, x_test, y_train, y_test = train_test_split(x,y,test_size =0.5) CTDF.dev <- pd.read_csv("datafile/DEV_SAMPLE.csv", sep = ",", header = T) CTDF.holdout <- pd.read_csv ("datafile/HOLDOUT_SAMPLE.csv", sep = ",", header = T) SamplingCode Separate Dev & Val samples areprovidedas such we will directly import them rather than use samplingcode Visit: Learnbay.co
  • 19. Decision Tree code to build CART Tree ## installing rpart package forCART # from sklearn.model_selection importtrain_test_split # from sklearn.tree import DecisionTreeClassifier # import matplotlib.pyplot as plt from sklearn.externals.six # # import StringIO from IPython.display import Image # from sklearn.tree import export_graphviz # import pydotplus ## calling the Decision Tree functionto buildthe tree model_dt = DecisionTreeClassifier(max_depth = 8, criterion =“gini“, min_samples_split = 100, min_sample_leaf = 10 ) Visit: Learnbay.co
  • 20. Decision Tree control arguments § Min_samples_split: the minimum number of observations that must existin a nodein order for a split to beattempted. § Min_samples_leaf: the minimum number of observations in any terminal leaf node. If only one of min_samples_leaf or min_samples_split is specified, the code either sets min_samples_split to min_samples_leaf*3 or min_samples_leaf to min_samples_split/3,as appropriate. § max_depth: The maximum depth of the tree.if NONE then nodes are expanded until all leaves are pure or until all leaves contains less than min_samples_split samples. § Criterion: The function to measure the quality of the split. It can be “gini” for the gini impurity and “entropy” for the information gain. Visit: Learnbay.co
  • 21. Loss, Mis-Classification Error and Response Rate § Loss is the number of cases mis- classified in a given node § Mis-Classification Error is the ratio of total number of cases mis- classified to total number ofcases – We are interested in mis- classification error for the fulltree § Response Rate is the ratio of number of responders (Target= 1) to the total number ofcases – We are interested in findingnodes where the response rate is very high # Obs : 9,182 # Target =1 :443 # Target= 0 : 8,739 # Obs : 4,818 # Target =1 :792 # Target= 0 : 4,026 # Obs : 600 # Target =1 :400 # Target= 0 : 200 # Obs : 4,218 # Target =1 :392 # Target= 0 : 3,826 Root Node # Obs : 14,000 # Target =1 :1,235 # Target = 0 :12,765 N Holding Period >=10 Y ABC >X What is the mis-classification error for the abovetree? Visit: Learnbay.co
  • 22. Plotting the Classification Tree l ) Let us exportthe output to PDF format to havea clear view ofthe tree Visit: Learnbay.co
  • 23. Concepts | Greedy Algorithm Make 31 Paise using any combination of above coins Optimal solution with few coins : 25 + 5 + 1 What if the 5 paise coin is not there? Optimal solution with few coins : 10 * 3 + 1 Greedy Algorithm solution: 25 + 1 * 6 Visit: Learnbay.co
  • 24. Concepts | Cross Validation § Cross Validation is part of the CART algorithm § Method to see how well the model performs tounseen data § Typically xval parameter for cross- validation is set to10 KFoldCV P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 Fold1 Train Train Train Train Train Train Train Train Train Test Fold2 Train Train Train Train Train Train Train Train Test Train Fold3 Train Train Train Train Train Train Train Test Train Train Fold4 Train Train Train Train Train Train Test Train Train Train Fold5 Train Train Train Train Train Test Train Train Train Train Fold6 Train Train Train Train Test Train Train Train Train Train Fold7 Train Train Train Test Train Train Train Train Train Train Fold8 Train Train Test Train Train Train Train Train Train Train Fold9 Train Test Train Train Train Train Train Train Train Train Fold10 Test Train Train Train Train Train Train Train Train Train Visit: Learnbay.co
  • 25. Concepts | Over-fitting 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Training Data Test Data 0 10 20 30 40 50 60 70 80 90 100 Tree Size(No. of Nodes) Accuracy § If you grow the tree too long you will run the risk of over-fitting § Classification model may not work well on unseen data How do we avoid Over-fitting? Stopping Rule: don’t expand a node if the impurity reductionof the best split is below somethreshold Pruning: grow a very large tree and merge backnodes Visit: Learnbay.co
  • 26. Concepts | Parsimony Principle & Re-substitution Error § Parsimony principle is basic to all science and tells us to choose the simplest scientific explanation that fits the evidence. § Resubstitution Error: It measures what fraction of the cases in a node is classified incorrectly if we assign every case to the majority class in that node; It always favours large tree § Tocounter balance the resubstitution error we need a penalty component that favours smaller tree Sub-tree Node 530 ; 113;0 Node 14 122; 10;0 Node 15 408; 103;0 Node 30 388; 90;0 Node 31 20; 7; 1 SCR <334 Y N Gender:M,O Re (prunded) = 113 /530 Re (leaves) = 107 /530 Visit: Learnbay.co
  • 27. Cost Component Pruning § “cost-complexity” – a measure of avg. error reduced per leaf § Calculate number of errors for each node if collapsed toleaf § Compare to errors in leaves, taking into account more nodes used Sub-tree Node 530 ; 113;0 Node 14 122; 10;0 Node 15 408; 103;0 Node 30 388; 90;0 Node 31 20; 7; 1 SCR <334 Y N Gender:M,O Re (prunded) + 1 a = Re (leaves) +3 a 113 / 530 + 1 a =107/ 530+3 a a = 0.0056 Visit: Learnbay.co
  • 28. Pruning § Pruning is Basically the average cost complexity reduced perleaf in a Decision Tree. § Generally It’s a hit & try method to get the accuracy improved over the depth of tree getting reduced or average number of nodes reduced without over fitting. § Practically, We creates a Tree structure which is getting refined on certain pre-assumptions for improving the performance and accuracy of a Decision Treeclassifier http://stats.stackexchange.com/questions/92547/r-rpart-cross-validation-and-1-se-rule-why-is-the-column-in-cptable-called-xst https://stats.stackexchange.com/questions/13471/how-to-choose-the-number-of-splits-in-rpart Visit: Learnbay.co
  • 30. Model Evaluation Various measures to see the model performance § Error Matrix § Gini Coefficient § AUC § KS § Lift Chart https://www.youtube.com/watch?v=OAl6eAyP-yo Demo of Rattle interfaceto build model and generate various model evaluation measures Visit: Learnbay.co
  • 32. Area Under Curve Sensitivity = True PositiveRate = True Positive / TotalPositive = a / (a + b) Specificity = True Negative / TotalNegative = d / (c + d) False Positive Rate = 1 -Specificity Classification Matrix Predicted Y N Actual Y a b N c d Visit: Learnbay.co