SlideShare a Scribd company logo
Evaluating Classifier Performance for Titanic Survivor Dataset!
Raman Kannan
Adjunct NYU Tandon School of Engineering
Tandon OnLine Introduction to Machine Learning CS 6923
Random Forest
Improving Classifier Performance
Process
Preparing the Data – train and testing data set
Baseline Classifier – GLM
Tree – Refine and fine tune,all features on entire dataset
Random Forest – Aggregating Weak Learners to make a strong Learner
– Random subset of features, bootstrapped dataset
Compare Performance – ROC Curve, AUC
Prepare
dataset
Baseline Classifer
D Tree
Random Forest
Training + testing
dataset
Compare
refine
Simplest strategy may sometimes be the best solution
Sophisticated strategies may not always yield the best solution.
Setup objective means to compare
Same training set and testing set
Same measurement criteria
Dataset Preparation
titanic<-
read.csv("http://christianherta.de/lehre/dataScience/machine
Learning/data/titanic-train.csv",header=T)
sm_titantic_3<-titanic[,c(2,3,5,6,10)] #dont need all the columns
sm_titanic_3<-sm_titantic_3[complete.cases(sm_titantic_3),]
set.seed(43)
tst_idx<-sample(714,200,replace=FALSE)
tstdata<-sm_titanic_3[tst_idx,]
trdata<-sm_titanic_3[-tst_idx,]
Running GLM
Create a model using logistic regression
glm_sm_titanic_3<-glm(Survived~.,data=trdata,family=binomial())
Predict using the test dataset and glm model created
glm_predicted<-
predict(glm_sm_titanic_3,tstdata[,2:5],type="response");
How well did GLM perform?
require(ROCR) # use ROCR package
True/False Positives, True/False Negatives
We will use Precision, Recall, and derived measures
Precision (related to false positives)
Recall (related to false negatives)
https://en.wikipedia.org/wiki/Precision_and_recall
Binary Classifier
Precision → (TRUE POSITIVES)/(TP+FP)
Recall → (TP)/(All positives in the sample)
https://uberpython.wordpress.com/2012/01/01/precision-recall-sensitivity-and-specificity/
Calculate
glm_sm_titanic_3<-glm(Survived~.,data=trdata,family=binomial())
glm_predicted<-predict(glm_sm_titanic_3,tstdata[,2:5],type="response");
require(ROCR)
glm_auc_1<-prediction(glm_predicted,tstdata$Survived)
glm_prf<-performance(glm_auc_1, measure="tpr", x.measure="fpr")
glm_slot_fp<-slot(glm_auc_1,"fp")
glm_slot_tp<-slot(glm_auc_1,"tp")
Model with training and Predict with test
Extract fp and tp using ROCR package
glm_fpr3<-unlist(glm_slot_fp)/unlist(slot(glm_auc_1,"n.neg"))
glm_tpr3<-unlist(glm_slot_tp)/unlist(slot(glm_auc_1,"n.pos"))
Calculate fpr and tpr vectors
glm_perf_AUC=performance(glm_auc_1,"auc")
glm_AUC=glm_perf_AUC@y.values[[1]] Calculate Area Under Curve – ROC
Plot the results for GLM
Plot RoC
plot(glm_fpr3,glm_tpr3,
main="ROC Curve from first principles -- raw counts",
xlab="FPR",ylab="TPR")
points(glm_fpr3,glm_fpr3,cex=0.3) # will generate a diagonal plot Let us draw a diagonal and see the lift
text(0.4,0.6,
paste("GLM AUC = ",format(glm_AUC, digits=5, scientific=FALSE))) Let us place AUC on the plot
Let us repeat the steps for individual trees
tree<-rpart(Survived~.,data=trdata)
tree_predicted_prob_03<-
predict(pruned.tree.03,tstdata[,2:5]);
tree_predicted_class_03<-round(tree_predicted_prob_03)
tree_prediction_rocr_03<-prediction(tree_predicted_class_03,tstdata$Survived)
tree_prf_rocr_03<-performance(tree_prediction_rocr_03, measure="tpr", x.measure="fpr")
tree_perf_AUC_03=
performance(tree_prediction_rocr_03,"auc")
Model with training and Predict with test
Extract measures using ROCR package
plot(tree_prf_rocr_03,main="ROC plot cp=0.03(DTREE using rpart)")
text(0.5,0.5,paste("AUC=",format(tree_perf_AUC_03@y.values[[1]],digits=5, scientific=FALSE)))
#Use prune and printcp/plotcp to fine tune the model
# as shown in the video
We will use recursive partition package – there are many other packages
Visualize
Motivation for Ensemble Methods
Tree while non-parametric, includes all features and all observations, and consequently can
result in over-fitting.
Bootstrap aggregation (aka Bagging).
Construct multiple individual trees
At each split while constructing the tree
Randomly select a subset of features
Bootstrap dataset (with REPLACEMENT)
Aggregate results from multiple trees
Using any strategy
Allow voting and pick most voted
Simply average over all the trees
Let us repeat the steps for individual trees
fac.titanic.rf<-randomForest(as.factor(Survived)~.,data=trdata,keep.inbag=TRUE,
type=classification,importance=TRUE,keep.forest=TRUE,ntree=193)
predicted.rf <- predict(fac.titanic.rf, tstdata[,-1], type='response')
confusionTable <- table(predicted.rf, tstdata[,1],dnn = c('Predicted','Observed'))
table( predicted.rf==tstdata[,1])
pred.rf<-prediction(as.numeric(predicted.rf),as.numeric(tstdata[,1]))
perf.rf<-performance(pred.rf,measure="tpr",x.measure="fpr")
auc_rf<-performance(pred.rf,measure="auc")
Model with training and Predict with test
Extract measures using ROCR package
plot(perf.rf, col=rainbow(7), main="ROC curve Titanic (Random Forest)",
xlab="FPR", ylab="TPR")
text(0.5,0.5,paste("AUC=",format(auc_rf@y.values[[1]],digits=5, scientific=FALSE)))
#Use prune and printcp/plotcp to fine tune the model
# as shown in the video
We will use randomForest package (which uses CART)– there are many other packages
Visualize
Comparing Models
Random Forests always outperforms Individual Trees
In this experiment, GLM outperforms Random Forest.
Research Directions: Random Forest
R Exercise : extract and reconstruct individual trees for further investigation
GLM still rules...because superior LIFT
AUC for GLM → 0.84
AUC for DT → 0.78
AUC for RF → 0.82
AUC (more area) → LIFT

More Related Content

What's hot

Gradient Boosted trees
Gradient Boosted treesGradient Boosted trees
Gradient Boosted trees
Nihar Ranjan
 
Summerization notes for descriptive statistics using r
Summerization notes for descriptive statistics using r Summerization notes for descriptive statistics using r
Summerization notes for descriptive statistics using r
Ashwini Mathur
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
Vivian S. Zhang
 
R Workshop for Beginners
R Workshop for BeginnersR Workshop for Beginners
R Workshop for Beginners
Metamarkets
 
R Cheat Sheet – Data Management
R Cheat Sheet – Data ManagementR Cheat Sheet – Data Management
R Cheat Sheet – Data Management
Dr. Volkan OBAN
 
23 Machine Learning Feature Generation
23 Machine Learning Feature Generation23 Machine Learning Feature Generation
23 Machine Learning Feature Generation
Andres Mendez-Vazquez
 
18 Machine Learning Radial Basis Function Networks Forward Heuristics
18 Machine Learning Radial Basis Function Networks Forward Heuristics18 Machine Learning Radial Basis Function Networks Forward Heuristics
18 Machine Learning Radial Basis Function Networks Forward Heuristics
Andres Mendez-Vazquez
 
R learning by examples
R learning by examplesR learning by examples
R learning by examples
Michelle Darling
 
De-Cluttering-ML | TechWeekends
De-Cluttering-ML | TechWeekendsDe-Cluttering-ML | TechWeekends
De-Cluttering-ML | TechWeekends
DSCUSICT
 
Machine learning in science and industry — day 3
Machine learning in science and industry — day 3Machine learning in science and industry — day 3
Machine learning in science and industry — day 3
arogozhnikov
 
Data preprocessing for Machine Learning with R and Python
Data preprocessing for Machine Learning with R and PythonData preprocessing for Machine Learning with R and Python
Data preprocessing for Machine Learning with R and Python
Akhilesh Joshi
 
Kaggle talk series top 0.2% kaggler on amazon employee access challenge
Kaggle talk series  top 0.2% kaggler on amazon employee access challengeKaggle talk series  top 0.2% kaggler on amazon employee access challenge
Kaggle talk series top 0.2% kaggler on amazon employee access challenge
Vivian S. Zhang
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
Gilles Louppe
 
R short-refcard
R short-refcardR short-refcard
R short-refcard
conline
 
R교육1
R교육1R교육1
R교육1
Kangwook Lee
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
Yao Yao
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
Avjinder (Avi) Kaler
 
Reweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEPReweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEP
arogozhnikov
 
Machine Learning Feature Selection - Random Forest
Machine Learning Feature Selection - Random Forest Machine Learning Feature Selection - Random Forest
Machine Learning Feature Selection - Random Forest
Rupak Roy
 
Introduction to Boosted Trees by Tianqi Chen
Introduction to Boosted Trees by Tianqi ChenIntroduction to Boosted Trees by Tianqi Chen
Introduction to Boosted Trees by Tianqi Chen
Zhuyi Xue
 

What's hot (20)

Gradient Boosted trees
Gradient Boosted treesGradient Boosted trees
Gradient Boosted trees
 
Summerization notes for descriptive statistics using r
Summerization notes for descriptive statistics using r Summerization notes for descriptive statistics using r
Summerization notes for descriptive statistics using r
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
 
R Workshop for Beginners
R Workshop for BeginnersR Workshop for Beginners
R Workshop for Beginners
 
R Cheat Sheet – Data Management
R Cheat Sheet – Data ManagementR Cheat Sheet – Data Management
R Cheat Sheet – Data Management
 
23 Machine Learning Feature Generation
23 Machine Learning Feature Generation23 Machine Learning Feature Generation
23 Machine Learning Feature Generation
 
18 Machine Learning Radial Basis Function Networks Forward Heuristics
18 Machine Learning Radial Basis Function Networks Forward Heuristics18 Machine Learning Radial Basis Function Networks Forward Heuristics
18 Machine Learning Radial Basis Function Networks Forward Heuristics
 
R learning by examples
R learning by examplesR learning by examples
R learning by examples
 
De-Cluttering-ML | TechWeekends
De-Cluttering-ML | TechWeekendsDe-Cluttering-ML | TechWeekends
De-Cluttering-ML | TechWeekends
 
Machine learning in science and industry — day 3
Machine learning in science and industry — day 3Machine learning in science and industry — day 3
Machine learning in science and industry — day 3
 
Data preprocessing for Machine Learning with R and Python
Data preprocessing for Machine Learning with R and PythonData preprocessing for Machine Learning with R and Python
Data preprocessing for Machine Learning with R and Python
 
Kaggle talk series top 0.2% kaggler on amazon employee access challenge
Kaggle talk series  top 0.2% kaggler on amazon employee access challengeKaggle talk series  top 0.2% kaggler on amazon employee access challenge
Kaggle talk series top 0.2% kaggler on amazon employee access challenge
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
R short-refcard
R short-refcardR short-refcard
R short-refcard
 
R교육1
R교육1R교육1
R교육1
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 
Reweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEPReweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEP
 
Machine Learning Feature Selection - Random Forest
Machine Learning Feature Selection - Random Forest Machine Learning Feature Selection - Random Forest
Machine Learning Feature Selection - Random Forest
 
Introduction to Boosted Trees by Tianqi Chen
Introduction to Boosted Trees by Tianqi ChenIntroduction to Boosted Trees by Tianqi Chen
Introduction to Boosted Trees by Tianqi Chen
 

Similar to Evaluating classifierperformance ml-cs6923

Machine Learning With R
Machine Learning With RMachine Learning With R
Machine Learning With R
David Chiu
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
Vivian S. Zhang
 
PCA and LDA in machine learning
PCA and LDA in machine learningPCA and LDA in machine learning
PCA and LDA in machine learning
Akhilesh Joshi
 
Passenger forecasting at KLM
Passenger forecasting at KLMPassenger forecasting at KLM
Passenger forecasting at KLM
Alexander Backus
 
logistic regression with python and R
logistic regression with python and Rlogistic regression with python and R
logistic regression with python and R
Akhilesh Joshi
 
R programming intro with examples
R programming intro with examplesR programming intro with examples
R programming intro with examples
Dennis
 
Tree-Based Methods (Article 8 - Practical Exercises)
Tree-Based Methods (Article 8 - Practical Exercises)Tree-Based Methods (Article 8 - Practical Exercises)
Tree-Based Methods (Article 8 - Practical Exercises)
Theodore Grammatikopoulos
 
Classification examp
Classification exampClassification examp
Classification examp
Ryan Hong
 
R console
R consoleR console
R console
Ananth Raj
 
R and data mining
R and data miningR and data mining
R and data mining
Chaozhong Yang
 
13. Random forest
13. Random forest13. Random forest
13. Random forest
ExternalEvents
 
proj1v2
proj1v2proj1v2
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Chakkrit (Kla) Tantithamthavorn
 
Bigger Data v Better Math
Bigger Data v Better MathBigger Data v Better Math
Bigger Data v Better Math
Brent Schneeman
 
Sampling Techniques.pptx
Sampling Techniques.pptxSampling Techniques.pptx
Sampling Techniques.pptx
ChamilaWalgampaya1
 
R/Finance 2009 Chicago
R/Finance 2009 ChicagoR/Finance 2009 Chicago
R/Finance 2009 Chicago
gyollin
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with R
Yanchang Zhao
 
R Get Started II
R Get Started IIR Get Started II
R Get Started II
Sankhya_Analytics
 
maXbox starter65 machinelearning3
maXbox starter65 machinelearning3maXbox starter65 machinelearning3
maXbox starter65 machinelearning3
Max Kleiner
 
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
Dr. Volkan OBAN
 

Similar to Evaluating classifierperformance ml-cs6923 (20)

Machine Learning With R
Machine Learning With RMachine Learning With R
Machine Learning With R
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
 
PCA and LDA in machine learning
PCA and LDA in machine learningPCA and LDA in machine learning
PCA and LDA in machine learning
 
Passenger forecasting at KLM
Passenger forecasting at KLMPassenger forecasting at KLM
Passenger forecasting at KLM
 
logistic regression with python and R
logistic regression with python and Rlogistic regression with python and R
logistic regression with python and R
 
R programming intro with examples
R programming intro with examplesR programming intro with examples
R programming intro with examples
 
Tree-Based Methods (Article 8 - Practical Exercises)
Tree-Based Methods (Article 8 - Practical Exercises)Tree-Based Methods (Article 8 - Practical Exercises)
Tree-Based Methods (Article 8 - Practical Exercises)
 
Classification examp
Classification exampClassification examp
Classification examp
 
R console
R consoleR console
R console
 
R and data mining
R and data miningR and data mining
R and data mining
 
13. Random forest
13. Random forest13. Random forest
13. Random forest
 
proj1v2
proj1v2proj1v2
proj1v2
 
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
 
Bigger Data v Better Math
Bigger Data v Better MathBigger Data v Better Math
Bigger Data v Better Math
 
Sampling Techniques.pptx
Sampling Techniques.pptxSampling Techniques.pptx
Sampling Techniques.pptx
 
R/Finance 2009 Chicago
R/Finance 2009 ChicagoR/Finance 2009 Chicago
R/Finance 2009 Chicago
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with R
 
R Get Started II
R Get Started IIR Get Started II
R Get Started II
 
maXbox starter65 machinelearning3
maXbox starter65 machinelearning3maXbox starter65 machinelearning3
maXbox starter65 machinelearning3
 
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
 

More from Raman Kannan

Essays on-civic-responsibilty
Essays on-civic-responsibiltyEssays on-civic-responsibilty
Essays on-civic-responsibilty
Raman Kannan
 
M12 boosting-part02
M12 boosting-part02M12 boosting-part02
M12 boosting-part02
Raman Kannan
 
M12 random forest-part01
M12 random forest-part01M12 random forest-part01
M12 random forest-part01
Raman Kannan
 
M11 bagging loo cv
M11 bagging loo cvM11 bagging loo cv
M11 bagging loo cv
Raman Kannan
 
M10 gradient descent
M10 gradient descentM10 gradient descent
M10 gradient descent
Raman Kannan
 
M09-Cross validating-naive-bayes
M09-Cross validating-naive-bayesM09-Cross validating-naive-bayes
M09-Cross validating-naive-bayes
Raman Kannan
 
M06 tree
M06 treeM06 tree
M06 tree
Raman Kannan
 
M07 svm
M07 svmM07 svm
M07 svm
Raman Kannan
 
M08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffM08 BiasVarianceTradeoff
M08 BiasVarianceTradeoff
Raman Kannan
 
Chapter 05 k nn
Chapter 05 k nnChapter 05 k nn
Chapter 05 k nn
Raman Kannan
 
Augmented 11022020-ieee
Augmented 11022020-ieeeAugmented 11022020-ieee
Augmented 11022020-ieee
Raman Kannan
 
Chapter01 introductory handbook
Chapter01 introductory handbookChapter01 introductory handbook
Chapter01 introductory handbook
Raman Kannan
 
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
Raman Kannan
 
A voyage-inward-02
A voyage-inward-02A voyage-inward-02
A voyage-inward-02
Raman Kannan
 
A data scientist's study plan
A data scientist's study planA data scientist's study plan
A data scientist's study plan
Raman Kannan
 
Cognitive Assistants
Cognitive AssistantsCognitive Assistants
Cognitive Assistants
Raman Kannan
 
Essay on-data-analysis
Essay on-data-analysisEssay on-data-analysis
Essay on-data-analysis
Raman Kannan
 
Joy of Unix
Joy of UnixJoy of Unix
Joy of Unix
Raman Kannan
 
How to-run-ols-diagnostics-02
How to-run-ols-diagnostics-02How to-run-ols-diagnostics-02
How to-run-ols-diagnostics-02
Raman Kannan
 
Sdr dodd frankbirdseyeview
Sdr dodd frankbirdseyeviewSdr dodd frankbirdseyeview
Sdr dodd frankbirdseyeview
Raman Kannan
 

More from Raman Kannan (20)

Essays on-civic-responsibilty
Essays on-civic-responsibiltyEssays on-civic-responsibilty
Essays on-civic-responsibilty
 
M12 boosting-part02
M12 boosting-part02M12 boosting-part02
M12 boosting-part02
 
M12 random forest-part01
M12 random forest-part01M12 random forest-part01
M12 random forest-part01
 
M11 bagging loo cv
M11 bagging loo cvM11 bagging loo cv
M11 bagging loo cv
 
M10 gradient descent
M10 gradient descentM10 gradient descent
M10 gradient descent
 
M09-Cross validating-naive-bayes
M09-Cross validating-naive-bayesM09-Cross validating-naive-bayes
M09-Cross validating-naive-bayes
 
M06 tree
M06 treeM06 tree
M06 tree
 
M07 svm
M07 svmM07 svm
M07 svm
 
M08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffM08 BiasVarianceTradeoff
M08 BiasVarianceTradeoff
 
Chapter 05 k nn
Chapter 05 k nnChapter 05 k nn
Chapter 05 k nn
 
Augmented 11022020-ieee
Augmented 11022020-ieeeAugmented 11022020-ieee
Augmented 11022020-ieee
 
Chapter01 introductory handbook
Chapter01 introductory handbookChapter01 introductory handbook
Chapter01 introductory handbook
 
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
 
A voyage-inward-02
A voyage-inward-02A voyage-inward-02
A voyage-inward-02
 
A data scientist's study plan
A data scientist's study planA data scientist's study plan
A data scientist's study plan
 
Cognitive Assistants
Cognitive AssistantsCognitive Assistants
Cognitive Assistants
 
Essay on-data-analysis
Essay on-data-analysisEssay on-data-analysis
Essay on-data-analysis
 
Joy of Unix
Joy of UnixJoy of Unix
Joy of Unix
 
How to-run-ols-diagnostics-02
How to-run-ols-diagnostics-02How to-run-ols-diagnostics-02
How to-run-ols-diagnostics-02
 
Sdr dodd frankbirdseyeview
Sdr dodd frankbirdseyeviewSdr dodd frankbirdseyeview
Sdr dodd frankbirdseyeview
 

Recently uploaded

一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 

Recently uploaded (20)

一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 

Evaluating classifierperformance ml-cs6923

  • 1. Evaluating Classifier Performance for Titanic Survivor Dataset! Raman Kannan Adjunct NYU Tandon School of Engineering Tandon OnLine Introduction to Machine Learning CS 6923
  • 2. Random Forest Improving Classifier Performance Process Preparing the Data – train and testing data set Baseline Classifier – GLM Tree – Refine and fine tune,all features on entire dataset Random Forest – Aggregating Weak Learners to make a strong Learner – Random subset of features, bootstrapped dataset Compare Performance – ROC Curve, AUC Prepare dataset Baseline Classifer D Tree Random Forest Training + testing dataset Compare refine Simplest strategy may sometimes be the best solution Sophisticated strategies may not always yield the best solution. Setup objective means to compare Same training set and testing set Same measurement criteria
  • 3. Dataset Preparation titanic<- read.csv("http://christianherta.de/lehre/dataScience/machine Learning/data/titanic-train.csv",header=T) sm_titantic_3<-titanic[,c(2,3,5,6,10)] #dont need all the columns sm_titanic_3<-sm_titantic_3[complete.cases(sm_titantic_3),] set.seed(43) tst_idx<-sample(714,200,replace=FALSE) tstdata<-sm_titanic_3[tst_idx,] trdata<-sm_titanic_3[-tst_idx,]
  • 4. Running GLM Create a model using logistic regression glm_sm_titanic_3<-glm(Survived~.,data=trdata,family=binomial()) Predict using the test dataset and glm model created glm_predicted<- predict(glm_sm_titanic_3,tstdata[,2:5],type="response");
  • 5. How well did GLM perform? require(ROCR) # use ROCR package True/False Positives, True/False Negatives We will use Precision, Recall, and derived measures Precision (related to false positives) Recall (related to false negatives) https://en.wikipedia.org/wiki/Precision_and_recall
  • 6. Binary Classifier Precision → (TRUE POSITIVES)/(TP+FP) Recall → (TP)/(All positives in the sample) https://uberpython.wordpress.com/2012/01/01/precision-recall-sensitivity-and-specificity/
  • 7. Calculate glm_sm_titanic_3<-glm(Survived~.,data=trdata,family=binomial()) glm_predicted<-predict(glm_sm_titanic_3,tstdata[,2:5],type="response"); require(ROCR) glm_auc_1<-prediction(glm_predicted,tstdata$Survived) glm_prf<-performance(glm_auc_1, measure="tpr", x.measure="fpr") glm_slot_fp<-slot(glm_auc_1,"fp") glm_slot_tp<-slot(glm_auc_1,"tp") Model with training and Predict with test Extract fp and tp using ROCR package glm_fpr3<-unlist(glm_slot_fp)/unlist(slot(glm_auc_1,"n.neg")) glm_tpr3<-unlist(glm_slot_tp)/unlist(slot(glm_auc_1,"n.pos")) Calculate fpr and tpr vectors glm_perf_AUC=performance(glm_auc_1,"auc") glm_AUC=glm_perf_AUC@y.values[[1]] Calculate Area Under Curve – ROC
  • 8. Plot the results for GLM Plot RoC plot(glm_fpr3,glm_tpr3, main="ROC Curve from first principles -- raw counts", xlab="FPR",ylab="TPR") points(glm_fpr3,glm_fpr3,cex=0.3) # will generate a diagonal plot Let us draw a diagonal and see the lift text(0.4,0.6, paste("GLM AUC = ",format(glm_AUC, digits=5, scientific=FALSE))) Let us place AUC on the plot
  • 9. Let us repeat the steps for individual trees tree<-rpart(Survived~.,data=trdata) tree_predicted_prob_03<- predict(pruned.tree.03,tstdata[,2:5]); tree_predicted_class_03<-round(tree_predicted_prob_03) tree_prediction_rocr_03<-prediction(tree_predicted_class_03,tstdata$Survived) tree_prf_rocr_03<-performance(tree_prediction_rocr_03, measure="tpr", x.measure="fpr") tree_perf_AUC_03= performance(tree_prediction_rocr_03,"auc") Model with training and Predict with test Extract measures using ROCR package plot(tree_prf_rocr_03,main="ROC plot cp=0.03(DTREE using rpart)") text(0.5,0.5,paste("AUC=",format(tree_perf_AUC_03@y.values[[1]],digits=5, scientific=FALSE))) #Use prune and printcp/plotcp to fine tune the model # as shown in the video We will use recursive partition package – there are many other packages Visualize
  • 10. Motivation for Ensemble Methods Tree while non-parametric, includes all features and all observations, and consequently can result in over-fitting. Bootstrap aggregation (aka Bagging). Construct multiple individual trees At each split while constructing the tree Randomly select a subset of features Bootstrap dataset (with REPLACEMENT) Aggregate results from multiple trees Using any strategy Allow voting and pick most voted Simply average over all the trees
  • 11. Let us repeat the steps for individual trees fac.titanic.rf<-randomForest(as.factor(Survived)~.,data=trdata,keep.inbag=TRUE, type=classification,importance=TRUE,keep.forest=TRUE,ntree=193) predicted.rf <- predict(fac.titanic.rf, tstdata[,-1], type='response') confusionTable <- table(predicted.rf, tstdata[,1],dnn = c('Predicted','Observed')) table( predicted.rf==tstdata[,1]) pred.rf<-prediction(as.numeric(predicted.rf),as.numeric(tstdata[,1])) perf.rf<-performance(pred.rf,measure="tpr",x.measure="fpr") auc_rf<-performance(pred.rf,measure="auc") Model with training and Predict with test Extract measures using ROCR package plot(perf.rf, col=rainbow(7), main="ROC curve Titanic (Random Forest)", xlab="FPR", ylab="TPR") text(0.5,0.5,paste("AUC=",format(auc_rf@y.values[[1]],digits=5, scientific=FALSE))) #Use prune and printcp/plotcp to fine tune the model # as shown in the video We will use randomForest package (which uses CART)– there are many other packages Visualize
  • 12. Comparing Models Random Forests always outperforms Individual Trees In this experiment, GLM outperforms Random Forest.
  • 13. Research Directions: Random Forest R Exercise : extract and reconstruct individual trees for further investigation GLM still rules...because superior LIFT AUC for GLM → 0.84 AUC for DT → 0.78 AUC for RF → 0.82 AUC (more area) → LIFT