SlideShare a Scribd company logo
Benchmarking 20 Machine Learning Models Accuracy
and Speed
Marc Borowczak, PRC Consulting LLC
March 29, 2016
Contents
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Step 0: Selection & Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Step 1: Retrieve 1st Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Step 2. Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Step 3. Classification Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Step 4. Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Step 5: Retrieve 2nd Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Step 6. Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Step 7. Classification Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Step 8. Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Step 9. Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Step 10. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Summary
As Machine Learning tools become mainstream, and ever-growing choice of these is available to data scientists
and analysts, the need to assess those best suited becomes challenging. In this study, 20 Machine Learning
models were benchmarked for their accuracy and speed performance on a multi-core hardware, when applied
to 2 multinomial datasets differing broadly in size and complexity. It was observed that BAG-CART, RF and
BOOST-C50 top the list at more than 99% accuracy while NNET, PART, GBM, SVM and C45 exceeded
95% accuracy on the small Car Evaluation dataset. On the larger and more complex Nursery dataset, we
observed BAG-CART, BOOST-C50, PART, SVM and RF exceeded 99% accuracy, while JRIP, NNET, H2O,
C45, and KNN exceeded 95% accuracy. However, overwhelming dependencies on Speed (determined on an
average of 5-runs) were observed on a multicore hardware, with only CART, MDA and GBM as contenders
for the Car Evaluation dataset. For the more complex Nursery dataset, a different outcome was observed,
with MDA, ONE-R and BOOST-C50 as fastest and overall best predictors. The implications for the Data
Analytics Leaders are to continue allocating resources to insure Machine Learning benchmarks are conducted
regularly, documented and communicated thru the Analysts teams, and to insure the most efficient tools
based on established criteria are applied on day-to-day operations. The implications of these findings for
data scientists are to retain benchmarking tasks in the front- and not on the back-burner of activities’ list,
and to continue monitoring new, more efficient and distributed and/or parallellized algorithms and their
effects on various hardware platforms. Ultimately, finding the best tool depends strongly on criteria selection
and certainly on hardware platforms available. Therefore this benchmarking task may well rest on the data
analyst leaders’ and engineers’ to-do list for the foreseeable future.
1
Step 0: Selection & Reproducibility
As Machine Learning gains attention, more applications and models are being used, and often speed or
accuracy of the predicted models lack comparisons. In this analysis, we’ll compare the accuracy and speed
of 20 Machine learning models commonly selected. We’ll exercise these models on two multinomial UCI
reference datasets, differing in size, predictor levels and number of dependent variables. The first and smallest
dataset is Car_Evaluation and we’ll compare results when applying on the larger and more complex Nursery
dataset.
Sys.info()[1:5]
## sysname release version nodename machine
## "Windows" "10 x64" "build 10586" "STALLION" "x86-64"
sessionInfo()
## R version 3.2.4 Revised (2016-03-16 r70336)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 formatR_1.3 tools_3.2.4 htmltools_0.3
## [5] yaml_2.1.13 stringi_1.0-1 rmarkdown_0.9.5 knitr_1.12.3
## [9] stringr_1.0.0 digest_0.6.9 evaluate_0.8.3
library(stringr)
library(knitr)
userdir <- getwd()
set.seed(123)
Step 1: Retrieve 1st Dataset
We will mirror the approach used in the formulation challenge and use first the Car Evaluation dataset hosted
on UCI Machine Learning Repository. We will use R to reproducibly and quickly download the dataset, and
its full description. We continue to maintain reproducibility of the analysis as a general practice. The analysis
tool and platform are documented, all libraries clearly listed, while data is retrieved programmatically and
date stamped from the repository.
We will display a structure of the Car_Evaluation dataset and the corresponding dictionary to translate the
attribute factors.
2
datadir <- "./data"
if (!file.exists("data")){dir.create("data")}
uciUrl <- "http://archive.ics.uci.edu/ml/machine-learning-databases/"
fileUrl <- paste0(uciUrl,"car/car.data?accessType=DOWNLOAD")
download.file(fileUrl, destfile="./data/cardata.csv", method = "curl")
dateDownloaded <- date()
car_eval <- read.csv("./data/cardata.csv",header=FALSE)
fileUrl <- paste0(uciUrl,"car/car.names?accessType=DOWNLOAD")
download.file(fileUrl, destfile = "./data/carnames.txt")
txt <- readLines("./data/carnames.txt")
lns <- data.frame(beg=which(grepl("buying v-high",txt)),end=which(grepl("med, high",txt)))
# we now capture all lines of text between beg and end from txt
res <- lapply(seq_along(lns$beg),function(l){paste(txt[seq(from=lns$beg[l],to=lns$end[l],by=1)],collapse
res <- gsub(" ", ":", res, fixed = TRUE)
res <- gsub(" ", ":", res, fixed = TRUE)
res <- gsub(" ", ":", res, fixed = TRUE)
res <- gsub(" ", ":", res, fixed = TRUE)
res <- gsub(" ", "n", res, fixed = TRUE)
res <- gsub(" ", "", res, fixed = TRUE)
res <- gsub(" ", "", res, fixed = TRUE)
res <- str_c(res,"n")
writeLines(res, "./data/parsed_attr.csv")
attrib <- readLines("./data/parsed_attr.csv")
nv <- length(attrib) # number of attributes
attrib <- sapply (1:nv,function(i) {gsub(":"," ",attrib[i],fixed=TRUE)})
dictionary <- sapply (1:nv,function(i) {strsplit(attrib[i],' ')})
dictionary[[nv]][1]<-"class"
colnames(car_eval)<-sapply(1:nv,function(i) {colnames(car_eval)[i]<-dictionary[[i]][1]})
cm<-list()
x<-car_eval[,1:(nv-1)]
y<-car_eval[,nv]
fmla<-paste(colnames(car_eval)[1:(nv-1)],collapse="+")
fmla<-paste0(colnames(car_eval)[nv],"~",fmla)
fmla<-as.formula(fmla)
nlev<-nlevels(y) # number of factors describing class
Step 2. Data Exploration
head(car_eval)
## buying maint doors persons lug_boot safety class
## 1 vhigh vhigh 2 2 small low unacc
## 2 vhigh vhigh 2 2 small med unacc
## 3 vhigh vhigh 2 2 small high unacc
## 4 vhigh vhigh 2 2 med low unacc
## 5 vhigh vhigh 2 2 med med unacc
## 6 vhigh vhigh 2 2 med high unacc
summary(car_eval)
## buying maint doors persons lug_boot safety
3
## high :432 high :432 2 :432 2 :576 big :576 high:576
## low :432 low :432 3 :432 4 :576 med :576 low :576
## med :432 med :432 4 :432 more:576 small:576 med :576
## vhigh:432 vhigh:432 5more:432
## class
## acc : 384
## good : 69
## unacc:1210
## vgood: 65
str(car_eval)
## 'data.frame': 1728 obs. of 7 variables:
## $ buying : Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ maint : Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ doors : Factor w/ 4 levels "2","3","4","5more": 1 1 1 1 1 1 1 1 1 1 ...
## $ persons : Factor w/ 3 levels "2","4","more": 1 1 1 1 1 1 1 1 1 2 ...
## $ lug_boot: Factor w/ 3 levels "big","med","small": 3 3 3 2 2 2 1 1 1 3 ...
## $ safety : Factor w/ 3 levels "high","low","med": 2 3 1 2 3 1 2 3 1 2 ...
## $ class : Factor w/ 4 levels "acc","good","unacc",..: 3 3 3 3 3 3 3 3 3 3 ...
cm<-list() # initialize
Exploration reveals the multinomial car experience dataset comprises 6 attributes factors we can use to
predict the 4-factor car recommendation class, with no missing data.
Step 3. Classification Analysis
20 Models will be sequentially selected to represent 3 groups: A)Linear, B)Non-linear and C)Non-Linear
Classification with decision trees. From the collected Confusion Matrix performance, we will build a results
data frame and compare the prediction accuracies.
For each analysis, we’ll follow the sampe protocol: invoque the package library, build the model, summarize
it and predict the class (dependent variable), then save the accuracy data and predicted values in a list.
3.A Linear Classification
3.A1 Multinomial
library(nnet)
library(caret)
model<-multinom(fmla, data = car_eval, maxit = 500, trace=FALSE)
prob<-predict(model,x,type="probs")
pred<-apply(prob,1,which.max)
pred[which(pred=="1")]<-levels(y)[1]
pred[which(pred=="2")]<-levels(y)[2]
pred[which(pred=="3")]<-levels(y)[3]
pred[which(pred=="4")]<-levels(y)[4]
pred<-as.factor(pred)
l<-union(pred,y)
mtab<-table(factor(pred,l),factor(y,l))
cm[[1]]<-c("Multinomial","MULTINOM",confusionMatrix(mtab))
cm[[1]]$table
4
##
## unacc acc vgood good
## unacc 1166 32 0 0
## acc 43 346 2 6
## vgood 0 2 63 4
## good 1 4 0 59
cm[[1]]$overall[1]
## Accuracy
## 0.9456019
3.A2 Logistic Regression
library(VGAM)
model<-vglm(fmla, family = "multinomial", data = car_eval, maxit = 100)
prob<-predict(model,x,type="response")
pred<-apply(prob,1,which.max)
pred[which(pred=="1")]<-levels(y)[1]
pred[which(pred=="2")]<-levels(y)[2]
pred[which(pred=="3")]<-levels(y)[3]
pred[which(pred=="4")]<-levels(y)[4]
pred<-as.factor(pred)
l<-union(pred,y)
mtab<-table(factor(pred,l),factor(y,l))
cm[[2]]<-c("Logistic Regression","GLM",confusionMatrix(mtab))
cm[[2]]$table
##
## unacc acc vgood good
## unacc 1166 32 0 0
## acc 43 346 2 6
## vgood 0 2 63 4
## good 1 4 0 59
cm[[2]]$overall[1]
## Accuracy
## 0.9456019
3.A3 Linear Discriminant Analysis
library(MASS)
model<-lda(fmla,data=car_eval)
pred<-predict(model,x)$class
l<-union(pred,y)
mtab<-table(factor(pred,l),factor(y,l))
cm[[3]]<-c("Linear Discriminant Analysis","LDA",confusionMatrix(mtab))
cm[[3]]$table
5
##
## unacc acc vgood good
## unacc 1138 13 0 0
## acc 70 361 27 44
## vgood 0 0 35 2
## good 2 10 3 23
cm[[3]]$overall[1]
## Accuracy
## 0.9010417
3.B Non-Linear Classification
3.B1 Mixture Discriminant Analysis
library(mda)
model<-mda(fmla,data=car_eval)
pred<-predict(model,x)
l<-union(pred,y)
mtab<-table(factor(pred,l),factor(y,l))
cm[[4]]<-c("Mixture Discriminant Analysis","MDA",confusionMatrix(mtab))
cm[[4]]$table
##
## unacc acc vgood good
## unacc 1151 16 0 0
## acc 52 341 7 21
## vgood 0 4 55 3
## good 7 23 3 45
cm[[4]]$overall[1]
## Accuracy
## 0.9212963
3.B2 Regularized Discriminant Analysis
library(klaR)
model<-rda(fmla,data=car_eval,gamma = 0.05,lambda = 0.01)
pred<-predict(model,x)$class
l<-union(pred,y)
mtab<-table(factor(pred,l),factor(y,l))
cm[[5]]<-c("Regularized Discriminant Analysis","RDA",confusionMatrix(mtab))
cm[[5]]$table
##
## unacc acc vgood good
## unacc 1072 0 0 0
## acc 134 317 0 0
## vgood 0 22 65 6
## good 4 45 0 63
6
cm[[5]]$overall[1]
## Accuracy
## 0.8778935
3.B3 Neural Network
library(nnet)
library(devtools)
model<-nnet(fmla,data=car_eval,size = 4, decay = 0.0001, maxit = 700, trace = FALSE)
#import the function from Github
# source_url('https://gist.githubusercontent.com/Peque/41a9e20d6687f2f3108d/raw/85e14f3a292e126f14548644
# plot.nnet(model, alpha.val = 0.5, cex= 0.7, circle.col = list('lightblue', 'white'), bord.col = 'black
pred<-predict(model,x,type="class")
pred<-as.factor(pred)
l<-union(pred,y)
mtab<-table(factor(pred,l),factor(y,l))
cm[[6]]<-c("Neural Network","NNET",confusionMatrix(mtab))
cm[[6]]$table
##
## unacc acc vgood good
## unacc 1208 2 0 0
## acc 2 376 0 6
## vgood 0 0 64 1
## good 0 6 1 62
cm[[6]]$overall[1]
## Accuracy
## 0.9895833
3.B4 Flexible Discriminant Analysis
library(mda)
model<-fda(fmla,data=car_eval)
pred<-predict(model,x,type="class")
l<-union(pred,y)
mtab<-table(factor(pred,l),factor(y,l))
cm[[7]]<-c("Flexible Discriminant Analysis","FDA",confusionMatrix(mtab))
cm[[7]]$table
##
## unacc acc vgood good
## unacc 1136 13 0 0
## acc 72 361 27 44
## vgood 0 0 35 2
## good 2 10 3 23
7
cm[[7]]$overall[1]
## Accuracy
## 0.8998843
3.B5 Support Vector Machine
library(kernlab)
model<-ksvm(fmla,data=car_eval)
pred<-predict(model,x,type="response")
l<-union(pred,y)
mtab<-table(factor(pred,l),factor(y,l))
cm[[8]]<-c("Support Vector Machine","SVM",confusionMatrix(mtab))
cm[[8]]$table
##
## unacc acc vgood good
## unacc 1174 0 0 0
## acc 33 375 0 0
## vgood 0 1 65 9
## good 3 8 0 60
cm[[8]]$overall[1]
## Accuracy
## 0.96875
3.B6 k-Nearest Neighbors
library(caret)
model<-knn3(fmla,data=car_eval,k=nlev+1)
pred<-predict(model,x,type="class")
l<-union(pred,y)
mtab<-table(factor(pred,l),factor(y,l))
cm[[9]]<-c("k-Nearest Neighbors","KNN",confusionMatrix(mtab))
cm[[9]]$table
##
## unacc acc good vgood
## unacc 1199 41 4 1
## acc 11 336 45 15
## good 0 6 15 4
## vgood 0 1 5 45
cm[[9]]$overall[1]
## Accuracy
## 0.9230324
8
3.B8 Naive Bayes
library(e1071)
model<-naiveBayes(fmla,data=car_eval,k=nlev+1)
pred<-predict(model,x)
l<-union(pred,y)
mtab<-table(factor(pred,l),factor(y,l))
cm[[10]]<-c("Naive Bayes","NBAYES",confusionMatrix(mtab))
cm[[10]]$table
##
## unacc acc vgood good
## unacc 1161 85 0 0
## acc 47 289 26 46
## vgood 0 0 39 2
## good 2 10 0 21
cm[[10]]$overall[1]
## Accuracy
## 0.8738426
3.C Non-Linear Classification with Decision Trees
3.C1 Classification and Regression Trees(CART)
library(rpart)
library(rpart.plot)
model<-rpart(fmla,data=car_eval)
# prp(model, faclen=3)
pred<-predict(model, x ,type="class")
l<-union(pred,y)
mtab<-table(factor(pred,l),factor(y,l))
cm[[11]]<-c("classification and Regression Trees","CART",confusionMatrix(mtab))
cm[[11]]$table
##
## unacc acc good vgood
## unacc 1161 7 0 0
## acc 45 358 0 13
## good 4 16 60 0
## vgood 0 3 9 52
cm[[11]]$overall[1]
## Accuracy
## 0.9438657
3.C2 OneR
9
library(RWeka)
model<-OneR(fmla,data=car_eval)
pred<-predict(model,x,type="class")
l<-union(pred,y)
mtab<-table(factor(pred,l),factor(y,l))
cm[[12]]<-c("One R","ONE-R",confusionMatrix(mtab))
cm[[12]]$table
##
## unacc acc vgood good
## unacc 1210 384 65 69
## acc 0 0 0 0
## vgood 0 0 0 0
## good 0 0 0 0
cm[[12]]$overall[1]
## Accuracy
## 0.7002315
3.C3 C4.5
library(RWeka)
model<-J48(fmla,data=car_eval)
pred<-predict(model,x)
l<-union(pred,y)
mtab<-table(factor(pred,l),factor(y,l))
cm[[13]]<-c("C4.5","C45",confusionMatrix(mtab))
cm[[13]]$table
##
## unacc acc vgood good
## unacc 1172 0 0 0
## acc 35 380 4 9
## vgood 0 2 55 3
## good 3 2 6 57
cm[[13]]$overall[1]
## Accuracy
## 0.962963
3.C4 PART
library(RWeka)
model<-PART(fmla,data=car_eval)
pred<-predict(model,x)
l<-union(pred,y)
mtab<-table(factor(pred,l),factor(y,l))
cm[[14]]<-c("PART","PART",confusionMatrix(mtab))
cm[[14]]$table
10
##
## unacc acc vgood good
## unacc 1196 6 0 1
## acc 13 374 0 3
## vgood 0 3 65 2
## good 1 1 0 63
cm[[14]]$overall[1]
## Accuracy
## 0.9826389
3.C5 Bagging CART
library(ipred)
model<-bagging(fmla,data=car_eval)
pred<-predict(model,x)
l<-union(pred,y)
mtab<-table(factor(pred,l),factor(y,l))
cm[[15]]<-c("Bagging CART","BAG-CART",confusionMatrix(mtab))
cm[[15]]$table
##
## unacc acc vgood good
## unacc 1210 0 0 0
## acc 0 384 0 0
## vgood 0 0 65 0
## good 0 0 0 69
cm[[15]]$overall[1]
## Accuracy
## 1
3.C6 Random Forest
library(randomForest)
model<-randomForest(fmla,data=car_eval)
pred<-predict(model,x)
l<-union(pred,y)
mtab<-table(factor(pred,l),factor(y,l))
cm[[16]]<-c("Random Forest","RF",confusionMatrix(mtab))
cm[[16]]$table
##
## unacc acc vgood good
## unacc 1208 0 0 0
## acc 2 384 0 0
## vgood 0 0 65 0
## good 0 0 0 69
11
cm[[16]]$overall[1]
## Accuracy
## 0.9988426
3.C7 Gradient Boosted Machine
library(gbm)
model<-gbm(fmla,data=car_eval,n.trees=5000,interaction.depth=nlev,shrinkage=0.001,bag.fraction=0.8,distr
prob<-predict(model,x,n.trees=5000,type="response")
pred<-apply(prob,1,which.max)
pred[which(pred=="1")]<-levels(y)[1]
pred[which(pred=="2")]<-levels(y)[2]
pred[which(pred=="3")]<-levels(y)[3]
pred[which(pred=="4")]<-levels(y)[4]
pred<-as.factor(pred)
l<-union(pred,y)
mtab<-table(factor(pred,l),factor(y,l))
cm[[17]]<-c("Gradient Boosted Machine","GBM",confusionMatrix(mtab))
cm[[17]]$table
##
## unacc acc vgood good
## unacc 1191 3 0 0
## acc 16 372 1 0
## vgood 0 0 64 4
## good 3 9 0 65
cm[[17]]$overall[1]
## Accuracy
## 0.9791667
3.C8 Boosted C5.0
library(C50)
model<-C5.0(fmla,data=car_eval,trials=10)
# tree <- rpart(model,data=car_eval,control=rpart.control(minsplit=20,cp=0,digits=6))
# prp(tree,faclen=3)
pred<-predict(model,x)
l<-union(pred,y)
mtab<-table(factor(pred,l),factor(y,l))
cm[[18]]<-c("Boosted C5.0","BOOST-C50",confusionMatrix(mtab))
cm[[18]]$table
##
## unacc acc vgood good
## unacc 1207 1 0 0
## acc 3 383 2 0
## vgood 0 0 63 0
## good 0 0 0 69
12
cm[[18]]$overall[1]
## Accuracy
## 0.9965278
3.C9 JRip
library(RWeka)
model<-JRip(fmla,data=car_eval)
pred<-predict(model,x)
l<-union(pred,y)
mtab<-table(factor(pred,l),factor(y,l))
cm[[19]]<-c("JRip","JRIP",confusionMatrix(mtab))
cm[[19]]$table
##
## unacc acc vgood good
## unacc 1156 0 0 0
## acc 44 356 2 6
## vgood 4 16 61 3
## good 6 12 2 60
cm[[19]]$overall[1]
## Accuracy
## 0.9450231
3.C10 H2O Deep Learning
##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## C:UsersMARC_B~1AppDataLocalTempRtmpglsya5/h2o_Marc_Borowczak_started_from_r.out
## C:UsersMARC_B~1AppDataLocalTempRtmpglsya5/h2o_Marc_Borowczak_started_from_r.err
##
##
## Starting H2O JVM and connecting: ... Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 9 seconds 10 milliseconds
## H2O cluster version: 3.8.1.3
## H2O cluster name: H2O_started_from_R_Marc_Borowczak_dvn595
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.44 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## R Version: R version 3.2.4 Revised (2016-03-16 r70336)
13
## acc good unacc vgood
## acc 319 26 36 3
## good 0 66 0 3
## unacc 24 9 1177 0
## vgood 0 0 0 65
## Accuracy
## 0.9415509
Step 4. Performance Comparison
We now compare all analysis results gathered in the ConfusionMatrix list into a results data frame.
library(h2o)
localH2O=h2o.init(nthreads=-1)
##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## C:UsersMARC_B~1AppDataLocalTempRtmpUrf7Ob/h2o_Marc_Borowczak_started_from_r.out
## C:UsersMARC_B~1AppDataLocalTempRtmpUrf7Ob/h2o_Marc_Borowczak_started_from_r.err
##
##
## Starting H2O JVM and connecting: .... Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 13 seconds 909 milliseconds
## H2O cluster version: 3.8.1.3
## H2O cluster name: H2O_started_from_R_Marc_Borowczak_abx142
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.44 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## R Version: R version 3.2.4 Revised (2016-03-16 r70336)
h2o.no_progress()
car_eval.hex=h2o.uploadFile(path=paste0(userdir,"/car_eval.csv"),destination_frame="car_eval.hex")
library(microbenchmark)
mbm<-microbenchmark(
m1<-multinom(fmla, data = car_eval, maxit = 500, trace=FALSE),
m2<-vglm(fmla, family = "multinomial", data = car_eval, maxit = 100),
m3<-lda(fmla,data=car_eval),
m4<-mda(fmla,data=car_eval),
m5<-rda(fmla,data=car_eval,gamma = 0.05,lambda = 0.01),
m6<-nnet(fmla,data=car_eval,size = 4, decay = 0.0001, maxit = 700,trace=FALSE),
m7<-fda(fmla,data=car_eval),
m8<-ksvm(fmla,data=car_eval),
14
m9<-knn3(fmla,data=car_eval,k=nlev+1),
m10<-naiveBayes(fmla,data=car_eval,k=nlev+1),
m11<-rpart(fmla,data=car_eval),
m12<-OneR(fmla,data=car_eval),
m13<-J48(fmla,data=car_eval),
m14<-PART(fmla,data=car_eval),
m15<-bagging(fmla,data=car_eval),
m16<-randomForest(fmla,data=car_eval),
m17<-gbm(fmla,data=car_eval,n.trees=5000,interaction.depth=nlev,shrinkage=0.001,bag.fraction=0.8
m18<-C5.0(fmla,data=car_eval,trials=10),
m19<-JRip(fmla,data=car_eval),
m20<-h2o.deeplearning(x=1:(nv-1), y=nv, training_frame = car_eval.hex, variable_importances = TR
times=5)
h2o.shutdown(prompt = FALSE)
## [1] TRUE
#
library(dplyr)
models<-length(cm)
mbm$expr<-rep(sapply(1:models, function(i) {cm[[i]][[2]]}),5)
mbm<-aggregate(x=mbm$time,by=list(Model=mbm$expr),FUN=mean)
mbm$x<-mbm$x/min(mbm$x)
results<-sapply (1:models, function(i) {c(cm[[i]][[1]],cm[[i]][[2]],mbm$x[i],cm[[i]]$overall[1:6])})
row.names(results)<-c("Description","Model","Model_Time_X",names(cm[[1]]$overall[1:6]))
results<-as.data.frame(t(results))
results[,3:9]<-sapply(3:9,function(i){results[,i]<-as.numeric(levels(results[,i])[results[,i]])})
results<-results[,-(8:9)]
results<-arrange(results,desc(Accuracy))
We are now ready to chart on faceted pie charts, with decreasing value order to ease comparisons.
library(ggplot2)
library(gridExtra)
res<-data.frame(Model=results$Model,Accuracy=results$Accuracy,Speed=1/results$Model_Time_X,Overall=resul
library(RColorBrewer)
myPalette <- colorRampPalette(rev(brewer.pal(12, "Set3")))
sc <- scale_colour_gradientn(colours = myPalette(256), limits=c(0.8, 1))
g<-ggplot(res,aes(x=reorder(Model,-Accuracy),y=Accuracy,fill=Model))+
geom_bar(stat="identity")+
coord_polar(theta="x",direction=1)+
labs(x="Machine Learning Model",y="Prediction Accuracy")+
theme(legend.position="bottom",legend.box="horizontal")+
ggtitle('Car Evaluation Dataset Accuracy Performance')
g
15
BAG−CART
RF
BOOST−C50
NNET
PART
GBM
SVM
C45
GLM
MULTINOMJRIP
CART
H2O
KNN
MDA
LDA
FDA
RDA
NBAYES
ONE−R
0.00
0.25
0.50
0.75
1.00
Machine Learning Model
PredictionAccuracy
Model
BAG−CART
BOOST−C50
C45
CART
FDA
GBM
GLM
H2O
JRIP
KNN
LDA
MDA
MULTINOM
NBAYES
NNET
ONE−R
PART
RDA
RF
SVM
Car Evaluation Dataset Accuracy Performance
g<-ggplot(res,aes(x=reorder(Model,-Speed),y=Speed,fill=Model))+
geom_bar(stat="identity")+
coord_polar(theta="x",direction=1)+
labs(x="Machine Learning Model",y="Prediction Speed")+
theme(legend.position="bottom",legend.box="horizontal")+
ggtitle('Car Evaluation Dataset Speed Performance')
g
16
CART
MDA
GBM
NNET
JRIP
RF
SVM
MULTINOM
BOOST−C50
LDAC45
KNN
PART
NBAYES
BAG−CART
RDA
ONE−R
FDA
H2O
GLM
0.00
0.25
0.50
0.75
1.00
Machine Learning Model
PredictionSpeed
Model
BAG−CART
BOOST−C50
C45
CART
FDA
GBM
GLM
H2O
JRIP
KNN
LDA
MDA
MULTINOM
NBAYES
NNET
ONE−R
PART
RDA
RF
SVM
Car Evaluation Dataset Speed Performance
g<-ggplot(res,aes(x=reorder(Model,-Overall),y=Overall,fill=Model))+
geom_bar(stat="identity")+
coord_polar(theta="x",direction=1)+
labs(x="Machine Learning Model",y="Prediction Overall")+
theme(legend.position="bottom",legend.box="horizontal")+
ggtitle('Car Evaluation Dataset Overall Performance')
g
17
CART
MDA
GBM
NNET
JRIP
RF
SVM
MULTINOM
BOOST−C50
LDAC45
KNN
PART
NBAYES
BAG−CART
RDA
ONE−R
H2O
FDA
GLM
0.00
0.25
0.50
0.75
Machine Learning Model
PredictionOverall
Model
BAG−CART
BOOST−C50
C45
CART
FDA
GBM
GLM
H2O
JRIP
KNN
LDA
MDA
MULTINOM
NBAYES
NNET
ONE−R
PART
RDA
RF
SVM
Car Evaluation Dataset Overall Performance
We conclude this dataset analysis by tabulating the results obtained on this dataset.
kable(res)
Model Accuracy Speed Overall
BAG-CART 1.0000000 0.0093808 0.0093808
RF 0.9988426 0.0966751 0.0965632
BOOST-C50 0.9965278 0.0624463 0.0622294
NNET 0.9895833 0.1358213 0.1344065
PART 0.9826389 0.0272376 0.0267647
GBM 0.9791667 0.2219264 0.2173029
SVM 0.9687500 0.0893670 0.0865743
C45 0.9629630 0.0434614 0.0418517
MULTINOM 0.9456019 0.0685611 0.0648315
GLM 0.9456019 0.0011533 0.0010906
JRIP 0.9450231 0.1148750 0.1085596
CART 0.9438657 1.0000000 0.9438657
H2O 0.9415509 0.0013525 0.0012735
KNN 0.9230324 0.0398154 0.0367509
MDA 0.9212963 0.3604409 0.3320729
LDA 0.9010417 0.0472147 0.0425424
FDA 0.8998843 0.0013615 0.0012252
RDA 0.8778935 0.0062438 0.0054814
NBAYES 0.8738426 0.0263502 0.0230260
ONE-R 0.7002315 0.0045960 0.0032182
18
Step 5: Retrieve 2nd Dataset
We now repeat with the nursery dataset. Again, We will use R to demonstrate quickly the approach on
this dataset, and its full description. We continue to maintain reproducibility of the analysis as a general
practice. The analysis tool and platform are documented, all libraries clearly listed, while data is retrieved
programmatically and date stamped from the repository.
We will display a structure of the Nursery dataset and the corresponding dictionary to translate the property
factors.
datadir <- "./data"
if (!file.exists("data")){dir.create("data")}
uciUrl <- "http://archive.ics.uci.edu/ml/machine-learning-databases/"
fileUrl <- paste0(uciUrl,"nursery/nursery.data?accessType=DOWNLOAD")
download.file(fileUrl,destfile="./data/nurserydata.csv",method="curl")
dateDownloaded <- date()
nursery <- read.csv("./data/nurserydata.csv",header=FALSE)
fileUrl <- paste0(uciUrl,"nursery/nursery.names?accessType=DOWNLOAD")
download.file(fileUrl,destfile="./data/nurserynames.txt")
txt <- readLines("./data/nurserynames.txt")
lns <- data.frame(beg=which(grepl("parents usual",txt)),end=which(grepl("priority, not_recom",txt
# capture text between beg and end from txt
res <- lapply(seq_along(lns$beg),
function(l){paste(txt[seq(from=lns$beg[l],to=lns$end[l],by=1)],collapse=" ")})
res <- gsub(" ", ":", res, fixed = TRUE)
res <- gsub(" ", ":", res, fixed = TRUE)
res <- gsub(" ", ":", res, fixed = TRUE)
res <- gsub(" ", ":", res, fixed = TRUE)
res <- gsub(" ", "n", res, fixed = TRUE)
res <- gsub(" ", "", res, fixed = TRUE)
res <- gsub(" ", "", res, fixed = TRUE)
res <- gsub(" ", "", res, fixed = TRUE)
res <- str_c(res,"n")
writeLines(res,"./data/n_parsed_attr.csv")
attrib <- readLines("./data/n_parsed_attr.csv")
nv <- length(attrib) # number of attributes
attrib <- sapply (1:nv,function(i) {gsub(":"," ",attrib[i],fixed=TRUE)})
dictionary <- sapply (1:nv,function(i) {strsplit(attrib[i],' ')})
dictionary[[nv]][1]<-"class"
colnames(nursery)<-sapply(1:nv,function(i) {colnames(nursery)[i]<-dictionary[[i]][1]})
x<-nursery[,1:(nv-1)]
y<-nursery[,nv]
fmla<-paste(colnames(nursery)[1:(nv-1)],collapse="+")
fmla<-paste0(colnames(nursery)[nv],"~",fmla)
fmla<-as.formula(fmla)
nlev<-nlevels(y) # number of factors describing class
Step 6. Data Exploration
head(nursery)
## parents has_nurs form children housing finance social
19
## 1 usual proper complete 1 convenient convenient nonprob
## 2 usual proper complete 1 convenient convenient nonprob
## 3 usual proper complete 1 convenient convenient nonprob
## 4 usual proper complete 1 convenient convenient slightly_prob
## 5 usual proper complete 1 convenient convenient slightly_prob
## 6 usual proper complete 1 convenient convenient slightly_prob
## health class
## 1 recommended recommend
## 2 priority priority
## 3 not_recom not_recom
## 4 recommended recommend
## 5 priority priority
## 6 not_recom not_recom
summary(nursery)
## parents has_nurs form children
## great_pret :4320 critical :2592 complete :3240 1 :3240
## pretentious:4320 improper :2592 completed :3240 2 :3240
## usual :4320 less_proper:2592 foster :3240 3 :3240
## proper :2592 incomplete:3240 more:3240
## very_crit :2592
## housing finance social
## convenient:4320 convenient:6480 nonprob :4320
## critical :4320 inconv :6480 problematic :4320
## less_conv :4320 slightly_prob:4320
##
##
## health class
## not_recom :4320 not_recom :4320
## priority :4320 priority :4266
## recommended:4320 recommend : 2
## spec_prior:4044
## very_recom: 328
str(nursery)
## 'data.frame': 12960 obs. of 9 variables:
## $ parents : Factor w/ 3 levels "great_pret","pretentious",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ has_nurs: Factor w/ 5 levels "critical","improper",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ form : Factor w/ 4 levels "complete","completed",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ children: Factor w/ 4 levels "1","2","3","more": 1 1 1 1 1 1 1 1 1 1 ...
## $ housing : Factor w/ 3 levels "convenient","critical",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ finance : Factor w/ 2 levels "convenient","inconv": 1 1 1 1 1 1 1 1 1 2 ...
## $ social : Factor w/ 3 levels "nonprob","problematic",..: 1 1 1 3 3 3 2 2 2 1 ...
## $ health : Factor w/ 3 levels "not_recom","priority",..: 3 2 1 3 2 1 3 2 1 3 ...
## $ class : Factor w/ 5 levels "not_recom","priority",..: 3 2 1 3 2 1 2 2 1 5 ...
m<-cm<-list() # initialize
Exploration reveals the multinomial Nursery dataset comprises 8 attribute variables to predict the 5-factor
nurse recommendation class, with no missing data.
20
Step 7. Classification Analysis
We will perform a total of 20 analysis using the same 3 groups: A)Linear, B)Non-linear and C)Non-Linear
Classification with decision trees, sequentially and then compare their outcome in terms of prediction accuracy,
comparing predicted vs. actual dependent variable.
For each analysis, we follow the sampe protocol: invoque the package library, build the model, summarize it
and predict the dependent variable, the save the accuracy data and predicted values.
##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## C:UsersMARC_B~1AppDataLocalTempRtmpglsya5/h2o_Marc_Borowczak_started_from_r.out
## C:UsersMARC_B~1AppDataLocalTempRtmpglsya5/h2o_Marc_Borowczak_started_from_r.err
##
##
## Starting H2O JVM and connecting: ... Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 8 seconds 961 milliseconds
## H2O cluster version: 3.8.1.3
## H2O cluster name: H2O_started_from_R_Marc_Borowczak_doh554
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.44 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## R Version: R version 3.2.4 Revised (2016-03-16 r70336)
## not_recom priority recommend spec_prior
## not_recom 3316 0 0 2
## priority 0 3105 0 165
## recommend 0 0 0 0
## spec_prior 0 11 0 3067
## Accuracy
## 0.9815849
Step 8. Performance Comparison
library(h2o)
localH2O=h2o.init(nthreads=-1)
##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
21
## C:UsersMARC_B~1AppDataLocalTempRtmpUrf7Ob/h2o_Marc_Borowczak_started_from_r.out
## C:UsersMARC_B~1AppDataLocalTempRtmpUrf7Ob/h2o_Marc_Borowczak_started_from_r.err
##
##
## Starting H2O JVM and connecting: .... Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 9 seconds 761 milliseconds
## H2O cluster version: 3.8.1.3
## H2O cluster name: H2O_started_from_R_Marc_Borowczak_gcl376
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.44 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## R Version: R version 3.2.4 Revised (2016-03-16 r70336)
h2o.no_progress()
nursery.hex=h2o.uploadFile(path=paste0(userdir,"/nursery.csv"),destination_frame="nursery.hex")
library(microbenchmark)
mbm<-microbenchmark(
m1<-multinom(fmla, data = nursery, maxit = 1200, trace=FALSE),
m2<-vglm(fmla, family = "multinomial", data = nursery, maxit = 100),
m3<-lda(fmla,data=nursery),
m4<-mda(fmla,data=nursery),
m5<-rda(fmla,data=nursery,gamma = 0.05,lambda = 0.01),
m6<-nnet(fmla,data=nursery,size = 4, decay = 0.0001, maxit = 1200,trace=FALSE),
m7<-fda(fmla,data=nursery),
m8<-ksvm(fmla,data=nursery),
m9<-knn3(fmla,data=nursery,k=nlev+1),
m10<-naiveBayes(fmla,data=nursery,k=nlev+1),
m11<-rpart(fmla,data=nursery),
m12<-OneR(fmla,data=nursery),
m13<-J48(fmla,data=nursery),
m14<-PART(fmla,data=nursery),
m15<-bagging(fmla,data=nursery),
m16<-randomForest(fmla,data=nursery),
m17<-gbm(fmla,data=nursery,n.trees=5000,interaction.depth=nlev,shrinkage=0.001,bag.fraction=0.8,
m18<-C5.0(fmla,data=nursery,trials=10),
m19<-JRip(fmla,data=nursery),
m20<-h2o.deeplearning(x=1:(nv-1), y=nv, training_frame = nursery.hex, variable_importances = TRU
times=5)
h2o.shutdown(prompt = FALSE)
## [1] TRUE
#
22
library(dplyr)
models<-length(cm)
mbm$expr<-rep(sapply(1:models, function(i) {cm[[i]][[2]]}),5)
mbm<-aggregate(x=mbm$time,by=list(Model=mbm$expr),FUN=mean)
mbm$x<-mbm$x/min(mbm$x)
results<-sapply (1:models, function(i) {c(cm[[i]][[1]],cm[[i]][[2]],mbm$x[i],cm[[i]]$overall[1:6])})
row.names(results)<-c("Description","Model","Model_Time_X",names(cm[[1]]$overall[1:6]))
results<-as.data.frame(t(results))
results[,3:9]<-sapply(3:9,function(i){results[,i]<-as.numeric(levels(results[,i])[results[,i]])})
results<-results[,-(8:9)]
results<-arrange(results,desc(Accuracy))
#
# Performance Plot
library(ggplot2)
library(gridExtra)
res<-data.frame(Model=results$Model,Accuracy=results$Accuracy,Speed=1/results$Model_Time_X,Overall=resul
library(RColorBrewer)
myPalette <- colorRampPalette(rev(brewer.pal(12, "Set3")))
sc <- scale_colour_gradientn(colours = myPalette(256), limits=c(0.8, 1))
We are now ready to chart and will again compare on faceted pie charts.
BAG−CART
BOOST−C50
PART
SVM
RF
JRIP
NNET
H2O
C45
KNNMDA
GLM
MULTINOM
FDA
NBAYES
RDA
CART
GBM
ONE−R
LDA
0.00
0.25
0.50
0.75
Machine Learning Model
PredictionAccuracy
Model
BAG−CART
BOOST−C50
C45
CART
FDA
GBM
GLM
H2O
JRIP
KNN
LDA
MDA
MULTINOM
NBAYES
NNET
ONE−R
PART
RDA
RF
SVM
Nursery Dataset Accuracy Performance
23
ONE−R
MDA
BOOST−C50
GLM
PART
KNN
C45
GBM
SVM
BAG−CARTFDA
RDA
CART
NBAYES
RF
JRIP
LDA
NNET
MULTINOM
H2O
0.00
0.25
0.50
0.75
1.00
Machine Learning Model
PredictionSpeed
Model
BAG−CART
BOOST−C50
C45
CART
FDA
GBM
GLM
H2O
JRIP
KNN
LDA
MDA
MULTINOM
NBAYES
NNET
ONE−R
PART
RDA
RF
SVM
Nursery Dataset Speed Performance
24
MDA
ONE−R
BOOST−C50
GLM
PART
KNN
C45
GBM
SVM
BAG−CARTFDA
RDA
CART
NBAYES
RF
JRIP
LDA
NNET
MULTINOM
H2O
0.00
0.25
0.50
0.75
Machine Learning Model
PredictionOverall
Model
BAG−CART
BOOST−C50
C45
CART
FDA
GBM
GLM
H2O
JRIP
KNN
LDA
MDA
MULTINOM
NBAYES
NNET
ONE−R
PART
RDA
RF
SVM
Nursery Dataset Overall Performance
We conclude this second dataset analysis by tabulating the results obtained.
Model Accuracy Speed Overall
BAG-CART 0.9999228 0.0094075 0.0094068
BOOST-C50 0.9997685 0.3000380 0.2999686
PART 0.9986883 0.1309383 0.1307665
SVM 0.9932099 0.0125290 0.0124439
RF 0.9916667 0.0045268 0.0044891
JRIP 0.9821759 0.0040615 0.0039892
NNET 0.9818673 0.0006675 0.0006554
H2O 0.9815849 0.0005785 0.0005678
C45 0.9813272 0.0491118 0.0481947
KNN 0.9768519 0.0699498 0.0683306
MDA 0.9331019 0.9991693 0.9323268
MULTINOM 0.9262346 0.0006674 0.0006182
GLM 0.9262346 0.1569507 0.1453732
FDA 0.9130401 0.0077249 0.0070531
NBAYES 0.9030864 0.0052343 0.0047270
RDA 0.8983025 0.0076060 0.0068325
CART 0.8726852 0.0055864 0.0048752
GBM 0.8726852 0.0284867 0.0248599
ONE-R 0.7097222 1.0000000 0.7097222
LDA 0.5483796 0.0037381 0.0020499
25
Step 9. Data Analysis
Comparing these 2 datasets’ accuracy, we observe that BAG-CART, RF and BOOST-C50 top the list at
more than 99% accuracy while NNET, PART, GBM, SVM and C45 exceeded 95% accuracy on the smallest
dataset. On the second dataset, we observe BAG-CART, BOOST-C50, PART, SVM and RF exceed 99%
accuracy, while JRIP, NNET, H2O, C45 and KNN exceed 95% accuracy.
From these observations, we should definitely include BAG-CART, BOOST-C50 and RF as prime models to
tackle multinomial data. However, including NNET, PART, SVM and C45 as 2nd tier seems also a good
idea. To be complete, the third group of models should include JRIP, H2O and KNN. For reference, if the
main dependency is desired, ONE-R can pinpoint it with more than 70% accuracy.
Speed can also be determinant when selecting a model. Microbenchmark data allow us to compare average
time used by the 20 models using a 5-run average. Normalized data is obtained by dividing all times by the
minimum time recorded. Transforming time into Speed involves taking the reciprocal values.
Overall ranking is obtained here by merely forming the product of Accuracy x Speed.
We observe overwhelming dependencies on Speed, with only KNN, BOOST-C50, NNET, JRIP and CART
contenders for the Car Evaluation dataset. For the more complex Nursery dataset, a different outcome is
observed, with ONE-R, MDA and BOOST-C50 as fastest and overall best predictors.
Step 10. Conclusions
We have compared 20 Machine Learning models and benchmarked their accuracy and speed on 2 multinomial
datasets. Although ranking accuracy seemed consistent across these 2 models, and forming a 3-tier ranking,
model execution speed which is often also a factor showed strong dependencies, so that combined ranking
remains strongly dataset dependent.
What it means for data scientists is that benchmarking should remain in the front- and not on the back-
burner of activities’ list, as well as continuous monitoring of new and more efficient and distributed and/or
parallellized algorithms and their effects on different hardware platforms.
We have evaluated the accuracy on 2 multinomial datasets and the analysis led to a 3-tier grouping. However,
Speed ranking could reduce our options. We will continue to monitor benchmark new Machine Learning
tools by applying them to broader datasets.
References
The following sources are referenced as they provided significant help and information to develop this Machine
Learning analysis applied to formulations:
1. UCI Machine Learning Repository Machine Learning Repository
2. Car recommendation documentation Car Evaluation Database donated by marko.bohanec@ijs.si June
1997
3. Nursery documentation Nursery Database donated by marko.bohanec@ijs.si June 1997
4. stringr Simple, Consistent Wrappers for Common String Operations
5. RWeka R/Weka interface
6. C50 Decision Trees and Rule-Based Models
7. rpart Recursive Partitioning and Regression Trees.
8. rpart.plot Plot ‘rpart’ Models: An Enhanced Version of ‘plot.rpart’
9. rattle A Graphical User Interface for Data Mining using R
10. VGAM Vector Generalized Linear and Additive Models
11. MASS Support Functions and Datasets for Venables and Ripley’s MASS
12. mda Mixture and Flexible Discriminant Analysis
26
13. klaR Classification and visualization
14. nnet Feed-forward Neural Networks and Multinomial Log-Linear Models
15. kernlab Kernel-based Machine Learning Lab
16. caret Package (short for Classification And REgression Training) is a set of functions that attempt to
streamline the process for creating predictive models
17. e1071 Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector
machines, shortest path computation, bagged clustering, naive Bayes classifier etc
18. ipred Improved Predictors
19. randomForest Breiman and Cutler’s random forests for classification and regression
20. gbm Generalized Boosted Regression Models
21. H2O Deep Learning R Interface for H2O
22. H2O H2O.ai documentation
23. gridExtra Miscellaneous Functions for Grid Graphics
24. knitr A General-Purpose Package for Dynamic Report Generation in R
25. RStudio Open Source and enterprise-ready professional software for R
27

More Related Content

Viewers also liked

Trabajo de comunicacion liana chalco 5_c
Trabajo de comunicacion liana chalco 5_cTrabajo de comunicacion liana chalco 5_c
Trabajo de comunicacion liana chalco 5_c
Institución Educativa "Uriel García" del Cusco
 
Avaliação preliminar
Avaliação preliminarAvaliação preliminar
Avaliação preliminar
Denise Oliveira
 
Influencia del internet en la sociedad yeny tintaya 5_c
Influencia del internet en la sociedad yeny tintaya 5_cInfluencia del internet en la sociedad yeny tintaya 5_c
Influencia del internet en la sociedad yeny tintaya 5_c
Institución Educativa "Uriel García" del Cusco
 
Avaliação do potencial de contaminação ambiental em imóveis
Avaliação do potencial de contaminação ambiental em imóveisAvaliação do potencial de contaminação ambiental em imóveis
Avaliação do potencial de contaminação ambiental em imóveis
Denise Oliveira
 
Influencia del internet en la sociedad lucero paz 5_b _31
Influencia del internet en la sociedad  lucero paz 5_b _31Influencia del internet en la sociedad  lucero paz 5_b _31
Influencia del internet en la sociedad lucero paz 5_b _31
Institución Educativa "Uriel García" del Cusco
 
Curso adm 479 inglés funcional de conversación
Curso adm 479   inglés funcional de conversaciónCurso adm 479   inglés funcional de conversación
Curso adm 479 inglés funcional de conversaciónProcasecapacita
 
Communication using network
Communication using networkCommunication using network
Communication using network
Online
 
La influencia del internet en la sociedad ruth naupa 5_c
La influencia del internet en la sociedad ruth naupa 5_cLa influencia del internet en la sociedad ruth naupa 5_c
La influencia del internet en la sociedad ruth naupa 5_c
Institución Educativa "Uriel García" del Cusco
 
Components of computers
Components of computersComponents of computers
Components of computers
Online
 
2016 Accounting and finance salaries in Leeds
2016 Accounting and finance salaries in Leeds2016 Accounting and finance salaries in Leeds
2016 Accounting and finance salaries in Leeds
Robert Half UK
 
Effective listening
Effective listeningEffective listening
Effective listening
Online
 

Viewers also liked (12)

Trabajo de comunicacion liana chalco 5_c
Trabajo de comunicacion liana chalco 5_cTrabajo de comunicacion liana chalco 5_c
Trabajo de comunicacion liana chalco 5_c
 
Avaliação preliminar
Avaliação preliminarAvaliação preliminar
Avaliação preliminar
 
Influencia del internet en la sociedad yeny tintaya 5_c
Influencia del internet en la sociedad yeny tintaya 5_cInfluencia del internet en la sociedad yeny tintaya 5_c
Influencia del internet en la sociedad yeny tintaya 5_c
 
Avaliação do potencial de contaminação ambiental em imóveis
Avaliação do potencial de contaminação ambiental em imóveisAvaliação do potencial de contaminação ambiental em imóveis
Avaliação do potencial de contaminação ambiental em imóveis
 
tar_Deg_eng
tar_Deg_engtar_Deg_eng
tar_Deg_eng
 
Influencia del internet en la sociedad lucero paz 5_b _31
Influencia del internet en la sociedad  lucero paz 5_b _31Influencia del internet en la sociedad  lucero paz 5_b _31
Influencia del internet en la sociedad lucero paz 5_b _31
 
Curso adm 479 inglés funcional de conversación
Curso adm 479   inglés funcional de conversaciónCurso adm 479   inglés funcional de conversación
Curso adm 479 inglés funcional de conversación
 
Communication using network
Communication using networkCommunication using network
Communication using network
 
La influencia del internet en la sociedad ruth naupa 5_c
La influencia del internet en la sociedad ruth naupa 5_cLa influencia del internet en la sociedad ruth naupa 5_c
La influencia del internet en la sociedad ruth naupa 5_c
 
Components of computers
Components of computersComponents of computers
Components of computers
 
2016 Accounting and finance salaries in Leeds
2016 Accounting and finance salaries in Leeds2016 Accounting and finance salaries in Leeds
2016 Accounting and finance salaries in Leeds
 
Effective listening
Effective listeningEffective listening
Effective listening
 

Similar to Benchmarking_ML_Tools

Machine Learning Project - Neural Network
Machine Learning Project - Neural Network Machine Learning Project - Neural Network
Machine Learning Project - Neural Network
HamdaAnees
 
Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)Sanghun Kim
 
Arules_TM_Rpart_Markdown
Arules_TM_Rpart_MarkdownArules_TM_Rpart_Markdown
Arules_TM_Rpart_MarkdownAdrian Cuyugan
 
Rapport d'analyse Dimensionality Reduction
Rapport d'analyse Dimensionality ReductionRapport d'analyse Dimensionality Reduction
Rapport d'analyse Dimensionality Reduction
Matthieu Cisel
 
A Survey on Stroke Prediction
A Survey on Stroke PredictionA Survey on Stroke Prediction
A Survey on Stroke Prediction
MohammadRakib8
 
A survey on heart stroke prediction
A survey on heart stroke predictionA survey on heart stroke prediction
A survey on heart stroke prediction
drubosaha
 
Hadoop as an extension of DW
Hadoop as an extension of DWHadoop as an extension of DW
Hadoop as an extension of DW
Sidi yazid
 
A Machine Learning approach to predict Software Defects
A Machine Learning approach to predict Software DefectsA Machine Learning approach to predict Software Defects
A Machine Learning approach to predict Software Defects
Chetan Hireholi
 
Microsoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data ScienceMicrosoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data Science
Mashfiq Shahriar
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
Mark Nichols, P.E.
 
Data Mining & Analytics for U.S. Airlines On-Time Performance
Data Mining & Analytics for U.S. Airlines On-Time Performance Data Mining & Analytics for U.S. Airlines On-Time Performance
Data Mining & Analytics for U.S. Airlines On-Time Performance
Mingxuan Li
 
Practical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationPractical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and Presentation
HariniMS1
 
CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b...
CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b...CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b...
CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b...
vinoth raja
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
Abhishek M Shivalingaiah
 

Similar to Benchmarking_ML_Tools (20)

Machine Learning Project - Neural Network
Machine Learning Project - Neural Network Machine Learning Project - Neural Network
Machine Learning Project - Neural Network
 
Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)
 
Arules_TM_Rpart_Markdown
Arules_TM_Rpart_MarkdownArules_TM_Rpart_Markdown
Arules_TM_Rpart_Markdown
 
Rapport d'analyse Dimensionality Reduction
Rapport d'analyse Dimensionality ReductionRapport d'analyse Dimensionality Reduction
Rapport d'analyse Dimensionality Reduction
 
A Survey on Stroke Prediction
A Survey on Stroke PredictionA Survey on Stroke Prediction
A Survey on Stroke Prediction
 
A survey on heart stroke prediction
A survey on heart stroke predictionA survey on heart stroke prediction
A survey on heart stroke prediction
 
Aregay_Msc_EEMCS
Aregay_Msc_EEMCSAregay_Msc_EEMCS
Aregay_Msc_EEMCS
 
Hadoop as an extension of DW
Hadoop as an extension of DWHadoop as an extension of DW
Hadoop as an extension of DW
 
A Machine Learning approach to predict Software Defects
A Machine Learning approach to predict Software DefectsA Machine Learning approach to predict Software Defects
A Machine Learning approach to predict Software Defects
 
Microsoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data ScienceMicrosoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data Science
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 
Data Mining & Analytics for U.S. Airlines On-Time Performance
Data Mining & Analytics for U.S. Airlines On-Time Performance Data Mining & Analytics for U.S. Airlines On-Time Performance
Data Mining & Analytics for U.S. Airlines On-Time Performance
 
Practical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationPractical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and Presentation
 
MSc_Thesis
MSc_ThesisMSc_Thesis
MSc_Thesis
 
KHAN_FAHAD_FL14
KHAN_FAHAD_FL14KHAN_FAHAD_FL14
KHAN_FAHAD_FL14
 
CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b...
CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b...CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b...
CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b...
 
Ashwin_Thesis
Ashwin_ThesisAshwin_Thesis
Ashwin_Thesis
 
Report
ReportReport
Report
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 

Benchmarking_ML_Tools

  • 1. Benchmarking 20 Machine Learning Models Accuracy and Speed Marc Borowczak, PRC Consulting LLC March 29, 2016 Contents Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Step 0: Selection & Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Step 1: Retrieve 1st Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Step 2. Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Step 3. Classification Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Step 4. Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Step 5: Retrieve 2nd Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Step 6. Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Step 7. Classification Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Step 8. Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Step 9. Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Step 10. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Summary As Machine Learning tools become mainstream, and ever-growing choice of these is available to data scientists and analysts, the need to assess those best suited becomes challenging. In this study, 20 Machine Learning models were benchmarked for their accuracy and speed performance on a multi-core hardware, when applied to 2 multinomial datasets differing broadly in size and complexity. It was observed that BAG-CART, RF and BOOST-C50 top the list at more than 99% accuracy while NNET, PART, GBM, SVM and C45 exceeded 95% accuracy on the small Car Evaluation dataset. On the larger and more complex Nursery dataset, we observed BAG-CART, BOOST-C50, PART, SVM and RF exceeded 99% accuracy, while JRIP, NNET, H2O, C45, and KNN exceeded 95% accuracy. However, overwhelming dependencies on Speed (determined on an average of 5-runs) were observed on a multicore hardware, with only CART, MDA and GBM as contenders for the Car Evaluation dataset. For the more complex Nursery dataset, a different outcome was observed, with MDA, ONE-R and BOOST-C50 as fastest and overall best predictors. The implications for the Data Analytics Leaders are to continue allocating resources to insure Machine Learning benchmarks are conducted regularly, documented and communicated thru the Analysts teams, and to insure the most efficient tools based on established criteria are applied on day-to-day operations. The implications of these findings for data scientists are to retain benchmarking tasks in the front- and not on the back-burner of activities’ list, and to continue monitoring new, more efficient and distributed and/or parallellized algorithms and their effects on various hardware platforms. Ultimately, finding the best tool depends strongly on criteria selection and certainly on hardware platforms available. Therefore this benchmarking task may well rest on the data analyst leaders’ and engineers’ to-do list for the foreseeable future. 1
  • 2. Step 0: Selection & Reproducibility As Machine Learning gains attention, more applications and models are being used, and often speed or accuracy of the predicted models lack comparisons. In this analysis, we’ll compare the accuracy and speed of 20 Machine learning models commonly selected. We’ll exercise these models on two multinomial UCI reference datasets, differing in size, predictor levels and number of dependent variables. The first and smallest dataset is Car_Evaluation and we’ll compare results when applying on the larger and more complex Nursery dataset. Sys.info()[1:5] ## sysname release version nodename machine ## "Windows" "10 x64" "build 10586" "STALLION" "x86-64" sessionInfo() ## R version 3.2.4 Revised (2016-03-16 r70336) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 10 x64 (build 10586) ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## loaded via a namespace (and not attached): ## [1] magrittr_1.5 formatR_1.3 tools_3.2.4 htmltools_0.3 ## [5] yaml_2.1.13 stringi_1.0-1 rmarkdown_0.9.5 knitr_1.12.3 ## [9] stringr_1.0.0 digest_0.6.9 evaluate_0.8.3 library(stringr) library(knitr) userdir <- getwd() set.seed(123) Step 1: Retrieve 1st Dataset We will mirror the approach used in the formulation challenge and use first the Car Evaluation dataset hosted on UCI Machine Learning Repository. We will use R to reproducibly and quickly download the dataset, and its full description. We continue to maintain reproducibility of the analysis as a general practice. The analysis tool and platform are documented, all libraries clearly listed, while data is retrieved programmatically and date stamped from the repository. We will display a structure of the Car_Evaluation dataset and the corresponding dictionary to translate the attribute factors. 2
  • 3. datadir <- "./data" if (!file.exists("data")){dir.create("data")} uciUrl <- "http://archive.ics.uci.edu/ml/machine-learning-databases/" fileUrl <- paste0(uciUrl,"car/car.data?accessType=DOWNLOAD") download.file(fileUrl, destfile="./data/cardata.csv", method = "curl") dateDownloaded <- date() car_eval <- read.csv("./data/cardata.csv",header=FALSE) fileUrl <- paste0(uciUrl,"car/car.names?accessType=DOWNLOAD") download.file(fileUrl, destfile = "./data/carnames.txt") txt <- readLines("./data/carnames.txt") lns <- data.frame(beg=which(grepl("buying v-high",txt)),end=which(grepl("med, high",txt))) # we now capture all lines of text between beg and end from txt res <- lapply(seq_along(lns$beg),function(l){paste(txt[seq(from=lns$beg[l],to=lns$end[l],by=1)],collapse res <- gsub(" ", ":", res, fixed = TRUE) res <- gsub(" ", ":", res, fixed = TRUE) res <- gsub(" ", ":", res, fixed = TRUE) res <- gsub(" ", ":", res, fixed = TRUE) res <- gsub(" ", "n", res, fixed = TRUE) res <- gsub(" ", "", res, fixed = TRUE) res <- gsub(" ", "", res, fixed = TRUE) res <- str_c(res,"n") writeLines(res, "./data/parsed_attr.csv") attrib <- readLines("./data/parsed_attr.csv") nv <- length(attrib) # number of attributes attrib <- sapply (1:nv,function(i) {gsub(":"," ",attrib[i],fixed=TRUE)}) dictionary <- sapply (1:nv,function(i) {strsplit(attrib[i],' ')}) dictionary[[nv]][1]<-"class" colnames(car_eval)<-sapply(1:nv,function(i) {colnames(car_eval)[i]<-dictionary[[i]][1]}) cm<-list() x<-car_eval[,1:(nv-1)] y<-car_eval[,nv] fmla<-paste(colnames(car_eval)[1:(nv-1)],collapse="+") fmla<-paste0(colnames(car_eval)[nv],"~",fmla) fmla<-as.formula(fmla) nlev<-nlevels(y) # number of factors describing class Step 2. Data Exploration head(car_eval) ## buying maint doors persons lug_boot safety class ## 1 vhigh vhigh 2 2 small low unacc ## 2 vhigh vhigh 2 2 small med unacc ## 3 vhigh vhigh 2 2 small high unacc ## 4 vhigh vhigh 2 2 med low unacc ## 5 vhigh vhigh 2 2 med med unacc ## 6 vhigh vhigh 2 2 med high unacc summary(car_eval) ## buying maint doors persons lug_boot safety 3
  • 4. ## high :432 high :432 2 :432 2 :576 big :576 high:576 ## low :432 low :432 3 :432 4 :576 med :576 low :576 ## med :432 med :432 4 :432 more:576 small:576 med :576 ## vhigh:432 vhigh:432 5more:432 ## class ## acc : 384 ## good : 69 ## unacc:1210 ## vgood: 65 str(car_eval) ## 'data.frame': 1728 obs. of 7 variables: ## $ buying : Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ... ## $ maint : Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ... ## $ doors : Factor w/ 4 levels "2","3","4","5more": 1 1 1 1 1 1 1 1 1 1 ... ## $ persons : Factor w/ 3 levels "2","4","more": 1 1 1 1 1 1 1 1 1 2 ... ## $ lug_boot: Factor w/ 3 levels "big","med","small": 3 3 3 2 2 2 1 1 1 3 ... ## $ safety : Factor w/ 3 levels "high","low","med": 2 3 1 2 3 1 2 3 1 2 ... ## $ class : Factor w/ 4 levels "acc","good","unacc",..: 3 3 3 3 3 3 3 3 3 3 ... cm<-list() # initialize Exploration reveals the multinomial car experience dataset comprises 6 attributes factors we can use to predict the 4-factor car recommendation class, with no missing data. Step 3. Classification Analysis 20 Models will be sequentially selected to represent 3 groups: A)Linear, B)Non-linear and C)Non-Linear Classification with decision trees. From the collected Confusion Matrix performance, we will build a results data frame and compare the prediction accuracies. For each analysis, we’ll follow the sampe protocol: invoque the package library, build the model, summarize it and predict the class (dependent variable), then save the accuracy data and predicted values in a list. 3.A Linear Classification 3.A1 Multinomial library(nnet) library(caret) model<-multinom(fmla, data = car_eval, maxit = 500, trace=FALSE) prob<-predict(model,x,type="probs") pred<-apply(prob,1,which.max) pred[which(pred=="1")]<-levels(y)[1] pred[which(pred=="2")]<-levels(y)[2] pred[which(pred=="3")]<-levels(y)[3] pred[which(pred=="4")]<-levels(y)[4] pred<-as.factor(pred) l<-union(pred,y) mtab<-table(factor(pred,l),factor(y,l)) cm[[1]]<-c("Multinomial","MULTINOM",confusionMatrix(mtab)) cm[[1]]$table 4
  • 5. ## ## unacc acc vgood good ## unacc 1166 32 0 0 ## acc 43 346 2 6 ## vgood 0 2 63 4 ## good 1 4 0 59 cm[[1]]$overall[1] ## Accuracy ## 0.9456019 3.A2 Logistic Regression library(VGAM) model<-vglm(fmla, family = "multinomial", data = car_eval, maxit = 100) prob<-predict(model,x,type="response") pred<-apply(prob,1,which.max) pred[which(pred=="1")]<-levels(y)[1] pred[which(pred=="2")]<-levels(y)[2] pred[which(pred=="3")]<-levels(y)[3] pred[which(pred=="4")]<-levels(y)[4] pred<-as.factor(pred) l<-union(pred,y) mtab<-table(factor(pred,l),factor(y,l)) cm[[2]]<-c("Logistic Regression","GLM",confusionMatrix(mtab)) cm[[2]]$table ## ## unacc acc vgood good ## unacc 1166 32 0 0 ## acc 43 346 2 6 ## vgood 0 2 63 4 ## good 1 4 0 59 cm[[2]]$overall[1] ## Accuracy ## 0.9456019 3.A3 Linear Discriminant Analysis library(MASS) model<-lda(fmla,data=car_eval) pred<-predict(model,x)$class l<-union(pred,y) mtab<-table(factor(pred,l),factor(y,l)) cm[[3]]<-c("Linear Discriminant Analysis","LDA",confusionMatrix(mtab)) cm[[3]]$table 5
  • 6. ## ## unacc acc vgood good ## unacc 1138 13 0 0 ## acc 70 361 27 44 ## vgood 0 0 35 2 ## good 2 10 3 23 cm[[3]]$overall[1] ## Accuracy ## 0.9010417 3.B Non-Linear Classification 3.B1 Mixture Discriminant Analysis library(mda) model<-mda(fmla,data=car_eval) pred<-predict(model,x) l<-union(pred,y) mtab<-table(factor(pred,l),factor(y,l)) cm[[4]]<-c("Mixture Discriminant Analysis","MDA",confusionMatrix(mtab)) cm[[4]]$table ## ## unacc acc vgood good ## unacc 1151 16 0 0 ## acc 52 341 7 21 ## vgood 0 4 55 3 ## good 7 23 3 45 cm[[4]]$overall[1] ## Accuracy ## 0.9212963 3.B2 Regularized Discriminant Analysis library(klaR) model<-rda(fmla,data=car_eval,gamma = 0.05,lambda = 0.01) pred<-predict(model,x)$class l<-union(pred,y) mtab<-table(factor(pred,l),factor(y,l)) cm[[5]]<-c("Regularized Discriminant Analysis","RDA",confusionMatrix(mtab)) cm[[5]]$table ## ## unacc acc vgood good ## unacc 1072 0 0 0 ## acc 134 317 0 0 ## vgood 0 22 65 6 ## good 4 45 0 63 6
  • 7. cm[[5]]$overall[1] ## Accuracy ## 0.8778935 3.B3 Neural Network library(nnet) library(devtools) model<-nnet(fmla,data=car_eval,size = 4, decay = 0.0001, maxit = 700, trace = FALSE) #import the function from Github # source_url('https://gist.githubusercontent.com/Peque/41a9e20d6687f2f3108d/raw/85e14f3a292e126f14548644 # plot.nnet(model, alpha.val = 0.5, cex= 0.7, circle.col = list('lightblue', 'white'), bord.col = 'black pred<-predict(model,x,type="class") pred<-as.factor(pred) l<-union(pred,y) mtab<-table(factor(pred,l),factor(y,l)) cm[[6]]<-c("Neural Network","NNET",confusionMatrix(mtab)) cm[[6]]$table ## ## unacc acc vgood good ## unacc 1208 2 0 0 ## acc 2 376 0 6 ## vgood 0 0 64 1 ## good 0 6 1 62 cm[[6]]$overall[1] ## Accuracy ## 0.9895833 3.B4 Flexible Discriminant Analysis library(mda) model<-fda(fmla,data=car_eval) pred<-predict(model,x,type="class") l<-union(pred,y) mtab<-table(factor(pred,l),factor(y,l)) cm[[7]]<-c("Flexible Discriminant Analysis","FDA",confusionMatrix(mtab)) cm[[7]]$table ## ## unacc acc vgood good ## unacc 1136 13 0 0 ## acc 72 361 27 44 ## vgood 0 0 35 2 ## good 2 10 3 23 7
  • 8. cm[[7]]$overall[1] ## Accuracy ## 0.8998843 3.B5 Support Vector Machine library(kernlab) model<-ksvm(fmla,data=car_eval) pred<-predict(model,x,type="response") l<-union(pred,y) mtab<-table(factor(pred,l),factor(y,l)) cm[[8]]<-c("Support Vector Machine","SVM",confusionMatrix(mtab)) cm[[8]]$table ## ## unacc acc vgood good ## unacc 1174 0 0 0 ## acc 33 375 0 0 ## vgood 0 1 65 9 ## good 3 8 0 60 cm[[8]]$overall[1] ## Accuracy ## 0.96875 3.B6 k-Nearest Neighbors library(caret) model<-knn3(fmla,data=car_eval,k=nlev+1) pred<-predict(model,x,type="class") l<-union(pred,y) mtab<-table(factor(pred,l),factor(y,l)) cm[[9]]<-c("k-Nearest Neighbors","KNN",confusionMatrix(mtab)) cm[[9]]$table ## ## unacc acc good vgood ## unacc 1199 41 4 1 ## acc 11 336 45 15 ## good 0 6 15 4 ## vgood 0 1 5 45 cm[[9]]$overall[1] ## Accuracy ## 0.9230324 8
  • 9. 3.B8 Naive Bayes library(e1071) model<-naiveBayes(fmla,data=car_eval,k=nlev+1) pred<-predict(model,x) l<-union(pred,y) mtab<-table(factor(pred,l),factor(y,l)) cm[[10]]<-c("Naive Bayes","NBAYES",confusionMatrix(mtab)) cm[[10]]$table ## ## unacc acc vgood good ## unacc 1161 85 0 0 ## acc 47 289 26 46 ## vgood 0 0 39 2 ## good 2 10 0 21 cm[[10]]$overall[1] ## Accuracy ## 0.8738426 3.C Non-Linear Classification with Decision Trees 3.C1 Classification and Regression Trees(CART) library(rpart) library(rpart.plot) model<-rpart(fmla,data=car_eval) # prp(model, faclen=3) pred<-predict(model, x ,type="class") l<-union(pred,y) mtab<-table(factor(pred,l),factor(y,l)) cm[[11]]<-c("classification and Regression Trees","CART",confusionMatrix(mtab)) cm[[11]]$table ## ## unacc acc good vgood ## unacc 1161 7 0 0 ## acc 45 358 0 13 ## good 4 16 60 0 ## vgood 0 3 9 52 cm[[11]]$overall[1] ## Accuracy ## 0.9438657 3.C2 OneR 9
  • 10. library(RWeka) model<-OneR(fmla,data=car_eval) pred<-predict(model,x,type="class") l<-union(pred,y) mtab<-table(factor(pred,l),factor(y,l)) cm[[12]]<-c("One R","ONE-R",confusionMatrix(mtab)) cm[[12]]$table ## ## unacc acc vgood good ## unacc 1210 384 65 69 ## acc 0 0 0 0 ## vgood 0 0 0 0 ## good 0 0 0 0 cm[[12]]$overall[1] ## Accuracy ## 0.7002315 3.C3 C4.5 library(RWeka) model<-J48(fmla,data=car_eval) pred<-predict(model,x) l<-union(pred,y) mtab<-table(factor(pred,l),factor(y,l)) cm[[13]]<-c("C4.5","C45",confusionMatrix(mtab)) cm[[13]]$table ## ## unacc acc vgood good ## unacc 1172 0 0 0 ## acc 35 380 4 9 ## vgood 0 2 55 3 ## good 3 2 6 57 cm[[13]]$overall[1] ## Accuracy ## 0.962963 3.C4 PART library(RWeka) model<-PART(fmla,data=car_eval) pred<-predict(model,x) l<-union(pred,y) mtab<-table(factor(pred,l),factor(y,l)) cm[[14]]<-c("PART","PART",confusionMatrix(mtab)) cm[[14]]$table 10
  • 11. ## ## unacc acc vgood good ## unacc 1196 6 0 1 ## acc 13 374 0 3 ## vgood 0 3 65 2 ## good 1 1 0 63 cm[[14]]$overall[1] ## Accuracy ## 0.9826389 3.C5 Bagging CART library(ipred) model<-bagging(fmla,data=car_eval) pred<-predict(model,x) l<-union(pred,y) mtab<-table(factor(pred,l),factor(y,l)) cm[[15]]<-c("Bagging CART","BAG-CART",confusionMatrix(mtab)) cm[[15]]$table ## ## unacc acc vgood good ## unacc 1210 0 0 0 ## acc 0 384 0 0 ## vgood 0 0 65 0 ## good 0 0 0 69 cm[[15]]$overall[1] ## Accuracy ## 1 3.C6 Random Forest library(randomForest) model<-randomForest(fmla,data=car_eval) pred<-predict(model,x) l<-union(pred,y) mtab<-table(factor(pred,l),factor(y,l)) cm[[16]]<-c("Random Forest","RF",confusionMatrix(mtab)) cm[[16]]$table ## ## unacc acc vgood good ## unacc 1208 0 0 0 ## acc 2 384 0 0 ## vgood 0 0 65 0 ## good 0 0 0 69 11
  • 12. cm[[16]]$overall[1] ## Accuracy ## 0.9988426 3.C7 Gradient Boosted Machine library(gbm) model<-gbm(fmla,data=car_eval,n.trees=5000,interaction.depth=nlev,shrinkage=0.001,bag.fraction=0.8,distr prob<-predict(model,x,n.trees=5000,type="response") pred<-apply(prob,1,which.max) pred[which(pred=="1")]<-levels(y)[1] pred[which(pred=="2")]<-levels(y)[2] pred[which(pred=="3")]<-levels(y)[3] pred[which(pred=="4")]<-levels(y)[4] pred<-as.factor(pred) l<-union(pred,y) mtab<-table(factor(pred,l),factor(y,l)) cm[[17]]<-c("Gradient Boosted Machine","GBM",confusionMatrix(mtab)) cm[[17]]$table ## ## unacc acc vgood good ## unacc 1191 3 0 0 ## acc 16 372 1 0 ## vgood 0 0 64 4 ## good 3 9 0 65 cm[[17]]$overall[1] ## Accuracy ## 0.9791667 3.C8 Boosted C5.0 library(C50) model<-C5.0(fmla,data=car_eval,trials=10) # tree <- rpart(model,data=car_eval,control=rpart.control(minsplit=20,cp=0,digits=6)) # prp(tree,faclen=3) pred<-predict(model,x) l<-union(pred,y) mtab<-table(factor(pred,l),factor(y,l)) cm[[18]]<-c("Boosted C5.0","BOOST-C50",confusionMatrix(mtab)) cm[[18]]$table ## ## unacc acc vgood good ## unacc 1207 1 0 0 ## acc 3 383 2 0 ## vgood 0 0 63 0 ## good 0 0 0 69 12
  • 13. cm[[18]]$overall[1] ## Accuracy ## 0.9965278 3.C9 JRip library(RWeka) model<-JRip(fmla,data=car_eval) pred<-predict(model,x) l<-union(pred,y) mtab<-table(factor(pred,l),factor(y,l)) cm[[19]]<-c("JRip","JRIP",confusionMatrix(mtab)) cm[[19]]$table ## ## unacc acc vgood good ## unacc 1156 0 0 0 ## acc 44 356 2 6 ## vgood 4 16 61 3 ## good 6 12 2 60 cm[[19]]$overall[1] ## Accuracy ## 0.9450231 3.C10 H2O Deep Learning ## ## H2O is not running yet, starting it now... ## ## Note: In case of errors look at the following log files: ## C:UsersMARC_B~1AppDataLocalTempRtmpglsya5/h2o_Marc_Borowczak_started_from_r.out ## C:UsersMARC_B~1AppDataLocalTempRtmpglsya5/h2o_Marc_Borowczak_started_from_r.err ## ## ## Starting H2O JVM and connecting: ... Connection successful! ## ## R is connected to the H2O cluster: ## H2O cluster uptime: 9 seconds 10 milliseconds ## H2O cluster version: 3.8.1.3 ## H2O cluster name: H2O_started_from_R_Marc_Borowczak_dvn595 ## H2O cluster total nodes: 1 ## H2O cluster total memory: 3.44 GB ## H2O cluster total cores: 4 ## H2O cluster allowed cores: 4 ## H2O cluster healthy: TRUE ## H2O Connection ip: localhost ## H2O Connection port: 54321 ## H2O Connection proxy: NA ## R Version: R version 3.2.4 Revised (2016-03-16 r70336) 13
  • 14. ## acc good unacc vgood ## acc 319 26 36 3 ## good 0 66 0 3 ## unacc 24 9 1177 0 ## vgood 0 0 0 65 ## Accuracy ## 0.9415509 Step 4. Performance Comparison We now compare all analysis results gathered in the ConfusionMatrix list into a results data frame. library(h2o) localH2O=h2o.init(nthreads=-1) ## ## H2O is not running yet, starting it now... ## ## Note: In case of errors look at the following log files: ## C:UsersMARC_B~1AppDataLocalTempRtmpUrf7Ob/h2o_Marc_Borowczak_started_from_r.out ## C:UsersMARC_B~1AppDataLocalTempRtmpUrf7Ob/h2o_Marc_Borowczak_started_from_r.err ## ## ## Starting H2O JVM and connecting: .... Connection successful! ## ## R is connected to the H2O cluster: ## H2O cluster uptime: 13 seconds 909 milliseconds ## H2O cluster version: 3.8.1.3 ## H2O cluster name: H2O_started_from_R_Marc_Borowczak_abx142 ## H2O cluster total nodes: 1 ## H2O cluster total memory: 3.44 GB ## H2O cluster total cores: 4 ## H2O cluster allowed cores: 4 ## H2O cluster healthy: TRUE ## H2O Connection ip: localhost ## H2O Connection port: 54321 ## H2O Connection proxy: NA ## R Version: R version 3.2.4 Revised (2016-03-16 r70336) h2o.no_progress() car_eval.hex=h2o.uploadFile(path=paste0(userdir,"/car_eval.csv"),destination_frame="car_eval.hex") library(microbenchmark) mbm<-microbenchmark( m1<-multinom(fmla, data = car_eval, maxit = 500, trace=FALSE), m2<-vglm(fmla, family = "multinomial", data = car_eval, maxit = 100), m3<-lda(fmla,data=car_eval), m4<-mda(fmla,data=car_eval), m5<-rda(fmla,data=car_eval,gamma = 0.05,lambda = 0.01), m6<-nnet(fmla,data=car_eval,size = 4, decay = 0.0001, maxit = 700,trace=FALSE), m7<-fda(fmla,data=car_eval), m8<-ksvm(fmla,data=car_eval), 14
  • 15. m9<-knn3(fmla,data=car_eval,k=nlev+1), m10<-naiveBayes(fmla,data=car_eval,k=nlev+1), m11<-rpart(fmla,data=car_eval), m12<-OneR(fmla,data=car_eval), m13<-J48(fmla,data=car_eval), m14<-PART(fmla,data=car_eval), m15<-bagging(fmla,data=car_eval), m16<-randomForest(fmla,data=car_eval), m17<-gbm(fmla,data=car_eval,n.trees=5000,interaction.depth=nlev,shrinkage=0.001,bag.fraction=0.8 m18<-C5.0(fmla,data=car_eval,trials=10), m19<-JRip(fmla,data=car_eval), m20<-h2o.deeplearning(x=1:(nv-1), y=nv, training_frame = car_eval.hex, variable_importances = TR times=5) h2o.shutdown(prompt = FALSE) ## [1] TRUE # library(dplyr) models<-length(cm) mbm$expr<-rep(sapply(1:models, function(i) {cm[[i]][[2]]}),5) mbm<-aggregate(x=mbm$time,by=list(Model=mbm$expr),FUN=mean) mbm$x<-mbm$x/min(mbm$x) results<-sapply (1:models, function(i) {c(cm[[i]][[1]],cm[[i]][[2]],mbm$x[i],cm[[i]]$overall[1:6])}) row.names(results)<-c("Description","Model","Model_Time_X",names(cm[[1]]$overall[1:6])) results<-as.data.frame(t(results)) results[,3:9]<-sapply(3:9,function(i){results[,i]<-as.numeric(levels(results[,i])[results[,i]])}) results<-results[,-(8:9)] results<-arrange(results,desc(Accuracy)) We are now ready to chart on faceted pie charts, with decreasing value order to ease comparisons. library(ggplot2) library(gridExtra) res<-data.frame(Model=results$Model,Accuracy=results$Accuracy,Speed=1/results$Model_Time_X,Overall=resul library(RColorBrewer) myPalette <- colorRampPalette(rev(brewer.pal(12, "Set3"))) sc <- scale_colour_gradientn(colours = myPalette(256), limits=c(0.8, 1)) g<-ggplot(res,aes(x=reorder(Model,-Accuracy),y=Accuracy,fill=Model))+ geom_bar(stat="identity")+ coord_polar(theta="x",direction=1)+ labs(x="Machine Learning Model",y="Prediction Accuracy")+ theme(legend.position="bottom",legend.box="horizontal")+ ggtitle('Car Evaluation Dataset Accuracy Performance') g 15
  • 16. BAG−CART RF BOOST−C50 NNET PART GBM SVM C45 GLM MULTINOMJRIP CART H2O KNN MDA LDA FDA RDA NBAYES ONE−R 0.00 0.25 0.50 0.75 1.00 Machine Learning Model PredictionAccuracy Model BAG−CART BOOST−C50 C45 CART FDA GBM GLM H2O JRIP KNN LDA MDA MULTINOM NBAYES NNET ONE−R PART RDA RF SVM Car Evaluation Dataset Accuracy Performance g<-ggplot(res,aes(x=reorder(Model,-Speed),y=Speed,fill=Model))+ geom_bar(stat="identity")+ coord_polar(theta="x",direction=1)+ labs(x="Machine Learning Model",y="Prediction Speed")+ theme(legend.position="bottom",legend.box="horizontal")+ ggtitle('Car Evaluation Dataset Speed Performance') g 16
  • 17. CART MDA GBM NNET JRIP RF SVM MULTINOM BOOST−C50 LDAC45 KNN PART NBAYES BAG−CART RDA ONE−R FDA H2O GLM 0.00 0.25 0.50 0.75 1.00 Machine Learning Model PredictionSpeed Model BAG−CART BOOST−C50 C45 CART FDA GBM GLM H2O JRIP KNN LDA MDA MULTINOM NBAYES NNET ONE−R PART RDA RF SVM Car Evaluation Dataset Speed Performance g<-ggplot(res,aes(x=reorder(Model,-Overall),y=Overall,fill=Model))+ geom_bar(stat="identity")+ coord_polar(theta="x",direction=1)+ labs(x="Machine Learning Model",y="Prediction Overall")+ theme(legend.position="bottom",legend.box="horizontal")+ ggtitle('Car Evaluation Dataset Overall Performance') g 17
  • 18. CART MDA GBM NNET JRIP RF SVM MULTINOM BOOST−C50 LDAC45 KNN PART NBAYES BAG−CART RDA ONE−R H2O FDA GLM 0.00 0.25 0.50 0.75 Machine Learning Model PredictionOverall Model BAG−CART BOOST−C50 C45 CART FDA GBM GLM H2O JRIP KNN LDA MDA MULTINOM NBAYES NNET ONE−R PART RDA RF SVM Car Evaluation Dataset Overall Performance We conclude this dataset analysis by tabulating the results obtained on this dataset. kable(res) Model Accuracy Speed Overall BAG-CART 1.0000000 0.0093808 0.0093808 RF 0.9988426 0.0966751 0.0965632 BOOST-C50 0.9965278 0.0624463 0.0622294 NNET 0.9895833 0.1358213 0.1344065 PART 0.9826389 0.0272376 0.0267647 GBM 0.9791667 0.2219264 0.2173029 SVM 0.9687500 0.0893670 0.0865743 C45 0.9629630 0.0434614 0.0418517 MULTINOM 0.9456019 0.0685611 0.0648315 GLM 0.9456019 0.0011533 0.0010906 JRIP 0.9450231 0.1148750 0.1085596 CART 0.9438657 1.0000000 0.9438657 H2O 0.9415509 0.0013525 0.0012735 KNN 0.9230324 0.0398154 0.0367509 MDA 0.9212963 0.3604409 0.3320729 LDA 0.9010417 0.0472147 0.0425424 FDA 0.8998843 0.0013615 0.0012252 RDA 0.8778935 0.0062438 0.0054814 NBAYES 0.8738426 0.0263502 0.0230260 ONE-R 0.7002315 0.0045960 0.0032182 18
  • 19. Step 5: Retrieve 2nd Dataset We now repeat with the nursery dataset. Again, We will use R to demonstrate quickly the approach on this dataset, and its full description. We continue to maintain reproducibility of the analysis as a general practice. The analysis tool and platform are documented, all libraries clearly listed, while data is retrieved programmatically and date stamped from the repository. We will display a structure of the Nursery dataset and the corresponding dictionary to translate the property factors. datadir <- "./data" if (!file.exists("data")){dir.create("data")} uciUrl <- "http://archive.ics.uci.edu/ml/machine-learning-databases/" fileUrl <- paste0(uciUrl,"nursery/nursery.data?accessType=DOWNLOAD") download.file(fileUrl,destfile="./data/nurserydata.csv",method="curl") dateDownloaded <- date() nursery <- read.csv("./data/nurserydata.csv",header=FALSE) fileUrl <- paste0(uciUrl,"nursery/nursery.names?accessType=DOWNLOAD") download.file(fileUrl,destfile="./data/nurserynames.txt") txt <- readLines("./data/nurserynames.txt") lns <- data.frame(beg=which(grepl("parents usual",txt)),end=which(grepl("priority, not_recom",txt # capture text between beg and end from txt res <- lapply(seq_along(lns$beg), function(l){paste(txt[seq(from=lns$beg[l],to=lns$end[l],by=1)],collapse=" ")}) res <- gsub(" ", ":", res, fixed = TRUE) res <- gsub(" ", ":", res, fixed = TRUE) res <- gsub(" ", ":", res, fixed = TRUE) res <- gsub(" ", ":", res, fixed = TRUE) res <- gsub(" ", "n", res, fixed = TRUE) res <- gsub(" ", "", res, fixed = TRUE) res <- gsub(" ", "", res, fixed = TRUE) res <- gsub(" ", "", res, fixed = TRUE) res <- str_c(res,"n") writeLines(res,"./data/n_parsed_attr.csv") attrib <- readLines("./data/n_parsed_attr.csv") nv <- length(attrib) # number of attributes attrib <- sapply (1:nv,function(i) {gsub(":"," ",attrib[i],fixed=TRUE)}) dictionary <- sapply (1:nv,function(i) {strsplit(attrib[i],' ')}) dictionary[[nv]][1]<-"class" colnames(nursery)<-sapply(1:nv,function(i) {colnames(nursery)[i]<-dictionary[[i]][1]}) x<-nursery[,1:(nv-1)] y<-nursery[,nv] fmla<-paste(colnames(nursery)[1:(nv-1)],collapse="+") fmla<-paste0(colnames(nursery)[nv],"~",fmla) fmla<-as.formula(fmla) nlev<-nlevels(y) # number of factors describing class Step 6. Data Exploration head(nursery) ## parents has_nurs form children housing finance social 19
  • 20. ## 1 usual proper complete 1 convenient convenient nonprob ## 2 usual proper complete 1 convenient convenient nonprob ## 3 usual proper complete 1 convenient convenient nonprob ## 4 usual proper complete 1 convenient convenient slightly_prob ## 5 usual proper complete 1 convenient convenient slightly_prob ## 6 usual proper complete 1 convenient convenient slightly_prob ## health class ## 1 recommended recommend ## 2 priority priority ## 3 not_recom not_recom ## 4 recommended recommend ## 5 priority priority ## 6 not_recom not_recom summary(nursery) ## parents has_nurs form children ## great_pret :4320 critical :2592 complete :3240 1 :3240 ## pretentious:4320 improper :2592 completed :3240 2 :3240 ## usual :4320 less_proper:2592 foster :3240 3 :3240 ## proper :2592 incomplete:3240 more:3240 ## very_crit :2592 ## housing finance social ## convenient:4320 convenient:6480 nonprob :4320 ## critical :4320 inconv :6480 problematic :4320 ## less_conv :4320 slightly_prob:4320 ## ## ## health class ## not_recom :4320 not_recom :4320 ## priority :4320 priority :4266 ## recommended:4320 recommend : 2 ## spec_prior:4044 ## very_recom: 328 str(nursery) ## 'data.frame': 12960 obs. of 9 variables: ## $ parents : Factor w/ 3 levels "great_pret","pretentious",..: 3 3 3 3 3 3 3 3 3 3 ... ## $ has_nurs: Factor w/ 5 levels "critical","improper",..: 4 4 4 4 4 4 4 4 4 4 ... ## $ form : Factor w/ 4 levels "complete","completed",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ children: Factor w/ 4 levels "1","2","3","more": 1 1 1 1 1 1 1 1 1 1 ... ## $ housing : Factor w/ 3 levels "convenient","critical",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ finance : Factor w/ 2 levels "convenient","inconv": 1 1 1 1 1 1 1 1 1 2 ... ## $ social : Factor w/ 3 levels "nonprob","problematic",..: 1 1 1 3 3 3 2 2 2 1 ... ## $ health : Factor w/ 3 levels "not_recom","priority",..: 3 2 1 3 2 1 3 2 1 3 ... ## $ class : Factor w/ 5 levels "not_recom","priority",..: 3 2 1 3 2 1 2 2 1 5 ... m<-cm<-list() # initialize Exploration reveals the multinomial Nursery dataset comprises 8 attribute variables to predict the 5-factor nurse recommendation class, with no missing data. 20
  • 21. Step 7. Classification Analysis We will perform a total of 20 analysis using the same 3 groups: A)Linear, B)Non-linear and C)Non-Linear Classification with decision trees, sequentially and then compare their outcome in terms of prediction accuracy, comparing predicted vs. actual dependent variable. For each analysis, we follow the sampe protocol: invoque the package library, build the model, summarize it and predict the dependent variable, the save the accuracy data and predicted values. ## ## H2O is not running yet, starting it now... ## ## Note: In case of errors look at the following log files: ## C:UsersMARC_B~1AppDataLocalTempRtmpglsya5/h2o_Marc_Borowczak_started_from_r.out ## C:UsersMARC_B~1AppDataLocalTempRtmpglsya5/h2o_Marc_Borowczak_started_from_r.err ## ## ## Starting H2O JVM and connecting: ... Connection successful! ## ## R is connected to the H2O cluster: ## H2O cluster uptime: 8 seconds 961 milliseconds ## H2O cluster version: 3.8.1.3 ## H2O cluster name: H2O_started_from_R_Marc_Borowczak_doh554 ## H2O cluster total nodes: 1 ## H2O cluster total memory: 3.44 GB ## H2O cluster total cores: 4 ## H2O cluster allowed cores: 4 ## H2O cluster healthy: TRUE ## H2O Connection ip: localhost ## H2O Connection port: 54321 ## H2O Connection proxy: NA ## R Version: R version 3.2.4 Revised (2016-03-16 r70336) ## not_recom priority recommend spec_prior ## not_recom 3316 0 0 2 ## priority 0 3105 0 165 ## recommend 0 0 0 0 ## spec_prior 0 11 0 3067 ## Accuracy ## 0.9815849 Step 8. Performance Comparison library(h2o) localH2O=h2o.init(nthreads=-1) ## ## H2O is not running yet, starting it now... ## ## Note: In case of errors look at the following log files: 21
  • 22. ## C:UsersMARC_B~1AppDataLocalTempRtmpUrf7Ob/h2o_Marc_Borowczak_started_from_r.out ## C:UsersMARC_B~1AppDataLocalTempRtmpUrf7Ob/h2o_Marc_Borowczak_started_from_r.err ## ## ## Starting H2O JVM and connecting: .... Connection successful! ## ## R is connected to the H2O cluster: ## H2O cluster uptime: 9 seconds 761 milliseconds ## H2O cluster version: 3.8.1.3 ## H2O cluster name: H2O_started_from_R_Marc_Borowczak_gcl376 ## H2O cluster total nodes: 1 ## H2O cluster total memory: 3.44 GB ## H2O cluster total cores: 4 ## H2O cluster allowed cores: 4 ## H2O cluster healthy: TRUE ## H2O Connection ip: localhost ## H2O Connection port: 54321 ## H2O Connection proxy: NA ## R Version: R version 3.2.4 Revised (2016-03-16 r70336) h2o.no_progress() nursery.hex=h2o.uploadFile(path=paste0(userdir,"/nursery.csv"),destination_frame="nursery.hex") library(microbenchmark) mbm<-microbenchmark( m1<-multinom(fmla, data = nursery, maxit = 1200, trace=FALSE), m2<-vglm(fmla, family = "multinomial", data = nursery, maxit = 100), m3<-lda(fmla,data=nursery), m4<-mda(fmla,data=nursery), m5<-rda(fmla,data=nursery,gamma = 0.05,lambda = 0.01), m6<-nnet(fmla,data=nursery,size = 4, decay = 0.0001, maxit = 1200,trace=FALSE), m7<-fda(fmla,data=nursery), m8<-ksvm(fmla,data=nursery), m9<-knn3(fmla,data=nursery,k=nlev+1), m10<-naiveBayes(fmla,data=nursery,k=nlev+1), m11<-rpart(fmla,data=nursery), m12<-OneR(fmla,data=nursery), m13<-J48(fmla,data=nursery), m14<-PART(fmla,data=nursery), m15<-bagging(fmla,data=nursery), m16<-randomForest(fmla,data=nursery), m17<-gbm(fmla,data=nursery,n.trees=5000,interaction.depth=nlev,shrinkage=0.001,bag.fraction=0.8, m18<-C5.0(fmla,data=nursery,trials=10), m19<-JRip(fmla,data=nursery), m20<-h2o.deeplearning(x=1:(nv-1), y=nv, training_frame = nursery.hex, variable_importances = TRU times=5) h2o.shutdown(prompt = FALSE) ## [1] TRUE # 22
  • 23. library(dplyr) models<-length(cm) mbm$expr<-rep(sapply(1:models, function(i) {cm[[i]][[2]]}),5) mbm<-aggregate(x=mbm$time,by=list(Model=mbm$expr),FUN=mean) mbm$x<-mbm$x/min(mbm$x) results<-sapply (1:models, function(i) {c(cm[[i]][[1]],cm[[i]][[2]],mbm$x[i],cm[[i]]$overall[1:6])}) row.names(results)<-c("Description","Model","Model_Time_X",names(cm[[1]]$overall[1:6])) results<-as.data.frame(t(results)) results[,3:9]<-sapply(3:9,function(i){results[,i]<-as.numeric(levels(results[,i])[results[,i]])}) results<-results[,-(8:9)] results<-arrange(results,desc(Accuracy)) # # Performance Plot library(ggplot2) library(gridExtra) res<-data.frame(Model=results$Model,Accuracy=results$Accuracy,Speed=1/results$Model_Time_X,Overall=resul library(RColorBrewer) myPalette <- colorRampPalette(rev(brewer.pal(12, "Set3"))) sc <- scale_colour_gradientn(colours = myPalette(256), limits=c(0.8, 1)) We are now ready to chart and will again compare on faceted pie charts. BAG−CART BOOST−C50 PART SVM RF JRIP NNET H2O C45 KNNMDA GLM MULTINOM FDA NBAYES RDA CART GBM ONE−R LDA 0.00 0.25 0.50 0.75 Machine Learning Model PredictionAccuracy Model BAG−CART BOOST−C50 C45 CART FDA GBM GLM H2O JRIP KNN LDA MDA MULTINOM NBAYES NNET ONE−R PART RDA RF SVM Nursery Dataset Accuracy Performance 23
  • 25. MDA ONE−R BOOST−C50 GLM PART KNN C45 GBM SVM BAG−CARTFDA RDA CART NBAYES RF JRIP LDA NNET MULTINOM H2O 0.00 0.25 0.50 0.75 Machine Learning Model PredictionOverall Model BAG−CART BOOST−C50 C45 CART FDA GBM GLM H2O JRIP KNN LDA MDA MULTINOM NBAYES NNET ONE−R PART RDA RF SVM Nursery Dataset Overall Performance We conclude this second dataset analysis by tabulating the results obtained. Model Accuracy Speed Overall BAG-CART 0.9999228 0.0094075 0.0094068 BOOST-C50 0.9997685 0.3000380 0.2999686 PART 0.9986883 0.1309383 0.1307665 SVM 0.9932099 0.0125290 0.0124439 RF 0.9916667 0.0045268 0.0044891 JRIP 0.9821759 0.0040615 0.0039892 NNET 0.9818673 0.0006675 0.0006554 H2O 0.9815849 0.0005785 0.0005678 C45 0.9813272 0.0491118 0.0481947 KNN 0.9768519 0.0699498 0.0683306 MDA 0.9331019 0.9991693 0.9323268 MULTINOM 0.9262346 0.0006674 0.0006182 GLM 0.9262346 0.1569507 0.1453732 FDA 0.9130401 0.0077249 0.0070531 NBAYES 0.9030864 0.0052343 0.0047270 RDA 0.8983025 0.0076060 0.0068325 CART 0.8726852 0.0055864 0.0048752 GBM 0.8726852 0.0284867 0.0248599 ONE-R 0.7097222 1.0000000 0.7097222 LDA 0.5483796 0.0037381 0.0020499 25
  • 26. Step 9. Data Analysis Comparing these 2 datasets’ accuracy, we observe that BAG-CART, RF and BOOST-C50 top the list at more than 99% accuracy while NNET, PART, GBM, SVM and C45 exceeded 95% accuracy on the smallest dataset. On the second dataset, we observe BAG-CART, BOOST-C50, PART, SVM and RF exceed 99% accuracy, while JRIP, NNET, H2O, C45 and KNN exceed 95% accuracy. From these observations, we should definitely include BAG-CART, BOOST-C50 and RF as prime models to tackle multinomial data. However, including NNET, PART, SVM and C45 as 2nd tier seems also a good idea. To be complete, the third group of models should include JRIP, H2O and KNN. For reference, if the main dependency is desired, ONE-R can pinpoint it with more than 70% accuracy. Speed can also be determinant when selecting a model. Microbenchmark data allow us to compare average time used by the 20 models using a 5-run average. Normalized data is obtained by dividing all times by the minimum time recorded. Transforming time into Speed involves taking the reciprocal values. Overall ranking is obtained here by merely forming the product of Accuracy x Speed. We observe overwhelming dependencies on Speed, with only KNN, BOOST-C50, NNET, JRIP and CART contenders for the Car Evaluation dataset. For the more complex Nursery dataset, a different outcome is observed, with ONE-R, MDA and BOOST-C50 as fastest and overall best predictors. Step 10. Conclusions We have compared 20 Machine Learning models and benchmarked their accuracy and speed on 2 multinomial datasets. Although ranking accuracy seemed consistent across these 2 models, and forming a 3-tier ranking, model execution speed which is often also a factor showed strong dependencies, so that combined ranking remains strongly dataset dependent. What it means for data scientists is that benchmarking should remain in the front- and not on the back- burner of activities’ list, as well as continuous monitoring of new and more efficient and distributed and/or parallellized algorithms and their effects on different hardware platforms. We have evaluated the accuracy on 2 multinomial datasets and the analysis led to a 3-tier grouping. However, Speed ranking could reduce our options. We will continue to monitor benchmark new Machine Learning tools by applying them to broader datasets. References The following sources are referenced as they provided significant help and information to develop this Machine Learning analysis applied to formulations: 1. UCI Machine Learning Repository Machine Learning Repository 2. Car recommendation documentation Car Evaluation Database donated by marko.bohanec@ijs.si June 1997 3. Nursery documentation Nursery Database donated by marko.bohanec@ijs.si June 1997 4. stringr Simple, Consistent Wrappers for Common String Operations 5. RWeka R/Weka interface 6. C50 Decision Trees and Rule-Based Models 7. rpart Recursive Partitioning and Regression Trees. 8. rpart.plot Plot ‘rpart’ Models: An Enhanced Version of ‘plot.rpart’ 9. rattle A Graphical User Interface for Data Mining using R 10. VGAM Vector Generalized Linear and Additive Models 11. MASS Support Functions and Datasets for Venables and Ripley’s MASS 12. mda Mixture and Flexible Discriminant Analysis 26
  • 27. 13. klaR Classification and visualization 14. nnet Feed-forward Neural Networks and Multinomial Log-Linear Models 15. kernlab Kernel-based Machine Learning Lab 16. caret Package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models 17. e1071 Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier etc 18. ipred Improved Predictors 19. randomForest Breiman and Cutler’s random forests for classification and regression 20. gbm Generalized Boosted Regression Models 21. H2O Deep Learning R Interface for H2O 22. H2O H2O.ai documentation 23. gridExtra Miscellaneous Functions for Grid Graphics 24. knitr A General-Purpose Package for Dynamic Report Generation in R 25. RStudio Open Source and enterprise-ready professional software for R 27