The document benchmarks 20 machine learning models on two datasets to compare their accuracy and speed. On the smaller Car Evaluation dataset, bagged decision trees, random forests and boosted decision trees achieved over 99% accuracy, while neural networks, decision stumps and support vector machines exceeded 95% accuracy. On the larger Nursery dataset, similar models exceeded 99% accuracy, while other models like decision rules and k-nearest neighbors exceeded 95% accuracy. However, models varied significantly in speed depending on the hardware, with decision trees, mixture discriminant analysis and gradient boosting as the fastest on Car Evaluation, and mixture discriminant analysis, one rule and boosted decision trees as the fastest on Nursery. The findings imply the importance of regular benchmarking
Morachi chincholi is a place where one can get special experience of the rural ambiance and can also enjoy the rural life. Here one can Witness the unique lifestyle of the villagers and learn about their age old traditions and culture, which they continued till date.
Penggunaan power point sebagai media presentasi Biodata Diri sekaligus salah satu keunikan dari matematika. dibuat sebagai tugas pertama mata kuliah Program Aplikasi Komputer
Morachi chincholi is a place where one can get special experience of the rural ambiance and can also enjoy the rural life. Here one can Witness the unique lifestyle of the villagers and learn about their age old traditions and culture, which they continued till date.
Penggunaan power point sebagai media presentasi Biodata Diri sekaligus salah satu keunikan dari matematika. dibuat sebagai tugas pertama mata kuliah Program Aplikasi Komputer
2016 Accounting and finance salaries in LeedsRobert Half UK
This slideshow from the Robert Half 2016 Salary Guide provides and overview of the hiring outlook in Leeds and what accounting and finance salaries Leeds professionals can expect in 2016. Visit www.roberthalf.co.uk/salary-centre for more salary information.
A simple report on implementation of an Optical Character Recognition (ORC) as a Handwritten Digit Recognition Machine. It is basically tested on a single neural network using 3 methods: K-Nearest Neighbor (KNN), Support Vector Machine (SVM) and Random Forest Classifier (RFC) Algorithm.
A Machine Learning approach to predict Software DefectsChetan Hireholi
Software engineering teams are not only involved in developing new versions of a product but are often involved in fixing customer reported defects. Customers report issues
faced by a particular software and some of these issues may actually require the engineering team to analyze and potentially provide a fix for the issue. The number of defects that a software engineering team has to analyze is significant and teams often prioritize the order of the defects based on the customer’s priority and to the extent the defect impacts the
business operations of the customer. Often, it is likely that the engineering team may not truly understand the business impact that a defect is likely to have and this results in the
customer escalating the defect up the engineering team’s management chain seeking more immediate attention to their problem. Such escalated defects tend to consume a lot of engineering bandwidth and increase defect handling costs; further such escalations impact existing plans as all critical
resources are used to handle these cases. Software escalations are classified under three categories: Red, Yellow and Green. The software defect report containing a high business value is prioritized and is marked as Red, Yellow being the neutral one, which might be escalated if appropriate attention is not given and the Green reports have the low priority. The engineering team understands the nature of the software escalation and allocates the resources appropriately. The objective of this project is to be able to analyze software defects and predict the defects that are likely to be escalated by the customer. This would permit the engineering team to be alerted about the situation and take proactive measures that
will give better support to the customers. For the purpose of our analysis, we have used the defects database provided by Hewlett-Packard (HP) India. We have used the concepts of R
programming for cleaning the database and to pre-process it. We then extracted keywords using natural language processing and then used machine learning (J 48 decision tree, Na¨ıve Bayes and Simple K Means) to predict the escalations. Thus by combining the key words
and the tickets received to the team, we can predict the nature of the Escalation and alert the engineering team so that they can take respective steps appropriately.
2016 Accounting and finance salaries in LeedsRobert Half UK
This slideshow from the Robert Half 2016 Salary Guide provides and overview of the hiring outlook in Leeds and what accounting and finance salaries Leeds professionals can expect in 2016. Visit www.roberthalf.co.uk/salary-centre for more salary information.
A simple report on implementation of an Optical Character Recognition (ORC) as a Handwritten Digit Recognition Machine. It is basically tested on a single neural network using 3 methods: K-Nearest Neighbor (KNN), Support Vector Machine (SVM) and Random Forest Classifier (RFC) Algorithm.
A Machine Learning approach to predict Software DefectsChetan Hireholi
Software engineering teams are not only involved in developing new versions of a product but are often involved in fixing customer reported defects. Customers report issues
faced by a particular software and some of these issues may actually require the engineering team to analyze and potentially provide a fix for the issue. The number of defects that a software engineering team has to analyze is significant and teams often prioritize the order of the defects based on the customer’s priority and to the extent the defect impacts the
business operations of the customer. Often, it is likely that the engineering team may not truly understand the business impact that a defect is likely to have and this results in the
customer escalating the defect up the engineering team’s management chain seeking more immediate attention to their problem. Such escalated defects tend to consume a lot of engineering bandwidth and increase defect handling costs; further such escalations impact existing plans as all critical
resources are used to handle these cases. Software escalations are classified under three categories: Red, Yellow and Green. The software defect report containing a high business value is prioritized and is marked as Red, Yellow being the neutral one, which might be escalated if appropriate attention is not given and the Green reports have the low priority. The engineering team understands the nature of the software escalation and allocates the resources appropriately. The objective of this project is to be able to analyze software defects and predict the defects that are likely to be escalated by the customer. This would permit the engineering team to be alerted about the situation and take proactive measures that
will give better support to the customers. For the purpose of our analysis, we have used the defects database provided by Hewlett-Packard (HP) India. We have used the concepts of R
programming for cleaning the database and to pre-process it. We then extracted keywords using natural language processing and then used machine learning (J 48 decision tree, Na¨ıve Bayes and Simple K Means) to predict the escalations. Thus by combining the key words
and the tickets received to the team, we can predict the nature of the Escalation and alert the engineering team so that they can take respective steps appropriately.
• Explored and cleaned huge amount of user activity logs (JSON) from Movies website using Map Reduce jobs in Python.
• Classified user accounts into adults and children for targeted advertising by implementing Similarity Ranking algorithm.
• Grouped user sessions based on user behavior using K means clustering to observe outliers and to find distinctive groups.
• Predicted ratings for movies using User-user and Item-Item based recommendation algorithms using Mahout.
1. Benchmarking 20 Machine Learning Models Accuracy
and Speed
Marc Borowczak, PRC Consulting LLC
March 29, 2016
Contents
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Step 0: Selection & Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Step 1: Retrieve 1st Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Step 2. Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Step 3. Classification Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Step 4. Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Step 5: Retrieve 2nd Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Step 6. Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Step 7. Classification Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Step 8. Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Step 9. Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Step 10. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Summary
As Machine Learning tools become mainstream, and ever-growing choice of these is available to data scientists
and analysts, the need to assess those best suited becomes challenging. In this study, 20 Machine Learning
models were benchmarked for their accuracy and speed performance on a multi-core hardware, when applied
to 2 multinomial datasets differing broadly in size and complexity. It was observed that BAG-CART, RF and
BOOST-C50 top the list at more than 99% accuracy while NNET, PART, GBM, SVM and C45 exceeded
95% accuracy on the small Car Evaluation dataset. On the larger and more complex Nursery dataset, we
observed BAG-CART, BOOST-C50, PART, SVM and RF exceeded 99% accuracy, while JRIP, NNET, H2O,
C45, and KNN exceeded 95% accuracy. However, overwhelming dependencies on Speed (determined on an
average of 5-runs) were observed on a multicore hardware, with only CART, MDA and GBM as contenders
for the Car Evaluation dataset. For the more complex Nursery dataset, a different outcome was observed,
with MDA, ONE-R and BOOST-C50 as fastest and overall best predictors. The implications for the Data
Analytics Leaders are to continue allocating resources to insure Machine Learning benchmarks are conducted
regularly, documented and communicated thru the Analysts teams, and to insure the most efficient tools
based on established criteria are applied on day-to-day operations. The implications of these findings for
data scientists are to retain benchmarking tasks in the front- and not on the back-burner of activities’ list,
and to continue monitoring new, more efficient and distributed and/or parallellized algorithms and their
effects on various hardware platforms. Ultimately, finding the best tool depends strongly on criteria selection
and certainly on hardware platforms available. Therefore this benchmarking task may well rest on the data
analyst leaders’ and engineers’ to-do list for the foreseeable future.
1
2. Step 0: Selection & Reproducibility
As Machine Learning gains attention, more applications and models are being used, and often speed or
accuracy of the predicted models lack comparisons. In this analysis, we’ll compare the accuracy and speed
of 20 Machine learning models commonly selected. We’ll exercise these models on two multinomial UCI
reference datasets, differing in size, predictor levels and number of dependent variables. The first and smallest
dataset is Car_Evaluation and we’ll compare results when applying on the larger and more complex Nursery
dataset.
Sys.info()[1:5]
## sysname release version nodename machine
## "Windows" "10 x64" "build 10586" "STALLION" "x86-64"
sessionInfo()
## R version 3.2.4 Revised (2016-03-16 r70336)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 formatR_1.3 tools_3.2.4 htmltools_0.3
## [5] yaml_2.1.13 stringi_1.0-1 rmarkdown_0.9.5 knitr_1.12.3
## [9] stringr_1.0.0 digest_0.6.9 evaluate_0.8.3
library(stringr)
library(knitr)
userdir <- getwd()
set.seed(123)
Step 1: Retrieve 1st Dataset
We will mirror the approach used in the formulation challenge and use first the Car Evaluation dataset hosted
on UCI Machine Learning Repository. We will use R to reproducibly and quickly download the dataset, and
its full description. We continue to maintain reproducibility of the analysis as a general practice. The analysis
tool and platform are documented, all libraries clearly listed, while data is retrieved programmatically and
date stamped from the repository.
We will display a structure of the Car_Evaluation dataset and the corresponding dictionary to translate the
attribute factors.
2
3. datadir <- "./data"
if (!file.exists("data")){dir.create("data")}
uciUrl <- "http://archive.ics.uci.edu/ml/machine-learning-databases/"
fileUrl <- paste0(uciUrl,"car/car.data?accessType=DOWNLOAD")
download.file(fileUrl, destfile="./data/cardata.csv", method = "curl")
dateDownloaded <- date()
car_eval <- read.csv("./data/cardata.csv",header=FALSE)
fileUrl <- paste0(uciUrl,"car/car.names?accessType=DOWNLOAD")
download.file(fileUrl, destfile = "./data/carnames.txt")
txt <- readLines("./data/carnames.txt")
lns <- data.frame(beg=which(grepl("buying v-high",txt)),end=which(grepl("med, high",txt)))
# we now capture all lines of text between beg and end from txt
res <- lapply(seq_along(lns$beg),function(l){paste(txt[seq(from=lns$beg[l],to=lns$end[l],by=1)],collapse
res <- gsub(" ", ":", res, fixed = TRUE)
res <- gsub(" ", ":", res, fixed = TRUE)
res <- gsub(" ", ":", res, fixed = TRUE)
res <- gsub(" ", ":", res, fixed = TRUE)
res <- gsub(" ", "n", res, fixed = TRUE)
res <- gsub(" ", "", res, fixed = TRUE)
res <- gsub(" ", "", res, fixed = TRUE)
res <- str_c(res,"n")
writeLines(res, "./data/parsed_attr.csv")
attrib <- readLines("./data/parsed_attr.csv")
nv <- length(attrib) # number of attributes
attrib <- sapply (1:nv,function(i) {gsub(":"," ",attrib[i],fixed=TRUE)})
dictionary <- sapply (1:nv,function(i) {strsplit(attrib[i],' ')})
dictionary[[nv]][1]<-"class"
colnames(car_eval)<-sapply(1:nv,function(i) {colnames(car_eval)[i]<-dictionary[[i]][1]})
cm<-list()
x<-car_eval[,1:(nv-1)]
y<-car_eval[,nv]
fmla<-paste(colnames(car_eval)[1:(nv-1)],collapse="+")
fmla<-paste0(colnames(car_eval)[nv],"~",fmla)
fmla<-as.formula(fmla)
nlev<-nlevels(y) # number of factors describing class
Step 2. Data Exploration
head(car_eval)
## buying maint doors persons lug_boot safety class
## 1 vhigh vhigh 2 2 small low unacc
## 2 vhigh vhigh 2 2 small med unacc
## 3 vhigh vhigh 2 2 small high unacc
## 4 vhigh vhigh 2 2 med low unacc
## 5 vhigh vhigh 2 2 med med unacc
## 6 vhigh vhigh 2 2 med high unacc
summary(car_eval)
## buying maint doors persons lug_boot safety
3
4. ## high :432 high :432 2 :432 2 :576 big :576 high:576
## low :432 low :432 3 :432 4 :576 med :576 low :576
## med :432 med :432 4 :432 more:576 small:576 med :576
## vhigh:432 vhigh:432 5more:432
## class
## acc : 384
## good : 69
## unacc:1210
## vgood: 65
str(car_eval)
## 'data.frame': 1728 obs. of 7 variables:
## $ buying : Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ maint : Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ doors : Factor w/ 4 levels "2","3","4","5more": 1 1 1 1 1 1 1 1 1 1 ...
## $ persons : Factor w/ 3 levels "2","4","more": 1 1 1 1 1 1 1 1 1 2 ...
## $ lug_boot: Factor w/ 3 levels "big","med","small": 3 3 3 2 2 2 1 1 1 3 ...
## $ safety : Factor w/ 3 levels "high","low","med": 2 3 1 2 3 1 2 3 1 2 ...
## $ class : Factor w/ 4 levels "acc","good","unacc",..: 3 3 3 3 3 3 3 3 3 3 ...
cm<-list() # initialize
Exploration reveals the multinomial car experience dataset comprises 6 attributes factors we can use to
predict the 4-factor car recommendation class, with no missing data.
Step 3. Classification Analysis
20 Models will be sequentially selected to represent 3 groups: A)Linear, B)Non-linear and C)Non-Linear
Classification with decision trees. From the collected Confusion Matrix performance, we will build a results
data frame and compare the prediction accuracies.
For each analysis, we’ll follow the sampe protocol: invoque the package library, build the model, summarize
it and predict the class (dependent variable), then save the accuracy data and predicted values in a list.
3.A Linear Classification
3.A1 Multinomial
library(nnet)
library(caret)
model<-multinom(fmla, data = car_eval, maxit = 500, trace=FALSE)
prob<-predict(model,x,type="probs")
pred<-apply(prob,1,which.max)
pred[which(pred=="1")]<-levels(y)[1]
pred[which(pred=="2")]<-levels(y)[2]
pred[which(pred=="3")]<-levels(y)[3]
pred[which(pred=="4")]<-levels(y)[4]
pred<-as.factor(pred)
l<-union(pred,y)
mtab<-table(factor(pred,l),factor(y,l))
cm[[1]]<-c("Multinomial","MULTINOM",confusionMatrix(mtab))
cm[[1]]$table
4
18. CART
MDA
GBM
NNET
JRIP
RF
SVM
MULTINOM
BOOST−C50
LDAC45
KNN
PART
NBAYES
BAG−CART
RDA
ONE−R
H2O
FDA
GLM
0.00
0.25
0.50
0.75
Machine Learning Model
PredictionOverall
Model
BAG−CART
BOOST−C50
C45
CART
FDA
GBM
GLM
H2O
JRIP
KNN
LDA
MDA
MULTINOM
NBAYES
NNET
ONE−R
PART
RDA
RF
SVM
Car Evaluation Dataset Overall Performance
We conclude this dataset analysis by tabulating the results obtained on this dataset.
kable(res)
Model Accuracy Speed Overall
BAG-CART 1.0000000 0.0093808 0.0093808
RF 0.9988426 0.0966751 0.0965632
BOOST-C50 0.9965278 0.0624463 0.0622294
NNET 0.9895833 0.1358213 0.1344065
PART 0.9826389 0.0272376 0.0267647
GBM 0.9791667 0.2219264 0.2173029
SVM 0.9687500 0.0893670 0.0865743
C45 0.9629630 0.0434614 0.0418517
MULTINOM 0.9456019 0.0685611 0.0648315
GLM 0.9456019 0.0011533 0.0010906
JRIP 0.9450231 0.1148750 0.1085596
CART 0.9438657 1.0000000 0.9438657
H2O 0.9415509 0.0013525 0.0012735
KNN 0.9230324 0.0398154 0.0367509
MDA 0.9212963 0.3604409 0.3320729
LDA 0.9010417 0.0472147 0.0425424
FDA 0.8998843 0.0013615 0.0012252
RDA 0.8778935 0.0062438 0.0054814
NBAYES 0.8738426 0.0263502 0.0230260
ONE-R 0.7002315 0.0045960 0.0032182
18
19. Step 5: Retrieve 2nd Dataset
We now repeat with the nursery dataset. Again, We will use R to demonstrate quickly the approach on
this dataset, and its full description. We continue to maintain reproducibility of the analysis as a general
practice. The analysis tool and platform are documented, all libraries clearly listed, while data is retrieved
programmatically and date stamped from the repository.
We will display a structure of the Nursery dataset and the corresponding dictionary to translate the property
factors.
datadir <- "./data"
if (!file.exists("data")){dir.create("data")}
uciUrl <- "http://archive.ics.uci.edu/ml/machine-learning-databases/"
fileUrl <- paste0(uciUrl,"nursery/nursery.data?accessType=DOWNLOAD")
download.file(fileUrl,destfile="./data/nurserydata.csv",method="curl")
dateDownloaded <- date()
nursery <- read.csv("./data/nurserydata.csv",header=FALSE)
fileUrl <- paste0(uciUrl,"nursery/nursery.names?accessType=DOWNLOAD")
download.file(fileUrl,destfile="./data/nurserynames.txt")
txt <- readLines("./data/nurserynames.txt")
lns <- data.frame(beg=which(grepl("parents usual",txt)),end=which(grepl("priority, not_recom",txt
# capture text between beg and end from txt
res <- lapply(seq_along(lns$beg),
function(l){paste(txt[seq(from=lns$beg[l],to=lns$end[l],by=1)],collapse=" ")})
res <- gsub(" ", ":", res, fixed = TRUE)
res <- gsub(" ", ":", res, fixed = TRUE)
res <- gsub(" ", ":", res, fixed = TRUE)
res <- gsub(" ", ":", res, fixed = TRUE)
res <- gsub(" ", "n", res, fixed = TRUE)
res <- gsub(" ", "", res, fixed = TRUE)
res <- gsub(" ", "", res, fixed = TRUE)
res <- gsub(" ", "", res, fixed = TRUE)
res <- str_c(res,"n")
writeLines(res,"./data/n_parsed_attr.csv")
attrib <- readLines("./data/n_parsed_attr.csv")
nv <- length(attrib) # number of attributes
attrib <- sapply (1:nv,function(i) {gsub(":"," ",attrib[i],fixed=TRUE)})
dictionary <- sapply (1:nv,function(i) {strsplit(attrib[i],' ')})
dictionary[[nv]][1]<-"class"
colnames(nursery)<-sapply(1:nv,function(i) {colnames(nursery)[i]<-dictionary[[i]][1]})
x<-nursery[,1:(nv-1)]
y<-nursery[,nv]
fmla<-paste(colnames(nursery)[1:(nv-1)],collapse="+")
fmla<-paste0(colnames(nursery)[nv],"~",fmla)
fmla<-as.formula(fmla)
nlev<-nlevels(y) # number of factors describing class
Step 6. Data Exploration
head(nursery)
## parents has_nurs form children housing finance social
19
21. Step 7. Classification Analysis
We will perform a total of 20 analysis using the same 3 groups: A)Linear, B)Non-linear and C)Non-Linear
Classification with decision trees, sequentially and then compare their outcome in terms of prediction accuracy,
comparing predicted vs. actual dependent variable.
For each analysis, we follow the sampe protocol: invoque the package library, build the model, summarize it
and predict the dependent variable, the save the accuracy data and predicted values.
##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## C:UsersMARC_B~1AppDataLocalTempRtmpglsya5/h2o_Marc_Borowczak_started_from_r.out
## C:UsersMARC_B~1AppDataLocalTempRtmpglsya5/h2o_Marc_Borowczak_started_from_r.err
##
##
## Starting H2O JVM and connecting: ... Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 8 seconds 961 milliseconds
## H2O cluster version: 3.8.1.3
## H2O cluster name: H2O_started_from_R_Marc_Borowczak_doh554
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.44 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## R Version: R version 3.2.4 Revised (2016-03-16 r70336)
## not_recom priority recommend spec_prior
## not_recom 3316 0 0 2
## priority 0 3105 0 165
## recommend 0 0 0 0
## spec_prior 0 11 0 3067
## Accuracy
## 0.9815849
Step 8. Performance Comparison
library(h2o)
localH2O=h2o.init(nthreads=-1)
##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
21
26. Step 9. Data Analysis
Comparing these 2 datasets’ accuracy, we observe that BAG-CART, RF and BOOST-C50 top the list at
more than 99% accuracy while NNET, PART, GBM, SVM and C45 exceeded 95% accuracy on the smallest
dataset. On the second dataset, we observe BAG-CART, BOOST-C50, PART, SVM and RF exceed 99%
accuracy, while JRIP, NNET, H2O, C45 and KNN exceed 95% accuracy.
From these observations, we should definitely include BAG-CART, BOOST-C50 and RF as prime models to
tackle multinomial data. However, including NNET, PART, SVM and C45 as 2nd tier seems also a good
idea. To be complete, the third group of models should include JRIP, H2O and KNN. For reference, if the
main dependency is desired, ONE-R can pinpoint it with more than 70% accuracy.
Speed can also be determinant when selecting a model. Microbenchmark data allow us to compare average
time used by the 20 models using a 5-run average. Normalized data is obtained by dividing all times by the
minimum time recorded. Transforming time into Speed involves taking the reciprocal values.
Overall ranking is obtained here by merely forming the product of Accuracy x Speed.
We observe overwhelming dependencies on Speed, with only KNN, BOOST-C50, NNET, JRIP and CART
contenders for the Car Evaluation dataset. For the more complex Nursery dataset, a different outcome is
observed, with ONE-R, MDA and BOOST-C50 as fastest and overall best predictors.
Step 10. Conclusions
We have compared 20 Machine Learning models and benchmarked their accuracy and speed on 2 multinomial
datasets. Although ranking accuracy seemed consistent across these 2 models, and forming a 3-tier ranking,
model execution speed which is often also a factor showed strong dependencies, so that combined ranking
remains strongly dataset dependent.
What it means for data scientists is that benchmarking should remain in the front- and not on the back-
burner of activities’ list, as well as continuous monitoring of new and more efficient and distributed and/or
parallellized algorithms and their effects on different hardware platforms.
We have evaluated the accuracy on 2 multinomial datasets and the analysis led to a 3-tier grouping. However,
Speed ranking could reduce our options. We will continue to monitor benchmark new Machine Learning
tools by applying them to broader datasets.
References
The following sources are referenced as they provided significant help and information to develop this Machine
Learning analysis applied to formulations:
1. UCI Machine Learning Repository Machine Learning Repository
2. Car recommendation documentation Car Evaluation Database donated by marko.bohanec@ijs.si June
1997
3. Nursery documentation Nursery Database donated by marko.bohanec@ijs.si June 1997
4. stringr Simple, Consistent Wrappers for Common String Operations
5. RWeka R/Weka interface
6. C50 Decision Trees and Rule-Based Models
7. rpart Recursive Partitioning and Regression Trees.
8. rpart.plot Plot ‘rpart’ Models: An Enhanced Version of ‘plot.rpart’
9. rattle A Graphical User Interface for Data Mining using R
10. VGAM Vector Generalized Linear and Additive Models
11. MASS Support Functions and Datasets for Venables and Ripley’s MASS
12. mda Mixture and Flexible Discriminant Analysis
26
27. 13. klaR Classification and visualization
14. nnet Feed-forward Neural Networks and Multinomial Log-Linear Models
15. kernlab Kernel-based Machine Learning Lab
16. caret Package (short for Classification And REgression Training) is a set of functions that attempt to
streamline the process for creating predictive models
17. e1071 Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector
machines, shortest path computation, bagged clustering, naive Bayes classifier etc
18. ipred Improved Predictors
19. randomForest Breiman and Cutler’s random forests for classification and regression
20. gbm Generalized Boosted Regression Models
21. H2O Deep Learning R Interface for H2O
22. H2O H2O.ai documentation
23. gridExtra Miscellaneous Functions for Grid Graphics
24. knitr A General-Purpose Package for Dynamic Report Generation in R
25. RStudio Open Source and enterprise-ready professional software for R
27