SlideShare a Scribd company logo
1 of 26
Using Adaboost Algorithm to
Enhance the Prediction Accuracy of
Decision Trees
MOHAMMED ALHAMADI - PROJECT 2
Acknowledgement
This project was done as a partial requirement for the course Introduction
to Machine Learning offered online fall-2016 at the Tandon Online, Tandon
School of Engineering, NYU.
Outline
1. Adaboost Algorithm
2. The Dataset
3. Data Exploration and Visualization
4. Factorizing the Dependent Variable
5. Prediction Using a Single Decision Tree
6. Prediction Using Adaboost
7. Can we improve this?
8. Can we improve even further?
9. Comparison
10. References
Adaboost Algorithm
• Adaboost is short for “Adaptive Boosting”
• Created by Yoav Freund and Robert Schapire in 1997
• Aaboost is a meta-algorithm (provides a sufficiently good solution to an optimization problem)
• Can be used with other learning algorithms to improve their performance
• Makes the weak classifiers strong by combining them
• Typically, decision trees (which we’ll use in this project)
Adaboost Algorithm (cont.)
• Weak models are added sequentially after being trained on the training data
• The process continues until:
• No further improvement can be made on the training data set
• A threshold number of weak classifiers is created
• Predication are then made by taking the weighted average of the weak classifiers
• Each weak classifier has a weight (its classification result is multiplied by its weight)
Adaboost Algorithm (cont.)
• It is better to have these characteristics in the data before processing it with adaboost:
◦ Quality Data:
◦ Ensemble method attempts to correct misclassifications in the training data
◦ Make sure that the training data is of high-quality
◦ Outliers:
◦ Outliers will force the ensemble to work very hard in order to correct for cases that are unrealistic
◦ Best to remove them from dataset
◦ Noisy Data:
◦ Noise in the output variable can be problematic for ensemble
◦ Isolate and clean the noisy data from your training dataset
The Data Set
• The data set that we will use in this project contains 4898 observations on white wine varieties
and quality ranked by the wine tasters
• The data set contains 11 independent variables and 1 dependent variable
• The Independent variables include: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides,
free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol
• The dependent variable is the quality of the wine ranked from 3 (lowest quality) to 9 (highest quality)
Data Exploration and Visualization
wine_data <- read.csv("C:/Users/Mohammed/Google Drive/R_code/Project1/winequality-white.csv",
header=TRUE, sep=";")
dim(wine_data)
[1] 4898 12
names(wine_data)
[1] "fixed.acidity" "volatile.acidity" "citric.acid" "residual.sugar"
[5] "chlorides" "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
[9] "pH" "sulphates" "alcohol" "quality"
Data Exploration and Visualization (cont.)
'data.frame': 4898 obs. of 12 variables:
$ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
$ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
$ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
$ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
$ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
$ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
$ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
$ density : num 1.001 0.994 0.995 0.996 0.996 ...
$ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
$ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
$ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
$ quality : int 6 6 6 6 6 6 6 6 6 6 ...
str(wine_data)
Data Exploration and Visualization (cont.)
[1] 3.8 14.2
range(wine_data$fixed.acidity)
[1] 0.08 1.10
range(wine_data$volatile.acidity)
[1] 0.00 1.66
range(wine_data$citric.acid)
[1] 0.6 65.8
range(wine_data$residual.sugar)
[1] 0.009 0.346
range(wine_data$chlorides)
[1] 2 289
range(wine_data$free.sulfur.dioxide)
[1] 9 440
range(wine_data$total.sulfur.dioxide)
[1] 0.98711 1.03898
range(wine_data$density)
[1] 2.72 3.82
range(wine_data$pH)
[1] 0.22 1.08
range(wine_data$sulphates)
[1] 8.0 14.2
range(wine_data$alcohol)
Data Exploration and Visualization (cont.)
cor(wine_data)
Fixed acid Vol. acid Citric acid Res.Sugar Chlorides FS dioxide TS dioxide Density pH Sulphates Alcohol Quality
Fixed acid 1 -0.02 0.29 0.09 0.02 -0.05 0.1 0.27 -0.43 -0.02 -0.12 -0.11
Vol. acid -0.02 1 -0.15 0.06 0.07 -0.1 0.09 0.03 -0.03 -0.04 0.07 -0.19
Citric acid 0.29 -0.15 1 0.09 0.11 0.09 0.12 0.15 -0.16 0.06 -0.08 -0.01
Res.Sugar 0.09 0.06 0.09 1 0.09 0.3 0.4 0.83 -0.2 -0.02 -0.05 -0.1
Chlorides 0.02 0.07 0.11 0.09 1 0.1 0.2 0.26 -0.1 0.02 -0.36 -0.21
FS dioxide -0.05 -0.1 0.09 0.3 0.1 1 0.62 0.3 -0.01 0.01 -0.25 0.01
TS dioxide 0.1 0.09 0.12 0.4 0.2 0.62 1 0.53 0.01 0.13 -0.45 -0.17
Density 0.27 0.03 0.15 0.83 0.26 0.3 0.53 1 -0.1 0.07 -0.8 -0.31
pH -0.43 -0.03 -0.16 -0.2 -0.1 -0.01 0.01 -0.1 1 0.16 0.12 0.1
Sulphates -0.02 -0.04 0.06 -0.02 0.02 0.01 0.13 0.07 0.16 1 -0.02 0.05
Alcohol -0.12 0.07 -0.08 -0.05 -0.36 -0.25 -0.45 -0.8 0.12 -0.02 1 0.44
Quality -0.11 -0.19 -0.01 -0.1 -0.21 0.01 -0.17 -0.31 0.1 0.05 0.44 1
Data Exploration and Visualization (cont.)
hist(wine_data$alcohol, col="#EE3B3B", main="Histogram of Alcohol Percent in Wine", xlab="Alcohol Percent", ylab="Number of samples", las=1)
hist(wine_data$density, col="#BCEE6B", main="Histogram of Wine Density", xlab="Density", ylab="Number of samples", las=1)
hist(wine_data$chlorides, col="#CDB79E", main="Histogram of Chlorides in Wine", xlab="Chlorides", ylab="Number of samples", las=1)
Data Exploration and Visualization (cont.)
hist(wine_data$quality, col="#458B74", main="Wine Quality Histogram", xlab="Quality", ylab="Number of samples")
Factorizing the Dependent Variable
quality_fac <- ifelse(wine_data$quality >= 6, "high", "low")
wine_data <- data.frame(wine_data, quality_fac)
table(wine_data$quality_fac)
High Low
3258 1640
• We can now remove the old integer dependent variable (quality)
wine_data <- wine_data[,-12]
Prediction Using a Single Decision Tree
• In order to compare the performance of Adaboost, we will first use a single decision tree
• Package (C50) in R
• So the misclassification error rate is almost 22% with a single tree
set.seed(797)
training_size <- round(0.7 * dim(wine_data)[1])
training_sample <- sample(dim(wine_data)[1], training_size, replace=FALSE)
testing_sample <- -training_sample
training_data <- wine_data[training_sample,]
testing_data <- wine_data[-training_sample,]
testing <- quality_fac[testing_sample]
C50_model <- C5.0(quality_fac~., data=training_data)
predict_C50 <- predict(C50_model, testing_data[,-12])
mean(predict_C50 != testing)
[1] 0.2178353
Prediction Using Adaboost
• We will use library (fastAdaboost)
• Initially, we will choose to use 15 classifiers and combine them
• So, we can see a quick improvement when using Adaboost with 15 classifiers
• We improved from 22% error rate to 18%
library(fastAdaboost)
adaboost_model <- adaboost(quality_fac~., training_data, 15)
pred <- predict(adaboost_model, newdata=testing_data)
error <- pred$error
print(paste("Adaboost Error:", error))
"Adaboost Error: 0.184479237576583"
Can we improve this?
• Choosing 15 classifiers in the previous step was a random choice
• We can try many different number of classifiers and check performance
• Try a range of classifiers from 1 to 50
trials <- data.frame(Trees=numeric(0), Error=numeric(0))
for (num_of_trees in 1:50){
pred <- predict(adaboost(quality_fac~., training_data, num_of_trees), newdata=testing_data)
error <- pred$error
print(paste("Adaboost Error at", num_of_trees, ":", error))
trials[num_of_trees, ] <- c(num_of_trees, error)
}
Can we improve this? (cont.)
[1] "Adaboost Error at 1 : 0.243703199455412"
[1] "Adaboost Error at 2 : 0.277739959155888"
[1] "Adaboost Error at 3 : 0.219877467665078"
[1] "Adaboost Error at 4 : 0.228727025187202"
[1] "Adaboost Error at 5 : 0.208304969366916"
[1] "Adaboost Error at 6 : 0.222600408441116"
[1] "Adaboost Error at 7 : 0.194009530292716"
[1] "Adaboost Error at 8 : 0.200816882232811"
[1] "Adaboost Error at 9 : 0.191967324710688"
[1] "Adaboost Error at 10 : 0.195371000680735"
[1] "Adaboost Error at 11 : 0.193328795098707"
[1] "Adaboost Error at 12 : 0.198093941456773"
[1] "Adaboost Error at 13 : 0.185840707964602"
[1] "Adaboost Error at 14 : 0.183117767188564"
[1] "Adaboost Error at 15 : 0.184479237576583"
[1] "Adaboost Error at 16 : 0.189244383934649"
[1] "Adaboost Error at 17 : 0.186521443158611"
[1] "Adaboost Error at 18 : 0.189925119128659"
[1] "Adaboost Error at 19 : 0.185159972770592"
[1] "Adaboost Error at 20 : 0.184479237576583"
[1] "Adaboost Error at 21 : 0.182437031994554"
[1] "Adaboost Error at 22 : 0.187202178352621"
[1] "Adaboost Error at 23 : 0.181075561606535"
[1] "Adaboost Error at 24 : 0.184479237576583"
[1] "Adaboost Error at 25 : 0.174948944860449"
[1] "Adaboost Error at 26 : 0.183798502382573"
[1] "Adaboost Error at 27 : 0.181075561606535"
[1] "Adaboost Error at 28 : 0.176310415248468"
[1] "Adaboost Error at 29 : 0.179714091218516"
[1] "Adaboost Error at 30 : 0.180394826412526"
[1] "Adaboost Error at 31 : 0.179714091218516"
[1] "Adaboost Error at 32 : 0.177671885636487"
[1] "Adaboost Error at 33 : 0.171545268890402"
[1] "Adaboost Error at 34 : 0.171545268890402"
[1] "Adaboost Error at 35 : 0.170864533696392"
[1] "Adaboost Error at 36 : 0.172906739278421"
[1] "Adaboost Error at 37 : 0.170864533696392"
[1] "Adaboost Error at 38 : 0.175629680054459"
[1] "Adaboost Error at 39 : 0.171545268890402"
[1] "Adaboost Error at 40 : 0.17358747447243"
[1] "Adaboost Error at 41 : 0.17358747447243"
[1] "Adaboost Error at 42 : 0.175629680054459"
[1] "Adaboost Error at 43 : 0.17358747447243"
[1] "Adaboost Error at 44 : 0.172906739278421"
[1] "Adaboost Error at 45 : 0.174948944860449"
[1] "Adaboost Error at 46 : 0.174948944860449"
[1] "Adaboost Error at 47 : 0.174948944860449"
[1] "Adaboost Error at 48 : 0.171545268890402"
[1] "Adaboost Error at 49 : 0.172226004084411"
[1] "Adaboost Error at 50 : 0.172226004084411"
print(min(paste("The minimum is", trials[,2])))
The minimum is 0.1708645
We improved from 18% error rate when
using 15 classifiers to 17% when using 35
classifiers
Can we improve this? (cont.)
We can notice that, generally, the error
decreases as the number of classifiers
increases.
qplot(trials2$Trees, trials2$Error, data=trials2,
color=Trees, xlab="Number of Classifiers",
ylab="Error", geom=c("point", "smooth"))
Can we improve even further?
• One of the requirements for the data used with Adaboost is to not have outliers
• We did not check for outliers
• We can check if the data contains outliers and remove them to improve performance
•… and we do this for all other variables in the dataset
wine_data2 <- wine_data
outlier <- boxplot.stats(wine_data2$fixed.acidity)$out
wine_data2$fixed.acidity <- ifelse(wine_data2$fixed.acidity %in% outlier, NA, wine_data2$fixed.acidity)
boxplot.stats(wine_data2$fixed.acidity)
wine_data2 <- wine_data2[complete.cases(wine_data2),]
outlier <- boxplot.stats(wine_data2$volatile.acidity)$out
wine_data2$volatile.acidity <- ifelse(wine_data2$volatile.acidity %in% outlier, NA, wine_data2$volatile.acidity)
boxplot.stats(wine_data2$volatile.acidity)
# … continue for all other variables
Can we improve even further? (cont.)
• After removing all the NA values, we are left with 3971 data instances
• We originally had 4898 data instances
• We lost 927 data instances to outliers in all the variables
• Is that worth it?
dim(wine_data)
dim(wine_data2)
[1] 4898 12
[1] 3971 12
Can we improve even further? (cont.)
• We will check multiple number of classifiers again after removing outliers
set.seed(797)
training_size <- round(0.7 * dim(wine_data2)[1])
training_sample <- sample(dim(wine_data2)[1], training_size, replace=FALSE)
testing_sample <- -training_sample
training_data <- wine_data2[training_sample,]
testing_data <- wine_data2[-training_sample,]
testing <- quality_fac[testing_sample]
trials2 <- data.frame(Trees=numeric(0), Error=numeric(0))
for (num_of_trees in 1:50){
pred <- predict(adaboost(quality_fac~., training_data, num_of_trees), newdata=testing_data)
error <- pred$error
print(paste("Adaboost Error at", num_of_trees, ":", error))
trials2[num_of_trees, ] <- c(num_of_trees, error)
}
Can we improve even further? (cont.)
[1] "Adaboost Error at 1 : 0.250209907640638"
[1] "Adaboost Error at 2 : 0.300587741393787"
[1] "Adaboost Error at 3 : 0.237615449202351"
[1] "Adaboost Error at 4 : 0.26448362720403"
[1] "Adaboost Error at 5 : 0.216624685138539"
[1] "Adaboost Error at 6 : 0.219983207388749"
[1] "Adaboost Error at 7 : 0.203190596137699"
[1] "Adaboost Error at 8 : 0.222502099076406"
[1] "Adaboost Error at 9 : 0.194794290512175"
[1] "Adaboost Error at 10 : 0.198152812762385"
[1] "Adaboost Error at 11 : 0.19311502938707"
[1] "Adaboost Error at 12 : 0.199832073887489"
[1] "Adaboost Error at 13 : 0.19647355163728"
[1] "Adaboost Error at 14 : 0.194794290512175"
[1] "Adaboost Error at 15 : 0.184718723761545"
[1] "Adaboost Error at 16 : 0.185558354324097"
[1] "Adaboost Error at 17 : 0.174643157010915"
[1] "Adaboost Error at 18 : 0.18975650713686"
[1] "Adaboost Error at 19 : 0.174643157010915"
[1] "Adaboost Error at 20 : 0.185558354324097"
[1] "Adaboost Error at 21 : 0.17632241813602"
[1] "Adaboost Error at 22 : 0.182199832073887"
[1] "Adaboost Error at 23 : 0.174643157010915"
[1] "Adaboost Error at 24 : 0.182199832073887"
[1] "Adaboost Error at 25 : 0.174643157010915"
[1] "Adaboost Error at 26 : 0.17296389588581"
[1] "Adaboost Error at 27 : 0.180520570948783"
[1] "Adaboost Error at 28 : 0.174643157010915"
[1] "Adaboost Error at 29 : 0.17296389588581"
[1] "Adaboost Error at 30 : 0.183879093198992"
[1] "Adaboost Error at 31 : 0.174643157010915"
[1] "Adaboost Error at 32 : 0.178001679261125"
[1] "Adaboost Error at 33 : 0.170445004198153"
[1] "Adaboost Error at 34 : 0.168765743073048"
[1] "Adaboost Error at 35 : 0.164567590260285"
[1] "Adaboost Error at 36 : 0.1696053736356"
[1] "Adaboost Error at 37 : 0.162048698572628"
[1] "Adaboost Error at 38 : 0.168765743073048"
[1] "Adaboost Error at 39 : 0.167086481947943"
[1] "Adaboost Error at 40 : 0.161209068010076"
[1] "Adaboost Error at 41 : 0.1696053736356"
[1] "Adaboost Error at 42 : 0.165407220822838"
[1] "Adaboost Error at 43 : 0.171284634760705"
[1] "Adaboost Error at 44 : 0.174643157010915"
[1] "Adaboost Error at 45 : 0.167926112510495"
[1] "Adaboost Error at 46 : 0.171284634760705"
[1] "Adaboost Error at 47 : 0.1696053736356"
[1] "Adaboost Error at 48 : 0.170445004198153"
[1] "Adaboost Error at 49 : 0.16624685138539"
[1] "Adaboost Error at 50 : 0.168765743073048"
print(min(paste("The minimum is", trials2[,2])))
The minimum is 0.1612090
We improved from 17% error rate with
outliers in the data to 16% when we removed
the outliers
(not substantial improvement, but good enough)
Can we improve even further? (cont.)
Here also we notice that, generally, the
error decreases as the number of
classifiers increases.
qplot(trials2$Trees, trials2$Error, data=trials2,
color=Trees, xlab="Number of Classifiers",
ylab="Error", geom=c("point", "smooth"))
Comparison
Misclassification
error = 22%
Misclassification
error = 17%
Misclassification
error = 16%
Single Tree Adaboost
(with outliers)
Adaboost
(no outliers)
Misclassification
error = 18%
Adaboost
(random # of classifiers)
References
• Boosting and AdaBoost for Machine Learning
• What is AdaBoost - Quora
• Adaboost – Wikipedia
• Identify, describe, plot, and remove the outliers from the dataset
• Wine Quality Data Set - UCI Machine Learning Repository
• fastAdaboost – Github
• fastAdaboost - cran.r-project

More Related Content

Similar to Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Wine Quality Analysis Using Machine Learning
Wine Quality Analysis Using Machine LearningWine Quality Analysis Using Machine Learning
Wine Quality Analysis Using Machine LearningMahima -
 
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...James Nelson
 
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...DataWorks Summit
 
Predicting wine quality using data analytics
Predicting wine quality using data analyticsPredicting wine quality using data analytics
Predicting wine quality using data analyticsGautam Sawant
 
[QM] SPC Case Study - Coffee Mix
[QM] SPC Case Study - Coffee Mix[QM] SPC Case Study - Coffee Mix
[QM] SPC Case Study - Coffee MixJoonhee Lee
 
Six Sigma Empirical Application at a Tile Manufacturing Company
Six Sigma Empirical Application at a Tile Manufacturing Company Six Sigma Empirical Application at a Tile Manufacturing Company
Six Sigma Empirical Application at a Tile Manufacturing Company maxdup90
 
Fostering Long-Term Test Automation Success
Fostering Long-Term Test Automation SuccessFostering Long-Term Test Automation Success
Fostering Long-Term Test Automation SuccessJosiah Renaudin
 
50-AAPL-Buyside-Pitchbook.ppt
50-AAPL-Buyside-Pitchbook.ppt50-AAPL-Buyside-Pitchbook.ppt
50-AAPL-Buyside-Pitchbook.pptDanielYang700061
 
Open Science Data Repository - Dataledger
Open Science Data Repository - DataledgerOpen Science Data Repository - Dataledger
Open Science Data Repository - DataledgerAlexandru Korotcov
 
Sensory panels: Set-up, management and reducing bias
Sensory panels: Set-up, management and reducing biasSensory panels: Set-up, management and reducing bias
Sensory panels: Set-up, management and reducing biasPaul Hughes
 
Di Bacco Brochure Services PowerPoint
Di Bacco Brochure Services PowerPointDi Bacco Brochure Services PowerPoint
Di Bacco Brochure Services PowerPointJoe Mcgough
 
RS in the context of Big Data-v4
RS in the context of Big Data-v4RS in the context of Big Data-v4
RS in the context of Big Data-v4Khadija Atiya
 
Soft And Handling
Soft And HandlingSoft And Handling
Soft And Handlinghiratufail
 
What+Is+Six+Sigma
What+Is+Six+SigmaWhat+Is+Six+Sigma
What+Is+Six+SigmaTyg Lucas
 
Practical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationPractical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationHariniMS1
 
Preparing for Enterprise Continuous Delivery - 5 Critical Steps
Preparing for Enterprise Continuous Delivery - 5 Critical StepsPreparing for Enterprise Continuous Delivery - 5 Critical Steps
Preparing for Enterprise Continuous Delivery - 5 Critical StepsXebiaLabs
 

Similar to Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees (20)

Wine Quality Analysis Using Machine Learning
Wine Quality Analysis Using Machine LearningWine Quality Analysis Using Machine Learning
Wine Quality Analysis Using Machine Learning
 
pdf.pdf
pdf.pdfpdf.pdf
pdf.pdf
 
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
 
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
 
Predicting wine quality using data analytics
Predicting wine quality using data analyticsPredicting wine quality using data analytics
Predicting wine quality using data analytics
 
[QM] SPC Case Study - Coffee Mix
[QM] SPC Case Study - Coffee Mix[QM] SPC Case Study - Coffee Mix
[QM] SPC Case Study - Coffee Mix
 
Six Sigma Empirical Application at a Tile Manufacturing Company
Six Sigma Empirical Application at a Tile Manufacturing Company Six Sigma Empirical Application at a Tile Manufacturing Company
Six Sigma Empirical Application at a Tile Manufacturing Company
 
Fostering Long-Term Test Automation Success
Fostering Long-Term Test Automation SuccessFostering Long-Term Test Automation Success
Fostering Long-Term Test Automation Success
 
50-AAPL-Buyside-Pitchbook.ppt
50-AAPL-Buyside-Pitchbook.ppt50-AAPL-Buyside-Pitchbook.ppt
50-AAPL-Buyside-Pitchbook.ppt
 
Open Science Data Repository - Dataledger
Open Science Data Repository - DataledgerOpen Science Data Repository - Dataledger
Open Science Data Repository - Dataledger
 
Sensory panels: Set-up, management and reducing bias
Sensory panels: Set-up, management and reducing biasSensory panels: Set-up, management and reducing bias
Sensory panels: Set-up, management and reducing bias
 
Di Bacco Brochure Services PowerPoint
Di Bacco Brochure Services PowerPointDi Bacco Brochure Services PowerPoint
Di Bacco Brochure Services PowerPoint
 
RS in the context of Big Data-v4
RS in the context of Big Data-v4RS in the context of Big Data-v4
RS in the context of Big Data-v4
 
Six sigma technique
Six sigma techniqueSix sigma technique
Six sigma technique
 
Soft And Handling
Soft And HandlingSoft And Handling
Soft And Handling
 
What+Is+Six+Sigma
What+Is+Six+SigmaWhat+Is+Six+Sigma
What+Is+Six+Sigma
 
Practical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationPractical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and Presentation
 
Preparing for Enterprise Continuous Delivery - 5 Critical Steps
Preparing for Enterprise Continuous Delivery - 5 Critical StepsPreparing for Enterprise Continuous Delivery - 5 Critical Steps
Preparing for Enterprise Continuous Delivery - 5 Critical Steps
 
(Spring 2012) IT 345 Posters
(Spring 2012) IT 345 Posters(Spring 2012) IT 345 Posters
(Spring 2012) IT 345 Posters
 
Quality management
Quality management Quality management
Quality management
 

Recently uploaded

Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024patrickdtherriault
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...Amil baba
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationmuqadasqasim10
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeBoston Institute of Analytics
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...ssuserf63bd7
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token PredictionNABLAS株式会社
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfRobertoOcampo24
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证ju0dztxtn
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancingmohamed Elzalabany
 
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksSensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksBoston Institute of Analytics
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理cyebo
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理pyhepag
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfgreat91
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesBoston Institute of Analytics
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理pyhepag
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeralNABLAS株式会社
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一fztigerwe
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"John Sobanski
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证pwgnohujw
 

Recently uploaded (20)

Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksSensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 

Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

  • 1. Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees MOHAMMED ALHAMADI - PROJECT 2
  • 2. Acknowledgement This project was done as a partial requirement for the course Introduction to Machine Learning offered online fall-2016 at the Tandon Online, Tandon School of Engineering, NYU.
  • 3. Outline 1. Adaboost Algorithm 2. The Dataset 3. Data Exploration and Visualization 4. Factorizing the Dependent Variable 5. Prediction Using a Single Decision Tree 6. Prediction Using Adaboost 7. Can we improve this? 8. Can we improve even further? 9. Comparison 10. References
  • 4. Adaboost Algorithm • Adaboost is short for “Adaptive Boosting” • Created by Yoav Freund and Robert Schapire in 1997 • Aaboost is a meta-algorithm (provides a sufficiently good solution to an optimization problem) • Can be used with other learning algorithms to improve their performance • Makes the weak classifiers strong by combining them • Typically, decision trees (which we’ll use in this project)
  • 5. Adaboost Algorithm (cont.) • Weak models are added sequentially after being trained on the training data • The process continues until: • No further improvement can be made on the training data set • A threshold number of weak classifiers is created • Predication are then made by taking the weighted average of the weak classifiers • Each weak classifier has a weight (its classification result is multiplied by its weight)
  • 6. Adaboost Algorithm (cont.) • It is better to have these characteristics in the data before processing it with adaboost: ◦ Quality Data: ◦ Ensemble method attempts to correct misclassifications in the training data ◦ Make sure that the training data is of high-quality ◦ Outliers: ◦ Outliers will force the ensemble to work very hard in order to correct for cases that are unrealistic ◦ Best to remove them from dataset ◦ Noisy Data: ◦ Noise in the output variable can be problematic for ensemble ◦ Isolate and clean the noisy data from your training dataset
  • 7. The Data Set • The data set that we will use in this project contains 4898 observations on white wine varieties and quality ranked by the wine tasters • The data set contains 11 independent variables and 1 dependent variable • The Independent variables include: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol • The dependent variable is the quality of the wine ranked from 3 (lowest quality) to 9 (highest quality)
  • 8. Data Exploration and Visualization wine_data <- read.csv("C:/Users/Mohammed/Google Drive/R_code/Project1/winequality-white.csv", header=TRUE, sep=";") dim(wine_data) [1] 4898 12 names(wine_data) [1] "fixed.acidity" "volatile.acidity" "citric.acid" "residual.sugar" [5] "chlorides" "free.sulfur.dioxide" "total.sulfur.dioxide" "density" [9] "pH" "sulphates" "alcohol" "quality"
  • 9. Data Exploration and Visualization (cont.) 'data.frame': 4898 obs. of 12 variables: $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ... $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ... $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ... $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ... $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ... $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ... $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ... $ density : num 1.001 0.994 0.995 0.996 0.996 ... $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ... $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ... $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ... $ quality : int 6 6 6 6 6 6 6 6 6 6 ... str(wine_data)
  • 10. Data Exploration and Visualization (cont.) [1] 3.8 14.2 range(wine_data$fixed.acidity) [1] 0.08 1.10 range(wine_data$volatile.acidity) [1] 0.00 1.66 range(wine_data$citric.acid) [1] 0.6 65.8 range(wine_data$residual.sugar) [1] 0.009 0.346 range(wine_data$chlorides) [1] 2 289 range(wine_data$free.sulfur.dioxide) [1] 9 440 range(wine_data$total.sulfur.dioxide) [1] 0.98711 1.03898 range(wine_data$density) [1] 2.72 3.82 range(wine_data$pH) [1] 0.22 1.08 range(wine_data$sulphates) [1] 8.0 14.2 range(wine_data$alcohol)
  • 11. Data Exploration and Visualization (cont.) cor(wine_data) Fixed acid Vol. acid Citric acid Res.Sugar Chlorides FS dioxide TS dioxide Density pH Sulphates Alcohol Quality Fixed acid 1 -0.02 0.29 0.09 0.02 -0.05 0.1 0.27 -0.43 -0.02 -0.12 -0.11 Vol. acid -0.02 1 -0.15 0.06 0.07 -0.1 0.09 0.03 -0.03 -0.04 0.07 -0.19 Citric acid 0.29 -0.15 1 0.09 0.11 0.09 0.12 0.15 -0.16 0.06 -0.08 -0.01 Res.Sugar 0.09 0.06 0.09 1 0.09 0.3 0.4 0.83 -0.2 -0.02 -0.05 -0.1 Chlorides 0.02 0.07 0.11 0.09 1 0.1 0.2 0.26 -0.1 0.02 -0.36 -0.21 FS dioxide -0.05 -0.1 0.09 0.3 0.1 1 0.62 0.3 -0.01 0.01 -0.25 0.01 TS dioxide 0.1 0.09 0.12 0.4 0.2 0.62 1 0.53 0.01 0.13 -0.45 -0.17 Density 0.27 0.03 0.15 0.83 0.26 0.3 0.53 1 -0.1 0.07 -0.8 -0.31 pH -0.43 -0.03 -0.16 -0.2 -0.1 -0.01 0.01 -0.1 1 0.16 0.12 0.1 Sulphates -0.02 -0.04 0.06 -0.02 0.02 0.01 0.13 0.07 0.16 1 -0.02 0.05 Alcohol -0.12 0.07 -0.08 -0.05 -0.36 -0.25 -0.45 -0.8 0.12 -0.02 1 0.44 Quality -0.11 -0.19 -0.01 -0.1 -0.21 0.01 -0.17 -0.31 0.1 0.05 0.44 1
  • 12. Data Exploration and Visualization (cont.) hist(wine_data$alcohol, col="#EE3B3B", main="Histogram of Alcohol Percent in Wine", xlab="Alcohol Percent", ylab="Number of samples", las=1) hist(wine_data$density, col="#BCEE6B", main="Histogram of Wine Density", xlab="Density", ylab="Number of samples", las=1) hist(wine_data$chlorides, col="#CDB79E", main="Histogram of Chlorides in Wine", xlab="Chlorides", ylab="Number of samples", las=1)
  • 13. Data Exploration and Visualization (cont.) hist(wine_data$quality, col="#458B74", main="Wine Quality Histogram", xlab="Quality", ylab="Number of samples")
  • 14. Factorizing the Dependent Variable quality_fac <- ifelse(wine_data$quality >= 6, "high", "low") wine_data <- data.frame(wine_data, quality_fac) table(wine_data$quality_fac) High Low 3258 1640 • We can now remove the old integer dependent variable (quality) wine_data <- wine_data[,-12]
  • 15. Prediction Using a Single Decision Tree • In order to compare the performance of Adaboost, we will first use a single decision tree • Package (C50) in R • So the misclassification error rate is almost 22% with a single tree set.seed(797) training_size <- round(0.7 * dim(wine_data)[1]) training_sample <- sample(dim(wine_data)[1], training_size, replace=FALSE) testing_sample <- -training_sample training_data <- wine_data[training_sample,] testing_data <- wine_data[-training_sample,] testing <- quality_fac[testing_sample] C50_model <- C5.0(quality_fac~., data=training_data) predict_C50 <- predict(C50_model, testing_data[,-12]) mean(predict_C50 != testing) [1] 0.2178353
  • 16. Prediction Using Adaboost • We will use library (fastAdaboost) • Initially, we will choose to use 15 classifiers and combine them • So, we can see a quick improvement when using Adaboost with 15 classifiers • We improved from 22% error rate to 18% library(fastAdaboost) adaboost_model <- adaboost(quality_fac~., training_data, 15) pred <- predict(adaboost_model, newdata=testing_data) error <- pred$error print(paste("Adaboost Error:", error)) "Adaboost Error: 0.184479237576583"
  • 17. Can we improve this? • Choosing 15 classifiers in the previous step was a random choice • We can try many different number of classifiers and check performance • Try a range of classifiers from 1 to 50 trials <- data.frame(Trees=numeric(0), Error=numeric(0)) for (num_of_trees in 1:50){ pred <- predict(adaboost(quality_fac~., training_data, num_of_trees), newdata=testing_data) error <- pred$error print(paste("Adaboost Error at", num_of_trees, ":", error)) trials[num_of_trees, ] <- c(num_of_trees, error) }
  • 18. Can we improve this? (cont.) [1] "Adaboost Error at 1 : 0.243703199455412" [1] "Adaboost Error at 2 : 0.277739959155888" [1] "Adaboost Error at 3 : 0.219877467665078" [1] "Adaboost Error at 4 : 0.228727025187202" [1] "Adaboost Error at 5 : 0.208304969366916" [1] "Adaboost Error at 6 : 0.222600408441116" [1] "Adaboost Error at 7 : 0.194009530292716" [1] "Adaboost Error at 8 : 0.200816882232811" [1] "Adaboost Error at 9 : 0.191967324710688" [1] "Adaboost Error at 10 : 0.195371000680735" [1] "Adaboost Error at 11 : 0.193328795098707" [1] "Adaboost Error at 12 : 0.198093941456773" [1] "Adaboost Error at 13 : 0.185840707964602" [1] "Adaboost Error at 14 : 0.183117767188564" [1] "Adaboost Error at 15 : 0.184479237576583" [1] "Adaboost Error at 16 : 0.189244383934649" [1] "Adaboost Error at 17 : 0.186521443158611" [1] "Adaboost Error at 18 : 0.189925119128659" [1] "Adaboost Error at 19 : 0.185159972770592" [1] "Adaboost Error at 20 : 0.184479237576583" [1] "Adaboost Error at 21 : 0.182437031994554" [1] "Adaboost Error at 22 : 0.187202178352621" [1] "Adaboost Error at 23 : 0.181075561606535" [1] "Adaboost Error at 24 : 0.184479237576583" [1] "Adaboost Error at 25 : 0.174948944860449" [1] "Adaboost Error at 26 : 0.183798502382573" [1] "Adaboost Error at 27 : 0.181075561606535" [1] "Adaboost Error at 28 : 0.176310415248468" [1] "Adaboost Error at 29 : 0.179714091218516" [1] "Adaboost Error at 30 : 0.180394826412526" [1] "Adaboost Error at 31 : 0.179714091218516" [1] "Adaboost Error at 32 : 0.177671885636487" [1] "Adaboost Error at 33 : 0.171545268890402" [1] "Adaboost Error at 34 : 0.171545268890402" [1] "Adaboost Error at 35 : 0.170864533696392" [1] "Adaboost Error at 36 : 0.172906739278421" [1] "Adaboost Error at 37 : 0.170864533696392" [1] "Adaboost Error at 38 : 0.175629680054459" [1] "Adaboost Error at 39 : 0.171545268890402" [1] "Adaboost Error at 40 : 0.17358747447243" [1] "Adaboost Error at 41 : 0.17358747447243" [1] "Adaboost Error at 42 : 0.175629680054459" [1] "Adaboost Error at 43 : 0.17358747447243" [1] "Adaboost Error at 44 : 0.172906739278421" [1] "Adaboost Error at 45 : 0.174948944860449" [1] "Adaboost Error at 46 : 0.174948944860449" [1] "Adaboost Error at 47 : 0.174948944860449" [1] "Adaboost Error at 48 : 0.171545268890402" [1] "Adaboost Error at 49 : 0.172226004084411" [1] "Adaboost Error at 50 : 0.172226004084411" print(min(paste("The minimum is", trials[,2]))) The minimum is 0.1708645 We improved from 18% error rate when using 15 classifiers to 17% when using 35 classifiers
  • 19. Can we improve this? (cont.) We can notice that, generally, the error decreases as the number of classifiers increases. qplot(trials2$Trees, trials2$Error, data=trials2, color=Trees, xlab="Number of Classifiers", ylab="Error", geom=c("point", "smooth"))
  • 20. Can we improve even further? • One of the requirements for the data used with Adaboost is to not have outliers • We did not check for outliers • We can check if the data contains outliers and remove them to improve performance •… and we do this for all other variables in the dataset wine_data2 <- wine_data outlier <- boxplot.stats(wine_data2$fixed.acidity)$out wine_data2$fixed.acidity <- ifelse(wine_data2$fixed.acidity %in% outlier, NA, wine_data2$fixed.acidity) boxplot.stats(wine_data2$fixed.acidity) wine_data2 <- wine_data2[complete.cases(wine_data2),] outlier <- boxplot.stats(wine_data2$volatile.acidity)$out wine_data2$volatile.acidity <- ifelse(wine_data2$volatile.acidity %in% outlier, NA, wine_data2$volatile.acidity) boxplot.stats(wine_data2$volatile.acidity) # … continue for all other variables
  • 21. Can we improve even further? (cont.) • After removing all the NA values, we are left with 3971 data instances • We originally had 4898 data instances • We lost 927 data instances to outliers in all the variables • Is that worth it? dim(wine_data) dim(wine_data2) [1] 4898 12 [1] 3971 12
  • 22. Can we improve even further? (cont.) • We will check multiple number of classifiers again after removing outliers set.seed(797) training_size <- round(0.7 * dim(wine_data2)[1]) training_sample <- sample(dim(wine_data2)[1], training_size, replace=FALSE) testing_sample <- -training_sample training_data <- wine_data2[training_sample,] testing_data <- wine_data2[-training_sample,] testing <- quality_fac[testing_sample] trials2 <- data.frame(Trees=numeric(0), Error=numeric(0)) for (num_of_trees in 1:50){ pred <- predict(adaboost(quality_fac~., training_data, num_of_trees), newdata=testing_data) error <- pred$error print(paste("Adaboost Error at", num_of_trees, ":", error)) trials2[num_of_trees, ] <- c(num_of_trees, error) }
  • 23. Can we improve even further? (cont.) [1] "Adaboost Error at 1 : 0.250209907640638" [1] "Adaboost Error at 2 : 0.300587741393787" [1] "Adaboost Error at 3 : 0.237615449202351" [1] "Adaboost Error at 4 : 0.26448362720403" [1] "Adaboost Error at 5 : 0.216624685138539" [1] "Adaboost Error at 6 : 0.219983207388749" [1] "Adaboost Error at 7 : 0.203190596137699" [1] "Adaboost Error at 8 : 0.222502099076406" [1] "Adaboost Error at 9 : 0.194794290512175" [1] "Adaboost Error at 10 : 0.198152812762385" [1] "Adaboost Error at 11 : 0.19311502938707" [1] "Adaboost Error at 12 : 0.199832073887489" [1] "Adaboost Error at 13 : 0.19647355163728" [1] "Adaboost Error at 14 : 0.194794290512175" [1] "Adaboost Error at 15 : 0.184718723761545" [1] "Adaboost Error at 16 : 0.185558354324097" [1] "Adaboost Error at 17 : 0.174643157010915" [1] "Adaboost Error at 18 : 0.18975650713686" [1] "Adaboost Error at 19 : 0.174643157010915" [1] "Adaboost Error at 20 : 0.185558354324097" [1] "Adaboost Error at 21 : 0.17632241813602" [1] "Adaboost Error at 22 : 0.182199832073887" [1] "Adaboost Error at 23 : 0.174643157010915" [1] "Adaboost Error at 24 : 0.182199832073887" [1] "Adaboost Error at 25 : 0.174643157010915" [1] "Adaboost Error at 26 : 0.17296389588581" [1] "Adaboost Error at 27 : 0.180520570948783" [1] "Adaboost Error at 28 : 0.174643157010915" [1] "Adaboost Error at 29 : 0.17296389588581" [1] "Adaboost Error at 30 : 0.183879093198992" [1] "Adaboost Error at 31 : 0.174643157010915" [1] "Adaboost Error at 32 : 0.178001679261125" [1] "Adaboost Error at 33 : 0.170445004198153" [1] "Adaboost Error at 34 : 0.168765743073048" [1] "Adaboost Error at 35 : 0.164567590260285" [1] "Adaboost Error at 36 : 0.1696053736356" [1] "Adaboost Error at 37 : 0.162048698572628" [1] "Adaboost Error at 38 : 0.168765743073048" [1] "Adaboost Error at 39 : 0.167086481947943" [1] "Adaboost Error at 40 : 0.161209068010076" [1] "Adaboost Error at 41 : 0.1696053736356" [1] "Adaboost Error at 42 : 0.165407220822838" [1] "Adaboost Error at 43 : 0.171284634760705" [1] "Adaboost Error at 44 : 0.174643157010915" [1] "Adaboost Error at 45 : 0.167926112510495" [1] "Adaboost Error at 46 : 0.171284634760705" [1] "Adaboost Error at 47 : 0.1696053736356" [1] "Adaboost Error at 48 : 0.170445004198153" [1] "Adaboost Error at 49 : 0.16624685138539" [1] "Adaboost Error at 50 : 0.168765743073048" print(min(paste("The minimum is", trials2[,2]))) The minimum is 0.1612090 We improved from 17% error rate with outliers in the data to 16% when we removed the outliers (not substantial improvement, but good enough)
  • 24. Can we improve even further? (cont.) Here also we notice that, generally, the error decreases as the number of classifiers increases. qplot(trials2$Trees, trials2$Error, data=trials2, color=Trees, xlab="Number of Classifiers", ylab="Error", geom=c("point", "smooth"))
  • 25. Comparison Misclassification error = 22% Misclassification error = 17% Misclassification error = 16% Single Tree Adaboost (with outliers) Adaboost (no outliers) Misclassification error = 18% Adaboost (random # of classifiers)
  • 26. References • Boosting and AdaBoost for Machine Learning • What is AdaBoost - Quora • Adaboost – Wikipedia • Identify, describe, plot, and remove the outliers from the dataset • Wine Quality Data Set - UCI Machine Learning Repository • fastAdaboost – Github • fastAdaboost - cran.r-project