Using R programming language's package fastAdaboost, we use the adaboost algorithm created by Yoav Freund and Robert Schapire on a public data set (white wine quality) to see if we can enhance the performance of a single decision tree.
Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees
1. Using Adaboost Algorithm to
Enhance the Prediction Accuracy of
Decision Trees
MOHAMMED ALHAMADI - PROJECT 2
2. Acknowledgement
This project was done as a partial requirement for the course Introduction
to Machine Learning offered online fall-2016 at the Tandon Online, Tandon
School of Engineering, NYU.
3. Outline
1. Adaboost Algorithm
2. The Dataset
3. Data Exploration and Visualization
4. Factorizing the Dependent Variable
5. Prediction Using a Single Decision Tree
6. Prediction Using Adaboost
7. Can we improve this?
8. Can we improve even further?
9. Comparison
10. References
4. Adaboost Algorithm
• Adaboost is short for “Adaptive Boosting”
• Created by Yoav Freund and Robert Schapire in 1997
• Aaboost is a meta-algorithm (provides a sufficiently good solution to an optimization problem)
• Can be used with other learning algorithms to improve their performance
• Makes the weak classifiers strong by combining them
• Typically, decision trees (which we’ll use in this project)
5. Adaboost Algorithm (cont.)
• Weak models are added sequentially after being trained on the training data
• The process continues until:
• No further improvement can be made on the training data set
• A threshold number of weak classifiers is created
• Predication are then made by taking the weighted average of the weak classifiers
• Each weak classifier has a weight (its classification result is multiplied by its weight)
6. Adaboost Algorithm (cont.)
• It is better to have these characteristics in the data before processing it with adaboost:
◦ Quality Data:
◦ Ensemble method attempts to correct misclassifications in the training data
◦ Make sure that the training data is of high-quality
◦ Outliers:
◦ Outliers will force the ensemble to work very hard in order to correct for cases that are unrealistic
◦ Best to remove them from dataset
◦ Noisy Data:
◦ Noise in the output variable can be problematic for ensemble
◦ Isolate and clean the noisy data from your training dataset
7. The Data Set
• The data set that we will use in this project contains 4898 observations on white wine varieties
and quality ranked by the wine tasters
• The data set contains 11 independent variables and 1 dependent variable
• The Independent variables include: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides,
free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol
• The dependent variable is the quality of the wine ranked from 3 (lowest quality) to 9 (highest quality)
12. Data Exploration and Visualization (cont.)
hist(wine_data$alcohol, col="#EE3B3B", main="Histogram of Alcohol Percent in Wine", xlab="Alcohol Percent", ylab="Number of samples", las=1)
hist(wine_data$density, col="#BCEE6B", main="Histogram of Wine Density", xlab="Density", ylab="Number of samples", las=1)
hist(wine_data$chlorides, col="#CDB79E", main="Histogram of Chlorides in Wine", xlab="Chlorides", ylab="Number of samples", las=1)
13. Data Exploration and Visualization (cont.)
hist(wine_data$quality, col="#458B74", main="Wine Quality Histogram", xlab="Quality", ylab="Number of samples")
14. Factorizing the Dependent Variable
quality_fac <- ifelse(wine_data$quality >= 6, "high", "low")
wine_data <- data.frame(wine_data, quality_fac)
table(wine_data$quality_fac)
High Low
3258 1640
• We can now remove the old integer dependent variable (quality)
wine_data <- wine_data[,-12]
15. Prediction Using a Single Decision Tree
• In order to compare the performance of Adaboost, we will first use a single decision tree
• Package (C50) in R
• So the misclassification error rate is almost 22% with a single tree
set.seed(797)
training_size <- round(0.7 * dim(wine_data)[1])
training_sample <- sample(dim(wine_data)[1], training_size, replace=FALSE)
testing_sample <- -training_sample
training_data <- wine_data[training_sample,]
testing_data <- wine_data[-training_sample,]
testing <- quality_fac[testing_sample]
C50_model <- C5.0(quality_fac~., data=training_data)
predict_C50 <- predict(C50_model, testing_data[,-12])
mean(predict_C50 != testing)
[1] 0.2178353
16. Prediction Using Adaboost
• We will use library (fastAdaboost)
• Initially, we will choose to use 15 classifiers and combine them
• So, we can see a quick improvement when using Adaboost with 15 classifiers
• We improved from 22% error rate to 18%
library(fastAdaboost)
adaboost_model <- adaboost(quality_fac~., training_data, 15)
pred <- predict(adaboost_model, newdata=testing_data)
error <- pred$error
print(paste("Adaboost Error:", error))
"Adaboost Error: 0.184479237576583"
17. Can we improve this?
• Choosing 15 classifiers in the previous step was a random choice
• We can try many different number of classifiers and check performance
• Try a range of classifiers from 1 to 50
trials <- data.frame(Trees=numeric(0), Error=numeric(0))
for (num_of_trees in 1:50){
pred <- predict(adaboost(quality_fac~., training_data, num_of_trees), newdata=testing_data)
error <- pred$error
print(paste("Adaboost Error at", num_of_trees, ":", error))
trials[num_of_trees, ] <- c(num_of_trees, error)
}
18. Can we improve this? (cont.)
[1] "Adaboost Error at 1 : 0.243703199455412"
[1] "Adaboost Error at 2 : 0.277739959155888"
[1] "Adaboost Error at 3 : 0.219877467665078"
[1] "Adaboost Error at 4 : 0.228727025187202"
[1] "Adaboost Error at 5 : 0.208304969366916"
[1] "Adaboost Error at 6 : 0.222600408441116"
[1] "Adaboost Error at 7 : 0.194009530292716"
[1] "Adaboost Error at 8 : 0.200816882232811"
[1] "Adaboost Error at 9 : 0.191967324710688"
[1] "Adaboost Error at 10 : 0.195371000680735"
[1] "Adaboost Error at 11 : 0.193328795098707"
[1] "Adaboost Error at 12 : 0.198093941456773"
[1] "Adaboost Error at 13 : 0.185840707964602"
[1] "Adaboost Error at 14 : 0.183117767188564"
[1] "Adaboost Error at 15 : 0.184479237576583"
[1] "Adaboost Error at 16 : 0.189244383934649"
[1] "Adaboost Error at 17 : 0.186521443158611"
[1] "Adaboost Error at 18 : 0.189925119128659"
[1] "Adaboost Error at 19 : 0.185159972770592"
[1] "Adaboost Error at 20 : 0.184479237576583"
[1] "Adaboost Error at 21 : 0.182437031994554"
[1] "Adaboost Error at 22 : 0.187202178352621"
[1] "Adaboost Error at 23 : 0.181075561606535"
[1] "Adaboost Error at 24 : 0.184479237576583"
[1] "Adaboost Error at 25 : 0.174948944860449"
[1] "Adaboost Error at 26 : 0.183798502382573"
[1] "Adaboost Error at 27 : 0.181075561606535"
[1] "Adaboost Error at 28 : 0.176310415248468"
[1] "Adaboost Error at 29 : 0.179714091218516"
[1] "Adaboost Error at 30 : 0.180394826412526"
[1] "Adaboost Error at 31 : 0.179714091218516"
[1] "Adaboost Error at 32 : 0.177671885636487"
[1] "Adaboost Error at 33 : 0.171545268890402"
[1] "Adaboost Error at 34 : 0.171545268890402"
[1] "Adaboost Error at 35 : 0.170864533696392"
[1] "Adaboost Error at 36 : 0.172906739278421"
[1] "Adaboost Error at 37 : 0.170864533696392"
[1] "Adaboost Error at 38 : 0.175629680054459"
[1] "Adaboost Error at 39 : 0.171545268890402"
[1] "Adaboost Error at 40 : 0.17358747447243"
[1] "Adaboost Error at 41 : 0.17358747447243"
[1] "Adaboost Error at 42 : 0.175629680054459"
[1] "Adaboost Error at 43 : 0.17358747447243"
[1] "Adaboost Error at 44 : 0.172906739278421"
[1] "Adaboost Error at 45 : 0.174948944860449"
[1] "Adaboost Error at 46 : 0.174948944860449"
[1] "Adaboost Error at 47 : 0.174948944860449"
[1] "Adaboost Error at 48 : 0.171545268890402"
[1] "Adaboost Error at 49 : 0.172226004084411"
[1] "Adaboost Error at 50 : 0.172226004084411"
print(min(paste("The minimum is", trials[,2])))
The minimum is 0.1708645
We improved from 18% error rate when
using 15 classifiers to 17% when using 35
classifiers
19. Can we improve this? (cont.)
We can notice that, generally, the error
decreases as the number of classifiers
increases.
qplot(trials2$Trees, trials2$Error, data=trials2,
color=Trees, xlab="Number of Classifiers",
ylab="Error", geom=c("point", "smooth"))
20. Can we improve even further?
• One of the requirements for the data used with Adaboost is to not have outliers
• We did not check for outliers
• We can check if the data contains outliers and remove them to improve performance
•… and we do this for all other variables in the dataset
wine_data2 <- wine_data
outlier <- boxplot.stats(wine_data2$fixed.acidity)$out
wine_data2$fixed.acidity <- ifelse(wine_data2$fixed.acidity %in% outlier, NA, wine_data2$fixed.acidity)
boxplot.stats(wine_data2$fixed.acidity)
wine_data2 <- wine_data2[complete.cases(wine_data2),]
outlier <- boxplot.stats(wine_data2$volatile.acidity)$out
wine_data2$volatile.acidity <- ifelse(wine_data2$volatile.acidity %in% outlier, NA, wine_data2$volatile.acidity)
boxplot.stats(wine_data2$volatile.acidity)
# … continue for all other variables
21. Can we improve even further? (cont.)
• After removing all the NA values, we are left with 3971 data instances
• We originally had 4898 data instances
• We lost 927 data instances to outliers in all the variables
• Is that worth it?
dim(wine_data)
dim(wine_data2)
[1] 4898 12
[1] 3971 12
22. Can we improve even further? (cont.)
• We will check multiple number of classifiers again after removing outliers
set.seed(797)
training_size <- round(0.7 * dim(wine_data2)[1])
training_sample <- sample(dim(wine_data2)[1], training_size, replace=FALSE)
testing_sample <- -training_sample
training_data <- wine_data2[training_sample,]
testing_data <- wine_data2[-training_sample,]
testing <- quality_fac[testing_sample]
trials2 <- data.frame(Trees=numeric(0), Error=numeric(0))
for (num_of_trees in 1:50){
pred <- predict(adaboost(quality_fac~., training_data, num_of_trees), newdata=testing_data)
error <- pred$error
print(paste("Adaboost Error at", num_of_trees, ":", error))
trials2[num_of_trees, ] <- c(num_of_trees, error)
}
23. Can we improve even further? (cont.)
[1] "Adaboost Error at 1 : 0.250209907640638"
[1] "Adaboost Error at 2 : 0.300587741393787"
[1] "Adaboost Error at 3 : 0.237615449202351"
[1] "Adaboost Error at 4 : 0.26448362720403"
[1] "Adaboost Error at 5 : 0.216624685138539"
[1] "Adaboost Error at 6 : 0.219983207388749"
[1] "Adaboost Error at 7 : 0.203190596137699"
[1] "Adaboost Error at 8 : 0.222502099076406"
[1] "Adaboost Error at 9 : 0.194794290512175"
[1] "Adaboost Error at 10 : 0.198152812762385"
[1] "Adaboost Error at 11 : 0.19311502938707"
[1] "Adaboost Error at 12 : 0.199832073887489"
[1] "Adaboost Error at 13 : 0.19647355163728"
[1] "Adaboost Error at 14 : 0.194794290512175"
[1] "Adaboost Error at 15 : 0.184718723761545"
[1] "Adaboost Error at 16 : 0.185558354324097"
[1] "Adaboost Error at 17 : 0.174643157010915"
[1] "Adaboost Error at 18 : 0.18975650713686"
[1] "Adaboost Error at 19 : 0.174643157010915"
[1] "Adaboost Error at 20 : 0.185558354324097"
[1] "Adaboost Error at 21 : 0.17632241813602"
[1] "Adaboost Error at 22 : 0.182199832073887"
[1] "Adaboost Error at 23 : 0.174643157010915"
[1] "Adaboost Error at 24 : 0.182199832073887"
[1] "Adaboost Error at 25 : 0.174643157010915"
[1] "Adaboost Error at 26 : 0.17296389588581"
[1] "Adaboost Error at 27 : 0.180520570948783"
[1] "Adaboost Error at 28 : 0.174643157010915"
[1] "Adaboost Error at 29 : 0.17296389588581"
[1] "Adaboost Error at 30 : 0.183879093198992"
[1] "Adaboost Error at 31 : 0.174643157010915"
[1] "Adaboost Error at 32 : 0.178001679261125"
[1] "Adaboost Error at 33 : 0.170445004198153"
[1] "Adaboost Error at 34 : 0.168765743073048"
[1] "Adaboost Error at 35 : 0.164567590260285"
[1] "Adaboost Error at 36 : 0.1696053736356"
[1] "Adaboost Error at 37 : 0.162048698572628"
[1] "Adaboost Error at 38 : 0.168765743073048"
[1] "Adaboost Error at 39 : 0.167086481947943"
[1] "Adaboost Error at 40 : 0.161209068010076"
[1] "Adaboost Error at 41 : 0.1696053736356"
[1] "Adaboost Error at 42 : 0.165407220822838"
[1] "Adaboost Error at 43 : 0.171284634760705"
[1] "Adaboost Error at 44 : 0.174643157010915"
[1] "Adaboost Error at 45 : 0.167926112510495"
[1] "Adaboost Error at 46 : 0.171284634760705"
[1] "Adaboost Error at 47 : 0.1696053736356"
[1] "Adaboost Error at 48 : 0.170445004198153"
[1] "Adaboost Error at 49 : 0.16624685138539"
[1] "Adaboost Error at 50 : 0.168765743073048"
print(min(paste("The minimum is", trials2[,2])))
The minimum is 0.1612090
We improved from 17% error rate with
outliers in the data to 16% when we removed
the outliers
(not substantial improvement, but good enough)
24. Can we improve even further? (cont.)
Here also we notice that, generally, the
error decreases as the number of
classifiers increases.
qplot(trials2$Trees, trials2$Error, data=trials2,
color=Trees, xlab="Number of Classifiers",
ylab="Error", geom=c("point", "smooth"))
26. References
• Boosting and AdaBoost for Machine Learning
• What is AdaBoost - Quora
• Adaboost – Wikipedia
• Identify, describe, plot, and remove the outliers from the dataset
• Wine Quality Data Set - UCI Machine Learning Repository
• fastAdaboost – Github
• fastAdaboost - cran.r-project