Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Using Adaboost Algorithm to
Enhance the Prediction Accuracy of
Decision Trees
MOHAMMED ALHAMADI - PROJECT 2

Acknowledgement
This project was done as a partial requirement for the course Introduction
to Machine Learning offered online fall-2016 at the Tandon Online, Tandon
School of Engineering, NYU.

Outline
1. Adaboost Algorithm
2. The Dataset
3. Data Exploration and Visualization
4. Factorizing the Dependent Variable
5. Prediction Using a Single Decision Tree
6. Prediction Using Adaboost
7. Can we improve this?
8. Can we improve even further?
9. Comparison
10. References

Adaboost Algorithm
• Adaboost is short for “Adaptive Boosting”
• Created by Yoav Freund and Robert Schapire in 1997
• Aaboost is a meta-algorithm (provides a sufficiently good solution to an optimization problem)
• Can be used with other learning algorithms to improve their performance
• Makes the weak classifiers strong by combining them
• Typically, decision trees (which we’ll use in this project)

Adaboost Algorithm (cont.)
• Weak models are added sequentially after being trained on the training data
• The process continues until:
• No further improvement can be made on the training data set
• A threshold number of weak classifiers is created
• Predication are then made by taking the weighted average of the weak classifiers
• Each weak classifier has a weight (its classification result is multiplied by its weight)

Adaboost Algorithm (cont.)
• It is better to have these characteristics in the data before processing it with adaboost:
◦ Quality Data:
◦ Ensemble method attempts to correct misclassifications in the training data
◦ Make sure that the training data is of high-quality
◦ Outliers:
◦ Outliers will force the ensemble to work very hard in order to correct for cases that are unrealistic
◦ Best to remove them from dataset
◦ Noisy Data:
◦ Noise in the output variable can be problematic for ensemble
◦ Isolate and clean the noisy data from your training dataset

The Data Set
• The data set that we will use in this project contains 4898 observations on white wine varieties
and quality ranked by the wine tasters
• The data set contains 11 independent variables and 1 dependent variable
• The Independent variables include: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides,
free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol
• The dependent variable is the quality of the wine ranked from 3 (lowest quality) to 9 (highest quality)

Data Exploration and Visualization
wine_data <- read.csv("C:/Users/Mohammed/Google Drive/R_code/Project1/winequality-white.csv",
header=TRUE, sep=";")
dim(wine_data)
[1] 4898 12
names(wine_data)
[1] "fixed.acidity" "volatile.acidity" "citric.acid" "residual.sugar"
[5] "chlorides" "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
[9] "pH" "sulphates" "alcohol" "quality"

Data Exploration and Visualization (cont.)
'data.frame': 4898 obs. of 12 variables:
$ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
$ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
$ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
$ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
$ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
$ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
$ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
$ density : num 1.001 0.994 0.995 0.996 0.996 ...
$ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
$ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
$ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
$ quality : int 6 6 6 6 6 6 6 6 6 6 ...
str(wine_data)

[1] 3.8 14.2
range(wine_data$fixed.acidity)
[1] 0.08 1.10
range(wine_data$volatile.acidity)
[1] 0.00 1.66
range(wine_data$citric.acid)
[1] 0.6 65.8
range(wine_data$residual.sugar)
[1] 0.009 0.346
range(wine_data$chlorides)
[1] 2 289
range(wine_data$free.sulfur.dioxide)
[1] 9 440
range(wine_data$total.sulfur.dioxide)
[1] 0.98711 1.03898
range(wine_data$density)
[1] 2.72 3.82
range(wine_data$pH)
[1] 0.22 1.08
range(wine_data$sulphates)
[1] 8.0 14.2
range(wine_data$alcohol)

cor(wine_data)
Fixed acid Vol. acid Citric acid Res.Sugar Chlorides FS dioxide TS dioxide Density pH Sulphates Alcohol Quality
Fixed acid 1 -0.02 0.29 0.09 0.02 -0.05 0.1 0.27 -0.43 -0.02 -0.12 -0.11
Vol. acid -0.02 1 -0.15 0.06 0.07 -0.1 0.09 0.03 -0.03 -0.04 0.07 -0.19
Citric acid 0.29 -0.15 1 0.09 0.11 0.09 0.12 0.15 -0.16 0.06 -0.08 -0.01
Res.Sugar 0.09 0.06 0.09 1 0.09 0.3 0.4 0.83 -0.2 -0.02 -0.05 -0.1
Chlorides 0.02 0.07 0.11 0.09 1 0.1 0.2 0.26 -0.1 0.02 -0.36 -0.21
FS dioxide -0.05 -0.1 0.09 0.3 0.1 1 0.62 0.3 -0.01 0.01 -0.25 0.01
TS dioxide 0.1 0.09 0.12 0.4 0.2 0.62 1 0.53 0.01 0.13 -0.45 -0.17
Density 0.27 0.03 0.15 0.83 0.26 0.3 0.53 1 -0.1 0.07 -0.8 -0.31
pH -0.43 -0.03 -0.16 -0.2 -0.1 -0.01 0.01 -0.1 1 0.16 0.12 0.1
Sulphates -0.02 -0.04 0.06 -0.02 0.02 0.01 0.13 0.07 0.16 1 -0.02 0.05
Alcohol -0.12 0.07 -0.08 -0.05 -0.36 -0.25 -0.45 -0.8 0.12 -0.02 1 0.44
Quality -0.11 -0.19 -0.01 -0.1 -0.21 0.01 -0.17 -0.31 0.1 0.05 0.44 1

hist(wine_data$alcohol, col="#EE3B3B", main="Histogram of Alcohol Percent in Wine", xlab="Alcohol Percent", ylab="Number of samples", las=1)
hist(wine_data$density, col="#BCEE6B", main="Histogram of Wine Density", xlab="Density", ylab="Number of samples", las=1)
hist(wine_data$chlorides, col="#CDB79E", main="Histogram of Chlorides in Wine", xlab="Chlorides", ylab="Number of samples", las=1)

hist(wine_data$quality, col="#458B74", main="Wine Quality Histogram", xlab="Quality", ylab="Number of samples")

Factorizing the Dependent Variable
quality_fac <- ifelse(wine_data$quality >= 6, "high", "low")
wine_data <- data.frame(wine_data, quality_fac)
table(wine_data$quality_fac)
High Low
3258 1640
• We can now remove the old integer dependent variable (quality)
wine_data <- wine_data[,-12]

Prediction Using a Single Decision Tree
• In order to compare the performance of Adaboost, we will first use a single decision tree
• Package (C50) in R
• So the misclassification error rate is almost 22% with a single tree
set.seed(797)
training_size <- round(0.7 * dim(wine_data)[1])
training_sample <- sample(dim(wine_data)[1], training_size, replace=FALSE)
testing_sample <- -training_sample
training_data <- wine_data[training_sample,]
testing_data <- wine_data[-training_sample,]
testing <- quality_fac[testing_sample]
C50_model <- C5.0(quality_fac~., data=training_data)
predict_C50 <- predict(C50_model, testing_data[,-12])
mean(predict_C50 != testing)
[1] 0.2178353

Prediction Using Adaboost
• We will use library (fastAdaboost)
• Initially, we will choose to use 15 classifiers and combine them
• So, we can see a quick improvement when using Adaboost with 15 classifiers
• We improved from 22% error rate to 18%
library(fastAdaboost)
adaboost_model <- adaboost(quality_fac~., training_data, 15)
pred <- predict(adaboost_model, newdata=testing_data)
error <- pred$error
print(paste("Adaboost Error:", error))
"Adaboost Error: 0.184479237576583"

Can we improve this?
• Choosing 15 classifiers in the previous step was a random choice
• We can try many different number of classifiers and check performance
• Try a range of classifiers from 1 to 50
trials <- data.frame(Trees=numeric(0), Error=numeric(0))
for (num_of_trees in 1:50){
pred <- predict(adaboost(quality_fac~., training_data, num_of_trees), newdata=testing_data)
error <- pred$error
print(paste("Adaboost Error at", num_of_trees, ":", error))
trials[num_of_trees, ] <- c(num_of_trees, error)
}

Can we improve this? (cont.)
[1] "Adaboost Error at 1 : 0.243703199455412"
print(min(paste("The minimum is", trials[,2])))
The minimum is 0.1708645
We improved from 18% error rate when
using 15 classifiers to 17% when using 35
classifiers

Can we improve this? (cont.)
We can notice that, generally, the error
decreases as the number of classifiers
increases.
qplot(trials2$Trees, trials2$Error, data=trials2,
color=Trees, xlab="Number of Classifiers",
ylab="Error", geom=c("point", "smooth"))

Can we improve even further?
• One of the requirements for the data used with Adaboost is to not have outliers
• We did not check for outliers
• We can check if the data contains outliers and remove them to improve performance
•… and we do this for all other variables in the dataset
wine_data2 <- wine_data
outlier <- boxplot.stats(wine_data2$fixed.acidity)$out
wine_data2$fixed.acidity <- ifelse(wine_data2$fixed.acidity %in% outlier, NA, wine_data2$fixed.acidity)
boxplot.stats(wine_data2$fixed.acidity)
wine_data2 <- wine_data2[complete.cases(wine_data2),]
outlier <- boxplot.stats(wine_data2$volatile.acidity)$out
wine_data2$volatile.acidity <- ifelse(wine_data2$volatile.acidity %in% outlier, NA, wine_data2$volatile.acidity)
boxplot.stats(wine_data2$volatile.acidity)
# … continue for all other variables

Can we improve even further? (cont.)
• After removing all the NA values, we are left with 3971 data instances
• We originally had 4898 data instances
• We lost 927 data instances to outliers in all the variables
• Is that worth it?
dim(wine_data)
dim(wine_data2)
[1] 4898 12
[1] 3971 12

• We will check multiple number of classifiers again after removing outliers
set.seed(797)
training_size <- round(0.7 * dim(wine_data2)[1])
training_sample <- sample(dim(wine_data2)[1], training_size, replace=FALSE)
testing_sample <- -training_sample
training_data <- wine_data2[training_sample,]
testing_data <- wine_data2[-training_sample,]
testing <- quality_fac[testing_sample]
trials2 <- data.frame(Trees=numeric(0), Error=numeric(0))
for (num_of_trees in 1:50){
pred <- predict(adaboost(quality_fac~., training_data, num_of_trees), newdata=testing_data)
error <- pred$error
print(paste("Adaboost Error at", num_of_trees, ":", error))
trials2[num_of_trees, ] <- c(num_of_trees, error)
}

print(min(paste("The minimum is", trials2[,2])))
The minimum is 0.1612090
We improved from 17% error rate with
outliers in the data to 16% when we removed
the outliers
(not substantial improvement, but good enough)

Here also we notice that, generally, the
error decreases as the number of
classifiers increases.
qplot(trials2$Trees, trials2$Error, data=trials2,
color=Trees, xlab="Number of Classifiers",
ylab="Error", geom=c("point", "smooth"))

Comparison
Misclassification
error = 22%
Misclassification
error = 17%
Misclassification
error = 16%
Single Tree Adaboost
(with outliers)
Adaboost
(no outliers)
Misclassification
error = 18%
Adaboost
(random # of classifiers)

References
• Boosting and AdaBoost for Machine Learning
• What is AdaBoost - Quora
• Adaboost – Wikipedia
• Identify, describe, plot, and remove the outliers from the dataset
• Wine Quality Data Set - UCI Machine Learning Repository
• fastAdaboost – Github
• fastAdaboost - cran.r-project

Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Recommended

Recommended

More Related Content

Similar to Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees

Similar to Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees (20)

Recently uploaded

Recently uploaded (20)

Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision Trees