IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wine Preference

Applied Machine Learning Final Project
Jim Nelson
December 5, 2016
OVERVIEW
Aim
Compare performace of R Machine Learning packages for logistic regression and random forest
algorithms.
Datasets
Datasets used in this study were obtained from the UCI Machine Learning Dataset Repository. Two datasets
of red and white wine variants of the northern Portugal “Vinho Verde” region were used. Each dataset
comprises 11 physicochemical attributes of each wine such as acidity, residual sugars and percent alcohol(1).
The red dataset includes 1599 different wines, while the white dataset has a total of 4898 wines. Specific data
about grape types, wine brand, wine selling price, etc, have been omitted. Each wine was given a quality
score which graded the wine in a scale that ranges from 0 (very bad) to 10 (excellent). Importantly, the score
was derived subjectively from a panel of at least 3 experts using blind taste tests. The final quality score is
given by the median of these evaluations(2). The binary class label “drinkit” was created for this analysis
based on a quality score of >= 7. The datasets were submitted on 10/7/2009 by Paulo Cortez, University of
Minho, Guimarães, Portugal.
Software and Computing Environment
The study was performed locally on a HP Pavilion 23 All-in-One with a 64 Bit OS running Windows
10 with 4.0 GB Ram and an AMD AP-5300 APU processor with Radeon HD graphics. The study was
performed in R (R version 33.2.4 (2016-03-10) – “Very Secure Dishes”. Platform: x86_64-w64-mingw32/x64
(64-bit))Copyright (C) 2015 The R Foundation for Statistical Computing) using the open source R platform
R Studio (Version 0.98.1102 - 2009-2014 RStudio, Inc.)
The following R software packages with corresponding manuals and vignettes were obtained from the The
Comprehensive R Archive Network (CRAN):
Data Manipulation and Visualization:
dplyr: A Grammar of Data Manipulation (v.0.4.3);
ggplot2: An Implementation of the Grammar of Graphics (v.1.01)
Logistic Regression Analysis:
glm2: Fitting Generalized Linear Models (v. 1.1.2)
Classification statistics and AUROC analysis:
caret: Classification and Regression Training (v.6.0-62); pROC: Display and Analyze ROC Curves (v. 1.8)
Random Forest Modeling:
randomForest: Breiman and Cutler’s Random Forests for Classification and Regression( v. 4.6-10)
1

DATA CURATION
Load Datasets and R Packages
Load packages for data transformation
library(dplyr)
library(caTools)
Load packages for data visualization
library(ggplot2)
library(fBasics)
Load ML packages
library(glm2)
library(caret)
library(pROC)
library(randomForest)
Load the datasets
red <- read.csv ("winequality-red.csv", header= TRUE, sep=";")
str(red)
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
white <- read.csv ("winequality-white.csv", header= TRUE, sep=";")
str(white)
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
2

## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
Create class labels
red<- red %>%
mutate (drinkit = as.factor(as.numeric(quality >= 7 )))
class(red$drinkit)
## [1] "factor"
white<- white %>%
mutate (drinkit = as.factor(as.numeric(quality >= 7 )))
class(white$drinkit)
## [1] "factor"
Create test and train datasets
set.seed(123)
sample = sample.split(red$drinkit, SplitRatio = .70)
redtrain = subset(red, sample == TRUE)
redtest = subset(red, sample == FALSE)
sample <-sample.split(white$drinkit, SplitRatio = .70)
whitetrain <- subset(white, sample == TRUE)
whitetest <- subset(white, sample == FALSE)
dim(redtrain) #70% for training
## [1] 1119 13
dim(redtest) #30% for test
## [1] 480 13
dim(whitetrain)
## [1] 3429 13
dim(whitetest)
## [1] 1469 13
3

DATA EXPLORATION
Descriptive statistics
by(red[1:12][,c(1:12)], red$drinkit, basicStats)
## red$drinkit: 0
## fixed.acidity volatile.acidity citric.acid residual.sugar
## nobs 1382.000000 1382.000000 1382.000000 1382.000000
## NAs 0.000000 0.000000 0.000000 0.000000
## Minimum 4.600000 0.160000 0.000000 0.900000
## Maximum 15.900000 1.580000 1.000000 15.500000
## 1. Quartile 7.100000 0.420000 0.082500 1.900000
## 3. Quartile 9.100000 0.650000 0.400000 2.600000
## Mean 8.236831 0.547022 0.254407 2.512120
## Median 7.800000 0.540000 0.240000 2.200000
## Sum 11383.300000 755.985000 351.590000 3471.750000
## SE Mean 0.045265 0.004743 0.005102 0.038084
## LCL Mean 8.148036 0.537717 0.244398 2.437412
## UCL Mean 8.325626 0.556327 0.264415 2.586829
## Variance 2.831568 0.031095 0.035973 2.004428
## Stdev 1.682726 0.176337 0.189665 1.415778
## Skewness 1.071064 0.670310 0.422857 4.878595
## Kurtosis 1.337482 1.438348 -0.673833 32.003919
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## nobs 1382.000000 1382.000000 1382.000000
## NAs 0.000000 0.000000 0.000000
## Minimum 0.034000 1.000000 6.000000
## Maximum 0.611000 72.000000 165.000000
## 1. Quartile 0.071000 8.000000 23.000000
## 3. Quartile 0.091000 22.000000 65.000000
## Mean 0.089281 16.172214 48.285818
## Median 0.080000 14.000000 39.500000
## Sum 123.386000 22350.000000 66731.000000
## SE Mean 0.001321 0.281577 0.876540
## LCL Mean 0.086689 15.619850 46.566324
## UCL Mean 0.091872 16.724578 50.005312
## Variance 0.002412 109.572421 1061.821580
## Stdev 0.049113 10.467685 32.585604
## Skewness 5.547353 1.224203 1.110405
## Kurtosis 38.898772 2.056348 0.704994
## density pH sulphates alcohol quality
## nobs 1382.000000 1382.000000 1382.000000 1382.000000 1382.000000
## NAs 0.000000 0.000000 0.000000 0.000000 0.000000
## Minimum 0.990070 2.740000 0.330000 8.400000 3.000000
## Maximum 1.003690 4.010000 2.000000 14.900000 6.000000
## 1. Quartile 0.995785 3.210000 0.540000 9.500000 5.000000
## 3. Quartile 0.997900 3.410000 0.700000 10.900000 6.000000
## Mean 0.996859 3.314616 0.644754 10.251037 5.408828
## Median 0.996800 3.310000 0.600000 10.000000 5.000000
## Sum 1377.659370 4580.800000 891.050000 14166.933333 7475.000000
## SE Mean 0.000049 0.004146 0.004590 0.026084 0.016186
## LCL Mean 0.996764 3.306483 0.635750 10.199869 5.377076
4

## UCL Mean 0.996955 3.322750 0.653758 10.302205 5.440580
## Variance 0.000003 0.023758 0.029114 0.940248 0.362065
## Stdev 0.001808 0.154135 0.170629 0.969664 0.601719
## Skewness 0.117601 0.168576 2.774057 1.058300 -0.673203
## Kurtosis 1.091744 0.843706 13.810053 0.921817 0.546000
## --------------------------------------------------------
## red$drinkit: 1
## nobs 217.000000 217.000000 217.000000 217.000000
## NAs 0.000000 0.000000 0.000000 0.000000
## Minimum 4.900000 0.120000 0.000000 1.200000
## Maximum 15.600000 0.915000 0.760000 8.900000
## 1. Quartile 7.400000 0.300000 0.300000 2.000000
## 3. Quartile 10.100000 0.490000 0.490000 2.700000
## Mean 8.847005 0.405530 0.376498 2.708756
## Median 8.700000 0.370000 0.400000 2.300000
## Sum 1919.800000 88.000000 81.700000 587.800000
## SE Mean 0.135767 0.009841 0.013199 0.092528
## LCL Mean 8.579406 0.386134 0.350482 2.526382
## UCL Mean 9.114603 0.424926 0.402514 2.891130
## Variance 3.999910 0.021014 0.037806 1.857840
## Stdev 1.999977 0.144963 0.194438 1.363026
## Skewness 0.460276 0.987628 -0.373539 2.173338
## Kurtosis 0.313564 0.884353 -0.475447 4.660784
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## nobs 217.000000 217.000000 217.000000 217.000000
## NAs 0.000000 0.000000 0.000000 0.000000
## Minimum 0.012000 3.000000 7.000000 0.990640
## Maximum 0.358000 54.000000 289.000000 1.003200
## 1. Quartile 0.062000 6.000000 17.000000 0.994700
## 3. Quartile 0.085000 18.000000 43.000000 0.997350
## Mean 0.075912 13.981567 34.889401 0.996030
## Median 0.073000 11.000000 27.000000 0.995720
## Sum 16.473000 3034.000000 7571.000000 216.138570
## SE Mean 0.001933 0.694771 2.211148 0.000149
## LCL Mean 0.072102 12.612168 30.531213 0.995736
## UCL Mean 0.079723 15.350966 39.247589 0.996325
## Variance 0.000811 104.747344 1060.950674 0.000005
## Stdev 0.028480 10.234615 32.572238 0.002201
## Skewness 5.035644 1.456370 4.439771 0.262062
## Kurtosis 44.818747 1.889996 29.222588 0.292770
## pH sulphates alcohol quality
## nobs 217.000000 217.000000 217.000000 217.000000
## NAs 0.000000 0.000000 0.000000 0.000000
## Minimum 2.880000 0.390000 9.200000 7.000000
## Maximum 3.780000 1.360000 14.000000 8.000000
## 1. Quartile 3.200000 0.650000 10.800000 7.000000
## 3. Quartile 3.380000 0.820000 12.200000 7.000000
## Mean 3.288802 0.743456 11.518049 7.082949
## Median 3.270000 0.740000 11.600000 7.000000
## Sum 713.670000 161.330000 2499.416667 1537.000000
## SE Mean 0.010487 0.009099 0.067759 0.018766
## LCL Mean 3.268133 0.725522 11.384496 7.045961
## UCL Mean 3.309471 0.761391 11.651603 7.119938
5

## Variance 0.023863 0.017966 0.996310 0.076421
## Stdev 0.154478 0.134038 0.998153 0.276443
## Skewness 0.358697 0.620835 0.065494 3.003356
## Kurtosis 0.608113 1.964857 -0.430136 7.052712
by(white[1:12][,c(1:12)], white$drinkit, basicStats)
## white$drinkit: 0
## nobs 3838.000000 3838.000000 3838.000000 3838.000000
## NAs 0.000000 0.000000 0.000000 0.000000
## Minimum 3.800000 0.080000 0.000000 0.600000
## Maximum 14.200000 1.100000 1.660000 65.800000
## 1. Quartile 6.300000 0.220000 0.260000 1.700000
## 3. Quartile 7.400000 0.320000 0.400000 10.400000
## Mean 6.890594 0.281802 0.336438 6.703478
## Median 6.800000 0.270000 0.320000 6.000000
## Sum 26446.100000 1081.555000 1291.250000 25727.950000
## SE Mean 0.013884 0.001651 0.002098 0.084341
## LCL Mean 6.863374 0.278564 0.332325 6.538121
## UCL Mean 6.917814 0.285039 0.340551 6.868835
## Variance 0.739786 0.010464 0.016889 27.301127
## Stdev 0.860108 0.102293 0.129959 5.225048
## Skewness 0.752179 1.720644 1.242293 1.035464
## Kurtosis 2.339990 5.737845 5.472595 3.755082
## nobs 3838.000000 3.838000e+03 3.838000e+03
## NAs 0.000000 0.000000e+00 0.000000e+00
## Minimum 0.009000 2.000000e+00 9.000000e+00
## Maximum 0.346000 2.890000e+02 4.400000e+02
## 1. Quartile 0.037000 2.300000e+01 1.110000e+02
## 3. Quartile 0.051000 4.700000e+01 1.730000e+02
## Mean 0.047875 3.551733e+01 1.419829e+02
## Median 0.045000 3.400000e+01 1.400000e+02
## Sum 183.743000 1.363155e+05 5.449305e+05
## SE Mean 0.000380 2.871250e-01 7.125790e-01
## LCL Mean 0.047129 3.495439e+01 1.405859e+02
## UCL Mean 0.048620 3.608026e+01 1.433800e+02
## Variance 0.000554 3.164067e+02 1.948817e+03
## Stdev 0.023548 1.778783e+01 4.414540e+01
## Skewness 4.851869 1.423264e+00 2.830020e-01
## Kurtosis 33.327716 1.178770e+01 4.934680e-01
## nobs 3838.000000 3838.000000 3838.000000 3838.000000 3838.000000
## NAs 0.000000 0.000000 0.000000 0.000000 0.000000
## Minimum 0.987220 2.720000 0.230000 8.000000 3.000000
## Maximum 1.038980 3.810000 1.060000 14.000000 6.000000
## 1. Quartile 0.992320 3.080000 0.410000 9.400000 5.000000
## 3. Quartile 0.996570 3.267500 0.540000 11.000000 6.000000
## Mean 0.994474 3.180847 0.487004 10.265215 5.519802
## Median 0.994380 3.170000 0.470000 10.000000 6.000000
## Sum 3816.789390 12208.090000 1869.120000 39397.896667 21185.000000
## SE Mean 0.000047 0.002396 0.001746 0.017765 0.009764
## LCL Mean 0.994382 3.176150 0.483580 10.230385 5.500659
6

## UCL Mean 0.994565 3.185544 0.490427 10.300045 5.538945
## Variance 0.000008 0.022027 0.011700 1.211267 0.365910
## Stdev 0.002894 0.148414 0.108167 1.100576 0.604905
## Skewness 1.139004 0.518051 0.939134 0.690666 -1.004627
## Kurtosis 14.002972 0.817759 1.572179 -0.281440 0.695824
## --------------------------------------------------------
## white$drinkit: 1
## nobs 1060.000000 1060.000000 1060.000000 1060.000000
## NAs 0.000000 0.000000 0.000000 0.000000
## Minimum 3.900000 0.080000 0.010000 0.800000
## Maximum 9.200000 0.760000 0.740000 19.250000
## 1. Quartile 6.200000 0.190000 0.280000 1.800000
## 3. Quartile 7.200000 0.320000 0.360000 7.400000
## Mean 6.725142 0.265349 0.326057 5.261509
## Median 6.700000 0.250000 0.310000 3.875000
## Sum 7128.650000 281.270000 345.620000 5577.200000
## SE Mean 0.023613 0.002890 0.002466 0.131792
## LCL Mean 6.678807 0.259678 0.321218 5.002906
## UCL Mean 6.771476 0.271020 0.330895 5.520113
## Variance 0.591050 0.008854 0.006446 18.411355
## Stdev 0.768798 0.094097 0.080288 4.290845
## Skewness 0.019010 0.874201 0.705643 1.080500
## Kurtosis 0.437095 1.098804 3.030720 0.098557
## nobs 1060.000000 1060.000000 1.060000e+03
## NAs 0.000000 0.000000 0.000000e+00
## Minimum 0.012000 5.000000 3.400000e+01
## Maximum 0.135000 108.000000 2.290000e+02
## 1. Quartile 0.031000 25.000000 1.010000e+02
## 3. Quartile 0.044000 42.000000 1.460000e+02
## Mean 0.038160 34.550472 1.252453e+02
## Median 0.037000 33.000000 1.220000e+02
## Sum 40.450000 36623.500000 1.327600e+05
## SE Mean 0.000342 0.423776 1.005136e+00
## LCL Mean 0.037489 33.718936 1.232730e+02
## UCL Mean 0.038832 35.382008 1.272176e+02
## Variance 0.000124 190.361237 1.070916e+03
## Stdev 0.011145 13.797146 3.272485e+01
## Skewness 2.258097 1.015169 5.094610e-01
## Kurtosis 14.741475 2.729507 2.420860e-01
## nobs 1060.000000 1060.000000 1060.000000 1060.000000 1060.000000
## NAs 0.000000 0.000000 0.000000 0.000000 0.000000
## Minimum 0.987110 2.840000 0.220000 8.500000 7.000000
## Maximum 1.000600 3.820000 1.080000 14.200000 9.000000
## 1. Quartile 0.990500 3.100000 0.400000 10.700000 7.000000
## 3. Quartile 0.993605 3.320000 0.580000 12.400000 7.000000
## Mean 0.992412 3.215132 0.500142 11.416022 7.174528
## Median 0.991730 3.200000 0.480000 11.500000 7.000000
## Sum 1051.956700 3408.040000 530.150000 12100.983333 7605.000000
## SE Mean 0.000085 0.004828 0.004086 0.038553 0.012040
## LCL Mean 0.992245 3.205659 0.492123 11.340372 7.150904
## UCL Mean 0.992579 3.224605 0.508160 11.491672 7.198152
7

## Variance 0.000008 0.024707 0.017701 1.575551 0.153647
## Stdev 0.002772 0.157185 0.133044 1.255209 0.391978
## Skewness 1.001822 0.235375 0.942026 -0.404076 1.945040
## Kurtosis 0.406561 -0.186210 1.063202 -0.522122 2.498491
Boxplots
attribute = fixed.acidity
bp_fixed.acidity <- ggplot(red, aes(x=drinkit, y=fixed.acidity ))
bp_fixed.acidity + geom_boxplot()
8
12
16
0 1
drinkit
fixed.acidity
bp_fixed.acidity <- ggplot(white, aes(x=drinkit, y=fixed.acidity ))
bp_fixed.acidity + geom_boxplot()
8

6
9
12
0 1
drinkit
fixed.acidity
attribute = volatile.acidity
bp_volatile.acidity <- ggplot(red, aes(x=drinkit, y=volatile.acidity ))
bp_volatile.acidity + geom_boxplot()
9

0.4
0.8
1.2
1.6
0 1
drinkit
volatile.acidity
bp_volatile.acidity <- ggplot(white, aes(x=drinkit, y=volatile.acidity ))
bp_volatile.acidity + geom_boxplot()
10

0.3
0.6
0.9
0 1
drinkit
volatile.acidity
attribute = citric.acid
bp_citric.acid <- ggplot(red, aes(x=drinkit, y=citric.acid ))
bp_citric.acid + geom_boxplot()
11

0.00
0.25
0.50
0.75
1.00
0 1
drinkit
citric.acid
bp_citric.acid <- ggplot(white, aes(x=drinkit, y=citric.acid ))
bp_citric.acid + geom_boxplot()
12

0.0
0.5
1.0
1.5
0 1
drinkit
citric.acid
attribute = residual.sugar
bp_residual.sugar <- ggplot(red, aes(x=drinkit, y= residual.sugar ))
bp_residual.sugar + geom_boxplot()
13

4
8
12
16
0 1
drinkit
residual.sugar
bp_residual.sugar <- ggplot(white, aes(x=drinkit, y= residual.sugar ))
bp_residual.sugar + geom_boxplot()
14

0
20
40
60
0 1
drinkit
residual.sugar
attribute = chlorides
bp_chlorides <- ggplot(red, aes(x=drinkit, y= chlorides ))
bp_chlorides + geom_boxplot()
15

0.0
0.2
0.4
0.6
0 1
drinkit
chlorides
bp_chlorides <- ggplot(white, aes(x=drinkit, y= chlorides ))
bp_chlorides + geom_boxplot()
16

0.0
0.1
0.2
0.3
0 1
drinkit
chlorides
attribute = free.sulfur.dioxide
bp_free.sulfur.dioxide <- ggplot(red, aes(x=drinkit, y= free.sulfur.dioxide ))
bp_free.sulfur.dioxide + geom_boxplot()
17

0
20
40
60
0 1
drinkit
free.sulfur.dioxide
bp_free.sulfur.dioxide <- ggplot(white, aes(x=drinkit, y= free.sulfur.dioxide ))
bp_free.sulfur.dioxide + geom_boxplot()
18

0
100
200
300
0 1
drinkit
free.sulfur.dioxide
attribute = total.sulfur.dioxide
bp_total.sulfur.dioxide <- ggplot(red, aes(x=drinkit, y= total.sulfur.dioxide ))
bp_total.sulfur.dioxide + geom_boxplot()
19

0
100
200
300
0 1
drinkit
total.sulfur.dioxide
bp_total.sulfur.dioxide <- ggplot(white, aes(x=drinkit, y= total.sulfur.dioxide ))
bp_total.sulfur.dioxide + geom_boxplot()
20

0
100
200
300
400
0 1
drinkit
total.sulfur.dioxide
attribute = sulphates
bp_sulphates <- ggplot(red, aes(x=drinkit, y= sulphates ))
bp_sulphates + geom_boxplot()
21

0.5
1.0
1.5
2.0
0 1
drinkit
sulphates
bp_sulphates <- ggplot(white, aes(x=drinkit, y= sulphates ))
22

0.2
0.4
0.6
0.8
1.0
0 1
drinkit
sulphates
attribute = sulphates
bp_sulphates <- ggplot(red, aes(x=drinkit, y= sulphates ))
23

0.5
1.0
1.5
2.0
0 1
drinkit
sulphates
bp_sulphates <- ggplot(white, aes(x=drinkit, y= sulphates ))
24

0.2
0.4
0.6
0.8
1.0
0 1
drinkit
sulphates
attribute = alcohol
bp_alcohol <- ggplot(red, aes(x=drinkit, y= alcohol ))
bp_alcohol + geom_boxplot()
25

10
12
14
0 1
drinkit
alcohol
bp_alcohol <- ggplot(white, aes(x=drinkit, y= alcohol ))
bp_alcohol + geom_boxplot()
26

8
10
12
14
0 1
drinkit
alcohol
PREDICTIVE MODELING
Logistic regression - red wine dataset
Model 1
redmodel1 <- glm(drinkit ~ fixed.acidity+volatile.acidity+citric.acid+residual.sugar+
total.sulfur.dioxide+chlorides+alcohol, family="binomial", data= redtrain)
summary(redmodel1)
##
## Call:
## glm(formula = drinkit ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + total.sulfur.dioxide + chlorides + alcohol,
## family = "binomial", data = redtrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6103 -0.4541 -0.2492 -0.1403 2.6757
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.12828 1.51262 -6.696 2.14e-11 ***
## fixed.acidity 0.12368 0.08316 1.487 0.13695
27

## volatile.acidity -4.49437 0.91316 -4.922 8.58e-07 ***
## citric.acid 0.31682 0.97102 0.326 0.74422
## residual.sugar 0.10574 0.06983 1.514 0.12996
## total.sulfur.dioxide -0.01329 0.00419 -3.172 0.00151 **
## chlorides -6.34684 4.16619 -1.523 0.12765
## alcohol 0.91992 0.10404 8.842 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 889.23 on 1118 degrees of freedom
## Residual deviance: 634.16 on 1111 degrees of freedom
## AIC: 650.16
##
## Number of Fisher Scoring iterations: 6
exp(cbind(OR = coef(redmodel1), confint(redmodel1))) # odds ratios and 95% CI
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 3.993409e-05 1.934350e-06 0.0007319036
## fixed.acidity 1.131655e+00 9.622602e-01 1.3339255345
## volatile.acidity 1.117174e-02 1.792495e-03 0.0642802846
## citric.acid 1.372752e+00 2.016707e-01 9.1164704929
## residual.sugar 1.111529e+00 9.597369e-01 1.2685734018
## total.sulfur.dioxide 9.867968e-01 9.783435e-01 0.9945651864
## chlorides 1.752269e-03 2.005370e-07 2.3287091038
## alcohol 2.509101e+00 2.055529e+00 3.0926740162
Model 1 performance
redtrain$drinkitYhat <- predict(redmodel1, type = "response") # generate yhat values on train df
redtrain$drinkitYhat <- ifelse(redtrain$drinkitYhat > 0.5, 1.0, 0.0)
confusionMatrix( redtrain$drinkitYhat, redtrain$drinkit) # run confusionMatrix to assess accuracy
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 932 110
## 1 35 42
##
## Accuracy : 0.8704
## 95% CI : (0.8493, 0.8896)
## No Information Rate : 0.8642
## P-Value [Acc > NIR] : 0.2878
##
## Kappa : 0.3032
## Mcnemar's Test P-Value : 7.978e-10
##
28

## Sensitivity : 0.9638
## Specificity : 0.2763
## Pos Pred Value : 0.8944
## Neg Pred Value : 0.5455
## Prevalence : 0.8642
## Detection Rate : 0.8329
## Detection Prevalence : 0.9312
## Balanced Accuracy : 0.6201
##
## 'Positive' Class : 0
##
auc(roc(redtrain$drinkit, redtrain$drinkitYhat ),levels=levels(as.factor(redtrain$drinkit))) # calculat
## Area under the curve: 0.6201
Model 2
redmodel2 <- glm(drinkit ~ volatile.acidity+total.sulfur.dioxide+alcohol, family="binomial",
data= redtrain)
summary(redmodel2)
##
## Call:
## glm(formula = drinkit ~ volatile.acidity + total.sulfur.dioxide +
## alcohol, family = "binomial", data = redtrain)
##
## -2.0665 -0.4559 -0.2634 -0.1452 2.6374
##
## Coefficients:
## (Intercept) -8.90155 1.12801 -7.891 2.99e-15 ***
## volatile.acidity -5.21126 0.71903 -7.248 4.24e-13 ***
## total.sulfur.dioxide -0.01379 0.00409 -3.371 0.000749 ***
## alcohol 0.92389 0.09549 9.675 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 652.62
##
29

## OR 2.5 % 97.5 %
## (Intercept) 0.0001361775 1.429867e-05 0.001198158
## volatile.acidity 0.0054547915 1.276332e-03 0.021452206
## total.sulfur.dioxide 0.9863066898 9.780499e-01 0.993846316
## alcohol 2.5190775373 2.097376e+00 3.051543824
Model 2 performance
##
## Reference
## Prediction 0 1
## 0 934 112
## 1 33 40
##
## 95% CI : (0.8493, 0.8896)
## P-Value [Acc > NIR] : 0.2878
##
## Kappa : 0.2933
##
##
##
auc(roc(redtrain$drinkit, redtrain$drinkitYhat ), levels=levels(as.factor(redtrain$drinkit))) # calcula
Model 3
redmodel3 <- glm(drinkit ~ alcohol, family="binomial", data= redtrain)
summary(redmodel3)
##
## Call:
## glm(formula = drinkit ~ alcohol, family = "binomial", data = redtrain)
30

##
## -2.2384 -0.5138 -0.3279 -0.2540 2.6650
##
## Coefficients:
## (Intercept) -13.11263 1.00437 -13.06 <2e-16 ***
## alcohol 1.04246 0.08978 11.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 728.88
##
## OR 2.5 % 97.5 %
## (Intercept) 2.019563e-06 2.659407e-07 1.371582e-05
## alcohol 2.836199e+00 2.388290e+00 3.397552e+00
Model 3 performance
##
## Reference
## Prediction 0 1
## 0 943 131
## 1 24 21
##
## 95% CI : (0.8398, 0.8812)
## P-Value [Acc > NIR] : 0.6236
##
## Kappa : 0.1611
## Mcnemar's Test P-Value : <2e-16
##
31

##
##
auc(roc(redtrain$drinkit, redtrain$drinkitYhat ), levels=levels(as.factor(redtrain$drinkit))) # calcula
Use best model (model 2) for testset prediction
redtest$drinkitYhat <- predict(redmodel2, newdata = redtest, type = "response") # predict values on tra
redtest$drinkitYhat <- ifelse(redtest$drinkitYhat > 0.5, 1.0, 0.0)
confusionMatrix(redtest$drinkitYhat, redtest$drinkit) # run confusionMatrix to assess accuracy
##
## Reference
## Prediction 0 1
## 0 402 49
## 1 13 16
##
## 95% CI : (0.8375, 0.8995)
## P-Value [Acc > NIR] : 0.3749
##
## Kappa : 0.2803
##
##
##
auc(roc(redtest$drinkit, redtest$drinkitYhat), levels=levels(as.factor(redtrain$drinkit))) # calculate
32

Logistic regression - white wine dataset
Model 1
whitemodel1 <- glm(drinkit ~ fixed.acidity+volatile.acidity+citric.acid+residual.sugar+
total.sulfur.dioxide+chlorides+alcohol, family="binomial", data= whitetrain)
summary(whitemodel1)
##
## Call:
## glm(formula = drinkit ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + total.sulfur.dioxide + chlorides + alcohol,
## family = "binomial", data = whitetrain)
##
## -1.9694 -0.6635 -0.4286 -0.1833 2.8909
##
## Coefficients:
## (Intercept) -8.906528 0.805489 -11.057 < 2e-16 ***
## fixed.acidity -0.039714 0.058123 -0.683 0.494
## volatile.acidity -4.905984 0.578866 -8.475 < 2e-16 ***
## citric.acid -0.717323 0.460893 -1.556 0.120
## residual.sugar 0.047718 0.011470 4.160 3.18e-05 ***
## total.sulfur.dioxide 0.001578 0.001356 1.164 0.245
## chlorides -18.758762 4.513067 -4.157 3.23e-05 ***
## alcohol 0.899407 0.052430 17.154 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 2958.3
##
exp(cbind(OR = coef(whitemodel1), confint(whitemodel1))) # odds ratios and 95% CI
## OR 2.5 % 97.5 %
## (Intercept) 1.355014e-04 2.768784e-05 6.517844e-04
## fixed.acidity 9.610643e-01 8.571076e-01 1.076523e+00
## volatile.acidity 7.402156e-03 2.339702e-03 2.263315e-02
## citric.acid 4.880570e-01 1.948869e-01 1.193235e+00
## residual.sugar 1.048875e+00 1.025407e+00 1.072573e+00
## total.sulfur.dioxide 1.001579e+00 9.989146e-01 1.004239e+00
## chlorides 7.131371e-09 7.522433e-13 3.505045e-05
## alcohol 2.458146e+00 2.220357e+00 2.727226e+00
33

Model 1 performance
whitetrain$drinkitYhat <- predict(whitemodel1, type = "response") # generate yhat values on train df
whitetrain$drinkitYhat <- ifelse(whitetrain$drinkitYhat > 0.5, 1.0, 0.0)
confusionMatrix( whitetrain$drinkitYhat, whitetrain$drinkit) # run confusionMatrix to assess accuracy
##
## Reference
## Prediction 0 1
## 0 2532 565
## 1 155 177
##
## Accuracy : 0.79
## 95% CI : (0.776, 0.8036)
## P-Value [Acc > NIR] : 0.1865
##
## Kappa : 0.2261
##
##
##
auc(roc(whitetrain$drinkit, whitetrain$drinkitYhat ), levels=levels(as.factor(whitetrain$drinkit))) # c
Model 2
whitemodel2 <- glm(drinkit ~ volatile.acidity+residual.sugar+chlorides+alcohol, family="binomial",
data= whitetrain)
summary(whitemodel2)
##
## Call:
## glm(formula = drinkit ~ volatile.acidity + residual.sugar + chlorides +
## alcohol, family = "binomial", data = whitetrain)
##
## -1.9024 -0.6594 -0.4260 -0.1902 2.8269
34

##
## Coefficients:
## (Intercept) -9.09922 0.63146 -14.410 < 2e-16 ***
## volatile.acidity -4.68741 0.56405 -8.310 < 2e-16 ***
## residual.sugar 0.04845 0.01115 4.345 1.39e-05 ***
## chlorides -18.36325 4.41268 -4.161 3.16e-05 ***
## alcohol 0.88237 0.04981 17.714 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 2957.1
##
exp(cbind(OR = coef(whitemodel2), confint(whitemodel2))) # odds ratios and 95% CI
## OR 2.5 % 97.5 %
## (Intercept) 1.117534e-04 3.232733e-05 3.849781e-04
## volatile.acidity 9.210485e-03 2.998984e-03 2.737557e-02
## residual.sugar 1.049642e+00 1.026810e+00 1.072716e+00
## chlorides 1.059114e-08 1.369508e-12 4.332245e-05
## alcohol 2.416612e+00 2.193714e+00 2.667033e+00
Model 2 performance
whitetrain$drinkitYhat <- predict(whitemodel2, type = "response") # generate yhat values on train df
whitetrain$drinkitYhat <- ifelse(whitetrain$drinkitYhat > 0.5, 1.0, 0.0)
confusionMatrix( whitetrain$drinkitYhat, whitetrain$drinkit) # run confusionMatrix to assess accuracy
##
## Reference
## Prediction 0 1
## 0 2536 564
## 1 151 178
##
## 95% CI : (0.7775, 0.805)
## P-Value [Acc > NIR] : 0.1357
##
## Kappa : 0.23
##
35

##
##
auc(roc(whitetrain$drinkit, whitetrain$drinkitYhat ), levels=levels(as.factor(whitetrain$drinkit))) # c
Use best model (model 2) for testset prediction
whitetest$drinkitYhat <- predict(whitemodel2, newdata = whitetest, type = "response")
whitetest$drinkitYhat <- ifelse(whitetest$drinkitYhat > 0.5, 1.0, 0.0)
confusionMatrix(whitetest$drinkitYhat, whitetest$drinkit) # run confusionMatrix to assess accuracy
##
## Reference
## Prediction 0 1
## 0 1081 242
## 1 70 76
##
## 95% CI : (0.7658, 0.8083)
## P-Value [Acc > NIR] : 0.3657
##
## Kappa : 0.2215
##
##
##
auc(roc(whitetest$drinkit, whitetest$drinkitYhat), levels=levels(as.factor(whitetrain$drinkit))) # calc
36

Random forest algorithm - red wine dataset
set.seed(77)
rf1 <- randomForest(drinkit ~ volatile.acidity+total.sulfur.dioxide+alcohol, type = classification,
data=redtrain, ntree = 1000, importance = TRUE,confusion = TRUE)
round(importance(rf1), 1)
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## volatile.acidity 18.7 54.4 44.3 81.8
## total.sulfur.dioxide 10.6 37.5 28.9 70.7
## alcohol 24.6 73.7 59.5 89.5
print(rf1)
##
## Call:
## randomForest(formula = drinkit ~ volatile.acidity + total.sulfur.dioxide + alcohol, data = redt
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 12.87%
## Confusion matrix:
## 0 1 class.error
## 0 918 49 0.05067218
## 1 95 57 0.62500000
Random forest using only alcohol attribute
set.seed(77)
rf2 <- randomForest(drinkit ~ alcohol, type = classification, data=redtrain, ntree = 1000,
importance = TRUE,confusion = TRUE)
## alcohol 67.5 66.4 67.6 72.2
print(rf2)
##
## Call:
## randomForest(formula = drinkit ~ alcohol, data = redtrain, type = classification, ntree = 1000,
##
## 0 1 class.error
## 0 922 45 0.04653568
## 1 133 19 0.87500000
37

Testset prediction
rf1predict <-predict(rf1, redtest, type="response")
Model performance on testset
confusionMatrix( rf1predict, redtest$drinkit ) # run confusionMatrix to assess accuracy
##
## Reference
## Prediction 0 1
## 0 397 27
## 1 18 38
##
## 95% CI : (0.8766, 0.9308)
## P-Value [Acc > NIR] : 0.003355
##
## Kappa : 0.5748
## Mcnemar's Test P-Value : 0.233038
##
##
##
rf1predict<- as.numeric(rf1predict)
auc(roc(redtest$drinkit, rf1predict )) # calculate AUROC curve
Random forest algorithm - white wine dataset
set.seed(77)
rf3 <- randomForest(drinkit ~ volatile.acidity+residual.sugar+chlorides+alcohol, type = classification,
data=whitetrain, ntree = 1000, importance = TRUE,confusion = TRUE)
## volatile.acidity 45.4 128.0 113.4 248.4
38

## residual.sugar 48.0 105.0 110.3 295.5
## chlorides 32.6 116.0 101.8 251.7
## alcohol 68.0 226.6 179.2 359.4
print(rf3)
##
## Call:
## randomForest(formula = drinkit ~ volatile.acidity + residual.sugar + chlorides + alcohol, data
##
## 0 1 class.error
## 0 2521 166 0.06177894
## 1 329 413 0.44339623
Testset prediction
rf3predict <-predict(rf3, whitetest, type="response")
Model performance on testset
confusionMatrix( rf3predict, whitetest$drinkit ) # run confusionMatrix to assess accuracy
##
## Reference
## Prediction 0 1
## 0 1075 140
## 1 76 178
##
## Accuracy : 0.853
## 95% CI : (0.8338, 0.8707)
## P-Value [Acc > NIR] : 9.184e-12
##
## Kappa : 0.5325
##
##
39

##
rf3predict<- as.numeric(rf3predict)
auc(roc(whitetest$drinkit, rf3predict )) # calculate AUROC curve
RESULTS SUMMARY
Descriptive statistics and visualization of the data were employed to define variables to include in the
predicitve models to define high quality wines. Several models were compared using multivariate logistic
regression. The best model (model 2) defined on the red wine training dataset included the variables
volatile.acidity+total.sulfur.dioxide+alcohol. Results of the confusion matrix and auroc calculation were as
follows (see p.30):
Please note after checking the raw data I found that although the confusion matrices reported
in the caret package output above were correct, the labels of Sensitivity and Specificity are
incorrectly inverted. The correct values are shown below.
1. Accuracy: 0.8704
2. Sensitivity (TPR): 0.2632 3. Specificity (TNR): 0.9659
4. FPR (1-Specificity): 0.0341 5. Area under the curve: 0.6145
Results of this model on the test set were (see p.32):
1. Accuracy: 0.8708
2. Sensitivity: 0.2462
3. Specificity: 0.9687
The best white wine logit model included the variables volatile.acidity+residual.sugar+chlorides+alcohol.
Results of this model on the test set were (see p.35-36):
1. Accuracy: 0.7915
1. Accuracy : 0.7876
2. Sensitivity : 0.2390
3. Specificity : 0.9392
For direct comparison of the logit and random forest algorithms, the best models defined using logistic
regression were then evaluated by random forest using 1000 trees. Red wine training set results (p.37): 1.
Accuracy: 0.8713
3. Specificity: 0.9062 4. FPR (1-Specificity): 0.0938 5. Area under the curve: N/A using randomForest
package
1. Accuracy: 0.9062
White wine training set results (p.37)
1. Accuracy: 0.8556
40

3. Specificity: 0.8846 4. FPR (1-Specificity): 0.1154 5. Area under the curve: N/A using randomForest
package
1. Accuracy: 0.8530
DISCUSSION
Neither algorithim performed very well in these datasets, although both performed better for negative
predicitve value (ie., prediciting the bad wines). This makes sense since the physicochemical attributes of
each wine in these datasets probably are more indicative of bad wine than good wine. For example, high
sulfur or acidity can probably easily spoil an otherwise good wine, but these components may combine more
subtly to effect a wines flavor. I think what makes a wine taste good is highly subjective anyway, which
probably makes the class label harder to predict using these data. Interestingly, alcohol content was the
single most predicitve variable (doesn’t everything taste better when your intoxicated?).
Both algorithims performed better on the white wine dataset which was 3 times larger than the red wine
dataset. This result reinforces a major tenent of data science, that is that more data is superior to better
algorithms. In comparison, the random forest algorithm outperformed the logit models, somewhat. This was
expected as decision trees usually outperform logistic regression in my experience. Decision trees generally
perform well since they are highly iterative, robust to noisy data including outliers (which there was many in
these data) and have good predicitve power. In contrast, logit is also pretty robust but is more sensitive
linearity of the independant variable and some of these data were skewed.
There are several things I could do to improve the analysis. First some of the variables were skewed and had
many outliers, especially the white wine dataset, but I didn’t perform any type of transformation. This would
have linearized the data anad improved the logistic regression performance, as I stated. Moreover, performing
stepwise regression optimizes model fitting and categorizing continuous variables improves linearity of the
independant variable, but neither of these methods were performed. Second, I didn’t perform any pruning of
the random forest trees to try and improve performance. Lastly, both R packages have many functions which
can be employed to optimize algorithm performance, but in general I didn’t make use of these.
CONCLUSIONS
1. The random forest algorithm outperformed logit.
2. Both algorithms performed better on the larger white wine dataset; getting more data is the best way
to improve your models.
3. What makes a wine taste good is subjective.
4. Good wines and especially reds (14% vs 22% whites), are hard to find, at least in the northern Portugal
“Vinho Verde” region.
REFERENCES
Datasets
1. Wine Quality Data Set from UCI ML Repository
41

2. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining
from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
dplyr resources
www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
www.youtube.com/watch?v=jWjqLW-u3hc&feature=youtu.be
www.youtube.com/watch?v=2mh1PqfsXVI
groups.google.com/forum/#!topic/manipulatr/Z46zwYXNh0g
stackoverflow.com/questions/22850026/filtering-row-which-contains-a-certain-string-using-dplyr
stackoverflow.com/questions/13520515/command-to-remove-row-from-a-data-frame
Logistic regression resources
cran.r-project.org/web/packages/glm2/glm2.pdf
www.kaggle.com/eyebervil/titanic/titanic-simple-logit-with-interaction
cran.r-project.org/web/packages/caret/vignettes/caret.pdf
cran.r-project.org/web/packages/caret/caret.pdf
cran.r-project.org/web/packages/pROC/pROC.pdf
stats.stackexchange.com/questions/87234/aic-values-and-their-use-in-stepwise-model-selection-for-a-simple-linear-regress
Random forest algorithm resources
cran.r-project.org/web/packages/randomForest/randomForest.pdf
campus.datacamp.com/courses/kaggle-r-tutorial-on-machine-learning/chapter-3-improving-your-predictions-through-random-
ex=1
R Markdown resources
rmarkdown.rstudio.com/
www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf
www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf
42

IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wine Preference

More Related Content

Similar to IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wine Preference

More from James Nelson

Recently uploaded

IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wine Preference