Applied Machine Learning Final Project
Jim Nelson
December 5, 2016
OVERVIEW
Aim
Compare performace of R Machine Learning packages for logistic regression and random forest
algorithms.
Datasets
Datasets used in this study were obtained from the UCI Machine Learning Dataset Repository. Two datasets
of red and white wine variants of the northern Portugal “Vinho Verde” region were used. Each dataset
comprises 11 physicochemical attributes of each wine such as acidity, residual sugars and percent alcohol(1).
The red dataset includes 1599 different wines, while the white dataset has a total of 4898 wines. Specific data
about grape types, wine brand, wine selling price, etc, have been omitted. Each wine was given a quality
score which graded the wine in a scale that ranges from 0 (very bad) to 10 (excellent). Importantly, the score
was derived subjectively from a panel of at least 3 experts using blind taste tests. The final quality score is
given by the median of these evaluations(2). The binary class label “drinkit” was created for this analysis
based on a quality score of >= 7. The datasets were submitted on 10/7/2009 by Paulo Cortez, University of
Minho, Guimarães, Portugal.
Software and Computing Environment
The study was performed locally on a HP Pavilion 23 All-in-One with a 64 Bit OS running Windows
10 with 4.0 GB Ram and an AMD AP-5300 APU processor with Radeon HD graphics. The study was
performed in R (R version 33.2.4 (2016-03-10) – “Very Secure Dishes”. Platform: x86_64-w64-mingw32/x64
(64-bit))Copyright (C) 2015 The R Foundation for Statistical Computing) using the open source R platform
R Studio (Version 0.98.1102 - 2009-2014 RStudio, Inc.)
The following R software packages with corresponding manuals and vignettes were obtained from the The
Comprehensive R Archive Network (CRAN):
Data Manipulation and Visualization:
dplyr: A Grammar of Data Manipulation (v.0.4.3);
ggplot2: An Implementation of the Grammar of Graphics (v.1.01)
Logistic Regression Analysis:
glm2: Fitting Generalized Linear Models (v. 1.1.2)
Classification statistics and AUROC analysis:
caret: Classification and Regression Training (v.6.0-62); pROC: Display and Analyze ROC Curves (v. 1.8)
Random Forest Modeling:
randomForest: Breiman and Cutler’s Random Forests for Classification and Regression( v. 4.6-10)
1
DATA CURATION
Load Datasets and R Packages
Load packages for data transformation
library(dplyr)
library(caTools)
Load packages for data visualization
library(ggplot2)
library(fBasics)
Load ML packages
library(glm2)
library(caret)
library(pROC)
library(randomForest)
Load the datasets
red <- read.csv ("winequality-red.csv", header= TRUE, sep=";")
str(red)
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
white <- read.csv ("winequality-white.csv", header= TRUE, sep=";")
str(white)
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
2
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
Create class labels
red<- red %>%
mutate (drinkit = as.factor(as.numeric(quality >= 7 )))
class(red$drinkit)
## [1] "factor"
white<- white %>%
mutate (drinkit = as.factor(as.numeric(quality >= 7 )))
class(white$drinkit)
## [1] "factor"
Create test and train datasets
set.seed(123)
sample = sample.split(red$drinkit, SplitRatio = .70)
redtrain = subset(red, sample == TRUE)
redtest = subset(red, sample == FALSE)
sample <-sample.split(white$drinkit, SplitRatio = .70)
whitetrain <- subset(white, sample == TRUE)
whitetest <- subset(white, sample == FALSE)
dim(redtrain) #70% for training
## [1] 1119 13
dim(redtest) #30% for test
## [1] 480 13
dim(whitetrain)
## [1] 3429 13
dim(whitetest)
## [1] 1469 13
3
DATA EXPLORATION
Descriptive statistics
by(red[1:12][,c(1:12)], red$drinkit, basicStats)
## red$drinkit: 0
## fixed.acidity volatile.acidity citric.acid residual.sugar
## nobs 1382.000000 1382.000000 1382.000000 1382.000000
## NAs 0.000000 0.000000 0.000000 0.000000
## Minimum 4.600000 0.160000 0.000000 0.900000
## Maximum 15.900000 1.580000 1.000000 15.500000
## 1. Quartile 7.100000 0.420000 0.082500 1.900000
## 3. Quartile 9.100000 0.650000 0.400000 2.600000
## Mean 8.236831 0.547022 0.254407 2.512120
## Median 7.800000 0.540000 0.240000 2.200000
## Sum 11383.300000 755.985000 351.590000 3471.750000
## SE Mean 0.045265 0.004743 0.005102 0.038084
## LCL Mean 8.148036 0.537717 0.244398 2.437412
## UCL Mean 8.325626 0.556327 0.264415 2.586829
## Variance 2.831568 0.031095 0.035973 2.004428
## Stdev 1.682726 0.176337 0.189665 1.415778
## Skewness 1.071064 0.670310 0.422857 4.878595
## Kurtosis 1.337482 1.438348 -0.673833 32.003919
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## nobs 1382.000000 1382.000000 1382.000000
## NAs 0.000000 0.000000 0.000000
## Minimum 0.034000 1.000000 6.000000
## Maximum 0.611000 72.000000 165.000000
## 1. Quartile 0.071000 8.000000 23.000000
## 3. Quartile 0.091000 22.000000 65.000000
## Mean 0.089281 16.172214 48.285818
## Median 0.080000 14.000000 39.500000
## Sum 123.386000 22350.000000 66731.000000
## SE Mean 0.001321 0.281577 0.876540
## LCL Mean 0.086689 15.619850 46.566324
## UCL Mean 0.091872 16.724578 50.005312
## Variance 0.002412 109.572421 1061.821580
## Stdev 0.049113 10.467685 32.585604
## Skewness 5.547353 1.224203 1.110405
## Kurtosis 38.898772 2.056348 0.704994
## density pH sulphates alcohol quality
## nobs 1382.000000 1382.000000 1382.000000 1382.000000 1382.000000
## NAs 0.000000 0.000000 0.000000 0.000000 0.000000
## Minimum 0.990070 2.740000 0.330000 8.400000 3.000000
## Maximum 1.003690 4.010000 2.000000 14.900000 6.000000
## 1. Quartile 0.995785 3.210000 0.540000 9.500000 5.000000
## 3. Quartile 0.997900 3.410000 0.700000 10.900000 6.000000
## Mean 0.996859 3.314616 0.644754 10.251037 5.408828
## Median 0.996800 3.310000 0.600000 10.000000 5.000000
## Sum 1377.659370 4580.800000 891.050000 14166.933333 7475.000000
## SE Mean 0.000049 0.004146 0.004590 0.026084 0.016186
## LCL Mean 0.996764 3.306483 0.635750 10.199869 5.377076
4
## UCL Mean 0.996955 3.322750 0.653758 10.302205 5.440580
## Variance 0.000003 0.023758 0.029114 0.940248 0.362065
## Stdev 0.001808 0.154135 0.170629 0.969664 0.601719
## Skewness 0.117601 0.168576 2.774057 1.058300 -0.673203
## Kurtosis 1.091744 0.843706 13.810053 0.921817 0.546000
## --------------------------------------------------------
## red$drinkit: 1
## fixed.acidity volatile.acidity citric.acid residual.sugar
## nobs 217.000000 217.000000 217.000000 217.000000
## NAs 0.000000 0.000000 0.000000 0.000000
## Minimum 4.900000 0.120000 0.000000 1.200000
## Maximum 15.600000 0.915000 0.760000 8.900000
## 1. Quartile 7.400000 0.300000 0.300000 2.000000
## 3. Quartile 10.100000 0.490000 0.490000 2.700000
## Mean 8.847005 0.405530 0.376498 2.708756
## Median 8.700000 0.370000 0.400000 2.300000
## Sum 1919.800000 88.000000 81.700000 587.800000
## SE Mean 0.135767 0.009841 0.013199 0.092528
## LCL Mean 8.579406 0.386134 0.350482 2.526382
## UCL Mean 9.114603 0.424926 0.402514 2.891130
## Variance 3.999910 0.021014 0.037806 1.857840
## Stdev 1.999977 0.144963 0.194438 1.363026
## Skewness 0.460276 0.987628 -0.373539 2.173338
## Kurtosis 0.313564 0.884353 -0.475447 4.660784
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## nobs 217.000000 217.000000 217.000000 217.000000
## NAs 0.000000 0.000000 0.000000 0.000000
## Minimum 0.012000 3.000000 7.000000 0.990640
## Maximum 0.358000 54.000000 289.000000 1.003200
## 1. Quartile 0.062000 6.000000 17.000000 0.994700
## 3. Quartile 0.085000 18.000000 43.000000 0.997350
## Mean 0.075912 13.981567 34.889401 0.996030
## Median 0.073000 11.000000 27.000000 0.995720
## Sum 16.473000 3034.000000 7571.000000 216.138570
## SE Mean 0.001933 0.694771 2.211148 0.000149
## LCL Mean 0.072102 12.612168 30.531213 0.995736
## UCL Mean 0.079723 15.350966 39.247589 0.996325
## Variance 0.000811 104.747344 1060.950674 0.000005
## Stdev 0.028480 10.234615 32.572238 0.002201
## Skewness 5.035644 1.456370 4.439771 0.262062
## Kurtosis 44.818747 1.889996 29.222588 0.292770
## pH sulphates alcohol quality
## nobs 217.000000 217.000000 217.000000 217.000000
## NAs 0.000000 0.000000 0.000000 0.000000
## Minimum 2.880000 0.390000 9.200000 7.000000
## Maximum 3.780000 1.360000 14.000000 8.000000
## 1. Quartile 3.200000 0.650000 10.800000 7.000000
## 3. Quartile 3.380000 0.820000 12.200000 7.000000
## Mean 3.288802 0.743456 11.518049 7.082949
## Median 3.270000 0.740000 11.600000 7.000000
## Sum 713.670000 161.330000 2499.416667 1537.000000
## SE Mean 0.010487 0.009099 0.067759 0.018766
## LCL Mean 3.268133 0.725522 11.384496 7.045961
## UCL Mean 3.309471 0.761391 11.651603 7.119938
5
## Variance 0.023863 0.017966 0.996310 0.076421
## Stdev 0.154478 0.134038 0.998153 0.276443
## Skewness 0.358697 0.620835 0.065494 3.003356
## Kurtosis 0.608113 1.964857 -0.430136 7.052712
by(white[1:12][,c(1:12)], white$drinkit, basicStats)
## white$drinkit: 0
## fixed.acidity volatile.acidity citric.acid residual.sugar
## nobs 3838.000000 3838.000000 3838.000000 3838.000000
## NAs 0.000000 0.000000 0.000000 0.000000
## Minimum 3.800000 0.080000 0.000000 0.600000
## Maximum 14.200000 1.100000 1.660000 65.800000
## 1. Quartile 6.300000 0.220000 0.260000 1.700000
## 3. Quartile 7.400000 0.320000 0.400000 10.400000
## Mean 6.890594 0.281802 0.336438 6.703478
## Median 6.800000 0.270000 0.320000 6.000000
## Sum 26446.100000 1081.555000 1291.250000 25727.950000
## SE Mean 0.013884 0.001651 0.002098 0.084341
## LCL Mean 6.863374 0.278564 0.332325 6.538121
## UCL Mean 6.917814 0.285039 0.340551 6.868835
## Variance 0.739786 0.010464 0.016889 27.301127
## Stdev 0.860108 0.102293 0.129959 5.225048
## Skewness 0.752179 1.720644 1.242293 1.035464
## Kurtosis 2.339990 5.737845 5.472595 3.755082
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## nobs 3838.000000 3.838000e+03 3.838000e+03
## NAs 0.000000 0.000000e+00 0.000000e+00
## Minimum 0.009000 2.000000e+00 9.000000e+00
## Maximum 0.346000 2.890000e+02 4.400000e+02
## 1. Quartile 0.037000 2.300000e+01 1.110000e+02
## 3. Quartile 0.051000 4.700000e+01 1.730000e+02
## Mean 0.047875 3.551733e+01 1.419829e+02
## Median 0.045000 3.400000e+01 1.400000e+02
## Sum 183.743000 1.363155e+05 5.449305e+05
## SE Mean 0.000380 2.871250e-01 7.125790e-01
## LCL Mean 0.047129 3.495439e+01 1.405859e+02
## UCL Mean 0.048620 3.608026e+01 1.433800e+02
## Variance 0.000554 3.164067e+02 1.948817e+03
## Stdev 0.023548 1.778783e+01 4.414540e+01
## Skewness 4.851869 1.423264e+00 2.830020e-01
## Kurtosis 33.327716 1.178770e+01 4.934680e-01
## density pH sulphates alcohol quality
## nobs 3838.000000 3838.000000 3838.000000 3838.000000 3838.000000
## NAs 0.000000 0.000000 0.000000 0.000000 0.000000
## Minimum 0.987220 2.720000 0.230000 8.000000 3.000000
## Maximum 1.038980 3.810000 1.060000 14.000000 6.000000
## 1. Quartile 0.992320 3.080000 0.410000 9.400000 5.000000
## 3. Quartile 0.996570 3.267500 0.540000 11.000000 6.000000
## Mean 0.994474 3.180847 0.487004 10.265215 5.519802
## Median 0.994380 3.170000 0.470000 10.000000 6.000000
## Sum 3816.789390 12208.090000 1869.120000 39397.896667 21185.000000
## SE Mean 0.000047 0.002396 0.001746 0.017765 0.009764
## LCL Mean 0.994382 3.176150 0.483580 10.230385 5.500659
6
## UCL Mean 0.994565 3.185544 0.490427 10.300045 5.538945
## Variance 0.000008 0.022027 0.011700 1.211267 0.365910
## Stdev 0.002894 0.148414 0.108167 1.100576 0.604905
## Skewness 1.139004 0.518051 0.939134 0.690666 -1.004627
## Kurtosis 14.002972 0.817759 1.572179 -0.281440 0.695824
## --------------------------------------------------------
## white$drinkit: 1
## fixed.acidity volatile.acidity citric.acid residual.sugar
## nobs 1060.000000 1060.000000 1060.000000 1060.000000
## NAs 0.000000 0.000000 0.000000 0.000000
## Minimum 3.900000 0.080000 0.010000 0.800000
## Maximum 9.200000 0.760000 0.740000 19.250000
## 1. Quartile 6.200000 0.190000 0.280000 1.800000
## 3. Quartile 7.200000 0.320000 0.360000 7.400000
## Mean 6.725142 0.265349 0.326057 5.261509
## Median 6.700000 0.250000 0.310000 3.875000
## Sum 7128.650000 281.270000 345.620000 5577.200000
## SE Mean 0.023613 0.002890 0.002466 0.131792
## LCL Mean 6.678807 0.259678 0.321218 5.002906
## UCL Mean 6.771476 0.271020 0.330895 5.520113
## Variance 0.591050 0.008854 0.006446 18.411355
## Stdev 0.768798 0.094097 0.080288 4.290845
## Skewness 0.019010 0.874201 0.705643 1.080500
## Kurtosis 0.437095 1.098804 3.030720 0.098557
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## nobs 1060.000000 1060.000000 1.060000e+03
## NAs 0.000000 0.000000 0.000000e+00
## Minimum 0.012000 5.000000 3.400000e+01
## Maximum 0.135000 108.000000 2.290000e+02
## 1. Quartile 0.031000 25.000000 1.010000e+02
## 3. Quartile 0.044000 42.000000 1.460000e+02
## Mean 0.038160 34.550472 1.252453e+02
## Median 0.037000 33.000000 1.220000e+02
## Sum 40.450000 36623.500000 1.327600e+05
## SE Mean 0.000342 0.423776 1.005136e+00
## LCL Mean 0.037489 33.718936 1.232730e+02
## UCL Mean 0.038832 35.382008 1.272176e+02
## Variance 0.000124 190.361237 1.070916e+03
## Stdev 0.011145 13.797146 3.272485e+01
## Skewness 2.258097 1.015169 5.094610e-01
## Kurtosis 14.741475 2.729507 2.420860e-01
## density pH sulphates alcohol quality
## nobs 1060.000000 1060.000000 1060.000000 1060.000000 1060.000000
## NAs 0.000000 0.000000 0.000000 0.000000 0.000000
## Minimum 0.987110 2.840000 0.220000 8.500000 7.000000
## Maximum 1.000600 3.820000 1.080000 14.200000 9.000000
## 1. Quartile 0.990500 3.100000 0.400000 10.700000 7.000000
## 3. Quartile 0.993605 3.320000 0.580000 12.400000 7.000000
## Mean 0.992412 3.215132 0.500142 11.416022 7.174528
## Median 0.991730 3.200000 0.480000 11.500000 7.000000
## Sum 1051.956700 3408.040000 530.150000 12100.983333 7605.000000
## SE Mean 0.000085 0.004828 0.004086 0.038553 0.012040
## LCL Mean 0.992245 3.205659 0.492123 11.340372 7.150904
## UCL Mean 0.992579 3.224605 0.508160 11.491672 7.198152
7
## Variance 0.000008 0.024707 0.017701 1.575551 0.153647
## Stdev 0.002772 0.157185 0.133044 1.255209 0.391978
## Skewness 1.001822 0.235375 0.942026 -0.404076 1.945040
## Kurtosis 0.406561 -0.186210 1.063202 -0.522122 2.498491
Boxplots
attribute = fixed.acidity
bp_fixed.acidity <- ggplot(red, aes(x=drinkit, y=fixed.acidity ))
bp_fixed.acidity + geom_boxplot()
8
12
16
0 1
drinkit
fixed.acidity
bp_fixed.acidity <- ggplot(white, aes(x=drinkit, y=fixed.acidity ))
bp_fixed.acidity + geom_boxplot()
8
6
9
12
0 1
drinkit
fixed.acidity
attribute = volatile.acidity
bp_volatile.acidity <- ggplot(red, aes(x=drinkit, y=volatile.acidity ))
bp_volatile.acidity + geom_boxplot()
9
0.4
0.8
1.2
1.6
0 1
drinkit
volatile.acidity
bp_volatile.acidity <- ggplot(white, aes(x=drinkit, y=volatile.acidity ))
bp_volatile.acidity + geom_boxplot()
10
0.3
0.6
0.9
0 1
drinkit
volatile.acidity
attribute = citric.acid
bp_citric.acid <- ggplot(red, aes(x=drinkit, y=citric.acid ))
bp_citric.acid + geom_boxplot()
11
0.00
0.25
0.50
0.75
1.00
0 1
drinkit
citric.acid
bp_citric.acid <- ggplot(white, aes(x=drinkit, y=citric.acid ))
bp_citric.acid + geom_boxplot()
12
0.0
0.5
1.0
1.5
0 1
drinkit
citric.acid
attribute = residual.sugar
bp_residual.sugar <- ggplot(red, aes(x=drinkit, y= residual.sugar ))
bp_residual.sugar + geom_boxplot()
13
4
8
12
16
0 1
drinkit
residual.sugar
bp_residual.sugar <- ggplot(white, aes(x=drinkit, y= residual.sugar ))
bp_residual.sugar + geom_boxplot()
14
0
20
40
60
0 1
drinkit
residual.sugar
attribute = chlorides
bp_chlorides <- ggplot(red, aes(x=drinkit, y= chlorides ))
bp_chlorides + geom_boxplot()
15
0.0
0.2
0.4
0.6
0 1
drinkit
chlorides
bp_chlorides <- ggplot(white, aes(x=drinkit, y= chlorides ))
bp_chlorides + geom_boxplot()
16
0.0
0.1
0.2
0.3
0 1
drinkit
chlorides
attribute = free.sulfur.dioxide
bp_free.sulfur.dioxide <- ggplot(red, aes(x=drinkit, y= free.sulfur.dioxide ))
bp_free.sulfur.dioxide + geom_boxplot()
17
0
20
40
60
0 1
drinkit
free.sulfur.dioxide
bp_free.sulfur.dioxide <- ggplot(white, aes(x=drinkit, y= free.sulfur.dioxide ))
bp_free.sulfur.dioxide + geom_boxplot()
18
0
100
200
300
0 1
drinkit
free.sulfur.dioxide
attribute = total.sulfur.dioxide
bp_total.sulfur.dioxide <- ggplot(red, aes(x=drinkit, y= total.sulfur.dioxide ))
bp_total.sulfur.dioxide + geom_boxplot()
19
0
100
200
300
0 1
drinkit
total.sulfur.dioxide
bp_total.sulfur.dioxide <- ggplot(white, aes(x=drinkit, y= total.sulfur.dioxide ))
bp_total.sulfur.dioxide + geom_boxplot()
20
0
100
200
300
400
0 1
drinkit
total.sulfur.dioxide
attribute = sulphates
bp_sulphates <- ggplot(red, aes(x=drinkit, y= sulphates ))
bp_sulphates + geom_boxplot()
21
0.5
1.0
1.5
2.0
0 1
drinkit
sulphates
bp_sulphates <- ggplot(white, aes(x=drinkit, y= sulphates ))
bp_sulphates + geom_boxplot()
22
0.2
0.4
0.6
0.8
1.0
0 1
drinkit
sulphates
attribute = sulphates
bp_sulphates <- ggplot(red, aes(x=drinkit, y= sulphates ))
bp_sulphates + geom_boxplot()
23
0.5
1.0
1.5
2.0
0 1
drinkit
sulphates
bp_sulphates <- ggplot(white, aes(x=drinkit, y= sulphates ))
bp_sulphates + geom_boxplot()
24
0.2
0.4
0.6
0.8
1.0
0 1
drinkit
sulphates
attribute = alcohol
bp_alcohol <- ggplot(red, aes(x=drinkit, y= alcohol ))
bp_alcohol + geom_boxplot()
25
10
12
14
0 1
drinkit
alcohol
bp_alcohol <- ggplot(white, aes(x=drinkit, y= alcohol ))
bp_alcohol + geom_boxplot()
26
8
10
12
14
0 1
drinkit
alcohol
PREDICTIVE MODELING
Logistic regression - red wine dataset
Model 1
redmodel1 <- glm(drinkit ~ fixed.acidity+volatile.acidity+citric.acid+residual.sugar+
total.sulfur.dioxide+chlorides+alcohol, family="binomial", data= redtrain)
summary(redmodel1)
##
## Call:
## glm(formula = drinkit ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + total.sulfur.dioxide + chlorides + alcohol,
## family = "binomial", data = redtrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6103 -0.4541 -0.2492 -0.1403 2.6757
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.12828 1.51262 -6.696 2.14e-11 ***
## fixed.acidity 0.12368 0.08316 1.487 0.13695
27
## volatile.acidity -4.49437 0.91316 -4.922 8.58e-07 ***
## citric.acid 0.31682 0.97102 0.326 0.74422
## residual.sugar 0.10574 0.06983 1.514 0.12996
## total.sulfur.dioxide -0.01329 0.00419 -3.172 0.00151 **
## chlorides -6.34684 4.16619 -1.523 0.12765
## alcohol 0.91992 0.10404 8.842 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 889.23 on 1118 degrees of freedom
## Residual deviance: 634.16 on 1111 degrees of freedom
## AIC: 650.16
##
## Number of Fisher Scoring iterations: 6
exp(cbind(OR = coef(redmodel1), confint(redmodel1))) # odds ratios and 95% CI
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 3.993409e-05 1.934350e-06 0.0007319036
## fixed.acidity 1.131655e+00 9.622602e-01 1.3339255345
## volatile.acidity 1.117174e-02 1.792495e-03 0.0642802846
## citric.acid 1.372752e+00 2.016707e-01 9.1164704929
## residual.sugar 1.111529e+00 9.597369e-01 1.2685734018
## total.sulfur.dioxide 9.867968e-01 9.783435e-01 0.9945651864
## chlorides 1.752269e-03 2.005370e-07 2.3287091038
## alcohol 2.509101e+00 2.055529e+00 3.0926740162
Model 1 performance
redtrain$drinkitYhat <- predict(redmodel1, type = "response") # generate yhat values on train df
redtrain$drinkitYhat <- ifelse(redtrain$drinkitYhat > 0.5, 1.0, 0.0)
confusionMatrix( redtrain$drinkitYhat, redtrain$drinkit) # run confusionMatrix to assess accuracy
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 932 110
## 1 35 42
##
## Accuracy : 0.8704
## 95% CI : (0.8493, 0.8896)
## No Information Rate : 0.8642
## P-Value [Acc > NIR] : 0.2878
##
## Kappa : 0.3032
## Mcnemar's Test P-Value : 7.978e-10
##
28
## Sensitivity : 0.9638
## Specificity : 0.2763
## Pos Pred Value : 0.8944
## Neg Pred Value : 0.5455
## Prevalence : 0.8642
## Detection Rate : 0.8329
## Detection Prevalence : 0.9312
## Balanced Accuracy : 0.6201
##
## 'Positive' Class : 0
##
auc(roc(redtrain$drinkit, redtrain$drinkitYhat ),levels=levels(as.factor(redtrain$drinkit))) # calculat
## Area under the curve: 0.6201
Model 2
redmodel2 <- glm(drinkit ~ volatile.acidity+total.sulfur.dioxide+alcohol, family="binomial",
data= redtrain)
summary(redmodel2)
##
## Call:
## glm(formula = drinkit ~ volatile.acidity + total.sulfur.dioxide +
## alcohol, family = "binomial", data = redtrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0665 -0.4559 -0.2634 -0.1452 2.6374
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.90155 1.12801 -7.891 2.99e-15 ***
## volatile.acidity -5.21126 0.71903 -7.248 4.24e-13 ***
## total.sulfur.dioxide -0.01379 0.00409 -3.371 0.000749 ***
## alcohol 0.92389 0.09549 9.675 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 889.23 on 1118 degrees of freedom
## Residual deviance: 644.62 on 1115 degrees of freedom
## AIC: 652.62
##
## Number of Fisher Scoring iterations: 6
exp(cbind(OR = coef(redmodel2), confint(redmodel2))) # odds ratios and 95% CI
## Waiting for profiling to be done...
29
## OR 2.5 % 97.5 %
## (Intercept) 0.0001361775 1.429867e-05 0.001198158
## volatile.acidity 0.0054547915 1.276332e-03 0.021452206
## total.sulfur.dioxide 0.9863066898 9.780499e-01 0.993846316
## alcohol 2.5190775373 2.097376e+00 3.051543824
Model 2 performance
redtrain$drinkitYhat <- predict(redmodel2, type = "response") # generate yhat values on train df
redtrain$drinkitYhat <- ifelse(redtrain$drinkitYhat > 0.5, 1.0, 0.0)
confusionMatrix( redtrain$drinkitYhat, redtrain$drinkit) # run confusionMatrix to assess accuracy
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 934 112
## 1 33 40
##
## Accuracy : 0.8704
## 95% CI : (0.8493, 0.8896)
## No Information Rate : 0.8642
## P-Value [Acc > NIR] : 0.2878
##
## Kappa : 0.2933
## Mcnemar's Test P-Value : 9.323e-11
##
## Sensitivity : 0.9659
## Specificity : 0.2632
## Pos Pred Value : 0.8929
## Neg Pred Value : 0.5479
## Prevalence : 0.8642
## Detection Rate : 0.8347
## Detection Prevalence : 0.9348
## Balanced Accuracy : 0.6145
##
## 'Positive' Class : 0
##
auc(roc(redtrain$drinkit, redtrain$drinkitYhat ), levels=levels(as.factor(redtrain$drinkit))) # calcula
## Area under the curve: 0.6145
Model 3
redmodel3 <- glm(drinkit ~ alcohol, family="binomial", data= redtrain)
summary(redmodel3)
##
## Call:
## glm(formula = drinkit ~ alcohol, family = "binomial", data = redtrain)
30
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2384 -0.5138 -0.3279 -0.2540 2.6650
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -13.11263 1.00437 -13.06 <2e-16 ***
## alcohol 1.04246 0.08978 11.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 889.23 on 1118 degrees of freedom
## Residual deviance: 724.88 on 1117 degrees of freedom
## AIC: 728.88
##
## Number of Fisher Scoring iterations: 5
exp(cbind(OR = coef(redmodel3), confint(redmodel3))) # odds ratios and 95% CI
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 2.019563e-06 2.659407e-07 1.371582e-05
## alcohol 2.836199e+00 2.388290e+00 3.397552e+00
Model 3 performance
redtrain$drinkitYhat <- predict(redmodel3, type = "response") # generate yhat values on train df
redtrain$drinkitYhat <- ifelse(redtrain$drinkitYhat > 0.5, 1.0, 0.0)
confusionMatrix( redtrain$drinkitYhat, redtrain$drinkit) # run confusionMatrix to assess accuracy
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 943 131
## 1 24 21
##
## Accuracy : 0.8615
## 95% CI : (0.8398, 0.8812)
## No Information Rate : 0.8642
## P-Value [Acc > NIR] : 0.6236
##
## Kappa : 0.1611
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9752
## Specificity : 0.1382
## Pos Pred Value : 0.8780
31
## Neg Pred Value : 0.4667
## Prevalence : 0.8642
## Detection Rate : 0.8427
## Detection Prevalence : 0.9598
## Balanced Accuracy : 0.5567
##
## 'Positive' Class : 0
##
auc(roc(redtrain$drinkit, redtrain$drinkitYhat ), levels=levels(as.factor(redtrain$drinkit))) # calcula
## Area under the curve: 0.5567
Use best model (model 2) for testset prediction
redtest$drinkitYhat <- predict(redmodel2, newdata = redtest, type = "response") # predict values on tra
redtest$drinkitYhat <- ifelse(redtest$drinkitYhat > 0.5, 1.0, 0.0)
confusionMatrix(redtest$drinkitYhat, redtest$drinkit) # run confusionMatrix to assess accuracy
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 402 49
## 1 13 16
##
## Accuracy : 0.8708
## 95% CI : (0.8375, 0.8995)
## No Information Rate : 0.8646
## P-Value [Acc > NIR] : 0.3749
##
## Kappa : 0.2803
## Mcnemar's Test P-Value : 8.789e-06
##
## Sensitivity : 0.9687
## Specificity : 0.2462
## Pos Pred Value : 0.8914
## Neg Pred Value : 0.5517
## Prevalence : 0.8646
## Detection Rate : 0.8375
## Detection Prevalence : 0.9396
## Balanced Accuracy : 0.6074
##
## 'Positive' Class : 0
##
auc(roc(redtest$drinkit, redtest$drinkitYhat), levels=levels(as.factor(redtrain$drinkit))) # calculate
## Area under the curve: 0.6074
32
Logistic regression - white wine dataset
Model 1
whitemodel1 <- glm(drinkit ~ fixed.acidity+volatile.acidity+citric.acid+residual.sugar+
total.sulfur.dioxide+chlorides+alcohol, family="binomial", data= whitetrain)
summary(whitemodel1)
##
## Call:
## glm(formula = drinkit ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + total.sulfur.dioxide + chlorides + alcohol,
## family = "binomial", data = whitetrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9694 -0.6635 -0.4286 -0.1833 2.8909
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.906528 0.805489 -11.057 < 2e-16 ***
## fixed.acidity -0.039714 0.058123 -0.683 0.494
## volatile.acidity -4.905984 0.578866 -8.475 < 2e-16 ***
## citric.acid -0.717323 0.460893 -1.556 0.120
## residual.sugar 0.047718 0.011470 4.160 3.18e-05 ***
## total.sulfur.dioxide 0.001578 0.001356 1.164 0.245
## chlorides -18.758762 4.513067 -4.157 3.23e-05 ***
## alcohol 0.899407 0.052430 17.154 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3581.9 on 3428 degrees of freedom
## Residual deviance: 2942.3 on 3421 degrees of freedom
## AIC: 2958.3
##
## Number of Fisher Scoring iterations: 6
exp(cbind(OR = coef(whitemodel1), confint(whitemodel1))) # odds ratios and 95% CI
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 1.355014e-04 2.768784e-05 6.517844e-04
## fixed.acidity 9.610643e-01 8.571076e-01 1.076523e+00
## volatile.acidity 7.402156e-03 2.339702e-03 2.263315e-02
## citric.acid 4.880570e-01 1.948869e-01 1.193235e+00
## residual.sugar 1.048875e+00 1.025407e+00 1.072573e+00
## total.sulfur.dioxide 1.001579e+00 9.989146e-01 1.004239e+00
## chlorides 7.131371e-09 7.522433e-13 3.505045e-05
## alcohol 2.458146e+00 2.220357e+00 2.727226e+00
33
Model 1 performance
whitetrain$drinkitYhat <- predict(whitemodel1, type = "response") # generate yhat values on train df
whitetrain$drinkitYhat <- ifelse(whitetrain$drinkitYhat > 0.5, 1.0, 0.0)
confusionMatrix( whitetrain$drinkitYhat, whitetrain$drinkit) # run confusionMatrix to assess accuracy
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2532 565
## 1 155 177
##
## Accuracy : 0.79
## 95% CI : (0.776, 0.8036)
## No Information Rate : 0.7836
## P-Value [Acc > NIR] : 0.1865
##
## Kappa : 0.2261
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9423
## Specificity : 0.2385
## Pos Pred Value : 0.8176
## Neg Pred Value : 0.5331
## Prevalence : 0.7836
## Detection Rate : 0.7384
## Detection Prevalence : 0.9032
## Balanced Accuracy : 0.5904
##
## 'Positive' Class : 0
##
auc(roc(whitetrain$drinkit, whitetrain$drinkitYhat ), levels=levels(as.factor(whitetrain$drinkit))) # c
## Area under the curve: 0.5904
Model 2
whitemodel2 <- glm(drinkit ~ volatile.acidity+residual.sugar+chlorides+alcohol, family="binomial",
data= whitetrain)
summary(whitemodel2)
##
## Call:
## glm(formula = drinkit ~ volatile.acidity + residual.sugar + chlorides +
## alcohol, family = "binomial", data = whitetrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9024 -0.6594 -0.4260 -0.1902 2.8269
34
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.09922 0.63146 -14.410 < 2e-16 ***
## volatile.acidity -4.68741 0.56405 -8.310 < 2e-16 ***
## residual.sugar 0.04845 0.01115 4.345 1.39e-05 ***
## chlorides -18.36325 4.41268 -4.161 3.16e-05 ***
## alcohol 0.88237 0.04981 17.714 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3581.9 on 3428 degrees of freedom
## Residual deviance: 2947.1 on 3424 degrees of freedom
## AIC: 2957.1
##
## Number of Fisher Scoring iterations: 5
exp(cbind(OR = coef(whitemodel2), confint(whitemodel2))) # odds ratios and 95% CI
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 1.117534e-04 3.232733e-05 3.849781e-04
## volatile.acidity 9.210485e-03 2.998984e-03 2.737557e-02
## residual.sugar 1.049642e+00 1.026810e+00 1.072716e+00
## chlorides 1.059114e-08 1.369508e-12 4.332245e-05
## alcohol 2.416612e+00 2.193714e+00 2.667033e+00
Model 2 performance
whitetrain$drinkitYhat <- predict(whitemodel2, type = "response") # generate yhat values on train df
whitetrain$drinkitYhat <- ifelse(whitetrain$drinkitYhat > 0.5, 1.0, 0.0)
confusionMatrix( whitetrain$drinkitYhat, whitetrain$drinkit) # run confusionMatrix to assess accuracy
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2536 564
## 1 151 178
##
## Accuracy : 0.7915
## 95% CI : (0.7775, 0.805)
## No Information Rate : 0.7836
## P-Value [Acc > NIR] : 0.1357
##
## Kappa : 0.23
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9438
35
## Specificity : 0.2399
## Pos Pred Value : 0.8181
## Neg Pred Value : 0.5410
## Prevalence : 0.7836
## Detection Rate : 0.7396
## Detection Prevalence : 0.9041
## Balanced Accuracy : 0.5918
##
## 'Positive' Class : 0
##
auc(roc(whitetrain$drinkit, whitetrain$drinkitYhat ), levels=levels(as.factor(whitetrain$drinkit))) # c
## Area under the curve: 0.5918
Use best model (model 2) for testset prediction
whitetest$drinkitYhat <- predict(whitemodel2, newdata = whitetest, type = "response")
whitetest$drinkitYhat <- ifelse(whitetest$drinkitYhat > 0.5, 1.0, 0.0)
confusionMatrix(whitetest$drinkitYhat, whitetest$drinkit) # run confusionMatrix to assess accuracy
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1081 242
## 1 70 76
##
## Accuracy : 0.7876
## 95% CI : (0.7658, 0.8083)
## No Information Rate : 0.7835
## P-Value [Acc > NIR] : 0.3657
##
## Kappa : 0.2215
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9392
## Specificity : 0.2390
## Pos Pred Value : 0.8171
## Neg Pred Value : 0.5205
## Prevalence : 0.7835
## Detection Rate : 0.7359
## Detection Prevalence : 0.9006
## Balanced Accuracy : 0.5891
##
## 'Positive' Class : 0
##
auc(roc(whitetest$drinkit, whitetest$drinkitYhat), levels=levels(as.factor(whitetrain$drinkit))) # calc
## Area under the curve: 0.5891
36
Random forest algorithm - red wine dataset
set.seed(77)
rf1 <- randomForest(drinkit ~ volatile.acidity+total.sulfur.dioxide+alcohol, type = classification,
data=redtrain, ntree = 1000, importance = TRUE,confusion = TRUE)
round(importance(rf1), 1)
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## volatile.acidity 18.7 54.4 44.3 81.8
## total.sulfur.dioxide 10.6 37.5 28.9 70.7
## alcohol 24.6 73.7 59.5 89.5
print(rf1)
##
## Call:
## randomForest(formula = drinkit ~ volatile.acidity + total.sulfur.dioxide + alcohol, data = redt
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 12.87%
## Confusion matrix:
## 0 1 class.error
## 0 918 49 0.05067218
## 1 95 57 0.62500000
Random forest using only alcohol attribute
set.seed(77)
rf2 <- randomForest(drinkit ~ alcohol, type = classification, data=redtrain, ntree = 1000,
importance = TRUE,confusion = TRUE)
round(importance(rf2), 1)
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## alcohol 67.5 66.4 67.6 72.2
print(rf2)
##
## Call:
## randomForest(formula = drinkit ~ alcohol, data = redtrain, type = classification, ntree = 1000,
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 15.91%
## Confusion matrix:
## 0 1 class.error
## 0 922 45 0.04653568
## 1 133 19 0.87500000
37
Testset prediction
rf1predict <-predict(rf1, redtest, type="response")
Model performance on testset
confusionMatrix( rf1predict, redtest$drinkit ) # run confusionMatrix to assess accuracy
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 397 27
## 1 18 38
##
## Accuracy : 0.9062
## 95% CI : (0.8766, 0.9308)
## No Information Rate : 0.8646
## P-Value [Acc > NIR] : 0.003355
##
## Kappa : 0.5748
## Mcnemar's Test P-Value : 0.233038
##
## Sensitivity : 0.9566
## Specificity : 0.5846
## Pos Pred Value : 0.9363
## Neg Pred Value : 0.6786
## Prevalence : 0.8646
## Detection Rate : 0.8271
## Detection Prevalence : 0.8833
## Balanced Accuracy : 0.7706
##
## 'Positive' Class : 0
##
rf1predict<- as.numeric(rf1predict)
auc(roc(redtest$drinkit, rf1predict )) # calculate AUROC curve
## Area under the curve: 0.7706
Random forest algorithm - white wine dataset
set.seed(77)
rf3 <- randomForest(drinkit ~ volatile.acidity+residual.sugar+chlorides+alcohol, type = classification,
data=whitetrain, ntree = 1000, importance = TRUE,confusion = TRUE)
round(importance(rf3), 1)
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## volatile.acidity 45.4 128.0 113.4 248.4
38
## residual.sugar 48.0 105.0 110.3 295.5
## chlorides 32.6 116.0 101.8 251.7
## alcohol 68.0 226.6 179.2 359.4
print(rf3)
##
## Call:
## randomForest(formula = drinkit ~ volatile.acidity + residual.sugar + chlorides + alcohol, data
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 14.44%
## Confusion matrix:
## 0 1 class.error
## 0 2521 166 0.06177894
## 1 329 413 0.44339623
Testset prediction
rf3predict <-predict(rf3, whitetest, type="response")
Model performance on testset
confusionMatrix( rf3predict, whitetest$drinkit ) # run confusionMatrix to assess accuracy
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1075 140
## 1 76 178
##
## Accuracy : 0.853
## 95% CI : (0.8338, 0.8707)
## No Information Rate : 0.7835
## P-Value [Acc > NIR] : 9.184e-12
##
## Kappa : 0.5325
## Mcnemar's Test P-Value : 1.814e-05
##
## Sensitivity : 0.9340
## Specificity : 0.5597
## Pos Pred Value : 0.8848
## Neg Pred Value : 0.7008
## Prevalence : 0.7835
## Detection Rate : 0.7318
## Detection Prevalence : 0.8271
## Balanced Accuracy : 0.7469
##
39
## 'Positive' Class : 0
##
rf3predict<- as.numeric(rf3predict)
auc(roc(whitetest$drinkit, rf3predict )) # calculate AUROC curve
## Area under the curve: 0.7469
RESULTS SUMMARY
Descriptive statistics and visualization of the data were employed to define variables to include in the
predicitve models to define high quality wines. Several models were compared using multivariate logistic
regression. The best model (model 2) defined on the red wine training dataset included the variables
volatile.acidity+total.sulfur.dioxide+alcohol. Results of the confusion matrix and auroc calculation were as
follows (see p.30):
Please note after checking the raw data I found that although the confusion matrices reported
in the caret package output above were correct, the labels of Sensitivity and Specificity are
incorrectly inverted. The correct values are shown below.
1. Accuracy: 0.8704
2. Sensitivity (TPR): 0.2632 3. Specificity (TNR): 0.9659
4. FPR (1-Specificity): 0.0341 5. Area under the curve: 0.6145
Results of this model on the test set were (see p.32):
1. Accuracy: 0.8708
2. Sensitivity: 0.2462
3. Specificity: 0.9687
4. FPR (1-Specificity): 0.0313 5. Area under the curve: 0.6074
The best white wine logit model included the variables volatile.acidity+residual.sugar+chlorides+alcohol.
Results of this model on the test set were (see p.35-36):
1. Accuracy: 0.7915
2. Sensitivity: 0.2399
3. Specificity: 0.9438
4. FPR (1-Specificity): 0.0562 5. Area under the curve: 0.5918
Results of this model on the test set were (see p.36):
1. Accuracy : 0.7876
2. Sensitivity : 0.2390
3. Specificity : 0.9392
4. FPR (1-Specificity): 0.0608 5. Area under the curve: 0.5891
For direct comparison of the logit and random forest algorithms, the best models defined using logistic
regression were then evaluated by random forest using 1000 trees. Red wine training set results (p.37): 1.
Accuracy: 0.8713
2. Sensitivity: 0.5377
3. Specificity: 0.9062 4. FPR (1-Specificity): 0.0938 5. Area under the curve: N/A using randomForest
package
Results of this model on the test set were (see p.36):
1. Accuracy: 0.9062
2. Sensitivity: 0.5846
3. Specificity: 0.9566
4. FPR (1-Specificity): 0.0434 5. Area under the curve: 0.7706
White wine training set results (p.37)
1. Accuracy: 0.8556
40
2. Sensitivity: 0.7133
3. Specificity: 0.8846 4. FPR (1-Specificity): 0.1154 5. Area under the curve: N/A using randomForest
package
Results of this model on the test set were (see p.36):
1. Accuracy: 0.8530
2. Sensitivity: 0.5597
3. Specificity: 0.9340
4. FPR (1-Specificity): 0.0660 5. Area under the curve: 0.7469
DISCUSSION
Neither algorithim performed very well in these datasets, although both performed better for negative
predicitve value (ie., prediciting the bad wines). This makes sense since the physicochemical attributes of
each wine in these datasets probably are more indicative of bad wine than good wine. For example, high
sulfur or acidity can probably easily spoil an otherwise good wine, but these components may combine more
subtly to effect a wines flavor. I think what makes a wine taste good is highly subjective anyway, which
probably makes the class label harder to predict using these data. Interestingly, alcohol content was the
single most predicitve variable (doesn’t everything taste better when your intoxicated?).
Both algorithims performed better on the white wine dataset which was 3 times larger than the red wine
dataset. This result reinforces a major tenent of data science, that is that more data is superior to better
algorithms. In comparison, the random forest algorithm outperformed the logit models, somewhat. This was
expected as decision trees usually outperform logistic regression in my experience. Decision trees generally
perform well since they are highly iterative, robust to noisy data including outliers (which there was many in
these data) and have good predicitve power. In contrast, logit is also pretty robust but is more sensitive
linearity of the independant variable and some of these data were skewed.
There are several things I could do to improve the analysis. First some of the variables were skewed and had
many outliers, especially the white wine dataset, but I didn’t perform any type of transformation. This would
have linearized the data anad improved the logistic regression performance, as I stated. Moreover, performing
stepwise regression optimizes model fitting and categorizing continuous variables improves linearity of the
independant variable, but neither of these methods were performed. Second, I didn’t perform any pruning of
the random forest trees to try and improve performance. Lastly, both R packages have many functions which
can be employed to optimize algorithm performance, but in general I didn’t make use of these.
CONCLUSIONS
1. The random forest algorithm outperformed logit.
2. Both algorithms performed better on the larger white wine dataset; getting more data is the best way
to improve your models.
3. What makes a wine taste good is subjective.
4. Good wines and especially reds (14% vs 22% whites), are hard to find, at least in the northern Portugal
“Vinho Verde” region.
REFERENCES
Datasets
1. Wine Quality Data Set from UCI ML Repository
41
2. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining
from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
dplyr resources
www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
www.youtube.com/watch?v=jWjqLW-u3hc&feature=youtu.be
www.youtube.com/watch?v=2mh1PqfsXVI
groups.google.com/forum/#!topic/manipulatr/Z46zwYXNh0g
stackoverflow.com/questions/22850026/filtering-row-which-contains-a-certain-string-using-dplyr
stackoverflow.com/questions/13520515/command-to-remove-row-from-a-data-frame
Logistic regression resources
cran.r-project.org/web/packages/glm2/glm2.pdf
www.kaggle.com/eyebervil/titanic/titanic-simple-logit-with-interaction
cran.r-project.org/web/packages/caret/vignettes/caret.pdf
cran.r-project.org/web/packages/caret/caret.pdf
cran.r-project.org/web/packages/pROC/pROC.pdf
stats.stackexchange.com/questions/87234/aic-values-and-their-use-in-stepwise-model-selection-for-a-simple-linear-regress
Random forest algorithm resources
cran.r-project.org/web/packages/randomForest/randomForest.pdf
campus.datacamp.com/courses/kaggle-r-tutorial-on-machine-learning/chapter-3-improving-your-predictions-through-random-
ex=1
R Markdown resources
rmarkdown.rstudio.com/
www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf
www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf
42

IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wine Preference

  • 1.
    Applied Machine LearningFinal Project Jim Nelson December 5, 2016 OVERVIEW Aim Compare performace of R Machine Learning packages for logistic regression and random forest algorithms. Datasets Datasets used in this study were obtained from the UCI Machine Learning Dataset Repository. Two datasets of red and white wine variants of the northern Portugal “Vinho Verde” region were used. Each dataset comprises 11 physicochemical attributes of each wine such as acidity, residual sugars and percent alcohol(1). The red dataset includes 1599 different wines, while the white dataset has a total of 4898 wines. Specific data about grape types, wine brand, wine selling price, etc, have been omitted. Each wine was given a quality score which graded the wine in a scale that ranges from 0 (very bad) to 10 (excellent). Importantly, the score was derived subjectively from a panel of at least 3 experts using blind taste tests. The final quality score is given by the median of these evaluations(2). The binary class label “drinkit” was created for this analysis based on a quality score of >= 7. The datasets were submitted on 10/7/2009 by Paulo Cortez, University of Minho, Guimarães, Portugal. Software and Computing Environment The study was performed locally on a HP Pavilion 23 All-in-One with a 64 Bit OS running Windows 10 with 4.0 GB Ram and an AMD AP-5300 APU processor with Radeon HD graphics. The study was performed in R (R version 33.2.4 (2016-03-10) – “Very Secure Dishes”. Platform: x86_64-w64-mingw32/x64 (64-bit))Copyright (C) 2015 The R Foundation for Statistical Computing) using the open source R platform R Studio (Version 0.98.1102 - 2009-2014 RStudio, Inc.) The following R software packages with corresponding manuals and vignettes were obtained from the The Comprehensive R Archive Network (CRAN): Data Manipulation and Visualization: dplyr: A Grammar of Data Manipulation (v.0.4.3); ggplot2: An Implementation of the Grammar of Graphics (v.1.01) Logistic Regression Analysis: glm2: Fitting Generalized Linear Models (v. 1.1.2) Classification statistics and AUROC analysis: caret: Classification and Regression Training (v.6.0-62); pROC: Display and Analyze ROC Curves (v. 1.8) Random Forest Modeling: randomForest: Breiman and Cutler’s Random Forests for Classification and Regression( v. 4.6-10) 1
  • 2.
    DATA CURATION Load Datasetsand R Packages Load packages for data transformation library(dplyr) library(caTools) Load packages for data visualization library(ggplot2) library(fBasics) Load ML packages library(glm2) library(caret) library(pROC) library(randomForest) Load the datasets red <- read.csv ("winequality-red.csv", header= TRUE, sep=";") str(red) ## 'data.frame': 1599 obs. of 12 variables: ## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ... ## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ... ## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ... ## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ... ## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ... ## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ... ## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ... ## $ density : num 0.998 0.997 0.997 0.998 0.998 ... ## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ... ## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ... ## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ... ## $ quality : int 5 5 5 6 5 5 5 7 7 5 ... white <- read.csv ("winequality-white.csv", header= TRUE, sep=";") str(white) ## 'data.frame': 4898 obs. of 12 variables: ## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ... ## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ... ## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ... ## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ... ## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ... ## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ... 2
  • 3.
    ## $ total.sulfur.dioxide:num 170 132 97 186 186 97 136 170 132 129 ... ## $ density : num 1.001 0.994 0.995 0.996 0.996 ... ## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ... ## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ... ## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ... ## $ quality : int 6 6 6 6 6 6 6 6 6 6 ... Create class labels red<- red %>% mutate (drinkit = as.factor(as.numeric(quality >= 7 ))) class(red$drinkit) ## [1] "factor" white<- white %>% mutate (drinkit = as.factor(as.numeric(quality >= 7 ))) class(white$drinkit) ## [1] "factor" Create test and train datasets set.seed(123) sample = sample.split(red$drinkit, SplitRatio = .70) redtrain = subset(red, sample == TRUE) redtest = subset(red, sample == FALSE) sample <-sample.split(white$drinkit, SplitRatio = .70) whitetrain <- subset(white, sample == TRUE) whitetest <- subset(white, sample == FALSE) dim(redtrain) #70% for training ## [1] 1119 13 dim(redtest) #30% for test ## [1] 480 13 dim(whitetrain) ## [1] 3429 13 dim(whitetest) ## [1] 1469 13 3
  • 4.
    DATA EXPLORATION Descriptive statistics by(red[1:12][,c(1:12)],red$drinkit, basicStats) ## red$drinkit: 0 ## fixed.acidity volatile.acidity citric.acid residual.sugar ## nobs 1382.000000 1382.000000 1382.000000 1382.000000 ## NAs 0.000000 0.000000 0.000000 0.000000 ## Minimum 4.600000 0.160000 0.000000 0.900000 ## Maximum 15.900000 1.580000 1.000000 15.500000 ## 1. Quartile 7.100000 0.420000 0.082500 1.900000 ## 3. Quartile 9.100000 0.650000 0.400000 2.600000 ## Mean 8.236831 0.547022 0.254407 2.512120 ## Median 7.800000 0.540000 0.240000 2.200000 ## Sum 11383.300000 755.985000 351.590000 3471.750000 ## SE Mean 0.045265 0.004743 0.005102 0.038084 ## LCL Mean 8.148036 0.537717 0.244398 2.437412 ## UCL Mean 8.325626 0.556327 0.264415 2.586829 ## Variance 2.831568 0.031095 0.035973 2.004428 ## Stdev 1.682726 0.176337 0.189665 1.415778 ## Skewness 1.071064 0.670310 0.422857 4.878595 ## Kurtosis 1.337482 1.438348 -0.673833 32.003919 ## chlorides free.sulfur.dioxide total.sulfur.dioxide ## nobs 1382.000000 1382.000000 1382.000000 ## NAs 0.000000 0.000000 0.000000 ## Minimum 0.034000 1.000000 6.000000 ## Maximum 0.611000 72.000000 165.000000 ## 1. Quartile 0.071000 8.000000 23.000000 ## 3. Quartile 0.091000 22.000000 65.000000 ## Mean 0.089281 16.172214 48.285818 ## Median 0.080000 14.000000 39.500000 ## Sum 123.386000 22350.000000 66731.000000 ## SE Mean 0.001321 0.281577 0.876540 ## LCL Mean 0.086689 15.619850 46.566324 ## UCL Mean 0.091872 16.724578 50.005312 ## Variance 0.002412 109.572421 1061.821580 ## Stdev 0.049113 10.467685 32.585604 ## Skewness 5.547353 1.224203 1.110405 ## Kurtosis 38.898772 2.056348 0.704994 ## density pH sulphates alcohol quality ## nobs 1382.000000 1382.000000 1382.000000 1382.000000 1382.000000 ## NAs 0.000000 0.000000 0.000000 0.000000 0.000000 ## Minimum 0.990070 2.740000 0.330000 8.400000 3.000000 ## Maximum 1.003690 4.010000 2.000000 14.900000 6.000000 ## 1. Quartile 0.995785 3.210000 0.540000 9.500000 5.000000 ## 3. Quartile 0.997900 3.410000 0.700000 10.900000 6.000000 ## Mean 0.996859 3.314616 0.644754 10.251037 5.408828 ## Median 0.996800 3.310000 0.600000 10.000000 5.000000 ## Sum 1377.659370 4580.800000 891.050000 14166.933333 7475.000000 ## SE Mean 0.000049 0.004146 0.004590 0.026084 0.016186 ## LCL Mean 0.996764 3.306483 0.635750 10.199869 5.377076 4
  • 5.
    ## UCL Mean0.996955 3.322750 0.653758 10.302205 5.440580 ## Variance 0.000003 0.023758 0.029114 0.940248 0.362065 ## Stdev 0.001808 0.154135 0.170629 0.969664 0.601719 ## Skewness 0.117601 0.168576 2.774057 1.058300 -0.673203 ## Kurtosis 1.091744 0.843706 13.810053 0.921817 0.546000 ## -------------------------------------------------------- ## red$drinkit: 1 ## fixed.acidity volatile.acidity citric.acid residual.sugar ## nobs 217.000000 217.000000 217.000000 217.000000 ## NAs 0.000000 0.000000 0.000000 0.000000 ## Minimum 4.900000 0.120000 0.000000 1.200000 ## Maximum 15.600000 0.915000 0.760000 8.900000 ## 1. Quartile 7.400000 0.300000 0.300000 2.000000 ## 3. Quartile 10.100000 0.490000 0.490000 2.700000 ## Mean 8.847005 0.405530 0.376498 2.708756 ## Median 8.700000 0.370000 0.400000 2.300000 ## Sum 1919.800000 88.000000 81.700000 587.800000 ## SE Mean 0.135767 0.009841 0.013199 0.092528 ## LCL Mean 8.579406 0.386134 0.350482 2.526382 ## UCL Mean 9.114603 0.424926 0.402514 2.891130 ## Variance 3.999910 0.021014 0.037806 1.857840 ## Stdev 1.999977 0.144963 0.194438 1.363026 ## Skewness 0.460276 0.987628 -0.373539 2.173338 ## Kurtosis 0.313564 0.884353 -0.475447 4.660784 ## chlorides free.sulfur.dioxide total.sulfur.dioxide density ## nobs 217.000000 217.000000 217.000000 217.000000 ## NAs 0.000000 0.000000 0.000000 0.000000 ## Minimum 0.012000 3.000000 7.000000 0.990640 ## Maximum 0.358000 54.000000 289.000000 1.003200 ## 1. Quartile 0.062000 6.000000 17.000000 0.994700 ## 3. Quartile 0.085000 18.000000 43.000000 0.997350 ## Mean 0.075912 13.981567 34.889401 0.996030 ## Median 0.073000 11.000000 27.000000 0.995720 ## Sum 16.473000 3034.000000 7571.000000 216.138570 ## SE Mean 0.001933 0.694771 2.211148 0.000149 ## LCL Mean 0.072102 12.612168 30.531213 0.995736 ## UCL Mean 0.079723 15.350966 39.247589 0.996325 ## Variance 0.000811 104.747344 1060.950674 0.000005 ## Stdev 0.028480 10.234615 32.572238 0.002201 ## Skewness 5.035644 1.456370 4.439771 0.262062 ## Kurtosis 44.818747 1.889996 29.222588 0.292770 ## pH sulphates alcohol quality ## nobs 217.000000 217.000000 217.000000 217.000000 ## NAs 0.000000 0.000000 0.000000 0.000000 ## Minimum 2.880000 0.390000 9.200000 7.000000 ## Maximum 3.780000 1.360000 14.000000 8.000000 ## 1. Quartile 3.200000 0.650000 10.800000 7.000000 ## 3. Quartile 3.380000 0.820000 12.200000 7.000000 ## Mean 3.288802 0.743456 11.518049 7.082949 ## Median 3.270000 0.740000 11.600000 7.000000 ## Sum 713.670000 161.330000 2499.416667 1537.000000 ## SE Mean 0.010487 0.009099 0.067759 0.018766 ## LCL Mean 3.268133 0.725522 11.384496 7.045961 ## UCL Mean 3.309471 0.761391 11.651603 7.119938 5
  • 6.
    ## Variance 0.0238630.017966 0.996310 0.076421 ## Stdev 0.154478 0.134038 0.998153 0.276443 ## Skewness 0.358697 0.620835 0.065494 3.003356 ## Kurtosis 0.608113 1.964857 -0.430136 7.052712 by(white[1:12][,c(1:12)], white$drinkit, basicStats) ## white$drinkit: 0 ## fixed.acidity volatile.acidity citric.acid residual.sugar ## nobs 3838.000000 3838.000000 3838.000000 3838.000000 ## NAs 0.000000 0.000000 0.000000 0.000000 ## Minimum 3.800000 0.080000 0.000000 0.600000 ## Maximum 14.200000 1.100000 1.660000 65.800000 ## 1. Quartile 6.300000 0.220000 0.260000 1.700000 ## 3. Quartile 7.400000 0.320000 0.400000 10.400000 ## Mean 6.890594 0.281802 0.336438 6.703478 ## Median 6.800000 0.270000 0.320000 6.000000 ## Sum 26446.100000 1081.555000 1291.250000 25727.950000 ## SE Mean 0.013884 0.001651 0.002098 0.084341 ## LCL Mean 6.863374 0.278564 0.332325 6.538121 ## UCL Mean 6.917814 0.285039 0.340551 6.868835 ## Variance 0.739786 0.010464 0.016889 27.301127 ## Stdev 0.860108 0.102293 0.129959 5.225048 ## Skewness 0.752179 1.720644 1.242293 1.035464 ## Kurtosis 2.339990 5.737845 5.472595 3.755082 ## chlorides free.sulfur.dioxide total.sulfur.dioxide ## nobs 3838.000000 3.838000e+03 3.838000e+03 ## NAs 0.000000 0.000000e+00 0.000000e+00 ## Minimum 0.009000 2.000000e+00 9.000000e+00 ## Maximum 0.346000 2.890000e+02 4.400000e+02 ## 1. Quartile 0.037000 2.300000e+01 1.110000e+02 ## 3. Quartile 0.051000 4.700000e+01 1.730000e+02 ## Mean 0.047875 3.551733e+01 1.419829e+02 ## Median 0.045000 3.400000e+01 1.400000e+02 ## Sum 183.743000 1.363155e+05 5.449305e+05 ## SE Mean 0.000380 2.871250e-01 7.125790e-01 ## LCL Mean 0.047129 3.495439e+01 1.405859e+02 ## UCL Mean 0.048620 3.608026e+01 1.433800e+02 ## Variance 0.000554 3.164067e+02 1.948817e+03 ## Stdev 0.023548 1.778783e+01 4.414540e+01 ## Skewness 4.851869 1.423264e+00 2.830020e-01 ## Kurtosis 33.327716 1.178770e+01 4.934680e-01 ## density pH sulphates alcohol quality ## nobs 3838.000000 3838.000000 3838.000000 3838.000000 3838.000000 ## NAs 0.000000 0.000000 0.000000 0.000000 0.000000 ## Minimum 0.987220 2.720000 0.230000 8.000000 3.000000 ## Maximum 1.038980 3.810000 1.060000 14.000000 6.000000 ## 1. Quartile 0.992320 3.080000 0.410000 9.400000 5.000000 ## 3. Quartile 0.996570 3.267500 0.540000 11.000000 6.000000 ## Mean 0.994474 3.180847 0.487004 10.265215 5.519802 ## Median 0.994380 3.170000 0.470000 10.000000 6.000000 ## Sum 3816.789390 12208.090000 1869.120000 39397.896667 21185.000000 ## SE Mean 0.000047 0.002396 0.001746 0.017765 0.009764 ## LCL Mean 0.994382 3.176150 0.483580 10.230385 5.500659 6
  • 7.
    ## UCL Mean0.994565 3.185544 0.490427 10.300045 5.538945 ## Variance 0.000008 0.022027 0.011700 1.211267 0.365910 ## Stdev 0.002894 0.148414 0.108167 1.100576 0.604905 ## Skewness 1.139004 0.518051 0.939134 0.690666 -1.004627 ## Kurtosis 14.002972 0.817759 1.572179 -0.281440 0.695824 ## -------------------------------------------------------- ## white$drinkit: 1 ## fixed.acidity volatile.acidity citric.acid residual.sugar ## nobs 1060.000000 1060.000000 1060.000000 1060.000000 ## NAs 0.000000 0.000000 0.000000 0.000000 ## Minimum 3.900000 0.080000 0.010000 0.800000 ## Maximum 9.200000 0.760000 0.740000 19.250000 ## 1. Quartile 6.200000 0.190000 0.280000 1.800000 ## 3. Quartile 7.200000 0.320000 0.360000 7.400000 ## Mean 6.725142 0.265349 0.326057 5.261509 ## Median 6.700000 0.250000 0.310000 3.875000 ## Sum 7128.650000 281.270000 345.620000 5577.200000 ## SE Mean 0.023613 0.002890 0.002466 0.131792 ## LCL Mean 6.678807 0.259678 0.321218 5.002906 ## UCL Mean 6.771476 0.271020 0.330895 5.520113 ## Variance 0.591050 0.008854 0.006446 18.411355 ## Stdev 0.768798 0.094097 0.080288 4.290845 ## Skewness 0.019010 0.874201 0.705643 1.080500 ## Kurtosis 0.437095 1.098804 3.030720 0.098557 ## chlorides free.sulfur.dioxide total.sulfur.dioxide ## nobs 1060.000000 1060.000000 1.060000e+03 ## NAs 0.000000 0.000000 0.000000e+00 ## Minimum 0.012000 5.000000 3.400000e+01 ## Maximum 0.135000 108.000000 2.290000e+02 ## 1. Quartile 0.031000 25.000000 1.010000e+02 ## 3. Quartile 0.044000 42.000000 1.460000e+02 ## Mean 0.038160 34.550472 1.252453e+02 ## Median 0.037000 33.000000 1.220000e+02 ## Sum 40.450000 36623.500000 1.327600e+05 ## SE Mean 0.000342 0.423776 1.005136e+00 ## LCL Mean 0.037489 33.718936 1.232730e+02 ## UCL Mean 0.038832 35.382008 1.272176e+02 ## Variance 0.000124 190.361237 1.070916e+03 ## Stdev 0.011145 13.797146 3.272485e+01 ## Skewness 2.258097 1.015169 5.094610e-01 ## Kurtosis 14.741475 2.729507 2.420860e-01 ## density pH sulphates alcohol quality ## nobs 1060.000000 1060.000000 1060.000000 1060.000000 1060.000000 ## NAs 0.000000 0.000000 0.000000 0.000000 0.000000 ## Minimum 0.987110 2.840000 0.220000 8.500000 7.000000 ## Maximum 1.000600 3.820000 1.080000 14.200000 9.000000 ## 1. Quartile 0.990500 3.100000 0.400000 10.700000 7.000000 ## 3. Quartile 0.993605 3.320000 0.580000 12.400000 7.000000 ## Mean 0.992412 3.215132 0.500142 11.416022 7.174528 ## Median 0.991730 3.200000 0.480000 11.500000 7.000000 ## Sum 1051.956700 3408.040000 530.150000 12100.983333 7605.000000 ## SE Mean 0.000085 0.004828 0.004086 0.038553 0.012040 ## LCL Mean 0.992245 3.205659 0.492123 11.340372 7.150904 ## UCL Mean 0.992579 3.224605 0.508160 11.491672 7.198152 7
  • 8.
    ## Variance 0.0000080.024707 0.017701 1.575551 0.153647 ## Stdev 0.002772 0.157185 0.133044 1.255209 0.391978 ## Skewness 1.001822 0.235375 0.942026 -0.404076 1.945040 ## Kurtosis 0.406561 -0.186210 1.063202 -0.522122 2.498491 Boxplots attribute = fixed.acidity bp_fixed.acidity <- ggplot(red, aes(x=drinkit, y=fixed.acidity )) bp_fixed.acidity + geom_boxplot() 8 12 16 0 1 drinkit fixed.acidity bp_fixed.acidity <- ggplot(white, aes(x=drinkit, y=fixed.acidity )) bp_fixed.acidity + geom_boxplot() 8
  • 9.
    6 9 12 0 1 drinkit fixed.acidity attribute =volatile.acidity bp_volatile.acidity <- ggplot(red, aes(x=drinkit, y=volatile.acidity )) bp_volatile.acidity + geom_boxplot() 9
  • 10.
    0.4 0.8 1.2 1.6 0 1 drinkit volatile.acidity bp_volatile.acidity <-ggplot(white, aes(x=drinkit, y=volatile.acidity )) bp_volatile.acidity + geom_boxplot() 10
  • 11.
    0.3 0.6 0.9 0 1 drinkit volatile.acidity attribute =citric.acid bp_citric.acid <- ggplot(red, aes(x=drinkit, y=citric.acid )) bp_citric.acid + geom_boxplot() 11
  • 12.
    0.00 0.25 0.50 0.75 1.00 0 1 drinkit citric.acid bp_citric.acid <-ggplot(white, aes(x=drinkit, y=citric.acid )) bp_citric.acid + geom_boxplot() 12
  • 13.
    0.0 0.5 1.0 1.5 0 1 drinkit citric.acid attribute =residual.sugar bp_residual.sugar <- ggplot(red, aes(x=drinkit, y= residual.sugar )) bp_residual.sugar + geom_boxplot() 13
  • 14.
    4 8 12 16 0 1 drinkit residual.sugar bp_residual.sugar <-ggplot(white, aes(x=drinkit, y= residual.sugar )) bp_residual.sugar + geom_boxplot() 14
  • 15.
    0 20 40 60 0 1 drinkit residual.sugar attribute =chlorides bp_chlorides <- ggplot(red, aes(x=drinkit, y= chlorides )) bp_chlorides + geom_boxplot() 15
  • 16.
    0.0 0.2 0.4 0.6 0 1 drinkit chlorides bp_chlorides <-ggplot(white, aes(x=drinkit, y= chlorides )) bp_chlorides + geom_boxplot() 16
  • 17.
    0.0 0.1 0.2 0.3 0 1 drinkit chlorides attribute =free.sulfur.dioxide bp_free.sulfur.dioxide <- ggplot(red, aes(x=drinkit, y= free.sulfur.dioxide )) bp_free.sulfur.dioxide + geom_boxplot() 17
  • 18.
    0 20 40 60 0 1 drinkit free.sulfur.dioxide bp_free.sulfur.dioxide <-ggplot(white, aes(x=drinkit, y= free.sulfur.dioxide )) bp_free.sulfur.dioxide + geom_boxplot() 18
  • 19.
    0 100 200 300 0 1 drinkit free.sulfur.dioxide attribute =total.sulfur.dioxide bp_total.sulfur.dioxide <- ggplot(red, aes(x=drinkit, y= total.sulfur.dioxide )) bp_total.sulfur.dioxide + geom_boxplot() 19
  • 20.
    0 100 200 300 0 1 drinkit total.sulfur.dioxide bp_total.sulfur.dioxide <-ggplot(white, aes(x=drinkit, y= total.sulfur.dioxide )) bp_total.sulfur.dioxide + geom_boxplot() 20
  • 21.
    0 100 200 300 400 0 1 drinkit total.sulfur.dioxide attribute =sulphates bp_sulphates <- ggplot(red, aes(x=drinkit, y= sulphates )) bp_sulphates + geom_boxplot() 21
  • 22.
    0.5 1.0 1.5 2.0 0 1 drinkit sulphates bp_sulphates <-ggplot(white, aes(x=drinkit, y= sulphates )) bp_sulphates + geom_boxplot() 22
  • 23.
    0.2 0.4 0.6 0.8 1.0 0 1 drinkit sulphates attribute =sulphates bp_sulphates <- ggplot(red, aes(x=drinkit, y= sulphates )) bp_sulphates + geom_boxplot() 23
  • 24.
    0.5 1.0 1.5 2.0 0 1 drinkit sulphates bp_sulphates <-ggplot(white, aes(x=drinkit, y= sulphates )) bp_sulphates + geom_boxplot() 24
  • 25.
    0.2 0.4 0.6 0.8 1.0 0 1 drinkit sulphates attribute =alcohol bp_alcohol <- ggplot(red, aes(x=drinkit, y= alcohol )) bp_alcohol + geom_boxplot() 25
  • 26.
    10 12 14 0 1 drinkit alcohol bp_alcohol <-ggplot(white, aes(x=drinkit, y= alcohol )) bp_alcohol + geom_boxplot() 26
  • 27.
    8 10 12 14 0 1 drinkit alcohol PREDICTIVE MODELING Logisticregression - red wine dataset Model 1 redmodel1 <- glm(drinkit ~ fixed.acidity+volatile.acidity+citric.acid+residual.sugar+ total.sulfur.dioxide+chlorides+alcohol, family="binomial", data= redtrain) summary(redmodel1) ## ## Call: ## glm(formula = drinkit ~ fixed.acidity + volatile.acidity + citric.acid + ## residual.sugar + total.sulfur.dioxide + chlorides + alcohol, ## family = "binomial", data = redtrain) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.6103 -0.4541 -0.2492 -0.1403 2.6757 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -10.12828 1.51262 -6.696 2.14e-11 *** ## fixed.acidity 0.12368 0.08316 1.487 0.13695 27
  • 28.
    ## volatile.acidity -4.494370.91316 -4.922 8.58e-07 *** ## citric.acid 0.31682 0.97102 0.326 0.74422 ## residual.sugar 0.10574 0.06983 1.514 0.12996 ## total.sulfur.dioxide -0.01329 0.00419 -3.172 0.00151 ** ## chlorides -6.34684 4.16619 -1.523 0.12765 ## alcohol 0.91992 0.10404 8.842 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 889.23 on 1118 degrees of freedom ## Residual deviance: 634.16 on 1111 degrees of freedom ## AIC: 650.16 ## ## Number of Fisher Scoring iterations: 6 exp(cbind(OR = coef(redmodel1), confint(redmodel1))) # odds ratios and 95% CI ## Waiting for profiling to be done... ## OR 2.5 % 97.5 % ## (Intercept) 3.993409e-05 1.934350e-06 0.0007319036 ## fixed.acidity 1.131655e+00 9.622602e-01 1.3339255345 ## volatile.acidity 1.117174e-02 1.792495e-03 0.0642802846 ## citric.acid 1.372752e+00 2.016707e-01 9.1164704929 ## residual.sugar 1.111529e+00 9.597369e-01 1.2685734018 ## total.sulfur.dioxide 9.867968e-01 9.783435e-01 0.9945651864 ## chlorides 1.752269e-03 2.005370e-07 2.3287091038 ## alcohol 2.509101e+00 2.055529e+00 3.0926740162 Model 1 performance redtrain$drinkitYhat <- predict(redmodel1, type = "response") # generate yhat values on train df redtrain$drinkitYhat <- ifelse(redtrain$drinkitYhat > 0.5, 1.0, 0.0) confusionMatrix( redtrain$drinkitYhat, redtrain$drinkit) # run confusionMatrix to assess accuracy ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 932 110 ## 1 35 42 ## ## Accuracy : 0.8704 ## 95% CI : (0.8493, 0.8896) ## No Information Rate : 0.8642 ## P-Value [Acc > NIR] : 0.2878 ## ## Kappa : 0.3032 ## Mcnemar's Test P-Value : 7.978e-10 ## 28
  • 29.
    ## Sensitivity :0.9638 ## Specificity : 0.2763 ## Pos Pred Value : 0.8944 ## Neg Pred Value : 0.5455 ## Prevalence : 0.8642 ## Detection Rate : 0.8329 ## Detection Prevalence : 0.9312 ## Balanced Accuracy : 0.6201 ## ## 'Positive' Class : 0 ## auc(roc(redtrain$drinkit, redtrain$drinkitYhat ),levels=levels(as.factor(redtrain$drinkit))) # calculat ## Area under the curve: 0.6201 Model 2 redmodel2 <- glm(drinkit ~ volatile.acidity+total.sulfur.dioxide+alcohol, family="binomial", data= redtrain) summary(redmodel2) ## ## Call: ## glm(formula = drinkit ~ volatile.acidity + total.sulfur.dioxide + ## alcohol, family = "binomial", data = redtrain) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.0665 -0.4559 -0.2634 -0.1452 2.6374 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -8.90155 1.12801 -7.891 2.99e-15 *** ## volatile.acidity -5.21126 0.71903 -7.248 4.24e-13 *** ## total.sulfur.dioxide -0.01379 0.00409 -3.371 0.000749 *** ## alcohol 0.92389 0.09549 9.675 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 889.23 on 1118 degrees of freedom ## Residual deviance: 644.62 on 1115 degrees of freedom ## AIC: 652.62 ## ## Number of Fisher Scoring iterations: 6 exp(cbind(OR = coef(redmodel2), confint(redmodel2))) # odds ratios and 95% CI ## Waiting for profiling to be done... 29
  • 30.
    ## OR 2.5% 97.5 % ## (Intercept) 0.0001361775 1.429867e-05 0.001198158 ## volatile.acidity 0.0054547915 1.276332e-03 0.021452206 ## total.sulfur.dioxide 0.9863066898 9.780499e-01 0.993846316 ## alcohol 2.5190775373 2.097376e+00 3.051543824 Model 2 performance redtrain$drinkitYhat <- predict(redmodel2, type = "response") # generate yhat values on train df redtrain$drinkitYhat <- ifelse(redtrain$drinkitYhat > 0.5, 1.0, 0.0) confusionMatrix( redtrain$drinkitYhat, redtrain$drinkit) # run confusionMatrix to assess accuracy ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 934 112 ## 1 33 40 ## ## Accuracy : 0.8704 ## 95% CI : (0.8493, 0.8896) ## No Information Rate : 0.8642 ## P-Value [Acc > NIR] : 0.2878 ## ## Kappa : 0.2933 ## Mcnemar's Test P-Value : 9.323e-11 ## ## Sensitivity : 0.9659 ## Specificity : 0.2632 ## Pos Pred Value : 0.8929 ## Neg Pred Value : 0.5479 ## Prevalence : 0.8642 ## Detection Rate : 0.8347 ## Detection Prevalence : 0.9348 ## Balanced Accuracy : 0.6145 ## ## 'Positive' Class : 0 ## auc(roc(redtrain$drinkit, redtrain$drinkitYhat ), levels=levels(as.factor(redtrain$drinkit))) # calcula ## Area under the curve: 0.6145 Model 3 redmodel3 <- glm(drinkit ~ alcohol, family="binomial", data= redtrain) summary(redmodel3) ## ## Call: ## glm(formula = drinkit ~ alcohol, family = "binomial", data = redtrain) 30
  • 31.
    ## ## Deviance Residuals: ##Min 1Q Median 3Q Max ## -2.2384 -0.5138 -0.3279 -0.2540 2.6650 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -13.11263 1.00437 -13.06 <2e-16 *** ## alcohol 1.04246 0.08978 11.61 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 889.23 on 1118 degrees of freedom ## Residual deviance: 724.88 on 1117 degrees of freedom ## AIC: 728.88 ## ## Number of Fisher Scoring iterations: 5 exp(cbind(OR = coef(redmodel3), confint(redmodel3))) # odds ratios and 95% CI ## Waiting for profiling to be done... ## OR 2.5 % 97.5 % ## (Intercept) 2.019563e-06 2.659407e-07 1.371582e-05 ## alcohol 2.836199e+00 2.388290e+00 3.397552e+00 Model 3 performance redtrain$drinkitYhat <- predict(redmodel3, type = "response") # generate yhat values on train df redtrain$drinkitYhat <- ifelse(redtrain$drinkitYhat > 0.5, 1.0, 0.0) confusionMatrix( redtrain$drinkitYhat, redtrain$drinkit) # run confusionMatrix to assess accuracy ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 943 131 ## 1 24 21 ## ## Accuracy : 0.8615 ## 95% CI : (0.8398, 0.8812) ## No Information Rate : 0.8642 ## P-Value [Acc > NIR] : 0.6236 ## ## Kappa : 0.1611 ## Mcnemar's Test P-Value : <2e-16 ## ## Sensitivity : 0.9752 ## Specificity : 0.1382 ## Pos Pred Value : 0.8780 31
  • 32.
    ## Neg PredValue : 0.4667 ## Prevalence : 0.8642 ## Detection Rate : 0.8427 ## Detection Prevalence : 0.9598 ## Balanced Accuracy : 0.5567 ## ## 'Positive' Class : 0 ## auc(roc(redtrain$drinkit, redtrain$drinkitYhat ), levels=levels(as.factor(redtrain$drinkit))) # calcula ## Area under the curve: 0.5567 Use best model (model 2) for testset prediction redtest$drinkitYhat <- predict(redmodel2, newdata = redtest, type = "response") # predict values on tra redtest$drinkitYhat <- ifelse(redtest$drinkitYhat > 0.5, 1.0, 0.0) confusionMatrix(redtest$drinkitYhat, redtest$drinkit) # run confusionMatrix to assess accuracy ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 402 49 ## 1 13 16 ## ## Accuracy : 0.8708 ## 95% CI : (0.8375, 0.8995) ## No Information Rate : 0.8646 ## P-Value [Acc > NIR] : 0.3749 ## ## Kappa : 0.2803 ## Mcnemar's Test P-Value : 8.789e-06 ## ## Sensitivity : 0.9687 ## Specificity : 0.2462 ## Pos Pred Value : 0.8914 ## Neg Pred Value : 0.5517 ## Prevalence : 0.8646 ## Detection Rate : 0.8375 ## Detection Prevalence : 0.9396 ## Balanced Accuracy : 0.6074 ## ## 'Positive' Class : 0 ## auc(roc(redtest$drinkit, redtest$drinkitYhat), levels=levels(as.factor(redtrain$drinkit))) # calculate ## Area under the curve: 0.6074 32
  • 33.
    Logistic regression -white wine dataset Model 1 whitemodel1 <- glm(drinkit ~ fixed.acidity+volatile.acidity+citric.acid+residual.sugar+ total.sulfur.dioxide+chlorides+alcohol, family="binomial", data= whitetrain) summary(whitemodel1) ## ## Call: ## glm(formula = drinkit ~ fixed.acidity + volatile.acidity + citric.acid + ## residual.sugar + total.sulfur.dioxide + chlorides + alcohol, ## family = "binomial", data = whitetrain) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.9694 -0.6635 -0.4286 -0.1833 2.8909 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -8.906528 0.805489 -11.057 < 2e-16 *** ## fixed.acidity -0.039714 0.058123 -0.683 0.494 ## volatile.acidity -4.905984 0.578866 -8.475 < 2e-16 *** ## citric.acid -0.717323 0.460893 -1.556 0.120 ## residual.sugar 0.047718 0.011470 4.160 3.18e-05 *** ## total.sulfur.dioxide 0.001578 0.001356 1.164 0.245 ## chlorides -18.758762 4.513067 -4.157 3.23e-05 *** ## alcohol 0.899407 0.052430 17.154 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 3581.9 on 3428 degrees of freedom ## Residual deviance: 2942.3 on 3421 degrees of freedom ## AIC: 2958.3 ## ## Number of Fisher Scoring iterations: 6 exp(cbind(OR = coef(whitemodel1), confint(whitemodel1))) # odds ratios and 95% CI ## Waiting for profiling to be done... ## OR 2.5 % 97.5 % ## (Intercept) 1.355014e-04 2.768784e-05 6.517844e-04 ## fixed.acidity 9.610643e-01 8.571076e-01 1.076523e+00 ## volatile.acidity 7.402156e-03 2.339702e-03 2.263315e-02 ## citric.acid 4.880570e-01 1.948869e-01 1.193235e+00 ## residual.sugar 1.048875e+00 1.025407e+00 1.072573e+00 ## total.sulfur.dioxide 1.001579e+00 9.989146e-01 1.004239e+00 ## chlorides 7.131371e-09 7.522433e-13 3.505045e-05 ## alcohol 2.458146e+00 2.220357e+00 2.727226e+00 33
  • 34.
    Model 1 performance whitetrain$drinkitYhat<- predict(whitemodel1, type = "response") # generate yhat values on train df whitetrain$drinkitYhat <- ifelse(whitetrain$drinkitYhat > 0.5, 1.0, 0.0) confusionMatrix( whitetrain$drinkitYhat, whitetrain$drinkit) # run confusionMatrix to assess accuracy ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 2532 565 ## 1 155 177 ## ## Accuracy : 0.79 ## 95% CI : (0.776, 0.8036) ## No Information Rate : 0.7836 ## P-Value [Acc > NIR] : 0.1865 ## ## Kappa : 0.2261 ## Mcnemar's Test P-Value : <2e-16 ## ## Sensitivity : 0.9423 ## Specificity : 0.2385 ## Pos Pred Value : 0.8176 ## Neg Pred Value : 0.5331 ## Prevalence : 0.7836 ## Detection Rate : 0.7384 ## Detection Prevalence : 0.9032 ## Balanced Accuracy : 0.5904 ## ## 'Positive' Class : 0 ## auc(roc(whitetrain$drinkit, whitetrain$drinkitYhat ), levels=levels(as.factor(whitetrain$drinkit))) # c ## Area under the curve: 0.5904 Model 2 whitemodel2 <- glm(drinkit ~ volatile.acidity+residual.sugar+chlorides+alcohol, family="binomial", data= whitetrain) summary(whitemodel2) ## ## Call: ## glm(formula = drinkit ~ volatile.acidity + residual.sugar + chlorides + ## alcohol, family = "binomial", data = whitetrain) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.9024 -0.6594 -0.4260 -0.1902 2.8269 34
  • 35.
    ## ## Coefficients: ## EstimateStd. Error z value Pr(>|z|) ## (Intercept) -9.09922 0.63146 -14.410 < 2e-16 *** ## volatile.acidity -4.68741 0.56405 -8.310 < 2e-16 *** ## residual.sugar 0.04845 0.01115 4.345 1.39e-05 *** ## chlorides -18.36325 4.41268 -4.161 3.16e-05 *** ## alcohol 0.88237 0.04981 17.714 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 3581.9 on 3428 degrees of freedom ## Residual deviance: 2947.1 on 3424 degrees of freedom ## AIC: 2957.1 ## ## Number of Fisher Scoring iterations: 5 exp(cbind(OR = coef(whitemodel2), confint(whitemodel2))) # odds ratios and 95% CI ## Waiting for profiling to be done... ## OR 2.5 % 97.5 % ## (Intercept) 1.117534e-04 3.232733e-05 3.849781e-04 ## volatile.acidity 9.210485e-03 2.998984e-03 2.737557e-02 ## residual.sugar 1.049642e+00 1.026810e+00 1.072716e+00 ## chlorides 1.059114e-08 1.369508e-12 4.332245e-05 ## alcohol 2.416612e+00 2.193714e+00 2.667033e+00 Model 2 performance whitetrain$drinkitYhat <- predict(whitemodel2, type = "response") # generate yhat values on train df whitetrain$drinkitYhat <- ifelse(whitetrain$drinkitYhat > 0.5, 1.0, 0.0) confusionMatrix( whitetrain$drinkitYhat, whitetrain$drinkit) # run confusionMatrix to assess accuracy ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 2536 564 ## 1 151 178 ## ## Accuracy : 0.7915 ## 95% CI : (0.7775, 0.805) ## No Information Rate : 0.7836 ## P-Value [Acc > NIR] : 0.1357 ## ## Kappa : 0.23 ## Mcnemar's Test P-Value : <2e-16 ## ## Sensitivity : 0.9438 35
  • 36.
    ## Specificity :0.2399 ## Pos Pred Value : 0.8181 ## Neg Pred Value : 0.5410 ## Prevalence : 0.7836 ## Detection Rate : 0.7396 ## Detection Prevalence : 0.9041 ## Balanced Accuracy : 0.5918 ## ## 'Positive' Class : 0 ## auc(roc(whitetrain$drinkit, whitetrain$drinkitYhat ), levels=levels(as.factor(whitetrain$drinkit))) # c ## Area under the curve: 0.5918 Use best model (model 2) for testset prediction whitetest$drinkitYhat <- predict(whitemodel2, newdata = whitetest, type = "response") whitetest$drinkitYhat <- ifelse(whitetest$drinkitYhat > 0.5, 1.0, 0.0) confusionMatrix(whitetest$drinkitYhat, whitetest$drinkit) # run confusionMatrix to assess accuracy ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 1081 242 ## 1 70 76 ## ## Accuracy : 0.7876 ## 95% CI : (0.7658, 0.8083) ## No Information Rate : 0.7835 ## P-Value [Acc > NIR] : 0.3657 ## ## Kappa : 0.2215 ## Mcnemar's Test P-Value : <2e-16 ## ## Sensitivity : 0.9392 ## Specificity : 0.2390 ## Pos Pred Value : 0.8171 ## Neg Pred Value : 0.5205 ## Prevalence : 0.7835 ## Detection Rate : 0.7359 ## Detection Prevalence : 0.9006 ## Balanced Accuracy : 0.5891 ## ## 'Positive' Class : 0 ## auc(roc(whitetest$drinkit, whitetest$drinkitYhat), levels=levels(as.factor(whitetrain$drinkit))) # calc ## Area under the curve: 0.5891 36
  • 37.
    Random forest algorithm- red wine dataset set.seed(77) rf1 <- randomForest(drinkit ~ volatile.acidity+total.sulfur.dioxide+alcohol, type = classification, data=redtrain, ntree = 1000, importance = TRUE,confusion = TRUE) round(importance(rf1), 1) ## 0 1 MeanDecreaseAccuracy MeanDecreaseGini ## volatile.acidity 18.7 54.4 44.3 81.8 ## total.sulfur.dioxide 10.6 37.5 28.9 70.7 ## alcohol 24.6 73.7 59.5 89.5 print(rf1) ## ## Call: ## randomForest(formula = drinkit ~ volatile.acidity + total.sulfur.dioxide + alcohol, data = redt ## Type of random forest: classification ## Number of trees: 1000 ## No. of variables tried at each split: 1 ## ## OOB estimate of error rate: 12.87% ## Confusion matrix: ## 0 1 class.error ## 0 918 49 0.05067218 ## 1 95 57 0.62500000 Random forest using only alcohol attribute set.seed(77) rf2 <- randomForest(drinkit ~ alcohol, type = classification, data=redtrain, ntree = 1000, importance = TRUE,confusion = TRUE) round(importance(rf2), 1) ## 0 1 MeanDecreaseAccuracy MeanDecreaseGini ## alcohol 67.5 66.4 67.6 72.2 print(rf2) ## ## Call: ## randomForest(formula = drinkit ~ alcohol, data = redtrain, type = classification, ntree = 1000, ## Type of random forest: classification ## Number of trees: 1000 ## No. of variables tried at each split: 1 ## ## OOB estimate of error rate: 15.91% ## Confusion matrix: ## 0 1 class.error ## 0 922 45 0.04653568 ## 1 133 19 0.87500000 37
  • 38.
    Testset prediction rf1predict <-predict(rf1,redtest, type="response") Model performance on testset confusionMatrix( rf1predict, redtest$drinkit ) # run confusionMatrix to assess accuracy ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 397 27 ## 1 18 38 ## ## Accuracy : 0.9062 ## 95% CI : (0.8766, 0.9308) ## No Information Rate : 0.8646 ## P-Value [Acc > NIR] : 0.003355 ## ## Kappa : 0.5748 ## Mcnemar's Test P-Value : 0.233038 ## ## Sensitivity : 0.9566 ## Specificity : 0.5846 ## Pos Pred Value : 0.9363 ## Neg Pred Value : 0.6786 ## Prevalence : 0.8646 ## Detection Rate : 0.8271 ## Detection Prevalence : 0.8833 ## Balanced Accuracy : 0.7706 ## ## 'Positive' Class : 0 ## rf1predict<- as.numeric(rf1predict) auc(roc(redtest$drinkit, rf1predict )) # calculate AUROC curve ## Area under the curve: 0.7706 Random forest algorithm - white wine dataset set.seed(77) rf3 <- randomForest(drinkit ~ volatile.acidity+residual.sugar+chlorides+alcohol, type = classification, data=whitetrain, ntree = 1000, importance = TRUE,confusion = TRUE) round(importance(rf3), 1) ## 0 1 MeanDecreaseAccuracy MeanDecreaseGini ## volatile.acidity 45.4 128.0 113.4 248.4 38
  • 39.
    ## residual.sugar 48.0105.0 110.3 295.5 ## chlorides 32.6 116.0 101.8 251.7 ## alcohol 68.0 226.6 179.2 359.4 print(rf3) ## ## Call: ## randomForest(formula = drinkit ~ volatile.acidity + residual.sugar + chlorides + alcohol, data ## Type of random forest: classification ## Number of trees: 1000 ## No. of variables tried at each split: 2 ## ## OOB estimate of error rate: 14.44% ## Confusion matrix: ## 0 1 class.error ## 0 2521 166 0.06177894 ## 1 329 413 0.44339623 Testset prediction rf3predict <-predict(rf3, whitetest, type="response") Model performance on testset confusionMatrix( rf3predict, whitetest$drinkit ) # run confusionMatrix to assess accuracy ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 1075 140 ## 1 76 178 ## ## Accuracy : 0.853 ## 95% CI : (0.8338, 0.8707) ## No Information Rate : 0.7835 ## P-Value [Acc > NIR] : 9.184e-12 ## ## Kappa : 0.5325 ## Mcnemar's Test P-Value : 1.814e-05 ## ## Sensitivity : 0.9340 ## Specificity : 0.5597 ## Pos Pred Value : 0.8848 ## Neg Pred Value : 0.7008 ## Prevalence : 0.7835 ## Detection Rate : 0.7318 ## Detection Prevalence : 0.8271 ## Balanced Accuracy : 0.7469 ## 39
  • 40.
    ## 'Positive' Class: 0 ## rf3predict<- as.numeric(rf3predict) auc(roc(whitetest$drinkit, rf3predict )) # calculate AUROC curve ## Area under the curve: 0.7469 RESULTS SUMMARY Descriptive statistics and visualization of the data were employed to define variables to include in the predicitve models to define high quality wines. Several models were compared using multivariate logistic regression. The best model (model 2) defined on the red wine training dataset included the variables volatile.acidity+total.sulfur.dioxide+alcohol. Results of the confusion matrix and auroc calculation were as follows (see p.30): Please note after checking the raw data I found that although the confusion matrices reported in the caret package output above were correct, the labels of Sensitivity and Specificity are incorrectly inverted. The correct values are shown below. 1. Accuracy: 0.8704 2. Sensitivity (TPR): 0.2632 3. Specificity (TNR): 0.9659 4. FPR (1-Specificity): 0.0341 5. Area under the curve: 0.6145 Results of this model on the test set were (see p.32): 1. Accuracy: 0.8708 2. Sensitivity: 0.2462 3. Specificity: 0.9687 4. FPR (1-Specificity): 0.0313 5. Area under the curve: 0.6074 The best white wine logit model included the variables volatile.acidity+residual.sugar+chlorides+alcohol. Results of this model on the test set were (see p.35-36): 1. Accuracy: 0.7915 2. Sensitivity: 0.2399 3. Specificity: 0.9438 4. FPR (1-Specificity): 0.0562 5. Area under the curve: 0.5918 Results of this model on the test set were (see p.36): 1. Accuracy : 0.7876 2. Sensitivity : 0.2390 3. Specificity : 0.9392 4. FPR (1-Specificity): 0.0608 5. Area under the curve: 0.5891 For direct comparison of the logit and random forest algorithms, the best models defined using logistic regression were then evaluated by random forest using 1000 trees. Red wine training set results (p.37): 1. Accuracy: 0.8713 2. Sensitivity: 0.5377 3. Specificity: 0.9062 4. FPR (1-Specificity): 0.0938 5. Area under the curve: N/A using randomForest package Results of this model on the test set were (see p.36): 1. Accuracy: 0.9062 2. Sensitivity: 0.5846 3. Specificity: 0.9566 4. FPR (1-Specificity): 0.0434 5. Area under the curve: 0.7706 White wine training set results (p.37) 1. Accuracy: 0.8556 40
  • 41.
    2. Sensitivity: 0.7133 3.Specificity: 0.8846 4. FPR (1-Specificity): 0.1154 5. Area under the curve: N/A using randomForest package Results of this model on the test set were (see p.36): 1. Accuracy: 0.8530 2. Sensitivity: 0.5597 3. Specificity: 0.9340 4. FPR (1-Specificity): 0.0660 5. Area under the curve: 0.7469 DISCUSSION Neither algorithim performed very well in these datasets, although both performed better for negative predicitve value (ie., prediciting the bad wines). This makes sense since the physicochemical attributes of each wine in these datasets probably are more indicative of bad wine than good wine. For example, high sulfur or acidity can probably easily spoil an otherwise good wine, but these components may combine more subtly to effect a wines flavor. I think what makes a wine taste good is highly subjective anyway, which probably makes the class label harder to predict using these data. Interestingly, alcohol content was the single most predicitve variable (doesn’t everything taste better when your intoxicated?). Both algorithims performed better on the white wine dataset which was 3 times larger than the red wine dataset. This result reinforces a major tenent of data science, that is that more data is superior to better algorithms. In comparison, the random forest algorithm outperformed the logit models, somewhat. This was expected as decision trees usually outperform logistic regression in my experience. Decision trees generally perform well since they are highly iterative, robust to noisy data including outliers (which there was many in these data) and have good predicitve power. In contrast, logit is also pretty robust but is more sensitive linearity of the independant variable and some of these data were skewed. There are several things I could do to improve the analysis. First some of the variables were skewed and had many outliers, especially the white wine dataset, but I didn’t perform any type of transformation. This would have linearized the data anad improved the logistic regression performance, as I stated. Moreover, performing stepwise regression optimizes model fitting and categorizing continuous variables improves linearity of the independant variable, but neither of these methods were performed. Second, I didn’t perform any pruning of the random forest trees to try and improve performance. Lastly, both R packages have many functions which can be employed to optimize algorithm performance, but in general I didn’t make use of these. CONCLUSIONS 1. The random forest algorithm outperformed logit. 2. Both algorithms performed better on the larger white wine dataset; getting more data is the best way to improve your models. 3. What makes a wine taste good is subjective. 4. Good wines and especially reds (14% vs 22% whites), are hard to find, at least in the northern Portugal “Vinho Verde” region. REFERENCES Datasets 1. Wine Quality Data Set from UCI ML Repository 41
  • 42.
    2. P. Cortez,A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009. dplyr resources www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf www.youtube.com/watch?v=jWjqLW-u3hc&feature=youtu.be www.youtube.com/watch?v=2mh1PqfsXVI groups.google.com/forum/#!topic/manipulatr/Z46zwYXNh0g stackoverflow.com/questions/22850026/filtering-row-which-contains-a-certain-string-using-dplyr stackoverflow.com/questions/13520515/command-to-remove-row-from-a-data-frame Logistic regression resources cran.r-project.org/web/packages/glm2/glm2.pdf www.kaggle.com/eyebervil/titanic/titanic-simple-logit-with-interaction cran.r-project.org/web/packages/caret/vignettes/caret.pdf cran.r-project.org/web/packages/caret/caret.pdf cran.r-project.org/web/packages/pROC/pROC.pdf stats.stackexchange.com/questions/87234/aic-values-and-their-use-in-stepwise-model-selection-for-a-simple-linear-regress Random forest algorithm resources cran.r-project.org/web/packages/randomForest/randomForest.pdf campus.datacamp.com/courses/kaggle-r-tutorial-on-machine-learning/chapter-3-improving-your-predictions-through-random- ex=1 R Markdown resources rmarkdown.rstudio.com/ www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf 42