1. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
1
Predict 411 Section 55 Project 3
‘Wine Sales Review’
By Christopher Dorow
Due Date: May 31, 2015
File Name: Chris_Dorow_PRED411_Sec55_PROJ3.PDF
2. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
2
Results Summary and Conclusion
Several models were developed to predict the probability and amount of wine sales
based upon a collection of variables. The training data consisted of approximately
12,000 records. The best model from my investigation was a Zero Inflated Poisson
Regression, which yielded a model AIC of 40,865. The factors most likely to influence
wine sales were the presence of a rating for the wine, as wines without a STAR rating
sold poorly, and greater label appeal was likely to increase wine sales.
Introduction
The purpose of this assignment is to develop a regression that will predict the number of
probability of claim based upon the data set provided. Variables included in this data set
are listed below:
• Acid index, a measurement of total acidity
• Alcohol content
• Chloride content of wine
• Citric acid content
• Wine density
• Wine fixed acidity
• Free sulfur dioxide content
• Label appeal
• Residual sugar
• Independent rating by stars
• Sulphate content of wine
• Total sulfur dioxide
• Volatile acidity
• Wine pH
Evaluations of data quality will be made, including identification of missing or outlier
data. Linear, Poisson, Zero Inflated Poisson, Negative Binomial, and Zero Inflated
Negative Binomial regressions will be generated and compared. The best model will
be selected that predicts the amount of wine sold, in cases.
3. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
3
Data Exploration
Within the provided information was a data dictionary, which is copied below.
Variable
Name
Definition
Theoretical
Effect
INDEX
Identification
Variable
(do
not
use)
None
TARGET Number
of
Cases
Purchased
None
AcidIndex
Proprietary method of testing total acidity of wine
by using a weighted average
Alcohol
Alcohol Content
Chlorides
Chloride content of wine
CitricAcid
Citric Acid Content
Density
Density of Wine
FixedAcidity
Fixed Acidity of Wine
FreeSulfurDioxide
Sulfur Dioxide content of wine
LabelAppeal
Marketing Score indicating the appeal of label
design for consumers. High numbers suggest
customers like the label design. Negative
numbers suggest customes don't like the design.
Many
consumers
purchase
based
on
the
visual
appeal
of
the
wine
label
design.
Higher
numbers
suggest
better
sales.
ResidualSugar
Residual Sugar of wine
STARS
Wine rating by a team of experts. 4 Stars =
Excellent, 1 Star = Poor
A
high
number
of
stars
suggests
high
sales
Sulphates
Sulfate content of wine
TotalSulfurDioxide
Total Sulfur Dioxide of Wine
VolatileAcidity
Volatile Acid content of wine
pH
pH of wine
Continuous variables were reviewed and I could not discern trends that could be utilized
among the continuous data. However upon reviewing two key contingency tables, I was
able to locate two key variables. The tables are located in Attachment 1.
The first contingency table considered LabelAppeal and Target. Lower rated labels had
lower target values. Given the examples below, an appealing labal and bottle
combination can be very useful in grabbing the attention of the consumer.
6. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
6
The second contingency table that was very useful was the Stars and Target table.
When there was no rating, no sales occurred in just over 2,000 records, or one-sixth of
the training data. The consumers seem to shy away from the unknown quality when it
comes to wine.
Data Preparation
The descriptive statistics for the data set are summarized for the continuous variables in
the following table.
The missing records for the respective variables were replaced with the respective
variable mean values. Missing values are flagged in the chosen model for identification
and reference. Missing values were flagged for identification purposes
Variable N N Miss Median Mean Minimum Maximum Std Dev
INDEX
TARGET
FixedAcidity
VolatileAcidity
CitricAcid
ResidualSugar
Chlorides
FreeSulfurDioxide
TotalSulfurDioxide
Density
pH
Sulphates
Alcohol
LabelAppeal
AcidIndex
STARS
12795
12795
12795
12795
12795
12179
12157
12148
12113
12795
12400
11585
12142
12795
12795
9436
0
0
0
0
0
616
638
647
682
0
395
1210
653
0
0
3359
8110.00
3.0000000
6.9000000
0.2800000
0.3100000
3.9000000
0.0460000
30.0000000
123.0000000
0.9944900
3.2000000
0.5000000
10.4000000
0
8.0000000
2.0000000
8069.98
3.0290739
7.0757171
0.3241039
0.3084127
5.4187331
0.0548225
30.8455713
120.7142326
0.9942027
3.2076282
0.5271118
10.4892363
-0.0090660
7.7727237
2.0417550
1.0000000
0
-18.1000000
-2.7900000
-3.2400000
-127.8000000
-1.1710000
-555.0000000
-823.0000000
0.8880900
0.4800000
-3.1300000
-4.7000000
-2.0000000
4.0000000
1.0000000
16129.00
8.0000000
34.4000000
3.6800000
3.8600000
141.1500000
1.3510000
623.0000000
1057.00
1.0992400
6.1300000
4.2400000
26.5000000
2.0000000
17.0000000
4.0000000
4656.91
1.9263682
6.3176435
0.7840142
0.8620798
33.7493790
0.3184673
148.7145577
231.9132105
0.0265376
0.6796871
0.9321293
3.7278190
0.8910892
1.3239264
0.9025400
Treatment of Outliers
Sulfur dioxide records (free and total) were limited to 10 mg/l and 350 mg/l, as
concentrations above 10 mg/l require labeling, and the maximum concentration of
sulphates is limited to 350 mg/l by law. (Source: http://www.piwine.com/use-and-
measurement-of-sulfur-dioxide-in-wine.html_). pH limits were put at 3, as negative
values of pH are indicative of highly concentrated mineral acids, such as hydrochloric or
sulfuric acids, and unfit for human consumption, indicating the inappropriateness of the
value. Negative values for any concentration or composition values were also
conditioned as they are not possible. These values were replaced with the lowest
acceptable value.
7. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
7
Variable Creation and Combination
The following variables were created:
New Variable Description Implication
Alcohol_Type Less than 10.5 (value =1 )
Greater than 10.5 (value=2)
Wines with alcohol content
less than 10.5% are
predominantly white wines,
greater than 10.5% are
predominantly red wines.
Label_Group Grouping of Label_appeal,
if negative, Label_Group
=1, if positive Label_Group
=2.
Grouping of impact of
Label_appeal on sales
(negative or positive
correlation)
Star_Impact Grouping of STARS. If less
than 2, Star_Impact=1, if
STARS greater than 2,
Star_Impact=2.
Grouping of impact of wine
rating sytem,
Real_pH Conversion of pH into
hydroxyl ion concentration
in moles/liter
Concentration = 10**(-pH).
Density Adjusted Density – 1 Indication if above or below
specific gravity of water
Impurities Sum of chlorides and
sulphates.
Impact of preservatives
Imp_Chorldes_Log Log of chlorides
concentration
Impact of chlorides
.
8. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
8
Model Development
Linear Model
The most appropriate linear model that I was able to develop is presented below. It has
an R-squared of 0.3365, and the variable coefficients are presented below. The
regression has an average error of 0.002, with a standard deviation of 1.59.
Variable
Parameter
Estimate
Standard
Error Type II SS F Value Pr > F
Intercept 4.59507 0.54968 172.14337 69.88 <.0001
IMP_STARS 1.34815 0.02780 5794.49325 2352.26 <.0001
IMP_Density -1.06520 0.52398 10.18023 4.13 0.0421
IMP_Sulphates -0.06317 0.02104 22.20909 9.02 0.0027
IMP_LabelAppeal 0.53029 0.02645 990.02230 401.90 <.0001
IMP_FREESULFURDIOXIDE 0.00069504 0.00015990 46.54574 18.90 <.0001
IMP_TotalSulfurDioxide 0.00076498 0.00012274 95.69266 38.85 <.0001
IMP_PH -0.12880 0.02906 48.39406 19.65 <.0001
IMP_ACIDINDEX -0.29945 0.01067 1939.71306 787.42 <.0001
IMP_CITRICACID 0.03850 0.01614 14.01150 5.69 0.0171
IMP_VOLATILEACIDITY -0.14508 0.01774 164.78316 66.89 <.0001
Alcohol_TYPE 0.17647 0.02799 97.90097 39.74 <.0001
STAR_IMPACT -1.44641 0.04898 2147.93290 871.95 <.0001
IMP_CHLORIDES_LOG -0.11579 0.02372 58.70894 23.83 <.0001
Label_GROUP 0.08371 0.05121 6.58391 2.67 0.1021
9. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
9
Summary of Stepwise Selection
Step
Variable
Entered
Variable
Removed
Number
Vars In
Partial
R-Square
Model
R-Square C(p) F Value Pr > F
1 IMP_STARS 1 0.1601 0.1601 3394.31 2438.72 <.0001
2 IMP_LabelAppeal 2 0.0639 0.2240 2164.98 1053.31 <.0001
3 STAR_IMPACT 3 0.0550 0.2790 1106.34 976.49 <.0001
4 IMP_ACIDINDEX 4 0.0454 0.3245 233.114 859.89 <.0001
5 IMP_VOLATILEACIDITY 5 0.0037 0.3282 163.250 70.99 <.0001
6 Alcohol_TYPE 6 0.0021 0.3302 125.697 39.19 <.0001
7 IMP_TotalSulfurDioxide 7 0.0022 0.3324 85.4345 42.01 <.0001
8 IMP_CHLORIDES_LOG 8 0.0013 0.3338 62.0761 25.25 <.0001
9 IMP_PH 9 0.0010 0.3348 44.0543 19.97 <.0001
10 IMP_FREESULFURDIOXIDE 10 0.0010 0.3358 27.2288 18.80 <.0001
11 IMP_Sulphates 11 0.0005 0.3362 20.0288 9.19 0.0024
12 IMP_CITRICACID 12 0.0003 0.3365 16.1780 5.85 0.0156
13 IMP_Density 13 0.0002 0.3368 13.9635 4.21 0.0401
14 Label_GROUP 14 0.0001 0.3369 13.2911 2.67 0.1021
For a wine novice, coefficients are difficult to discern. The variables that seem
counterintuitive are the interaction between IMP_STARS (expert rating) and
STAR_IMPACT appear to be in conflict. Based upon the reference sources
(http://www.piwine.com/use-and-measurement-of-sulfur-dioxide-in-wine.html ,
http://waterhouse.ucdavis.edu/whats-in-wine/fixed-acidity , and http://winefolly.com/wp-
content/uploads/2013/10/basic-wine-101-guide-infographic-poster.jpg#big) it is possible
that the combination of variables may make sense overall as wine critic opinions may
not represent popular opinion and economic sense to the consumer.
There is some indication that label appeal drives sales, based upon LABEL_GROUP.
The following represent some examples of unique wine labels that capture consumer
interest.
10. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
10
Poisson Regression
The most appropriate Poisson model that I was able to develop is presented below. It
has an AIC of 49,895, and the variable coefficients are presented in the table below.
The regression has an average error of 0.025, with a standard deviation of 1.62.
Analysis Of Maximum Likelihood Parameter Estimates
Parameter DF Estimate
Standard
Error
Wald 95%
Confidence Limits
Wald Chi-
Square Pr > ChiSq
Intercept 1 1.5004 0.2003 1.1078 1.8930 56.10 <.0001
IMP_STARS 1 0.3348 0.0085 0.3181 0.3515 1546.53 <.0001
IMP_Density 1 -0.3517 0.1922 -0.7284 0.0250 3.35 0.0672
IMP_Sulphates 1 -0.0233 0.0079 -0.0387 -0.0079 8.80 0.0030
IMP_Alcohol 1 -0.0015 0.0020 -0.0054 0.0024 0.59 0.4430
IMP_LabelAppeal 1 0.1526 0.0090 0.1350 0.1702 287.73 <.0001
IMP_CHLORIDES 1 0.0354 0.0475 -0.0577 0.1285 0.56 0.4562
IMP_FREESULFURDIOXID 1 0.0002 0.0001 0.0001 0.0003 15.73 <.0001
IMP_TotalSulfurDioxi 1 0.0002 0.0000 0.0002 0.0003 30.22 <.0001
REAL_pH 1 89.6532 14.8763 60.4961 118.8102 36.32 <.0001
IMP_ACIDINDEX 1 -0.1173 0.0045 -0.1261 -0.1085 678.54 <.0001
IMP_RESIDUALSUGAR 1 0.0002 0.0002 -0.0001 0.0005 1.31 0.2519
IMP_CITRICACID 1 0.0129 0.0059 0.0014 0.0245 4.82 0.0281
IMP_VOLATILEACIDITY 1 -0.0476 0.0065 -0.0603 -0.0349 53.72 <.0001
IMP_FixedAcidity 1 -0.0005 0.0008 -0.0021 0.0011 0.40 0.5245
Alcohol_TYPE 1 0.0633 0.0144 0.0351 0.0915 19.37 <.0001
STAR_IMPACT 1 -0.3615 0.0178 -0.3964 -0.3265 410.32 <.0001
IMP_CHLORIDES_LOG 1 -0.0496 0.0165 -0.0820 -0.0172 8.99 0.0027
Label_GROUP 1 0.1108 0.0191 0.0732 0.1483 33.46 <.0001
Scale 0 1.0000 0.0000 1.0000 1.0000
The variable coefficients presented in this Poisson regression are consistent with the
linear regression, with the apparent conflict from earlier. The same observations also
hold true for the variable coefficients presented in the Negative Binomial Regression.
11. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
11
Negative Binomial Regression
The most appropriate Negative Binomial model that I was able to develop is presented
below. It has an AIC of 49,897, and the variable coefficients are presented in the table
below. The regression has an average error of 0.025, with a standard deviation of 1.62.
Initially, these results are identical to the Poisson model. This occurred as the stepwise
selection method utilized and the fact that both Poisson and Negative Binomial
regressions have the same form, as the Poisson distribution is a special case of the
Negative Binomial regression. The mean and variance are equal.
Analysis Of Maximum Likelihood Parameter Estimates
Parameter DF Estimate
Standard
Error
Wald 95%
Confidence Limits
Wald Chi-
Square Pr > ChiSq
Intercept 1 1.5004 0.2003 1.1078 1.8930 56.10 <.0001
IMP_STARS 1 0.3348 0.0085 0.3181 0.3515 1546.52 <.0001
IMP_Density 1 -0.3517 0.1922 -0.7284 0.0250 3.35 0.0672
IMP_Sulphates 1 -0.0233 0.0079 -0.0387 -0.0079 8.80 0.0030
IMP_Alcohol 1 -0.0015 0.0020 -0.0054 0.0024 0.59 0.4430
IMP_LabelAppeal 1 0.1526 0.0090 0.1350 0.1702 287.73 <.0001
IMP_CHLORIDES 1 0.0354 0.0475 -0.0577 0.1285 0.56 0.4562
IMP_FREESULFURDIOXID 1 0.0002 0.0001 0.0001 0.0003 15.73 <.0001
IMP_TotalSulfurDioxi 1 0.0002 0.0000 0.0002 0.0003 30.22 <.0001
REAL_pH 1 89.6532 14.8763 60.4961 118.8102 36.32 <.0001
IMP_ACIDINDEX 1 -0.1173 0.0045 -0.1261 -0.1085 678.54 <.0001
IMP_RESIDUALSUGAR 1 0.0002 0.0002 -0.0001 0.0005 1.31 0.2519
IMP_CITRICACID 1 0.0129 0.0059 0.0014 0.0245 4.82 0.0281
IMP_VOLATILEACIDITY 1 -0.0476 0.0065 -0.0603 -0.0349 53.72 <.0001
IMP_FixedAcidity 1 -0.0005 0.0008 -0.0021 0.0011 0.40 0.5245
Alcohol_TYPE 1 0.0633 0.0144 0.0351 0.0915 19.37 <.0001
STAR_IMPACT 1 -0.3615 0.0178 -0.3964 -0.3265 410.31 <.0001
IMP_CHLORIDES_LOG 1 -0.0496 0.0165 -0.0820 -0.0172 8.99 0.0027
Label_GROUP 1 0.1108 0.0191 0.0732 0.1483 33.46 <.0001
Dispersion 1 0.0000 0.0001 0.0000 2.24E122
I then manually modified the model according to the assignment instructions. I inserted
a new variable, called EXPERT_OPINION, which was the sum of the squared
LABEL_GROUP and STAR_IMPACT. The AIC increased to 50,177. I chose not to run
additional analysis as the model did not improve from the Poisson model earlier. The
table below summarizes the variable coefficients of this alternative model.
13. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
13
Zero Inflated Poisson Regression
The most appropriate ZIP model that I was able to develop is presented below. It has an
AIC of 40,865 and the variable coefficients are presented in the table below.
Analysis Of Maximum Likelihood Parameter Estimates
Parameter DF Estimate
Standard
Error
Wald 95%
Confidence Limits
Wald Chi-
Square Pr > ChiSq
Intercept 1 1.3960 0.1998 1.0045 1.7876 48.84 <.0001
IMP_STARS 1 0.1137 0.0088 0.0964 0.1309 166.49 <.0001
IMP_Density 1 -0.2694 0.1969 -0.6553 0.1164 1.87 0.1711
IMP_Sulphates 1 0.0006 0.0080 -0.0151 0.0162 0.00 0.9439
IMP_Alcohol 1 0.0003 0.0028 -0.0052 0.0058 0.01 0.9198
STAR_IMPACT 1 -0.0280 0.0187 -0.0646 0.0086 2.25 0.1339
IMP_CHLORIDES 1 -0.0389 0.0258 -0.0895 0.0116 2.28 0.1313
IMP_FREESULFURDIOXID 1 0.0000 0.0001 -0.0001 0.0002 0.46 0.4966
IMP_TotalSulfurDioxi 1 -0.0000 0.0000 -0.0001 0.0000 0.78 0.3761
IMP_ACIDINDEX 1 -0.0194 0.0049 -0.0290 -0.0098 15.64 <.0001
IMP_LabelAppeal 1 0.2413 0.0062 0.2291 0.2536 1494.55 <.0001
IMP_CITRICACID 1 0.0002 0.0087 -0.0168 0.0172 0.00 0.9807
IMP_VOLATILEACIDITY 1 -0.0220 0.0097 -0.0410 -0.0030 5.17 0.0230
IMP_FixedAcidity 1 0.0002 0.0010 -0.0017 0.0022 0.06 0.8131
REAL_pH 1 -10.0082 15.2121 -39.8233 19.8069 0.43 0.5106
Alcohol_TYPE 1 0.0795 0.0149 0.0502 0.1087 28.33 <.0001
Scale 0 1.0000 0.0000 1.0000 1.0000
14. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
14
Analysis Of Maximum Likelihood Zero Inflation Parameter Estimates
Parameter DF Estimate
Standard
Error
Wald 95%
Confidence Limits
Wald Chi-
Square Pr > ChiSq
Intercept 1 6.4613 72.2951 -135.234 148.1570 0.01 0.9288
IMP_STARS 1 -11.3195 72.2946 -153.014 130.3752 0.02 0.8756
M_STARS 1 5.8765 0.3463 5.1977 6.5553 287.88 <.0001
M_SULPHATES 1 0.0900 0.1108 -0.1271 0.3071 0.66 0.4164
IMP_LabelAppeal 1 0.6992 0.0415 0.6179 0.7805 284.40 <.0001
IMP_CHLORIDES_LOG 1 0.0575 0.0568 -0.0538 0.1688 1.03 0.3111
IMP_TotalSulfurDioxi 1 -0.0019 0.0003 -0.0025 -0.0013 42.30 <.0001
IMP_ACIDINDEX 1 0.4391 0.0255 0.3891 0.4890 296.70 <.0001
IMP_CITRICACID 1 -0.0889 0.0572 -0.2010 0.0231 2.42 0.1198
IMP_VOLATILEACIDITY 1 0.2550 0.0573 0.1426 0.3674 19.77 <.0001
REAL_pH 1 -636.197 97.4542 -827.204 -445.190 42.62 <.0001
IMP_Alcohol 1 -0.0128 0.0193 -0.0506 0.0249 0.44 0.5055
Alcohol_TYPE 1 0.3255 0.0990 0.1315 0.5194 10.82 0.0010
STAR_IMPACT 1 7.5539 72.2970 -134.146 149.2534 0.01 0.9168
The most important improvement variable was the inclusion of M_STARS (missing
variable STAR record indicated). From the EDA, in 76% of the cases when no rating
was provided or available, no wine cases sold.
15. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
15
Zero Inflated Negative Binomial Regression
The most appropriate ZINB model that I was able to develop is presented below. It has
an AIC of 43,937 and the variable coefficients are presented in the table below.
Analysis Of Maximum Likelihood Parameter Estimates
Parameter DF Estimate
Standard
Error
Wald 95%
Confidence Limits
Wald Chi-
Square Pr > ChiSq
Intercept 1 1.1376 0.2055 0.7349 1.5403 30.65 <.0001
IMP_STARS 1 0.1155 0.0088 0.0983 0.1328 171.82 <.0001
IMP_Density 1 -0.2516 0.1968 -0.6374 0.1341 1.63 0.2011
IMP_Sulphates 1 0.0007 0.0080 -0.0149 0.0163 0.01 0.9305
IMP_Alcohol 1 0.0001 0.0028 -0.0054 0.0056 0.00 0.9748
IMP_LabelAppeal 1 0.2007 0.0091 0.1829 0.2186 483.72 <.0001
IMP_CHLORIDES 1 0.0059 0.0492 -0.0906 0.1024 0.01 0.9048
IMP_FREESULFURDIOXID 1 0.0000 0.0001 -0.0001 0.0002 0.40 0.5271
IMP_TotalSulfurDioxi 1 -0.0000 0.0000 -0.0001 0.0000 0.72 0.3967
REAL_pH 1 -9.1897 15.2098 -39.0003 20.6209 0.37 0.5457
IMP_ACIDINDEX 1 -0.0190 0.0049 -0.0286 -0.0094 15.05 0.0001
IMP_RESIDUALSUGAR 1 0.0000 0.0002 -0.0005 0.0005 0.00 0.9677
IMP_CITRICACID 1 0.0002 0.0087 -0.0168 0.0172 0.00 0.9844
IMP_VOLATILEACIDITY 1 -0.0221 0.0097 -0.0411 -0.0031 5.19 0.0227
IMP_FixedAcidity 1 0.0002 0.0010 -0.0018 0.0021 0.03 0.8563
Alcohol_TYPE 1 0.0795 0.0149 0.0502 0.1088 28.32 <.0001
STAR_IMPACT 1 -0.0329 0.0187 -0.0695 0.0037 3.10 0.0785
IMP_CHLORIDES_LOG 1 -0.0191 0.0171 -0.0527 0.0144 1.25 0.2639
Label_GROUP 1 0.1207 0.0196 0.0823 0.1591 37.94 <.0001
Dispersion 1 0.0000 0.0000 0.0000 1.007E39
16. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
16
Analysis Of Maximum Likelihood Zero Inflation Parameter Estimates
Parameter DF Estimate
Standard
Error
Wald 95%
Confidence Limits
Wald Chi-
Square Pr > ChiSq
Intercept 1 6.7501 84.4772 -158.822 172.3224 0.01 0.9363
IMP_STARS 1 -11.6220 84.4768 -177.193 153.9494 0.02 0.8906
IMP_LabelAppeal 1 0.7127 0.0419 0.6306 0.7948 289.62 <.0001
IMP_CHLORIDES_LOG 1 0.0524 0.0569 -0.0592 0.1640 0.85 0.3572
M_STARS 1 5.8954 0.3528 5.2040 6.5869 279.29 <.0001
M_SULPHATES 1 0.0905 0.1110 -0.1271 0.3080 0.66 0.4150
IMP_TotalSulfurDioxi 1 -0.0019 0.0003 -0.0025 -0.0013 42.16 <.0001
IMP_ACIDINDEX 1 0.4398 0.0255 0.3898 0.4899 296.47 <.0001
IMP_CITRICACID 1 -0.0895 0.0573 -0.2018 0.0228 2.44 0.1183
IMP_VOLATILEACIDITY 1 0.2559 0.0575 0.1432 0.3685 19.83 <.0001
REAL_pH 1 -637.209 97.6977 -828.693 -445.725 42.54 <.0001
IMP_Alcohol 1 -0.0134 0.0193 -0.0513 0.0245 0.48 0.4871
Alcohol_TYPE 1 0.3283 0.0992 0.1338 0.5227 10.95 0.0009
STAR_IMPACT 1 7.8393 84.4789 -157.736 173.4149 0.01 0.9261
Model Selection
Model AIC
Poisson 49,877
Negative Binomial 49,877
Negative Binomial
(modified)
50,902
Zero Inflated
Poisson
40,865
Zero Inflated
Negative Binomial
(modified)
43,937
The model I chose was the ZIP model, based upon the AIC. This model scoring code
yields the following histogram.
17. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
17
Strengths of the model is that approximately 80% of the projections are within a range of
1.5 from the target value and over 30% of the projections are target (see Attachment 2).
Weakness of this model is that 0 cases are under counted.
Based upon the instruction set for this assignment, the linear model could not be
considered. However, an application of Occam’s Razor, which states "…when you have
two competing theories that make exactly the same predictions, the simpler one is the
better (source: www.math.ucr.edu/home/baez/physics/General/occam.html),” applies.
The performance of the linear regression over the range of concern for the model was
equally, or nearly equally accurate.
0 1 2 3 4 5 6 7 8
P_SCORE_ZIP
0
5
10
15
20
25
30Percent
Distribution
of
P_SC OR E_ZIP
18. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
18
Model Interpretation
The following tables summarize the ZIP model selected and the meaning of the
respective coefficients.
Maximum Likelihood Parameter Estimates
Parameter Coefficient Interpretation
Intercept 1.1376
IMP_STARS 0.1155 The rating of number of stars will
increase wine sales.
IMP_Density -0.2516 Increased wine density will reduce win
sales.
IMP_Sulphates 0.0007 The concentration of sulphates will
increase wine sales.
IMP_Alcohol 0.0001 Increased alcohol content will increase
the amount of wine sales..
IMP_LabelAppeal 0.2007 The label appeal rating will increase
wine sales.
IMP_CHLORIDES 0.0059 The concentration of chlorides will
increase the amount of wine sales.
IMP_FREESULFURDIOXID 0.0000 The presence of free sulfur dioxide has
no impact on wine sales amount.
IMP_TotalSulfurDioxi -0.0000 The presence of total sulfur dioxide has
no impact on wine sales amount.
REAL_pH -9.1897 pH, expressed as concentration will
reduce wine sales amount.
IMP_ACIDINDEX -0.0190 Acid index has a negative impact on
wine sales amount.
IMP_RESIDUALSUGAR 0.0000 Residual sugar has no impact on wines
sales amount.
IMP_CITRICACID 0.0002 Citric acid concentration will increase
wine sales.
IMP_VOLATILEACIDITY -0.0221 Volatile acidity will decrease the wine
sales amount.
IMP_FixedAcidity 0.0002 Fixed acidity will increase the wine
sales amount.
Alcohol_TYPE 0.0795 Wines having alcohol greater than
10.5% sell in greater amounts.
STAR_IMPACT -0.0329 Wines with star ratings of 1 or 2 sell
more than wines with higher star
ratings.
IMP_CHLORIDES_LOG -0.0191 The logarithm of chlorides negatively
impacts wine sales.
Label_GROUP 0.1207 Label ratings with a rating less than 0
sell in less amounts than the wins with
labels rated positively.
19. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
19
Maximum Likelihood Parameter Estimates
Parameter Coefficient Interpretation
Intercept 6.7501
IMP_STARS -11.6220 The increase in the rating in number of
stars will reduce the probability that
none of this wine will be sold.
IMP_LabelAppeal 0.7127 The increase in the higher label appeal
rating of the wine will increase the
probability that none of this wine will be
sold.
IMP_CHLORIDES_LOG 0.0524 The increase in the log concentration of
chlorides in the wine will increase the
probability that none of this wine will be
sold.
M_STARS 5.8954 A missing record for STARS results in
an increased probability that none of the
particular wine will be sold.
M_SULPHATES 0.0905 A missing record for SULPHATES
results in an increased probability that
none of the particular wine will be sold.
IMP_TotalSulfurDioxi -0.0019 The increase in the concentration of
sulfur dioxide will decrease the
probability that none of this wine will be
sold.
IMP_ACIDINDEX 0.4398 The increase in the acid index will
increase the probability that none of this
wine will be sold.
IMP_CITRICACID -0.0895 The increase in the concentration of
citric acid will decrease the probability
that none of this wine will be sold.
IMP_VOLATILEACIDITY 0.2559 The increase in the volatile acidity will
increase the probability that none of this
wine will be sold.
REAL_pH -637.209 The increase in the concentration of
hydroxyl ion (-log(base10) [H+]) will
decrease the probability that none of
this wine will be sold.
IMP_Alcohol -0.0134 The increase in the concentration of
alcohol will decrease the probability that
none of this wine will be sold.
Alcohol_TYPE 0.3283 The shift from lower alcohol wines
(<10.5%, almost all whites and lighter
reds) to higher alcohol wines (heartier,
drier wines, primarily reds) will increase
the probability that none of this wine
will be sold.
STAR_IMPACT 7.8393 The shift from lower rated wines (<2
stars) to higher rated wines (>2 stars)
will increase the probability that none of
this wine will be sold.
20. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
20
File Attachments
File Name Contents Comments
CDOROW_PRD411_SEC55_PROJ3TEST.sas Test code SAS
CDOROW_PRED411_PROJ3_SCORE_FILE.sas Scored data Bingo Bonus for
.sas file.
CDOROW_PRED411_PROJ3_SCORE.csv CSV file
contingency
CDOROW_SEC55_MODELWINNER_PROJ3.sas
21. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
21
Appendix 1
Correlation of Continuous Variables
22. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
22
0 1 2 3 4 5 6 7 8
TA RGET
0
5
10
15
20
25Percent
Distribution
of
TAR GET
25. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
25
Appendix 2
Selected Regression Error Histograms
Linear Regression
Zero Inflated Poisson Regression
26. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
26
Linear Regression Error Histogram
-‐5 -‐4 -‐3 -‐2 -‐1 0 1 2 3 4 5
TA RGET_ERROR
0
5
10
15
20
25
30
Percent
Distribution
of
TAR GET_ER R OR
27. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
27
Zero Inflated Poisson Regression
-‐6 -‐5 -‐4 -‐3 -‐2 -‐1 0 1 2 3 4 5 6 7
error_term
0
10
20
30
40
Percent
Distribution
of
error_term
29. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
29
Linear Regression
libname mydata '/folders/myfolders' access=readonly;
proc contents data=mydata.wine;
run;
data work.wine_scrub;
set mydata.wine;
*cleaning up variabes;
TARGET_FLAG = ( TARGET > 0 );
TARGET_AMT = TARGET - 1;
if TARGET_FLAG = 0 then TARGET_AMT = .;
IMP_STARS = STARS;
IMP_Density = Density;
IMP_Sulphates = Sulphates;
IMP_Alcohol = Alcohol;
IMP_LabelAppeal = LabelAppeal;
IMP_CHLORIDES = Chlorides;
IMP_FREESULFURDIOXIDE = FREESULFURDIOXIDE;
IMP_TotalSulfurDioxide = TotalSulfurDioxide;
IMP_PH = pH;
IMP_ACIDINDEX = ACIDINDEX;
IMP_RESIDUALSUGAR = ResidualSugar;
IMP_CITRICACID = CitricAcid;
IMP_VOLATILEACIDITY = VolatileAcidity;
IMP_FixedAcidity = FixedAcidity;
*missing counts;
M_STARS = 0;
M_RESIDUALSUGAR = 0;
M_CHLORIDES = 0;
M_FRESSULFURDIOXIDE = 0;
M_TOTALSULFURDIOXIDE = 0;
M_SULPHATES = 0;
M_ALCOHOL = 0;
if missing(STARS) then do; IMP_STARS = 2;
M_STARS = 1;
end;
if missing(Density) then IMP_Density =
0.9942027;
if missing(Sulphates) then do;
IMP_Sulphates = 0.5271118;
M_SULPHATES =1;
30. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
30
end;
if missing(Alcohol) then do;
IMP_Alcohol = 10.4892363;
M_ALCOHOL =1;
end;
if missing(pH) then do;
IMP_pH = 4;
M_pH =1;
*typical wine pH is now 4;
end;
if missing(LabelAppeal) then IMP_LabelAppeal = 0;
if missing(TotalSulfurDioxide) then IMP_TotalSulfurDioxide =
120.7142326;
if missing (FreeSulfurDioxide) then IMP_FreeSulfurDioxide = 30.845;
if missing (Chlorides) then IMP_Chlorides = 0.046;
if IMP_Chlorides =< 0.01 then IMP_Chlorides= 0.01;
*IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * sqrt(
abs(IMP_TotalSulfurDioxide)+1 );
*IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * log(
abs(IMP_TotalSulfurDioxide)+1 );
if IMP_TotalSulfurDioxide < 10 then IMP_TotalSulfurDioxide = 10 ;
if IMP_TotalSulfurDioxide > 350 then IMP_TotalSulfurDioxide = 350;
if IMP_FreeSulfurDioxide < 10 then IMP_FreeSulfurDioxide = 10 ;
if IMP_FreeSulfurDioxide > 350 then IMP_FreeSulfurDioxide = 350;
* more than 10 mg/l requires labeling, >350 mg/l is prohibited, limits based upon
requirements;
if IMP_PH < 3 then IMP_PH=3;
*a pH of 0.48 is high concentration acid that is unfit for human consumption;
if IMP_Sulphates <0 then IMP_SULPHATES= 0;
if missing(ResidualSuger) then IMP_ResidualSugar = 3.9;
31. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
31
*grouping of wines by http://winefolly.com/wp-content/uploads/2013/10/basic-
wine-101-guide-infographic-poster.jpg#big
light to heavy, which is a crude calssification of white to red;
if IMP_Alcohol < 10.5 then Alcohol_TYPE=1;
if IMP_Alcohol >= 10.5 then Alcohol_TYPE=2;
if IMP_LabelAppeal <0 then Label_GROUP =1;
if IMP_LabelAppeal >=0 then Label_GROUP = 2;
if IMP_STARS <2 then STAR_IMPACT = 0;
if IMP_STARS >=2 then STAR_IMPACT = 1;
REAL_pH = 10**(-IMP_pH);
density_adjusted = density - 1;
IMPURITIES = IMP_Chlorides + IMP_sulphates + IMP_FreeSulfurDioxide +
IMP_TotalSulfurDioxide;
TARGET_FLAG = ( TARGET > 0 );
TARGET_AMT = TARGET - 1;
if TARGET_FLAG = 0 then TARGET_AMT = .;
IMP_CHLORIDES_LOG = LOG10(IMP_CHLORIDES);
TARGET_LOG=0;
IF TARGET>0 then TARGET_LOG=1;
run;
proc means data=work.wine_scrub n nmiss median mean min max stddev;
run;
proc reg data = work.wine_scrub;
stepwise: model TARGET =
IMP_STARS
IMP_Density
IMP_Sulphates
IMP_Alcohol
IMP_LabelAppeal
IMP_CHLORIDES
IMP_FREESULFURDIOXIDE
IMP_TotalSulfurDioxide
IMP_PH
IMP_ACIDINDEX
34. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
34
M_SULPHATES =1;
end;
if missing(Alcohol) then do;
IMP_Alcohol = 10.4892363;
M_ALCOHOL =1;
end;
if missing(pH) then do;
IMP_pH = 4;
M_pH =1;
*typical wine pH is now 4;
end;
if missing(LabelAppeal) then IMP_LabelAppeal = 0;
if missing(TotalSulfurDioxide) then IMP_TSulfurDioxide = 120.7142326;
if missing (FreeSulfurDioxide) then IMP_FSulfurDioxide = 30.845;
if missing (Chlorides) then IMP_Chlorides = 0.046;
if IMP_Chlorides =< 0.01 then IMP_Chlorides= 0.01;
*IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * sqrt(
abs(IMP_TotalSulfurDioxide)+1 );
*IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * log(
abs(IMP_TotalSulfurDioxide)+1 );
if IMP_TotalSulfurDioxide < 10 then IMP_TotalSulfurDioxide = 10 ;
if IMP_TotalSulfurDioxide > 350 then IMP_TotalSulfurDioxide = 350;
if IMP_FreeSulfurDioxide < 10 then IMP_FreeSulfurDioxide = 10 ;
if IMP_FreeSulfurDioxide > 350 then IMP_FreeSulfurDioxide = 350;
* more than 10 mg/l requires labeling, >350 mg/l is prohibited, limits based upon
requirements;
if IMP_PH < 3 then IMP_PH=3;
*a pH of 0.48 is high concentration acid that is unfit for human consumption;
if IMP_Sulphates <0 then IMP_SULPHATES= 0;
if missing(ResidualSugar) then IMP_ResidualSugar = 3.9;
35. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
35
*grouping of wines by http://winefolly.com/wp-content/uploads/2013/10/basic-
wine-101-guide-infographic-poster.jpg#big
light to heavy, which is a crude calssification of white to red;
if IMP_Alcohol < 10.5 then Alcohol_TYPE=1;
if IMP_Alcohol >= 10.5 then Alcohol_TYPE=2;
if IMP_LabelAppeal <0 then Label_GROUP =1;
if IMP_LabelAppeal >=0 then Label_GROUP = 2;
if IMP_STARS <2 then STAR_IMPACT = 0;
if IMP_STARS >=2 then STAR_IMPACT = 1;
REAL_pH = 10**(-IMP_pH);
density_adjusted = density - 1;
IMPURITIES = IMP_Chlorides + IMP_sulphates + IMP_FreeSulfurDioxide +
IMP_TotalSulfurDioxide;
TARGET_FLAG = ( TARGET > 0 );
TARGET_AMT = TARGET - 1;
if TARGET_FLAG = 0 then TARGET_AMT = .;
IMP_CHLORIDES_LOG = LOG10(IMP_CHLORIDES);
TARGET_LOG=0;
IF TARGET>0 then TARGET_LOG=1;
run;
proc means data=work.wine_scrub n nmiss median mean min max stddev;
run;
proc genmod data = work.wine_scrub;
stepwise:
model TARGET =
IMP_STARS
IMP_Density
IMP_Sulphates
IMP_Alcohol
IMP_LabelAppeal
IMP_CHLORIDES
IMP_FREESULFURDIOXIDE
IMP_TotalSulfurDioxide
REAL_PH
IMP_ACIDINDEX
IMP_RESIDUALSUGAR
IMP_CITRICACID
IMP_VOLATILEACIDITY
39. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
39
M_SULPHATES =1;
end;
if missing(Alcohol) then do;
IMP_Alcohol = 10.4892363;
M_ALCOHOL =1;
end;
if missing(pH) then do;
IMP_pH = 4;
M_pH =1;
*typical wine pH is now 4;
end;
if missing(LabelAppeal) then IMP_LabelAppeal = 0;
if missing(TotalSulfurDioxide) then IMP_TSulfurDioxide = 120.7142326;
if missing (FreeSulfurDioxide) then IMP_FSulfurDioxide = 30.845;
if missing (Chlorides) then IMP_Chlorides = 0.046;
if IMP_Chlorides =< 0.01 then IMP_Chlorides= 0.01;
*IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * sqrt(
abs(IMP_TotalSulfurDioxide)+1 );
*IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * log(
abs(IMP_TotalSulfurDioxide)+1 );
if IMP_TotalSulfurDioxide < 10 then IMP_TotalSulfurDioxide = 10 ;
if IMP_TotalSulfurDioxide > 350 then IMP_TotalSulfurDioxide = 350;
if IMP_FreeSulfurDioxide < 10 then IMP_FreeSulfurDioxide = 10 ;
if IMP_FreeSulfurDioxide > 350 then IMP_FreeSulfurDioxide = 350;
* more than 10 mg/l requires labeling, >350 mg/l is prohibited, limits based upon
requirements;
if IMP_PH < 3 then IMP_PH=3;
*a pH of 0.48 is high concentration acid that is unfit for human consumption;
if IMP_Sulphates <0 then IMP_SULPHATES= 0;
if missing(ResidualSugar) then IMP_ResidualSugar = 3.9;
*grouping of wines by http://winefolly.com/wp-content/uploads/2013/10/basic-
wine-101-guide-infographic-poster.jpg#big
40. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
40
light to heavy, which is a crude calssification of white to red;
if IMP_Alcohol < 10.5 then Alcohol_TYPE=1;
if IMP_Alcohol >= 10.5 then Alcohol_TYPE=2;
if IMP_LabelAppeal <0 then Label_GROUP =1;
if IMP_LabelAppeal >=0 then Label_GROUP = 2;
if IMP_STARS <2 then STAR_IMPACT = 0;
if IMP_STARS >=2 then STAR_IMPACT = 1;
REAL_pH = 10**(-IMP_pH);
density_adjusted = density - 1;
IMPURITIES = IMP_Chlorides + IMP_sulphates + IMP_FreeSulfurDioxide +
IMP_TotalSulfurDioxide;
EXPERT_OPINION = (STAR_IMPACT**2) + (LABEL_GROUP**2);
TARGET_FLAG = ( TARGET > 0 );
TARGET_AMT = TARGET - 1;
if TARGET_FLAG = 0 then TARGET_AMT = .;
IMP_CHLORIDES_LOG = LOG10(IMP_CHLORIDES);
TARGET_LOG=0;
IF TARGET>0 then TARGET_LOG=1;
run;
proc means data=work.wine_scrub n nmiss median mean min max stddev;
run;
proc genmod data = work.wine_scrub;
model TARGET =
IMP_STARS
IMP_Density
IMP_Sulphates
IMP_Alcohol
IMP_LabelAppeal
IMP_CHLORIDES
IMP_FREESULFURDIOXIDE
IMP_TotalSulfurDioxide
REAL_PH
IMP_ACIDINDEX
IMP_RESIDUALSUGAR
IMP_CITRICACID
IMP_VOLATILEACIDITY
43. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
43
Zero Inflated Poisson
libname mydata '/folders/myfolders' access=readonly;
proc contents data=mydata.wine;
run;
data work.wine_scrub;
set mydata.wine;
*cleaning up variabes;
TARGET_FLAG = ( TARGET > 0 );
TARGET_AMT = TARGET - 1;
if TARGET_FLAG = 0 then TARGET_AMT = .;
IMP_STARS = STARS;
IMP_Density = Density;
IMP_Sulphates = Sulphates;
IMP_Alcohol = Alcohol;
IMP_LabelAppeal = LabelAppeal;
IMP_CHLORIDES = Chlorides;
IMP_FREESULFURDIOXIDE = FREESULFURDIOXIDE;
IMP_TotalSulfurDioxide = TotalSulfurDioxide;
IMP_PH = pH;
IMP_ACIDINDEX = ACIDINDEX;
IMP_RESIDUALSUGAR = ResidualSugar;
IMP_CITRICACID = CitricAcid;
IMP_VOLATILEACIDITY = VolatileAcidity;
IMP_FixedAcidity = FixedAcidity;
*missing counts;
M_STARS = 0;
M_RESIDUALSUGAR = 0;
M_CHLORIDES = 0;
M_FRESSULFURDIOXIDE = 0;
M_TOTALSULFURDIOXIDE = 0;
M_SULPHATES = 0;
M_ALCOHOL = 0;
if missing(STARS) then do; IMP_STARS = 2;
M_STARS = 1;
end;
if missing(Density) then IMP_Density =
0.9942027;
if missing(Sulphates) then do;
IMP_Sulphates = 0.5271118;
44. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
44
M_SULPHATES =1;
end;
if missing(Alcohol) then do;
IMP_Alcohol = 10.4892363;
M_ALCOHOL =1;
end;
if missing(pH) then do;
IMP_pH = 4;
M_pH =1;
*typical wine pH is now 4;
end;
if missing(STARS) then do; IMP_STARS
= 2;
M_STARS = 1;
end;
if missing(LabelAppeal) then IMP_LabelAppeal = 0;
if missing(TotalSulfurDioxide) then IMP_TSulfurDioxide = 120.7142326;
if missing (FreeSulfurDioxide) then IMP_FSulfurDioxide = 30.845;
if missing (Chlorides) then IMP_Chlorides = 0.046;
if IMP_Chlorides =< 0.01 then IMP_Chlorides= 0.01;
*IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * sqrt(
abs(IMP_TotalSulfurDioxide)+1 );
*IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * log(
abs(IMP_TotalSulfurDioxide)+1 );
if IMP_TotalSulfurDioxide < 10 then IMP_TotalSulfurDioxide = 10 ;
if IMP_TotalSulfurDioxide > 350 then IMP_TotalSulfurDioxide = 350;
if IMP_FreeSulfurDioxide < 10 then IMP_FreeSulfurDioxide = 10 ;
if IMP_FreeSulfurDioxide > 350 then IMP_FreeSulfurDioxide = 350;
* more than 10 mg/l requires labeling, >350 mg/l is prohibited, limits based upon
requirements;
if IMP_PH < 3 then IMP_PH=3;
*a pH of 0.48 is high concentration acid that is unfit for human consumption;
if IMP_Sulphates <0 then IMP_SULPHATES= 0;
if missing(ResidualSugar) then IMP_ResidualSugar = 3.9;
if IMP_Alcohol <9 then IMP_Alcohol =9.0;
45. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
45
if IMP_ResidualSugar <1 then IMP_ResidualSugar=1;
if IMP_CITRICACID <0 then IMP_CITRICACID=0;
if IMP_VOLATILEACIDITY <0 then IMP_VOLATILEACIDITY=0;
if IMP_FixedAcidity <0 then IMP_FixedAcidity =0;
*grouping of wines by http://winefolly.com/wp-content/uploads/2013/10/basic-
wine-101-guide-infographic-poster.jpg#big
light to heavy, which is a crude calssification of white to red;
if IMP_Alcohol < 10.5 then Alcohol_TYPE=1;
if IMP_Alcohol >= 10.5 then Alcohol_TYPE=2;
if IMP_LabelAppeal <0 then Label_GROUP =1;
if IMP_LabelAppeal >=0 then Label_GROUP = 2;
if IMP_STARS <2 then STAR_IMPACT = 0;
if IMP_STARS >=2 then STAR_IMPACT = 1;
ALCOHOL_EMP = ALCOHOL_TYPE**2;
STAR_EMP = STAR_IMPACT**2;
EXPERT_INFLUENCE = ALCOHOL_TYPE + STAR_IMPACT +LABEL_GROUP;
REAL_pH = 10**(-IMP_pH);
density_adjusted = density - 1;
IMPURITIES = IMP_Chlorides + IMP_sulphates + IMP_FreeSulfurDioxide +
IMP_TotalSulfurDioxide;
TARGET_FLAG = ( TARGET > 0 );
TARGET_AMT = TARGET - 1;
if TARGET_FLAG = 0 then TARGET_AMT = .;
IMP_CHLORIDES_LOG = LOG10(IMP_CHLORIDES);
TARGET_LOG=0;
IF TARGET>0 then TARGET_LOG=1;
run;
proc means data=work.wine_scrub n nmiss median mean min max stddev;
run;
48. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
48
end;
if missing(Alcohol) then do;
IMP_Alcohol = 10.4892363;
M_ALCOHOL =1;
end;
if missing(pH) then do;
IMP_pH = 4;
M_pH =1;
*typical wine pH is now 4;
end;
if missing(LabelAppeal) then do;
IMP_LabelAppeal = 0;
M_LabelAppeal =1;
end;
if missing(TotalSulfurDioxide) then IMP_TSulfurDioxide = 120.7142326;
if missing (FreeSulfurDioxide) then IMP_FSulfurDioxide = 30.845;
if missing (Chlorides) then IMP_Chlorides = 0.046;
if IMP_Chlorides =< 0.01 then IMP_Chlorides= 0.01;
*IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * sqrt(
abs(IMP_TotalSulfurDioxide)+1 );
*IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * log(
abs(IMP_TotalSulfurDioxide)+1 );
if IMP_TotalSulfurDioxide < 10 then IMP_TotalSulfurDioxide = 10 ;
if IMP_TotalSulfurDioxide > 350 then IMP_TotalSulfurDioxide = 350;
if IMP_FreeSulfurDioxide < 10 then IMP_FreeSulfurDioxide = 10 ;
if IMP_FreeSulfurDioxide > 350 then IMP_FreeSulfurDioxide = 350;
* more than 10 mg/l requires labeling, >350 mg/l is prohibited, limits based upon
requirements;
if IMP_PH < 3 then IMP_PH=3;
*a pH of 0.48 is high concentration acid that is unfit for human consumption;
if IMP_Sulphates <0 then IMP_SULPHATES= 0;
if missing(ResidualSugar) then IMP_ResidualSugar = 3.9;
if IMP_Alcohol <9 then IMP_Alcohol =9.0;
if IMP_ResidualSugar <1 then IMP_ResidualSugar=1;
49. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
49
if IMP_CITRICACID <0 then IMP_CITRICACID=0;
if IMP_VOLATILEACIDITY <0 then IMP_VOLATILEACIDITY=0;
if IMP_FixedAcidity <0 then IMP_FixedAcidity =0;
*grouping of wines by http://winefolly.com/wp-content/uploads/2013/10/basic-
wine-101-guide-infographic-poster.jpg#big
light to heavy, which is a crude calssification of white to red;
if IMP_Alcohol < 10.5 then Alcohol_TYPE=1;
if IMP_Alcohol >= 10.5 then Alcohol_TYPE=2;
if IMP_LabelAppeal <0 then Label_GROUP =1;
if IMP_LabelAppeal >=0 then Label_GROUP = 2;
if IMP_STARS <2 then STAR_IMPACT = 0;
if IMP_STARS >=2 then STAR_IMPACT = 1;
ALCOHOL_EMP = ALCOHOL_TYPE**2;
STAR_EMP = STAR_IMPACT**2;
EXPERT_INFLUENCE = ALCOHOL_TYPE + STAR_IMPACT +LABEL_GROUP;
REAL_pH = 10**(-IMP_pH);
density_adjusted = density - 1;
IMPURITIES = IMP_Chlorides + IMP_sulphates + IMP_FreeSulfurDioxide +
IMP_TotalSulfurDioxide;
TARGET_FLAG = ( TARGET > 0 );
TARGET_AMT = TARGET - 1;
if TARGET_FLAG = 0 then TARGET_AMT = .;
IMP_CHLORIDES_LOG = LOG10(IMP_CHLORIDES);
TARGET_LOG=0;
IF TARGET>0 then TARGET_LOG=1;
run;
proc means data=work.wine_scrub n nmiss median mean min max stddev;
run;
proc genmod data = work.wine_scrub;
stepwise:
model TARGET =
IMP_STARS
IMP_Density
IMP_Sulphates
IMP_Alcohol