SlideShare a Scribd company logo
1 of 51
Download to read offline
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
1
Predict 411 Section 55 Project 3
‘Wine Sales Review’
By Christopher Dorow
Due Date: May 31, 2015
File Name: Chris_Dorow_PRED411_Sec55_PROJ3.PDF
Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
2
Results Summary and Conclusion
Several models were developed to predict the probability and amount of wine sales
based upon a collection of variables. The training data consisted of approximately
12,000 records. The best model from my investigation was a Zero Inflated Poisson
Regression, which yielded a model AIC of 40,865. The factors most likely to influence
wine sales were the presence of a rating for the wine, as wines without a STAR rating
sold poorly, and greater label appeal was likely to increase wine sales.
Introduction
The purpose of this assignment is to develop a regression that will predict the number of
probability of claim based upon the data set provided. Variables included in this data set
are listed below:
• Acid index, a measurement of total acidity
• Alcohol content
• Chloride content of wine
• Citric acid content
• Wine density
• Wine fixed acidity
• Free sulfur dioxide content
• Label appeal
• Residual sugar
• Independent rating by stars
• Sulphate content of wine
• Total sulfur dioxide
• Volatile acidity
• Wine pH
Evaluations of data quality will be made, including identification of missing or outlier
data. Linear, Poisson, Zero Inflated Poisson, Negative Binomial, and Zero Inflated
Negative Binomial regressions will be generated and compared. The best model will
be selected that predicts the amount of wine sold, in cases.
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
3
Data Exploration
Within the provided information was a data dictionary, which is copied below.
Variable	
  Name	
   Definition	
  
Theoretical	
  
Effect	
  
INDEX	
  	
  	
  	
  	
   Identification	
  Variable	
  (do	
  not	
  use)	
   None	
  
TARGET Number	
  of	
  Cases	
  Purchased	
   None	
  
	
  	
  
	
  	
  
AcidIndex	
  
Proprietary method of testing total acidity of wine
by using a weighted average
	
  	
  
Alcohol	
   Alcohol Content 	
  	
  
Chlorides	
   Chloride content of wine 	
  	
  
CitricAcid	
   Citric Acid Content 	
  	
  
Density	
   Density of Wine 	
  	
  
FixedAcidity	
   Fixed Acidity of Wine 	
  	
  
FreeSulfurDioxide	
   Sulfur Dioxide content of wine 	
  	
  
LabelAppeal	
  
Marketing Score indicating the appeal of label
design for consumers. High numbers suggest
customers like the label design. Negative
numbers suggest customes don't like the design.
Many	
  
consumers	
  
purchase	
  
based	
  on	
  the	
  
visual	
  appeal	
  
of	
  the	
  wine	
  
label	
  design.	
  
Higher	
  
numbers	
  
suggest	
  better	
  
sales.	
  
ResidualSugar	
   Residual Sugar of wine 	
  	
  
STARS	
  
Wine rating by a team of experts. 4 Stars =
Excellent, 1 Star = Poor
A	
  high	
  number	
  
of	
  stars	
  
suggests	
  high	
  
sales	
  
Sulphates	
   Sulfate content of wine 	
  	
  
TotalSulfurDioxide	
   Total Sulfur Dioxide of Wine 	
  	
  
VolatileAcidity	
   Volatile Acid content of wine 	
  	
  
pH	
   pH of wine 	
  	
  
Continuous variables were reviewed and I could not discern trends that could be utilized
among the continuous data. However upon reviewing two key contingency tables, I was
able to locate two key variables. The tables are located in Attachment 1.
The first contingency table considered LabelAppeal and Target. Lower rated labels had
lower target values. Given the examples below, an appealing labal and bottle
combination can be very useful in grabbing the attention of the consumer.
Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
4
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
5
Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
6
The second contingency table that was very useful was the Stars and Target table.
When there was no rating, no sales occurred in just over 2,000 records, or one-sixth of
the training data. The consumers seem to shy away from the unknown quality when it
comes to wine.
Data Preparation
The descriptive statistics for the data set are summarized for the continuous variables in
the following table.
The missing records for the respective variables were replaced with the respective
variable mean values. Missing values are flagged in the chosen model for identification
and reference. Missing values were flagged for identification purposes
Variable N N Miss Median Mean Minimum Maximum Std Dev
INDEX
TARGET
FixedAcidity
VolatileAcidity
CitricAcid
ResidualSugar
Chlorides
FreeSulfurDioxide
TotalSulfurDioxide
Density
pH
Sulphates
Alcohol
LabelAppeal
AcidIndex
STARS
12795
12795
12795
12795
12795
12179
12157
12148
12113
12795
12400
11585
12142
12795
12795
9436
0
0
0
0
0
616
638
647
682
0
395
1210
653
0
0
3359
8110.00
3.0000000
6.9000000
0.2800000
0.3100000
3.9000000
0.0460000
30.0000000
123.0000000
0.9944900
3.2000000
0.5000000
10.4000000
0
8.0000000
2.0000000
8069.98
3.0290739
7.0757171
0.3241039
0.3084127
5.4187331
0.0548225
30.8455713
120.7142326
0.9942027
3.2076282
0.5271118
10.4892363
-0.0090660
7.7727237
2.0417550
1.0000000
0
-18.1000000
-2.7900000
-3.2400000
-127.8000000
-1.1710000
-555.0000000
-823.0000000
0.8880900
0.4800000
-3.1300000
-4.7000000
-2.0000000
4.0000000
1.0000000
16129.00
8.0000000
34.4000000
3.6800000
3.8600000
141.1500000
1.3510000
623.0000000
1057.00
1.0992400
6.1300000
4.2400000
26.5000000
2.0000000
17.0000000
4.0000000
4656.91
1.9263682
6.3176435
0.7840142
0.8620798
33.7493790
0.3184673
148.7145577
231.9132105
0.0265376
0.6796871
0.9321293
3.7278190
0.8910892
1.3239264
0.9025400
Treatment of Outliers
Sulfur dioxide records (free and total) were limited to 10 mg/l and 350 mg/l, as
concentrations above 10 mg/l require labeling, and the maximum concentration of
sulphates is limited to 350 mg/l by law. (Source: http://www.piwine.com/use-and-
measurement-of-sulfur-dioxide-in-wine.html_). pH limits were put at 3, as negative
values of pH are indicative of highly concentrated mineral acids, such as hydrochloric or
sulfuric acids, and unfit for human consumption, indicating the inappropriateness of the
value. Negative values for any concentration or composition values were also
conditioned as they are not possible. These values were replaced with the lowest
acceptable value.
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
7
Variable Creation and Combination
The following variables were created:
New Variable Description Implication
Alcohol_Type Less than 10.5 (value =1 )
Greater than 10.5 (value=2)
Wines with alcohol content
less than 10.5% are
predominantly white wines,
greater than 10.5% are
predominantly red wines.
Label_Group Grouping of Label_appeal,
if negative, Label_Group
=1, if positive Label_Group
=2.
Grouping of impact of
Label_appeal on sales
(negative or positive
correlation)
Star_Impact Grouping of STARS. If less
than 2, Star_Impact=1, if
STARS greater than 2,
Star_Impact=2.
Grouping of impact of wine
rating sytem,
Real_pH Conversion of pH into
hydroxyl ion concentration
in moles/liter
Concentration = 10**(-pH).
Density Adjusted Density – 1 Indication if above or below
specific gravity of water
Impurities Sum of chlorides and
sulphates.
Impact of preservatives
Imp_Chorldes_Log Log of chlorides
concentration
Impact of chlorides
.
Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
8
Model Development
Linear Model
The most appropriate linear model that I was able to develop is presented below. It has
an R-squared of 0.3365, and the variable coefficients are presented below. The
regression has an average error of 0.002, with a standard deviation of 1.59.
Variable
Parameter
Estimate
Standard
Error Type II SS F Value Pr > F
Intercept 4.59507 0.54968 172.14337 69.88 <.0001
IMP_STARS 1.34815 0.02780 5794.49325 2352.26 <.0001
IMP_Density -1.06520 0.52398 10.18023 4.13 0.0421
IMP_Sulphates -0.06317 0.02104 22.20909 9.02 0.0027
IMP_LabelAppeal 0.53029 0.02645 990.02230 401.90 <.0001
IMP_FREESULFURDIOXIDE 0.00069504 0.00015990 46.54574 18.90 <.0001
IMP_TotalSulfurDioxide 0.00076498 0.00012274 95.69266 38.85 <.0001
IMP_PH -0.12880 0.02906 48.39406 19.65 <.0001
IMP_ACIDINDEX -0.29945 0.01067 1939.71306 787.42 <.0001
IMP_CITRICACID 0.03850 0.01614 14.01150 5.69 0.0171
IMP_VOLATILEACIDITY -0.14508 0.01774 164.78316 66.89 <.0001
Alcohol_TYPE 0.17647 0.02799 97.90097 39.74 <.0001
STAR_IMPACT -1.44641 0.04898 2147.93290 871.95 <.0001
IMP_CHLORIDES_LOG -0.11579 0.02372 58.70894 23.83 <.0001
Label_GROUP 0.08371 0.05121 6.58391 2.67 0.1021
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
9
Summary of Stepwise Selection
Step
Variable
Entered
Variable
Removed
Number
Vars In
Partial
R-Square
Model
R-Square C(p) F Value Pr > F
1 IMP_STARS 1 0.1601 0.1601 3394.31 2438.72 <.0001
2 IMP_LabelAppeal 2 0.0639 0.2240 2164.98 1053.31 <.0001
3 STAR_IMPACT 3 0.0550 0.2790 1106.34 976.49 <.0001
4 IMP_ACIDINDEX 4 0.0454 0.3245 233.114 859.89 <.0001
5 IMP_VOLATILEACIDITY 5 0.0037 0.3282 163.250 70.99 <.0001
6 Alcohol_TYPE 6 0.0021 0.3302 125.697 39.19 <.0001
7 IMP_TotalSulfurDioxide 7 0.0022 0.3324 85.4345 42.01 <.0001
8 IMP_CHLORIDES_LOG 8 0.0013 0.3338 62.0761 25.25 <.0001
9 IMP_PH 9 0.0010 0.3348 44.0543 19.97 <.0001
10 IMP_FREESULFURDIOXIDE 10 0.0010 0.3358 27.2288 18.80 <.0001
11 IMP_Sulphates 11 0.0005 0.3362 20.0288 9.19 0.0024
12 IMP_CITRICACID 12 0.0003 0.3365 16.1780 5.85 0.0156
13 IMP_Density 13 0.0002 0.3368 13.9635 4.21 0.0401
14 Label_GROUP 14 0.0001 0.3369 13.2911 2.67 0.1021
For a wine novice, coefficients are difficult to discern. The variables that seem
counterintuitive are the interaction between IMP_STARS (expert rating) and
STAR_IMPACT appear to be in conflict. Based upon the reference sources
(http://www.piwine.com/use-and-measurement-of-sulfur-dioxide-in-wine.html ,
http://waterhouse.ucdavis.edu/whats-in-wine/fixed-acidity , and http://winefolly.com/wp-
content/uploads/2013/10/basic-wine-101-guide-infographic-poster.jpg#big) it is possible
that the combination of variables may make sense overall as wine critic opinions may
not represent popular opinion and economic sense to the consumer.
There is some indication that label appeal drives sales, based upon LABEL_GROUP.
The following represent some examples of unique wine labels that capture consumer
interest.
Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
10
Poisson Regression
The most appropriate Poisson model that I was able to develop is presented below. It
has an AIC of 49,895, and the variable coefficients are presented in the table below.
The regression has an average error of 0.025, with a standard deviation of 1.62.
Analysis Of Maximum Likelihood Parameter Estimates
Parameter DF Estimate
Standard
Error
Wald 95%
Confidence Limits
Wald Chi-
Square Pr > ChiSq
Intercept 1 1.5004 0.2003 1.1078 1.8930 56.10 <.0001
IMP_STARS 1 0.3348 0.0085 0.3181 0.3515 1546.53 <.0001
IMP_Density 1 -0.3517 0.1922 -0.7284 0.0250 3.35 0.0672
IMP_Sulphates 1 -0.0233 0.0079 -0.0387 -0.0079 8.80 0.0030
IMP_Alcohol 1 -0.0015 0.0020 -0.0054 0.0024 0.59 0.4430
IMP_LabelAppeal 1 0.1526 0.0090 0.1350 0.1702 287.73 <.0001
IMP_CHLORIDES 1 0.0354 0.0475 -0.0577 0.1285 0.56 0.4562
IMP_FREESULFURDIOXID 1 0.0002 0.0001 0.0001 0.0003 15.73 <.0001
IMP_TotalSulfurDioxi 1 0.0002 0.0000 0.0002 0.0003 30.22 <.0001
REAL_pH 1 89.6532 14.8763 60.4961 118.8102 36.32 <.0001
IMP_ACIDINDEX 1 -0.1173 0.0045 -0.1261 -0.1085 678.54 <.0001
IMP_RESIDUALSUGAR 1 0.0002 0.0002 -0.0001 0.0005 1.31 0.2519
IMP_CITRICACID 1 0.0129 0.0059 0.0014 0.0245 4.82 0.0281
IMP_VOLATILEACIDITY 1 -0.0476 0.0065 -0.0603 -0.0349 53.72 <.0001
IMP_FixedAcidity 1 -0.0005 0.0008 -0.0021 0.0011 0.40 0.5245
Alcohol_TYPE 1 0.0633 0.0144 0.0351 0.0915 19.37 <.0001
STAR_IMPACT 1 -0.3615 0.0178 -0.3964 -0.3265 410.32 <.0001
IMP_CHLORIDES_LOG 1 -0.0496 0.0165 -0.0820 -0.0172 8.99 0.0027
Label_GROUP 1 0.1108 0.0191 0.0732 0.1483 33.46 <.0001
Scale 0 1.0000 0.0000 1.0000 1.0000
The variable coefficients presented in this Poisson regression are consistent with the
linear regression, with the apparent conflict from earlier. The same observations also
hold true for the variable coefficients presented in the Negative Binomial Regression.
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
11
Negative Binomial Regression
The most appropriate Negative Binomial model that I was able to develop is presented
below. It has an AIC of 49,897, and the variable coefficients are presented in the table
below. The regression has an average error of 0.025, with a standard deviation of 1.62.
Initially, these results are identical to the Poisson model. This occurred as the stepwise
selection method utilized and the fact that both Poisson and Negative Binomial
regressions have the same form, as the Poisson distribution is a special case of the
Negative Binomial regression. The mean and variance are equal.
Analysis Of Maximum Likelihood Parameter Estimates
Parameter DF Estimate
Standard
Error
Wald 95%
Confidence Limits
Wald Chi-
Square Pr > ChiSq
Intercept 1 1.5004 0.2003 1.1078 1.8930 56.10 <.0001
IMP_STARS 1 0.3348 0.0085 0.3181 0.3515 1546.52 <.0001
IMP_Density 1 -0.3517 0.1922 -0.7284 0.0250 3.35 0.0672
IMP_Sulphates 1 -0.0233 0.0079 -0.0387 -0.0079 8.80 0.0030
IMP_Alcohol 1 -0.0015 0.0020 -0.0054 0.0024 0.59 0.4430
IMP_LabelAppeal 1 0.1526 0.0090 0.1350 0.1702 287.73 <.0001
IMP_CHLORIDES 1 0.0354 0.0475 -0.0577 0.1285 0.56 0.4562
IMP_FREESULFURDIOXID 1 0.0002 0.0001 0.0001 0.0003 15.73 <.0001
IMP_TotalSulfurDioxi 1 0.0002 0.0000 0.0002 0.0003 30.22 <.0001
REAL_pH 1 89.6532 14.8763 60.4961 118.8102 36.32 <.0001
IMP_ACIDINDEX 1 -0.1173 0.0045 -0.1261 -0.1085 678.54 <.0001
IMP_RESIDUALSUGAR 1 0.0002 0.0002 -0.0001 0.0005 1.31 0.2519
IMP_CITRICACID 1 0.0129 0.0059 0.0014 0.0245 4.82 0.0281
IMP_VOLATILEACIDITY 1 -0.0476 0.0065 -0.0603 -0.0349 53.72 <.0001
IMP_FixedAcidity 1 -0.0005 0.0008 -0.0021 0.0011 0.40 0.5245
Alcohol_TYPE 1 0.0633 0.0144 0.0351 0.0915 19.37 <.0001
STAR_IMPACT 1 -0.3615 0.0178 -0.3964 -0.3265 410.31 <.0001
IMP_CHLORIDES_LOG 1 -0.0496 0.0165 -0.0820 -0.0172 8.99 0.0027
Label_GROUP 1 0.1108 0.0191 0.0732 0.1483 33.46 <.0001
Dispersion 1 0.0000 0.0001 0.0000 2.24E122
I then manually modified the model according to the assignment instructions. I inserted
a new variable, called EXPERT_OPINION, which was the sum of the squared
LABEL_GROUP and STAR_IMPACT. The AIC increased to 50,177. I chose not to run
additional analysis as the model did not improve from the Poisson model earlier. The
table below summarizes the variable coefficients of this alternative model.
Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
12
Analysis Of Maximum Likelihood Parameter Estimates
Parameter DF Estimate
Standard
Error
Wald 95%
Confidence Limits
Wald Chi-
Square Pr > ChiSq
Intercept 1 1.2531 0.1976 0.8658 1.6404 40.21 <.0001
IMP_STARS 1 0.3644 0.0083 0.3481 0.3807 1912.32 <.0001
IMP_Density 1 -0.3291 0.1922 -0.7059 0.0476 2.93 0.0869
IMP_Sulphates 1 -0.0226 0.0079 -0.0380 -0.0072 8.24 0.0041
IMP_Alcohol 1 -0.0013 0.0020 -0.0052 0.0026 0.44 0.5065
EXPERT_OPINION 1 0.1154 0.0043 0.1068 0.1239 706.46 <.0001
IMP_CHLORIDES 1 0.0192 0.0476 -0.0741 0.1124 0.16 0.6870
IMP_FREESULFURDIOXID 1 0.0002 0.0001 0.0001 0.0004 17.47 <.0001
IMP_TotalSulfurDioxi 1 0.0002 0.0000 0.0002 0.0003 29.48 <.0001
REAL_pH 1 88.7296 14.8692 59.5864 117.8728 35.61 <.0001
IMP_ACIDINDEX 1 -0.1145 0.0045 -0.1233 -0.1057 647.47 <.0001
IMP_RESIDUALSUGAR 1 0.0002 0.0002 -0.0001 0.0005 1.91 0.1665
IMP_CITRICACID 1 0.0131 0.0059 0.0015 0.0246 4.91 0.0266
IMP_VOLATILEACIDITY 1 -0.0495 0.0065 -0.0622 -0.0368 58.10 <.0001
IMP_FixedAcidity 1 -0.0008 0.0008 -0.0024 0.0009 0.85 0.3562
Alcohol_TYPE 1 0.0593 0.0144 0.0311 0.0875 17.02 <.0001
STAR_IMPACT 1 -0.4941 0.0183 -0.5299 -0.4583 731.26 <.0001
IMP_CHLORIDES_LOG 1 -0.0435 0.0166 -0.0759 -0.0111 6.91 0.0086
Dispersion 1 0.0000 0.0001 0.0000 8.49E183
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
13
Zero Inflated Poisson Regression
The most appropriate ZIP model that I was able to develop is presented below. It has an
AIC of 40,865 and the variable coefficients are presented in the table below.
Analysis Of Maximum Likelihood Parameter Estimates
Parameter DF Estimate
Standard
Error
Wald 95%
Confidence Limits
Wald Chi-
Square Pr > ChiSq
Intercept 1 1.3960 0.1998 1.0045 1.7876 48.84 <.0001
IMP_STARS 1 0.1137 0.0088 0.0964 0.1309 166.49 <.0001
IMP_Density 1 -0.2694 0.1969 -0.6553 0.1164 1.87 0.1711
IMP_Sulphates 1 0.0006 0.0080 -0.0151 0.0162 0.00 0.9439
IMP_Alcohol 1 0.0003 0.0028 -0.0052 0.0058 0.01 0.9198
STAR_IMPACT 1 -0.0280 0.0187 -0.0646 0.0086 2.25 0.1339
IMP_CHLORIDES 1 -0.0389 0.0258 -0.0895 0.0116 2.28 0.1313
IMP_FREESULFURDIOXID 1 0.0000 0.0001 -0.0001 0.0002 0.46 0.4966
IMP_TotalSulfurDioxi 1 -0.0000 0.0000 -0.0001 0.0000 0.78 0.3761
IMP_ACIDINDEX 1 -0.0194 0.0049 -0.0290 -0.0098 15.64 <.0001
IMP_LabelAppeal 1 0.2413 0.0062 0.2291 0.2536 1494.55 <.0001
IMP_CITRICACID 1 0.0002 0.0087 -0.0168 0.0172 0.00 0.9807
IMP_VOLATILEACIDITY 1 -0.0220 0.0097 -0.0410 -0.0030 5.17 0.0230
IMP_FixedAcidity 1 0.0002 0.0010 -0.0017 0.0022 0.06 0.8131
REAL_pH 1 -10.0082 15.2121 -39.8233 19.8069 0.43 0.5106
Alcohol_TYPE 1 0.0795 0.0149 0.0502 0.1087 28.33 <.0001
Scale 0 1.0000 0.0000 1.0000 1.0000
Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
14
Analysis Of Maximum Likelihood Zero Inflation Parameter Estimates
Parameter DF Estimate
Standard
Error
Wald 95%
Confidence Limits
Wald Chi-
Square Pr > ChiSq
Intercept 1 6.4613 72.2951 -135.234 148.1570 0.01 0.9288
IMP_STARS 1 -11.3195 72.2946 -153.014 130.3752 0.02 0.8756
M_STARS 1 5.8765 0.3463 5.1977 6.5553 287.88 <.0001
M_SULPHATES 1 0.0900 0.1108 -0.1271 0.3071 0.66 0.4164
IMP_LabelAppeal 1 0.6992 0.0415 0.6179 0.7805 284.40 <.0001
IMP_CHLORIDES_LOG 1 0.0575 0.0568 -0.0538 0.1688 1.03 0.3111
IMP_TotalSulfurDioxi 1 -0.0019 0.0003 -0.0025 -0.0013 42.30 <.0001
IMP_ACIDINDEX 1 0.4391 0.0255 0.3891 0.4890 296.70 <.0001
IMP_CITRICACID 1 -0.0889 0.0572 -0.2010 0.0231 2.42 0.1198
IMP_VOLATILEACIDITY 1 0.2550 0.0573 0.1426 0.3674 19.77 <.0001
REAL_pH 1 -636.197 97.4542 -827.204 -445.190 42.62 <.0001
IMP_Alcohol 1 -0.0128 0.0193 -0.0506 0.0249 0.44 0.5055
Alcohol_TYPE 1 0.3255 0.0990 0.1315 0.5194 10.82 0.0010
STAR_IMPACT 1 7.5539 72.2970 -134.146 149.2534 0.01 0.9168
The most important improvement variable was the inclusion of M_STARS (missing
variable STAR record indicated). From the EDA, in 76% of the cases when no rating
was provided or available, no wine cases sold.
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
15
Zero Inflated Negative Binomial Regression
The most appropriate ZINB model that I was able to develop is presented below. It has
an AIC of 43,937 and the variable coefficients are presented in the table below.
Analysis Of Maximum Likelihood Parameter Estimates
Parameter DF Estimate
Standard
Error
Wald 95%
Confidence Limits
Wald Chi-
Square Pr > ChiSq
Intercept 1 1.1376 0.2055 0.7349 1.5403 30.65 <.0001
IMP_STARS 1 0.1155 0.0088 0.0983 0.1328 171.82 <.0001
IMP_Density 1 -0.2516 0.1968 -0.6374 0.1341 1.63 0.2011
IMP_Sulphates 1 0.0007 0.0080 -0.0149 0.0163 0.01 0.9305
IMP_Alcohol 1 0.0001 0.0028 -0.0054 0.0056 0.00 0.9748
IMP_LabelAppeal 1 0.2007 0.0091 0.1829 0.2186 483.72 <.0001
IMP_CHLORIDES 1 0.0059 0.0492 -0.0906 0.1024 0.01 0.9048
IMP_FREESULFURDIOXID 1 0.0000 0.0001 -0.0001 0.0002 0.40 0.5271
IMP_TotalSulfurDioxi 1 -0.0000 0.0000 -0.0001 0.0000 0.72 0.3967
REAL_pH 1 -9.1897 15.2098 -39.0003 20.6209 0.37 0.5457
IMP_ACIDINDEX 1 -0.0190 0.0049 -0.0286 -0.0094 15.05 0.0001
IMP_RESIDUALSUGAR 1 0.0000 0.0002 -0.0005 0.0005 0.00 0.9677
IMP_CITRICACID 1 0.0002 0.0087 -0.0168 0.0172 0.00 0.9844
IMP_VOLATILEACIDITY 1 -0.0221 0.0097 -0.0411 -0.0031 5.19 0.0227
IMP_FixedAcidity 1 0.0002 0.0010 -0.0018 0.0021 0.03 0.8563
Alcohol_TYPE 1 0.0795 0.0149 0.0502 0.1088 28.32 <.0001
STAR_IMPACT 1 -0.0329 0.0187 -0.0695 0.0037 3.10 0.0785
IMP_CHLORIDES_LOG 1 -0.0191 0.0171 -0.0527 0.0144 1.25 0.2639
Label_GROUP 1 0.1207 0.0196 0.0823 0.1591 37.94 <.0001
Dispersion 1 0.0000 0.0000 0.0000 1.007E39
Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
16
Analysis Of Maximum Likelihood Zero Inflation Parameter Estimates
Parameter DF Estimate
Standard
Error
Wald 95%
Confidence Limits
Wald Chi-
Square Pr > ChiSq
Intercept 1 6.7501 84.4772 -158.822 172.3224 0.01 0.9363
IMP_STARS 1 -11.6220 84.4768 -177.193 153.9494 0.02 0.8906
IMP_LabelAppeal 1 0.7127 0.0419 0.6306 0.7948 289.62 <.0001
IMP_CHLORIDES_LOG 1 0.0524 0.0569 -0.0592 0.1640 0.85 0.3572
M_STARS 1 5.8954 0.3528 5.2040 6.5869 279.29 <.0001
M_SULPHATES 1 0.0905 0.1110 -0.1271 0.3080 0.66 0.4150
IMP_TotalSulfurDioxi 1 -0.0019 0.0003 -0.0025 -0.0013 42.16 <.0001
IMP_ACIDINDEX 1 0.4398 0.0255 0.3898 0.4899 296.47 <.0001
IMP_CITRICACID 1 -0.0895 0.0573 -0.2018 0.0228 2.44 0.1183
IMP_VOLATILEACIDITY 1 0.2559 0.0575 0.1432 0.3685 19.83 <.0001
REAL_pH 1 -637.209 97.6977 -828.693 -445.725 42.54 <.0001
IMP_Alcohol 1 -0.0134 0.0193 -0.0513 0.0245 0.48 0.4871
Alcohol_TYPE 1 0.3283 0.0992 0.1338 0.5227 10.95 0.0009
STAR_IMPACT 1 7.8393 84.4789 -157.736 173.4149 0.01 0.9261
Model Selection
Model AIC
Poisson 49,877
Negative Binomial 49,877
Negative Binomial
(modified)
50,902
Zero Inflated
Poisson
40,865
Zero Inflated
Negative Binomial
(modified)
43,937
The model I chose was the ZIP model, based upon the AIC. This model scoring code
yields the following histogram.
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
17
Strengths of the model is that approximately 80% of the projections are within a range of
1.5 from the target value and over 30% of the projections are target (see Attachment 2).
Weakness of this model is that 0 cases are under counted.
Based upon the instruction set for this assignment, the linear model could not be
considered. However, an application of Occam’s Razor, which states "…when you have
two competing theories that make exactly the same predictions, the simpler one is the
better (source: www.math.ucr.edu/home/baez/physics/General/occam.html),” applies.
The performance of the linear regression over the range of concern for the model was
equally, or nearly equally accurate.
0 1 2 3 4 5 6 7 8
P_SCORE_ZIP
0
5
10
15
20
25
30Percent
Distribution	
  of	
  P_SC OR E_ZIP
Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
18
Model Interpretation
The following tables summarize the ZIP model selected and the meaning of the
respective coefficients.
Maximum Likelihood Parameter Estimates
Parameter Coefficient Interpretation
Intercept 1.1376
IMP_STARS 0.1155 The rating of number of stars will
increase wine sales.
IMP_Density -0.2516 Increased wine density will reduce win
sales.
IMP_Sulphates 0.0007 The concentration of sulphates will
increase wine sales.
IMP_Alcohol 0.0001 Increased alcohol content will increase
the amount of wine sales..
IMP_LabelAppeal 0.2007 The label appeal rating will increase
wine sales.
IMP_CHLORIDES 0.0059 The concentration of chlorides will
increase the amount of wine sales.
IMP_FREESULFURDIOXID 0.0000 The presence of free sulfur dioxide has
no impact on wine sales amount.
IMP_TotalSulfurDioxi -0.0000 The presence of total sulfur dioxide has
no impact on wine sales amount.
REAL_pH -9.1897 pH, expressed as concentration will
reduce wine sales amount.
IMP_ACIDINDEX -0.0190 Acid index has a negative impact on
wine sales amount.
IMP_RESIDUALSUGAR 0.0000 Residual sugar has no impact on wines
sales amount.
IMP_CITRICACID 0.0002 Citric acid concentration will increase
wine sales.
IMP_VOLATILEACIDITY -0.0221 Volatile acidity will decrease the wine
sales amount.
IMP_FixedAcidity 0.0002 Fixed acidity will increase the wine
sales amount.
Alcohol_TYPE 0.0795 Wines having alcohol greater than
10.5% sell in greater amounts.
STAR_IMPACT -0.0329 Wines with star ratings of 1 or 2 sell
more than wines with higher star
ratings.
IMP_CHLORIDES_LOG -0.0191 The logarithm of chlorides negatively
impacts wine sales.
Label_GROUP 0.1207 Label ratings with a rating less than 0
sell in less amounts than the wins with
labels rated positively.
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
19
Maximum Likelihood Parameter Estimates
Parameter Coefficient Interpretation
Intercept 6.7501
IMP_STARS -11.6220 The increase in the rating in number of
stars will reduce the probability that
none of this wine will be sold.
IMP_LabelAppeal 0.7127 The increase in the higher label appeal
rating of the wine will increase the
probability that none of this wine will be
sold.
IMP_CHLORIDES_LOG 0.0524 The increase in the log concentration of
chlorides in the wine will increase the
probability that none of this wine will be
sold.
M_STARS 5.8954 A missing record for STARS results in
an increased probability that none of the
particular wine will be sold.
M_SULPHATES 0.0905 A missing record for SULPHATES
results in an increased probability that
none of the particular wine will be sold.
IMP_TotalSulfurDioxi -0.0019 The increase in the concentration of
sulfur dioxide will decrease the
probability that none of this wine will be
sold.
IMP_ACIDINDEX 0.4398 The increase in the acid index will
increase the probability that none of this
wine will be sold.
IMP_CITRICACID -0.0895 The increase in the concentration of
citric acid will decrease the probability
that none of this wine will be sold.
IMP_VOLATILEACIDITY 0.2559 The increase in the volatile acidity will
increase the probability that none of this
wine will be sold.
REAL_pH -637.209 The increase in the concentration of
hydroxyl ion (-log(base10) [H+]) will
decrease the probability that none of
this wine will be sold.
IMP_Alcohol -0.0134 The increase in the concentration of
alcohol will decrease the probability that
none of this wine will be sold.
Alcohol_TYPE 0.3283 The shift from lower alcohol wines
(<10.5%, almost all whites and lighter
reds) to higher alcohol wines (heartier,
drier wines, primarily reds) will increase
the probability that none of this wine
will be sold.
STAR_IMPACT 7.8393 The shift from lower rated wines (<2
stars) to higher rated wines (>2 stars)
will increase the probability that none of
this wine will be sold.
Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
20
File Attachments
File Name Contents Comments
CDOROW_PRD411_SEC55_PROJ3TEST.sas Test code SAS
CDOROW_PRED411_PROJ3_SCORE_FILE.sas Scored data Bingo Bonus for
.sas file.
CDOROW_PRED411_PROJ3_SCORE.csv CSV file
contingency
CDOROW_SEC55_MODELWINNER_PROJ3.sas
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
21
Appendix 1
Correlation of Continuous Variables
Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
22
0 1 2 3 4 5 6 7 8
TA RGET
0
5
10
15
20
25Percent
Distribution	
  of	
  TAR GET
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
23
Contingency Table of Stars by Target
Table of STARS by TARGET
STARS TARGET
Frequency
Percent
Row Pct
Col Pct 0 1 2 3 4 5 6 7 8 Total
. 2038
15.93
60.67
74.54
126
0.98
3.75
51.64
335
2.62
9.97
30.71
457
3.57
13.61
17.50
260
2.03
7.74
8.18
101
0.79
3.01
5.01
32
0.25
0.95
4.18
8
0.06
0.24
5.63
2
0.02
0.06
11.76
3359
26.25
1 607
4.74
19.95
22.20
98
0.77
3.22
40.16
469
3.67
15.42
42.99
916
7.16
30.11
35.08
716
5.60
23.54
22.54
214
1.67
7.03
10.63
22
0.17
0.72
2.88
0
0.00
0.00
0.00
0
0.00
0.00
0.00
3042
23.77
2 89
0.70
2.49
3.26
20
0.16
0.56
8.20
253
1.98
7.09
23.19
948
7.41
26.55
36.31
1333
10.42
37.34
41.96
716
5.60
20.06
35.55
199
1.56
5.57
26.01
12
0.09
0.34
8.45
0
0.00
0.00
0.00
3570
27.90
3 0
0.00
0.00
0.00
0
0.00
0.00
0.00
34
0.27
1.54
3.12
290
2.27
13.11
11.11
764
5.97
34.54
24.05
750
5.86
33.91
37.24
313
2.45
14.15
40.92
57
0.45
2.58
40.14
4
0.03
0.18
23.53
2212
17.29
4 0
0.00
0.00
0.00
0
0.00
0.00
0.00
0
0.00
0.00
0.00
0
0.00
0.00
0.00
104
0.81
16.99
3.27
233
1.82
38.07
11.57
199
1.56
32.52
26.01
65
0.51
10.62
45.77
11
0.09
1.80
64.71
612
4.78
Total 2734
21.37
244
1.91
1091
8.53
2611
20.41
3177
24.83
2014
15.74
765
5.98
142
1.11
17
0.13
12795
100.00
Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
24
Table of Label Appeal by Target
Table of LabelAppeal by TARGET
LabelAppeal TARGET
Frequency
Percent
Row Pct
Col Pct 0 1 2 3 4 5 6 7 8 Total
-2 102
0.80
20.24
3.73
136
1.06
26.98
55.74
177
1.38
35.12
16.22
74
0.58
14.68
2.83
14
0.11
2.78
0.44
1
0.01
0.20
0.05
0
0.00
0.00
0.00
0
0.00
0.00
0.00
0
0.00
0.00
0.00
504
3.94
-1 671
5.24
21.40
24.54
89
0.70
2.84
36.48
755
5.90
24.08
69.20
1118
8.74
35.65
42.82
413
3.23
13.17
13.00
88
0.69
2.81
4.37
2
0.02
0.06
0.26
0
0.00
0.00
0.00
0
0.00
0.00
0.00
3136
24.51
0 1193
9.32
21.24
43.64
19
0.15
0.34
7.79
152
1.19
2.71
13.93
1347
10.53
23.98
51.59
1972
15.41
35.11
62.07
775
6.06
13.80
38.48
155
1.21
2.76
20.26
4
0.03
0.07
2.82
0
0.00
0.00
0.00
5617
43.90
1 660
5.16
21.65
24.14
0
0.00
0.00
0.00
7
0.05
0.23
0.64
70
0.55
2.30
2.68
765
5.98
25.10
24.08
1040
8.13
34.12
51.64
425
3.32
13.94
55.56
79
0.62
2.59
55.63
2
0.02
0.07
11.76
3048
23.82
2 108
0.84
22.04
3.95
0
0.00
0.00
0.00
0
0.00
0.00
0.00
2
0.02
0.41
0.08
13
0.10
2.65
0.41
110
0.86
22.45
5.46
183
1.43
37.35
23.92
59
0.46
12.04
41.55
15
0.12
3.06
88.24
490
3.83
Total 2734
21.37
244
1.91
1091
8.53
2611
20.41
3177
24.83
2014
15.74
765
5.98
142
1.11
17
0.13
12795
100.00
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
25
Appendix 2
Selected Regression Error Histograms
Linear Regression
Zero Inflated Poisson Regression
Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
26
Linear Regression Error Histogram
-­‐5 -­‐4 -­‐3 -­‐2 -­‐1 0 1 2 3 4 5
TA RGET_ERROR
0
5
10
15
20
25
30
Percent
Distribution	
  of	
  TAR GET_ER R OR
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
27
Zero Inflated Poisson Regression
-­‐6 -­‐5 -­‐4 -­‐3 -­‐2 -­‐1 0 1 2 3 4 5 6 7
error_term
0
10
20
30
40
Percent
Distribution	
  of	
  error_term
Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
28
Appendix 3
Code Used
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
29
Linear Regression
libname mydata '/folders/myfolders' access=readonly;
proc contents data=mydata.wine;
run;
data work.wine_scrub;
set mydata.wine;
*cleaning up variabes;
TARGET_FLAG = ( TARGET > 0 );
TARGET_AMT = TARGET - 1;
if TARGET_FLAG = 0 then TARGET_AMT = .;
IMP_STARS = STARS;
IMP_Density = Density;
IMP_Sulphates = Sulphates;
IMP_Alcohol = Alcohol;
IMP_LabelAppeal = LabelAppeal;
IMP_CHLORIDES = Chlorides;
IMP_FREESULFURDIOXIDE = FREESULFURDIOXIDE;
IMP_TotalSulfurDioxide = TotalSulfurDioxide;
IMP_PH = pH;
IMP_ACIDINDEX = ACIDINDEX;
IMP_RESIDUALSUGAR = ResidualSugar;
IMP_CITRICACID = CitricAcid;
IMP_VOLATILEACIDITY = VolatileAcidity;
IMP_FixedAcidity = FixedAcidity;
*missing counts;
M_STARS = 0;
M_RESIDUALSUGAR = 0;
M_CHLORIDES = 0;
M_FRESSULFURDIOXIDE = 0;
M_TOTALSULFURDIOXIDE = 0;
M_SULPHATES = 0;
M_ALCOHOL = 0;
if missing(STARS) then do; IMP_STARS = 2;
M_STARS = 1;
end;
if missing(Density) then IMP_Density =
0.9942027;
if missing(Sulphates) then do;
IMP_Sulphates = 0.5271118;
M_SULPHATES =1;
Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
30
end;
if missing(Alcohol) then do;
IMP_Alcohol = 10.4892363;
M_ALCOHOL =1;
end;
if missing(pH) then do;
IMP_pH = 4;
M_pH =1;
*typical wine pH is now 4;
end;
if missing(LabelAppeal) then IMP_LabelAppeal = 0;
if missing(TotalSulfurDioxide) then IMP_TotalSulfurDioxide =
120.7142326;
if missing (FreeSulfurDioxide) then IMP_FreeSulfurDioxide = 30.845;
if missing (Chlorides) then IMP_Chlorides = 0.046;
if IMP_Chlorides =< 0.01 then IMP_Chlorides= 0.01;
*IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * sqrt(
abs(IMP_TotalSulfurDioxide)+1 );
*IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * log(
abs(IMP_TotalSulfurDioxide)+1 );
if IMP_TotalSulfurDioxide < 10 then IMP_TotalSulfurDioxide = 10 ;
if IMP_TotalSulfurDioxide > 350 then IMP_TotalSulfurDioxide = 350;
if IMP_FreeSulfurDioxide < 10 then IMP_FreeSulfurDioxide = 10 ;
if IMP_FreeSulfurDioxide > 350 then IMP_FreeSulfurDioxide = 350;
* more than 10 mg/l requires labeling, >350 mg/l is prohibited, limits based upon
requirements;
if IMP_PH < 3 then IMP_PH=3;
*a pH of 0.48 is high concentration acid that is unfit for human consumption;
if IMP_Sulphates <0 then IMP_SULPHATES= 0;
if missing(ResidualSuger) then IMP_ResidualSugar = 3.9;
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
31
*grouping of wines by http://winefolly.com/wp-content/uploads/2013/10/basic-
wine-101-guide-infographic-poster.jpg#big
light to heavy, which is a crude calssification of white to red;
if IMP_Alcohol < 10.5 then Alcohol_TYPE=1;
if IMP_Alcohol >= 10.5 then Alcohol_TYPE=2;
if IMP_LabelAppeal <0 then Label_GROUP =1;
if IMP_LabelAppeal >=0 then Label_GROUP = 2;
if IMP_STARS <2 then STAR_IMPACT = 0;
if IMP_STARS >=2 then STAR_IMPACT = 1;
REAL_pH = 10**(-IMP_pH);
density_adjusted = density - 1;
IMPURITIES = IMP_Chlorides + IMP_sulphates + IMP_FreeSulfurDioxide +
IMP_TotalSulfurDioxide;
TARGET_FLAG = ( TARGET > 0 );
TARGET_AMT = TARGET - 1;
if TARGET_FLAG = 0 then TARGET_AMT = .;
IMP_CHLORIDES_LOG = LOG10(IMP_CHLORIDES);
TARGET_LOG=0;
IF TARGET>0 then TARGET_LOG=1;
run;
proc means data=work.wine_scrub n nmiss median mean min max stddev;
run;
proc reg data = work.wine_scrub;
stepwise: model TARGET =
IMP_STARS
IMP_Density
IMP_Sulphates
IMP_Alcohol
IMP_LabelAppeal
IMP_CHLORIDES
IMP_FREESULFURDIOXIDE
IMP_TotalSulfurDioxide
IMP_PH
IMP_ACIDINDEX
Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
32
IMP_RESIDUALSUGAR
IMP_CITRICACID
IMP_VOLATILEACIDITY
IMP_FixedAcidity
Alcohol_Type
STAR_IMPACT
IMP_CHLORIDES_LOG
LABEL_GROUP
/ selection = stepwise;
run;
data work.wine_scrub;
set work.wine_scrub;
TARGET_TEMP=4.59507 +
IMP_STARS* 1.34815 +
IMP_Density* -1.06520 +
IMP_Sulphates* -0.06317 +
IMP_LabelAppeal* 0.53029 +
IMP_FREESULFURDIOXIDE* 0.00069504 +
IMP_TotalSulfurDioxide* 0.00076498 +
IMP_PH* -0.12880 +
IMP_ACIDINDEX* -0.29945 +
IMP_CITRICACID* 0.03850 +
IMP_VOLATILEACIDITY* -0.14508 +
Alcohol_TYPE* 0.17647 +
STAR_IMPACT* -1.44641 +
IMP_CHLORIDES_LOG* -0.11579 +
Label_GROUP* 0.08371 ;
If target_temp <0 then target_temp=0;
TARGET_ERROR = Target - TARGET_TEMP;
target_error = round (target_error, 1);
run;
proc univariate data=work.wine_scrub noprint;
histogram target_error/midpoints = -5 -4 -3 -2 -1 0 1 2 3 4 5 ;
run;
proc univariate data=work.wine_scrub;
var target_temp;
histogram/midpoints = 0 1 2 3 4 5 6 7 8;
run;
proc means data=work.wine_scrub n nmiss median mean min max stddev;
run;
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
33
Poisson Regression
libname mydata '/folders/myfolders' access=readonly;
proc contents data=mydata.wine;
run;
data work.wine_scrub;
set mydata.wine;
*cleaning up variabes;
TARGET_FLAG = ( TARGET > 0 );
TARGET_AMT = TARGET - 1;
if TARGET_FLAG = 0 then TARGET_AMT = .;
IMP_STARS = STARS;
IMP_Density = Density;
IMP_Sulphates = Sulphates;
IMP_Alcohol = Alcohol;
IMP_LabelAppeal = LabelAppeal;
IMP_CHLORIDES = Chlorides;
IMP_FREESULFURDIOXIDE = FREESULFURDIOXIDE;
IMP_TotalSulfurDioxide = TotalSulfurDioxide;
IMP_PH = pH;
IMP_ACIDINDEX = ACIDINDEX;
IMP_RESIDUALSUGAR = ResidualSugar;
IMP_CITRICACID = CitricAcid;
IMP_VOLATILEACIDITY = VolatileAcidity;
IMP_FixedAcidity = FixedAcidity;
*missing counts;
M_STARS = 0;
M_RESIDUALSUGAR = 0;
M_CHLORIDES = 0;
M_FRESSULFURDIOXIDE = 0;
M_TOTALSULFURDIOXIDE = 0;
M_SULPHATES = 0;
M_ALCOHOL = 0;
if missing(STARS) then do; IMP_STARS = 2;
M_STARS = 1;
end;
if missing(Density) then IMP_Density =
0.9942027;
if missing(Sulphates) then do;
IMP_Sulphates = 0.5271118;
Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
34
M_SULPHATES =1;
end;
if missing(Alcohol) then do;
IMP_Alcohol = 10.4892363;
M_ALCOHOL =1;
end;
if missing(pH) then do;
IMP_pH = 4;
M_pH =1;
*typical wine pH is now 4;
end;
if missing(LabelAppeal) then IMP_LabelAppeal = 0;
if missing(TotalSulfurDioxide) then IMP_TSulfurDioxide = 120.7142326;
if missing (FreeSulfurDioxide) then IMP_FSulfurDioxide = 30.845;
if missing (Chlorides) then IMP_Chlorides = 0.046;
if IMP_Chlorides =< 0.01 then IMP_Chlorides= 0.01;
*IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * sqrt(
abs(IMP_TotalSulfurDioxide)+1 );
*IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * log(
abs(IMP_TotalSulfurDioxide)+1 );
if IMP_TotalSulfurDioxide < 10 then IMP_TotalSulfurDioxide = 10 ;
if IMP_TotalSulfurDioxide > 350 then IMP_TotalSulfurDioxide = 350;
if IMP_FreeSulfurDioxide < 10 then IMP_FreeSulfurDioxide = 10 ;
if IMP_FreeSulfurDioxide > 350 then IMP_FreeSulfurDioxide = 350;
* more than 10 mg/l requires labeling, >350 mg/l is prohibited, limits based upon
requirements;
if IMP_PH < 3 then IMP_PH=3;
*a pH of 0.48 is high concentration acid that is unfit for human consumption;
if IMP_Sulphates <0 then IMP_SULPHATES= 0;
if missing(ResidualSugar) then IMP_ResidualSugar = 3.9;
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
35
*grouping of wines by http://winefolly.com/wp-content/uploads/2013/10/basic-
wine-101-guide-infographic-poster.jpg#big
light to heavy, which is a crude calssification of white to red;
if IMP_Alcohol < 10.5 then Alcohol_TYPE=1;
if IMP_Alcohol >= 10.5 then Alcohol_TYPE=2;
if IMP_LabelAppeal <0 then Label_GROUP =1;
if IMP_LabelAppeal >=0 then Label_GROUP = 2;
if IMP_STARS <2 then STAR_IMPACT = 0;
if IMP_STARS >=2 then STAR_IMPACT = 1;
REAL_pH = 10**(-IMP_pH);
density_adjusted = density - 1;
IMPURITIES = IMP_Chlorides + IMP_sulphates + IMP_FreeSulfurDioxide +
IMP_TotalSulfurDioxide;
TARGET_FLAG = ( TARGET > 0 );
TARGET_AMT = TARGET - 1;
if TARGET_FLAG = 0 then TARGET_AMT = .;
IMP_CHLORIDES_LOG = LOG10(IMP_CHLORIDES);
TARGET_LOG=0;
IF TARGET>0 then TARGET_LOG=1;
run;
proc means data=work.wine_scrub n nmiss median mean min max stddev;
run;
proc genmod data = work.wine_scrub;
stepwise:
model TARGET =
IMP_STARS
IMP_Density
IMP_Sulphates
IMP_Alcohol
IMP_LabelAppeal
IMP_CHLORIDES
IMP_FREESULFURDIOXIDE
IMP_TotalSulfurDioxide
REAL_PH
IMP_ACIDINDEX
IMP_RESIDUALSUGAR
IMP_CITRICACID
IMP_VOLATILEACIDITY
Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
36
IMP_FixedAcidity
Alcohol_Type
STAR_IMPACT
IMP_CHLORIDES_LOG
LABEL_GROUP
/link=log dist=poi;
output out= work.wine_scrub_poi_out p=y_poi;
run;
proc genmod data = work.wine_scrub;
stepwise:
model TARGET =
IMP_STARS
IMP_Density
IMP_Sulphates
IMP_Alcohol
IMP_LabelAppeal
IMP_CHLORIDES
IMP_FREESULFURDIOXIDE
/link=log dist=poi;
output out= work.wine_scrub_poi_outx p=y_poi;
run;
data work.wine_scrub;
set work.wine_scrub;
P_SCORE_TEMP = 1.5004 +
IMP_STARS * 0.3348 +
IMP_Density * -0.3517 +
IMP_Sulphates * -0.0233 +
IMP_Alcohol * -0.0015 +
IMP_LabelAppeal * 0.1526 +
IMP_CHLORIDES * 0.0354 +
IMP_FREESULFURDIOXIDE * 0.0002 +
IMP_TotalSulfurDioxide * 0.0002 +
REAL_PH * 89.6532 +
IMP_ACIDINDEX * -0.1173 +
IMP_RESIDUALSUGAR * 0.0002 +
IMP_CITRICACID * 0.0129 +
IMP_VOLATILEACIDITY * -0.0476 +
IMP_FixedAcidity * -0.0005 +
Alcohol_TYPE * 0.0633 +
STAR_IMPACT * -0.3615 +
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
37
IMP_CHLORIDES_LOG * -0.0496 +
Label_GROUP * 0.1108
;
P_SCORE_POISSON = exp(P_SCORE_TEMP );
P_SCORE_POISSON = round (P_SCORE_POISSON,1);
if P_SCORE_POISSON > 8 then P_SCORE_POISSON =8;
POISSON_ERROR = TARGET - P_SCORE_POISSON;
run;
proc univariate data=work.wine_scrub noprint;
histogram poisson_error/midpoints = -5 -4 -3 -2 -1 0 1 2 3 4 5 ;
run;
proc means data=work.wine_scrub n nmiss median mean min max stddev;
run;
Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
38
Negative Binomial Regression
libname mydata '/folders/myfolders' access=readonly;
proc contents data=mydata.wine;
run;
data work.wine_scrub;
set mydata.wine;
*cleaning up variabes;
TARGET_FLAG = ( TARGET > 0 );
TARGET_AMT = TARGET - 1;
if TARGET_FLAG = 0 then TARGET_AMT = .;
IMP_STARS = STARS;
IMP_Density = Density;
IMP_Sulphates = Sulphates;
IMP_Alcohol = Alcohol;
IMP_LabelAppeal = LabelAppeal;
IMP_CHLORIDES = Chlorides;
IMP_FREESULFURDIOXIDE = FREESULFURDIOXIDE;
IMP_TotalSulfurDioxide = TotalSulfurDioxide;
IMP_PH = pH;
IMP_ACIDINDEX = ACIDINDEX;
IMP_RESIDUALSUGAR = ResidualSugar;
IMP_CITRICACID = CitricAcid;
IMP_VOLATILEACIDITY = VolatileAcidity;
IMP_FixedAcidity = FixedAcidity;
*missing counts;
M_STARS = 0;
M_RESIDUALSUGAR = 0;
M_CHLORIDES = 0;
M_FRESSULFURDIOXIDE = 0;
M_TOTALSULFURDIOXIDE = 0;
M_SULPHATES = 0;
M_ALCOHOL = 0;
if missing(STARS) then do; IMP_STARS = 2;
M_STARS = 1;
end;
if missing(Density) then IMP_Density =
0.9942027;
if missing(Sulphates) then do;
IMP_Sulphates = 0.5271118;
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
39
M_SULPHATES =1;
end;
if missing(Alcohol) then do;
IMP_Alcohol = 10.4892363;
M_ALCOHOL =1;
end;
if missing(pH) then do;
IMP_pH = 4;
M_pH =1;
*typical wine pH is now 4;
end;
if missing(LabelAppeal) then IMP_LabelAppeal = 0;
if missing(TotalSulfurDioxide) then IMP_TSulfurDioxide = 120.7142326;
if missing (FreeSulfurDioxide) then IMP_FSulfurDioxide = 30.845;
if missing (Chlorides) then IMP_Chlorides = 0.046;
if IMP_Chlorides =< 0.01 then IMP_Chlorides= 0.01;
*IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * sqrt(
abs(IMP_TotalSulfurDioxide)+1 );
*IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * log(
abs(IMP_TotalSulfurDioxide)+1 );
if IMP_TotalSulfurDioxide < 10 then IMP_TotalSulfurDioxide = 10 ;
if IMP_TotalSulfurDioxide > 350 then IMP_TotalSulfurDioxide = 350;
if IMP_FreeSulfurDioxide < 10 then IMP_FreeSulfurDioxide = 10 ;
if IMP_FreeSulfurDioxide > 350 then IMP_FreeSulfurDioxide = 350;
* more than 10 mg/l requires labeling, >350 mg/l is prohibited, limits based upon
requirements;
if IMP_PH < 3 then IMP_PH=3;
*a pH of 0.48 is high concentration acid that is unfit for human consumption;
if IMP_Sulphates <0 then IMP_SULPHATES= 0;
if missing(ResidualSugar) then IMP_ResidualSugar = 3.9;
*grouping of wines by http://winefolly.com/wp-content/uploads/2013/10/basic-
wine-101-guide-infographic-poster.jpg#big
Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
40
light to heavy, which is a crude calssification of white to red;
if IMP_Alcohol < 10.5 then Alcohol_TYPE=1;
if IMP_Alcohol >= 10.5 then Alcohol_TYPE=2;
if IMP_LabelAppeal <0 then Label_GROUP =1;
if IMP_LabelAppeal >=0 then Label_GROUP = 2;
if IMP_STARS <2 then STAR_IMPACT = 0;
if IMP_STARS >=2 then STAR_IMPACT = 1;
REAL_pH = 10**(-IMP_pH);
density_adjusted = density - 1;
IMPURITIES = IMP_Chlorides + IMP_sulphates + IMP_FreeSulfurDioxide +
IMP_TotalSulfurDioxide;
EXPERT_OPINION = (STAR_IMPACT**2) + (LABEL_GROUP**2);
TARGET_FLAG = ( TARGET > 0 );
TARGET_AMT = TARGET - 1;
if TARGET_FLAG = 0 then TARGET_AMT = .;
IMP_CHLORIDES_LOG = LOG10(IMP_CHLORIDES);
TARGET_LOG=0;
IF TARGET>0 then TARGET_LOG=1;
run;
proc means data=work.wine_scrub n nmiss median mean min max stddev;
run;
proc genmod data = work.wine_scrub;
model TARGET =
IMP_STARS
IMP_Density
IMP_Sulphates
IMP_Alcohol
IMP_LabelAppeal
IMP_CHLORIDES
IMP_FREESULFURDIOXIDE
IMP_TotalSulfurDioxide
REAL_PH
IMP_ACIDINDEX
IMP_RESIDUALSUGAR
IMP_CITRICACID
IMP_VOLATILEACIDITY
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
41
IMP_FixedAcidity
Alcohol_Type
STAR_IMPACT
IMP_CHLORIDES_LOG
LABEL_GROUP
/link=log dist=nb;
output out= work.wine_scrub_negbin_out p=y_nb;
run;
proc genmod data = work.wine_scrub;
model TARGET =
IMP_STARS
IMP_Density
IMP_Sulphates
IMP_Alcohol
EXPERT_OPINION
IMP_CHLORIDES
IMP_FREESULFURDIOXIDE
IMP_TotalSulfurDioxide
REAL_PH
IMP_ACIDINDEX
IMP_RESIDUALSUGAR
IMP_CITRICACID
IMP_VOLATILEACIDITY
IMP_FixedAcidity
Alcohol_Type
STAR_IMPACT
IMP_CHLORIDES_LOG
/link=log dist=nb;
output out= work.wine_scrub_negbin_out p=y_nb;
run;
data work.wine_scrub;
set work.wine_scrub;
P_SCORE_TEMP = 1.5004 +
IMP_STARS * 0.3348 +
Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
42
IMP_Density * -0.3517 +
IMP_Sulphates * -0.0233 +
IMP_Alcohol * -0.0015 +
IMP_LabelAppeal * 0.1526 +
IMP_CHLORIDES * 0.0354 +
IMP_FREESULFURDIOXIDE * 0.0002 +
IMP_TotalSulfurDioxide * 0.0002 +
REAL_PH * 89.6532 +
IMP_ACIDINDEX * -0.1173 +
IMP_RESIDUALSUGAR * 0.0002 +
IMP_CITRICACID * 0.0129 +
IMP_VOLATILEACIDITY * -0.0476 +
IMP_FixedAcidity * -0.0005 +
Alcohol_TYPE * 0.0633 +
STAR_IMPACT * -0.3615 +
IMP_CHLORIDES_LOG * -0.0496 +
Label_GROUP * 0.1108
;
P_NEGBIN = exp(P_SCORE_TEMP );
P_NEGBIN = round (P_NEGBIN,1);
if P_NEGBIN > 8 then P_NEGBIN =8;
NEGBIN_ERROR = TARGET - P_NEGBIN;
run;
proc univariate data=work.wine_scrub noprint;
histogram NEGBIN_ERROR/midpoints = -5 -4 -3 -2 -1 0 1 2 3 4 5 ;
run;
proc univariate data=work.wine_scrub noprint;
histogram P_NEGBIN/midpoints = 0 1 2 3 4 5 6 7 8;
run;
proc means data=work.wine_scrub n nmiss median mean min max stddev;
run;
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
43
Zero Inflated Poisson
libname mydata '/folders/myfolders' access=readonly;
proc contents data=mydata.wine;
run;
data work.wine_scrub;
set mydata.wine;
*cleaning up variabes;
TARGET_FLAG = ( TARGET > 0 );
TARGET_AMT = TARGET - 1;
if TARGET_FLAG = 0 then TARGET_AMT = .;
IMP_STARS = STARS;
IMP_Density = Density;
IMP_Sulphates = Sulphates;
IMP_Alcohol = Alcohol;
IMP_LabelAppeal = LabelAppeal;
IMP_CHLORIDES = Chlorides;
IMP_FREESULFURDIOXIDE = FREESULFURDIOXIDE;
IMP_TotalSulfurDioxide = TotalSulfurDioxide;
IMP_PH = pH;
IMP_ACIDINDEX = ACIDINDEX;
IMP_RESIDUALSUGAR = ResidualSugar;
IMP_CITRICACID = CitricAcid;
IMP_VOLATILEACIDITY = VolatileAcidity;
IMP_FixedAcidity = FixedAcidity;
*missing counts;
M_STARS = 0;
M_RESIDUALSUGAR = 0;
M_CHLORIDES = 0;
M_FRESSULFURDIOXIDE = 0;
M_TOTALSULFURDIOXIDE = 0;
M_SULPHATES = 0;
M_ALCOHOL = 0;
if missing(STARS) then do; IMP_STARS = 2;
M_STARS = 1;
end;
if missing(Density) then IMP_Density =
0.9942027;
if missing(Sulphates) then do;
IMP_Sulphates = 0.5271118;
Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
44
M_SULPHATES =1;
end;
if missing(Alcohol) then do;
IMP_Alcohol = 10.4892363;
M_ALCOHOL =1;
end;
if missing(pH) then do;
IMP_pH = 4;
M_pH =1;
*typical wine pH is now 4;
end;
if missing(STARS) then do; IMP_STARS
= 2;
M_STARS = 1;
end;
if missing(LabelAppeal) then IMP_LabelAppeal = 0;
if missing(TotalSulfurDioxide) then IMP_TSulfurDioxide = 120.7142326;
if missing (FreeSulfurDioxide) then IMP_FSulfurDioxide = 30.845;
if missing (Chlorides) then IMP_Chlorides = 0.046;
if IMP_Chlorides =< 0.01 then IMP_Chlorides= 0.01;
*IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * sqrt(
abs(IMP_TotalSulfurDioxide)+1 );
*IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * log(
abs(IMP_TotalSulfurDioxide)+1 );
if IMP_TotalSulfurDioxide < 10 then IMP_TotalSulfurDioxide = 10 ;
if IMP_TotalSulfurDioxide > 350 then IMP_TotalSulfurDioxide = 350;
if IMP_FreeSulfurDioxide < 10 then IMP_FreeSulfurDioxide = 10 ;
if IMP_FreeSulfurDioxide > 350 then IMP_FreeSulfurDioxide = 350;
* more than 10 mg/l requires labeling, >350 mg/l is prohibited, limits based upon
requirements;
if IMP_PH < 3 then IMP_PH=3;
*a pH of 0.48 is high concentration acid that is unfit for human consumption;
if IMP_Sulphates <0 then IMP_SULPHATES= 0;
if missing(ResidualSugar) then IMP_ResidualSugar = 3.9;
if IMP_Alcohol <9 then IMP_Alcohol =9.0;
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
45
if IMP_ResidualSugar <1 then IMP_ResidualSugar=1;
if IMP_CITRICACID <0 then IMP_CITRICACID=0;
if IMP_VOLATILEACIDITY <0 then IMP_VOLATILEACIDITY=0;
if IMP_FixedAcidity <0 then IMP_FixedAcidity =0;
*grouping of wines by http://winefolly.com/wp-content/uploads/2013/10/basic-
wine-101-guide-infographic-poster.jpg#big
light to heavy, which is a crude calssification of white to red;
if IMP_Alcohol < 10.5 then Alcohol_TYPE=1;
if IMP_Alcohol >= 10.5 then Alcohol_TYPE=2;
if IMP_LabelAppeal <0 then Label_GROUP =1;
if IMP_LabelAppeal >=0 then Label_GROUP = 2;
if IMP_STARS <2 then STAR_IMPACT = 0;
if IMP_STARS >=2 then STAR_IMPACT = 1;
ALCOHOL_EMP = ALCOHOL_TYPE**2;
STAR_EMP = STAR_IMPACT**2;
EXPERT_INFLUENCE = ALCOHOL_TYPE + STAR_IMPACT +LABEL_GROUP;
REAL_pH = 10**(-IMP_pH);
density_adjusted = density - 1;
IMPURITIES = IMP_Chlorides + IMP_sulphates + IMP_FreeSulfurDioxide +
IMP_TotalSulfurDioxide;
TARGET_FLAG = ( TARGET > 0 );
TARGET_AMT = TARGET - 1;
if TARGET_FLAG = 0 then TARGET_AMT = .;
IMP_CHLORIDES_LOG = LOG10(IMP_CHLORIDES);
TARGET_LOG=0;
IF TARGET>0 then TARGET_LOG=1;
run;
proc means data=work.wine_scrub n nmiss median mean min max stddev;
run;
Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
46
proc genmod data = work.wine_scrub;
model TARGET =
IMP_STARS
IMP_Density
IMP_Sulphates
IMP_Alcohol
STAR_IMPACT
IMP_CHLORIDES
IMP_FREESULFURDIOXIDE
IMP_TotalSulfurDioxide
IMP_ACIDINDEX
IMP_LabelAppeal
IMP_CITRICACID
IMP_VOLATILEACIDITY
IMP_FixedAcidity
REAL_pH
ALCOHOL_TYPE
/link=log dist=zip;
zeromodel IMP_STARS
M_STARS
M_SULPHATES
IMP_LabelAppeal
IMP_CHLORIDES_LOG
IMP_TotalSulfurDioxide
IMP_ACIDINDEX
IMP_CITRICACID
IMP_VOLATILEACIDITY
REAL_pH
IMP_Alcohol
ALCOHOL_TYPE
STAR_IMPACT
/link=logit;
output out= work.winezip0526 pred=p_target_zip pzero=p_zero_zip;
run;
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
47
Zero Inflated Negative Binomial
libname mydata '/folders/myfolders' access=readonly;
proc contents data=mydata.wine;
run;
data work.wine_scrub;
set mydata.wine;
*cleaning up variabes;
TARGET_FLAG = ( TARGET > 0 );
TARGET_AMT = TARGET - 1;
if TARGET_FLAG = 0 then TARGET_AMT = .;
IMP_STARS = STARS;
IMP_Density = Density;
IMP_Sulphates = Sulphates;
IMP_Alcohol = Alcohol;
IMP_LabelAppeal = LabelAppeal;
IMP_CHLORIDES = Chlorides;
IMP_FREESULFURDIOXIDE = FREESULFURDIOXIDE;
IMP_TotalSulfurDioxide = TotalSulfurDioxide;
IMP_PH = pH;
IMP_ACIDINDEX = ACIDINDEX;
IMP_RESIDUALSUGAR = ResidualSugar;
IMP_CITRICACID = CitricAcid;
IMP_VOLATILEACIDITY = VolatileAcidity;
IMP_FixedAcidity = FixedAcidity;
*missing counts;
M_STARS = 0;
M_RESIDUALSUGAR = 0;
M_CHLORIDES = 0;
M_FRESSULFURDIOXIDE = 0;
M_TOTALSULFURDIOXIDE = 0;
M_SULPHATES = 0;
M_ALCOHOL = 0;
if missing(STARS) then do; IMP_STARS = 2;
M_STARS = 1;
end;
if missing(Density) then IMP_Density =
0.9942027;
if missing(Sulphates) then do;
IMP_Sulphates = 0.5271118;
M_SULPHATES =1;
Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
48
end;
if missing(Alcohol) then do;
IMP_Alcohol = 10.4892363;
M_ALCOHOL =1;
end;
if missing(pH) then do;
IMP_pH = 4;
M_pH =1;
*typical wine pH is now 4;
end;
if missing(LabelAppeal) then do;
IMP_LabelAppeal = 0;
M_LabelAppeal =1;
end;
if missing(TotalSulfurDioxide) then IMP_TSulfurDioxide = 120.7142326;
if missing (FreeSulfurDioxide) then IMP_FSulfurDioxide = 30.845;
if missing (Chlorides) then IMP_Chlorides = 0.046;
if IMP_Chlorides =< 0.01 then IMP_Chlorides= 0.01;
*IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * sqrt(
abs(IMP_TotalSulfurDioxide)+1 );
*IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * log(
abs(IMP_TotalSulfurDioxide)+1 );
if IMP_TotalSulfurDioxide < 10 then IMP_TotalSulfurDioxide = 10 ;
if IMP_TotalSulfurDioxide > 350 then IMP_TotalSulfurDioxide = 350;
if IMP_FreeSulfurDioxide < 10 then IMP_FreeSulfurDioxide = 10 ;
if IMP_FreeSulfurDioxide > 350 then IMP_FreeSulfurDioxide = 350;
* more than 10 mg/l requires labeling, >350 mg/l is prohibited, limits based upon
requirements;
if IMP_PH < 3 then IMP_PH=3;
*a pH of 0.48 is high concentration acid that is unfit for human consumption;
if IMP_Sulphates <0 then IMP_SULPHATES= 0;
if missing(ResidualSugar) then IMP_ResidualSugar = 3.9;
if IMP_Alcohol <9 then IMP_Alcohol =9.0;
if IMP_ResidualSugar <1 then IMP_ResidualSugar=1;
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
49
if IMP_CITRICACID <0 then IMP_CITRICACID=0;
if IMP_VOLATILEACIDITY <0 then IMP_VOLATILEACIDITY=0;
if IMP_FixedAcidity <0 then IMP_FixedAcidity =0;
*grouping of wines by http://winefolly.com/wp-content/uploads/2013/10/basic-
wine-101-guide-infographic-poster.jpg#big
light to heavy, which is a crude calssification of white to red;
if IMP_Alcohol < 10.5 then Alcohol_TYPE=1;
if IMP_Alcohol >= 10.5 then Alcohol_TYPE=2;
if IMP_LabelAppeal <0 then Label_GROUP =1;
if IMP_LabelAppeal >=0 then Label_GROUP = 2;
if IMP_STARS <2 then STAR_IMPACT = 0;
if IMP_STARS >=2 then STAR_IMPACT = 1;
ALCOHOL_EMP = ALCOHOL_TYPE**2;
STAR_EMP = STAR_IMPACT**2;
EXPERT_INFLUENCE = ALCOHOL_TYPE + STAR_IMPACT +LABEL_GROUP;
REAL_pH = 10**(-IMP_pH);
density_adjusted = density - 1;
IMPURITIES = IMP_Chlorides + IMP_sulphates + IMP_FreeSulfurDioxide +
IMP_TotalSulfurDioxide;
TARGET_FLAG = ( TARGET > 0 );
TARGET_AMT = TARGET - 1;
if TARGET_FLAG = 0 then TARGET_AMT = .;
IMP_CHLORIDES_LOG = LOG10(IMP_CHLORIDES);
TARGET_LOG=0;
IF TARGET>0 then TARGET_LOG=1;
run;
proc means data=work.wine_scrub n nmiss median mean min max stddev;
run;
proc genmod data = work.wine_scrub;
stepwise:
model TARGET =
IMP_STARS
IMP_Density
IMP_Sulphates
IMP_Alcohol
Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015
50
IMP_LabelAppeal
IMP_CHLORIDES
IMP_FREESULFURDIOXIDE
IMP_TotalSulfurDioxide
REAL_PH
IMP_ACIDINDEX
IMP_RESIDUALSUGAR
IMP_CITRICACID
IMP_VOLATILEACIDITY
IMP_FixedAcidity
Alcohol_Type
STAR_IMPACT
IMP_CHLORIDES_LOG
LABEL_GROUP
/link=log dist=zinb;
zeromodel IMP_STARS
IMP_LabelAppeal
IMP_CHLORIDES_LOG
M_STARS
M_SULPHATES
IMP_TotalSulfurDioxide
IMP_ACIDINDEX
IMP_CITRICACID
IMP_VOLATILEACIDITY
REAL_pH
IMP_Alcohol
ALCOHOL_TYPE
STAR_IMPACT /link=logit;
output out= work.wine_scrub_zinb_out p=y_zinb;
run;
proc genmod data = work.wine_scrub;
stepwise:
model TARGET =
IMP_STARS
IMP_Density
IMP_Sulphates
IMP_Alcohol
IMP_LabelAppeal
IMP_CHLORIDES
IMP_FREESULFURDIOXIDE
C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015
51
/link=log dist=zinb;
zeromodel IMP_STARS
IMP_LabelAppeal
IMP_CHLORIDES_LOG
IMP_TotalSulfurDioxide
IMP_ACIDINDEX
IMP_CITRICACID
IMP_VOLATILEACIDITY
REAL_pH
IMP_Alcohol
ALCOHOL_TYPE
STAR_IMPACT /link=logit;
output out= work.wine_scrub_zinb_outx p=y_zinb;
run;

More Related Content

Similar to Chris_Dorow_PRED411_Sec55_PROJ3

investor conference final
investor conference finalinvestor conference final
investor conference final
Anne Stephens
 
Qualty Costing
Qualty CostingQualty Costing
Qualty Costing
jravish
 
Demand for GasolineTrillions of BTUsPrice per Million BTUBillions .docx
Demand for GasolineTrillions of BTUsPrice per Million BTUBillions .docxDemand for GasolineTrillions of BTUsPrice per Million BTUBillions .docx
Demand for GasolineTrillions of BTUsPrice per Million BTUBillions .docx
simonithomas47935
 
SGS Supply Chain Solution - Dashboard KPI's Paper
SGS Supply Chain Solution - Dashboard KPI's PaperSGS Supply Chain Solution - Dashboard KPI's Paper
SGS Supply Chain Solution - Dashboard KPI's Paper
Mark Hudson
 
SPF AMERICAS 2016 -Product Track_Product Compliance - Final
SPF AMERICAS 2016 -Product Track_Product Compliance - FinalSPF AMERICAS 2016 -Product Track_Product Compliance - Final
SPF AMERICAS 2016 -Product Track_Product Compliance - Final
Alan L. Johnson
 

Similar to Chris_Dorow_PRED411_Sec55_PROJ3 (20)

investor conference final
investor conference finalinvestor conference final
investor conference final
 
Liquor Store Advertising Part 1 of 5
Liquor Store Advertising Part 1 of 5Liquor Store Advertising Part 1 of 5
Liquor Store Advertising Part 1 of 5
 
pdf.pdf
pdf.pdfpdf.pdf
pdf.pdf
 
Hansen aise im ch15
Hansen aise im ch15Hansen aise im ch15
Hansen aise im ch15
 
Salt Reduction '
Salt Reduction 'Salt Reduction '
Salt Reduction '
 
Qualty Costing
Qualty CostingQualty Costing
Qualty Costing
 
Napa Technology Seminar: Increasing Profits With Wines By The Glass
Napa Technology Seminar:  Increasing Profits With Wines By The GlassNapa Technology Seminar:  Increasing Profits With Wines By The Glass
Napa Technology Seminar: Increasing Profits With Wines By The Glass
 
Demand for GasolineTrillions of BTUsPrice per Million BTUBillions .docx
Demand for GasolineTrillions of BTUsPrice per Million BTUBillions .docxDemand for GasolineTrillions of BTUsPrice per Million BTUBillions .docx
Demand for GasolineTrillions of BTUsPrice per Million BTUBillions .docx
 
2018 Oregon Wine Symposium | Benchmarking for Wine Business Profitability - L...
2018 Oregon Wine Symposium | Benchmarking for Wine Business Profitability - L...2018 Oregon Wine Symposium | Benchmarking for Wine Business Profitability - L...
2018 Oregon Wine Symposium | Benchmarking for Wine Business Profitability - L...
 
TBLI CONFERENCE @BOOTH/KELLOGG 2015: "How key ESG metrics can help the market...
TBLI CONFERENCE @BOOTH/KELLOGG 2015: "How key ESG metrics can help the market...TBLI CONFERENCE @BOOTH/KELLOGG 2015: "How key ESG metrics can help the market...
TBLI CONFERENCE @BOOTH/KELLOGG 2015: "How key ESG metrics can help the market...
 
Briefing: Sustainable drinks, how to create opportunity from innovation
Briefing: Sustainable drinks, how to create opportunity from innovationBriefing: Sustainable drinks, how to create opportunity from innovation
Briefing: Sustainable drinks, how to create opportunity from innovation
 
Database Marketing - Dominick's stores in Chicago distric
Database Marketing - Dominick's stores in Chicago districDatabase Marketing - Dominick's stores in Chicago distric
Database Marketing - Dominick's stores in Chicago distric
 
Coca cola sustainability report
Coca cola sustainability reportCoca cola sustainability report
Coca cola sustainability report
 
SGS Supply Chain Solution - Dashboard KPI's Paper
SGS Supply Chain Solution - Dashboard KPI's PaperSGS Supply Chain Solution - Dashboard KPI's Paper
SGS Supply Chain Solution - Dashboard KPI's Paper
 
chapter 2.ppt
chapter 2.pptchapter 2.ppt
chapter 2.ppt
 
Sustainable Supply Chains in the Global Health Aid Market
Sustainable Supply Chains in the Global Health Aid MarketSustainable Supply Chains in the Global Health Aid Market
Sustainable Supply Chains in the Global Health Aid Market
 
Redbull Laser - Product Strategy.
Redbull Laser - Product Strategy. Redbull Laser - Product Strategy.
Redbull Laser - Product Strategy.
 
Team_Random
Team_RandomTeam_Random
Team_Random
 
SPF AMERICAS 2016 -Product Track_Product Compliance - Final
SPF AMERICAS 2016 -Product Track_Product Compliance - FinalSPF AMERICAS 2016 -Product Track_Product Compliance - Final
SPF AMERICAS 2016 -Product Track_Product Compliance - Final
 
Corporate Climate Responsibility Monitor 2022
Corporate Climate Responsibility Monitor 2022Corporate Climate Responsibility Monitor 2022
Corporate Climate Responsibility Monitor 2022
 

Chris_Dorow_PRED411_Sec55_PROJ3

  • 1. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 1 Predict 411 Section 55 Project 3 ‘Wine Sales Review’ By Christopher Dorow Due Date: May 31, 2015 File Name: Chris_Dorow_PRED411_Sec55_PROJ3.PDF
  • 2. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015 2 Results Summary and Conclusion Several models were developed to predict the probability and amount of wine sales based upon a collection of variables. The training data consisted of approximately 12,000 records. The best model from my investigation was a Zero Inflated Poisson Regression, which yielded a model AIC of 40,865. The factors most likely to influence wine sales were the presence of a rating for the wine, as wines without a STAR rating sold poorly, and greater label appeal was likely to increase wine sales. Introduction The purpose of this assignment is to develop a regression that will predict the number of probability of claim based upon the data set provided. Variables included in this data set are listed below: • Acid index, a measurement of total acidity • Alcohol content • Chloride content of wine • Citric acid content • Wine density • Wine fixed acidity • Free sulfur dioxide content • Label appeal • Residual sugar • Independent rating by stars • Sulphate content of wine • Total sulfur dioxide • Volatile acidity • Wine pH Evaluations of data quality will be made, including identification of missing or outlier data. Linear, Poisson, Zero Inflated Poisson, Negative Binomial, and Zero Inflated Negative Binomial regressions will be generated and compared. The best model will be selected that predicts the amount of wine sold, in cases.
  • 3. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 3 Data Exploration Within the provided information was a data dictionary, which is copied below. Variable  Name   Definition   Theoretical   Effect   INDEX           Identification  Variable  (do  not  use)   None   TARGET Number  of  Cases  Purchased   None           AcidIndex   Proprietary method of testing total acidity of wine by using a weighted average     Alcohol   Alcohol Content     Chlorides   Chloride content of wine     CitricAcid   Citric Acid Content     Density   Density of Wine     FixedAcidity   Fixed Acidity of Wine     FreeSulfurDioxide   Sulfur Dioxide content of wine     LabelAppeal   Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customes don't like the design. Many   consumers   purchase   based  on  the   visual  appeal   of  the  wine   label  design.   Higher   numbers   suggest  better   sales.   ResidualSugar   Residual Sugar of wine     STARS   Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor A  high  number   of  stars   suggests  high   sales   Sulphates   Sulfate content of wine     TotalSulfurDioxide   Total Sulfur Dioxide of Wine     VolatileAcidity   Volatile Acid content of wine     pH   pH of wine     Continuous variables were reviewed and I could not discern trends that could be utilized among the continuous data. However upon reviewing two key contingency tables, I was able to locate two key variables. The tables are located in Attachment 1. The first contingency table considered LabelAppeal and Target. Lower rated labels had lower target values. Given the examples below, an appealing labal and bottle combination can be very useful in grabbing the attention of the consumer.
  • 4. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015 4
  • 5. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 5
  • 6. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015 6 The second contingency table that was very useful was the Stars and Target table. When there was no rating, no sales occurred in just over 2,000 records, or one-sixth of the training data. The consumers seem to shy away from the unknown quality when it comes to wine. Data Preparation The descriptive statistics for the data set are summarized for the continuous variables in the following table. The missing records for the respective variables were replaced with the respective variable mean values. Missing values are flagged in the chosen model for identification and reference. Missing values were flagged for identification purposes Variable N N Miss Median Mean Minimum Maximum Std Dev INDEX TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates Alcohol LabelAppeal AcidIndex STARS 12795 12795 12795 12795 12795 12179 12157 12148 12113 12795 12400 11585 12142 12795 12795 9436 0 0 0 0 0 616 638 647 682 0 395 1210 653 0 0 3359 8110.00 3.0000000 6.9000000 0.2800000 0.3100000 3.9000000 0.0460000 30.0000000 123.0000000 0.9944900 3.2000000 0.5000000 10.4000000 0 8.0000000 2.0000000 8069.98 3.0290739 7.0757171 0.3241039 0.3084127 5.4187331 0.0548225 30.8455713 120.7142326 0.9942027 3.2076282 0.5271118 10.4892363 -0.0090660 7.7727237 2.0417550 1.0000000 0 -18.1000000 -2.7900000 -3.2400000 -127.8000000 -1.1710000 -555.0000000 -823.0000000 0.8880900 0.4800000 -3.1300000 -4.7000000 -2.0000000 4.0000000 1.0000000 16129.00 8.0000000 34.4000000 3.6800000 3.8600000 141.1500000 1.3510000 623.0000000 1057.00 1.0992400 6.1300000 4.2400000 26.5000000 2.0000000 17.0000000 4.0000000 4656.91 1.9263682 6.3176435 0.7840142 0.8620798 33.7493790 0.3184673 148.7145577 231.9132105 0.0265376 0.6796871 0.9321293 3.7278190 0.8910892 1.3239264 0.9025400 Treatment of Outliers Sulfur dioxide records (free and total) were limited to 10 mg/l and 350 mg/l, as concentrations above 10 mg/l require labeling, and the maximum concentration of sulphates is limited to 350 mg/l by law. (Source: http://www.piwine.com/use-and- measurement-of-sulfur-dioxide-in-wine.html_). pH limits were put at 3, as negative values of pH are indicative of highly concentrated mineral acids, such as hydrochloric or sulfuric acids, and unfit for human consumption, indicating the inappropriateness of the value. Negative values for any concentration or composition values were also conditioned as they are not possible. These values were replaced with the lowest acceptable value.
  • 7. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 7 Variable Creation and Combination The following variables were created: New Variable Description Implication Alcohol_Type Less than 10.5 (value =1 ) Greater than 10.5 (value=2) Wines with alcohol content less than 10.5% are predominantly white wines, greater than 10.5% are predominantly red wines. Label_Group Grouping of Label_appeal, if negative, Label_Group =1, if positive Label_Group =2. Grouping of impact of Label_appeal on sales (negative or positive correlation) Star_Impact Grouping of STARS. If less than 2, Star_Impact=1, if STARS greater than 2, Star_Impact=2. Grouping of impact of wine rating sytem, Real_pH Conversion of pH into hydroxyl ion concentration in moles/liter Concentration = 10**(-pH). Density Adjusted Density – 1 Indication if above or below specific gravity of water Impurities Sum of chlorides and sulphates. Impact of preservatives Imp_Chorldes_Log Log of chlorides concentration Impact of chlorides .
  • 8. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015 8 Model Development Linear Model The most appropriate linear model that I was able to develop is presented below. It has an R-squared of 0.3365, and the variable coefficients are presented below. The regression has an average error of 0.002, with a standard deviation of 1.59. Variable Parameter Estimate Standard Error Type II SS F Value Pr > F Intercept 4.59507 0.54968 172.14337 69.88 <.0001 IMP_STARS 1.34815 0.02780 5794.49325 2352.26 <.0001 IMP_Density -1.06520 0.52398 10.18023 4.13 0.0421 IMP_Sulphates -0.06317 0.02104 22.20909 9.02 0.0027 IMP_LabelAppeal 0.53029 0.02645 990.02230 401.90 <.0001 IMP_FREESULFURDIOXIDE 0.00069504 0.00015990 46.54574 18.90 <.0001 IMP_TotalSulfurDioxide 0.00076498 0.00012274 95.69266 38.85 <.0001 IMP_PH -0.12880 0.02906 48.39406 19.65 <.0001 IMP_ACIDINDEX -0.29945 0.01067 1939.71306 787.42 <.0001 IMP_CITRICACID 0.03850 0.01614 14.01150 5.69 0.0171 IMP_VOLATILEACIDITY -0.14508 0.01774 164.78316 66.89 <.0001 Alcohol_TYPE 0.17647 0.02799 97.90097 39.74 <.0001 STAR_IMPACT -1.44641 0.04898 2147.93290 871.95 <.0001 IMP_CHLORIDES_LOG -0.11579 0.02372 58.70894 23.83 <.0001 Label_GROUP 0.08371 0.05121 6.58391 2.67 0.1021
  • 9. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 9 Summary of Stepwise Selection Step Variable Entered Variable Removed Number Vars In Partial R-Square Model R-Square C(p) F Value Pr > F 1 IMP_STARS 1 0.1601 0.1601 3394.31 2438.72 <.0001 2 IMP_LabelAppeal 2 0.0639 0.2240 2164.98 1053.31 <.0001 3 STAR_IMPACT 3 0.0550 0.2790 1106.34 976.49 <.0001 4 IMP_ACIDINDEX 4 0.0454 0.3245 233.114 859.89 <.0001 5 IMP_VOLATILEACIDITY 5 0.0037 0.3282 163.250 70.99 <.0001 6 Alcohol_TYPE 6 0.0021 0.3302 125.697 39.19 <.0001 7 IMP_TotalSulfurDioxide 7 0.0022 0.3324 85.4345 42.01 <.0001 8 IMP_CHLORIDES_LOG 8 0.0013 0.3338 62.0761 25.25 <.0001 9 IMP_PH 9 0.0010 0.3348 44.0543 19.97 <.0001 10 IMP_FREESULFURDIOXIDE 10 0.0010 0.3358 27.2288 18.80 <.0001 11 IMP_Sulphates 11 0.0005 0.3362 20.0288 9.19 0.0024 12 IMP_CITRICACID 12 0.0003 0.3365 16.1780 5.85 0.0156 13 IMP_Density 13 0.0002 0.3368 13.9635 4.21 0.0401 14 Label_GROUP 14 0.0001 0.3369 13.2911 2.67 0.1021 For a wine novice, coefficients are difficult to discern. The variables that seem counterintuitive are the interaction between IMP_STARS (expert rating) and STAR_IMPACT appear to be in conflict. Based upon the reference sources (http://www.piwine.com/use-and-measurement-of-sulfur-dioxide-in-wine.html , http://waterhouse.ucdavis.edu/whats-in-wine/fixed-acidity , and http://winefolly.com/wp- content/uploads/2013/10/basic-wine-101-guide-infographic-poster.jpg#big) it is possible that the combination of variables may make sense overall as wine critic opinions may not represent popular opinion and economic sense to the consumer. There is some indication that label appeal drives sales, based upon LABEL_GROUP. The following represent some examples of unique wine labels that capture consumer interest.
  • 10. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015 10 Poisson Regression The most appropriate Poisson model that I was able to develop is presented below. It has an AIC of 49,895, and the variable coefficients are presented in the table below. The regression has an average error of 0.025, with a standard deviation of 1.62. Analysis Of Maximum Likelihood Parameter Estimates Parameter DF Estimate Standard Error Wald 95% Confidence Limits Wald Chi- Square Pr > ChiSq Intercept 1 1.5004 0.2003 1.1078 1.8930 56.10 <.0001 IMP_STARS 1 0.3348 0.0085 0.3181 0.3515 1546.53 <.0001 IMP_Density 1 -0.3517 0.1922 -0.7284 0.0250 3.35 0.0672 IMP_Sulphates 1 -0.0233 0.0079 -0.0387 -0.0079 8.80 0.0030 IMP_Alcohol 1 -0.0015 0.0020 -0.0054 0.0024 0.59 0.4430 IMP_LabelAppeal 1 0.1526 0.0090 0.1350 0.1702 287.73 <.0001 IMP_CHLORIDES 1 0.0354 0.0475 -0.0577 0.1285 0.56 0.4562 IMP_FREESULFURDIOXID 1 0.0002 0.0001 0.0001 0.0003 15.73 <.0001 IMP_TotalSulfurDioxi 1 0.0002 0.0000 0.0002 0.0003 30.22 <.0001 REAL_pH 1 89.6532 14.8763 60.4961 118.8102 36.32 <.0001 IMP_ACIDINDEX 1 -0.1173 0.0045 -0.1261 -0.1085 678.54 <.0001 IMP_RESIDUALSUGAR 1 0.0002 0.0002 -0.0001 0.0005 1.31 0.2519 IMP_CITRICACID 1 0.0129 0.0059 0.0014 0.0245 4.82 0.0281 IMP_VOLATILEACIDITY 1 -0.0476 0.0065 -0.0603 -0.0349 53.72 <.0001 IMP_FixedAcidity 1 -0.0005 0.0008 -0.0021 0.0011 0.40 0.5245 Alcohol_TYPE 1 0.0633 0.0144 0.0351 0.0915 19.37 <.0001 STAR_IMPACT 1 -0.3615 0.0178 -0.3964 -0.3265 410.32 <.0001 IMP_CHLORIDES_LOG 1 -0.0496 0.0165 -0.0820 -0.0172 8.99 0.0027 Label_GROUP 1 0.1108 0.0191 0.0732 0.1483 33.46 <.0001 Scale 0 1.0000 0.0000 1.0000 1.0000 The variable coefficients presented in this Poisson regression are consistent with the linear regression, with the apparent conflict from earlier. The same observations also hold true for the variable coefficients presented in the Negative Binomial Regression.
  • 11. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 11 Negative Binomial Regression The most appropriate Negative Binomial model that I was able to develop is presented below. It has an AIC of 49,897, and the variable coefficients are presented in the table below. The regression has an average error of 0.025, with a standard deviation of 1.62. Initially, these results are identical to the Poisson model. This occurred as the stepwise selection method utilized and the fact that both Poisson and Negative Binomial regressions have the same form, as the Poisson distribution is a special case of the Negative Binomial regression. The mean and variance are equal. Analysis Of Maximum Likelihood Parameter Estimates Parameter DF Estimate Standard Error Wald 95% Confidence Limits Wald Chi- Square Pr > ChiSq Intercept 1 1.5004 0.2003 1.1078 1.8930 56.10 <.0001 IMP_STARS 1 0.3348 0.0085 0.3181 0.3515 1546.52 <.0001 IMP_Density 1 -0.3517 0.1922 -0.7284 0.0250 3.35 0.0672 IMP_Sulphates 1 -0.0233 0.0079 -0.0387 -0.0079 8.80 0.0030 IMP_Alcohol 1 -0.0015 0.0020 -0.0054 0.0024 0.59 0.4430 IMP_LabelAppeal 1 0.1526 0.0090 0.1350 0.1702 287.73 <.0001 IMP_CHLORIDES 1 0.0354 0.0475 -0.0577 0.1285 0.56 0.4562 IMP_FREESULFURDIOXID 1 0.0002 0.0001 0.0001 0.0003 15.73 <.0001 IMP_TotalSulfurDioxi 1 0.0002 0.0000 0.0002 0.0003 30.22 <.0001 REAL_pH 1 89.6532 14.8763 60.4961 118.8102 36.32 <.0001 IMP_ACIDINDEX 1 -0.1173 0.0045 -0.1261 -0.1085 678.54 <.0001 IMP_RESIDUALSUGAR 1 0.0002 0.0002 -0.0001 0.0005 1.31 0.2519 IMP_CITRICACID 1 0.0129 0.0059 0.0014 0.0245 4.82 0.0281 IMP_VOLATILEACIDITY 1 -0.0476 0.0065 -0.0603 -0.0349 53.72 <.0001 IMP_FixedAcidity 1 -0.0005 0.0008 -0.0021 0.0011 0.40 0.5245 Alcohol_TYPE 1 0.0633 0.0144 0.0351 0.0915 19.37 <.0001 STAR_IMPACT 1 -0.3615 0.0178 -0.3964 -0.3265 410.31 <.0001 IMP_CHLORIDES_LOG 1 -0.0496 0.0165 -0.0820 -0.0172 8.99 0.0027 Label_GROUP 1 0.1108 0.0191 0.0732 0.1483 33.46 <.0001 Dispersion 1 0.0000 0.0001 0.0000 2.24E122 I then manually modified the model according to the assignment instructions. I inserted a new variable, called EXPERT_OPINION, which was the sum of the squared LABEL_GROUP and STAR_IMPACT. The AIC increased to 50,177. I chose not to run additional analysis as the model did not improve from the Poisson model earlier. The table below summarizes the variable coefficients of this alternative model.
  • 12. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015 12 Analysis Of Maximum Likelihood Parameter Estimates Parameter DF Estimate Standard Error Wald 95% Confidence Limits Wald Chi- Square Pr > ChiSq Intercept 1 1.2531 0.1976 0.8658 1.6404 40.21 <.0001 IMP_STARS 1 0.3644 0.0083 0.3481 0.3807 1912.32 <.0001 IMP_Density 1 -0.3291 0.1922 -0.7059 0.0476 2.93 0.0869 IMP_Sulphates 1 -0.0226 0.0079 -0.0380 -0.0072 8.24 0.0041 IMP_Alcohol 1 -0.0013 0.0020 -0.0052 0.0026 0.44 0.5065 EXPERT_OPINION 1 0.1154 0.0043 0.1068 0.1239 706.46 <.0001 IMP_CHLORIDES 1 0.0192 0.0476 -0.0741 0.1124 0.16 0.6870 IMP_FREESULFURDIOXID 1 0.0002 0.0001 0.0001 0.0004 17.47 <.0001 IMP_TotalSulfurDioxi 1 0.0002 0.0000 0.0002 0.0003 29.48 <.0001 REAL_pH 1 88.7296 14.8692 59.5864 117.8728 35.61 <.0001 IMP_ACIDINDEX 1 -0.1145 0.0045 -0.1233 -0.1057 647.47 <.0001 IMP_RESIDUALSUGAR 1 0.0002 0.0002 -0.0001 0.0005 1.91 0.1665 IMP_CITRICACID 1 0.0131 0.0059 0.0015 0.0246 4.91 0.0266 IMP_VOLATILEACIDITY 1 -0.0495 0.0065 -0.0622 -0.0368 58.10 <.0001 IMP_FixedAcidity 1 -0.0008 0.0008 -0.0024 0.0009 0.85 0.3562 Alcohol_TYPE 1 0.0593 0.0144 0.0311 0.0875 17.02 <.0001 STAR_IMPACT 1 -0.4941 0.0183 -0.5299 -0.4583 731.26 <.0001 IMP_CHLORIDES_LOG 1 -0.0435 0.0166 -0.0759 -0.0111 6.91 0.0086 Dispersion 1 0.0000 0.0001 0.0000 8.49E183
  • 13. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 13 Zero Inflated Poisson Regression The most appropriate ZIP model that I was able to develop is presented below. It has an AIC of 40,865 and the variable coefficients are presented in the table below. Analysis Of Maximum Likelihood Parameter Estimates Parameter DF Estimate Standard Error Wald 95% Confidence Limits Wald Chi- Square Pr > ChiSq Intercept 1 1.3960 0.1998 1.0045 1.7876 48.84 <.0001 IMP_STARS 1 0.1137 0.0088 0.0964 0.1309 166.49 <.0001 IMP_Density 1 -0.2694 0.1969 -0.6553 0.1164 1.87 0.1711 IMP_Sulphates 1 0.0006 0.0080 -0.0151 0.0162 0.00 0.9439 IMP_Alcohol 1 0.0003 0.0028 -0.0052 0.0058 0.01 0.9198 STAR_IMPACT 1 -0.0280 0.0187 -0.0646 0.0086 2.25 0.1339 IMP_CHLORIDES 1 -0.0389 0.0258 -0.0895 0.0116 2.28 0.1313 IMP_FREESULFURDIOXID 1 0.0000 0.0001 -0.0001 0.0002 0.46 0.4966 IMP_TotalSulfurDioxi 1 -0.0000 0.0000 -0.0001 0.0000 0.78 0.3761 IMP_ACIDINDEX 1 -0.0194 0.0049 -0.0290 -0.0098 15.64 <.0001 IMP_LabelAppeal 1 0.2413 0.0062 0.2291 0.2536 1494.55 <.0001 IMP_CITRICACID 1 0.0002 0.0087 -0.0168 0.0172 0.00 0.9807 IMP_VOLATILEACIDITY 1 -0.0220 0.0097 -0.0410 -0.0030 5.17 0.0230 IMP_FixedAcidity 1 0.0002 0.0010 -0.0017 0.0022 0.06 0.8131 REAL_pH 1 -10.0082 15.2121 -39.8233 19.8069 0.43 0.5106 Alcohol_TYPE 1 0.0795 0.0149 0.0502 0.1087 28.33 <.0001 Scale 0 1.0000 0.0000 1.0000 1.0000
  • 14. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015 14 Analysis Of Maximum Likelihood Zero Inflation Parameter Estimates Parameter DF Estimate Standard Error Wald 95% Confidence Limits Wald Chi- Square Pr > ChiSq Intercept 1 6.4613 72.2951 -135.234 148.1570 0.01 0.9288 IMP_STARS 1 -11.3195 72.2946 -153.014 130.3752 0.02 0.8756 M_STARS 1 5.8765 0.3463 5.1977 6.5553 287.88 <.0001 M_SULPHATES 1 0.0900 0.1108 -0.1271 0.3071 0.66 0.4164 IMP_LabelAppeal 1 0.6992 0.0415 0.6179 0.7805 284.40 <.0001 IMP_CHLORIDES_LOG 1 0.0575 0.0568 -0.0538 0.1688 1.03 0.3111 IMP_TotalSulfurDioxi 1 -0.0019 0.0003 -0.0025 -0.0013 42.30 <.0001 IMP_ACIDINDEX 1 0.4391 0.0255 0.3891 0.4890 296.70 <.0001 IMP_CITRICACID 1 -0.0889 0.0572 -0.2010 0.0231 2.42 0.1198 IMP_VOLATILEACIDITY 1 0.2550 0.0573 0.1426 0.3674 19.77 <.0001 REAL_pH 1 -636.197 97.4542 -827.204 -445.190 42.62 <.0001 IMP_Alcohol 1 -0.0128 0.0193 -0.0506 0.0249 0.44 0.5055 Alcohol_TYPE 1 0.3255 0.0990 0.1315 0.5194 10.82 0.0010 STAR_IMPACT 1 7.5539 72.2970 -134.146 149.2534 0.01 0.9168 The most important improvement variable was the inclusion of M_STARS (missing variable STAR record indicated). From the EDA, in 76% of the cases when no rating was provided or available, no wine cases sold.
  • 15. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 15 Zero Inflated Negative Binomial Regression The most appropriate ZINB model that I was able to develop is presented below. It has an AIC of 43,937 and the variable coefficients are presented in the table below. Analysis Of Maximum Likelihood Parameter Estimates Parameter DF Estimate Standard Error Wald 95% Confidence Limits Wald Chi- Square Pr > ChiSq Intercept 1 1.1376 0.2055 0.7349 1.5403 30.65 <.0001 IMP_STARS 1 0.1155 0.0088 0.0983 0.1328 171.82 <.0001 IMP_Density 1 -0.2516 0.1968 -0.6374 0.1341 1.63 0.2011 IMP_Sulphates 1 0.0007 0.0080 -0.0149 0.0163 0.01 0.9305 IMP_Alcohol 1 0.0001 0.0028 -0.0054 0.0056 0.00 0.9748 IMP_LabelAppeal 1 0.2007 0.0091 0.1829 0.2186 483.72 <.0001 IMP_CHLORIDES 1 0.0059 0.0492 -0.0906 0.1024 0.01 0.9048 IMP_FREESULFURDIOXID 1 0.0000 0.0001 -0.0001 0.0002 0.40 0.5271 IMP_TotalSulfurDioxi 1 -0.0000 0.0000 -0.0001 0.0000 0.72 0.3967 REAL_pH 1 -9.1897 15.2098 -39.0003 20.6209 0.37 0.5457 IMP_ACIDINDEX 1 -0.0190 0.0049 -0.0286 -0.0094 15.05 0.0001 IMP_RESIDUALSUGAR 1 0.0000 0.0002 -0.0005 0.0005 0.00 0.9677 IMP_CITRICACID 1 0.0002 0.0087 -0.0168 0.0172 0.00 0.9844 IMP_VOLATILEACIDITY 1 -0.0221 0.0097 -0.0411 -0.0031 5.19 0.0227 IMP_FixedAcidity 1 0.0002 0.0010 -0.0018 0.0021 0.03 0.8563 Alcohol_TYPE 1 0.0795 0.0149 0.0502 0.1088 28.32 <.0001 STAR_IMPACT 1 -0.0329 0.0187 -0.0695 0.0037 3.10 0.0785 IMP_CHLORIDES_LOG 1 -0.0191 0.0171 -0.0527 0.0144 1.25 0.2639 Label_GROUP 1 0.1207 0.0196 0.0823 0.1591 37.94 <.0001 Dispersion 1 0.0000 0.0000 0.0000 1.007E39
  • 16. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015 16 Analysis Of Maximum Likelihood Zero Inflation Parameter Estimates Parameter DF Estimate Standard Error Wald 95% Confidence Limits Wald Chi- Square Pr > ChiSq Intercept 1 6.7501 84.4772 -158.822 172.3224 0.01 0.9363 IMP_STARS 1 -11.6220 84.4768 -177.193 153.9494 0.02 0.8906 IMP_LabelAppeal 1 0.7127 0.0419 0.6306 0.7948 289.62 <.0001 IMP_CHLORIDES_LOG 1 0.0524 0.0569 -0.0592 0.1640 0.85 0.3572 M_STARS 1 5.8954 0.3528 5.2040 6.5869 279.29 <.0001 M_SULPHATES 1 0.0905 0.1110 -0.1271 0.3080 0.66 0.4150 IMP_TotalSulfurDioxi 1 -0.0019 0.0003 -0.0025 -0.0013 42.16 <.0001 IMP_ACIDINDEX 1 0.4398 0.0255 0.3898 0.4899 296.47 <.0001 IMP_CITRICACID 1 -0.0895 0.0573 -0.2018 0.0228 2.44 0.1183 IMP_VOLATILEACIDITY 1 0.2559 0.0575 0.1432 0.3685 19.83 <.0001 REAL_pH 1 -637.209 97.6977 -828.693 -445.725 42.54 <.0001 IMP_Alcohol 1 -0.0134 0.0193 -0.0513 0.0245 0.48 0.4871 Alcohol_TYPE 1 0.3283 0.0992 0.1338 0.5227 10.95 0.0009 STAR_IMPACT 1 7.8393 84.4789 -157.736 173.4149 0.01 0.9261 Model Selection Model AIC Poisson 49,877 Negative Binomial 49,877 Negative Binomial (modified) 50,902 Zero Inflated Poisson 40,865 Zero Inflated Negative Binomial (modified) 43,937 The model I chose was the ZIP model, based upon the AIC. This model scoring code yields the following histogram.
  • 17. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 17 Strengths of the model is that approximately 80% of the projections are within a range of 1.5 from the target value and over 30% of the projections are target (see Attachment 2). Weakness of this model is that 0 cases are under counted. Based upon the instruction set for this assignment, the linear model could not be considered. However, an application of Occam’s Razor, which states "…when you have two competing theories that make exactly the same predictions, the simpler one is the better (source: www.math.ucr.edu/home/baez/physics/General/occam.html),” applies. The performance of the linear regression over the range of concern for the model was equally, or nearly equally accurate. 0 1 2 3 4 5 6 7 8 P_SCORE_ZIP 0 5 10 15 20 25 30Percent Distribution  of  P_SC OR E_ZIP
  • 18. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015 18 Model Interpretation The following tables summarize the ZIP model selected and the meaning of the respective coefficients. Maximum Likelihood Parameter Estimates Parameter Coefficient Interpretation Intercept 1.1376 IMP_STARS 0.1155 The rating of number of stars will increase wine sales. IMP_Density -0.2516 Increased wine density will reduce win sales. IMP_Sulphates 0.0007 The concentration of sulphates will increase wine sales. IMP_Alcohol 0.0001 Increased alcohol content will increase the amount of wine sales.. IMP_LabelAppeal 0.2007 The label appeal rating will increase wine sales. IMP_CHLORIDES 0.0059 The concentration of chlorides will increase the amount of wine sales. IMP_FREESULFURDIOXID 0.0000 The presence of free sulfur dioxide has no impact on wine sales amount. IMP_TotalSulfurDioxi -0.0000 The presence of total sulfur dioxide has no impact on wine sales amount. REAL_pH -9.1897 pH, expressed as concentration will reduce wine sales amount. IMP_ACIDINDEX -0.0190 Acid index has a negative impact on wine sales amount. IMP_RESIDUALSUGAR 0.0000 Residual sugar has no impact on wines sales amount. IMP_CITRICACID 0.0002 Citric acid concentration will increase wine sales. IMP_VOLATILEACIDITY -0.0221 Volatile acidity will decrease the wine sales amount. IMP_FixedAcidity 0.0002 Fixed acidity will increase the wine sales amount. Alcohol_TYPE 0.0795 Wines having alcohol greater than 10.5% sell in greater amounts. STAR_IMPACT -0.0329 Wines with star ratings of 1 or 2 sell more than wines with higher star ratings. IMP_CHLORIDES_LOG -0.0191 The logarithm of chlorides negatively impacts wine sales. Label_GROUP 0.1207 Label ratings with a rating less than 0 sell in less amounts than the wins with labels rated positively.
  • 19. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 19 Maximum Likelihood Parameter Estimates Parameter Coefficient Interpretation Intercept 6.7501 IMP_STARS -11.6220 The increase in the rating in number of stars will reduce the probability that none of this wine will be sold. IMP_LabelAppeal 0.7127 The increase in the higher label appeal rating of the wine will increase the probability that none of this wine will be sold. IMP_CHLORIDES_LOG 0.0524 The increase in the log concentration of chlorides in the wine will increase the probability that none of this wine will be sold. M_STARS 5.8954 A missing record for STARS results in an increased probability that none of the particular wine will be sold. M_SULPHATES 0.0905 A missing record for SULPHATES results in an increased probability that none of the particular wine will be sold. IMP_TotalSulfurDioxi -0.0019 The increase in the concentration of sulfur dioxide will decrease the probability that none of this wine will be sold. IMP_ACIDINDEX 0.4398 The increase in the acid index will increase the probability that none of this wine will be sold. IMP_CITRICACID -0.0895 The increase in the concentration of citric acid will decrease the probability that none of this wine will be sold. IMP_VOLATILEACIDITY 0.2559 The increase in the volatile acidity will increase the probability that none of this wine will be sold. REAL_pH -637.209 The increase in the concentration of hydroxyl ion (-log(base10) [H+]) will decrease the probability that none of this wine will be sold. IMP_Alcohol -0.0134 The increase in the concentration of alcohol will decrease the probability that none of this wine will be sold. Alcohol_TYPE 0.3283 The shift from lower alcohol wines (<10.5%, almost all whites and lighter reds) to higher alcohol wines (heartier, drier wines, primarily reds) will increase the probability that none of this wine will be sold. STAR_IMPACT 7.8393 The shift from lower rated wines (<2 stars) to higher rated wines (>2 stars) will increase the probability that none of this wine will be sold.
  • 20. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015 20 File Attachments File Name Contents Comments CDOROW_PRD411_SEC55_PROJ3TEST.sas Test code SAS CDOROW_PRED411_PROJ3_SCORE_FILE.sas Scored data Bingo Bonus for .sas file. CDOROW_PRED411_PROJ3_SCORE.csv CSV file contingency CDOROW_SEC55_MODELWINNER_PROJ3.sas
  • 21. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 21 Appendix 1 Correlation of Continuous Variables
  • 22. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015 22 0 1 2 3 4 5 6 7 8 TA RGET 0 5 10 15 20 25Percent Distribution  of  TAR GET
  • 23. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 23 Contingency Table of Stars by Target Table of STARS by TARGET STARS TARGET Frequency Percent Row Pct Col Pct 0 1 2 3 4 5 6 7 8 Total . 2038 15.93 60.67 74.54 126 0.98 3.75 51.64 335 2.62 9.97 30.71 457 3.57 13.61 17.50 260 2.03 7.74 8.18 101 0.79 3.01 5.01 32 0.25 0.95 4.18 8 0.06 0.24 5.63 2 0.02 0.06 11.76 3359 26.25 1 607 4.74 19.95 22.20 98 0.77 3.22 40.16 469 3.67 15.42 42.99 916 7.16 30.11 35.08 716 5.60 23.54 22.54 214 1.67 7.03 10.63 22 0.17 0.72 2.88 0 0.00 0.00 0.00 0 0.00 0.00 0.00 3042 23.77 2 89 0.70 2.49 3.26 20 0.16 0.56 8.20 253 1.98 7.09 23.19 948 7.41 26.55 36.31 1333 10.42 37.34 41.96 716 5.60 20.06 35.55 199 1.56 5.57 26.01 12 0.09 0.34 8.45 0 0.00 0.00 0.00 3570 27.90 3 0 0.00 0.00 0.00 0 0.00 0.00 0.00 34 0.27 1.54 3.12 290 2.27 13.11 11.11 764 5.97 34.54 24.05 750 5.86 33.91 37.24 313 2.45 14.15 40.92 57 0.45 2.58 40.14 4 0.03 0.18 23.53 2212 17.29 4 0 0.00 0.00 0.00 0 0.00 0.00 0.00 0 0.00 0.00 0.00 0 0.00 0.00 0.00 104 0.81 16.99 3.27 233 1.82 38.07 11.57 199 1.56 32.52 26.01 65 0.51 10.62 45.77 11 0.09 1.80 64.71 612 4.78 Total 2734 21.37 244 1.91 1091 8.53 2611 20.41 3177 24.83 2014 15.74 765 5.98 142 1.11 17 0.13 12795 100.00
  • 24. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015 24 Table of Label Appeal by Target Table of LabelAppeal by TARGET LabelAppeal TARGET Frequency Percent Row Pct Col Pct 0 1 2 3 4 5 6 7 8 Total -2 102 0.80 20.24 3.73 136 1.06 26.98 55.74 177 1.38 35.12 16.22 74 0.58 14.68 2.83 14 0.11 2.78 0.44 1 0.01 0.20 0.05 0 0.00 0.00 0.00 0 0.00 0.00 0.00 0 0.00 0.00 0.00 504 3.94 -1 671 5.24 21.40 24.54 89 0.70 2.84 36.48 755 5.90 24.08 69.20 1118 8.74 35.65 42.82 413 3.23 13.17 13.00 88 0.69 2.81 4.37 2 0.02 0.06 0.26 0 0.00 0.00 0.00 0 0.00 0.00 0.00 3136 24.51 0 1193 9.32 21.24 43.64 19 0.15 0.34 7.79 152 1.19 2.71 13.93 1347 10.53 23.98 51.59 1972 15.41 35.11 62.07 775 6.06 13.80 38.48 155 1.21 2.76 20.26 4 0.03 0.07 2.82 0 0.00 0.00 0.00 5617 43.90 1 660 5.16 21.65 24.14 0 0.00 0.00 0.00 7 0.05 0.23 0.64 70 0.55 2.30 2.68 765 5.98 25.10 24.08 1040 8.13 34.12 51.64 425 3.32 13.94 55.56 79 0.62 2.59 55.63 2 0.02 0.07 11.76 3048 23.82 2 108 0.84 22.04 3.95 0 0.00 0.00 0.00 0 0.00 0.00 0.00 2 0.02 0.41 0.08 13 0.10 2.65 0.41 110 0.86 22.45 5.46 183 1.43 37.35 23.92 59 0.46 12.04 41.55 15 0.12 3.06 88.24 490 3.83 Total 2734 21.37 244 1.91 1091 8.53 2611 20.41 3177 24.83 2014 15.74 765 5.98 142 1.11 17 0.13 12795 100.00
  • 25. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 25 Appendix 2 Selected Regression Error Histograms Linear Regression Zero Inflated Poisson Regression
  • 26. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015 26 Linear Regression Error Histogram -­‐5 -­‐4 -­‐3 -­‐2 -­‐1 0 1 2 3 4 5 TA RGET_ERROR 0 5 10 15 20 25 30 Percent Distribution  of  TAR GET_ER R OR
  • 27. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 27 Zero Inflated Poisson Regression -­‐6 -­‐5 -­‐4 -­‐3 -­‐2 -­‐1 0 1 2 3 4 5 6 7 error_term 0 10 20 30 40 Percent Distribution  of  error_term
  • 28. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015 28 Appendix 3 Code Used
  • 29. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 29 Linear Regression libname mydata '/folders/myfolders' access=readonly; proc contents data=mydata.wine; run; data work.wine_scrub; set mydata.wine; *cleaning up variabes; TARGET_FLAG = ( TARGET > 0 ); TARGET_AMT = TARGET - 1; if TARGET_FLAG = 0 then TARGET_AMT = .; IMP_STARS = STARS; IMP_Density = Density; IMP_Sulphates = Sulphates; IMP_Alcohol = Alcohol; IMP_LabelAppeal = LabelAppeal; IMP_CHLORIDES = Chlorides; IMP_FREESULFURDIOXIDE = FREESULFURDIOXIDE; IMP_TotalSulfurDioxide = TotalSulfurDioxide; IMP_PH = pH; IMP_ACIDINDEX = ACIDINDEX; IMP_RESIDUALSUGAR = ResidualSugar; IMP_CITRICACID = CitricAcid; IMP_VOLATILEACIDITY = VolatileAcidity; IMP_FixedAcidity = FixedAcidity; *missing counts; M_STARS = 0; M_RESIDUALSUGAR = 0; M_CHLORIDES = 0; M_FRESSULFURDIOXIDE = 0; M_TOTALSULFURDIOXIDE = 0; M_SULPHATES = 0; M_ALCOHOL = 0; if missing(STARS) then do; IMP_STARS = 2; M_STARS = 1; end; if missing(Density) then IMP_Density = 0.9942027; if missing(Sulphates) then do; IMP_Sulphates = 0.5271118; M_SULPHATES =1;
  • 30. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015 30 end; if missing(Alcohol) then do; IMP_Alcohol = 10.4892363; M_ALCOHOL =1; end; if missing(pH) then do; IMP_pH = 4; M_pH =1; *typical wine pH is now 4; end; if missing(LabelAppeal) then IMP_LabelAppeal = 0; if missing(TotalSulfurDioxide) then IMP_TotalSulfurDioxide = 120.7142326; if missing (FreeSulfurDioxide) then IMP_FreeSulfurDioxide = 30.845; if missing (Chlorides) then IMP_Chlorides = 0.046; if IMP_Chlorides =< 0.01 then IMP_Chlorides= 0.01; *IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * sqrt( abs(IMP_TotalSulfurDioxide)+1 ); *IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * log( abs(IMP_TotalSulfurDioxide)+1 ); if IMP_TotalSulfurDioxide < 10 then IMP_TotalSulfurDioxide = 10 ; if IMP_TotalSulfurDioxide > 350 then IMP_TotalSulfurDioxide = 350; if IMP_FreeSulfurDioxide < 10 then IMP_FreeSulfurDioxide = 10 ; if IMP_FreeSulfurDioxide > 350 then IMP_FreeSulfurDioxide = 350; * more than 10 mg/l requires labeling, >350 mg/l is prohibited, limits based upon requirements; if IMP_PH < 3 then IMP_PH=3; *a pH of 0.48 is high concentration acid that is unfit for human consumption; if IMP_Sulphates <0 then IMP_SULPHATES= 0; if missing(ResidualSuger) then IMP_ResidualSugar = 3.9;
  • 31. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 31 *grouping of wines by http://winefolly.com/wp-content/uploads/2013/10/basic- wine-101-guide-infographic-poster.jpg#big light to heavy, which is a crude calssification of white to red; if IMP_Alcohol < 10.5 then Alcohol_TYPE=1; if IMP_Alcohol >= 10.5 then Alcohol_TYPE=2; if IMP_LabelAppeal <0 then Label_GROUP =1; if IMP_LabelAppeal >=0 then Label_GROUP = 2; if IMP_STARS <2 then STAR_IMPACT = 0; if IMP_STARS >=2 then STAR_IMPACT = 1; REAL_pH = 10**(-IMP_pH); density_adjusted = density - 1; IMPURITIES = IMP_Chlorides + IMP_sulphates + IMP_FreeSulfurDioxide + IMP_TotalSulfurDioxide; TARGET_FLAG = ( TARGET > 0 ); TARGET_AMT = TARGET - 1; if TARGET_FLAG = 0 then TARGET_AMT = .; IMP_CHLORIDES_LOG = LOG10(IMP_CHLORIDES); TARGET_LOG=0; IF TARGET>0 then TARGET_LOG=1; run; proc means data=work.wine_scrub n nmiss median mean min max stddev; run; proc reg data = work.wine_scrub; stepwise: model TARGET = IMP_STARS IMP_Density IMP_Sulphates IMP_Alcohol IMP_LabelAppeal IMP_CHLORIDES IMP_FREESULFURDIOXIDE IMP_TotalSulfurDioxide IMP_PH IMP_ACIDINDEX
  • 32. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015 32 IMP_RESIDUALSUGAR IMP_CITRICACID IMP_VOLATILEACIDITY IMP_FixedAcidity Alcohol_Type STAR_IMPACT IMP_CHLORIDES_LOG LABEL_GROUP / selection = stepwise; run; data work.wine_scrub; set work.wine_scrub; TARGET_TEMP=4.59507 + IMP_STARS* 1.34815 + IMP_Density* -1.06520 + IMP_Sulphates* -0.06317 + IMP_LabelAppeal* 0.53029 + IMP_FREESULFURDIOXIDE* 0.00069504 + IMP_TotalSulfurDioxide* 0.00076498 + IMP_PH* -0.12880 + IMP_ACIDINDEX* -0.29945 + IMP_CITRICACID* 0.03850 + IMP_VOLATILEACIDITY* -0.14508 + Alcohol_TYPE* 0.17647 + STAR_IMPACT* -1.44641 + IMP_CHLORIDES_LOG* -0.11579 + Label_GROUP* 0.08371 ; If target_temp <0 then target_temp=0; TARGET_ERROR = Target - TARGET_TEMP; target_error = round (target_error, 1); run; proc univariate data=work.wine_scrub noprint; histogram target_error/midpoints = -5 -4 -3 -2 -1 0 1 2 3 4 5 ; run; proc univariate data=work.wine_scrub; var target_temp; histogram/midpoints = 0 1 2 3 4 5 6 7 8; run; proc means data=work.wine_scrub n nmiss median mean min max stddev; run;
  • 33. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 33 Poisson Regression libname mydata '/folders/myfolders' access=readonly; proc contents data=mydata.wine; run; data work.wine_scrub; set mydata.wine; *cleaning up variabes; TARGET_FLAG = ( TARGET > 0 ); TARGET_AMT = TARGET - 1; if TARGET_FLAG = 0 then TARGET_AMT = .; IMP_STARS = STARS; IMP_Density = Density; IMP_Sulphates = Sulphates; IMP_Alcohol = Alcohol; IMP_LabelAppeal = LabelAppeal; IMP_CHLORIDES = Chlorides; IMP_FREESULFURDIOXIDE = FREESULFURDIOXIDE; IMP_TotalSulfurDioxide = TotalSulfurDioxide; IMP_PH = pH; IMP_ACIDINDEX = ACIDINDEX; IMP_RESIDUALSUGAR = ResidualSugar; IMP_CITRICACID = CitricAcid; IMP_VOLATILEACIDITY = VolatileAcidity; IMP_FixedAcidity = FixedAcidity; *missing counts; M_STARS = 0; M_RESIDUALSUGAR = 0; M_CHLORIDES = 0; M_FRESSULFURDIOXIDE = 0; M_TOTALSULFURDIOXIDE = 0; M_SULPHATES = 0; M_ALCOHOL = 0; if missing(STARS) then do; IMP_STARS = 2; M_STARS = 1; end; if missing(Density) then IMP_Density = 0.9942027; if missing(Sulphates) then do; IMP_Sulphates = 0.5271118;
  • 34. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015 34 M_SULPHATES =1; end; if missing(Alcohol) then do; IMP_Alcohol = 10.4892363; M_ALCOHOL =1; end; if missing(pH) then do; IMP_pH = 4; M_pH =1; *typical wine pH is now 4; end; if missing(LabelAppeal) then IMP_LabelAppeal = 0; if missing(TotalSulfurDioxide) then IMP_TSulfurDioxide = 120.7142326; if missing (FreeSulfurDioxide) then IMP_FSulfurDioxide = 30.845; if missing (Chlorides) then IMP_Chlorides = 0.046; if IMP_Chlorides =< 0.01 then IMP_Chlorides= 0.01; *IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * sqrt( abs(IMP_TotalSulfurDioxide)+1 ); *IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * log( abs(IMP_TotalSulfurDioxide)+1 ); if IMP_TotalSulfurDioxide < 10 then IMP_TotalSulfurDioxide = 10 ; if IMP_TotalSulfurDioxide > 350 then IMP_TotalSulfurDioxide = 350; if IMP_FreeSulfurDioxide < 10 then IMP_FreeSulfurDioxide = 10 ; if IMP_FreeSulfurDioxide > 350 then IMP_FreeSulfurDioxide = 350; * more than 10 mg/l requires labeling, >350 mg/l is prohibited, limits based upon requirements; if IMP_PH < 3 then IMP_PH=3; *a pH of 0.48 is high concentration acid that is unfit for human consumption; if IMP_Sulphates <0 then IMP_SULPHATES= 0; if missing(ResidualSugar) then IMP_ResidualSugar = 3.9;
  • 35. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 35 *grouping of wines by http://winefolly.com/wp-content/uploads/2013/10/basic- wine-101-guide-infographic-poster.jpg#big light to heavy, which is a crude calssification of white to red; if IMP_Alcohol < 10.5 then Alcohol_TYPE=1; if IMP_Alcohol >= 10.5 then Alcohol_TYPE=2; if IMP_LabelAppeal <0 then Label_GROUP =1; if IMP_LabelAppeal >=0 then Label_GROUP = 2; if IMP_STARS <2 then STAR_IMPACT = 0; if IMP_STARS >=2 then STAR_IMPACT = 1; REAL_pH = 10**(-IMP_pH); density_adjusted = density - 1; IMPURITIES = IMP_Chlorides + IMP_sulphates + IMP_FreeSulfurDioxide + IMP_TotalSulfurDioxide; TARGET_FLAG = ( TARGET > 0 ); TARGET_AMT = TARGET - 1; if TARGET_FLAG = 0 then TARGET_AMT = .; IMP_CHLORIDES_LOG = LOG10(IMP_CHLORIDES); TARGET_LOG=0; IF TARGET>0 then TARGET_LOG=1; run; proc means data=work.wine_scrub n nmiss median mean min max stddev; run; proc genmod data = work.wine_scrub; stepwise: model TARGET = IMP_STARS IMP_Density IMP_Sulphates IMP_Alcohol IMP_LabelAppeal IMP_CHLORIDES IMP_FREESULFURDIOXIDE IMP_TotalSulfurDioxide REAL_PH IMP_ACIDINDEX IMP_RESIDUALSUGAR IMP_CITRICACID IMP_VOLATILEACIDITY
  • 36. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015 36 IMP_FixedAcidity Alcohol_Type STAR_IMPACT IMP_CHLORIDES_LOG LABEL_GROUP /link=log dist=poi; output out= work.wine_scrub_poi_out p=y_poi; run; proc genmod data = work.wine_scrub; stepwise: model TARGET = IMP_STARS IMP_Density IMP_Sulphates IMP_Alcohol IMP_LabelAppeal IMP_CHLORIDES IMP_FREESULFURDIOXIDE /link=log dist=poi; output out= work.wine_scrub_poi_outx p=y_poi; run; data work.wine_scrub; set work.wine_scrub; P_SCORE_TEMP = 1.5004 + IMP_STARS * 0.3348 + IMP_Density * -0.3517 + IMP_Sulphates * -0.0233 + IMP_Alcohol * -0.0015 + IMP_LabelAppeal * 0.1526 + IMP_CHLORIDES * 0.0354 + IMP_FREESULFURDIOXIDE * 0.0002 + IMP_TotalSulfurDioxide * 0.0002 + REAL_PH * 89.6532 + IMP_ACIDINDEX * -0.1173 + IMP_RESIDUALSUGAR * 0.0002 + IMP_CITRICACID * 0.0129 + IMP_VOLATILEACIDITY * -0.0476 + IMP_FixedAcidity * -0.0005 + Alcohol_TYPE * 0.0633 + STAR_IMPACT * -0.3615 +
  • 37. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 37 IMP_CHLORIDES_LOG * -0.0496 + Label_GROUP * 0.1108 ; P_SCORE_POISSON = exp(P_SCORE_TEMP ); P_SCORE_POISSON = round (P_SCORE_POISSON,1); if P_SCORE_POISSON > 8 then P_SCORE_POISSON =8; POISSON_ERROR = TARGET - P_SCORE_POISSON; run; proc univariate data=work.wine_scrub noprint; histogram poisson_error/midpoints = -5 -4 -3 -2 -1 0 1 2 3 4 5 ; run; proc means data=work.wine_scrub n nmiss median mean min max stddev; run;
  • 38. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015 38 Negative Binomial Regression libname mydata '/folders/myfolders' access=readonly; proc contents data=mydata.wine; run; data work.wine_scrub; set mydata.wine; *cleaning up variabes; TARGET_FLAG = ( TARGET > 0 ); TARGET_AMT = TARGET - 1; if TARGET_FLAG = 0 then TARGET_AMT = .; IMP_STARS = STARS; IMP_Density = Density; IMP_Sulphates = Sulphates; IMP_Alcohol = Alcohol; IMP_LabelAppeal = LabelAppeal; IMP_CHLORIDES = Chlorides; IMP_FREESULFURDIOXIDE = FREESULFURDIOXIDE; IMP_TotalSulfurDioxide = TotalSulfurDioxide; IMP_PH = pH; IMP_ACIDINDEX = ACIDINDEX; IMP_RESIDUALSUGAR = ResidualSugar; IMP_CITRICACID = CitricAcid; IMP_VOLATILEACIDITY = VolatileAcidity; IMP_FixedAcidity = FixedAcidity; *missing counts; M_STARS = 0; M_RESIDUALSUGAR = 0; M_CHLORIDES = 0; M_FRESSULFURDIOXIDE = 0; M_TOTALSULFURDIOXIDE = 0; M_SULPHATES = 0; M_ALCOHOL = 0; if missing(STARS) then do; IMP_STARS = 2; M_STARS = 1; end; if missing(Density) then IMP_Density = 0.9942027; if missing(Sulphates) then do; IMP_Sulphates = 0.5271118;
  • 39. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 39 M_SULPHATES =1; end; if missing(Alcohol) then do; IMP_Alcohol = 10.4892363; M_ALCOHOL =1; end; if missing(pH) then do; IMP_pH = 4; M_pH =1; *typical wine pH is now 4; end; if missing(LabelAppeal) then IMP_LabelAppeal = 0; if missing(TotalSulfurDioxide) then IMP_TSulfurDioxide = 120.7142326; if missing (FreeSulfurDioxide) then IMP_FSulfurDioxide = 30.845; if missing (Chlorides) then IMP_Chlorides = 0.046; if IMP_Chlorides =< 0.01 then IMP_Chlorides= 0.01; *IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * sqrt( abs(IMP_TotalSulfurDioxide)+1 ); *IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * log( abs(IMP_TotalSulfurDioxide)+1 ); if IMP_TotalSulfurDioxide < 10 then IMP_TotalSulfurDioxide = 10 ; if IMP_TotalSulfurDioxide > 350 then IMP_TotalSulfurDioxide = 350; if IMP_FreeSulfurDioxide < 10 then IMP_FreeSulfurDioxide = 10 ; if IMP_FreeSulfurDioxide > 350 then IMP_FreeSulfurDioxide = 350; * more than 10 mg/l requires labeling, >350 mg/l is prohibited, limits based upon requirements; if IMP_PH < 3 then IMP_PH=3; *a pH of 0.48 is high concentration acid that is unfit for human consumption; if IMP_Sulphates <0 then IMP_SULPHATES= 0; if missing(ResidualSugar) then IMP_ResidualSugar = 3.9; *grouping of wines by http://winefolly.com/wp-content/uploads/2013/10/basic- wine-101-guide-infographic-poster.jpg#big
  • 40. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015 40 light to heavy, which is a crude calssification of white to red; if IMP_Alcohol < 10.5 then Alcohol_TYPE=1; if IMP_Alcohol >= 10.5 then Alcohol_TYPE=2; if IMP_LabelAppeal <0 then Label_GROUP =1; if IMP_LabelAppeal >=0 then Label_GROUP = 2; if IMP_STARS <2 then STAR_IMPACT = 0; if IMP_STARS >=2 then STAR_IMPACT = 1; REAL_pH = 10**(-IMP_pH); density_adjusted = density - 1; IMPURITIES = IMP_Chlorides + IMP_sulphates + IMP_FreeSulfurDioxide + IMP_TotalSulfurDioxide; EXPERT_OPINION = (STAR_IMPACT**2) + (LABEL_GROUP**2); TARGET_FLAG = ( TARGET > 0 ); TARGET_AMT = TARGET - 1; if TARGET_FLAG = 0 then TARGET_AMT = .; IMP_CHLORIDES_LOG = LOG10(IMP_CHLORIDES); TARGET_LOG=0; IF TARGET>0 then TARGET_LOG=1; run; proc means data=work.wine_scrub n nmiss median mean min max stddev; run; proc genmod data = work.wine_scrub; model TARGET = IMP_STARS IMP_Density IMP_Sulphates IMP_Alcohol IMP_LabelAppeal IMP_CHLORIDES IMP_FREESULFURDIOXIDE IMP_TotalSulfurDioxide REAL_PH IMP_ACIDINDEX IMP_RESIDUALSUGAR IMP_CITRICACID IMP_VOLATILEACIDITY
  • 41. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 41 IMP_FixedAcidity Alcohol_Type STAR_IMPACT IMP_CHLORIDES_LOG LABEL_GROUP /link=log dist=nb; output out= work.wine_scrub_negbin_out p=y_nb; run; proc genmod data = work.wine_scrub; model TARGET = IMP_STARS IMP_Density IMP_Sulphates IMP_Alcohol EXPERT_OPINION IMP_CHLORIDES IMP_FREESULFURDIOXIDE IMP_TotalSulfurDioxide REAL_PH IMP_ACIDINDEX IMP_RESIDUALSUGAR IMP_CITRICACID IMP_VOLATILEACIDITY IMP_FixedAcidity Alcohol_Type STAR_IMPACT IMP_CHLORIDES_LOG /link=log dist=nb; output out= work.wine_scrub_negbin_out p=y_nb; run; data work.wine_scrub; set work.wine_scrub; P_SCORE_TEMP = 1.5004 + IMP_STARS * 0.3348 +
  • 42. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015 42 IMP_Density * -0.3517 + IMP_Sulphates * -0.0233 + IMP_Alcohol * -0.0015 + IMP_LabelAppeal * 0.1526 + IMP_CHLORIDES * 0.0354 + IMP_FREESULFURDIOXIDE * 0.0002 + IMP_TotalSulfurDioxide * 0.0002 + REAL_PH * 89.6532 + IMP_ACIDINDEX * -0.1173 + IMP_RESIDUALSUGAR * 0.0002 + IMP_CITRICACID * 0.0129 + IMP_VOLATILEACIDITY * -0.0476 + IMP_FixedAcidity * -0.0005 + Alcohol_TYPE * 0.0633 + STAR_IMPACT * -0.3615 + IMP_CHLORIDES_LOG * -0.0496 + Label_GROUP * 0.1108 ; P_NEGBIN = exp(P_SCORE_TEMP ); P_NEGBIN = round (P_NEGBIN,1); if P_NEGBIN > 8 then P_NEGBIN =8; NEGBIN_ERROR = TARGET - P_NEGBIN; run; proc univariate data=work.wine_scrub noprint; histogram NEGBIN_ERROR/midpoints = -5 -4 -3 -2 -1 0 1 2 3 4 5 ; run; proc univariate data=work.wine_scrub noprint; histogram P_NEGBIN/midpoints = 0 1 2 3 4 5 6 7 8; run; proc means data=work.wine_scrub n nmiss median mean min max stddev; run;
  • 43. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 43 Zero Inflated Poisson libname mydata '/folders/myfolders' access=readonly; proc contents data=mydata.wine; run; data work.wine_scrub; set mydata.wine; *cleaning up variabes; TARGET_FLAG = ( TARGET > 0 ); TARGET_AMT = TARGET - 1; if TARGET_FLAG = 0 then TARGET_AMT = .; IMP_STARS = STARS; IMP_Density = Density; IMP_Sulphates = Sulphates; IMP_Alcohol = Alcohol; IMP_LabelAppeal = LabelAppeal; IMP_CHLORIDES = Chlorides; IMP_FREESULFURDIOXIDE = FREESULFURDIOXIDE; IMP_TotalSulfurDioxide = TotalSulfurDioxide; IMP_PH = pH; IMP_ACIDINDEX = ACIDINDEX; IMP_RESIDUALSUGAR = ResidualSugar; IMP_CITRICACID = CitricAcid; IMP_VOLATILEACIDITY = VolatileAcidity; IMP_FixedAcidity = FixedAcidity; *missing counts; M_STARS = 0; M_RESIDUALSUGAR = 0; M_CHLORIDES = 0; M_FRESSULFURDIOXIDE = 0; M_TOTALSULFURDIOXIDE = 0; M_SULPHATES = 0; M_ALCOHOL = 0; if missing(STARS) then do; IMP_STARS = 2; M_STARS = 1; end; if missing(Density) then IMP_Density = 0.9942027; if missing(Sulphates) then do; IMP_Sulphates = 0.5271118;
  • 44. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015 44 M_SULPHATES =1; end; if missing(Alcohol) then do; IMP_Alcohol = 10.4892363; M_ALCOHOL =1; end; if missing(pH) then do; IMP_pH = 4; M_pH =1; *typical wine pH is now 4; end; if missing(STARS) then do; IMP_STARS = 2; M_STARS = 1; end; if missing(LabelAppeal) then IMP_LabelAppeal = 0; if missing(TotalSulfurDioxide) then IMP_TSulfurDioxide = 120.7142326; if missing (FreeSulfurDioxide) then IMP_FSulfurDioxide = 30.845; if missing (Chlorides) then IMP_Chlorides = 0.046; if IMP_Chlorides =< 0.01 then IMP_Chlorides= 0.01; *IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * sqrt( abs(IMP_TotalSulfurDioxide)+1 ); *IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * log( abs(IMP_TotalSulfurDioxide)+1 ); if IMP_TotalSulfurDioxide < 10 then IMP_TotalSulfurDioxide = 10 ; if IMP_TotalSulfurDioxide > 350 then IMP_TotalSulfurDioxide = 350; if IMP_FreeSulfurDioxide < 10 then IMP_FreeSulfurDioxide = 10 ; if IMP_FreeSulfurDioxide > 350 then IMP_FreeSulfurDioxide = 350; * more than 10 mg/l requires labeling, >350 mg/l is prohibited, limits based upon requirements; if IMP_PH < 3 then IMP_PH=3; *a pH of 0.48 is high concentration acid that is unfit for human consumption; if IMP_Sulphates <0 then IMP_SULPHATES= 0; if missing(ResidualSugar) then IMP_ResidualSugar = 3.9; if IMP_Alcohol <9 then IMP_Alcohol =9.0;
  • 45. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 45 if IMP_ResidualSugar <1 then IMP_ResidualSugar=1; if IMP_CITRICACID <0 then IMP_CITRICACID=0; if IMP_VOLATILEACIDITY <0 then IMP_VOLATILEACIDITY=0; if IMP_FixedAcidity <0 then IMP_FixedAcidity =0; *grouping of wines by http://winefolly.com/wp-content/uploads/2013/10/basic- wine-101-guide-infographic-poster.jpg#big light to heavy, which is a crude calssification of white to red; if IMP_Alcohol < 10.5 then Alcohol_TYPE=1; if IMP_Alcohol >= 10.5 then Alcohol_TYPE=2; if IMP_LabelAppeal <0 then Label_GROUP =1; if IMP_LabelAppeal >=0 then Label_GROUP = 2; if IMP_STARS <2 then STAR_IMPACT = 0; if IMP_STARS >=2 then STAR_IMPACT = 1; ALCOHOL_EMP = ALCOHOL_TYPE**2; STAR_EMP = STAR_IMPACT**2; EXPERT_INFLUENCE = ALCOHOL_TYPE + STAR_IMPACT +LABEL_GROUP; REAL_pH = 10**(-IMP_pH); density_adjusted = density - 1; IMPURITIES = IMP_Chlorides + IMP_sulphates + IMP_FreeSulfurDioxide + IMP_TotalSulfurDioxide; TARGET_FLAG = ( TARGET > 0 ); TARGET_AMT = TARGET - 1; if TARGET_FLAG = 0 then TARGET_AMT = .; IMP_CHLORIDES_LOG = LOG10(IMP_CHLORIDES); TARGET_LOG=0; IF TARGET>0 then TARGET_LOG=1; run; proc means data=work.wine_scrub n nmiss median mean min max stddev; run;
  • 46. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015 46 proc genmod data = work.wine_scrub; model TARGET = IMP_STARS IMP_Density IMP_Sulphates IMP_Alcohol STAR_IMPACT IMP_CHLORIDES IMP_FREESULFURDIOXIDE IMP_TotalSulfurDioxide IMP_ACIDINDEX IMP_LabelAppeal IMP_CITRICACID IMP_VOLATILEACIDITY IMP_FixedAcidity REAL_pH ALCOHOL_TYPE /link=log dist=zip; zeromodel IMP_STARS M_STARS M_SULPHATES IMP_LabelAppeal IMP_CHLORIDES_LOG IMP_TotalSulfurDioxide IMP_ACIDINDEX IMP_CITRICACID IMP_VOLATILEACIDITY REAL_pH IMP_Alcohol ALCOHOL_TYPE STAR_IMPACT /link=logit; output out= work.winezip0526 pred=p_target_zip pzero=p_zero_zip; run;
  • 47. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 47 Zero Inflated Negative Binomial libname mydata '/folders/myfolders' access=readonly; proc contents data=mydata.wine; run; data work.wine_scrub; set mydata.wine; *cleaning up variabes; TARGET_FLAG = ( TARGET > 0 ); TARGET_AMT = TARGET - 1; if TARGET_FLAG = 0 then TARGET_AMT = .; IMP_STARS = STARS; IMP_Density = Density; IMP_Sulphates = Sulphates; IMP_Alcohol = Alcohol; IMP_LabelAppeal = LabelAppeal; IMP_CHLORIDES = Chlorides; IMP_FREESULFURDIOXIDE = FREESULFURDIOXIDE; IMP_TotalSulfurDioxide = TotalSulfurDioxide; IMP_PH = pH; IMP_ACIDINDEX = ACIDINDEX; IMP_RESIDUALSUGAR = ResidualSugar; IMP_CITRICACID = CitricAcid; IMP_VOLATILEACIDITY = VolatileAcidity; IMP_FixedAcidity = FixedAcidity; *missing counts; M_STARS = 0; M_RESIDUALSUGAR = 0; M_CHLORIDES = 0; M_FRESSULFURDIOXIDE = 0; M_TOTALSULFURDIOXIDE = 0; M_SULPHATES = 0; M_ALCOHOL = 0; if missing(STARS) then do; IMP_STARS = 2; M_STARS = 1; end; if missing(Density) then IMP_Density = 0.9942027; if missing(Sulphates) then do; IMP_Sulphates = 0.5271118; M_SULPHATES =1;
  • 48. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015 48 end; if missing(Alcohol) then do; IMP_Alcohol = 10.4892363; M_ALCOHOL =1; end; if missing(pH) then do; IMP_pH = 4; M_pH =1; *typical wine pH is now 4; end; if missing(LabelAppeal) then do; IMP_LabelAppeal = 0; M_LabelAppeal =1; end; if missing(TotalSulfurDioxide) then IMP_TSulfurDioxide = 120.7142326; if missing (FreeSulfurDioxide) then IMP_FSulfurDioxide = 30.845; if missing (Chlorides) then IMP_Chlorides = 0.046; if IMP_Chlorides =< 0.01 then IMP_Chlorides= 0.01; *IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * sqrt( abs(IMP_TotalSulfurDioxide)+1 ); *IMP_TotalSulfurDioxide = sign( IMP_TotalSulfurDioxide ) * log( abs(IMP_TotalSulfurDioxide)+1 ); if IMP_TotalSulfurDioxide < 10 then IMP_TotalSulfurDioxide = 10 ; if IMP_TotalSulfurDioxide > 350 then IMP_TotalSulfurDioxide = 350; if IMP_FreeSulfurDioxide < 10 then IMP_FreeSulfurDioxide = 10 ; if IMP_FreeSulfurDioxide > 350 then IMP_FreeSulfurDioxide = 350; * more than 10 mg/l requires labeling, >350 mg/l is prohibited, limits based upon requirements; if IMP_PH < 3 then IMP_PH=3; *a pH of 0.48 is high concentration acid that is unfit for human consumption; if IMP_Sulphates <0 then IMP_SULPHATES= 0; if missing(ResidualSugar) then IMP_ResidualSugar = 3.9; if IMP_Alcohol <9 then IMP_Alcohol =9.0; if IMP_ResidualSugar <1 then IMP_ResidualSugar=1;
  • 49. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 49 if IMP_CITRICACID <0 then IMP_CITRICACID=0; if IMP_VOLATILEACIDITY <0 then IMP_VOLATILEACIDITY=0; if IMP_FixedAcidity <0 then IMP_FixedAcidity =0; *grouping of wines by http://winefolly.com/wp-content/uploads/2013/10/basic- wine-101-guide-infographic-poster.jpg#big light to heavy, which is a crude calssification of white to red; if IMP_Alcohol < 10.5 then Alcohol_TYPE=1; if IMP_Alcohol >= 10.5 then Alcohol_TYPE=2; if IMP_LabelAppeal <0 then Label_GROUP =1; if IMP_LabelAppeal >=0 then Label_GROUP = 2; if IMP_STARS <2 then STAR_IMPACT = 0; if IMP_STARS >=2 then STAR_IMPACT = 1; ALCOHOL_EMP = ALCOHOL_TYPE**2; STAR_EMP = STAR_IMPACT**2; EXPERT_INFLUENCE = ALCOHOL_TYPE + STAR_IMPACT +LABEL_GROUP; REAL_pH = 10**(-IMP_pH); density_adjusted = density - 1; IMPURITIES = IMP_Chlorides + IMP_sulphates + IMP_FreeSulfurDioxide + IMP_TotalSulfurDioxide; TARGET_FLAG = ( TARGET > 0 ); TARGET_AMT = TARGET - 1; if TARGET_FLAG = 0 then TARGET_AMT = .; IMP_CHLORIDES_LOG = LOG10(IMP_CHLORIDES); TARGET_LOG=0; IF TARGET>0 then TARGET_LOG=1; run; proc means data=work.wine_scrub n nmiss median mean min max stddev; run; proc genmod data = work.wine_scrub; stepwise: model TARGET = IMP_STARS IMP_Density IMP_Sulphates IMP_Alcohol
  • 50. Chris Dorow, PRED411, Sec55 Project 3, May 31, 2015 50 IMP_LabelAppeal IMP_CHLORIDES IMP_FREESULFURDIOXIDE IMP_TotalSulfurDioxide REAL_PH IMP_ACIDINDEX IMP_RESIDUALSUGAR IMP_CITRICACID IMP_VOLATILEACIDITY IMP_FixedAcidity Alcohol_Type STAR_IMPACT IMP_CHLORIDES_LOG LABEL_GROUP /link=log dist=zinb; zeromodel IMP_STARS IMP_LabelAppeal IMP_CHLORIDES_LOG M_STARS M_SULPHATES IMP_TotalSulfurDioxide IMP_ACIDINDEX IMP_CITRICACID IMP_VOLATILEACIDITY REAL_pH IMP_Alcohol ALCOHOL_TYPE STAR_IMPACT /link=logit; output out= work.wine_scrub_zinb_out p=y_zinb; run; proc genmod data = work.wine_scrub; stepwise: model TARGET = IMP_STARS IMP_Density IMP_Sulphates IMP_Alcohol IMP_LabelAppeal IMP_CHLORIDES IMP_FREESULFURDIOXIDE
  • 51. C. Dorow, PRED411, Sec55 Insurance, Project 3, May 31, 2015 51 /link=log dist=zinb; zeromodel IMP_STARS IMP_LabelAppeal IMP_CHLORIDES_LOG IMP_TotalSulfurDioxide IMP_ACIDINDEX IMP_CITRICACID IMP_VOLATILEACIDITY REAL_pH IMP_Alcohol ALCOHOL_TYPE STAR_IMPACT /link=logit; output out= work.wine_scrub_zinb_outx p=y_zinb; run;