Assignment - 03
Model Building, Selection, & Prediction
Question 1:
1. Predicting the Output Variable Y – Energy Production Prediction
a) Importing the data from CSV data and splitting into test and training data:
Using the read.csv() function we can import the data into R
INPUT:
OUTPUT:
INPUT:
OUTPUT:
b) Fitting a Linear Regression Model:
Running the Linear Regression Model with all the Variables
INPUT:
OUTPUT:
The Adjusted R-Squared value is found to be 0.2366.
From the data It can seen that Pressure and Wind are only significant.
So, we run the model only with wind and pressure variables.
Reduced Regression Model (Wind and Pressures Variable only)
INPUT:
OUTPUT:
Removing the Wind Variable since the Adjusted R Squared Value is only 0.0229. Now we run the regression using only the Pressure Variable.
Running the Regression model with only Wind Variable:
INPUT:
OUTPUT:
The Adjusted R-Squared value is found to be 0.219, which is less than the previous regression models.
ANOVA test is to be conducted to find the significance of the all variable included model and the reduced pressure variable model.
INPUT:
OUTPUT:
Between the All variable and Reduced model, the P value is found to be 0.2578, so we should not reject the Null hypothesis and use the Reduced Model.
Between the Pressure variable and Reduced model, the P value is found to be 0.0768, so we should not reject the Null hypothesis and use the Pressure Model.
Running Best Subset to find the model:
Best Subset find the value of statistics for all variables involved and print the statistics for comparison, using which we can select the appropriate variable
INPUT:
OUTPUT:
RSS Value decrease as the variable increase.
Model with 5 variable has the highest Adjusted R Square.
Model with 3 variable has the smallest AIC (or Cp).
Model with 8 variable has the smallest BIC.
Since the Bestsubset approach provides a broad result we check the predicted R square and use the model with highest R square and lower RMSE
R square and RMSE Prediction:
For all variable considered Model:
INPUT:
OUTPUT:
For the Reduced Model with Pressure and Wind Variables:
INPUT:
OUTPUT:
Single Model with Pressure as the dependent variable:
INPUT:
OUTPUT:
Summary:
From the Analysis we can conclude that model with the pressure as the dependent variable is better than the other models. The Adjusted R square value of 0.31 is the best and the RMSE value is also the least in case of the pressur model.
From the Adjusted R Squared value we conclude that the pressure model is the best and can predict the energy produced rate accurately for 31% of the data.
c) Backward Selection Approach:
Regression Model using all the variables:
INPUT:
OUTPUT:
Conclusion:
The backward step AIC function tells a slightly different result then the models generated above. However, when we create the regression model we see a low R2 value then our single mod.
1. Assignment - 03
Model Building, Selection, & Prediction
Question 1:
1. Predicting the Output Variable Y – Energy Production
Prediction
a) Importing the data from CSV data and splitting into test and
training data:
Using the read.csv() function we can import the data into R
INPUT:
2. OUTPUT:
INPUT:
OUTPUT:
b) Fitting a Linear Regression Model:
Running the Linear Regression Model with all the Variables
INPUT:
OUTPUT:
The Adjusted R-Squared value is found to be 0.2366.
From the data It can seen that Pressure and Wind are only
significant.
So, we run the model only with wind and pressure variables.
Reduced Regression Model (Wind and Pressures Variable only)
INPUT:
3. OUTPUT:
Removing the Wind Variable since the Adjusted R Squared
Value is only 0.0229. Now we run the regression using only the
Pressure Variable.
Running the Regression model with only Wind Variable:
INPUT:
OUTPUT:
The Adjusted R-Squared value is found to be 0.219, which is
less than the previous regression models.
ANOVA test is to be conducted to find the significance of the
all variable included model and the reduced pressure variable
model.
INPUT:
OUTPUT:
Between the All variable and Reduced model, the P value is
found to be 0.2578, so we should not reject the Null hypothesis
and use the Reduced Model.
Between the Pressure variable and Reduced model, the P value
is found to be 0.0768, so we should not reject the Null
4. hypothesis and use the Pressure Model.
Running Best Subset to find the model:
Best Subset find the value of statistics for all variables involved
and print the statistics for comparison, using which we can
select the appropriate variable
INPUT:
OUTPUT:
RSS Value decrease as the variable increase.
Model with 5 variable has the highest Adjusted R Square.
Model with 3 variable has the smallest AIC (or Cp).
Model with 8 variable has the smallest BIC.
Since the Bestsubset approach provides a broad result we check
the predicted R square and use the model with highest R square
and lower RMSE
R square and RMSE Prediction:
For all variable considered Model:
INPUT:
OUTPUT:
For the Reduced Model with Pressure and Wind Variables:
INPUT:
5. OUTPUT:
Single Model with Pressure as the dependent variable:
INPUT:
OUTPUT:
Summary:
From the Analysis we can conclude that model with the pressure
as the dependent variable is better than the other models. The
Adjusted R square value of 0.31 is the best and the RMSE value
is also the least in case of the pressur model.
From the Adjusted R Squared value we conclude that the
pressure model is the best and can predict the energy produced
rate accurately for 31% of the data.
c) Backward Selection Approach:
Regression Model using all the variables:
INPUT:
OUTPUT:
6. Conclusion:
The backward step AIC function tells a slightly different result
then the models generated above. However, when we create the
regression model we see a low R2 value then our single model.
Below, we can compare all the 3 models above with this step
model.
2
Final Project
ALY-6015 Week 6 Project
Intermediate Analytics
Submitted to:Ani Aghababyan
College of Professional Studies
Northeastern University, MA
Submitted by:
Vikrant Kakad
Vikas Warudkar
Sunita Mohapatra
Darshan Shah
Akshay kannan
Academic Term Spring 2018 - Quarter 2
Introduction
Wine making is affected by a series of variables, when it is
made. Several variables from alcohol, to pH can affect the final
7. results. It is crucial to understand and learn how these variables
impact the quality of red wine. The scope of this project work is
to understand effect of various attributes which impact the
quality of the Red wine. The data set utilized for the analysis is
downloaded from UCI repository. The analysis has additional
focus on the following key parameters:
pH value - pH value is considered to be a key parameter for the
determination of quality of wine and hence the analysis focused
on determining the impact of these pH values on final quality
determination.
SO2 values (Free and Total) - SO2 has been always a debatable
topic due to the allergic reactions associated with SO2.The
current analysis tries to determine the impact of SO2 on pH
values and the final quality values for the wine samples.
Alcohol content - Alcohol content is an important parameter
considered when a buyer purchases any alcoholic product and
this analysis tries to unravel relationship of Alcohol content
with parameters like pH values and SO2 contents and the impact
to quality.
In this project, we did the analysis of Red Wine Data and try to
understand which variables are responsible for the quality of the
wine. First, we got the feel of the variables on their own and
then we found out the correlation between them and the wine
quality with other factors thrown in. Finally, we created a linear
model to predict the outcome of a test set data.
Proposing supervised learning approach to predict human wine
taste preferences that is based on easily available analytical
tests at the certification step. A large dataset (when compared to
other studies in this domain) is considered, with red Vinho
Verde samples from Portugal (CVRVV, 2008). Two regression
techniques were applied, under a computationally efficient
procedure that performs simultaneous variable and model
selection. The support vector machine achieved promising
results, outperforming the multiple regression and neural
network methods. Such model is useful to support the
oenologist wine tasting evaluations and improve wine
8. production. Furthermore, similar techniques can help in target
marketing by modeling consumer tastes from niche markets.
Research Question
By performing this analysis, we seek to answer the following
questions:
1. How is the quality of the wines tasted?
2. What is the minimum set of properties and their values that
defines a high-quality wine?
3. What are considered wine defects?
About dataset
· Name: Red Wine Quality Data Set
· Sources Created by: Paulo Cortez (Univ. Minho), Antonio
Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis
(CVRVV, 2009)
· Input variables:
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
· Output variable: quality (score between 0 and 10)
· Data Set Characteristics: Multivariate
· Number of Observations: 1599
· Number of Attributes: 12
· Missing Values: N/A
9. Description of attributes:
1. Fixed acidity: Most acids involved with wine or fixed or
nonvolatile (do not evaporate readily)
2. Volatile acidity: The amount of acetic acid in wine, which at
too high of levels can lead to an unpleasant, vinegar taste
3. Citric acid: Found in small quantities, citric acid can add
'freshness' and flavor to wines
4. Residual sugar: The amount of sugar remaining after
fermentation stops, it's rare to find wines with less than 1
gram/liter and wines with greater than 45 grams/liter are
considered sweet
5. Chlorides: The amount of salt in the wine
6. Free sulfur dioxide: The free form of SO2 exists in
equilibrium between molecular SO2 (as a dissolved gas) and
bisulfite ion; it prevents microbial growth and the oxidation of
wine
7. Total sulfur dioxide: Amount of free and bound forms of S02;
in low concentrations, SO2 is mostly undetectable in wine, but
at free SO2 concentrations over 50 ppm, SO2 becomes evident
in the nose and taste of wine
8. Density: The density of water is close to that of water
depending on the percent alcohol and sugar content
9. pH: Describes how acidic or basic a wine is on a scale from 0
(very acidic) to 14 (very basic); most wines are between 3-4 on
the pH scale
10. Sulphates: A wine additive which can contribute to sulfur
dioxide gas (S02) levels, which acts as an antimicrobial and
antioxidant
11. Alcohol: the percent alcohol content of the wine
12. Quality: output variable (based on sensory data, score
between 0 and 10)
The dataset chosen has the following above attributes and it
delivers a better result in detecting the quality after testing. The
datatypes of the aforementioned attributes are as follows.
10. As described before, there are 1599 observations (rows) for 12
different variables (columns). Quality is type of ‘ordered,
categorical, discrete’ variable, whose value ranges from 3-8.
A statistical description of the above dataset would provide a
more coherent picture as to how the numerical values are
distributed across the dataset (Range, Quartiles, Central
Tendencies, etc). They are as follows:
The overall summary of the dataset covers all the above
information, and presents the data in a concise & lucid way.
They can be shown as follows:
Methods chosen:
Univariate Plot Analysis:
A univariate plot shows the data and summarizes its
distribution. A dot plot, also known as a strip plot, shows the
individual observations. A box plot shows the five-
number summary of the data – the minimum, first quartile,
median, third quartile, and maximum.
The graph analysis is as follows :-
Here, it can be observed that the density, pH value and wine
quality appears to be normally distributed. Fixed, Volatile
acidity & Sulphur dioxides, Sulphates and alcohol seems to be
long tailed. Qualitatively, residual sugar and chlorides have
extreme outliers. Citric acid appeared to have a large number of
zero values. This might be a case of non-reporting.
Exploratory Data Analysis (EDA) and Data Pre-
processingHistograms to show the distribution of the variable
values. As we could clearly see, citric acid was one feature that
was found to be not normally distributed on a logarithmic scale.
11. Now, a combined variable namely “TAC.acidity” is created
that constitutes the sum of Tartaric, acetic & citric acid. It is as
follows :-
Boxplots for each of the variables as another indicator of
spread.
Observations regarding variables: All variables have
outliers
· Acidities like Citric acid, Volatile acidity and Fixed acidity
data have critical outliers present. If these outliers are removed,
then the distribution of these attributes can become symmetric.
· Positively Skewed Distribution is shown by the residual sugar
in the wine, interesting fact here is that even if we ignore the
outliers, this skewness remains unaffected.
· Attributes/variables like Density of wine, Free Sulphur
Dioxide have significant outliers, but they are very different
from the rest.
· Larger side of the data has most of the outliers.
· Irregular distribution is shown by the alcohol content of the
red wine without any major outliers.
Support vector machines are a class of factual models initially
created in the mid-1960s by Vladimir Vapnik. In later years, the
model has advanced extensively into a standout amongst the
most adaptable and powerful machine learning instruments
accessible. It is a regulated learning calculation which can be
utilized to tackle both characterization and relapse issue, even
though the present spotlight is on grouping as it were. To place
it, this calculation searches for a straightly distinguishable
hyperplane, or a choice limit isolating individual from one class
from the other. If such a hyperplane exists, the work is finished!
If such a hyperplane does not exist, SVM utilizes a nonlinear
mapping to change the preparation information into a higher
measurement. At that point it scans for the straight ideal
12. isolating hyperplane. With a fitting nonlinear mapping to an
adequately high measurement, information from two classes can
simply be isolated by a hyperplane. The SVM calculation
discovers this hyperplane utilizing support vectors and edges.
As a preparation calculation, SVM may not be quick contrasted
with some other grouping techniques, however inferable from
its capacity to display complex nonlinear limits, SVM has high
precision. SVM is relatively less inclined to overfitting. SVM
has effectively been connected to manually written digit
acknowledgment, content arrangement, speaker distinguishing
proof and so forth. The utilization of this procedure helped us to
recognize the correct closer sum and incentive through relapses
and definitions.
Results and Findings
A correlation of each variable has been made against the wine
quality to determine those factors which comparatively have a
better influence in the quality of wine. It was found that the top
4 variables that influence the wine quality are as follows :-
1) alcohol
2) sulphates (log10)
3) volatile acidity
4) citric acid
The following was done to examine the acidity variables.
Of all the other factors, base 10 logarithm TAC.acidity
correlated very well with Ph, and rightfully so, since pH is a
defining measure of acidity.
An interesting question to pose, using basic chemistry
knowledge, is to ask what other components other than the
measured acids are affecting pH.
13. We can quantify this difference by building a predictive linear
model, to predict pH based off of TAC.acidity and capture the
% difference as a new variable.
Conclusion
By examining the above information, we could locate the
administered learning strategy called bolster vector machine
anticipated the essence of the red wine quality and gave us the
outcome for more wine quality is specifically corresponding to
the liquor content. Although alternate systems were in the same
class as this above technique yet it helped us to discover the
guess result and we could foresee the quality through the
measure of liquor content. The use of this investigation can
comprehend whether by adjusting the factors, it is conceivable
to build the nature of the wine available. In the event that you
can control your factors, at that point you can foresee the nature
of your wine and acquire more benefits.
As observed, the direct model and the Support Vector Machine.
The SVM performed imperceptibly better and we chose to stay
with it on the off chance that we needed to make any more
expectations. The use of this investigation, can comprehend
whether by altering the factors amid wine making, it is
conceivable to expand the nature of the wine available. In the
event that you can control your factors, at that point you can
anticipate the nature of your wine and acquire more benefits.
References
CVRVV. 2008. Portuguese Wine — Vinho Verde. Comissão de
Viticultura da Região dos Vinhos Verdes (CVRVV),
http://www.vinhoverde.pt.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. 2009.
Modeling wine preferences by data mining from
physicochemical properties. In Decision Support Systems,
Elsevier, 47(4):547-553.
V. Cherkassy, Y. Ma. 2004. Practical selection of SVM
parameters and noise estimation for SVM regression. Neural
15. library(skimr)
```
Read csv file and explore statistics
```{r echo=FALSE,message=FALSE,warning=FALSE}
Wine <- read.csv("https://archive.ics.uci.edu/ml/machine-
learning-databases/wine-quality/winequality-red.csv", sep = ";")
str(Wine)
summary(Wine)
skim(Wine)
Wine$quality <- as.numeric(Wine$quality)
```
Creates tabular results of categorical variables
```{r,message=FALSE,warning=FALSE}
table(Wine$quality)
```
# Univariate Plots Section
```{r echo=FALSE,message=FALSE,warning=FALSE}
grid.arrange(qplot(Wine$fixed.acidity),
qplot(Wine$volatile.acidity),
qplot(Wine$citric.acid),
qplot(Wine$residual.sugar),
qplot(Wine$chlorides),
qplot(Wine$free.sulfur.dioxide),
qplot(Wine$total.sulfur.dioxide),
qplot(Wine$density),
qplot(Wine$pH),
qplot(Wine$sulphates),
qplot(Wine$alcohol),
qplot(Wine$quality),
ncol = 4)
```
# Univariate Analysis
1. Wine Quality forms a normal distribution.
2. Density and pH are normally distributed with a few outliers.
Create new variable for better exploration
16. ```{r,message=FALSE,warning=FALSE}
Wine$rating <- ifelse(Wine$quality < 5, 'bad', ifelse(
Wine$quality < 7, 'average', 'good'))
Wine$rating <- ordered(Wine$rating,
levels = c('bad', 'average', 'good'))
summary(Wine$rating)
```
Create Histogram of log function of the variables for further
analysis
```{r,message=FALSE,warning=FALSE}
ggplot(Wine,aes(x=fixed.acidity))+geom_histogram(fill='red')+s
cale_x_log10(breaks=4:15)+
xlab('Fixed Acidity')+ylab('Count')+ggtitle('Histogram of
Fixed Acidity Values')
require(plotly)
ggplot()
plot_ly(data=Wine,x=~citric.acid,type='histogram')
ggplot(Wine) +
geom_histogram(aes(x=volatile.acidity),fill='blue')+
scale_x_log10(breaks=seq(0.1,1,0.1))
ggplot(Wine) +
geom_histogram(aes(x=citric.acid),fill='green') +
scale_x_log10()
```
Citric acid was one feature that was found to be not
normally distributed on a logarithmic scale.
Create a combined variable,
TAC.acidity, containing the sum of tartaric, acetic, and citric
acid.
```{r,message=FALSE,warning=FALSE}
Wine$TAC.acidity <- Wine$fixed.acidity +
Wine$volatile.acidity +
Wine$citric.acid
qplot(Wine$TAC.acidity,main = 'Histogram of TAC Acidity
(fixed+volatile+Citric)')
```
20. cor.test(Wine$volatile.acidity, Wine$citric.acid)
ggplot(data = Wine, aes(x = log10(TAC.acidity), y = pH)) +
geom_point(alpha=0.3)
cor.test(log10(Wine$TAC.acidity), Wine$pH)
```
Base 10 logarithm TAC.acidity correlated very well with pH.
Building a predictive linear model,
to predict pH based off of TAC.acidity and
capture the % difference as a new variable.
```{r,message=FALSE,warning=FALSE}
m <- lm(I(pH) ~ I(log10(TAC.acidity)), data = Wine)
Wine$pH.predictions <- predict(m, Wine)
# (observed - expected) / expected
Wine$pH.error <- (Wine$pH.predictions - Wine$pH)/Wine$pH
```
To check its accuracy.
The RMS Error.
```{r,message=FALSE,warning=FALSE}
rmse <- function(error)
{
sqrt(mean(error^2))
}
rmse(m$residuals)
#Now, we train a Support Vector Machine.
require(e1071)
SVM <- svm(I(pH) ~ I(log10(TAC.acidity)), data = Wine)
Wine$pH.Predict.SVM <- predict(SVM,Wine)
Wine$pH.error.SVM <- (Wine$pH.Predict.SVM -
Wine$pH)/Wine$pH
rmse(SVM$residuals)
```
SVM functions slightly better than a LM.
### Plot 1: Effect of Alcohol on Wine Quality
```{r echo=FALSE,message=FALSE,warning=FALSE}
ggplot(data = Wine, aes(x = quality, y = alcohol,
21. fill = rating)) +
geom_boxplot(outlier.color = 'red') +
ggtitle('Alcohol Levels in Different Wine Qualities') +
xlab('Quality') +
ylab('Alcohol (% volume)')
```
### Description 1
These boxplots demonstrate the effect of alcohol content on
wine quality.
Generally, higher alcohol content correlated with higher wine
quality.
However, as the outliers and intervals show, alchol content
alone did not
produce a higher quality.
13