Without having prior knowledge on wine and quality of the wine; just for curiosity purpose worked on the famous wine data and find out some relations between the compositions used in the wine and the quality rating given by the individuals.
Kindly go through the report and share your comments and suggestions.
Invezz.com - Grow your wealth with trading signals
Wine Quality
1. Exercise on Wine Dataset
Objective: Finding a Relation between Wine Quality and Chemical Composition of Wines
Summary of the Data:
Total number of observations: 6497
Number of instances for Red Wine: 1599
Number of instances for White Wine: 4898
Total number of Numeric variables: 11; Character variables: 2
Exploratory Data Analysis: The following graphical analysis will help us to understand how the
different components of wines are being used while making the wines or we can get a sense of
the distribution of the data chemicals is being used in the wines
Interpretation:
From the above plot, it is evident that, quality has most values concentrated in the categories 5,
6 and 7. Only a small proportion is in the categories [3, 4] and [8, 9] and none in the categories
[1, 2] and 10. Fixed acidity, volatile acidity and citric acid have outliers or more specifically for
some of the cases these components make the taste of the wine bad. If those mistakes are
taken care the quality of the wine would have been better. Residual sugar has a positively
skewed distribution; even after eliminating the outliers distribution will remain skewed.
Descriptive Statistical Analysis of the Wine Dataset:
2. Interpretation: At an overall level the average quality of the wine is 5.8 and if we calculate the
inter quartile range then we can see that in most of the cases the Range is greater than IQR,
which clearly tell us that dataset has the outlier on the higher side and due care should be
taken during preparation of wine.
Outlier Analysis: Boxplot analysis clearly showing that the dataset has the outliers for almost
all the variables.
Variables n mean sd median trimmed mad min max range skew kurtosis se
fixed_acidity 6497 7.2 1.3 7.0 7.1 0.9 3.8 15.9 12.1 1.7 5.1 0.0
volatile_acidity 6497 0.3 0.2 0.3 0.3 0.1 0.1 1.6 1.5 1.5 2.8 0.0
citric_acid 6497 0.3 0.1 0.3 0.3 0.1 0.0 1.7 1.7 0.5 2.4 0.0
residual_sugar 6497 5.4 4.8 3.0 4.7 2.5 0.6 65.8 65.2 1.4 4.4 0.1
chlorides 6497 0.1 0.0 0.0 0.1 0.0 0.0 0.6 0.6 5.4 50.8 0.0
free_sulfur_dioxide 6497 30.5 17.7 29.0 29.3 17.8 1.0 289.0 288.0 1.2 7.9 0.2
total_sulfur_dioxide 6497 115.7 56.5 118.0 115.9 57.8 6.0 440.0 434.0 0.0 -0.4 0.7
density 6497 1.0 0.0 1.0 1.0 0.0 1.0 1.0 0.1 0.5 6.6 0.0
pH 6497 3.2 0.2 3.2 3.2 0.2 2.7 4.0 1.3 0.4 0.4 0.0
sulphates 6497 0.5 0.1 0.5 0.5 0.1 0.2 2.0 1.8 1.8 8.6 0.0
alcohol 6497 10.5 1.2 10.3 10.4 1.3 8.0 14.9 6.9 0.6 -0.5 0.0
quality 6497 5.8 0.9 6.0 5.8 1.5 3.0 9.0 6.0 0.2 0.2 0.0
3. Correlation Analysis: There are no such high correlation is exists among the variables, every
variable has its own importance while making the wine.
Deep Dive on the Wine Dataset: As our main objective is to find the relation between quality of
the wine and the different compositions of chemical variables for producing wine, we have
tried to look at for each quality and the corresponding compositions.
Quality-3:
8. Interpretation: Though the analysis is quite lengthy but from the above statistical summary and
multiple plots we can easily find out the cut-offs points for a good or excellent quality of wine.
It is observed that people finds a wine of bad quality or good quality if the following criteria is
satisfied
Variables Bad Quality Excellent Quality
fixed_acidity less than 4 and greater than 8 Range between 6.5 to 7.5
volatile_acidity greater than 0.5 Range between 0.2 to 0.4
citric_acid greater than 0.5 Range between 0.2 to 0.5
residual_sugar greater than 10 Range between 0.5 to 5
chlorides greater than 0.05 Range between 0.01 to 0.04
free_sulfur_dioxide greater than 45 Range between 20 to 40
total_sulfur_dioxide less than 30 and greater than 150 Range between 60 to 145
density greater than 1 Range between 0.98 to 1
pH less than 3 Range between 3 to 3.6
sulphates less than 0.3 Range between 0.4 to 0.75
alcohol less than 8 and greater than 13.5 Range between 10.5 to 13
Model Building and Classification Analysis:
After thoroughly working on the exploratory data analysis we have observed the behavior of
the wine dataset. Based on the pattern of the data we have seen that the possible reason for a
wine to be of bad quality or good quality. As a next step, I have worked on the random forest
algorithm to classify the wine dataset. Variable importance plot and Mean decrease Accuracy
shows us which are chemical composition are most important while making the wine. I have
found an accuracy level of 85%, which is pretty good and the variables are classifying the quality
correctly. The model can be improved or further analysis can be done to better understanding
the variable importance.
Variables n mean sd median trimmed mad min max range skew kurtosis se
fixed_acidity 5 7.4 1.0 7.1 7.4 0.4 6.6 9.1 2.5 0.8 -1.2 0.4
volatile_acidity 5 0.3 0.1 0.3 0.3 0.0 0.2 0.4 0.1 0.2 -2.2 0.0
citric_acid 5 0.4 0.1 0.4 0.4 0.1 0.3 0.5 0.2 0.1 -2.0 0.0
residual_sugar 5 4.1 3.8 2.2 4.1 0.9 1.6 10.6 9.0 0.9 -1.2 1.7
chlorides 5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.2 -2.1 0.0
free_sulfur_dioxide 5 33.4 13.4 28.0 33.4 4.4 24.0 57.0 33.0 1.0 -1.0 6.0
total_sulfur_dioxide 5 116.0 19.8 119.0 116.0 8.9 85.0 139.0 54.0 -0.4 -1.4 8.9
density 5 1.0 0.0 1.0 1.0 0.0 1.0 1.0 0.0 1.0 -1.0 0.0
pH 5 3.3 0.1 3.3 3.3 0.1 3.2 3.4 0.2 0.0 -1.9 0.0
sulphates 5 0.5 0.1 0.5 0.5 0.1 0.4 0.6 0.3 0.4 -1.5 0.0
alcohol 5.0 12.2 1.0 12.5 12.2 0.3 10.4 12.9 2.5 -1.0 -1.0 0.5
9. Variable Importance Plot:
Importance of the Variables:
Variables MeanDec re a seAccu ra c y
alcohol 82.6
free_sulfur_dioxide 64.4
volatile_acidity 62.0
pH 61.1
sulphates 60.3
residual_sugar 59.7
chlorides 56.0
fixed_acidity 52.8
Density 51.2
total_sulfur_dioxide 49.6
citric_acid 46.7
While it is found that alcohol is the most important chemical while making the wine and the
citric acid is the least significant variable for making the wine.
Confusion Matrix:
Predicted Variable
Target Variable Bad Good Normal
Bad 9 0 1
Good 1 227 54
Normal 68 162 1428
Accuracy 85%
10. Note:
Though the outlier is there in the dataset I haven’t deleted the outlier observations from
the data, as I believe we have to identify those cases where we can improve the wine
making procedure
As our target is to identify the relation between the perceived quality of the wine and
the chemical composition of the wines, didn’t consider the red and white wine as a
separate dataset, if you observed there are no such significant differences between the
red and white wine dataset
To classify the variables I have considered Bad as the quality level 1,2,3,4
Normal as the quality level 5, 6, 7 and
Good as the quality level 8,9,10