Problem Statement
Although we are attempting to predict wine quality as a target for a certain number of wines with a
given set of predictor factors, wine quality is a subjective measurement. This is an EDA, or data-
driven story, including a range of graphs and images as well as an attribute-based quality forecast.
Here we need to know: “what is the quality of the wine (in ordinal values)(3-9)? It is a regression
Perform Data Cleaning, Pre-processing and Feature Selection
Apply ML models to predict the Churned Customers
Use Auto-ML to determine the best model
Use SHAP library to determine the impact of the predictor variables
ML Data Cleaning and Feature Selection
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy.stats import norm
from scipy import stats
from scipy.stats import norm
from scipy import stats
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from math import sqrt
from sklearn.metrics import r2_score
Cabernet Sauvignon is known as the king of the red wine.
Cabernet Sauvignon is known as the king of the red wine.
Cabernet_Sauvignon = pd.read_csv('
0 white 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010
1 white 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940
2 white 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951
3 white 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956
4 white 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956
(6497, 13)
What are the data types? (Only numeric and categorical)
type object
fixed acidity float64
volatile acidity float64
citric acid float64
residual sugar float64
chlorides float64
free sulfur dioxide float64
total sulfur dioxide float64
density float64
pH float64
sulphates float64
alcohol float64
quality int64
dtype: object
The dataset has 1 Categorical and 12 Numerical Features.
What features are in the dataset?
fixed acidity. Fixed acidity is due to the presence of non-volatile acids in wine. For example, tartaric,
citric or malic acid. This type of acid combines the balance of the taste of wine, brings freshness to
the taste.
Volatile acidity is the part of the acid in wine that can be picked up by the nose. Unlike those acids
that are palpable to the taste (as we talked about above). Volatile acidity, or in other words, souring
of wine, is one of the most common defects.
citric acid - allowed to offer in winemaking by the Resolution of the OIV No. 23/2000. It can be used
in three cases: for acid treatment of wine (increasing acidity), for collecting wine, for cleaning filters
from possible fungal and mold infections.
residual sugar is that grape sugar that has not been fermented in alcohol
chlorides. The structure of the wine also depends on the content of minerals in the wine, which
determine the taste sensation such as salinity (sapidità). Anions of inorganic acids (chlorides,
sulfates, sulfites..), anions of transferred acids, metal cations (potassium, sodium, magnesium...)
are found in wine. Their content depends mainly on the climatic zone (cold or warm region, salty
soils depending on the observation of the sea), oenological practices, storage and aging of wine.
free sulfur dioxide, total sulfur dioxide - Sulfur dioxide (sulfur oxide, sulfur dioxide, readiness E220,
SO2) is used as a preservative due to its antioxidant and antimicrobial properties. Molecular SO2 is
an extremely important antibiotic, affecting significant consumption (including wild yeast) that can
manifest itself in wine spoilage.
Density - The density of wine can be either less or more than water. Its value is determined primarily
by the concentration of alcohol and sugar. White, rosé and red wines are generally light - their
density at 20°C is below 998.3 kg/m3.
pH is a measure of the acidity of wine. All wines ideally have a pH level between 2.9 and 4.2. The
lower the pH, the more acidic the wine; the lower the pH, the less acidic the wine.
Sulfates are a natural result of yeast fermenting the sugar in wine into alcohol. That is, the presence
of sulfites in wine is excluded.
alcohol - The alcohol content in wines depends on many tastes: the grape variety and the amount of
sugar in the berries, production technology and growing conditions. Wines vary greatly in degree:
this Parameter varies from 4.5 to 22 depending on the category.
quality is a target.
Are there missing values?
type 0
fixed acidity 10
volatile acidity 8
citric acid 3
residual sugar 2
chlorides 2
free sulfur dioxide 0
total sulfur dioxide 0
density 0
pH 9
sulphates 4
alcohol 0
quality 0
dtype: int64
Which independent variables have missing data? How much?
fixed acidity - 10
volatile acidity - 8
citric acid - 3
residual sugar - 2
chlorides - 2
pH - 9
sulphates - 4
The above features have the respective number of missing data. Since the data is more symmetric,
mean replacement would be better.
Before examining quality feature, categorical variables will be mapped with help of cat.code. This
will assist to make easier and comprehensible data analysis.
Cabernet_Sauvignon['type'] = Cabernet_Sauvignon['type'].astype("category")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 type 6497 non-null int8
1 fixed acidity 6487 non-null float64
2 volatile acidity 6489 non-null float64
3 citric acid 6494 non-null float64
4 residual sugar 6495 non-null float64
5 chlorides 6495 non-null float64
6 free sulfur dioxide 6497 non-null float64
7 total sulfur dioxide 6497 non-null float64
8 density 6497 non-null float64
9 pH 6488 non-null float64
10 sulphates 6493 non-null float64
11 alcohol 6497 non-null float64
12 quality 6497 non-null int64
dtypes: float64(11), int64(1), int8(1)
memory usage: 615.6 KB
1. Mean
# mean = Cabernet_Sauvignon["fixed acidity"].mean()
# Cabernet_Sauvignon["fixed acidity"].fillna(mean,inplace=True)
# Cabernet_Sauvignon["fixed acidity"].isnull().sum()
# mean2 = Cabernet_Sauvignon["volatile acidity"].mean()
# Cabernet_Sauvignon["volatile acidity"].fillna(mean2,inplace=True)
# Cabernet_Sauvignon["volatile acidity"].isnull().sum()
# mean3 = Cabernet_Sauvignon["citric acid"].mean()
# Cabernet_Sauvignon["citric acid"].fillna(mean3,inplace=True)
# Cabernet_Sauvignon["citric acid"].isnull().sum()
# mean4 = Cabernet_Sauvignon["residual sugar"].mean()
# Cabernet_Sauvignon["residual sugar"].fillna(mean4,inplace=True)
# Cabernet_Sauvignon["residual sugar"].isnull().sum()
# mean5 = Cabernet_Sauvignon["chlorides"].mean()
# Cabernet_Sauvignon["chlorides"].fillna(mean5,inplace=True)
# Cabernet_Sauvignon["chlorides"].isnull().sum()
# mean6 = Cabernet_Sauvignon["pH"].mean()
# Cabernet_Sauvignon["pH"].fillna(mean6,inplace=True)
# Cabernet_Sauvignon["pH"].isnull().sum()
# mean7 = Cabernet_Sauvignon["sulphates"].mean()
# Cabernet_Sauvignon["sulphates"].fillna(mean7,inplace=True)
# Cabernet_Sauvignon["sulphates"].isnull().sum()
# Cabernet_Sauvignon.isnull().sum()
2. KNN Imputer
#Creating a seperate dataframe for performing the KNN imputation
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler
imputer = KNNImputer(n_neighbors=5)
Cabernet_Sauvignon = pd.DataFrame(imputer.fit_transform(Cabernet_Sauvignon), columns =
type 0
fixed acidity 0
volatile acidity 0
citric acid 0
residual sugar 0
chlorides 0
free sulfur dioxide 0
total sulfur dioxide 0
density 0
pH 0
sulphates 0
alcohol 0
quality 0
dtype: int64
What are the likely distributions of the numeric variables? & What are the distributions of the
predictor variables?
In below above, the good fit indicates that normality is a reasonable approximation.
Distribution of Predictors
Cabernet_SauvignonColumnList = Cabernet_Sauvignon.columns
for i in Cabernet_SauvignonColumnList:
plt.figure(figsize= (5,5))
sns.distplot(Cabernet_Sauvignon[i], fit = norm)
plt.title(f"Distribution of {i} (checking normal distribution fit)",size = 15, wei
type : categorical values
fixed acidity : nomral distribution
volatile acidity : almost normal distribution with a bit of right-skewness
citric acid : almost normal distribution with a bit of edge-peak
residual sugar : almost normal distribution with a bit of right-skewness
chlorides : almost normal distribution with a bit of right-skewness
free sulfur dioxide : nomral distribution
total sulfur dioxide : almost normal distribution with a bit of edge-peak
sulphates : normal distribution
alcohol : almost normal distribution with a bit of right-skewness
pH : normal distribution
density : normal distribution
Do the ranges of the predictor variables make sense?
chlorides su
count 6497.000000 6497.000000 6497.000000 6497.000000 6497.000000 6497.000000 6497.0
mean 0.753886 7.216501 0.339634 0.318675 5.445704 0.056041 30.5
std 0.430779 1.295928 0.164563 0.145267 4.758043 0.035032 17.7
min 0.000000 3.800000 0.080000 0.000000 0.600000 0.009000 1.0
25% 1.000000 6.400000 0.230000 0.250000 1.800000 0.038000 17.0
50% 1.000000 7.000000 0.290000 0.310000 3.000000 0.047000 29.0
75% 1.000000 7.700000 0.400000 0.390000 8.100000 0.065000 41.0
max 1.000000 15.900000 1.580000 1.660000 65.800000 0.611000 289.0
#Range of each column
Cabernet_Sauvignon.max() - Cabernet_Sauvignon.min()
The ranges make sense for each attribute that a wine constitutes. The range of "total sulphur
dioxide" variable is high, this implies high variablity in it's distribution.
Do the training and test sets have the same data?
By using test_train_split, the train and test sets are split at a ratio of 80/20 from the same dataset.
But both sets are distinct and is not seen by the model during the training phase. Although the
distribution of each attribute is proportional in both train and test sets.
Phase 1
Cabernet_Sauvignon_x = Cabernet_Sauvignon[['type','fixed acidity','volatile acidity','
Cabernet_Sauvignon_y = Cabernet_Sauvignon['quality']
# .iloc[:,:12], Cabernet_Sauvignon.iloc[:,-1]
0 6.0
1 6.0
2 6.0
3 6.0
4 6.0
Name: quality, dtype: float64
scaler = StandardScaler()
# #Dataframe Cabernet_Sauvignon with outliers
Cabernet_Sauvignon_x = scaler.fit_transform(Cabernet_Sauvignon_x)
ax = sns.boxplot(data=Cabernet_Sauvignon_x)
[Text(0, 0, 'type'),
Text(0, 0, 'fixed acidity'),
Text(0, 0, 'volatile acidity'),
Text(0, 0, 'citric acid'),
Text(0, 0, 'residual sugar'),
Text(0, 0, 'chlorides'),
Text(0, 0, 'free sulfur dioxide'),
Text(0, 0, 'total sulfur dioxide'),
Text(0, 0, 'density'),
Text(0, 0, 'pH'),
Text(0, 0, 'sulphates'),
Text(0, 0, 'alcohol')]
#Splitting the dataset with outlier into Train and Test sets at 80-20 proportion
X_train, X_test, y_train, y_test = train_test_split(Cabernet_Sauvignon_x, Cabernet_Sau
(5197, 12)
(1300, 12)
Model Buidling
Linear Regression Model
##Linear Regression
lr = LinearRegression(copy_X= True, fit_intercept = True, normalize = True), y_train)
lr_pred= lr.predict(X_test)
mae1 = mean_absolute_error(y_test, lr_pred)
print('MAE: %f'% mae1)
rmse1= np.sqrt(mean_squared_error(y_test, lr_pred))
print('RMSE: %f'% rmse1)
r21 = r2_score(y_test, lr_pred)
print('R2: %f' % r21)
MAE: 0.545152
RMSE: 0.686665
R2: 0.340363
Random Forest
from sklearn.ensemble import RandomForestRegressor
model2 = RandomForestRegressor(random_state=1, n_estimators=1000), y_train)
Rm_pred = model2.predict(X_test)
mae2 = mean_absolute_error(y_test, Rm_pred)
print('MAE: %f'% mae2)
rmse2 = np.sqrt(mean_squared_error(y_test, Rm_pred))
print('RMSE: %f'% rmse2 )
r22 = r2_score(y_test, Rm_pred)
print('R2: %f' % r22)
MAE: 0.401750
RMSE: 0.561165
R2: 0.559449
plt.figure(figsize=(5, 7))
ax = sns.distplot(y_test, hist=False, color="r", label="Actual Value")
sns.distplot(Rm_pred, hist=False, color="b", label="Fitted Values" , ax=ax)
plt.title('Actual(red) vs Fitted(blue) Values for Quality')
/usr/local/lib/python3.8/dist-packages/seaborn/ FutureWarni
warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.8/dist-packages/seaborn/ FutureWarni
warnings.warn(msg, FutureWarning)
Descision Tree
from sklearn.tree import DecisionTreeRegressor
model3 = DecisionTreeRegressor(max_depth=6), y_train)
Dt_pred = model3.predict(X_test)
mae3 = mean_absolute_error(y_test, Dt_pred)
print('MAE: %f'% mae3)
rmse3 = np.sqrt(mean_squared_error(y_test, Dt_pred))
print('RMSE: %f'% rmse3)
r23 = r2_score(y_test, Dt_pred)
print('R2: %f' % r23)
MAE: 0.541020
RMSE: 0.696854
R2: 0.320642
plt.figure(figsize=(5, 7))
ax = sns.distplot(y_test, hist=False, color="r", label="Actual Value")
/usr/local/lib/python3.8/dist-packages/seaborn/ FutureWarni
warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.8/dist-packages/seaborn/ FutureWarni
warnings.warn(msg, FutureWarning)
sns.distplot(Dt_pred, hist=False, color="b", label="Fitted Values" , ax=ax)
plt.title('Actual(red) vs Fitted(blue) Values for Quality')
Phase 2
In the predictor variables independent of all the other predictor variables?
Multicollinearity will help to measure the relationship between explanatory variables in multiple
regression. If there is multicollinearity occurs, these highly related input variables should be
eliminated from the model.
In this kernel, multicollinearity will be checked when plotting a correlation heatmap.
Which independent variables are useful to predict a target (dependent variable)? (Use at least
three methods) For a regression model, the most useful Independent Variables can be statistically
determined using the following methods:
Correlation Matrix with Heatmap
Each of the following method is applied below to the dataset.
1. f_regression
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression, mutual_info_regression
X = Cabernet_Sauvignon.iloc[:,0:12]
y = Cabernet_Sauvignon.iloc[:,-1]
# y=y.astype('int')
# y = pd.DataFrame(y)
# y.head(10)
# y.describe()
#Applying SelectKBest class to extract top features
# feature selection
f_selector = SelectKBest(score_func=f_regression, k='all')
# learn relationship from training data, y_train)
# transform train input data
X_train_fs = f_selector.transform(X_train)
# transform test input data
X_test_fs = f_selector.transform(X_test)
# Plot the scores for the features
plt.rcParams["figure.figsize"] = (30,10)[i for i in range(len(f_selector.scores_))], f_selector.scores_)
plt.xlabel("feature index")
plt.xticks([i for i in range(len(f_selector.scores_))], Cabernet_SauvignonColumnList[:
plt.ylabel("F-value (transformed from the correlation values)")
# bestFeatures = SelectKBest(score_func= chi2, k =12)
# fit =,y)
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory 16/93
we can see that volatile acidity, chlorides, density and alcohol have more importance than the
2.Mutual information metric
# feature selection
f_selector = SelectKBest(score_func=mutual_info_regression, k='all')
# learn relationship from training data, y_train)
# transform train input data
X_train_fs = f_selector.transform(X_train)
# transform test input data
X_test_fs = f_selector.transform(X_test)
# Plot the scores for the features[i for i in range(len(f_selector.scores_))], f_selector.scores_, align = 'cent
plt.xlabel("feature index")
plt.xticks([i for i in range(len(f_selector.scores_))], Cabernet_SauvignonColumnList[:
plt.ylabel("Estimated MI value")
# plt.rcParams["figure.figsize"] = (30,10)
3. Correlation Matrix with HeatMap
corrmat = Cabernet_Sauvignon.corr()
top_corr_features = corrmat.index
plt.figure(figsize = (20,20))
#plot heatmap
g = sns.heatmap(Cabernet_Sauvignon[top_corr_features].corr(), annot= True, cmap='RdYlG
By looking at the correlation mattrix above we can gain the following insights:
1. volatile acidity and chlorides is highly (-ve) correlated with type.
2. alcohol is highly (-ve) correlated with density.
3. total sulpher dioxide is highly (+ve) correlated with type.
By looking at the 3 feature importance methods above, we can see that volatile acidity, chlorides,
density and alcohol are the common most important features in predicting the value of quality.
Outlier Treatment
Q1fixed,Q3fixed = np.percentile(Cabernet_Sauvignon['fixed acidity'] , [25,75])
IQRfixed = Q3fixed - Q1fixed
Ufixed_acidity = Q3fixed + 1.5*IQRfixed
Lfixed_acidity = Q1fixed - 1.5*IQRfixed
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['fixed acidity'] < Lfixe
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['fixed acidity'] > Ufixe
Q1volatile,Q3volatile = np.percentile(Cabernet_Sauvignon['volatile acidity'] , [25,75]
IQRvolatile = Q3volatile - Q1volatile
Uvolatile_acidity = Q3volatile + 1.5*IQRvolatile
Lvolatile_acidity= Q1volatile - 1.5*IQRvolatile
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['volatile acidity'] < Lv
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['volatile acidity'] > Uv
Q1citric,Q3citric = np.percentile(Cabernet_Sauvignon['citric acid'] , [25,75])
IQRcitric = Q3citric - Q1citric
Ucitric_acid = Q3citric + 1.5*IQRcitric
Lcitric_acid= Q1citric - 1.5*IQRcitric
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['citric acid'] < Lcitric
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['citric acid'] > Ucitric
Q1residual,Q3residual = np.percentile(Cabernet_Sauvignon['residual sugar'] , [25,75])
IQRresidual = Q3residual - Q1residual
Uresidual_sugar = Q3residual + 1.5*IQRresidual
Lresidual_sugar= Q1residual - 1.5*IQRresidual
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['residual sugar'] < Lres
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['residual sugar'] > Ures
Q1chlorides,Q3chlorides = np.percentile(Cabernet_Sauvignon['chlorides'] , [25,75])
IQRchlorides = Q3chlorides - Q1chlorides
Uchlorides = Q3chlorides + 1.5*IQRchlorides
# Cabernet_Sauvignon['chlori
Lchlorides= Q1chlorides - 1.5*IQRchlorides
# Cabernet_Sauvignon['chlori
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['chlorides'] < Lchloride
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['chlorides'] > Uchloride
Q1free_sulfur,Q3free_sulfur = np.percentile(Cabernet_Sauvignon['free sulfur dioxide']
IQRfree_sulfur = Q3free_sulfur - Q1free_sulfur
Ufree_sulfur_dioxide = Q3free_sulfur + 1.5*IQRfree_sulfur
Lfree_sulfur_dioxide= Q1free_sulfur - 1.5*IQRfree_sulfur
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['free sulfur dioxide'] <
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['free sulfur dioxide'] >
Q1total_sulfur,Q3total_sulfur = np.percentile(Cabernet_Sauvignon['total sulfur dioxide
IQRtotal_sulfur = Q3total_sulfur - Q1total_sulfur
Utotal_sulfur_dioxide = Q3total_sulfur + 1.5*IQRtotal_sulfur
Ltotal_sulfur_dioxide= Q1total_sulfur - 1.5*IQRtotal_sulfur
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['total sulfur dioxide']
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['total sulfur dioxide']
Q1sulphates,Q3sulphates = np.percentile(Cabernet_Sauvignon['sulphates'] , [25,75])
IQRsulphates = Q3sulphates - Q1sulphates
Usulphates = Q3sulphates + 1.5*IQRsulphates
Lsulphates= Q1sulphates - 1.5*IQRsulphates
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['sulphates'] < Lsulphate
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['sulphates'] > Usulphate
Q1alcohol,Q3alcohol = np.percentile(Cabernet_Sauvignon['alcohol'] , [25,75])
IQRalcohol = Q3alcohol - Q1alcohol
Ualcohol = Q3alcohol + 1.5*IQRalcohol
Lalcohol= Q1alcohol - 1.5*IQRalcohol
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['alcohol'] < Lalcohol].i
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['alcohol'] > Ualcohol].i
Q1pH,Q3pH = np.percentile(Cabernet_Sauvignon['pH'] , [25,75])
IQRpH = Q3pH - Q1pH
UpH = Q3pH + 1.5*IQRpH
LpH= Q1pH - 1.5*IQRpH
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['pH'] < LpH].index, inpl
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['pH'] > UpH].index, inpl
Q1density,Q3density = np.percentile(Cabernet_Sauvignon['density'] , [25,75])
IQRdensity = Q3density - Q1density
Udensity = Q3density + 1.5*IQRdensity
Ldensity= Q1density - 1.5*IQRdensity
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['density'] < Ldensity].i
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['density'] > Udensity].i
chlorides su
count 4598.000000 4598.000000 4598.000000 4598.000000 4598.000000 4598.000000 4598.0
mean 0.921923 6.911398 0.284059 0.320317 5.939374 0.044548 33.0
std 0.268323 0.832672 0.101024 0.089928 4.743293 0.012699 15.3
min 0.000000 4.700000 0.080000 0.090000 0.600000 0.009000 2.0
25% 1.000000 6.400000 0.210000 0.260000 1.800000 0.036000 22.0
50% 1.000000 6.800000 0.270000 0.310000 4.600000 0.043000 32.0
75% 1.000000 7.400000 0.330000 0.370000 8.987500 0.051000 44.0
max 1.000000 9.600000 0.645000 0.560000 18.950000 0.081000 78.0
# Cabernet_Sauvignon.drop([9])
Cabernet_Sauvignon_cleaned_x,Cabernet_Sauvignon_cleaned_y = Cabernet_Sauvignon.iloc[:,
(4598, 12)
Cabernet_Sauvignon_cleaned_x = scaler.fit_transform(Cabernet_Sauvignon_cleaned_x)
#Splitting the dataset after outlier treatment into Train and Test sets at 80-20 propo
Xclean_train, Xclean_test, yclean_train, yclean_test = train_test_split(Cabernet_Sauvi
ax = sns.boxplot(data=Cabernet_Sauvignon_cleaned_x)
[Text(0, 0, 'type'),
Text(0, 0, 'fixed acidity'),
Text(0, 0, 'volatile acidity'),
Text(0, 0, 'citric acid'),
Text(0, 0, 'residual sugar'),
Text(0, 0, 'chlorides'),
Text(0, 0, 'free sulfur dioxide'),
Text(0, 0, 'total sulfur dioxide'),
Text(0, 0, 'density'),
Text(0, 0, 'pH'),
Text(0, 0, 'sulphates'),
Text(0, 0, 'alcohol')]
##Linear Regression
# lr = LinearRegression(copy_X= True, fit_intercept = True, normalize = True), yclean_train)
lrclean_pred= lr.predict(Xclean_test)
# model2 = RandomForestRegressor(random_state=1, n_estimators=1000), yclean_train)
Rmclean_pred = model2.predict(Xclean_test), yclean_train)
Dtclean_pred = model3.predict(Xclean_test)
print('-------------Linear Regression-----------')
print('MAE: %f'% mae1)
print('RMSE: %f'% rmse1)
print('R2: %f' % r21)
print('MAE: %f'% mean_absolute_error(yclean_test, lrclean_pred))
print('RMSE: %f'% np.sqrt(mean_squared_error(yclean_test, lrclean_pred)))
print('R2: %f' % r2_score(yclean_test, lrclean_pred))
print('-------------Random forest-----------')
print('MAE: %f'% mae2)
print('RMSE: %f'% rmse2)
print('R2: %f' % r22)
print('MAE: %f'% mean_absolute_error(yclean_test, Rmclean_pred))
print('RMSE: %f'% np.sqrt(mean_squared_error(yclean_test, Rmclean_pred)))
print('R2: %f' % r2_score(yclean_test, Rmclean_pred))
print('-------------Descision Tree-----------')
print('MAE: %f'% mae3)
print('RMSE: %f'% rmse3)
print('R2: %f' % r23)
print('MAE: %f'% mean_absolute_error(yclean_test, Dtclean_pred))
print('RMSE: %f'% np.sqrt(mean_squared_error(yclean_test, Dtclean_pred)))
print('R2: %f' % r2_score(yclean_test, Dtclean_pred))
-------------Linear Regression-----------
MAE: 0.545152
RMSE: 0.686665
R2: 0.340363
MAE: 0.578749
RMSE: 0.748469
R2: 0.274277
-------------Random forest-----------
MAE: 0.401750
RMSE: 0.561165
R2: 0.559449
MAE: 0.438112
RMSE: 0.622107
R2: 0.498635
-------------Descision Tree-----------
MAE: 0.541020
RMSE: 0.696854
R2: 0.320642
MAE: 0.586013
RMSE: 0.756198
R2: 0.259211
The results show that both phases have different prediction results. Phase 1 and 2 don't have a
great difference for each metric. MAE, RMSE metric values are increased in Phase 2 which means,
the prediction error value is higher in that Phase and model explainability has decresed by a
negligible margin.
Remove outliers and keep outliers (does if have an effect of the final predictive model)? The MAE
value of 0 indicates no error on the model. In other words, there is a perfect prediction. The above
results show that all predictions have great error especially in phase 2. RMSE gives an idea of how
much error the system typically makes in its predictions. The above results show that RMSE gave a
worse value after removing the outliers. R2 represents the proportion of the variance for a
dependent variable that's explained by an independent variable.
Cabernet_Sauvignon_class = Cabernet_Sauvignon
Cabernet_Sauvignon_imputation= Cabernet_Sauvignon
quality_mapping = { 3 : 'Low', 4 : 'Low', 5: 'Medium', 6 : 'Medium', 7: 'Medium', 8 :
Cabernet_Sauvignon_class['quality'] = Cabernet_Sauvignon_class['quality'].map(quality
Cabernet_Sauvignon_class_x,Cabernet_Sauvignon_class_y = Cabernet_Sauvignon.iloc[:,:12]
Cabernet_Sauvignon_class_x = scaler.fit_transform(Cabernet_Sauvignon_class_x)
#Splitting the dataset after classifying quality to class into Train and Test sets at
Xclass_train, Xclass_test, yclass_train, yclass_test = train_test_split(Cabernet_Sauvi
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# creating a RF classifier
clf = RandomForestClassifier(n_estimators = 1000)
# Training the model on the training dataset
# fit function is used to train the model using the training sets as parameters, yclass_train)
# performing predictions on the test dataset
yclass_pred = clf.predict(Xclass_test)
# metrics are used to find accuracy or error
from sklearn import metrics
# using metrics module for accuracy calculation
print("ACCURACY OF THE MODEL: ", metrics.accuracy_score(yclass_test, yclass_pred))
print(classification_report(yclass_test, yclass_pred))
ACCURACY OF THE MODEL: 0.9456521739130435
precision recall f1-score support
High 1.00 0.34 0.51 38
Low 0.00 0.00 0.00 24
Medium 0.95 1.00 0.97 858
accuracy 0.95 920
macro avg 0.65 0.45 0.49 920
weighted avg 0.92 0.95 0.93 920
quality_mapping_again = { 'Low':0, 'Medium':1, 'High':2}
yclass_test =
yclass_pred_new = [s.replace('Medium', '1') for s in yclass_pred]
yclass_pred_new = [s.replace('Low', '0') for s in yclass_pred_new]
yclass_pred_new = [s.replace('High', '2') for s in yclass_pred_new]
yclass_pred_new = [int(item) for item in yclass_pred_new]
plt.figure(figsize=(5, 7))
ax = sns.distplot(yclass_test, hist=False, color="r", label="Actual Value")
sns.distplot(yclass_pred_new, hist=False, color="b", label="Fitted Values" , ax=ax)
plt.title('Actual vs Fitted Values for Quality')
As we can see here, the accuracy of the classification model turned out to be way higher than any
regression method used in phase 1. It can be interpretted as: Wine tastings are generally blind
tastings and even for the best wine conoisseurs, it is very difficult to differentiate between a quality
7 or 8. Also, quality of a wine by how it tastes is a very subjective to human individuals. Most times,
its about how the product is marketed/promoted which forms the general opinion of the targeted
Being said that, a good wine is a good wine. Based on the chemical composition of the wine itself,
we can atleast say if it's a good or bad one. So, when a model is asked to make it fall in a category it
gives a much greater accuracy as classifying into bins is easier than predicting a precise quality
Data Imputation
Remove 1%, 5%, and 10% of your data randomly and impute the values back using at least 3
imputation methods. How well did the methods recover the missing values? That is remove some
data, check the % error on residuals for numeric data and check for bias and variance of the error.
Imputation 1
Cabernet_Sauvignon_imputation['1_percent'] = Cabernet_Sauvignon_imputation[['alcohol']
Cabernet_Sauvignon_imputation['5_percent'] = Cabernet_Sauvignon_imputation[['alcohol']
Cabernet_Sauvignon_imputation['10_percent'] = Cabernet_Sauvignon_imputation[['alcohol'
1 1.0 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940
2 1.0 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951
3 1.0 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956
4 1.0 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956
5 1.0 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951
def get_percent_missing(dataframe):
percent_missing = dataframe.isnull().sum() * 100 / len(dataframe)
missing_value_Cabernet_Sauvignon = pd.DataFrame({'column_name': dataframe.columns,
'percent_missing': percent_missing})
return missing_value_Cabernet_Sauvignon
column_name percent_missing
type type 0.0
fixed acidity fixed acidity 0.0
volatile acidity volatile acidity 0.0
citric acid citric acid 0.0
residual sugar residual sugar 0.0
chlorides chlorides 0.0
free sulfur dioxide free sulfur dioxide 0.0
total sulfur dioxide total sulfur dioxide 0.0
density density 0.0
pH pH 0.0
sulphates sulphates 0.0
alcohol alcohol 0.0
quality quality 0.0
1_percent 1_percent 0.0
5_percent 5_percent 0.0
10_percent 10_percent 0.0
def create_missing(dataframe, percent, col):
dataframe.loc[dataframe.sample(frac = percent).index, col] = np.nan
create_missing(Cabernet_Sauvignon_imputation, 0.01, '1_percent')
create_missing(Cabernet_Sauvignon_imputation, 0.05, '5_percent')
create_missing(Cabernet_Sauvignon_imputation, 0.1, '10_percent')
column_name percent_missing
type type 0.000000
fixed acidity fixed acidity 0.000000
volatile acidity volatile acidity 0.000000
citric acid citric acid 0.000000
residual sugar residual sugar 0.000000
chlorides chlorides 0.000000
free sulfur dioxide free sulfur dioxide 0.000000
total sulfur dioxide total sulfur dioxide 0.000000
density density 0.000000
pH pH 0.000000
sulphates sulphates 0.000000
alcohol alcohol 0.000000
quality quality 0.000000
1_percent 1_percent 1.000435
5_percent 5_percent 5.002175
10_percent 10_percent 10.004350
# Store Index of NaN values in each coloumns
number_1_idx = list(np.where(Cabernet_Sauvignon_imputation['1_percent'].isna())[0])
number_5_idx = list(np.where(Cabernet_Sauvignon_imputation['5_percent'].isna())[0])
number_10_idx = list(np.where(Cabernet_Sauvignon_imputation['10_percent'].isna())[0])
print(f"Length of number_1_idx is {len(number_1_idx)} and it contains {(len(number_1_i
print(f"Length of number_5_idx is {len(number_5_idx)} and it contains {(len(number_5_i
print(f"Length of number_10_idx is {len(number_10_idx)} and it contains {(len(number_1
Length of number_1_idx is 46 and it contains 1.0004349717268377% of total data in
Length of number_5_idx is 230 and it contains 5.002174858634189% of total data in
Length of number_10_idx is 460 and it contains 10.004349717268378% of total data
Imputation 2
KNN Imputation The k nearest neighbours is an algorithm that is used for simple classification. The
algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the
new point is assigned a value based on how closely it resembles the points in the training set.
#Creating a seperate dataframe for performing the KNN imputation
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler
Cabernet_Sauvignon_imputation1 = Cabernet_Sauvignon_imputation[['1_percent','5_percent
imputer = KNNImputer(n_neighbors=5)
imputed_number_Cabernet_Sauvignon = pd.DataFrame(imputer.fit_transform(Cabernet_Sauvig
# imputed_number_Cabernet_Sauvignon.sample(10)
column_name percent_missing
1_percent 1_percent 0.0
5_percent 5_percent 0.0
10_percent 10_percent 0.0
alcohol = Cabernet_Sauvignon["alcohol"]
imputed_mean = pd.concat([alcohol,imputed_number_Cabernet_Sauvignon])
imputed_mean.columns = ["Alcohol","1_Percent","5_Percent","10_Percent"]
Alcohol 1.470385
1_Percent 1.470326
5_Percent 1.470391
10_Percent 1.470429
dtype: float64
The KNN based method showed very negotiable variablilty. Therefore this method is acceptable for
the current dataset.
Mean based Imputation with Simpleimputer This works by calculating the mean/median of the non-
missing values in a column and then replacing the missing values within each column separately
and independently from the others. It can only be used with numeric data.
Cabernet_Sauvignon_imputation_mean = Cabernet_Sauvignon_imputation[['1_percent','5_per
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer( strategy='mean') #for median imputation replace 'mean' with
imputed_train_Cabernet_Sauvignon = imp_mean.transform(Cabernet_Sauvignon_imputation_me
imputed_mean = pd.DataFrame(imp_mean.fit_transform(Cabernet_Sauvignon_imputation_mean)
column_name percent_missing
1_percent 1_percent 0.0
5_percent 5_percent 0.0
10_percent 10_percent 0.0
alcohol = Cabernet_Sauvignon["alcohol"]
combined_mean = pd.concat([alcohol,imputed_mean])
0 10.587102
10_percent 10.588810
1_percent 10.586540
5_percent 10.581520
dtype: float64
0 1.470385
10_percent 1.320797
1_percent 1.456402
5_percent 1.395375
dtype: float64
Imputation 3
The Mean based method showed very negotiable variablilty. Therefore this method is acceptable for
the current dataset.
Imputation Using Multivariate Imputation by Chained Equation (MICE) This type of imputation works
by filling the missing data multiple times. Multiple Imputations (MIs) are much better than a single
imputation as it measures the uncertainty of the missing values in a better way. The chained
equations approach is also very flexible and can handle different variables of different data types
(ie., continuous or binary) as well as complexities such as bounds or survey skip patterns.
Cabernet_Sauvignon_imputation_mice = Cabernet_Sauvignon_imputation[['1_percent','5_per
column_name percent_missing
1_percent 1_percent 1.000435
5_percent 5_percent 5.002175
10_percent 10_percent 10.004350
!pip install impyute
from impyute.imputation.cs import mice
# start the MICE training
Looking in indexes:,
Requirement already satisfied: impyute in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: scipy in /usr/local/lib/python3.8/dist-packages (f
Requirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (f
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.8/d
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.8/dist-pack
imputed_training = pd.DataFrame(imputed_training)
imputed_training.columns = ("1_percent","5_percent","10_percent")
# imputed_mice = pd.DataFrame(imputed_training.fit_transform(Cabernet_Sauvignon_imputa
column_name percent_missing
1_percent 1_percent 0.0
5_percent 5_percent 0.0
10_percent 10_percent 0.0
alcohol = Cabernet_Sauvignon["alcohol"]
combined_mice = pd.concat([alcohol,imputed_training])
combined_mice.columns = ["Alcohol","1_Percent","5_Percent","10_Percent"]
Alcohol 10.587102
1_Percent 10.586915
5_Percent 10.587098
10_Percent 10.586915
dtype: float64
Alcohol 1.470385
1_Percent 1.467981
5_Percent 1.470375
10_Percent 1.467981
dtype: float64
The MICE method showed very negotiable variablilty. Therefore this method is acceptable for the
current dataset.
#Install AutoML library - PyCaret
!pip install pycaret
from scipy import stats
# import math
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
#Reading Data
Chateau_Montelena_AutoML = pd.read_csv('
Chateau_Montelena_AutoMLM = Chateau_Montelena_AutoML.copy()
Chateau_Montelena_AutoMLB = Chateau_Montelena_AutoML.copy()
Each row represents a wine; Each column contains wine’s attributes such as type, sulphates,
chlorides etc and the target label 'quality'.
Problem Statement
Binary Classification: Predict the quality of wine i.e. Low or High.
Multiclass Classification: Predict the quality of wine i.e Low,Medium,High.
Regression: Predict the quality of wine between 3-9 based on the independent predictor
Dataset - Wine Quality
count 6487.000000 6489.000000 6494.000000 6495.000000 6495.000000 6497.000000 6497.0
mean 7.216579 0.339691 0.318722 5.444326 0.056042 30.525319 115.7
std 1.296750 0.164649 0.145265 4.758125 0.035036 17.749400 56.5
min 3.800000 0.080000 0.000000 0.600000 0.009000 1.000000 6.0
25% 6.400000 0.230000 0.250000 1.800000 0.038000 17.000000 77.0
50% 7.000000 0.290000 0.310000 3.000000 0.047000 29.000000 118.0
75% 7.700000 0.400000 0.390000 8.100000 0.065000 41.000000 156.0
max 15.900000 1.580000 1.660000 65.800000 0.611000 289.000000 440.0
Dataset Shape: (6497, 13)
Name dtypes Missing Uniques Sample Value Entropy
0 type object 0 2 white 0.24
1 fixed acidity float64 10 106 7.0 1.65
2 volatile acidity float64 8 187 0.27 1.79
3 citric acid float64 3 89 0.36 1.70
4 residual sugar float64 2 316 20.7 2.08
5 chlorides float64 2 214 0.045 1.90
6 free sulfur dioxide float64 0 135 45.0 1.82
7 total sulfur dioxide float64 0 276 170.0 2.32
8 density float64 0 998 1.001 2.70
9 pH float64 9 108 3.0 1.81
10 sulphates float64 4 111 0.45 1.72
11 alcohol float64 0 111 8.8 1.66
12 quality int64 0 7 6 0.55
def tableinfo(Chateau_Montelena_AutoML):
print(f"Dataset Shape: {Chateau_Montelena_AutoML.shape}")
summary = pd.DataFrame(Chateau_Montelena_AutoML.dtypes,columns=['dtypes'])
summary = summary.reset_index()
summary['Name'] = summary['index']
summary = summary[['Name','dtypes']]
summary['Missing'] = Chateau_Montelena_AutoML.isnull().sum().values
summary['Uniques'] = Chateau_Montelena_AutoML.nunique().values
summary['Sample Value'] = Chateau_Montelena_AutoML.loc[0].values
for name in summary['Name'].value_counts().index:
summary.loc[summary['Name'] == name, 'Entropy'] = round(stats.entropy(Chateau_
return summary
Entropy is defined as the randomness or measuring the disorder of the information being
Actions required for data preparation:
Converting 'Type' to a integer data type. Encoding categorical features.
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory 37/93
print(round(Chateau_Montelena_AutoML['quality'].value_counts(normalize=True) * 100,2))
6 43.65
5 32.91
7 16.61
4 3.32
8 2.97
3 0.46
9 0.08
Name: quality, dtype: float64
Chateau_Montelena_AutoML['type'] = Chateau_Montelena_AutoML['type'].astype("category")
Chateau_Montelena_AutoML_copy = Chateau_Montelena_AutoML.copy()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 type 6497 non-null int8
1 fixed acidity 6487 non-null float64
2 volatile acidity 6489 non-null float64
3 citric acid 6494 non-null float64
4 residual sugar 6495 non-null float64
5 chlorides 6495 non-null float64
6 free sulfur dioxide 6497 non-null float64
7 total sulfur dioxide 6497 non-null float64
8 density 6497 non-null float64
9 pH 6488 non-null float64
10 sulphates 6493 non-null float64
11 alcohol 6497 non-null float64
12 quality 6497 non-null int64
dtypes: float64(11), int64(1), int8(1)
memory usage: 615.6 KB
Analyzing the numeric features
plot , ax = plt.subplots( 4,3 , figsize = (35 , 20))
g = sns.histplot(Chateau_Montelena_AutoML['type'] , kde = True , ax = ax[0][0])
g = sns.histplot(Chateau_Montelena_AutoML['fixed acidity'] , kde = True , ax = ax[0][1
g = sns.histplot(Chateau_Montelena_AutoML['volatile acidity'] , kde = True , ax = ax[0
g = sns.histplot(Chateau_Montelena_AutoML['citric acid'] , kde = True , ax = ax[1][0])
g = sns.histplot(Chateau_Montelena_AutoML['residual sugar'] , kde = True , ax = ax[1][
g = sns.histplot(Chateau_Montelena_AutoML['chlorides'] , kde = True , ax = ax[1][2])
g = sns.histplot(Chateau_Montelena_AutoML['density'] , kde = True , ax = ax[2][0])
g = sns.histplot(Chateau_Montelena_AutoML['pH'] , kde = True , ax = ax[2][1])
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory 38/93
g = sns.histplot(Chateau_Montelena_AutoML['sulphates'] , kde = True , ax = ax[2][2])
g = sns.histplot(Chateau_Montelena_AutoML['alcohol'] , kde = True , ax = ax[3][0])
Observation :
These numerical variables are not following a normal distribution. These distributions indicate there
are different data distributions present in population data with separate and independent peaks.
Action :
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory 39/93
Data scaling, As most of the algorithms assume the data to be normally (Gaussian) distributed we
Normalize these features.
0 1 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010
1 1 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940
2 1 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951
3 1 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956
4 1 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956
from sklearn.preprocessing import MinMaxScaler,StandardScaler
mms = MinMaxScaler() # Normalization
# cust_dummies=pd.get_dummies(cust)
Chateau_Montelena_AutoML_copy['type'] = mms.fit_transform(Chateau_Montelena_AutoML_cop
Chateau_Montelena_AutoML_copy['fixed acidity'] = mms.fit_transform(Chateau_Montelena_A
Chateau_Montelena_AutoML_copy['volatile acidity'] = mms.fit_transform(Chateau_Montelen
Chateau_Montelena_AutoML_copy['citric acid']= mms.fit_transform(Chateau_Montelena_Auto
Chateau_Montelena_AutoML_copy['residual sugar']= mms.fit_transform(Chateau_Montelena_A
Chateau_Montelena_AutoML_copy['chlorides']= mms.fit_transform(Chateau_Montelena_AutoML
# Chateau_Montelena_AutoML_copy['free sulphur dioxide']= mms.fit_transform(Chateau_Mon
# Chateau_Montelena_AutoML_copy['total sulphur dioxide']= mms.fit_transform(Chateau_Mo
Chateau_Montelena_AutoML_copy['density'] = mms.fit_transform(Chateau_Montelena_AutoML_
Chateau_Montelena_AutoML_copy['pH'] = mms.fit_transform(Chateau_Montelena_AutoML_copy[
Chateau_Montelena_AutoML_copy['sulphates'] = mms.fit_transform(Chateau_Montelena_AutoM
Chateau_Montelena_AutoML_copy['alcohol'] = mms.fit_transform(Chateau_Montelena_AutoML_
sns.boxplot(data=Chateau_Montelena_AutoML_copy[['type','fixed acidity','volatile acidi
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory 40/93
Observation : There are values present beyond the upper and lower extremes of the Box plots (1.5 x
Inter Quartile Range)
corr = Chateau_Montelena_AutoML_copy.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr,mask=mask, cmap='RdYlGn')
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory 41/93
Observation :
By looking at the correlation mattrix above we can gain the following insights:
volatile acidity and chlorides is highly (-ve) correlated with type.
alcohol is highly (-ve) correlated with density.
total sulpher dioxide is highly (+ve) correlated with type.
Action :
Dropping some of the highly correlated categorical variables.
Target Variable = Quality between 3-9
!pip install numba==0.53
Looking in indexes:,
Collecting numba==0.53
Downloading numba-0.53.0-cp38-cp38-manylinux2014_x86_64.whl (3.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.4/3.4 MB 31.4 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.15 in /usr/local/lib/python3.8/dist-packa
Requirement already satisfied: setuptools in /usr/local/lib/python3.8/dist-packag
Collecting llvmlite<0.37,>=0.36.0rc1
Downloading llvmlite-0.36.0-cp38-cp38-manylinux2010_x86_64.whl (25.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25.3/25.3 MB 54.4 MB/s eta 0:00:00
Installing collected packages: llvmlite, numba
Attempting uninstall: llvmlite
Found existing installation: llvmlite 0.37.0
Uninstalling llvmlite-0.37.0:
Successfully uninstalled llvmlite-0.37.0
Attempting uninstall: numba
Found existing installation: numba 0.54.1
Uninstalling numba-0.54.1:
Successfully uninstalled numba-0.54.1
Successfully installed llvmlite-0.36.0 numba-0.53.0
from pycaret.regression import *
s = setup(Chateau_Montelena_AutoML, target = 'quality',train_size=0.8,
silent = True)
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory 42/93
Description Value
0 session_id 6943
1 Target quality
2 Original Data (6497, 13)
3 Missing Values True
4 Numeric Features 12
5 Categorical Features 0
6 Ordinal Features False
7 High Cardinality Features False
8 High Cardinality Method None
9 Transformed Train Set (4937, 12)
10 Transformed Test Set (1300, 12)
11 Shuffle Train-Test True
12 Stratify Train-Test False
13 Fold Generator KFold
14 Fold Number 5
15 CPU Jobs -1
16 Use GPU False
17 Log Experiment False
18 Experiment Name reg-default-name
19 USI 900b
20 Imputation Type simple
21 Iterative Imputation Iteration None
22 Numeric Imputer mean
23 Iterative Imputation Numeric Model None
24 Categorical Imputer constant
25 Iterative Imputation Categorical Model None
26 Unknown Categoricals Handling least_frequent
27 Normalize True
28 Normalize Method minmax
29 Transformation False
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory 43/93
30 Transformation Method None
31 PCA False
32 PCA Method None
33 PCA Components None
34 Ignore Low Variance False
35 Combine Rare Levels False
36 Rare Level Threshold None
37 Numeric Binning False
38 Remove Outliers True
39 Outliers Threshold 0.05
40 Remove Multicollinearity True
41 Multicollinearity Threshold 0.9
42 Remove Perfect Collinearity True
43 Clustering False
44 Clustering Iteration None
45 Polynomial Features False
46 Polynomial Degree None
47 Trignometry Features False
48 Polynomial Threshold None
49 Group Features False
50 Feature Selection False
51 Feature Selection Method classic
52 Features Selection Threshold None
53 Feature Interaction False
54 Feature Ratio False
55 Interaction Threshold None
56 Transform Target False
57 Transform Target Method box-cox
et Extra Trees Regressor 0.3974 0.3534 0.5941 0.5312 0.0890 0.0710 1.150
rf Random Forest Regressor 0.4454 0.3757 0.6124 0.5018 0.0916 0.0793 2.436
Light Gradient Boosting
0.4847 0.4085 0.6388 0.4577 0.0951 0.0857 0.190
Extreme Gradient
0.4631 0.4104 0.6404 0.4548 0.0955 0.0821 0.590
Gradient Boosting
0.5298 0.4610 0.6786 0.3880 0.1006 0.0934 1.008
knn K Neighbors Regressor 0.5362 0.5059 0.7111 0.3280 0.1055 0.0950 0.082
ada AdaBoost Regressor 0.5725 0.5243 0.7235 0.3048 0.1074 0.1015 0.600
lr Linear Regression 0.5643 0.5288 0.7268 0.2982 0.1074 0.0995 0.588
lar Least Angle Regression 0.5643 0.5288 0.7268 0.2982 0.1074 0.0995 0.012
br Bayesian Ridge 0.5645 0.5289 0.7269 0.2981 0.1074 0.0995 0.012
ridge Ridge Regression 0.5652 0.5296 0.7273 0.2972 0.1075 0.0996 0.010
huber Huber Regressor 0.5636 0.5301 0.7277 0.2965 0.1074 0.0990 0.102
Orthogonal Matching
0.6133 0.5987 0.7733 0.2056 0.1145 0.1086 0.010
dt Decision Tree Regressor 0.5058 0.7132 0.8440 0.0484 0.1252 0.0889 0.046
lasso Lasso Regression 0.6772 0.7537 0.8678 -0.0004 0.1277 0.1203 0.012
en Elastic Net 0.6772 0.7537 0.8678 -0.0004 0.1277 0.1203 0.014
Lasso Least Angle
0.6772 0.7537 0.8678 -0.0004 0.1277 0.1203 0.012
dummy Dummy Regressor 0.6772 0.7537 0.8678 -0.0004 0.1277 0.1203 0.014
Passive Aggressive
0.8006 0.9957 0.9905 -0.3256 0.1469 0.1372 0.014
tuned_model = tune_model(best)
#Creating Models
lightgbm = create_model('lightgbm');
et = create_model('et');
rf = create_model('rf');
#Blending the top 3 models
blend = blend_models(estimator_list=[lightgbm,et,rf])
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory 45/93
0 0.4346 0.3397 0.5828 0.5165 0.0863 0.0763
1 0.4596 0.4069 0.6379 0.5017 0.0952 0.0819
2 0.4237 0.3473 0.5893 0.5462 0.0889 0.0761
3 0.4418 0.3806 0.6169 0.5084 0.0937 0.0797
4 0.4261 0.3356 0.5793 0.5262 0.0858 0.0747
Mean 0.4372 0.3620 0.6012 0.5198 0.0900 0.0777
Std 0.0129 0.0275 0.0226 0.0155 0.0038 0.0026
INFO:logs:create_model_container: 24
INFO:logs:master_model_container: 24
INFO:logs:display_container: 7
learning_rate=0.1, max_depth=-1,
min_split_gain=0.0, n_estimators=100,
n_jobs=-1, num_leaves=31,
objective=None, random_state=6943,
reg_alpha=0.0, reg_lambda=0.0,
silent='warn', s...
n_estimators=100, n_jobs=-1,
random_state=6943, verbose=0,
n_jobs=-1, verbose=False, weights=None)
INFO:logs:blend_models() succesfully completed..................................
plot_model(estimator = tuned_model, plot = 'feature')
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory 46/93
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory 47/93
INFO:logs:Initializing interpret_model()
INFO:logs:interpret_model(estimator=ExtraTreesRegressor(bootstrap=False, ccp_alph
max_depth=9, max_features=1.0, max_leaf_nodes=None,
max_samples=None, min_impurity_decrease=0.002,
min_impurity_split=None, min_samples_leaf=3,
min_samples_split=5, min_weight_fraction_leaf=0.0,
n_estimators=210, n_jobs=-1, oob_score=False,
random_state=6943, verbose=0, warm_start=False), use_train_da
INFO:logs:Checking exceptions
INFO:logs:plot type: summary
INFO:logs:Creating TreeExplainer
INFO:logs:Compiling shap values
plot_model(estimator = tuned_model, plot = 'residuals')
Observation : The residuals are evenly distributed and the line fits well.
Double-click (or enter) to edit
Target Variable = Quality- Low or High
Binary classification
from pycaret.classification import *
Categorization of Quality
quality_mapping = { 3 : 'Low', 4 : 'Low', 5: 'Low', 6 : 'High', 7: 'High', 8 : 'High',
Chateau_Montelena_AutoMLB['quality'] = Chateau_Montelena_AutoMLB['quality'].map(quali
print("Wine Quality(%):")
print(round(Chateau_Montelena_AutoMLB['quality'].value_counts(normalize=True) * 100,2)
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory 48/93
Wine Quality(%):
High 63.31
Low 36.69
Name: quality, dtype: float64
Classifier Setup
clfb = setup(data = Chateau_Montelena_AutoMLB,
target = 'quality',
# ignore_features = ['customerID'],
silent = True)
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory 49/93
Description Value
0 session_id 4967
1 Target quality
2 Target Type Binary
3 Label Encoded High: 0, Low: 1
4 Original Data (6497, 13)
5 Missing Values True
6 Numeric Features 11
7 Categorical Features 1
8 Ordinal Features False
9 High Cardinality Features False
10 High Cardinality Method None
11 Transformed Train Set (4937, 12)
12 Transformed Test Set (1300, 12)
13 Shuffle Train-Test True
14 Stratify Train-Test False
15 Fold Generator StratifiedKFold
16 Fold Number 5
17 CPU Jobs -1
18 Use GPU False
19 Log Experiment False
20 Experiment Name clf-default-name
21 USI 3508
22 Imputation Type simple
23 Iterative Imputation Iteration None
24 Numeric Imputer mean
25 Iterative Imputation Numeric Model None
26 Categorical Imputer constant
27 Iterative Imputation Categorical Model None
28 Unknown Categoricals Handling least_frequent
29 Normalize True
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory 50/93
30 Normalize Method minmax
31 Transformation False
32 Transformation Method None
33 PCA False
34 PCA Method None
35 PCA Components None
36 Ignore Low Variance False
37 Combine Rare Levels False
38 Rare Level Threshold None
39 Numeric Binning False
40 Remove Outliers True
41 Outliers Threshold 0.05
42 Remove Multicollinearity True
43 Multicollinearity Threshold 0.9
44 Remove Perfect Collinearity True
45 Clustering False
46 Clustering Iteration None
47 Polynomial Features False
48 Polynomial Degree None
49 Trignometry Features False
50 Polynomial Threshold None
51 Group Features False
52 Feature Selection False
53 Feature Selection Method classic
54 Features Selection Threshold None
55 Feature Interaction False
56 Feature Ratio False
57 Interaction Threshold None
58 Fix Imbalance True
59 Fix Imbalance Method SMOTE
Pycaret provides the following metrics used for comparing model performance in the
compare_models() function:
Confusion Matrix is a performance measurement for machine learning classification problem
where output can be two or more classes. It is a table with 4 different combinations of
predicted and actual values.
AUC known as the Area Under the ROC Curve can be calculated and provides a single score to
summarize the plot that can be used to compare models. A no skill classifier will have a score
of 0.5, whereas a perfect classifier will have a score of 1.0.
F1 score is the harmonic mean of Precision and recall, a single score that seeks to balance
both concerns.
Accuracy is the fraction of correction predictions against the total prediction
Accuracy = Correct Predictions / Total Predictions
MCC produces a high score only if the prediction obtained good results in all of the four
confusion matrix categories (true positives, false negatives, true negatives, and false
positives), proportionally both to the size of positive elements and the size of negative
elements in the dataset.
Precision summarizes the fraction of examples assigned the positive class that belong to the
positive class.
Precision = TruePositive / (TruePositive + FalsePositive)
Cohen’s Kappa Statistic is used to measure the level of agreement between two raters or
judges who each classify items into mutually exclusive categories.
kappa = (Observed agreement - chance agreement) / (1-chance agreement)
Recall summarizes how well the positive class was predicted.
Recall = TruePositive / (TruePositive + FalseNegative)
F-Measure = (2 * Precision * Recall) / (Precision + Recall)
Searching for the best models
Model Comparison & Evaluation
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory 52/93
Model Accuracy AUC Recall Prec. F1 Kappa MCC
Extra Trees
0.8232 0.9011 0.7532 0.7558 0.7539 0.6160 0.6166 0.432
Random Forest
0.8209 0.8940 0.7623 0.7463 0.7539 0.6132 0.6136 1.092
0.8112 0.8668 0.7392 0.7377 0.7380 0.5905 0.5909 0.978
Light Gradient
0.8009 0.8680 0.7538 0.7114 0.7314 0.5735 0.5748 0.208
0.7582 0.8375 0.7499 0.6405 0.6905 0.4942 0.4987 0.836
Decision Tree
0.7559 0.7384 0.6761 0.6556 0.6655 0.4735 0.4737 0.104
K Neighbors
0.7379 0.8094 0.7386 0.6124 0.6695 0.4555 0.4611 0.120
Ada Boost
0.7377 0.8115 0.7442 0.6116 0.6712 0.4566 0.4629 0.252
0.7284 0.8077 0.7662 0.5952 0.6697 0.4452 0.4558 0.042
0.7249 0.0000 0.7600 0.5920 0.6653 0.4380 0.4482 0.054
0.7223 0.8052 0.7532 0.5896 0.6611 0.4319 0.4415 0.054
0.7203 0.7995 0.7386 0.5890 0.6550 0.4249 0.4329 0.040
Hyperparameter Tuning
tuned_modelB = tune_model(best_modelB)
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory 53/93
Accuracy AUC Recall Prec. F1 Kappa MCC
0 0.7611 0.8567 0.8085 0.6308 0.7086 0.5114 0.5227
1 0.7520 0.8351 0.7690 0.6261 0.6903 0.4871 0.4943
2 0.7700 0.8457 0.8113 0.6429 0.7173 0.5278 0.5380
3 0.7021 0.8110 0.8028 0.5599 0.6597 0.4095 0.4306
4 0.7427 0.8236 0.7493 0.6172 0.6768 0.4663 0.4724
Mean 0.7456 0.8344 0.7882 0.6154 0.6906 0.4804 0.4916
Std 0.0236 0.0160 0.0246 0.0289 0.0209 0.0412 0.0380
We will use Light GBM , Extra Trees Classifier, Random Forest Classifier model here, as these
perform the best.
Creating a model
#Creating Models
lightgbmB = create_model('lightgbm');
etB = create_model('et');
rfB = create_model('rf');
#Blending the top 3 models
blendB = blend_models(estimator_list=[lightgbmB,etB,rfB])
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory 54/93
Accuracy AUC Recall Prec. F1 Kappa MCC
0 0.8451 0.9127 0.8000 0.7760 0.7878 0.6659 0.6661
1 0.8148 0.8971 0.7296 0.7486 0.7389 0.5955 0.5956
2 0.8470 0.9021 0.7859 0.7881 0.7870 0.6677 0.6677
3 0.8024 0.8846 0.7859 0.7010 0.7410 0.5822 0.5847
4 0.8126 0.8924 0.7296 0.7443 0.7368 0.5913 0.5914
Mean 0.8244 0.8978 0.7662 0.7516 0.7583 0.6205 0.6211
Std 0.0182 0.0094 0.0303 0.0302 0.0238 0.0380 0.0376
plot_model(estimator = tuned_modelB, plot = 'feature')
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory 55/93
#Plotting the confusion Matrix
plot_model(estimator = tuned_modelB, plot = 'confusion_matrix')
Observation :
We can see a strong diagnol indicating good predictions.
#plotting decision boundary
plot_model(estimator = tuned_modelB, plot = 'boundary', use_train_data = True)
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory 56/93
We can see a great seperation with very few misclassifications.
plot_model(tuned_modelB, plot = 'parameter')
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory 57/93
bootstrap False
ccp_alpha 0.0
class_weight {}
criterion entropy
max_depth 11
max_features log2
max_leaf_nodes None
max_samples None
min_impurity_decrease 0.0001
min_impurity_split None
min_samples_leaf 5
min_samples_split 9
min_weight_fraction_leaf 0.0
n_estimators 180
n_jobs -1
oob_score False
random_state 4967
verbose 0
warm_start False
#Plotting Area under Curve
plot_model(estimator = tuned_modelB, plot = 'auc')
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory 58/93
Double-click (or enter) to edit
Target Variable = Quality - Low,Medium,High
Multivariate classification
#from pycaret.classification import *
Classification of Quality
quality_mappingM = { 3 : 'Low', 4 : 'Low', 5: 'Medium', 6 : 'Medium', 7: 'Medium', 8 :
Chateau_Montelena_AutoMLM['quality'] = Chateau_Montelena_AutoMLM['quality'].map(quali
print("Wine Quality(%):")
print(round(Chateau_Montelena_AutoMLM['quality'].value_counts(normalize=True) * 100,2)
Wine Quality(%):
Medium 93.17
Low 3.79
High 3.05
Name: quality, dtype: float64
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory 59/93
Setting the classifier
clfM = setup(data = Chateau_Montelena_AutoMLM,
target = 'quality',
# ignore_features = ['customerID'],
silent = True)
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory 60/93
Description Value
0 session_id 4450
1 Target quality
2 Target Type Multiclass
3 Label Encoded High: 0, Low: 1, Medium: 2
4 Original Data (6497, 13)
5 Missing Values True
6 Numeric Features 11
7 Categorical Features 1
8 Ordinal Features False
9 High Cardinality Features False
10 High Cardinality Method None
11 Transformed Train Set (4937, 12)
12 Transformed Test Set (1300, 12)
13 Shuffle Train-Test True
14 Stratify Train-Test False
15 Fold Generator StratifiedKFold
16 Fold Number 5
17 CPU Jobs -1
18 Use GPU False
19 Log Experiment False
20 Experiment Name clf-default-name
21 USI 40d8
22 Imputation Type simple
23 Iterative Imputation Iteration None
24 Numeric Imputer mean
25 Iterative Imputation Numeric Model None
26 Categorical Imputer constant
27 Iterative Imputation Categorical Model None
28 Unknown Categoricals Handling least_frequent
29 Normalize True
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory 61/93
30 Normalize Method minmax
31 Transformation False
32 Transformation Method None
33 PCA False
34 PCA Method None
35 PCA Components None
36 Ignore Low Variance False
37 Combine Rare Levels False
38 Rare Level Threshold None
39 Numeric Binning False
40 Remove Outliers True
41 Outliers Threshold 0.05
42 Remove Multicollinearity True
43 Multicollinearity Threshold 0.9
44 Remove Perfect Collinearity True
45 Clustering False
46 Clustering Iteration None
47 Polynomial Features False
48 Polynomial Degree None
49 Trignometry Features False
50 Polynomial Threshold None
51 Group Features False
52 Feature Selection False
53 Feature Selection Method classic
54 Features Selection Threshold None
55 Feature Interaction False
56 Feature Ratio False
57 Interaction Threshold None
58 Fix Imbalance True
59 Fix Imbalance Method SMOTE
Model Accuracy AUC Recall Prec. F1 Kappa MCC
0.9299 0.7765 0.5413 0.9209 0.9243 0.3454 0.3514 5.532
Light Gradient
0.9279 0.7702 0.5348 0.9196 0.9225 0.3327 0.3389 0.538
Extra Trees
0.9230 0.8402 0.5646 0.9195 0.9210 0.3475 0.3487 0.618
Random Forest
0.9123 0.8244 0.5722 0.9166 0.9141 0.3233 0.3248 2.222
Decision Tree
0.8404 0.6445 0.5569 0.9048 0.8679 0.1915 0.2112 0.136
0.7727 0.7342 0.6042 0.9068 0.8254 0.1643 0.2067 9.140
K Neighbors
0.7432 0.7225 0.6320 0.9112 0.8064 0.1613 0.2160 0.180
Ada Boost
0.5345 0.5782 0.5922 0.9011 0.6462 0.0730 0.1325 1.000
0.4950 0.6411 0.5851 0.9010 0.6113 0.0646 0.1249 0.052
0.4857 0.7076 0.6144 0.9079 0.6017 0.0735 0.1446 0.038
0.4794 0.7101 0.6330 0.9118 0.5952 0.0780 0.1556 0.562
0.4132 0.0000 0.6236 0.9116 0.5293 0.0650 0.1426 0.022
SVM - Linear
0.3830 0.0000 0.6252 0.9121 0.4962 0.0613 0.1404 0.072
b N i B 0 3721 0 6096 0 5746 0 9040 0 4885 0 0492 0 1140 0 022
LGBM has the best F1 score and is faster than the other top models.
tuned_modelM = tune_model(best_modelM)

