pdf.pdf

4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 1/93
Problem Statement
Although we are attempting to predict wine quality as a target for a certain number of wines with a
given set of predictor factors, wine quality is a subjective measurement. This is an EDA, or data-
driven story, including a range of graphs and images as well as an attribute-based quality forecast.
Here we need to know: “what is the quality of the wine (in ordinal values)(3-9)? It is a regression
task.
Objective
Perform Data Cleaning, Pre-processing and Feature Selection
Apply ML models to predict the Churned Customers
Use Auto-ML to determine the best model
Use SHAP library to determine the impact of the predictor variables
ML Data Cleaning and Feature Selection
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy.stats import norm
from scipy import stats
from scipy.stats import norm
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from math import sqrt
from sklearn.metrics import r2_score
Cabernet Sauvignon is known as the king of the red wine.
C b t S i d d ('htt // ith b t t /M h j th /DA

Cabernet_Sauvignon = pd.read_csv('https://raw.githubusercontent.com/Mohanmanjunatha/DA
type
fixed
acidity
volatile
acidity
citric
acid
residual
sugar
chlorides
free
sulfur
dioxide
total
sulfur
dioxide
density
0 white 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010
1 white 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940
2 white 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951
3 white 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956
4 white 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956
Cabernet_Sauvignon.head()
Cabernet_Sauvignon.shape
(6497, 13)
What are the data types? (Only numeric and categorical)
Cabernet_Sauvignon.dtypes
type object
fixed acidity float64
volatile acidity float64
citric acid float64
residual sugar float64
chlorides float64
free sulfur dioxide float64
total sulfur dioxide float64
density float64
pH float64
sulphates float64
alcohol float64
quality int64
dtype: object
The dataset has 1 Categorical and 12 Numerical Features.
What features are in the dataset?
fixed acidity. Fixed acidity is due to the presence of non-volatile acids in wine. For example, tartaric,
citric or malic acid. This type of acid combines the balance of the taste of wine, brings freshness to
the taste.
Volatile acidity is the part of the acid in wine that can be picked up by the nose. Unlike those acids
that are palpable to the taste (as we talked about above). Volatile acidity, or in other words, souring

of wine, is one of the most common defects.
citric acid - allowed to offer in winemaking by the Resolution of the OIV No. 23/2000. It can be used
in three cases: for acid treatment of wine (increasing acidity), for collecting wine, for cleaning filters
from possible fungal and mold infections.
residual sugar is that grape sugar that has not been fermented in alcohol
chlorides. The structure of the wine also depends on the content of minerals in the wine, which
determine the taste sensation such as salinity (sapidità). Anions of inorganic acids (chlorides,
sulfates, sulfites..), anions of transferred acids, metal cations (potassium, sodium, magnesium...)
are found in wine. Their content depends mainly on the climatic zone (cold or warm region, salty
soils depending on the observation of the sea), oenological practices, storage and aging of wine.
free sulfur dioxide, total sulfur dioxide - Sulfur dioxide (sulfur oxide, sulfur dioxide, readiness E220,
SO2) is used as a preservative due to its antioxidant and antimicrobial properties. Molecular SO2 is
an extremely important antibiotic, affecting significant consumption (including wild yeast) that can
manifest itself in wine spoilage.
Density - The density of wine can be either less or more than water. Its value is determined primarily
by the concentration of alcohol and sugar. White, rosé and red wines are generally light - their
density at 20°C is below 998.3 kg/m3.
pH is a measure of the acidity of wine. All wines ideally have a pH level between 2.9 and 4.2. The
lower the pH, the more acidic the wine; the lower the pH, the less acidic the wine.
Sulfates are a natural result of yeast fermenting the sugar in wine into alcohol. That is, the presence
of sulfites in wine is excluded.
alcohol - The alcohol content in wines depends on many tastes: the grape variety and the amount of
sugar in the berries, production technology and growing conditions. Wines vary greatly in degree:
this Parameter varies from 4.5 to 22 depending on the category.
quality is a target.
Are there missing values?
Cabernet_Sauvignon.isna().sum()
type 0
fixed acidity 10
volatile acidity 8
citric acid 3
residual sugar 2
chlorides 2
free sulfur dioxide 0
total sulfur dioxide 0
density 0

pH 9
sulphates 4
alcohol 0
quality 0
dtype: int64
Which independent variables have missing data? How much?
fixed acidity - 10
volatile acidity - 8
citric acid - 3
residual sugar - 2
chlorides - 2
pH - 9
sulphates - 4
The above features have the respective number of missing data. Since the data is more symmetric,
mean replacement would be better.
Before examining quality feature, categorical variables will be mapped with help of cat.code. This
will assist to make easier and comprehensible data analysis.
Cabernet_Sauvignon['type'] = Cabernet_Sauvignon['type'].astype("category").cat.codes
Cabernet_Sauvignon.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 type 6497 non-null int8
1 fixed acidity 6487 non-null float64
2 volatile acidity 6489 non-null float64
3 citric acid 6494 non-null float64
4 residual sugar 6495 non-null float64
5 chlorides 6495 non-null float64
6 free sulfur dioxide 6497 non-null float64
7 total sulfur dioxide 6497 non-null float64
8 density 6497 non-null float64
9 pH 6488 non-null float64
10 sulphates 6493 non-null float64
11 alcohol 6497 non-null float64
12 quality 6497 non-null int64
dtypes: float64(11), int64(1), int8(1)
memory usage: 615.6 KB

1. Mean
# mean = Cabernet_Sauvignon["fixed acidity"].mean()
# Cabernet_Sauvignon["fixed acidity"].fillna(mean,inplace=True)
# Cabernet_Sauvignon["fixed acidity"].isnull().sum()
# mean2 = Cabernet_Sauvignon["volatile acidity"].mean()
# Cabernet_Sauvignon["volatile acidity"].fillna(mean2,inplace=True)
# Cabernet_Sauvignon["volatile acidity"].isnull().sum()
# mean3 = Cabernet_Sauvignon["citric acid"].mean()
# Cabernet_Sauvignon["citric acid"].fillna(mean3,inplace=True)
# Cabernet_Sauvignon["citric acid"].isnull().sum()
# mean4 = Cabernet_Sauvignon["residual sugar"].mean()
# Cabernet_Sauvignon["residual sugar"].fillna(mean4,inplace=True)
# Cabernet_Sauvignon["residual sugar"].isnull().sum()
# mean5 = Cabernet_Sauvignon["chlorides"].mean()
# Cabernet_Sauvignon["chlorides"].fillna(mean5,inplace=True)
# Cabernet_Sauvignon["chlorides"].isnull().sum()
# mean6 = Cabernet_Sauvignon["pH"].mean()
# Cabernet_Sauvignon["pH"].fillna(mean6,inplace=True)
# Cabernet_Sauvignon["pH"].isnull().sum()
# mean7 = Cabernet_Sauvignon["sulphates"].mean()
# Cabernet_Sauvignon["sulphates"].fillna(mean7,inplace=True)
# Cabernet_Sauvignon["sulphates"].isnull().sum()
# Cabernet_Sauvignon.isnull().sum()
2. KNN Imputer
#Creating a seperate dataframe for performing the KNN imputation
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler
imputer = KNNImputer(n_neighbors=5)
Cabernet_Sauvignon = pd.DataFrame(imputer.fit_transform(Cabernet_Sauvignon), columns =
Cabernet_Sauvignon.isnull().sum()
type 0
fixed acidity 0
volatile acidity 0
citric acid 0
residual sugar 0
chlorides 0
free sulfur dioxide 0

total sulfur dioxide 0
density 0
pH 0
sulphates 0
alcohol 0
quality 0
dtype: int64
What are the likely distributions of the numeric variables? & What are the distributions of the
predictor variables?
In below above, the good fit indicates that normality is a reasonable approximation.
Distribution of Predictors
Cabernet_SauvignonColumnList = Cabernet_Sauvignon.columns
for i in Cabernet_SauvignonColumnList:
plt.figure(figsize= (5,5))
sns.distplot(Cabernet_Sauvignon[i], fit = norm)
plt.title(f"Distribution of {i} (checking normal distribution fit)",size = 15, wei

/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)

type : categorical values
fixed acidity : nomral distribution
volatile acidity : almost normal distribution with a bit of right-skewness
citric acid : almost normal distribution with a bit of edge-peak
residual sugar : almost normal distribution with a bit of right-skewness
chlorides : almost normal distribution with a bit of right-skewness
free sulfur dioxide : nomral distribution
total sulfur dioxide : almost normal distribution with a bit of edge-peak
sulphates : normal distribution
alcohol : almost normal distribution with a bit of right-skewness
pH : normal distribution
density : normal distribution
Do the ranges of the predictor variables make sense?
type
fixed
acidity
volatile
acidity
citric
acid
residual
sugar
chlorides su
dio
count 6497.000000 6497.000000 6497.000000 6497.000000 6497.000000 6497.000000 6497.0
mean 0.753886 7.216501 0.339634 0.318675 5.445704 0.056041 30.5
std 0.430779 1.295928 0.164563 0.145267 4.758043 0.035032 17.7
min 0.000000 3.800000 0.080000 0.000000 0.600000 0.009000 1.0
25% 1.000000 6.400000 0.230000 0.250000 1.800000 0.038000 17.0
50% 1.000000 7.000000 0.290000 0.310000 3.000000 0.047000 29.0
75% 1.000000 7.700000 0.400000 0.390000 8.100000 0.065000 41.0
max 1.000000 15.900000 1.580000 1.660000 65.800000 0.611000 289.0
#Range of each column
Cabernet_Sauvignon.max() - Cabernet_Sauvignon.min()
Cabernet_Sauvignon.describe()
The ranges make sense for each attribute that a wine constitutes. The range of "total sulphur
dioxide" variable is high, this implies high variablity in it's distribution.

Do the training and test sets have the same data?
By using test_train_split, the train and test sets are split at a ratio of 80/20 from the same dataset.
But both sets are distinct and is not seen by the model during the training phase. Although the
distribution of each attribute is proportional in both train and test sets.
Phase 1
Cabernet_Sauvignon_x = Cabernet_Sauvignon[['type','fixed acidity','volatile acidity','
Cabernet_Sauvignon_y = Cabernet_Sauvignon['quality']
# .iloc[:,:12], Cabernet_Sauvignon.iloc[:,-1]
Cabernet_Sauvignon_y.head()
0 6.0
1 6.0
2 6.0
3 6.0
4 6.0
Name: quality, dtype: float64
scaler = StandardScaler()
# #Dataframe Cabernet_Sauvignon with outliers
Cabernet_Sauvignon_x = scaler.fit_transform(Cabernet_Sauvignon_x)
plt.figure(figsize=(20,7))
ax = sns.boxplot(data=Cabernet_Sauvignon_x)
ax.set_xticklabels(Cabernet_SauvignonColumnList[:12])

[Text(0, 0, 'type'),
Text(0, 0, 'fixed acidity'),
Text(0, 0, 'volatile acidity'),
Text(0, 0, 'citric acid'),
Text(0, 0, 'residual sugar'),
Text(0, 0, 'chlorides'),
Text(0, 0, 'free sulfur dioxide'),
Text(0, 0, 'total sulfur dioxide'),
Text(0, 0, 'density'),
Text(0, 0, 'pH'),
Text(0, 0, 'sulphates'),
Text(0, 0, 'alcohol')]
#Splitting the dataset with outlier into Train and Test sets at 80-20 proportion
X_train, X_test, y_train, y_test = train_test_split(Cabernet_Sauvignon_x, Cabernet_Sau
X_train.shape
(5197, 12)
X_test.shape
(1300, 12)
Model Buidling
Linear Regression Model
##Linear Regression
lr = LinearRegression(copy_X= True, fit_intercept = True, normalize = True)
lr.fit(X_train, y_train)
lr_pred= lr.predict(X_test)
print('--Phase-1--')
mae1 = mean_absolute_error(y_test, lr_pred)
print('MAE: %f'% mae1)
rmse1= np.sqrt(mean_squared_error(y_test, lr_pred))
print('RMSE: %f'% rmse1)
r21 = r2_score(y_test, lr_pred)
print('R2: %f' % r21)
--Phase-1--
MAE: 0.545152
RMSE: 0.686665
R2: 0.340363
/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_base.py:141: FutureW

If you wish to scale the data, use Pipeline with a StandardScaler in a preprocess
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())
If you wish to pass a sample_weight parameter, you need to pass it as a fit param
kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)
warnings.warn(
3 metrics will be calculated for evaluating predictions.
Mean Absolute Error (MAE) shows the difference between predictions and actual values.
Root Mean Square Error (RMSE) shows how accurately the model predicts the response.
R^2 will be calculated to find the goodness of fit measure.
plt.figure(figsize=(5, 7))
ax = sns.distplot(y_test, hist=False, color="r", label="Actual Value")
sns.distplot(lr_pred, hist=False, color="b", label="Fitted Values" , ax=ax)
plt.title('Actual(red) vs Fitted(blue) Values for Quality')
plt.show()
plt.close()

Random Forest
from sklearn.ensemble import RandomForestRegressor
model2 = RandomForestRegressor(random_state=1, n_estimators=1000)
model2.fit(X_train, y_train)
Rm_pred = model2.predict(X_test)
mae2 = mean_absolute_error(y_test, Rm_pred)
rmse2 = np.sqrt(mean_squared_error(y_test, Rm_pred))
print('RMSE: %f'% rmse2 )
r22 = r2_score(y_test, Rm_pred)
print('R2: %f' % r22)
--Phase-1--
MAE: 0.401750
RMSE: 0.561165
R2: 0.559449
sns.distplot(Rm_pred, hist=False, color="b", label="Fitted Values" , ax=ax)
plt.show()
plt.close()

Descision Tree
from sklearn.tree import DecisionTreeRegressor
model3 = DecisionTreeRegressor(max_depth=6)
model3.fit(X_train, y_train)
Dt_pred = model3.predict(X_test)
mae3 = mean_absolute_error(y_test, Dt_pred)
rmse3 = np.sqrt(mean_squared_error(y_test, Dt_pred))
r23 = r2_score(y_test, Dt_pred)
print('R2: %f' % r23)
--Phase-1--
MAE: 0.541020
RMSE: 0.696854
R2: 0.320642

sns.distplot(Dt_pred, hist=False, color="b", label="Fitted Values" , ax=ax)
plt.show()
plt.close()
Phase 2
In the predictor variables independent of all the other predictor variables?
Multicollinearity
Multicollinearity will help to measure the relationship between explanatory variables in multiple
regression. If there is multicollinearity occurs, these highly related input variables should be
eliminated from the model.
In this kernel, multicollinearity will be checked when plotting a correlation heatmap.

Which independent variables are useful to predict a target (dependent variable)? (Use at least
three methods) For a regression model, the most useful Independent Variables can be statistically
determined using the following methods:
f_regression
mutual_info_regression
Correlation Matrix with Heatmap
Each of the following method is applied below to the dataset.
1. f_regression
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression, mutual_info_regression
X = Cabernet_Sauvignon.iloc[:,0:12]
y = Cabernet_Sauvignon.iloc[:,-1]
# y=y.astype('int')
# y = pd.DataFrame(y)
# y.head(10)
# y.describe()
#Applying SelectKBest class to extract top features
# feature selection
f_selector = SelectKBest(score_func=f_regression, k='all')
# learn relationship from training data
f_selector.fit(X_train, y_train)
# transform train input data
X_train_fs = f_selector.transform(X_train)
# transform test input data
X_test_fs = f_selector.transform(X_test)
# Plot the scores for the features
plt.rcParams["figure.figsize"] = (30,10)
plt.bar([i for i in range(len(f_selector.scores_))], f_selector.scores_)
plt.xlabel("feature index")
plt.xticks([i for i in range(len(f_selector.scores_))], Cabernet_SauvignonColumnList[:
plt.ylabel("F-value (transformed from the correlation values)")
plt.show()
# bestFeatures = SelectKBest(score_func= chi2, k =12)
# fit = bestFeatures.fit(X,y)

we can see that volatile acidity, chlorides, density and alcohol have more importance than the
others.
2.Mutual information metric
# feature selection
f_selector = SelectKBest(score_func=mutual_info_regression, k='all')
# learn relationship from training data
f_selector.fit(X_train, y_train)
# transform train input data
X_train_fs = f_selector.transform(X_train)
# transform test input data
X_test_fs = f_selector.transform(X_test)
# Plot the scores for the features
plt.bar([i for i in range(len(f_selector.scores_))], f_selector.scores_, align = 'cent
plt.xlabel("feature index")
plt.xticks([i for i in range(len(f_selector.scores_))], Cabernet_SauvignonColumnList[:
plt.ylabel("Estimated MI value")
# plt.rcParams["figure.figsize"] = (30,10)
plt.show()

3. Correlation Matrix with HeatMap
corrmat = Cabernet_Sauvignon.corr()
top_corr_features = corrmat.index
plt.figure(figsize = (20,20))
#plot heatmap
g = sns.heatmap(Cabernet_Sauvignon[top_corr_features].corr(), annot= True, cmap='RdYlG

By looking at the correlation mattrix above we can gain the following insights:
1. volatile acidity and chlorides is highly (-ve) correlated with type.
2. alcohol is highly (-ve) correlated with density.
3. total sulpher dioxide is highly (+ve) correlated with type.
By looking at the 3 feature importance methods above, we can see that volatile acidity, chlorides,
density and alcohol are the common most important features in predicting the value of quality.
Outlier Treatment

Q1fixed,Q3fixed = np.percentile(Cabernet_Sauvignon['fixed acidity'] , [25,75])
IQRfixed = Q3fixed - Q1fixed
Ufixed_acidity = Q3fixed + 1.5*IQRfixed
Lfixed_acidity = Q1fixed - 1.5*IQRfixed
print(Ufixed_acidity)
print(Lfixed_acidity)
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['fixed acidity'] < Lfixe
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['fixed acidity'] > Ufixe
9.65
4.450000000000001
Q1volatile,Q3volatile = np.percentile(Cabernet_Sauvignon['volatile acidity'] , [25,75]
IQRvolatile = Q3volatile - Q1volatile
Uvolatile_acidity = Q3volatile + 1.5*IQRvolatile
Lvolatile_acidity= Q1volatile - 1.5*IQRvolatile
print(Uvolatile_acidity)
print(Lvolatile_acidity)
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['volatile acidity'] < Lv
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['volatile acidity'] > Uv
0.645
-0.035
Q1citric,Q3citric = np.percentile(Cabernet_Sauvignon['citric acid'] , [25,75])
IQRcitric = Q3citric - Q1citric
Ucitric_acid = Q3citric + 1.5*IQRcitric
Lcitric_acid= Q1citric - 1.5*IQRcitric
print(Ucitric_acid)
print(Lcitric_acid)
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['citric acid'] < Lcitric
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['citric acid'] > Ucitric
0.56
0.08000000000000002

Q1residual,Q3residual = np.percentile(Cabernet_Sauvignon['residual sugar'] , [25,75])
IQRresidual = Q3residual - Q1residual
Uresidual_sugar = Q3residual + 1.5*IQRresidual
Lresidual_sugar= Q1residual - 1.5*IQRresidual
print(Uresidual_sugar)
print(Lresidual_sugar)
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['residual sugar'] < Lres
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['residual sugar'] > Ures
19.049999999999997
-8.549999999999999
Q1chlorides,Q3chlorides = np.percentile(Cabernet_Sauvignon['chlorides'] , [25,75])
IQRchlorides = Q3chlorides - Q1chlorides
Uchlorides = Q3chlorides + 1.5*IQRchlorides
# Cabernet_Sauvignon['chlori
Lchlorides= Q1chlorides - 1.5*IQRchlorides
# Cabernet_Sauvignon['chlori
print(Uchlorides)
print(Lchlorides)
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['chlorides'] < Lchloride
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['chlorides'] > Uchloride
0.081
0.008999999999999994
Q1free_sulfur,Q3free_sulfur = np.percentile(Cabernet_Sauvignon['free sulfur dioxide']
IQRfree_sulfur = Q3free_sulfur - Q1free_sulfur
Ufree_sulfur_dioxide = Q3free_sulfur + 1.5*IQRfree_sulfur
Lfree_sulfur_dioxide= Q1free_sulfur - 1.5*IQRfree_sulfur
print(Ufree_sulfur_dioxide)
print(Lfree_sulfur_dioxide)
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['free sulfur dioxide'] <
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['free sulfur dioxide'] >

78.5
-13.5
Q1total_sulfur,Q3total_sulfur = np.percentile(Cabernet_Sauvignon['total sulfur dioxide
IQRtotal_sulfur = Q3total_sulfur - Q1total_sulfur
Utotal_sulfur_dioxide = Q3total_sulfur + 1.5*IQRtotal_sulfur
Ltotal_sulfur_dioxide= Q1total_sulfur - 1.5*IQRtotal_sulfur
print(Utotal_sulfur_dioxide)
print(Ltotal_sulfur_dioxide)
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['total sulfur dioxide']
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['total sulfur dioxide']
254.0
6.0
Q1sulphates,Q3sulphates = np.percentile(Cabernet_Sauvignon['sulphates'] , [25,75])
IQRsulphates = Q3sulphates - Q1sulphates
Usulphates = Q3sulphates + 1.5*IQRsulphates
Lsulphates= Q1sulphates - 1.5*IQRsulphates
print(Usulphates)
print(Lsulphates)
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['sulphates'] < Lsulphate
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['sulphates'] > Usulphate
0.7949999999999999
0.19500000000000003
Q1alcohol,Q3alcohol = np.percentile(Cabernet_Sauvignon['alcohol'] , [25,75])
IQRalcohol = Q3alcohol - Q1alcohol
Ualcohol = Q3alcohol + 1.5*IQRalcohol
Lalcohol= Q1alcohol - 1.5*IQRalcohol
print(Ualcohol)
print(Lalcohol)
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['alcohol'] < Lalcohol].i
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['alcohol'] > Ualcohol].i

14.25
6.6499999999999995
Q1pH,Q3pH = np.percentile(Cabernet_Sauvignon['pH'] , [25,75])
IQRpH = Q3pH - Q1pH
UpH = Q3pH + 1.5*IQRpH
LpH= Q1pH - 1.5*IQRpH
print(UpH)
print(LpH)
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['pH'] < LpH].index, inpl
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['pH'] > UpH].index, inpl
3.5999999999999996
2.8000000000000007
Q1density,Q3density = np.percentile(Cabernet_Sauvignon['density'] , [25,75])
IQRdensity = Q3density - Q1density
Udensity = Q3density + 1.5*IQRdensity
Ldensity= Q1density - 1.5*IQRdensity
print(Udensity)
print(Ldensity)
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['density'] < Ldensity].i
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['density'] > Udensity].i
1.00267
0.9851500000000002
Cabernet_Sauvignon.describe()

type
fixed
acidity
volatile
acidity
citric
acid
residual
sugar
chlorides su
dio
count 4598.000000 4598.000000 4598.000000 4598.000000 4598.000000 4598.000000 4598.0
mean 0.921923 6.911398 0.284059 0.320317 5.939374 0.044548 33.0
std 0.268323 0.832672 0.101024 0.089928 4.743293 0.012699 15.3
min 0.000000 4.700000 0.080000 0.090000 0.600000 0.009000 2.0
25% 1.000000 6.400000 0.210000 0.260000 1.800000 0.036000 22.0
50% 1.000000 6.800000 0.270000 0.310000 4.600000 0.043000 32.0
75% 1.000000 7.400000 0.330000 0.370000 8.987500 0.051000 44.0
max 1.000000 9.600000 0.645000 0.560000 18.950000 0.081000 78.0
# Cabernet_Sauvignon.drop([9])
Cabernet_Sauvignon_cleaned_x,Cabernet_Sauvignon_cleaned_y = Cabernet_Sauvignon.iloc[:,
Cabernet_Sauvignon_cleaned_x.shape
(4598, 12)
Cabernet_Sauvignon_cleaned_x = scaler.fit_transform(Cabernet_Sauvignon_cleaned_x)
#Splitting the dataset after outlier treatment into Train and Test sets at 80-20 propo
Xclean_train, Xclean_test, yclean_train, yclean_test = train_test_split(Cabernet_Sauvi
ax = sns.boxplot(data=Cabernet_Sauvignon_cleaned_x)
ax.set_xticklabels(Cabernet_SauvignonColumnList[:12])

[Text(0, 0, 'type'),
Text(0, 0, 'fixed acidity'),
Text(0, 0, 'volatile acidity'),
Text(0, 0, 'citric acid'),
Text(0, 0, 'residual sugar'),
Text(0, 0, 'chlorides'),
Text(0, 0, 'free sulfur dioxide'),
Text(0, 0, 'total sulfur dioxide'),
Text(0, 0, 'density'),
Text(0, 0, 'pH'),
Text(0, 0, 'sulphates'),
Text(0, 0, 'alcohol')]
##Linear Regression
# lr = LinearRegression(copy_X= True, fit_intercept = True, normalize = True)
lr.fit(Xclean_train, yclean_train)
lrclean_pred= lr.predict(Xclean_test)
# model2 = RandomForestRegressor(random_state=1, n_estimators=1000)
model2.fit(Xclean_train, yclean_train)
Rmclean_pred = model2.predict(Xclean_test)
model3.fit(Xclean_train, yclean_train)
Dtclean_pred = model3.predict(Xclean_test)
/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_base.py:141: FutureW
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocess
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())
If you wish to pass a sample_weight parameter, you need to pass it as a fit param
kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)
warnings.warn(
print('-------------Linear Regression-----------')
print('R2: %f' % r21)
print('MAE: %f'% mean_absolute_error(yclean_test, lrclean_pred))
print('RMSE: %f'% np.sqrt(mean_squared_error(yclean_test, lrclean_pred)))
print('R2: %f' % r2_score(yclean_test, lrclean_pred))
print('-------------Random forest-----------')

print('R2: %f' % r22)
print('MAE: %f'% mean_absolute_error(yclean_test, Rmclean_pred))
print('RMSE: %f'% np.sqrt(mean_squared_error(yclean_test, Rmclean_pred)))
print('R2: %f' % r2_score(yclean_test, Rmclean_pred))
print('-------------Descision Tree-----------')
print('R2: %f' % r23)
print('MAE: %f'% mean_absolute_error(yclean_test, Dtclean_pred))
print('RMSE: %f'% np.sqrt(mean_squared_error(yclean_test, Dtclean_pred)))
print('R2: %f' % r2_score(yclean_test, Dtclean_pred))
-------------Linear Regression-----------
--Phase-1--
MAE: 0.545152
RMSE: 0.686665
R2: 0.340363
--Phase-2--
MAE: 0.578749
RMSE: 0.748469
R2: 0.274277
-------------Random forest-----------
--Phase-1--
MAE: 0.401750
RMSE: 0.561165
R2: 0.559449
--Phase-2--
MAE: 0.438112
RMSE: 0.622107
R2: 0.498635
-------------Descision Tree-----------
--Phase-1--
MAE: 0.541020
RMSE: 0.696854
R2: 0.320642
--Phase-2--
MAE: 0.586013
RMSE: 0.756198
R2: 0.259211
The results show that both phases have different prediction results. Phase 1 and 2 don't have a
great difference for each metric. MAE, RMSE metric values are increased in Phase 2 which means,

the prediction error value is higher in that Phase and model explainability has decresed by a
negligible margin.
Remove outliers and keep outliers (does if have an effect of the final predictive model)? The MAE
value of 0 indicates no error on the model. In other words, there is a perfect prediction. The above
results show that all predictions have great error especially in phase 2. RMSE gives an idea of how
much error the system typically makes in its predictions. The above results show that RMSE gave a
worse value after removing the outliers. R2 represents the proportion of the variance for a
dependent variable that's explained by an independent variable.
Cabernet_Sauvignon_class = Cabernet_Sauvignon
Cabernet_Sauvignon_imputation= Cabernet_Sauvignon
quality_mapping = { 3 : 'Low', 4 : 'Low', 5: 'Medium', 6 : 'Medium', 7: 'Medium', 8 :
Cabernet_Sauvignon_class['quality'] = Cabernet_Sauvignon_class['quality'].map(quality
Cabernet_Sauvignon_class_x,Cabernet_Sauvignon_class_y = Cabernet_Sauvignon.iloc[:,:12]
Cabernet_Sauvignon_class_x = scaler.fit_transform(Cabernet_Sauvignon_class_x)
#Splitting the dataset after classifying quality to class into Train and Test sets at
Xclass_train, Xclass_test, yclass_train, yclass_test = train_test_split(Cabernet_Sauvi
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# creating a RF classifier
clf = RandomForestClassifier(n_estimators = 1000)
# Training the model on the training dataset
# fit function is used to train the model using the training sets as parameters
clf.fit(Xclass_train, yclass_train)
# performing predictions on the test dataset
yclass_pred = clf.predict(Xclass_test)
# metrics are used to find accuracy or error
from sklearn import metrics
print()
# using metrics module for accuracy calculation
print("ACCURACY OF THE MODEL: ", metrics.accuracy_score(yclass_test, yclass_pred))
print(classification_report(yclass_test, yclass_pred))
ACCURACY OF THE MODEL: 0.9456521739130435
precision recall f1-score support

High 1.00 0.34 0.51 38
Low 0.00 0.00 0.00 24
Medium 0.95 1.00 0.97 858
accuracy 0.95 920
macro avg 0.65 0.45 0.49 920
weighted avg 0.92 0.95 0.93 920
quality_mapping_again = { 'Low':0, 'Medium':1, 'High':2}
yclass_test = yclass_test.map(quality_mapping_again)
yclass_pred_new = [s.replace('Medium', '1') for s in yclass_pred]
yclass_pred_new = [s.replace('Low', '0') for s in yclass_pred_new]
yclass_pred_new = [s.replace('High', '2') for s in yclass_pred_new]
yclass_pred_new = [int(item) for item in yclass_pred_new]
ax = sns.distplot(yclass_test, hist=False, color="r", label="Actual Value")
sns.distplot(yclass_pred_new, hist=False, color="b", label="Fitted Values" , ax=ax)
plt.title('Actual vs Fitted Values for Quality')
plt.show()
plt.close()

As we can see here, the accuracy of the classification model turned out to be way higher than any
regression method used in phase 1. It can be interpretted as: Wine tastings are generally blind
tastings and even for the best wine conoisseurs, it is very difficult to differentiate between a quality
7 or 8. Also, quality of a wine by how it tastes is a very subjective to human individuals. Most times,
its about how the product is marketed/promoted which forms the general opinion of the targeted
people.
Being said that, a good wine is a good wine. Based on the chemical composition of the wine itself,
we can atleast say if it's a good or bad one. So, when a model is asked to make it fall in a category it
gives a much greater accuracy as classifying into bins is easier than predicting a precise quality
rating.
Data Imputation
Remove 1%, 5%, and 10% of your data randomly and impute the values back using at least 3
imputation methods. How well did the methods recover the missing values? That is remove some
data, check the % error on residuals for numeric data and check for bias and variance of the error.
Imputation 1
Cabernet_Sauvignon_imputation['1_percent'] = Cabernet_Sauvignon_imputation[['alcohol']
Cabernet_Sauvignon_imputation['5_percent'] = Cabernet_Sauvignon_imputation[['alcohol']
Cabernet_Sauvignon_imputation['10_percent'] = Cabernet_Sauvignon_imputation[['alcohol'
Cabernet_Sauvignon_imputation.head()

type
fixed
acidity
volatile
acidity
citric
acid
residual
sugar
chlorides
free
sulfur
dioxide
total
sulfur
dioxide
density
1 1.0 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940
2 1.0 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951
3 1.0 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956
4 1.0 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956
5 1.0 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951
def get_percent_missing(dataframe):
percent_missing = dataframe.isnull().sum() * 100 / len(dataframe)
missing_value_Cabernet_Sauvignon = pd.DataFrame({'column_name': dataframe.columns,
'percent_missing': percent_missing})
return missing_value_Cabernet_Sauvignon
print(get_percent_missing(Cabernet_Sauvignon_imputation))
column_name percent_missing
type type 0.0
fixed acidity fixed acidity 0.0
volatile acidity volatile acidity 0.0
citric acid citric acid 0.0
residual sugar residual sugar 0.0
chlorides chlorides 0.0
free sulfur dioxide free sulfur dioxide 0.0
total sulfur dioxide total sulfur dioxide 0.0
density density 0.0
pH pH 0.0
sulphates sulphates 0.0
alcohol alcohol 0.0
quality quality 0.0
1_percent 1_percent 0.0
def create_missing(dataframe, percent, col):
dataframe.loc[dataframe.sample(frac = percent).index, col] = np.nan
create_missing(Cabernet_Sauvignon_imputation, 0.01, '1_percent')
print(get_percent_missing(Cabernet_Sauvignon_imputation))
type type 0.000000
fixed acidity fixed acidity 0.000000
volatile acidity volatile acidity 0.000000
citric acid citric acid 0.000000
residual sugar residual sugar 0.000000
chlorides chlorides 0.000000
free sulfur dioxide free sulfur dioxide 0.000000
total sulfur dioxide total sulfur dioxide 0.000000
density density 0.000000
pH pH 0.000000
sulphates sulphates 0.000000
alcohol alcohol 0.000000

quality quality 0.000000
# Store Index of NaN values in each coloumns
number_1_idx = list(np.where(Cabernet_Sauvignon_imputation['1_percent'].isna())[0])
print(f"Length of number_1_idx is {len(number_1_idx)} and it contains {(len(number_1_i
print(f"Length of number_5_idx is {len(number_5_idx)} and it contains {(len(number_5_i
print(f"Length of number_10_idx is {len(number_10_idx)} and it contains {(len(number_1
Length of number_1_idx is 46 and it contains 1.0004349717268377% of total data in
Length of number_5_idx is 230 and it contains 5.002174858634189% of total data in
Length of number_10_idx is 460 and it contains 10.004349717268378% of total data
Imputation 2
KNN Imputation The k nearest neighbours is an algorithm that is used for simple classification. The
algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the
new point is assigned a value based on how closely it resembles the points in the training set.
#Creating a seperate dataframe for performing the KNN imputation
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler
Cabernet_Sauvignon_imputation1 = Cabernet_Sauvignon_imputation[['1_percent','5_percent
imputer = KNNImputer(n_neighbors=5)
imputed_number_Cabernet_Sauvignon = pd.DataFrame(imputer.fit_transform(Cabernet_Sauvig
# imputed_number_Cabernet_Sauvignon.sample(10)
imputed_number_Cabernet_Sauvignon.head()
print(get_percent_missing(imputed_number_Cabernet_Sauvignon))
alcohol = Cabernet_Sauvignon["alcohol"]
imputed_mean = pd.concat([alcohol,imputed_number_Cabernet_Sauvignon])
imputed_mean.columns = ["Alcohol","1_Percent","5_Percent","10_Percent"]
imputed_mean.var()

Alcohol 1.470385
1_Percent 1.470326
5_Percent 1.470391
10_Percent 1.470429
dtype: float64
The KNN based method showed very negotiable variablilty. Therefore this method is acceptable for
the current dataset.
Mean based Imputation with Simpleimputer This works by calculating the mean/median of the non-
missing values in a column and then replacing the missing values within each column separately
and independently from the others. It can only be used with numeric data.
Cabernet_Sauvignon_imputation_mean = Cabernet_Sauvignon_imputation[['1_percent','5_per
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer( strategy='mean') #for median imputation replace 'mean' with
imp_mean.fit(Cabernet_Sauvignon_imputation_mean)
imputed_train_Cabernet_Sauvignon = imp_mean.transform(Cabernet_Sauvignon_imputation_me
imputed_mean = pd.DataFrame(imp_mean.fit_transform(Cabernet_Sauvignon_imputation_mean)
print(get_percent_missing(imputed_mean))
combined_mean = pd.concat([alcohol,imputed_mean])
combined_mean.mean()
0 10.587102
10_percent 10.588810
1_percent 10.586540
5_percent 10.581520
dtype: float64
combined_mean.var()
0 1.470385
10_percent 1.320797
1_percent 1.456402
5_percent 1.395375
dtype: float64

Imputation 3
The Mean based method showed very negotiable variablilty. Therefore this method is acceptable for
the current dataset.
Imputation Using Multivariate Imputation by Chained Equation (MICE) This type of imputation works
by filling the missing data multiple times. Multiple Imputations (MIs) are much better than a single
imputation as it measures the uncertainty of the missing values in a better way. The chained
equations approach is also very flexible and can handle different variables of different data types
(ie., continuous or binary) as well as complexities such as bounds or survey skip patterns.
Cabernet_Sauvignon_imputation_mice = Cabernet_Sauvignon_imputation[['1_percent','5_per
print(get_percent_missing(Cabernet_Sauvignon_imputation_mice))
!pip install impyute
from impyute.imputation.cs import mice
# start the MICE training
imputed_training=mice(Cabernet_Sauvignon_imputation_mice.values)
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-whee
Requirement already satisfied: impyute in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: scipy in /usr/local/lib/python3.8/dist-packages (f
Requirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (f
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.8/d
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.8/dist-pack
imputed_training = pd.DataFrame(imputed_training)
imputed_training.columns = ("1_percent","5_percent","10_percent")
# imputed_mice = pd.DataFrame(imputed_training.fit_transform(Cabernet_Sauvignon_imputa
print(get_percent_missing(imputed_training))

combined_mice = pd.concat([alcohol,imputed_training])
combined_mice.columns = ["Alcohol","1_Percent","5_Percent","10_Percent"]
combined_mice.mean()
Alcohol 10.587102
1_Percent 10.586915
5_Percent 10.587098
10_Percent 10.586915
dtype: float64
combined_mice.var()
Alcohol 1.470385
1_Percent 1.467981
5_Percent 1.470375
10_Percent 1.467981
dtype: float64
The MICE method showed very negotiable variablilty. Therefore this method is acceptable for the
current dataset.
Double-click (or enter) to edit
AutoML
#Install AutoML library - PyCaret
!pip install pycaret

Requirement already satisfied: pycaret in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: numba<0.55 in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: mlxtend>=0.17.0 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: wordcloud in /usr/local/lib/python3.8/dist-package
Requirement already satisfied: pandas-profiling>=2.8.0 in /usr/local/lib/python3
Requirement already satisfied: mlflow in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: pyyaml<6.0.0 in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: scikit-plot in /usr/local/lib/python3.8/dist-packa
Requirement already satisfied: scipy<=1.5.4 in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: joblib in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: matplotlib in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: lightgbm>=2.3.1 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: imbalanced-learn==0.7.0 in /usr/local/lib/python3
Requirement already satisfied: seaborn in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: textblob in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: kmodes>=0.10.1 in /usr/local/lib/python3.8/dist-pa
Requirement already satisfied: spacy<2.4.0 in /usr/local/lib/python3.8/dist-packa
Requirement already satisfied: nltk in /usr/local/lib/python3.8/dist-packages (fr
Requirement already satisfied: pandas in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: pyod in /usr/local/lib/python3.8/dist-packages (fr
Requirement already satisfied: yellowbrick>=1.0.1 in /usr/local/lib/python3.8/dis
Requirement already satisfied: umap-learn in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: cufflinks>=0.17.0 in /usr/local/lib/python3.8/dist
Requirement already satisfied: gensim<4.0.0 in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: plotly>=4.4.1 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: IPython in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: ipywidgets in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: scikit-learn==0.23.2 in /usr/local/lib/python3.8/d
Requirement already satisfied: Boruta in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: pyLDAvis in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.8/d
Requirement already satisfied: colorlover>=0.2.1 in /usr/local/lib/python3.8/dist
Requirement already satisfied: setuptools>=34.4.1 in /usr/local/lib/python3.8/dis
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.8/dist
Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.8/dist-pa
Requirement already satisfied: decorator in /usr/local/lib/python3.8/dist-package
Requirement already satisfied: backcall in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: pickleshare in /usr/local/lib/python3.8/dist-packa
Requirement already satisfied: prompt-toolkit<2.1.0,>=2.0.0 in /usr/local/lib/pyt
Requirement already satisfied: jedi>=0.10 in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: pygments in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: pexpect in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: ipykernel>=4.5.1 in /usr/local/lib/python3.8/dist-
Requirement already satisfied: jupyterlab-widgets>=1.0.0 in /usr/local/lib/python
Requirement already satisfied: ipython-genutils~=0.2.0 in /usr/local/lib/python3
Requirement already satisfied: widgetsnbextension~=3.6.0 in /usr/local/lib/python
Requirement already satisfied: jupyter-client in /usr/local/lib/python3.8/dist-pa
Requirement already satisfied: tornado>=4.2 in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: parso<0.9.0,>=0.8.0 in /usr/local/lib/python3.8/di
Requirement already satisfied: wheel in /usr/local/lib/python3.8/dist-packages (f
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.8/dist
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.8/d
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.8/dist-pack

Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/l
Requirement already satisfied: llvmlite<0.38,>=0.37.0rc1 in /usr/local/lib/python
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: tqdm<4.65,>=4.48.2 in /usr/local/lib/python3.8/dis
Requirement already satisfied: requests<2.29,>=2.24.0 in /usr/local/lib/python3.8
Requirement already satisfied: phik<0.13,>=0.11.1 in /usr/local/lib/python3.8/dis
Requirement already satisfied: pydantic<1.11,>=1.8.1 in /usr/local/lib/python3.8/
Requirement already satisfied: jinja2<3.2,>=2.11.1 in /usr/local/lib/python3.8/di
Requirement already satisfied: htmlmin==0.1.12 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: statsmodels<0.14,>=0.13.2 in /usr/local/lib/python
Requirement already satisfied: typeguard<2.14,>=2.13.2 in /usr/local/lib/python3
Requirement already satisfied: multimethod<1.10,>=1.4 in /usr/local/lib/python3.8
Requirement already satisfied: visions[type_image_path]==0.7.5 in /usr/local/lib/
Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: tangled-up-in-unicode>=0.0.4 in /usr/local/lib/pyt
Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: imagehash in /usr/local/lib/python3.8/dist-package
Requirement already satisfied: Pillow in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.8/dist-
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: wcwidth in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: typing-extensions>=4.1.0 in /usr/local/lib/python3
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.8/
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dis
Requirement already satisfied: charset-normalizer<3,>=2 in /usr/local/lib/python3
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.8/
Requirement already satisfied: blis<0.8.0,>=0.4.0 in /usr/local/lib/python3.8/dis
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.8/di
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.8/dis
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.8/d
Requirement already satisfied: thinc<7.5.0,>=7.4.1 in /usr/local/lib/python3.8/di
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.8/di
Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: patsy>=0.5.2 in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: notebook>=4.4.1 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: nbformat in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: Send2Trash in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: jupyter-core>=4.4.0 in /usr/local/lib/python3.8/di
Requirement already satisfied: prometheus-client in /usr/local/lib/python3.8/dist
Requirement already satisfied: terminado>=0.8.1 in /usr/local/lib/python3.8/dist-
Requirement already satisfied: pyzmq>=17 in /usr/local/lib/python3.8/dist-package
Requirement already satisfied: nbconvert<6.0 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: platformdirs>=2.5 in /usr/local/lib/python3.8/dist
Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.8/dis
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.8/d
Requirement already satisfied: defusedxml in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: testpath in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: bleach in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.8/dist
Requirement already satisfied: jsonschema>=2.6 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: fastjsonschema in /usr/local/lib/python3.8/dist-pa
Requirement already satisfied: importlib-resources>=1.4.0 in /usr/local/lib/pytho
Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /
Requirement already satisfied: zipp>=3.1.0 in /usr/local/lib/python3.8/dist-packa
# import math
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
#Reading Data
Chateau_Montelena_AutoML = pd.read_csv('https://raw.githubusercontent.com/Mohanmanjuna
Chateau_Montelena_AutoMLM = Chateau_Montelena_AutoML.copy()
Chateau_Montelena_AutoMLB = Chateau_Montelena_AutoML.copy()
Each row represents a wine; Each column contains wine’s attributes such as type, sulphates,
chlorides etc and the target label 'quality'.
Problem Statement
Binary Classification: Predict the quality of wine i.e. Low or High.
Multiclass Classification: Predict the quality of wine i.e Low,Medium,High.
Regression: Predict the quality of wine between 3-9 based on the independent predictor
variables.
Dataset - Wine Quality
Chateau_Montelena_AutoML.describe()

Requirement already satisfied: zipp> 3.1.0 in /usr/local/lib/python3.8/dist packa
Requirement already satisfied: ptyprocess in /usr/local/lib/python3.8/dist-packag
Collecting numpy>=1.13.3
Using cached numpy-1.19.5-cp38-cp38-manylinux2010_x86_64.whl (14.9 MB)
Requirement already satisfied: webencodings in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: PyWavelets in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: shap<1,>=0.40 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: protobuf<5,>=3.12.0 in /usr/local/lib/python3.8/di
Requirement already satisfied: alembic<2 in /usr/local/lib/python3.8/dist-package
Requirement already satisfied: gitpython<4,>=2.1.0 in /usr/local/lib/python3.8/di
Requirement already satisfied: databricks-cli<1,>=0.8.7 in /usr/local/lib/python3
Requirement already satisfied: importlib-metadata!=4.7.0,<6,>=3.7.0 in /usr/local
Requirement already satisfied: sqlalchemy<2,>=1.4.0 in /usr/local/lib/python3.8/d
Requirement already satisfied: docker<7,>=4.0.0 in /usr/local/lib/python3.8/dist-
Requirement already satisfied: querystring-parser<2 in /usr/local/lib/python3.8/d
Requirement already satisfied: click<9,>=7.0 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: gunicorn<21 in /usr/local/lib/python3.8/dist-packa
Requirement already satisfied: pyarrow<11,>=4.0.0 in /usr/local/lib/python3.8/dis
Requirement already satisfied: cloudpickle<3 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: markdown<4,>=3.3 in /usr/local/lib/python3.8/dist-
Requirement already satisfied: Flask<3 in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: sqlparse<1,>=0.4.0 in /usr/local/lib/python3.8/dis
Requirement already satisfied: Mako in /usr/local/lib/python3.8/dist-packages (fr
Requirement already satisfied: oauthlib>=3.1.0 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: tabulate>=0.7.7 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: pyjwt>=1.7.0 in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: websocket-client>=0.32.0 in /usr/local/lib/python3
Requirement already satisfied: Werkzeug<2.0,>=0.15 in /usr/local/lib/python3.8/di
Requirement already satisfied: itsdangerous<2.0,>=0.24 in /usr/local/lib/python3
Requirement already satisfied: gitdb<5,>=4.0.1 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: smmap<6,>=3.0.1 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: slicer==0.0.7 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.8/dist-
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: funcy in /usr/local/lib/python3.8/dist-packages (f
Requirement already satisfied: numexpr in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: future in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: pynndescent>=0.5 in /usr/local/lib/python3.8/dist-
Installing collected packages: numpy
Attempting uninstall: numpy
Found existing installation: numpy 1.20.0
Uninstalling numpy-1.20.0:
Successfully uninstalled numpy-1.20.0
ERROR: pip's dependency resolver does not currently take into account all the pac
tensorflow 2.9.2 requires numpy>=1.20, but you have numpy 1.19.5 which is incompa
jaxlib 0.3.25+cuda11.cudnn805 requires numpy>=1.20, but you have numpy 1.19.5 whi
jax 0.3.25 requires numpy>=1.20, but you have numpy 1.19.5 which is incompatible
en-core-web-sm 3.4.1 requires spacy<3.5.0,>=3.4.0, but you have spacy 2.3.8 which
cmdstanpy 1.0.8 requires numpy>=1.21, but you have numpy 1.19.5 which is incompat
Successfully installed numpy-1.19.5
fixed
acidity
volatile
acidity
citric
acid
residual
sugar
chlorides
free
sulfur
dioxide
t
su
dio
count 6487.000000 6489.000000 6494.000000 6495.000000 6495.000000 6497.000000 6497.0
mean 7.216579 0.339691 0.318722 5.444326 0.056042 30.525319 115.7
std 1.296750 0.164649 0.145265 4.758125 0.035036 17.749400 56.5
min 3.800000 0.080000 0.000000 0.600000 0.009000 1.000000 6.0
25% 6.400000 0.230000 0.250000 1.800000 0.038000 17.000000 77.0
50% 7.000000 0.290000 0.310000 3.000000 0.047000 29.000000 118.0
75% 7.700000 0.400000 0.390000 8.100000 0.065000 41.000000 156.0
max 15.900000 1.580000 1.660000 65.800000 0.611000 289.000000 440.0
Dataset Shape: (6497, 13)
Name dtypes Missing Uniques Sample Value Entropy
0 type object 0 2 white 0.24
1 fixed acidity float64 10 106 7.0 1.65
2 volatile acidity float64 8 187 0.27 1.79
3 citric acid float64 3 89 0.36 1.70
4 residual sugar float64 2 316 20.7 2.08
5 chlorides float64 2 214 0.045 1.90
6 free sulfur dioxide float64 0 135 45.0 1.82
7 total sulfur dioxide float64 0 276 170.0 2.32
8 density float64 0 998 1.001 2.70
9 pH float64 9 108 3.0 1.81
10 sulphates float64 4 111 0.45 1.72
11 alcohol float64 0 111 8.8 1.66
12 quality int64 0 7 6 0.55
def tableinfo(Chateau_Montelena_AutoML):
print(f"Dataset Shape: {Chateau_Montelena_AutoML.shape}")
summary = pd.DataFrame(Chateau_Montelena_AutoML.dtypes,columns=['dtypes'])
summary = summary.reset_index()
summary['Name'] = summary['index']
summary = summary[['Name','dtypes']]
summary['Missing'] = Chateau_Montelena_AutoML.isnull().sum().values
summary['Uniques'] = Chateau_Montelena_AutoML.nunique().values
summary['Sample Value'] = Chateau_Montelena_AutoML.loc[0].values
for name in summary['Name'].value_counts().index:
summary.loc[summary['Name'] == name, 'Entropy'] = round(stats.entropy(Chateau_
return summary
tableinfo(Chateau_Montelena_AutoML)
Entropy is defined as the randomness or measuring the disorder of the information being
processed.
Actions required for data preparation:
Converting 'Type' to a integer data type. Encoding categorical features.

print("Quality(%):")
print(round(Chateau_Montelena_AutoML['quality'].value_counts(normalize=True) * 100,2))
Quality(%):
6 43.65
5 32.91
7 16.61
4 3.32
8 2.97
3 0.46
9 0.08
Chateau_Montelena_AutoML['type'] = Chateau_Montelena_AutoML['type'].astype("category")
Chateau_Montelena_AutoML_copy = Chateau_Montelena_AutoML.copy()
Chateau_Montelena_AutoML.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 type 6497 non-null int8
1 fixed acidity 6487 non-null float64
2 volatile acidity 6489 non-null float64
3 citric acid 6494 non-null float64
4 residual sugar 6495 non-null float64
5 chlorides 6495 non-null float64
6 free sulfur dioxide 6497 non-null float64
7 total sulfur dioxide 6497 non-null float64
8 density 6497 non-null float64
9 pH 6488 non-null float64
10 sulphates 6493 non-null float64
11 alcohol 6497 non-null float64
12 quality 6497 non-null int64
dtypes: float64(11), int64(1), int8(1)
memory usage: 615.6 KB
Analyzing the numeric features
plot , ax = plt.subplots( 4,3 , figsize = (35 , 20))
g = sns.histplot(Chateau_Montelena_AutoML['type'] , kde = True , ax = ax[0][0])
g = sns.histplot(Chateau_Montelena_AutoML['fixed acidity'] , kde = True , ax = ax[0][1
g = sns.histplot(Chateau_Montelena_AutoML['volatile acidity'] , kde = True , ax = ax[0
g = sns.histplot(Chateau_Montelena_AutoML['citric acid'] , kde = True , ax = ax[1][0])
g = sns.histplot(Chateau_Montelena_AutoML['residual sugar'] , kde = True , ax = ax[1][
g = sns.histplot(Chateau_Montelena_AutoML['chlorides'] , kde = True , ax = ax[1][2])
g = sns.histplot(Chateau_Montelena_AutoML['density'] , kde = True , ax = ax[2][0])
g = sns.histplot(Chateau_Montelena_AutoML['pH'] , kde = True , ax = ax[2][1])

g = sns.histplot(Chateau_Montelena_AutoML['sulphates'] , kde = True , ax = ax[2][2])
g = sns.histplot(Chateau_Montelena_AutoML['alcohol'] , kde = True , ax = ax[3][0])
Observation :
These numerical variables are not following a normal distribution. These distributions indicate there
are different data distributions present in population data with separate and independent peaks.
Action :

Data scaling, As most of the algorithms assume the data to be normally (Gaussian) distributed we
Normalize these features.
type
fixed
acidity
volatile
acidity
citric
acid
residual
sugar
chlorides
free
sulfur
dioxide
total
sulfur
dioxide
density
0 1 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010
1 1 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940
2 1 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951
3 1 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956
4 1 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956
Chateau_Montelena_AutoML.head()
Outliers
from sklearn.preprocessing import MinMaxScaler,StandardScaler
mms = MinMaxScaler() # Normalization
# cust_dummies=pd.get_dummies(cust)
Chateau_Montelena_AutoML_copy['type'] = mms.fit_transform(Chateau_Montelena_AutoML_cop
Chateau_Montelena_AutoML_copy['fixed acidity'] = mms.fit_transform(Chateau_Montelena_A
Chateau_Montelena_AutoML_copy['volatile acidity'] = mms.fit_transform(Chateau_Montelen
Chateau_Montelena_AutoML_copy['citric acid']= mms.fit_transform(Chateau_Montelena_Auto
Chateau_Montelena_AutoML_copy['residual sugar']= mms.fit_transform(Chateau_Montelena_A
Chateau_Montelena_AutoML_copy['chlorides']= mms.fit_transform(Chateau_Montelena_AutoML
# Chateau_Montelena_AutoML_copy['free sulphur dioxide']= mms.fit_transform(Chateau_Mon
# Chateau_Montelena_AutoML_copy['total sulphur dioxide']= mms.fit_transform(Chateau_Mo
Chateau_Montelena_AutoML_copy['density'] = mms.fit_transform(Chateau_Montelena_AutoML_
Chateau_Montelena_AutoML_copy['pH'] = mms.fit_transform(Chateau_Montelena_AutoML_copy[
Chateau_Montelena_AutoML_copy['sulphates'] = mms.fit_transform(Chateau_Montelena_AutoM
Chateau_Montelena_AutoML_copy['alcohol'] = mms.fit_transform(Chateau_Montelena_AutoML_
sns.boxplot(data=Chateau_Montelena_AutoML_copy[['type','fixed acidity','volatile acidi

<AxesSubplot:>
Observation : There are values present beyond the upper and lower extremes of the Box plots (1.5 x
Inter Quartile Range)
Multicolinearity
<AxesSubplot:>
corr = Chateau_Montelena_AutoML_copy.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr,mask=mask, cmap='RdYlGn')

Observation :
By looking at the correlation mattrix above we can gain the following insights:
volatile acidity and chlorides is highly (-ve) correlated with type.
alcohol is highly (-ve) correlated with density.
total sulpher dioxide is highly (+ve) correlated with type.
Action :
Dropping some of the highly correlated categorical variables.
Target Variable = Quality between 3-9
Regression
!pip install numba==0.53
Collecting numba==0.53
Downloading numba-0.53.0-cp38-cp38-manylinux2014_x86_64.whl (3.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.4/3.4 MB 31.4 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.15 in /usr/local/lib/python3.8/dist-packa
Requirement already satisfied: setuptools in /usr/local/lib/python3.8/dist-packag
Collecting llvmlite<0.37,>=0.36.0rc1
Downloading llvmlite-0.36.0-cp38-cp38-manylinux2010_x86_64.whl (25.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25.3/25.3 MB 54.4 MB/s eta 0:00:00
Installing collected packages: llvmlite, numba
Attempting uninstall: llvmlite
Found existing installation: llvmlite 0.37.0
Uninstalling llvmlite-0.37.0:
Successfully uninstalled llvmlite-0.37.0
Attempting uninstall: numba
Found existing installation: numba 0.54.1
Uninstalling numba-0.54.1:
Successfully uninstalled numba-0.54.1
Successfully installed llvmlite-0.36.0 numba-0.53.0
from pycaret.regression import *
s = setup(Chateau_Montelena_AutoML, target = 'quality',train_size=0.8,
normalize=True,
normalize_method='minmax',
remove_multicollinearity=True,
remove_outliers=True,
fold=5,
silent = True)

Description Value
0 session_id 6943
1 Target quality
2 Original Data (6497, 13)
3 Missing Values True
4 Numeric Features 12
5 Categorical Features 0
6 Ordinal Features False
7 High Cardinality Features False
8 High Cardinality Method None
9 Transformed Train Set (4937, 12)
10 Transformed Test Set (1300, 12)
11 Shuffle Train-Test True
12 Stratify Train-Test False
13 Fold Generator KFold
14 Fold Number 5
15 CPU Jobs -1
16 Use GPU False
17 Log Experiment False
18 Experiment Name reg-default-name
19 USI 900b
20 Imputation Type simple
21 Iterative Imputation Iteration None
22 Numeric Imputer mean
23 Iterative Imputation Numeric Model None
24 Categorical Imputer constant
25 Iterative Imputation Categorical Model None
26 Unknown Categoricals Handling least_frequent
27 Normalize True
28 Normalize Method minmax
29 Transformation False

30 Transformation Method None
31 PCA False
32 PCA Method None
33 PCA Components None
34 Ignore Low Variance False
35 Combine Rare Levels False
36 Rare Level Threshold None
37 Numeric Binning False
38 Remove Outliers True
39 Outliers Threshold 0.05
40 Remove Multicollinearity True
41 Multicollinearity Threshold 0.9
42 Remove Perfect Collinearity True
43 Clustering False
44 Clustering Iteration None
45 Polynomial Features False
46 Polynomial Degree None
47 Trignometry Features False
48 Polynomial Threshold None
49 Group Features False
50 Feature Selection False
51 Feature Selection Method classic
52 Features Selection Threshold None
53 Feature Interaction False
54 Feature Ratio False
55 Interaction Threshold None
56 Transform Target False
57 Transform Target Method box-cox
INFO:logs:create_model_container: 0
INFO:logs:master_model_container: 0
INFO:logs:display_container: 1
INFO:logs:Pipeline(memory=None,
steps=[('dtypes',
DataTypes_Auto_infer(categorical_features=[],
Model MAE MSE RMSE R2 RMSLE MAPE
TT
(Sec)
et Extra Trees Regressor 0.3974 0.3534 0.5941 0.5312 0.0890 0.0710 1.150
rf Random Forest Regressor 0.4454 0.3757 0.6124 0.5018 0.0916 0.0793 2.436
lightgbm
Light Gradient Boosting
Machine
0.4847 0.4085 0.6388 0.4577 0.0951 0.0857 0.190
xgboost
Extreme Gradient
Boosting
0.4631 0.4104 0.6404 0.4548 0.0955 0.0821 0.590
gbr
Gradient Boosting
Regressor
0.5298 0.4610 0.6786 0.3880 0.1006 0.0934 1.008
knn K Neighbors Regressor 0.5362 0.5059 0.7111 0.3280 0.1055 0.0950 0.082
ada AdaBoost Regressor 0.5725 0.5243 0.7235 0.3048 0.1074 0.1015 0.600
lr Linear Regression 0.5643 0.5288 0.7268 0.2982 0.1074 0.0995 0.588
lar Least Angle Regression 0.5643 0.5288 0.7268 0.2982 0.1074 0.0995 0.012
br Bayesian Ridge 0.5645 0.5289 0.7269 0.2981 0.1074 0.0995 0.012
ridge Ridge Regression 0.5652 0.5296 0.7273 0.2972 0.1075 0.0996 0.010
huber Huber Regressor 0.5636 0.5301 0.7277 0.2965 0.1074 0.0990 0.102
omp
Orthogonal Matching
Pursuit
0.6133 0.5987 0.7733 0.2056 0.1145 0.1086 0.010
dt Decision Tree Regressor 0.5058 0.7132 0.8440 0.0484 0.1252 0.0889 0.046
lasso Lasso Regression 0.6772 0.7537 0.8678 -0.0004 0.1277 0.1203 0.012
en Elastic Net 0.6772 0.7537 0.8678 -0.0004 0.1277 0.1203 0.014
llar
Lasso Least Angle
Regression
0.6772 0.7537 0.8678 -0.0004 0.1277 0.1203 0.012
dummy Dummy Regressor 0.6772 0.7537 0.8678 -0.0004 0.1277 0.1203 0.014
par
Passive Aggressive
Regressor
0.8006 0.9957 0.9905 -0.3256 0.1469 0.1372 0.014
INFO:logs:ExtraTreesRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse',
max_depth=None, max_features='auto', max_leaf_nodes=None,
best = compare_models()
Tuning the best model here i.e. Extra Trees Regressor

display_types=False, features_todrop=[],
id_columns=[], ml_usecase='regression',
numerical_features=[], target='quality',
time_features=[])),
('imputer',
Simple_Imputer(categorical_strategy='not_available',
fill_value_categorical=None,
fill_value_numerical=None,
numeric_strateg...
('dummy', Dummify(target='quality')),
('fix_perfect', Remove_100(target='quality')),
('clean_names', Clean_Colum_Names()),
('feature_select', 'passthrough'),
('fix_multi',
Fix_multicollinearity(correlation_with_target_preference=None,
correlation_with_target_threshold=0.0,
target_variable='quality',
threshold=0.9)),
('dfs', 'passthrough'), ('pca', 'passthrough')],
verbose=False)
INFO:logs:setup() succesfully completed......................................
MAE MSE RMSE R2 RMSLE MAPE
Fold
0 0.5600 0.4734 0.6881 0.3261 0.1013 0.0982
1 0.6064 0.5845 0.7645 0.2842 0.1133 0.1078
2 0.5680 0.5103 0.7144 0.3331 0.1063 0.1010
3 0.5849 0.5351 0.7315 0.3088 0.1100 0.1047
4 0.5651 0.4953 0.7038 0.3008 0.1029 0.0985
Mean 0.5769 0.5197 0.7204 0.3106 0.1068 0.1020
Std 0.0170 0.0381 0.0262 0.0176 0.0044 0.0037
INFO:logs:ExtraTreesRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse',
max_depth=9, max_features=1.0, max_leaf_nodes=None,
max_samples=None, min_impurity_decrease=0.002,
min_impurity_split=None, min_samples_leaf=3,
min_samples_split=5, min_weight_fraction_leaf=0.0,
n_estimators=210, n_jobs=-1, oob_score=False,
random_state=6943, verbose=0, warm_start=False)
INFO:logs:tune_model() succesfully completed....................................
tuned_model = tune_model(best)
#Creating Models
lightgbm = create_model('lightgbm');
et = create_model('et');
rf = create_model('rf');
#Blending the top 3 models
blend = blend_models(estimator_list=[lightgbm,et,rf])

MAE MSE RMSE R2 RMSLE MAPE
Fold
0 0.4346 0.3397 0.5828 0.5165 0.0863 0.0763
1 0.4596 0.4069 0.6379 0.5017 0.0952 0.0819
2 0.4237 0.3473 0.5893 0.5462 0.0889 0.0761
3 0.4418 0.3806 0.6169 0.5084 0.0937 0.0797
4 0.4261 0.3356 0.5793 0.5262 0.0858 0.0747
Mean 0.4372 0.3620 0.6012 0.5198 0.0900 0.0777
Std 0.0129 0.0275 0.0226 0.0155 0.0038 0.0026
INFO:logs:VotingRegressor(estimators=[('lightgbm',
LGBMRegressor(boosting_type='gbdt',
class_weight=None,
colsample_bytree=1.0,
importance_type='split',
learning_rate=0.1, max_depth=-1,
min_child_samples=20,
min_child_weight=0.001,
min_split_gain=0.0, n_estimators=100,
n_jobs=-1, num_leaves=31,
objective=None, random_state=6943,
reg_alpha=0.0, reg_lambda=0.0,
silent='warn', s...
RandomForestRegressor(bootstrap=True,
ccp_alpha=0.0,
criterion='mse',
max_depth=None,
max_features='auto',
max_leaf_nodes=None,
max_samples=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=100, n_jobs=-1,
oob_score=False,
random_state=6943, verbose=0,
warm_start=False))],
n_jobs=-1, verbose=False, weights=None)
INFO:logs:blend_models() succesfully completed..................................
plot_model(estimator = tuned_model, plot = 'feature')

INFO:logs:Visual Rendered Successfully
INFO:logs:plot_model() succesfully completed....................................
interpret_model(tuned_model)

INFO:logs:Initializing interpret_model()
INFO:logs:interpret_model(estimator=ExtraTreesRegressor(bootstrap=False, ccp_alph
max_depth=9, max_features=1.0, max_leaf_nodes=None,
max_samples=None, min_impurity_decrease=0.002,
min_impurity_split=None, min_samples_leaf=3,
min_samples_split=5, min_weight_fraction_leaf=0.0,
n_estimators=210, n_jobs=-1, oob_score=False,
random_state=6943, verbose=0, warm_start=False), use_train_da
INFO:logs:Checking exceptions
INFO:logs:plot type: summary
INFO:logs:Creating TreeExplainer
INFO:logs:Compiling shap values
INFO:logs:interpret_model() succesfully completed...............................
plot_model(estimator = tuned_model, plot = 'residuals')
Observation : The residuals are evenly distributed and the line fits well.
Target Variable = Quality- Low or High
Binary classification
from pycaret.classification import *
Categorization of Quality
quality_mapping = { 3 : 'Low', 4 : 'Low', 5: 'Low', 6 : 'High', 7: 'High', 8 : 'High',
Chateau_Montelena_AutoMLB['quality'] = Chateau_Montelena_AutoMLB['quality'].map(quali
print("Wine Quality(%):")
print(round(Chateau_Montelena_AutoMLB['quality'].value_counts(normalize=True) * 100,2)

Wine Quality(%):
High 63.31
Low 36.69
Classifier Setup
clfb = setup(data = Chateau_Montelena_AutoMLB,
target = 'quality',
# ignore_features = ['customerID'],
train_size=0.8,
normalize=True,
fix_imbalance=True,
fold=5,
silent = True)

Description Value
0 session_id 4967
1 Target quality
2 Target Type Binary
3 Label Encoded High: 0, Low: 1
15 Fold Generator StratifiedKFold
16 Fold Number 5
17 CPU Jobs -1
18 Use GPU False
20 Experiment Name clf-default-name
21 USI 3508
29 Normalize True

33 PCA False
34 PCA Method None
45 Clustering False
58 Fix Imbalance True
59 Fix Imbalance Method SMOTE
INFO l di l t i 1
Evaluation Metrics

INFO:logs:Pipeline(memory=None,
steps=[('dtypes',
DataTypes_Auto_infer(categorical_features=[],
display_types=False, features_todrop=[],
id_columns=[],
ml_usecase='classification',
numerical_features=[], target='quality',
time_features=[])),
('imputer',
Simple_Imputer(categorical_strategy='not_available',
fill_value_categorical=None,
fill_value_numerical=None,
numeric_str...
('dummy', Dummify(target='quality')),
('fix_perfect', Remove_100(target='quality')),
('clean_names', Clean_Colum_Names()),
('feature_select', 'passthrough'),
('fix_multi',
Fix_multicollinearity(correlation_with_target_preference=None,
correlation_with_target_threshold=0.0,
target_variable='quality',
threshold=0.9)),
('dfs', 'passthrough'), ('pca', 'passthrough')],
verbose=False)
INFO:logs:setup() succesfully completed......................................
Pycaret provides the following metrics used for comparing model performance in the
compare_models() function:
Confusion Matrix is a performance measurement for machine learning classification problem
where output can be two or more classes. It is a table with 4 different combinations of
predicted and actual values.
AUC known as the Area Under the ROC Curve can be calculated and provides a single score to
summarize the plot that can be used to compare models. A no skill classifier will have a score
of 0.5, whereas a perfect classifier will have a score of 1.0.
F1 score is the harmonic mean of Precision and recall, a single score that seeks to balance
both concerns.
Accuracy is the fraction of correction predictions against the total prediction
Accuracy = Correct Predictions / Total Predictions
MCC produces a high score only if the prediction obtained good results in all of the four
confusion matrix categories (true positives, false negatives, true negatives, and false
positives), proportionally both to the size of positive elements and the size of negative
elements in the dataset.
Precision summarizes the fraction of examples assigned the positive class that belong to the
positive class.
Precision = TruePositive / (TruePositive + FalsePositive)
Cohen’s Kappa Statistic is used to measure the level of agreement between two raters or
judges who each classify items into mutually exclusive categories.
kappa = (Observed agreement - chance agreement) / (1-chance agreement)
Recall summarizes how well the positive class was predicted.
Recall = TruePositive / (TruePositive + FalseNegative)
F-Measure = (2 * Precision * Recall) / (Precision + Recall)
Searching for the best models
Model Comparison & Evaluation
best_modelB=compare_models()

Model Accuracy AUC Recall Prec. F1 Kappa MCC
TT
(Sec)
et
Extra Trees
Classifier
0.8232 0.9011 0.7532 0.7558 0.7539 0.6160 0.6166 0.432
rf
Random Forest
Classifier
0.8209 0.8940 0.7623 0.7463 0.7539 0.6132 0.6136 1.092
xgboost
Extreme
Gradient
Boosting
0.8112 0.8668 0.7392 0.7377 0.7380 0.5905 0.5909 0.978
lightgbm
Light Gradient
Boosting
Machine
0.8009 0.8680 0.7538 0.7114 0.7314 0.5735 0.5748 0.208
gbc
Gradient
Boosting
Classifier
0.7582 0.8375 0.7499 0.6405 0.6905 0.4942 0.4987 0.836
dt
Decision Tree
Classifier
0.7559 0.7384 0.6761 0.6556 0.6655 0.4735 0.4737 0.104
knn
K Neighbors
Classifier
0.7379 0.8094 0.7386 0.6124 0.6695 0.4555 0.4611 0.120
ada
Ada Boost
Classifier
0.7377 0.8115 0.7442 0.6116 0.6712 0.4566 0.4629 0.252
lda
Linear
Discriminant
Analysis
0.7284 0.8077 0.7662 0.5952 0.6697 0.4452 0.4558 0.042
ridge
Ridge
Classifier
0.7249 0.0000 0.7600 0.5920 0.6653 0.4380 0.4482 0.054
lr
Logistic
Regression
0.7223 0.8052 0.7532 0.5896 0.6611 0.4319 0.4415 0.054
qda
Quadratic
Discriminant
Analysis
0.7203 0.7995 0.7386 0.5890 0.6550 0.4249 0.4329 0.040
Hyperparameter Tuning
tuned_modelB = tune_model(best_modelB)

Accuracy AUC Recall Prec. F1 Kappa MCC
Fold
0 0.7611 0.8567 0.8085 0.6308 0.7086 0.5114 0.5227
1 0.7520 0.8351 0.7690 0.6261 0.6903 0.4871 0.4943
2 0.7700 0.8457 0.8113 0.6429 0.7173 0.5278 0.5380
3 0.7021 0.8110 0.8028 0.5599 0.6597 0.4095 0.4306
4 0.7427 0.8236 0.7493 0.6172 0.6768 0.4663 0.4724
Mean 0.7456 0.8344 0.7882 0.6154 0.6906 0.4804 0.4916
Std 0.0236 0.0160 0.0246 0.0289 0.0209 0.0412 0.0380
INFO:logs:ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight={},
criterion='entropy', max_depth=11, max_features='log2',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0001, min_impurity_split=None,
min_samples_leaf=5, min_samples_split=9,
min_weight_fraction_leaf=0.0, n_estimators=180, n_jobs=-1,
oob_score=False, random_state=4967, verbose=0,
warm_start=False)
INFO:logs:tune_model() succesfully completed....................................
We will use Light GBM , Extra Trees Classifier, Random Forest Classifier model here, as these
perform the best.
Creating a model
#Creating Models
lightgbmB = create_model('lightgbm');
etB = create_model('et');
rfB = create_model('rf');
#Blending the top 3 models
blendB = blend_models(estimator_list=[lightgbmB,etB,rfB])

Accuracy AUC Recall Prec. F1 Kappa MCC
Fold
0 0.8451 0.9127 0.8000 0.7760 0.7878 0.6659 0.6661
1 0.8148 0.8971 0.7296 0.7486 0.7389 0.5955 0.5956
2 0.8470 0.9021 0.7859 0.7881 0.7870 0.6677 0.6677
3 0.8024 0.8846 0.7859 0.7010 0.7410 0.5822 0.5847
4 0.8126 0.8924 0.7296 0.7443 0.7368 0.5913 0.5914
Mean 0.8244 0.8978 0.7662 0.7516 0.7583 0.6205 0.6211
Std 0.0182 0.0094 0.0303 0.0302 0.0238 0.0380 0.0376
INFO:logs:VotingClassifier(estimators=[('lightgbm',
LGBMClassifier(boosting_type='gbdt',
class_weight=None,
colsample_bytree=1.0,
importance_type='split',
learning_rate=0.1, max_depth=-1,
min_child_samples=20,
min_child_weight=0.001,
min_split_gain=0.0,
n_estimators=100, n_jobs=-1,
num_leaves=31, objective=None,
random_state=4967, reg_alpha=0.0,
reg_lambda=0.0, silent='warn'...
max_depth=None,
max_features='auto',
max_leaf_nodes=None,
max_samples=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0
n_estimators=100,
n_jobs=-1, oob_score=False,
random_state=4967,
verbose=0,
warm_start=False))],
flatten_transform=True, n_jobs=-1, verbose=False,
voting='soft', weights=None)
INFO:logs:blend_models() succesfully completed..................................
plot_model(estimator = tuned_modelB, plot = 'feature')

#Plotting the confusion Matrix
plot_model(estimator = tuned_modelB, plot = 'confusion_matrix')
Observation :
We can see a strong diagnol indicating good predictions.
#plotting decision boundary
plot_model(estimator = tuned_modelB, plot = 'boundary', use_train_data = True)

Observation:
We can see a great seperation with very few misclassifications.
plot_model(tuned_modelB, plot = 'parameter')

Parameters
bootstrap False
ccp_alpha 0.0
class_weight {}
criterion entropy
max_depth 11
max_features log2
max_leaf_nodes None
max_samples None
min_impurity_decrease 0.0001
min_impurity_split None
min_samples_leaf 5
min_samples_split 9
min_weight_fraction_leaf 0.0
n_estimators 180
n_jobs -1
oob_score False
random_state 4967
verbose 0
warm_start False
#Plotting Area under Curve
plot_model(estimator = tuned_modelB, plot = 'auc')
interpret_model(tuned_modelB)

INFO:logs:Initializing interpret_model()
INFO:logs:interpret_model(estimator=ExtraTreesClassifier(bootstrap=False, ccp_alp
criterion='entropy', max_depth=11, max_features='log2',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0001, min_impurity_split=None,
min_samples_leaf=5, min_samples_split=9,
min_weight_fraction_leaf=0.0, n_estimators=180, n_jobs=-1,
oob_score=False, random_state=4967, verbose=0,
warm_start=False), use_train_data=False, X_new_sample=None,
INFO:logs:Checking exceptions
INFO:logs:plot type: summary
INFO:logs:Creating TreeExplainer
INFO:logs:Compiling shap values
INFO:logs:interpret_model() succesfully completed...............................
Target Variable = Quality - Low,Medium,High
Multivariate classification
#from pycaret.classification import *
Classification of Quality
quality_mappingM = { 3 : 'Low', 4 : 'Low', 5: 'Medium', 6 : 'Medium', 7: 'Medium', 8 :
Chateau_Montelena_AutoMLM['quality'] = Chateau_Montelena_AutoMLM['quality'].map(quali
Distribution
print("Wine Quality(%):")
print(round(Chateau_Montelena_AutoMLM['quality'].value_counts(normalize=True) * 100,2)
Wine Quality(%):
Medium 93.17
Low 3.79
High 3.05

Setting the classifier
clfM = setup(data = Chateau_Montelena_AutoMLM,
target = 'quality',
# ignore_features = ['customerID'],
train_size=0.8,
normalize=True,
fix_imbalance=True,
fold=5,
silent = True)

Description Value
0 session_id 4450
1 Target quality
2 Target Type Multiclass
3 Label Encoded High: 0, Low: 1, Medium: 2
15 Fold Generator StratifiedKFold
16 Fold Number 5
17 CPU Jobs -1
18 Use GPU False
20 Experiment Name clf-default-name
21 USI 40d8
29 Normalize True

33 PCA False
34 PCA Method None
45 Clustering False
58 Fix Imbalance True
59 Fix Imbalance Method SMOTE
INFO l di l t i 1
Model Accuracy AUC Recall Prec. F1 Kappa MCC
TT
(Sec)
xgboost
Extreme
Gradient
Boosting
0.9299 0.7765 0.5413 0.9209 0.9243 0.3454 0.3514 5.532
lightgbm
Light Gradient
Boosting
Machine
0.9279 0.7702 0.5348 0.9196 0.9225 0.3327 0.3389 0.538
et
Extra Trees
Classifier
0.9230 0.8402 0.5646 0.9195 0.9210 0.3475 0.3487 0.618
rf
Random Forest
Classifier
0.9123 0.8244 0.5722 0.9166 0.9141 0.3233 0.3248 2.222
dt
Decision Tree
Classifier
0.8404 0.6445 0.5569 0.9048 0.8679 0.1915 0.2112 0.136
gbc
Gradient
Boosting
Classifier
0.7727 0.7342 0.6042 0.9068 0.8254 0.1643 0.2067 9.140
knn
K Neighbors
Classifier
0.7432 0.7225 0.6320 0.9112 0.8064 0.1613 0.2160 0.180
ada
Ada Boost
Classifier
0.5345 0.5782 0.5922 0.9011 0.6462 0.0730 0.1325 1.000
qda
Quadratic
Discriminant
Analysis
0.4950 0.6411 0.5851 0.9010 0.6113 0.0646 0.1249 0.052
lda
Linear
Discriminant
Analysis
0.4857 0.7076 0.6144 0.9079 0.6017 0.0735 0.1446 0.038
lr
Logistic
Regression
0.4794 0.7101 0.6330 0.9118 0.5952 0.0780 0.1556 0.562
ridge
Ridge
Classifier
0.4132 0.0000 0.6236 0.9116 0.5293 0.0650 0.1426 0.022
svm
SVM - Linear
Kernel
0.3830 0.0000 0.6252 0.9121 0.4962 0.0613 0.1404 0.072
b N i B 0 3721 0 6096 0 5746 0 9040 0 4885 0 0492 0 1140 0 022
best_modelM=compare_models()
LGBM has the best F1 score and is faster than the other top models.
tuned_modelM = tune_model(best_modelM)

pdf.pdf

Recommended

Recommended

More Related Content

Similar to pdf.pdf

Similar to pdf.pdf (20)

Recently uploaded

Recently uploaded (20)

pdf.pdf