SlideShare a Scribd company logo
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 1/93
Problem Statement
Although we are attempting to predict wine quality as a target for a certain number of wines with a
given set of predictor factors, wine quality is a subjective measurement. This is an EDA, or data-
driven story, including a range of graphs and images as well as an attribute-based quality forecast.
Here we need to know: “what is the quality of the wine (in ordinal values)(3-9)? It is a regression
task.
Objective
Perform Data Cleaning, Pre-processing and Feature Selection
Apply ML models to predict the Churned Customers
Use Auto-ML to determine the best model
Use SHAP library to determine the impact of the predictor variables
ML Data Cleaning and Feature Selection
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy.stats import norm
from scipy import stats
from scipy.stats import norm
from scipy import stats
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from math import sqrt
from sklearn.metrics import r2_score
Cabernet Sauvignon is known as the king of the red wine.
C b t S i d d ('htt // ith b t t /M h j th /DA
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 2/93
Cabernet_Sauvignon = pd.read_csv('https://raw.githubusercontent.com/Mohanmanjunatha/DA
type
fixed
acidity
volatile
acidity
citric
acid
residual
sugar
chlorides
free
sulfur
dioxide
total
sulfur
dioxide
density
0 white 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010
1 white 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940
2 white 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951
3 white 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956
4 white 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956
Cabernet_Sauvignon.head()
Cabernet_Sauvignon.shape
(6497, 13)
What are the data types? (Only numeric and categorical)
Cabernet_Sauvignon.dtypes
type object
fixed acidity float64
volatile acidity float64
citric acid float64
residual sugar float64
chlorides float64
free sulfur dioxide float64
total sulfur dioxide float64
density float64
pH float64
sulphates float64
alcohol float64
quality int64
dtype: object
The dataset has 1 Categorical and 12 Numerical Features.
What features are in the dataset?
fixed acidity. Fixed acidity is due to the presence of non-volatile acids in wine. For example, tartaric,
citric or malic acid. This type of acid combines the balance of the taste of wine, brings freshness to
the taste.
Volatile acidity is the part of the acid in wine that can be picked up by the nose. Unlike those acids
that are palpable to the taste (as we talked about above). Volatile acidity, or in other words, souring
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 3/93
of wine, is one of the most common defects.
citric acid - allowed to offer in winemaking by the Resolution of the OIV No. 23/2000. It can be used
in three cases: for acid treatment of wine (increasing acidity), for collecting wine, for cleaning filters
from possible fungal and mold infections.
residual sugar is that grape sugar that has not been fermented in alcohol
chlorides. The structure of the wine also depends on the content of minerals in the wine, which
determine the taste sensation such as salinity (sapidità). Anions of inorganic acids (chlorides,
sulfates, sulfites..), anions of transferred acids, metal cations (potassium, sodium, magnesium...)
are found in wine. Their content depends mainly on the climatic zone (cold or warm region, salty
soils depending on the observation of the sea), oenological practices, storage and aging of wine.
free sulfur dioxide, total sulfur dioxide - Sulfur dioxide (sulfur oxide, sulfur dioxide, readiness E220,
SO2) is used as a preservative due to its antioxidant and antimicrobial properties. Molecular SO2 is
an extremely important antibiotic, affecting significant consumption (including wild yeast) that can
manifest itself in wine spoilage.
Density - The density of wine can be either less or more than water. Its value is determined primarily
by the concentration of alcohol and sugar. White, rosé and red wines are generally light - their
density at 20°C is below 998.3 kg/m3.
pH is a measure of the acidity of wine. All wines ideally have a pH level between 2.9 and 4.2. The
lower the pH, the more acidic the wine; the lower the pH, the less acidic the wine.
Sulfates are a natural result of yeast fermenting the sugar in wine into alcohol. That is, the presence
of sulfites in wine is excluded.
alcohol - The alcohol content in wines depends on many tastes: the grape variety and the amount of
sugar in the berries, production technology and growing conditions. Wines vary greatly in degree:
this Parameter varies from 4.5 to 22 depending on the category.
quality is a target.
Are there missing values?
Cabernet_Sauvignon.isna().sum()
type 0
fixed acidity 10
volatile acidity 8
citric acid 3
residual sugar 2
chlorides 2
free sulfur dioxide 0
total sulfur dioxide 0
density 0
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 4/93
pH 9
sulphates 4
alcohol 0
quality 0
dtype: int64
Which independent variables have missing data? How much?
fixed acidity - 10
volatile acidity - 8
citric acid - 3
residual sugar - 2
chlorides - 2
pH - 9
sulphates - 4
The above features have the respective number of missing data. Since the data is more symmetric,
mean replacement would be better.
Before examining quality feature, categorical variables will be mapped with help of cat.code. This
will assist to make easier and comprehensible data analysis.
Cabernet_Sauvignon['type'] = Cabernet_Sauvignon['type'].astype("category").cat.codes
Cabernet_Sauvignon.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 type 6497 non-null int8
1 fixed acidity 6487 non-null float64
2 volatile acidity 6489 non-null float64
3 citric acid 6494 non-null float64
4 residual sugar 6495 non-null float64
5 chlorides 6495 non-null float64
6 free sulfur dioxide 6497 non-null float64
7 total sulfur dioxide 6497 non-null float64
8 density 6497 non-null float64
9 pH 6488 non-null float64
10 sulphates 6493 non-null float64
11 alcohol 6497 non-null float64
12 quality 6497 non-null int64
dtypes: float64(11), int64(1), int8(1)
memory usage: 615.6 KB
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 5/93
1. Mean
# mean = Cabernet_Sauvignon["fixed acidity"].mean()
# Cabernet_Sauvignon["fixed acidity"].fillna(mean,inplace=True)
# Cabernet_Sauvignon["fixed acidity"].isnull().sum()
# mean2 = Cabernet_Sauvignon["volatile acidity"].mean()
# Cabernet_Sauvignon["volatile acidity"].fillna(mean2,inplace=True)
# Cabernet_Sauvignon["volatile acidity"].isnull().sum()
# mean3 = Cabernet_Sauvignon["citric acid"].mean()
# Cabernet_Sauvignon["citric acid"].fillna(mean3,inplace=True)
# Cabernet_Sauvignon["citric acid"].isnull().sum()
# mean4 = Cabernet_Sauvignon["residual sugar"].mean()
# Cabernet_Sauvignon["residual sugar"].fillna(mean4,inplace=True)
# Cabernet_Sauvignon["residual sugar"].isnull().sum()
# mean5 = Cabernet_Sauvignon["chlorides"].mean()
# Cabernet_Sauvignon["chlorides"].fillna(mean5,inplace=True)
# Cabernet_Sauvignon["chlorides"].isnull().sum()
# mean6 = Cabernet_Sauvignon["pH"].mean()
# Cabernet_Sauvignon["pH"].fillna(mean6,inplace=True)
# Cabernet_Sauvignon["pH"].isnull().sum()
# mean7 = Cabernet_Sauvignon["sulphates"].mean()
# Cabernet_Sauvignon["sulphates"].fillna(mean7,inplace=True)
# Cabernet_Sauvignon["sulphates"].isnull().sum()
# Cabernet_Sauvignon.isnull().sum()
2. KNN Imputer
#Creating a seperate dataframe for performing the KNN imputation
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler
imputer = KNNImputer(n_neighbors=5)
Cabernet_Sauvignon = pd.DataFrame(imputer.fit_transform(Cabernet_Sauvignon), columns =
Cabernet_Sauvignon.isnull().sum()
type 0
fixed acidity 0
volatile acidity 0
citric acid 0
residual sugar 0
chlorides 0
free sulfur dioxide 0
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 6/93
total sulfur dioxide 0
density 0
pH 0
sulphates 0
alcohol 0
quality 0
dtype: int64
What are the likely distributions of the numeric variables? & What are the distributions of the
predictor variables?
In below above, the good fit indicates that normality is a reasonable approximation.
Distribution of Predictors
Cabernet_SauvignonColumnList = Cabernet_Sauvignon.columns
for i in Cabernet_SauvignonColumnList:
plt.figure(figsize= (5,5))
sns.distplot(Cabernet_Sauvignon[i], fit = norm)
plt.title(f"Distribution of {i} (checking normal distribution fit)",size = 15, wei
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 7/93
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 8/93
type : categorical values
fixed acidity : nomral distribution
volatile acidity : almost normal distribution with a bit of right-skewness
citric acid : almost normal distribution with a bit of edge-peak
residual sugar : almost normal distribution with a bit of right-skewness
chlorides : almost normal distribution with a bit of right-skewness
free sulfur dioxide : nomral distribution
total sulfur dioxide : almost normal distribution with a bit of edge-peak
sulphates : normal distribution
alcohol : almost normal distribution with a bit of right-skewness
pH : normal distribution
density : normal distribution
Do the ranges of the predictor variables make sense?
type
fixed
acidity
volatile
acidity
citric
acid
residual
sugar
chlorides su
dio
count 6497.000000 6497.000000 6497.000000 6497.000000 6497.000000 6497.000000 6497.0
mean 0.753886 7.216501 0.339634 0.318675 5.445704 0.056041 30.5
std 0.430779 1.295928 0.164563 0.145267 4.758043 0.035032 17.7
min 0.000000 3.800000 0.080000 0.000000 0.600000 0.009000 1.0
25% 1.000000 6.400000 0.230000 0.250000 1.800000 0.038000 17.0
50% 1.000000 7.000000 0.290000 0.310000 3.000000 0.047000 29.0
75% 1.000000 7.700000 0.400000 0.390000 8.100000 0.065000 41.0
max 1.000000 15.900000 1.580000 1.660000 65.800000 0.611000 289.0
#Range of each column
Cabernet_Sauvignon.max() - Cabernet_Sauvignon.min()
Cabernet_Sauvignon.describe()
The ranges make sense for each attribute that a wine constitutes. The range of "total sulphur
dioxide" variable is high, this implies high variablity in it's distribution.
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 9/93
Do the training and test sets have the same data?
By using test_train_split, the train and test sets are split at a ratio of 80/20 from the same dataset.
But both sets are distinct and is not seen by the model during the training phase. Although the
distribution of each attribute is proportional in both train and test sets.
Phase 1
Cabernet_Sauvignon_x = Cabernet_Sauvignon[['type','fixed acidity','volatile acidity','
Cabernet_Sauvignon_y = Cabernet_Sauvignon['quality']
# .iloc[:,:12], Cabernet_Sauvignon.iloc[:,-1]
Cabernet_Sauvignon_y.head()
0 6.0
1 6.0
2 6.0
3 6.0
4 6.0
Name: quality, dtype: float64
scaler = StandardScaler()
# #Dataframe Cabernet_Sauvignon with outliers
Cabernet_Sauvignon_x = scaler.fit_transform(Cabernet_Sauvignon_x)
plt.figure(figsize=(20,7))
ax = sns.boxplot(data=Cabernet_Sauvignon_x)
ax.set_xticklabels(Cabernet_SauvignonColumnList[:12])
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 10/93
[Text(0, 0, 'type'),
Text(0, 0, 'fixed acidity'),
Text(0, 0, 'volatile acidity'),
Text(0, 0, 'citric acid'),
Text(0, 0, 'residual sugar'),
Text(0, 0, 'chlorides'),
Text(0, 0, 'free sulfur dioxide'),
Text(0, 0, 'total sulfur dioxide'),
Text(0, 0, 'density'),
Text(0, 0, 'pH'),
Text(0, 0, 'sulphates'),
Text(0, 0, 'alcohol')]
#Splitting the dataset with outlier into Train and Test sets at 80-20 proportion
X_train, X_test, y_train, y_test = train_test_split(Cabernet_Sauvignon_x, Cabernet_Sau
X_train.shape
(5197, 12)
X_test.shape
(1300, 12)
Model Buidling
Linear Regression Model
##Linear Regression
lr = LinearRegression(copy_X= True, fit_intercept = True, normalize = True)
lr.fit(X_train, y_train)
lr_pred= lr.predict(X_test)
print('--Phase-1--')
mae1 = mean_absolute_error(y_test, lr_pred)
print('MAE: %f'% mae1)
rmse1= np.sqrt(mean_squared_error(y_test, lr_pred))
print('RMSE: %f'% rmse1)
r21 = r2_score(y_test, lr_pred)
print('R2: %f' % r21)
--Phase-1--
MAE: 0.545152
RMSE: 0.686665
R2: 0.340363
/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_base.py:141: FutureW
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 11/93
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocess
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())
If you wish to pass a sample_weight parameter, you need to pass it as a fit param
kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)
warnings.warn(
3 metrics will be calculated for evaluating predictions.
Mean Absolute Error (MAE) shows the difference between predictions and actual values.
Root Mean Square Error (RMSE) shows how accurately the model predicts the response.
R^2 will be calculated to find the goodness of fit measure.
plt.figure(figsize=(5, 7))
ax = sns.distplot(y_test, hist=False, color="r", label="Actual Value")
sns.distplot(lr_pred, hist=False, color="b", label="Fitted Values" , ax=ax)
plt.title('Actual(red) vs Fitted(blue) Values for Quality')
plt.show()
plt.close()
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 12/93
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)
Random Forest
from sklearn.ensemble import RandomForestRegressor
model2 = RandomForestRegressor(random_state=1, n_estimators=1000)
model2.fit(X_train, y_train)
Rm_pred = model2.predict(X_test)
print('--Phase-1--')
mae2 = mean_absolute_error(y_test, Rm_pred)
print('MAE: %f'% mae2)
rmse2 = np.sqrt(mean_squared_error(y_test, Rm_pred))
print('RMSE: %f'% rmse2 )
r22 = r2_score(y_test, Rm_pred)
print('R2: %f' % r22)
--Phase-1--
MAE: 0.401750
RMSE: 0.561165
R2: 0.559449
plt.figure(figsize=(5, 7))
ax = sns.distplot(y_test, hist=False, color="r", label="Actual Value")
sns.distplot(Rm_pred, hist=False, color="b", label="Fitted Values" , ax=ax)
plt.title('Actual(red) vs Fitted(blue) Values for Quality')
plt.show()
plt.close()
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 13/93
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)
Descision Tree
from sklearn.tree import DecisionTreeRegressor
model3 = DecisionTreeRegressor(max_depth=6)
model3.fit(X_train, y_train)
Dt_pred = model3.predict(X_test)
print('--Phase-1--')
mae3 = mean_absolute_error(y_test, Dt_pred)
print('MAE: %f'% mae3)
rmse3 = np.sqrt(mean_squared_error(y_test, Dt_pred))
print('RMSE: %f'% rmse3)
r23 = r2_score(y_test, Dt_pred)
print('R2: %f' % r23)
--Phase-1--
MAE: 0.541020
RMSE: 0.696854
R2: 0.320642
plt.figure(figsize=(5, 7))
ax = sns.distplot(y_test, hist=False, color="r", label="Actual Value")
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 14/93
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)
sns.distplot(Dt_pred, hist=False, color="b", label="Fitted Values" , ax=ax)
plt.title('Actual(red) vs Fitted(blue) Values for Quality')
plt.show()
plt.close()
Phase 2
In the predictor variables independent of all the other predictor variables?
Multicollinearity
Multicollinearity will help to measure the relationship between explanatory variables in multiple
regression. If there is multicollinearity occurs, these highly related input variables should be
eliminated from the model.
In this kernel, multicollinearity will be checked when plotting a correlation heatmap.
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 15/93
Which independent variables are useful to predict a target (dependent variable)? (Use at least
three methods) For a regression model, the most useful Independent Variables can be statistically
determined using the following methods:
f_regression
mutual_info_regression
Correlation Matrix with Heatmap
Each of the following method is applied below to the dataset.
1. f_regression
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression, mutual_info_regression
X = Cabernet_Sauvignon.iloc[:,0:12]
y = Cabernet_Sauvignon.iloc[:,-1]
# y=y.astype('int')
# y = pd.DataFrame(y)
# y.head(10)
# y.describe()
#Applying SelectKBest class to extract top features
# feature selection
f_selector = SelectKBest(score_func=f_regression, k='all')
# learn relationship from training data
f_selector.fit(X_train, y_train)
# transform train input data
X_train_fs = f_selector.transform(X_train)
# transform test input data
X_test_fs = f_selector.transform(X_test)
# Plot the scores for the features
plt.rcParams["figure.figsize"] = (30,10)
plt.bar([i for i in range(len(f_selector.scores_))], f_selector.scores_)
plt.xlabel("feature index")
plt.xticks([i for i in range(len(f_selector.scores_))], Cabernet_SauvignonColumnList[:
plt.ylabel("F-value (transformed from the correlation values)")
plt.show()
# bestFeatures = SelectKBest(score_func= chi2, k =12)
# fit = bestFeatures.fit(X,y)
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 16/93
we can see that volatile acidity, chlorides, density and alcohol have more importance than the
others.
2.Mutual information metric
# feature selection
f_selector = SelectKBest(score_func=mutual_info_regression, k='all')
# learn relationship from training data
f_selector.fit(X_train, y_train)
# transform train input data
X_train_fs = f_selector.transform(X_train)
# transform test input data
X_test_fs = f_selector.transform(X_test)
# Plot the scores for the features
plt.bar([i for i in range(len(f_selector.scores_))], f_selector.scores_, align = 'cent
plt.xlabel("feature index")
plt.xticks([i for i in range(len(f_selector.scores_))], Cabernet_SauvignonColumnList[:
plt.ylabel("Estimated MI value")
# plt.rcParams["figure.figsize"] = (30,10)
plt.show()
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 17/93
3. Correlation Matrix with HeatMap
corrmat = Cabernet_Sauvignon.corr()
top_corr_features = corrmat.index
plt.figure(figsize = (20,20))
#plot heatmap
g = sns.heatmap(Cabernet_Sauvignon[top_corr_features].corr(), annot= True, cmap='RdYlG
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 18/93
By looking at the correlation mattrix above we can gain the following insights:
1. volatile acidity and chlorides is highly (-ve) correlated with type.
2. alcohol is highly (-ve) correlated with density.
3. total sulpher dioxide is highly (+ve) correlated with type.
By looking at the 3 feature importance methods above, we can see that volatile acidity, chlorides,
density and alcohol are the common most important features in predicting the value of quality.
Outlier Treatment
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 19/93
Q1fixed,Q3fixed = np.percentile(Cabernet_Sauvignon['fixed acidity'] , [25,75])
IQRfixed = Q3fixed - Q1fixed
Ufixed_acidity = Q3fixed + 1.5*IQRfixed
Lfixed_acidity = Q1fixed - 1.5*IQRfixed
print(Ufixed_acidity)
print(Lfixed_acidity)
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['fixed acidity'] < Lfixe
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['fixed acidity'] > Ufixe
9.65
4.450000000000001
Q1volatile,Q3volatile = np.percentile(Cabernet_Sauvignon['volatile acidity'] , [25,75]
IQRvolatile = Q3volatile - Q1volatile
Uvolatile_acidity = Q3volatile + 1.5*IQRvolatile
Lvolatile_acidity= Q1volatile - 1.5*IQRvolatile
print(Uvolatile_acidity)
print(Lvolatile_acidity)
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['volatile acidity'] < Lv
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['volatile acidity'] > Uv
0.645
-0.035
Q1citric,Q3citric = np.percentile(Cabernet_Sauvignon['citric acid'] , [25,75])
IQRcitric = Q3citric - Q1citric
Ucitric_acid = Q3citric + 1.5*IQRcitric
Lcitric_acid= Q1citric - 1.5*IQRcitric
print(Ucitric_acid)
print(Lcitric_acid)
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['citric acid'] < Lcitric
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['citric acid'] > Ucitric
0.56
0.08000000000000002
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 20/93
Q1residual,Q3residual = np.percentile(Cabernet_Sauvignon['residual sugar'] , [25,75])
IQRresidual = Q3residual - Q1residual
Uresidual_sugar = Q3residual + 1.5*IQRresidual
Lresidual_sugar= Q1residual - 1.5*IQRresidual
print(Uresidual_sugar)
print(Lresidual_sugar)
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['residual sugar'] < Lres
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['residual sugar'] > Ures
19.049999999999997
-8.549999999999999
Q1chlorides,Q3chlorides = np.percentile(Cabernet_Sauvignon['chlorides'] , [25,75])
IQRchlorides = Q3chlorides - Q1chlorides
Uchlorides = Q3chlorides + 1.5*IQRchlorides
# Cabernet_Sauvignon['chlori
Lchlorides= Q1chlorides - 1.5*IQRchlorides
# Cabernet_Sauvignon['chlori
print(Uchlorides)
print(Lchlorides)
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['chlorides'] < Lchloride
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['chlorides'] > Uchloride
0.081
0.008999999999999994
Q1free_sulfur,Q3free_sulfur = np.percentile(Cabernet_Sauvignon['free sulfur dioxide']
IQRfree_sulfur = Q3free_sulfur - Q1free_sulfur
Ufree_sulfur_dioxide = Q3free_sulfur + 1.5*IQRfree_sulfur
Lfree_sulfur_dioxide= Q1free_sulfur - 1.5*IQRfree_sulfur
print(Ufree_sulfur_dioxide)
print(Lfree_sulfur_dioxide)
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['free sulfur dioxide'] <
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['free sulfur dioxide'] >
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 21/93
78.5
-13.5
Q1total_sulfur,Q3total_sulfur = np.percentile(Cabernet_Sauvignon['total sulfur dioxide
IQRtotal_sulfur = Q3total_sulfur - Q1total_sulfur
Utotal_sulfur_dioxide = Q3total_sulfur + 1.5*IQRtotal_sulfur
Ltotal_sulfur_dioxide= Q1total_sulfur - 1.5*IQRtotal_sulfur
print(Utotal_sulfur_dioxide)
print(Ltotal_sulfur_dioxide)
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['total sulfur dioxide']
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['total sulfur dioxide']
254.0
6.0
Q1sulphates,Q3sulphates = np.percentile(Cabernet_Sauvignon['sulphates'] , [25,75])
IQRsulphates = Q3sulphates - Q1sulphates
Usulphates = Q3sulphates + 1.5*IQRsulphates
Lsulphates= Q1sulphates - 1.5*IQRsulphates
print(Usulphates)
print(Lsulphates)
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['sulphates'] < Lsulphate
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['sulphates'] > Usulphate
0.7949999999999999
0.19500000000000003
Q1alcohol,Q3alcohol = np.percentile(Cabernet_Sauvignon['alcohol'] , [25,75])
IQRalcohol = Q3alcohol - Q1alcohol
Ualcohol = Q3alcohol + 1.5*IQRalcohol
Lalcohol= Q1alcohol - 1.5*IQRalcohol
print(Ualcohol)
print(Lalcohol)
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['alcohol'] < Lalcohol].i
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['alcohol'] > Ualcohol].i
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 22/93
14.25
6.6499999999999995
Q1pH,Q3pH = np.percentile(Cabernet_Sauvignon['pH'] , [25,75])
IQRpH = Q3pH - Q1pH
UpH = Q3pH + 1.5*IQRpH
LpH= Q1pH - 1.5*IQRpH
print(UpH)
print(LpH)
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['pH'] < LpH].index, inpl
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['pH'] > UpH].index, inpl
3.5999999999999996
2.8000000000000007
Q1density,Q3density = np.percentile(Cabernet_Sauvignon['density'] , [25,75])
IQRdensity = Q3density - Q1density
Udensity = Q3density + 1.5*IQRdensity
Ldensity= Q1density - 1.5*IQRdensity
print(Udensity)
print(Ldensity)
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['density'] < Ldensity].i
Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['density'] > Udensity].i
1.00267
0.9851500000000002
Cabernet_Sauvignon.describe()
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 23/93
type
fixed
acidity
volatile
acidity
citric
acid
residual
sugar
chlorides su
dio
count 4598.000000 4598.000000 4598.000000 4598.000000 4598.000000 4598.000000 4598.0
mean 0.921923 6.911398 0.284059 0.320317 5.939374 0.044548 33.0
std 0.268323 0.832672 0.101024 0.089928 4.743293 0.012699 15.3
min 0.000000 4.700000 0.080000 0.090000 0.600000 0.009000 2.0
25% 1.000000 6.400000 0.210000 0.260000 1.800000 0.036000 22.0
50% 1.000000 6.800000 0.270000 0.310000 4.600000 0.043000 32.0
75% 1.000000 7.400000 0.330000 0.370000 8.987500 0.051000 44.0
max 1.000000 9.600000 0.645000 0.560000 18.950000 0.081000 78.0
# Cabernet_Sauvignon.drop([9])
Cabernet_Sauvignon_cleaned_x,Cabernet_Sauvignon_cleaned_y = Cabernet_Sauvignon.iloc[:,
Cabernet_Sauvignon_cleaned_x.shape
(4598, 12)
Cabernet_Sauvignon_cleaned_x = scaler.fit_transform(Cabernet_Sauvignon_cleaned_x)
#Splitting the dataset after outlier treatment into Train and Test sets at 80-20 propo
Xclean_train, Xclean_test, yclean_train, yclean_test = train_test_split(Cabernet_Sauvi
plt.figure(figsize=(20,7))
ax = sns.boxplot(data=Cabernet_Sauvignon_cleaned_x)
ax.set_xticklabels(Cabernet_SauvignonColumnList[:12])
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 24/93
[Text(0, 0, 'type'),
Text(0, 0, 'fixed acidity'),
Text(0, 0, 'volatile acidity'),
Text(0, 0, 'citric acid'),
Text(0, 0, 'residual sugar'),
Text(0, 0, 'chlorides'),
Text(0, 0, 'free sulfur dioxide'),
Text(0, 0, 'total sulfur dioxide'),
Text(0, 0, 'density'),
Text(0, 0, 'pH'),
Text(0, 0, 'sulphates'),
Text(0, 0, 'alcohol')]
##Linear Regression
# lr = LinearRegression(copy_X= True, fit_intercept = True, normalize = True)
lr.fit(Xclean_train, yclean_train)
lrclean_pred= lr.predict(Xclean_test)
# model2 = RandomForestRegressor(random_state=1, n_estimators=1000)
model2.fit(Xclean_train, yclean_train)
Rmclean_pred = model2.predict(Xclean_test)
model3.fit(Xclean_train, yclean_train)
Dtclean_pred = model3.predict(Xclean_test)
/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_base.py:141: FutureW
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocess
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())
If you wish to pass a sample_weight parameter, you need to pass it as a fit param
kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)
warnings.warn(
print('-------------Linear Regression-----------')
print('--Phase-1--')
print('MAE: %f'% mae1)
print('RMSE: %f'% rmse1)
print('R2: %f' % r21)
print('--Phase-2--')
print('MAE: %f'% mean_absolute_error(yclean_test, lrclean_pred))
print('RMSE: %f'% np.sqrt(mean_squared_error(yclean_test, lrclean_pred)))
print('R2: %f' % r2_score(yclean_test, lrclean_pred))
print('-------------Random forest-----------')
print('--Phase-1--')
print('MAE: %f'% mae2)
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 25/93
print('RMSE: %f'% rmse2)
print('R2: %f' % r22)
print('--Phase-2--')
print('MAE: %f'% mean_absolute_error(yclean_test, Rmclean_pred))
print('RMSE: %f'% np.sqrt(mean_squared_error(yclean_test, Rmclean_pred)))
print('R2: %f' % r2_score(yclean_test, Rmclean_pred))
print('-------------Descision Tree-----------')
print('--Phase-1--')
print('MAE: %f'% mae3)
print('RMSE: %f'% rmse3)
print('R2: %f' % r23)
print('--Phase-2--')
print('MAE: %f'% mean_absolute_error(yclean_test, Dtclean_pred))
print('RMSE: %f'% np.sqrt(mean_squared_error(yclean_test, Dtclean_pred)))
print('R2: %f' % r2_score(yclean_test, Dtclean_pred))
-------------Linear Regression-----------
--Phase-1--
MAE: 0.545152
RMSE: 0.686665
R2: 0.340363
--Phase-2--
MAE: 0.578749
RMSE: 0.748469
R2: 0.274277
-------------Random forest-----------
--Phase-1--
MAE: 0.401750
RMSE: 0.561165
R2: 0.559449
--Phase-2--
MAE: 0.438112
RMSE: 0.622107
R2: 0.498635
-------------Descision Tree-----------
--Phase-1--
MAE: 0.541020
RMSE: 0.696854
R2: 0.320642
--Phase-2--
MAE: 0.586013
RMSE: 0.756198
R2: 0.259211
The results show that both phases have different prediction results. Phase 1 and 2 don't have a
great difference for each metric. MAE, RMSE metric values are increased in Phase 2 which means,
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 26/93
the prediction error value is higher in that Phase and model explainability has decresed by a
negligible margin.
Remove outliers and keep outliers (does if have an effect of the final predictive model)? The MAE
value of 0 indicates no error on the model. In other words, there is a perfect prediction. The above
results show that all predictions have great error especially in phase 2. RMSE gives an idea of how
much error the system typically makes in its predictions. The above results show that RMSE gave a
worse value after removing the outliers. R2 represents the proportion of the variance for a
dependent variable that's explained by an independent variable.
Cabernet_Sauvignon_class = Cabernet_Sauvignon
Cabernet_Sauvignon_imputation= Cabernet_Sauvignon
quality_mapping = { 3 : 'Low', 4 : 'Low', 5: 'Medium', 6 : 'Medium', 7: 'Medium', 8 :
Cabernet_Sauvignon_class['quality'] = Cabernet_Sauvignon_class['quality'].map(quality
Cabernet_Sauvignon_class_x,Cabernet_Sauvignon_class_y = Cabernet_Sauvignon.iloc[:,:12]
Cabernet_Sauvignon_class_x = scaler.fit_transform(Cabernet_Sauvignon_class_x)
#Splitting the dataset after classifying quality to class into Train and Test sets at
Xclass_train, Xclass_test, yclass_train, yclass_test = train_test_split(Cabernet_Sauvi
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# creating a RF classifier
clf = RandomForestClassifier(n_estimators = 1000)
# Training the model on the training dataset
# fit function is used to train the model using the training sets as parameters
clf.fit(Xclass_train, yclass_train)
# performing predictions on the test dataset
yclass_pred = clf.predict(Xclass_test)
# metrics are used to find accuracy or error
from sklearn import metrics
print()
# using metrics module for accuracy calculation
print("ACCURACY OF THE MODEL: ", metrics.accuracy_score(yclass_test, yclass_pred))
print(classification_report(yclass_test, yclass_pred))
ACCURACY OF THE MODEL: 0.9456521739130435
precision recall f1-score support
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 27/93
High 1.00 0.34 0.51 38
Low 0.00 0.00 0.00 24
Medium 0.95 1.00 0.97 858
accuracy 0.95 920
macro avg 0.65 0.45 0.49 920
weighted avg 0.92 0.95 0.93 920
quality_mapping_again = { 'Low':0, 'Medium':1, 'High':2}
yclass_test = yclass_test.map(quality_mapping_again)
yclass_pred_new = [s.replace('Medium', '1') for s in yclass_pred]
yclass_pred_new = [s.replace('Low', '0') for s in yclass_pred_new]
yclass_pred_new = [s.replace('High', '2') for s in yclass_pred_new]
yclass_pred_new = [int(item) for item in yclass_pred_new]
plt.figure(figsize=(5, 7))
ax = sns.distplot(yclass_test, hist=False, color="r", label="Actual Value")
sns.distplot(yclass_pred_new, hist=False, color="b", label="Fitted Values" , ax=ax)
plt.title('Actual vs Fitted Values for Quality')
plt.show()
plt.close()
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 28/93
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)
As we can see here, the accuracy of the classification model turned out to be way higher than any
regression method used in phase 1. It can be interpretted as: Wine tastings are generally blind
tastings and even for the best wine conoisseurs, it is very difficult to differentiate between a quality
7 or 8. Also, quality of a wine by how it tastes is a very subjective to human individuals. Most times,
its about how the product is marketed/promoted which forms the general opinion of the targeted
people.
Being said that, a good wine is a good wine. Based on the chemical composition of the wine itself,
we can atleast say if it's a good or bad one. So, when a model is asked to make it fall in a category it
gives a much greater accuracy as classifying into bins is easier than predicting a precise quality
rating.
Data Imputation
Remove 1%, 5%, and 10% of your data randomly and impute the values back using at least 3
imputation methods. How well did the methods recover the missing values? That is remove some
data, check the % error on residuals for numeric data and check for bias and variance of the error.
Imputation 1
Cabernet_Sauvignon_imputation['1_percent'] = Cabernet_Sauvignon_imputation[['alcohol']
Cabernet_Sauvignon_imputation['5_percent'] = Cabernet_Sauvignon_imputation[['alcohol']
Cabernet_Sauvignon_imputation['10_percent'] = Cabernet_Sauvignon_imputation[['alcohol'
Cabernet_Sauvignon_imputation.head()
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 29/93
type
fixed
acidity
volatile
acidity
citric
acid
residual
sugar
chlorides
free
sulfur
dioxide
total
sulfur
dioxide
density
1 1.0 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940
2 1.0 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951
3 1.0 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956
4 1.0 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956
5 1.0 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951
def get_percent_missing(dataframe):
percent_missing = dataframe.isnull().sum() * 100 / len(dataframe)
missing_value_Cabernet_Sauvignon = pd.DataFrame({'column_name': dataframe.columns,
'percent_missing': percent_missing})
return missing_value_Cabernet_Sauvignon
print(get_percent_missing(Cabernet_Sauvignon_imputation))
column_name percent_missing
type type 0.0
fixed acidity fixed acidity 0.0
volatile acidity volatile acidity 0.0
citric acid citric acid 0.0
residual sugar residual sugar 0.0
chlorides chlorides 0.0
free sulfur dioxide free sulfur dioxide 0.0
total sulfur dioxide total sulfur dioxide 0.0
density density 0.0
pH pH 0.0
sulphates sulphates 0.0
alcohol alcohol 0.0
quality quality 0.0
1_percent 1_percent 0.0
5_percent 5_percent 0.0
10_percent 10_percent 0.0
def create_missing(dataframe, percent, col):
dataframe.loc[dataframe.sample(frac = percent).index, col] = np.nan
create_missing(Cabernet_Sauvignon_imputation, 0.01, '1_percent')
create_missing(Cabernet_Sauvignon_imputation, 0.05, '5_percent')
create_missing(Cabernet_Sauvignon_imputation, 0.1, '10_percent')
print(get_percent_missing(Cabernet_Sauvignon_imputation))
column_name percent_missing
type type 0.000000
fixed acidity fixed acidity 0.000000
volatile acidity volatile acidity 0.000000
citric acid citric acid 0.000000
residual sugar residual sugar 0.000000
chlorides chlorides 0.000000
free sulfur dioxide free sulfur dioxide 0.000000
total sulfur dioxide total sulfur dioxide 0.000000
density density 0.000000
pH pH 0.000000
sulphates sulphates 0.000000
alcohol alcohol 0.000000
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 30/93
quality quality 0.000000
1_percent 1_percent 1.000435
5_percent 5_percent 5.002175
10_percent 10_percent 10.004350
# Store Index of NaN values in each coloumns
number_1_idx = list(np.where(Cabernet_Sauvignon_imputation['1_percent'].isna())[0])
number_5_idx = list(np.where(Cabernet_Sauvignon_imputation['5_percent'].isna())[0])
number_10_idx = list(np.where(Cabernet_Sauvignon_imputation['10_percent'].isna())[0])
print(f"Length of number_1_idx is {len(number_1_idx)} and it contains {(len(number_1_i
print(f"Length of number_5_idx is {len(number_5_idx)} and it contains {(len(number_5_i
print(f"Length of number_10_idx is {len(number_10_idx)} and it contains {(len(number_1
Length of number_1_idx is 46 and it contains 1.0004349717268377% of total data in
Length of number_5_idx is 230 and it contains 5.002174858634189% of total data in
Length of number_10_idx is 460 and it contains 10.004349717268378% of total data
Imputation 2
KNN Imputation The k nearest neighbours is an algorithm that is used for simple classification. The
algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the
new point is assigned a value based on how closely it resembles the points in the training set.
#Creating a seperate dataframe for performing the KNN imputation
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler
Cabernet_Sauvignon_imputation1 = Cabernet_Sauvignon_imputation[['1_percent','5_percent
imputer = KNNImputer(n_neighbors=5)
imputed_number_Cabernet_Sauvignon = pd.DataFrame(imputer.fit_transform(Cabernet_Sauvig
# imputed_number_Cabernet_Sauvignon.sample(10)
imputed_number_Cabernet_Sauvignon.head()
print(get_percent_missing(imputed_number_Cabernet_Sauvignon))
column_name percent_missing
1_percent 1_percent 0.0
5_percent 5_percent 0.0
10_percent 10_percent 0.0
alcohol = Cabernet_Sauvignon["alcohol"]
imputed_mean = pd.concat([alcohol,imputed_number_Cabernet_Sauvignon])
imputed_mean.columns = ["Alcohol","1_Percent","5_Percent","10_Percent"]
imputed_mean.var()
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 31/93
Alcohol 1.470385
1_Percent 1.470326
5_Percent 1.470391
10_Percent 1.470429
dtype: float64
The KNN based method showed very negotiable variablilty. Therefore this method is acceptable for
the current dataset.
Mean based Imputation with Simpleimputer This works by calculating the mean/median of the non-
missing values in a column and then replacing the missing values within each column separately
and independently from the others. It can only be used with numeric data.
Cabernet_Sauvignon_imputation_mean = Cabernet_Sauvignon_imputation[['1_percent','5_per
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer( strategy='mean') #for median imputation replace 'mean' with
imp_mean.fit(Cabernet_Sauvignon_imputation_mean)
imputed_train_Cabernet_Sauvignon = imp_mean.transform(Cabernet_Sauvignon_imputation_me
imputed_mean = pd.DataFrame(imp_mean.fit_transform(Cabernet_Sauvignon_imputation_mean)
print(get_percent_missing(imputed_mean))
column_name percent_missing
1_percent 1_percent 0.0
5_percent 5_percent 0.0
10_percent 10_percent 0.0
alcohol = Cabernet_Sauvignon["alcohol"]
combined_mean = pd.concat([alcohol,imputed_mean])
combined_mean.mean()
0 10.587102
10_percent 10.588810
1_percent 10.586540
5_percent 10.581520
dtype: float64
combined_mean.var()
0 1.470385
10_percent 1.320797
1_percent 1.456402
5_percent 1.395375
dtype: float64
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 32/93
Imputation 3
The Mean based method showed very negotiable variablilty. Therefore this method is acceptable for
the current dataset.
Imputation Using Multivariate Imputation by Chained Equation (MICE) This type of imputation works
by filling the missing data multiple times. Multiple Imputations (MIs) are much better than a single
imputation as it measures the uncertainty of the missing values in a better way. The chained
equations approach is also very flexible and can handle different variables of different data types
(ie., continuous or binary) as well as complexities such as bounds or survey skip patterns.
Cabernet_Sauvignon_imputation_mice = Cabernet_Sauvignon_imputation[['1_percent','5_per
print(get_percent_missing(Cabernet_Sauvignon_imputation_mice))
column_name percent_missing
1_percent 1_percent 1.000435
5_percent 5_percent 5.002175
10_percent 10_percent 10.004350
!pip install impyute
from impyute.imputation.cs import mice
# start the MICE training
imputed_training=mice(Cabernet_Sauvignon_imputation_mice.values)
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-whee
Requirement already satisfied: impyute in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: scipy in /usr/local/lib/python3.8/dist-packages (f
Requirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (f
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.8/d
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.8/dist-pack
imputed_training = pd.DataFrame(imputed_training)
imputed_training.columns = ("1_percent","5_percent","10_percent")
# imputed_mice = pd.DataFrame(imputed_training.fit_transform(Cabernet_Sauvignon_imputa
print(get_percent_missing(imputed_training))
column_name percent_missing
1_percent 1_percent 0.0
5_percent 5_percent 0.0
10_percent 10_percent 0.0
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 33/93
alcohol = Cabernet_Sauvignon["alcohol"]
combined_mice = pd.concat([alcohol,imputed_training])
combined_mice.columns = ["Alcohol","1_Percent","5_Percent","10_Percent"]
combined_mice.mean()
Alcohol 10.587102
1_Percent 10.586915
5_Percent 10.587098
10_Percent 10.586915
dtype: float64
combined_mice.var()
Alcohol 1.470385
1_Percent 1.467981
5_Percent 1.470375
10_Percent 1.467981
dtype: float64
The MICE method showed very negotiable variablilty. Therefore this method is acceptable for the
current dataset.
Double-click (or enter) to edit
Double-click (or enter) to edit
Double-click (or enter) to edit
AutoML
#Install AutoML library - PyCaret
!pip install pycaret
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 34/93
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-whee
Requirement already satisfied: pycaret in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: numba<0.55 in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: mlxtend>=0.17.0 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: wordcloud in /usr/local/lib/python3.8/dist-package
Requirement already satisfied: pandas-profiling>=2.8.0 in /usr/local/lib/python3
Requirement already satisfied: mlflow in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: pyyaml<6.0.0 in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: scikit-plot in /usr/local/lib/python3.8/dist-packa
Requirement already satisfied: scipy<=1.5.4 in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: joblib in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: matplotlib in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: lightgbm>=2.3.1 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: imbalanced-learn==0.7.0 in /usr/local/lib/python3
Requirement already satisfied: seaborn in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: textblob in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: kmodes>=0.10.1 in /usr/local/lib/python3.8/dist-pa
Requirement already satisfied: spacy<2.4.0 in /usr/local/lib/python3.8/dist-packa
Requirement already satisfied: nltk in /usr/local/lib/python3.8/dist-packages (fr
Requirement already satisfied: pandas in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: pyod in /usr/local/lib/python3.8/dist-packages (fr
Requirement already satisfied: yellowbrick>=1.0.1 in /usr/local/lib/python3.8/dis
Requirement already satisfied: umap-learn in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: cufflinks>=0.17.0 in /usr/local/lib/python3.8/dist
Requirement already satisfied: gensim<4.0.0 in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: plotly>=4.4.1 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: IPython in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: ipywidgets in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: scikit-learn==0.23.2 in /usr/local/lib/python3.8/d
Requirement already satisfied: Boruta in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: pyLDAvis in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.8/d
Requirement already satisfied: colorlover>=0.2.1 in /usr/local/lib/python3.8/dist
Requirement already satisfied: setuptools>=34.4.1 in /usr/local/lib/python3.8/dis
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.8/dist
Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.8/dist-pa
Requirement already satisfied: decorator in /usr/local/lib/python3.8/dist-package
Requirement already satisfied: backcall in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: pickleshare in /usr/local/lib/python3.8/dist-packa
Requirement already satisfied: prompt-toolkit<2.1.0,>=2.0.0 in /usr/local/lib/pyt
Requirement already satisfied: jedi>=0.10 in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: pygments in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: pexpect in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: ipykernel>=4.5.1 in /usr/local/lib/python3.8/dist-
Requirement already satisfied: jupyterlab-widgets>=1.0.0 in /usr/local/lib/python
Requirement already satisfied: ipython-genutils~=0.2.0 in /usr/local/lib/python3
Requirement already satisfied: widgetsnbextension~=3.6.0 in /usr/local/lib/python
Requirement already satisfied: jupyter-client in /usr/local/lib/python3.8/dist-pa
Requirement already satisfied: tornado>=4.2 in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: parso<0.9.0,>=0.8.0 in /usr/local/lib/python3.8/di
Requirement already satisfied: wheel in /usr/local/lib/python3.8/dist-packages (f
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.8/dist
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.8/d
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.8/dist-pack
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 35/93
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/l
Requirement already satisfied: llvmlite<0.38,>=0.37.0rc1 in /usr/local/lib/python
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: tqdm<4.65,>=4.48.2 in /usr/local/lib/python3.8/dis
Requirement already satisfied: requests<2.29,>=2.24.0 in /usr/local/lib/python3.8
Requirement already satisfied: phik<0.13,>=0.11.1 in /usr/local/lib/python3.8/dis
Requirement already satisfied: pydantic<1.11,>=1.8.1 in /usr/local/lib/python3.8/
Requirement already satisfied: jinja2<3.2,>=2.11.1 in /usr/local/lib/python3.8/di
Requirement already satisfied: htmlmin==0.1.12 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: statsmodels<0.14,>=0.13.2 in /usr/local/lib/python
Requirement already satisfied: typeguard<2.14,>=2.13.2 in /usr/local/lib/python3
Requirement already satisfied: multimethod<1.10,>=1.4 in /usr/local/lib/python3.8
Requirement already satisfied: visions[type_image_path]==0.7.5 in /usr/local/lib/
Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: tangled-up-in-unicode>=0.0.4 in /usr/local/lib/pyt
Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: imagehash in /usr/local/lib/python3.8/dist-package
Requirement already satisfied: Pillow in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.8/dist-
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: wcwidth in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: typing-extensions>=4.1.0 in /usr/local/lib/python3
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.8/
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dis
Requirement already satisfied: charset-normalizer<3,>=2 in /usr/local/lib/python3
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.8/
Requirement already satisfied: blis<0.8.0,>=0.4.0 in /usr/local/lib/python3.8/dis
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.8/di
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.8/dis
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.8/d
Requirement already satisfied: thinc<7.5.0,>=7.4.1 in /usr/local/lib/python3.8/di
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.8/di
Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: patsy>=0.5.2 in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: notebook>=4.4.1 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: nbformat in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: Send2Trash in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: jupyter-core>=4.4.0 in /usr/local/lib/python3.8/di
Requirement already satisfied: prometheus-client in /usr/local/lib/python3.8/dist
Requirement already satisfied: terminado>=0.8.1 in /usr/local/lib/python3.8/dist-
Requirement already satisfied: pyzmq>=17 in /usr/local/lib/python3.8/dist-package
Requirement already satisfied: nbconvert<6.0 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: platformdirs>=2.5 in /usr/local/lib/python3.8/dist
Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.8/dis
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.8/d
Requirement already satisfied: defusedxml in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: testpath in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: bleach in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.8/dist
Requirement already satisfied: jsonschema>=2.6 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: fastjsonschema in /usr/local/lib/python3.8/dist-pa
Requirement already satisfied: importlib-resources>=1.4.0 in /usr/local/lib/pytho
Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /
Requirement already satisfied: zipp>=3.1.0 in /usr/local/lib/python3.8/dist-packa
from scipy import stats
# import math
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
#Reading Data
Chateau_Montelena_AutoML = pd.read_csv('https://raw.githubusercontent.com/Mohanmanjuna
Chateau_Montelena_AutoMLM = Chateau_Montelena_AutoML.copy()
Chateau_Montelena_AutoMLB = Chateau_Montelena_AutoML.copy()
Each row represents a wine; Each column contains wine’s attributes such as type, sulphates,
chlorides etc and the target label 'quality'.
Problem Statement
Binary Classification: Predict the quality of wine i.e. Low or High.
Multiclass Classification: Predict the quality of wine i.e Low,Medium,High.
Regression: Predict the quality of wine between 3-9 based on the independent predictor
variables.
Dataset - Wine Quality
Chateau_Montelena_AutoML.describe()
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 36/93
Requirement already satisfied: zipp> 3.1.0 in /usr/local/lib/python3.8/dist packa
Requirement already satisfied: ptyprocess in /usr/local/lib/python3.8/dist-packag
Collecting numpy>=1.13.3
Using cached numpy-1.19.5-cp38-cp38-manylinux2010_x86_64.whl (14.9 MB)
Requirement already satisfied: webencodings in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: PyWavelets in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: shap<1,>=0.40 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: protobuf<5,>=3.12.0 in /usr/local/lib/python3.8/di
Requirement already satisfied: alembic<2 in /usr/local/lib/python3.8/dist-package
Requirement already satisfied: gitpython<4,>=2.1.0 in /usr/local/lib/python3.8/di
Requirement already satisfied: databricks-cli<1,>=0.8.7 in /usr/local/lib/python3
Requirement already satisfied: importlib-metadata!=4.7.0,<6,>=3.7.0 in /usr/local
Requirement already satisfied: sqlalchemy<2,>=1.4.0 in /usr/local/lib/python3.8/d
Requirement already satisfied: docker<7,>=4.0.0 in /usr/local/lib/python3.8/dist-
Requirement already satisfied: querystring-parser<2 in /usr/local/lib/python3.8/d
Requirement already satisfied: click<9,>=7.0 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: gunicorn<21 in /usr/local/lib/python3.8/dist-packa
Requirement already satisfied: pyarrow<11,>=4.0.0 in /usr/local/lib/python3.8/dis
Requirement already satisfied: cloudpickle<3 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: markdown<4,>=3.3 in /usr/local/lib/python3.8/dist-
Requirement already satisfied: Flask<3 in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: sqlparse<1,>=0.4.0 in /usr/local/lib/python3.8/dis
Requirement already satisfied: Mako in /usr/local/lib/python3.8/dist-packages (fr
Requirement already satisfied: oauthlib>=3.1.0 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: tabulate>=0.7.7 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: pyjwt>=1.7.0 in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: websocket-client>=0.32.0 in /usr/local/lib/python3
Requirement already satisfied: Werkzeug<2.0,>=0.15 in /usr/local/lib/python3.8/di
Requirement already satisfied: itsdangerous<2.0,>=0.24 in /usr/local/lib/python3
Requirement already satisfied: gitdb<5,>=4.0.1 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: smmap<6,>=3.0.1 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: slicer==0.0.7 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.8/dist-
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: funcy in /usr/local/lib/python3.8/dist-packages (f
Requirement already satisfied: numexpr in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: future in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: pynndescent>=0.5 in /usr/local/lib/python3.8/dist-
Installing collected packages: numpy
Attempting uninstall: numpy
Found existing installation: numpy 1.20.0
Uninstalling numpy-1.20.0:
Successfully uninstalled numpy-1.20.0
ERROR: pip's dependency resolver does not currently take into account all the pac
tensorflow 2.9.2 requires numpy>=1.20, but you have numpy 1.19.5 which is incompa
jaxlib 0.3.25+cuda11.cudnn805 requires numpy>=1.20, but you have numpy 1.19.5 whi
jax 0.3.25 requires numpy>=1.20, but you have numpy 1.19.5 which is incompatible
en-core-web-sm 3.4.1 requires spacy<3.5.0,>=3.4.0, but you have spacy 2.3.8 which
cmdstanpy 1.0.8 requires numpy>=1.21, but you have numpy 1.19.5 which is incompat
Successfully installed numpy-1.19.5
fixed
acidity
volatile
acidity
citric
acid
residual
sugar
chlorides
free
sulfur
dioxide
t
su
dio
count 6487.000000 6489.000000 6494.000000 6495.000000 6495.000000 6497.000000 6497.0
mean 7.216579 0.339691 0.318722 5.444326 0.056042 30.525319 115.7
std 1.296750 0.164649 0.145265 4.758125 0.035036 17.749400 56.5
min 3.800000 0.080000 0.000000 0.600000 0.009000 1.000000 6.0
25% 6.400000 0.230000 0.250000 1.800000 0.038000 17.000000 77.0
50% 7.000000 0.290000 0.310000 3.000000 0.047000 29.000000 118.0
75% 7.700000 0.400000 0.390000 8.100000 0.065000 41.000000 156.0
max 15.900000 1.580000 1.660000 65.800000 0.611000 289.000000 440.0
Dataset Shape: (6497, 13)
Name dtypes Missing Uniques Sample Value Entropy
0 type object 0 2 white 0.24
1 fixed acidity float64 10 106 7.0 1.65
2 volatile acidity float64 8 187 0.27 1.79
3 citric acid float64 3 89 0.36 1.70
4 residual sugar float64 2 316 20.7 2.08
5 chlorides float64 2 214 0.045 1.90
6 free sulfur dioxide float64 0 135 45.0 1.82
7 total sulfur dioxide float64 0 276 170.0 2.32
8 density float64 0 998 1.001 2.70
9 pH float64 9 108 3.0 1.81
10 sulphates float64 4 111 0.45 1.72
11 alcohol float64 0 111 8.8 1.66
12 quality int64 0 7 6 0.55
def tableinfo(Chateau_Montelena_AutoML):
print(f"Dataset Shape: {Chateau_Montelena_AutoML.shape}")
summary = pd.DataFrame(Chateau_Montelena_AutoML.dtypes,columns=['dtypes'])
summary = summary.reset_index()
summary['Name'] = summary['index']
summary = summary[['Name','dtypes']]
summary['Missing'] = Chateau_Montelena_AutoML.isnull().sum().values
summary['Uniques'] = Chateau_Montelena_AutoML.nunique().values
summary['Sample Value'] = Chateau_Montelena_AutoML.loc[0].values
for name in summary['Name'].value_counts().index:
summary.loc[summary['Name'] == name, 'Entropy'] = round(stats.entropy(Chateau_
return summary
tableinfo(Chateau_Montelena_AutoML)
Entropy is defined as the randomness or measuring the disorder of the information being
processed.
Actions required for data preparation:
Converting 'Type' to a integer data type. Encoding categorical features.
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 37/93
print("Quality(%):")
print(round(Chateau_Montelena_AutoML['quality'].value_counts(normalize=True) * 100,2))
Quality(%):
6 43.65
5 32.91
7 16.61
4 3.32
8 2.97
3 0.46
9 0.08
Name: quality, dtype: float64
Chateau_Montelena_AutoML['type'] = Chateau_Montelena_AutoML['type'].astype("category")
Chateau_Montelena_AutoML_copy = Chateau_Montelena_AutoML.copy()
Chateau_Montelena_AutoML.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 type 6497 non-null int8
1 fixed acidity 6487 non-null float64
2 volatile acidity 6489 non-null float64
3 citric acid 6494 non-null float64
4 residual sugar 6495 non-null float64
5 chlorides 6495 non-null float64
6 free sulfur dioxide 6497 non-null float64
7 total sulfur dioxide 6497 non-null float64
8 density 6497 non-null float64
9 pH 6488 non-null float64
10 sulphates 6493 non-null float64
11 alcohol 6497 non-null float64
12 quality 6497 non-null int64
dtypes: float64(11), int64(1), int8(1)
memory usage: 615.6 KB
Analyzing the numeric features
plot , ax = plt.subplots( 4,3 , figsize = (35 , 20))
g = sns.histplot(Chateau_Montelena_AutoML['type'] , kde = True , ax = ax[0][0])
g = sns.histplot(Chateau_Montelena_AutoML['fixed acidity'] , kde = True , ax = ax[0][1
g = sns.histplot(Chateau_Montelena_AutoML['volatile acidity'] , kde = True , ax = ax[0
g = sns.histplot(Chateau_Montelena_AutoML['citric acid'] , kde = True , ax = ax[1][0])
g = sns.histplot(Chateau_Montelena_AutoML['residual sugar'] , kde = True , ax = ax[1][
g = sns.histplot(Chateau_Montelena_AutoML['chlorides'] , kde = True , ax = ax[1][2])
g = sns.histplot(Chateau_Montelena_AutoML['density'] , kde = True , ax = ax[2][0])
g = sns.histplot(Chateau_Montelena_AutoML['pH'] , kde = True , ax = ax[2][1])
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 38/93
g = sns.histplot(Chateau_Montelena_AutoML['sulphates'] , kde = True , ax = ax[2][2])
g = sns.histplot(Chateau_Montelena_AutoML['alcohol'] , kde = True , ax = ax[3][0])
Observation :
These numerical variables are not following a normal distribution. These distributions indicate there
are different data distributions present in population data with separate and independent peaks.
Action :
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 39/93
Data scaling, As most of the algorithms assume the data to be normally (Gaussian) distributed we
Normalize these features.
type
fixed
acidity
volatile
acidity
citric
acid
residual
sugar
chlorides
free
sulfur
dioxide
total
sulfur
dioxide
density
0 1 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010
1 1 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940
2 1 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951
3 1 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956
4 1 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956
Chateau_Montelena_AutoML.head()
Outliers
from sklearn.preprocessing import MinMaxScaler,StandardScaler
mms = MinMaxScaler() # Normalization
# cust_dummies=pd.get_dummies(cust)
Chateau_Montelena_AutoML_copy['type'] = mms.fit_transform(Chateau_Montelena_AutoML_cop
Chateau_Montelena_AutoML_copy['fixed acidity'] = mms.fit_transform(Chateau_Montelena_A
Chateau_Montelena_AutoML_copy['volatile acidity'] = mms.fit_transform(Chateau_Montelen
Chateau_Montelena_AutoML_copy['citric acid']= mms.fit_transform(Chateau_Montelena_Auto
Chateau_Montelena_AutoML_copy['residual sugar']= mms.fit_transform(Chateau_Montelena_A
Chateau_Montelena_AutoML_copy['chlorides']= mms.fit_transform(Chateau_Montelena_AutoML
# Chateau_Montelena_AutoML_copy['free sulphur dioxide']= mms.fit_transform(Chateau_Mon
# Chateau_Montelena_AutoML_copy['total sulphur dioxide']= mms.fit_transform(Chateau_Mo
Chateau_Montelena_AutoML_copy['density'] = mms.fit_transform(Chateau_Montelena_AutoML_
Chateau_Montelena_AutoML_copy['pH'] = mms.fit_transform(Chateau_Montelena_AutoML_copy[
Chateau_Montelena_AutoML_copy['sulphates'] = mms.fit_transform(Chateau_Montelena_AutoM
Chateau_Montelena_AutoML_copy['alcohol'] = mms.fit_transform(Chateau_Montelena_AutoML_
plt.figure(figsize=(16,4))
sns.boxplot(data=Chateau_Montelena_AutoML_copy[['type','fixed acidity','volatile acidi
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 40/93
<AxesSubplot:>
Observation : There are values present beyond the upper and lower extremes of the Box plots (1.5 x
Inter Quartile Range)
Multicolinearity
<AxesSubplot:>
plt.figure(figsize=(24,8))
corr = Chateau_Montelena_AutoML_copy.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr,mask=mask, cmap='RdYlGn')
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 41/93
Observation :
By looking at the correlation mattrix above we can gain the following insights:
volatile acidity and chlorides is highly (-ve) correlated with type.
alcohol is highly (-ve) correlated with density.
total sulpher dioxide is highly (+ve) correlated with type.
Action :
Dropping some of the highly correlated categorical variables.
Target Variable = Quality between 3-9
Regression
!pip install numba==0.53
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-whee
Collecting numba==0.53
Downloading numba-0.53.0-cp38-cp38-manylinux2014_x86_64.whl (3.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.4/3.4 MB 31.4 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.15 in /usr/local/lib/python3.8/dist-packa
Requirement already satisfied: setuptools in /usr/local/lib/python3.8/dist-packag
Collecting llvmlite<0.37,>=0.36.0rc1
Downloading llvmlite-0.36.0-cp38-cp38-manylinux2010_x86_64.whl (25.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25.3/25.3 MB 54.4 MB/s eta 0:00:00
Installing collected packages: llvmlite, numba
Attempting uninstall: llvmlite
Found existing installation: llvmlite 0.37.0
Uninstalling llvmlite-0.37.0:
Successfully uninstalled llvmlite-0.37.0
Attempting uninstall: numba
Found existing installation: numba 0.54.1
Uninstalling numba-0.54.1:
Successfully uninstalled numba-0.54.1
Successfully installed llvmlite-0.36.0 numba-0.53.0
from pycaret.regression import *
s = setup(Chateau_Montelena_AutoML, target = 'quality',train_size=0.8,
normalize=True,
normalize_method='minmax',
remove_multicollinearity=True,
remove_outliers=True,
fold=5,
silent = True)
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 42/93
Description Value
0 session_id 6943
1 Target quality
2 Original Data (6497, 13)
3 Missing Values True
4 Numeric Features 12
5 Categorical Features 0
6 Ordinal Features False
7 High Cardinality Features False
8 High Cardinality Method None
9 Transformed Train Set (4937, 12)
10 Transformed Test Set (1300, 12)
11 Shuffle Train-Test True
12 Stratify Train-Test False
13 Fold Generator KFold
14 Fold Number 5
15 CPU Jobs -1
16 Use GPU False
17 Log Experiment False
18 Experiment Name reg-default-name
19 USI 900b
20 Imputation Type simple
21 Iterative Imputation Iteration None
22 Numeric Imputer mean
23 Iterative Imputation Numeric Model None
24 Categorical Imputer constant
25 Iterative Imputation Categorical Model None
26 Unknown Categoricals Handling least_frequent
27 Normalize True
28 Normalize Method minmax
29 Transformation False
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 43/93
30 Transformation Method None
31 PCA False
32 PCA Method None
33 PCA Components None
34 Ignore Low Variance False
35 Combine Rare Levels False
36 Rare Level Threshold None
37 Numeric Binning False
38 Remove Outliers True
39 Outliers Threshold 0.05
40 Remove Multicollinearity True
41 Multicollinearity Threshold 0.9
42 Remove Perfect Collinearity True
43 Clustering False
44 Clustering Iteration None
45 Polynomial Features False
46 Polynomial Degree None
47 Trignometry Features False
48 Polynomial Threshold None
49 Group Features False
50 Feature Selection False
51 Feature Selection Method classic
52 Features Selection Threshold None
53 Feature Interaction False
54 Feature Ratio False
55 Interaction Threshold None
56 Transform Target False
57 Transform Target Method box-cox
INFO:logs:create_model_container: 0
INFO:logs:master_model_container: 0
INFO:logs:display_container: 1
INFO:logs:Pipeline(memory=None,
steps=[('dtypes',
DataTypes_Auto_infer(categorical_features=[],
Model MAE MSE RMSE R2 RMSLE MAPE
TT
(Sec)
et Extra Trees Regressor 0.3974 0.3534 0.5941 0.5312 0.0890 0.0710 1.150
rf Random Forest Regressor 0.4454 0.3757 0.6124 0.5018 0.0916 0.0793 2.436
lightgbm
Light Gradient Boosting
Machine
0.4847 0.4085 0.6388 0.4577 0.0951 0.0857 0.190
xgboost
Extreme Gradient
Boosting
0.4631 0.4104 0.6404 0.4548 0.0955 0.0821 0.590
gbr
Gradient Boosting
Regressor
0.5298 0.4610 0.6786 0.3880 0.1006 0.0934 1.008
knn K Neighbors Regressor 0.5362 0.5059 0.7111 0.3280 0.1055 0.0950 0.082
ada AdaBoost Regressor 0.5725 0.5243 0.7235 0.3048 0.1074 0.1015 0.600
lr Linear Regression 0.5643 0.5288 0.7268 0.2982 0.1074 0.0995 0.588
lar Least Angle Regression 0.5643 0.5288 0.7268 0.2982 0.1074 0.0995 0.012
br Bayesian Ridge 0.5645 0.5289 0.7269 0.2981 0.1074 0.0995 0.012
ridge Ridge Regression 0.5652 0.5296 0.7273 0.2972 0.1075 0.0996 0.010
huber Huber Regressor 0.5636 0.5301 0.7277 0.2965 0.1074 0.0990 0.102
omp
Orthogonal Matching
Pursuit
0.6133 0.5987 0.7733 0.2056 0.1145 0.1086 0.010
dt Decision Tree Regressor 0.5058 0.7132 0.8440 0.0484 0.1252 0.0889 0.046
lasso Lasso Regression 0.6772 0.7537 0.8678 -0.0004 0.1277 0.1203 0.012
en Elastic Net 0.6772 0.7537 0.8678 -0.0004 0.1277 0.1203 0.014
llar
Lasso Least Angle
Regression
0.6772 0.7537 0.8678 -0.0004 0.1277 0.1203 0.012
dummy Dummy Regressor 0.6772 0.7537 0.8678 -0.0004 0.1277 0.1203 0.014
par
Passive Aggressive
Regressor
0.8006 0.9957 0.9905 -0.3256 0.1469 0.1372 0.014
INFO:logs:create_model_container: 19
INFO:logs:master_model_container: 19
INFO:logs:display_container: 2
INFO:logs:ExtraTreesRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse',
max_depth=None, max_features='auto', max_leaf_nodes=None,
best = compare_models()
Tuning the best model here i.e. Extra Trees Regressor
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 44/93
display_types=False, features_todrop=[],
id_columns=[], ml_usecase='regression',
numerical_features=[], target='quality',
time_features=[])),
('imputer',
Simple_Imputer(categorical_strategy='not_available',
fill_value_categorical=None,
fill_value_numerical=None,
numeric_strateg...
('dummy', Dummify(target='quality')),
('fix_perfect', Remove_100(target='quality')),
('clean_names', Clean_Colum_Names()),
('feature_select', 'passthrough'),
('fix_multi',
Fix_multicollinearity(correlation_with_target_preference=None,
correlation_with_target_threshold=0.0,
target_variable='quality',
threshold=0.9)),
('dfs', 'passthrough'), ('pca', 'passthrough')],
verbose=False)
INFO:logs:setup() succesfully completed......................................
MAE MSE RMSE R2 RMSLE MAPE
Fold
0 0.5600 0.4734 0.6881 0.3261 0.1013 0.0982
1 0.6064 0.5845 0.7645 0.2842 0.1133 0.1078
2 0.5680 0.5103 0.7144 0.3331 0.1063 0.1010
3 0.5849 0.5351 0.7315 0.3088 0.1100 0.1047
4 0.5651 0.4953 0.7038 0.3008 0.1029 0.0985
Mean 0.5769 0.5197 0.7204 0.3106 0.1068 0.1020
Std 0.0170 0.0381 0.0262 0.0176 0.0044 0.0037
INFO:logs:create_model_container: 20
INFO:logs:master_model_container: 20
INFO:logs:display_container: 3
INFO:logs:ExtraTreesRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse',
max_depth=9, max_features=1.0, max_leaf_nodes=None,
max_samples=None, min_impurity_decrease=0.002,
min_impurity_split=None, min_samples_leaf=3,
min_samples_split=5, min_weight_fraction_leaf=0.0,
n_estimators=210, n_jobs=-1, oob_score=False,
random_state=6943, verbose=0, warm_start=False)
INFO:logs:tune_model() succesfully completed....................................
tuned_model = tune_model(best)
#Creating Models
lightgbm = create_model('lightgbm');
et = create_model('et');
rf = create_model('rf');
#Blending the top 3 models
blend = blend_models(estimator_list=[lightgbm,et,rf])
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 45/93
MAE MSE RMSE R2 RMSLE MAPE
Fold
0 0.4346 0.3397 0.5828 0.5165 0.0863 0.0763
1 0.4596 0.4069 0.6379 0.5017 0.0952 0.0819
2 0.4237 0.3473 0.5893 0.5462 0.0889 0.0761
3 0.4418 0.3806 0.6169 0.5084 0.0937 0.0797
4 0.4261 0.3356 0.5793 0.5262 0.0858 0.0747
Mean 0.4372 0.3620 0.6012 0.5198 0.0900 0.0777
Std 0.0129 0.0275 0.0226 0.0155 0.0038 0.0026
INFO:logs:create_model_container: 24
INFO:logs:master_model_container: 24
INFO:logs:display_container: 7
INFO:logs:VotingRegressor(estimators=[('lightgbm',
LGBMRegressor(boosting_type='gbdt',
class_weight=None,
colsample_bytree=1.0,
importance_type='split',
learning_rate=0.1, max_depth=-1,
min_child_samples=20,
min_child_weight=0.001,
min_split_gain=0.0, n_estimators=100,
n_jobs=-1, num_leaves=31,
objective=None, random_state=6943,
reg_alpha=0.0, reg_lambda=0.0,
silent='warn', s...
RandomForestRegressor(bootstrap=True,
ccp_alpha=0.0,
criterion='mse',
max_depth=None,
max_features='auto',
max_leaf_nodes=None,
max_samples=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=100, n_jobs=-1,
oob_score=False,
random_state=6943, verbose=0,
warm_start=False))],
n_jobs=-1, verbose=False, weights=None)
INFO:logs:blend_models() succesfully completed..................................
plot_model(estimator = tuned_model, plot = 'feature')
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 46/93
INFO:logs:Visual Rendered Successfully
INFO:logs:plot_model() succesfully completed....................................
interpret_model(tuned_model)
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 47/93
INFO:logs:Initializing interpret_model()
INFO:logs:interpret_model(estimator=ExtraTreesRegressor(bootstrap=False, ccp_alph
max_depth=9, max_features=1.0, max_leaf_nodes=None,
max_samples=None, min_impurity_decrease=0.002,
min_impurity_split=None, min_samples_leaf=3,
min_samples_split=5, min_weight_fraction_leaf=0.0,
n_estimators=210, n_jobs=-1, oob_score=False,
random_state=6943, verbose=0, warm_start=False), use_train_da
INFO:logs:Checking exceptions
INFO:logs:plot type: summary
INFO:logs:Creating TreeExplainer
INFO:logs:Compiling shap values
INFO:logs:Visual Rendered Successfully
INFO:logs:interpret_model() succesfully completed...............................
INFO:logs:Visual Rendered Successfully
INFO:logs:plot_model() succesfully completed....................................
plot_model(estimator = tuned_model, plot = 'residuals')
Observation : The residuals are evenly distributed and the line fits well.
Double-click (or enter) to edit
Target Variable = Quality- Low or High
Binary classification
from pycaret.classification import *
Categorization of Quality
quality_mapping = { 3 : 'Low', 4 : 'Low', 5: 'Low', 6 : 'High', 7: 'High', 8 : 'High',
Chateau_Montelena_AutoMLB['quality'] = Chateau_Montelena_AutoMLB['quality'].map(quali
print("Wine Quality(%):")
print(round(Chateau_Montelena_AutoMLB['quality'].value_counts(normalize=True) * 100,2)
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 48/93
Wine Quality(%):
High 63.31
Low 36.69
Name: quality, dtype: float64
Classifier Setup
clfb = setup(data = Chateau_Montelena_AutoMLB,
target = 'quality',
# ignore_features = ['customerID'],
train_size=0.8,
normalize=True,
normalize_method='minmax',
fix_imbalance=True,
remove_multicollinearity=True,
remove_outliers=True,
fold=5,
silent = True)
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 49/93
Description Value
0 session_id 4967
1 Target quality
2 Target Type Binary
3 Label Encoded High: 0, Low: 1
4 Original Data (6497, 13)
5 Missing Values True
6 Numeric Features 11
7 Categorical Features 1
8 Ordinal Features False
9 High Cardinality Features False
10 High Cardinality Method None
11 Transformed Train Set (4937, 12)
12 Transformed Test Set (1300, 12)
13 Shuffle Train-Test True
14 Stratify Train-Test False
15 Fold Generator StratifiedKFold
16 Fold Number 5
17 CPU Jobs -1
18 Use GPU False
19 Log Experiment False
20 Experiment Name clf-default-name
21 USI 3508
22 Imputation Type simple
23 Iterative Imputation Iteration None
24 Numeric Imputer mean
25 Iterative Imputation Numeric Model None
26 Categorical Imputer constant
27 Iterative Imputation Categorical Model None
28 Unknown Categoricals Handling least_frequent
29 Normalize True
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 50/93
30 Normalize Method minmax
31 Transformation False
32 Transformation Method None
33 PCA False
34 PCA Method None
35 PCA Components None
36 Ignore Low Variance False
37 Combine Rare Levels False
38 Rare Level Threshold None
39 Numeric Binning False
40 Remove Outliers True
41 Outliers Threshold 0.05
42 Remove Multicollinearity True
43 Multicollinearity Threshold 0.9
44 Remove Perfect Collinearity True
45 Clustering False
46 Clustering Iteration None
47 Polynomial Features False
48 Polynomial Degree None
49 Trignometry Features False
50 Polynomial Threshold None
51 Group Features False
52 Feature Selection False
53 Feature Selection Method classic
54 Features Selection Threshold None
55 Feature Interaction False
56 Feature Ratio False
57 Interaction Threshold None
58 Fix Imbalance True
59 Fix Imbalance Method SMOTE
INFO:logs:create_model_container: 0
INFO:logs:master_model_container: 0
INFO l di l t i 1
Evaluation Metrics
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 51/93
INFO:logs:display_container: 1
INFO:logs:Pipeline(memory=None,
steps=[('dtypes',
DataTypes_Auto_infer(categorical_features=[],
display_types=False, features_todrop=[],
id_columns=[],
ml_usecase='classification',
numerical_features=[], target='quality',
time_features=[])),
('imputer',
Simple_Imputer(categorical_strategy='not_available',
fill_value_categorical=None,
fill_value_numerical=None,
numeric_str...
('dummy', Dummify(target='quality')),
('fix_perfect', Remove_100(target='quality')),
('clean_names', Clean_Colum_Names()),
('feature_select', 'passthrough'),
('fix_multi',
Fix_multicollinearity(correlation_with_target_preference=None,
correlation_with_target_threshold=0.0,
target_variable='quality',
threshold=0.9)),
('dfs', 'passthrough'), ('pca', 'passthrough')],
verbose=False)
INFO:logs:setup() succesfully completed......................................
Pycaret provides the following metrics used for comparing model performance in the
compare_models() function:
Confusion Matrix is a performance measurement for machine learning classification problem
where output can be two or more classes. It is a table with 4 different combinations of
predicted and actual values.
AUC known as the Area Under the ROC Curve can be calculated and provides a single score to
summarize the plot that can be used to compare models. A no skill classifier will have a score
of 0.5, whereas a perfect classifier will have a score of 1.0.
F1 score is the harmonic mean of Precision and recall, a single score that seeks to balance
both concerns.
Accuracy is the fraction of correction predictions against the total prediction
Accuracy = Correct Predictions / Total Predictions
MCC produces a high score only if the prediction obtained good results in all of the four
confusion matrix categories (true positives, false negatives, true negatives, and false
positives), proportionally both to the size of positive elements and the size of negative
elements in the dataset.
Precision summarizes the fraction of examples assigned the positive class that belong to the
positive class.
Precision = TruePositive / (TruePositive + FalsePositive)
Cohen’s Kappa Statistic is used to measure the level of agreement between two raters or
judges who each classify items into mutually exclusive categories.
kappa = (Observed agreement - chance agreement) / (1-chance agreement)
Recall summarizes how well the positive class was predicted.
Recall = TruePositive / (TruePositive + FalseNegative)
F-Measure = (2 * Precision * Recall) / (Precision + Recall)
Searching for the best models
Model Comparison & Evaluation
best_modelB=compare_models()
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 52/93
Model Accuracy AUC Recall Prec. F1 Kappa MCC
TT
(Sec)
et
Extra Trees
Classifier
0.8232 0.9011 0.7532 0.7558 0.7539 0.6160 0.6166 0.432
rf
Random Forest
Classifier
0.8209 0.8940 0.7623 0.7463 0.7539 0.6132 0.6136 1.092
xgboost
Extreme
Gradient
Boosting
0.8112 0.8668 0.7392 0.7377 0.7380 0.5905 0.5909 0.978
lightgbm
Light Gradient
Boosting
Machine
0.8009 0.8680 0.7538 0.7114 0.7314 0.5735 0.5748 0.208
gbc
Gradient
Boosting
Classifier
0.7582 0.8375 0.7499 0.6405 0.6905 0.4942 0.4987 0.836
dt
Decision Tree
Classifier
0.7559 0.7384 0.6761 0.6556 0.6655 0.4735 0.4737 0.104
knn
K Neighbors
Classifier
0.7379 0.8094 0.7386 0.6124 0.6695 0.4555 0.4611 0.120
ada
Ada Boost
Classifier
0.7377 0.8115 0.7442 0.6116 0.6712 0.4566 0.4629 0.252
lda
Linear
Discriminant
Analysis
0.7284 0.8077 0.7662 0.5952 0.6697 0.4452 0.4558 0.042
ridge
Ridge
Classifier
0.7249 0.0000 0.7600 0.5920 0.6653 0.4380 0.4482 0.054
lr
Logistic
Regression
0.7223 0.8052 0.7532 0.5896 0.6611 0.4319 0.4415 0.054
qda
Quadratic
Discriminant
Analysis
0.7203 0.7995 0.7386 0.5890 0.6550 0.4249 0.4329 0.040
Hyperparameter Tuning
tuned_modelB = tune_model(best_modelB)
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 53/93
Accuracy AUC Recall Prec. F1 Kappa MCC
Fold
0 0.7611 0.8567 0.8085 0.6308 0.7086 0.5114 0.5227
1 0.7520 0.8351 0.7690 0.6261 0.6903 0.4871 0.4943
2 0.7700 0.8457 0.8113 0.6429 0.7173 0.5278 0.5380
3 0.7021 0.8110 0.8028 0.5599 0.6597 0.4095 0.4306
4 0.7427 0.8236 0.7493 0.6172 0.6768 0.4663 0.4724
Mean 0.7456 0.8344 0.7882 0.6154 0.6906 0.4804 0.4916
Std 0.0236 0.0160 0.0246 0.0289 0.0209 0.0412 0.0380
INFO:logs:create_model_container: 16
INFO:logs:master_model_container: 16
INFO:logs:display_container: 3
INFO:logs:ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight={},
criterion='entropy', max_depth=11, max_features='log2',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0001, min_impurity_split=None,
min_samples_leaf=5, min_samples_split=9,
min_weight_fraction_leaf=0.0, n_estimators=180, n_jobs=-1,
oob_score=False, random_state=4967, verbose=0,
warm_start=False)
INFO:logs:tune_model() succesfully completed....................................
We will use Light GBM , Extra Trees Classifier, Random Forest Classifier model here, as these
perform the best.
Creating a model
#Creating Models
lightgbmB = create_model('lightgbm');
etB = create_model('et');
rfB = create_model('rf');
#Blending the top 3 models
blendB = blend_models(estimator_list=[lightgbmB,etB,rfB])
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 54/93
Accuracy AUC Recall Prec. F1 Kappa MCC
Fold
0 0.8451 0.9127 0.8000 0.7760 0.7878 0.6659 0.6661
1 0.8148 0.8971 0.7296 0.7486 0.7389 0.5955 0.5956
2 0.8470 0.9021 0.7859 0.7881 0.7870 0.6677 0.6677
3 0.8024 0.8846 0.7859 0.7010 0.7410 0.5822 0.5847
4 0.8126 0.8924 0.7296 0.7443 0.7368 0.5913 0.5914
Mean 0.8244 0.8978 0.7662 0.7516 0.7583 0.6205 0.6211
Std 0.0182 0.0094 0.0303 0.0302 0.0238 0.0380 0.0376
INFO:logs:create_model_container: 20
INFO:logs:master_model_container: 20
INFO:logs:display_container: 7
INFO:logs:VotingClassifier(estimators=[('lightgbm',
LGBMClassifier(boosting_type='gbdt',
class_weight=None,
colsample_bytree=1.0,
importance_type='split',
learning_rate=0.1, max_depth=-1,
min_child_samples=20,
min_child_weight=0.001,
min_split_gain=0.0,
n_estimators=100, n_jobs=-1,
num_leaves=31, objective=None,
random_state=4967, reg_alpha=0.0,
reg_lambda=0.0, silent='warn'...
max_depth=None,
max_features='auto',
max_leaf_nodes=None,
max_samples=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0
n_estimators=100,
n_jobs=-1, oob_score=False,
random_state=4967,
verbose=0,
warm_start=False))],
flatten_transform=True, n_jobs=-1, verbose=False,
voting='soft', weights=None)
INFO:logs:blend_models() succesfully completed..................................
plot_model(estimator = tuned_modelB, plot = 'feature')
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 55/93
INFO:logs:Visual Rendered Successfully
INFO:logs:plot_model() succesfully completed....................................
INFO:logs:Visual Rendered Successfully
INFO:logs:plot_model() succesfully completed....................................
#Plotting the confusion Matrix
plot_model(estimator = tuned_modelB, plot = 'confusion_matrix')
Observation :
We can see a strong diagnol indicating good predictions.
#plotting decision boundary
plot_model(estimator = tuned_modelB, plot = 'boundary', use_train_data = True)
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 56/93
INFO:logs:Visual Rendered Successfully
INFO:logs:plot_model() succesfully completed....................................
Observation:
We can see a great seperation with very few misclassifications.
plot_model(tuned_modelB, plot = 'parameter')
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 57/93
Parameters
bootstrap False
ccp_alpha 0.0
class_weight {}
criterion entropy
max_depth 11
max_features log2
max_leaf_nodes None
max_samples None
min_impurity_decrease 0.0001
min_impurity_split None
min_samples_leaf 5
min_samples_split 9
min_weight_fraction_leaf 0.0
n_estimators 180
n_jobs -1
oob_score False
random_state 4967
verbose 0
warm_start False
INFO:logs:Visual Rendered Successfully
INFO:logs:plot_model() succesfully completed....................................
INFO:logs:Visual Rendered Successfully
INFO:logs:plot_model() succesfully completed....................................
#Plotting Area under Curve
plot_model(estimator = tuned_modelB, plot = 'auc')
interpret_model(tuned_modelB)
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 58/93
INFO:logs:Initializing interpret_model()
INFO:logs:interpret_model(estimator=ExtraTreesClassifier(bootstrap=False, ccp_alp
criterion='entropy', max_depth=11, max_features='log2',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0001, min_impurity_split=None,
min_samples_leaf=5, min_samples_split=9,
min_weight_fraction_leaf=0.0, n_estimators=180, n_jobs=-1,
oob_score=False, random_state=4967, verbose=0,
warm_start=False), use_train_data=False, X_new_sample=None,
INFO:logs:Checking exceptions
INFO:logs:plot type: summary
INFO:logs:Creating TreeExplainer
INFO:logs:Compiling shap values
INFO:logs:Visual Rendered Successfully
INFO:logs:interpret_model() succesfully completed...............................
Double-click (or enter) to edit
Target Variable = Quality - Low,Medium,High
Multivariate classification
#from pycaret.classification import *
Classification of Quality
quality_mappingM = { 3 : 'Low', 4 : 'Low', 5: 'Medium', 6 : 'Medium', 7: 'Medium', 8 :
Chateau_Montelena_AutoMLM['quality'] = Chateau_Montelena_AutoMLM['quality'].map(quali
Distribution
print("Wine Quality(%):")
print(round(Chateau_Montelena_AutoMLM['quality'].value_counts(normalize=True) * 100,2)
Wine Quality(%):
Medium 93.17
Low 3.79
High 3.05
Name: quality, dtype: float64
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 59/93
Setting the classifier
clfM = setup(data = Chateau_Montelena_AutoMLM,
target = 'quality',
# ignore_features = ['customerID'],
train_size=0.8,
normalize=True,
normalize_method='minmax',
fix_imbalance=True,
remove_multicollinearity=True,
remove_outliers=True,
fold=5,
silent = True)
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 60/93
Description Value
0 session_id 4450
1 Target quality
2 Target Type Multiclass
3 Label Encoded High: 0, Low: 1, Medium: 2
4 Original Data (6497, 13)
5 Missing Values True
6 Numeric Features 11
7 Categorical Features 1
8 Ordinal Features False
9 High Cardinality Features False
10 High Cardinality Method None
11 Transformed Train Set (4937, 12)
12 Transformed Test Set (1300, 12)
13 Shuffle Train-Test True
14 Stratify Train-Test False
15 Fold Generator StratifiedKFold
16 Fold Number 5
17 CPU Jobs -1
18 Use GPU False
19 Log Experiment False
20 Experiment Name clf-default-name
21 USI 40d8
22 Imputation Type simple
23 Iterative Imputation Iteration None
24 Numeric Imputer mean
25 Iterative Imputation Numeric Model None
26 Categorical Imputer constant
27 Iterative Imputation Categorical Model None
28 Unknown Categoricals Handling least_frequent
29 Normalize True
4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 61/93
30 Normalize Method minmax
31 Transformation False
32 Transformation Method None
33 PCA False
34 PCA Method None
35 PCA Components None
36 Ignore Low Variance False
37 Combine Rare Levels False
38 Rare Level Threshold None
39 Numeric Binning False
40 Remove Outliers True
41 Outliers Threshold 0.05
42 Remove Multicollinearity True
43 Multicollinearity Threshold 0.9
44 Remove Perfect Collinearity True
45 Clustering False
46 Clustering Iteration None
47 Polynomial Features False
48 Polynomial Degree None
49 Trignometry Features False
50 Polynomial Threshold None
51 Group Features False
52 Feature Selection False
53 Feature Selection Method classic
54 Features Selection Threshold None
55 Feature Interaction False
56 Feature Ratio False
57 Interaction Threshold None
58 Fix Imbalance True
59 Fix Imbalance Method SMOTE
INFO:logs:create_model_container: 0
INFO:logs:master_model_container: 0
INFO l di l t i 1
Model Accuracy AUC Recall Prec. F1 Kappa MCC
TT
(Sec)
xgboost
Extreme
Gradient
Boosting
0.9299 0.7765 0.5413 0.9209 0.9243 0.3454 0.3514 5.532
lightgbm
Light Gradient
Boosting
Machine
0.9279 0.7702 0.5348 0.9196 0.9225 0.3327 0.3389 0.538
et
Extra Trees
Classifier
0.9230 0.8402 0.5646 0.9195 0.9210 0.3475 0.3487 0.618
rf
Random Forest
Classifier
0.9123 0.8244 0.5722 0.9166 0.9141 0.3233 0.3248 2.222
dt
Decision Tree
Classifier
0.8404 0.6445 0.5569 0.9048 0.8679 0.1915 0.2112 0.136
gbc
Gradient
Boosting
Classifier
0.7727 0.7342 0.6042 0.9068 0.8254 0.1643 0.2067 9.140
knn
K Neighbors
Classifier
0.7432 0.7225 0.6320 0.9112 0.8064 0.1613 0.2160 0.180
ada
Ada Boost
Classifier
0.5345 0.5782 0.5922 0.9011 0.6462 0.0730 0.1325 1.000
qda
Quadratic
Discriminant
Analysis
0.4950 0.6411 0.5851 0.9010 0.6113 0.0646 0.1249 0.052
lda
Linear
Discriminant
Analysis
0.4857 0.7076 0.6144 0.9079 0.6017 0.0735 0.1446 0.038
lr
Logistic
Regression
0.4794 0.7101 0.6330 0.9118 0.5952 0.0780 0.1556 0.562
ridge
Ridge
Classifier
0.4132 0.0000 0.6236 0.9116 0.5293 0.0650 0.1426 0.022
svm
SVM - Linear
Kernel
0.3830 0.0000 0.6252 0.9121 0.4962 0.0613 0.1404 0.072
b N i B 0 3721 0 6096 0 5746 0 9040 0 4885 0 0492 0 1140 0 022
best_modelM=compare_models()
LGBM has the best F1 score and is faster than the other top models.
tuned_modelM = tune_model(best_modelM)
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf
pdf.pdf

More Related Content

Similar to pdf.pdf

Wine ppt template
Wine ppt templateWine ppt template
Wine ppt template
Krishna Bollojula
 
Beer and water analysis directly in your brewery
Beer and water analysis directly in your brewery Beer and water analysis directly in your brewery
Beer and water analysis directly in your brewery
CDR S.r.l.
 
5228_Leeder Wine Bro-low res
5228_Leeder Wine Bro-low res5228_Leeder Wine Bro-low res
5228_Leeder Wine Bro-low resDr John Leeder
 
CLEAN IN PLACE PROCESS AND TEORY AND PRAC
CLEAN IN PLACE PROCESS AND TEORY AND PRACCLEAN IN PLACE PROCESS AND TEORY AND PRAC
CLEAN IN PLACE PROCESS AND TEORY AND PRAC
JoseGuerra736717
 
ACS NERM 2013 Sour Beer - NMR Talk
ACS NERM 2013   Sour Beer - NMR TalkACS NERM 2013   Sour Beer - NMR Talk
ACS NERM 2013 Sour Beer - NMR TalkJohn Edwards
 
2018 Oregon Wine Symposium | Understanding Control Points from Crush Pad to B...
2018 Oregon Wine Symposium | Understanding Control Points from Crush Pad to B...2018 Oregon Wine Symposium | Understanding Control Points from Crush Pad to B...
2018 Oregon Wine Symposium | Understanding Control Points from Crush Pad to B...
Oregon Wine Board
 
Wine Quality
Wine QualityWine Quality
Wine Quality
Tapas Saha
 
21 Cost efficient QM in microbreweries agk
21 Cost efficient QM in microbreweries agk21 Cost efficient QM in microbreweries agk
21 Cost efficient QM in microbreweries agkAxel Kristiansen
 
36 evans
36 evans36 evans
36 evans
Vohinh Ngo
 
Leeder -Analytical Brewing Bro Vis05
Leeder -Analytical Brewing Bro Vis05Leeder -Analytical Brewing Bro Vis05
Leeder -Analytical Brewing Bro Vis05Dr John Leeder
 
Re-utilization of winemaking lees as a new food ingredient
Re-utilization of winemaking lees as a new food ingredientRe-utilization of winemaking lees as a new food ingredient
Re-utilization of winemaking lees as a new food ingredient
World Bulk Wine Exhibition Amsterdam - Asia
 
IRJET- Production and Optimization of Citric Acid by Aspergillus Niger Is...
IRJET-  	  Production and Optimization of Citric Acid by Aspergillus Niger Is...IRJET-  	  Production and Optimization of Citric Acid by Aspergillus Niger Is...
IRJET- Production and Optimization of Citric Acid by Aspergillus Niger Is...
IRJET Journal
 
ABHISHEK S2 FA FERMENTATION food analysis
ABHISHEK S2 FA FERMENTATION food analysisABHISHEK S2 FA FERMENTATION food analysis
ABHISHEK S2 FA FERMENTATION food analysis
Venkatesan R - 6369851191
 
BR-70701-LC-GC-NIR-ICP-Edible-Oil-Workflows
BR-70701-LC-GC-NIR-ICP-Edible-Oil-WorkflowsBR-70701-LC-GC-NIR-ICP-Edible-Oil-Workflows
BR-70701-LC-GC-NIR-ICP-Edible-Oil-Workflowsdmend129
 
Beer Brewing & Ethanol Production
Beer Brewing & Ethanol ProductionBeer Brewing & Ethanol Production
Beer Brewing & Ethanol Production
Betsy Kenaston
 

Similar to pdf.pdf (20)

Wine ppt template
Wine ppt templateWine ppt template
Wine ppt template
 
Beer and water analysis directly in your brewery
Beer and water analysis directly in your brewery Beer and water analysis directly in your brewery
Beer and water analysis directly in your brewery
 
5228_Leeder Wine Bro-low res
5228_Leeder Wine Bro-low res5228_Leeder Wine Bro-low res
5228_Leeder Wine Bro-low res
 
Team_Random
Team_RandomTeam_Random
Team_Random
 
CLEAN IN PLACE PROCESS AND TEORY AND PRAC
CLEAN IN PLACE PROCESS AND TEORY AND PRACCLEAN IN PLACE PROCESS AND TEORY AND PRAC
CLEAN IN PLACE PROCESS AND TEORY AND PRAC
 
ACS NERM 2013 Sour Beer - NMR Talk
ACS NERM 2013   Sour Beer - NMR TalkACS NERM 2013   Sour Beer - NMR Talk
ACS NERM 2013 Sour Beer - NMR Talk
 
2018 Oregon Wine Symposium | Understanding Control Points from Crush Pad to B...
2018 Oregon Wine Symposium | Understanding Control Points from Crush Pad to B...2018 Oregon Wine Symposium | Understanding Control Points from Crush Pad to B...
2018 Oregon Wine Symposium | Understanding Control Points from Crush Pad to B...
 
Wine Quality
Wine QualityWine Quality
Wine Quality
 
21 Cost efficient QM in microbreweries agk
21 Cost efficient QM in microbreweries agk21 Cost efficient QM in microbreweries agk
21 Cost efficient QM in microbreweries agk
 
GCB Syllabus 2014
GCB Syllabus 2014GCB Syllabus 2014
GCB Syllabus 2014
 
36 evans
36 evans36 evans
36 evans
 
Leeder -Analytical Brewing Bro Vis05
Leeder -Analytical Brewing Bro Vis05Leeder -Analytical Brewing Bro Vis05
Leeder -Analytical Brewing Bro Vis05
 
Re-utilization of winemaking lees as a new food ingredient
Re-utilization of winemaking lees as a new food ingredientRe-utilization of winemaking lees as a new food ingredient
Re-utilization of winemaking lees as a new food ingredient
 
Chris_Dorow_PRED411_Sec55_PROJ3
Chris_Dorow_PRED411_Sec55_PROJ3Chris_Dorow_PRED411_Sec55_PROJ3
Chris_Dorow_PRED411_Sec55_PROJ3
 
IRJET- Production and Optimization of Citric Acid by Aspergillus Niger Is...
IRJET-  	  Production and Optimization of Citric Acid by Aspergillus Niger Is...IRJET-  	  Production and Optimization of Citric Acid by Aspergillus Niger Is...
IRJET- Production and Optimization of Citric Acid by Aspergillus Niger Is...
 
P-225_Gore
P-225_GoreP-225_Gore
P-225_Gore
 
ABHISHEK S2 FA FERMENTATION food analysis
ABHISHEK S2 FA FERMENTATION food analysisABHISHEK S2 FA FERMENTATION food analysis
ABHISHEK S2 FA FERMENTATION food analysis
 
BDI Dec 2016 Sodium
BDI Dec 2016 SodiumBDI Dec 2016 Sodium
BDI Dec 2016 Sodium
 
BR-70701-LC-GC-NIR-ICP-Edible-Oil-Workflows
BR-70701-LC-GC-NIR-ICP-Edible-Oil-WorkflowsBR-70701-LC-GC-NIR-ICP-Edible-Oil-Workflows
BR-70701-LC-GC-NIR-ICP-Edible-Oil-Workflows
 
Beer Brewing & Ethanol Production
Beer Brewing & Ethanol ProductionBeer Brewing & Ethanol Production
Beer Brewing & Ethanol Production
 

Recently uploaded

一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 

Recently uploaded (20)

一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 

pdf.pdf

  • 1. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 1/93 Problem Statement Although we are attempting to predict wine quality as a target for a certain number of wines with a given set of predictor factors, wine quality is a subjective measurement. This is an EDA, or data- driven story, including a range of graphs and images as well as an attribute-based quality forecast. Here we need to know: “what is the quality of the wine (in ordinal values)(3-9)? It is a regression task. Objective Perform Data Cleaning, Pre-processing and Feature Selection Apply ML models to predict the Churned Customers Use Auto-ML to determine the best model Use SHAP library to determine the impact of the predictor variables ML Data Cleaning and Feature Selection import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns from scipy.stats import norm from scipy import stats from scipy.stats import norm from scipy import stats from sklearn import preprocessing from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.model_selection import GridSearchCV from sklearn.ensemble import ExtraTreesClassifier from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error from sklearn.metrics import mean_absolute_error from math import sqrt from sklearn.metrics import r2_score Cabernet Sauvignon is known as the king of the red wine. C b t S i d d ('htt // ith b t t /M h j th /DA
  • 2. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 2/93 Cabernet_Sauvignon = pd.read_csv('https://raw.githubusercontent.com/Mohanmanjunatha/DA type fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density 0 white 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010 1 white 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940 2 white 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3 white 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 4 white 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 Cabernet_Sauvignon.head() Cabernet_Sauvignon.shape (6497, 13) What are the data types? (Only numeric and categorical) Cabernet_Sauvignon.dtypes type object fixed acidity float64 volatile acidity float64 citric acid float64 residual sugar float64 chlorides float64 free sulfur dioxide float64 total sulfur dioxide float64 density float64 pH float64 sulphates float64 alcohol float64 quality int64 dtype: object The dataset has 1 Categorical and 12 Numerical Features. What features are in the dataset? fixed acidity. Fixed acidity is due to the presence of non-volatile acids in wine. For example, tartaric, citric or malic acid. This type of acid combines the balance of the taste of wine, brings freshness to the taste. Volatile acidity is the part of the acid in wine that can be picked up by the nose. Unlike those acids that are palpable to the taste (as we talked about above). Volatile acidity, or in other words, souring
  • 3. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 3/93 of wine, is one of the most common defects. citric acid - allowed to offer in winemaking by the Resolution of the OIV No. 23/2000. It can be used in three cases: for acid treatment of wine (increasing acidity), for collecting wine, for cleaning filters from possible fungal and mold infections. residual sugar is that grape sugar that has not been fermented in alcohol chlorides. The structure of the wine also depends on the content of minerals in the wine, which determine the taste sensation such as salinity (sapidità). Anions of inorganic acids (chlorides, sulfates, sulfites..), anions of transferred acids, metal cations (potassium, sodium, magnesium...) are found in wine. Their content depends mainly on the climatic zone (cold or warm region, salty soils depending on the observation of the sea), oenological practices, storage and aging of wine. free sulfur dioxide, total sulfur dioxide - Sulfur dioxide (sulfur oxide, sulfur dioxide, readiness E220, SO2) is used as a preservative due to its antioxidant and antimicrobial properties. Molecular SO2 is an extremely important antibiotic, affecting significant consumption (including wild yeast) that can manifest itself in wine spoilage. Density - The density of wine can be either less or more than water. Its value is determined primarily by the concentration of alcohol and sugar. White, rosé and red wines are generally light - their density at 20°C is below 998.3 kg/m3. pH is a measure of the acidity of wine. All wines ideally have a pH level between 2.9 and 4.2. The lower the pH, the more acidic the wine; the lower the pH, the less acidic the wine. Sulfates are a natural result of yeast fermenting the sugar in wine into alcohol. That is, the presence of sulfites in wine is excluded. alcohol - The alcohol content in wines depends on many tastes: the grape variety and the amount of sugar in the berries, production technology and growing conditions. Wines vary greatly in degree: this Parameter varies from 4.5 to 22 depending on the category. quality is a target. Are there missing values? Cabernet_Sauvignon.isna().sum() type 0 fixed acidity 10 volatile acidity 8 citric acid 3 residual sugar 2 chlorides 2 free sulfur dioxide 0 total sulfur dioxide 0 density 0
  • 4. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 4/93 pH 9 sulphates 4 alcohol 0 quality 0 dtype: int64 Which independent variables have missing data? How much? fixed acidity - 10 volatile acidity - 8 citric acid - 3 residual sugar - 2 chlorides - 2 pH - 9 sulphates - 4 The above features have the respective number of missing data. Since the data is more symmetric, mean replacement would be better. Before examining quality feature, categorical variables will be mapped with help of cat.code. This will assist to make easier and comprehensible data analysis. Cabernet_Sauvignon['type'] = Cabernet_Sauvignon['type'].astype("category").cat.codes Cabernet_Sauvignon.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 6497 entries, 0 to 6496 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 type 6497 non-null int8 1 fixed acidity 6487 non-null float64 2 volatile acidity 6489 non-null float64 3 citric acid 6494 non-null float64 4 residual sugar 6495 non-null float64 5 chlorides 6495 non-null float64 6 free sulfur dioxide 6497 non-null float64 7 total sulfur dioxide 6497 non-null float64 8 density 6497 non-null float64 9 pH 6488 non-null float64 10 sulphates 6493 non-null float64 11 alcohol 6497 non-null float64 12 quality 6497 non-null int64 dtypes: float64(11), int64(1), int8(1) memory usage: 615.6 KB
  • 5. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 5/93 1. Mean # mean = Cabernet_Sauvignon["fixed acidity"].mean() # Cabernet_Sauvignon["fixed acidity"].fillna(mean,inplace=True) # Cabernet_Sauvignon["fixed acidity"].isnull().sum() # mean2 = Cabernet_Sauvignon["volatile acidity"].mean() # Cabernet_Sauvignon["volatile acidity"].fillna(mean2,inplace=True) # Cabernet_Sauvignon["volatile acidity"].isnull().sum() # mean3 = Cabernet_Sauvignon["citric acid"].mean() # Cabernet_Sauvignon["citric acid"].fillna(mean3,inplace=True) # Cabernet_Sauvignon["citric acid"].isnull().sum() # mean4 = Cabernet_Sauvignon["residual sugar"].mean() # Cabernet_Sauvignon["residual sugar"].fillna(mean4,inplace=True) # Cabernet_Sauvignon["residual sugar"].isnull().sum() # mean5 = Cabernet_Sauvignon["chlorides"].mean() # Cabernet_Sauvignon["chlorides"].fillna(mean5,inplace=True) # Cabernet_Sauvignon["chlorides"].isnull().sum() # mean6 = Cabernet_Sauvignon["pH"].mean() # Cabernet_Sauvignon["pH"].fillna(mean6,inplace=True) # Cabernet_Sauvignon["pH"].isnull().sum() # mean7 = Cabernet_Sauvignon["sulphates"].mean() # Cabernet_Sauvignon["sulphates"].fillna(mean7,inplace=True) # Cabernet_Sauvignon["sulphates"].isnull().sum() # Cabernet_Sauvignon.isnull().sum() 2. KNN Imputer #Creating a seperate dataframe for performing the KNN imputation from sklearn.impute import KNNImputer from sklearn.preprocessing import MinMaxScaler imputer = KNNImputer(n_neighbors=5) Cabernet_Sauvignon = pd.DataFrame(imputer.fit_transform(Cabernet_Sauvignon), columns = Cabernet_Sauvignon.isnull().sum() type 0 fixed acidity 0 volatile acidity 0 citric acid 0 residual sugar 0 chlorides 0 free sulfur dioxide 0
  • 6. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 6/93 total sulfur dioxide 0 density 0 pH 0 sulphates 0 alcohol 0 quality 0 dtype: int64 What are the likely distributions of the numeric variables? & What are the distributions of the predictor variables? In below above, the good fit indicates that normality is a reasonable approximation. Distribution of Predictors Cabernet_SauvignonColumnList = Cabernet_Sauvignon.columns for i in Cabernet_SauvignonColumnList: plt.figure(figsize= (5,5)) sns.distplot(Cabernet_Sauvignon[i], fit = norm) plt.title(f"Distribution of {i} (checking normal distribution fit)",size = 15, wei
  • 7. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 7/93 /usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni warnings.warn(msg, FutureWarning) /usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni warnings.warn(msg, FutureWarning) /usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni warnings.warn(msg, FutureWarning) /usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni warnings.warn(msg, FutureWarning) /usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni warnings.warn(msg, FutureWarning) /usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni warnings.warn(msg, FutureWarning) /usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni warnings.warn(msg, FutureWarning) /usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni warnings.warn(msg, FutureWarning) /usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni warnings.warn(msg, FutureWarning) /usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni warnings.warn(msg, FutureWarning) /usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni warnings.warn(msg, FutureWarning) /usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni warnings.warn(msg, FutureWarning) /usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni warnings.warn(msg, FutureWarning)
  • 8. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 8/93 type : categorical values fixed acidity : nomral distribution volatile acidity : almost normal distribution with a bit of right-skewness citric acid : almost normal distribution with a bit of edge-peak residual sugar : almost normal distribution with a bit of right-skewness chlorides : almost normal distribution with a bit of right-skewness free sulfur dioxide : nomral distribution total sulfur dioxide : almost normal distribution with a bit of edge-peak sulphates : normal distribution alcohol : almost normal distribution with a bit of right-skewness pH : normal distribution density : normal distribution Do the ranges of the predictor variables make sense? type fixed acidity volatile acidity citric acid residual sugar chlorides su dio count 6497.000000 6497.000000 6497.000000 6497.000000 6497.000000 6497.000000 6497.0 mean 0.753886 7.216501 0.339634 0.318675 5.445704 0.056041 30.5 std 0.430779 1.295928 0.164563 0.145267 4.758043 0.035032 17.7 min 0.000000 3.800000 0.080000 0.000000 0.600000 0.009000 1.0 25% 1.000000 6.400000 0.230000 0.250000 1.800000 0.038000 17.0 50% 1.000000 7.000000 0.290000 0.310000 3.000000 0.047000 29.0 75% 1.000000 7.700000 0.400000 0.390000 8.100000 0.065000 41.0 max 1.000000 15.900000 1.580000 1.660000 65.800000 0.611000 289.0 #Range of each column Cabernet_Sauvignon.max() - Cabernet_Sauvignon.min() Cabernet_Sauvignon.describe() The ranges make sense for each attribute that a wine constitutes. The range of "total sulphur dioxide" variable is high, this implies high variablity in it's distribution.
  • 9. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 9/93 Do the training and test sets have the same data? By using test_train_split, the train and test sets are split at a ratio of 80/20 from the same dataset. But both sets are distinct and is not seen by the model during the training phase. Although the distribution of each attribute is proportional in both train and test sets. Phase 1 Cabernet_Sauvignon_x = Cabernet_Sauvignon[['type','fixed acidity','volatile acidity',' Cabernet_Sauvignon_y = Cabernet_Sauvignon['quality'] # .iloc[:,:12], Cabernet_Sauvignon.iloc[:,-1] Cabernet_Sauvignon_y.head() 0 6.0 1 6.0 2 6.0 3 6.0 4 6.0 Name: quality, dtype: float64 scaler = StandardScaler() # #Dataframe Cabernet_Sauvignon with outliers Cabernet_Sauvignon_x = scaler.fit_transform(Cabernet_Sauvignon_x) plt.figure(figsize=(20,7)) ax = sns.boxplot(data=Cabernet_Sauvignon_x) ax.set_xticklabels(Cabernet_SauvignonColumnList[:12])
  • 10. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 10/93 [Text(0, 0, 'type'), Text(0, 0, 'fixed acidity'), Text(0, 0, 'volatile acidity'), Text(0, 0, 'citric acid'), Text(0, 0, 'residual sugar'), Text(0, 0, 'chlorides'), Text(0, 0, 'free sulfur dioxide'), Text(0, 0, 'total sulfur dioxide'), Text(0, 0, 'density'), Text(0, 0, 'pH'), Text(0, 0, 'sulphates'), Text(0, 0, 'alcohol')] #Splitting the dataset with outlier into Train and Test sets at 80-20 proportion X_train, X_test, y_train, y_test = train_test_split(Cabernet_Sauvignon_x, Cabernet_Sau X_train.shape (5197, 12) X_test.shape (1300, 12) Model Buidling Linear Regression Model ##Linear Regression lr = LinearRegression(copy_X= True, fit_intercept = True, normalize = True) lr.fit(X_train, y_train) lr_pred= lr.predict(X_test) print('--Phase-1--') mae1 = mean_absolute_error(y_test, lr_pred) print('MAE: %f'% mae1) rmse1= np.sqrt(mean_squared_error(y_test, lr_pred)) print('RMSE: %f'% rmse1) r21 = r2_score(y_test, lr_pred) print('R2: %f' % r21) --Phase-1-- MAE: 0.545152 RMSE: 0.686665 R2: 0.340363 /usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_base.py:141: FutureW
  • 11. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 11/93 If you wish to scale the data, use Pipeline with a StandardScaler in a preprocess from sklearn.pipeline import make_pipeline model = make_pipeline(StandardScaler(with_mean=False), LinearRegression()) If you wish to pass a sample_weight parameter, you need to pass it as a fit param kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps} model.fit(X, y, **kwargs) warnings.warn( 3 metrics will be calculated for evaluating predictions. Mean Absolute Error (MAE) shows the difference between predictions and actual values. Root Mean Square Error (RMSE) shows how accurately the model predicts the response. R^2 will be calculated to find the goodness of fit measure. plt.figure(figsize=(5, 7)) ax = sns.distplot(y_test, hist=False, color="r", label="Actual Value") sns.distplot(lr_pred, hist=False, color="b", label="Fitted Values" , ax=ax) plt.title('Actual(red) vs Fitted(blue) Values for Quality') plt.show() plt.close()
  • 12. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 12/93 /usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni warnings.warn(msg, FutureWarning) /usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni warnings.warn(msg, FutureWarning) Random Forest from sklearn.ensemble import RandomForestRegressor model2 = RandomForestRegressor(random_state=1, n_estimators=1000) model2.fit(X_train, y_train) Rm_pred = model2.predict(X_test) print('--Phase-1--') mae2 = mean_absolute_error(y_test, Rm_pred) print('MAE: %f'% mae2) rmse2 = np.sqrt(mean_squared_error(y_test, Rm_pred)) print('RMSE: %f'% rmse2 ) r22 = r2_score(y_test, Rm_pred) print('R2: %f' % r22) --Phase-1-- MAE: 0.401750 RMSE: 0.561165 R2: 0.559449 plt.figure(figsize=(5, 7)) ax = sns.distplot(y_test, hist=False, color="r", label="Actual Value") sns.distplot(Rm_pred, hist=False, color="b", label="Fitted Values" , ax=ax) plt.title('Actual(red) vs Fitted(blue) Values for Quality') plt.show() plt.close()
  • 13. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 13/93 /usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni warnings.warn(msg, FutureWarning) /usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni warnings.warn(msg, FutureWarning) Descision Tree from sklearn.tree import DecisionTreeRegressor model3 = DecisionTreeRegressor(max_depth=6) model3.fit(X_train, y_train) Dt_pred = model3.predict(X_test) print('--Phase-1--') mae3 = mean_absolute_error(y_test, Dt_pred) print('MAE: %f'% mae3) rmse3 = np.sqrt(mean_squared_error(y_test, Dt_pred)) print('RMSE: %f'% rmse3) r23 = r2_score(y_test, Dt_pred) print('R2: %f' % r23) --Phase-1-- MAE: 0.541020 RMSE: 0.696854 R2: 0.320642 plt.figure(figsize=(5, 7)) ax = sns.distplot(y_test, hist=False, color="r", label="Actual Value")
  • 14. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 14/93 /usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni warnings.warn(msg, FutureWarning) /usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni warnings.warn(msg, FutureWarning) sns.distplot(Dt_pred, hist=False, color="b", label="Fitted Values" , ax=ax) plt.title('Actual(red) vs Fitted(blue) Values for Quality') plt.show() plt.close() Phase 2 In the predictor variables independent of all the other predictor variables? Multicollinearity Multicollinearity will help to measure the relationship between explanatory variables in multiple regression. If there is multicollinearity occurs, these highly related input variables should be eliminated from the model. In this kernel, multicollinearity will be checked when plotting a correlation heatmap.
  • 15. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 15/93 Which independent variables are useful to predict a target (dependent variable)? (Use at least three methods) For a regression model, the most useful Independent Variables can be statistically determined using the following methods: f_regression mutual_info_regression Correlation Matrix with Heatmap Each of the following method is applied below to the dataset. 1. f_regression from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import f_regression, mutual_info_regression X = Cabernet_Sauvignon.iloc[:,0:12] y = Cabernet_Sauvignon.iloc[:,-1] # y=y.astype('int') # y = pd.DataFrame(y) # y.head(10) # y.describe() #Applying SelectKBest class to extract top features # feature selection f_selector = SelectKBest(score_func=f_regression, k='all') # learn relationship from training data f_selector.fit(X_train, y_train) # transform train input data X_train_fs = f_selector.transform(X_train) # transform test input data X_test_fs = f_selector.transform(X_test) # Plot the scores for the features plt.rcParams["figure.figsize"] = (30,10) plt.bar([i for i in range(len(f_selector.scores_))], f_selector.scores_) plt.xlabel("feature index") plt.xticks([i for i in range(len(f_selector.scores_))], Cabernet_SauvignonColumnList[: plt.ylabel("F-value (transformed from the correlation values)") plt.show() # bestFeatures = SelectKBest(score_func= chi2, k =12) # fit = bestFeatures.fit(X,y)
  • 16. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 16/93 we can see that volatile acidity, chlorides, density and alcohol have more importance than the others. 2.Mutual information metric # feature selection f_selector = SelectKBest(score_func=mutual_info_regression, k='all') # learn relationship from training data f_selector.fit(X_train, y_train) # transform train input data X_train_fs = f_selector.transform(X_train) # transform test input data X_test_fs = f_selector.transform(X_test) # Plot the scores for the features plt.bar([i for i in range(len(f_selector.scores_))], f_selector.scores_, align = 'cent plt.xlabel("feature index") plt.xticks([i for i in range(len(f_selector.scores_))], Cabernet_SauvignonColumnList[: plt.ylabel("Estimated MI value") # plt.rcParams["figure.figsize"] = (30,10) plt.show()
  • 17. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 17/93 3. Correlation Matrix with HeatMap corrmat = Cabernet_Sauvignon.corr() top_corr_features = corrmat.index plt.figure(figsize = (20,20)) #plot heatmap g = sns.heatmap(Cabernet_Sauvignon[top_corr_features].corr(), annot= True, cmap='RdYlG
  • 18. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 18/93 By looking at the correlation mattrix above we can gain the following insights: 1. volatile acidity and chlorides is highly (-ve) correlated with type. 2. alcohol is highly (-ve) correlated with density. 3. total sulpher dioxide is highly (+ve) correlated with type. By looking at the 3 feature importance methods above, we can see that volatile acidity, chlorides, density and alcohol are the common most important features in predicting the value of quality. Outlier Treatment
  • 19. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 19/93 Q1fixed,Q3fixed = np.percentile(Cabernet_Sauvignon['fixed acidity'] , [25,75]) IQRfixed = Q3fixed - Q1fixed Ufixed_acidity = Q3fixed + 1.5*IQRfixed Lfixed_acidity = Q1fixed - 1.5*IQRfixed print(Ufixed_acidity) print(Lfixed_acidity) Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['fixed acidity'] < Lfixe Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['fixed acidity'] > Ufixe 9.65 4.450000000000001 Q1volatile,Q3volatile = np.percentile(Cabernet_Sauvignon['volatile acidity'] , [25,75] IQRvolatile = Q3volatile - Q1volatile Uvolatile_acidity = Q3volatile + 1.5*IQRvolatile Lvolatile_acidity= Q1volatile - 1.5*IQRvolatile print(Uvolatile_acidity) print(Lvolatile_acidity) Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['volatile acidity'] < Lv Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['volatile acidity'] > Uv 0.645 -0.035 Q1citric,Q3citric = np.percentile(Cabernet_Sauvignon['citric acid'] , [25,75]) IQRcitric = Q3citric - Q1citric Ucitric_acid = Q3citric + 1.5*IQRcitric Lcitric_acid= Q1citric - 1.5*IQRcitric print(Ucitric_acid) print(Lcitric_acid) Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['citric acid'] < Lcitric Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['citric acid'] > Ucitric 0.56 0.08000000000000002
  • 20. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 20/93 Q1residual,Q3residual = np.percentile(Cabernet_Sauvignon['residual sugar'] , [25,75]) IQRresidual = Q3residual - Q1residual Uresidual_sugar = Q3residual + 1.5*IQRresidual Lresidual_sugar= Q1residual - 1.5*IQRresidual print(Uresidual_sugar) print(Lresidual_sugar) Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['residual sugar'] < Lres Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['residual sugar'] > Ures 19.049999999999997 -8.549999999999999 Q1chlorides,Q3chlorides = np.percentile(Cabernet_Sauvignon['chlorides'] , [25,75]) IQRchlorides = Q3chlorides - Q1chlorides Uchlorides = Q3chlorides + 1.5*IQRchlorides # Cabernet_Sauvignon['chlori Lchlorides= Q1chlorides - 1.5*IQRchlorides # Cabernet_Sauvignon['chlori print(Uchlorides) print(Lchlorides) Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['chlorides'] < Lchloride Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['chlorides'] > Uchloride 0.081 0.008999999999999994 Q1free_sulfur,Q3free_sulfur = np.percentile(Cabernet_Sauvignon['free sulfur dioxide'] IQRfree_sulfur = Q3free_sulfur - Q1free_sulfur Ufree_sulfur_dioxide = Q3free_sulfur + 1.5*IQRfree_sulfur Lfree_sulfur_dioxide= Q1free_sulfur - 1.5*IQRfree_sulfur print(Ufree_sulfur_dioxide) print(Lfree_sulfur_dioxide) Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['free sulfur dioxide'] < Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['free sulfur dioxide'] >
  • 21. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 21/93 78.5 -13.5 Q1total_sulfur,Q3total_sulfur = np.percentile(Cabernet_Sauvignon['total sulfur dioxide IQRtotal_sulfur = Q3total_sulfur - Q1total_sulfur Utotal_sulfur_dioxide = Q3total_sulfur + 1.5*IQRtotal_sulfur Ltotal_sulfur_dioxide= Q1total_sulfur - 1.5*IQRtotal_sulfur print(Utotal_sulfur_dioxide) print(Ltotal_sulfur_dioxide) Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['total sulfur dioxide'] Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['total sulfur dioxide'] 254.0 6.0 Q1sulphates,Q3sulphates = np.percentile(Cabernet_Sauvignon['sulphates'] , [25,75]) IQRsulphates = Q3sulphates - Q1sulphates Usulphates = Q3sulphates + 1.5*IQRsulphates Lsulphates= Q1sulphates - 1.5*IQRsulphates print(Usulphates) print(Lsulphates) Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['sulphates'] < Lsulphate Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['sulphates'] > Usulphate 0.7949999999999999 0.19500000000000003 Q1alcohol,Q3alcohol = np.percentile(Cabernet_Sauvignon['alcohol'] , [25,75]) IQRalcohol = Q3alcohol - Q1alcohol Ualcohol = Q3alcohol + 1.5*IQRalcohol Lalcohol= Q1alcohol - 1.5*IQRalcohol print(Ualcohol) print(Lalcohol) Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['alcohol'] < Lalcohol].i Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['alcohol'] > Ualcohol].i
  • 22. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 22/93 14.25 6.6499999999999995 Q1pH,Q3pH = np.percentile(Cabernet_Sauvignon['pH'] , [25,75]) IQRpH = Q3pH - Q1pH UpH = Q3pH + 1.5*IQRpH LpH= Q1pH - 1.5*IQRpH print(UpH) print(LpH) Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['pH'] < LpH].index, inpl Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['pH'] > UpH].index, inpl 3.5999999999999996 2.8000000000000007 Q1density,Q3density = np.percentile(Cabernet_Sauvignon['density'] , [25,75]) IQRdensity = Q3density - Q1density Udensity = Q3density + 1.5*IQRdensity Ldensity= Q1density - 1.5*IQRdensity print(Udensity) print(Ldensity) Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['density'] < Ldensity].i Cabernet_Sauvignon.drop(Cabernet_Sauvignon[Cabernet_Sauvignon['density'] > Udensity].i 1.00267 0.9851500000000002 Cabernet_Sauvignon.describe()
  • 23. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 23/93 type fixed acidity volatile acidity citric acid residual sugar chlorides su dio count 4598.000000 4598.000000 4598.000000 4598.000000 4598.000000 4598.000000 4598.0 mean 0.921923 6.911398 0.284059 0.320317 5.939374 0.044548 33.0 std 0.268323 0.832672 0.101024 0.089928 4.743293 0.012699 15.3 min 0.000000 4.700000 0.080000 0.090000 0.600000 0.009000 2.0 25% 1.000000 6.400000 0.210000 0.260000 1.800000 0.036000 22.0 50% 1.000000 6.800000 0.270000 0.310000 4.600000 0.043000 32.0 75% 1.000000 7.400000 0.330000 0.370000 8.987500 0.051000 44.0 max 1.000000 9.600000 0.645000 0.560000 18.950000 0.081000 78.0 # Cabernet_Sauvignon.drop([9]) Cabernet_Sauvignon_cleaned_x,Cabernet_Sauvignon_cleaned_y = Cabernet_Sauvignon.iloc[:, Cabernet_Sauvignon_cleaned_x.shape (4598, 12) Cabernet_Sauvignon_cleaned_x = scaler.fit_transform(Cabernet_Sauvignon_cleaned_x) #Splitting the dataset after outlier treatment into Train and Test sets at 80-20 propo Xclean_train, Xclean_test, yclean_train, yclean_test = train_test_split(Cabernet_Sauvi plt.figure(figsize=(20,7)) ax = sns.boxplot(data=Cabernet_Sauvignon_cleaned_x) ax.set_xticklabels(Cabernet_SauvignonColumnList[:12])
  • 24. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 24/93 [Text(0, 0, 'type'), Text(0, 0, 'fixed acidity'), Text(0, 0, 'volatile acidity'), Text(0, 0, 'citric acid'), Text(0, 0, 'residual sugar'), Text(0, 0, 'chlorides'), Text(0, 0, 'free sulfur dioxide'), Text(0, 0, 'total sulfur dioxide'), Text(0, 0, 'density'), Text(0, 0, 'pH'), Text(0, 0, 'sulphates'), Text(0, 0, 'alcohol')] ##Linear Regression # lr = LinearRegression(copy_X= True, fit_intercept = True, normalize = True) lr.fit(Xclean_train, yclean_train) lrclean_pred= lr.predict(Xclean_test) # model2 = RandomForestRegressor(random_state=1, n_estimators=1000) model2.fit(Xclean_train, yclean_train) Rmclean_pred = model2.predict(Xclean_test) model3.fit(Xclean_train, yclean_train) Dtclean_pred = model3.predict(Xclean_test) /usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_base.py:141: FutureW If you wish to scale the data, use Pipeline with a StandardScaler in a preprocess from sklearn.pipeline import make_pipeline model = make_pipeline(StandardScaler(with_mean=False), LinearRegression()) If you wish to pass a sample_weight parameter, you need to pass it as a fit param kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps} model.fit(X, y, **kwargs) warnings.warn( print('-------------Linear Regression-----------') print('--Phase-1--') print('MAE: %f'% mae1) print('RMSE: %f'% rmse1) print('R2: %f' % r21) print('--Phase-2--') print('MAE: %f'% mean_absolute_error(yclean_test, lrclean_pred)) print('RMSE: %f'% np.sqrt(mean_squared_error(yclean_test, lrclean_pred))) print('R2: %f' % r2_score(yclean_test, lrclean_pred)) print('-------------Random forest-----------') print('--Phase-1--') print('MAE: %f'% mae2)
  • 25. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 25/93 print('RMSE: %f'% rmse2) print('R2: %f' % r22) print('--Phase-2--') print('MAE: %f'% mean_absolute_error(yclean_test, Rmclean_pred)) print('RMSE: %f'% np.sqrt(mean_squared_error(yclean_test, Rmclean_pred))) print('R2: %f' % r2_score(yclean_test, Rmclean_pred)) print('-------------Descision Tree-----------') print('--Phase-1--') print('MAE: %f'% mae3) print('RMSE: %f'% rmse3) print('R2: %f' % r23) print('--Phase-2--') print('MAE: %f'% mean_absolute_error(yclean_test, Dtclean_pred)) print('RMSE: %f'% np.sqrt(mean_squared_error(yclean_test, Dtclean_pred))) print('R2: %f' % r2_score(yclean_test, Dtclean_pred)) -------------Linear Regression----------- --Phase-1-- MAE: 0.545152 RMSE: 0.686665 R2: 0.340363 --Phase-2-- MAE: 0.578749 RMSE: 0.748469 R2: 0.274277 -------------Random forest----------- --Phase-1-- MAE: 0.401750 RMSE: 0.561165 R2: 0.559449 --Phase-2-- MAE: 0.438112 RMSE: 0.622107 R2: 0.498635 -------------Descision Tree----------- --Phase-1-- MAE: 0.541020 RMSE: 0.696854 R2: 0.320642 --Phase-2-- MAE: 0.586013 RMSE: 0.756198 R2: 0.259211 The results show that both phases have different prediction results. Phase 1 and 2 don't have a great difference for each metric. MAE, RMSE metric values are increased in Phase 2 which means,
  • 26. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 26/93 the prediction error value is higher in that Phase and model explainability has decresed by a negligible margin. Remove outliers and keep outliers (does if have an effect of the final predictive model)? The MAE value of 0 indicates no error on the model. In other words, there is a perfect prediction. The above results show that all predictions have great error especially in phase 2. RMSE gives an idea of how much error the system typically makes in its predictions. The above results show that RMSE gave a worse value after removing the outliers. R2 represents the proportion of the variance for a dependent variable that's explained by an independent variable. Cabernet_Sauvignon_class = Cabernet_Sauvignon Cabernet_Sauvignon_imputation= Cabernet_Sauvignon quality_mapping = { 3 : 'Low', 4 : 'Low', 5: 'Medium', 6 : 'Medium', 7: 'Medium', 8 : Cabernet_Sauvignon_class['quality'] = Cabernet_Sauvignon_class['quality'].map(quality Cabernet_Sauvignon_class_x,Cabernet_Sauvignon_class_y = Cabernet_Sauvignon.iloc[:,:12] Cabernet_Sauvignon_class_x = scaler.fit_transform(Cabernet_Sauvignon_class_x) #Splitting the dataset after classifying quality to class into Train and Test sets at Xclass_train, Xclass_test, yclass_train, yclass_test = train_test_split(Cabernet_Sauvi from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report # creating a RF classifier clf = RandomForestClassifier(n_estimators = 1000) # Training the model on the training dataset # fit function is used to train the model using the training sets as parameters clf.fit(Xclass_train, yclass_train) # performing predictions on the test dataset yclass_pred = clf.predict(Xclass_test) # metrics are used to find accuracy or error from sklearn import metrics print() # using metrics module for accuracy calculation print("ACCURACY OF THE MODEL: ", metrics.accuracy_score(yclass_test, yclass_pred)) print(classification_report(yclass_test, yclass_pred)) ACCURACY OF THE MODEL: 0.9456521739130435 precision recall f1-score support
  • 27. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 27/93 High 1.00 0.34 0.51 38 Low 0.00 0.00 0.00 24 Medium 0.95 1.00 0.97 858 accuracy 0.95 920 macro avg 0.65 0.45 0.49 920 weighted avg 0.92 0.95 0.93 920 quality_mapping_again = { 'Low':0, 'Medium':1, 'High':2} yclass_test = yclass_test.map(quality_mapping_again) yclass_pred_new = [s.replace('Medium', '1') for s in yclass_pred] yclass_pred_new = [s.replace('Low', '0') for s in yclass_pred_new] yclass_pred_new = [s.replace('High', '2') for s in yclass_pred_new] yclass_pred_new = [int(item) for item in yclass_pred_new] plt.figure(figsize=(5, 7)) ax = sns.distplot(yclass_test, hist=False, color="r", label="Actual Value") sns.distplot(yclass_pred_new, hist=False, color="b", label="Fitted Values" , ax=ax) plt.title('Actual vs Fitted Values for Quality') plt.show() plt.close()
  • 28. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 28/93 /usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni warnings.warn(msg, FutureWarning) /usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni warnings.warn(msg, FutureWarning) As we can see here, the accuracy of the classification model turned out to be way higher than any regression method used in phase 1. It can be interpretted as: Wine tastings are generally blind tastings and even for the best wine conoisseurs, it is very difficult to differentiate between a quality 7 or 8. Also, quality of a wine by how it tastes is a very subjective to human individuals. Most times, its about how the product is marketed/promoted which forms the general opinion of the targeted people. Being said that, a good wine is a good wine. Based on the chemical composition of the wine itself, we can atleast say if it's a good or bad one. So, when a model is asked to make it fall in a category it gives a much greater accuracy as classifying into bins is easier than predicting a precise quality rating. Data Imputation Remove 1%, 5%, and 10% of your data randomly and impute the values back using at least 3 imputation methods. How well did the methods recover the missing values? That is remove some data, check the % error on residuals for numeric data and check for bias and variance of the error. Imputation 1 Cabernet_Sauvignon_imputation['1_percent'] = Cabernet_Sauvignon_imputation[['alcohol'] Cabernet_Sauvignon_imputation['5_percent'] = Cabernet_Sauvignon_imputation[['alcohol'] Cabernet_Sauvignon_imputation['10_percent'] = Cabernet_Sauvignon_imputation[['alcohol' Cabernet_Sauvignon_imputation.head()
  • 29. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 29/93 type fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density 1 1.0 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940 2 1.0 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3 1.0 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 4 1.0 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 5 1.0 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 def get_percent_missing(dataframe): percent_missing = dataframe.isnull().sum() * 100 / len(dataframe) missing_value_Cabernet_Sauvignon = pd.DataFrame({'column_name': dataframe.columns, 'percent_missing': percent_missing}) return missing_value_Cabernet_Sauvignon print(get_percent_missing(Cabernet_Sauvignon_imputation)) column_name percent_missing type type 0.0 fixed acidity fixed acidity 0.0 volatile acidity volatile acidity 0.0 citric acid citric acid 0.0 residual sugar residual sugar 0.0 chlorides chlorides 0.0 free sulfur dioxide free sulfur dioxide 0.0 total sulfur dioxide total sulfur dioxide 0.0 density density 0.0 pH pH 0.0 sulphates sulphates 0.0 alcohol alcohol 0.0 quality quality 0.0 1_percent 1_percent 0.0 5_percent 5_percent 0.0 10_percent 10_percent 0.0 def create_missing(dataframe, percent, col): dataframe.loc[dataframe.sample(frac = percent).index, col] = np.nan create_missing(Cabernet_Sauvignon_imputation, 0.01, '1_percent') create_missing(Cabernet_Sauvignon_imputation, 0.05, '5_percent') create_missing(Cabernet_Sauvignon_imputation, 0.1, '10_percent') print(get_percent_missing(Cabernet_Sauvignon_imputation)) column_name percent_missing type type 0.000000 fixed acidity fixed acidity 0.000000 volatile acidity volatile acidity 0.000000 citric acid citric acid 0.000000 residual sugar residual sugar 0.000000 chlorides chlorides 0.000000 free sulfur dioxide free sulfur dioxide 0.000000 total sulfur dioxide total sulfur dioxide 0.000000 density density 0.000000 pH pH 0.000000 sulphates sulphates 0.000000 alcohol alcohol 0.000000
  • 30. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 30/93 quality quality 0.000000 1_percent 1_percent 1.000435 5_percent 5_percent 5.002175 10_percent 10_percent 10.004350 # Store Index of NaN values in each coloumns number_1_idx = list(np.where(Cabernet_Sauvignon_imputation['1_percent'].isna())[0]) number_5_idx = list(np.where(Cabernet_Sauvignon_imputation['5_percent'].isna())[0]) number_10_idx = list(np.where(Cabernet_Sauvignon_imputation['10_percent'].isna())[0]) print(f"Length of number_1_idx is {len(number_1_idx)} and it contains {(len(number_1_i print(f"Length of number_5_idx is {len(number_5_idx)} and it contains {(len(number_5_i print(f"Length of number_10_idx is {len(number_10_idx)} and it contains {(len(number_1 Length of number_1_idx is 46 and it contains 1.0004349717268377% of total data in Length of number_5_idx is 230 and it contains 5.002174858634189% of total data in Length of number_10_idx is 460 and it contains 10.004349717268378% of total data Imputation 2 KNN Imputation The k nearest neighbours is an algorithm that is used for simple classification. The algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set. #Creating a seperate dataframe for performing the KNN imputation from sklearn.impute import KNNImputer from sklearn.preprocessing import MinMaxScaler Cabernet_Sauvignon_imputation1 = Cabernet_Sauvignon_imputation[['1_percent','5_percent imputer = KNNImputer(n_neighbors=5) imputed_number_Cabernet_Sauvignon = pd.DataFrame(imputer.fit_transform(Cabernet_Sauvig # imputed_number_Cabernet_Sauvignon.sample(10) imputed_number_Cabernet_Sauvignon.head() print(get_percent_missing(imputed_number_Cabernet_Sauvignon)) column_name percent_missing 1_percent 1_percent 0.0 5_percent 5_percent 0.0 10_percent 10_percent 0.0 alcohol = Cabernet_Sauvignon["alcohol"] imputed_mean = pd.concat([alcohol,imputed_number_Cabernet_Sauvignon]) imputed_mean.columns = ["Alcohol","1_Percent","5_Percent","10_Percent"] imputed_mean.var()
  • 31. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 31/93 Alcohol 1.470385 1_Percent 1.470326 5_Percent 1.470391 10_Percent 1.470429 dtype: float64 The KNN based method showed very negotiable variablilty. Therefore this method is acceptable for the current dataset. Mean based Imputation with Simpleimputer This works by calculating the mean/median of the non- missing values in a column and then replacing the missing values within each column separately and independently from the others. It can only be used with numeric data. Cabernet_Sauvignon_imputation_mean = Cabernet_Sauvignon_imputation[['1_percent','5_per from sklearn.impute import SimpleImputer imp_mean = SimpleImputer( strategy='mean') #for median imputation replace 'mean' with imp_mean.fit(Cabernet_Sauvignon_imputation_mean) imputed_train_Cabernet_Sauvignon = imp_mean.transform(Cabernet_Sauvignon_imputation_me imputed_mean = pd.DataFrame(imp_mean.fit_transform(Cabernet_Sauvignon_imputation_mean) print(get_percent_missing(imputed_mean)) column_name percent_missing 1_percent 1_percent 0.0 5_percent 5_percent 0.0 10_percent 10_percent 0.0 alcohol = Cabernet_Sauvignon["alcohol"] combined_mean = pd.concat([alcohol,imputed_mean]) combined_mean.mean() 0 10.587102 10_percent 10.588810 1_percent 10.586540 5_percent 10.581520 dtype: float64 combined_mean.var() 0 1.470385 10_percent 1.320797 1_percent 1.456402 5_percent 1.395375 dtype: float64
  • 32. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 32/93 Imputation 3 The Mean based method showed very negotiable variablilty. Therefore this method is acceptable for the current dataset. Imputation Using Multivariate Imputation by Chained Equation (MICE) This type of imputation works by filling the missing data multiple times. Multiple Imputations (MIs) are much better than a single imputation as it measures the uncertainty of the missing values in a better way. The chained equations approach is also very flexible and can handle different variables of different data types (ie., continuous or binary) as well as complexities such as bounds or survey skip patterns. Cabernet_Sauvignon_imputation_mice = Cabernet_Sauvignon_imputation[['1_percent','5_per print(get_percent_missing(Cabernet_Sauvignon_imputation_mice)) column_name percent_missing 1_percent 1_percent 1.000435 5_percent 5_percent 5.002175 10_percent 10_percent 10.004350 !pip install impyute from impyute.imputation.cs import mice # start the MICE training imputed_training=mice(Cabernet_Sauvignon_imputation_mice.values) Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-whee Requirement already satisfied: impyute in /usr/local/lib/python3.8/dist-packages Requirement already satisfied: scikit-learn in /usr/local/lib/python3.8/dist-pack Requirement already satisfied: scipy in /usr/local/lib/python3.8/dist-packages (f Requirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (f Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.8/d Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.8/dist-pack imputed_training = pd.DataFrame(imputed_training) imputed_training.columns = ("1_percent","5_percent","10_percent") # imputed_mice = pd.DataFrame(imputed_training.fit_transform(Cabernet_Sauvignon_imputa print(get_percent_missing(imputed_training)) column_name percent_missing 1_percent 1_percent 0.0 5_percent 5_percent 0.0 10_percent 10_percent 0.0
  • 33. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 33/93 alcohol = Cabernet_Sauvignon["alcohol"] combined_mice = pd.concat([alcohol,imputed_training]) combined_mice.columns = ["Alcohol","1_Percent","5_Percent","10_Percent"] combined_mice.mean() Alcohol 10.587102 1_Percent 10.586915 5_Percent 10.587098 10_Percent 10.586915 dtype: float64 combined_mice.var() Alcohol 1.470385 1_Percent 1.467981 5_Percent 1.470375 10_Percent 1.467981 dtype: float64 The MICE method showed very negotiable variablilty. Therefore this method is acceptable for the current dataset. Double-click (or enter) to edit Double-click (or enter) to edit Double-click (or enter) to edit AutoML #Install AutoML library - PyCaret !pip install pycaret
  • 34. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 34/93 Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-whee Requirement already satisfied: pycaret in /usr/local/lib/python3.8/dist-packages Requirement already satisfied: numba<0.55 in /usr/local/lib/python3.8/dist-packag Requirement already satisfied: mlxtend>=0.17.0 in /usr/local/lib/python3.8/dist-p Requirement already satisfied: wordcloud in /usr/local/lib/python3.8/dist-package Requirement already satisfied: pandas-profiling>=2.8.0 in /usr/local/lib/python3 Requirement already satisfied: mlflow in /usr/local/lib/python3.8/dist-packages Requirement already satisfied: pyyaml<6.0.0 in /usr/local/lib/python3.8/dist-pack Requirement already satisfied: scikit-plot in /usr/local/lib/python3.8/dist-packa Requirement already satisfied: scipy<=1.5.4 in /usr/local/lib/python3.8/dist-pack Requirement already satisfied: joblib in /usr/local/lib/python3.8/dist-packages Requirement already satisfied: matplotlib in /usr/local/lib/python3.8/dist-packag Requirement already satisfied: lightgbm>=2.3.1 in /usr/local/lib/python3.8/dist-p Requirement already satisfied: imbalanced-learn==0.7.0 in /usr/local/lib/python3 Requirement already satisfied: seaborn in /usr/local/lib/python3.8/dist-packages Requirement already satisfied: textblob in /usr/local/lib/python3.8/dist-packages Requirement already satisfied: kmodes>=0.10.1 in /usr/local/lib/python3.8/dist-pa Requirement already satisfied: spacy<2.4.0 in /usr/local/lib/python3.8/dist-packa Requirement already satisfied: nltk in /usr/local/lib/python3.8/dist-packages (fr Requirement already satisfied: pandas in /usr/local/lib/python3.8/dist-packages Requirement already satisfied: pyod in /usr/local/lib/python3.8/dist-packages (fr Requirement already satisfied: yellowbrick>=1.0.1 in /usr/local/lib/python3.8/dis Requirement already satisfied: umap-learn in /usr/local/lib/python3.8/dist-packag Requirement already satisfied: cufflinks>=0.17.0 in /usr/local/lib/python3.8/dist Requirement already satisfied: gensim<4.0.0 in /usr/local/lib/python3.8/dist-pack Requirement already satisfied: plotly>=4.4.1 in /usr/local/lib/python3.8/dist-pac Requirement already satisfied: IPython in /usr/local/lib/python3.8/dist-packages Requirement already satisfied: ipywidgets in /usr/local/lib/python3.8/dist-packag Requirement already satisfied: scikit-learn==0.23.2 in /usr/local/lib/python3.8/d Requirement already satisfied: Boruta in /usr/local/lib/python3.8/dist-packages Requirement already satisfied: pyLDAvis in /usr/local/lib/python3.8/dist-packages Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.8/dist-pac Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.8/d Requirement already satisfied: colorlover>=0.2.1 in /usr/local/lib/python3.8/dist Requirement already satisfied: setuptools>=34.4.1 in /usr/local/lib/python3.8/dis Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.8/dist-packag Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.8/dist Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.8/dist-pa Requirement already satisfied: decorator in /usr/local/lib/python3.8/dist-package Requirement already satisfied: backcall in /usr/local/lib/python3.8/dist-packages Requirement already satisfied: pickleshare in /usr/local/lib/python3.8/dist-packa Requirement already satisfied: prompt-toolkit<2.1.0,>=2.0.0 in /usr/local/lib/pyt Requirement already satisfied: jedi>=0.10 in /usr/local/lib/python3.8/dist-packag Requirement already satisfied: pygments in /usr/local/lib/python3.8/dist-packages Requirement already satisfied: pexpect in /usr/local/lib/python3.8/dist-packages Requirement already satisfied: ipykernel>=4.5.1 in /usr/local/lib/python3.8/dist- Requirement already satisfied: jupyterlab-widgets>=1.0.0 in /usr/local/lib/python Requirement already satisfied: ipython-genutils~=0.2.0 in /usr/local/lib/python3 Requirement already satisfied: widgetsnbextension~=3.6.0 in /usr/local/lib/python Requirement already satisfied: jupyter-client in /usr/local/lib/python3.8/dist-pa Requirement already satisfied: tornado>=4.2 in /usr/local/lib/python3.8/dist-pack Requirement already satisfied: parso<0.9.0,>=0.8.0 in /usr/local/lib/python3.8/di Requirement already satisfied: wheel in /usr/local/lib/python3.8/dist-packages (f Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.8/dist Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.8/d Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.8/dist-pack
  • 35. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 35/93 Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/l Requirement already satisfied: llvmlite<0.38,>=0.37.0rc1 in /usr/local/lib/python Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-pack Requirement already satisfied: tqdm<4.65,>=4.48.2 in /usr/local/lib/python3.8/dis Requirement already satisfied: requests<2.29,>=2.24.0 in /usr/local/lib/python3.8 Requirement already satisfied: phik<0.13,>=0.11.1 in /usr/local/lib/python3.8/dis Requirement already satisfied: pydantic<1.11,>=1.8.1 in /usr/local/lib/python3.8/ Requirement already satisfied: jinja2<3.2,>=2.11.1 in /usr/local/lib/python3.8/di Requirement already satisfied: htmlmin==0.1.12 in /usr/local/lib/python3.8/dist-p Requirement already satisfied: statsmodels<0.14,>=0.13.2 in /usr/local/lib/python Requirement already satisfied: typeguard<2.14,>=2.13.2 in /usr/local/lib/python3 Requirement already satisfied: multimethod<1.10,>=1.4 in /usr/local/lib/python3.8 Requirement already satisfied: visions[type_image_path]==0.7.5 in /usr/local/lib/ Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.8/dist-pac Requirement already satisfied: tangled-up-in-unicode>=0.0.4 in /usr/local/lib/pyt Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.8/dist-pac Requirement already satisfied: imagehash in /usr/local/lib/python3.8/dist-package Requirement already satisfied: Pillow in /usr/local/lib/python3.8/dist-packages Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.8/dist- Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.8/dist-p Requirement already satisfied: wcwidth in /usr/local/lib/python3.8/dist-packages Requirement already satisfied: typing-extensions>=4.1.0 in /usr/local/lib/python3 Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.8/dist-pack Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.8/ Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dis Requirement already satisfied: charset-normalizer<3,>=2 in /usr/local/lib/python3 Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.8/ Requirement already satisfied: blis<0.8.0,>=0.4.0 in /usr/local/lib/python3.8/dis Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.8/di Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.8/dis Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3 Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.8/d Requirement already satisfied: thinc<7.5.0,>=7.4.1 in /usr/local/lib/python3.8/di Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.8/di Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.8/dist-p Requirement already satisfied: patsy>=0.5.2 in /usr/local/lib/python3.8/dist-pack Requirement already satisfied: notebook>=4.4.1 in /usr/local/lib/python3.8/dist-p Requirement already satisfied: nbformat in /usr/local/lib/python3.8/dist-packages Requirement already satisfied: Send2Trash in /usr/local/lib/python3.8/dist-packag Requirement already satisfied: jupyter-core>=4.4.0 in /usr/local/lib/python3.8/di Requirement already satisfied: prometheus-client in /usr/local/lib/python3.8/dist Requirement already satisfied: terminado>=0.8.1 in /usr/local/lib/python3.8/dist- Requirement already satisfied: pyzmq>=17 in /usr/local/lib/python3.8/dist-package Requirement already satisfied: nbconvert<6.0 in /usr/local/lib/python3.8/dist-pac Requirement already satisfied: platformdirs>=2.5 in /usr/local/lib/python3.8/dist Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.8/dis Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.8/d Requirement already satisfied: defusedxml in /usr/local/lib/python3.8/dist-packag Requirement already satisfied: testpath in /usr/local/lib/python3.8/dist-packages Requirement already satisfied: bleach in /usr/local/lib/python3.8/dist-packages Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.8/dist Requirement already satisfied: jsonschema>=2.6 in /usr/local/lib/python3.8/dist-p Requirement already satisfied: fastjsonschema in /usr/local/lib/python3.8/dist-pa Requirement already satisfied: importlib-resources>=1.4.0 in /usr/local/lib/pytho Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in / Requirement already satisfied: zipp>=3.1.0 in /usr/local/lib/python3.8/dist-packa from scipy import stats # import math import matplotlib.pyplot as plt import seaborn as sns import warnings warnings.filterwarnings('ignore') import os for dirname, _, filenames in os.walk('/kaggle/input'): for filename in filenames: print(os.path.join(dirname, filename)) #Reading Data Chateau_Montelena_AutoML = pd.read_csv('https://raw.githubusercontent.com/Mohanmanjuna Chateau_Montelena_AutoMLM = Chateau_Montelena_AutoML.copy() Chateau_Montelena_AutoMLB = Chateau_Montelena_AutoML.copy() Each row represents a wine; Each column contains wine’s attributes such as type, sulphates, chlorides etc and the target label 'quality'. Problem Statement Binary Classification: Predict the quality of wine i.e. Low or High. Multiclass Classification: Predict the quality of wine i.e Low,Medium,High. Regression: Predict the quality of wine between 3-9 based on the independent predictor variables. Dataset - Wine Quality Chateau_Montelena_AutoML.describe()
  • 36. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 36/93 Requirement already satisfied: zipp> 3.1.0 in /usr/local/lib/python3.8/dist packa Requirement already satisfied: ptyprocess in /usr/local/lib/python3.8/dist-packag Collecting numpy>=1.13.3 Using cached numpy-1.19.5-cp38-cp38-manylinux2010_x86_64.whl (14.9 MB) Requirement already satisfied: webencodings in /usr/local/lib/python3.8/dist-pack Requirement already satisfied: PyWavelets in /usr/local/lib/python3.8/dist-packag Requirement already satisfied: shap<1,>=0.40 in /usr/local/lib/python3.8/dist-pac Requirement already satisfied: protobuf<5,>=3.12.0 in /usr/local/lib/python3.8/di Requirement already satisfied: alembic<2 in /usr/local/lib/python3.8/dist-package Requirement already satisfied: gitpython<4,>=2.1.0 in /usr/local/lib/python3.8/di Requirement already satisfied: databricks-cli<1,>=0.8.7 in /usr/local/lib/python3 Requirement already satisfied: importlib-metadata!=4.7.0,<6,>=3.7.0 in /usr/local Requirement already satisfied: sqlalchemy<2,>=1.4.0 in /usr/local/lib/python3.8/d Requirement already satisfied: docker<7,>=4.0.0 in /usr/local/lib/python3.8/dist- Requirement already satisfied: querystring-parser<2 in /usr/local/lib/python3.8/d Requirement already satisfied: click<9,>=7.0 in /usr/local/lib/python3.8/dist-pac Requirement already satisfied: gunicorn<21 in /usr/local/lib/python3.8/dist-packa Requirement already satisfied: pyarrow<11,>=4.0.0 in /usr/local/lib/python3.8/dis Requirement already satisfied: cloudpickle<3 in /usr/local/lib/python3.8/dist-pac Requirement already satisfied: markdown<4,>=3.3 in /usr/local/lib/python3.8/dist- Requirement already satisfied: Flask<3 in /usr/local/lib/python3.8/dist-packages Requirement already satisfied: sqlparse<1,>=0.4.0 in /usr/local/lib/python3.8/dis Requirement already satisfied: Mako in /usr/local/lib/python3.8/dist-packages (fr Requirement already satisfied: oauthlib>=3.1.0 in /usr/local/lib/python3.8/dist-p Requirement already satisfied: tabulate>=0.7.7 in /usr/local/lib/python3.8/dist-p Requirement already satisfied: pyjwt>=1.7.0 in /usr/local/lib/python3.8/dist-pack Requirement already satisfied: websocket-client>=0.32.0 in /usr/local/lib/python3 Requirement already satisfied: Werkzeug<2.0,>=0.15 in /usr/local/lib/python3.8/di Requirement already satisfied: itsdangerous<2.0,>=0.24 in /usr/local/lib/python3 Requirement already satisfied: gitdb<5,>=4.0.1 in /usr/local/lib/python3.8/dist-p Requirement already satisfied: smmap<6,>=3.0.1 in /usr/local/lib/python3.8/dist-p Requirement already satisfied: slicer==0.0.7 in /usr/local/lib/python3.8/dist-pac Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.8/dist- Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.8/dist-p Requirement already satisfied: funcy in /usr/local/lib/python3.8/dist-packages (f Requirement already satisfied: numexpr in /usr/local/lib/python3.8/dist-packages Requirement already satisfied: future in /usr/local/lib/python3.8/dist-packages Requirement already satisfied: pynndescent>=0.5 in /usr/local/lib/python3.8/dist- Installing collected packages: numpy Attempting uninstall: numpy Found existing installation: numpy 1.20.0 Uninstalling numpy-1.20.0: Successfully uninstalled numpy-1.20.0 ERROR: pip's dependency resolver does not currently take into account all the pac tensorflow 2.9.2 requires numpy>=1.20, but you have numpy 1.19.5 which is incompa jaxlib 0.3.25+cuda11.cudnn805 requires numpy>=1.20, but you have numpy 1.19.5 whi jax 0.3.25 requires numpy>=1.20, but you have numpy 1.19.5 which is incompatible en-core-web-sm 3.4.1 requires spacy<3.5.0,>=3.4.0, but you have spacy 2.3.8 which cmdstanpy 1.0.8 requires numpy>=1.21, but you have numpy 1.19.5 which is incompat Successfully installed numpy-1.19.5 fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide t su dio count 6487.000000 6489.000000 6494.000000 6495.000000 6495.000000 6497.000000 6497.0 mean 7.216579 0.339691 0.318722 5.444326 0.056042 30.525319 115.7 std 1.296750 0.164649 0.145265 4.758125 0.035036 17.749400 56.5 min 3.800000 0.080000 0.000000 0.600000 0.009000 1.000000 6.0 25% 6.400000 0.230000 0.250000 1.800000 0.038000 17.000000 77.0 50% 7.000000 0.290000 0.310000 3.000000 0.047000 29.000000 118.0 75% 7.700000 0.400000 0.390000 8.100000 0.065000 41.000000 156.0 max 15.900000 1.580000 1.660000 65.800000 0.611000 289.000000 440.0 Dataset Shape: (6497, 13) Name dtypes Missing Uniques Sample Value Entropy 0 type object 0 2 white 0.24 1 fixed acidity float64 10 106 7.0 1.65 2 volatile acidity float64 8 187 0.27 1.79 3 citric acid float64 3 89 0.36 1.70 4 residual sugar float64 2 316 20.7 2.08 5 chlorides float64 2 214 0.045 1.90 6 free sulfur dioxide float64 0 135 45.0 1.82 7 total sulfur dioxide float64 0 276 170.0 2.32 8 density float64 0 998 1.001 2.70 9 pH float64 9 108 3.0 1.81 10 sulphates float64 4 111 0.45 1.72 11 alcohol float64 0 111 8.8 1.66 12 quality int64 0 7 6 0.55 def tableinfo(Chateau_Montelena_AutoML): print(f"Dataset Shape: {Chateau_Montelena_AutoML.shape}") summary = pd.DataFrame(Chateau_Montelena_AutoML.dtypes,columns=['dtypes']) summary = summary.reset_index() summary['Name'] = summary['index'] summary = summary[['Name','dtypes']] summary['Missing'] = Chateau_Montelena_AutoML.isnull().sum().values summary['Uniques'] = Chateau_Montelena_AutoML.nunique().values summary['Sample Value'] = Chateau_Montelena_AutoML.loc[0].values for name in summary['Name'].value_counts().index: summary.loc[summary['Name'] == name, 'Entropy'] = round(stats.entropy(Chateau_ return summary tableinfo(Chateau_Montelena_AutoML) Entropy is defined as the randomness or measuring the disorder of the information being processed. Actions required for data preparation: Converting 'Type' to a integer data type. Encoding categorical features.
  • 37. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 37/93 print("Quality(%):") print(round(Chateau_Montelena_AutoML['quality'].value_counts(normalize=True) * 100,2)) Quality(%): 6 43.65 5 32.91 7 16.61 4 3.32 8 2.97 3 0.46 9 0.08 Name: quality, dtype: float64 Chateau_Montelena_AutoML['type'] = Chateau_Montelena_AutoML['type'].astype("category") Chateau_Montelena_AutoML_copy = Chateau_Montelena_AutoML.copy() Chateau_Montelena_AutoML.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 6497 entries, 0 to 6496 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 type 6497 non-null int8 1 fixed acidity 6487 non-null float64 2 volatile acidity 6489 non-null float64 3 citric acid 6494 non-null float64 4 residual sugar 6495 non-null float64 5 chlorides 6495 non-null float64 6 free sulfur dioxide 6497 non-null float64 7 total sulfur dioxide 6497 non-null float64 8 density 6497 non-null float64 9 pH 6488 non-null float64 10 sulphates 6493 non-null float64 11 alcohol 6497 non-null float64 12 quality 6497 non-null int64 dtypes: float64(11), int64(1), int8(1) memory usage: 615.6 KB Analyzing the numeric features plot , ax = plt.subplots( 4,3 , figsize = (35 , 20)) g = sns.histplot(Chateau_Montelena_AutoML['type'] , kde = True , ax = ax[0][0]) g = sns.histplot(Chateau_Montelena_AutoML['fixed acidity'] , kde = True , ax = ax[0][1 g = sns.histplot(Chateau_Montelena_AutoML['volatile acidity'] , kde = True , ax = ax[0 g = sns.histplot(Chateau_Montelena_AutoML['citric acid'] , kde = True , ax = ax[1][0]) g = sns.histplot(Chateau_Montelena_AutoML['residual sugar'] , kde = True , ax = ax[1][ g = sns.histplot(Chateau_Montelena_AutoML['chlorides'] , kde = True , ax = ax[1][2]) g = sns.histplot(Chateau_Montelena_AutoML['density'] , kde = True , ax = ax[2][0]) g = sns.histplot(Chateau_Montelena_AutoML['pH'] , kde = True , ax = ax[2][1])
  • 38. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 38/93 g = sns.histplot(Chateau_Montelena_AutoML['sulphates'] , kde = True , ax = ax[2][2]) g = sns.histplot(Chateau_Montelena_AutoML['alcohol'] , kde = True , ax = ax[3][0]) Observation : These numerical variables are not following a normal distribution. These distributions indicate there are different data distributions present in population data with separate and independent peaks. Action :
  • 39. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 39/93 Data scaling, As most of the algorithms assume the data to be normally (Gaussian) distributed we Normalize these features. type fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density 0 1 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010 1 1 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940 2 1 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3 1 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 4 1 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 Chateau_Montelena_AutoML.head() Outliers from sklearn.preprocessing import MinMaxScaler,StandardScaler mms = MinMaxScaler() # Normalization # cust_dummies=pd.get_dummies(cust) Chateau_Montelena_AutoML_copy['type'] = mms.fit_transform(Chateau_Montelena_AutoML_cop Chateau_Montelena_AutoML_copy['fixed acidity'] = mms.fit_transform(Chateau_Montelena_A Chateau_Montelena_AutoML_copy['volatile acidity'] = mms.fit_transform(Chateau_Montelen Chateau_Montelena_AutoML_copy['citric acid']= mms.fit_transform(Chateau_Montelena_Auto Chateau_Montelena_AutoML_copy['residual sugar']= mms.fit_transform(Chateau_Montelena_A Chateau_Montelena_AutoML_copy['chlorides']= mms.fit_transform(Chateau_Montelena_AutoML # Chateau_Montelena_AutoML_copy['free sulphur dioxide']= mms.fit_transform(Chateau_Mon # Chateau_Montelena_AutoML_copy['total sulphur dioxide']= mms.fit_transform(Chateau_Mo Chateau_Montelena_AutoML_copy['density'] = mms.fit_transform(Chateau_Montelena_AutoML_ Chateau_Montelena_AutoML_copy['pH'] = mms.fit_transform(Chateau_Montelena_AutoML_copy[ Chateau_Montelena_AutoML_copy['sulphates'] = mms.fit_transform(Chateau_Montelena_AutoM Chateau_Montelena_AutoML_copy['alcohol'] = mms.fit_transform(Chateau_Montelena_AutoML_ plt.figure(figsize=(16,4)) sns.boxplot(data=Chateau_Montelena_AutoML_copy[['type','fixed acidity','volatile acidi
  • 40. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 40/93 <AxesSubplot:> Observation : There are values present beyond the upper and lower extremes of the Box plots (1.5 x Inter Quartile Range) Multicolinearity <AxesSubplot:> plt.figure(figsize=(24,8)) corr = Chateau_Montelena_AutoML_copy.corr() mask = np.triu(np.ones_like(corr, dtype=bool)) sns.heatmap(corr,mask=mask, cmap='RdYlGn')
  • 41. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 41/93 Observation : By looking at the correlation mattrix above we can gain the following insights: volatile acidity and chlorides is highly (-ve) correlated with type. alcohol is highly (-ve) correlated with density. total sulpher dioxide is highly (+ve) correlated with type. Action : Dropping some of the highly correlated categorical variables. Target Variable = Quality between 3-9 Regression !pip install numba==0.53 Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-whee Collecting numba==0.53 Downloading numba-0.53.0-cp38-cp38-manylinux2014_x86_64.whl (3.4 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.4/3.4 MB 31.4 MB/s eta 0:00:00 Requirement already satisfied: numpy>=1.15 in /usr/local/lib/python3.8/dist-packa Requirement already satisfied: setuptools in /usr/local/lib/python3.8/dist-packag Collecting llvmlite<0.37,>=0.36.0rc1 Downloading llvmlite-0.36.0-cp38-cp38-manylinux2010_x86_64.whl (25.3 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25.3/25.3 MB 54.4 MB/s eta 0:00:00 Installing collected packages: llvmlite, numba Attempting uninstall: llvmlite Found existing installation: llvmlite 0.37.0 Uninstalling llvmlite-0.37.0: Successfully uninstalled llvmlite-0.37.0 Attempting uninstall: numba Found existing installation: numba 0.54.1 Uninstalling numba-0.54.1: Successfully uninstalled numba-0.54.1 Successfully installed llvmlite-0.36.0 numba-0.53.0 from pycaret.regression import * s = setup(Chateau_Montelena_AutoML, target = 'quality',train_size=0.8, normalize=True, normalize_method='minmax', remove_multicollinearity=True, remove_outliers=True, fold=5, silent = True)
  • 42. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 42/93 Description Value 0 session_id 6943 1 Target quality 2 Original Data (6497, 13) 3 Missing Values True 4 Numeric Features 12 5 Categorical Features 0 6 Ordinal Features False 7 High Cardinality Features False 8 High Cardinality Method None 9 Transformed Train Set (4937, 12) 10 Transformed Test Set (1300, 12) 11 Shuffle Train-Test True 12 Stratify Train-Test False 13 Fold Generator KFold 14 Fold Number 5 15 CPU Jobs -1 16 Use GPU False 17 Log Experiment False 18 Experiment Name reg-default-name 19 USI 900b 20 Imputation Type simple 21 Iterative Imputation Iteration None 22 Numeric Imputer mean 23 Iterative Imputation Numeric Model None 24 Categorical Imputer constant 25 Iterative Imputation Categorical Model None 26 Unknown Categoricals Handling least_frequent 27 Normalize True 28 Normalize Method minmax 29 Transformation False
  • 43. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 43/93 30 Transformation Method None 31 PCA False 32 PCA Method None 33 PCA Components None 34 Ignore Low Variance False 35 Combine Rare Levels False 36 Rare Level Threshold None 37 Numeric Binning False 38 Remove Outliers True 39 Outliers Threshold 0.05 40 Remove Multicollinearity True 41 Multicollinearity Threshold 0.9 42 Remove Perfect Collinearity True 43 Clustering False 44 Clustering Iteration None 45 Polynomial Features False 46 Polynomial Degree None 47 Trignometry Features False 48 Polynomial Threshold None 49 Group Features False 50 Feature Selection False 51 Feature Selection Method classic 52 Features Selection Threshold None 53 Feature Interaction False 54 Feature Ratio False 55 Interaction Threshold None 56 Transform Target False 57 Transform Target Method box-cox INFO:logs:create_model_container: 0 INFO:logs:master_model_container: 0 INFO:logs:display_container: 1 INFO:logs:Pipeline(memory=None, steps=[('dtypes', DataTypes_Auto_infer(categorical_features=[], Model MAE MSE RMSE R2 RMSLE MAPE TT (Sec) et Extra Trees Regressor 0.3974 0.3534 0.5941 0.5312 0.0890 0.0710 1.150 rf Random Forest Regressor 0.4454 0.3757 0.6124 0.5018 0.0916 0.0793 2.436 lightgbm Light Gradient Boosting Machine 0.4847 0.4085 0.6388 0.4577 0.0951 0.0857 0.190 xgboost Extreme Gradient Boosting 0.4631 0.4104 0.6404 0.4548 0.0955 0.0821 0.590 gbr Gradient Boosting Regressor 0.5298 0.4610 0.6786 0.3880 0.1006 0.0934 1.008 knn K Neighbors Regressor 0.5362 0.5059 0.7111 0.3280 0.1055 0.0950 0.082 ada AdaBoost Regressor 0.5725 0.5243 0.7235 0.3048 0.1074 0.1015 0.600 lr Linear Regression 0.5643 0.5288 0.7268 0.2982 0.1074 0.0995 0.588 lar Least Angle Regression 0.5643 0.5288 0.7268 0.2982 0.1074 0.0995 0.012 br Bayesian Ridge 0.5645 0.5289 0.7269 0.2981 0.1074 0.0995 0.012 ridge Ridge Regression 0.5652 0.5296 0.7273 0.2972 0.1075 0.0996 0.010 huber Huber Regressor 0.5636 0.5301 0.7277 0.2965 0.1074 0.0990 0.102 omp Orthogonal Matching Pursuit 0.6133 0.5987 0.7733 0.2056 0.1145 0.1086 0.010 dt Decision Tree Regressor 0.5058 0.7132 0.8440 0.0484 0.1252 0.0889 0.046 lasso Lasso Regression 0.6772 0.7537 0.8678 -0.0004 0.1277 0.1203 0.012 en Elastic Net 0.6772 0.7537 0.8678 -0.0004 0.1277 0.1203 0.014 llar Lasso Least Angle Regression 0.6772 0.7537 0.8678 -0.0004 0.1277 0.1203 0.012 dummy Dummy Regressor 0.6772 0.7537 0.8678 -0.0004 0.1277 0.1203 0.014 par Passive Aggressive Regressor 0.8006 0.9957 0.9905 -0.3256 0.1469 0.1372 0.014 INFO:logs:create_model_container: 19 INFO:logs:master_model_container: 19 INFO:logs:display_container: 2 INFO:logs:ExtraTreesRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, best = compare_models() Tuning the best model here i.e. Extra Trees Regressor
  • 44. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 44/93 display_types=False, features_todrop=[], id_columns=[], ml_usecase='regression', numerical_features=[], target='quality', time_features=[])), ('imputer', Simple_Imputer(categorical_strategy='not_available', fill_value_categorical=None, fill_value_numerical=None, numeric_strateg... ('dummy', Dummify(target='quality')), ('fix_perfect', Remove_100(target='quality')), ('clean_names', Clean_Colum_Names()), ('feature_select', 'passthrough'), ('fix_multi', Fix_multicollinearity(correlation_with_target_preference=None, correlation_with_target_threshold=0.0, target_variable='quality', threshold=0.9)), ('dfs', 'passthrough'), ('pca', 'passthrough')], verbose=False) INFO:logs:setup() succesfully completed...................................... MAE MSE RMSE R2 RMSLE MAPE Fold 0 0.5600 0.4734 0.6881 0.3261 0.1013 0.0982 1 0.6064 0.5845 0.7645 0.2842 0.1133 0.1078 2 0.5680 0.5103 0.7144 0.3331 0.1063 0.1010 3 0.5849 0.5351 0.7315 0.3088 0.1100 0.1047 4 0.5651 0.4953 0.7038 0.3008 0.1029 0.0985 Mean 0.5769 0.5197 0.7204 0.3106 0.1068 0.1020 Std 0.0170 0.0381 0.0262 0.0176 0.0044 0.0037 INFO:logs:create_model_container: 20 INFO:logs:master_model_container: 20 INFO:logs:display_container: 3 INFO:logs:ExtraTreesRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse', max_depth=9, max_features=1.0, max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.002, min_impurity_split=None, min_samples_leaf=3, min_samples_split=5, min_weight_fraction_leaf=0.0, n_estimators=210, n_jobs=-1, oob_score=False, random_state=6943, verbose=0, warm_start=False) INFO:logs:tune_model() succesfully completed.................................... tuned_model = tune_model(best) #Creating Models lightgbm = create_model('lightgbm'); et = create_model('et'); rf = create_model('rf'); #Blending the top 3 models blend = blend_models(estimator_list=[lightgbm,et,rf])
  • 45. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 45/93 MAE MSE RMSE R2 RMSLE MAPE Fold 0 0.4346 0.3397 0.5828 0.5165 0.0863 0.0763 1 0.4596 0.4069 0.6379 0.5017 0.0952 0.0819 2 0.4237 0.3473 0.5893 0.5462 0.0889 0.0761 3 0.4418 0.3806 0.6169 0.5084 0.0937 0.0797 4 0.4261 0.3356 0.5793 0.5262 0.0858 0.0747 Mean 0.4372 0.3620 0.6012 0.5198 0.0900 0.0777 Std 0.0129 0.0275 0.0226 0.0155 0.0038 0.0026 INFO:logs:create_model_container: 24 INFO:logs:master_model_container: 24 INFO:logs:display_container: 7 INFO:logs:VotingRegressor(estimators=[('lightgbm', LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0, importance_type='split', learning_rate=0.1, max_depth=-1, min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0, n_estimators=100, n_jobs=-1, num_leaves=31, objective=None, random_state=6943, reg_alpha=0.0, reg_lambda=0.0, silent='warn', s... RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1, oob_score=False, random_state=6943, verbose=0, warm_start=False))], n_jobs=-1, verbose=False, weights=None) INFO:logs:blend_models() succesfully completed.................................. plot_model(estimator = tuned_model, plot = 'feature')
  • 46. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 46/93 INFO:logs:Visual Rendered Successfully INFO:logs:plot_model() succesfully completed.................................... interpret_model(tuned_model)
  • 47. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 47/93 INFO:logs:Initializing interpret_model() INFO:logs:interpret_model(estimator=ExtraTreesRegressor(bootstrap=False, ccp_alph max_depth=9, max_features=1.0, max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.002, min_impurity_split=None, min_samples_leaf=3, min_samples_split=5, min_weight_fraction_leaf=0.0, n_estimators=210, n_jobs=-1, oob_score=False, random_state=6943, verbose=0, warm_start=False), use_train_da INFO:logs:Checking exceptions INFO:logs:plot type: summary INFO:logs:Creating TreeExplainer INFO:logs:Compiling shap values INFO:logs:Visual Rendered Successfully INFO:logs:interpret_model() succesfully completed............................... INFO:logs:Visual Rendered Successfully INFO:logs:plot_model() succesfully completed.................................... plot_model(estimator = tuned_model, plot = 'residuals') Observation : The residuals are evenly distributed and the line fits well. Double-click (or enter) to edit Target Variable = Quality- Low or High Binary classification from pycaret.classification import * Categorization of Quality quality_mapping = { 3 : 'Low', 4 : 'Low', 5: 'Low', 6 : 'High', 7: 'High', 8 : 'High', Chateau_Montelena_AutoMLB['quality'] = Chateau_Montelena_AutoMLB['quality'].map(quali print("Wine Quality(%):") print(round(Chateau_Montelena_AutoMLB['quality'].value_counts(normalize=True) * 100,2)
  • 48. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 48/93 Wine Quality(%): High 63.31 Low 36.69 Name: quality, dtype: float64 Classifier Setup clfb = setup(data = Chateau_Montelena_AutoMLB, target = 'quality', # ignore_features = ['customerID'], train_size=0.8, normalize=True, normalize_method='minmax', fix_imbalance=True, remove_multicollinearity=True, remove_outliers=True, fold=5, silent = True)
  • 49. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 49/93 Description Value 0 session_id 4967 1 Target quality 2 Target Type Binary 3 Label Encoded High: 0, Low: 1 4 Original Data (6497, 13) 5 Missing Values True 6 Numeric Features 11 7 Categorical Features 1 8 Ordinal Features False 9 High Cardinality Features False 10 High Cardinality Method None 11 Transformed Train Set (4937, 12) 12 Transformed Test Set (1300, 12) 13 Shuffle Train-Test True 14 Stratify Train-Test False 15 Fold Generator StratifiedKFold 16 Fold Number 5 17 CPU Jobs -1 18 Use GPU False 19 Log Experiment False 20 Experiment Name clf-default-name 21 USI 3508 22 Imputation Type simple 23 Iterative Imputation Iteration None 24 Numeric Imputer mean 25 Iterative Imputation Numeric Model None 26 Categorical Imputer constant 27 Iterative Imputation Categorical Model None 28 Unknown Categoricals Handling least_frequent 29 Normalize True
  • 50. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 50/93 30 Normalize Method minmax 31 Transformation False 32 Transformation Method None 33 PCA False 34 PCA Method None 35 PCA Components None 36 Ignore Low Variance False 37 Combine Rare Levels False 38 Rare Level Threshold None 39 Numeric Binning False 40 Remove Outliers True 41 Outliers Threshold 0.05 42 Remove Multicollinearity True 43 Multicollinearity Threshold 0.9 44 Remove Perfect Collinearity True 45 Clustering False 46 Clustering Iteration None 47 Polynomial Features False 48 Polynomial Degree None 49 Trignometry Features False 50 Polynomial Threshold None 51 Group Features False 52 Feature Selection False 53 Feature Selection Method classic 54 Features Selection Threshold None 55 Feature Interaction False 56 Feature Ratio False 57 Interaction Threshold None 58 Fix Imbalance True 59 Fix Imbalance Method SMOTE INFO:logs:create_model_container: 0 INFO:logs:master_model_container: 0 INFO l di l t i 1 Evaluation Metrics
  • 51. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 51/93 INFO:logs:display_container: 1 INFO:logs:Pipeline(memory=None, steps=[('dtypes', DataTypes_Auto_infer(categorical_features=[], display_types=False, features_todrop=[], id_columns=[], ml_usecase='classification', numerical_features=[], target='quality', time_features=[])), ('imputer', Simple_Imputer(categorical_strategy='not_available', fill_value_categorical=None, fill_value_numerical=None, numeric_str... ('dummy', Dummify(target='quality')), ('fix_perfect', Remove_100(target='quality')), ('clean_names', Clean_Colum_Names()), ('feature_select', 'passthrough'), ('fix_multi', Fix_multicollinearity(correlation_with_target_preference=None, correlation_with_target_threshold=0.0, target_variable='quality', threshold=0.9)), ('dfs', 'passthrough'), ('pca', 'passthrough')], verbose=False) INFO:logs:setup() succesfully completed...................................... Pycaret provides the following metrics used for comparing model performance in the compare_models() function: Confusion Matrix is a performance measurement for machine learning classification problem where output can be two or more classes. It is a table with 4 different combinations of predicted and actual values. AUC known as the Area Under the ROC Curve can be calculated and provides a single score to summarize the plot that can be used to compare models. A no skill classifier will have a score of 0.5, whereas a perfect classifier will have a score of 1.0. F1 score is the harmonic mean of Precision and recall, a single score that seeks to balance both concerns. Accuracy is the fraction of correction predictions against the total prediction Accuracy = Correct Predictions / Total Predictions MCC produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset. Precision summarizes the fraction of examples assigned the positive class that belong to the positive class. Precision = TruePositive / (TruePositive + FalsePositive) Cohen’s Kappa Statistic is used to measure the level of agreement between two raters or judges who each classify items into mutually exclusive categories. kappa = (Observed agreement - chance agreement) / (1-chance agreement) Recall summarizes how well the positive class was predicted. Recall = TruePositive / (TruePositive + FalseNegative) F-Measure = (2 * Precision * Recall) / (Precision + Recall) Searching for the best models Model Comparison & Evaluation best_modelB=compare_models()
  • 52. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 52/93 Model Accuracy AUC Recall Prec. F1 Kappa MCC TT (Sec) et Extra Trees Classifier 0.8232 0.9011 0.7532 0.7558 0.7539 0.6160 0.6166 0.432 rf Random Forest Classifier 0.8209 0.8940 0.7623 0.7463 0.7539 0.6132 0.6136 1.092 xgboost Extreme Gradient Boosting 0.8112 0.8668 0.7392 0.7377 0.7380 0.5905 0.5909 0.978 lightgbm Light Gradient Boosting Machine 0.8009 0.8680 0.7538 0.7114 0.7314 0.5735 0.5748 0.208 gbc Gradient Boosting Classifier 0.7582 0.8375 0.7499 0.6405 0.6905 0.4942 0.4987 0.836 dt Decision Tree Classifier 0.7559 0.7384 0.6761 0.6556 0.6655 0.4735 0.4737 0.104 knn K Neighbors Classifier 0.7379 0.8094 0.7386 0.6124 0.6695 0.4555 0.4611 0.120 ada Ada Boost Classifier 0.7377 0.8115 0.7442 0.6116 0.6712 0.4566 0.4629 0.252 lda Linear Discriminant Analysis 0.7284 0.8077 0.7662 0.5952 0.6697 0.4452 0.4558 0.042 ridge Ridge Classifier 0.7249 0.0000 0.7600 0.5920 0.6653 0.4380 0.4482 0.054 lr Logistic Regression 0.7223 0.8052 0.7532 0.5896 0.6611 0.4319 0.4415 0.054 qda Quadratic Discriminant Analysis 0.7203 0.7995 0.7386 0.5890 0.6550 0.4249 0.4329 0.040 Hyperparameter Tuning tuned_modelB = tune_model(best_modelB)
  • 53. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 53/93 Accuracy AUC Recall Prec. F1 Kappa MCC Fold 0 0.7611 0.8567 0.8085 0.6308 0.7086 0.5114 0.5227 1 0.7520 0.8351 0.7690 0.6261 0.6903 0.4871 0.4943 2 0.7700 0.8457 0.8113 0.6429 0.7173 0.5278 0.5380 3 0.7021 0.8110 0.8028 0.5599 0.6597 0.4095 0.4306 4 0.7427 0.8236 0.7493 0.6172 0.6768 0.4663 0.4724 Mean 0.7456 0.8344 0.7882 0.6154 0.6906 0.4804 0.4916 Std 0.0236 0.0160 0.0246 0.0289 0.0209 0.0412 0.0380 INFO:logs:create_model_container: 16 INFO:logs:master_model_container: 16 INFO:logs:display_container: 3 INFO:logs:ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight={}, criterion='entropy', max_depth=11, max_features='log2', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0001, min_impurity_split=None, min_samples_leaf=5, min_samples_split=9, min_weight_fraction_leaf=0.0, n_estimators=180, n_jobs=-1, oob_score=False, random_state=4967, verbose=0, warm_start=False) INFO:logs:tune_model() succesfully completed.................................... We will use Light GBM , Extra Trees Classifier, Random Forest Classifier model here, as these perform the best. Creating a model #Creating Models lightgbmB = create_model('lightgbm'); etB = create_model('et'); rfB = create_model('rf'); #Blending the top 3 models blendB = blend_models(estimator_list=[lightgbmB,etB,rfB])
  • 54. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 54/93 Accuracy AUC Recall Prec. F1 Kappa MCC Fold 0 0.8451 0.9127 0.8000 0.7760 0.7878 0.6659 0.6661 1 0.8148 0.8971 0.7296 0.7486 0.7389 0.5955 0.5956 2 0.8470 0.9021 0.7859 0.7881 0.7870 0.6677 0.6677 3 0.8024 0.8846 0.7859 0.7010 0.7410 0.5822 0.5847 4 0.8126 0.8924 0.7296 0.7443 0.7368 0.5913 0.5914 Mean 0.8244 0.8978 0.7662 0.7516 0.7583 0.6205 0.6211 Std 0.0182 0.0094 0.0303 0.0302 0.0238 0.0380 0.0376 INFO:logs:create_model_container: 20 INFO:logs:master_model_container: 20 INFO:logs:display_container: 7 INFO:logs:VotingClassifier(estimators=[('lightgbm', LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0, importance_type='split', learning_rate=0.1, max_depth=-1, min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0, n_estimators=100, n_jobs=-1, num_leaves=31, objective=None, random_state=4967, reg_alpha=0.0, reg_lambda=0.0, silent='warn'... max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0 n_estimators=100, n_jobs=-1, oob_score=False, random_state=4967, verbose=0, warm_start=False))], flatten_transform=True, n_jobs=-1, verbose=False, voting='soft', weights=None) INFO:logs:blend_models() succesfully completed.................................. plot_model(estimator = tuned_modelB, plot = 'feature')
  • 55. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 55/93 INFO:logs:Visual Rendered Successfully INFO:logs:plot_model() succesfully completed.................................... INFO:logs:Visual Rendered Successfully INFO:logs:plot_model() succesfully completed.................................... #Plotting the confusion Matrix plot_model(estimator = tuned_modelB, plot = 'confusion_matrix') Observation : We can see a strong diagnol indicating good predictions. #plotting decision boundary plot_model(estimator = tuned_modelB, plot = 'boundary', use_train_data = True)
  • 56. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 56/93 INFO:logs:Visual Rendered Successfully INFO:logs:plot_model() succesfully completed.................................... Observation: We can see a great seperation with very few misclassifications. plot_model(tuned_modelB, plot = 'parameter')
  • 57. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 57/93 Parameters bootstrap False ccp_alpha 0.0 class_weight {} criterion entropy max_depth 11 max_features log2 max_leaf_nodes None max_samples None min_impurity_decrease 0.0001 min_impurity_split None min_samples_leaf 5 min_samples_split 9 min_weight_fraction_leaf 0.0 n_estimators 180 n_jobs -1 oob_score False random_state 4967 verbose 0 warm_start False INFO:logs:Visual Rendered Successfully INFO:logs:plot_model() succesfully completed.................................... INFO:logs:Visual Rendered Successfully INFO:logs:plot_model() succesfully completed.................................... #Plotting Area under Curve plot_model(estimator = tuned_modelB, plot = 'auc') interpret_model(tuned_modelB)
  • 58. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 58/93 INFO:logs:Initializing interpret_model() INFO:logs:interpret_model(estimator=ExtraTreesClassifier(bootstrap=False, ccp_alp criterion='entropy', max_depth=11, max_features='log2', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0001, min_impurity_split=None, min_samples_leaf=5, min_samples_split=9, min_weight_fraction_leaf=0.0, n_estimators=180, n_jobs=-1, oob_score=False, random_state=4967, verbose=0, warm_start=False), use_train_data=False, X_new_sample=None, INFO:logs:Checking exceptions INFO:logs:plot type: summary INFO:logs:Creating TreeExplainer INFO:logs:Compiling shap values INFO:logs:Visual Rendered Successfully INFO:logs:interpret_model() succesfully completed............................... Double-click (or enter) to edit Target Variable = Quality - Low,Medium,High Multivariate classification #from pycaret.classification import * Classification of Quality quality_mappingM = { 3 : 'Low', 4 : 'Low', 5: 'Medium', 6 : 'Medium', 7: 'Medium', 8 : Chateau_Montelena_AutoMLM['quality'] = Chateau_Montelena_AutoMLM['quality'].map(quali Distribution print("Wine Quality(%):") print(round(Chateau_Montelena_AutoMLM['quality'].value_counts(normalize=True) * 100,2) Wine Quality(%): Medium 93.17 Low 3.79 High 3.05 Name: quality, dtype: float64
  • 59. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 59/93 Setting the classifier clfM = setup(data = Chateau_Montelena_AutoMLM, target = 'quality', # ignore_features = ['customerID'], train_size=0.8, normalize=True, normalize_method='minmax', fix_imbalance=True, remove_multicollinearity=True, remove_outliers=True, fold=5, silent = True)
  • 60. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 60/93 Description Value 0 session_id 4450 1 Target quality 2 Target Type Multiclass 3 Label Encoded High: 0, Low: 1, Medium: 2 4 Original Data (6497, 13) 5 Missing Values True 6 Numeric Features 11 7 Categorical Features 1 8 Ordinal Features False 9 High Cardinality Features False 10 High Cardinality Method None 11 Transformed Train Set (4937, 12) 12 Transformed Test Set (1300, 12) 13 Shuffle Train-Test True 14 Stratify Train-Test False 15 Fold Generator StratifiedKFold 16 Fold Number 5 17 CPU Jobs -1 18 Use GPU False 19 Log Experiment False 20 Experiment Name clf-default-name 21 USI 40d8 22 Imputation Type simple 23 Iterative Imputation Iteration None 24 Numeric Imputer mean 25 Iterative Imputation Numeric Model None 26 Categorical Imputer constant 27 Iterative Imputation Categorical Model None 28 Unknown Categoricals Handling least_frequent 29 Normalize True
  • 61. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 61/93 30 Normalize Method minmax 31 Transformation False 32 Transformation Method None 33 PCA False 34 PCA Method None 35 PCA Components None 36 Ignore Low Variance False 37 Combine Rare Levels False 38 Rare Level Threshold None 39 Numeric Binning False 40 Remove Outliers True 41 Outliers Threshold 0.05 42 Remove Multicollinearity True 43 Multicollinearity Threshold 0.9 44 Remove Perfect Collinearity True 45 Clustering False 46 Clustering Iteration None 47 Polynomial Features False 48 Polynomial Degree None 49 Trignometry Features False 50 Polynomial Threshold None 51 Group Features False 52 Feature Selection False 53 Feature Selection Method classic 54 Features Selection Threshold None 55 Feature Interaction False 56 Feature Ratio False 57 Interaction Threshold None 58 Fix Imbalance True 59 Fix Imbalance Method SMOTE INFO:logs:create_model_container: 0 INFO:logs:master_model_container: 0 INFO l di l t i 1 Model Accuracy AUC Recall Prec. F1 Kappa MCC TT (Sec) xgboost Extreme Gradient Boosting 0.9299 0.7765 0.5413 0.9209 0.9243 0.3454 0.3514 5.532 lightgbm Light Gradient Boosting Machine 0.9279 0.7702 0.5348 0.9196 0.9225 0.3327 0.3389 0.538 et Extra Trees Classifier 0.9230 0.8402 0.5646 0.9195 0.9210 0.3475 0.3487 0.618 rf Random Forest Classifier 0.9123 0.8244 0.5722 0.9166 0.9141 0.3233 0.3248 2.222 dt Decision Tree Classifier 0.8404 0.6445 0.5569 0.9048 0.8679 0.1915 0.2112 0.136 gbc Gradient Boosting Classifier 0.7727 0.7342 0.6042 0.9068 0.8254 0.1643 0.2067 9.140 knn K Neighbors Classifier 0.7432 0.7225 0.6320 0.9112 0.8064 0.1613 0.2160 0.180 ada Ada Boost Classifier 0.5345 0.5782 0.5922 0.9011 0.6462 0.0730 0.1325 1.000 qda Quadratic Discriminant Analysis 0.4950 0.6411 0.5851 0.9010 0.6113 0.0646 0.1249 0.052 lda Linear Discriminant Analysis 0.4857 0.7076 0.6144 0.9079 0.6017 0.0735 0.1446 0.038 lr Logistic Regression 0.4794 0.7101 0.6330 0.9118 0.5952 0.0780 0.1556 0.562 ridge Ridge Classifier 0.4132 0.0000 0.6236 0.9116 0.5293 0.0650 0.1426 0.022 svm SVM - Linear Kernel 0.3830 0.0000 0.6252 0.9121 0.4962 0.0613 0.1404 0.072 b N i B 0 3721 0 6096 0 5746 0 9040 0 4885 0 0492 0 1140 0 022 best_modelM=compare_models() LGBM has the best F1 score and is faster than the other top models. tuned_modelM = tune_model(best_modelM)