The document discusses analyzing wine quality prediction using machine learning models. It aims to predict wine quality, which is measured on an ordinal scale of 3 to 9, based on various predictor factors about the wines. The document performs data cleaning and preprocessing steps like handling missing data through mean imputation and normalizing variables. It analyzes the distributions of the predictor variables which are found to mostly follow a normal distribution. The ranges of the predictor variables are also examined and found to make sense. The objective is to apply ML models to predict wine quality and use autoML and SHAP to analyze model performance and feature importance.
CDR WineLab®: controllare, intervenire e migliorare la vinificazione in cantinaCDR S.r.l.
Le analisi del vino per il controllo del processo della vinificazione in Rosso e in Bianco con CDR WineLab®, il Sistema semplice per il tuo controllo qualità.
Predicting Wine Quality Using Different Implementations of Decision Tree Algo...Mohammed Al Hamadi
Using R programming language's three packages: tree, rpart and C50, we try to predict the quality of wine on a publicly available data set. Then, we evaluate the performance of each package using misclassification error, sensitivity, fall-out, ROC Curve and Area Under Curve (AUC).
Wine Quality Analysis Using Machine LearningMahima -
Wine industries use Product Quality Certification to promote their products and become a concern for every individual who consumes any product. But it's not possible to ensure wine quality by experts with such a huge demand for the product as it will increase the cost. It allows building a model using machine learning techniques with a user interface which predicts the quality of the wine by selecting the important parameters.
Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision TreesMohammed Al Hamadi
Using R programming language's package fastAdaboost, we use the adaboost algorithm created by Yoav Freund and Robert Schapire on a public data set (white wine quality) to see if we can enhance the performance of a single decision tree.
Presentation of CDR WineLab®, Wine Analysis SystemCDR S.r.l.
CDR WineLab® system is an easy and fast tool for your wine making QC. You can realize a complete in house quality control of the process, so you can take decisions quickly in a few minutes about the wine making process.
The analyzer can be used by everyone. You don’t need any chemical expertise. You don’t need any glassware. With only a small desk you can check the whole production process.
CDR WineLab®: controllare, intervenire e migliorare la vinificazione in cantinaCDR S.r.l.
Le analisi del vino per il controllo del processo della vinificazione in Rosso e in Bianco con CDR WineLab®, il Sistema semplice per il tuo controllo qualità.
Predicting Wine Quality Using Different Implementations of Decision Tree Algo...Mohammed Al Hamadi
Using R programming language's three packages: tree, rpart and C50, we try to predict the quality of wine on a publicly available data set. Then, we evaluate the performance of each package using misclassification error, sensitivity, fall-out, ROC Curve and Area Under Curve (AUC).
Wine Quality Analysis Using Machine LearningMahima -
Wine industries use Product Quality Certification to promote their products and become a concern for every individual who consumes any product. But it's not possible to ensure wine quality by experts with such a huge demand for the product as it will increase the cost. It allows building a model using machine learning techniques with a user interface which predicts the quality of the wine by selecting the important parameters.
Using Adaboost Algorithm to Enhance the Prediction Accuracy of Decision TreesMohammed Al Hamadi
Using R programming language's package fastAdaboost, we use the adaboost algorithm created by Yoav Freund and Robert Schapire on a public data set (white wine quality) to see if we can enhance the performance of a single decision tree.
Presentation of CDR WineLab®, Wine Analysis SystemCDR S.r.l.
CDR WineLab® system is an easy and fast tool for your wine making QC. You can realize a complete in house quality control of the process, so you can take decisions quickly in a few minutes about the wine making process.
The analyzer can be used by everyone. You don’t need any chemical expertise. You don’t need any glassware. With only a small desk you can check the whole production process.
2018 Oregon Wine Symposium | Understanding Control Points from Crush Pad to B...Oregon Wine Board
Low-intervention and ‘natural’ wines have carved out a category for themselves in the wine market, leading to increased interest in these practices among both consumers and producers alike. Within this wine segment, winemaking methods are widely variable, and the resulting wine styles diverse. This session will offer both scientific and experience-learned techniques that can be utilized from fruit reception through élevage. New research from Sydney Morgan at the University of British Columbia, will be presented on different concentrations of sulfur dioxide additions at crush and its effect on different Saccharomyces cerevisiae strain compositions during fermentation. Guest winemakers and consultants include Eric Texier from the Rhône Valley, Mike Roth from Santa Barbara, and Chad Stock from the Willamette Valley, who will offer insight and strategies that can be applied in the cellar to lessen the reliance on modern winemaking tools without sacrificing quality.
Without having prior knowledge on wine and quality of the wine; just for curiosity purpose worked on the famous wine data and find out some relations between the compositions used in the wine and the quality rating given by the individuals.
Kindly go through the report and share your comments and suggestions.
Department of Chemical Engineering and Food Technology, Faculty of Sciences, University of Cádiz, Agrifood Campus of International Excellence (CeiA3), IVAGRO, Puerto Real, Cádiz, Spain.
2018 Oregon Wine Symposium | Understanding Control Points from Crush Pad to B...Oregon Wine Board
Low-intervention and ‘natural’ wines have carved out a category for themselves in the wine market, leading to increased interest in these practices among both consumers and producers alike. Within this wine segment, winemaking methods are widely variable, and the resulting wine styles diverse. This session will offer both scientific and experience-learned techniques that can be utilized from fruit reception through élevage. New research from Sydney Morgan at the University of British Columbia, will be presented on different concentrations of sulfur dioxide additions at crush and its effect on different Saccharomyces cerevisiae strain compositions during fermentation. Guest winemakers and consultants include Eric Texier from the Rhône Valley, Mike Roth from Santa Barbara, and Chad Stock from the Willamette Valley, who will offer insight and strategies that can be applied in the cellar to lessen the reliance on modern winemaking tools without sacrificing quality.
Without having prior knowledge on wine and quality of the wine; just for curiosity purpose worked on the famous wine data and find out some relations between the compositions used in the wine and the quality rating given by the individuals.
Kindly go through the report and share your comments and suggestions.
Department of Chemical Engineering and Food Technology, Faculty of Sciences, University of Cádiz, Agrifood Campus of International Excellence (CeiA3), IVAGRO, Puerto Real, Cádiz, Spain.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
1. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 1/93
Problem Statement
Although we are attempting to predict wine quality as a target for a certain number of wines with a
given set of predictor factors, wine quality is a subjective measurement. This is an EDA, or data-
driven story, including a range of graphs and images as well as an attribute-based quality forecast.
Here we need to know: “what is the quality of the wine (in ordinal values)(3-9)? It is a regression
task.
Objective
Perform Data Cleaning, Pre-processing and Feature Selection
Apply ML models to predict the Churned Customers
Use Auto-ML to determine the best model
Use SHAP library to determine the impact of the predictor variables
ML Data Cleaning and Feature Selection
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy.stats import norm
from scipy import stats
from scipy.stats import norm
from scipy import stats
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from math import sqrt
from sklearn.metrics import r2_score
Cabernet Sauvignon is known as the king of the red wine.
C b t S i d d ('htt // ith b t t /M h j th /DA
2. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 2/93
Cabernet_Sauvignon = pd.read_csv('https://raw.githubusercontent.com/Mohanmanjunatha/DA
type
fixed
acidity
volatile
acidity
citric
acid
residual
sugar
chlorides
free
sulfur
dioxide
total
sulfur
dioxide
density
0 white 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010
1 white 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940
2 white 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951
3 white 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956
4 white 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956
Cabernet_Sauvignon.head()
Cabernet_Sauvignon.shape
(6497, 13)
What are the data types? (Only numeric and categorical)
Cabernet_Sauvignon.dtypes
type object
fixed acidity float64
volatile acidity float64
citric acid float64
residual sugar float64
chlorides float64
free sulfur dioxide float64
total sulfur dioxide float64
density float64
pH float64
sulphates float64
alcohol float64
quality int64
dtype: object
The dataset has 1 Categorical and 12 Numerical Features.
What features are in the dataset?
fixed acidity. Fixed acidity is due to the presence of non-volatile acids in wine. For example, tartaric,
citric or malic acid. This type of acid combines the balance of the taste of wine, brings freshness to
the taste.
Volatile acidity is the part of the acid in wine that can be picked up by the nose. Unlike those acids
that are palpable to the taste (as we talked about above). Volatile acidity, or in other words, souring
3. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 3/93
of wine, is one of the most common defects.
citric acid - allowed to offer in winemaking by the Resolution of the OIV No. 23/2000. It can be used
in three cases: for acid treatment of wine (increasing acidity), for collecting wine, for cleaning filters
from possible fungal and mold infections.
residual sugar is that grape sugar that has not been fermented in alcohol
chlorides. The structure of the wine also depends on the content of minerals in the wine, which
determine the taste sensation such as salinity (sapidità). Anions of inorganic acids (chlorides,
sulfates, sulfites..), anions of transferred acids, metal cations (potassium, sodium, magnesium...)
are found in wine. Their content depends mainly on the climatic zone (cold or warm region, salty
soils depending on the observation of the sea), oenological practices, storage and aging of wine.
free sulfur dioxide, total sulfur dioxide - Sulfur dioxide (sulfur oxide, sulfur dioxide, readiness E220,
SO2) is used as a preservative due to its antioxidant and antimicrobial properties. Molecular SO2 is
an extremely important antibiotic, affecting significant consumption (including wild yeast) that can
manifest itself in wine spoilage.
Density - The density of wine can be either less or more than water. Its value is determined primarily
by the concentration of alcohol and sugar. White, rosé and red wines are generally light - their
density at 20°C is below 998.3 kg/m3.
pH is a measure of the acidity of wine. All wines ideally have a pH level between 2.9 and 4.2. The
lower the pH, the more acidic the wine; the lower the pH, the less acidic the wine.
Sulfates are a natural result of yeast fermenting the sugar in wine into alcohol. That is, the presence
of sulfites in wine is excluded.
alcohol - The alcohol content in wines depends on many tastes: the grape variety and the amount of
sugar in the berries, production technology and growing conditions. Wines vary greatly in degree:
this Parameter varies from 4.5 to 22 depending on the category.
quality is a target.
Are there missing values?
Cabernet_Sauvignon.isna().sum()
type 0
fixed acidity 10
volatile acidity 8
citric acid 3
residual sugar 2
chlorides 2
free sulfur dioxide 0
total sulfur dioxide 0
density 0
4. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 4/93
pH 9
sulphates 4
alcohol 0
quality 0
dtype: int64
Which independent variables have missing data? How much?
fixed acidity - 10
volatile acidity - 8
citric acid - 3
residual sugar - 2
chlorides - 2
pH - 9
sulphates - 4
The above features have the respective number of missing data. Since the data is more symmetric,
mean replacement would be better.
Before examining quality feature, categorical variables will be mapped with help of cat.code. This
will assist to make easier and comprehensible data analysis.
Cabernet_Sauvignon['type'] = Cabernet_Sauvignon['type'].astype("category").cat.codes
Cabernet_Sauvignon.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 type 6497 non-null int8
1 fixed acidity 6487 non-null float64
2 volatile acidity 6489 non-null float64
3 citric acid 6494 non-null float64
4 residual sugar 6495 non-null float64
5 chlorides 6495 non-null float64
6 free sulfur dioxide 6497 non-null float64
7 total sulfur dioxide 6497 non-null float64
8 density 6497 non-null float64
9 pH 6488 non-null float64
10 sulphates 6493 non-null float64
11 alcohol 6497 non-null float64
12 quality 6497 non-null int64
dtypes: float64(11), int64(1), int8(1)
memory usage: 615.6 KB
6. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 6/93
total sulfur dioxide 0
density 0
pH 0
sulphates 0
alcohol 0
quality 0
dtype: int64
What are the likely distributions of the numeric variables? & What are the distributions of the
predictor variables?
In below above, the good fit indicates that normality is a reasonable approximation.
Distribution of Predictors
Cabernet_SauvignonColumnList = Cabernet_Sauvignon.columns
for i in Cabernet_SauvignonColumnList:
plt.figure(figsize= (5,5))
sns.distplot(Cabernet_Sauvignon[i], fit = norm)
plt.title(f"Distribution of {i} (checking normal distribution fit)",size = 15, wei
8. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 8/93
type : categorical values
fixed acidity : nomral distribution
volatile acidity : almost normal distribution with a bit of right-skewness
citric acid : almost normal distribution with a bit of edge-peak
residual sugar : almost normal distribution with a bit of right-skewness
chlorides : almost normal distribution with a bit of right-skewness
free sulfur dioxide : nomral distribution
total sulfur dioxide : almost normal distribution with a bit of edge-peak
sulphates : normal distribution
alcohol : almost normal distribution with a bit of right-skewness
pH : normal distribution
density : normal distribution
Do the ranges of the predictor variables make sense?
type
fixed
acidity
volatile
acidity
citric
acid
residual
sugar
chlorides su
dio
count 6497.000000 6497.000000 6497.000000 6497.000000 6497.000000 6497.000000 6497.0
mean 0.753886 7.216501 0.339634 0.318675 5.445704 0.056041 30.5
std 0.430779 1.295928 0.164563 0.145267 4.758043 0.035032 17.7
min 0.000000 3.800000 0.080000 0.000000 0.600000 0.009000 1.0
25% 1.000000 6.400000 0.230000 0.250000 1.800000 0.038000 17.0
50% 1.000000 7.000000 0.290000 0.310000 3.000000 0.047000 29.0
75% 1.000000 7.700000 0.400000 0.390000 8.100000 0.065000 41.0
max 1.000000 15.900000 1.580000 1.660000 65.800000 0.611000 289.0
#Range of each column
Cabernet_Sauvignon.max() - Cabernet_Sauvignon.min()
Cabernet_Sauvignon.describe()
The ranges make sense for each attribute that a wine constitutes. The range of "total sulphur
dioxide" variable is high, this implies high variablity in it's distribution.
9. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 9/93
Do the training and test sets have the same data?
By using test_train_split, the train and test sets are split at a ratio of 80/20 from the same dataset.
But both sets are distinct and is not seen by the model during the training phase. Although the
distribution of each attribute is proportional in both train and test sets.
Phase 1
Cabernet_Sauvignon_x = Cabernet_Sauvignon[['type','fixed acidity','volatile acidity','
Cabernet_Sauvignon_y = Cabernet_Sauvignon['quality']
# .iloc[:,:12], Cabernet_Sauvignon.iloc[:,-1]
Cabernet_Sauvignon_y.head()
0 6.0
1 6.0
2 6.0
3 6.0
4 6.0
Name: quality, dtype: float64
scaler = StandardScaler()
# #Dataframe Cabernet_Sauvignon with outliers
Cabernet_Sauvignon_x = scaler.fit_transform(Cabernet_Sauvignon_x)
plt.figure(figsize=(20,7))
ax = sns.boxplot(data=Cabernet_Sauvignon_x)
ax.set_xticklabels(Cabernet_SauvignonColumnList[:12])
11. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 11/93
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocess
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())
If you wish to pass a sample_weight parameter, you need to pass it as a fit param
kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)
warnings.warn(
3 metrics will be calculated for evaluating predictions.
Mean Absolute Error (MAE) shows the difference between predictions and actual values.
Root Mean Square Error (RMSE) shows how accurately the model predicts the response.
R^2 will be calculated to find the goodness of fit measure.
plt.figure(figsize=(5, 7))
ax = sns.distplot(y_test, hist=False, color="r", label="Actual Value")
sns.distplot(lr_pred, hist=False, color="b", label="Fitted Values" , ax=ax)
plt.title('Actual(red) vs Fitted(blue) Values for Quality')
plt.show()
plt.close()
14. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 14/93
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)
sns.distplot(Dt_pred, hist=False, color="b", label="Fitted Values" , ax=ax)
plt.title('Actual(red) vs Fitted(blue) Values for Quality')
plt.show()
plt.close()
Phase 2
In the predictor variables independent of all the other predictor variables?
Multicollinearity
Multicollinearity will help to measure the relationship between explanatory variables in multiple
regression. If there is multicollinearity occurs, these highly related input variables should be
eliminated from the model.
In this kernel, multicollinearity will be checked when plotting a correlation heatmap.
15. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 15/93
Which independent variables are useful to predict a target (dependent variable)? (Use at least
three methods) For a regression model, the most useful Independent Variables can be statistically
determined using the following methods:
f_regression
mutual_info_regression
Correlation Matrix with Heatmap
Each of the following method is applied below to the dataset.
1. f_regression
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression, mutual_info_regression
X = Cabernet_Sauvignon.iloc[:,0:12]
y = Cabernet_Sauvignon.iloc[:,-1]
# y=y.astype('int')
# y = pd.DataFrame(y)
# y.head(10)
# y.describe()
#Applying SelectKBest class to extract top features
# feature selection
f_selector = SelectKBest(score_func=f_regression, k='all')
# learn relationship from training data
f_selector.fit(X_train, y_train)
# transform train input data
X_train_fs = f_selector.transform(X_train)
# transform test input data
X_test_fs = f_selector.transform(X_test)
# Plot the scores for the features
plt.rcParams["figure.figsize"] = (30,10)
plt.bar([i for i in range(len(f_selector.scores_))], f_selector.scores_)
plt.xlabel("feature index")
plt.xticks([i for i in range(len(f_selector.scores_))], Cabernet_SauvignonColumnList[:
plt.ylabel("F-value (transformed from the correlation values)")
plt.show()
# bestFeatures = SelectKBest(score_func= chi2, k =12)
# fit = bestFeatures.fit(X,y)
16. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 16/93
we can see that volatile acidity, chlorides, density and alcohol have more importance than the
others.
2.Mutual information metric
# feature selection
f_selector = SelectKBest(score_func=mutual_info_regression, k='all')
# learn relationship from training data
f_selector.fit(X_train, y_train)
# transform train input data
X_train_fs = f_selector.transform(X_train)
# transform test input data
X_test_fs = f_selector.transform(X_test)
# Plot the scores for the features
plt.bar([i for i in range(len(f_selector.scores_))], f_selector.scores_, align = 'cent
plt.xlabel("feature index")
plt.xticks([i for i in range(len(f_selector.scores_))], Cabernet_SauvignonColumnList[:
plt.ylabel("Estimated MI value")
# plt.rcParams["figure.figsize"] = (30,10)
plt.show()
18. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 18/93
By looking at the correlation mattrix above we can gain the following insights:
1. volatile acidity and chlorides is highly (-ve) correlated with type.
2. alcohol is highly (-ve) correlated with density.
3. total sulpher dioxide is highly (+ve) correlated with type.
By looking at the 3 feature importance methods above, we can see that volatile acidity, chlorides,
density and alcohol are the common most important features in predicting the value of quality.
Outlier Treatment
24. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 24/93
[Text(0, 0, 'type'),
Text(0, 0, 'fixed acidity'),
Text(0, 0, 'volatile acidity'),
Text(0, 0, 'citric acid'),
Text(0, 0, 'residual sugar'),
Text(0, 0, 'chlorides'),
Text(0, 0, 'free sulfur dioxide'),
Text(0, 0, 'total sulfur dioxide'),
Text(0, 0, 'density'),
Text(0, 0, 'pH'),
Text(0, 0, 'sulphates'),
Text(0, 0, 'alcohol')]
##Linear Regression
# lr = LinearRegression(copy_X= True, fit_intercept = True, normalize = True)
lr.fit(Xclean_train, yclean_train)
lrclean_pred= lr.predict(Xclean_test)
# model2 = RandomForestRegressor(random_state=1, n_estimators=1000)
model2.fit(Xclean_train, yclean_train)
Rmclean_pred = model2.predict(Xclean_test)
model3.fit(Xclean_train, yclean_train)
Dtclean_pred = model3.predict(Xclean_test)
/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_base.py:141: FutureW
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocess
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())
If you wish to pass a sample_weight parameter, you need to pass it as a fit param
kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)
warnings.warn(
print('-------------Linear Regression-----------')
print('--Phase-1--')
print('MAE: %f'% mae1)
print('RMSE: %f'% rmse1)
print('R2: %f' % r21)
print('--Phase-2--')
print('MAE: %f'% mean_absolute_error(yclean_test, lrclean_pred))
print('RMSE: %f'% np.sqrt(mean_squared_error(yclean_test, lrclean_pred)))
print('R2: %f' % r2_score(yclean_test, lrclean_pred))
print('-------------Random forest-----------')
print('--Phase-1--')
print('MAE: %f'% mae2)
25. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 25/93
print('RMSE: %f'% rmse2)
print('R2: %f' % r22)
print('--Phase-2--')
print('MAE: %f'% mean_absolute_error(yclean_test, Rmclean_pred))
print('RMSE: %f'% np.sqrt(mean_squared_error(yclean_test, Rmclean_pred)))
print('R2: %f' % r2_score(yclean_test, Rmclean_pred))
print('-------------Descision Tree-----------')
print('--Phase-1--')
print('MAE: %f'% mae3)
print('RMSE: %f'% rmse3)
print('R2: %f' % r23)
print('--Phase-2--')
print('MAE: %f'% mean_absolute_error(yclean_test, Dtclean_pred))
print('RMSE: %f'% np.sqrt(mean_squared_error(yclean_test, Dtclean_pred)))
print('R2: %f' % r2_score(yclean_test, Dtclean_pred))
-------------Linear Regression-----------
--Phase-1--
MAE: 0.545152
RMSE: 0.686665
R2: 0.340363
--Phase-2--
MAE: 0.578749
RMSE: 0.748469
R2: 0.274277
-------------Random forest-----------
--Phase-1--
MAE: 0.401750
RMSE: 0.561165
R2: 0.559449
--Phase-2--
MAE: 0.438112
RMSE: 0.622107
R2: 0.498635
-------------Descision Tree-----------
--Phase-1--
MAE: 0.541020
RMSE: 0.696854
R2: 0.320642
--Phase-2--
MAE: 0.586013
RMSE: 0.756198
R2: 0.259211
The results show that both phases have different prediction results. Phase 1 and 2 don't have a
great difference for each metric. MAE, RMSE metric values are increased in Phase 2 which means,
26. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 26/93
the prediction error value is higher in that Phase and model explainability has decresed by a
negligible margin.
Remove outliers and keep outliers (does if have an effect of the final predictive model)? The MAE
value of 0 indicates no error on the model. In other words, there is a perfect prediction. The above
results show that all predictions have great error especially in phase 2. RMSE gives an idea of how
much error the system typically makes in its predictions. The above results show that RMSE gave a
worse value after removing the outliers. R2 represents the proportion of the variance for a
dependent variable that's explained by an independent variable.
Cabernet_Sauvignon_class = Cabernet_Sauvignon
Cabernet_Sauvignon_imputation= Cabernet_Sauvignon
quality_mapping = { 3 : 'Low', 4 : 'Low', 5: 'Medium', 6 : 'Medium', 7: 'Medium', 8 :
Cabernet_Sauvignon_class['quality'] = Cabernet_Sauvignon_class['quality'].map(quality
Cabernet_Sauvignon_class_x,Cabernet_Sauvignon_class_y = Cabernet_Sauvignon.iloc[:,:12]
Cabernet_Sauvignon_class_x = scaler.fit_transform(Cabernet_Sauvignon_class_x)
#Splitting the dataset after classifying quality to class into Train and Test sets at
Xclass_train, Xclass_test, yclass_train, yclass_test = train_test_split(Cabernet_Sauvi
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# creating a RF classifier
clf = RandomForestClassifier(n_estimators = 1000)
# Training the model on the training dataset
# fit function is used to train the model using the training sets as parameters
clf.fit(Xclass_train, yclass_train)
# performing predictions on the test dataset
yclass_pred = clf.predict(Xclass_test)
# metrics are used to find accuracy or error
from sklearn import metrics
print()
# using metrics module for accuracy calculation
print("ACCURACY OF THE MODEL: ", metrics.accuracy_score(yclass_test, yclass_pred))
print(classification_report(yclass_test, yclass_pred))
ACCURACY OF THE MODEL: 0.9456521739130435
precision recall f1-score support
27. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 27/93
High 1.00 0.34 0.51 38
Low 0.00 0.00 0.00 24
Medium 0.95 1.00 0.97 858
accuracy 0.95 920
macro avg 0.65 0.45 0.49 920
weighted avg 0.92 0.95 0.93 920
quality_mapping_again = { 'Low':0, 'Medium':1, 'High':2}
yclass_test = yclass_test.map(quality_mapping_again)
yclass_pred_new = [s.replace('Medium', '1') for s in yclass_pred]
yclass_pred_new = [s.replace('Low', '0') for s in yclass_pred_new]
yclass_pred_new = [s.replace('High', '2') for s in yclass_pred_new]
yclass_pred_new = [int(item) for item in yclass_pred_new]
plt.figure(figsize=(5, 7))
ax = sns.distplot(yclass_test, hist=False, color="r", label="Actual Value")
sns.distplot(yclass_pred_new, hist=False, color="b", label="Fitted Values" , ax=ax)
plt.title('Actual vs Fitted Values for Quality')
plt.show()
plt.close()
28. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 28/93
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarni
warnings.warn(msg, FutureWarning)
As we can see here, the accuracy of the classification model turned out to be way higher than any
regression method used in phase 1. It can be interpretted as: Wine tastings are generally blind
tastings and even for the best wine conoisseurs, it is very difficult to differentiate between a quality
7 or 8. Also, quality of a wine by how it tastes is a very subjective to human individuals. Most times,
its about how the product is marketed/promoted which forms the general opinion of the targeted
people.
Being said that, a good wine is a good wine. Based on the chemical composition of the wine itself,
we can atleast say if it's a good or bad one. So, when a model is asked to make it fall in a category it
gives a much greater accuracy as classifying into bins is easier than predicting a precise quality
rating.
Data Imputation
Remove 1%, 5%, and 10% of your data randomly and impute the values back using at least 3
imputation methods. How well did the methods recover the missing values? That is remove some
data, check the % error on residuals for numeric data and check for bias and variance of the error.
Imputation 1
Cabernet_Sauvignon_imputation['1_percent'] = Cabernet_Sauvignon_imputation[['alcohol']
Cabernet_Sauvignon_imputation['5_percent'] = Cabernet_Sauvignon_imputation[['alcohol']
Cabernet_Sauvignon_imputation['10_percent'] = Cabernet_Sauvignon_imputation[['alcohol'
Cabernet_Sauvignon_imputation.head()
30. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 30/93
quality quality 0.000000
1_percent 1_percent 1.000435
5_percent 5_percent 5.002175
10_percent 10_percent 10.004350
# Store Index of NaN values in each coloumns
number_1_idx = list(np.where(Cabernet_Sauvignon_imputation['1_percent'].isna())[0])
number_5_idx = list(np.where(Cabernet_Sauvignon_imputation['5_percent'].isna())[0])
number_10_idx = list(np.where(Cabernet_Sauvignon_imputation['10_percent'].isna())[0])
print(f"Length of number_1_idx is {len(number_1_idx)} and it contains {(len(number_1_i
print(f"Length of number_5_idx is {len(number_5_idx)} and it contains {(len(number_5_i
print(f"Length of number_10_idx is {len(number_10_idx)} and it contains {(len(number_1
Length of number_1_idx is 46 and it contains 1.0004349717268377% of total data in
Length of number_5_idx is 230 and it contains 5.002174858634189% of total data in
Length of number_10_idx is 460 and it contains 10.004349717268378% of total data
Imputation 2
KNN Imputation The k nearest neighbours is an algorithm that is used for simple classification. The
algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the
new point is assigned a value based on how closely it resembles the points in the training set.
#Creating a seperate dataframe for performing the KNN imputation
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler
Cabernet_Sauvignon_imputation1 = Cabernet_Sauvignon_imputation[['1_percent','5_percent
imputer = KNNImputer(n_neighbors=5)
imputed_number_Cabernet_Sauvignon = pd.DataFrame(imputer.fit_transform(Cabernet_Sauvig
# imputed_number_Cabernet_Sauvignon.sample(10)
imputed_number_Cabernet_Sauvignon.head()
print(get_percent_missing(imputed_number_Cabernet_Sauvignon))
column_name percent_missing
1_percent 1_percent 0.0
5_percent 5_percent 0.0
10_percent 10_percent 0.0
alcohol = Cabernet_Sauvignon["alcohol"]
imputed_mean = pd.concat([alcohol,imputed_number_Cabernet_Sauvignon])
imputed_mean.columns = ["Alcohol","1_Percent","5_Percent","10_Percent"]
imputed_mean.var()
31. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 31/93
Alcohol 1.470385
1_Percent 1.470326
5_Percent 1.470391
10_Percent 1.470429
dtype: float64
The KNN based method showed very negotiable variablilty. Therefore this method is acceptable for
the current dataset.
Mean based Imputation with Simpleimputer This works by calculating the mean/median of the non-
missing values in a column and then replacing the missing values within each column separately
and independently from the others. It can only be used with numeric data.
Cabernet_Sauvignon_imputation_mean = Cabernet_Sauvignon_imputation[['1_percent','5_per
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer( strategy='mean') #for median imputation replace 'mean' with
imp_mean.fit(Cabernet_Sauvignon_imputation_mean)
imputed_train_Cabernet_Sauvignon = imp_mean.transform(Cabernet_Sauvignon_imputation_me
imputed_mean = pd.DataFrame(imp_mean.fit_transform(Cabernet_Sauvignon_imputation_mean)
print(get_percent_missing(imputed_mean))
column_name percent_missing
1_percent 1_percent 0.0
5_percent 5_percent 0.0
10_percent 10_percent 0.0
alcohol = Cabernet_Sauvignon["alcohol"]
combined_mean = pd.concat([alcohol,imputed_mean])
combined_mean.mean()
0 10.587102
10_percent 10.588810
1_percent 10.586540
5_percent 10.581520
dtype: float64
combined_mean.var()
0 1.470385
10_percent 1.320797
1_percent 1.456402
5_percent 1.395375
dtype: float64
32. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 32/93
Imputation 3
The Mean based method showed very negotiable variablilty. Therefore this method is acceptable for
the current dataset.
Imputation Using Multivariate Imputation by Chained Equation (MICE) This type of imputation works
by filling the missing data multiple times. Multiple Imputations (MIs) are much better than a single
imputation as it measures the uncertainty of the missing values in a better way. The chained
equations approach is also very flexible and can handle different variables of different data types
(ie., continuous or binary) as well as complexities such as bounds or survey skip patterns.
Cabernet_Sauvignon_imputation_mice = Cabernet_Sauvignon_imputation[['1_percent','5_per
print(get_percent_missing(Cabernet_Sauvignon_imputation_mice))
column_name percent_missing
1_percent 1_percent 1.000435
5_percent 5_percent 5.002175
10_percent 10_percent 10.004350
!pip install impyute
from impyute.imputation.cs import mice
# start the MICE training
imputed_training=mice(Cabernet_Sauvignon_imputation_mice.values)
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-whee
Requirement already satisfied: impyute in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: scipy in /usr/local/lib/python3.8/dist-packages (f
Requirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (f
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.8/d
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.8/dist-pack
imputed_training = pd.DataFrame(imputed_training)
imputed_training.columns = ("1_percent","5_percent","10_percent")
# imputed_mice = pd.DataFrame(imputed_training.fit_transform(Cabernet_Sauvignon_imputa
print(get_percent_missing(imputed_training))
column_name percent_missing
1_percent 1_percent 0.0
5_percent 5_percent 0.0
10_percent 10_percent 0.0
33. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 33/93
alcohol = Cabernet_Sauvignon["alcohol"]
combined_mice = pd.concat([alcohol,imputed_training])
combined_mice.columns = ["Alcohol","1_Percent","5_Percent","10_Percent"]
combined_mice.mean()
Alcohol 10.587102
1_Percent 10.586915
5_Percent 10.587098
10_Percent 10.586915
dtype: float64
combined_mice.var()
Alcohol 1.470385
1_Percent 1.467981
5_Percent 1.470375
10_Percent 1.467981
dtype: float64
The MICE method showed very negotiable variablilty. Therefore this method is acceptable for the
current dataset.
Double-click (or enter) to edit
Double-click (or enter) to edit
Double-click (or enter) to edit
AutoML
#Install AutoML library - PyCaret
!pip install pycaret
34. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 34/93
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-whee
Requirement already satisfied: pycaret in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: numba<0.55 in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: mlxtend>=0.17.0 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: wordcloud in /usr/local/lib/python3.8/dist-package
Requirement already satisfied: pandas-profiling>=2.8.0 in /usr/local/lib/python3
Requirement already satisfied: mlflow in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: pyyaml<6.0.0 in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: scikit-plot in /usr/local/lib/python3.8/dist-packa
Requirement already satisfied: scipy<=1.5.4 in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: joblib in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: matplotlib in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: lightgbm>=2.3.1 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: imbalanced-learn==0.7.0 in /usr/local/lib/python3
Requirement already satisfied: seaborn in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: textblob in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: kmodes>=0.10.1 in /usr/local/lib/python3.8/dist-pa
Requirement already satisfied: spacy<2.4.0 in /usr/local/lib/python3.8/dist-packa
Requirement already satisfied: nltk in /usr/local/lib/python3.8/dist-packages (fr
Requirement already satisfied: pandas in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: pyod in /usr/local/lib/python3.8/dist-packages (fr
Requirement already satisfied: yellowbrick>=1.0.1 in /usr/local/lib/python3.8/dis
Requirement already satisfied: umap-learn in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: cufflinks>=0.17.0 in /usr/local/lib/python3.8/dist
Requirement already satisfied: gensim<4.0.0 in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: plotly>=4.4.1 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: IPython in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: ipywidgets in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: scikit-learn==0.23.2 in /usr/local/lib/python3.8/d
Requirement already satisfied: Boruta in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: pyLDAvis in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.8/d
Requirement already satisfied: colorlover>=0.2.1 in /usr/local/lib/python3.8/dist
Requirement already satisfied: setuptools>=34.4.1 in /usr/local/lib/python3.8/dis
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.8/dist
Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.8/dist-pa
Requirement already satisfied: decorator in /usr/local/lib/python3.8/dist-package
Requirement already satisfied: backcall in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: pickleshare in /usr/local/lib/python3.8/dist-packa
Requirement already satisfied: prompt-toolkit<2.1.0,>=2.0.0 in /usr/local/lib/pyt
Requirement already satisfied: jedi>=0.10 in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: pygments in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: pexpect in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: ipykernel>=4.5.1 in /usr/local/lib/python3.8/dist-
Requirement already satisfied: jupyterlab-widgets>=1.0.0 in /usr/local/lib/python
Requirement already satisfied: ipython-genutils~=0.2.0 in /usr/local/lib/python3
Requirement already satisfied: widgetsnbextension~=3.6.0 in /usr/local/lib/python
Requirement already satisfied: jupyter-client in /usr/local/lib/python3.8/dist-pa
Requirement already satisfied: tornado>=4.2 in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: parso<0.9.0,>=0.8.0 in /usr/local/lib/python3.8/di
Requirement already satisfied: wheel in /usr/local/lib/python3.8/dist-packages (f
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.8/dist
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.8/d
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.8/dist-pack
35. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 35/93
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/l
Requirement already satisfied: llvmlite<0.38,>=0.37.0rc1 in /usr/local/lib/python
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: tqdm<4.65,>=4.48.2 in /usr/local/lib/python3.8/dis
Requirement already satisfied: requests<2.29,>=2.24.0 in /usr/local/lib/python3.8
Requirement already satisfied: phik<0.13,>=0.11.1 in /usr/local/lib/python3.8/dis
Requirement already satisfied: pydantic<1.11,>=1.8.1 in /usr/local/lib/python3.8/
Requirement already satisfied: jinja2<3.2,>=2.11.1 in /usr/local/lib/python3.8/di
Requirement already satisfied: htmlmin==0.1.12 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: statsmodels<0.14,>=0.13.2 in /usr/local/lib/python
Requirement already satisfied: typeguard<2.14,>=2.13.2 in /usr/local/lib/python3
Requirement already satisfied: multimethod<1.10,>=1.4 in /usr/local/lib/python3.8
Requirement already satisfied: visions[type_image_path]==0.7.5 in /usr/local/lib/
Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: tangled-up-in-unicode>=0.0.4 in /usr/local/lib/pyt
Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: imagehash in /usr/local/lib/python3.8/dist-package
Requirement already satisfied: Pillow in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.8/dist-
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: wcwidth in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: typing-extensions>=4.1.0 in /usr/local/lib/python3
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.8/
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dis
Requirement already satisfied: charset-normalizer<3,>=2 in /usr/local/lib/python3
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.8/
Requirement already satisfied: blis<0.8.0,>=0.4.0 in /usr/local/lib/python3.8/dis
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.8/di
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.8/dis
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.8/d
Requirement already satisfied: thinc<7.5.0,>=7.4.1 in /usr/local/lib/python3.8/di
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.8/di
Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: patsy>=0.5.2 in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: notebook>=4.4.1 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: nbformat in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: Send2Trash in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: jupyter-core>=4.4.0 in /usr/local/lib/python3.8/di
Requirement already satisfied: prometheus-client in /usr/local/lib/python3.8/dist
Requirement already satisfied: terminado>=0.8.1 in /usr/local/lib/python3.8/dist-
Requirement already satisfied: pyzmq>=17 in /usr/local/lib/python3.8/dist-package
Requirement already satisfied: nbconvert<6.0 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: platformdirs>=2.5 in /usr/local/lib/python3.8/dist
Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.8/dis
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.8/d
Requirement already satisfied: defusedxml in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: testpath in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: bleach in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.8/dist
Requirement already satisfied: jsonschema>=2.6 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: fastjsonschema in /usr/local/lib/python3.8/dist-pa
Requirement already satisfied: importlib-resources>=1.4.0 in /usr/local/lib/pytho
Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /
Requirement already satisfied: zipp>=3.1.0 in /usr/local/lib/python3.8/dist-packa
from scipy import stats
# import math
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
#Reading Data
Chateau_Montelena_AutoML = pd.read_csv('https://raw.githubusercontent.com/Mohanmanjuna
Chateau_Montelena_AutoMLM = Chateau_Montelena_AutoML.copy()
Chateau_Montelena_AutoMLB = Chateau_Montelena_AutoML.copy()
Each row represents a wine; Each column contains wine’s attributes such as type, sulphates,
chlorides etc and the target label 'quality'.
Problem Statement
Binary Classification: Predict the quality of wine i.e. Low or High.
Multiclass Classification: Predict the quality of wine i.e Low,Medium,High.
Regression: Predict the quality of wine between 3-9 based on the independent predictor
variables.
Dataset - Wine Quality
Chateau_Montelena_AutoML.describe()
36. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 36/93
Requirement already satisfied: zipp> 3.1.0 in /usr/local/lib/python3.8/dist packa
Requirement already satisfied: ptyprocess in /usr/local/lib/python3.8/dist-packag
Collecting numpy>=1.13.3
Using cached numpy-1.19.5-cp38-cp38-manylinux2010_x86_64.whl (14.9 MB)
Requirement already satisfied: webencodings in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: PyWavelets in /usr/local/lib/python3.8/dist-packag
Requirement already satisfied: shap<1,>=0.40 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: protobuf<5,>=3.12.0 in /usr/local/lib/python3.8/di
Requirement already satisfied: alembic<2 in /usr/local/lib/python3.8/dist-package
Requirement already satisfied: gitpython<4,>=2.1.0 in /usr/local/lib/python3.8/di
Requirement already satisfied: databricks-cli<1,>=0.8.7 in /usr/local/lib/python3
Requirement already satisfied: importlib-metadata!=4.7.0,<6,>=3.7.0 in /usr/local
Requirement already satisfied: sqlalchemy<2,>=1.4.0 in /usr/local/lib/python3.8/d
Requirement already satisfied: docker<7,>=4.0.0 in /usr/local/lib/python3.8/dist-
Requirement already satisfied: querystring-parser<2 in /usr/local/lib/python3.8/d
Requirement already satisfied: click<9,>=7.0 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: gunicorn<21 in /usr/local/lib/python3.8/dist-packa
Requirement already satisfied: pyarrow<11,>=4.0.0 in /usr/local/lib/python3.8/dis
Requirement already satisfied: cloudpickle<3 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: markdown<4,>=3.3 in /usr/local/lib/python3.8/dist-
Requirement already satisfied: Flask<3 in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: sqlparse<1,>=0.4.0 in /usr/local/lib/python3.8/dis
Requirement already satisfied: Mako in /usr/local/lib/python3.8/dist-packages (fr
Requirement already satisfied: oauthlib>=3.1.0 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: tabulate>=0.7.7 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: pyjwt>=1.7.0 in /usr/local/lib/python3.8/dist-pack
Requirement already satisfied: websocket-client>=0.32.0 in /usr/local/lib/python3
Requirement already satisfied: Werkzeug<2.0,>=0.15 in /usr/local/lib/python3.8/di
Requirement already satisfied: itsdangerous<2.0,>=0.24 in /usr/local/lib/python3
Requirement already satisfied: gitdb<5,>=4.0.1 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: smmap<6,>=3.0.1 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: slicer==0.0.7 in /usr/local/lib/python3.8/dist-pac
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.8/dist-
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.8/dist-p
Requirement already satisfied: funcy in /usr/local/lib/python3.8/dist-packages (f
Requirement already satisfied: numexpr in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: future in /usr/local/lib/python3.8/dist-packages
Requirement already satisfied: pynndescent>=0.5 in /usr/local/lib/python3.8/dist-
Installing collected packages: numpy
Attempting uninstall: numpy
Found existing installation: numpy 1.20.0
Uninstalling numpy-1.20.0:
Successfully uninstalled numpy-1.20.0
ERROR: pip's dependency resolver does not currently take into account all the pac
tensorflow 2.9.2 requires numpy>=1.20, but you have numpy 1.19.5 which is incompa
jaxlib 0.3.25+cuda11.cudnn805 requires numpy>=1.20, but you have numpy 1.19.5 whi
jax 0.3.25 requires numpy>=1.20, but you have numpy 1.19.5 which is incompatible
en-core-web-sm 3.4.1 requires spacy<3.5.0,>=3.4.0, but you have spacy 2.3.8 which
cmdstanpy 1.0.8 requires numpy>=1.21, but you have numpy 1.19.5 which is incompat
Successfully installed numpy-1.19.5
fixed
acidity
volatile
acidity
citric
acid
residual
sugar
chlorides
free
sulfur
dioxide
t
su
dio
count 6487.000000 6489.000000 6494.000000 6495.000000 6495.000000 6497.000000 6497.0
mean 7.216579 0.339691 0.318722 5.444326 0.056042 30.525319 115.7
std 1.296750 0.164649 0.145265 4.758125 0.035036 17.749400 56.5
min 3.800000 0.080000 0.000000 0.600000 0.009000 1.000000 6.0
25% 6.400000 0.230000 0.250000 1.800000 0.038000 17.000000 77.0
50% 7.000000 0.290000 0.310000 3.000000 0.047000 29.000000 118.0
75% 7.700000 0.400000 0.390000 8.100000 0.065000 41.000000 156.0
max 15.900000 1.580000 1.660000 65.800000 0.611000 289.000000 440.0
Dataset Shape: (6497, 13)
Name dtypes Missing Uniques Sample Value Entropy
0 type object 0 2 white 0.24
1 fixed acidity float64 10 106 7.0 1.65
2 volatile acidity float64 8 187 0.27 1.79
3 citric acid float64 3 89 0.36 1.70
4 residual sugar float64 2 316 20.7 2.08
5 chlorides float64 2 214 0.045 1.90
6 free sulfur dioxide float64 0 135 45.0 1.82
7 total sulfur dioxide float64 0 276 170.0 2.32
8 density float64 0 998 1.001 2.70
9 pH float64 9 108 3.0 1.81
10 sulphates float64 4 111 0.45 1.72
11 alcohol float64 0 111 8.8 1.66
12 quality int64 0 7 6 0.55
def tableinfo(Chateau_Montelena_AutoML):
print(f"Dataset Shape: {Chateau_Montelena_AutoML.shape}")
summary = pd.DataFrame(Chateau_Montelena_AutoML.dtypes,columns=['dtypes'])
summary = summary.reset_index()
summary['Name'] = summary['index']
summary = summary[['Name','dtypes']]
summary['Missing'] = Chateau_Montelena_AutoML.isnull().sum().values
summary['Uniques'] = Chateau_Montelena_AutoML.nunique().values
summary['Sample Value'] = Chateau_Montelena_AutoML.loc[0].values
for name in summary['Name'].value_counts().index:
summary.loc[summary['Name'] == name, 'Entropy'] = round(stats.entropy(Chateau_
return summary
tableinfo(Chateau_Montelena_AutoML)
Entropy is defined as the randomness or measuring the disorder of the information being
processed.
Actions required for data preparation:
Converting 'Type' to a integer data type. Encoding categorical features.
38. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 38/93
g = sns.histplot(Chateau_Montelena_AutoML['sulphates'] , kde = True , ax = ax[2][2])
g = sns.histplot(Chateau_Montelena_AutoML['alcohol'] , kde = True , ax = ax[3][0])
Observation :
These numerical variables are not following a normal distribution. These distributions indicate there
are different data distributions present in population data with separate and independent peaks.
Action :
51. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 51/93
INFO:logs:display_container: 1
INFO:logs:Pipeline(memory=None,
steps=[('dtypes',
DataTypes_Auto_infer(categorical_features=[],
display_types=False, features_todrop=[],
id_columns=[],
ml_usecase='classification',
numerical_features=[], target='quality',
time_features=[])),
('imputer',
Simple_Imputer(categorical_strategy='not_available',
fill_value_categorical=None,
fill_value_numerical=None,
numeric_str...
('dummy', Dummify(target='quality')),
('fix_perfect', Remove_100(target='quality')),
('clean_names', Clean_Colum_Names()),
('feature_select', 'passthrough'),
('fix_multi',
Fix_multicollinearity(correlation_with_target_preference=None,
correlation_with_target_threshold=0.0,
target_variable='quality',
threshold=0.9)),
('dfs', 'passthrough'), ('pca', 'passthrough')],
verbose=False)
INFO:logs:setup() succesfully completed......................................
Pycaret provides the following metrics used for comparing model performance in the
compare_models() function:
Confusion Matrix is a performance measurement for machine learning classification problem
where output can be two or more classes. It is a table with 4 different combinations of
predicted and actual values.
AUC known as the Area Under the ROC Curve can be calculated and provides a single score to
summarize the plot that can be used to compare models. A no skill classifier will have a score
of 0.5, whereas a perfect classifier will have a score of 1.0.
F1 score is the harmonic mean of Precision and recall, a single score that seeks to balance
both concerns.
Accuracy is the fraction of correction predictions against the total prediction
Accuracy = Correct Predictions / Total Predictions
MCC produces a high score only if the prediction obtained good results in all of the four
confusion matrix categories (true positives, false negatives, true negatives, and false
positives), proportionally both to the size of positive elements and the size of negative
elements in the dataset.
Precision summarizes the fraction of examples assigned the positive class that belong to the
positive class.
Precision = TruePositive / (TruePositive + FalsePositive)
Cohen’s Kappa Statistic is used to measure the level of agreement between two raters or
judges who each classify items into mutually exclusive categories.
kappa = (Observed agreement - chance agreement) / (1-chance agreement)
Recall summarizes how well the positive class was predicted.
Recall = TruePositive / (TruePositive + FalseNegative)
F-Measure = (2 * Precision * Recall) / (Precision + Recall)
Searching for the best models
Model Comparison & Evaluation
best_modelB=compare_models()
55. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 55/93
INFO:logs:Visual Rendered Successfully
INFO:logs:plot_model() succesfully completed....................................
INFO:logs:Visual Rendered Successfully
INFO:logs:plot_model() succesfully completed....................................
#Plotting the confusion Matrix
plot_model(estimator = tuned_modelB, plot = 'confusion_matrix')
Observation :
We can see a strong diagnol indicating good predictions.
#plotting decision boundary
plot_model(estimator = tuned_modelB, plot = 'boundary', use_train_data = True)
56. 4/26/23, 1:22 PM Wine_Quality_report (2).ipynb - Colaboratory
https://colab.research.google.com/drive/1wRV8eEd_0fvwQr6MUuoR5ibqWm6wLzAI#scrollTo=3ed695ef&printMode=true 56/93
INFO:logs:Visual Rendered Successfully
INFO:logs:plot_model() succesfully completed....................................
Observation:
We can see a great seperation with very few misclassifications.
plot_model(tuned_modelB, plot = 'parameter')