Practical Data Science: Data Modelling and Presentation

COSC2670 Practical Data Science Assignment 2
Predicting The quality of Red Wine
Names: Junaid Ahmed Syed &Harini Mylanahally Sannaveeranna
Student ID: s3731300& s3755660
May 29, 2019

Contents
1 Abstract 2
2 Introduction 3
2.1 DataSet Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Target Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Descriptive Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Methodology 4
3.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1.1 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2.1 Univariate visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2.2 Mulitvariate Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Data Modelling 6
4.0.1 Train and Test data split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.0.2 Knn Classiﬁcation Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.0.3 Decision Tree Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5 Results 8
6 Disscusion 9
7 Conclusion 10
1

Chapter 1
Abstract
The main objective of this assignment is to focus on data modelling, Which is a core step in the data science
process. The dataset used here is ’Red Wine Quality’ with a target feature being ’wine quality’. This dataset
can be viewed as a classiﬁcation task, and the chosen models within these particular tasks are KNearest-
Neighbor, DecisionTree. The rest of this report is organized as follows. Section 2 gives an introduction
and describes the data sets and their attributes. Part 3 is a Methodology that covers data pre-processing,
Data Exploration and Data Modelling. In Section 4, we explore the results got in Section 3. In section 5, we
discuss the effects we got in Section 4.The last part is to present a summary.
2

Chapter 2
Introduction
2.1 DataSet Information
This Dataset is sourced from the UCI Machine Learning Repository at
https://archive.ics.uci.edu/ml/datasets/Wine+Quality[1]..The UCI Machine Learning Repository has 2
datasets, but only winequality-red.csv is useful for this Assignment. This data set has 1599 observations
and 12 variables.
2.2 Target Feature
The classiﬁcation goal is to predict wheater the quality of the wine is good or bad.
Wine[quality] =
{
bad if value = 0
good if value = 1
2.3 Descriptive Features
1 - ﬁxed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
3

Chapter 3
Methodology
3.1 Data Preprocessing
In 3.1, we checked that the feature types matched the description as outlined in the documentation by using
dtypes().
3.1.1 Missing values
Upon verifying feature types, Missing values got checked with isnull(). sum() but there are no missing
values for this dataset on the surface level. ### pd.cut() and LableEncoder() We can see that the target
feature - quality has values from 2 to 8, with the help of pd.cut() we can Bin the values into discrete intervals.
The mean of the quality is 5.6, So we have set an interval of 2 to 5.6 and address it as bad quality and 5.6 to
8 as good quality. LabelEncoder() was then used to encode labels with values of 0 and 1. ### Outliers We
tend to keep outliers for our predictive analysis as outliers can be a great source of information.
3.2 Data Exploration
3.2.1 Univariate visualisation
BoxHistogramPlot(x) is a function defined for numerical features, for the sake of simplicity. For a given
binary input column, BoxHistogramPlot(x) plots a histogram. A histogram is are useful to visualize the
shape of the underlying distribution, whereas A box plot tells the range of the attribute and helps detect
any outliers. The following chunk codes show how these functions were defined using the numpy library
and the matplotlib library.
From the plots, we can see that the majority of histograms of columns are unimodal, and among these
graphs fixed acidity, density, pH, sulphates and residual sugar were seemed normally distributed whereas
free sulphur dioxide, total sulphur dioxide, chlorides are left-skewed. We can also notice volatile acid as a
bimodal attribute because most of the values lie 0.4 to 0.5 and 0.6 to 0.7 and citric acid column as Plateau
since there are more than 3 modes.
3.2.2 Mulitvariate Visualisation
• Histogram of numeric features segregated by Wine Quality
From the histograms, we can see that if the volatile acidity of the Wine ranges above 0.6, the quality of
the Wine is good. Likewise, higher citric acid levels are not so good for Wine. Alcohol in excess quantity,
i.e., above 10% may make the quality of wine bad.
• Pairwise scatter plot between numeric features by Wine quality
4

A function named scatterplotByCategory(c, x, y, D) is designed to draw a scatterplot between two nu-
meric attributes y and x labelled by a categorical attribute c given an input data D. In the case, D is the
dataset itself and c is the Wine quality.
We have plotted scatterplot for volatile acidity, citric acid and alcohol segregated by the target fea-
ture.But the graphs show no clear correlation between any two numeric variables. Therefore, numeric
features are likely to be independent which each other
5

Chapter 4
Data Modelling
4.0.1 Train and Test data split
In order to perform predictive analysis, The dataset got divided into two parts. One part has all the de-
scriptive features, and another part has a target feature itself. These were named as X and y respectively.
4.0.2 Knn Classification Training
Data Slicing
Now we need to split the data randomly into training and test set in the ratio of 50:50. We have used
train_test_split () to perform that which is provided by Scikit-learn. Later on, We will fit/train a classifier on
the training set and make predictions on the test set.standardscaler() which helps improve the performance
of the model and reducing the values/models from varying widely.
KnnClassifier()
There are 2 important parameters for KnnClassifier() one of them is n_neighbors while the other on being
distance metric.The default metric is Minkowski distance and we have used the default one.
- Predicting optimal number of clusters(K value):
The most common way of finding k value chosen as the square-root of the number of observations in test s
Then we define the Knn classifier function with the optimal value of K and fit the train data in the
model.Then we use predict() to test the results.Lastly we evalute the model using confusion matrix and
classification report.We repeat this process 2 more times with a train and test ratios of 60:40 and 80:20
respectively.We shall discuss the results in next chapter.
4.0.3 Decision Tree Training
We used a similar approach to do decision tree classification as we did in the Knn Classification. However,
parameters are different for both of them. An advantage of using Decision Tree over Knn is minimal effort
was required for data preparation. i.e., No scaling of feature variables is needed.
DecisionTreeClassifier()
There are important features of DecisionTreeClassifier() are - criterion: It is a function which is used to
measure the quality of a split.We have used the default one which is gini
index.
6

- max_depth:
It is an integer value which denotes the maximum depth of the tree. When not specified, it will take
default value as None.
-min_samples_leaf:
It is Used to restrict the decision tree by specifing minimum number of samples required to be at a
node.
After defining the parameters we have fitted the train, predict and finally evaluate the models using
confusion matrix and classification report with a train and test ratios of 50;50,60:40 and 80:20.
Plotting decision tree
We have plotted a decision tree to see how it looks internally.This plot uses criterion as the Gini index &
information gain. The value row in each node tells us how many of the observations that were sorted into
that node fall into each category.As expected the maximum depth of the decision tree is 4 and also we have
got 16 leaf nodes because we haven’t specifed any value for it.
7

Chapter 5
Results
The results which of confusion matrix and classification report for both classification algorithm is as follows:
- A Table For Confusion matrix for KnnClassifer with a train and test ratios of 50;50,60:40 and 80:20.
confusion matrix 50:50 60:40 80:20
True negative 679 534 267
False positive 14 16 10
False negative 80 67 33
True postive 27 23 10
• A Table For Confusion matrix for Decision tree with a train and test ratios of 50;50,60:40 and 80:20.
confusion matrix 50:50 60:40 80:20
True negative 651 538 270
False positive 42 12 7
False negative 61 70 27
True postive 46 20 16
• A Table For Accuracy percentage for both KNN and decision Tree with a train and test ratios of
50;50,60:40 and 80:20.
precision KNN Decision
50:50 88.25 89.37
60:40 87.03 89.37
80:20 86.56 86.56
From the tables, we can say that both models Knn And Decision tree seem to have similar results of
accuracy. However, if we try not applying Standard scaler functions to train the model and use the same
process, we get around 7% less low result in precision. So, for this particular dataset, we assume decision
tree classification is better than KNN classification.
8

Chapter 6
Disscusion
• The functions used for visualizations is taken from MATH2319[2]
• For finding optimal k-value, we have come across so many functions over the internet, All of them
gave us similar results, but we have chosen the function from Website[3]. If the result is a odd number
the k value is taken as that number, other case if the result is even, it is incremented by 1.
• To find max_depth value, we taken a range from 2 to 8 and started fitting the models. Out of all the
numbers, we got better precision for the value of 4.
9

Chapter 7
Conclusion
In this assignment, We have converted the cardinality of column ’quality’ into binary which is of a integer
datatype. From the visualizations, we came to know that all the variables were potentially useful features
in predicting wine quality. Finally, after fitting binary classification and model evaluation, We founded out
that decision classification is better for this dataset.
10

Bibliography
[1] P. Cortez S. Moro and P. Rita. UCI Machine Learning Repository: Wine Data Set.
[2] Math2319,machine learning course,rmit.
[3] Sklearn. URL: http://www.simplilean.com.
11

Practical Data Science: Data Modelling and Presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to Practical Data Science: Data Modelling and Presentation

Similar to Practical Data Science: Data Modelling and Presentation (20)

Recently uploaded

Recently uploaded (20)

Practical Data Science: Data Modelling and Presentation