SlideShare a Scribd company logo
COSC2670 Practical Data Science Assignment 2
Predicting The quality of Red Wine
Names: Junaid Ahmed Syed &Harini Mylanahally Sannaveeranna
Student ID: s3731300& s3755660
May 29, 2019
Contents
1 Abstract 2
2 Introduction 3
2.1 DataSet Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Target Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Descriptive Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Methodology 4
3.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1.1 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2.1 Univariate visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2.2 Mulitvariate Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Data Modelling 6
4.0.1 Train and Test data split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.0.2 Knn Classification Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.0.3 Decision Tree Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5 Results 8
6 Disscusion 9
7 Conclusion 10
1
Chapter 1
Abstract
The main objective of this assignment is to focus on data modelling, Which is a core step in the data science
process. The dataset used here is ’Red Wine Quality’ with a target feature being ’wine quality’. This dataset
can be viewed as a classification task, and the chosen models within these particular tasks are KNearest-
Neighbor, DecisionTree. The rest of this report is organized as follows. Section 2 gives an introduction
and describes the data sets and their attributes. Part 3 is a Methodology that covers data pre-processing,
Data Exploration and Data Modelling. In Section 4, we explore the results got in Section 3. In section 5, we
discuss the effects we got in Section 4.The last part is to present a summary.
2
Chapter 2
Introduction
2.1 DataSet Information
This Dataset is sourced from the UCI Machine Learning Repository at
https://archive.ics.uci.edu/ml/datasets/Wine+Quality[1]..The UCI Machine Learning Repository has 2
datasets, but only winequality-red.csv is useful for this Assignment. This data set has 1599 observations
and 12 variables.
2.2 Target Feature
The classification goal is to predict wheater the quality of the wine is good or bad.
Wine[quality] =
{
bad if value = 0
good if value = 1
2.3 Descriptive Features
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
3
Chapter 3
Methodology
3.1 Data Preprocessing
In 3.1, we checked that the feature types matched the description as outlined in the documentation by using
dtypes().
3.1.1 Missing values
Upon verifying feature types, Missing values got checked with isnull(). sum() but there are no missing
values for this dataset on the surface level. ### pd.cut() and LableEncoder() We can see that the target
feature - quality has values from 2 to 8, with the help of pd.cut() we can Bin the values into discrete intervals.
The mean of the quality is 5.6, So we have set an interval of 2 to 5.6 and address it as bad quality and 5.6 to
8 as good quality. LabelEncoder() was then used to encode labels with values of 0 and 1. ### Outliers We
tend to keep outliers for our predictive analysis as outliers can be a great source of information.
3.2 Data Exploration
3.2.1 Univariate visualisation
BoxHistogramPlot(x) is a function defined for numerical features, for the sake of simplicity. For a given
binary input column, BoxHistogramPlot(x) plots a histogram. A histogram is are useful to visualize the
shape of the underlying distribution, whereas A box plot tells the range of the attribute and helps detect
any outliers. The following chunk codes show how these functions were defined using the numpy library
and the matplotlib library.
From the plots, we can see that the majority of histograms of columns are unimodal, and among these
graphs fixed acidity, density, pH, sulphates and residual sugar were seemed normally distributed whereas
free sulphur dioxide, total sulphur dioxide, chlorides are left-skewed. We can also notice volatile acid as a
bimodal attribute because most of the values lie 0.4 to 0.5 and 0.6 to 0.7 and citric acid column as Plateau
since there are more than 3 modes.
3.2.2 Mulitvariate Visualisation
• Histogram of numeric features segregated by Wine Quality
From the histograms, we can see that if the volatile acidity of the Wine ranges above 0.6, the quality of
the Wine is good. Likewise, higher citric acid levels are not so good for Wine. Alcohol in excess quantity,
i.e., above 10% may make the quality of wine bad.
• Pairwise scatter plot between numeric features by Wine quality
4
A function named scatterplotByCategory(c, x, y, D) is designed to draw a scatterplot between two nu-
meric attributes y and x labelled by a categorical attribute c given an input data D. In the case, D is the
dataset itself and c is the Wine quality.
We have plotted scatterplot for volatile acidity, citric acid and alcohol segregated by the target fea-
ture.But the graphs show no clear correlation between any two numeric variables. Therefore, numeric
features are likely to be independent which each other
5
Chapter 4
Data Modelling
4.0.1 Train and Test data split
In order to perform predictive analysis, The dataset got divided into two parts. One part has all the de-
scriptive features, and another part has a target feature itself. These were named as X and y respectively.
4.0.2 Knn Classification Training
Data Slicing
Now we need to split the data randomly into training and test set in the ratio of 50:50. We have used
train_test_split () to perform that which is provided by Scikit-learn. Later on, We will fit/train a classifier on
the training set and make predictions on the test set.standardscaler() which helps improve the performance
of the model and reducing the values/models from varying widely.
KnnClassifier()
There are 2 important parameters for KnnClassifier() one of them is n_neighbors while the other on being
distance metric.The default metric is Minkowski distance and we have used the default one.
- Predicting optimal number of clusters(K value):
The most common way of finding k value chosen as the square-root of the number of observations in test s
Then we define the Knn classifier function with the optimal value of K and fit the train data in the
model.Then we use predict() to test the results.Lastly we evalute the model using confusion matrix and
classification report.We repeat this process 2 more times with a train and test ratios of 60:40 and 80:20
respectively.We shall discuss the results in next chapter.
4.0.3 Decision Tree Training
We used a similar approach to do decision tree classification as we did in the Knn Classification. However,
parameters are different for both of them. An advantage of using Decision Tree over Knn is minimal effort
was required for data preparation. i.e., No scaling of feature variables is needed.
DecisionTreeClassifier()
There are important features of DecisionTreeClassifier() are - criterion: It is a function which is used to
measure the quality of a split.We have used the default one which is gini
index.
6
- max_depth:
It is an integer value which denotes the maximum depth of the tree. When not specified, it will take
default value as None.
-min_samples_leaf:
It is Used to restrict the decision tree by specifing minimum number of samples required to be at a
node.
After defining the parameters we have fitted the train, predict and finally evaluate the models using
confusion matrix and classification report with a train and test ratios of 50;50,60:40 and 80:20.
Plotting decision tree
We have plotted a decision tree to see how it looks internally.This plot uses criterion as the Gini index &
information gain. The value row in each node tells us how many of the observations that were sorted into
that node fall into each category.As expected the maximum depth of the decision tree is 4 and also we have
got 16 leaf nodes because we haven’t specifed any value for it.
7
Chapter 5
Results
The results which of confusion matrix and classification report for both classification algorithm is as follows:
- A Table For Confusion matrix for KnnClassifer with a train and test ratios of 50;50,60:40 and 80:20.
confusion matrix 50:50 60:40 80:20
True negative 679 534 267
False positive 14 16 10
False negative 80 67 33
True postive 27 23 10
• A Table For Confusion matrix for Decision tree with a train and test ratios of 50;50,60:40 and 80:20.
confusion matrix 50:50 60:40 80:20
True negative 651 538 270
False positive 42 12 7
False negative 61 70 27
True postive 46 20 16
• A Table For Accuracy percentage for both KNN and decision Tree with a train and test ratios of
50;50,60:40 and 80:20.
precision KNN Decision
50:50 88.25 89.37
60:40 87.03 89.37
80:20 86.56 86.56
From the tables, we can say that both models Knn And Decision tree seem to have similar results of
accuracy. However, if we try not applying Standard scaler functions to train the model and use the same
process, we get around 7% less low result in precision. So, for this particular dataset, we assume decision
tree classification is better than KNN classification.
8
Chapter 6
Disscusion
• The functions used for visualizations is taken from MATH2319[2]
• For finding optimal k-value, we have come across so many functions over the internet, All of them
gave us similar results, but we have chosen the function from Website[3]. If the result is a odd number
the k value is taken as that number, other case if the result is even, it is incremented by 1.
• To find max_depth value, we taken a range from 2 to 8 and started fitting the models. Out of all the
numbers, we got better precision for the value of 4.
9
Chapter 7
Conclusion
In this assignment, We have converted the cardinality of column ’quality’ into binary which is of a integer
datatype. From the visualizations, we came to know that all the variables were potentially useful features
in predicting wine quality. Finally, after fitting binary classification and model evaluation, We founded out
that decision classification is better for this dataset.
10
Bibliography
[1] P. Cortez S. Moro and P. Rita. UCI Machine Learning Repository: Wine Data Set.
[2] Math2319,machine learning course,rmit.
[3] Sklearn. URL: http://www.simplilean.com.
11

More Related Content

What's hot

Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
CloudxLab
 
Contoh Soal HOTS Fisika.pdf
Contoh Soal HOTS Fisika.pdfContoh Soal HOTS Fisika.pdf
Contoh Soal HOTS Fisika.pdf
AmrinaRosada40
 
Predict Breast Cancer using Deep Learning
Predict Breast Cancer using Deep LearningPredict Breast Cancer using Deep Learning
Predict Breast Cancer using Deep Learning
Ayesha Shafique
 
UH IPA7 S1 B1 BESARAN N PENGUKURAN.docx
UH IPA7 S1 B1 BESARAN N PENGUKURAN.docxUH IPA7 S1 B1 BESARAN N PENGUKURAN.docx
UH IPA7 S1 B1 BESARAN N PENGUKURAN.docx
sajidintuban
 
Statistika rata - rata gabungan
Statistika rata - rata gabunganStatistika rata - rata gabungan
Statistika rata - rata gabungan
dinakudus
 
Latihan soal olimpiade fisika SMP
Latihan soal olimpiade fisika SMPLatihan soal olimpiade fisika SMP
Latihan soal olimpiade fisika SMPDaniel Tohari
 
Modul UN Matematika SMP 2018 (yogaZsor)
Modul UN Matematika SMP 2018 (yogaZsor)Modul UN Matematika SMP 2018 (yogaZsor)
Modul UN Matematika SMP 2018 (yogaZsor)
IC Magnet School
 
Soal suhu dan kalor
Soal suhu dan kalorSoal suhu dan kalor
Soal suhu dan kalorAlfipi
 
heart final last sem.pptx
heart final last sem.pptxheart final last sem.pptx
heart final last sem.pptx
rakshashadu
 

What's hot (9)

Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
 
Contoh Soal HOTS Fisika.pdf
Contoh Soal HOTS Fisika.pdfContoh Soal HOTS Fisika.pdf
Contoh Soal HOTS Fisika.pdf
 
Predict Breast Cancer using Deep Learning
Predict Breast Cancer using Deep LearningPredict Breast Cancer using Deep Learning
Predict Breast Cancer using Deep Learning
 
UH IPA7 S1 B1 BESARAN N PENGUKURAN.docx
UH IPA7 S1 B1 BESARAN N PENGUKURAN.docxUH IPA7 S1 B1 BESARAN N PENGUKURAN.docx
UH IPA7 S1 B1 BESARAN N PENGUKURAN.docx
 
Statistika rata - rata gabungan
Statistika rata - rata gabunganStatistika rata - rata gabungan
Statistika rata - rata gabungan
 
Latihan soal olimpiade fisika SMP
Latihan soal olimpiade fisika SMPLatihan soal olimpiade fisika SMP
Latihan soal olimpiade fisika SMP
 
Modul UN Matematika SMP 2018 (yogaZsor)
Modul UN Matematika SMP 2018 (yogaZsor)Modul UN Matematika SMP 2018 (yogaZsor)
Modul UN Matematika SMP 2018 (yogaZsor)
 
Soal suhu dan kalor
Soal suhu dan kalorSoal suhu dan kalor
Soal suhu dan kalor
 
heart final last sem.pptx
heart final last sem.pptxheart final last sem.pptx
heart final last sem.pptx
 

Similar to Practical Data Science: Data Modelling and Presentation

Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
Muthu Kumaar Thangavelu
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
Muthu Kumaar Thangavelu
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_Trushita
Trushita Redij
 
Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)
Sanghun Kim
 
forest-cover-type
forest-cover-typeforest-cover-type
forest-cover-type
Kayleigh Beard
 
Benchmarking_ML_Tools
Benchmarking_ML_ToolsBenchmarking_ML_Tools
Benchmarking_ML_Tools
Marc Borowczak
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Prediction
sriram30691
 
07 learning
07 learning07 learning
07 learning
ankit_ppt
 
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood Test
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood TestUsing Artificial Neural Networks to Detect Multiple Cancers from a Blood Test
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood Test
StevenQu1
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine Learning
Leo Salemann
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine Learning
Karunakar Kotha
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine Learning
Wenfan Xu
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports Data
KathleneNgo
 
Microsoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data ScienceMicrosoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data Science
Mashfiq Shahriar
 
Enhancing the performance of Naive Bayesian Classifier using Information Gain...
Enhancing the performance of Naive Bayesian Classifier using Information Gain...Enhancing the performance of Naive Bayesian Classifier using Information Gain...
Enhancing the performance of Naive Bayesian Classifier using Information Gain...
Rafiul Sabbir
 
Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesPredicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensembles
Varad Meru
 
House Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN ApproachHouse Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN Approach
Yusuf Uzun
 
Unsupervised learning
Unsupervised learning Unsupervised learning
Unsupervised learning
AlexAman1
 
Dm
DmDm
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reduction
Marco Quartulli
 

Similar to Practical Data Science: Data Modelling and Presentation (20)

Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_Trushita
 
Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)
 
forest-cover-type
forest-cover-typeforest-cover-type
forest-cover-type
 
Benchmarking_ML_Tools
Benchmarking_ML_ToolsBenchmarking_ML_Tools
Benchmarking_ML_Tools
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Prediction
 
07 learning
07 learning07 learning
07 learning
 
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood Test
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood TestUsing Artificial Neural Networks to Detect Multiple Cancers from a Blood Test
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood Test
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine Learning
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine Learning
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine Learning
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports Data
 
Microsoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data ScienceMicrosoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data Science
 
Enhancing the performance of Naive Bayesian Classifier using Information Gain...
Enhancing the performance of Naive Bayesian Classifier using Information Gain...Enhancing the performance of Naive Bayesian Classifier using Information Gain...
Enhancing the performance of Naive Bayesian Classifier using Information Gain...
 
Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesPredicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensembles
 
House Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN ApproachHouse Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN Approach
 
Unsupervised learning
Unsupervised learning Unsupervised learning
Unsupervised learning
 
Dm
DmDm
Dm
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reduction
 

Recently uploaded

Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 

Recently uploaded (20)

Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 

Practical Data Science: Data Modelling and Presentation

  • 1. COSC2670 Practical Data Science Assignment 2 Predicting The quality of Red Wine Names: Junaid Ahmed Syed &Harini Mylanahally Sannaveeranna Student ID: s3731300& s3755660 May 29, 2019
  • 2. Contents 1 Abstract 2 2 Introduction 3 2.1 DataSet Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Target Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 Descriptive Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 Methodology 4 3.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.1.1 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2 Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2.1 Univariate visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2.2 Mulitvariate Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 Data Modelling 6 4.0.1 Train and Test data split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.0.2 Knn Classification Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.0.3 Decision Tree Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 5 Results 8 6 Disscusion 9 7 Conclusion 10 1
  • 3. Chapter 1 Abstract The main objective of this assignment is to focus on data modelling, Which is a core step in the data science process. The dataset used here is ’Red Wine Quality’ with a target feature being ’wine quality’. This dataset can be viewed as a classification task, and the chosen models within these particular tasks are KNearest- Neighbor, DecisionTree. The rest of this report is organized as follows. Section 2 gives an introduction and describes the data sets and their attributes. Part 3 is a Methodology that covers data pre-processing, Data Exploration and Data Modelling. In Section 4, we explore the results got in Section 3. In section 5, we discuss the effects we got in Section 4.The last part is to present a summary. 2
  • 4. Chapter 2 Introduction 2.1 DataSet Information This Dataset is sourced from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Wine+Quality[1]..The UCI Machine Learning Repository has 2 datasets, but only winequality-red.csv is useful for this Assignment. This data set has 1599 observations and 12 variables. 2.2 Target Feature The classification goal is to predict wheater the quality of the wine is good or bad. Wine[quality] = { bad if value = 0 good if value = 1 2.3 Descriptive Features 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol 3
  • 5. Chapter 3 Methodology 3.1 Data Preprocessing In 3.1, we checked that the feature types matched the description as outlined in the documentation by using dtypes(). 3.1.1 Missing values Upon verifying feature types, Missing values got checked with isnull(). sum() but there are no missing values for this dataset on the surface level. ### pd.cut() and LableEncoder() We can see that the target feature - quality has values from 2 to 8, with the help of pd.cut() we can Bin the values into discrete intervals. The mean of the quality is 5.6, So we have set an interval of 2 to 5.6 and address it as bad quality and 5.6 to 8 as good quality. LabelEncoder() was then used to encode labels with values of 0 and 1. ### Outliers We tend to keep outliers for our predictive analysis as outliers can be a great source of information. 3.2 Data Exploration 3.2.1 Univariate visualisation BoxHistogramPlot(x) is a function defined for numerical features, for the sake of simplicity. For a given binary input column, BoxHistogramPlot(x) plots a histogram. A histogram is are useful to visualize the shape of the underlying distribution, whereas A box plot tells the range of the attribute and helps detect any outliers. The following chunk codes show how these functions were defined using the numpy library and the matplotlib library. From the plots, we can see that the majority of histograms of columns are unimodal, and among these graphs fixed acidity, density, pH, sulphates and residual sugar were seemed normally distributed whereas free sulphur dioxide, total sulphur dioxide, chlorides are left-skewed. We can also notice volatile acid as a bimodal attribute because most of the values lie 0.4 to 0.5 and 0.6 to 0.7 and citric acid column as Plateau since there are more than 3 modes. 3.2.2 Mulitvariate Visualisation • Histogram of numeric features segregated by Wine Quality From the histograms, we can see that if the volatile acidity of the Wine ranges above 0.6, the quality of the Wine is good. Likewise, higher citric acid levels are not so good for Wine. Alcohol in excess quantity, i.e., above 10% may make the quality of wine bad. • Pairwise scatter plot between numeric features by Wine quality 4
  • 6. A function named scatterplotByCategory(c, x, y, D) is designed to draw a scatterplot between two nu- meric attributes y and x labelled by a categorical attribute c given an input data D. In the case, D is the dataset itself and c is the Wine quality. We have plotted scatterplot for volatile acidity, citric acid and alcohol segregated by the target fea- ture.But the graphs show no clear correlation between any two numeric variables. Therefore, numeric features are likely to be independent which each other 5
  • 7. Chapter 4 Data Modelling 4.0.1 Train and Test data split In order to perform predictive analysis, The dataset got divided into two parts. One part has all the de- scriptive features, and another part has a target feature itself. These were named as X and y respectively. 4.0.2 Knn Classification Training Data Slicing Now we need to split the data randomly into training and test set in the ratio of 50:50. We have used train_test_split () to perform that which is provided by Scikit-learn. Later on, We will fit/train a classifier on the training set and make predictions on the test set.standardscaler() which helps improve the performance of the model and reducing the values/models from varying widely. KnnClassifier() There are 2 important parameters for KnnClassifier() one of them is n_neighbors while the other on being distance metric.The default metric is Minkowski distance and we have used the default one. - Predicting optimal number of clusters(K value): The most common way of finding k value chosen as the square-root of the number of observations in test s Then we define the Knn classifier function with the optimal value of K and fit the train data in the model.Then we use predict() to test the results.Lastly we evalute the model using confusion matrix and classification report.We repeat this process 2 more times with a train and test ratios of 60:40 and 80:20 respectively.We shall discuss the results in next chapter. 4.0.3 Decision Tree Training We used a similar approach to do decision tree classification as we did in the Knn Classification. However, parameters are different for both of them. An advantage of using Decision Tree over Knn is minimal effort was required for data preparation. i.e., No scaling of feature variables is needed. DecisionTreeClassifier() There are important features of DecisionTreeClassifier() are - criterion: It is a function which is used to measure the quality of a split.We have used the default one which is gini index. 6
  • 8. - max_depth: It is an integer value which denotes the maximum depth of the tree. When not specified, it will take default value as None. -min_samples_leaf: It is Used to restrict the decision tree by specifing minimum number of samples required to be at a node. After defining the parameters we have fitted the train, predict and finally evaluate the models using confusion matrix and classification report with a train and test ratios of 50;50,60:40 and 80:20. Plotting decision tree We have plotted a decision tree to see how it looks internally.This plot uses criterion as the Gini index & information gain. The value row in each node tells us how many of the observations that were sorted into that node fall into each category.As expected the maximum depth of the decision tree is 4 and also we have got 16 leaf nodes because we haven’t specifed any value for it. 7
  • 9. Chapter 5 Results The results which of confusion matrix and classification report for both classification algorithm is as follows: - A Table For Confusion matrix for KnnClassifer with a train and test ratios of 50;50,60:40 and 80:20. confusion matrix 50:50 60:40 80:20 True negative 679 534 267 False positive 14 16 10 False negative 80 67 33 True postive 27 23 10 • A Table For Confusion matrix for Decision tree with a train and test ratios of 50;50,60:40 and 80:20. confusion matrix 50:50 60:40 80:20 True negative 651 538 270 False positive 42 12 7 False negative 61 70 27 True postive 46 20 16 • A Table For Accuracy percentage for both KNN and decision Tree with a train and test ratios of 50;50,60:40 and 80:20. precision KNN Decision 50:50 88.25 89.37 60:40 87.03 89.37 80:20 86.56 86.56 From the tables, we can say that both models Knn And Decision tree seem to have similar results of accuracy. However, if we try not applying Standard scaler functions to train the model and use the same process, we get around 7% less low result in precision. So, for this particular dataset, we assume decision tree classification is better than KNN classification. 8
  • 10. Chapter 6 Disscusion • The functions used for visualizations is taken from MATH2319[2] • For finding optimal k-value, we have come across so many functions over the internet, All of them gave us similar results, but we have chosen the function from Website[3]. If the result is a odd number the k value is taken as that number, other case if the result is even, it is incremented by 1. • To find max_depth value, we taken a range from 2 to 8 and started fitting the models. Out of all the numbers, we got better precision for the value of 4. 9
  • 11. Chapter 7 Conclusion In this assignment, We have converted the cardinality of column ’quality’ into binary which is of a integer datatype. From the visualizations, we came to know that all the variables were potentially useful features in predicting wine quality. Finally, after fitting binary classification and model evaluation, We founded out that decision classification is better for this dataset. 10
  • 12. Bibliography [1] P. Cortez S. Moro and P. Rita. UCI Machine Learning Repository: Wine Data Set. [2] Math2319,machine learning course,rmit. [3] Sklearn. URL: http://www.simplilean.com. 11