The data set is about the 1987 national Indonesia contraceptive prevalence survey. Data Retrieving, cleaning, exploration, modelling with classification using Decision Tree and KNN model.
1. Practical Data Science – COSC2670
Assignment - 2
1
Title: Practical Data Science Assignment - 2
Author: Sai Chandan V (s3734305)
Contact Details: s3734305@student.rmit.edu.au
2. Practical Data Science – COSC2670
Assignment - 2
2
Table of Content
1. Abstract / Executive summary 3
2. Introduction 4
3. Methodology 5
3.1 Data Retrieving
3.2 Data Exploration
3.3 Data Modeling
4. Conclusion 15
5. References 15
3. Practical Data Science – COSC2670
Assignment - 2
3
Abstract/Executive Summary
The dataset used is contraceptive method choice, the data set is treated by the
classification task. This data set is downloaded from the following link
https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice.
This dataset is a subset of the dataset from 1987 national Indonesia contraceptive
prevalence survey. The data set have the data of married women who were pregnant, or
they don’t know if they are. The solution is to predict the current contraceptive method
choice methods used by these women who won’t use, or short term or long term
based on there socio-economic and demographic characteristics. The data set is
multivariate and has 1473 instances and 9 attributes.
The Task 1 is about data retrieving and the data set chosen is classification and after
loading the data the data has no missing values, the data has been checked for Nan
values, missing values. The task 2 is about Data exploration in which each column is
explored (min 10 columns) and explained with the descriptive statistics and graphs like
the distribution of a numerical attribute and value of categorical attribute. Exploring the
relationship of all the attributes of pairs and the relationship in an appropriate with
focus on pair of columns.
The Task 3 is data modeling where the data set is trained by a classification the data set
is trained in the order of 50% for training and 50% for testing and the data set is trained
for 60% for training and 40%for testing and then 80% for the training and 20% for
testing. This is explained with the KNN model and the Decision tree with the confusion
matrix, classification error rate, precision, recall, F1-Score. From this we find that the
more the data is the more the accuracy of the algorithm.
4. Practical Data Science – COSC2670
Assignment - 2
4
Introduction:
The Data set is Contraceptive Method Choice Data set which is a subset Data set
from 1987 National Indonesia Contraceptive Prevalence Survey. The data consists of
samples of women who were married and not pregnant and who do not know if they
are pregnant or not at the time of collecting the data. The solution to the problem is to
predict the current contraceptive method choice (no usage, short term methods, or long
term methods) for a woman which are based on her socio-economic and demographic
characteristics. This data set has 1473 instances and 9 Attributes and no missing values.
Attributes:
The attributes present in the data set are
1. Wife’s age => (Numerical)
2. Wife’s education => (Categorical) 1=low … 4 = High
3. Husband’s education => (Categorical) 1=low … 4 = High
4. Number of children ever born => (Numerical)
5. Wife’s religion => (binary) 0=Non-Islam, 1=Islam
6. Wife’s now working => (binary) 0=Yes, 1=No
7. Husband’s occupation => (Categorical) 1,2,3,4
8. Standard-of-living index => (Categorical) 1=low … 4 = High
9. Media exposure => (binary) 0=Good, 1=Not good
10. Contraceptive Method used => (Class attribute) 1=No use, 2= Long Term,
3=Short Term.
5. Practical Data Science – COSC2670
Assignment - 2
5
Methodology:
Data Manipulation:
Import pandas and load the data using pd.read_csv and add headers for
the data set. Check for the white spaces, missing values, Nan values and data types.
Checking for the null values like rawdata.isnull().sum() resulting in zero null values. Since
this contraceptive data choice data set doesn’t have any of these problems, we don’t
have to clean the data set.
Data Exploration:
The Process starts with few simple Univariate (one feature) analysis.
They are so many ways to manipulate feature type, but for simplicity lets define
Numerical and categorical
Numerical: Feature that has numeric values.
Categorical: Feature that contains text or categories.
Since the rawdata has categorical variables already in the dataset. We will decide the
variable to their labels and check the dataframe, make a copy of the data as datasetraw
by using datasetraw = rawdata.copy(). Defining the labels and replacing them into the
data frame.
wifereligion = {0:"Non_Islam", 1:"Islam"}
datasetraw.WifeReligion.replace(wifereligion, inplace=True)
wifeworking = {0:"Yes", 1:"No"}
datasetraw.WifeWorking.replace(wifeworking, inplace=True)
6. Practical Data Science – COSC2670
Assignment - 2
6
let's create two new data frames for discretized continuous variable and continuous
variable
dataset_bin = pd.DataFrame() #dataframe for discretized continuous variable
dataset_con = pd.DataFrame() #dataframe for continuous variable
from this data frames we create the plot graph for contraceptive method using sns with
x-axis having no use, long term, Short term and y -axis being count.
These graphs represents the wife age and the methods followed, in the x-axis it has with
count and y-axis has the age delimiter and the second graph represent the usage of the
contraceptive methods depending on the age.
7. Practical Data Science – COSC2670
Assignment - 2
7
The graph represents the contraceptive method usage according to the wife education
and the plotting is x-axis has the wife education and y-axis has the count. It says the
education of the women is high, the high the usage of contraceptive method usage.
The graph represents the contraceptive method usage according to the Husband
education and the plotting is x-axis has the Husband education and y-axis has the count.
The high the education of the husband is the high the usage of the contraceptive
method usage.
8. Practical Data Science – COSC2670
Assignment - 2
8
The graph represents the contraceptive method usage according to the wife religion and
the plotting is x-axis has the wife religion and y-axis has the count. There are two factors
in this one is Islam and the other is non – Islam if it’s Islam there is high very high
usage and if not it’s low.
The graph represents the contraceptive method usage according to the wife working
and the plotting is x-axis has the wife working and y-axis has the count. If the wife is
working the low the contraceptive method usage.
9. Practical Data Science – COSC2670
Assignment - 2
9
The graph represents the contraceptive method usage according to the Husband
Education and the plotting is x-axis has the Husband Education and y-axis has the count.
The high the education is the high the usage of contraceptive method usage of wife.
The graph represents the contraceptive method usage according to the wife religion and
the plotting is x-axis has the wife religion and y-axis has the count. High the sol index is
the high the usage of the contraceptive method usage.
10. Practical Data Science – COSC2670
Assignment - 2
10
The graph represents the contraceptive method usage according to the Media Exposure
and the plotting is x-axis has the Media Exposure and y-axis has the count. High the
media exposure is the more the contraceptive method usage.
Bivariate Analysis and Multi-Variate Analysis:
The features have been analyzed individually. Now Let's combine these features to
understand the interactions between them. The graph represents the contraceptive
method usage according to the children born and the plotting is x-axis has the Children
born and y-axis has the count.
11. Practical Data Science – COSC2670
Assignment - 2
11
This graph represents the contraceptive methods used between wife education and wife
age. It shows the wife education is low the low the contraceptive method usage is and
the high the wife education is the more the contraceptive method usage.
This graph represents the contraceptive methods used between media exposure on x-
axis and wife age on y-axis. It shows the wife age is low the low the contraceptive
method usage is and the high the media exposure is the more the contraceptive method
usage. Another plot represents the media exposure on x-axis and children born on y-
axis, the more the media exposure is the high the contraceptive method usage.
12. Practical Data Science – COSC2670
Assignment - 2
12
This graph represents the contraceptive methods used between Wife age and children
born. it explains the features between pair of features of wife age and children born.
13. Practical Data Science – COSC2670
Assignment - 2
13
Data Modelling:
The dataset is classification and the data set should be trained with three
different ways which are 1. 50% for training and 50% for testing 2. 60% for training and
40% for testing 3. 80% for training and 20% for testing. The models used to train or KNN
and decision tree model. And with this we must find the confusion matrix, precision,
recall, f1 score.
KNN for 50% training and 50% for testing.
Testing accuracy: 49.38941655359565%
Confusion Matrix:
[[207 35 84]
[ 55 65 44]
[ 92 63 92]]
precision: 0.4672335290095506
recall: 0.46792680806517956
f1 score: 0.46679377629552765
KNN for 60% training and 40% for testing.
Testing accuracy: 49.49152542372882%
Confusion Matrix:
[[167 36 58]
[ 48 47 34]
[ 73 49 78]]
precision: 0.464915082194494
recall: 0.46472927618877896
f1 score: 0.463384583000185
14. Practical Data Science – COSC2670
Assignment - 2
14
KNN for 80% training and 20% for testing.
Testing accuracy: 49.83050847457628%
Confusion Matrix:
[[82 16 32]
[21 27 15]
[41 23 38]]
precision: 0.47519805902158846
recall: 0.47729655964950085
f1 score: 0.4745206364825525
From the KNN model we can get to a conclusion that more the data given to train the
model, the better the accuracy rate. The same process is repeated for the decision tree
by which we find the result which is
Decision tree for 50% training and 50% for testing.
Testing accuracy: 56.98778833107191%
Confusion Matrix:
[[210 30 86]
[ 40 66 58]
[ 61 42 144]]
precision: 0.5511673423738291
recall: 0.5432022516494507
f1 score: 0.544914836355079
Decision tree for 60% training and 40% for testing.
Testing accuracy: 54.23728813559322%
Classification error rate:45.76271186440678%
Confusion Matrix:
[[82 18 30]
[16 20 27]
[30 14 58]]
15. Practical Data Science – COSC2670
Assignment - 2
15
precision: 0.5098627369007803
recall: 0.5056189997366468
f1 score: 0.5060157378889235
Decision Tree for 80% training and 20% for testing.
Testing accuracy: 57.11864406779661%
Confusion Matrix:
[[172 25 64]
[ 34 51 44]
[ 50 36 114]]
Classification error rate:42.88135593220339%
precision: 0.5469152187902188
recall: 0.5414508895423089
f1 score: 0.5429660169092897
Conclusion:
The dataset is the classification dataset which is trained by KNN and Decision tree mode
ls which is trained by three different sets which are trained. From this we can conclude that the m
ore data the model has the more the accuracy is produced by the given model. From this above
we can prove that Decision tree with test accuracy of 57.11864406779661%
has the better accuracy rate than the KNN model.
References:
"Indonesia | Data". Data.worldbank.org. N.p., 2016. Web. 8 Apr. 2016.
Grubinger, Thomas, Achim Zeileis, and Karl-Peter Pfeiffer. "Evtree : Evolutionary Learning
Of Globally Optimal Classification And Regression Trees In R". Journal of Statistical
Software 61.1 (2014): n. pag. Web.
16. Practical Data Science – COSC2670
Assignment - 2
16
Lim, Tjen-Sien, Wei-Yin Loh, and Yu-Shan Shih. "A Comparison Of Prediction Accuracy,
Complexity, And Training Time Of Thirty-Three Old And New Classification Algorithms".
Machine Learning40.3 (2000): 203-228. Web. 8 Apr. 2016.