SlideShare a Scribd company logo
1 of 16
Download to read offline
Practical Data Science – COSC2670
Assignment - 2
1
Title: Practical Data Science Assignment - 2
Author: Sai Chandan V (s3734305)
Contact Details: s3734305@student.rmit.edu.au
Practical Data Science – COSC2670
Assignment - 2
2
Table of Content
1. Abstract / Executive summary 3
2. Introduction 4
3. Methodology 5
3.1 Data Retrieving
3.2 Data Exploration
3.3 Data Modeling
4. Conclusion 15
5. References 15
Practical Data Science – COSC2670
Assignment - 2
3
Abstract/Executive Summary
The dataset used is contraceptive method choice, the data set is treated by the
classification task. This data set is downloaded from the following link
https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice.
This dataset is a subset of the dataset from 1987 national Indonesia contraceptive
prevalence survey. The data set have the data of married women who were pregnant, or
they don’t know if they are. The solution is to predict the current contraceptive method
choice methods used by these women who won’t use, or short term or long term
based on there socio-economic and demographic characteristics. The data set is
multivariate and has 1473 instances and 9 attributes.
The Task 1 is about data retrieving and the data set chosen is classification and after
loading the data the data has no missing values, the data has been checked for Nan
values, missing values. The task 2 is about Data exploration in which each column is
explored (min 10 columns) and explained with the descriptive statistics and graphs like
the distribution of a numerical attribute and value of categorical attribute. Exploring the
relationship of all the attributes of pairs and the relationship in an appropriate with
focus on pair of columns.
The Task 3 is data modeling where the data set is trained by a classification the data set
is trained in the order of 50% for training and 50% for testing and the data set is trained
for 60% for training and 40%for testing and then 80% for the training and 20% for
testing. This is explained with the KNN model and the Decision tree with the confusion
matrix, classification error rate, precision, recall, F1-Score. From this we find that the
more the data is the more the accuracy of the algorithm.
Practical Data Science – COSC2670
Assignment - 2
4
Introduction:
The Data set is Contraceptive Method Choice Data set which is a subset Data set
from 1987 National Indonesia Contraceptive Prevalence Survey. The data consists of
samples of women who were married and not pregnant and who do not know if they
are pregnant or not at the time of collecting the data. The solution to the problem is to
predict the current contraceptive method choice (no usage, short term methods, or long
term methods) for a woman which are based on her socio-economic and demographic
characteristics. This data set has 1473 instances and 9 Attributes and no missing values.
Attributes:
The attributes present in the data set are
1. Wife’s age => (Numerical)
2. Wife’s education => (Categorical) 1=low … 4 = High
3. Husband’s education => (Categorical) 1=low … 4 = High
4. Number of children ever born => (Numerical)
5. Wife’s religion => (binary) 0=Non-Islam, 1=Islam
6. Wife’s now working => (binary) 0=Yes, 1=No
7. Husband’s occupation => (Categorical) 1,2,3,4
8. Standard-of-living index => (Categorical) 1=low … 4 = High
9. Media exposure => (binary) 0=Good, 1=Not good
10. Contraceptive Method used => (Class attribute) 1=No use, 2= Long Term,
3=Short Term.
Practical Data Science – COSC2670
Assignment - 2
5
Methodology:
Data Manipulation:
Import pandas and load the data using pd.read_csv and add headers for
the data set. Check for the white spaces, missing values, Nan values and data types.
Checking for the null values like rawdata.isnull().sum() resulting in zero null values. Since
this contraceptive data choice data set doesn’t have any of these problems, we don’t
have to clean the data set.
Data Exploration:
The Process starts with few simple Univariate (one feature) analysis.
They are so many ways to manipulate feature type, but for simplicity lets define
Numerical and categorical
Numerical: Feature that has numeric values.
Categorical: Feature that contains text or categories.
Since the rawdata has categorical variables already in the dataset. We will decide the
variable to their labels and check the dataframe, make a copy of the data as datasetraw
by using datasetraw = rawdata.copy(). Defining the labels and replacing them into the
data frame.
wifereligion = {0:"Non_Islam", 1:"Islam"}
datasetraw.WifeReligion.replace(wifereligion, inplace=True)
wifeworking = {0:"Yes", 1:"No"}
datasetraw.WifeWorking.replace(wifeworking, inplace=True)
Practical Data Science – COSC2670
Assignment - 2
6
let's create two new data frames for discretized continuous variable and continuous
variable
dataset_bin = pd.DataFrame() #dataframe for discretized continuous variable
dataset_con = pd.DataFrame() #dataframe for continuous variable
from this data frames we create the plot graph for contraceptive method using sns with
x-axis having no use, long term, Short term and y -axis being count.
These graphs represents the wife age and the methods followed, in the x-axis it has with
count and y-axis has the age delimiter and the second graph represent the usage of the
contraceptive methods depending on the age.
Practical Data Science – COSC2670
Assignment - 2
7
The graph represents the contraceptive method usage according to the wife education
and the plotting is x-axis has the wife education and y-axis has the count. It says the
education of the women is high, the high the usage of contraceptive method usage.
The graph represents the contraceptive method usage according to the Husband
education and the plotting is x-axis has the Husband education and y-axis has the count.
The high the education of the husband is the high the usage of the contraceptive
method usage.
Practical Data Science – COSC2670
Assignment - 2
8
The graph represents the contraceptive method usage according to the wife religion and
the plotting is x-axis has the wife religion and y-axis has the count. There are two factors
in this one is Islam and the other is non – Islam if it’s Islam there is high very high
usage and if not it’s low.
The graph represents the contraceptive method usage according to the wife working
and the plotting is x-axis has the wife working and y-axis has the count. If the wife is
working the low the contraceptive method usage.
Practical Data Science – COSC2670
Assignment - 2
9
The graph represents the contraceptive method usage according to the Husband
Education and the plotting is x-axis has the Husband Education and y-axis has the count.
The high the education is the high the usage of contraceptive method usage of wife.
The graph represents the contraceptive method usage according to the wife religion and
the plotting is x-axis has the wife religion and y-axis has the count. High the sol index is
the high the usage of the contraceptive method usage.
Practical Data Science – COSC2670
Assignment - 2
10
The graph represents the contraceptive method usage according to the Media Exposure
and the plotting is x-axis has the Media Exposure and y-axis has the count. High the
media exposure is the more the contraceptive method usage.
Bivariate Analysis and Multi-Variate Analysis:
The features have been analyzed individually. Now Let's combine these features to
understand the interactions between them. The graph represents the contraceptive
method usage according to the children born and the plotting is x-axis has the Children
born and y-axis has the count.
Practical Data Science – COSC2670
Assignment - 2
11
This graph represents the contraceptive methods used between wife education and wife
age. It shows the wife education is low the low the contraceptive method usage is and
the high the wife education is the more the contraceptive method usage.
This graph represents the contraceptive methods used between media exposure on x-
axis and wife age on y-axis. It shows the wife age is low the low the contraceptive
method usage is and the high the media exposure is the more the contraceptive method
usage. Another plot represents the media exposure on x-axis and children born on y-
axis, the more the media exposure is the high the contraceptive method usage.
Practical Data Science – COSC2670
Assignment - 2
12
This graph represents the contraceptive methods used between Wife age and children
born. it explains the features between pair of features of wife age and children born.
Practical Data Science – COSC2670
Assignment - 2
13
Data Modelling:
The dataset is classification and the data set should be trained with three
different ways which are 1. 50% for training and 50% for testing 2. 60% for training and
40% for testing 3. 80% for training and 20% for testing. The models used to train or KNN
and decision tree model. And with this we must find the confusion matrix, precision,
recall, f1 score.
KNN for 50% training and 50% for testing.
Testing accuracy: 49.38941655359565%
Confusion Matrix:
[[207 35 84]
[ 55 65 44]
[ 92 63 92]]
precision: 0.4672335290095506
recall: 0.46792680806517956
f1 score: 0.46679377629552765
KNN for 60% training and 40% for testing.
Testing accuracy: 49.49152542372882%
Confusion Matrix:
[[167 36 58]
[ 48 47 34]
[ 73 49 78]]
precision: 0.464915082194494
recall: 0.46472927618877896
f1 score: 0.463384583000185
Practical Data Science – COSC2670
Assignment - 2
14
KNN for 80% training and 20% for testing.
Testing accuracy: 49.83050847457628%
Confusion Matrix:
[[82 16 32]
[21 27 15]
[41 23 38]]
precision: 0.47519805902158846
recall: 0.47729655964950085
f1 score: 0.4745206364825525
From the KNN model we can get to a conclusion that more the data given to train the
model, the better the accuracy rate. The same process is repeated for the decision tree
by which we find the result which is
Decision tree for 50% training and 50% for testing.
Testing accuracy: 56.98778833107191%
Confusion Matrix:
[[210 30 86]
[ 40 66 58]
[ 61 42 144]]
precision: 0.5511673423738291
recall: 0.5432022516494507
f1 score: 0.544914836355079
Decision tree for 60% training and 40% for testing.
Testing accuracy: 54.23728813559322%
Classification error rate:45.76271186440678%
Confusion Matrix:
[[82 18 30]
[16 20 27]
[30 14 58]]
Practical Data Science – COSC2670
Assignment - 2
15
precision: 0.5098627369007803
recall: 0.5056189997366468
f1 score: 0.5060157378889235
Decision Tree for 80% training and 20% for testing.
Testing accuracy: 57.11864406779661%
Confusion Matrix:
[[172 25 64]
[ 34 51 44]
[ 50 36 114]]
Classification error rate:42.88135593220339%
precision: 0.5469152187902188
recall: 0.5414508895423089
f1 score: 0.5429660169092897
Conclusion:
The dataset is the classification dataset which is trained by KNN and Decision tree mode
ls which is trained by three different sets which are trained. From this we can conclude that the m
ore data the model has the more the accuracy is produced by the given model. From this above
we can prove that Decision tree with test accuracy of 57.11864406779661%
has the better accuracy rate than the KNN model.
References:
"Indonesia | Data". Data.worldbank.org. N.p., 2016. Web. 8 Apr. 2016.
Grubinger, Thomas, Achim Zeileis, and Karl-Peter Pfeiffer. "Evtree : Evolutionary Learning
Of Globally Optimal Classification And Regression Trees In R". Journal of Statistical
Software 61.1 (2014): n. pag. Web.
Practical Data Science – COSC2670
Assignment - 2
16
Lim, Tjen-Sien, Wei-Yin Loh, and Yu-Shan Shih. "A Comparison Of Prediction Accuracy,
Complexity, And Training Time Of Thirty-Three Old And New Classification Algorithms".
Machine Learning40.3 (2000): 203-228. Web. 8 Apr. 2016.

More Related Content

What's hot

Diagnosis of Cancer using Fuzzy Rough Set Theory
Diagnosis of Cancer using Fuzzy Rough Set TheoryDiagnosis of Cancer using Fuzzy Rough Set Theory
Diagnosis of Cancer using Fuzzy Rough Set TheoryIRJET Journal
 
A MODIFIED MAXIMUM RELEVANCE MINIMUM REDUNDANCY FEATURE SELECTION METHOD BASE...
A MODIFIED MAXIMUM RELEVANCE MINIMUM REDUNDANCY FEATURE SELECTION METHOD BASE...A MODIFIED MAXIMUM RELEVANCE MINIMUM REDUNDANCY FEATURE SELECTION METHOD BASE...
A MODIFIED MAXIMUM RELEVANCE MINIMUM REDUNDANCY FEATURE SELECTION METHOD BASE...gerogepatton
 
Hiv Replication Model for The Succeeding Period Of Viral Dynamic Studies In A...
Hiv Replication Model for The Succeeding Period Of Viral Dynamic Studies In A...Hiv Replication Model for The Succeeding Period Of Viral Dynamic Studies In A...
Hiv Replication Model for The Succeeding Period Of Viral Dynamic Studies In A...inventionjournals
 
Prognosticating Autism Spectrum Disorder Using Artificial Neural Network: Lev...
Prognosticating Autism Spectrum Disorder Using Artificial Neural Network: Lev...Prognosticating Autism Spectrum Disorder Using Artificial Neural Network: Lev...
Prognosticating Autism Spectrum Disorder Using Artificial Neural Network: Lev...Avishek Choudhury
 
An Empirical Study on Mushroom Disease Diagnosis:A Data Mining Approach
An Empirical Study on Mushroom Disease Diagnosis:A Data Mining ApproachAn Empirical Study on Mushroom Disease Diagnosis:A Data Mining Approach
An Empirical Study on Mushroom Disease Diagnosis:A Data Mining ApproachIRJET Journal
 
Statistical concepts
Statistical conceptsStatistical concepts
Statistical conceptsCarlo Magno
 
Factors Associated with Antenatal Care Service Utilization among Women with C...
Factors Associated with Antenatal Care Service Utilization among Women with C...Factors Associated with Antenatal Care Service Utilization among Women with C...
Factors Associated with Antenatal Care Service Utilization among Women with C...YogeshIJTSRD
 
MLTDD : USE OF MACHINE LEARNING TECHNIQUES FOR DIAGNOSIS OF THYROID GLAND DIS...
MLTDD : USE OF MACHINE LEARNING TECHNIQUES FOR DIAGNOSIS OF THYROID GLAND DIS...MLTDD : USE OF MACHINE LEARNING TECHNIQUES FOR DIAGNOSIS OF THYROID GLAND DIS...
MLTDD : USE OF MACHINE LEARNING TECHNIQUES FOR DIAGNOSIS OF THYROID GLAND DIS...cscpconf
 
A Magnified Application of Deficient Data Using Bolzano Classifier
A Magnified Application of Deficient Data Using Bolzano ClassifierA Magnified Application of Deficient Data Using Bolzano Classifier
A Magnified Application of Deficient Data Using Bolzano Classifierjournal ijrtem
 
The methodology for handling missing data during development of predictive model
The methodology for handling missing data during development of predictive modelThe methodology for handling missing data during development of predictive model
The methodology for handling missing data during development of predictive modelpingxiaoou
 
The methodology for handling missing data during development of predictive model
The methodology for handling missing data during development of predictive modelThe methodology for handling missing data during development of predictive model
The methodology for handling missing data during development of predictive modelpingxiaoou
 
Does the perception of trust in the usefulness or ease of use of a HIS (Healt...
Does the perception of trust in the usefulness or ease of use of a HIS (Healt...Does the perception of trust in the usefulness or ease of use of a HIS (Healt...
Does the perception of trust in the usefulness or ease of use of a HIS (Healt...Monica Barrowman MacFadyen
 
Prediction of Dengue, Diabetes and Swine Flu using Random Forest Classificati...
Prediction of Dengue, Diabetes and Swine Flu using Random Forest Classificati...Prediction of Dengue, Diabetes and Swine Flu using Random Forest Classificati...
Prediction of Dengue, Diabetes and Swine Flu using Random Forest Classificati...IRJET Journal
 
Bayesian random effects meta-analysis model for normal data - Pubrica
Bayesian random effects meta-analysis model for normal data - PubricaBayesian random effects meta-analysis model for normal data - Pubrica
Bayesian random effects meta-analysis model for normal data - PubricaPubrica
 
IRJET- The Prediction of Heart Disease using Naive Bayes Classifier
IRJET- The Prediction of Heart Disease using Naive Bayes ClassifierIRJET- The Prediction of Heart Disease using Naive Bayes Classifier
IRJET- The Prediction of Heart Disease using Naive Bayes ClassifierIRJET Journal
 
Breast cancer diagnosis and recurrence prediction using machine learning tech...
Breast cancer diagnosis and recurrence prediction using machine learning tech...Breast cancer diagnosis and recurrence prediction using machine learning tech...
Breast cancer diagnosis and recurrence prediction using machine learning tech...eSAT Journals
 
Breast cancer classification
Breast cancer classificationBreast cancer classification
Breast cancer classificationAshwan Abdulmunem
 
September Journal Club -Aishwarya
September Journal Club -AishwaryaSeptember Journal Club -Aishwarya
September Journal Club -AishwaryaRSG Luxembourg
 
Big Data Analytics for Healthcare
Big Data Analytics for HealthcareBig Data Analytics for Healthcare
Big Data Analytics for HealthcareChandan Reddy
 

What's hot (19)

Diagnosis of Cancer using Fuzzy Rough Set Theory
Diagnosis of Cancer using Fuzzy Rough Set TheoryDiagnosis of Cancer using Fuzzy Rough Set Theory
Diagnosis of Cancer using Fuzzy Rough Set Theory
 
A MODIFIED MAXIMUM RELEVANCE MINIMUM REDUNDANCY FEATURE SELECTION METHOD BASE...
A MODIFIED MAXIMUM RELEVANCE MINIMUM REDUNDANCY FEATURE SELECTION METHOD BASE...A MODIFIED MAXIMUM RELEVANCE MINIMUM REDUNDANCY FEATURE SELECTION METHOD BASE...
A MODIFIED MAXIMUM RELEVANCE MINIMUM REDUNDANCY FEATURE SELECTION METHOD BASE...
 
Hiv Replication Model for The Succeeding Period Of Viral Dynamic Studies In A...
Hiv Replication Model for The Succeeding Period Of Viral Dynamic Studies In A...Hiv Replication Model for The Succeeding Period Of Viral Dynamic Studies In A...
Hiv Replication Model for The Succeeding Period Of Viral Dynamic Studies In A...
 
Prognosticating Autism Spectrum Disorder Using Artificial Neural Network: Lev...
Prognosticating Autism Spectrum Disorder Using Artificial Neural Network: Lev...Prognosticating Autism Spectrum Disorder Using Artificial Neural Network: Lev...
Prognosticating Autism Spectrum Disorder Using Artificial Neural Network: Lev...
 
An Empirical Study on Mushroom Disease Diagnosis:A Data Mining Approach
An Empirical Study on Mushroom Disease Diagnosis:A Data Mining ApproachAn Empirical Study on Mushroom Disease Diagnosis:A Data Mining Approach
An Empirical Study on Mushroom Disease Diagnosis:A Data Mining Approach
 
Statistical concepts
Statistical conceptsStatistical concepts
Statistical concepts
 
Factors Associated with Antenatal Care Service Utilization among Women with C...
Factors Associated with Antenatal Care Service Utilization among Women with C...Factors Associated with Antenatal Care Service Utilization among Women with C...
Factors Associated with Antenatal Care Service Utilization among Women with C...
 
MLTDD : USE OF MACHINE LEARNING TECHNIQUES FOR DIAGNOSIS OF THYROID GLAND DIS...
MLTDD : USE OF MACHINE LEARNING TECHNIQUES FOR DIAGNOSIS OF THYROID GLAND DIS...MLTDD : USE OF MACHINE LEARNING TECHNIQUES FOR DIAGNOSIS OF THYROID GLAND DIS...
MLTDD : USE OF MACHINE LEARNING TECHNIQUES FOR DIAGNOSIS OF THYROID GLAND DIS...
 
A Magnified Application of Deficient Data Using Bolzano Classifier
A Magnified Application of Deficient Data Using Bolzano ClassifierA Magnified Application of Deficient Data Using Bolzano Classifier
A Magnified Application of Deficient Data Using Bolzano Classifier
 
The methodology for handling missing data during development of predictive model
The methodology for handling missing data during development of predictive modelThe methodology for handling missing data during development of predictive model
The methodology for handling missing data during development of predictive model
 
The methodology for handling missing data during development of predictive model
The methodology for handling missing data during development of predictive modelThe methodology for handling missing data during development of predictive model
The methodology for handling missing data during development of predictive model
 
Does the perception of trust in the usefulness or ease of use of a HIS (Healt...
Does the perception of trust in the usefulness or ease of use of a HIS (Healt...Does the perception of trust in the usefulness or ease of use of a HIS (Healt...
Does the perception of trust in the usefulness or ease of use of a HIS (Healt...
 
Prediction of Dengue, Diabetes and Swine Flu using Random Forest Classificati...
Prediction of Dengue, Diabetes and Swine Flu using Random Forest Classificati...Prediction of Dengue, Diabetes and Swine Flu using Random Forest Classificati...
Prediction of Dengue, Diabetes and Swine Flu using Random Forest Classificati...
 
Bayesian random effects meta-analysis model for normal data - Pubrica
Bayesian random effects meta-analysis model for normal data - PubricaBayesian random effects meta-analysis model for normal data - Pubrica
Bayesian random effects meta-analysis model for normal data - Pubrica
 
IRJET- The Prediction of Heart Disease using Naive Bayes Classifier
IRJET- The Prediction of Heart Disease using Naive Bayes ClassifierIRJET- The Prediction of Heart Disease using Naive Bayes Classifier
IRJET- The Prediction of Heart Disease using Naive Bayes Classifier
 
Breast cancer diagnosis and recurrence prediction using machine learning tech...
Breast cancer diagnosis and recurrence prediction using machine learning tech...Breast cancer diagnosis and recurrence prediction using machine learning tech...
Breast cancer diagnosis and recurrence prediction using machine learning tech...
 
Breast cancer classification
Breast cancer classificationBreast cancer classification
Breast cancer classification
 
September Journal Club -Aishwarya
September Journal Club -AishwaryaSeptember Journal Club -Aishwarya
September Journal Club -Aishwarya
 
Big Data Analytics for Healthcare
Big Data Analytics for HealthcareBig Data Analytics for Healthcare
Big Data Analytics for Healthcare
 

Similar to Sample Data Preparation

Prediction of Diabetes using Probability Approach
Prediction of Diabetes using Probability ApproachPrediction of Diabetes using Probability Approach
Prediction of Diabetes using Probability ApproachIRJET Journal
 
Data science notes for ASDS calicut 2.pptx
Data science notes for ASDS calicut 2.pptxData science notes for ASDS calicut 2.pptx
Data science notes for ASDS calicut 2.pptxswapnaraghav
 
IRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
IRJET- Breast Cancer Prediction using Supervised Machine Learning AlgorithmsIRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
IRJET- Breast Cancer Prediction using Supervised Machine Learning AlgorithmsIRJET Journal
 
Development of Cognitive Instruments in Epidemiology Using Asyncronous Methods
Development of Cognitive Instruments in Epidemiology Using Asyncronous MethodsDevelopment of Cognitive Instruments in Epidemiology Using Asyncronous Methods
Development of Cognitive Instruments in Epidemiology Using Asyncronous MethodsAJHSSR Journal
 
1. What is a codebook Who might create one, what would she or h.docx
1.  What is a codebook  Who might create one, what would she or h.docx1.  What is a codebook  Who might create one, what would she or h.docx
1. What is a codebook Who might create one, what would she or h.docxSONU61709
 
Exploring the performance of feature selection method using breast cancer dat...
Exploring the performance of feature selection method using breast cancer dat...Exploring the performance of feature selection method using breast cancer dat...
Exploring the performance of feature selection method using breast cancer dat...nooriasukmaningtyas
 
Hybrid filtering methods for feature selection in high-dimensional cancer data
Hybrid filtering methods for feature selection in high-dimensional cancer dataHybrid filtering methods for feature selection in high-dimensional cancer data
Hybrid filtering methods for feature selection in high-dimensional cancer dataIJECEIAES
 
Two Layer k-means based Consensus Clustering for Rural Health Information System
Two Layer k-means based Consensus Clustering for Rural Health Information SystemTwo Layer k-means based Consensus Clustering for Rural Health Information System
Two Layer k-means based Consensus Clustering for Rural Health Information SystemIRJET Journal
 
IRJET- Breast Cancer Relapse Prognosis by Classic and Modern Structures o...
IRJET-  	  Breast Cancer Relapse Prognosis by Classic and Modern Structures o...IRJET-  	  Breast Cancer Relapse Prognosis by Classic and Modern Structures o...
IRJET- Breast Cancer Relapse Prognosis by Classic and Modern Structures o...IRJET Journal
 
"Predictive Modelling for Overweight and Obesity: Harnessing Machine Learning...
"Predictive Modelling for Overweight and Obesity: Harnessing Machine Learning..."Predictive Modelling for Overweight and Obesity: Harnessing Machine Learning...
"Predictive Modelling for Overweight and Obesity: Harnessing Machine Learning...IRJET Journal
 
IRJET - Prediction and Detection of Diabetes using Machine Learning
IRJET - Prediction and Detection of Diabetes using Machine LearningIRJET - Prediction and Detection of Diabetes using Machine Learning
IRJET - Prediction and Detection of Diabetes using Machine LearningIRJET Journal
 
Analysis of Imbalanced Classification Algorithms A Perspective View
Analysis of Imbalanced Classification Algorithms A Perspective ViewAnalysis of Imbalanced Classification Algorithms A Perspective View
Analysis of Imbalanced Classification Algorithms A Perspective Viewijtsrd
 
ESTIMATING FETAL WEIGHT AT VARYING GESTATIONAL AGE USING MACHINE LEARNING
ESTIMATING FETAL WEIGHT AT VARYING GESTATIONAL AGE USING MACHINE LEARNINGESTIMATING FETAL WEIGHT AT VARYING GESTATIONAL AGE USING MACHINE LEARNING
ESTIMATING FETAL WEIGHT AT VARYING GESTATIONAL AGE USING MACHINE LEARNINGIRJET Journal
 
Unit 7 ‒ Scientific Knowledge, Contributions, an
Unit 7 ‒ Scientific Knowledge, Contributions, anUnit 7 ‒ Scientific Knowledge, Contributions, an
Unit 7 ‒ Scientific Knowledge, Contributions, analisondakintxt
 
Unit 7 ‒ Scientific Knowledge, Contributions, an
                Unit 7 ‒ Scientific Knowledge, Contributions, an                Unit 7 ‒ Scientific Knowledge, Contributions, an
Unit 7 ‒ Scientific Knowledge, Contributions, andrennanmicah
 
ENSEMBLE LEARNING MODEL FOR SCREENING AUTISM IN CHILDREN
ENSEMBLE LEARNING MODEL FOR SCREENING AUTISM IN CHILDRENENSEMBLE LEARNING MODEL FOR SCREENING AUTISM IN CHILDREN
ENSEMBLE LEARNING MODEL FOR SCREENING AUTISM IN CHILDRENijcsit
 
ENSEMBLE LEARNING MODEL FOR SCREENING AUTISM IN CHILDREN
ENSEMBLE LEARNING MODEL FOR SCREENING AUTISM IN CHILDRENENSEMBLE LEARNING MODEL FOR SCREENING AUTISM IN CHILDREN
ENSEMBLE LEARNING MODEL FOR SCREENING AUTISM IN CHILDRENAIRCC Publishing Corporation
 
Factors Influencing Postnatal Monitoring in the Bafang Health District (West ...
Factors Influencing Postnatal Monitoring in the Bafang Health District (West ...Factors Influencing Postnatal Monitoring in the Bafang Health District (West ...
Factors Influencing Postnatal Monitoring in the Bafang Health District (West ...Healthcare and Medical Sciences
 
Multiple Linear Regression Homework Help
Multiple Linear Regression Homework HelpMultiple Linear Regression Homework Help
Multiple Linear Regression Homework HelpExcel Homework Help
 

Similar to Sample Data Preparation (20)

Prediction of Diabetes using Probability Approach
Prediction of Diabetes using Probability ApproachPrediction of Diabetes using Probability Approach
Prediction of Diabetes using Probability Approach
 
Data science notes for ASDS calicut 2.pptx
Data science notes for ASDS calicut 2.pptxData science notes for ASDS calicut 2.pptx
Data science notes for ASDS calicut 2.pptx
 
IRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
IRJET- Breast Cancer Prediction using Supervised Machine Learning AlgorithmsIRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
IRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
 
Development of Cognitive Instruments in Epidemiology Using Asyncronous Methods
Development of Cognitive Instruments in Epidemiology Using Asyncronous MethodsDevelopment of Cognitive Instruments in Epidemiology Using Asyncronous Methods
Development of Cognitive Instruments in Epidemiology Using Asyncronous Methods
 
1. What is a codebook Who might create one, what would she or h.docx
1.  What is a codebook  Who might create one, what would she or h.docx1.  What is a codebook  Who might create one, what would she or h.docx
1. What is a codebook Who might create one, what would she or h.docx
 
Multiple Linear Regression Homework Help
Multiple Linear Regression Homework HelpMultiple Linear Regression Homework Help
Multiple Linear Regression Homework Help
 
Exploring the performance of feature selection method using breast cancer dat...
Exploring the performance of feature selection method using breast cancer dat...Exploring the performance of feature selection method using breast cancer dat...
Exploring the performance of feature selection method using breast cancer dat...
 
Hybrid filtering methods for feature selection in high-dimensional cancer data
Hybrid filtering methods for feature selection in high-dimensional cancer dataHybrid filtering methods for feature selection in high-dimensional cancer data
Hybrid filtering methods for feature selection in high-dimensional cancer data
 
Two Layer k-means based Consensus Clustering for Rural Health Information System
Two Layer k-means based Consensus Clustering for Rural Health Information SystemTwo Layer k-means based Consensus Clustering for Rural Health Information System
Two Layer k-means based Consensus Clustering for Rural Health Information System
 
IRJET- Breast Cancer Relapse Prognosis by Classic and Modern Structures o...
IRJET-  	  Breast Cancer Relapse Prognosis by Classic and Modern Structures o...IRJET-  	  Breast Cancer Relapse Prognosis by Classic and Modern Structures o...
IRJET- Breast Cancer Relapse Prognosis by Classic and Modern Structures o...
 
"Predictive Modelling for Overweight and Obesity: Harnessing Machine Learning...
"Predictive Modelling for Overweight and Obesity: Harnessing Machine Learning..."Predictive Modelling for Overweight and Obesity: Harnessing Machine Learning...
"Predictive Modelling for Overweight and Obesity: Harnessing Machine Learning...
 
IRJET - Prediction and Detection of Diabetes using Machine Learning
IRJET - Prediction and Detection of Diabetes using Machine LearningIRJET - Prediction and Detection of Diabetes using Machine Learning
IRJET - Prediction and Detection of Diabetes using Machine Learning
 
Analysis of Imbalanced Classification Algorithms A Perspective View
Analysis of Imbalanced Classification Algorithms A Perspective ViewAnalysis of Imbalanced Classification Algorithms A Perspective View
Analysis of Imbalanced Classification Algorithms A Perspective View
 
ESTIMATING FETAL WEIGHT AT VARYING GESTATIONAL AGE USING MACHINE LEARNING
ESTIMATING FETAL WEIGHT AT VARYING GESTATIONAL AGE USING MACHINE LEARNINGESTIMATING FETAL WEIGHT AT VARYING GESTATIONAL AGE USING MACHINE LEARNING
ESTIMATING FETAL WEIGHT AT VARYING GESTATIONAL AGE USING MACHINE LEARNING
 
Unit 7 ‒ Scientific Knowledge, Contributions, an
Unit 7 ‒ Scientific Knowledge, Contributions, anUnit 7 ‒ Scientific Knowledge, Contributions, an
Unit 7 ‒ Scientific Knowledge, Contributions, an
 
Unit 7 ‒ Scientific Knowledge, Contributions, an
                Unit 7 ‒ Scientific Knowledge, Contributions, an                Unit 7 ‒ Scientific Knowledge, Contributions, an
Unit 7 ‒ Scientific Knowledge, Contributions, an
 
ENSEMBLE LEARNING MODEL FOR SCREENING AUTISM IN CHILDREN
ENSEMBLE LEARNING MODEL FOR SCREENING AUTISM IN CHILDRENENSEMBLE LEARNING MODEL FOR SCREENING AUTISM IN CHILDREN
ENSEMBLE LEARNING MODEL FOR SCREENING AUTISM IN CHILDREN
 
ENSEMBLE LEARNING MODEL FOR SCREENING AUTISM IN CHILDREN
ENSEMBLE LEARNING MODEL FOR SCREENING AUTISM IN CHILDRENENSEMBLE LEARNING MODEL FOR SCREENING AUTISM IN CHILDREN
ENSEMBLE LEARNING MODEL FOR SCREENING AUTISM IN CHILDREN
 
Factors Influencing Postnatal Monitoring in the Bafang Health District (West ...
Factors Influencing Postnatal Monitoring in the Bafang Health District (West ...Factors Influencing Postnatal Monitoring in the Bafang Health District (West ...
Factors Influencing Postnatal Monitoring in the Bafang Health District (West ...
 
Multiple Linear Regression Homework Help
Multiple Linear Regression Homework HelpMultiple Linear Regression Homework Help
Multiple Linear Regression Homework Help
 

Recently uploaded

定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 

Recently uploaded (20)

定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 

Sample Data Preparation

  • 1. Practical Data Science – COSC2670 Assignment - 2 1 Title: Practical Data Science Assignment - 2 Author: Sai Chandan V (s3734305) Contact Details: s3734305@student.rmit.edu.au
  • 2. Practical Data Science – COSC2670 Assignment - 2 2 Table of Content 1. Abstract / Executive summary 3 2. Introduction 4 3. Methodology 5 3.1 Data Retrieving 3.2 Data Exploration 3.3 Data Modeling 4. Conclusion 15 5. References 15
  • 3. Practical Data Science – COSC2670 Assignment - 2 3 Abstract/Executive Summary The dataset used is contraceptive method choice, the data set is treated by the classification task. This data set is downloaded from the following link https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice. This dataset is a subset of the dataset from 1987 national Indonesia contraceptive prevalence survey. The data set have the data of married women who were pregnant, or they don’t know if they are. The solution is to predict the current contraceptive method choice methods used by these women who won’t use, or short term or long term based on there socio-economic and demographic characteristics. The data set is multivariate and has 1473 instances and 9 attributes. The Task 1 is about data retrieving and the data set chosen is classification and after loading the data the data has no missing values, the data has been checked for Nan values, missing values. The task 2 is about Data exploration in which each column is explored (min 10 columns) and explained with the descriptive statistics and graphs like the distribution of a numerical attribute and value of categorical attribute. Exploring the relationship of all the attributes of pairs and the relationship in an appropriate with focus on pair of columns. The Task 3 is data modeling where the data set is trained by a classification the data set is trained in the order of 50% for training and 50% for testing and the data set is trained for 60% for training and 40%for testing and then 80% for the training and 20% for testing. This is explained with the KNN model and the Decision tree with the confusion matrix, classification error rate, precision, recall, F1-Score. From this we find that the more the data is the more the accuracy of the algorithm.
  • 4. Practical Data Science – COSC2670 Assignment - 2 4 Introduction: The Data set is Contraceptive Method Choice Data set which is a subset Data set from 1987 National Indonesia Contraceptive Prevalence Survey. The data consists of samples of women who were married and not pregnant and who do not know if they are pregnant or not at the time of collecting the data. The solution to the problem is to predict the current contraceptive method choice (no usage, short term methods, or long term methods) for a woman which are based on her socio-economic and demographic characteristics. This data set has 1473 instances and 9 Attributes and no missing values. Attributes: The attributes present in the data set are 1. Wife’s age => (Numerical) 2. Wife’s education => (Categorical) 1=low … 4 = High 3. Husband’s education => (Categorical) 1=low … 4 = High 4. Number of children ever born => (Numerical) 5. Wife’s religion => (binary) 0=Non-Islam, 1=Islam 6. Wife’s now working => (binary) 0=Yes, 1=No 7. Husband’s occupation => (Categorical) 1,2,3,4 8. Standard-of-living index => (Categorical) 1=low … 4 = High 9. Media exposure => (binary) 0=Good, 1=Not good 10. Contraceptive Method used => (Class attribute) 1=No use, 2= Long Term, 3=Short Term.
  • 5. Practical Data Science – COSC2670 Assignment - 2 5 Methodology: Data Manipulation: Import pandas and load the data using pd.read_csv and add headers for the data set. Check for the white spaces, missing values, Nan values and data types. Checking for the null values like rawdata.isnull().sum() resulting in zero null values. Since this contraceptive data choice data set doesn’t have any of these problems, we don’t have to clean the data set. Data Exploration: The Process starts with few simple Univariate (one feature) analysis. They are so many ways to manipulate feature type, but for simplicity lets define Numerical and categorical Numerical: Feature that has numeric values. Categorical: Feature that contains text or categories. Since the rawdata has categorical variables already in the dataset. We will decide the variable to their labels and check the dataframe, make a copy of the data as datasetraw by using datasetraw = rawdata.copy(). Defining the labels and replacing them into the data frame. wifereligion = {0:"Non_Islam", 1:"Islam"} datasetraw.WifeReligion.replace(wifereligion, inplace=True) wifeworking = {0:"Yes", 1:"No"} datasetraw.WifeWorking.replace(wifeworking, inplace=True)
  • 6. Practical Data Science – COSC2670 Assignment - 2 6 let's create two new data frames for discretized continuous variable and continuous variable dataset_bin = pd.DataFrame() #dataframe for discretized continuous variable dataset_con = pd.DataFrame() #dataframe for continuous variable from this data frames we create the plot graph for contraceptive method using sns with x-axis having no use, long term, Short term and y -axis being count. These graphs represents the wife age and the methods followed, in the x-axis it has with count and y-axis has the age delimiter and the second graph represent the usage of the contraceptive methods depending on the age.
  • 7. Practical Data Science – COSC2670 Assignment - 2 7 The graph represents the contraceptive method usage according to the wife education and the plotting is x-axis has the wife education and y-axis has the count. It says the education of the women is high, the high the usage of contraceptive method usage. The graph represents the contraceptive method usage according to the Husband education and the plotting is x-axis has the Husband education and y-axis has the count. The high the education of the husband is the high the usage of the contraceptive method usage.
  • 8. Practical Data Science – COSC2670 Assignment - 2 8 The graph represents the contraceptive method usage according to the wife religion and the plotting is x-axis has the wife religion and y-axis has the count. There are two factors in this one is Islam and the other is non – Islam if it’s Islam there is high very high usage and if not it’s low. The graph represents the contraceptive method usage according to the wife working and the plotting is x-axis has the wife working and y-axis has the count. If the wife is working the low the contraceptive method usage.
  • 9. Practical Data Science – COSC2670 Assignment - 2 9 The graph represents the contraceptive method usage according to the Husband Education and the plotting is x-axis has the Husband Education and y-axis has the count. The high the education is the high the usage of contraceptive method usage of wife. The graph represents the contraceptive method usage according to the wife religion and the plotting is x-axis has the wife religion and y-axis has the count. High the sol index is the high the usage of the contraceptive method usage.
  • 10. Practical Data Science – COSC2670 Assignment - 2 10 The graph represents the contraceptive method usage according to the Media Exposure and the plotting is x-axis has the Media Exposure and y-axis has the count. High the media exposure is the more the contraceptive method usage. Bivariate Analysis and Multi-Variate Analysis: The features have been analyzed individually. Now Let's combine these features to understand the interactions between them. The graph represents the contraceptive method usage according to the children born and the plotting is x-axis has the Children born and y-axis has the count.
  • 11. Practical Data Science – COSC2670 Assignment - 2 11 This graph represents the contraceptive methods used between wife education and wife age. It shows the wife education is low the low the contraceptive method usage is and the high the wife education is the more the contraceptive method usage. This graph represents the contraceptive methods used between media exposure on x- axis and wife age on y-axis. It shows the wife age is low the low the contraceptive method usage is and the high the media exposure is the more the contraceptive method usage. Another plot represents the media exposure on x-axis and children born on y- axis, the more the media exposure is the high the contraceptive method usage.
  • 12. Practical Data Science – COSC2670 Assignment - 2 12 This graph represents the contraceptive methods used between Wife age and children born. it explains the features between pair of features of wife age and children born.
  • 13. Practical Data Science – COSC2670 Assignment - 2 13 Data Modelling: The dataset is classification and the data set should be trained with three different ways which are 1. 50% for training and 50% for testing 2. 60% for training and 40% for testing 3. 80% for training and 20% for testing. The models used to train or KNN and decision tree model. And with this we must find the confusion matrix, precision, recall, f1 score. KNN for 50% training and 50% for testing. Testing accuracy: 49.38941655359565% Confusion Matrix: [[207 35 84] [ 55 65 44] [ 92 63 92]] precision: 0.4672335290095506 recall: 0.46792680806517956 f1 score: 0.46679377629552765 KNN for 60% training and 40% for testing. Testing accuracy: 49.49152542372882% Confusion Matrix: [[167 36 58] [ 48 47 34] [ 73 49 78]] precision: 0.464915082194494 recall: 0.46472927618877896 f1 score: 0.463384583000185
  • 14. Practical Data Science – COSC2670 Assignment - 2 14 KNN for 80% training and 20% for testing. Testing accuracy: 49.83050847457628% Confusion Matrix: [[82 16 32] [21 27 15] [41 23 38]] precision: 0.47519805902158846 recall: 0.47729655964950085 f1 score: 0.4745206364825525 From the KNN model we can get to a conclusion that more the data given to train the model, the better the accuracy rate. The same process is repeated for the decision tree by which we find the result which is Decision tree for 50% training and 50% for testing. Testing accuracy: 56.98778833107191% Confusion Matrix: [[210 30 86] [ 40 66 58] [ 61 42 144]] precision: 0.5511673423738291 recall: 0.5432022516494507 f1 score: 0.544914836355079 Decision tree for 60% training and 40% for testing. Testing accuracy: 54.23728813559322% Classification error rate:45.76271186440678% Confusion Matrix: [[82 18 30] [16 20 27] [30 14 58]]
  • 15. Practical Data Science – COSC2670 Assignment - 2 15 precision: 0.5098627369007803 recall: 0.5056189997366468 f1 score: 0.5060157378889235 Decision Tree for 80% training and 20% for testing. Testing accuracy: 57.11864406779661% Confusion Matrix: [[172 25 64] [ 34 51 44] [ 50 36 114]] Classification error rate:42.88135593220339% precision: 0.5469152187902188 recall: 0.5414508895423089 f1 score: 0.5429660169092897 Conclusion: The dataset is the classification dataset which is trained by KNN and Decision tree mode ls which is trained by three different sets which are trained. From this we can conclude that the m ore data the model has the more the accuracy is produced by the given model. From this above we can prove that Decision tree with test accuracy of 57.11864406779661% has the better accuracy rate than the KNN model. References: "Indonesia | Data". Data.worldbank.org. N.p., 2016. Web. 8 Apr. 2016. Grubinger, Thomas, Achim Zeileis, and Karl-Peter Pfeiffer. "Evtree : Evolutionary Learning Of Globally Optimal Classification And Regression Trees In R". Journal of Statistical Software 61.1 (2014): n. pag. Web.
  • 16. Practical Data Science – COSC2670 Assignment - 2 16 Lim, Tjen-Sien, Wei-Yin Loh, and Yu-Shan Shih. "A Comparison Of Prediction Accuracy, Complexity, And Training Time Of Thirty-Three Old And New Classification Algorithms". Machine Learning40.3 (2000): 203-228. Web. 8 Apr. 2016.