SlideShare a Scribd company logo
DATA ANALYSIS COLLECTION
ASSIGNMENT
Data Analysis And Interpretation Specialization
Running a classification tree
Andrea Rubio Amorós
June 26, 2017
Modul 4
Assignment 1
Data Analysis And Interpretation Specialization
Running a classification tree M4A1
1 General
In this session, you will learn about decision trees, a type of data mining algorithm that can select from among a large
number of variables those and their interactions that are most important in predicting the target or response variable
to be explained. Decision trees create segmentations or subgroups in the data, by applying a series of simple rules or
criteria over and over again, which choose variable constellations that best predict the target variable.
Document written in LATEX
template_version_01.tex
2
Data Analysis And Interpretation Specialization
Running a classification tree M4A1
2 Description
GENDER <= 0.5
gini = 0.2736
samples = 25855
value = [21626, 4229]
DIVORCED <= 0.5
gini = 0.2091
samples = 11122
value = [9803, 1319]
True
DIVORCED <= 0.5
gini = 0.317
samples = 14733
value = [11823, 2910]
False
FIRED <= 0.5
gini = 0.198
samples = 10422
value = [9261, 1161]
FIRED <= 0.5
gini = 0.3495
samples = 700
value = [542, 158]
gini = 0.1929
samples = 9691
value = [8643, 1048]
gini = 0.2614
samples = 731
value = [618, 113]
gini = 0.3399
samples = 585
value = [458, 127]
gini = 0.3938
samples = 115
value = [84, 31]
FIRED <= 0.5
gini = 0.3028
samples = 13713
value = [11163, 2550]
FIRED <= 0.5
gini = 0.4567
samples = 1020
value = [660, 360]
gini = 0.296
samples = 13088
value = [10724, 2364]
gini = 0.4181
samples = 625
value = [439, 186]
gini = 0.4487
samples = 877
value = [579, 298]
gini = 0.4912
samples = 143
value = [81, 62]
Figure 2.1 Decision tree
Today, I will build a decision tree (statistical model) to study supervised prediction problems. For that, I will work
with the NESARC data set.
First, I set my explanatory and response variables (binary, categorical).
Explanatory variables (predictors):
• GENDER: Sex (0=Male / 1=Female)
• DIVORCED: got divorced in last 12 months (0=no / 1=yes)
• FIRED: fired in last 12 months (0=no / 1=yes)
Response variable (target):
• MAJORDEPP12: major depression in last 12 months (0=no / 1=yes)
Then, I include the train test split function for predictors and target, to split my data set into two:
train sample
(25855, 3)
test sample
(17238, 3)
• The train sample has 25855 observations or rows, 60% of the original sample, and 3 explanatory variables.
• The test sample has 17238 observations or rows, 40% of the original sample, and again 3 explanatory variables
or columns.
Once training and testing data sets have been created, I initialize the DecisionTreeClassifier from SKLearn.
Then, I include the predict function and the confusion matrix function, to study the classification accuracy of my
decision tree.
Predict function - show number of true and false negatives and positives
[[14393 0]
[ 2845 0]]
Document written in LATEX
template_version_01.tex
3
Data Analysis And Interpretation Specialization
Running a classification tree M4A1
The predict function shows the correct and incorrect classifications of our decision tree. The diagonal, 14393 and 0,
represent the number of true negative and the number of true positives for major depression, respectively. The 2845,
on the bottom left, represents the number of false negatives. Classifying individuals who suffer from major depression
as individuals who does not. And the 0 on the top right, the number of false positives, classifying who does not suffer
from major depression as individuals who does.
Confusion matrix function - show classification accuracy in percentage
0.836291913215
The confusion matrix function indicates the accuracy score, approximately 0.83, which suggests that the decision tree
model has classified 83% of the sample correctly.
My decision tree is built with MAJORDEPP12, my binary major depression variable, as the target. And GENDER,
DIVORCED and FIRED as the predictors or explanatory variables.
The resulting tree starts with the first node that indicates the total number of observations in the train set of 25855,
and from those, 21649 does not suffer from major depression while 4206 does.
The first split is made on GENDER, our first explanatory variable. Values for GENDER less than 0.5, that is Male, move
to the left side of the split and include 11079 of the 25855 individuals. Values equal 0.5 or higher, move to the right side
of the split and include 14776 of the 25855 individuals. That means that from the total individuals, 11079 are male
and 14776 are female.
In each side we can also see that:
• From the male individuals (left node), 9809 do not suffer from major depression while 1270 do.
• From the female individuals (right node), 11840 do not suffer from major depression while 2936 do.
From this node, more splits are done with the variables DIVORCE and FIRED, which generate more nodes in the same
way as described before.
By looking to the bottom left and the bottom right nodes, we can describe the output as:
• From the 9646 male individuals who did not get divorced and did not get fired, 8622 individuals (89%) do not
suffer from major depression, while 1024 (11%) do.
• From the 143 female individuals who got divorced and got fired, 79 individuals (55% do not suffer from major
depression, while 64 (45%) do.
Document written in LATEX
template_version_01.tex
4
Data Analysis And Interpretation Specialization
Running a classification tree M4A1
3 Python Code
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import sklearn.metrics
# saving the python console window as text file
import sys
sys.stdout =open(working_folder+"M4A1output.txt","w")
# reading in the data set we want to work with
mydata = pd.read_csv(working_folder+"M4A1data_nesarc_pds.csv",low_memory=False)
# cleaning my data from NaNs
mydata_clean = mydata.dropna()
mydata_clean.dtypes
mydata_clean.describe()
# recode variable observations to 0, 1
def GENDER(x):
if x['SEX'] == 1:
return 0
else:
return 1
mydata_clean['GENDER'] = mydata_clean.apply(lambda x: GENDER(x), axis = 1)
def DIVORCED(x):
if x['S1Q238'] == 1:
return 1
else:
return 0
mydata_clean['DIVORCED'] = mydata_clean.apply(lambda x: DIVORCED(x), axis = 1)
def FIRED(x):
if x['S1Q234'] == 1:
return 1
else:
return 0
mydata_clean['FIRED'] = mydata_clean.apply(lambda x: FIRED(x), axis = 1)
# set explanatory (predictors) and response (target) variables
predictors = mydata_clean[['GENDER', 'DIVORCED', 'FIRED']]
target = mydata_clean.MAJORDEPP12
# split into training and testing sets
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.4)
print('train sample')
print(pred_train.shape)
print('test sample')
print(pred_test.shape)
tar_train.shape
tar_test.shape
# build model on training data
classifier = DecisionTreeClassifier()
classifier = classifier.fit(pred_train,tar_train)
# predict for the test values
predictions = classifier.predict(pred_test)
# show number of true and false negatives and positives
print('show number of true and false negatives and positives')
print(sklearn.metrics.confusion_matrix(tar_test, predictions))
# show classification accuracy in percentage
print('show classification accuracy in percentage')
Document written in LATEX
template_version_01.tex
5
Data Analysis And Interpretation Specialization
Running a classification tree M4A1
print(sklearn.metrics.accuracy_score(tar_test, predictions))
# Display the decision tree
from sklearn import tree
import graphviz
import pydotplus
dot_data = tree.export_graphviz(classifier,
feature_names=['GENDER', 'DIVORCED', 'FIRED'], filled=True, rounded=True, out_file=None)
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_pdf(working_folder+"M4A1fig1.pdf")
Document written in LATEX
template_version_01.tex
6
Data Analysis And Interpretation Specialization
Running a classification tree M4A1
4 Codebook
Document written in LATEX
template_version_01.tex
7

More Related Content

What's hot

Discriminant analysis
Discriminant analysisDiscriminant analysis
Discriminant analysis
緯鈞 沈
 
Discriminant analysis basicrelationships
Discriminant analysis basicrelationshipsDiscriminant analysis basicrelationships
Discriminant analysis basicrelationshipsdivyakalsi89
 
[M2A4] Data Analysis and Interpretation Specialization
[M2A4] Data Analysis and Interpretation Specialization [M2A4] Data Analysis and Interpretation Specialization
[M2A4] Data Analysis and Interpretation Specialization
Andrea Rubio
 
Chapter3 session1
Chapter3 session1Chapter3 session1
Chapter3 session1
QucPh9
 
Measures of dispersion by Prof Najeeb Memon BMC lumhs jamshoro
Measures of dispersion by Prof Najeeb Memon BMC lumhs jamshoroMeasures of dispersion by Prof Najeeb Memon BMC lumhs jamshoro
Measures of dispersion by Prof Najeeb Memon BMC lumhs jamshoro
muhammed najeeb
 
MS SQL SERVER: Microsoft sequence clustering and association rules
MS SQL SERVER: Microsoft sequence clustering and association rulesMS SQL SERVER: Microsoft sequence clustering and association rules
MS SQL SERVER: Microsoft sequence clustering and association rules
DataminingTools Inc
 
[M2A2] Data Analysis and Interpretation Specialization
[M2A2] Data Analysis and Interpretation Specialization [M2A2] Data Analysis and Interpretation Specialization
[M2A2] Data Analysis and Interpretation Specialization
Andrea Rubio
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
Piyush Srivastava
 
M1 regression metrics_middleschool
M1 regression metrics_middleschoolM1 regression metrics_middleschool
M1 regression metrics_middleschool
aiclub_slides
 
Two-way Mixed Design with SPSS
Two-way Mixed Design with SPSSTwo-way Mixed Design with SPSS
Two-way Mixed Design with SPSS
J P Verma
 
Hepatic injury classification
Hepatic injury classificationHepatic injury classification
Hepatic injury classification
Zheliang Jiang
 
Moderation and Meditation conducting in SPSS
Moderation and Meditation conducting in SPSSModeration and Meditation conducting in SPSS
Moderation and Meditation conducting in SPSS
Osama Yousaf
 
Measures of Variation
Measures of Variation Measures of Variation
Measures of Variation
Long Beach City College
 
Discriminant analysis ravi nakulan slideshare
Discriminant analysis ravi nakulan slideshareDiscriminant analysis ravi nakulan slideshare
Discriminant analysis ravi nakulan slideshare
Ravi Nakulan
 
Tree pruning
 Tree pruning Tree pruning
Tree pruning
Shivangi Gupta
 

What's hot (18)

Discriminant analysis
Discriminant analysisDiscriminant analysis
Discriminant analysis
 
Discriminant analysis basicrelationships
Discriminant analysis basicrelationshipsDiscriminant analysis basicrelationships
Discriminant analysis basicrelationships
 
[M2A4] Data Analysis and Interpretation Specialization
[M2A4] Data Analysis and Interpretation Specialization [M2A4] Data Analysis and Interpretation Specialization
[M2A4] Data Analysis and Interpretation Specialization
 
Student’s t test
Student’s  t testStudent’s  t test
Student’s t test
 
Chapter3 session1
Chapter3 session1Chapter3 session1
Chapter3 session1
 
Measures of dispersion by Prof Najeeb Memon BMC lumhs jamshoro
Measures of dispersion by Prof Najeeb Memon BMC lumhs jamshoroMeasures of dispersion by Prof Najeeb Memon BMC lumhs jamshoro
Measures of dispersion by Prof Najeeb Memon BMC lumhs jamshoro
 
Solve sysbyelimmult (1)
Solve sysbyelimmult (1)Solve sysbyelimmult (1)
Solve sysbyelimmult (1)
 
MS SQL SERVER: Microsoft sequence clustering and association rules
MS SQL SERVER: Microsoft sequence clustering and association rulesMS SQL SERVER: Microsoft sequence clustering and association rules
MS SQL SERVER: Microsoft sequence clustering and association rules
 
[M2A2] Data Analysis and Interpretation Specialization
[M2A2] Data Analysis and Interpretation Specialization [M2A2] Data Analysis and Interpretation Specialization
[M2A2] Data Analysis and Interpretation Specialization
 
Revised DEMATEL1
Revised DEMATEL1Revised DEMATEL1
Revised DEMATEL1
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
 
M1 regression metrics_middleschool
M1 regression metrics_middleschoolM1 regression metrics_middleschool
M1 regression metrics_middleschool
 
Two-way Mixed Design with SPSS
Two-way Mixed Design with SPSSTwo-way Mixed Design with SPSS
Two-way Mixed Design with SPSS
 
Hepatic injury classification
Hepatic injury classificationHepatic injury classification
Hepatic injury classification
 
Moderation and Meditation conducting in SPSS
Moderation and Meditation conducting in SPSSModeration and Meditation conducting in SPSS
Moderation and Meditation conducting in SPSS
 
Measures of Variation
Measures of Variation Measures of Variation
Measures of Variation
 
Discriminant analysis ravi nakulan slideshare
Discriminant analysis ravi nakulan slideshareDiscriminant analysis ravi nakulan slideshare
Discriminant analysis ravi nakulan slideshare
 
Tree pruning
 Tree pruning Tree pruning
Tree pruning
 

Similar to [M4A1] Data Analysis and Interpretation Specialization

[M4A2] Data Analysis and Interpretation Specialization
[M4A2] Data Analysis and Interpretation Specialization [M4A2] Data Analysis and Interpretation Specialization
[M4A2] Data Analysis and Interpretation Specialization
Andrea Rubio
 
1. Outline the differences between Hoarding power and Encouraging..docx
1. Outline the differences between Hoarding power and Encouraging..docx1. Outline the differences between Hoarding power and Encouraging..docx
1. Outline the differences between Hoarding power and Encouraging..docx
paynetawnya
 
classification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdfclassification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdf
321106410027
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
Boston Institute of Analytics
 
[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation Specialization[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation Specialization
Andrea Rubio
 
Basics of SPSS and how to use it first time
Basics of SPSS and how to use it first timeBasics of SPSS and how to use it first time
Basics of SPSS and how to use it first time
RagabGautam1
 
Factor Analysis-Presentation DATA ANALYTICS
Factor Analysis-Presentation DATA ANALYTICSFactor Analysis-Presentation DATA ANALYTICS
Factor Analysis-Presentation DATA ANALYTICS
HaritikaChhatwal1
 
dataminingclassificationprediction123 .pptx
dataminingclassificationprediction123 .pptxdataminingclassificationprediction123 .pptx
dataminingclassificationprediction123 .pptx
AsrithaKorupolu
 
Four machine learning methods to predict academic achievement of college stud...
Four machine learning methods to predict academic achievement of college stud...Four machine learning methods to predict academic achievement of college stud...
Four machine learning methods to predict academic achievement of college stud...
Venkat Projects
 
Chapter 02-logistic regression
Chapter 02-logistic regressionChapter 02-logistic regression
Chapter 02-logistic regression
Raman Kannan
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
DataminingTools Inc
 
Data mining: Classification and Prediction
Data mining: Classification and PredictionData mining: Classification and Prediction
Data mining: Classification and Prediction
Datamining Tools
 
EDA by Sastry.pptx
EDA by Sastry.pptxEDA by Sastry.pptx
EDA by Sastry.pptx
AmitDas125851
 
Measures of Central Tendency.ppt
Measures of Central Tendency.pptMeasures of Central Tendency.ppt
Measures of Central Tendency.ppt
AdamRayManlunas1
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptx
NitinSharma134320
 
analysis part 02.pptx
analysis part 02.pptxanalysis part 02.pptx
analysis part 02.pptx
efrembeyene4
 
Assessment of Anxiety,Depression and Stress using Machine Learning Models
Assessment of Anxiety,Depression and Stress using Machine Learning ModelsAssessment of Anxiety,Depression and Stress using Machine Learning Models
Assessment of Anxiety,Depression and Stress using Machine Learning Models
Prince Kumar
 
Rubric for Investigational Design
Rubric for Investigational DesignRubric for Investigational Design
Rubric for Investigational Design
kalegado
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind Map
Ashish Patel
 

Similar to [M4A1] Data Analysis and Interpretation Specialization (20)

[M4A2] Data Analysis and Interpretation Specialization
[M4A2] Data Analysis and Interpretation Specialization [M4A2] Data Analysis and Interpretation Specialization
[M4A2] Data Analysis and Interpretation Specialization
 
1. Outline the differences between Hoarding power and Encouraging..docx
1. Outline the differences between Hoarding power and Encouraging..docx1. Outline the differences between Hoarding power and Encouraging..docx
1. Outline the differences between Hoarding power and Encouraging..docx
 
classification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdfclassification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdf
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation Specialization[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation Specialization
 
Basics of SPSS and how to use it first time
Basics of SPSS and how to use it first timeBasics of SPSS and how to use it first time
Basics of SPSS and how to use it first time
 
Factor Analysis-Presentation DATA ANALYTICS
Factor Analysis-Presentation DATA ANALYTICSFactor Analysis-Presentation DATA ANALYTICS
Factor Analysis-Presentation DATA ANALYTICS
 
dataminingclassificationprediction123 .pptx
dataminingclassificationprediction123 .pptxdataminingclassificationprediction123 .pptx
dataminingclassificationprediction123 .pptx
 
Four machine learning methods to predict academic achievement of college stud...
Four machine learning methods to predict academic achievement of college stud...Four machine learning methods to predict academic achievement of college stud...
Four machine learning methods to predict academic achievement of college stud...
 
Chapter 02-logistic regression
Chapter 02-logistic regressionChapter 02-logistic regression
Chapter 02-logistic regression
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
Data mining: Classification and Prediction
Data mining: Classification and PredictionData mining: Classification and Prediction
Data mining: Classification and Prediction
 
EDA by Sastry.pptx
EDA by Sastry.pptxEDA by Sastry.pptx
EDA by Sastry.pptx
 
forest-cover-type
forest-cover-typeforest-cover-type
forest-cover-type
 
Measures of Central Tendency.ppt
Measures of Central Tendency.pptMeasures of Central Tendency.ppt
Measures of Central Tendency.ppt
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptx
 
analysis part 02.pptx
analysis part 02.pptxanalysis part 02.pptx
analysis part 02.pptx
 
Assessment of Anxiety,Depression and Stress using Machine Learning Models
Assessment of Anxiety,Depression and Stress using Machine Learning ModelsAssessment of Anxiety,Depression and Stress using Machine Learning Models
Assessment of Anxiety,Depression and Stress using Machine Learning Models
 
Rubric for Investigational Design
Rubric for Investigational DesignRubric for Investigational Design
Rubric for Investigational Design
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind Map
 

Recently uploaded

Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
Wasswaderrick3
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
muralinath2
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
pablovgd
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
sonaliswain16
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
fafyfskhan251kmf
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
yusufzako14
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
DiyaBiswas10
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Studia Poinsotiana
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
Areesha Ahmad
 
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiologyBLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
NoelManyise1
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
IqrimaNabilatulhusni
 

Recently uploaded (20)

Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
 
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiologyBLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
 

[M4A1] Data Analysis and Interpretation Specialization

  • 1. DATA ANALYSIS COLLECTION ASSIGNMENT Data Analysis And Interpretation Specialization Running a classification tree Andrea Rubio Amorós June 26, 2017 Modul 4 Assignment 1
  • 2. Data Analysis And Interpretation Specialization Running a classification tree M4A1 1 General In this session, you will learn about decision trees, a type of data mining algorithm that can select from among a large number of variables those and their interactions that are most important in predicting the target or response variable to be explained. Decision trees create segmentations or subgroups in the data, by applying a series of simple rules or criteria over and over again, which choose variable constellations that best predict the target variable. Document written in LATEX template_version_01.tex 2
  • 3. Data Analysis And Interpretation Specialization Running a classification tree M4A1 2 Description GENDER <= 0.5 gini = 0.2736 samples = 25855 value = [21626, 4229] DIVORCED <= 0.5 gini = 0.2091 samples = 11122 value = [9803, 1319] True DIVORCED <= 0.5 gini = 0.317 samples = 14733 value = [11823, 2910] False FIRED <= 0.5 gini = 0.198 samples = 10422 value = [9261, 1161] FIRED <= 0.5 gini = 0.3495 samples = 700 value = [542, 158] gini = 0.1929 samples = 9691 value = [8643, 1048] gini = 0.2614 samples = 731 value = [618, 113] gini = 0.3399 samples = 585 value = [458, 127] gini = 0.3938 samples = 115 value = [84, 31] FIRED <= 0.5 gini = 0.3028 samples = 13713 value = [11163, 2550] FIRED <= 0.5 gini = 0.4567 samples = 1020 value = [660, 360] gini = 0.296 samples = 13088 value = [10724, 2364] gini = 0.4181 samples = 625 value = [439, 186] gini = 0.4487 samples = 877 value = [579, 298] gini = 0.4912 samples = 143 value = [81, 62] Figure 2.1 Decision tree Today, I will build a decision tree (statistical model) to study supervised prediction problems. For that, I will work with the NESARC data set. First, I set my explanatory and response variables (binary, categorical). Explanatory variables (predictors): • GENDER: Sex (0=Male / 1=Female) • DIVORCED: got divorced in last 12 months (0=no / 1=yes) • FIRED: fired in last 12 months (0=no / 1=yes) Response variable (target): • MAJORDEPP12: major depression in last 12 months (0=no / 1=yes) Then, I include the train test split function for predictors and target, to split my data set into two: train sample (25855, 3) test sample (17238, 3) • The train sample has 25855 observations or rows, 60% of the original sample, and 3 explanatory variables. • The test sample has 17238 observations or rows, 40% of the original sample, and again 3 explanatory variables or columns. Once training and testing data sets have been created, I initialize the DecisionTreeClassifier from SKLearn. Then, I include the predict function and the confusion matrix function, to study the classification accuracy of my decision tree. Predict function - show number of true and false negatives and positives [[14393 0] [ 2845 0]] Document written in LATEX template_version_01.tex 3
  • 4. Data Analysis And Interpretation Specialization Running a classification tree M4A1 The predict function shows the correct and incorrect classifications of our decision tree. The diagonal, 14393 and 0, represent the number of true negative and the number of true positives for major depression, respectively. The 2845, on the bottom left, represents the number of false negatives. Classifying individuals who suffer from major depression as individuals who does not. And the 0 on the top right, the number of false positives, classifying who does not suffer from major depression as individuals who does. Confusion matrix function - show classification accuracy in percentage 0.836291913215 The confusion matrix function indicates the accuracy score, approximately 0.83, which suggests that the decision tree model has classified 83% of the sample correctly. My decision tree is built with MAJORDEPP12, my binary major depression variable, as the target. And GENDER, DIVORCED and FIRED as the predictors or explanatory variables. The resulting tree starts with the first node that indicates the total number of observations in the train set of 25855, and from those, 21649 does not suffer from major depression while 4206 does. The first split is made on GENDER, our first explanatory variable. Values for GENDER less than 0.5, that is Male, move to the left side of the split and include 11079 of the 25855 individuals. Values equal 0.5 or higher, move to the right side of the split and include 14776 of the 25855 individuals. That means that from the total individuals, 11079 are male and 14776 are female. In each side we can also see that: • From the male individuals (left node), 9809 do not suffer from major depression while 1270 do. • From the female individuals (right node), 11840 do not suffer from major depression while 2936 do. From this node, more splits are done with the variables DIVORCE and FIRED, which generate more nodes in the same way as described before. By looking to the bottom left and the bottom right nodes, we can describe the output as: • From the 9646 male individuals who did not get divorced and did not get fired, 8622 individuals (89%) do not suffer from major depression, while 1024 (11%) do. • From the 143 female individuals who got divorced and got fired, 79 individuals (55% do not suffer from major depression, while 64 (45%) do. Document written in LATEX template_version_01.tex 4
  • 5. Data Analysis And Interpretation Specialization Running a classification tree M4A1 3 Python Code import pandas as pd from pandas import Series, DataFrame import numpy as np from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier import sklearn.metrics # saving the python console window as text file import sys sys.stdout =open(working_folder+"M4A1output.txt","w") # reading in the data set we want to work with mydata = pd.read_csv(working_folder+"M4A1data_nesarc_pds.csv",low_memory=False) # cleaning my data from NaNs mydata_clean = mydata.dropna() mydata_clean.dtypes mydata_clean.describe() # recode variable observations to 0, 1 def GENDER(x): if x['SEX'] == 1: return 0 else: return 1 mydata_clean['GENDER'] = mydata_clean.apply(lambda x: GENDER(x), axis = 1) def DIVORCED(x): if x['S1Q238'] == 1: return 1 else: return 0 mydata_clean['DIVORCED'] = mydata_clean.apply(lambda x: DIVORCED(x), axis = 1) def FIRED(x): if x['S1Q234'] == 1: return 1 else: return 0 mydata_clean['FIRED'] = mydata_clean.apply(lambda x: FIRED(x), axis = 1) # set explanatory (predictors) and response (target) variables predictors = mydata_clean[['GENDER', 'DIVORCED', 'FIRED']] target = mydata_clean.MAJORDEPP12 # split into training and testing sets pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.4) print('train sample') print(pred_train.shape) print('test sample') print(pred_test.shape) tar_train.shape tar_test.shape # build model on training data classifier = DecisionTreeClassifier() classifier = classifier.fit(pred_train,tar_train) # predict for the test values predictions = classifier.predict(pred_test) # show number of true and false negatives and positives print('show number of true and false negatives and positives') print(sklearn.metrics.confusion_matrix(tar_test, predictions)) # show classification accuracy in percentage print('show classification accuracy in percentage') Document written in LATEX template_version_01.tex 5
  • 6. Data Analysis And Interpretation Specialization Running a classification tree M4A1 print(sklearn.metrics.accuracy_score(tar_test, predictions)) # Display the decision tree from sklearn import tree import graphviz import pydotplus dot_data = tree.export_graphviz(classifier, feature_names=['GENDER', 'DIVORCED', 'FIRED'], filled=True, rounded=True, out_file=None) graph = pydotplus.graph_from_dot_data(dot_data) graph.write_pdf(working_folder+"M4A1fig1.pdf") Document written in LATEX template_version_01.tex 6
  • 7. Data Analysis And Interpretation Specialization Running a classification tree M4A1 4 Codebook Document written in LATEX template_version_01.tex 7