[M4A1] Data Analysis and Interpretation Specialization

DATA ANALYSIS COLLECTION
ASSIGNMENT
Data Analysis And Interpretation Specialization
Running a classiﬁcation tree
Andrea Rubio Amorós
June 26, 2017
Modul 4
Assignment 1

Running a classiﬁcation tree M4A1
1 General
In this session, you will learn about decision trees, a type of data mining algorithm that can select from among a large
number of variables those and their interactions that are most important in predicting the target or response variable
to be explained. Decision trees create segmentations or subgroups in the data, by applying a series of simple rules or
criteria over and over again, which choose variable constellations that best predict the target variable.
Document written in LATEX
template_version_01.tex
2

2 Description
GENDER <= 0.5
gini = 0.2736
samples = 25855
value = [21626, 4229]
DIVORCED <= 0.5
gini = 0.2091
samples = 11122
value = [9803, 1319]
True
DIVORCED <= 0.5
gini = 0.317
samples = 14733
value = [11823, 2910]
False
FIRED <= 0.5
gini = 0.198
samples = 10422
value = [9261, 1161]
FIRED <= 0.5
gini = 0.3495
samples = 700
value = [542, 158]
gini = 0.1929
samples = 9691
value = [8643, 1048]
gini = 0.2614
samples = 731
value = [618, 113]
gini = 0.3399
samples = 585
value = [458, 127]
gini = 0.3938
samples = 115
value = [84, 31]
FIRED <= 0.5
gini = 0.3028
samples = 13713
value = [11163, 2550]
FIRED <= 0.5
gini = 0.4567
samples = 1020
value = [660, 360]
gini = 0.296
samples = 13088
value = [10724, 2364]
gini = 0.4181
samples = 625
value = [439, 186]
gini = 0.4487
samples = 877
value = [579, 298]
gini = 0.4912
samples = 143
value = [81, 62]
Figure 2.1 Decision tree
Today, I will build a decision tree (statistical model) to study supervised prediction problems. For that, I will work
with the NESARC data set.
First, I set my explanatory and response variables (binary, categorical).
Explanatory variables (predictors):
• GENDER: Sex (0=Male / 1=Female)
• DIVORCED: got divorced in last 12 months (0=no / 1=yes)
• FIRED: fired in last 12 months (0=no / 1=yes)
Response variable (target):
• MAJORDEPP12: major depression in last 12 months (0=no / 1=yes)
Then, I include the train test split function for predictors and target, to split my data set into two:
train sample
(25855, 3)
test sample
(17238, 3)
• The train sample has 25855 observations or rows, 60% of the original sample, and 3 explanatory variables.
• The test sample has 17238 observations or rows, 40% of the original sample, and again 3 explanatory variables
or columns.
Once training and testing data sets have been created, I initialize the DecisionTreeClassifier from SKLearn.
Then, I include the predict function and the confusion matrix function, to study the classification accuracy of my
decision tree.
Predict function - show number of true and false negatives and positives
[[14393 0]
[ 2845 0]]
3

The predict function shows the correct and incorrect classifications of our decision tree. The diagonal, 14393 and 0,
represent the number of true negative and the number of true positives for major depression, respectively. The 2845,
on the bottom left, represents the number of false negatives. Classifying individuals who suffer from major depression
as individuals who does not. And the 0 on the top right, the number of false positives, classifying who does not suffer
from major depression as individuals who does.
Confusion matrix function - show classification accuracy in percentage
0.836291913215
The confusion matrix function indicates the accuracy score, approximately 0.83, which suggests that the decision tree
model has classified 83% of the sample correctly.
My decision tree is built with MAJORDEPP12, my binary major depression variable, as the target. And GENDER,
DIVORCED and FIRED as the predictors or explanatory variables.
The resulting tree starts with the first node that indicates the total number of observations in the train set of 25855,
and from those, 21649 does not suffer from major depression while 4206 does.
The first split is made on GENDER, our first explanatory variable. Values for GENDER less than 0.5, that is Male, move
to the left side of the split and include 11079 of the 25855 individuals. Values equal 0.5 or higher, move to the right side
of the split and include 14776 of the 25855 individuals. That means that from the total individuals, 11079 are male
and 14776 are female.
In each side we can also see that:
• From the male individuals (left node), 9809 do not suffer from major depression while 1270 do.
• From the female individuals (right node), 11840 do not suffer from major depression while 2936 do.
From this node, more splits are done with the variables DIVORCE and FIRED, which generate more nodes in the same
way as described before.
By looking to the bottom left and the bottom right nodes, we can describe the output as:
• From the 9646 male individuals who did not get divorced and did not get fired, 8622 individuals (89%) do not
suffer from major depression, while 1024 (11%) do.
• From the 143 female individuals who got divorced and got fired, 79 individuals (55% do not suffer from major
depression, while 64 (45%) do.
4

3 Python Code
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import sklearn.metrics
# saving the python console window as text file
import sys
sys.stdout =open(working_folder+"M4A1output.txt","w")
# reading in the data set we want to work with
mydata = pd.read_csv(working_folder+"M4A1data_nesarc_pds.csv",low_memory=False)
# cleaning my data from NaNs
mydata_clean = mydata.dropna()
mydata_clean.dtypes
mydata_clean.describe()
# recode variable observations to 0, 1
def GENDER(x):
if x['SEX'] == 1:
return 0
else:
return 1
mydata_clean['GENDER'] = mydata_clean.apply(lambda x: GENDER(x), axis = 1)
def DIVORCED(x):
if x['S1Q238'] == 1:
return 1
else:
return 0
mydata_clean['DIVORCED'] = mydata_clean.apply(lambda x: DIVORCED(x), axis = 1)
def FIRED(x):
if x['S1Q234'] == 1:
return 1
else:
return 0
mydata_clean['FIRED'] = mydata_clean.apply(lambda x: FIRED(x), axis = 1)
# set explanatory (predictors) and response (target) variables
predictors = mydata_clean[['GENDER', 'DIVORCED', 'FIRED']]
target = mydata_clean.MAJORDEPP12
# split into training and testing sets
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.4)
print('train sample')
print(pred_train.shape)
print('test sample')
print(pred_test.shape)
tar_train.shape
tar_test.shape
# build model on training data
classifier = DecisionTreeClassifier()
classifier = classifier.fit(pred_train,tar_train)
# predict for the test values
predictions = classifier.predict(pred_test)
# show number of true and false negatives and positives
print('show number of true and false negatives and positives')
print(sklearn.metrics.confusion_matrix(tar_test, predictions))
# show classification accuracy in percentage
print('show classification accuracy in percentage')
5

print(sklearn.metrics.accuracy_score(tar_test, predictions))
# Display the decision tree
from sklearn import tree
import graphviz
import pydotplus
dot_data = tree.export_graphviz(classifier,
feature_names=['GENDER', 'DIVORCED', 'FIRED'], filled=True, rounded=True, out_file=None)
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_pdf(working_folder+"M4A1fig1.pdf")
6

4 Codebook
7

[M4A1] Data Analysis and Interpretation Specialization

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to [M4A1] Data Analysis and Interpretation Specialization

Similar to [M4A1] Data Analysis and Interpretation Specialization (20)

Recently uploaded

Recently uploaded (20)

[M4A1] Data Analysis and Interpretation Specialization