[M4A2] Data Analysis and Interpretation Specialization

DATA ANALYSIS COLLECTION
ASSIGNMENT
Data Analysis And Interpretation Specialization
Running A Random Forest
Andrea Rubio Amorós
July 9, 2017
Modul 4
Assignment 2

Running A Random Forest M4A2
1 General
In this session, you will learn about random forests, a type of data mining algorithm that can select from among a large
number of variables those that are most important in determining the target or response variable to be explained.
Unlike decision trees, the results of random forests generalize well to new data.
Document written in LATEX
template_version_01.tex
2

2 Python Code
For this week’s assignment, I performed a random forest analysis to evaluate the importance of a series of explanatory
variables in predicting alcohol consumption among students (my response variable).
To start writing my python code, I import all necessary libraries, call-in my dataset and clean it from NANs.
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
# Feature Importance
from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier
# saving the python console window as text file
import sys
sys.stdout =open(working_folder+"M4A2output.txt","w")
# reading in the data set we want to work with
mydata = pd.read_csv(working_folder+"M4A2data_student-mat.csv",low_memory=False)
# cleaning my data from NaNs
mydata_clean = mydata.dropna()
mydata_clean.dtypes
mydata_clean.describe()
Then I re-code all categorical variables that I’m going to use to binary (0,1), and create my new dataset only with the
selected variables.
# recode variable observations to 0, 1
def GENDER(x):
if x['sex'] == "M":
return 0
else:
return 1
mydata_clean['GENDER'] = mydata_clean.apply(lambda x: GENDER(x), axis = 1)
def HOMETYPE(x):
if x['address'] == "U":
return 0
else:
return 1
mydata_clean['HOMETYPE'] = mydata_clean.apply(lambda x: HOMETYPE(x), axis = 1)
def FAMSIZE(x):
if x['famsize'] == "LE3":
return 0
else:
return 1
mydata_clean['FAMSIZE'] = mydata_clean.apply(lambda x: FAMSIZE(x), axis = 1)
def PSTATUS(x):
if x['Pstatus'] == "T":
return 0
else:
return 1
mydata_clean['PSTATUS'] = mydata_clean.apply(lambda x: PSTATUS(x), axis = 1)
def ACTIVITIES(x):
if x['activities'] == "no":
return 0
3

else:
return 1
mydata_clean['ACTIVITIES'] = mydata_clean.apply(lambda x: ACTIVITIES(x), axis = 1)
def INTERNET(x):
if x['internet'] == "no":
return 0
else:
return 1
mydata_clean['INTERNET'] = mydata_clean.apply(lambda x: INTERNET(x), axis = 1)
def ROMANTIC(x):
if x['romantic'] == "no":
return 0
else:
return 1
mydata_clean['ROMANTIC'] = mydata_clean.apply(lambda x: ROMANTIC(x), axis = 1)
def FAMREL(x):
if x['famrel'] < 3:
return 0
else:
return 1
mydata_clean['FAMREL'] = mydata_clean.apply(lambda x: FAMREL(x), axis = 1)
def FREETIME(x):
if x['freetime'] < 3:
return 0
else:
return 1
mydata_clean['FREETIME'] = mydata_clean.apply(lambda x: FREETIME(x), axis = 1)
def GOOUT(x):
if x['goout'] < 3:
return 0
else:
return 1
mydata_clean['GOOUT'] = mydata_clean.apply(lambda x: GOOUT(x), axis = 1)
def WALC(x):
if x['Walc'] < 3:
return 0
else:
return 1
mydata_clean['WALC'] = mydata_clean.apply(lambda x: WALC(x), axis = 1)
# set explanatory (predictors) and response (target) variables
predictors = mydata_clean[['GENDER', 'age', 'HOMETYPE', 'FAMSIZE', 'PSTATUS', 'ACTIVITIES', 'INTERNET',
'ROMANTIC', 'FAMREL', 'studytime', 'FREETIME', 'GOOUT', 'absences']]
target = mydata_clean.WALC
Re-coded explanatory variables (binary, categorical):
• GENDER - student’s sex (0 = Male or 1 = Female)
• HOMETYPE - student’s home address type (0 = urban or 1 = rural)
• FAMSIZE - family size (0 = less or equal to 3 or 1 = greater than 3)
• PSTATUS - parent’s cohabitation status (0 = living together or 1 = apart)
• ACTIVITIES - extra-curricular activities (0 = no or 1 = yes)
• INTERNET - Internet access at home (0 = no or 1 = yes)
• ROMANTIC - with a romantic relationship (0 = no or 1 = yes)
4

• FAMREL - quality of family relationships (0 = very bad or 1 = very good)
• FREETIME - free time after school (0 = very low or 1 = very high)
• GOOUT - going out with friends (0 = very low or 1 = very high)
Explanatory variables (quantitative):
• age - student’s age (from 15 to 22)
• absences - number of school absences (from 0 to 93)
Response variable (binary, categorical):
• WALC - weekend alcohol consumption (0 = very low or 1 = very high)
Once my variables are defined, I include the train test split function for predictors and target, to split my data set into
two:
# split into training and testing sets
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.4)
print('train sample')
print(pred_train.shape)
print('ntest sample')
print(pred_test.shape)
tar_train.shape
tar_test.shape
train sample
(237, 13)
test sample
(158, 13)
• The train sample has 237 observations or rows, 60% of the original sample, and 13 explanatory variables.
• The test sample has 158 observations or rows, 40% of the original sample, and again 13 explanatory variables.
I initialize the RandomForestClassifier from SKLearn and include the predict function and the confusion matrix func-
tion, to study the classification accuracy of my random forest.
#Build model on training data
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=30)
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
# show number of true and false negatives and positives
print('nPredict function - show number of true and false negatives and positives')
print(sklearn.metrics.confusion_matrix(tar_test, predictions))
# show classification accuracy in percentage
print('nConfusion matrix function - show classification accuracy in percentage')
print(sklearn.metrics.accuracy_score(tar_test, predictions))
Predict function - show number of true and false negatives and positives
[[76 19]
[33 30]]
5

Confusion matrix function - show classification accuracy in percentage
0.670886075949
The accuracy of the random forest is 67%.
I also display the importance of each explanatory variable in relation to my response variable.
# fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(pred_train,tar_train)
# display the relative importance of each attribute
print('nRelative importance of each attribute')
print(model.feature_importances_)
Relative importance of each attribute
[ 0.06641071 0.15107883 0.05013939 0.04982962 0.03054427 0.06166548
0.0248313 0.07963848 0.02088957 0.10560803 0.03823574 0.12249413
0.19863446]
The explanatory variables with the highest relative importance scores are absences (0.20), age (0.15) and going out
with friends (0.12).
To conclude, I plot the accuracy level of each tree in a line graph to visualize the effect in the accuracy by adding new
trees to my forest.
trees=range(30)
accuracy=np.zeros(30)
for idx in range(len(trees)):
classifier=RandomForestClassifier(n_estimators=idx + 1)
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)
plt.figure(1)
plt.cla()
plt.plot(trees, accuracy)
# saving the figure 1 als pdf
plt.savefig(working_folder+'M4A2fig1.pdf')
plt.ion()
plt.show()
0 5 10 15 20 25 30
0.58
0.60
0.62
0.64
0.66
0.68
0.70
0.72
Figure 2.1 Random forest accuracy
The accuracy of the random forest starts with 66% when running the ﬁrst decision tree and grows up when adding
more trees to my forest (72% by 12 trees). This suggests that an interpretation of multiple decision trees may be
appropriate for this study case.
6

[M4A2] Data Analysis and Interpretation Specialization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to [M4A2] Data Analysis and Interpretation Specialization

Similar to [M4A2] Data Analysis and Interpretation Specialization (20)

Recently uploaded

Recently uploaded (20)

[M4A2] Data Analysis and Interpretation Specialization