SlideShare a Scribd company logo
1 of 6
Download to read offline
DATA ANALYSIS COLLECTION
ASSIGNMENT
Data Analysis And Interpretation Specialization
Running A Random Forest
Andrea Rubio Amorós
July 9, 2017
Modul 4
Assignment 2
Data Analysis And Interpretation Specialization
Running A Random Forest M4A2
1 General
In this session, you will learn about random forests, a type of data mining algorithm that can select from among a large
number of variables those that are most important in determining the target or response variable to be explained.
Unlike decision trees, the results of random forests generalize well to new data.
Document written in LATEX
template_version_01.tex
2
Data Analysis And Interpretation Specialization
Running A Random Forest M4A2
2 Python Code
For this week’s assignment, I performed a random forest analysis to evaluate the importance of a series of explanatory
variables in predicting alcohol consumption among students (my response variable).
To start writing my python code, I import all necessary libraries, call-in my dataset and clean it from NANs.
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
# Feature Importance
from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier
# saving the python console window as text file
import sys
sys.stdout =open(working_folder+"M4A2output.txt","w")
# reading in the data set we want to work with
mydata = pd.read_csv(working_folder+"M4A2data_student-mat.csv",low_memory=False)
# cleaning my data from NaNs
mydata_clean = mydata.dropna()
mydata_clean.dtypes
mydata_clean.describe()
Then I re-code all categorical variables that I’m going to use to binary (0,1), and create my new dataset only with the
selected variables.
# recode variable observations to 0, 1
def GENDER(x):
if x['sex'] == "M":
return 0
else:
return 1
mydata_clean['GENDER'] = mydata_clean.apply(lambda x: GENDER(x), axis = 1)
def HOMETYPE(x):
if x['address'] == "U":
return 0
else:
return 1
mydata_clean['HOMETYPE'] = mydata_clean.apply(lambda x: HOMETYPE(x), axis = 1)
def FAMSIZE(x):
if x['famsize'] == "LE3":
return 0
else:
return 1
mydata_clean['FAMSIZE'] = mydata_clean.apply(lambda x: FAMSIZE(x), axis = 1)
def PSTATUS(x):
if x['Pstatus'] == "T":
return 0
else:
return 1
mydata_clean['PSTATUS'] = mydata_clean.apply(lambda x: PSTATUS(x), axis = 1)
def ACTIVITIES(x):
if x['activities'] == "no":
return 0
Document written in LATEX
template_version_01.tex
3
Data Analysis And Interpretation Specialization
Running A Random Forest M4A2
else:
return 1
mydata_clean['ACTIVITIES'] = mydata_clean.apply(lambda x: ACTIVITIES(x), axis = 1)
def INTERNET(x):
if x['internet'] == "no":
return 0
else:
return 1
mydata_clean['INTERNET'] = mydata_clean.apply(lambda x: INTERNET(x), axis = 1)
def ROMANTIC(x):
if x['romantic'] == "no":
return 0
else:
return 1
mydata_clean['ROMANTIC'] = mydata_clean.apply(lambda x: ROMANTIC(x), axis = 1)
def FAMREL(x):
if x['famrel'] < 3:
return 0
else:
return 1
mydata_clean['FAMREL'] = mydata_clean.apply(lambda x: FAMREL(x), axis = 1)
def FREETIME(x):
if x['freetime'] < 3:
return 0
else:
return 1
mydata_clean['FREETIME'] = mydata_clean.apply(lambda x: FREETIME(x), axis = 1)
def GOOUT(x):
if x['goout'] < 3:
return 0
else:
return 1
mydata_clean['GOOUT'] = mydata_clean.apply(lambda x: GOOUT(x), axis = 1)
def WALC(x):
if x['Walc'] < 3:
return 0
else:
return 1
mydata_clean['WALC'] = mydata_clean.apply(lambda x: WALC(x), axis = 1)
# set explanatory (predictors) and response (target) variables
predictors = mydata_clean[['GENDER', 'age', 'HOMETYPE', 'FAMSIZE', 'PSTATUS', 'ACTIVITIES', 'INTERNET',
'ROMANTIC', 'FAMREL', 'studytime', 'FREETIME', 'GOOUT', 'absences']]
target = mydata_clean.WALC
Re-coded explanatory variables (binary, categorical):
• GENDER - student’s sex (0 = Male or 1 = Female)
• HOMETYPE - student’s home address type (0 = urban or 1 = rural)
• FAMSIZE - family size (0 = less or equal to 3 or 1 = greater than 3)
• PSTATUS - parent’s cohabitation status (0 = living together or 1 = apart)
• ACTIVITIES - extra-curricular activities (0 = no or 1 = yes)
• INTERNET - Internet access at home (0 = no or 1 = yes)
• ROMANTIC - with a romantic relationship (0 = no or 1 = yes)
Document written in LATEX
template_version_01.tex
4
Data Analysis And Interpretation Specialization
Running A Random Forest M4A2
• FAMREL - quality of family relationships (0 = very bad or 1 = very good)
• FREETIME - free time after school (0 = very low or 1 = very high)
• GOOUT - going out with friends (0 = very low or 1 = very high)
Explanatory variables (quantitative):
• age - student’s age (from 15 to 22)
• absences - number of school absences (from 0 to 93)
Response variable (binary, categorical):
• WALC - weekend alcohol consumption (0 = very low or 1 = very high)
Once my variables are defined, I include the train test split function for predictors and target, to split my data set into
two:
# split into training and testing sets
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.4)
print('train sample')
print(pred_train.shape)
print('ntest sample')
print(pred_test.shape)
tar_train.shape
tar_test.shape
train sample
(237, 13)
test sample
(158, 13)
• The train sample has 237 observations or rows, 60% of the original sample, and 13 explanatory variables.
• The test sample has 158 observations or rows, 40% of the original sample, and again 13 explanatory variables.
I initialize the RandomForestClassifier from SKLearn and include the predict function and the confusion matrix func-
tion, to study the classification accuracy of my random forest.
#Build model on training data
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=30)
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
# show number of true and false negatives and positives
print('nPredict function - show number of true and false negatives and positives')
print(sklearn.metrics.confusion_matrix(tar_test, predictions))
# show classification accuracy in percentage
print('nConfusion matrix function - show classification accuracy in percentage')
print(sklearn.metrics.accuracy_score(tar_test, predictions))
Predict function - show number of true and false negatives and positives
[[76 19]
[33 30]]
Document written in LATEX
template_version_01.tex
5
Data Analysis And Interpretation Specialization
Running A Random Forest M4A2
Confusion matrix function - show classification accuracy in percentage
0.670886075949
The accuracy of the random forest is 67%.
I also display the importance of each explanatory variable in relation to my response variable.
# fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(pred_train,tar_train)
# display the relative importance of each attribute
print('nRelative importance of each attribute')
print(model.feature_importances_)
Relative importance of each attribute
[ 0.06641071 0.15107883 0.05013939 0.04982962 0.03054427 0.06166548
0.0248313 0.07963848 0.02088957 0.10560803 0.03823574 0.12249413
0.19863446]
The explanatory variables with the highest relative importance scores are absences (0.20), age (0.15) and going out
with friends (0.12).
To conclude, I plot the accuracy level of each tree in a line graph to visualize the effect in the accuracy by adding new
trees to my forest.
trees=range(30)
accuracy=np.zeros(30)
for idx in range(len(trees)):
classifier=RandomForestClassifier(n_estimators=idx + 1)
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)
plt.figure(1)
plt.cla()
plt.plot(trees, accuracy)
# saving the figure 1 als pdf
plt.savefig(working_folder+'M4A2fig1.pdf')
plt.ion()
plt.show()
0 5 10 15 20 25 30
0.58
0.60
0.62
0.64
0.66
0.68
0.70
0.72
Figure 2.1 Random forest accuracy
The accuracy of the random forest starts with 66% when running the first decision tree and grows up when adding
more trees to my forest (72% by 12 trees). This suggests that an interpretation of multiple decision trees may be
appropriate for this study case.
Document written in LATEX
template_version_01.tex
6

More Related Content

What's hot

MS SQL SERVER: Microsoft sequence clustering and association rules
MS SQL SERVER: Microsoft sequence clustering and association rulesMS SQL SERVER: Microsoft sequence clustering and association rules
MS SQL SERVER: Microsoft sequence clustering and association rulesDataminingTools Inc
 
Observations
ObservationsObservations
Observationsbutest
 
Data Assessment and Analysis for Model Evaluation
Data Assessment and Analysis for Model Evaluation Data Assessment and Analysis for Model Evaluation
Data Assessment and Analysis for Model Evaluation SaravanakumarSekar4
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
 
Arrays In Python | Python Array Operations | Edureka
Arrays In Python | Python Array Operations | EdurekaArrays In Python | Python Array Operations | Edureka
Arrays In Python | Python Array Operations | EdurekaEdureka!
 
Set data structure
Set data structure Set data structure
Set data structure Tech_MX
 
Spock the enterprise ready specifiation framework - Ted Vinke
Spock the enterprise ready specifiation framework - Ted VinkeSpock the enterprise ready specifiation framework - Ted Vinke
Spock the enterprise ready specifiation framework - Ted VinkeTed Vinke
 
Introduction to Data Structures & Algorithms
Introduction to Data Structures & AlgorithmsIntroduction to Data Structures & Algorithms
Introduction to Data Structures & AlgorithmsAfaq Mansoor Khan
 
Key functions in_oracle_sql
Key functions in_oracle_sqlKey functions in_oracle_sql
Key functions in_oracle_sqlpgolhar
 
Python Dictionaries and Sets
Python Dictionaries and SetsPython Dictionaries and Sets
Python Dictionaries and SetsNicole Ryan
 

What's hot (20)

Java arrays (1)
Java arrays (1)Java arrays (1)
Java arrays (1)
 
R교육1
R교육1R교육1
R교육1
 
MS SQL SERVER: Microsoft sequence clustering and association rules
MS SQL SERVER: Microsoft sequence clustering and association rulesMS SQL SERVER: Microsoft sequence clustering and association rules
MS SQL SERVER: Microsoft sequence clustering and association rules
 
8 python data structure-1
8 python data structure-18 python data structure-1
8 python data structure-1
 
Observations
ObservationsObservations
Observations
 
Data Assessment and Analysis for Model Evaluation
Data Assessment and Analysis for Model Evaluation Data Assessment and Analysis for Model Evaluation
Data Assessment and Analysis for Model Evaluation
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
 
Arrays In Python | Python Array Operations | Edureka
Arrays In Python | Python Array Operations | EdurekaArrays In Python | Python Array Operations | Edureka
Arrays In Python | Python Array Operations | Edureka
 
Deep Factor Model
Deep Factor ModelDeep Factor Model
Deep Factor Model
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
R training2
R training2R training2
R training2
 
Set data structure
Set data structure Set data structure
Set data structure
 
LectureNotes-05-DSA
LectureNotes-05-DSALectureNotes-05-DSA
LectureNotes-05-DSA
 
Dictionaries in Python
Dictionaries in PythonDictionaries in Python
Dictionaries in Python
 
Spock the enterprise ready specifiation framework - Ted Vinke
Spock the enterprise ready specifiation framework - Ted VinkeSpock the enterprise ready specifiation framework - Ted Vinke
Spock the enterprise ready specifiation framework - Ted Vinke
 
Arrays
ArraysArrays
Arrays
 
Introduction to Data Structures & Algorithms
Introduction to Data Structures & AlgorithmsIntroduction to Data Structures & Algorithms
Introduction to Data Structures & Algorithms
 
Key functions in_oracle_sql
Key functions in_oracle_sqlKey functions in_oracle_sql
Key functions in_oracle_sql
 
Python Dictionaries and Sets
Python Dictionaries and SetsPython Dictionaries and Sets
Python Dictionaries and Sets
 
Sorting
SortingSorting
Sorting
 

Similar to [M4A2] Data Analysis and Interpretation Specialization

Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
 
[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation Specialization[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation SpecializationAndrea Rubio
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Yao Yao
 
MATLAB/SIMULINK for Engineering Applications day 2:Introduction to simulink
MATLAB/SIMULINK for Engineering Applications day 2:Introduction to simulinkMATLAB/SIMULINK for Engineering Applications day 2:Introduction to simulink
MATLAB/SIMULINK for Engineering Applications day 2:Introduction to simulinkreddyprasad reddyvari
 
Introduction to matlab
Introduction to matlabIntroduction to matlab
Introduction to matlabBilawalBaloch1
 
[M4A1] Data Analysis and Interpretation Specialization
[M4A1] Data Analysis and Interpretation Specialization[M4A1] Data Analysis and Interpretation Specialization
[M4A1] Data Analysis and Interpretation SpecializationAndrea Rubio
 
20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdfMariaKhan905189
 
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMulticlass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMarjan Sterjev
 
Reaction StatisticsBackgroundWhen collecting experimental data f.pdf
Reaction StatisticsBackgroundWhen collecting experimental data f.pdfReaction StatisticsBackgroundWhen collecting experimental data f.pdf
Reaction StatisticsBackgroundWhen collecting experimental data f.pdffashionbigchennai
 
PCA and LDA in machine learning
PCA and LDA in machine learningPCA and LDA in machine learning
PCA and LDA in machine learningAkhilesh Joshi
 
Mca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structureMca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structureRai University
 
maXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIImaXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIIMax Kleiner
 
Chapter 02-logistic regression
Chapter 02-logistic regressionChapter 02-logistic regression
Chapter 02-logistic regressionRaman Kannan
 
[M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization [M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization Andrea Rubio
 
Rcommands-for those who interested in R.
Rcommands-for those who interested in R.Rcommands-for those who interested in R.
Rcommands-for those who interested in R.Dr. Volkan OBAN
 
wk5ppt1_Titanic
wk5ppt1_Titanicwk5ppt1_Titanic
wk5ppt1_TitanicAliciaWei1
 

Similar to [M4A2] Data Analysis and Interpretation Specialization (20)

Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation Specialization[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation Specialization
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
 
MATLAB/SIMULINK for Engineering Applications day 2:Introduction to simulink
MATLAB/SIMULINK for Engineering Applications day 2:Introduction to simulinkMATLAB/SIMULINK for Engineering Applications day 2:Introduction to simulink
MATLAB/SIMULINK for Engineering Applications day 2:Introduction to simulink
 
Introduction to matlab
Introduction to matlabIntroduction to matlab
Introduction to matlab
 
[M4A1] Data Analysis and Interpretation Specialization
[M4A1] Data Analysis and Interpretation Specialization[M4A1] Data Analysis and Interpretation Specialization
[M4A1] Data Analysis and Interpretation Specialization
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 
20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf
 
Decision Tree.pptx
Decision Tree.pptxDecision Tree.pptx
Decision Tree.pptx
 
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMulticlass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark Examples
 
PythonML.pptx
PythonML.pptxPythonML.pptx
PythonML.pptx
 
Reaction StatisticsBackgroundWhen collecting experimental data f.pdf
Reaction StatisticsBackgroundWhen collecting experimental data f.pdfReaction StatisticsBackgroundWhen collecting experimental data f.pdf
Reaction StatisticsBackgroundWhen collecting experimental data f.pdf
 
PCA and LDA in machine learning
PCA and LDA in machine learningPCA and LDA in machine learning
PCA and LDA in machine learning
 
Mca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structureMca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structure
 
maXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIImaXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VII
 
Chapter 02-logistic regression
Chapter 02-logistic regressionChapter 02-logistic regression
Chapter 02-logistic regression
 
[M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization [M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization
 
Rcommands-for those who interested in R.
Rcommands-for those who interested in R.Rcommands-for those who interested in R.
Rcommands-for those who interested in R.
 
wk5ppt1_Titanic
wk5ppt1_Titanicwk5ppt1_Titanic
wk5ppt1_Titanic
 

Recently uploaded

Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptxkhadijarafiq2012
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionPriyansha Singh
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 

Recently uploaded (20)

Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptx
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorption
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 

[M4A2] Data Analysis and Interpretation Specialization

  • 1. DATA ANALYSIS COLLECTION ASSIGNMENT Data Analysis And Interpretation Specialization Running A Random Forest Andrea Rubio Amorós July 9, 2017 Modul 4 Assignment 2
  • 2. Data Analysis And Interpretation Specialization Running A Random Forest M4A2 1 General In this session, you will learn about random forests, a type of data mining algorithm that can select from among a large number of variables those that are most important in determining the target or response variable to be explained. Unlike decision trees, the results of random forests generalize well to new data. Document written in LATEX template_version_01.tex 2
  • 3. Data Analysis And Interpretation Specialization Running A Random Forest M4A2 2 Python Code For this week’s assignment, I performed a random forest analysis to evaluate the importance of a series of explanatory variables in predicting alcohol consumption among students (my response variable). To start writing my python code, I import all necessary libraries, call-in my dataset and clean it from NANs. import pandas as pd from pandas import Series, DataFrame import numpy as np import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics # Feature Importance from sklearn import datasets from sklearn.ensemble import ExtraTreesClassifier # saving the python console window as text file import sys sys.stdout =open(working_folder+"M4A2output.txt","w") # reading in the data set we want to work with mydata = pd.read_csv(working_folder+"M4A2data_student-mat.csv",low_memory=False) # cleaning my data from NaNs mydata_clean = mydata.dropna() mydata_clean.dtypes mydata_clean.describe() Then I re-code all categorical variables that I’m going to use to binary (0,1), and create my new dataset only with the selected variables. # recode variable observations to 0, 1 def GENDER(x): if x['sex'] == "M": return 0 else: return 1 mydata_clean['GENDER'] = mydata_clean.apply(lambda x: GENDER(x), axis = 1) def HOMETYPE(x): if x['address'] == "U": return 0 else: return 1 mydata_clean['HOMETYPE'] = mydata_clean.apply(lambda x: HOMETYPE(x), axis = 1) def FAMSIZE(x): if x['famsize'] == "LE3": return 0 else: return 1 mydata_clean['FAMSIZE'] = mydata_clean.apply(lambda x: FAMSIZE(x), axis = 1) def PSTATUS(x): if x['Pstatus'] == "T": return 0 else: return 1 mydata_clean['PSTATUS'] = mydata_clean.apply(lambda x: PSTATUS(x), axis = 1) def ACTIVITIES(x): if x['activities'] == "no": return 0 Document written in LATEX template_version_01.tex 3
  • 4. Data Analysis And Interpretation Specialization Running A Random Forest M4A2 else: return 1 mydata_clean['ACTIVITIES'] = mydata_clean.apply(lambda x: ACTIVITIES(x), axis = 1) def INTERNET(x): if x['internet'] == "no": return 0 else: return 1 mydata_clean['INTERNET'] = mydata_clean.apply(lambda x: INTERNET(x), axis = 1) def ROMANTIC(x): if x['romantic'] == "no": return 0 else: return 1 mydata_clean['ROMANTIC'] = mydata_clean.apply(lambda x: ROMANTIC(x), axis = 1) def FAMREL(x): if x['famrel'] < 3: return 0 else: return 1 mydata_clean['FAMREL'] = mydata_clean.apply(lambda x: FAMREL(x), axis = 1) def FREETIME(x): if x['freetime'] < 3: return 0 else: return 1 mydata_clean['FREETIME'] = mydata_clean.apply(lambda x: FREETIME(x), axis = 1) def GOOUT(x): if x['goout'] < 3: return 0 else: return 1 mydata_clean['GOOUT'] = mydata_clean.apply(lambda x: GOOUT(x), axis = 1) def WALC(x): if x['Walc'] < 3: return 0 else: return 1 mydata_clean['WALC'] = mydata_clean.apply(lambda x: WALC(x), axis = 1) # set explanatory (predictors) and response (target) variables predictors = mydata_clean[['GENDER', 'age', 'HOMETYPE', 'FAMSIZE', 'PSTATUS', 'ACTIVITIES', 'INTERNET', 'ROMANTIC', 'FAMREL', 'studytime', 'FREETIME', 'GOOUT', 'absences']] target = mydata_clean.WALC Re-coded explanatory variables (binary, categorical): • GENDER - student’s sex (0 = Male or 1 = Female) • HOMETYPE - student’s home address type (0 = urban or 1 = rural) • FAMSIZE - family size (0 = less or equal to 3 or 1 = greater than 3) • PSTATUS - parent’s cohabitation status (0 = living together or 1 = apart) • ACTIVITIES - extra-curricular activities (0 = no or 1 = yes) • INTERNET - Internet access at home (0 = no or 1 = yes) • ROMANTIC - with a romantic relationship (0 = no or 1 = yes) Document written in LATEX template_version_01.tex 4
  • 5. Data Analysis And Interpretation Specialization Running A Random Forest M4A2 • FAMREL - quality of family relationships (0 = very bad or 1 = very good) • FREETIME - free time after school (0 = very low or 1 = very high) • GOOUT - going out with friends (0 = very low or 1 = very high) Explanatory variables (quantitative): • age - student’s age (from 15 to 22) • absences - number of school absences (from 0 to 93) Response variable (binary, categorical): • WALC - weekend alcohol consumption (0 = very low or 1 = very high) Once my variables are defined, I include the train test split function for predictors and target, to split my data set into two: # split into training and testing sets pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.4) print('train sample') print(pred_train.shape) print('ntest sample') print(pred_test.shape) tar_train.shape tar_test.shape train sample (237, 13) test sample (158, 13) • The train sample has 237 observations or rows, 60% of the original sample, and 13 explanatory variables. • The test sample has 158 observations or rows, 40% of the original sample, and again 13 explanatory variables. I initialize the RandomForestClassifier from SKLearn and include the predict function and the confusion matrix func- tion, to study the classification accuracy of my random forest. #Build model on training data from sklearn.ensemble import RandomForestClassifier classifier=RandomForestClassifier(n_estimators=30) classifier=classifier.fit(pred_train,tar_train) predictions=classifier.predict(pred_test) # show number of true and false negatives and positives print('nPredict function - show number of true and false negatives and positives') print(sklearn.metrics.confusion_matrix(tar_test, predictions)) # show classification accuracy in percentage print('nConfusion matrix function - show classification accuracy in percentage') print(sklearn.metrics.accuracy_score(tar_test, predictions)) Predict function - show number of true and false negatives and positives [[76 19] [33 30]] Document written in LATEX template_version_01.tex 5
  • 6. Data Analysis And Interpretation Specialization Running A Random Forest M4A2 Confusion matrix function - show classification accuracy in percentage 0.670886075949 The accuracy of the random forest is 67%. I also display the importance of each explanatory variable in relation to my response variable. # fit an Extra Trees model to the data model = ExtraTreesClassifier() model.fit(pred_train,tar_train) # display the relative importance of each attribute print('nRelative importance of each attribute') print(model.feature_importances_) Relative importance of each attribute [ 0.06641071 0.15107883 0.05013939 0.04982962 0.03054427 0.06166548 0.0248313 0.07963848 0.02088957 0.10560803 0.03823574 0.12249413 0.19863446] The explanatory variables with the highest relative importance scores are absences (0.20), age (0.15) and going out with friends (0.12). To conclude, I plot the accuracy level of each tree in a line graph to visualize the effect in the accuracy by adding new trees to my forest. trees=range(30) accuracy=np.zeros(30) for idx in range(len(trees)): classifier=RandomForestClassifier(n_estimators=idx + 1) classifier=classifier.fit(pred_train,tar_train) predictions=classifier.predict(pred_test) accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions) plt.figure(1) plt.cla() plt.plot(trees, accuracy) # saving the figure 1 als pdf plt.savefig(working_folder+'M4A2fig1.pdf') plt.ion() plt.show() 0 5 10 15 20 25 30 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0.72 Figure 2.1 Random forest accuracy The accuracy of the random forest starts with 66% when running the first decision tree and grows up when adding more trees to my forest (72% by 12 trees). This suggests that an interpretation of multiple decision trees may be appropriate for this study case. Document written in LATEX template_version_01.tex 6