SlideShare a Scribd company logo
1 of 10
Download to read offline
DATA ANALYSIS COLLECTION
ASSIGNMENT
Data Analysis And Interpretation Specialization
Test A Basic Linear Regression Model
Andrea Rubio Amorós
June 15, 2017
Modul 3
Assignment 2
Data Analysis And Interpretation Specialization
Test A Basic Linear Regression Model M3A2
1 Introduction
In this assignment, I discuss more about the importance of testing for confounding, and provide examples of situ-
ations in which a confounding variable can explain the association between an explanatory and response variable.
In addition, now that I have statistically tested the association between an explanatory variable and your response
variable, I will test and interpret this association using basic linear regression analysis for a quantitative response
variable. You will also learn about how the linear regression model can be used to predict your observed response
variable. Finally, I will also discuss the statistical assumptions underlying the linear regression model, and show you
some best practices for coding your explanatory variables. Note that if my research question does not include one
quantitative response variable, I can use one from your data set just to get some practice with the tool.
Document written in LATEX
template_version_01.tex
2
Data Analysis And Interpretation Specialization
Test A Basic Linear Regression Model M3A2
2 Python Code
For this week’s assignment, I will continue working with the Global Terrorism Database (GTD). I have selected a
quantitative explanatory variable (nperps), which indicates the total number of terrorists participating in an incident;
and a quantitative response variable (nkill), which stores the number of total confirmed fatalities for an incident.
Through a linear regression model, I want to study how strong is the relationship between this both variables. With
other words, I would like to know if there is a positive association between the number of terrorists that participated
in an incident and the number of people that got killed. Writing the program: The first step is to call in all the libraries
that I will need and to read my dataset with the “pandas” library read function. Then, I code my variables as numeric
to make sure that I’m working with quantitative values. Now I can use the OLS regression model to get a summary of
information about the studied variables.
# import libraries
import pandas
import numpy
import seaborn
import matplotlib.pyplot as plt
import scipy
import statsmodels.formula.api as smf
# reading in the data set we want to work with
# as my csv is separated by ";" instead of "," I need to add sep=';' to the code
# as my csv has extrange simbols as tildes 'é' I need to add encoding='ISO8859-1' to the code
mydata = pandas.read_csv(working_folder+'M3A2data_gtd_12to15_0616dist_MSDOS.csv', sep=';',
encoding='ISO8859-1',low_memory=False)
# convert the variables to numeric format
mydata['nperps'] = pandas.to_numeric(mydata['nperps'], errors='coerce')
mydata['nkill'] = pandas.to_numeric(mydata['nkill'], errors='coerce')
mydata['nperps'] = mydata['nperps'].replace(-99, numpy.nan)
print('OLS regression model for the assotiation between n° of terrorists and total n° of fatalities')
reg1 = smf.ols('nkill ~ nperps', data=mydata).fit()
print(reg1.summary())
OLS regression model for the assotiation between n° of terrorists and total n° of fatalities
OLS Regression Results
==============================================================================
Dep. Variable: nkill R-squared: 0.034
Model: OLS Adj. R-squared: 0.034
Method: Least Squares F-statistic: 304.3
Date: Wed, 14 Jun 2017 Prob (F-statistic): 5.44e-67
Time: 17:40:52 Log-Likelihood: -31750.
No. Observations: 8519 AIC: 6.350e+04
Df Residuals: 8517 BIC: 6.352e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.3591 0.111 30.206 0.000 3.141 3.577
nperps 0.0301 0.002 17.445 0.000 0.027 0.034
==============================================================================
Omnibus: 13372.286 Durbin-Watson: 1.867
Prob(Omnibus): 0.000 Jarque-Bera (JB): 8999254.771
Skew: 10.006 Prob(JB): 0.00
Kurtosis: 160.964 Cond. No. 65.7
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
When looking at the output, we can first recognize on the top (red marked) our response variable (nkill). There are
8519 observations that had valid data on both the response and explanatory variables. The p value (5.44e-67) is
extremely small and that indicates a strong relationship between both variables “Number of terrorists” and “Total
number of fatalities”.
Document written in LATEX
template_version_01.tex
3
Data Analysis And Interpretation Specialization
Test A Basic Linear Regression Model M3A2
The intercept is 3.3591 and the slope is 0.0301. For both figures, the p value is smaller than 0.0001.
y = f (x) = β0 +β1 · x
= 3.3591+0.0301· x (2.1)
We can conclude saying that the results of the linear regression model indicates that the number of terrorists is sig-
nificantly and positively associated with the total number of fatalities.
Document written in LATEX
template_version_01.tex
4
My Codebook (GTD) for Module 3, week 2:
Number of Perpetrators
(nperps)
Numeric Variable
This field indicates the total number of terrorists participating in the incident. (In the instance of
multiple perpetrator groups participating in one case, the total number of perpetrators, across
groups, is recorded). There are often discrepancies in information on this value.
Where several independent credible sources report different numbers of attackers, the value of
this variable reflects the number given by the majority of sources, unless there is reason to do
otherwise. Where there is no majority figure among independent sources, the database records
the lowest proffered perpetrator figure, unless there is clear reason to do otherwise. In cases
where the number of perpetrators is stated vaguely, for example “…at least 11 attackers”, then
the lowest possible number is recorded, in this example, “11.” “-99” or “Unknown” appears
when the number of perpetrators is not reported.
Number of Perpetrators Captured
(nperpcap)
Numeric Variable
This field records the number of perpetrators taken into custody. “-99” or “Unknown” appears
when there is evidence of captured, but the number is not reported.
Divergent reports on the number of perpetrators captured are dealt with in same manner used
for the Number of Perpetrators variable described above.
Note: This field is presently only systematically available with incidents occurring after 1997.
Total Number of Fatalities
(nkill)
Numeric Variable
This field stores the number of total confirmed fatalities for the incident. The number includes
all victims and attackers who died as a direct result of the incident.
Where there is evidence of fatalities, but a figure is not reported or it is too vague to be of use,
this field remains blank. If information is missing regarding the number of victims killed in an
attack, but perpetrator fatalities are known, this value will reflect only the number of
perpetrators who died as a result of the incident. Likewise, if information on the number of
perpetrators killed in an attack is missing, but victim fatalities are known, this field will only
report the number of victims killed in the incident.
Data Analysis And Interpretation Specialization
Test A Basic Linear Regression Model M3A2
3 Codebook
Document written in LATEX
template_version_01.tex
5
Where several independent sources report different numbers of casualties, the database will
usually reflect the number given by the most recent source. However, the most recent source
will not be used if the source itself is of questionable validity or if the source bases its casualty
numbers on claims made by a perpetrator group. When there are several “most recent” sources
published around the same time, or there are concerns about the validity of a recent source, the
majority figure will be used. Where there is no majority figure among independent sources, the
database will record the lowest proffered fatality figure, unless that figure comes from a source
of questionable validity or there is another compelling reason to do otherwise. Conflicting
reports of fatalities will be noted in the “Additional Notes” field.
Number of US Fatalities
(nkillus)
Numeric Variable
This field records the number of U.S. citizens who died as a result of the incident, and follows
the conventions of “Total Number of Fatalities” described above. Thus, this field records the
number of U.S. victims and U.S. perpetrators who died as a result of the attack. The value for
this field is not limited to U.S. citizens killed on U.S. soil, but also includes U.S. citizens who died
in incidents occurring outside of the U.S.
Number of Perpetrator Fatalities
(nkillter)
Numeric Variable
Limited to only perpetrator fatalities, this field follows the conventions of the “Total
Number of Fatalities” field described above.
Total Number of Injured
(nwound)
Numeric Variable
This field records the number of confirmed non-fatal injuries to both perpetrators and victims. It
follows the conventions of the “Total Number of Fatalities” field
described above.
Number of U.S. Injured
(nwoundus)
Numeric Variable
This field records the number of confirmed non-fatal injuries to U.S. citizens, both perpetrators
and victims. It follows the conventions of the “Number of U.S.
Fatalities” field described above.
Data Analysis And Interpretation Specialization
Test A Basic Linear Regression Model M3A2
Document written in LATEX
template_version_01.tex
6
Number of Perpetrators Injured
(nwoundte)
Numeric Variable
Conventions follow the “Number of Perpetrator Fatalities” field described above.
Value of Property Damage (in USD)
(propvalue)
Numeric Variable
If “Property Damage?” is “Yes,” then the exact U.S. dollar amount (at the time of the incident) of
total damages is listed. Where applicable, property values reported in foreign currencies are
converted to U.S. dollars before being entered into the GTD. If
no dollar figure is reported, the field is left blank. That is, a blank field here does not indicate
that there was no property damage but, rather, that no precise estimate of the value was
available. The value of damages only includes direct economic effects of the incident (i.e. cost
of buildings, etc.) and not indirect economic costs (longer term effects on the company,
industry, tourism, etc.).
Protocols for recording inconsistent numbers, etc., listed above are followed (see,
for example, “Number of Perpetrators”).
Total Number of Hostages/ Kidnapping Victims
(nhostkid)
Numeric Variable
This field records the total number of hostages or kidnapping victims. For successful hijackings,
this value will reflect the total number of crew members and passengers aboard the vehicle at
the time of the incident.
As with other fields, where several independent sources report different numbers of hostages,
the GTD reflects the number given by the most recent source, unless there is reason to do
otherwise. When there are several most recent sources available, or there are question about
the validity of a recent source, the GTD will report the majority figure from a group of
independent sources. Where there is no majority figure among independent sources, the
database will record the lowest proffered hostage figure, unless there is clear reason to do
otherwise.
In cases where the number of hostages or kidnapping victims is stated vaguely, for example,
“…at least 11 hostages”, then the lowest possible number will be recorded, in this example
“11.” If the number of hostages is unknown or unidentified, this field
records “-99” (unknown).
Data Analysis And Interpretation Specialization
Test A Basic Linear Regression Model M3A2
Document written in LATEX
template_version_01.tex
7
Number of U.S. Hostages/ Kidnapping Victims
(nhostkidus)
Numeric Variable
This field reports the number of U.S. citizens that were taken hostage or kidnapped in the
incident. Conventions follow the “Total Number of Hostages/ Kidnapping
Victims” field described above.
Total Ransom Amount Demanded
(ransomamt)
Numeric Variable
If a ransom was demanded, then the amount (in U.S. dollars) is listed in this field. If a ransom
was demanded but the monetary figure is unknown, then this field will report “-99”
(unknown). If there are conflicting reports on the amount of ransom demanded, the majority
figure from independent sources will be used. If no majority exists, the lowest proffered figure
will be used, unless that figure comes from a source of questionable validity or there is another
compelling reason to do otherwise.
Ransom Amount Demanded from U.S. Sources
(ransomamtus)
Numeric Variable
If a ransom was demanded from U.S. sources, then the amount (in U.S. dollars) is listed in this
field. If a ransom was demanded from U.S. sources but the monetary figure is unknown, then
this field will report “-99” (unknown). If there are conflicting reports on the amount of ransom
demanded from a U.S. source, the majority figure from independent sources will be used. If no
majority exists, the lowest proffered
figure will be used, unless there is a reason to do otherwise.
Total Ransom Amount Paid
(ransompaid)
Numeric Variable
If a ransom amount was paid, then the amount (in U.S. dollars) is listed in this field. If a
ransom was paid but the monetary figure was unspecified, then this field will report “-99”
(unknown). A value of “-99” will also be reported for any case where it is suspected that a
ransom was paid, but it has not been confirmed. If there are conflicting reports on the
amount of ransom paid, the majority figure from independent sources will be used. If no
majority exists, the lowest proffered figure will be used, unless that figure comes from a
source of questionable validity or there is another compelling reason to do otherwise.
Data Analysis And Interpretation Specialization
Test A Basic Linear Regression Model M3A2
Document written in LATEX
template_version_01.tex
8
Ransom Amount Paid By U.S. Sources
(ransompaidus)
Numeric Variable
If a ransom amount was paid by U.S. sources, then this figure is listed in U.S. dollars. If a
ransom was paid by U.S. sources but the monetary figure was unspecified, then this field will
report “-99” (unknown). If there are conflicting reports on the amount of ransom paid by a
U.S. source, the majority figure from independent sources will be used. If no majority exists,
the lowest proffered figure will be used, unless that figure comes from a source of
questionable validity or there is another compelling reason to do otherwise.
Number Released/Escaped/Rescued
(nreleased)
Numeric Variable
If the “Attack Type” is “Hostage Taking (Kidnapping),” “Hostage Taking (Barricade Incident),” or
a successful “Hijacking,” then this field will apply. This field records the number of hostages who
survived the incident.
As with the total number of kidnapping victims, where several independent sources report
different numbers of hostages, the database will reflect the number given by the most recent
source, unless there is reason to do otherwise. If there are several most recent sources
available, the majority number form a group of independent sources will be used, unless
there is reason to do otherwise. Where there is no majority figure among independent
sources, the database will record the lowest proffered hostage released/escaped/rescued
figure, unless there is clear reason to do otherwise.
In cases where the number of hostages released/escaped/rescued is stated vaguely, for example
“…at least 11 hostages were released”, then the lowest possible number will be recorded, in this
example “11.” If the fate of the hostages is
unknown, this field will record “-99.”
Data Analysis And Interpretation Specialization
Test A Basic Linear Regression Model M3A2
Document written in LATEX
template_version_01.tex
9
Data Analysis And Interpretation Specialization
Test A Basic Linear Regression Model M3A2
4 List Of Abbreviations
GTD Global Terrorism Database
Document written in LATEX
template_version_01.tex
10

More Related Content

Similar to [M3A2] Data Analysis and Interpretation Specialization

Data What Type Of Data Do You Have V2.1
Data   What Type Of Data Do You Have V2.1Data   What Type Of Data Do You Have V2.1
Data What Type Of Data Do You Have V2.1TimKasse
 
Reducing False Positives
Reducing False PositivesReducing False Positives
Reducing False PositivesMayank Johri
 
Time Series Project
Time Series Project Time Series Project
Time Series Project Sean Cahill
 
Predicting the Crimes in Chicago
Predicting the Crimes in Chicago Predicting the Crimes in Chicago
Predicting the Crimes in Chicago Swati Arora
 
Binary OR Binomial logistic regression
Binary OR Binomial logistic regression Binary OR Binomial logistic regression
Binary OR Binomial logistic regression Dr Athar Khan
 
Generating test data for Statistical and ML models
Generating test data for Statistical and ML modelsGenerating test data for Statistical and ML models
Generating test data for Statistical and ML modelsVladimir Ulogov
 
An introduction to probabilistic data structures
An introduction to probabilistic data structuresAn introduction to probabilistic data structures
An introduction to probabilistic data structuresMiguel Ping
 
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATADETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATAIJCSEA Journal
 
BMI214_Assignment2_S..
BMI214_Assignment2_S..BMI214_Assignment2_S..
BMI214_Assignment2_S..butest
 
Applied python for correlation on churn and stocks datasets
Applied python for correlation on churn and stocks datasetsApplied python for correlation on churn and stocks datasets
Applied python for correlation on churn and stocks datasetsMahmoud Fouad
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)Abhimanyu Dwivedi
 
Workbook Project
Workbook ProjectWorkbook Project
Workbook ProjectBrian Ryan
 
Advanced compression technique for random data
Advanced compression technique for random dataAdvanced compression technique for random data
Advanced compression technique for random dataijitjournal
 

Similar to [M3A2] Data Analysis and Interpretation Specialization (20)

Data What Type Of Data Do You Have V2.1
Data   What Type Of Data Do You Have V2.1Data   What Type Of Data Do You Have V2.1
Data What Type Of Data Do You Have V2.1
 
TamingStatistics
TamingStatisticsTamingStatistics
TamingStatistics
 
Reducing False Positives
Reducing False PositivesReducing False Positives
Reducing False Positives
 
Time Series Project
Time Series Project Time Series Project
Time Series Project
 
50120130405032
5012013040503250120130405032
50120130405032
 
Predicting the Crimes in Chicago
Predicting the Crimes in Chicago Predicting the Crimes in Chicago
Predicting the Crimes in Chicago
 
Eviews forecasting
Eviews forecastingEviews forecasting
Eviews forecasting
 
Data science
Data scienceData science
Data science
 
Binary OR Binomial logistic regression
Binary OR Binomial logistic regression Binary OR Binomial logistic regression
Binary OR Binomial logistic regression
 
Generating test data for Statistical and ML models
Generating test data for Statistical and ML modelsGenerating test data for Statistical and ML models
Generating test data for Statistical and ML models
 
Telecom customer churn prediction
Telecom customer churn predictionTelecom customer churn prediction
Telecom customer churn prediction
 
working with python
working with pythonworking with python
working with python
 
An introduction to probabilistic data structures
An introduction to probabilistic data structuresAn introduction to probabilistic data structures
An introduction to probabilistic data structures
 
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATADETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
 
BMI214_Assignment2_S..
BMI214_Assignment2_S..BMI214_Assignment2_S..
BMI214_Assignment2_S..
 
Applied python for correlation on churn and stocks datasets
Applied python for correlation on churn and stocks datasetsApplied python for correlation on churn and stocks datasets
Applied python for correlation on churn and stocks datasets
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
 
Workbook Project
Workbook ProjectWorkbook Project
Workbook Project
 
Advanced compression technique for random data
Advanced compression technique for random dataAdvanced compression technique for random data
Advanced compression technique for random data
 
Lab manual_statistik
Lab manual_statistikLab manual_statistik
Lab manual_statistik
 

More from Andrea Rubio

[M4A2] Data Analysis and Interpretation Specialization
[M4A2] Data Analysis and Interpretation Specialization [M4A2] Data Analysis and Interpretation Specialization
[M4A2] Data Analysis and Interpretation Specialization Andrea Rubio
 
[M2A2] Data Analysis and Interpretation Specialization
[M2A2] Data Analysis and Interpretation Specialization [M2A2] Data Analysis and Interpretation Specialization
[M2A2] Data Analysis and Interpretation Specialization Andrea Rubio
 
[M2A3] Data Analysis and Interpretation Specialization
[M2A3] Data Analysis and Interpretation Specialization [M2A3] Data Analysis and Interpretation Specialization
[M2A3] Data Analysis and Interpretation Specialization Andrea Rubio
 
[M2A4] Data Analysis and Interpretation Specialization
[M2A4] Data Analysis and Interpretation Specialization [M2A4] Data Analysis and Interpretation Specialization
[M2A4] Data Analysis and Interpretation Specialization Andrea Rubio
 
[M3A1] Data Analysis and Interpretation Specialization
[M3A1] Data Analysis and Interpretation Specialization [M3A1] Data Analysis and Interpretation Specialization
[M3A1] Data Analysis and Interpretation Specialization Andrea Rubio
 
[M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization [M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization Andrea Rubio
 
[M4A1] Data Analysis and Interpretation Specialization
[M4A1] Data Analysis and Interpretation Specialization[M4A1] Data Analysis and Interpretation Specialization
[M4A1] Data Analysis and Interpretation SpecializationAndrea Rubio
 
[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation Specialization[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation SpecializationAndrea Rubio
 

More from Andrea Rubio (8)

[M4A2] Data Analysis and Interpretation Specialization
[M4A2] Data Analysis and Interpretation Specialization [M4A2] Data Analysis and Interpretation Specialization
[M4A2] Data Analysis and Interpretation Specialization
 
[M2A2] Data Analysis and Interpretation Specialization
[M2A2] Data Analysis and Interpretation Specialization [M2A2] Data Analysis and Interpretation Specialization
[M2A2] Data Analysis and Interpretation Specialization
 
[M2A3] Data Analysis and Interpretation Specialization
[M2A3] Data Analysis and Interpretation Specialization [M2A3] Data Analysis and Interpretation Specialization
[M2A3] Data Analysis and Interpretation Specialization
 
[M2A4] Data Analysis and Interpretation Specialization
[M2A4] Data Analysis and Interpretation Specialization [M2A4] Data Analysis and Interpretation Specialization
[M2A4] Data Analysis and Interpretation Specialization
 
[M3A1] Data Analysis and Interpretation Specialization
[M3A1] Data Analysis and Interpretation Specialization [M3A1] Data Analysis and Interpretation Specialization
[M3A1] Data Analysis and Interpretation Specialization
 
[M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization [M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization
 
[M4A1] Data Analysis and Interpretation Specialization
[M4A1] Data Analysis and Interpretation Specialization[M4A1] Data Analysis and Interpretation Specialization
[M4A1] Data Analysis and Interpretation Specialization
 
[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation Specialization[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation Specialization
 

Recently uploaded

GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 

Recently uploaded (20)

GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 

[M3A2] Data Analysis and Interpretation Specialization

  • 1. DATA ANALYSIS COLLECTION ASSIGNMENT Data Analysis And Interpretation Specialization Test A Basic Linear Regression Model Andrea Rubio Amorós June 15, 2017 Modul 3 Assignment 2
  • 2. Data Analysis And Interpretation Specialization Test A Basic Linear Regression Model M3A2 1 Introduction In this assignment, I discuss more about the importance of testing for confounding, and provide examples of situ- ations in which a confounding variable can explain the association between an explanatory and response variable. In addition, now that I have statistically tested the association between an explanatory variable and your response variable, I will test and interpret this association using basic linear regression analysis for a quantitative response variable. You will also learn about how the linear regression model can be used to predict your observed response variable. Finally, I will also discuss the statistical assumptions underlying the linear regression model, and show you some best practices for coding your explanatory variables. Note that if my research question does not include one quantitative response variable, I can use one from your data set just to get some practice with the tool. Document written in LATEX template_version_01.tex 2
  • 3. Data Analysis And Interpretation Specialization Test A Basic Linear Regression Model M3A2 2 Python Code For this week’s assignment, I will continue working with the Global Terrorism Database (GTD). I have selected a quantitative explanatory variable (nperps), which indicates the total number of terrorists participating in an incident; and a quantitative response variable (nkill), which stores the number of total confirmed fatalities for an incident. Through a linear regression model, I want to study how strong is the relationship between this both variables. With other words, I would like to know if there is a positive association between the number of terrorists that participated in an incident and the number of people that got killed. Writing the program: The first step is to call in all the libraries that I will need and to read my dataset with the “pandas” library read function. Then, I code my variables as numeric to make sure that I’m working with quantitative values. Now I can use the OLS regression model to get a summary of information about the studied variables. # import libraries import pandas import numpy import seaborn import matplotlib.pyplot as plt import scipy import statsmodels.formula.api as smf # reading in the data set we want to work with # as my csv is separated by ";" instead of "," I need to add sep=';' to the code # as my csv has extrange simbols as tildes 'é' I need to add encoding='ISO8859-1' to the code mydata = pandas.read_csv(working_folder+'M3A2data_gtd_12to15_0616dist_MSDOS.csv', sep=';', encoding='ISO8859-1',low_memory=False) # convert the variables to numeric format mydata['nperps'] = pandas.to_numeric(mydata['nperps'], errors='coerce') mydata['nkill'] = pandas.to_numeric(mydata['nkill'], errors='coerce') mydata['nperps'] = mydata['nperps'].replace(-99, numpy.nan) print('OLS regression model for the assotiation between n° of terrorists and total n° of fatalities') reg1 = smf.ols('nkill ~ nperps', data=mydata).fit() print(reg1.summary()) OLS regression model for the assotiation between n° of terrorists and total n° of fatalities OLS Regression Results ============================================================================== Dep. Variable: nkill R-squared: 0.034 Model: OLS Adj. R-squared: 0.034 Method: Least Squares F-statistic: 304.3 Date: Wed, 14 Jun 2017 Prob (F-statistic): 5.44e-67 Time: 17:40:52 Log-Likelihood: -31750. No. Observations: 8519 AIC: 6.350e+04 Df Residuals: 8517 BIC: 6.352e+04 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 3.3591 0.111 30.206 0.000 3.141 3.577 nperps 0.0301 0.002 17.445 0.000 0.027 0.034 ============================================================================== Omnibus: 13372.286 Durbin-Watson: 1.867 Prob(Omnibus): 0.000 Jarque-Bera (JB): 8999254.771 Skew: 10.006 Prob(JB): 0.00 Kurtosis: 160.964 Cond. No. 65.7 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. When looking at the output, we can first recognize on the top (red marked) our response variable (nkill). There are 8519 observations that had valid data on both the response and explanatory variables. The p value (5.44e-67) is extremely small and that indicates a strong relationship between both variables “Number of terrorists” and “Total number of fatalities”. Document written in LATEX template_version_01.tex 3
  • 4. Data Analysis And Interpretation Specialization Test A Basic Linear Regression Model M3A2 The intercept is 3.3591 and the slope is 0.0301. For both figures, the p value is smaller than 0.0001. y = f (x) = β0 +β1 · x = 3.3591+0.0301· x (2.1) We can conclude saying that the results of the linear regression model indicates that the number of terrorists is sig- nificantly and positively associated with the total number of fatalities. Document written in LATEX template_version_01.tex 4
  • 5. My Codebook (GTD) for Module 3, week 2: Number of Perpetrators (nperps) Numeric Variable This field indicates the total number of terrorists participating in the incident. (In the instance of multiple perpetrator groups participating in one case, the total number of perpetrators, across groups, is recorded). There are often discrepancies in information on this value. Where several independent credible sources report different numbers of attackers, the value of this variable reflects the number given by the majority of sources, unless there is reason to do otherwise. Where there is no majority figure among independent sources, the database records the lowest proffered perpetrator figure, unless there is clear reason to do otherwise. In cases where the number of perpetrators is stated vaguely, for example “…at least 11 attackers”, then the lowest possible number is recorded, in this example, “11.” “-99” or “Unknown” appears when the number of perpetrators is not reported. Number of Perpetrators Captured (nperpcap) Numeric Variable This field records the number of perpetrators taken into custody. “-99” or “Unknown” appears when there is evidence of captured, but the number is not reported. Divergent reports on the number of perpetrators captured are dealt with in same manner used for the Number of Perpetrators variable described above. Note: This field is presently only systematically available with incidents occurring after 1997. Total Number of Fatalities (nkill) Numeric Variable This field stores the number of total confirmed fatalities for the incident. The number includes all victims and attackers who died as a direct result of the incident. Where there is evidence of fatalities, but a figure is not reported or it is too vague to be of use, this field remains blank. If information is missing regarding the number of victims killed in an attack, but perpetrator fatalities are known, this value will reflect only the number of perpetrators who died as a result of the incident. Likewise, if information on the number of perpetrators killed in an attack is missing, but victim fatalities are known, this field will only report the number of victims killed in the incident. Data Analysis And Interpretation Specialization Test A Basic Linear Regression Model M3A2 3 Codebook Document written in LATEX template_version_01.tex 5
  • 6. Where several independent sources report different numbers of casualties, the database will usually reflect the number given by the most recent source. However, the most recent source will not be used if the source itself is of questionable validity or if the source bases its casualty numbers on claims made by a perpetrator group. When there are several “most recent” sources published around the same time, or there are concerns about the validity of a recent source, the majority figure will be used. Where there is no majority figure among independent sources, the database will record the lowest proffered fatality figure, unless that figure comes from a source of questionable validity or there is another compelling reason to do otherwise. Conflicting reports of fatalities will be noted in the “Additional Notes” field. Number of US Fatalities (nkillus) Numeric Variable This field records the number of U.S. citizens who died as a result of the incident, and follows the conventions of “Total Number of Fatalities” described above. Thus, this field records the number of U.S. victims and U.S. perpetrators who died as a result of the attack. The value for this field is not limited to U.S. citizens killed on U.S. soil, but also includes U.S. citizens who died in incidents occurring outside of the U.S. Number of Perpetrator Fatalities (nkillter) Numeric Variable Limited to only perpetrator fatalities, this field follows the conventions of the “Total Number of Fatalities” field described above. Total Number of Injured (nwound) Numeric Variable This field records the number of confirmed non-fatal injuries to both perpetrators and victims. It follows the conventions of the “Total Number of Fatalities” field described above. Number of U.S. Injured (nwoundus) Numeric Variable This field records the number of confirmed non-fatal injuries to U.S. citizens, both perpetrators and victims. It follows the conventions of the “Number of U.S. Fatalities” field described above. Data Analysis And Interpretation Specialization Test A Basic Linear Regression Model M3A2 Document written in LATEX template_version_01.tex 6
  • 7. Number of Perpetrators Injured (nwoundte) Numeric Variable Conventions follow the “Number of Perpetrator Fatalities” field described above. Value of Property Damage (in USD) (propvalue) Numeric Variable If “Property Damage?” is “Yes,” then the exact U.S. dollar amount (at the time of the incident) of total damages is listed. Where applicable, property values reported in foreign currencies are converted to U.S. dollars before being entered into the GTD. If no dollar figure is reported, the field is left blank. That is, a blank field here does not indicate that there was no property damage but, rather, that no precise estimate of the value was available. The value of damages only includes direct economic effects of the incident (i.e. cost of buildings, etc.) and not indirect economic costs (longer term effects on the company, industry, tourism, etc.). Protocols for recording inconsistent numbers, etc., listed above are followed (see, for example, “Number of Perpetrators”). Total Number of Hostages/ Kidnapping Victims (nhostkid) Numeric Variable This field records the total number of hostages or kidnapping victims. For successful hijackings, this value will reflect the total number of crew members and passengers aboard the vehicle at the time of the incident. As with other fields, where several independent sources report different numbers of hostages, the GTD reflects the number given by the most recent source, unless there is reason to do otherwise. When there are several most recent sources available, or there are question about the validity of a recent source, the GTD will report the majority figure from a group of independent sources. Where there is no majority figure among independent sources, the database will record the lowest proffered hostage figure, unless there is clear reason to do otherwise. In cases where the number of hostages or kidnapping victims is stated vaguely, for example, “…at least 11 hostages”, then the lowest possible number will be recorded, in this example “11.” If the number of hostages is unknown or unidentified, this field records “-99” (unknown). Data Analysis And Interpretation Specialization Test A Basic Linear Regression Model M3A2 Document written in LATEX template_version_01.tex 7
  • 8. Number of U.S. Hostages/ Kidnapping Victims (nhostkidus) Numeric Variable This field reports the number of U.S. citizens that were taken hostage or kidnapped in the incident. Conventions follow the “Total Number of Hostages/ Kidnapping Victims” field described above. Total Ransom Amount Demanded (ransomamt) Numeric Variable If a ransom was demanded, then the amount (in U.S. dollars) is listed in this field. If a ransom was demanded but the monetary figure is unknown, then this field will report “-99” (unknown). If there are conflicting reports on the amount of ransom demanded, the majority figure from independent sources will be used. If no majority exists, the lowest proffered figure will be used, unless that figure comes from a source of questionable validity or there is another compelling reason to do otherwise. Ransom Amount Demanded from U.S. Sources (ransomamtus) Numeric Variable If a ransom was demanded from U.S. sources, then the amount (in U.S. dollars) is listed in this field. If a ransom was demanded from U.S. sources but the monetary figure is unknown, then this field will report “-99” (unknown). If there are conflicting reports on the amount of ransom demanded from a U.S. source, the majority figure from independent sources will be used. If no majority exists, the lowest proffered figure will be used, unless there is a reason to do otherwise. Total Ransom Amount Paid (ransompaid) Numeric Variable If a ransom amount was paid, then the amount (in U.S. dollars) is listed in this field. If a ransom was paid but the monetary figure was unspecified, then this field will report “-99” (unknown). A value of “-99” will also be reported for any case where it is suspected that a ransom was paid, but it has not been confirmed. If there are conflicting reports on the amount of ransom paid, the majority figure from independent sources will be used. If no majority exists, the lowest proffered figure will be used, unless that figure comes from a source of questionable validity or there is another compelling reason to do otherwise. Data Analysis And Interpretation Specialization Test A Basic Linear Regression Model M3A2 Document written in LATEX template_version_01.tex 8
  • 9. Ransom Amount Paid By U.S. Sources (ransompaidus) Numeric Variable If a ransom amount was paid by U.S. sources, then this figure is listed in U.S. dollars. If a ransom was paid by U.S. sources but the monetary figure was unspecified, then this field will report “-99” (unknown). If there are conflicting reports on the amount of ransom paid by a U.S. source, the majority figure from independent sources will be used. If no majority exists, the lowest proffered figure will be used, unless that figure comes from a source of questionable validity or there is another compelling reason to do otherwise. Number Released/Escaped/Rescued (nreleased) Numeric Variable If the “Attack Type” is “Hostage Taking (Kidnapping),” “Hostage Taking (Barricade Incident),” or a successful “Hijacking,” then this field will apply. This field records the number of hostages who survived the incident. As with the total number of kidnapping victims, where several independent sources report different numbers of hostages, the database will reflect the number given by the most recent source, unless there is reason to do otherwise. If there are several most recent sources available, the majority number form a group of independent sources will be used, unless there is reason to do otherwise. Where there is no majority figure among independent sources, the database will record the lowest proffered hostage released/escaped/rescued figure, unless there is clear reason to do otherwise. In cases where the number of hostages released/escaped/rescued is stated vaguely, for example “…at least 11 hostages were released”, then the lowest possible number will be recorded, in this example “11.” If the fate of the hostages is unknown, this field will record “-99.” Data Analysis And Interpretation Specialization Test A Basic Linear Regression Model M3A2 Document written in LATEX template_version_01.tex 9
  • 10. Data Analysis And Interpretation Specialization Test A Basic Linear Regression Model M3A2 4 List Of Abbreviations GTD Global Terrorism Database Document written in LATEX template_version_01.tex 10