SlideShare a Scribd company logo
1 of 9
Download to read offline
DATA ANALYSIS COLLECTION
ASSIGNMENT
Data Analysis And Interpretation Specialization
Test A Multiple Regression Model
Andrea Rubio Amorós
June 15, 2017
Modul 3
Assignment 3
Data Analysis And Interpretation Specialization
Test A Multiple Regression Model M3A3
1 Introduction
Multiple regression analysis is tool that allows you to expand on your research question, and conduct a more rigorous
test of the association between your explanatory and response variable by adding additional quantitative and/or cat-
egorical explanatory variables to your linear regression model. In this session, you will apply and interpret a multiple
regression analysis for a quantitative response variable, and will learn how to use confidence intervals to take into
account error in estimating a population parameter. You will also learn how to account for nonlinear associations in
a linear regression model.
Finally, you will develop experience using regression diagnostic techniques to evaluate how well your multiple regres-
sion model predicts your observed response variable. Note that if you have not yet identified additional explanatory
variables, you should choose at least one additional explanatory variable from your data set. When you go back to
your codebooks, ask yourself a few questions like “What other variables might explain the association between my
explanatory and response variable?”; “What other variables might explain more of the variability in my response vari-
able?”, or even “What other explanatory variables might be interesting to explore?” Additional explanatory variables
can be either quantitative, categorical, or both. Although you need only two explanatory variables to test a multiple
regression model, we encourage you to identify more than one additional explanatory variable. Doing so will really
allow you to experience the power of multiple regression analysis, and will increase your confidence in your ability
to test and interpret more complex regression models. If your research question does not include one quantitative
response variable, you can use the same quantitative response variable that you used in Module 2, or you may choose
another one from your data set.
Document written in LATEX
template_version_01.tex
2
Data Analysis And Interpretation Specialization
Test A Multiple Regression Model M3A3
2 Python Code
This week, I will be working with the NESARC dataset, in order to study multiple variables that might influence the
personal income of the population.
Multiple Explanatory Variables:
Age Sex Grade
18-97. (Age in years) 1. (Male) 1. (no)
98. (98 years or older) 2. (Female) 2. (yes)
Response Variable:
Income
0-3’000’000. (Income in USD
Table 2.1 Multiple Explanatory Variables and Response Variable
Through a multiple regression model, I want to study if there is a positive or negative association between each ex-
planatory variable and my response variable.
Writing the program: The selected dataset includes a random number of observations from a population, which
includes children, adults, students, unemployed individuals, part-time workers, full-time workers and retired indi-
viduals. Obviously, not all the observations are relevant for my research. When studying the personal income, I need
to focus only on employed individuals and to be more accurate, full-time workers.
Therefore, I start creating a new variable called FULLTIME which stores only the full-time workers. Then I create a
personalized dataset called “mydata” which includes only the chosen variables for this research.
# import libraries
import pandas
import numpy
import seaborn
import matplotlib.pyplot as plt
import scipy
import statsmodels.api as sm
import statsmodels.formula.api as smf
# reading in the data set we want to work with
data = pandas.read_csv(working_folder+"M3A3data_nesarc_pds.csv",low_memory=False)
# create new variable FULLTIME from original S1Q7A1, to include only full time workers (observations 1=yes)
def FULLTIME(row):
if row['S1Q7A1'] == 1:
return 1
data['FULLTIME'] = data.apply(lambda row: FULLTIME(row), axis = 1)
# create new variable GRADE from original S1Q6A, to reduce it to two levels (observations 1=no, 2=yes)
def GRADE(row):
if row['S1Q6A'] < 12:
return 1
elif row['S1Q6A'] >= 12:
return 2
data['GRADE'] = data.apply(lambda row: GRADE(row), axis = 1)
# give a new name to variable for better understanding
data['INCOME'] = data['S1Q10A']
# create a personalized dataset only with the chosen variables for this research
mydata = data[['AGE','SEX','GRADE','FULLTIME','INCOME']].dropna()
Document written in LATEX
template_version_01.tex
3
Data Analysis And Interpretation Specialization
Test A Multiple Regression Model M3A3
To better differentiate the output, first, I will use a linear regression model for the association between AGE and
INCOME, and then I will compare the results with a multiple regression model by adding more explanatory variables
to the function.
# linear regression
# scatterplot for the association between AGE and TOTAL PERSONAL INCOME
plt.figure(1)
scat1 = seaborn.regplot(x='AGE', y='INCOME', scatter=True, data=mydata, line_kws={'color': 'red'})
plt.xlabel('AGE')
plt.ylabel('TOTAL PERSONAL INCOME')
plt.title('Association between AGE and TOTAL PERSONAL INCOME for full time workers')
       
$*(







727$/3(5621$/,120(
$VVRFLDWLRQEHWZHHQ$*(DQG727$/3(5621$/,120(IRUIXOOWLPHZRUNHUV
Figure 2.1 Association between AGE and TOTAL PERSONAL INCOME for full time workers
As you can see, due to the fact that there are some observations with a very high value compare to the majority of the
observations; it is difficult to read the data. For that reason, we need to reduce x and y axis to smaller values, in order
to zoom into the scatterplot.
# linear regression
# scatterplot for the association between AGE and TOTAL PERSONAL INCOME
plt.figure(2)
scat1 = seaborn.regplot(x='AGE', y='INCOME', scatter=True, data=mydata, line_kws={'color': 'red'})
plt.xlabel('AGE')
plt.ylabel('TOTAL PERSONAL INCOME')
plt.title('Association between AGE and TOTAL PERSONAL INCOME for full time workers')
plt.xlim((0,100))
plt.ylim((0,1000000))
Document written in LATEX
template_version_01.tex
4
Data Analysis And Interpretation Specialization
Test A Multiple Regression Model M3A3
     
$*(






727$/3(5621$/,120( $VVRFLDWLRQEHWZHHQ$*(DQG727$/3(5621$/,120(IRUIXOOWLPHZRUNHUV
Figure 2.2 Association between AGE and TOTAL PERSONAL INCOME for full time workers
Now, it looks like there is a positive relationship between AGE and INCOME (increasing red line), but the line does not
exactly fit the pattern of the dots. The output suggests a curve starting with small values, increasing to higher values
in the middle and decreasing again at the end. I deduce that by using a polynomial regression model we will cover a
higher number of observations and our research will be more accurate.
# polynomial regression
# scatterplot for the association between AGE and TOTAL PERSONAL INCOME
plt.figure(3)
scat2 = seaborn.regplot(x='AGE', y='INCOME', scatter=True, order=2, data=mydata, line_kws={'color': 'red'})
plt.xlabel('AGE')
plt.ylabel('TOTAL PERSONAL INCOME')
plt.title('Association between AGE and TOTAL PERSONAL INCOME')
plt.xlim((0,100))
plt.ylim((0,1000000))
     
$*(






727$/3(5621$/,120(
$VVRFLDWLRQEHWZHHQ$*(DQG727$/3(5621$/,120(
Figure 2.3 Association between AGE and TOTAL PERSONAL INCOME
Document written in LATEX
template_version_01.tex
5
Data Analysis And Interpretation Specialization
Test A Multiple Regression Model M3A3
As shown in the new output, the regression line fits now better the pattern. Other way to prove these results is by
using OLS functions for linear and polynomial regression models.
# center quantitative variables for regression analysis
mydata['AGE_c'] = (mydata['AGE'] - mydata['AGE'].mean())
# linear regression model
print('OLS Regression Model for the assotiation between TOTAL PERSONAL INCOME and AGE')
regmod1 = smf.ols('INCOME ~ AGE_c', data=mydata).fit()
print(regmod1.summary())
# regression model with second order polynomial (qadratic term)
print('OLS Regression Model for the assotiation between TOTAL PERSONAL INCOME and AGE')
regmod2 = smf.ols('INCOME ~ AGE_c + I(AGE_c**2)', data=mydata).fit()
print(regmod2.summary())
OLS Regression Model for the assotiation between TOTAL PERSONAL INCOME and AGE
OLS Regression Results
==============================================================================
Dep. Variable: INCOME R-squared: 0.025
Model: OLS Adj. R-squared: 0.025
Method: Least Squares F-statistic: 571.5
Date: Thu, 15 Jun 2017 Prob (F-statistic): 9.66e-125
Time: 09:21:34 Log-Likelihood: -2.7322e+05
No. Observations: 22267 AIC: 5.464e+05
Df Residuals: 22265 BIC: 5.465e+05
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.969e+04 345.791 114.770 0.000 3.9e+04 4.04e+04
AGE_c 693.5094 29.009 23.907 0.000 636.650 750.368
==============================================================================
Omnibus: 52977.388 Durbin-Watson: 1.994
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1122433086.644
Skew: 23.952 Prob(JB): 0.00
Kurtosis: 1101.861 Cond. No. 11.9
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Model for the assotiation between TOTAL PERSONAL INCOME and AGE
OLS Regression Results
==============================================================================
Dep. Variable: INCOME R-squared: 0.029
Model: OLS Adj. R-squared: 0.029
Method: Least Squares F-statistic: 333.1
Date: Thu, 15 Jun 2017 Prob (F-statistic): 2.95e-143
Time: 09:21:34 Log-Likelihood: -2.7317e+05
No. Observations: 22267 AIC: 5.464e+05
Df Residuals: 22264 BIC: 5.464e+05
Df Model: 2
Covariance Type: nonrobust
=================================================================================
coef std err t P|t| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 4.243e+04 448.224 94.671 0.000 4.16e+04 4.33e+04
AGE_c 753.3524 29.612 25.441 0.000 695.310 811.394
I(AGE_c ** 2) -19.3358 2.013 -9.605 0.000 -23.282 -15.390
==============================================================================
Omnibus: 53353.873 Durbin-Watson: 1.993
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1184305888.518
Skew: 24.372 Prob(JB): 0.00
Kurtosis: 1131.761 Cond. No. 293.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Document written in LATEX
template_version_01.tex
6
Data Analysis And Interpretation Specialization
Test A Multiple Regression Model M3A3
The R-squared value indicates that a polynomial regression model covers more observations as the linear regression
model (0.029 vs. 0.025). By looking at the p-values, we can differentiate that both regression models AGE is positive
related to INCOME with p-values smaller than 0.0001. In the polynomial regression model, however, the coefficient
is negative which indicates that the curve is concave.
Now that my regression model is accurate, I will proceed to add more explanatory variables to the OLS function and
analyze the results.
# multimple regression model
print('OLS Regression Model for the assotiation between TOTAL PERSONAL INCOME and AGE, SEX and GRADE')
regmod3 = smf.ols('INCOME ~ AGE_c + I(AGE_c**2) + SEX + GRADE', data=mydata).fit()
print(regmod3.summary())
OLS Regression Model for the assotiation between TOTAL PERSONAL INCOME and AGE, SEX and GRADE
OLS Regression Results
==============================================================================
Dep. Variable: INCOME R-squared: 0.112
Model: OLS Adj. R-squared: 0.111
Method: Least Squares F-statistic: 698.8
Date: Thu, 15 Jun 2017 Prob (F-statistic): 0.00
Time: 09:21:34 Log-Likelihood: -2.7218e+05
No. Observations: 22267 AIC: 5.444e+05
Df Residuals: 22262 BIC: 5.444e+05
Df Model: 4
Covariance Type: nonrobust
=================================================================================
coef std err t P|t| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.53e+04 1433.216 17.656 0.000 2.25e+04 2.81e+04
AGE_c 650.2752 28.439 22.865 0.000 594.532 706.018
I(AGE_c ** 2) -12.3280 1.934 -6.375 0.000 -16.118 -8.538
SEX -1.46e+04 660.861 -22.087 0.000 -1.59e+04 -1.33e+04
GRADE 2.905e+04 726.163 40.000 0.000 2.76e+04 3.05e+04
==============================================================================
Omnibus: 55681.626 Durbin-Watson: 1.996
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1650204138.297
Skew: 27.098 Prob(JB): 0.00
Kurtosis: 1335.554 Cond. No. 1.08e+03
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.08e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
This new output shows that there is an association between each of the explanatory variables and the response vari-
able. This association is positive for AGE and GRADE and negative for SEX.
Document written in LATEX
template_version_01.tex
7
Data Analysis And Interpretation Specialization
Test A Multiple Regression Model M3A3
To conclude, I have included a regression diagnostic plot (Q-Q plot) to evaluate the assumption that the residuals
from our aggression model are normally distributed and whether there are any outlying observations, that might be
unduly influencing the estimation of the regression coefficient.
# regression diagnostic plot
# Q-Q plot for normality
plt.figure(4)
regdiag = sm.qqplot(regmod3.resid, line='r')
print(regdiag)
        
7KHRUHWLFDO4XDQWLOHV







6DPSOH4XDQWLOHV
Figure 2.4 Regression diagnostic plot
The output shows that most of the residuals are normally distributed (follow the red line), but there are some values
that can be defined as outliers and slightly influence the estimation of the regression coefficient.
Document written in LATEX
template_version_01.tex
8
Tape Location Source Code Frequency Item value Description Title
68-69 AGE AGE
43079 18-97. Age in years
14 98. 98 years or older
79-79 SEX SEX
18518 1. Male
24575 2. Female
131-132 S1Q6A HIGHEST GRADE OR YEAR OF SCHOOL COMPLETED
218 1. No formal schooling
137 2. Completed grade K, 1 or 2
421 3. Completed grade 3 or 4
931 4. Completed grade 5 or 6
414 5. Completed grade 7
1210 6. Completed grade 8
4518 7. Some high school (grades 9-11)
10935 8. Completed high school
1612 9. Graduate equivalency degree (GED)
8891 10. Some college (no degree)
3772 11. Completed associate or other technical 2-year degree
5251 12. Completed college (bachelor's degree)
1526 13. Some graduate or professional studies (completed bachelor's degree but not graduate degree)
3257 14. Completed graduate or professional degree (master's degree or higher)
136-136 S1Q7A1 PRESENT SITUATION INCLUDES WORKING FULL TIME (35+ HOURS A WEEK)
22267 1. Yes
20826 2. No
171-172 S1Q9A BUSINESS OR INDUSTRY: CURRENT OR MOST RECENT JOB
1143 1. Agriculture
147 2. Mining
2124 3. Construction
3909 4. Manufacturing
2404 5. Transportation, Communications and Other Public Utilities
724 6. Wholesale Trade
4701 7. Retail Trade
1945 8. Finance, Insurance and Real Estate
1248 9. Business and Repair Service
4186 10. Personal Services
992 11. Entertainment and Recreation Services
8181 12. Professional and Related Services
1925 13. Public Administration
431 14. Armed Services
9033 BL. NA, never worked for pay or in family business or farm
179-185 S1Q10A TOTAL PERSONAL INCOME IN LAST 12 MONTHS
43093 0-3000000. Income in dollars
186-187 S1Q10B TOTAL PERSONAL INCOME IN LAST 12 MONTHS: CATEGORY
2462 0. $0 (No personal income)
3571 1. $1 to $4,999
3823 2. $5,000 to $7,999
2002 3. $8,000 to $9,999
3669 4. $10,000 to $12,999
1634 5. $13,000 to $14,999
3940 6. $15,000 to $19,999
3887 7. $20,000 to $24,999
3085 8. $25,000 to $29,999
3003 9. $30,000 to $34,999
2351 10. $35,000 to $39,999
3291 11. $40,000 to $49,999
2059 12. $50,000 to $59,999
1328 13. $60,000 to $69,999
857 14. $70,000 to $79,999
521 15. $80,000 to $89,999
290 16. $90,000 to $99,999
1320 17. $100,000 or more
Data Analysis And Interpretation Specialization
Test A Multiple Regression Model M3A3
3 Codebook
Document written in LATEX
template_version_01.tex
9

More Related Content

What's hot

Chapter01 introductory handbook
Chapter01 introductory handbookChapter01 introductory handbook
Chapter01 introductory handbookRaman Kannan
 
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsThe caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsNYC Predictive Analytics
 
Network Analytics - Homework 3 - Msc Business Analytics - Imperial College Lo...
Network Analytics - Homework 3 - Msc Business Analytics - Imperial College Lo...Network Analytics - Homework 3 - Msc Business Analytics - Imperial College Lo...
Network Analytics - Homework 3 - Msc Business Analytics - Imperial College Lo...Jonathan Zimmermann
 
Ridge regression, lasso and elastic net
Ridge regression, lasso and elastic netRidge regression, lasso and elastic net
Ridge regression, lasso and elastic netVivian S. Zhang
 
An Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social SciencesAn Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social Sciencesfsmart01
 
Caret Package for R
Caret Package for RCaret Package for R
Caret Package for Rkmettler
 
Machine learning session8(svm nlp)
Machine learning   session8(svm nlp)Machine learning   session8(svm nlp)
Machine learning session8(svm nlp)Abhimanyu Dwivedi
 
SupportVectorRegression
SupportVectorRegressionSupportVectorRegression
SupportVectorRegressionDaniel K
 
DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...
DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...
DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...Dataconomy Media
 
Binary classification metrics_cheatsheet
Binary classification metrics_cheatsheetBinary classification metrics_cheatsheet
Binary classification metrics_cheatsheetJakub Czakon
 
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project ADN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project ADataconomy Media
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Md. Main Uddin Rony
 
M08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffM08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffRaman Kannan
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)Abhimanyu Dwivedi
 
Machine learning session9(clustering)
Machine learning   session9(clustering)Machine learning   session9(clustering)
Machine learning session9(clustering)Abhimanyu Dwivedi
 
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...PAPIs.io
 
The caret package is a unified interface to a large number of predictive mode...
The caret package is a unified interface to a large number of predictive mode...The caret package is a unified interface to a large number of predictive mode...
The caret package is a unified interface to a large number of predictive mode...odsc
 
Tree-Based Methods (Article 8 - Practical Exercises)
Tree-Based Methods (Article 8 - Practical Exercises)Tree-Based Methods (Article 8 - Practical Exercises)
Tree-Based Methods (Article 8 - Practical Exercises)Theodore Grammatikopoulos
 

What's hot (20)

Chapter01 introductory handbook
Chapter01 introductory handbookChapter01 introductory handbook
Chapter01 introductory handbook
 
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsThe caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
 
Network Analytics - Homework 3 - Msc Business Analytics - Imperial College Lo...
Network Analytics - Homework 3 - Msc Business Analytics - Imperial College Lo...Network Analytics - Homework 3 - Msc Business Analytics - Imperial College Lo...
Network Analytics - Homework 3 - Msc Business Analytics - Imperial College Lo...
 
Ridge regression, lasso and elastic net
Ridge regression, lasso and elastic netRidge regression, lasso and elastic net
Ridge regression, lasso and elastic net
 
An Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social SciencesAn Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social Sciences
 
Caret Package for R
Caret Package for RCaret Package for R
Caret Package for R
 
Machine learning session8(svm nlp)
Machine learning   session8(svm nlp)Machine learning   session8(svm nlp)
Machine learning session8(svm nlp)
 
Cg32519523
Cg32519523Cg32519523
Cg32519523
 
SupportVectorRegression
SupportVectorRegressionSupportVectorRegression
SupportVectorRegression
 
Answers
AnswersAnswers
Answers
 
DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...
DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...
DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...
 
Binary classification metrics_cheatsheet
Binary classification metrics_cheatsheetBinary classification metrics_cheatsheet
Binary classification metrics_cheatsheet
 
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project ADN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
 
M08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffM08 BiasVarianceTradeoff
M08 BiasVarianceTradeoff
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
 
Machine learning session9(clustering)
Machine learning   session9(clustering)Machine learning   session9(clustering)
Machine learning session9(clustering)
 
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
 
The caret package is a unified interface to a large number of predictive mode...
The caret package is a unified interface to a large number of predictive mode...The caret package is a unified interface to a large number of predictive mode...
The caret package is a unified interface to a large number of predictive mode...
 
Tree-Based Methods (Article 8 - Practical Exercises)
Tree-Based Methods (Article 8 - Practical Exercises)Tree-Based Methods (Article 8 - Practical Exercises)
Tree-Based Methods (Article 8 - Practical Exercises)
 

Similar to [M3A3] Data Analysis and Interpretation Specialization

[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation Specialization[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation SpecializationAndrea Rubio
 
Insurance Optimization
Insurance OptimizationInsurance Optimization
Insurance OptimizationAlbert Chu
 
© Charles T. Diebold, Ph.D., 71113, 100313. All Rights Res.docx
© Charles T. Diebold, Ph.D., 71113, 100313. All Rights Res.docx© Charles T. Diebold, Ph.D., 71113, 100313. All Rights Res.docx
© Charles T. Diebold, Ph.D., 71113, 100313. All Rights Res.docxLynellBull52
 
Eli plots visualizing innumerable number of correlations
Eli plots   visualizing innumerable number of correlationsEli plots   visualizing innumerable number of correlations
Eli plots visualizing innumerable number of correlationsLeonardo Auslender
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
 
Linear Regression (Machine Learning)
Linear Regression (Machine Learning)Linear Regression (Machine Learning)
Linear Regression (Machine Learning)Omkar Rane
 
Exploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems ProjectExploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems ProjectSurya Chandra
 
Logistic Regression in Case-Control Study
Logistic Regression in Case-Control StudyLogistic Regression in Case-Control Study
Logistic Regression in Case-Control StudySatish Gupta
 
DIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docx
DIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docxDIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docx
DIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docxlynettearnold46882
 
Kaggle digits analysis_final_fc
Kaggle digits analysis_final_fcKaggle digits analysis_final_fc
Kaggle digits analysis_final_fcZachary Combs
 
Moving Beyond Linearity (Article 7 - Practical Exercises)
Moving Beyond Linearity (Article 7 - Practical Exercises)Moving Beyond Linearity (Article 7 - Practical Exercises)
Moving Beyond Linearity (Article 7 - Practical Exercises)Theodore Grammatikopoulos
 
maXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIImaXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIIMax Kleiner
 
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithmsArunangsu Sahu
 
4_Tutorial.pdf
4_Tutorial.pdf4_Tutorial.pdf
4_Tutorial.pdfbozo18
 
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...Chakkrit (Kla) Tantithamthavorn
 
AlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptxAlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptxPerumalPitchandi
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...IJCSES Journal
 

Similar to [M3A3] Data Analysis and Interpretation Specialization (20)

[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation Specialization[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation Specialization
 
Insurance Optimization
Insurance OptimizationInsurance Optimization
Insurance Optimization
 
© Charles T. Diebold, Ph.D., 71113, 100313. All Rights Res.docx
© Charles T. Diebold, Ph.D., 71113, 100313. All Rights Res.docx© Charles T. Diebold, Ph.D., 71113, 100313. All Rights Res.docx
© Charles T. Diebold, Ph.D., 71113, 100313. All Rights Res.docx
 
Eli plots visualizing innumerable number of correlations
Eli plots   visualizing innumerable number of correlationsEli plots   visualizing innumerable number of correlations
Eli plots visualizing innumerable number of correlations
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
Linear Regression (Machine Learning)
Linear Regression (Machine Learning)Linear Regression (Machine Learning)
Linear Regression (Machine Learning)
 
Exploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems ProjectExploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems Project
 
Rclass
RclassRclass
Rclass
 
Logistic Regression in Case-Control Study
Logistic Regression in Case-Control StudyLogistic Regression in Case-Control Study
Logistic Regression in Case-Control Study
 
DIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docx
DIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docxDIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docx
DIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docx
 
Kaggle digits analysis_final_fc
Kaggle digits analysis_final_fcKaggle digits analysis_final_fc
Kaggle digits analysis_final_fc
 
1624.pptx
1624.pptx1624.pptx
1624.pptx
 
Moving Beyond Linearity (Article 7 - Practical Exercises)
Moving Beyond Linearity (Article 7 - Practical Exercises)Moving Beyond Linearity (Article 7 - Practical Exercises)
Moving Beyond Linearity (Article 7 - Practical Exercises)
 
maXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIImaXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VII
 
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithms
 
4_Tutorial.pdf
4_Tutorial.pdf4_Tutorial.pdf
4_Tutorial.pdf
 
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
 
AlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptxAlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptx
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
 

More from Andrea Rubio

[M4A2] Data Analysis and Interpretation Specialization
[M4A2] Data Analysis and Interpretation Specialization [M4A2] Data Analysis and Interpretation Specialization
[M4A2] Data Analysis and Interpretation Specialization Andrea Rubio
 
[M2A2] Data Analysis and Interpretation Specialization
[M2A2] Data Analysis and Interpretation Specialization [M2A2] Data Analysis and Interpretation Specialization
[M2A2] Data Analysis and Interpretation Specialization Andrea Rubio
 
[M2A3] Data Analysis and Interpretation Specialization
[M2A3] Data Analysis and Interpretation Specialization [M2A3] Data Analysis and Interpretation Specialization
[M2A3] Data Analysis and Interpretation Specialization Andrea Rubio
 
[M2A4] Data Analysis and Interpretation Specialization
[M2A4] Data Analysis and Interpretation Specialization [M2A4] Data Analysis and Interpretation Specialization
[M2A4] Data Analysis and Interpretation Specialization Andrea Rubio
 
[M3A1] Data Analysis and Interpretation Specialization
[M3A1] Data Analysis and Interpretation Specialization [M3A1] Data Analysis and Interpretation Specialization
[M3A1] Data Analysis and Interpretation Specialization Andrea Rubio
 
[M3A2] Data Analysis and Interpretation Specialization
[M3A2] Data Analysis and Interpretation Specialization [M3A2] Data Analysis and Interpretation Specialization
[M3A2] Data Analysis and Interpretation Specialization Andrea Rubio
 
[M4A1] Data Analysis and Interpretation Specialization
[M4A1] Data Analysis and Interpretation Specialization[M4A1] Data Analysis and Interpretation Specialization
[M4A1] Data Analysis and Interpretation SpecializationAndrea Rubio
 

More from Andrea Rubio (7)

[M4A2] Data Analysis and Interpretation Specialization
[M4A2] Data Analysis and Interpretation Specialization [M4A2] Data Analysis and Interpretation Specialization
[M4A2] Data Analysis and Interpretation Specialization
 
[M2A2] Data Analysis and Interpretation Specialization
[M2A2] Data Analysis and Interpretation Specialization [M2A2] Data Analysis and Interpretation Specialization
[M2A2] Data Analysis and Interpretation Specialization
 
[M2A3] Data Analysis and Interpretation Specialization
[M2A3] Data Analysis and Interpretation Specialization [M2A3] Data Analysis and Interpretation Specialization
[M2A3] Data Analysis and Interpretation Specialization
 
[M2A4] Data Analysis and Interpretation Specialization
[M2A4] Data Analysis and Interpretation Specialization [M2A4] Data Analysis and Interpretation Specialization
[M2A4] Data Analysis and Interpretation Specialization
 
[M3A1] Data Analysis and Interpretation Specialization
[M3A1] Data Analysis and Interpretation Specialization [M3A1] Data Analysis and Interpretation Specialization
[M3A1] Data Analysis and Interpretation Specialization
 
[M3A2] Data Analysis and Interpretation Specialization
[M3A2] Data Analysis and Interpretation Specialization [M3A2] Data Analysis and Interpretation Specialization
[M3A2] Data Analysis and Interpretation Specialization
 
[M4A1] Data Analysis and Interpretation Specialization
[M4A1] Data Analysis and Interpretation Specialization[M4A1] Data Analysis and Interpretation Specialization
[M4A1] Data Analysis and Interpretation Specialization
 

Recently uploaded

Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdfNAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdfWadeK3
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxAleenaTreesaSaji
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 

Recently uploaded (20)

Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdfNAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptx
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 

[M3A3] Data Analysis and Interpretation Specialization

  • 1. DATA ANALYSIS COLLECTION ASSIGNMENT Data Analysis And Interpretation Specialization Test A Multiple Regression Model Andrea Rubio Amorós June 15, 2017 Modul 3 Assignment 3
  • 2. Data Analysis And Interpretation Specialization Test A Multiple Regression Model M3A3 1 Introduction Multiple regression analysis is tool that allows you to expand on your research question, and conduct a more rigorous test of the association between your explanatory and response variable by adding additional quantitative and/or cat- egorical explanatory variables to your linear regression model. In this session, you will apply and interpret a multiple regression analysis for a quantitative response variable, and will learn how to use confidence intervals to take into account error in estimating a population parameter. You will also learn how to account for nonlinear associations in a linear regression model. Finally, you will develop experience using regression diagnostic techniques to evaluate how well your multiple regres- sion model predicts your observed response variable. Note that if you have not yet identified additional explanatory variables, you should choose at least one additional explanatory variable from your data set. When you go back to your codebooks, ask yourself a few questions like “What other variables might explain the association between my explanatory and response variable?”; “What other variables might explain more of the variability in my response vari- able?”, or even “What other explanatory variables might be interesting to explore?” Additional explanatory variables can be either quantitative, categorical, or both. Although you need only two explanatory variables to test a multiple regression model, we encourage you to identify more than one additional explanatory variable. Doing so will really allow you to experience the power of multiple regression analysis, and will increase your confidence in your ability to test and interpret more complex regression models. If your research question does not include one quantitative response variable, you can use the same quantitative response variable that you used in Module 2, or you may choose another one from your data set. Document written in LATEX template_version_01.tex 2
  • 3. Data Analysis And Interpretation Specialization Test A Multiple Regression Model M3A3 2 Python Code This week, I will be working with the NESARC dataset, in order to study multiple variables that might influence the personal income of the population. Multiple Explanatory Variables: Age Sex Grade 18-97. (Age in years) 1. (Male) 1. (no) 98. (98 years or older) 2. (Female) 2. (yes) Response Variable: Income 0-3’000’000. (Income in USD Table 2.1 Multiple Explanatory Variables and Response Variable Through a multiple regression model, I want to study if there is a positive or negative association between each ex- planatory variable and my response variable. Writing the program: The selected dataset includes a random number of observations from a population, which includes children, adults, students, unemployed individuals, part-time workers, full-time workers and retired indi- viduals. Obviously, not all the observations are relevant for my research. When studying the personal income, I need to focus only on employed individuals and to be more accurate, full-time workers. Therefore, I start creating a new variable called FULLTIME which stores only the full-time workers. Then I create a personalized dataset called “mydata” which includes only the chosen variables for this research. # import libraries import pandas import numpy import seaborn import matplotlib.pyplot as plt import scipy import statsmodels.api as sm import statsmodels.formula.api as smf # reading in the data set we want to work with data = pandas.read_csv(working_folder+"M3A3data_nesarc_pds.csv",low_memory=False) # create new variable FULLTIME from original S1Q7A1, to include only full time workers (observations 1=yes) def FULLTIME(row): if row['S1Q7A1'] == 1: return 1 data['FULLTIME'] = data.apply(lambda row: FULLTIME(row), axis = 1) # create new variable GRADE from original S1Q6A, to reduce it to two levels (observations 1=no, 2=yes) def GRADE(row): if row['S1Q6A'] < 12: return 1 elif row['S1Q6A'] >= 12: return 2 data['GRADE'] = data.apply(lambda row: GRADE(row), axis = 1) # give a new name to variable for better understanding data['INCOME'] = data['S1Q10A'] # create a personalized dataset only with the chosen variables for this research mydata = data[['AGE','SEX','GRADE','FULLTIME','INCOME']].dropna() Document written in LATEX template_version_01.tex 3
  • 4. Data Analysis And Interpretation Specialization Test A Multiple Regression Model M3A3 To better differentiate the output, first, I will use a linear regression model for the association between AGE and INCOME, and then I will compare the results with a multiple regression model by adding more explanatory variables to the function. # linear regression # scatterplot for the association between AGE and TOTAL PERSONAL INCOME plt.figure(1) scat1 = seaborn.regplot(x='AGE', y='INCOME', scatter=True, data=mydata, line_kws={'color': 'red'}) plt.xlabel('AGE') plt.ylabel('TOTAL PERSONAL INCOME') plt.title('Association between AGE and TOTAL PERSONAL INCOME for full time workers') $*( 727$/3(5621$/,120( $VVRFLDWLRQEHWZHHQ$*(DQG727$/3(5621$/,120(IRUIXOOWLPHZRUNHUV Figure 2.1 Association between AGE and TOTAL PERSONAL INCOME for full time workers As you can see, due to the fact that there are some observations with a very high value compare to the majority of the observations; it is difficult to read the data. For that reason, we need to reduce x and y axis to smaller values, in order to zoom into the scatterplot. # linear regression # scatterplot for the association between AGE and TOTAL PERSONAL INCOME plt.figure(2) scat1 = seaborn.regplot(x='AGE', y='INCOME', scatter=True, data=mydata, line_kws={'color': 'red'}) plt.xlabel('AGE') plt.ylabel('TOTAL PERSONAL INCOME') plt.title('Association between AGE and TOTAL PERSONAL INCOME for full time workers') plt.xlim((0,100)) plt.ylim((0,1000000)) Document written in LATEX template_version_01.tex 4
  • 5. Data Analysis And Interpretation Specialization Test A Multiple Regression Model M3A3 $*( 727$/3(5621$/,120( $VVRFLDWLRQEHWZHHQ$*(DQG727$/3(5621$/,120(IRUIXOOWLPHZRUNHUV Figure 2.2 Association between AGE and TOTAL PERSONAL INCOME for full time workers Now, it looks like there is a positive relationship between AGE and INCOME (increasing red line), but the line does not exactly fit the pattern of the dots. The output suggests a curve starting with small values, increasing to higher values in the middle and decreasing again at the end. I deduce that by using a polynomial regression model we will cover a higher number of observations and our research will be more accurate. # polynomial regression # scatterplot for the association between AGE and TOTAL PERSONAL INCOME plt.figure(3) scat2 = seaborn.regplot(x='AGE', y='INCOME', scatter=True, order=2, data=mydata, line_kws={'color': 'red'}) plt.xlabel('AGE') plt.ylabel('TOTAL PERSONAL INCOME') plt.title('Association between AGE and TOTAL PERSONAL INCOME') plt.xlim((0,100)) plt.ylim((0,1000000)) $*( 727$/3(5621$/,120( $VVRFLDWLRQEHWZHHQ$*(DQG727$/3(5621$/,120( Figure 2.3 Association between AGE and TOTAL PERSONAL INCOME Document written in LATEX template_version_01.tex 5
  • 6. Data Analysis And Interpretation Specialization Test A Multiple Regression Model M3A3 As shown in the new output, the regression line fits now better the pattern. Other way to prove these results is by using OLS functions for linear and polynomial regression models. # center quantitative variables for regression analysis mydata['AGE_c'] = (mydata['AGE'] - mydata['AGE'].mean()) # linear regression model print('OLS Regression Model for the assotiation between TOTAL PERSONAL INCOME and AGE') regmod1 = smf.ols('INCOME ~ AGE_c', data=mydata).fit() print(regmod1.summary()) # regression model with second order polynomial (qadratic term) print('OLS Regression Model for the assotiation between TOTAL PERSONAL INCOME and AGE') regmod2 = smf.ols('INCOME ~ AGE_c + I(AGE_c**2)', data=mydata).fit() print(regmod2.summary()) OLS Regression Model for the assotiation between TOTAL PERSONAL INCOME and AGE OLS Regression Results ============================================================================== Dep. Variable: INCOME R-squared: 0.025 Model: OLS Adj. R-squared: 0.025 Method: Least Squares F-statistic: 571.5 Date: Thu, 15 Jun 2017 Prob (F-statistic): 9.66e-125 Time: 09:21:34 Log-Likelihood: -2.7322e+05 No. Observations: 22267 AIC: 5.464e+05 Df Residuals: 22265 BIC: 5.465e+05 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 3.969e+04 345.791 114.770 0.000 3.9e+04 4.04e+04 AGE_c 693.5094 29.009 23.907 0.000 636.650 750.368 ============================================================================== Omnibus: 52977.388 Durbin-Watson: 1.994 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1122433086.644 Skew: 23.952 Prob(JB): 0.00 Kurtosis: 1101.861 Cond. No. 11.9 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. OLS Regression Model for the assotiation between TOTAL PERSONAL INCOME and AGE OLS Regression Results ============================================================================== Dep. Variable: INCOME R-squared: 0.029 Model: OLS Adj. R-squared: 0.029 Method: Least Squares F-statistic: 333.1 Date: Thu, 15 Jun 2017 Prob (F-statistic): 2.95e-143 Time: 09:21:34 Log-Likelihood: -2.7317e+05 No. Observations: 22267 AIC: 5.464e+05 Df Residuals: 22264 BIC: 5.464e+05 Df Model: 2 Covariance Type: nonrobust ================================================================================= coef std err t P|t| [0.025 0.975] --------------------------------------------------------------------------------- Intercept 4.243e+04 448.224 94.671 0.000 4.16e+04 4.33e+04 AGE_c 753.3524 29.612 25.441 0.000 695.310 811.394 I(AGE_c ** 2) -19.3358 2.013 -9.605 0.000 -23.282 -15.390 ============================================================================== Omnibus: 53353.873 Durbin-Watson: 1.993 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1184305888.518 Skew: 24.372 Prob(JB): 0.00 Kurtosis: 1131.761 Cond. No. 293. ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. Document written in LATEX template_version_01.tex 6
  • 7. Data Analysis And Interpretation Specialization Test A Multiple Regression Model M3A3 The R-squared value indicates that a polynomial regression model covers more observations as the linear regression model (0.029 vs. 0.025). By looking at the p-values, we can differentiate that both regression models AGE is positive related to INCOME with p-values smaller than 0.0001. In the polynomial regression model, however, the coefficient is negative which indicates that the curve is concave. Now that my regression model is accurate, I will proceed to add more explanatory variables to the OLS function and analyze the results. # multimple regression model print('OLS Regression Model for the assotiation between TOTAL PERSONAL INCOME and AGE, SEX and GRADE') regmod3 = smf.ols('INCOME ~ AGE_c + I(AGE_c**2) + SEX + GRADE', data=mydata).fit() print(regmod3.summary()) OLS Regression Model for the assotiation between TOTAL PERSONAL INCOME and AGE, SEX and GRADE OLS Regression Results ============================================================================== Dep. Variable: INCOME R-squared: 0.112 Model: OLS Adj. R-squared: 0.111 Method: Least Squares F-statistic: 698.8 Date: Thu, 15 Jun 2017 Prob (F-statistic): 0.00 Time: 09:21:34 Log-Likelihood: -2.7218e+05 No. Observations: 22267 AIC: 5.444e+05 Df Residuals: 22262 BIC: 5.444e+05 Df Model: 4 Covariance Type: nonrobust ================================================================================= coef std err t P|t| [0.025 0.975] --------------------------------------------------------------------------------- Intercept 2.53e+04 1433.216 17.656 0.000 2.25e+04 2.81e+04 AGE_c 650.2752 28.439 22.865 0.000 594.532 706.018 I(AGE_c ** 2) -12.3280 1.934 -6.375 0.000 -16.118 -8.538 SEX -1.46e+04 660.861 -22.087 0.000 -1.59e+04 -1.33e+04 GRADE 2.905e+04 726.163 40.000 0.000 2.76e+04 3.05e+04 ============================================================================== Omnibus: 55681.626 Durbin-Watson: 1.996 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1650204138.297 Skew: 27.098 Prob(JB): 0.00 Kurtosis: 1335.554 Cond. No. 1.08e+03 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.08e+03. This might indicate that there are strong multicollinearity or other numerical problems. This new output shows that there is an association between each of the explanatory variables and the response vari- able. This association is positive for AGE and GRADE and negative for SEX. Document written in LATEX template_version_01.tex 7
  • 8. Data Analysis And Interpretation Specialization Test A Multiple Regression Model M3A3 To conclude, I have included a regression diagnostic plot (Q-Q plot) to evaluate the assumption that the residuals from our aggression model are normally distributed and whether there are any outlying observations, that might be unduly influencing the estimation of the regression coefficient. # regression diagnostic plot # Q-Q plot for normality plt.figure(4) regdiag = sm.qqplot(regmod3.resid, line='r') print(regdiag) 7KHRUHWLFDO4XDQWLOHV 6DPSOH4XDQWLOHV Figure 2.4 Regression diagnostic plot The output shows that most of the residuals are normally distributed (follow the red line), but there are some values that can be defined as outliers and slightly influence the estimation of the regression coefficient. Document written in LATEX template_version_01.tex 8
  • 9. Tape Location Source Code Frequency Item value Description Title 68-69 AGE AGE 43079 18-97. Age in years 14 98. 98 years or older 79-79 SEX SEX 18518 1. Male 24575 2. Female 131-132 S1Q6A HIGHEST GRADE OR YEAR OF SCHOOL COMPLETED 218 1. No formal schooling 137 2. Completed grade K, 1 or 2 421 3. Completed grade 3 or 4 931 4. Completed grade 5 or 6 414 5. Completed grade 7 1210 6. Completed grade 8 4518 7. Some high school (grades 9-11) 10935 8. Completed high school 1612 9. Graduate equivalency degree (GED) 8891 10. Some college (no degree) 3772 11. Completed associate or other technical 2-year degree 5251 12. Completed college (bachelor's degree) 1526 13. Some graduate or professional studies (completed bachelor's degree but not graduate degree) 3257 14. Completed graduate or professional degree (master's degree or higher) 136-136 S1Q7A1 PRESENT SITUATION INCLUDES WORKING FULL TIME (35+ HOURS A WEEK) 22267 1. Yes 20826 2. No 171-172 S1Q9A BUSINESS OR INDUSTRY: CURRENT OR MOST RECENT JOB 1143 1. Agriculture 147 2. Mining 2124 3. Construction 3909 4. Manufacturing 2404 5. Transportation, Communications and Other Public Utilities 724 6. Wholesale Trade 4701 7. Retail Trade 1945 8. Finance, Insurance and Real Estate 1248 9. Business and Repair Service 4186 10. Personal Services 992 11. Entertainment and Recreation Services 8181 12. Professional and Related Services 1925 13. Public Administration 431 14. Armed Services 9033 BL. NA, never worked for pay or in family business or farm 179-185 S1Q10A TOTAL PERSONAL INCOME IN LAST 12 MONTHS 43093 0-3000000. Income in dollars 186-187 S1Q10B TOTAL PERSONAL INCOME IN LAST 12 MONTHS: CATEGORY 2462 0. $0 (No personal income) 3571 1. $1 to $4,999 3823 2. $5,000 to $7,999 2002 3. $8,000 to $9,999 3669 4. $10,000 to $12,999 1634 5. $13,000 to $14,999 3940 6. $15,000 to $19,999 3887 7. $20,000 to $24,999 3085 8. $25,000 to $29,999 3003 9. $30,000 to $34,999 2351 10. $35,000 to $39,999 3291 11. $40,000 to $49,999 2059 12. $50,000 to $59,999 1328 13. $60,000 to $69,999 857 14. $70,000 to $79,999 521 15. $80,000 to $89,999 290 16. $90,000 to $99,999 1320 17. $100,000 or more Data Analysis And Interpretation Specialization Test A Multiple Regression Model M3A3 3 Codebook Document written in LATEX template_version_01.tex 9