[M3A3] Data Analysis and Interpretation Specialization
1. DATA ANALYSIS COLLECTION
ASSIGNMENT
Data Analysis And Interpretation Specialization
Test A Multiple Regression Model
Andrea Rubio Amorós
June 15, 2017
Modul 3
Assignment 3
2. Data Analysis And Interpretation Specialization
Test A Multiple Regression Model M3A3
1 Introduction
Multiple regression analysis is tool that allows you to expand on your research question, and conduct a more rigorous
test of the association between your explanatory and response variable by adding additional quantitative and/or cat-
egorical explanatory variables to your linear regression model. In this session, you will apply and interpret a multiple
regression analysis for a quantitative response variable, and will learn how to use confidence intervals to take into
account error in estimating a population parameter. You will also learn how to account for nonlinear associations in
a linear regression model.
Finally, you will develop experience using regression diagnostic techniques to evaluate how well your multiple regres-
sion model predicts your observed response variable. Note that if you have not yet identified additional explanatory
variables, you should choose at least one additional explanatory variable from your data set. When you go back to
your codebooks, ask yourself a few questions like “What other variables might explain the association between my
explanatory and response variable?”; “What other variables might explain more of the variability in my response vari-
able?”, or even “What other explanatory variables might be interesting to explore?” Additional explanatory variables
can be either quantitative, categorical, or both. Although you need only two explanatory variables to test a multiple
regression model, we encourage you to identify more than one additional explanatory variable. Doing so will really
allow you to experience the power of multiple regression analysis, and will increase your confidence in your ability
to test and interpret more complex regression models. If your research question does not include one quantitative
response variable, you can use the same quantitative response variable that you used in Module 2, or you may choose
another one from your data set.
Document written in LATEX
template_version_01.tex
2
3. Data Analysis And Interpretation Specialization
Test A Multiple Regression Model M3A3
2 Python Code
This week, I will be working with the NESARC dataset, in order to study multiple variables that might influence the
personal income of the population.
Multiple Explanatory Variables:
Age Sex Grade
18-97. (Age in years) 1. (Male) 1. (no)
98. (98 years or older) 2. (Female) 2. (yes)
Response Variable:
Income
0-3’000’000. (Income in USD
Table 2.1 Multiple Explanatory Variables and Response Variable
Through a multiple regression model, I want to study if there is a positive or negative association between each ex-
planatory variable and my response variable.
Writing the program: The selected dataset includes a random number of observations from a population, which
includes children, adults, students, unemployed individuals, part-time workers, full-time workers and retired indi-
viduals. Obviously, not all the observations are relevant for my research. When studying the personal income, I need
to focus only on employed individuals and to be more accurate, full-time workers.
Therefore, I start creating a new variable called FULLTIME which stores only the full-time workers. Then I create a
personalized dataset called “mydata” which includes only the chosen variables for this research.
# import libraries
import pandas
import numpy
import seaborn
import matplotlib.pyplot as plt
import scipy
import statsmodels.api as sm
import statsmodels.formula.api as smf
# reading in the data set we want to work with
data = pandas.read_csv(working_folder+"M3A3data_nesarc_pds.csv",low_memory=False)
# create new variable FULLTIME from original S1Q7A1, to include only full time workers (observations 1=yes)
def FULLTIME(row):
if row['S1Q7A1'] == 1:
return 1
data['FULLTIME'] = data.apply(lambda row: FULLTIME(row), axis = 1)
# create new variable GRADE from original S1Q6A, to reduce it to two levels (observations 1=no, 2=yes)
def GRADE(row):
if row['S1Q6A'] < 12:
return 1
elif row['S1Q6A'] >= 12:
return 2
data['GRADE'] = data.apply(lambda row: GRADE(row), axis = 1)
# give a new name to variable for better understanding
data['INCOME'] = data['S1Q10A']
# create a personalized dataset only with the chosen variables for this research
mydata = data[['AGE','SEX','GRADE','FULLTIME','INCOME']].dropna()
Document written in LATEX
template_version_01.tex
3
4. Data Analysis And Interpretation Specialization
Test A Multiple Regression Model M3A3
To better differentiate the output, first, I will use a linear regression model for the association between AGE and
INCOME, and then I will compare the results with a multiple regression model by adding more explanatory variables
to the function.
# linear regression
# scatterplot for the association between AGE and TOTAL PERSONAL INCOME
plt.figure(1)
scat1 = seaborn.regplot(x='AGE', y='INCOME', scatter=True, data=mydata, line_kws={'color': 'red'})
plt.xlabel('AGE')
plt.ylabel('TOTAL PERSONAL INCOME')
plt.title('Association between AGE and TOTAL PERSONAL INCOME for full time workers')
$*(
727$/3(5621$/,120(
$VVRFLDWLRQEHWZHHQ$*(DQG727$/3(5621$/,120(IRUIXOOWLPHZRUNHUV
Figure 2.1 Association between AGE and TOTAL PERSONAL INCOME for full time workers
As you can see, due to the fact that there are some observations with a very high value compare to the majority of the
observations; it is difficult to read the data. For that reason, we need to reduce x and y axis to smaller values, in order
to zoom into the scatterplot.
# linear regression
# scatterplot for the association between AGE and TOTAL PERSONAL INCOME
plt.figure(2)
scat1 = seaborn.regplot(x='AGE', y='INCOME', scatter=True, data=mydata, line_kws={'color': 'red'})
plt.xlabel('AGE')
plt.ylabel('TOTAL PERSONAL INCOME')
plt.title('Association between AGE and TOTAL PERSONAL INCOME for full time workers')
plt.xlim((0,100))
plt.ylim((0,1000000))
Document written in LATEX
template_version_01.tex
4
5. Data Analysis And Interpretation Specialization
Test A Multiple Regression Model M3A3
$*(
727$/3(5621$/,120( $VVRFLDWLRQEHWZHHQ$*(DQG727$/3(5621$/,120(IRUIXOOWLPHZRUNHUV
Figure 2.2 Association between AGE and TOTAL PERSONAL INCOME for full time workers
Now, it looks like there is a positive relationship between AGE and INCOME (increasing red line), but the line does not
exactly fit the pattern of the dots. The output suggests a curve starting with small values, increasing to higher values
in the middle and decreasing again at the end. I deduce that by using a polynomial regression model we will cover a
higher number of observations and our research will be more accurate.
# polynomial regression
# scatterplot for the association between AGE and TOTAL PERSONAL INCOME
plt.figure(3)
scat2 = seaborn.regplot(x='AGE', y='INCOME', scatter=True, order=2, data=mydata, line_kws={'color': 'red'})
plt.xlabel('AGE')
plt.ylabel('TOTAL PERSONAL INCOME')
plt.title('Association between AGE and TOTAL PERSONAL INCOME')
plt.xlim((0,100))
plt.ylim((0,1000000))
$*(
727$/3(5621$/,120(
$VVRFLDWLRQEHWZHHQ$*(DQG727$/3(5621$/,120(
Figure 2.3 Association between AGE and TOTAL PERSONAL INCOME
Document written in LATEX
template_version_01.tex
5
6. Data Analysis And Interpretation Specialization
Test A Multiple Regression Model M3A3
As shown in the new output, the regression line fits now better the pattern. Other way to prove these results is by
using OLS functions for linear and polynomial regression models.
# center quantitative variables for regression analysis
mydata['AGE_c'] = (mydata['AGE'] - mydata['AGE'].mean())
# linear regression model
print('OLS Regression Model for the assotiation between TOTAL PERSONAL INCOME and AGE')
regmod1 = smf.ols('INCOME ~ AGE_c', data=mydata).fit()
print(regmod1.summary())
# regression model with second order polynomial (qadratic term)
print('OLS Regression Model for the assotiation between TOTAL PERSONAL INCOME and AGE')
regmod2 = smf.ols('INCOME ~ AGE_c + I(AGE_c**2)', data=mydata).fit()
print(regmod2.summary())
OLS Regression Model for the assotiation between TOTAL PERSONAL INCOME and AGE
OLS Regression Results
==============================================================================
Dep. Variable: INCOME R-squared: 0.025
Model: OLS Adj. R-squared: 0.025
Method: Least Squares F-statistic: 571.5
Date: Thu, 15 Jun 2017 Prob (F-statistic): 9.66e-125
Time: 09:21:34 Log-Likelihood: -2.7322e+05
No. Observations: 22267 AIC: 5.464e+05
Df Residuals: 22265 BIC: 5.465e+05
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.969e+04 345.791 114.770 0.000 3.9e+04 4.04e+04
AGE_c 693.5094 29.009 23.907 0.000 636.650 750.368
==============================================================================
Omnibus: 52977.388 Durbin-Watson: 1.994
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1122433086.644
Skew: 23.952 Prob(JB): 0.00
Kurtosis: 1101.861 Cond. No. 11.9
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Model for the assotiation between TOTAL PERSONAL INCOME and AGE
OLS Regression Results
==============================================================================
Dep. Variable: INCOME R-squared: 0.029
Model: OLS Adj. R-squared: 0.029
Method: Least Squares F-statistic: 333.1
Date: Thu, 15 Jun 2017 Prob (F-statistic): 2.95e-143
Time: 09:21:34 Log-Likelihood: -2.7317e+05
No. Observations: 22267 AIC: 5.464e+05
Df Residuals: 22264 BIC: 5.464e+05
Df Model: 2
Covariance Type: nonrobust
=================================================================================
coef std err t P|t| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 4.243e+04 448.224 94.671 0.000 4.16e+04 4.33e+04
AGE_c 753.3524 29.612 25.441 0.000 695.310 811.394
I(AGE_c ** 2) -19.3358 2.013 -9.605 0.000 -23.282 -15.390
==============================================================================
Omnibus: 53353.873 Durbin-Watson: 1.993
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1184305888.518
Skew: 24.372 Prob(JB): 0.00
Kurtosis: 1131.761 Cond. No. 293.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Document written in LATEX
template_version_01.tex
6
7. Data Analysis And Interpretation Specialization
Test A Multiple Regression Model M3A3
The R-squared value indicates that a polynomial regression model covers more observations as the linear regression
model (0.029 vs. 0.025). By looking at the p-values, we can differentiate that both regression models AGE is positive
related to INCOME with p-values smaller than 0.0001. In the polynomial regression model, however, the coefficient
is negative which indicates that the curve is concave.
Now that my regression model is accurate, I will proceed to add more explanatory variables to the OLS function and
analyze the results.
# multimple regression model
print('OLS Regression Model for the assotiation between TOTAL PERSONAL INCOME and AGE, SEX and GRADE')
regmod3 = smf.ols('INCOME ~ AGE_c + I(AGE_c**2) + SEX + GRADE', data=mydata).fit()
print(regmod3.summary())
OLS Regression Model for the assotiation between TOTAL PERSONAL INCOME and AGE, SEX and GRADE
OLS Regression Results
==============================================================================
Dep. Variable: INCOME R-squared: 0.112
Model: OLS Adj. R-squared: 0.111
Method: Least Squares F-statistic: 698.8
Date: Thu, 15 Jun 2017 Prob (F-statistic): 0.00
Time: 09:21:34 Log-Likelihood: -2.7218e+05
No. Observations: 22267 AIC: 5.444e+05
Df Residuals: 22262 BIC: 5.444e+05
Df Model: 4
Covariance Type: nonrobust
=================================================================================
coef std err t P|t| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.53e+04 1433.216 17.656 0.000 2.25e+04 2.81e+04
AGE_c 650.2752 28.439 22.865 0.000 594.532 706.018
I(AGE_c ** 2) -12.3280 1.934 -6.375 0.000 -16.118 -8.538
SEX -1.46e+04 660.861 -22.087 0.000 -1.59e+04 -1.33e+04
GRADE 2.905e+04 726.163 40.000 0.000 2.76e+04 3.05e+04
==============================================================================
Omnibus: 55681.626 Durbin-Watson: 1.996
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1650204138.297
Skew: 27.098 Prob(JB): 0.00
Kurtosis: 1335.554 Cond. No. 1.08e+03
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.08e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
This new output shows that there is an association between each of the explanatory variables and the response vari-
able. This association is positive for AGE and GRADE and negative for SEX.
Document written in LATEX
template_version_01.tex
7
8. Data Analysis And Interpretation Specialization
Test A Multiple Regression Model M3A3
To conclude, I have included a regression diagnostic plot (Q-Q plot) to evaluate the assumption that the residuals
from our aggression model are normally distributed and whether there are any outlying observations, that might be
unduly influencing the estimation of the regression coefficient.
# regression diagnostic plot
# Q-Q plot for normality
plt.figure(4)
regdiag = sm.qqplot(regmod3.resid, line='r')
print(regdiag)
7KHRUHWLFDO4XDQWLOHV
6DPSOH4XDQWLOHV
Figure 2.4 Regression diagnostic plot
The output shows that most of the residuals are normally distributed (follow the red line), but there are some values
that can be defined as outliers and slightly influence the estimation of the regression coefficient.
Document written in LATEX
template_version_01.tex
8
9. Tape Location Source Code Frequency Item value Description Title
68-69 AGE AGE
43079 18-97. Age in years
14 98. 98 years or older
79-79 SEX SEX
18518 1. Male
24575 2. Female
131-132 S1Q6A HIGHEST GRADE OR YEAR OF SCHOOL COMPLETED
218 1. No formal schooling
137 2. Completed grade K, 1 or 2
421 3. Completed grade 3 or 4
931 4. Completed grade 5 or 6
414 5. Completed grade 7
1210 6. Completed grade 8
4518 7. Some high school (grades 9-11)
10935 8. Completed high school
1612 9. Graduate equivalency degree (GED)
8891 10. Some college (no degree)
3772 11. Completed associate or other technical 2-year degree
5251 12. Completed college (bachelor's degree)
1526 13. Some graduate or professional studies (completed bachelor's degree but not graduate degree)
3257 14. Completed graduate or professional degree (master's degree or higher)
136-136 S1Q7A1 PRESENT SITUATION INCLUDES WORKING FULL TIME (35+ HOURS A WEEK)
22267 1. Yes
20826 2. No
171-172 S1Q9A BUSINESS OR INDUSTRY: CURRENT OR MOST RECENT JOB
1143 1. Agriculture
147 2. Mining
2124 3. Construction
3909 4. Manufacturing
2404 5. Transportation, Communications and Other Public Utilities
724 6. Wholesale Trade
4701 7. Retail Trade
1945 8. Finance, Insurance and Real Estate
1248 9. Business and Repair Service
4186 10. Personal Services
992 11. Entertainment and Recreation Services
8181 12. Professional and Related Services
1925 13. Public Administration
431 14. Armed Services
9033 BL. NA, never worked for pay or in family business or farm
179-185 S1Q10A TOTAL PERSONAL INCOME IN LAST 12 MONTHS
43093 0-3000000. Income in dollars
186-187 S1Q10B TOTAL PERSONAL INCOME IN LAST 12 MONTHS: CATEGORY
2462 0. $0 (No personal income)
3571 1. $1 to $4,999
3823 2. $5,000 to $7,999
2002 3. $8,000 to $9,999
3669 4. $10,000 to $12,999
1634 5. $13,000 to $14,999
3940 6. $15,000 to $19,999
3887 7. $20,000 to $24,999
3085 8. $25,000 to $29,999
3003 9. $30,000 to $34,999
2351 10. $35,000 to $39,999
3291 11. $40,000 to $49,999
2059 12. $50,000 to $59,999
1328 13. $60,000 to $69,999
857 14. $70,000 to $79,999
521 15. $80,000 to $89,999
290 16. $90,000 to $99,999
1320 17. $100,000 or more
Data Analysis And Interpretation Specialization
Test A Multiple Regression Model M3A3
3 Codebook
Document written in LATEX
template_version_01.tex
9