SlideShare a Scribd company logo
1 of 11
DATA 503 โ€“ Applied Regression Analysis
Lecture 14: Diagnostics, Collinearity, Transformation, and
Missing Data Highlights
By Dr. Ellie Small
Overview
Topics:
โ€ข Diagnostics:
โ‚‹ Checking Error Variance
โ‚‹ Checking for Normality of the Errors
โ‚‹ Checking the Structure of the Model
โ€ข Categorical Predictors
โ€ข Collinearity
โ€ข Transformations:
โ‚‹ Box-Cox and Logtrans
โ‚‹ Polynomials
โ€ข Missing Data
2
Checking Error Variance
โ€ข To verify constant symmetrical variance, or homoscedasticity (vs. heteroscedasticity, or non-constant
variation), we plot the residuals (or standardized residuals) vs. the fitted values, with a horizontal line at
residual=0. The points should be randomly spread around the center line. We look for:
โ‚‹ โ€œcone shapeโ€ โ€“ variance appears to be either increasing or decreasing with increasing fitted values
โ‚‹ โ€œnonlinearityโ€ โ€“ data points do not seem to be randomly spread around the center line but appears to
be following a pattern
The graph can be created manually using plot(fitted(lmod), residuals(lmod); abline(h=0), or via plot(lmod),
where it will be the first plot.
If there are any doubts, a magnified view can be obtained by plotting the square root of the residuals versus
the fitted values (the third plot in plot(lmod)).
โ€ข Also plot the predictors vs. the residuals using plot(predictor, residuals(lmod); abline(h=0); these data points
should also be randomly spread around the center line.
โ€ข Plotting a predictor that was not included in the model vs. the residuals could indicate that the predictor
should be included if a relationship appears to exist in the plot.
3
Checking for Normality of the Errors
4
โ€ข We check for normality using qqnorm(residuals(lmod)) and qqline(residuals(lmod)), or via plot(lmod), where
it will be the second plot. For large sample sizes (>30), we are only concerned with a plot that starts below
the line and ends above, indicating a long-tailed distribution. In that case we should run Shapiroโ€™s test for
normality using shapiro.test(residuals(lmod)). If the p-value is large (greater than .05), we can assume
normality. If the p-value is small, we need to deal with non-normality of the data.
Plot of a long-tailed distribution
Finding Unusual Observations
Studentized Residual i: Residual of case (or subject) i vs. the model that excludes case i.
Unusual Observations
โ€ข Leverages: ๐’‰ = ๐‘‘๐‘–๐‘Ž๐‘” ๐‘ƒX . The ith element of ๐’‰ is the leverage of case (or subject) i; they can be obtained
using hatvalues(lmod). For all i, 0 โ‰ค โ„Ž๐‘– โ‰ค ๐‘, with the average equal to
๐‘
๐‘›
. Large values indicate extreme
predictor values; report/investigate values that exceed
2๐‘
๐‘›
. A half normal plot (available in the faraway
package) on ๐’‰ will also show the largest values: halfnorm(hatvalues(lmod)).
โ€ข Outliers: Data points for which the magnitude of the residual or the studentized residual is large. Find them
using abs(residuals(lmod)) (regular) or abs(rstudent(lmod)) (studentized).
โ€ข Influential Observations: Points whose removal from the data set would cause a large change in the fit.
These are usually outliers with high leverage and can be found using Cookโ€™s statistic via
cooks.distance(lmod). Investigate values larger than .5. The fourth and last plot in plot(lmod) also shows data
points with a Cookโ€™s distance larger than .5.
5
Checking the Structure of the Model
In simple linear regression we can plot the response versus the predictor to check whether the
relationship is linear. We can also easily find outliers, high leverage points, and influential points
using this plot. In multiple regression, we can do this too for each predictor, but we need to
consider the other predictors. As such, we use the partial regression plot and the partial residual
plot.
โ€ข Partial Regression Plot: For a partial regression plot of the response ๐’€ versus predictor ๐’™ ๐‘— , we
first remove the effect of the other predictors both from ๐’€ and ๐’™ ๐‘— . We plot the residuals of the
regression of ๐’€ onto all predictors except ๐’™ ๐‘— , versus the residuals of the regression of ๐’™ ๐‘— onto
all predictors except ๐’™ ๐‘— . The former we call ๐›ฟ, the latter ๐›พ. We then use plot( ๐›ฟ, ๐›พ). The slope of
this plot equals the jth regression coefficient in the full model.
โ€ข Partial Residual Plot: For a partial residual plot of the response ๐’€ versus predictor ๐’™ ๐‘— , we
remove the predicted effect of all predictors except ๐’™ ๐‘— from ๐’€, then plot these values against
๐’™ ๐‘— . The slope of this plot also equals the jth regression coefficient in the full model. We can
obtain the partial residual plot in R using termplot(lmod, partial.resid=T, terms=(j-1)).
6
Categorical Predictors
If a predictor only takes on a limited number of values, it is considered a categorical predictor and
needs to be identified as such. If those values are alphanumeric, R will automatically do this for you.
If they are numeric, this needs to be done using ๐’™ ๐‘— = factor(๐’™ ๐‘— ). If this is not done, R will treat the
variable as an ordinary numeric variable, possibly resulting in an inferior model.
R will create dummy variables, one for each value except the first, and name those dummy variables
as the original variable name indexed with the value it represents. The dummy variable for a specific
value will equal 1 if the variable equals that value, and it will equal 0 otherwise. The first value (for
which there is no dummy variable) will therefore be implied when all the dummy variables equal 0.
7
Collinearity
Exact Collinearity: Some predictors are a linear combination of some of the others.
Collinearity: Some predictors are almost a linear combination of some of the others.
Finding collinearity:
โ€ข Eigenvalues: To find out if collinearity exists, we determine ๐พ =
๐œ†1
๐œ† ๐‘
, where ๐œ†1 โ‰ฅ โ‹ฏ โ‰ฅ ๐œ† ๐‘ are the ordered
eigenvalues of X
โ€ฒ
X. If ๐พ exceeds 30, there is significant collinearity. Use ei=eigen(crossprod(X))$val;
K=sqrt(ei[1]/ei)) to find K (it is the last value).
โ€ข Pairwise collinearity: This can be found from the correlations between the predictors; any correlation close
to 1 or -1 needs to be investigated. Use cor(X).
โ€ข Variance Inflation Factors (VIFs): ๐‘‰๐ผ๐น๐‘— =
1
1โˆ’๐‘…2, where ๐‘…2 comes from the regression model that regresses
๐’™ ๐‘— on all other predictors. Investigate when it exceeds 5. In R: vif(X) gives all of them.
We investigate by removing suspect predictors one by one and see if it affects the model. If removing one does
not appear to make a great difference to the adjusted ๐‘…2
for the model, we leave it out.
8
Transformations
The Box-Cox Method: Used to determine the best power transformation on the
response. It provides the likelihood of observing the transformed response ๐’€ given the
predictors X for a range of powers, with a power of ๐œ† = 0 indicating the log
transformation. In R: boxcox(lmod, plotit=T); this function also plots vertical lines indicating
the 95% confidence interval.
NOTE: Can be used for strictly positive responses only! But we may add a constant to all
responses if small negative responses are present.
NOTE: Box-Cox does not deal very well with outliers; as such, if the power is extreme, it is
likely wrong.
Logtrans: Used to determine the optimal ๐›ผ for the transformation ๐‘™๐‘œ๐‘” ๐‘Œ๐‘– + ๐›ผ . It provides
the likelihood of observing the transformed response ๐’€ given the predictors X for a range
of ๐›ผ. In R: logtrans(lmod, plotit=T); this function also plots vertical lines indicating the 95%
confidence interval.
9
Transformations - 2
Polynomials: When there appears to be a nonlinear relationship between the residuals
and the fitted values, there is likely a nonlinear relationship between the residuals and at
least one of the predictors. If this situation is found for one of the predictors, we create
polynomials for that predictor, adding additional predictors equal to the square of the
predictor, the cube, etc. etc. We go as far as the last power that is significant in the model.
NOTE: All powers less than the last significant power must be included, even if they,
themselves, are not significant.
NOTE: The function poly(๐’™ ๐‘— , m) creates orthogonal polynomials for predictor ๐’™ ๐‘— up to
the mth power.
NOTE: Orthogonal polynomials in 2 predictors ๐’™ ๐‘— and ๐’™ ๐‘˜ can be created using
polym(๐’™ ๐‘— , ๐’™ ๐‘˜ , degree=2)
10
Missing Data
Missing values in R should be indicated by NA. Make sure to change the data appropriately if some other method was
used. In R, summary(data) will indicate if there are missing values for each predictor, and if so, how many. For linear
regression, deleting cases with missing data is the default for R. Other options are:
1. Single imputation: Replace missing values in predictors with either the mean of the predictor, or with the predicted value
of a regression of the predictor on the other predictors. For the latter, make sure to include all predictors in the regression,
even those that you will not include in your final model.
Regression method for predictor ๐’™ ๐‘— : lmod=lm(๐’™ ๐‘— ~X, data); data2[is.na(data[,j]),j]=predict(lmod, data[is.na(data[,j]),])
Note that data2 must start off equal to data and will be the data set with the imputed values in the end.
2. Multiple Imputation: Use the regression method above to determine the missing values but add a random error to each.
Do this multiple times and take the aggregate of the regression coefficients and standard errors over all of them.
In R, the package Amelia may be used to do multiple imputation, where a=amelia(data, m) creates m datasets like the
original one, but with the missing values replaced using regression (the regression method of single imputation) plus a
random error. The function outputs a$m, and a$imputations[[i]] as the data sets with imputed values for i going from 1 to
m. A regression can then be performed on each dataset using a loop, and the regression coefficients and standard errors
saved as rows in matrices. The function agg=mi.meld(q=betas, se=st.errors) will then do the aggregation of the matrices
resulting in the vectors agg$q.mi and agg$se.mi, containing the aggregated regression coefficients and aggregated standard
errors, respectively.
11

More Related Content

What's hot

Regression analysis algorithm
Regression analysis algorithm Regression analysis algorithm
Regression analysis algorithm Sammer Qader
ย 
Regression analysis in excel
Regression analysis in excelRegression analysis in excel
Regression analysis in excelThilina Rathnayaka
ย 
Lasso and ridge regression
Lasso and ridge regressionLasso and ridge regression
Lasso and ridge regressionSreerajVA
ย 
Functional Forms of Regression Models | Eonomics
Functional Forms of Regression Models | EonomicsFunctional Forms of Regression Models | Eonomics
Functional Forms of Regression Models | EonomicsTransweb Global Inc
ย 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionSiddharth Shrivastava
ย 
Regression Analysis
Regression AnalysisRegression Analysis
Regression AnalysisMuhammad Fazeel
ย 
Diagnostic methods for Building the regression model
Diagnostic methods for Building the regression modelDiagnostic methods for Building the regression model
Diagnostic methods for Building the regression modelMehdi Shayegani
ย 
Statistics-Regression analysis
Statistics-Regression analysisStatistics-Regression analysis
Statistics-Regression analysisRabin BK
ย 
Lecture 1 maximum likelihood
Lecture 1 maximum likelihoodLecture 1 maximum likelihood
Lecture 1 maximum likelihoodAnant Dashpute
ย 
Linear regression
Linear regression Linear regression
Linear regression Vani011
ย 
Logistic regression
Logistic regressionLogistic regression
Logistic regressionYashwantGahlot1
ย 
Simple linear regression analysis
Simple linear  regression analysisSimple linear  regression analysis
Simple linear regression analysisNorma Mingo
ย 
Ridge regression
Ridge regressionRidge regression
Ridge regressionAnanda Swarup
ย 
Tbs910 regression models
Tbs910 regression modelsTbs910 regression models
Tbs910 regression modelsStephen Ong
ย 
Introduction to Maximum Likelihood Estimator
Introduction to Maximum Likelihood EstimatorIntroduction to Maximum Likelihood Estimator
Introduction to Maximum Likelihood EstimatorAmir Al-Ansary
ย 
Visual Explanation of Ridge Regression and LASSO
Visual Explanation of Ridge Regression and LASSOVisual Explanation of Ridge Regression and LASSO
Visual Explanation of Ridge Regression and LASSOKazuki Yoshida
ย 
Notes Ch8
Notes Ch8Notes Ch8
Notes Ch8jmalpass
ย 
Contingency Table Test, M. Asad Hayat, UET Taxila
Contingency Table Test, M. Asad Hayat, UET TaxilaContingency Table Test, M. Asad Hayat, UET Taxila
Contingency Table Test, M. Asad Hayat, UET TaxilaMuhammad Warraich
ย 
Estimation theory 1
Estimation theory 1Estimation theory 1
Estimation theory 1Gopi Saiteja
ย 

What's hot (20)

Regression analysis algorithm
Regression analysis algorithm Regression analysis algorithm
Regression analysis algorithm
ย 
Regression analysis in excel
Regression analysis in excelRegression analysis in excel
Regression analysis in excel
ย 
Lasso and ridge regression
Lasso and ridge regressionLasso and ridge regression
Lasso and ridge regression
ย 
Functional Forms of Regression Models | Eonomics
Functional Forms of Regression Models | EonomicsFunctional Forms of Regression Models | Eonomics
Functional Forms of Regression Models | Eonomics
ย 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
ย 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
ย 
Diagnostic methods for Building the regression model
Diagnostic methods for Building the regression modelDiagnostic methods for Building the regression model
Diagnostic methods for Building the regression model
ย 
Statistics-Regression analysis
Statistics-Regression analysisStatistics-Regression analysis
Statistics-Regression analysis
ย 
Lecture 1 maximum likelihood
Lecture 1 maximum likelihoodLecture 1 maximum likelihood
Lecture 1 maximum likelihood
ย 
Linear regression
Linear regression Linear regression
Linear regression
ย 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
ย 
Simple linear regression analysis
Simple linear  regression analysisSimple linear  regression analysis
Simple linear regression analysis
ย 
Ridge regression
Ridge regressionRidge regression
Ridge regression
ย 
Mle
MleMle
Mle
ย 
Tbs910 regression models
Tbs910 regression modelsTbs910 regression models
Tbs910 regression models
ย 
Introduction to Maximum Likelihood Estimator
Introduction to Maximum Likelihood EstimatorIntroduction to Maximum Likelihood Estimator
Introduction to Maximum Likelihood Estimator
ย 
Visual Explanation of Ridge Regression and LASSO
Visual Explanation of Ridge Regression and LASSOVisual Explanation of Ridge Regression and LASSO
Visual Explanation of Ridge Regression and LASSO
ย 
Notes Ch8
Notes Ch8Notes Ch8
Notes Ch8
ย 
Contingency Table Test, M. Asad Hayat, UET Taxila
Contingency Table Test, M. Asad Hayat, UET TaxilaContingency Table Test, M. Asad Hayat, UET Taxila
Contingency Table Test, M. Asad Hayat, UET Taxila
ย 
Estimation theory 1
Estimation theory 1Estimation theory 1
Estimation theory 1
ย 

Similar to 2. diagnostics, collinearity, transformation, and missing data

Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)Abhimanyu Dwivedi
ย 
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docxFSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docxbudbarber38650
ย 
Simple Linear Regression.pptx
Simple Linear Regression.pptxSimple Linear Regression.pptx
Simple Linear Regression.pptxAbdalrahmanTahaJaya
ย 
Pampers CaseIn an increasingly competitive diaper market, P&Gโ€™.docx
Pampers CaseIn an increasingly competitive diaper market, P&Gโ€™.docxPampers CaseIn an increasingly competitive diaper market, P&Gโ€™.docx
Pampers CaseIn an increasingly competitive diaper market, P&Gโ€™.docxbunyansaturnina
ย 
Advanced Econometrics L5-6.pptx
Advanced Econometrics L5-6.pptxAdvanced Econometrics L5-6.pptx
Advanced Econometrics L5-6.pptxakashayosha
ย 
Statistical parameters
Statistical parametersStatistical parameters
Statistical parametersBurdwan University
ย 
Multiple regression
Multiple regressionMultiple regression
Multiple regressionAntoine De Henau
ย 
Linear Regression
Linear Regression Linear Regression
Linear Regression Rupak Roy
ย 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdfgadissaassefa
ย 
Get Multiple Regression Assignment Help
Get Multiple Regression Assignment Help Get Multiple Regression Assignment Help
Get Multiple Regression Assignment Help HelpWithAssignment.com
ย 
Multiple linear regression
Multiple linear regressionMultiple linear regression
Multiple linear regressionAvjinder (Avi) Kaler
ย 
chap4_Parametric_Methods.ppt
chap4_Parametric_Methods.pptchap4_Parametric_Methods.ppt
chap4_Parametric_Methods.pptShayanChowdary
ย 
Introduction to correlation and regression analysis
Introduction to correlation and regression analysisIntroduction to correlation and regression analysis
Introduction to correlation and regression analysisFarzad Javidanrad
ย 

Similar to 2. diagnostics, collinearity, transformation, and missing data (20)

1. linear model, inference, prediction
1. linear model, inference, prediction1. linear model, inference, prediction
1. linear model, inference, prediction
ย 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
ย 
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docxFSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
ย 
Simple egression.pptx
Simple egression.pptxSimple egression.pptx
Simple egression.pptx
ย 
Simple Linear Regression.pptx
Simple Linear Regression.pptxSimple Linear Regression.pptx
Simple Linear Regression.pptx
ย 
Simple lin regress_inference
Simple lin regress_inferenceSimple lin regress_inference
Simple lin regress_inference
ย 
Pampers CaseIn an increasingly competitive diaper market, P&Gโ€™.docx
Pampers CaseIn an increasingly competitive diaper market, P&Gโ€™.docxPampers CaseIn an increasingly competitive diaper market, P&Gโ€™.docx
Pampers CaseIn an increasingly competitive diaper market, P&Gโ€™.docx
ย 
Advanced Econometrics L5-6.pptx
Advanced Econometrics L5-6.pptxAdvanced Econometrics L5-6.pptx
Advanced Econometrics L5-6.pptx
ย 
Statistical parameters
Statistical parametersStatistical parameters
Statistical parameters
ย 
Multiple regression
Multiple regressionMultiple regression
Multiple regression
ย 
Linear Regression
Linear Regression Linear Regression
Linear Regression
ย 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdf
ย 
Get Multiple Regression Assignment Help
Get Multiple Regression Assignment Help Get Multiple Regression Assignment Help
Get Multiple Regression Assignment Help
ย 
rugs koco.pptx
rugs koco.pptxrugs koco.pptx
rugs koco.pptx
ย 
Multiple linear regression
Multiple linear regressionMultiple linear regression
Multiple linear regression
ย 
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regression
ย 
Correlation
CorrelationCorrelation
Correlation
ย 
JISA_Paper
JISA_PaperJISA_Paper
JISA_Paper
ย 
chap4_Parametric_Methods.ppt
chap4_Parametric_Methods.pptchap4_Parametric_Methods.ppt
chap4_Parametric_Methods.ppt
ย 
Introduction to correlation and regression analysis
Introduction to correlation and regression analysisIntroduction to correlation and regression analysis
Introduction to correlation and regression analysis
ย 

Recently uploaded

Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
ย 
ๅฎšๅˆถ่‹ฑๅ›ฝ็™ฝ้‡‘ๆฑ‰ๅคงๅญฆๆฏ•ไธš่ฏ๏ผˆUCBๆฏ•ไธš่ฏไนฆ๏ผ‰ ๆˆ็ปฉๅ•ๅŽŸ็‰ˆไธ€ๆฏ”ไธ€
ๅฎšๅˆถ่‹ฑๅ›ฝ็™ฝ้‡‘ๆฑ‰ๅคงๅญฆๆฏ•ไธš่ฏ๏ผˆUCBๆฏ•ไธš่ฏไนฆ๏ผ‰																			ๆˆ็ปฉๅ•ๅŽŸ็‰ˆไธ€ๆฏ”ไธ€ๅฎšๅˆถ่‹ฑๅ›ฝ็™ฝ้‡‘ๆฑ‰ๅคงๅญฆๆฏ•ไธš่ฏ๏ผˆUCBๆฏ•ไธš่ฏไนฆ๏ผ‰																			ๆˆ็ปฉๅ•ๅŽŸ็‰ˆไธ€ๆฏ”ไธ€
ๅฎšๅˆถ่‹ฑๅ›ฝ็™ฝ้‡‘ๆฑ‰ๅคงๅญฆๆฏ•ไธš่ฏ๏ผˆUCBๆฏ•ไธš่ฏไนฆ๏ผ‰ ๆˆ็ปฉๅ•ๅŽŸ็‰ˆไธ€ๆฏ”ไธ€ffjhghh
ย 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
ย 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
ย 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
ย 
From idea to production in a day โ€“ Leveraging Azure ML and Streamlit to build...
From idea to production in a day โ€“ Leveraging Azure ML and Streamlit to build...From idea to production in a day โ€“ Leveraging Azure ML and Streamlit to build...
From idea to production in a day โ€“ Leveraging Azure ML and Streamlit to build...Florian Roscheck
ย 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
ย 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
ย 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
ย 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
ย 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
ย 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
ย 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionBoston Institute of Analytics
ย 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
ย 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
ย 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
ย 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
ย 
๊งโค Greater Noida Call Girls Delhi โค๊ง‚ 9711199171 โ˜Ž๏ธ Hard And Sexy Vip Call
๊งโค Greater Noida Call Girls Delhi โค๊ง‚ 9711199171 โ˜Ž๏ธ Hard And Sexy Vip Call๊งโค Greater Noida Call Girls Delhi โค๊ง‚ 9711199171 โ˜Ž๏ธ Hard And Sexy Vip Call
๊งโค Greater Noida Call Girls Delhi โค๊ง‚ 9711199171 โ˜Ž๏ธ Hard And Sexy Vip Callshivangimorya083
ย 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
ย 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
ย 

Recently uploaded (20)

Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
ย 
ๅฎšๅˆถ่‹ฑๅ›ฝ็™ฝ้‡‘ๆฑ‰ๅคงๅญฆๆฏ•ไธš่ฏ๏ผˆUCBๆฏ•ไธš่ฏไนฆ๏ผ‰ ๆˆ็ปฉๅ•ๅŽŸ็‰ˆไธ€ๆฏ”ไธ€
ๅฎšๅˆถ่‹ฑๅ›ฝ็™ฝ้‡‘ๆฑ‰ๅคงๅญฆๆฏ•ไธš่ฏ๏ผˆUCBๆฏ•ไธš่ฏไนฆ๏ผ‰																			ๆˆ็ปฉๅ•ๅŽŸ็‰ˆไธ€ๆฏ”ไธ€ๅฎšๅˆถ่‹ฑๅ›ฝ็™ฝ้‡‘ๆฑ‰ๅคงๅญฆๆฏ•ไธš่ฏ๏ผˆUCBๆฏ•ไธš่ฏไนฆ๏ผ‰																			ๆˆ็ปฉๅ•ๅŽŸ็‰ˆไธ€ๆฏ”ไธ€
ๅฎšๅˆถ่‹ฑๅ›ฝ็™ฝ้‡‘ๆฑ‰ๅคงๅญฆๆฏ•ไธš่ฏ๏ผˆUCBๆฏ•ไธš่ฏไนฆ๏ผ‰ ๆˆ็ปฉๅ•ๅŽŸ็‰ˆไธ€ๆฏ”ไธ€
ย 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
ย 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
ย 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
ย 
From idea to production in a day โ€“ Leveraging Azure ML and Streamlit to build...
From idea to production in a day โ€“ Leveraging Azure ML and Streamlit to build...From idea to production in a day โ€“ Leveraging Azure ML and Streamlit to build...
From idea to production in a day โ€“ Leveraging Azure ML and Streamlit to build...
ย 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
ย 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
ย 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
ย 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
ย 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
ย 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
ย 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
ย 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
ย 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
ย 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
ย 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
ย 
๊งโค Greater Noida Call Girls Delhi โค๊ง‚ 9711199171 โ˜Ž๏ธ Hard And Sexy Vip Call
๊งโค Greater Noida Call Girls Delhi โค๊ง‚ 9711199171 โ˜Ž๏ธ Hard And Sexy Vip Call๊งโค Greater Noida Call Girls Delhi โค๊ง‚ 9711199171 โ˜Ž๏ธ Hard And Sexy Vip Call
๊งโค Greater Noida Call Girls Delhi โค๊ง‚ 9711199171 โ˜Ž๏ธ Hard And Sexy Vip Call
ย 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
ย 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
ย 

2. diagnostics, collinearity, transformation, and missing data

  • 1. DATA 503 โ€“ Applied Regression Analysis Lecture 14: Diagnostics, Collinearity, Transformation, and Missing Data Highlights By Dr. Ellie Small
  • 2. Overview Topics: โ€ข Diagnostics: โ‚‹ Checking Error Variance โ‚‹ Checking for Normality of the Errors โ‚‹ Checking the Structure of the Model โ€ข Categorical Predictors โ€ข Collinearity โ€ข Transformations: โ‚‹ Box-Cox and Logtrans โ‚‹ Polynomials โ€ข Missing Data 2
  • 3. Checking Error Variance โ€ข To verify constant symmetrical variance, or homoscedasticity (vs. heteroscedasticity, or non-constant variation), we plot the residuals (or standardized residuals) vs. the fitted values, with a horizontal line at residual=0. The points should be randomly spread around the center line. We look for: โ‚‹ โ€œcone shapeโ€ โ€“ variance appears to be either increasing or decreasing with increasing fitted values โ‚‹ โ€œnonlinearityโ€ โ€“ data points do not seem to be randomly spread around the center line but appears to be following a pattern The graph can be created manually using plot(fitted(lmod), residuals(lmod); abline(h=0), or via plot(lmod), where it will be the first plot. If there are any doubts, a magnified view can be obtained by plotting the square root of the residuals versus the fitted values (the third plot in plot(lmod)). โ€ข Also plot the predictors vs. the residuals using plot(predictor, residuals(lmod); abline(h=0); these data points should also be randomly spread around the center line. โ€ข Plotting a predictor that was not included in the model vs. the residuals could indicate that the predictor should be included if a relationship appears to exist in the plot. 3
  • 4. Checking for Normality of the Errors 4 โ€ข We check for normality using qqnorm(residuals(lmod)) and qqline(residuals(lmod)), or via plot(lmod), where it will be the second plot. For large sample sizes (>30), we are only concerned with a plot that starts below the line and ends above, indicating a long-tailed distribution. In that case we should run Shapiroโ€™s test for normality using shapiro.test(residuals(lmod)). If the p-value is large (greater than .05), we can assume normality. If the p-value is small, we need to deal with non-normality of the data. Plot of a long-tailed distribution
  • 5. Finding Unusual Observations Studentized Residual i: Residual of case (or subject) i vs. the model that excludes case i. Unusual Observations โ€ข Leverages: ๐’‰ = ๐‘‘๐‘–๐‘Ž๐‘” ๐‘ƒX . The ith element of ๐’‰ is the leverage of case (or subject) i; they can be obtained using hatvalues(lmod). For all i, 0 โ‰ค โ„Ž๐‘– โ‰ค ๐‘, with the average equal to ๐‘ ๐‘› . Large values indicate extreme predictor values; report/investigate values that exceed 2๐‘ ๐‘› . A half normal plot (available in the faraway package) on ๐’‰ will also show the largest values: halfnorm(hatvalues(lmod)). โ€ข Outliers: Data points for which the magnitude of the residual or the studentized residual is large. Find them using abs(residuals(lmod)) (regular) or abs(rstudent(lmod)) (studentized). โ€ข Influential Observations: Points whose removal from the data set would cause a large change in the fit. These are usually outliers with high leverage and can be found using Cookโ€™s statistic via cooks.distance(lmod). Investigate values larger than .5. The fourth and last plot in plot(lmod) also shows data points with a Cookโ€™s distance larger than .5. 5
  • 6. Checking the Structure of the Model In simple linear regression we can plot the response versus the predictor to check whether the relationship is linear. We can also easily find outliers, high leverage points, and influential points using this plot. In multiple regression, we can do this too for each predictor, but we need to consider the other predictors. As such, we use the partial regression plot and the partial residual plot. โ€ข Partial Regression Plot: For a partial regression plot of the response ๐’€ versus predictor ๐’™ ๐‘— , we first remove the effect of the other predictors both from ๐’€ and ๐’™ ๐‘— . We plot the residuals of the regression of ๐’€ onto all predictors except ๐’™ ๐‘— , versus the residuals of the regression of ๐’™ ๐‘— onto all predictors except ๐’™ ๐‘— . The former we call ๐›ฟ, the latter ๐›พ. We then use plot( ๐›ฟ, ๐›พ). The slope of this plot equals the jth regression coefficient in the full model. โ€ข Partial Residual Plot: For a partial residual plot of the response ๐’€ versus predictor ๐’™ ๐‘— , we remove the predicted effect of all predictors except ๐’™ ๐‘— from ๐’€, then plot these values against ๐’™ ๐‘— . The slope of this plot also equals the jth regression coefficient in the full model. We can obtain the partial residual plot in R using termplot(lmod, partial.resid=T, terms=(j-1)). 6
  • 7. Categorical Predictors If a predictor only takes on a limited number of values, it is considered a categorical predictor and needs to be identified as such. If those values are alphanumeric, R will automatically do this for you. If they are numeric, this needs to be done using ๐’™ ๐‘— = factor(๐’™ ๐‘— ). If this is not done, R will treat the variable as an ordinary numeric variable, possibly resulting in an inferior model. R will create dummy variables, one for each value except the first, and name those dummy variables as the original variable name indexed with the value it represents. The dummy variable for a specific value will equal 1 if the variable equals that value, and it will equal 0 otherwise. The first value (for which there is no dummy variable) will therefore be implied when all the dummy variables equal 0. 7
  • 8. Collinearity Exact Collinearity: Some predictors are a linear combination of some of the others. Collinearity: Some predictors are almost a linear combination of some of the others. Finding collinearity: โ€ข Eigenvalues: To find out if collinearity exists, we determine ๐พ = ๐œ†1 ๐œ† ๐‘ , where ๐œ†1 โ‰ฅ โ‹ฏ โ‰ฅ ๐œ† ๐‘ are the ordered eigenvalues of X โ€ฒ X. If ๐พ exceeds 30, there is significant collinearity. Use ei=eigen(crossprod(X))$val; K=sqrt(ei[1]/ei)) to find K (it is the last value). โ€ข Pairwise collinearity: This can be found from the correlations between the predictors; any correlation close to 1 or -1 needs to be investigated. Use cor(X). โ€ข Variance Inflation Factors (VIFs): ๐‘‰๐ผ๐น๐‘— = 1 1โˆ’๐‘…2, where ๐‘…2 comes from the regression model that regresses ๐’™ ๐‘— on all other predictors. Investigate when it exceeds 5. In R: vif(X) gives all of them. We investigate by removing suspect predictors one by one and see if it affects the model. If removing one does not appear to make a great difference to the adjusted ๐‘…2 for the model, we leave it out. 8
  • 9. Transformations The Box-Cox Method: Used to determine the best power transformation on the response. It provides the likelihood of observing the transformed response ๐’€ given the predictors X for a range of powers, with a power of ๐œ† = 0 indicating the log transformation. In R: boxcox(lmod, plotit=T); this function also plots vertical lines indicating the 95% confidence interval. NOTE: Can be used for strictly positive responses only! But we may add a constant to all responses if small negative responses are present. NOTE: Box-Cox does not deal very well with outliers; as such, if the power is extreme, it is likely wrong. Logtrans: Used to determine the optimal ๐›ผ for the transformation ๐‘™๐‘œ๐‘” ๐‘Œ๐‘– + ๐›ผ . It provides the likelihood of observing the transformed response ๐’€ given the predictors X for a range of ๐›ผ. In R: logtrans(lmod, plotit=T); this function also plots vertical lines indicating the 95% confidence interval. 9
  • 10. Transformations - 2 Polynomials: When there appears to be a nonlinear relationship between the residuals and the fitted values, there is likely a nonlinear relationship between the residuals and at least one of the predictors. If this situation is found for one of the predictors, we create polynomials for that predictor, adding additional predictors equal to the square of the predictor, the cube, etc. etc. We go as far as the last power that is significant in the model. NOTE: All powers less than the last significant power must be included, even if they, themselves, are not significant. NOTE: The function poly(๐’™ ๐‘— , m) creates orthogonal polynomials for predictor ๐’™ ๐‘— up to the mth power. NOTE: Orthogonal polynomials in 2 predictors ๐’™ ๐‘— and ๐’™ ๐‘˜ can be created using polym(๐’™ ๐‘— , ๐’™ ๐‘˜ , degree=2) 10
  • 11. Missing Data Missing values in R should be indicated by NA. Make sure to change the data appropriately if some other method was used. In R, summary(data) will indicate if there are missing values for each predictor, and if so, how many. For linear regression, deleting cases with missing data is the default for R. Other options are: 1. Single imputation: Replace missing values in predictors with either the mean of the predictor, or with the predicted value of a regression of the predictor on the other predictors. For the latter, make sure to include all predictors in the regression, even those that you will not include in your final model. Regression method for predictor ๐’™ ๐‘— : lmod=lm(๐’™ ๐‘— ~X, data); data2[is.na(data[,j]),j]=predict(lmod, data[is.na(data[,j]),]) Note that data2 must start off equal to data and will be the data set with the imputed values in the end. 2. Multiple Imputation: Use the regression method above to determine the missing values but add a random error to each. Do this multiple times and take the aggregate of the regression coefficients and standard errors over all of them. In R, the package Amelia may be used to do multiple imputation, where a=amelia(data, m) creates m datasets like the original one, but with the missing values replaced using regression (the regression method of single imputation) plus a random error. The function outputs a$m, and a$imputations[[i]] as the data sets with imputed values for i going from 1 to m. A regression can then be performed on each dataset using a loop, and the regression coefficients and standard errors saved as rows in matrices. The function agg=mi.meld(q=betas, se=st.errors) will then do the aggregation of the matrices resulting in the vectors agg$q.mi and agg$se.mi, containing the aggregated regression coefficients and aggregated standard errors, respectively. 11