• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Reif Regression Diagnostics I and II
 

Reif Regression Diagnostics I and II

on

  • 222 views

Lectures prepared for sections of Political Science 699 Winter 2010, when I served as a Graduate Student Instructor (TA) for this course, taught by Rob Franzese, at the University of Michigan. The ...

Lectures prepared for sections of Political Science 699 Winter 2010, when I served as a Graduate Student Instructor (TA) for this course, taught by Rob Franzese, at the University of Michigan. The last part of the last slide was cut off and I have not fixed it.

Statistics

Views

Total Views
222
Views on SlideShare
222
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Reif Regression Diagnostics I and II Reif Regression Diagnostics I and II Presentation Transcript

    • PS 699 Section March 18, 2010Megan ReifGraduate Student Instructor, Political ScienceProfessor Rob FranzeseUniversity of MichiganRegression Diagnostics for ExtremeValues (also known as extreme valuediagnostics, influence diagnostics,leverage diagnostics, case diagnostics)
    • Review of (often iterative) Modeling Process• EDA helps identify obviousviolations of CLRM• Address trade-offs betweencorrections•Numeric & Graphic, Formal &Informal Diagnostics•Influence•Normality•Collinearity•Non-Sphericality• Exploratory Data Analysis(EDA) of EmpiricalDistribution (center, spread,skewness, tail length, outliers)• Uni-& Bivariate• Numeric & Graphic, Formal& Informal• Include your prior infoabout populationdistribution & variance• Data-generating process• AssumptionsTHEORYFORMULATION &MODELSPECIFICATIONDATA (Measure,Sample, Collect,Clean)MODELESTIMATION &INFERENCEPOST-ESTIMATIONANALYSIS /CRITICALASSESSMENT OFASSUMPTIONSBut don’t startdroppingobservationsat this stage!But don’t startdroppingobservationsat this stage!Treat Outliersas INFO, notNUISANCE,Explain them,don’t hidethemTreat Outliersas INFO, notNUISANCE,Explain them,don’t hidethem2(c) Megan Reif
    • I. Pre-Modeling Exploratory Data Analysis (EDA)(Review/Checklist)• Not to be confused with data-mining – Arrive at data with your theory inhand• Because Multivariate analysis builds on uni- and bivariate analysis, beginwith univariate analysis, followed with bivariate, before proceding.• These notes assume knowledge of production of descriptive statistics, butprovides basic commands and output as a sort of checklist.• Don’t forget to start by using Stata’s “describe”, “summarize”,codebook”,and “inspect” commands to understand (a) how thevariables are labeled and coded (b) basic distributions (c) how muchmissing data there are for each variable.• To think about possible effect of missing data on your model, use “list if”commandlist yvar xvar1 xvar2 xvar3 if yvar==.list yvar xvar1 xvar2 xvar3 if xvar1==.and so on• Recode and label your variables for easier interpretation beforeproceeding, particularly the uniqueid variable (such as country-year,individual 1-n, etc.) for easy labeling of points (choose a short name).3(c) Megan Reif
    • I.A Exploratory Data Analysis(EDA):Univariate & Bivariate Analysis1. Summarize Basic Univariate and Bivariate Distributions forTheoretical Model Variables for data structure:1. Location (Mean, Median)2. Spread (Range, Variance, Quartiles)3. Genuine Skewness vs. OutliersThe most efficient way to obtain this information is to useStata’s “tabstat” command and the statistics you desirefor your model variables and then inspect:• Histograms (do not forget to explore using different binsizes and between 5-20 bins, since histogram distributionsare sensitive to bin size)• Boxplots• Matrix Scatterplots4(c) Megan Reif
    • Univariate Outliers• Distinguish between GENUINE skewness in the population distribution &subsequently the empirical distribution, as opposed to unusual behavior (outliers)in one of the tails. Your theory about the population may guide you on this.• Do not leave univariate outliers out of your model or model them explicitly basedon descriptive statistics until you have done post-estimation diagnostics todetermine whether they are also MULTIVARIATE outliers (or correct them if theyare due to obvious typos or missing data or non-response codes like “999”).• A UNIVARIATE outlier is a data point which is distant from the main body of thedata (say, middle 50%). One way to measure this distance is the Inter-Quartilerange (range of the middle 50% of the data). A data point xo is an outlier if:– OBSERVE whether the middle 50 percent of the data ALSO manifest skewness.– If IQR is skewed, a transformation such as a log or square may be called for; IFNOT, focus on the outliers.– Use a Box Plot to check location of the median in relation to the quartiles.1.5 or1.5A data point is a far outlier if3.0 or3.0o Lo Uo Lo Ux Q IQRx Q IQRx Q IQRx Q IQR    5In Stata, a Box-Plot will show outliers(1.5IQR criteria) as points if they arepresent in the data.(c) Megan Reif
    • Tanzania Revenue Datatabstat rev rexp dexp t, s(mean median sd var count min max iqr)stats | rev rexp dexp t---------+----------------------------------------mean | 3728.381 4030.048 1693.619 80p50 | 3544 3891 1549 80sd | 817.1005 821.3014 894.879 6.204837variance | 667653.2 674535.9 800808.3 38.5N | 21 21 21 21min | 2549 2899 586 70max | 5433 5627 3589 90iqr | 926 1127 1379 10--------------------------------------------------6EXAMPLE Data: Tanzania.dta (Mukjeree et al)REV: Gov Recurrent Revenue REXP: Gov Recurrent ExpenditureDEXP: Gov Development Expenditure Year (T) 1970-1990Decade 0=1970s, 1=1980s, 3=1990(c) Megan Reif
    • BIVARIATE NOTE: You can add “by(groupvariable)” after the comma to look atdescriptives for subgroups of interest.tabstat rev rexp dexp t, s(mean median sd var count min max iqr)by(decade)Summary statistics: mean, p50, sd, variance, N, min, max, iqrby categories of: decade (decade)decade | rev rexp dexp t---------+----------------------------------------0 | 4133.7 4057.3 2151.6 74.5| 3962.5 3850 1994 74.5| 814.8448 789.6686 774.8313 3.02765| 663972 623576.5 600363.6 9.166667| 10 10 10 10| 3243 3122 1228 70| 5433 5571 3589 79| 1072 937 927 5---------+----------------------------------------1 | 3303.1 3950 1346.4 84.5| 3221 3812.5 993 84.5| 657.0976 914.5912 822.1245 3.02765| 431777.2 836477.1 675888.7 9.166667| 10 10 10 10| 2549 2899 588 80| 4506 5627 3096 89| 857 1392 1037 5---------+----------------------------------------2 | 3928 4558 586 90| 3928 4558 586 90| . . . .| . . . .| 1 1 1 1| 3928 4558 586 90| 3928 4558 586 90| 0 0 0 0---------+----------------------------------------Total | 3728.381 4030.048 1693.619 80| 3544 3891 1549 80| 817.1005 821.3014 894.879 6.204837| 667653.2 674535.9 800808.3 38.5| 21 21 21 21| 2549 2899 586 70| 5433 5627 3589 90| 926 1127 1379 10--------------------------------------------------7(c) Megan Reif
    • Univariate Box Plots & Histograms2,0003,0004,0005,0006,000revgraph box rev• Notice that the inter-quartile range manifestsskewness, in addition to themaximum being muchfurther from the middle50% of the observations• Note how different thehistogram for Revenueappears for 4, 6, 8, and 10bins (21 observations)• See histogram help file toensure you properly displayhistograms for continuousvs. discrete variables.8685202468Frequency2500 3000 3500 4000 4500 5000rev5 54232012345Frequency2500 3000 3500 4000 4500 5000rev4262 2320246Frequency2000 3000 4000 5000rev32 252 232012345Frequency2000 3000 4000 5000 6000revDifferent Bin SizesHistogram of Tanzania Annual Revenue(c) Megan Reif
    • graph box rev if decade ==0 |decade==1, over(decade)histogram rev, by(decade)• Box Plot of Revenue by decade(1970s and 1980s)• Note that the IQR is lessskewed for the 1970s than the1980s• Since there are no dots in theboxplot we know there are noformal univariate outliers.• We also know from otherfinancial data that skewnessmay be something to correctfor with a log transformation.92,0003,0004,0005,0006,000rev0 105.0e-04.001.001505.0e-04.001.00152000 3000 4000 5000 60002000 3000 4000 5000 60000 12DensityrevGraphs by decadeBivariate Box Plots & Histograms: Inspecting by Subgroups orCategorical Transformations of Continuous Variables(c) Megan Reif
    • Scatterplot Matrices and Cross-Tabulations• Use these prior to ever running regression tosee differences and reveal potential violationsof CLRMGroup 1Group 2May have same relationshipto Y on average, but somethingelse is going on.y10(c) Megan Reif
    • The four panels form “Anscombe’s Quartet”—a famous demonstration by statistician Francis Anscombe in 1973. Bycreating the four plots he was able to check the assumptions of his linear regression model, and found them wantingfor three of the four data sets (all but the top left). As Epstein et al. write, “Anscombe’s point, of course, was tounderscore the importance of graphing data before analyzing it” (24).F.J. Anscombe, 1973. “Graphs in Statistical Analysis,” American Statistician 27:17, 19-20, cited in Lee Epstein, Andrew D. Martin, and Matthew M. Schneider,2006. “On the Effective Communication of the Results of Empirical Studies, Part I.” Paper presented at the Vanderbilt Law Review Symposium on EmpiricalLegal Scholarship, February 17.Remember that looking atcorrelations alone will concealcurvilinear relationships,heteroskedasticity, outliers, anddistributional shape. For example,THE DATA IN THE FOUR PLOTS HAVETHE SAME:1) means for both y and x variables2) slope and intercept estimates in aregression of y on x.3) R2 and F values (statistics we willcome to later).Bi-Variate Correlations/Regressions: The NEED TO GRAPHdata: Same Statistics, Different Relationships11(c) Megan Reif
    • Scatterplot Matricesgraph matrix rev rexp dexp t, half• Allows you to look at bivariaterelationships between yourmodel variables, think aboutpossible colinearity betweenexplanatory variables, non-linearity in relationships, etc.• Notice time trend of all threefinancial variables—considerautocorrelation• Extreme Points: We may want toinspect the scatterplots for rev –dexp and rexp – dexp forobservations that seem to beunusual given our theory thatdevelopment expenditure wouldbe a function of revenue (theobservations have highdevelopment expenditure butlow revenue) 12revrexpdexpt2000 4000 600030004000500060003000 4000 5000 60000200040000 2000 4000708090(c) Megan Reif
    • A Closer Look: Scatterplot with Labelsscatter dexp rev, mlabel(t)• Note that in 1990, revenuewas middling butdevelopment expenditureswere low. What might causethis?13707172 73747576777879808182838485868788899001000200030004000dexp2000 3000 4000 5000 6000rev32433497342637563409416944824498542454334506411236033470297227462623254929063544392820003000400050006000rev70 75 80 85 90tscatter rev t, mlabel(rev)• Scatter of revenue over time suggests a trendand possible autocorrelation.It is also curiousthat 1979 and 1980 have almost identical (andhigh) levels of revenue. Possible data error orreal stagnation in revenue? There was a warbetween Uganda and Tanzania in 1979. Notehow inspecting the data can lead to case-specific information that may require modelingadjustments (e.g., war dummies). And we didn’tknow a thing about Tanzania!(c) Megan Reif
    • Cross-Tabulations (Contingency Tables)• Recode continuous variables into categories (see notes from March 11), whichenables you to summarize continuous variables by categories (below) and inspecttest statistics for inter-group differences in means and variances (next slide)gen revcat=revrecode revcat 2549/3500=1 3501/4500=2 4501/max=3label define revcat 1 "low" 2 "med" 3 "high“Label values revcat revcat tatistics and interprettab revcat decade, sum(dexp)• We want to see if the mean, sd for development expenditure varies by revenuelevel and decade, for example, in order to see if one decade is responsible for all ofthe high revenue observations, etc. – remember how important sub-group size iswhen using interaction terms. Cross-tabs are an important exploring whether thesame small subgroup is driving the key results of estimation. Remember the 13educated women in the dummy model (Feb 25 notes).]| decaderevcat | 0 1 2 | Total-----------+---------------------------------+----------low | 1497.25 934.16667 . | 1159.4| 188.20977 275.87274 . | 372.3422| 4 6 0 | 10-----------+---------------------------------+----------med | 2205.75 1587.6667 586 | 1771.5| 439.05989 850.62408 0 | 782.53526| 4 3 1 | 8-----------+---------------------------------+----------high | 3352 3096 . | 3266.6667| 335.16861 0 . | 279.31046| 2 1 0 | 3-----------+---------------------------------+----------Total | 2151.6 1346.4 586 | 1693.619| 774.83134 822.12451 0 | 894.87896| 10 10 1 | 2114(c) Megan Reif
    • • Inspect test statistics for inter-groupdifferences in means and variances• Categories of low, medium, and highrevenue levels are not statisticallysignificantly disproportionatelydistributed in any one decade -- oneperiod alone will probably not be drivingstatistically significant results for revenueeffects), with the caveat that ourcategories need to be meaningful—perhaps coded at natural breaks in thedata, quartiles, etc. However, outliersthat do not fall in subgroups will notshow up with this method. It is stilluseful to consider possible clusters ofdata that will influence our model.tab revcat decade, column row chi2lrchi2 V exact gamma taubdecaderevcat | 0 1 2 | Total-----------+---------------------------------+----------low | 4 6 0 | 10| 40.00 60.00 0.00 | 100.00| 40.00 60.00 0.00 | 47.62-----------+---------------------------------+----------med | 4 3 1 | 8| 50.00 37.50 12.50 | 100.00| 40.00 30.00 100.00 | 38.10-----------+---------------------------------+----------high | 2 1 0 | 3| 66.67 33.33 0.00 | 100.00| 20.00 10.00 0.00 | 14.29-----------+---------------------------------+----------Total | 10 10 1 | 21| 47.62 47.62 4.76 | 100.00| 100.00 100.00 100.00 | 100.00Pearson chi2(4) = 2.6075 Pr = 0.625likelihood-ratio chi2(4) = 2.8982 Pr = 0.575Cramérs V = 0.2492gamma = -0.2000 ASE = 0.327Kendalls tau-b = -0.1183 ASE = 0.197Fishers exact = 0.64515Cross-Tabulations (Contingency Tables)(c) Megan Reif
    • II. Post-Estimation Diagnostics: OLSEstimator is a (Sensitive) Mean• The sample mean is a least squares estimator ofthe location of the center of the data, but themean is not a resistant estimator in that it issensitive to the presence of outliers in thesample. That is, changing a small part of the datacan change the value of the estimatorsubstantially, leading us astray.• This is particularly problematic if we are unsureabout the actual shape of the populationdistribution from which our data are drawn.16(c) Megan Reif
    • II. Post-Estimation DiagnosticsExtreme Points (start here, since extreme points willaffect formal testing procedures) Also called casediagnostics, case deletion diagnostics.• In multivariate analysis, extreme data points createmore complex problems than in univariate analysis.– A UNIVARIATE outlier is simply a value of xdifferent from X (unconditionally unusual, butmay not be a REGRESSION outlier).– An outlier in simple bivariate regression is anobservation whose dependent variable value isUNUSUAL GIVEN the value of the independentvariable (conditionally unusual).17(c) Megan Reif
    • II. Bivariate Regression Extreme Points• An outlier in either X or Y that has an atypical or anomalousX value has LEVERAGE. It affects model summary statistics(e.g. R2, standard error), but has little effect on theregression coefficient estimates.• An INFLUENCE point has an unusual Y value (AND maybe anextreme x value). It is characterized by having a noticeableimpact on the estimated regression coefficients (i.e., ifremoving it from the sample would markedly change theslope and direction of the regression line).• A RESIDUAL OUTLIER has large VERTICAL distance of a datapoint from the regression line. IMPORTANT NOTE: An outlierin X or Y is NOT necessarily associated with a large residual,and vice versa.18(c) Megan Reif
    • II.A.1.a Extreme Observations in Y19(c) Megan Reif
    • II.A.1.a Extreme Observations in Y20(c) Megan Reif
    • II.A.1.b Extreme Observations in XNOTE: These examples reveal that it is most typically observations extreme in BOTHx AND y that have influence (second graph on these two slides) but it is not always the case.21(c) Megan Reif
    • Summary Table: Model Effects for Outliers, Leverage, InfluenceType ofExtreme ValueYDIRECTIONXDIRECTIONLEVERAGE INFLUENCEEFFECT ON INTERCEPT/COEFFICIENTS/UNCERTAINTY?*Outlier in y(yi far from Y)Unusual In Trend = No No Yes/No/YesUnusual & Unusual = Yes Yes Yes/Large/YesOutlier in x(xi far from X)In Trend & Unusual = Yes NoNo/No/Yes-Tends toReduce UncertaintyUnusual & Unusual = Yes Yes Yes/Large/YesOutlier inResidualYes & PossiblyPossiblebut notnecessarilyPossiblebut notnecessarilyNo/No/Yes*Note that influence can refer several things: (1) effect on y-intercept; (2) on particular coefficient; (3) on allcoefficients; (4) on estimated standard errors. Thus we have a variety of procedures to evaluate influence.22(c) Megan Reif
    • 1. OUTLIERS are notnecessarily influential2. BUT they can be,depending on leverage3. Yet high LEVERAGE pointsare not always influential4. And INFLUENTIAL pointsare not necessarily outliers PLOT OUTLIER LEVERAGE INFLUENCE1 Yes No No2 Yes Yes Yes3 No Yes No4 No Yes Yes23(c) Megan Reif
    • II.A Multivariate Extreme Points• Influence in multivariate regression results from aparticular combination of values on all variablesin the regression, not necessarily just fromunusual values on one or two of the variables,but the concepts from the bivariate case apply.• When there are 2 or greater explanatoryvariables X, scatterplots may not revealmultivariate outliers, which are separated fromthe centroid of all the Xs but do not appear inbivariate relations of any two of them.24(c) Megan Reif
    • Residual Analysis: A Caution• Recall that residuals e are just an estimate of an unobservablevector with given distributional properties. Assessing theappropriateness of the model for a given problem may entailthe use of the residuals in the absence of Epsilon, but since eis by definition orthogonal to (uncorrelated with) Cov(X,e)=0the regressors with E(e)=0, one cannot use residuals to testthese assumptions of the CLRM model.Sample: e Population: εResiduals Error/Disturbance Term/Stochastic ComponentEstimatedUnobserved ParameterWe Try to EstimateDifference between these means that you are never totally confident that e is a good estimateof ε: If you meet all assumptions of CLRM then e is an unbiased, efficient, and consistentestimate of ε.25(c) Megan Reif
    • II.A.1 The “Hat” Matrix (Least SquaresProjection Matrix / Fitted Value Maker)• DeNardo calls it P (because it is the Projection matrix for thePredictor Space/least squares projection matrix (seehttp://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_projection_matrix.htm for a lovely geometricexplanation)); Rob calls it N (“fitted value maker”, Cook calls V, andBelsey H. I use H since most of the books on diagnostics seems touse H.• The hat matrix is• Since and by definition the vector offitted values it follows that• The individual diagonal elements h1, h2, hi,..., hn of Hcan thus be related to the distance between eachxi, xj .... xk explanatory variablesand the row vector of explanatory vector means x̅where xi is the ith row of the matrix X.T T-1H = X(X X) XT T-1b = (X X) X yy = Xb) y = Hy)26(c) Megan Reif
    • 27(c) Megan Reif
    • II.A.1 Hat Matrix, cont.(the matrix)(vector of diagonal elements)which equal . This is the effect of the ith elementon its own predicted value.In scalar form, the hat (leverage)forT TT Tii i iiiyy     --11H = X X X Xh x X X x)22the ith observation (note adjustmentfor number of observations, where as ngrows larger individual leverageof any one observation diminishes), is( )1( )(off-diagoniiiT Tij i jx xhn x x    -1h x X X x al)h serves as a measure ofleverage of the ith datapoint, because itsnumerator is thesquared distance of theith datapoint from itsmean in the X direction,while its denominator isa measure of the overallvariability of the datapoints along the X-axis.It therefore the distanceof the data point inrelation to the overallvariation in the Xdirection.h serves as a measure ofleverage of the ith datapoint, because itsnumerator is thesquared distance of theith datapoint from itsmean in the X direction,while its denominator isa measure of the overallvariability of the datapoints along the X-axis.It therefore the distanceof the data point inrelation to the overallvariation in the Xdirection.28(c) Megan Reif
    • II.A.1 Hat Matrix, cont.• Because H is a projection matrix,(for proof see Belsey et al., 1980, Appendix 2A)• It is possible to express the fitted values interms of the observed values (scalar):• hij therefore captures the extent to which yi isclose to the fitted values. If it is large, than thei-th observation had a substantial impact onthe j-th fitted value. The hat value summarizesthe potential influence of yi on ALL the fittedvalues.1 1 2 2 3 31... ...ni j j j jj j nj n ij iiy h y h y h y h y h y h y        )291 1ih (c) Megan Reif
    • II.A.1.a Hat and the Residuals• Since• The relationship betweenthe residual and the truestochastic component alsodepends on H. If the hijs aresufficiently small, e is areasonable estimate of ε.• Note the interestingsituation in which a better“fit”, if based on extremevalues, may signal anunderestimate of therandomness in the world.1ˆthenˆ( )ˆsubstituting for ,( )( )( )in scalar form, for 1,2... ,where I is the identity matrix.ni i ij jji ne h        e y ye I H yXβ ye I H Xβ εe I H ε30(c) Megan Reif
    • II.A.1.a Hat and the Residuals• The variance of e is also related to H (SeeDeNardo).• For high leverage cases, in which h approaches itsupper bound of one, the residual value will tendto zero (see graph above).• This means that the residuals will not be areliable means of detecting influential points, sowe need to transform them… leading us to thesubject of studentized (jacknifed) residuals:2( ) (1 )i iiVar e h 31(c) Megan Reif
    • II.A.1.a Hat / Studentized ResidualsPURPOSE: Detection of Multivariate Outliers• Adjust residuals to make them conspicuous so they are reliable for detectingleverage and influential points.• DeNardo’s “internally Studentized residual” is called “standardized residual” or“normalized”residual in other contexts--can disguise outliers.• The “externally” Studentized residual uses the Standard Error of the Regression(Residual Sum of Squares/n-k = e’e/(n-k), deleting the i-th observation, whichallows solving for h, the measure of leverage.• These residuals are distributed as Student’s t, with n-k d.f, so “a test” of eachoutlier can be made, with each studentized residual representing a t-value for itsobservation.• This is an application of the jacknife method, whereby observations are omittedand estimation iterated to arrive at the studentized residuals (just one of manyapplications of jacknife). Also called “Jacknife residual”*( )( )where s is the Standard Error of the1Estimate/Regression calculated after deleting ith observation.ii iiers h32(c) Megan Reif
    • II.A.1.a. continuedSteps for Assessing Studentized Residuals1. Studentized residuals correspond to the t-statistic we would obtain by includingin the regression a dummy predictor coded 1 for that observation and 0 for allothers. One can then test the null hypothesis that coefficient δ equals zero (Ho:δ=0) in:This tests whether case i causes a shift in the regression intercept.2. We set an alpha significance level α of our overall Type I error risk; probabilityof rejecting the null when it is in fact true. According to the Bonferroniinquality[Pr(set of events occurring) cannot exceed the sum of individualprobabilities of the events], the probability that at least one of the cases is astatistically significant outlier (when the null hypothesis is actually true) cannotexceed nα, so….3. We want to run n tests (one for each case) for each residual at the α/n level(let’s call this α*). Suppose we set α =.05 and we have 21 observations. To testwhether ANY case in a sample of n=21 is a significant outlier at level α , wecheck whether the maximum studentized residual max|ri| is significant at α* =.05/21=.0024 (given a t-distribution with df = n-K-1; 21-2-1 =19). Most t-tablesdo not have low numbers for t, so a computer is required.330 1 1 2 2 1 , 1( ) ...i i i k i k iE y x x x I          (c) Megan Reif
    • Tanzania Revenue Dataregress rexp rev (Expenditure as function of Revenue)Source | SS df MS Number of obs = 21-------------+------------------------------ F( 1, 19) = 55.16Model | 10034268 1 10034268 Prob > F = 0.0000Residual | 3456450.93 19 181918.47 R-squared = 0.7438-------------+------------------------------ Adj R-squared = 0.7303Total | 13490719 20 674535.948 Root MSE = 426.52------------------------------------------------------------------------------rexp | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+----------------------------------------------------------------rev | .8668668 .1167207 7.43 0.000 .6225675 1.111166_cons | 798.038 445.0211 1.79 0.089 -133.4019 1729.478------------------------------------------------------------------------------predict resid, resid (creates variable with ORDINARY RESIDUALS)predict estu, rstudent (STUDENTIZED RESIDUALS)34EXAMPLE Data: Tanzania.dta (Mukjeree et al)REV: Gov Recurrent Revenue REXP: Gov Recurrent ExpenditureDEXP: Gov Development Expenditure Year (T) 1970-1990(c) Megan Reif
    • 4. Identify the largest and smallest residuals. As a rule of thumb, we should pay attention toresiduals with absolute values greater than 2, be worried about those with values greaterthan 2.5, and most concerned about those exceeding 3. There are a variety of ways toidentify/inspect these residuals. Seehttp://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm for moreoptions. The fastest in a small dataset is to list the observations with a studentized residualexceeding + or -2. We see here that 1980 is an outlier. We can use Stata to carry out theBonferroni Outlier Test as follows:list if abs(estu)>2rev rexp dexp t resid estu decade revcat ||-----------------------------------------------------------------|11. | 4506 5627 3096 80 922.8602 2.590934 1 high |+-----------------------------------------------------------------+The maximum student residual of 2.59 is our t-value and n=21. For 1980 to be asignificant outlier (cause a significant shift in intercept) at The above P-value(p=.01796) is not below alpha/n=.0023, so 1980 is not a significant outlier at α=.05, then t=2.59 must be significant at .05/21.display .05/21.00238095display 2*ttail(19, 2.59).01796427The obtained P-value (P=.01796) is NOT below α/n=.00238, so 1980 is NOT a significant outlier atα =.05.35II.A.1.a. continued-Assessing Studentized ResidualsBonferroni Outlier Test (Test for outlier influence on y-intercept)(c) Megan Reif
    • II.A.1.b Hat Matrix and Leverage: Outlier influence on fittedvalues (recall that fit is overly dependent on these outliers)• Note that if hii =1, then ; that is, ei = 0, and the i-th case would be fit exactly.• This means that, if no observations are exact replicates, one parameter is dedicatedto one data point, which would make it impossible to obtain the determinant toinvert X’X and to obtain OLS estimates.• This rarely occurs, so that the value of hii will rarely reach its upper bound of 1.• The MAGNITUDE of hii depends on this relationship• The higher the value of hii the higher the leverage of the ith data point.• where c is the number of times the ith row of X is replicated (generally then, hwill range from 1/n to 1, but in survey data, it is possible to have duplicate responsesfor multiple respondents, so you can check this in stata with the“duplicates”command)• The average hat value is E(h) = (k+1)/n, where k is the number of regressors. Wetherefore proceed by looking at the maximum hat value. A hat value has leverage if itis more than twice the mean hat value.• Huber (1981) suggests another rule of thumb for interpreting hii , but this mightoverlook more than one large hat value.i iy y)1 1iihn c  Tix36max( ) .2 little to worry about.2 max( ) .5 risky.5 max( ) too much leverageiiihhh (c) Megan Reif
    • 7071727374757677787980818283848586878889 90.05.1.15.2.25Leverage2000 3000 4000 5000 6000revpredict h, hatsummarize hVariable | Obs Mean Std. Dev. Min Max-------------+-------------------------------------------------h | 21 .0952381 .0638661 .0476762 .2652265display 2/21.0952381list if h>2*.09523819. | rev | rexp | dexp | t | resid | estu | || 5424 | 5058 | 3115 | 78 | -441.9235 | -1.222458 || h .2629347 |10. | rev | rexp | dexp | t | resid | estu || 5433 | 5571 | 3589 | 79 | 63.27473 | .1685843 |h =.2652265scatter h rev, mlabel(t)• Use predict command to create thehat values for each observation.• Summarize or calculate to get themean.• List observations whose h valuesexceed 2 times E(h). We seen 1978and 1979 have leverage.• We can graph the hat values againstthe values of the independentvariable(s) The leverage points arewell above 0.2 and more than twicetheir mean. Recall that we identifiedfrom EDA that something might bedifferent for 1978 and 1979. Thismeans that too much of the sample’sinformation about the X-Yrelationship may come from a singlecase.37II.A.1.b Hat Matrix and Leverage: Outlier influence of Xvalues on fitted values, continued(c) Megan Reif
    • ( )( )The regression coefficient on is .Let represent the same coefficientwhen the ith case is deleted. Deleting theithcase therefore changes the coefficient onby - . We can express thk kk ik k k iX bbX b b( )( )( )is change instandard errors:-Where represents the residual standarddeviation with the ith case deleted, andis the residual sum of squares from the auxiliaryregresk k iike i ke ikb bDFBETASs RSSsRSSsion of X on all the other X variables(without deleting the ith case). The denominatortherefore modifies the usual estimate of the standarderror of the coefficient if the ith case is deleted.DFBETkkbA can also be expressed in terms of the Hatstatistic (see DeNardo).• Interpreting direction of influencewith DFBETAS:• The size of influence: DFBETAS tellsus “By how many standard errorsdoes the coefficient change if wedrop case i?”• A DFBETA of +1.34, for example,means that if case i were deleted,the coefficient for regressor k wouldbe 1.34 standard errors lower.38II.A.1.c The DFBETA Statistic (depends on X and Y values,tests how much a case i influences the coefficients, not aformal test statistic with a hypothesis test))If 0, case increases magnitude ofIf < 0, case decreases magnitude ofik kik kDFBETAS i bDFBETAS i b(c) Megan Reif
    • 0246Frequency-.8 -.6 -.4 -.2 0 .2 .4 .6Dfbeta revdfbeta_dfbeta_1: dfbeta(rev)list _dfbeta_1| _dfbeta_1 |1. | .1004588 |2. | .0401034 |3. | .0582458 |4. | -.0044781 |5. | .1422126 |6. | -.1596971 |7. | -.1744603 |8. | -.1502057 |9. | -.6607218 |10. | .0917439 |11. | .5789033 |12. | .1527179 |13. | -.059624 |14. | -.0800557 |15. | -.0528694 |16. | -.0164145 |17. | .1149248 |18. | .0945607 |19. | -.064976 |20. | -.0342036 |21. | .0475227 |display 2/sqrt(21).43643578histogram _dfbeta_1, bin(10) frequency xline(-.4364 .4364) xlabel(#10)(bin=10, start=-.66072184, width=.12396252)• Stata’s dfbeta command creates theDFBETA statistic for each of theregressors in the model, then list for allof our observations. A rule of thumb forlarge datasets where listing andinspecting all of the DFBETA valueswould be difficult is to inspect allDFBETAs in excess of 2/sqrt(n)• Since DFBETAs are obtained by case-wise deletion, they do not account forsituations where a number ofobservations may cluster together,jointly pulling the regression line in adirection, but not individually showingup as influential. You should not relysolely on DFBETA, then, to test forinfluence. A histogram of DFBETA canreveal groups of influential cases (theone displayed at left uses referencelines for + or – 2/sqrt(n) = .4364). Twoobservations fall outside the safe range.39II.A.1.c The DFBETA Statistic(c) Megan Reif
    • scatter _dfbeta_1 t, ylabel(-1(.5)1) yline(.4364 -.4364) mlabel(t)list t _dfbeta_1 rev rexp if t==78 | t==80+------------------------------+| t _dfbeta_1 rev rexp ||------------------------------|9. | 78 -.6607218 5424 5058 |11. | 80 .5789033 4506 5627 |• Now that we know there are two potentialobservations to worry about, it is useful touse another plot to identify which theyare (this is most useful for multivariateregression – it is rather obvious for thesingle regressor case).• We see that 1978 and 1980 are influential.• Note that 1978 and 1979 had leverage,but only 1978 is also influential. 1980 isInfluential but did not have leverage(review Slide 23).• 1978 decreases the coefficient on revenueby -.66 standard errors and 1980 increasesit by .58 standard errors.40II.A.1.c The DFBETA Statistic7071 72737475 76 777879808182 83 848586 8788 8990-1-.50.51Dfbetarev70 75 80 85 90t(c) Megan Reif
    • II.A.1.d Influence of a Case on Model as a Whole (Cook’sDistance and DFFITS Statistics)• Returning to the Hat statistic, ifwe want to know the effect ofcase i on the predicted values, wecan use the DFFITS statistic,which does not depend on thecoordinate system used to formthe regression model.• Rule of thumb cutoff values forsmall to medium sized data setsare to inspect observations withDFFITS that exceed the followingvalues (and to run the regressionwithout those observations tosee by how much the coefficientestimates change):41ˆ ( ) [ ( )]1to scale the measure, one can divide byˆthe standard deviation of the fit, where ( ) isour estimate of variance with observation i deleted.1i ii i i iiii iiiheDFFIT y y i ihhy s ihDFFITS    x b bx b( ) 1This is intuitive in that the first term increases the greaterthe hat statistic (and therefore the leverage) for case i,and the second term increases the larger the studentizedreii ieh s i h    * *( )sidual (outlier).Since then DFFITS can be written as11Then we want to know what the scaled changes in fit for the modelare for the values other than the ith row:[ ( )]( )i ii iiijje hr rhs his i hx b b( ) (1The absolute value of this change in fit for the remaining cases willbe less than the absolute value for the change attributed to the fittedˆvalue when the ith value is delij ij iih es i h hy    eted.[ ( )]( )is the number of standard errors that the fitted valuefor case i changes if the ith observation is deleted from the data.jijiDFFITSs i hDFFITSx b bSmall to medium datasets: DFFITS 11Large datasets: DFFITS 2iikn(c) Megan Reif
    • DFFITS and Hat vs. DFFIT PLOT(from Tanzania Model)predict dffit, dfitslist t rev rexp dffit| t rev rexp dffit ||------------------------------|1. | 70 3243 3304 -.1932092 |2. | 71 3497 3569 -.143909 |3. | 72 3426 3480 -.1642726 |4. | 73 3756 3809 -.1293692 |5. | 74 3409 3122 -.3824875 ||------------------------------|6. | 75 4169 3891 -.3301978 |7. | 76 4482 4352 -.2539932 |8. | 77 4498 4417 -.216292 |9. | 78 5424 5058 -.7301378 |10. | 79 5433 5571 .1012859 ||------------------------------|11. | 80 4506 5627 .8291757 |12. | 81 4112 4932 .3522714 |13. | 82 3603 4594 .3838607 |14. | 83 3470 4261 .2597122 |15. | 84 2972 3476 .0768232 ||------------------------------|16. | 85 2746 3202 .0211414 |17. | 86 2623 2929 -.1417075 |18. | 87 2549 2899 -.1141464 |19. | 88 2906 3431 .0905055 |20. | 89 3544 4149 .1518263 ||------------------------------|21. | 90 3928 4558 .1956947 |+------------------------------+Scatter h dffit, mlabel(t)• No observation has a DFFIT statistic larger than 1 in thissmall dataset. The largest is .829757, for 1980.• Note that as a function of hat and the studentizedresiduals, DFFITS is a kind of measure ofOUTLIERNESS*LEVERAGE• A graphical alternative to the influence measures is to plotgraphically hat against the studentized residuals to lookfor observations for which both are big (only 1979approaches this criteria, but is well under the DFFITScutoff):42*1ii iihDFFITS rh707172737475767778 798081828384858687888990.05.1.15.2.25Leverage-1 -.5 0 .5 1Dfits(c) Megan Reif
    • • Cook’s D is similar to the DFFITS statistic, but DFITS gives relativelymore weight to leverage points, since its shows the effect on anobservation’s fitted value when that particular one is dropped.• Cook’s Distance “tests the hypothesis” that the true slope coefficients areequal in the aggregate to the slope coefficients estimated withobservation i deleted (Ho: β =b(i)). It is more a rule of thumb that producesa measure of distance independent to how the variables are measured,rather than a formal F-test. influential if Di exceeds the median of the Fdistribution with k parameters [(Fk, n-k)(.5)]• Observations with larger D values than the rest of the data arethose that have unusual leverage.• While there are numerical rules for assessing Cook’s D authorsdiffer in their advice.• Some argue that it is best to graph Cook’s D values to see whetherany one or two points have a much bigger Di than the others.43II.A.1.d Influence of a Case on Model as a Whole (Cook’s Distance)2*2( ) ( )*21Since1 1 1then Cooks can be rewritten as(1 )i i ii ii i i i ii iiih e eD rk h s h s hr hDk h              (c) Megan Reif
    • Cook’s D, Continuedpredict cooksd, cooksd• We can then look up the median value of the F-distribution with k+1numerator and n-k denominator degrees of freedom:display invFtail(2,19, .5) For the Tanzania data; no observations this large..71906057list t rev rexp if cooksd>.71906057• Some authors suggest looking at the five most influential, which can bedone in Stata by (NOTE: last term is a lowercase “L” for last observation.).list t rev rexp cooksd dffit _dfbeta_1 in -5/l| t rev rexp cooksd dffit _dfbeta_1 ||-----------------------------------------------------|17. | 81 4112 4932 .0589684 .3522714 .1527179 |18. | 82 3603 4594 .0670656 .3838607 -.059624 |19. | 74 3409 3122 .067792 -.3824875 .1422126 |20. | 78 5424 5058 .2597905 -.7301378 -.6607218 |21. | 80 4506 5627 .2642971 .8291757 .5789033 |+-----------------------------------------------------+44(c) Megan Reif
    • Proportional Plots for Influence Statistics• It is useful to graph Cook’s D and DFFITSwith Residual vs. Fitted Plots, withsymbols proportional to the size ofCook’s D. First we have to predict thefitted values:predict yhat(option xb assumed; fitted values)• Then weight the symbols by the value ofthe influence statistic of interest:graph twoway scatter resid yhat[aweight =cooksd], msymbol(Oh) yline(0)saving(Dprop) NOTE: Prop Plot with weightsdisallows labeling, so I create two versions, one withlabels, one with proportions, and use ppt to overlay.graph twoway scatter resid yhat[aweight =cooksd], mlabel(t) yline(0)saving(Dlabe• We can also plot the studentized residuals vs. HAT(leverage, not the fitted values), with Proption toCook’s D, to look at outlierness, leverage, andinfluence at the same time. Same command asabove except variables are: estu h (or whatever youhave named your studentized residuals and hat)45-1000-50005001000Residuals3000 3500 4000 4500 5000 5500Fitted values-2-10123Studentizedresiduals.05 .1 .15 .2 .25Leverage(c) Megan Reif
    • • Recall that by increasing the variance of one or more Xs, a high-leverage observation will decrease the standard error of thecoefficient(s), even if it does not influence the magnitude. Thoughthis may be considered beneficial, it may also exaggerate ourconfidence in our estimate, especially if we don’t know if the high-leverage outlier is representative of the population distribution, ordue entirely to stochastic factors or error (sampling, coding, etc. –that is, a true outlier).• Using the COVRATIO statistic, we can examine the impact ofdeleting each observation in turn on the size of the joint-confidenceregion (in n-space) for β, since the size of this region is equivalent tothe length of the confidence interval for an individual coefficient,which is proportional to its standard error. The squared length ofthe CI is therefore proportional to the sampling variance for b. Thesquared size of a joint confidence region is proportional to thevariance for a set of coefficients (“generalized variance”) (Fox 1991,31; See Belsey et al. for the derivation, pp 22-24).46II.A.1.d Influence of a Case on Precision of the Estimates (COVRATIO)2*212(1 )2iiiCOVRATIOn k rhn k       (c) Megan Reif
    • COVRATIO• Look for values that differ substantially from 1.• A small COVRATIO (below 1) means that thegeneralized variance of the model would be SMALLERwithout the ith observation (i is reducing precision ofestimates)• A big COVRATIO (above 1) means the generalizedvariance would be LARGER without ith case, but if it isa high-leverage point, it may be making us overlyconfident in the precision our estimated coefficients.• Belsey et al. suggest that a COVRATIO should beexamined when:473( 1)1ikCOVRATIOn (c) Megan Reif
    • COVRATIO example48858488798773867189727090777683758182747880.05.1.15.2.25Leverage.6 .8 1 1.2 1.4 1.6Covratio858488 798773 867189727090777683758182747880-1-.50.51Dfits.6 .8 1 1.2 1.4 1.6Covratiopredict covratio, covratiolist t covratio rev rexp if abs(covratio-1)>(3*3)/21+-----------------------------+| t covratio rev rexp ||-----------------------------|4. | 79 1.511605 5433 5571 |+-----------------------------+• We see that 1979 is large and thereforehas perhaps exaggerated our certainty.• Plotting COVRATIO against hat revealsthat 1979 has leverage, but plottedagainst DFFITs, we see it is not greaterthan one. 1979 does not affect themagnitude of our coefficient estimates,but it may affect our hypothesis testingand conclusions.(c) Megan Reif
    • A Summary of Tests / Statistics for Extreme Values (note sample size dependence)Statistic Formula Use Critical Value Rule of ThumbStudentizedResidualOutliers’ Effect onIntercept1. Critical values (higher than usual t-test), recommended forexploratory diagnosis2. Rule of Thumb ValuesHat Statistic (h) LeverageBounded by 1/n to 1(assumes no replicates-Check this in survey data).Higher value=higher leverage:(depends on X Values)DFBETAInfluence of aCase on aParticularCoefficientCalculate for each regressor. Rule of Thumb: Under 2/√n means thepoint has no influence; over means the point is influential (depends onboth X AND Y values). Value of DFBETA is # of s.e.s by which case iincreases or decreases coefficient for regressor k.Cook’s DistanceInfluence of aCase on ModelMeasure of aggregate impact of the ith case on the group of regressioncoefficients as well as the group of fitted values (sometimes calledforecasting effect). A point is influential if Di exceeds the median of the Fdistribution with k parameters [(Fk, n-k)(.5)].DFFITSInfluence of aCase on ModelThe number of s.e.s by which thefitted value for ŷi changesif the ith observation is deleted.COVRATIOInfluence of aCase on ModelStandard ErrorsMeasures how precision of parameter estimates(generalized variance) change with removal ofith observation. Inspect if:Note: n is the sample size; k is the number of regressors; the subscript (i) (i.e., with parentheses) indicates an estimate from the sample omitting observation i. In each caseyou should use the absolute value of the calculated statistic.49*( ) 1iiiers h22( )1( )iiiT Tij i jx xhn x x    -1h x X X x( )( )-k k iike i kb bDFBETASs RSS***2 pay attention2.5 cause for worry3 cause for greatest concerniiiirrr( 1)2 wheremax( ) .2 little to worry about.2 max( ) .5 risky.5 max( ) too much leverageiiiikh h hnorhhh   If 0, case increases magnitude ofIf < 0, case decreases magnitude ofik kik kDFBETAS i bDFBETAS i b*1ii iihDFFITS rhSmall/med datasets: DFFITS 1Large datasets: DFFITS 2 ( 1) /ii k n 1*212(1 )2i kiiCOVRATIOn k rhn k       *2(1 )i iiir hDk h3( 1)1ikCOVRATIOn (c) Megan Reif
    • III. Plots to Identify Extreme Values• EXAMPLE: Model from Mukherjee et al. of crude birth rate as afunction of:– GNP per capita (logged, per Feb 18 Notes and gnrl practice for such variables)– IM: Infant mortality– URBAN: percent % population urban– HDI: human development index (From WB Human Dev Report 1993)regress birthr lngnp hdi infmor urbanpopSource | SS df MS Number of obs = 110-------------+------------------------------ F( 4, 105) = 129.19Model | 16552.2585 4 4138.06462 Prob > F = 0.0000Residual | 3363.19755 105 32.0304528 R-squared = 0.8311-------------+------------------------------ Adj R-squared = 0.8247Total | 19915.456 109 182.710606 Root MSE = 5.6595------------------------------------------------------------------------------birthr | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+----------------------------------------------------------------lngnp | -.2138487 .7960166 -0.27 0.789 -1.792203 1.364505hdi | -24.50566 7.495152 -3.27 0.001 -39.36716 -9.644157infmor | .111157 .0396176 2.81 0.006 .0326026 .1897115urbanpop | .0111358 .0396627 0.28 0.779 -.0675081 .0897797_cons | 39.56958 6.599771 6.00 0.000 26.48346 52.65571------------------------------------------------------------------------------ 5020 lnGNPC HDI IM URBBIRTHrt GNPC HDI IM URBAN          (c) Megan Reif
    • III.Plots to Identify Extreme Values:A. Leverage vs. Normalized Squared Residual Plots51lvr2plot, mcolor(green) msize(vsmall) mlabel(cid) mlabcolor(black)• This plot squares the NORMALIZED residuals (as standard deviation of eachresidual from mean residual) to make them more conspicuous in the plot(these are not the same as the externally studentized residuals).• Remember that we are worried about observations with HIGH LEVERAGE butLOW RESIDUALS, which indicates potential influence.• What we would like to see: A ball of points evenly spread around theintersection of the two means with no points discernibly far out in anydirection, and no leverage point above 0.2 with a low residual (to the left ofthe mean normalized squared residual line).• The vertical line represents the average squared normalized residual and thehorizontal line represents the average hat (leverage) value.• Points with high leverage and low residuals will lie below the mean of thesquared residual (X), and above the mean of hat, which we should worryabout if hat is above .2, and really worry if it is above .5.(c) Megan Reif
    • Leverage vs. Normalized Squared Residual Plots52Outlier but low leverageHigh Leverage, High Residual(might be reducing our standardErrors, but not above risky .2 level.May want to look at COVRATIO.Examinefurther ifabove0.2 in thisregionBased on this plot, the potential for points with high influence on ourcoefficients is low. There are no points that meet the high leverage, lowresidual criteria, individually or as a group.(c) Megan Reif
    • III.Plots to Identify Extreme Values:B. Partial Regression Leverage Plots (also known as partial-regression leverage plots, adjustedpartial residuals plots, adjusted variable plots, individual coefficient plots, and added variable plots)53avplots, rlopts(lcolor(red)) mlabel(cid) mlabsize(tiny) msize(small)• Plots graph the residuals from fitting y on the Xk variables EXCEPT one (y=Xk-1b)(the value of thoseresiduals given the Xk-1 variables (e|Xk-1 ) shown on y-axis, …plotted against… ordinary residuals of theregression of the EXCLUDED xi on the remaining Xk-1 independent variables on the x-axis (xi = Xk-1b)• Helps to uncover observations exerting a disproportionate influence on the regression model byshowing how each coefficient has been influenced by particular observations.• The regression slopes we see in the plots are the same as the original multiple regression coefficients forthe regression y=Xkb.• What we would like to see: Scatter of points even around the line in each plot – the “noise” or size ofthe cloud and spacing around the line need not concern us, but points very far from the rest should beexamined.• Cause for concern: Recall the bivariate examples from the first part of the notes – you are looking forvalues extreme in X (horizontal axis) with unusual/out-of-trend y-values. Pay most attention to thetheoretical variable(s) of interest and whether your conclusions and/or statistical significance wouldchange without the observation.• Utility of the graph: DFBETA will give a much more precise assessment of the change in magnitude ofthe coefficient in the absence of an influence point, but the graph can identify clusters of points thatmight be jointly influential.• Cautions: Pay attention to the SCALE of the axes reported in your computer output—a point may looklike an outlier but in reality be part of a cloud of points on which we are “zoomed in” rather close. If youhave doubts about the reliability of “eyeballing” the plot, you can re-run the regression leaving out theinfluential point and comparing the change in slope, but be sure to use commands that will retain theoriginal scale of the output so you can compare the changes (see slide 54-55). Some books recommendthis plot for deciding to include or discard variables. BETTER TO BASE THIS DECISION ON THEORY,techniques discussed previously.(c) Megan Reif
    • NICCHLCOLURYLKATZAPOLMOZVENCHNSYRBGRCRIECUDOMHUNPHLPERARGMEXTURJAMMDGPRYIDNBGDJORTTOETHROMPANBRAHNDLAOSOMMUSKENMYSTUNBOLKHMTHAKORUGAGHAEGYINDPAKSLVNPLGRCNGAGTMZMBTCDIRQLSOHTIISRMWIMARZAFHKGZWEBDIMLIBTNNZLESPCAFGBRSLETGOSGPBELAUSNLDPRTIRNBENCIVCOGIRLRWAMRTDZACMRBFASAUCANFRADEUNERSWEDNKUSABWAITANORSENPNGJPNAUTGINFINGABNAMCHEAREOMN-1001020e(birthr|X)-2 -1 0 1 2e( lngnp | X )coef = -.21384874, se = .79601659, t = -.27SENPNGCAFAREZWEBENSGPTGOMRTEGYCMRZMBGINCIVBWATCDSLVNEROMNNGANAMGHAHKGMARKENJORNICTUNHTIBELSOMCHNDZAHNDPRYJAMGTMBFABDIDEUDNKPANIRQISRROMSLEESPINDSWENLDGBRITAPHLJPNBOLNZLDOMAUSLAOFINKORCANNORPAKFRALKAUSABTNSAUARGMYSCHLCHEAUTIDNRWAIRLKHMBGRNPLGRCVENPERUGAURYPOLMUSPRTTHAECUGABHUNSYRCOLBRAIRNMEXCOGBGDMOZTTOCRIETHLSOTZAMLIZAFTURMDGMWI-1001020e(birthr|X)-.2 -.1 0 .1 .2e( hdi | X )coef = -24.505659, se = 7.4951522, t = -3.27SENPNGZWECHNCAFTGOEGYSGPLKABWAJAMKENNICAREPRYSLVZMBBENOMNPANTUNPHLROMGHAJORMARCMRHNDHKGNGAMYSGTMCIVHTIPOLBDITCDBGRMRTCHLTHADOMESPIDNSOMINDMUSBELISRNERKORDEULAOPRTGRCITADZADNKCRIIRLGBRNLDURYHUNNZLNAMARGGINSWESYRAUSJPNFINIRQCOLAUTVENFRACANBOLECUBFAKHMNORBTNUSAPAKUGANPLCHEPERTTORWAMEXBGDSLETZABRAMOZSAULSOETHTURZAFMDGIRNCOGGABMWIMLI-1001020e(birthr|X)-40 -20 0 20 40e( infmor | X )coef = .11115703, se = .03961763, t = 2.81OMNPRTTHACHELSOAUTFINRWABWALKAMUSIRLMWIBTNMYSBDICRIBGDGABPNGUGAZAFNPLKHMNAMETHMDGIDNCHNITAFRAGRCNORUSABFALAOJPNKENCANCOGTZAINDTTOMLIZWEGTMSYRPANHUNNERPAKPHLIRNJAMHTITURPRYROMSWEDZAPOLGINKORSLVDNKESPDEUHNDAUSGHAECUTGOBGRMARCMRNZLMOZCIVMEXNLDTUNNGASOMGBRSAUSLEEGYBRABOLAREZMBCOLDOMTCDISRBELSENBENIRQPERHKGARGJORMRTCHLURYNICCAFVENSGP-1001020e(birthr|X)-40 -20 0 20 40e( urbanpop | X )coef = .01113578, se = .03966274, t = .28Partial Regression Leverage Plots54(c) Megan Reif
    • NICCHLCOLURYLKATZAPOLMOZVENCHNSYRBGRCRIECUDOMHUNPHLPERARGMEXTURJAMMDGPRYIDNBGDJORTTOETHROMPANBRAHNDLAOSOMMUSKENMYSTUNBOLKHMTHAKORUGAGHAEGYINDPAKSLVNPLGRCNGAGTMZMBTCDIRQLSOHTIISRMWIMARZAFHKGZWEBDIMLIBTNNZLESPCAFGBRSLETGOSGPBELAUSNLDPRTIRNBENCIVCOGIRLRWAMRTDZACMRBFASAUCANFRADEUNERSWEDNKUSABWAITANORSENPNGJPNAUTGINFINGABNAMCHEAREOMN-1001020e(birthr|X)-2 -1 0 1 2e( lngnp | X )coef = -.21384874, se = .79601659, t = -.27SENPNGCAFAREZWEBENSGPTGOMRTEGYCMRZMBGINCIVBWATCDSLVNEROMNNGANAMGHAHKGMARKENJORNICTUNHTIBELSOMCHNDZAHNDPRYJAMGTMBFABDIDEUDNKPANIRQISRROMSLEESPINDSWENLDGBRITAPHLJPNBOLNZLDOMAUSLAOFINKORCANNORPAKFRALKAUSABTNSAUARGMYSCHLCHEAUTIDNRWAIRLKHMBGRNPLGRCVENPERUGAURYPOLMUSPRTTHAECUGABHUNSYRCOLBRAIRNMEXCOGBGDMOZTTOCRIETHLSOTZAMLIZAFTURMDGMWI-1001020e(birthr|X)-.2 -.1 0 .1 .2e( hdi | X )coef = -24.505659, se = 7.4951522, t = -3.27SENPNGZWECHNCAFTGOEGYSGPLKABWAJAMKENNICAREPRYSLVZMBBENOMNPANTUNPHLROMGHAJORMARCMRHNDHKGNGAMYSGTMCIVHTIPOLBDITCDBGRMRTCHLTHADOMESPIDNSOMINDMUSBELISRNERKORDEULAOPRTGRCITADZADNKCRIIRLGBRNLDURYHUNNZLNAMARGGINSWESYRAUSJPNFINIRQCOLAUTVENFRACANBOLECUBFAKHMNORBTNUSAPAKUGANPLCHEPERTTORWAMEXBGDSLETZABRAMOZSAULSOETHTURZAFMDGIRNCOGGABMWIMLI-1001020e(birthr|X)-40 -20 0 20 40e( infmor | X )coef = .11115703, se = .03961763, t = 2.81OMNPRTTHACHELSOAUTFINRWABWALKAMUSIRLMWIBTNMYSBDICRIBGDGABPNGUGAZAFNPLKHMNAMETHMDGIDNCHNITAFRAGRCNORUSABFALAOJPNKENCANCOGTZAINDTTOMLIZWEGTMSYRPANHUNNERPAKPHLIRNJAMHTITURPRYROMSWEDZAPOLGINKORSLVDNKESPDEUHNDAUSGHAECUTGOBGRMARCMRNZLMOZCIVMEXNLDTUNNGASOMGBRSAUSLEEGYBRABOLAREZMBCOLDOMTCDISRBELSENBENIRQPERHKGARGJORMRTCHLURYNICCAFVENSGP-1001020e(birthr|X)-40 -20 0 20 40e( urbanpop | X )coef = .01113578, se = .03966274, t = .28NICCHLCOLURYMOZTZAVENPOLLKASYRBGRECUCHNDOMPERCRIARGHUNPHLMEXTURMDGJAMJORTTOPRYETHBGDIDNROMBRASOMHNDPANLAOBOLTUNKENMUSMYSKORKHMGHAEGYUGAIRQPAKTCDINDNGAZMBTHASLVNPLGTMGRCISRLSOHTIMWIMARHKGZAFMLINZLCAFZWEESPBTNBDISLESGPGBRBELTGOAUSNLDIRNBENCOGCIVMRTPRTIRLDZASAURWACMRBFACANDEUNERDNKSWEFRAUSANORITASENJPNBWAPNGAUTGINGABFINNAMCHEARE-1001020e(birthr|X)-2 -1 0 1 2e( lngnp | X )coef = -.91084738, se = .7842531, t = -1.16SENPNGCAFAREZWEBENSGPTGOMRTEGYCMRZMBGINBWACIVTCDNERSLVNAMNGAGHAMARHKGKENJORTUNNICHTICHNBELSOMDZAPRYHNDJAMGTMBDIBFADEUDNKPANROMESPISRIRQINDSLESWENLDITAGBRPHLJPNNZLBOLAUSFINDOMLAONORCANFRAKORLKAPAKUSABTNCHEMYSAUTSAUIRLRWAARGIDNCHLKHMNPLBGRGRCUGAPERVENPOLURYPRTMUSTHAGABECUHUNSYRCOLIRNBRAMEXCOGBGDCRITTOMOZETHLSOTZAZAFMLITURMDGMWI-10-5051015e(birthr|X)-.2 -.1 0 .1 .2e( hdi | X )coef = -22.281675, se = 7.1637796, t = -3.11SENPNGZWECHNTGOCAFLKAEGYBWASGPJAMKENARENICPRYSLVZMBBENPANTUNPHLROMGHAJORCMRMARHNDNGAHKGMYSGTMCIVBDIHTIPOLTCDBGRTHAMRTCHLESPMUSIDNINDDOMSOMBELNERPRTISRKORDEUITALAOGRCDZAIRLDNKCRINAMNLDGBRHUNNZLGINSWEURYFINJPNSYRARGAUSAUTIRQFRACOLCANBFANORKHMBOLVENECUBTNUSAPAKUGACHENPLPERRWATTOMEXBGDSLETZABRALSOSAUMOZETHTURZAFMDGIRNGABCOGMWIMLI-1001020e(birthr|X)-40 -20 0 20 40e( infmor | X )coef = .12391559, se = .0378934, t = 3.27PRTTHACHEBWAFINAUTRWALSOLKAMUSIRLBTNMWIMYSBDIPNGCRIGABBGDNAMUGANPLZAFKHMITAFRACHNIDNETHNORMDGUSAGRCJPNBFAKENCANLAOINDCOGZWETTOTZAGTMNERPANMLISYRPHLHUNPAKJAMIRNHTIPRYROMSWEDZAGINTURSLVDNKKORESPPOLDEUTGOAUSGHACMRHNDMARBGRECUNZLCIVNLDTUNNGAMOZMEXGBRSOMARESAUEGYSLEZMBBOLBRACOLTCDSENDOMISRBELBENIRQHKGPERMRTJORARGCHLCAFURYNICVENSGP-1001020e(birthr|X)-40 -20 0 20 40e( urbanpop | X )coef = .05291099, se = .03965302, t = 1.33NICCHLCOLURYLKATZAPOLMOZVENCHNSYRBGRCRIECUDOMHUNPERPHLARGMEXTURMDGJAMBGDTTOIDNETHPRYJORROMBRAPANHNDLAOMUSSOMMYSKENTUNBOLTHAKHMKORUGAGHAPAKEGYINDGRCLSONPLGTMSLVNGAIRQTCDMWIZMBHTIISRZAFMARHKGMLIBTNBDINZLZWEESPGBRSLEBELCAFAUSTGOSGPPRTNLDIRNCOGIRLCIVBENRWADZAMRTBFASAUCANCMRFRADEUSWEDNKNERUSANORITABWAJPNPNGAUTFINGABGINCHENAM AREOMN-1001020e(birthr|X)-2 -1 0 1 2e( lngnp | X )coef = -.24870177, se = .80570718, t = -.31PNGCAFAREBENZWESGPTGOMRTEGYCMRZMBGINCIVBWATCDNERSLVNGAOMNNAMGHAMARHKGKENJORNICTUNHTISOMBELCHNDZAHNDPRYJAMGTMBFABDIDEUDNKPANIRQSLEROMISRINDESPSWENLDGBRITAPHLJPNBOLNZLDOMAUSLAOFINPAKKORCANNORBTNFRALKASAUUSAARGCHLMYSCHERWAIDNKHMNPLAUTIRLBGRUGAPERVENGRCURYPOLMUSPRTTHAECUGABHUNSYRCOLBRAIRNMEXCOGBGDMOZTTOCRIETHTZALSOMLIZAFTURMDGMWI-1001020e(birthr|X)-.2 -.1 0 .1 .2e( hdi | X )coef = -23.625004, se = 7.9461429, t = -2.97PNGZWECHNCAFTGOSGPEGYBWAARELKAKENJAMNICBENZMBSLVPRYOMNPANTUNGHAROMPHLCMRJORMARHNDNGAHKGGTMCIVMYSTCDMRTHTIBDIPOLBGRCHLSOMDOMESPINDTHABELIDNNERMUSISRDEUDZALAOKORGINITANAMDNKPRTGRCGBRNLDIRLNZLURYSWECRIHUNARGIRQJPNAUSSYRFINBFACOLBOLFRAAUTVENCANKHMECUNORBTNPAKUSAUGANPLCHEPERRWATTOSLEMEXBGDTZABRASAUMOZETHLSOTURZAFIRNMDGCOGGABMWIMLI-1001020e(birthr|X)-40 -20 0 20 40e( infmor | X )coef = .11481757, se = .04116964, t = 2.79OMNPRTTHACHELSOAUTFINRWABWALKAMUSMWIIRLBTNMYSCRIBDIBGDGABZAFUGANPLMDGETHKHMPNGNAMIDNFRAITACHNGRCNORUSAJPNLAOBFACOGCANTZATTOKENINDMLISYRHUNGTMPANIRNPAKPHLZWENERTURJAMHTIPRYROMSWEDZAPOLKORESPGINDNKSLVDEUECUAUSHNDBGRGHAMOZMARTGONZLCMRMEXNLDCIVTUNGBRSOMNGASAUSLEBRABOLCOLEGYZMBAREDOMTCDISRBELIRQPERBENARGHKGJORCHLMRTURYNICVENCAFSGP-1001020e(birthr|X)-40 -20 0 20 40e( urbanpop | X )coef = .00935766, se = .04016079, t = .2355NICCHLCOLURYLKATZAPOLMOZVENCHNSYRBGRCRIECUDOMHUNPERPHLARGMEXTURMDGJAMBGDTTOIDNETHPRYJORROMBRAPANHNDLAOMUSSOMMYSKENTUNBOLTHAKHMKORUGAGHAPAKEGYINDGRCLSONPLGTMSLVNGAIRQTCDMWIZMBHTIISRZAFMARHKGMLIBTNBDINZLZWEESPGBRSLEBELCAFAUSTGOSGPPRTNLDIRNCOGIRLCIVBENRWADZAMRTBFASAUCANCMRFRADEUSWEDNKNERUSANORITABWAJPNPNGAUTFINGABGINCHENAM AREOMN-1001020e(birthr|X)-2 -1 0 1 2e( lngnp | X )coef = -.24870177, se = .80570718, t = -.31PNGCAFAREBENZWESGPTGOMRTEGYCMRZMBGINCIVBWATCDNERSLVNGAOMNNAMGHAMARHKGKENJORNICTUNHTISOMBELCHNDZAHNDPRYJAMGTMBFABDIDEUDNKPANIRQSLEROMISRINDESPSWENLDGBRITAPHLJPNBOLNZLDOMAUSLAOFINPAKKORCANNORBTNFRALKASAUUSAARGCHLMYSCHERWAIDNKHMNPLAUTIRLBGRUGAPERVENGRCURYPOLMUSPRTTHAECUGABHUNSYRCOLBRAIRNMEXCOGBGDMOZTTOCRIETHTZALSOMLIZAFTURMDGMWI-1001020e(birthr|X)-.2 -.1 0 .1 .2e( hdi | X )coef = -23.625004, se = 7.9461429, t = -2.97PNGZWECHNCAFTGOSGPEGYBWAARELKAKENJAMNICBENZMBSLVPRYOMNPANTUNGHAROMPHLCMRJORMARHNDNGAHKGGTMCIVMYSTCDMRTHTIBDIPOLBGRCHLSOMDOMESPINDTHABELIDNNERMUSISRDEUDZALAOKORGINITANAMDNKPRTGRCGBRNLDIRLNZLURYSWECRIHUNARGIRQJPNAUSSYRFINBFACOLBOLFRAAUTVENCANKHMECUNORBTNPAKUSAUGANPLCHEPERRWATTOSLEMEXBGDTZABRASAUMOZETHLSOTURZAFIRNMDGCOGGABMWIMLI-1001020e(birthr|X)-40 -20 0 20 40e( infmor | X )coef = .11481757, se = .04116964, t = 2.79OMNPRTTHACHELSOAUTFINRWABWALKAMUSMWIIRLBTNMYSCRIBDIBGDGABZAFUGANPLMDGETHKHMPNGNAMIDNFRAITACHNGRCNORUSAJPNLAOBFACOGCANTZATTOKENINDMLISYRHUNGTMPANIRNPAKPHLZWENERTURJAMHTIPRYROMSWEDZAPOLKORESPGINDNKSLVDEUECUAUSHNDBGRGHAMOZMARTGONZLCMRMEXNLDCIVTUNGBRSOMNGASAUSLEBRABOLCOLEGYZMBAREDOMTCDISRBELIRQPERBENARGHKGJORCHLMRTURYNICVENCAFSGP-1001020e(birthr|X)-40 -20 0 20 40e( urbanpop | X )coef = .00935766, se = .04016079, t = .23Note that Senegal looked like apossible outlier but it was of thegood sort and it wasn’t particularlyextreme relative to the scale ofvalues shown. The coefficientchanges little and the SE increasesslightly without it (indicating it wascontributing to the fit somewhat).(c) Megan Reif
    • 56NICCHLCOLURYMOZTZAVENPOLLKASYRBGRECUCHNDOMPERCRIARGHUNPHLMEXTURMDGJAMJORTTOPRYETHBGDIDNROMBRASOMHNDPANLAOBOLTUNKENMUSMYSKORKHMGHAEGYUGAIRQPAKTCDINDNGAZMBTHASLVNPLGTMGRCISRLSOHTIMWIMARHKGZAFMLINZLCAFZWEESPBTNBDISLESGPGBRBELTGOAUSNLDIRNBENCOGCIVMRTPRTIRLDZASAURWACMRBFACANDEUNERDNKSWEFRAUSANORITASENJPNBWAPNGAUTGINGABFINNAMCHEARE-1001020e(birthr|X)-2 -1 0 1 2e( lngnp | X )coef = -.91084738, se = .7842531, t = -1.16SENPNGCAFAREZWEBENSGPTGOMRTEGYCMRZMBGINBWACIVTCDNERSLVNAMNGAGHAMARHKGKENJORTUNNICHTICHNBELSOMDZAPRYHNDJAMGTMBDIBFADEUDNKPANROMESPISRIRQINDSLESWENLDITAGBRPHLJPNNZLBOLAUSFINDOMLAONORCANFRAKORLKAPAKUSABTNCHEMYSAUTSAUIRLRWAARGIDNCHLKHMNPLBGRGRCUGAPERVENPOLURYPRTMUSTHAGABECUHUNSYRCOLIRNBRAMEXCOGBGDCRITTOMOZETHLSOTZAZAFMLITURMDGMWI-10-5051015e(birthr|X)-.2 -.1 0 .1 .2e( hdi | X )coef = -22.281675, se = 7.1637796, t = -3.11SENPNGZWECHNTGOCAFLKAEGYBWASGPJAMKENARENICPRYSLVZMBBENPANTUNPHLROMGHAJORCMRMARHNDNGAHKGMYSGTMCIVBDIHTIPOLTCDBGRTHAMRTCHLESPMUSIDNINDDOMSOMBELNERPRTISRKORDEUITALAOGRCDZAIRLDNKCRINAMNLDGBRHUNNZLGINSWEURYFINJPNSYRARGAUSAUTIRQFRACOLCANBFANORKHMBOLVENECUBTNUSAPAKUGACHENPLPERRWATTOMEXBGDSLETZABRALSOSAUMOZETHTURZAFMDGIRNGABCOGMWIMLI-1001020e(birthr|X)-40 -20 0 20 40e( infmor | X )coef = .12391559, se = .0378934, t = 3.27PRTTHACHEBWAFINAUTRWALSOLKAMUSIRLBTNMWIMYSBDIPNGCRIGABBGDNAMUGANPLZAFKHMITAFRACHNIDNETHNORMDGUSAGRCJPNBFAKENCANLAOINDCOGZWETTOTZAGTMNERPANMLISYRPHLHUNPAKJAMIRNHTIPRYROMSWEDZAGINTURSLVDNKKORESPPOLDEUTGOAUSGHACMRHNDMARBGRECUNZLCIVNLDTUNNGAMOZMEXGBRSOMARESAUEGYSLEZMBBOLBRACOLTCDSENDOMISRBELBENIRQHKGPERMRTJORARGCHLCAFURYNICVENSGP-1001020e(birthr|X)-40 -20 0 20 40e( urbanpop | X )coef = .05291099, se = .03965302, t = 1.33(c) Megan Reif
    • III.Plots to Identify Extreme Values:C. Star Plots for outliers, leverage, and model generalized influencedisplay invFtail(5,105, .5).87591656(use above command to get cut-off for Cook’sD), then use with the other rules ofthumb to choose observations to display:graph7 estu h cooksd if abs(estu) > 2 & h >.2 & cooksd > .87591656, stargraph7 estu h cooksd, star label(cid)select(88, 108)NOTES: This is an old but working Stata 7 command, search“graph7” for help file. Variable and thus direction associatedw/ each line depends on order listed in command .57• The scaling of a star chart is a function of all the stars. Selecting just a few to be displayed stillmaintains the scaling based on all the observations and variables.• In our example model, no observations meet all three criteria for influence, so instead I will tellStata to select some observations that include Senegal and Oman to show what the plot lookslike (do this by using selecting observations 88-108)• What we want to see: Dot OR (Line in outlier direction &/OR Line in leverage direction) and noor tiny line in influence direction.• Look for longer lines in influence direction (pointing Lower LFT), leverage (lower RT).(c) Megan Reif
    • III.C. Star Plots for DFBETAS (Individual Coefficient Influence)display 5/sqrt(110).47673129(use above command to get cut-off fordfbeta)graph7 _dfbeta_1 _dfbeta_2 _dfbeta_3_dfbeta_4 if abs(_dfbeta_1) >.4767 | abs(_dfbeta_2) >.4767 |abs(_dfbeta_3) >.4767 |abs(_dfbeta_4) >.4767, starlabel(cid)NOTE we have to create new variables inthe next command to ensure graphing ofabsolute values, so we do not know fromStar Plot whether the point increases ordecreases the coefficient.gen dflngnp=abs(_dfbeta_1)gen dfhdi=abs(_dfbeta_2)gen dfinform=abs(_dfbeta_3)gen dfurban=abs(_dfbeta_4)graph7 dfbeta_1 _dfbeta_2 _dfbeta_3_dfbeta_4, star label(cid)select(88, 108)58• The scaling of a star chart is a function of all the stars. Selecting just a few to be displayed stillmaintains the scaling based on all the observations and variables.• In our example model, only OMAN meets ANY of the criteria for influence, so let’s select someobservations to show what the plot looks like (a good reminder to use the statistics and rules ofthumb in addition to eyeballing). Only OMAN is influential on all the coefficients at a levelabove the cut-off point for DFBETAS. OMAN is an oddity—lots of oil, relatively small Omanipopulation, high birth rates, and a great deal of social development spending, raising HDIdespite a largely rural population. How would you model this without deleting Oman?• What we want to see: Dot, tiny lines in ALL directions.• Look for longer lines ANY direction.(c) Megan Reif
    • A Summary of COMMON DIAGNOSTIC PLOTS to identify potential extreme valuesPlot Type/CommandPreferredAppearanceUse Description/InterpretationLeverage(h) (y-axis)v. Squared NormalizedResidual Plot(x-axis)lvr2plotScatter evenly spreadaround intersection oftwo means; no pointsto left of the meannormalized squaredresidual line (upperLEFT quadrant)Potential Influenceon (1) ALLcoefficients and (2)standard errorsVertical line represents averagesquared normalized residual andhorizontal line average hat value.1. IDENTIFY POINTS in RED AREAHigh Leverage AND Low Residual, whenleverage greater than 0.22. POINTS in upper RIGHT quadrantare high leverage (>0.2) & high residual;Not influential on b but may diminishSEs and overstate certainty).Partial RegressionLeverage Plots (Alsocalled Added VariablePlot)avplotsScatter (loose or tight)of points even aroundthe line in each plot.Potential Influenceon EACHcoefficientResiduals from regressing y on the XkEXCEPT one (y=Xk-1b) y-axis, v.ordinary residuals EXCLUDED xi onremaining Xk-1 variables (xi = Xk-1b) x-axis Look for points extreme in Xw/ unusual e|y=Xk-1b values.CAUTIONS: (a) Verify points identifiedthrough “eyeballing” with DFBETAS.(b) Pay attention to scale of plots.Stretched or compacted displays mislead.Star Plots(a) Outliers, Leverage,& Model Influence(Cook’s D)gr7 estu h cooksd, star(b) Coefficient Influence(DFBETAs)gr7 dfx1 dfx2 dfxn, star(a) Dot OR (Line inoutlier Direction&/OR Line inLeverage Direction)and no or tiny linein influencedirection.(b) Dot (Lines in one ormore directions=possible influenceon one or morecoefficients)(a) MultivariateOutliers, Leverage,&/or InfluencePoints(b) PotentialInfluence on EACHcoefficient(a) Look for longer lines in Direction (a)(pointing Lower LFT), leverage(lower RT).(b) Look for longer line in any directionfor DFBETA – each for a coefficient.NOTES: 1. Working old Stata 7 command,search “graph7” for help file. 2. Variable (b)associated w/ each line depends onorder listed in command .(c) Megan Reif 59
    • Cautions about Extreme Value Procedures• One weakness in the DFFITS and other statistics is that they will not always detect caseswhere there are two similar outliers. A single point would count as influential by itself, butincluded together, they are influential.• Cluster of outliers may indicate that model was wrongly applied to set of points. Partialregression plots and other methods may be better for finding such clusters than individualdiagnostic statistics such as DFBETA. Both types of postestimation should be conducted.• A single outlier may indicate a typing error, ignoring a special missing data code, such as 999,or suggest that the model does account for important variation in the data. Only delete orchange an observation if it is an obvious error, like a person being 10 feet tall, or a negativegeographical distance.• Should not be abused to remove points to effect a desired change in a coefficient or itsstandard error! “An observation should only be removed if it is shown to be uncorrectably inerror. Often no action is warranted, and when it is, the action should be more subtle thandeletion….the benefits obtained from information on influential points far outweigh anypotential danger” (Belsey et al., 16).• Think about non-linear or other specifications that might model the outliers directly. Outliersmay present a research opportunity—do the outliers have anything in common?• Often the most that can be done is to report the results both with and without the outlier(maybe with one of the results in an appendix). The exception to this is the case of extremex-values. It is possible to reduce the range over which your predictions will be valid (e.g., onlyOECD countries, only EU, only low-income, etc.)--it is ok to say your height and weightrelationship is only usable for those between 5’5” and 6’5” for example, or that your modelonly applies to advanced industrialized democracies.60(c) Megan Reif
    • RESOURCES• UCLA Stata Regression Diagnostic Steps (good examples of data with problems)– http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm– http://www.ats.ucla.edu/stat/stata/examples/alsm/alsm9.htm– http://www.ats.ucla.edu/stat/stata/examples/ara/arastata11.htm• Belsley, D. A., E. Kuh, and R. E. Welsch. (1980). Regression Diagnostics: IdentifyingInfluential Data and Sources of Collinearity. New York: Wiley.• Cook, D. R. and S. Weisberg (1982). Residuals and Influence in Regression. NewYork, NY, Chapman and Hall.• Fox, J. (1991). Regression Diagnostics. Newbury Park: Sage Publications.• Hamilton, L. C. (1992). Regression with Graphics: A Second Course in AppliedStatistics. Pacific Grove, CA: Brooks/Cole Publishing Company.– Also has excellent chapter on pre-estimation graphical inspection of data– Includes section on post-estimation diagnostics for logit• For Regression Diagnostics for survey data (weighting for surveys requires adjustedmethods), see Li, J. and R. Valliant Influence Analysis in Linear Regression withSampling Weights. 3330-3337 and Valliant, R., J. Li, et al. (2009). RegressionDiagnostics for Survey Data. Stata Conference. Washington, DC, Stata Users Gro• Temple, J. (2000). "Growth Regressions and What the Textbooks Dont Tell You."Bulletin of Economic Research 52(3): 181-205. The paper discusses threeeconometric problems that are rarely given adequate discussion in textbooks:model uncertainty, parameter heterogeneity, and outliers.61(c) Megan Reif
    • PS 699 Section March 25, 2010Megan ReifGraduate Student Instructor, Political ScienceProfessor Rob FranzeseUniversity of MichiganRegression Diagnostics1. Diagnostics for Assessing (assessable) CLRMAssumptions2. Diagnostics for Assessing Data Problems (e.g.,Multicolinearity)
    • Step One: Histogram and Box-Plot of the Ordinary Residuals (the former is useful in detecting multi-modal distribution of residuals,which suggests omitted qualitative variable that divides data into groups)Step Two: Graphical methods – tests exist for error normality, but visual methods are generally preferredStep Three: Q-Q Plot of Residuals vs. Normal Distribution, and Normal Probability PlotBackground: What is a Q-Q Plot – Quantile-Quantile Plot?– Q-Q Plot is a scatterplot that graphs quantiles of one variable against quantiles of the second variable– The Quantiles are the data values in ascending order, where the first coordinate shows the lowest x1 valueagainst the lowest x2 value, the second coordinate are the next two lowest values of x1 and x2 and so on (Wegraph a set of points with coordinates (X1i, X2i), where X1i is the ith-from lowest value of X1 and X2i is the ith-from-lowest value of X2).– What we can learn from a Q-Q Plot of Two Variables:1. If the distributions of the two variables are similar in center, spread, and shape, then the pointswill lie on the 45-degree diagonal line from the origin.2. If the distributions have the same SPREAD and SHAPE but different center (mean, median…), thepoints will follow a straight line parallel to the 45-degree diagonal but not crossing the origin.3. If distributions have different spreads/variances and centers, but similar in shape, the points willfollow a straight line NOT parallel to the diagonal.4. If the points do not follow a straight line, the distributions are different shapes entirely.Two uses for Q-Q Plots:1. Compare two empirical distributions (useful to assess whether subsets of the data, such asdifferent time periods or groups, share the same distribution or come from differentpopulations).2. Compare an empirical distribution against a theoretical distribution (such as the Normal).I. Normal Distribution of Disturbances, ε,Can only be Evaluated using Estimate e.
    • A. Residual Quantile-Normal Plot (also known as probit plot,normal-quantile comparison plot of residuals)1. Quantile–Normal Plot (qnorm): - emphasize the tails of thedistribution2. Normal Probability Plot (pnorm): put the focus on the center of thedistribution• What we expect to see if the empirical distribution isidentical to a normal distribution, expect all points tolie on a diagonal line.I.A. Q-Q Plot of Residuals vs. Normal Distribution
    • Quantile-Normal Plot Interpretation BasicsSource: Hamilton, Regression with Graphics, p. 16
    • Quantile-Quantile Plot Diagnostic PatternsDescription of Point Pattern Possible InterpretationPoints on 45odiagonal line from Origin Distributions similar in center, spread, and shapePoints on straight line parallel to 45odiagonal Same SPREAD and SHAPE but different center(mean, median…), never see e with non-zero mean!Points follow straight line NOT parallel to the diagonal Different spreads/variances and centers, but similarin shape.Points do not follow a straight line Distributions have different shape.Vertically Steep (closer to parallel to y-axis)at Top and BottomHeavy Tails, Outliers at Low and HighData ValuesHorizontal (closer to parallel to x-axis) at Top andBottomLight Tails, Fewer OutliersTwo or more less-step areas (horizontal parallel to x-axis) indicate higher than normal density, separated bya gap or steep climb (area of lower density)Distribution is bi- or multi-modal(subgroups, different populations)All but a few points fall on a line - some points arevertically separated from the rest of the dataOutliers in the dataLeft end of pattern is below the line; right end ofpattern is above the lineLong tails at both ends of the datadistributionLeft end of pattern is above the line; right end ofpattern is below the lineShort tails at both ends of the distributionCurved pattern with slope increasing from left to right Data distribution is skewed to the rightCurved pattern with slope decreasing from left to right Data distribution is skewed to the leftGranularity: Staircase pattern (plateaus and gaps) Data values have been roundedor are discrete
    • • CONTINUING EXAMPLE (from March 18 Notes): Model fromMukherjee et al. of crude birth rate as a function of:– GNP per capita (logged, per Feb 18 Notes and gnrl practice for such variables)– IM: Infant mortality– URBAN: percent % population urban– HDI: human development index (From WB Human Dev Report 1993)regress birthr lngnp hdi infmor urbanpopSource | SS df MS Number of obs = 110-------------+------------------------------ F( 4, 105) = 129.19Model | 16552.2585 4 4138.06462 Prob > F = 0.0000Residual | 3363.19755 105 32.0304528 R-squared = 0.8311-------------+------------------------------ Adj R-squared = 0.8247Total | 19915.456 109 182.710606 Root MSE = 5.6595------------------------------------------------------------------------------birthr | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+----------------------------------------------------------------lngnp | -.2138487 .7960166 -0.27 0.789 -1.792203 1.364505hdi | -24.50566 7.495152 -3.27 0.001 -39.36716 -9.644157infmor | .111157 .0396176 2.81 0.006 .0326026 .1897115urbanpop | .0111358 .0396627 0.28 0.779 -.0675081 .0897797_cons | 39.56958 6.599771 6.00 0.000 26.48346 52.65571------------------------------------------------------------------------------620 lnGNPC HDI IM URBBIRTHrt GNPC HDI IM URBAN          
    • qnorm estu, grid mcolor(green) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny)mlabcolor(gs4) mlabangle(forty_five) yline(-.1535893, lpattern(longdash) lcolor(cranberry))caption(Red Dashed Line Shows Median of Studentized Residuals, size(vsmall)) legend(on)--------------------------------------------Quantile-NormalPlot
    • pnorm estu, grid mcolor(green) msize(small) msymbol(circle) mlabel(cid)mlabsize(tiny) mlabcolor(gs4) mlabangle(forty_five) legend(on)What does this granularity suggest?NormalProbabilityPlot
    • What Non-Normal Residuals do to your OLSEstimates and What do to• If errors not normally distributed:– Efficiency decreases and inference based on t- and F distributions arenot justified, especially as sample size decreases– Heavy-tailed error distributions (more outliers) will result in greatsample-to-sample variation (less generalizability)– Normality is not required in order to obtain unbiased estimates of theregression coefficients.• If you have not already transformed skewed variables, doing so mayhelp, as non-normal distribution of e may be caused by skewed Xand/or Y distributions.• Model re-specification may be required if evidence of granularity,multi-modality• Robust methods provide alternatives to OLS for dealing with non-normal errors.
    • (Ordinary) Residual vs. Fitted PlotCLRM:•Heteroskedasticity (leads to inefficiency and biasedstandard error estimates)•Residual Non-Normality (compounds in efficiencyand undermines rationale for t- and F-tests, castingdoubt on p-values reported in output)SPECIFICATION:•Non-linearity in X-Y relationship(s)
    • rvfplot, mcolor(green) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny)mlabcolor(gs4) mlabangle(forty_five) legend(on)HeteroskedasticityVariance for smallerfitted values larger thanfor medium fittedvalues?
    • Absolute Value of Residual v. Fitted (easier to see heteroskedasticity)predict yhatpredict resid, residgen absresid=abs(resid)graph twoway scatter absresid yhat, mcolor(green) msize(small) msymbol(circle)mlabel(cid) mlabsize(tiny) mlabcolor(gs4) mlabangle(forty_five) legend(on)
    • Note: Fox Recommends Using Studentized Residuals vs. Fitted Values(in example there is little difference)graph twoway scatter estu yhat, mcolor(green) msize(small) msymbol(circle) mlabel(cid)mlabsize(tiny) mlabcolor(gs4) mlabangle(forty_five) legend(on)
    • Residual v. Predictor Plot• Heteroskedasticity e varies with values of oneor more Xs.rvpplot lngnp, mcolor(green) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny)mlabcolor(gs4) mlabangle(forty_five) legend(on) name(lngnp)rvpplot hdi, mcolor(red) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny)mlabcolor(gs4) mlabangle(forty_five) legend(on) name(hdi, replace)rvpplot infmor, mcolor(blue) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny)mlabcolor(gs4) mlabangle(forty_five) legend(on) name(infmor, replace)rvpplot urbanpop, mcolor(orange) msize(small) msymbol(circle) mlabel(cid)mlabsize(tiny) mlabcolor(gs4) mlabangle(forty_five) legend(on) name(urbanpop,replace)graph combine lngnp hdi infmor urbanpop
    • Residual v. Predictor Plot
    • Component-Plus-Residual Plot• The component plus residual plot is also known as partial-regression leverage plots, adjusted partial residuals plots oradjusted variable plots.• This plot shows the expectation of the dependent variablegiven a single independent variable, holding all else constant,PLUS the residual for that observation from the FULL model.• Looks at one of the explained parts of Y, plus the unexplainedpart (e), plotted against an independent variable.• CLRM: Heteroskedasticity• Functional form / non-linearitycprplot lngnp, mcolor(green) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny)mlabcolor(gs4) mlabangle(forty_five) legend(on) name(lngnpcp, replace)cprplot hdi, mcolor(green) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny)mlabcolor(gs4) mlabangle(forty_five) legend(on) name(hdicp, replace)cprplot hdi, mcolor(red) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny)mlabcolor(gs4) mlabangle(forty_five) legend(on) name(hdicp, replace)cprplot infmor, mcolor(blue) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny)mlabcolor(gs4) mlabangle(forty_five) legend(on) name(infcp, replace)cprplot urbanpop, mcolor(orange) msize(small) msymbol(circle) mlabel(cid)mlabsize(tiny) mlabcolor(gs4) mlabangle(forty_five) legend(on) name(urbcp, replace)graph combine lngnpcp hdicp infcp urbcp
    • CPR Plot
    • Durbin Watson Test StatisticCorrelogramsSemi-VariogramsTime PlotI. Autocorrelation
    • Variance Inflation FactorHigh Collinearity increases se and reduce significance onimportant variables.VIF = 1/(1-R2j) where R2j is from the regression of variable jon the other independent variables. If variable j iscompletely uncorrelated with the other variables, than theR2 j will be zero, and VIF will be one. If fit perfect, R2j willbe large. Larger VIF=more coll.I. Multicollinearity
    • Summary of COMMON DIAGNOSTIC PLOTS to assess CLRM assumptions & Data Problems (post-estimation)Plot Type/Command Preferred Appearance Use Description/InterpretationQuantile-Normal Plot(Ordinary Residuals vs.Normal)qnorm estu&Normal Probability Plot(Studentized Residuals v.Standard Normal)pnorm estuIf the empiricaldistribution of theresiduals is identical toa normal distribution,expect all points to lieon the 45-degreediagonal line throughthe origin.Normally DistributedStochasticComponentQ-Normal Plot: Inspect TailsP-Probability Plot: Inspect Middle1. Look for multi-modality, granularity(possible misspecification)2. Right or Left Skewness (bowled up,Bowled down)3. Heavy Tails4. Vertical difference in values (outliers)Ordinary or StudentizedResidual v. Fitted Valuesrvfplot&|Residual| v. Fittedgraph twoway scatterabsresid yhatNo discernable pattern,even band withconstant varianceabove and below zero,and high and lowvalues of y.CLRM:Heteroskedasticity evaries with yResidual NormalitySPECIFICATION:Non-linearity in X-Yrelationship(s)Sum total of what the regression has explained.1. Look for systematic variation in the distance ofresiduals from their mean of zero.2. Q-N plot better to assess normality3. This plot helps asses whether errorvariance Increases or decreases atsmaller or larger values of y.4. Clusters of residuals above or below zero(Ordinary) Residual v.Predictor Plot (each X)rvpplot x1varnameNo discernable pattern,even band withconstant varianceabove and below zero,and high and lowvalues of each X.CLRM:Heteroskedasticity evaries with values ofone or more Xs.SPECIFICATION:Non-linearity1. Look for systematic variation in thedistance of residuals from mean2. Whether error variance increases ordecreases at smaller or large values ofeach X3. Clusters of residuals above or below 0.Component PlusResidual Plotcprplot x1varname