This document discusses outliers and influential points in regression analysis. It defines outliers as observations that lie outside the overall pattern of data, while influential points markedly change the regression if removed. The document provides methods to detect outliers and influential points, such as Cook's distance and DFBETA. It notes that outliers may be due to errors and should be corrected if possible, while influential points require more analysis to understand their impact before being removed from the regression model.
This was a presentation I gave to my firm's internal CPE in December 2012. It related to correlation and simple regression models and how we can utilize these statistics in both income and market approaches.
Overviews non-parametric and parametric approaches to (bivariate) linear correlation. See also: http://en.wikiversity.org/wiki/Survey_research_and_design_in_psychology/Lectures/Correlation
This was a presentation I gave to my firm's internal CPE in December 2012. It related to correlation and simple regression models and how we can utilize these statistics in both income and market approaches.
Overviews non-parametric and parametric approaches to (bivariate) linear correlation. See also: http://en.wikiversity.org/wiki/Survey_research_and_design_in_psychology/Lectures/Correlation
Presentation at "Impact Evaluation for Financial Inclusion" (January 2013)
CGAP and the UK Department for International Development (DFID) convened over 70 funders, practitioners, and researchers for a workshop on impact evaluation for financial inclusion in January 2013. Co-hosted by DFID in London, the workshop was an opportunity for participants to engage with leading researchers on the latest research methods of impact evaluation and to discuss other areas on the impact evaluation agenda.
The error (or disturbance) of an observed value is the deviation of the observed value from the (unobservable) true value of a quantity of interest (for example, a population mean), and the residual of an observed value is the difference between the observed value and the estimated value of the quantity of interest (for example, a sample mean).
Suppose there is a series of observations from a univariate distribution and we want to estimate the mean of that distribution (the so-called location model). In this case, the errors are the deviations of the observations from the population mean, while the residuals are the deviations of the observations from the sample mean.
A statistical error (or disturbance) is the amount by which an observation differs from its expected value, the latter being based on the whole population from which the statistical unit was chosen randomly. For example, if the mean height in a population of 21-year-old men is 1.75 meters, and one randomly chosen man is 1.80 meters tall, then the "error" is 0.05 meters; if the randomly chosen man is 1.70 meters tall, then the "error" is −0.05 meters. The expected value, being the mean of the entire population, is typically not observable, and hence the statistical error cannot be observed either.
2. Outliers & Influential Points
Outlier: outlier in
y direction
An observation that lies outside
the overall pattern of
observations.
“Influential observation”:
An observation that markedly
changes the regression if
removed. This is often an outlier
on the x-axis. outlier in
x direction
3. Outliers
Broadly speaking, an outlier is a data point that is distinct or deviant from
the bulk of the data.
Their relatively low probability of natural occurrence indicates that they
are most likely due to error.
• Measurement, recording, administration of treatment, etc.
Of course, outliers can also occur in the absence of error. These are
known as true outliers in contrast to the false outliers described above.
The distinction is important because true outliers have something to
teach us.
• Correct model?
• Violation of assumptions?
• Are there observations with an undue influence on the results?
4. Univariate Outliers
95% of all
observations will
be somewhere in
here.
2.5% of all 2.5% of all
observations will observations will
be somewhere be somewhere
out here. out here.
7. Detection of Outliers
When an Outlier is detected:
Can the outlier be traced to a reparable cause (e.g., incorrect entry into
dataset)?
• If yes, then simply fix the data point and repeat your analysis to verify.
What if an outlier cannot be “repaired?” Can nothing can be done to
rescue the data?
• In this case, you may opt to remove this subject from further analysis.
The univariate statistics have changed, so do you report the original
mean, standard deviation, etc. or the new ones?
• You report the new statistics as the statistics that reflect your data.
• The original statistics may be reported in the section wherein you
describe the rule for detecting the outliers.
8. Influence Analysis
Broadly speaking, an Influential Observation may be defined as:
“one which, either individually or together with several other
observations, has a demonstrably larger impact on the calculated
values of various estimates than is the case for most of the other
observations.”
As mentioned earlier, an outlier is not necessarily an influential point.
• An outlier, although relatively deviant, may have little impact on
the resulting regression line.
• Influential points may not be appreciably deviant, as based on an
analysis of the residuals, but which may greatly affect the terms in
the regression equation (which is why it is seldom detected by an
analysis of the residuals).
11. Influence Analysis
When an Influential Observation is detected:
When an influential observation is detected, the procedures
regarding how to deal with it are more complicated than those for
dealing with outliers.
First, scrutinize the attributes of the influential individual.
• How does this subject differ from the other subjects?
Second, understand that influential observations may reflect random
error, or they may serve as clues.
• Increase sample size
• Replication!
19. Standardized DFBETA
On that note, what constitutes a large DFBETA?
Unfortunately, there is no simple answer to this question since the
magnitude of DFBETA is dependent on the units of measurement
used in the study.
To get around this problem, then, one may prefer to use standardized
scores.
23. Remedies
“To delete or not to delete, that is the question.”
-Wilbur Shakesmann
When an influential point is detected, it is important to keep in mind
that it was only “detected” because of your criterion.
• In other words, a different criterion, a more conservative one, may
not have labeled said point as “influential.”
The urge to “correct” your data set by removing the errant point may
be strong, but before you do so, you must carefully weigh the
consequences.
• Is it more misleading to retain the errant point and report the data
as a whole?
• Or is it more misleading to remove the errant point and only
explain the data you understand?