Seven sins of regression analysis

Lecture:
Seven sins of regression analysis
Based on book by Naked Statistics by Charles Wheelan
Chaudhary Awais Salman
Doctoral Researcher in Future Energy
Course instructor
School of Business, Society and Engineering
Fuuture Energy – Centre of Excellence
Email: awaissalman@hotmail.com

1. Using regression to analyze a nonlinear relationship
● “Do Not Use When There Is Not a Linear Association between the Variables That You Are
Analyzing. “
● “Remember, a regression coefficient describes the slope of the “line of best fit” for the data; a
line that is not straight will have a different slope in different places.”
2
● Consider the following hypothetical relationship
between the number of golf lessons that a person
take during a month.
● The first few golf lessons appear to bring score
down rapidly which is good.
● But then when they reach the point where the
spending of $200 and $300 a month on lessons,
then the lessons do not seem to have much effect
at all.
The most important point here is that we cannot accurately summarize the
relationship between lessons and scores with a single coefficient.

2. Correlation does not equal causation.
● We cannot prove with statistics alone that a change in one variable is causing a change in the
other.
● A sloppy regression equation can produce a large and statistically significant association between
two variables that have nothing to do with one another.
3
● The graph shows perfect correlation
among US spending on science and
suicides by hanging. The R-square equals
0.99.
● But this does not make any sense.
● Regression analysis can only
demonstrate an association between two
variables.
● It is scientist or analysts duty to interpret
the results according to their expertise

3. Reverse causality.
● Lets take the golf score predicted from money spending on golf lessons example from slide 2.
● There the variable for golf lessons is consistently associated with worse scores.
● The more lessons person take, the worse they shoot!
● One explanation can be that they have a really, really bad golf instructor.
● A more plausible explanation could be that they tend to take more lessons when they were
playing poorly; bad golf is causing more lessons, not the other way around.
● (There are some simple methodological fixes to a problem of this nature. For example, I might
include golf lessons in one month as an explanatory variable for golf scores in the next month .)
4
A statistical association between A and B
does not prove that A causes B. In fact, it’s
entirely plausible that B is causing A.

4. Omitted variable bias.
● Regression results will be misleading and inaccurate if the regression equation leaves out an
important explanatory variable, particularly if other variables in the equation “pick up” that effect.
● You see a headline, “Golfers More Prone to Heart Disease, Cancer, and Arthritis!”
● But people play more golf when they get older, particularly in retirement.
● Any analysis that leaves out age as an explanatory variable is going to miss the fact that golfers, on
average, will be older than nongolfers.
● Golf isn’t killing people; old age is killing people.
● And they happen to enjoy playing golf while it does.
● When age is inserted into the regression analysis as a control variable, we will get a different
outcome.
5

5. Highly correlated explanatory variables (multicollinearity).
If a regression equation includes two or more explanatory variables that are highly correlated with each
other, the analysis will not necessarily be able to discern the true relationship between each of those
variables
6
● Assume we are trying to gauge the effect of illegal drug use on SAT scores. Specifically, we
have data on whether the participants in our study have ever used cocaine and also on whether
they have ever used heroin.
● The methodological challenge is that people who have used heroin have likely also used
cocaine.
● If we put both variables in regression equation, we will have very few individuals who have
used one drug but not the other, which leaves us very little variation in the data with which to
calculate their independent effects.
Similarly, the correlation between a husband’s and a wife’s educational
attainments is so high that we cannot depend on regression analysis to give
us coefficients that meaningfully isolate the effect of either parent’s
education on the SATs score of their child

6. Extrapolating beyond the data .
● Regression analysis, like all forms of statistical inference, is designed to offer us insights
into the world around us. We seek patterns that will hold true for the larger population.
● However, our results are valid only for a population that is similar to the sample on which
the analysis has been done .
● In the last chapter, I created a regression equation to predict ice cream sales based on a
outside temperature.
7
y = 12,728x - 442,39
R² = 0,8774
0
100
200
300
400
500
600
700
800
70 72 74 76 78 80 82 84 86 88 90
Icecreamsales,USD
Temperature (deg F)
Linear regression
● What will be the sales at 120 deg F or higher(48 deg C)
● Sales would be zero (because no one will be outside
their homes to buy ice creams at this extreme
temperatures)
● Fun fact: Swedish people like ice cream in winter
● So as a researcher you have the responsibility to see the
limitations of data boundary in your analysis
https://www.thelocal.se/20160406/nine-bizarre-swedish-eating-habits-that-will-confuse-foreigners

7. Data mining (too many variables).
● If omitting important variables is a potential problem, then presumably adding as many
explanatory variables as possible to a regression equation must be the solution. Nope.
● Your results can be compromised if you include too many variables, particularly extraneous
explanatory variables with no theoretical justification.
● For example, one should not design a research strategy built around the following premise: Since
we don’t know what causes autism, we should put as many potential explanatory variables as
possible in the regression equation just to see what might turn up as statistically significant; then
maybe we’ll get some answers.
● If you put enough junk variables in a regression equation, one of them is bound to meet the is
that junk variables are not always easily recognized as such.
● Clever researchers can always build a theory after the fact for why some curious variable that is
really just nonsense turns up as statistically significant.
8

Seven sins of regression analysis

Recommended

Recommended

More Related Content

Similar to Seven sins of regression analysis

Similar to Seven sins of regression analysis (20)

Recently uploaded

Recently uploaded (20)

Seven sins of regression analysis