The document provides an overview of correlation, regression, and other statistical methods. It defines correlation as measuring the association between two variables, while regression finds the best fitting line to predict a dependent variable from an independent variable. Simple linear regression uses one predictor variable, while multiple linear regression uses two or more. Logistic regression is used for nominal dependent variables. Nonlinear regression fits curved lines to nonlinear data. The document provides examples and guidelines for choosing the appropriate statistical test based on the type of variables.
1. Applied Statistics
Part 4
By
M. H. Farjoo MD, PhD, Bioanimator
Shahid Beheshti University of Medical Sciences
Instagram: @bio_animation
2. Applied Statistics
Part 4
Introduction of Correlation and regression
Difference Between Correlation & Regression
Correlation
Regression
Simple Linear regression
Multiple linear Regression
Simple Logistic Regression
Multiple Logistic Regression
Non linear (Curvilinear) Regression
Choosing Test
3. Introduction
Correlation and regression are not the same.
Use correlation to know:
Whether two measurement variables are associated.
Whether as one variable increases, the other increases
or decreases.
Use regression to know:
The strength of the association or relation.
The equation of a line that fits the cloud of data,
describes the relationship, and predicts unknown
values.
4. Difference Between Correlation & Regression
Goal:
Correlation quantifies the degree to which two variables are
related, and does not fit a line through the data points.
Linear regression finds the best line (equation) that fits data
points.
kind of data and sampling:
Correlation is used when you measure both variables and
sample both variables randomly from a population.
In regression, X is a variable we manipulate and choose its
values (time, concentration, etc.) and predict Y from X.
5. Difference Between Correlation & Regression
Relationship between results:
Correlation computes correlation coefficient, r.
Linear regression quantifies goodness of fit, r2 (or R2).
Which variable is which?
In correlation we get the same coefficient (r) if swap
the two variables.
Regression gets a different best-fit line and different
coefficient (r2) if we swap the two variables.
6. Correlation
When two variables vary together, there is
covariation or correlation.
The null hypothesis implies:
There is no relationship between the variables
As the X variable changes, the Y variable does not
change.
Correlation coefficient is not significantly different
form zero (or statistically: r = 0)
Correlation does not imply causation.
But a significant correlation may suggest further
research to test for a cause and effect relation.
7.
8.
9.
10.
11. Guidelines for Judging Causality
1. Is there a temporal relationship?
2. What is the strength of association?
3. Is there a dose/response relationship?
4. Were the findings replicated?
5. Is there biological plausibility?
6. What happens with cessation of exposure?
7. Is this explanation consistent with other knowledge?
12. Correlation
Causal inferences are licensed primarily by the design
of your study, not by the statistical techniques you
use.
Correlation only quantifies linear (straight line)
covariation.
A correlation analysis is not helpful if Y changes to a
point, and then continues to opposite direction.
In this case we obtain a low value of r, even though
the two variables are strongly related.
13. Correlation
The value of correlation may be:
-1 (perfect inverse relationship; X goes up, Y goes down)
1 (perfect positive relationship; X goes up so does Y).
0 (no correlation at all).
Pearson correlation (r) is parametric and assumes
both X and Y are from a Gaussian distribution.
Spearman correlation [rs or ρ (rho)] does not make
this assumption and is non-parametric.
Correlation is not sensitive to non-normality
So use Pearson method any time you have 2
measurement variables, even if they look non-normal.
14. Value of r (or rs) Interpretation
1.0 Perfect correlation
> 0 to 1
The two variables tend to
increase or decrease together.
0.0
The two variables do not vary
together at all.
-1 to < 0
One variable increases as the
other decreases.
-1.0
Perfect negative or inverse
correlation.
15. Correlation
If r or rs is far from zero, there are four possible
explanations:
1. Changes in the X variable causes a change in the value of
the Y variable.
2. Changes in the Y variable causes a change in the value of
the X variable.
3. Changes in another variable influences both X and Y.
4. X and Y don’t really correlate at all, and correlation is by
chance.
16. Regression
In regression we fit a line through the data and use its
equation to predict Y from X.
We predict scores on one variable (Y axis) from the
scores on a second variable (X axis).
The variable we are basing our predictions is the
independent or predictor variable (X axis).
The variable we are predicting is the criterion or
dependent variable (Y axis).
Only the dependent variable (Y axis) determines the type
of regression NOT the independent variable (X axis).
17. Regression
The null hypothesis implies: the slope of the best-fit
line is equal to zero.
Try to use the line equation for prediction within the
X values found in the data set (interpolation).
Predicting Y values outside the range of X values
(extrapolation) can yield ridiculous results if you go
too far!
The expansion of an iron rod is related to heat, but it
will not expand at 2000 °C, it will melt!!
18. Regression
r2 (in the output window of the results) is called the
coefficient of determination, or "r squared".
It is a value that ranges from 0 to 1, and is the fraction
of the variation in the two variables that is “shared”.
Regressions can have a small r2 (no relationship), yet
have a slope that is significantly different from zero.
The null hypothesis has nothing to do with r2.
19. Simulated data showing the effect of the range of X values on the
r2. For the exact same data, measuring Y over a smaller range of X
values yields a smaller r2.
20. Simple Linear regression
When Y is a continuous variable and there is only one
predictor variable, it is called: simple linear regression.
An example is: weight of the infant at birth (Y),
predicted by gestational age (X).
In simple linear regression, the predictions of Y from X
form a straight line.
Regression line can predict Y from X and is the best-
fitting straight line through the points with slope and
intercept.
The line minimizes the sum of the squares of the
vertical distances of the points (errors) from the line.
21. • The slope quantifies the steepness of the line. It equals the change in
Y for each unit change in X.
• The Y intercept is the Y value of the line when X equals zero. It
defines the elevation of the line.
22.
23.
24. in Graph A, the points are closer to the line than they are in Graph B.
Therefore, the predictions in Graph A are more accurate than in Graph B.
25. Simple Linear regression
The Frank Anscombe's quartet demonstrate the
importance of looking at your data.
They all have 11 points and are very different.
Surprisingly when analyzed by linear regression, all
these values are identical for all four graphs:
The mean values of X and Y
The slopes and intercepts
r2
The SE and CI of the slope and intercept
26. Frank Anscombe, (1918–2001) He was brother-in-law to
another well-known statistician, John Tukey; their wives were
sisters.
27.
28.
29. Multiple Linear Regression
In multiple regression, Y is predicted by two or more
variables in X axis.
We can use it for:
Predicting the values of the dependent variable.
To decide which independent variable (X) has a major
effect on the dependent variable (Y).
An example is: weight of the infant at birth (Y), predicted
by gestational age, Weight of mother, and whether the
mother smokes or not ( all on X).
Not all the predictors (X) are worth of including in a
multiple linear regression model.
30. Multiple Linear Regression
Another example: to predict a student's university score
based on their high school scores and their total SAT
score.
The basic idea is to find a linear combination of high
school scores that best predicts University score.
Be very careful in using multiple regression to understand
cause-and-effect relationships.
It is very easy to get misled by the results of a fancy
multiple regression analysis.
The results should be used as a suggestion, rather than for
hypothesis testing.
31.
32. Simple Logistic Regression
Simple logistic regression is used when there is one
measurement independent variable (X) and the Y variable
is nominal.
The goal is:
To check the probability of getting a particular condition of
Y is associated with X.
To predict a particular condition of Y, given the X.
If Y has only two values, the regression is called: “binary
logistic regression” (male/female, dead/alive).
If Y has more than two values, the regression is called:
"multinomial logistic regression”.
33. Simple Logistic Regression
Example of binary logistic regression: the effect of
study time (X), on exam outcome (Y).
The model can be used to predict the occurrence of
heart attack based on the plasma cholesterol.
An example of multinomial logistic regression: the
effect of the grade of a tumor (X), on the treatment
method (radiotherapy, chemotherapy, surgery) (Y).
The model can be used to choose how to treat the
patient based on the severity of the cancer.
34. Simple Logistic Regression
Pass:Y = 1
Fail: Y= 0
Y is only 0 or 1 because the result is only pass/fail and
there is nothing in between.
35. Multiple Logistic Regression
The dependent variable (Y) is nominal and there are 2
or more independent variables (X).
Example: the effect of cholesterol, age, and weight on
the probability of heart attack in the next year.
We can measure the risk factors on new individuals
and estimate the probability of heart attack.
This is done by comparing their odds ratios in the out
put window of the software.
36. Multiple Logistic Regression
We can try to guess what is the main risk factor that
changes the probability of the dependent variable.
The null hypothesis implies:
There is no relationship between the X variables and
the Y variable
Adding each X variable does not really improve the fit
of the equation.
37. Nonlinear (Curvilinear) Regression
If we have to transform nonlinear data to create a
linear relationship, nonlinear regression should be
used.
Avoid transformations such as Scatchard or
Lineweaver-Burk whose only goal is to linearize your
data.
These methods are outdated, and should not be used
to analyze data.
You might analyze the data by nonlinear regression
but show the results by linear transformation.
The human brain and eye is keen for straight lines!
38. The Scatchard equation is an equation for calculating the affinity
constant of a ligand with a protein.
39. In biochemistry, the Lineweaver–Burk plot is a representation
of enzyme kinetics.
40. Nonlinear (Curvilinear) Regression
Fitting a straight line to transformed data gives different
results than fitting a curved line to untransformed data.
The equation for a curve is a polynomial equation.
In polynomial equations X is raised to integer powers
such as X2 and X3.
A quadratic equation, is Y=aX1+bX2+d, and produces a
parabola.
A cubic equation is Y= aX1+bX2+cX3+d and produces
a S-shaped (sigmoid) curve.
41.
42.
43. Nonlinear (Curvilinear) Regression
Nonlinear regression is used for three purposes:
To fit a model to data for obtaining the best-fit values
of the parameters.
To compare the fits of alternative models.
To simply fit a smooth curve in order to interpolate
values from the curve.
The goal is not to describe the system perfectly, but to
fit a curve which comes close to knowing the system.
In this way we can understand the system, and reach
valid scientific conclusions.
44. Nonlinear (Curvilinear) Regression
The nonlinear method may yield results that are weird.
This happens with noisy or incomplete data, and include:
A rate constant that is negative.
A best-fit fraction that is greater than 1.
A best-fit Kd value that is negative.
Top of a sigmoid curve is far larger than the highest data.
An EC50 not within the range of your X values.
If the results make no sense, they are unacceptable, even
if the curve comes close to the points and R2 is close to 1.
45. Correlation & Regression
Hands-on practice
To calculate correlation & regression in SPSS:
For Correlation: Analyze => Correlate
For Regression: Analyze => Regression
To calculate correlation & regression in Prism:
XY (from welcome screen) => choose appropriate option
46. Choosing Test of Association
Dependent
Variable
Independen
t
Variable
Parametric
test
Non-
parametric test
Relationship
between 2
continuous
variables
Scale Scale
Pearson’s
Correlation
Coefficient
Spearman’s
Correlation
Coefficient
Predicting the value
of one variable from
the value of a
predictor variable or
looking
for significant
relationships
Scale Any
Simple Linear
Regression
Transform the
data
Nominal
(Binary)
Any
Logistic
regression
---------
Assessing the
relationship between
two categorical
variables
Categorical Categorical ---------
Chi-squared
test
47. Is your Dependent Variable (DV) continuous?
YES
NO
Is your Independent
Variable (IV) continuous?
Is your Independent
Variable (IV) continuous?
YES
YES
YES
Do you have
only two
groups?
NO
NO
NO
48. Type of Data
Goal
Measurement (from
Gaussian Population)
Rank, Score, or
Measurement (from Non-
Gaussian Population)
Binomial
(Two Possible
Outcomes)
Survival Time
Describe one group Mean, SD Median, interquartile range Proportion
Kaplan Meier survival
curve
Compare one group to a
hypothetical value
One-sample ttest Wilcoxon test
Chi-square
or
Binomial test **
Compare two
unpaired groups
Unpaired t test Mann-Whitney test
Fisher's test
(chi-square for large
samples)
Log-rank test or Mantel-
Haenszel*
Compare two paired
groups
Paired t test Wilcoxon test McNemar's test
Conditional proportional
hazards regression*
Compare three or more
unmatched groups
One-way ANOVA Kruskal-Wallis test Chi-square test
Cox proportional hazard
regression**
Compare three or more
matched groups
Repeated-measures
ANOVA
Friedman test Cochrane Q**
Conditional proportional
hazards regression**
Quantify association
between two variables
Pearson correlation Spearman correlation Contingency coefficients**
Predict value from another
measured variable
Simple linear regression
or
Nonlinear regression
Nonparametric
regression**
Simple logistic regression*
Cox proportional hazard
regression*
Predict value from several
measured or binomial
variables
Multiple linear regression*
or
Multiple nonlinear
regression**
Multiple logistic
regression*
Cox proportional hazard
regression*