CORRELATION AND REGRESSION.pptx

INTRODUCTION
• We deal with data that consists of random pairs (
or sets) of observations. Elements in each are
observations from the same subject
Ways to deal with such data
• Ignore any relation between the variables and
analyze them separately
• Use correlation to describe the intensity of
association between the two variables
• Use regression analysis to assess the degree and
nature of association between the variables

CORRELATION
• If we have two continuous variables X and Y
we can summarize them with five parameters
namely
• The two means(µx, µy)
• The two variances(σx,σy) and
• The covariance (σxy)

Covariance
• The sample covariance is calculated as the
sum of cross products ( of deviations ) divided
by the degrees of freedom

Properties of covariance
• If large values of X pair with large values of Y
and small values of x with small values of y the
covariance will be positive.
• If large x go with small Y and vice versa then
the covariance will be negative
• If X and Y are independent then covariance
will be zero

Correlation coefficient
• Can replace the covariance without any loss of
information. Its denoted by ρ while the
statistic is denoted by r

Computation of r
The sample correlation coefficient can be
calculated by

Properties of correlation coefficient
• The value of r is always between -1 and 1.
• Positive values indicate a positive association
between the variables
• Negative values indicate a negative
association between the variables
• If r=1 or r =-1 then all of the cases fall on a
straight line

Coefficient of determination
• This is the square of the correlation
coefficient.
• Recall that the total sum of squares is a
measure of variability of a variable

Cont
• The sum of squares may be given as
• Similarly
Is a measure of the total joint variability of X and
Y

Cont
• Measure of variability of Y over and above
that of the joint variability of both X and Y
(SS[Y|X]) is called the sum of squares due to
regression of Y on X denotes as SSr
• It can be shown that r2 is the ratio of SSr to
SSto

Cont`
• Coefficient of determination is therefore a
measure of variability in Y that is explained by
the variable X

Properties of r2
• Coefficeient of determination lies between 0
and 1
• When the variables are highly correlated r2 is
near 1 and near 0 when they are not
correlated.

Example
• Consider the following data
X Y
9 0
9 9
8 1
5 1
7 9
-Find the variance of x and y and the covariance of x and
y
Find the correlation coefficient and the coefficient of
determination

Sampling distribution of r
• The sampling distribution is only symmetric
when the parameter ρ=0.
• It becomes skewed as ρ moves away from 0
• Hence we cannot use CLT in computing
confidence interval for ρ and in hypothesis
testing
• Two variables are correlated if r>0.5 and the
sample is large enough

Testing hypothesis about ρ
Test H0:ρ=0
• Recall that if ρ=0 then the two variables are
not correlated
• The test assesses whether there is correlation
between variables .The test statistic

Hypothesis testing cont`
Test H0 :ρ=ρ0 whereρ0 is not equal to zero.
We transform to z` and the test statistic is
Where , and
95% C.I will be given by z`±1.96×σz

Example
• Consider the data used above ; Are x and y
correlated?

REGRESSION
• Model of relationships between some
covariates and outcome.
• Often used for exploratory settings
• Sometimes be used for confirmatory studies
• A regression line is an equation that describes
the relationship between a response variable
y(outcome) and an explanatory variable x(
covariates.

Regression continued
• Statistical relationships may be linear,
exponential , polynomial logarithmic etc
• Simplest form is the linear
• Linear means linear in the coefficients i.e. y is
a linear functions of the coefficients
• Non linear relations can be modified into
forms that are approximately linear through
the transformation

Simple linear regression
• Linear relationship may be summarized using an
equation
y = 0 + 1x
where 0 is the intercept and 1 the slope of the
line.
• For observation i ( i = 1,2,…,10 ) whose value of
the explanatory variable is xi one would expect
the corresponding response yi to be such that
E(yi) = 0 + 1xi

• The statistical model fitted for a simple linear
regression is of the form
yi= 0 + 1xi +εi i = 1, ..., n

EXAMPLE
Study of the effect of temperature on the rate of development
of the potato leafhooper, Empoasca fabae. The response (y)
was the mean length of the development period (in days)
from egg to adult.
Temperature (F) Mean length (days)
59.8 30.2
67.6 27.3
70.0 26.8
70.4 23.3
74.0 19.1
75.3 19.0
78.0 16.5
80.4 15.9
81.4 14.8
83.2 14.2

Mean length of development period of
potato leafhopper versus temperature
0
5
10
15
20
25
30
35
0 10 20 30 40 50 60 70 80 90

Genstat Analysis using Stats, Regression
Analysis, Linear Models…

Output from Genstat
• Regression Analysis
• Response variate: length
• Fitted terms: Constant, temp
• Summary of analysis
• d.f. s.s. m.s. v.r. F pr.
• Regression 1 282.28 282.282 120.85 <.001
• Residual 8 18.69 2.336
• Total 9 300.97 33.441
• Estimates of parameters
• estimate s.e. t(8) t pr.
• Constant 78.09 5.24 14.90 <.001
• temp -0.7753 0.0705 -10.99 <.001

ASSUMPTIONS
• Error terms have constant variance(
Homoscedascity)
• The error terms are independent
• The error terms are normally distributed
• The regression function is linear
• Outliers
• Important independent variables in the model
– Must be checked

How to check the assumptions
• Diagnostic plots
• The plots tell you whether the regression is
even appropriate.
• Include univariate plots, bivariate plots,
Residual analysis plots

Univariate plots of X and Y
• To look for outliers
• Examine the shape of the distribution
• Include box plots, stem plots , histograms and
dot plots for x and y

Bivariate plots
• Plots of X vs Y
• Is the relationship between the two variables
linear?
• Are there two dimensional outliers?
• Does the assumption of constant variance
look reasonable?

Plots of residuals versus X
• Useful for detecting non linearity
• Any observable pattern in the residual versus
X plot indicate a problem with model
assumption.

Plot the residuals versus Y'
• For one predictor variable its has same
information as previous.
• For multiple linear regression the plot lets us
examine patterns of the residuals with
increasing response.

Plot the standardized residuals versus
x

CORRELATION AND REGRESSION.pptx

Recommended

Recommended

More Related Content

Similar to CORRELATION AND REGRESSION.pptx

Similar to CORRELATION AND REGRESSION.pptx (20)

Recently uploaded

Recently uploaded (20)

CORRELATION AND REGRESSION.pptx