SlideShare a Scribd company logo
1 of 51
Correlation and
Regression
Correlation and Regression
The test you choose depends on level of measurement:
Independent Dependent Test
Dichotomous Interval-Ratio Independent Samples t-test
Dichotomous
Nominal Interval-Ratio ANOVA
Dichotomous Dichotomous
Nominal Nominal Cross Tabs
Dichotomous Dichotomous
Interval-Ratio Interval-Ratio Bivariate Regression/Correlation
Dichotomous
Correlation and Regression
 Bivariate regression is a technique that fits a
straight line as close as possible between all the
coordinates of two continuous variables plotted
on a two-dimensional graph--to summarize the
relationship between the variables
 Correlation is a statistic that assesses the
strength and direction of association of two
continuous variables . . . It is created through a
technique called “regression”
Bivariate Regression
 For example:
A criminologist may be interested in the
relationship between Income and Number of
Children in a family or self-esteem and
criminal behavior.
Independent Variables
Family Income
Self-esteem
Dependent Variables
Number of Children
Criminal Behavior
Bivariate Regression
 For example:
 Research Hypotheses:
 As family income increases, the number of children in
families declines (negative relationship).
 As self-esteem increases, reports of criminal behavior
increase (positive relationship).
Independent Variables
Family Income
Self-esteem
Dependent Variables
Number of Children
Criminal Behavior
Bivariate Regression
 For example:
 Null Hypotheses:
 There is no relationship between family income and the
number of children in families. The relationship statistic b = 0.
 There is no relationship between self-esteem and criminal
behavior. The relationship statistic b = 0.
Independent Variables
Family Income
Self-esteem
Dependent Variables
Number of Children
Criminal Behavior
Bivariate Regression
 Let’s look at the relationship between self-
esteem and criminal behavior.
 Regression starts with plots of coordinates of
variables in a hypothesis (although you will
hardly ever “plot” your data in reality).
 The data:
Each respondent has filled out a self-esteem
assessment and reported number of crimes
committed.
Bivariate Regression
Y,
crime
s
X,
self-esteem
10 15 20 25 30 35 40
What do you think
the relationship is?
012345678910
Bivariate Regression
Y,
crime
s
X,
self-esteem
10 15 20 25 30 35 40
Is it positive?
Negative?
No change?
012345678910
Bivariate Regression
Y,
crime
s
X,
self-esteem
10 15 20 25 30 35 40
Regression is a procedure that
fits a line to the data. The slope
of that line acts as a model for
the relationship between the
plotted variables.
012345678910
Bivariate Regression
Y,
crime
s
X,
self-esteem
10 15 20 25 30 35 40
The slope of a line is the change in the corresponding
Y value for each unit increase in X (rise over run).
Slope = 0, No relationship!
Slope = 0.2, Positive Relationship!
1
Slope = -0.2, Negative Relationship!
1
0.5
012345678910
Bivariate Regression
 The mathematical equation for a line:
Y = mx + b
Where: Y = the line’s position on the
vertical axis at any point
X = the line’s position on the
horizontal axis at any point
m = the slope of the line
b = the intercept with the Y axis,
where X equals zero
Bivariate Regression
 The statistics equation for a line:
Y = a + bx
Where: Y = the line’s position on the
vertical axis at any point (value of
dependent variable)
X = the line’s position on the
horizontal axis at any point (value of
the independent variable)
b = the slope of the line (called the coefficient)
a = the intercept with the Y axis,
where X equals zero
^
^
Bivariate Regression
 The next question:
How do we draw the line???
 Our goal for the line:
Fit the line as close as possible to all the
data points for all values of X.
Bivariate Regression
Y,
crime
s
X,
self-esteem
10 15 20 25 30 35 40
How do we minimize the
distance between a line and all
the data points?
012345678910
Bivariate Regression
• How do we minimize the distance between a line and
all the data points?
• You already know of a statistic that minimizes the
distance between itself and all data values for a
variable--the mean!
• The mean minimizes the sum of squared
deviations--it is where deviations sum to zero and
where the squared deviations are at their lowest
value. Σ(Y - Y-bar)2
Bivariate Regression
• The mean minimizes the sum of squared
deviations--it is where deviations sum to zero and
where the squared deviations are at their lowest
value.
• Take this principle and “fit the line” to the place
where squared deviations (on Y) from the line are
at their lowest value (across all X’s).
∀ Σ(Y - Y)2
Y = line
^ ^
Bivariate Regression
 There are several lines that you could draw where the
deviations would sum to zero...
 Minimizing the sum of squared errors gives you the
unique, best fitting line for all the data points. It is the
line that is closest to all points.
 Y or Y-hat = Y value for line at any X
 Y = case value on variable Y
 Y - Y = residual
 Σ (Y – Y) = 0; therefore, we use Σ (Y - Y)2
…and minimize that!
^
^
^ ^
Bivariate Regression
Y,
crime
s
X,
self-esteem
10 15 20 25 30 35 40
012345678910
Illustration of Y – Y
= Yi, actual Y value corresponding w/ actual X
= Yi, line level on Y corresponding w/ actual X
∧
∧
5
-4
Y = 10, Y = 5
Y = 0, Y = 4
∧
∧
Bivariate Regression
Y,
crime
s
X,
self-esteem
10 15 20 25 30 35 40
012345678910
Illustration of (Y – Y)2
= Yi, actual Y value corresponding w/ actual X
= Yi, line level on Y corresponding w/ actual X
∧
∧
5
-4
(Yi – Y)2
= deviation2
Y = 10, Y = 5 . . . 25
Y = 0, Y = 4 . . . 16
∧
∧
∧
Bivariate Regression
Y,
crime
s
X,
self-esteem
10 15 20 25 30 35 40
012345678910
Illustration of (Y – Y)2
= Yi, actual Y value corresponding w/ actual X
= Yi, line level on Y corresponding w/ actual X
The goal: Find the line that minimizes
sum of deviations squared.
∧
∧
?
The best line will have the lowest value of sum of deviations squared
(adding squared deviations for each case in the sample.
Bivariate Regression
Y,
crime
s
X,
self-esteem
10 15 20 25 30 35 40
012345678910
∧
Y = a + bX
∧
∧
e
Bivariate Regression
 We use Σ (Y - Y)2
…and minimize that!
 There is a simple, elegant formula for
“discovering” the line that minimizes the sum of
squared errors
Σ((X - X)(Y - Y))
b = Σ(X - X)2
a = Y - bX Y = a + bX
 This is the method of least squares, it gives our
least squares estimate and indicates why we call
this technique “ordinary least squares” or OLS
regression
^
^
Bivariate Regression
Y
X0 1
12345678910
Considering that a regression line minimizes Σ (Y - Y)2
,
where would the regression line cross for an interval-ratio
variable regressed on a dichotomous independent variable?
^
For example:
0=Men: Mean = 6
1=Women: Mean = 4
Bivariate Regression
Y
X0 1
12345678910
The difference of means will be the slope.
This is the same number that is tested for
significance in an independent samples t-test.
^
0=Men: Mean = 6
1=Women: Mean = 4
Slope = -2 ; Y = 6 – 2X
Correlation
 This lecture has
covered how to model
the relationship
between two
variables with
regression.
 Another concept is
strength of
association.
 Correlation provides
that.
Correlation
Y,
crime
s
X,
self-esteem
10 15 20 25 30 35 40
So our equation is:
Y = 6 - .2X
The slope tells us direction of
association… How strong is
that?
012345678910
^
Correlation
12345678910
Example of Low Negative Correlation
When there is a lot of difference on the dependent variable across
subjects at particular values of X, there is NOT as much association
(weaker).
Y
X
Correlation
12345678910
Example of High Negative Correlation
When there is little difference on the dependent variable across
subjects at particular values of X, there is MORE association
(Stronger).
Y
X
Correlation
 To find the strength of the relationship
between two variables, we need
correlation.
 The correlation is the standardized
slope… it refers to the standard deviation
change in Y when you go up a standard
deviation in X.
Correlation
 The correlation is the standardized slope… it refers to the standard
deviation change in Y when you go up a standard deviation in X.
Σ(X - X)2
 Recall that s.d. of x, Sx = n - 1
Σ(Y - Y)2
 and the s.d. of y, Sy = n - 1
Sx
 Pearson correlation, r = Sy b
Correlation
 The Pearson Correlation, r:
tells the direction and strength of the
relationship between continuous variables
ranges from -1 to +1
is + when the relationship is positive and -
when the relationship is negative
the higher the absolute value of r, the stronger
the association
a standard deviation change in x corresponds
with r standard deviation change in Y
Correlation
 The Pearson Correlation, r:
The pearson correlation is a statistic that is an
inferential statistic too.
r - (null = 0)
tn-2 = (1-r2
) (n-2)
When it is significant, there is a relationship in
the population that is not equal to zero!
Error Analysis
 Y = a + bX This equation gives the conditional
mean of Y at any given value of X.
 So… In reality, our line gives us the expected mean of Y given each
value of X
 The line’s equation tells you how the mean on your dependent
variable changes as your independent variable goes up.
^
Y
^
X
Y
Error Analysis
 As you know, every mean has a distribution around it--so
there is a standard deviation. This is true for conditional
means as well. So, you also have a conditional standard
deviation.
 “Conditional Standard Deviation” or “Root Mean Square Error”
equals “approximate average deviation from the line.”
SSE Σ ( Y - Y)2
 σ = n - 2 = n - 2
Y
^
X
Y
^
^
Error Analysis
 The Assumption of Homoskedasticity:
 The variation around the line is the same no matter the X.
 The conditional standard deviation is for any given value of X.
 If there is a relationship between X and Y, the conditional standard
deviation is going to be less than the standard deviation of Y--if this
is so, you have improved prediction of the mean value of Y by
looking at each level of X.
 If there were no relationship, the conditional standard deviation
would be the same as the original, and the regression line would be
flat at the mean of Y.
Y
X
Y Conditional
standard
deviation
Original
standard
deviation
Error Analysis
 So guess what?
 We have a way to determine how much our
understanding of Y is improved when taking X
into account—it is based on the fact that
conditional standard deviations should be
smaller than Y’s original standard deviation.
Error Analysis
 Proportional Reduction in Error
 Let’s call the variation around the mean in Y “Error 1.”
 Let’s call the variation around the line when X is considered
“Error 2.”
 But rather than going all the way to standard deviation to
determine error, let’s just stop at the basic measure, Sum of
Squared Deviations.
 Error 1 (E1) = Σ (Y – Y)2 also called “Sum of Squares”
 Error 2 (E2) = Σ (Y – Y)2 also called “Sum of Squared Errors”
Y
X
Y Error 2Error 1
∧
R-Squared
 Proportional Reduction in Error
 To determine how much taking X into consideration reduces the
variation in Y (at each level of X) we can use a simple formula:
E1 – E2 Which tells us the proportion or
E1 percentage of original error that
is Explained by X.
 Error 1 (E1) = Σ (Y – Y)2
 Error 2 (E2) = Σ (Y – Y)2
Y
X
Y
Error 2
Error 1
∧
R-squared
r2
= E1 - E2
E1
= TSS - SSE
TSS
= Σ (Y – Y)2 -
Σ (Y – Y)2
Σ (Y – Y)2
∧
r2
is called the “coefficient of
determination”…
It is also the square of the
Pearson correlation
Y
X
Y
Error 2
Error 1
R-Squared
 R2
:
 Is the improvement obtained by using X (and drawing a line
through the conditional means) in getting as near as possible to
everybody’s value for Y over just using the mean for Y alone.
 Falls between 0 and 1
 Of 1 means an exact fit (and there is no variation of scores
around the regression line)
 Of 0 means no relationship (and as much scatter as in the
original Y variable and a flat regression line through the mean of
Y)
 Would be the same for X regressed on Y as for Y regressed on
X
 Can be interpreted as the percentage of variability in Y that is
explained by X.
 Some people get hung up on maximizing R2
, but this is too bad
because any effect is still a finding—a small R2
only indicates that
you haven’t told the whole (or much of the) story with your variable.
Error Analysis, SPSS
Some SPSS output (Anti- Gay Marriage regressed on Age):
r2
Σ (Y – Y)2 - Σ (Y – Y)2
Σ (Y – Y)2
∧
196.886 ÷ 2853.286 = .069
Line to the Mean
Data points to the line
Data points to
the mean
Original SS
for Anti- Gay
Marriage
Error Analysis
Some SPSS output (Anti- Gay Marriage regressed on Age):
r2
Σ (Y – Y)2 - Σ (Y – Y)2
Σ (Y – Y)2
∧
196.886 ÷ 2853.286 = .069
Line to the Mean
Data points to the line
Data points to
the mean
0 18 45 89 Age
Strong Oppose 5
Oppose 4
Neutral 3
Support 2
Strong Support 1
Anti- Gay
Marriage
M = 2.98
Colored lines are examples of:
Distance from each person’s data point
to the line or model—new, still
unexplained error.
Distance from line or model to Mean for
each person—reduction in error.
Distance from each person’s data point
to the Mean—original variable’s error.
ANOVA Table
X
Y
Mean
Q: Why do I see an ANOVA Table?
A: We “bust up” variance to get R2
.
Each case has a value for distance
from the line (Y-barcond. Mean) to Y-barbig,
and a value for distance from its Y
value and the line (Y-barcond. Mean).
Squared distance from the line to the mean
(Regression SS) is equivalent to BSS, df =
1. In ANOVA, all in a group share Y-bargroup
The squared distance from the line to the
data values on Y (Residual SS) is equivalent
to WSS, df = n-2. In ANOVA, all in a group
share Y-bargroup
The ratio, Regression to Residual SS,
forms an F distribution in repeated
sampling. If F is significant, X explains
some variation in Y.
BSS
WSS
TSS
Line Intersects
Group Means
Dichotomous Variables
Y
X
0 1
12345678910 Using a dichotomous independent variable,
the ANOVA table in bivariate regression will
have the same numbers and ANOVA results
as a one-way ANOVA table would (and
compare this with an independent samples t-
test).
^
0=Men: Mean = 6
1=Women: Mean = 4
Slope = -2 ; Y = 6 – 2X
Mean = 5 BSS
WSS
TSS
Regression, Inferential Statistics
Descriptive:
 The equation for your line
is a descriptive statistic.
It tells you the real, best-
fitted line that minimizes
squared errors.
Inferential:
 But what about the
population? What can we
say about the relationship
between your variables in
the population???
 The inferential statistics
are estimates based on
the best-fitted line.
Recall that statistics are divided between descriptive
and inferential statistics.
Regression, Inferential Statistics
 The significance of F, you already understand.
 The ratio of Regression (line to the mean of Y) to Residual (line to
data point) Sums of Squares forms an F ratio in repeated sampling.
 Null: r2
= 0 in the population. If F exceeds critical F, then your
variables have a relationship in the population (X explains some of
the variation in Y).
Most extreme
5% of F’s
F = Regression SS / Residual SS
Regression, Inferential Statistics
 What about the Slope or
“Coefficient”?
 From sample to sample, different
slopes would be obtained.
 The slope has a sampling
distribution that is normally
distributed.
 So we can do a significance test. -3 -2 -1 0 1 2 3z
β
Regression, Inferential Statistics
Conducting a Test of Significance for the slope of the Regression Line
By slapping the sampling distribution for the slope over a guess of the
population’s slope, Ho, one determines whether a sample could have
been drawn from a population where the slope is equal Ho.
1. Two-tailed significance test for α-level = .05
2. Critical t = +/- 1.96
3. To find if there is a significant slope in the population,
Ho: β = 0
Ha: β ≠ 0 Σ ( Y – Y )2
n Collect Data n - 2
n Calculate t (z): t = b – βo s.e. =
s.e. Σ ( X – X )2
n Make decision about the null hypothesis
n Find P-value
∧
Correlation and Regression
Back to the SPSS output:
The standard error and t
appears on SPSS output
…and the p-value too!
Correlation and Regression
Back to the SPSS output:
Y = 1.88 + .023X
So the GSS example, the
slope is significant. There is
evidence of a positive
relationship in the
population between Age and
Anti- Gay Marriage
sentiment. 6.9% of the
variation in Marriage
attitude is explained by age.
The older Americans get, the
more likely they are to
oppose gay marriage.
∧
A one year increase in age elevates “anti” attitudes by .023 scale units. There is a weak
positive correlation. A s.d, increase in age produces a .023 s.d. increase in “anti” scale units.

More Related Content

What's hot

Inferential statistics (2)
Inferential statistics (2)Inferential statistics (2)
Inferential statistics (2)
rajnulada
 
Multiple regression presentation
Multiple regression presentationMultiple regression presentation
Multiple regression presentation
Carlo Magno
 
Chapter 6 simple regression and correlation
Chapter 6 simple regression and correlationChapter 6 simple regression and correlation
Chapter 6 simple regression and correlation
Rione Drevale
 

What's hot (20)

Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
 
Inferential statistics (2)
Inferential statistics (2)Inferential statistics (2)
Inferential statistics (2)
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
 
Categorical data analysis
Categorical data analysisCategorical data analysis
Categorical data analysis
 
Confidence interval
Confidence intervalConfidence interval
Confidence interval
 
Multiple regression presentation
Multiple regression presentationMultiple regression presentation
Multiple regression presentation
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Statistical tests for categorical data
Statistical tests for categorical dataStatistical tests for categorical data
Statistical tests for categorical data
 
Regression
RegressionRegression
Regression
 
hypothesis testing-tests of proportions and variances in six sigma
hypothesis testing-tests of proportions and variances in six sigmahypothesis testing-tests of proportions and variances in six sigma
hypothesis testing-tests of proportions and variances in six sigma
 
Regression and Co-Relation
Regression and Co-RelationRegression and Co-Relation
Regression and Co-Relation
 
Chapter 6 simple regression and correlation
Chapter 6 simple regression and correlationChapter 6 simple regression and correlation
Chapter 6 simple regression and correlation
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Sampling Distribution
Sampling DistributionSampling Distribution
Sampling Distribution
 
Introduction to Probability and Probability Distributions
Introduction to Probability and Probability DistributionsIntroduction to Probability and Probability Distributions
Introduction to Probability and Probability Distributions
 
Correlation analysis
Correlation analysisCorrelation analysis
Correlation analysis
 
Correlation and Simple Regression
Correlation  and Simple RegressionCorrelation  and Simple Regression
Correlation and Simple Regression
 
Linear Regression Using SPSS
Linear Regression Using SPSSLinear Regression Using SPSS
Linear Regression Using SPSS
 

Similar to regression and correlation

Correlation and Regression
Correlation and RegressionCorrelation and Regression
Correlation and Regression
Shubham Mehta
 
SM_d89ccf05-7de1-4a30-a134-3143e9b3bf3f_38.pptx
SM_d89ccf05-7de1-4a30-a134-3143e9b3bf3f_38.pptxSM_d89ccf05-7de1-4a30-a134-3143e9b3bf3f_38.pptx
SM_d89ccf05-7de1-4a30-a134-3143e9b3bf3f_38.pptx
Manjulasingh17
 
Correlation by Neeraj Bhandari ( Surkhet.Nepal )
Correlation by Neeraj Bhandari ( Surkhet.Nepal )Correlation by Neeraj Bhandari ( Surkhet.Nepal )
Correlation by Neeraj Bhandari ( Surkhet.Nepal )
Neeraj Bhandari
 

Similar to regression and correlation (20)

Correlation and Regression
Correlation and RegressionCorrelation and Regression
Correlation and Regression
 
Corr And Regress
Corr And RegressCorr And Regress
Corr And Regress
 
Corr-and-Regress (1).ppt
Corr-and-Regress (1).pptCorr-and-Regress (1).ppt
Corr-and-Regress (1).ppt
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.ppt
 
Cr-and-Regress.ppt
Cr-and-Regress.pptCr-and-Regress.ppt
Cr-and-Regress.ppt
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.ppt
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.ppt
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.ppt
 
Correlation & Regression for Statistics Social Science
Correlation & Regression for Statistics Social ScienceCorrelation & Regression for Statistics Social Science
Correlation & Regression for Statistics Social Science
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Applied Statistics In Business
Applied Statistics In BusinessApplied Statistics In Business
Applied Statistics In Business
 
Correlation and Regression
Correlation and Regression Correlation and Regression
Correlation and Regression
 
SM_d89ccf05-7de1-4a30-a134-3143e9b3bf3f_38.pptx
SM_d89ccf05-7de1-4a30-a134-3143e9b3bf3f_38.pptxSM_d89ccf05-7de1-4a30-a134-3143e9b3bf3f_38.pptx
SM_d89ccf05-7de1-4a30-a134-3143e9b3bf3f_38.pptx
 
Correlations
CorrelationsCorrelations
Correlations
 
Statistics
Statistics Statistics
Statistics
 
Reg
RegReg
Reg
 
Unit 1 Correlation- BSRM.pdf
Unit 1 Correlation- BSRM.pdfUnit 1 Correlation- BSRM.pdf
Unit 1 Correlation- BSRM.pdf
 
correlation.final.ppt (1).pptx
correlation.final.ppt (1).pptxcorrelation.final.ppt (1).pptx
correlation.final.ppt (1).pptx
 
Correlation by Neeraj Bhandari ( Surkhet.Nepal )
Correlation by Neeraj Bhandari ( Surkhet.Nepal )Correlation by Neeraj Bhandari ( Surkhet.Nepal )
Correlation by Neeraj Bhandari ( Surkhet.Nepal )
 
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regression
 

Recently uploaded

Beyond the Codes_Repositioning towards sustainable development
Beyond the Codes_Repositioning towards sustainable developmentBeyond the Codes_Repositioning towards sustainable development
Beyond the Codes_Repositioning towards sustainable development
Nimot Muili
 
Agile Coaching Change Management Framework.pptx
Agile Coaching Change Management Framework.pptxAgile Coaching Change Management Framework.pptx
Agile Coaching Change Management Framework.pptx
alinstan901
 
Abortion pills in Jeddah |• +966572737505 ] GET CYTOTEC
Abortion pills in Jeddah |• +966572737505 ] GET CYTOTECAbortion pills in Jeddah |• +966572737505 ] GET CYTOTEC
Abortion pills in Jeddah |• +966572737505 ] GET CYTOTEC
Abortion pills in Riyadh +966572737505 get cytotec
 

Recently uploaded (15)

internal analysis on strategic management
internal analysis on strategic managementinternal analysis on strategic management
internal analysis on strategic management
 
Intro_University_Ranking_Introduction.pptx
Intro_University_Ranking_Introduction.pptxIntro_University_Ranking_Introduction.pptx
Intro_University_Ranking_Introduction.pptx
 
GENUINE Babe,Call Girls IN Baderpur Delhi | +91-8377087607
GENUINE Babe,Call Girls IN Baderpur  Delhi | +91-8377087607GENUINE Babe,Call Girls IN Baderpur  Delhi | +91-8377087607
GENUINE Babe,Call Girls IN Baderpur Delhi | +91-8377087607
 
BDSM⚡Call Girls in Sector 99 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 99 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 99 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 99 Noida Escorts >༒8448380779 Escort Service
 
Safety T fire missions army field Artillery
Safety T fire missions army field ArtillerySafety T fire missions army field Artillery
Safety T fire missions army field Artillery
 
Dealing with Poor Performance - get the full picture from 3C Performance Mana...
Dealing with Poor Performance - get the full picture from 3C Performance Mana...Dealing with Poor Performance - get the full picture from 3C Performance Mana...
Dealing with Poor Performance - get the full picture from 3C Performance Mana...
 
Call now : 9892124323 Nalasopara Beautiful Call Girls Vasai virar Best Call G...
Call now : 9892124323 Nalasopara Beautiful Call Girls Vasai virar Best Call G...Call now : 9892124323 Nalasopara Beautiful Call Girls Vasai virar Best Call G...
Call now : 9892124323 Nalasopara Beautiful Call Girls Vasai virar Best Call G...
 
Beyond the Codes_Repositioning towards sustainable development
Beyond the Codes_Repositioning towards sustainable developmentBeyond the Codes_Repositioning towards sustainable development
Beyond the Codes_Repositioning towards sustainable development
 
International Ocean Transportation p.pdf
International Ocean Transportation p.pdfInternational Ocean Transportation p.pdf
International Ocean Transportation p.pdf
 
Day 0- Bootcamp Roadmap for PLC Bootcamp
Day 0- Bootcamp Roadmap for PLC BootcampDay 0- Bootcamp Roadmap for PLC Bootcamp
Day 0- Bootcamp Roadmap for PLC Bootcamp
 
Call Now Pooja Mehta : 7738631006 Door Step Call Girls Rate 100% Satisfactio...
Call Now Pooja Mehta :  7738631006 Door Step Call Girls Rate 100% Satisfactio...Call Now Pooja Mehta :  7738631006 Door Step Call Girls Rate 100% Satisfactio...
Call Now Pooja Mehta : 7738631006 Door Step Call Girls Rate 100% Satisfactio...
 
Agile Coaching Change Management Framework.pptx
Agile Coaching Change Management Framework.pptxAgile Coaching Change Management Framework.pptx
Agile Coaching Change Management Framework.pptx
 
Reviewing and summarization of university ranking system to.pptx
Reviewing and summarization of university ranking system  to.pptxReviewing and summarization of university ranking system  to.pptx
Reviewing and summarization of university ranking system to.pptx
 
Abortion pills in Jeddah |• +966572737505 ] GET CYTOTEC
Abortion pills in Jeddah |• +966572737505 ] GET CYTOTECAbortion pills in Jeddah |• +966572737505 ] GET CYTOTEC
Abortion pills in Jeddah |• +966572737505 ] GET CYTOTEC
 
Strategic Management, Vision Mission, Internal Analsysis
Strategic Management, Vision Mission, Internal AnalsysisStrategic Management, Vision Mission, Internal Analsysis
Strategic Management, Vision Mission, Internal Analsysis
 

regression and correlation

  • 2. Correlation and Regression The test you choose depends on level of measurement: Independent Dependent Test Dichotomous Interval-Ratio Independent Samples t-test Dichotomous Nominal Interval-Ratio ANOVA Dichotomous Dichotomous Nominal Nominal Cross Tabs Dichotomous Dichotomous Interval-Ratio Interval-Ratio Bivariate Regression/Correlation Dichotomous
  • 3. Correlation and Regression  Bivariate regression is a technique that fits a straight line as close as possible between all the coordinates of two continuous variables plotted on a two-dimensional graph--to summarize the relationship between the variables  Correlation is a statistic that assesses the strength and direction of association of two continuous variables . . . It is created through a technique called “regression”
  • 4. Bivariate Regression  For example: A criminologist may be interested in the relationship between Income and Number of Children in a family or self-esteem and criminal behavior. Independent Variables Family Income Self-esteem Dependent Variables Number of Children Criminal Behavior
  • 5. Bivariate Regression  For example:  Research Hypotheses:  As family income increases, the number of children in families declines (negative relationship).  As self-esteem increases, reports of criminal behavior increase (positive relationship). Independent Variables Family Income Self-esteem Dependent Variables Number of Children Criminal Behavior
  • 6. Bivariate Regression  For example:  Null Hypotheses:  There is no relationship between family income and the number of children in families. The relationship statistic b = 0.  There is no relationship between self-esteem and criminal behavior. The relationship statistic b = 0. Independent Variables Family Income Self-esteem Dependent Variables Number of Children Criminal Behavior
  • 7. Bivariate Regression  Let’s look at the relationship between self- esteem and criminal behavior.  Regression starts with plots of coordinates of variables in a hypothesis (although you will hardly ever “plot” your data in reality).  The data: Each respondent has filled out a self-esteem assessment and reported number of crimes committed.
  • 8. Bivariate Regression Y, crime s X, self-esteem 10 15 20 25 30 35 40 What do you think the relationship is? 012345678910
  • 9. Bivariate Regression Y, crime s X, self-esteem 10 15 20 25 30 35 40 Is it positive? Negative? No change? 012345678910
  • 10. Bivariate Regression Y, crime s X, self-esteem 10 15 20 25 30 35 40 Regression is a procedure that fits a line to the data. The slope of that line acts as a model for the relationship between the plotted variables. 012345678910
  • 11. Bivariate Regression Y, crime s X, self-esteem 10 15 20 25 30 35 40 The slope of a line is the change in the corresponding Y value for each unit increase in X (rise over run). Slope = 0, No relationship! Slope = 0.2, Positive Relationship! 1 Slope = -0.2, Negative Relationship! 1 0.5 012345678910
  • 12. Bivariate Regression  The mathematical equation for a line: Y = mx + b Where: Y = the line’s position on the vertical axis at any point X = the line’s position on the horizontal axis at any point m = the slope of the line b = the intercept with the Y axis, where X equals zero
  • 13. Bivariate Regression  The statistics equation for a line: Y = a + bx Where: Y = the line’s position on the vertical axis at any point (value of dependent variable) X = the line’s position on the horizontal axis at any point (value of the independent variable) b = the slope of the line (called the coefficient) a = the intercept with the Y axis, where X equals zero ^ ^
  • 14. Bivariate Regression  The next question: How do we draw the line???  Our goal for the line: Fit the line as close as possible to all the data points for all values of X.
  • 15. Bivariate Regression Y, crime s X, self-esteem 10 15 20 25 30 35 40 How do we minimize the distance between a line and all the data points? 012345678910
  • 16. Bivariate Regression • How do we minimize the distance between a line and all the data points? • You already know of a statistic that minimizes the distance between itself and all data values for a variable--the mean! • The mean minimizes the sum of squared deviations--it is where deviations sum to zero and where the squared deviations are at their lowest value. Σ(Y - Y-bar)2
  • 17. Bivariate Regression • The mean minimizes the sum of squared deviations--it is where deviations sum to zero and where the squared deviations are at their lowest value. • Take this principle and “fit the line” to the place where squared deviations (on Y) from the line are at their lowest value (across all X’s). ∀ Σ(Y - Y)2 Y = line ^ ^
  • 18. Bivariate Regression  There are several lines that you could draw where the deviations would sum to zero...  Minimizing the sum of squared errors gives you the unique, best fitting line for all the data points. It is the line that is closest to all points.  Y or Y-hat = Y value for line at any X  Y = case value on variable Y  Y - Y = residual  Σ (Y – Y) = 0; therefore, we use Σ (Y - Y)2 …and minimize that! ^ ^ ^ ^
  • 19. Bivariate Regression Y, crime s X, self-esteem 10 15 20 25 30 35 40 012345678910 Illustration of Y – Y = Yi, actual Y value corresponding w/ actual X = Yi, line level on Y corresponding w/ actual X ∧ ∧ 5 -4 Y = 10, Y = 5 Y = 0, Y = 4 ∧ ∧
  • 20. Bivariate Regression Y, crime s X, self-esteem 10 15 20 25 30 35 40 012345678910 Illustration of (Y – Y)2 = Yi, actual Y value corresponding w/ actual X = Yi, line level on Y corresponding w/ actual X ∧ ∧ 5 -4 (Yi – Y)2 = deviation2 Y = 10, Y = 5 . . . 25 Y = 0, Y = 4 . . . 16 ∧ ∧ ∧
  • 21. Bivariate Regression Y, crime s X, self-esteem 10 15 20 25 30 35 40 012345678910 Illustration of (Y – Y)2 = Yi, actual Y value corresponding w/ actual X = Yi, line level on Y corresponding w/ actual X The goal: Find the line that minimizes sum of deviations squared. ∧ ∧ ? The best line will have the lowest value of sum of deviations squared (adding squared deviations for each case in the sample.
  • 22. Bivariate Regression Y, crime s X, self-esteem 10 15 20 25 30 35 40 012345678910 ∧ Y = a + bX ∧ ∧ e
  • 23. Bivariate Regression  We use Σ (Y - Y)2 …and minimize that!  There is a simple, elegant formula for “discovering” the line that minimizes the sum of squared errors Σ((X - X)(Y - Y)) b = Σ(X - X)2 a = Y - bX Y = a + bX  This is the method of least squares, it gives our least squares estimate and indicates why we call this technique “ordinary least squares” or OLS regression ^ ^
  • 24. Bivariate Regression Y X0 1 12345678910 Considering that a regression line minimizes Σ (Y - Y)2 , where would the regression line cross for an interval-ratio variable regressed on a dichotomous independent variable? ^ For example: 0=Men: Mean = 6 1=Women: Mean = 4
  • 25. Bivariate Regression Y X0 1 12345678910 The difference of means will be the slope. This is the same number that is tested for significance in an independent samples t-test. ^ 0=Men: Mean = 6 1=Women: Mean = 4 Slope = -2 ; Y = 6 – 2X
  • 26. Correlation  This lecture has covered how to model the relationship between two variables with regression.  Another concept is strength of association.  Correlation provides that.
  • 27. Correlation Y, crime s X, self-esteem 10 15 20 25 30 35 40 So our equation is: Y = 6 - .2X The slope tells us direction of association… How strong is that? 012345678910 ^
  • 28. Correlation 12345678910 Example of Low Negative Correlation When there is a lot of difference on the dependent variable across subjects at particular values of X, there is NOT as much association (weaker). Y X
  • 29. Correlation 12345678910 Example of High Negative Correlation When there is little difference on the dependent variable across subjects at particular values of X, there is MORE association (Stronger). Y X
  • 30. Correlation  To find the strength of the relationship between two variables, we need correlation.  The correlation is the standardized slope… it refers to the standard deviation change in Y when you go up a standard deviation in X.
  • 31. Correlation  The correlation is the standardized slope… it refers to the standard deviation change in Y when you go up a standard deviation in X. Σ(X - X)2  Recall that s.d. of x, Sx = n - 1 Σ(Y - Y)2  and the s.d. of y, Sy = n - 1 Sx  Pearson correlation, r = Sy b
  • 32. Correlation  The Pearson Correlation, r: tells the direction and strength of the relationship between continuous variables ranges from -1 to +1 is + when the relationship is positive and - when the relationship is negative the higher the absolute value of r, the stronger the association a standard deviation change in x corresponds with r standard deviation change in Y
  • 33. Correlation  The Pearson Correlation, r: The pearson correlation is a statistic that is an inferential statistic too. r - (null = 0) tn-2 = (1-r2 ) (n-2) When it is significant, there is a relationship in the population that is not equal to zero!
  • 34. Error Analysis  Y = a + bX This equation gives the conditional mean of Y at any given value of X.  So… In reality, our line gives us the expected mean of Y given each value of X  The line’s equation tells you how the mean on your dependent variable changes as your independent variable goes up. ^ Y ^ X Y
  • 35. Error Analysis  As you know, every mean has a distribution around it--so there is a standard deviation. This is true for conditional means as well. So, you also have a conditional standard deviation.  “Conditional Standard Deviation” or “Root Mean Square Error” equals “approximate average deviation from the line.” SSE Σ ( Y - Y)2  σ = n - 2 = n - 2 Y ^ X Y ^ ^
  • 36. Error Analysis  The Assumption of Homoskedasticity:  The variation around the line is the same no matter the X.  The conditional standard deviation is for any given value of X.  If there is a relationship between X and Y, the conditional standard deviation is going to be less than the standard deviation of Y--if this is so, you have improved prediction of the mean value of Y by looking at each level of X.  If there were no relationship, the conditional standard deviation would be the same as the original, and the regression line would be flat at the mean of Y. Y X Y Conditional standard deviation Original standard deviation
  • 37. Error Analysis  So guess what?  We have a way to determine how much our understanding of Y is improved when taking X into account—it is based on the fact that conditional standard deviations should be smaller than Y’s original standard deviation.
  • 38. Error Analysis  Proportional Reduction in Error  Let’s call the variation around the mean in Y “Error 1.”  Let’s call the variation around the line when X is considered “Error 2.”  But rather than going all the way to standard deviation to determine error, let’s just stop at the basic measure, Sum of Squared Deviations.  Error 1 (E1) = Σ (Y – Y)2 also called “Sum of Squares”  Error 2 (E2) = Σ (Y – Y)2 also called “Sum of Squared Errors” Y X Y Error 2Error 1 ∧
  • 39. R-Squared  Proportional Reduction in Error  To determine how much taking X into consideration reduces the variation in Y (at each level of X) we can use a simple formula: E1 – E2 Which tells us the proportion or E1 percentage of original error that is Explained by X.  Error 1 (E1) = Σ (Y – Y)2  Error 2 (E2) = Σ (Y – Y)2 Y X Y Error 2 Error 1 ∧
  • 40. R-squared r2 = E1 - E2 E1 = TSS - SSE TSS = Σ (Y – Y)2 - Σ (Y – Y)2 Σ (Y – Y)2 ∧ r2 is called the “coefficient of determination”… It is also the square of the Pearson correlation Y X Y Error 2 Error 1
  • 41. R-Squared  R2 :  Is the improvement obtained by using X (and drawing a line through the conditional means) in getting as near as possible to everybody’s value for Y over just using the mean for Y alone.  Falls between 0 and 1  Of 1 means an exact fit (and there is no variation of scores around the regression line)  Of 0 means no relationship (and as much scatter as in the original Y variable and a flat regression line through the mean of Y)  Would be the same for X regressed on Y as for Y regressed on X  Can be interpreted as the percentage of variability in Y that is explained by X.  Some people get hung up on maximizing R2 , but this is too bad because any effect is still a finding—a small R2 only indicates that you haven’t told the whole (or much of the) story with your variable.
  • 42. Error Analysis, SPSS Some SPSS output (Anti- Gay Marriage regressed on Age): r2 Σ (Y – Y)2 - Σ (Y – Y)2 Σ (Y – Y)2 ∧ 196.886 ÷ 2853.286 = .069 Line to the Mean Data points to the line Data points to the mean Original SS for Anti- Gay Marriage
  • 43. Error Analysis Some SPSS output (Anti- Gay Marriage regressed on Age): r2 Σ (Y – Y)2 - Σ (Y – Y)2 Σ (Y – Y)2 ∧ 196.886 ÷ 2853.286 = .069 Line to the Mean Data points to the line Data points to the mean 0 18 45 89 Age Strong Oppose 5 Oppose 4 Neutral 3 Support 2 Strong Support 1 Anti- Gay Marriage M = 2.98 Colored lines are examples of: Distance from each person’s data point to the line or model—new, still unexplained error. Distance from line or model to Mean for each person—reduction in error. Distance from each person’s data point to the Mean—original variable’s error.
  • 44. ANOVA Table X Y Mean Q: Why do I see an ANOVA Table? A: We “bust up” variance to get R2 . Each case has a value for distance from the line (Y-barcond. Mean) to Y-barbig, and a value for distance from its Y value and the line (Y-barcond. Mean). Squared distance from the line to the mean (Regression SS) is equivalent to BSS, df = 1. In ANOVA, all in a group share Y-bargroup The squared distance from the line to the data values on Y (Residual SS) is equivalent to WSS, df = n-2. In ANOVA, all in a group share Y-bargroup The ratio, Regression to Residual SS, forms an F distribution in repeated sampling. If F is significant, X explains some variation in Y. BSS WSS TSS Line Intersects Group Means
  • 45. Dichotomous Variables Y X 0 1 12345678910 Using a dichotomous independent variable, the ANOVA table in bivariate regression will have the same numbers and ANOVA results as a one-way ANOVA table would (and compare this with an independent samples t- test). ^ 0=Men: Mean = 6 1=Women: Mean = 4 Slope = -2 ; Y = 6 – 2X Mean = 5 BSS WSS TSS
  • 46. Regression, Inferential Statistics Descriptive:  The equation for your line is a descriptive statistic. It tells you the real, best- fitted line that minimizes squared errors. Inferential:  But what about the population? What can we say about the relationship between your variables in the population???  The inferential statistics are estimates based on the best-fitted line. Recall that statistics are divided between descriptive and inferential statistics.
  • 47. Regression, Inferential Statistics  The significance of F, you already understand.  The ratio of Regression (line to the mean of Y) to Residual (line to data point) Sums of Squares forms an F ratio in repeated sampling.  Null: r2 = 0 in the population. If F exceeds critical F, then your variables have a relationship in the population (X explains some of the variation in Y). Most extreme 5% of F’s F = Regression SS / Residual SS
  • 48. Regression, Inferential Statistics  What about the Slope or “Coefficient”?  From sample to sample, different slopes would be obtained.  The slope has a sampling distribution that is normally distributed.  So we can do a significance test. -3 -2 -1 0 1 2 3z β
  • 49. Regression, Inferential Statistics Conducting a Test of Significance for the slope of the Regression Line By slapping the sampling distribution for the slope over a guess of the population’s slope, Ho, one determines whether a sample could have been drawn from a population where the slope is equal Ho. 1. Two-tailed significance test for α-level = .05 2. Critical t = +/- 1.96 3. To find if there is a significant slope in the population, Ho: β = 0 Ha: β ≠ 0 Σ ( Y – Y )2 n Collect Data n - 2 n Calculate t (z): t = b – βo s.e. = s.e. Σ ( X – X )2 n Make decision about the null hypothesis n Find P-value ∧
  • 50. Correlation and Regression Back to the SPSS output: The standard error and t appears on SPSS output …and the p-value too!
  • 51. Correlation and Regression Back to the SPSS output: Y = 1.88 + .023X So the GSS example, the slope is significant. There is evidence of a positive relationship in the population between Age and Anti- Gay Marriage sentiment. 6.9% of the variation in Marriage attitude is explained by age. The older Americans get, the more likely they are to oppose gay marriage. ∧ A one year increase in age elevates “anti” attitudes by .023 scale units. There is a weak positive correlation. A s.d, increase in age produces a .023 s.d. increase in “anti” scale units.