Introduction to
Correlation and Regression
.S.V. Bhaskar,
Associate Professor,
Department of Mechanical Engineering,
Sanjivani College of Engineering,
Kopargaon (MS), INDIA.
1
Correlation
 Correlation
 A measure of association between two numerical
variables.
 A quantitative relationship between two variables
 measures and describes the strength and direction of
the relationship
 Example (positive correlation)
 Typically, in the summer as the temperature
increases people are thirstier.
2
Specific Example
For seven
random summer
days, a person
recorded the
temperature and
their water
consumption,
during a three-hour
period spent
outside.
Temperature
(F)
Water
Consumption
(ounces)
75 16
83 20
85 25
85 27
92 32
97 48
99 48
3
How would you describe the graph?
4
SCATTER Plot
The simplest method to assess relationship between two
quantitative variables is to draw a scatter diagram
As age increases, there is a
general tendency for the BP to increase. But this does not
give us a quantitative estimate of the degree of the relationship
5
Negative Correlation–as x increases, y
x = hours of training (horizonta
y = number of accidents (vertic
Scatter Plots and Types of Correlation
60
50
40
30
20
10
0
0 2 4 6 8 10 12 14 16 18 20
Hours of Training
Accidents
6
Positive Correlation–as x increases, y increases
GPA
Scatter Plots and Types of Correlation
4.00
3.75
3.50
3.00
2.75
2.50
2.25
2.00
1.50
1.75
3.25
300 350 400 450 500 550 600 650 700 750 800
Math SAT
7
No linear correlation
Scatter Plots and Types of Correlation
160
150
140
130
120
110
100
90
80
60 64 68 72 76 80
Height
IQ
8
How “strong” is the linear relationship?
9
Measuring the Relationship
Pearson’s Sample Correlation
Coefficient, r
The correlation coefficient is an
index of the degree of
association between two
variables.
measures the direction and the strength of
the linear association between two numerical
paired variables.
10
Correlation Coefficient “r”
A measure of the strength and direction of a linear
relationship between two variables
The range of r is from –1 to 1.
If r is close to
1 there is a
strong
positive
correlation.
If r is close to –1
there is a strong
negative
correlation.
If r is close to
0 there is no
linear
correlation.
–1 0 1
11
High values of one variable tend to occur with high
values of the other (and low with low)
In such situations, we say that there is a positive correlation
High values of one variable occur with low values of the other
(and vice-versa)
we say that there is a negative correlation
12
A NOTE OF CAUTION
Correlation coefficient is purely a measure of degree of
association and does not provide any evidence of
a cause-effect relationship
It is valid only in the range of values studied
Extrapolation of the association may not always be valid
Eg.: Age & Grip strength
13
r measures the degree of linear relationship
r = 0 does not necessarily mean that there is no
relationship between the two characteristics under
study; the relationship could be curvilinear
Spurious correlation :
The production of steel in UK and population in India
over the last 25 years may be highly correlated
14
r does not give the rate of change in one variable
for changes in the other variable
Eg: Age & Systolic BP - Males : r = 0.7
Females : r = 0.5
From this one should not conclude that Systolic BP increases
at a higher rate among males than females
If the correlation coefficient between height in inches and
weight in pounds is say, 0.6, the correlation coefficient
between
height in cm and weight on kg will also be 0.6
15
Direction of Association
Positive Correlation Negative Correlation
16
Strength of Linear Association
r
value
Interpretation
1
perfect positive linear
relationship
0 no linear relationship
-1
perfect negative linear
relationship
17
Strength of Linear Association
18
Other Strengths of Association
r value Interpretation
0.9 strong association
0.5 moderate association
0.25 weak association
19
Other Strengths of Association
20
Calculation of r
= the sum
n = number of paired
items
xi = input variable yi = output variable
x = x-bar = mean of
x’s
y = y-bar = mean of
y’s
sx= standard deviation
of x’s
sy= standard
deviation of y’s
21
Application
95
90
85
80
75
70
65
60
55
45
40
50
0 2 4 6 8 10 12 14 16
FinalGrade
X
Absences
Absences Final
Grade
8 78
2 92
5 90
12 58
15 43
9 74
6 81
22
6084
8464
8100
3364
1849
5476
6561
624
184
450
696
645
666
486
57 516 3751 579 39898
1 8 78
2 2 92
3 5 90
4 12 58
5 15 43
6 9 74
7 6 81
64
4
25
144
225
81
36
xy x2 y2
Computation of r
x y
Σ
r=Cov(xy) / [Sd(x) x Sd(y)]
23
COMPUTATION OF THE
CORRELATION COEFFICIENT
Covariance (XY)
X Y (X - X ) (Y- Y ) (X –X) (Y- Y )
8 12 1 0 0
3 9 -4 -3 12
4 10 -3 -2 6
10 15 3 3 9
6 11 -1 -1 1
7 12 0 0 0
11 15 4 3 12
49 84 0 0 40Sum
7


n
x
x 12

n
y
y
67.6
6
40
)1(
))((





n
yyxx
98.0
31.294.2
67.6
).(.).(.
)(

XydSxdS
xyCov
r
n = 7
24
Standard Deviation
Most widely used measure of
dispersion
σ (Sigma)
Square root of the average of the
squares of deviations
σ = Sqrt [Σ(Xi-Xbar)2/n]
25
Regression Analysis
BIVARIATE LINEAR REGRESSION
Regression : Method of describing the relationship
between two variables
Use : To predict the value of one variable given the other
To use data to analyse relationship.
26
BIVARIATE REGRESSION ANALYSIS
Hypothesis
(Proposed)
Data
Collection Analysis
Verification of
Hypothesis
Hypothesis: Workers are rewarded with greater salaries
as their experience increases.
27
Simple Linear Regression
 Statistical method for finding
 the “line of best fit”
 for one response (dependent)
numerical variable
 based on one explanatory
(independent) variable.
28
Regression: 3 Main Purposes
 To describe (or model)
 To predict (or estimate)
 To control (or administer)
29
Linear regression model assumes that the mean values of Y
for given values of X are a linear function of X
Eq. E(Y/Xi)=α+βXi
E(Y/Xi) is expected value of Y for given value of Xi
The difference between actual & expected value is shown by
ui
ui is the error term
The Line of Regression
30
•Population Regression Function (PRF) : for entire data
•When not possible to collect for entire population-
•Samples of data are collected from population
•To predict how population behaves
•SRF- Sample Regression Function
The Line of Regression
31
•SRF- Sample Regression Function
•Yi^=a+bXi
•a & b are sample versions of
population’s α & β
The Line of Regression
32
•Intercept ‘a’ is a sample estimate of
population’s α
•‘b’ that of β
•Yi=Yi^+ei
• = a+bXi +ei Sample
= α+βXi +ui Population
•ei error term- residual – analogous to ui
The Line of Regression
33
Once we know there is a significant linear
correlation, we can write an equation describing
the relationship between the x and y variables.
This equation is called the line of regression or
least squares line.
Sum of squared errors is minimised.
The Line of Regression
34
Least Squares Regression
 GOAL -
minimize the
sum of the
square of
the errors of
the data
points.
This minimizes the Mean Square Error
35
180
190
200
210
220
230
240
250
260
1.5 2.0 2.5 3.0
Ad $
= a residual
(xi,yi) = a data pointrevenue
= a point on the line with the same x-value
Best fitting straight line
36
The equation of a line may be written as y = a + bXi
where b is the slope of the line and a is the y-
intercept.
The line of regression is:
The slope b is:
The y-intercept a is:
The Line of Regression
37
Steps to Reaching a Solution
 Draw a scatterplot of the data.
38
Steps to Reaching a Solution
 Draw a scatterplot of the data.
 Visually, consider the strength of the
linear relationship.
39
Steps to Reaching a Solution
 Draw a scatterplot of the data.
 Visually, consider the strength of the
linear relationship.
 If the relationship appears relatively
strong, find the correlation coefficient
as a numerical verification.
40
Steps to Reaching a Solution
 Draw a scatterplot of the data.
 Visually, consider the strength of the
linear relationship.
 If the relationship appears relatively
strong, find the correlation coefficient
as a numerical verification.
 If the correlation is still relatively
strong, then find the simple linear
regression line.
41
Calculate b and a.
Write the equation of the
line of regression with
x = number of absences
and y = final grade.
The line of regression is: = –3.924x + 105.667
6084
8464
8100
3364
1849
5476
6561
624
184
450
696
645
666
486
57 516 3751 579 39898
1 8 78
2 2 92
3 5 90
4 12 58
5 15 43
6 9 74
7 6 81
64
4
25
144
225
81
36
xy x2 y2x y
Example of Grade & no. of absences
42
0 2 4 6 8 10 12 14 16
40
45
50
55
60
65
70
75
80
85
90
95
Absences
FinalGrade
m = –3.924 and b = 105.667
The line of regression is:
Note that the point = (8.143, 73.714) is on the line.
The Line of Regression
43
•If CLRM (Classical Linear Regression model) assumptions are
satisfied then
•OLS regression line provides the best possible estimate of
population regression line or OLS is
•BLUE- Best Linear Unbiased Estimator
•Linear- Yi= = a+bXi +ei , a & b – raised to power 1
•Unbiased- 1st sample- some value of b
•2nd sample- likely to give different value of b
•Average of all bs=β
•Best- a & b bounce around from sample to sample
•If unbiased- they have mean values equal to α & β
•To be best- they will bounce around
the least.
The Line of Regression
44
•Goodness of fit- how well line fits the
data?
•How well model as a whole performs?
•How confident are we that our sample
results are a good reflection of
population’s behavior?
Model Performance & Evaluation
45
Goodness of Fit
 Coefficient of Determination – R2
 R2 = ESS/TSS
 General Interpretation: The coefficient of
determination tells the percent of the variation in
the response variable that is explained
(determined) by the model and the explanatory
variable.
 What proportion of behavior of dependent variable
is explained by independent variable
 R2 = 1 – total behavior of Y is explained by X
 R2 = 0 – X may tell nothing @ Y
 0 ˂ R2 ˂ 1
46
Goodness of Fit
 Coefficient of Determination – R2
 R2 = ESS/TSS
 TSS=Σ (Yi-Y bar )2 which tells how
much the values of Y bounce around its
mean.
 Part of TSS is explained by model- ESS
 ESS= Σ (Yi^-Y bar )2
 Remaining is unexplained – RSS
 RSS = Σ ei2
 R2 = Explained sum of squares/ Total sum of
squares 0 ˂ R2 ˂ 1
47
The line of regression is:
= –3.924x + 105.667
6084
8464
8100
3364
1849
5476
6561
624
184
450
696
645
666
486
57 516 3751 579 39898
1 8 78
2 2 92
3 5 90
4 12 58
5 15 43
6 9 74
7 6 81
64
4
25
144
225
81
36
xy x2 y2x y
Example of Grade & no. of absences
The correlation coefficient of number of times absent and
final grade is r = –0.975.
The coefficient of determination is R2 = (–0.975)2 = 0.9506.
Interpretation: About 95% of the variation in final grades can be
explained by the number of times a student is absent. The other
5% is unexplained and can be due to sampling error or other
variables such as intelligence, amount of time studied, etc.
48
Example
Temperature
(F)
Water
Consumption
(ounces)
75 16
83 20
85 25
85 27
92 32
97 48
99 48
49
5. Interpreting and Visualizing
 Interpreting the result:
y = a + b Xi
 The value of b is the slope
 The value of a is the y-intercept
 r is the correlation coefficient
 R2 is the coefficient of
determination
50
Interpretation in Context
 Regression Equation:
y= - 96.9 +1.5 Xi
Water Consumption =
1.5*Temperature - 96.9
51
Interpretation in Context
 Slope = 1.5 (ounces)/(degrees F)
 for each 1 degree F increase in
temperature, you expect an increase
of 1.5 ounces of water drunk.
52
Interpretation in Context
y-intercept = -96.9
 For this example,
when the temperature is 0 degrees F,
then a person would drink about -97
ounces of water.
 That does not make any sense!
 Our model is not applicable for x=0.
53
Prediction Example
 Predict the amount of
water a person would drink when the
temperature is 95 degrees F.
 Solution: Substitute the value of x=95
(degrees F) into the regression equation
and solve for y (water consumption).
If x=95, y=1.5*95 - 96.9 = 45.6 ounces.
54
Interpretation of R2
 Example: R2 =92.7%.
 Interpretation:
 Almost 93% of the variability in the
amount of water consumed is
explained by outside temperature
using this model.
 Note: Therefore 7% of the
variation in the amount of water
consumed is not explained by this
model using temperature.
55
Standard Error
 Positive square root of the variance of
error
 It is a measure used to judge the reliability
of a & b as estimates of α & β
 Two imp things @ std. error
 Its unit is same as dependent variable
 Its size relative to the value of estimated
coefficient
 t stat ( t statistic) gives the size of std. error
relative to the estimated coefficient
 t stat = estimated coefficient / std. error
 Positive values of t stat above 5 – corresponding
coeff. Is a reliable estimate of α or β 56
Null Hypothesis
 Hypothesis: Workers are rewarded with greater
salaries as their experience increases.
 Relation between salary & experience is not an
established fact
 It is rather theory of hypothesis that is put forth
 It could be wrong & there may be no
relationship
 Task of RA- to check this hypothesis
 If Y has nothing to do with X, then β would be 0
 If β is truly 0 then βXi=0 & hence Y is independent of X
 Whether or not β is really 0 is a hypothesis – Null
hypothesis
 Let H0: β=0 –Null Hypothesis
H1: β ≠0– Alternate Hypothesis
57
p- Value
 How reliable are a & b
 p values for a & b represent probabilities
that we can use to test null hypothesis
 Example- if say b=0.880, when β=0, if p
value for b=0.0005 which is very small.
 Interpretation- Hypo that β=0 is not true or
we can reject H0: β=0 –Null Hypothesis
with great confidence & accept H1: β ≠0
 Example- p value for b is 0.2—chance of getting
value of b we found or greater when β=0 is @
20%
58
When to reject Null Hypothesis
 When associated p values are
0.05 or smaller
 95% confidence level
 In some cases- p values of 0.1 or
smaller are used
 90% confidence level
59
Real Life Applications
Cost Estimating for Future Space
Flight Vehicles (Multiple
Regression)
60
Nonlinear Application
Predicting when Solar Maximum Will
Occur
61
Real Life Applications
 Estimating Seasonal Sales for
Department Stores (Periodic)
62
Thank
You
63

Correlation and Regression ppt

  • 1.
    Introduction to Correlation andRegression .S.V. Bhaskar, Associate Professor, Department of Mechanical Engineering, Sanjivani College of Engineering, Kopargaon (MS), INDIA. 1
  • 2.
    Correlation  Correlation  Ameasure of association between two numerical variables.  A quantitative relationship between two variables  measures and describes the strength and direction of the relationship  Example (positive correlation)  Typically, in the summer as the temperature increases people are thirstier. 2
  • 3.
    Specific Example For seven randomsummer days, a person recorded the temperature and their water consumption, during a three-hour period spent outside. Temperature (F) Water Consumption (ounces) 75 16 83 20 85 25 85 27 92 32 97 48 99 48 3
  • 4.
    How would youdescribe the graph? 4
  • 5.
    SCATTER Plot The simplestmethod to assess relationship between two quantitative variables is to draw a scatter diagram As age increases, there is a general tendency for the BP to increase. But this does not give us a quantitative estimate of the degree of the relationship 5
  • 6.
    Negative Correlation–as xincreases, y x = hours of training (horizonta y = number of accidents (vertic Scatter Plots and Types of Correlation 60 50 40 30 20 10 0 0 2 4 6 8 10 12 14 16 18 20 Hours of Training Accidents 6
  • 7.
    Positive Correlation–as xincreases, y increases GPA Scatter Plots and Types of Correlation 4.00 3.75 3.50 3.00 2.75 2.50 2.25 2.00 1.50 1.75 3.25 300 350 400 450 500 550 600 650 700 750 800 Math SAT 7
  • 8.
    No linear correlation ScatterPlots and Types of Correlation 160 150 140 130 120 110 100 90 80 60 64 68 72 76 80 Height IQ 8
  • 9.
    How “strong” isthe linear relationship? 9
  • 10.
    Measuring the Relationship Pearson’sSample Correlation Coefficient, r The correlation coefficient is an index of the degree of association between two variables. measures the direction and the strength of the linear association between two numerical paired variables. 10
  • 11.
    Correlation Coefficient “r” Ameasure of the strength and direction of a linear relationship between two variables The range of r is from –1 to 1. If r is close to 1 there is a strong positive correlation. If r is close to –1 there is a strong negative correlation. If r is close to 0 there is no linear correlation. –1 0 1 11
  • 12.
    High values ofone variable tend to occur with high values of the other (and low with low) In such situations, we say that there is a positive correlation High values of one variable occur with low values of the other (and vice-versa) we say that there is a negative correlation 12
  • 13.
    A NOTE OFCAUTION Correlation coefficient is purely a measure of degree of association and does not provide any evidence of a cause-effect relationship It is valid only in the range of values studied Extrapolation of the association may not always be valid Eg.: Age & Grip strength 13
  • 14.
    r measures thedegree of linear relationship r = 0 does not necessarily mean that there is no relationship between the two characteristics under study; the relationship could be curvilinear Spurious correlation : The production of steel in UK and population in India over the last 25 years may be highly correlated 14
  • 15.
    r does notgive the rate of change in one variable for changes in the other variable Eg: Age & Systolic BP - Males : r = 0.7 Females : r = 0.5 From this one should not conclude that Systolic BP increases at a higher rate among males than females If the correlation coefficient between height in inches and weight in pounds is say, 0.6, the correlation coefficient between height in cm and weight on kg will also be 0.6 15
  • 16.
    Direction of Association PositiveCorrelation Negative Correlation 16
  • 17.
    Strength of LinearAssociation r value Interpretation 1 perfect positive linear relationship 0 no linear relationship -1 perfect negative linear relationship 17
  • 18.
    Strength of LinearAssociation 18
  • 19.
    Other Strengths ofAssociation r value Interpretation 0.9 strong association 0.5 moderate association 0.25 weak association 19
  • 20.
    Other Strengths ofAssociation 20
  • 21.
    Calculation of r =the sum n = number of paired items xi = input variable yi = output variable x = x-bar = mean of x’s y = y-bar = mean of y’s sx= standard deviation of x’s sy= standard deviation of y’s 21
  • 22.
    Application 95 90 85 80 75 70 65 60 55 45 40 50 0 2 46 8 10 12 14 16 FinalGrade X Absences Absences Final Grade 8 78 2 92 5 90 12 58 15 43 9 74 6 81 22
  • 23.
    6084 8464 8100 3364 1849 5476 6561 624 184 450 696 645 666 486 57 516 3751579 39898 1 8 78 2 2 92 3 5 90 4 12 58 5 15 43 6 9 74 7 6 81 64 4 25 144 225 81 36 xy x2 y2 Computation of r x y Σ r=Cov(xy) / [Sd(x) x Sd(y)] 23
  • 24.
    COMPUTATION OF THE CORRELATIONCOEFFICIENT Covariance (XY) X Y (X - X ) (Y- Y ) (X –X) (Y- Y ) 8 12 1 0 0 3 9 -4 -3 12 4 10 -3 -2 6 10 15 3 3 9 6 11 -1 -1 1 7 12 0 0 0 11 15 4 3 12 49 84 0 0 40Sum 7   n x x 12  n y y 67.6 6 40 )1( ))((      n yyxx 98.0 31.294.2 67.6 ).(.).(. )(  XydSxdS xyCov r n = 7 24
  • 25.
    Standard Deviation Most widelyused measure of dispersion σ (Sigma) Square root of the average of the squares of deviations σ = Sqrt [Σ(Xi-Xbar)2/n] 25
  • 26.
    Regression Analysis BIVARIATE LINEARREGRESSION Regression : Method of describing the relationship between two variables Use : To predict the value of one variable given the other To use data to analyse relationship. 26
  • 27.
    BIVARIATE REGRESSION ANALYSIS Hypothesis (Proposed) Data CollectionAnalysis Verification of Hypothesis Hypothesis: Workers are rewarded with greater salaries as their experience increases. 27
  • 28.
    Simple Linear Regression Statistical method for finding  the “line of best fit”  for one response (dependent) numerical variable  based on one explanatory (independent) variable. 28
  • 29.
    Regression: 3 MainPurposes  To describe (or model)  To predict (or estimate)  To control (or administer) 29
  • 30.
    Linear regression modelassumes that the mean values of Y for given values of X are a linear function of X Eq. E(Y/Xi)=α+βXi E(Y/Xi) is expected value of Y for given value of Xi The difference between actual & expected value is shown by ui ui is the error term The Line of Regression 30
  • 31.
    •Population Regression Function(PRF) : for entire data •When not possible to collect for entire population- •Samples of data are collected from population •To predict how population behaves •SRF- Sample Regression Function The Line of Regression 31
  • 32.
    •SRF- Sample RegressionFunction •Yi^=a+bXi •a & b are sample versions of population’s α & β The Line of Regression 32
  • 33.
    •Intercept ‘a’ isa sample estimate of population’s α •‘b’ that of β •Yi=Yi^+ei • = a+bXi +ei Sample = α+βXi +ui Population •ei error term- residual – analogous to ui The Line of Regression 33
  • 34.
    Once we knowthere is a significant linear correlation, we can write an equation describing the relationship between the x and y variables. This equation is called the line of regression or least squares line. Sum of squared errors is minimised. The Line of Regression 34
  • 35.
    Least Squares Regression GOAL - minimize the sum of the square of the errors of the data points. This minimizes the Mean Square Error 35
  • 36.
    180 190 200 210 220 230 240 250 260 1.5 2.0 2.53.0 Ad $ = a residual (xi,yi) = a data pointrevenue = a point on the line with the same x-value Best fitting straight line 36
  • 37.
    The equation ofa line may be written as y = a + bXi where b is the slope of the line and a is the y- intercept. The line of regression is: The slope b is: The y-intercept a is: The Line of Regression 37
  • 38.
    Steps to Reachinga Solution  Draw a scatterplot of the data. 38
  • 39.
    Steps to Reachinga Solution  Draw a scatterplot of the data.  Visually, consider the strength of the linear relationship. 39
  • 40.
    Steps to Reachinga Solution  Draw a scatterplot of the data.  Visually, consider the strength of the linear relationship.  If the relationship appears relatively strong, find the correlation coefficient as a numerical verification. 40
  • 41.
    Steps to Reachinga Solution  Draw a scatterplot of the data.  Visually, consider the strength of the linear relationship.  If the relationship appears relatively strong, find the correlation coefficient as a numerical verification.  If the correlation is still relatively strong, then find the simple linear regression line. 41
  • 42.
    Calculate b anda. Write the equation of the line of regression with x = number of absences and y = final grade. The line of regression is: = –3.924x + 105.667 6084 8464 8100 3364 1849 5476 6561 624 184 450 696 645 666 486 57 516 3751 579 39898 1 8 78 2 2 92 3 5 90 4 12 58 5 15 43 6 9 74 7 6 81 64 4 25 144 225 81 36 xy x2 y2x y Example of Grade & no. of absences 42
  • 43.
    0 2 46 8 10 12 14 16 40 45 50 55 60 65 70 75 80 85 90 95 Absences FinalGrade m = –3.924 and b = 105.667 The line of regression is: Note that the point = (8.143, 73.714) is on the line. The Line of Regression 43
  • 44.
    •If CLRM (ClassicalLinear Regression model) assumptions are satisfied then •OLS regression line provides the best possible estimate of population regression line or OLS is •BLUE- Best Linear Unbiased Estimator •Linear- Yi= = a+bXi +ei , a & b – raised to power 1 •Unbiased- 1st sample- some value of b •2nd sample- likely to give different value of b •Average of all bs=β •Best- a & b bounce around from sample to sample •If unbiased- they have mean values equal to α & β •To be best- they will bounce around the least. The Line of Regression 44
  • 45.
    •Goodness of fit-how well line fits the data? •How well model as a whole performs? •How confident are we that our sample results are a good reflection of population’s behavior? Model Performance & Evaluation 45
  • 46.
    Goodness of Fit Coefficient of Determination – R2  R2 = ESS/TSS  General Interpretation: The coefficient of determination tells the percent of the variation in the response variable that is explained (determined) by the model and the explanatory variable.  What proportion of behavior of dependent variable is explained by independent variable  R2 = 1 – total behavior of Y is explained by X  R2 = 0 – X may tell nothing @ Y  0 ˂ R2 ˂ 1 46
  • 47.
    Goodness of Fit Coefficient of Determination – R2  R2 = ESS/TSS  TSS=Σ (Yi-Y bar )2 which tells how much the values of Y bounce around its mean.  Part of TSS is explained by model- ESS  ESS= Σ (Yi^-Y bar )2  Remaining is unexplained – RSS  RSS = Σ ei2  R2 = Explained sum of squares/ Total sum of squares 0 ˂ R2 ˂ 1 47
  • 48.
    The line ofregression is: = –3.924x + 105.667 6084 8464 8100 3364 1849 5476 6561 624 184 450 696 645 666 486 57 516 3751 579 39898 1 8 78 2 2 92 3 5 90 4 12 58 5 15 43 6 9 74 7 6 81 64 4 25 144 225 81 36 xy x2 y2x y Example of Grade & no. of absences The correlation coefficient of number of times absent and final grade is r = –0.975. The coefficient of determination is R2 = (–0.975)2 = 0.9506. Interpretation: About 95% of the variation in final grades can be explained by the number of times a student is absent. The other 5% is unexplained and can be due to sampling error or other variables such as intelligence, amount of time studied, etc. 48
  • 49.
  • 50.
    5. Interpreting andVisualizing  Interpreting the result: y = a + b Xi  The value of b is the slope  The value of a is the y-intercept  r is the correlation coefficient  R2 is the coefficient of determination 50
  • 51.
    Interpretation in Context Regression Equation: y= - 96.9 +1.5 Xi Water Consumption = 1.5*Temperature - 96.9 51
  • 52.
    Interpretation in Context Slope = 1.5 (ounces)/(degrees F)  for each 1 degree F increase in temperature, you expect an increase of 1.5 ounces of water drunk. 52
  • 53.
    Interpretation in Context y-intercept= -96.9  For this example, when the temperature is 0 degrees F, then a person would drink about -97 ounces of water.  That does not make any sense!  Our model is not applicable for x=0. 53
  • 54.
    Prediction Example  Predictthe amount of water a person would drink when the temperature is 95 degrees F.  Solution: Substitute the value of x=95 (degrees F) into the regression equation and solve for y (water consumption). If x=95, y=1.5*95 - 96.9 = 45.6 ounces. 54
  • 55.
    Interpretation of R2 Example: R2 =92.7%.  Interpretation:  Almost 93% of the variability in the amount of water consumed is explained by outside temperature using this model.  Note: Therefore 7% of the variation in the amount of water consumed is not explained by this model using temperature. 55
  • 56.
    Standard Error  Positivesquare root of the variance of error  It is a measure used to judge the reliability of a & b as estimates of α & β  Two imp things @ std. error  Its unit is same as dependent variable  Its size relative to the value of estimated coefficient  t stat ( t statistic) gives the size of std. error relative to the estimated coefficient  t stat = estimated coefficient / std. error  Positive values of t stat above 5 – corresponding coeff. Is a reliable estimate of α or β 56
  • 57.
    Null Hypothesis  Hypothesis:Workers are rewarded with greater salaries as their experience increases.  Relation between salary & experience is not an established fact  It is rather theory of hypothesis that is put forth  It could be wrong & there may be no relationship  Task of RA- to check this hypothesis  If Y has nothing to do with X, then β would be 0  If β is truly 0 then βXi=0 & hence Y is independent of X  Whether or not β is really 0 is a hypothesis – Null hypothesis  Let H0: β=0 –Null Hypothesis H1: β ≠0– Alternate Hypothesis 57
  • 58.
    p- Value  Howreliable are a & b  p values for a & b represent probabilities that we can use to test null hypothesis  Example- if say b=0.880, when β=0, if p value for b=0.0005 which is very small.  Interpretation- Hypo that β=0 is not true or we can reject H0: β=0 –Null Hypothesis with great confidence & accept H1: β ≠0  Example- p value for b is 0.2—chance of getting value of b we found or greater when β=0 is @ 20% 58
  • 59.
    When to rejectNull Hypothesis  When associated p values are 0.05 or smaller  95% confidence level  In some cases- p values of 0.1 or smaller are used  90% confidence level 59
  • 60.
    Real Life Applications CostEstimating for Future Space Flight Vehicles (Multiple Regression) 60
  • 61.
    Nonlinear Application Predicting whenSolar Maximum Will Occur 61
  • 62.
    Real Life Applications Estimating Seasonal Sales for Department Stores (Periodic) 62
  • 63.