2. 2
GOALS
l Define a hypothesis and hypothesis testing.
l Describe the five-step hypothesis-testing procedure.
l Distinguish between a one-tailed and a two-tailed
test of hypothesis.
l Conduct a test of hypothesis about a population
mean.
l Conduct a test of hypothesis about a population
proportion.
l Define Type I and Type II errors.
l Compute the probability of a Type II error.
3. 3
What is a Hypothesis?
A Hypothesis is a statement about the
value of a population parameter
developed for the purpose of testing.
Examples of hypotheses made about a
population parameter are:
– The mean monthly income for systems analysts is
$3,625.
– Twenty percent of all customers at Bovine’s Chop
House return for another meal within a month.
4. 4
What is Hypothesis Testing?
Hypothesis testing is a procedure, based
on sample evidence and probability
theory, used to determine whether the
hypothesis is a reasonable statement
and should not be rejected, or is
unreasonable and should be rejected.
6. 6
Important Things to Remember about H0 and H1
l H0: null hypothesis and H1: alternate hypothesis
l H0 and H1 are mutually exclusive and collectively exhaustive
l H0 is always presumed to be true
l H1 has the burden of proof
l A random sample (n) is used to “reject H0”
l If we conclude 'do not reject H0', this does not necessarily mean
that the null hypothesis is true, it only suggests that there is not
sufficient evidence to reject H0; rejecting the null hypothesis
then, suggests that the alternative hypothesis may be true.
l Equality is always part of H0 (e.g. “=” , “≥” , “≤”).
l “≠” “<” and “>” always part of H1
7. 7
How to Set Up a Claim as Hypothesis
l In actual practice, the status quo is set up as H0
l If the claim is “boastful” the claim is set up as H1
(we apply the Missouri rule – “show me”).
Remember, H1 has the burden of proof
l In problem solving, look for key words and
convert them into symbols. Some key words
include: “improved, better than, as effective as,
different from, has changed, etc.”
8. 8
Left-tail or Right-tail Test?
Keywords
Inequality
Symbol
Part of:
Larger (or more) than > H1
Smaller (or less) < H1
No more than £ H0
At least ≥ H0
Has increased > H1
Is there difference? ≠ H1
Has not changed = H0
Has “improved”, “is better
than”. “is more effective”
See right H1
• The direction of the test involving
claims that use the words “has
improved”, “is better than”, and the like
will depend upon the variable being
measured.
• For instance, if the variable involves
time for a certain medication to take
effect, the words “better” “improve” or
more effective” are translated as “<”
(less than, i.e. faster relief).
• On the other hand, if the variable
refers to a test score, then the words
“better” “improve” or more effective”
are translated as “>” (greater than, i.e.
higher test scores)
14. 14
Testing for a Population Mean with a
Known Population Standard Deviation- Example
Jamestown Steel Company
manufactures and assembles desks
and other office equipment at
several plants in western New York
State. The weekly production of the
Model A325 desk at the Fredonia
Plant follows the normal probability
distribution with a mean of 200 and
a standard deviation of 16.
Recently, because of market
expansion, new production
methods have been introduced and
new employees hired. The vice
president of manufacturing would
like to investigate whether there has
been a change in the weekly
production of the Model A325 desk.
15. 15
Testing for a Population Mean with a
Known Population Standard Deviation- Example
Step 1: State the null hypothesis and the alternate
hypothesis.
H0: m = 200
H1: m ≠ 200
(note: keyword in the problem “has changed”)
Step 2: Select the level of significance.
α = 0.01 as stated in the problem
Step 3: Select the test statistic.
Use Z-distribution since σ is known
16. 16
Testing for a Population Mean with a
Known Population Standard Deviation- Example
Step 4: Formulate the decision rule.
Reject H0 if |Z| > Za/2
58
.
2
not
is
55
.
1
50
/
16
200
5
.
203
/
2
/
01
.
2
/
2
/
>
>
-
>
-
>
Z
Z
n
X
Z
Z
a
a
s
m
Step 5: Make a decision and interpret the result.
Because 1.55 does not fall in the rejection region, H0 is not
rejected. We conclude that the population mean is not different from
200. So we would report to the vice president of manufacturing that the
sample evidence does not show that the production rate at the Fredonia
Plant has changed from 200 per week.
17. 17
Suppose in the previous problem the vice
president wants to know whether there has
been an increase in the number of units
assembled. To put it another way, can we
conclude, because of the improved
production methods, that the mean number
of desks assembled in the last 50 weeks was
more than 200?
Recall: σ=16, n=200, α=.01
Testing for a Population Mean with a Known
Population Standard Deviation- Another Example
18. 18
Testing for a Population Mean with a Known
Population Standard Deviation- Example
Step 1: State the null hypothesis and the alternate
hypothesis.
H0: m ≤ 200
H1: m > 200
(note: keyword in the problem “an increase”)
Step 2: Select the level of significance.
α = 0.01 as stated in the problem
Step 3: Select the test statistic.
Use Z-distribution since σ is known
19. 19
Testing for a Population Mean with a Known
Population Standard Deviation- Example
Step 4: Formulate the decision rule.
Reject H0 if Z > Za
Step 5: Make a decision and interpret the result.
Because 1.55 does not fall in the rejection region, H0 is not rejected.
We conclude that the average number of desks assembled in the last
50 weeks is not more than 200
20. 20
Type of Errors in Hypothesis Testing
l Type I Error -
– Defined as the probability of rejecting the null
hypothesis when it is actually true.
– This is denoted by the Greek letter “a”
– Also known as the significance level of a test
l Type II Error:
– Defined as the probability of “accepting” the null
hypothesis when it is actually false.
– This is denoted by the Greek letter “β”
21. 21
p-Value in Hypothesis Testing
l p-VALUE is the probability of observing a sample
value as extreme as, or more extreme than, the
value observed, given that the null hypothesis is
true.
l In testing a hypothesis, we can also compare the p-
value to with the significance level (a).
l If the p-value < significance level, H0 is rejected, else
H0 is not rejected.
22. 22
p-Value in Hypothesis Testing - Example
Recall the last problem where the
hypothesis and decision rules
were set up as:
H0: m ≤ 200
H1: m > 200
Reject H0 if Z > Za
where Z = 1.55 and Za =2.33
Reject H0 if p-value < a
0.0606 is not < 0.01
Conclude: Fail to reject H0
23. 23
What does it mean when p-value < a?
(a) .10, we have some evidence that H0 is not true.
(b) .05, we have strong evidence that H0 is not true.
(c) .01, we have very strong evidence that H0 is not true.
(d) .001, we have extremely strong evidence that H0 is not
true.
24. 24
Testing for the Population Mean: Population
Standard Deviation Unknown
l When the population standard deviation (σ) is
unknown, the sample standard deviation (s) is used in
its place
l The t-distribution is used as test statistic, which is
computed using the formula:
25. 25
Testing for the Population Mean: Population
Standard Deviation Unknown - Example
The McFarland Insurance Company Claims Department reports the mean
cost to process a claim is $60. An industry comparison showed this
amount to be larger than most other insurance companies, so the
company instituted cost-cutting measures. To evaluate the effect of the
cost-cutting measures, the Supervisor of the Claims Department
selected a random sample of 26 claims processed last month. The
sample information is reported below.
At the .01 significance level is it reasonable a claim is now less than $60?
26. 26
Testing for a Population Mean with a
Known Population Standard Deviation- Example
Step 1: State the null hypothesis and the alternate
hypothesis.
H0: m ≥ $60
H1: m < $60
(note: keyword in the problem “now less than”)
Step 2: Select the level of significance.
α = 0.01 as stated in the problem
Step 3: Select the test statistic.
Use t-distribution since σ is unknown
28. 28
Testing for the Population Mean: Population
Standard Deviation Unknown – Minitab Solution
29. 29
Testing for a Population Mean with a
Known Population Standard Deviation- Example
Step 5: Make a decision and interpret the result.
Because -1.818 does not fall in the rejection region, H0 is not rejected at the
.01 significance level. We have not demonstrated that the cost-cutting
measures reduced the mean cost per claim to less than $60. The difference
of $3.58 ($56.42 - $60) between the sample mean and the population mean
could be due to sampling error.
Step 4: Formulate the decision rule.
Reject H0 if t < -ta,n-1
30. 30
The current rate for producing 5 amp fuses at Neary
Electric Co. is 250 per hour. A new machine has
been purchased and installed that, according to the
supplier, will increase the production rate. A sample
of 10 randomly selected hours from last month
revealed the mean hourly production on the new
machine was 256 units, with a sample standard
deviation of 6 per hour.
At the .05 significance level can Neary conclude that
the new machine is faster?
Testing for a Population Mean with an Unknown
Population Standard Deviation- Example
31. 31
Testing for a Population Mean with a
Known Population Standard Deviation- Example continued
Step 1: State the null and the alternate hypothesis.
H0: µ ≤ 250; H1: µ > 250
Step 2: Select the level of significance.
It is .05.
Step 3: Find a test statistic. Use the t distribution
because the population standard deviation is not
known and the sample size is less than 30.
32. 32
Testing for a Population Mean with a
Known Population Standard Deviation- Example continued
Step 4: State the decision rule.
There are 10 – 1 = 9 degrees of freedom. The null
hypothesis is rejected if t > 1.833.
Step 5: Make a decision and interpret the results.
The null hypothesis is rejected. The mean number produced is
more than 250 per hour.
162
.
3
10
6
250
256
=
-
=
-
=
n
s
X
t
m
33. 33
Tests Concerning Proportion
l A Proportion is the fraction or percentage that indicates the part of
the population or sample having a particular trait of interest.
l The sample proportion is denoted by p and is found by x/n
l The test statistic is computed as follows:
34. 34
Assumptions in Testing a Population Proportion
using the z-Distribution
l A random sample is chosen from the population.
l It is assumed that the binomial assumptions discussed in
Chapter 6 are met:
(1) the sample data collected are the result of counts;
(2) the outcome of an experiment is classified into one of two
mutually exclusive categories—a “success” or a “failure”;
(3) the probability of a success is the same for each trial; and
(4) the trials are independent
l The test we will conduct shortly is appropriate when both np
and n(1- p ) are at least 5.
l When the above conditions are met, the normal distribution can
be used as an approximation to the binomial distribution
35. 35
Test Statistic for Testing a Single
Population Proportion
n
p
z
)
1
( p
p
p
-
-
=
Sample proportion
Hypothesized
population proportion
Sample size
36. 36
Test Statistic for Testing a Single
Population Proportion - Example
Suppose prior elections in a certain state indicated
it is necessary for a candidate for governor to
receive at least 80 percent of the vote in the
northern section of the state to be elected. The
incumbent governor is interested in assessing
his chances of returning to office and plans to
conduct a survey of 2,000 registered voters in
the northern section of the state. Using the
hypothesis-testing procedure, assess the
governor’s chances of reelection.
37. 37
Test Statistic for Testing a Single
Population Proportion - Example
Step 1: State the null hypothesis and the alternate
hypothesis.
H0: p ≥ .80
H1: p < .80
(note: keyword in the problem “at least”)
Step 2: Select the level of significance.
α = 0.01 as stated in the problem
Step 3: Select the test statistic.
Use Z-distribution since the assumptions are met
and np and n(1-p) ≥ 5
38. 38
Testing for a Population Proportion - Example
Step 5: Make a decision and interpret the result.
The computed value of z (2.80) is in the rejection region, so the null hypothesis is rejected
at the .05 level. The difference of 2.5 percentage points between the sample percent (77.5
percent) and the hypothesized population percent (80) is statistically significant. The
evidence at this point does not support the claim that the incumbent governor will return to
the governor’s mansion for another four years.
Step 4: Formulate the decision rule.
Reject H0 if Z <-Za
39. 39
Type II Error
l Recall Type I Error, the level of significance,
denoted by the Greek letter “a”, is defined as
the probability of rejecting the null hypothesis
when it is actually true.
l Type II Error, denoted by the Greek letter “β”,is
defined as the probability of “accepting” the null
hypothesis when it is actually false.
40. 40
Type II Error - Example
A manufacturer purchases steel bars to make cotter
pins. Past experience indicates that the mean tensile
strength of all incoming shipments is 10,000 psi and
that the standard deviation, σ, is 400 psi. In order to
make a decision about incoming shipments of steel
bars, the manufacturer set up this rule for the quality-
control inspector to follow: “Take a sample of 100
steel bars. At the .05 significance level if the sample
mean strength falls between 9,922 psi and 10,078
psi, accept the lot. Otherwise the lot is to be
rejected.”
46. 2
GOALS
l Conduct a test of a hypothesis about the difference
between two independent population means.
l Conduct a test of a hypothesis about the difference
between two population proportions.
l Conduct a test of a hypothesis about the mean
difference between paired or dependent
observations.
l Understand the difference between dependent and
independent samples.
47. 3
Comparing two populations – Some
Examples
l Is there a difference in the mean value of residential real
estate sold by male agents and female agents in south
Florida?
l Is there a difference in the mean number of defects
produced on the day and the afternoon shifts at Kimble
Products?
l Is there a difference in the mean number of days absent
between young workers (under 21 years of age) and older
workers (more than 60 years of age) in the fast-food
industry?
l Is there is a difference in the proportion of Ohio State
University graduates and University of Cincinnati graduates
who pass the state Certified Public Accountant Examination
on their first attempt?
l Is there an increase in the production rate if music is piped
into the production area?
48. 4
Comparing Two Population Means
l No assumptions about the shape of the populations are
required.
l The samples are from independent populations.
l The formula for computing the value of z is:
2
2
2
1
2
1
2
1
2
1 known
are
and
if
or
30
sizes
sample
if
Use
n
n
X
X
z
s
s
s
s
+
-
=
>
2
2
2
1
2
1
2
1
2
1 unknown
are
and
if
and
30
sizes
sample
if
Use
n
s
n
s
X
X
z
+
-
=
>
s
s
49. 5
EXAMPLE 1
The U-Scan facility was recently installed at the Byrne
Road Food-Town location. The store manager would
like to know if the mean checkout time using the
standard checkout method is longer than using the U-
Scan. She gathered the following sample information.
The time is measured from when the customer enters
the line until their bags are in the cart. Hence the time
includes both waiting in line and checking out.
50. 6
EXAMPLE 1 continued
Step 1: State the null and alternate hypotheses.
H0: µS ≤ µU
H1: µS > µU
Step 2: State the level of significance.
The .01 significance level is stated in the problem.
Step 3: Find the appropriate test statistic.
Because both samples are more than 30, we can use z-distribution
as the test statistic.
52. 8
Example 1 continued
Step 5: Compute the value of z and make a decision
13
.
3
064
.
0
2
.
0
100
30
.
0
50
40
.
0
3
.
5
5
.
5
2
2
2
2
=
=
+
-
=
+
-
=
u
u
s
s
u
s
n
n
X
X
z
s
s
The computed value of 3.13 is larger than the
critical value of 2.33. Our decision is to reject the
null hypothesis. The difference of .20 minutes
between the mean checkout time using the
standard method is too large to have occurred by
chance. We conclude the U-Scan method is
faster.
53. 9
Two-Sample Tests about Proportions
Here are several examples.
l The vice president of human resources wishes to know whether
there is a difference in the proportion of hourly employees who
miss more than 5 days of work per year at the Atlanta and the
Houston plants.
l General Motors is considering a new design for the Pontiac
Grand Am. The design is shown to a group of potential buyers
under 30 years of age and another group over 60 years of age.
Pontiac wishes to know whether there is a difference in the
proportion of the two groups who like the new design.
l A consultant to the airline industry is investigating the fear of
flying among adults. Specifically, the company wishes to know
whether there is a difference in the proportion of men versus
women who are fearful of flying.
54. 10
Two Sample Tests of Proportions
l We investigate whether two samples came from
populations with an equal proportion of successes.
l The two samples are pooled using the following
formula.
55. 11
Two Sample Tests of Proportions
continued
The value of the test statistic is computed from the following
formula.
56. 12
Manelli Perfume Company recently developed a new fragrance that
it plans to market under the name Heavenly. A number of market
studies indicate that Heavenly has very good market potential. The
Sales Department at Manelli is particularly interested in whether
there is a difference in the proportions of younger and older women
who would purchase Heavenly if it were marketed. There are two
independent populations, a population consisting
of the younger women and a population consisting of the older
women. Each sampled woman will be asked to smell Heavenly and
indicate whether she likes the fragrance well enough to purchase a
bottle.
Two Sample Tests of Proportions -
Example
57. 13
Step 1: State the null and alternate hypotheses.
H0: p1 = p 2
H1: p 1 ≠ p 2
Step 2: State the level of significance.
The .05 significance level is stated in the problem.
Step 3: Find the appropriate test statistic.
We will use the z-distribution
Two Sample Tests of Proportions -
Example
58. 14
Step 4: State the decision rule.
Reject H0 if Z > Za/2 or Z < - Za/2
Z > 1.96 or Z < -1.96
Two Sample Tests of Proportions -
Example
59. 15
Step 5: Compute the value of z and make a decision
The computed value of 2.21 is in the area of rejection. Therefore, the null hypothesis is
rejected at the .05 significance level. To put it another way, we reject the null hypothesis
that the proportion of young women who would purchase Heavenly is equal to the
proportion of older women who would purchase Heavenly.
Two Sample Tests of Proportions -
Example
61. 17
Comparing Population Means with Unknown
Population Standard Deviations (the Pooled t-test)
The t distribution is used as the test statistic if one
or more of the samples have less than 30
observations. The required assumptions are:
1. Both populations must follow the normal
distribution.
2. The populations must have equal standard
deviations.
3. The samples are from independent populations.
62. 18
Small sample test of means continued
Finding the value of the test
statistic requires two
steps.
1. Pool the sample standard
deviations.
2. Use the pooled standard
deviation in the formula.
2
)
1
(
)
1
(
2
1
2
2
2
2
1
1
2
-
+
-
+
-
=
n
n
s
n
s
n
sp
÷
÷
ø
ö
ç
ç
è
æ
+
-
=
2
1
2
2
1
1
1
n
n
s
X
X
t
p
63. 19
Owens Lawn Care, Inc., manufactures and assembles
lawnmowers that are shipped to dealers throughout the
United States and Canada. Two different procedures
have been proposed for mounting the engine on the
frame of the lawnmower. The question is: Is there a
difference in the mean time to mount the engines on the
frames of the lawnmowers? The first procedure was
developed by longtime Owens employee Herb Welles
(designated as procedure 1), and the other procedure
was developed by Owens Vice President of Engineering
William Atkins (designated as procedure 2). To evaluate
the two methods, it was decided to conduct a time and
motion study.
A sample of five employees was timed using the Welles
method and six using the Atkins method. The results, in
minutes, are shown on the right.
Is there a difference in the mean mounting times? Use
the .10 significance level.
Comparing Population Means with Unknown
Population Standard Deviations (the Pooled t-test)
64. 20
Step 1: State the null and alternate hypotheses.
H0: µ1 = µ2
H1: µ1 ≠ µ2
Step 2: State the level of significance. The .10 significance level is
stated in the problem.
Step 3: Find the appropriate test statistic.
Because the population standard deviations are not known but are
assumed to be equal, we use the pooled t-test.
Comparing Population Means with Unknown Population
Standard Deviations (the Pooled t-test) - Example
65. 21
Step 4: State the decision rule.
Reject H0 if t > ta/2,n1+n2-2 or t < - ta/2,n1+n2-2
t > t.05,9 or t < - t.05,9
t > 1.833 or t < - 1.833
Comparing Population Means with Unknown Population
Standard Deviations (the Pooled t-test) - Example
66. 22
Step 5: Compute the value of t and make a decision
(a) Calculate the sample standard deviations
Comparing Population Means with Unknown Population
Standard Deviations (the Pooled t-test) - Example
67. 23
Step 5: Compute the value of t and make a decision
Comparing Population Means with Unknown Population
Standard Deviations (the Pooled t-test) - Example
-0.662
The decision is not to reject
the null hypothesis, because
0.662 falls in the region
between -1.833 and 1.833.
We conclude that there is no
difference in the mean times
to mount the engine on the
frame using the two methods.
69. 25
Comparing Population Means with Unequal
Population Standard Deviations
If it is not reasonable to assume the
population standard deviations are
equal, then we compute the t-
statistic shown on the right.
The sample standard deviations s1 and
s2 are used in place of the
respective population standard
deviations.
In addition, the degrees of freedom are
adjusted downward by a rather
complex approximation formula.
The effect is to reduce the number
of degrees of freedom in the test,
which will require a larger value of
the test statistic to reject the null
hypothesis.
70. 26
Comparing Population Means with Unequal
Population Standard Deviations - Example
Personnel in a consumer testing laboratory are evaluating the absorbency of
paper towels. They wish to compare a set of store brand towels to a similar
group of name brand ones. For each brand they dip a ply of the paper into a
tub of fluid, allow the paper to drain back into the vat for two minutes, and
then evaluate the amount of liquid the paper has taken up from the vat. A
random sample of 9 store brand paper towels absorbed the following
amounts of liquid in milliliters.
8 8 3 1 9 7 5 5 12
An independent random sample of 12 name brand towels absorbed the
following amounts of liquid in milliliters:
12 11 10 6 8 9 9 10 11 9 8 10
Use the .10 significance level and test if there is a difference in the mean
amount of liquid absorbed by the two types of paper towels.
71. 27
The following dot plot provided by MINITAB shows the
variances to be unequal.
Comparing Population Means with Unequal
Population Standard Deviations - Example
72. 28
Step 1: State the null and alternate hypotheses.
H0: m1 = m2
H1: m1 ≠ m2
Step 2: State the level of significance.
The .10 significance level is stated in the problem.
Step 3: Find the appropriate test statistic.
We will use unequal variances t-test
Comparing Population Means with Unequal
Population Standard Deviations - Example
73. 29
Step 4: State the decision rule.
Reject H0 if
t > ta/2d.f. or t < - ta/2,d.f.
t > t.05,10 or t < - t.05, 10
t > 1.812 or t < -1.812
Step 5: Compute the value of t
and make a decision
The computed value of t is less than the lower critical value, so our
decision is to reject the null hypothesis. We conclude that the
mean absorption rate for the two towels is not the same.
Comparing Population Means with Unequal
Population Standard Deviations - Example
75. 31
Two-Sample Tests of Hypothesis:
Dependent Samples
Dependent samples are samples that are paired or
related in some fashion.
For example:
– If you wished to buy a car you would look at the
same car at two (or more) different dealerships
and compare the prices.
– If you wished to measure the effectiveness of a
new diet you would weigh the dieters at the start
and at the finish of the program.
76. 32
Hypothesis Testing Involving
Paired Observations
Use the following test when the samples are
dependent:
t
d
s n
d
=
/
d
Where
is the mean of the differences
sd is the standard deviation of the differences
n is the number of pairs (differences)
77. 33
Nickel Savings and Loan wishes to
compare the two companies it
uses to appraise the value of
residential homes. Nickel
Savings selected a sample of
10 residential properties and
scheduled both firms for an
appraisal. The results, reported
in $000, are shown on the table
(right).
At the .05 significance level, can
we conclude there is a
difference in the mean
appraised values of the homes?
Hypothesis Testing Involving
Paired Observations - Example
78. 34
Step 1: State the null and alternate hypotheses.
H0: md = 0
H1: md ≠ 0
Step 2: State the level of significance.
The .05 significance level is stated in the problem.
Step 3: Find the appropriate test statistic.
We will use the t-test
Hypothesis Testing Involving
Paired Observations - Example
79. 35
Step 4: State the decision rule.
Reject H0 if
t > ta/2, n-1 or t < - ta/2,n-1
t > t.025,9 or t < - t.025, 9
t > 2.262 or t < -2.262
Hypothesis Testing Involving
Paired Observations - Example
80. 36
Step 5: Compute the value of t and make a decision
The computed value of t
is greater than the
higher critical value, so
our decision is to reject
the null hypothesis. We
conclude that there is a
difference in the mean
appraised values of the
homes.
Hypothesis Testing Involving
Paired Observations - Example
84. 2
GOALS
l List the characteristics of the F distribution.
l Conduct a test of hypothesis to determine whether the
variances of two populations are equal.
l Discuss the general idea of analysis of variance.
l Organize data into a one-way and a two-way ANOVA table.
l Conduct a test of hypothesis among three or more treatment
means.
l Develop confidence intervals for the difference in treatment
means.
l Conduct a test of hypothesis among treatment means using a
blocking variable.
l Conduct a two-way ANOVA with interaction.
85. 3
Characteristics of F-Distribution
l There is a “family” of F
Distributions.
l Each member of the family is
determined by two parameters:
the numerator degrees of
freedom and the denominator
degrees of freedom.
l F cannot be negative, and it is
a continuous distribution.
l The F distribution is positively
skewed.
l Its values range from 0 to ¥
l As F ® ¥ the curve
approaches the X-axis.
86. 4
Comparing Two Population Variances
The F distribution is used to test the hypothesis that the variance of one
normal population equals the variance of another normal population.
The following examples will show the use of the test:
l Two Barth shearing machines are set to produce steel bars of the
same length. The bars, therefore, should have the same mean length.
We want to ensure that in addition to having the same mean length
they also have similar variation.
l The mean rate of return on two types of common stock may be the
same, but there may be more variation in the rate of return in one than
the other. A sample of 10 technology and 10 utility stocks shows the
same mean rate of return, but there is likely more variation in the
Internet stocks.
l A study by the marketing department for a large newspaper found that
men and women spent about the same amount of time per day
reading the paper. However, the same report indicated there was
nearly twice as much variation in time spent per day among the men
than the women.
88. 6
Test for Equal Variances - Example
Lammers Limos offers limousine service from the city hall in Toledo,
Ohio, to Metro Airport in Detroit. Sean Lammers, president of the
company, is considering two routes. One is via U.S. 25 and the
other via I-75. He wants to study the time it takes to drive to the
airport using each route and then compare the results. He collected
the following sample data, which is reported in minutes.
Using the .10 significance level, is there a difference in the variation
in the driving times for the two routes?
89. 7
Step 1: The hypotheses are:
H0: σ1
2 = σ1
2
H1: σ1
2 ≠ σ1
2
Step 2: The significance level is .05.
Step 3: The test statistic is the F distribution.
Test for Equal Variances - Example
90. 8
Step 4: State the decision rule.
Reject H0 if F > Fa/2,v1,v2
F > F.05/2,7-1,8-1
F > F.025,6,7
Test for Equal Variances - Example
91. 9
The decision is to reject the null hypothesis, because the computed F
value (4.23) is larger than the critical value (3.87).
We conclude that there is a difference in the variation of the travel times along
the two routes.
Step 5: Compute the value of F and make a decision
Test for Equal Variances - Example
93. 11
Comparing Means of Two or More
Populations
l The F distribution is also used for testing whether
two or more sample means came from the same
or equal populations.
l Assumptions:
– The sampled populations follow the normal
distribution.
– The populations have equal standard
deviations.
– The samples are randomly selected and are
independent.
94. 12
l The Null Hypothesis is that the population
means are the same. The Alternative Hypothesis
is that at least one of the means is different.
l The Test Statistic is the F distribution.
l The Decision rule is to reject the null hypothesis
if F (computed) is greater than F (table) with
numerator and denominator degrees of freedom.
l Hypothesis Setup and Decision Rule:
H0: µ1 = µ2 =…= µk
H1: The means are not all equal
Reject H0 if F > Fa,k-1,n-k
Comparing Means of Two or More
Populations
95. 13
Analysis of Variance – F statistic
l If there are k populations being sampled, the numerator degrees
of freedom is k – 1.
l If there are a total of n observations the denominator degrees of
freedom is n – k.
l The test statistic is computed by:
( )
( )
k
n
SSE
k
SST
F
-
-
=
1
96. 14
Joyce Kuhlman manages a regional financial center. She wishes to
compare the productivity, as measured by the number of customers
served, among three employees. Four days are randomly selected
and the number of customers served by each employee is recorded.
The results are:
Comparing Means of Two or More
Populations – Illustrative Example
98. 16
Recently a group of four major carriers
joined in hiring Brunner Marketing
Research, Inc., to survey recent
passengers regarding their level of
satisfaction with a recent flight.
The survey included questions on
ticketing, boarding, in-flight
service, baggage handling, pilot
communication, and so forth.
Twenty-five questions offered a
range of possible answers:
excellent, good, fair, or poor. A
response of excellent was given a
score of 4, good a 3, fair a 2, and
poor a 1. These responses were
then totaled, so the total score
was an indication of the
satisfaction with the flight. Brunner
Marketing Research, Inc.,
randomly selected and surveyed
passengers from the four airlines.
Comparing Means of Two or More
Populations – Example
Is there a difference in the mean
satisfaction level among the four
airlines?
Use the .01 significance level.
99. 17
Step 1: State the null and alternate hypotheses.
H0: µE = µA = µT = µO
H1: The means are not all equal
Reject H0 if F > Fa,k-1,n-k
Step 2: State the level of significance.
The .01 significance level is stated in the problem.
Step 3: Find the appropriate test statistic.
Because we are comparing means of more than
two groups, use the F statistic
Comparing Means of Two or More
Populations – Example
100. 18
Step 4: State the decision rule.
Reject H0 if F > Fa,k-1,n-k
F > F01,4-1,22-4
F > F01,3,18
F > 5.801
Comparing Means of Two or More
Populations – Example
101. 19
Step 5: Compute the value of F and make a decision
Comparing Means of Two or More
Populations – Example
104. 22
Computing SST
The computed value of F is 8.99, which is greater than the critical value of 5.09,
so the null hypothesis is rejected.
Conclusion: The population means are not all equal. The mean scores are not
the same for the four airlines; at this point we can only conclude there is a
difference in the treatment means. We cannot determine which treatment groups
differ or how many treatment groups differ.
105. 23
Inferences About Treatment Means
l When we reject the null hypothesis
that the means are equal, we may
want to know which treatment means
differ.
l One of the simplest procedures is
through the use of confidence
intervals.
106. 24
Confidence Interval for the
Difference Between Two Means
l where t is obtained from the t table with degrees of
freedom (n - k).
l MSE = [SSE/(n - k)]
( )
X X t MSE
n n
1 2
1 2
1 1
- ± +
æ
è
ç
ö
ø
÷
107. 25
From the previous example, develop a 95% confidence interval
for the difference in the mean rating for Eastern and Ozark.
Can we conclude that there is a difference between the two
airlines’ ratings?
The 95 percent confidence interval ranges from 10.46 up to
26.04. Both endpoints are positive; hence, we can conclude
these treatment means differ significantly. That is, passengers
on Eastern rated service significantly different from those
on Ozark.
Confidence Interval for the
Difference Between Two Means - Example
110. 28
Two-Way Analysis of Variance
l For the two-factor ANOVA we test whether there is a
significant difference between the treatment effect
and whether there is a difference in the blocking
effect. Let Br be the block totals (r for rows)
l Let SSB represent the sum of squares for the blocks
where:
SSB
B
k
X
n
r
=
é
ë
ê
ù
û
ú -
S
S
2 2
( )
111. 29
WARTA, the Warren Area Regional Transit Authority, is expanding bus
service from the suburb of Starbrick into the central business district of
Warren. There are four routes being considered from Starbrick to
downtown Warren: (1) via U.S. 6, (2) via the West End, (3) via the
Hickory Street Bridge, and (4) via Route 59.
WARTA conducted several tests to determine whether there was a difference
in the mean travel times along the four routes. Because there will be many
different drivers, the test was set up so each driver drove along each of the
four routes. Next slide shows the travel time, in minutes, for each driver-route
combination. At the .05 significance level, is there a difference in the mean
travel time along the four routes? If we remove the effect of the drivers, is
there a difference in the mean travel time?
Two-Way Analysis of Variance -
Example
113. 31
Step 1: State the null and alternate hypotheses.
H0: µu = µw = µh = µr
H1: The means are not all equal
Reject H0 if F > Fa,k-1,n-k
Step 2: State the level of significance.
The .05 significance level is stated in the problem.
Step 3: Find the appropriate test statistic.
Because we are comparing means of more than
two groups, use the F statistic
Two-Way Analysis of Variance -
Example
114. 32
Step 4: State the decision rule.
Reject H0 if F > Fa,v1,v2
F > F.05,k-1,n-k
F > F.05,4-1,20-4
F > F.05,3,16
F > 2.482
Two-Way Analysis of Variance -
Example
117. 35
Using Excel to perform the
calculations. The
computed value of F is
2.482, so our decision is to
not reject the null
hypothesis. We conclude
there is no difference in
the mean travel time along
the four routes. There is
no reason to select one of
the routes as faster than
the other.
Two-Way Analysis of Variance – Excel
Example
118. 36
Two-Way ANOVA with Interaction
Interaction occurs if the combination of two factors has some effect
on the variable under study, in addition to each factor alone. We refer
to the variable being studied as the response variable.
An everyday illustration of interaction is the effect of diet and exercise
on weight. It is generally agreed that a person’s weight (the response
variable) can be controlled with two factors, diet and exercise.
Research shows that weight is affected by diet alone and that weight
is affected by exercise alone. However, the general recommended
method to control weight is based on the combined or interaction
effect of diet and exercise.
119. 37
Graphical Observation of Mean Times
Our graphical observations show us that
interaction effects are possible. The next
step is to conduct statistical tests of
hypothesis to further investigate the
possible interaction effects. In summary,
our study of travel times has several
questions:
l Is there really an interaction between
routes and drivers?
l Are the travel times for the drivers the
same?
l Are the travel times for the routes the
same?
Of the three questions, we are most
interested in the test for interactions. To
put it another way, does a particular
route/driver combination result in
significantly faster (or slower) driving
times? Also, the results of the hypothesis
test for interaction affect the way we
analyze the route and driver questions.
120. 38
Interaction Effect
l We can investigate these questions statistically by extending
the two-way ANOVA procedure presented in the previous
section. We add another source of variation, namely, the
interaction.
l In order to estimate the “error” sum of squares, we need at
least two measurements for each driver/route combination.
l As example, suppose the experiment presented earlier is
repeated by measuring two more travel times for each driver
and route combination. That is, we replicate the experiment.
Now we have three new observations for each driver/route
combination.
l Using the mean of three travel times for each driver/route
combination we get a more reliable measure of the mean travel
time.
122. 40
Three Tests in ANOVA with Replication
The ANOVA now has three sets of hypotheses
to test:
1. H0: There is no interaction between drivers and routes.
H1: There is interaction between drivers and routes.
2. H0: The driver means are the same.
H1: The driver means are not the same.
3. H0: The route means are the same.
H1: The route means are not the same.
128. 2
GOALS
l Understand and interpret the terms dependent and
independent variable.
l Calculate and interpret the coefficient of correlation,
the coefficient of determination, and the standard
error of estimate.
l Conduct a test of hypothesis to determine whether
the coefficient of correlation in the population is zero.
l Calculate the least squares regression line.
l Construct and interpret confidence and prediction
intervals for the dependent variable.
129. 3
Regression Analysis - Introduction
l Recall in Chapter 4 the idea of showing the
relationship between two variables with a scatter
diagram was introduced.
l In that case we showed that, as the age of the buyer
increased, the amount spent for the vehicle also
increased.
l In this chapter we carry this idea further. Numerical
measures to express the strength of relationship
between two variables are developed.
l In addition, an equation is used to express the
relationship. between variables, allowing us to
estimate one variable on the basis of another.
130. 4
Regression Analysis - Uses
Some examples.
l Is there a relationship between the amount Healthtex
spends per month on advertising and its sales in the
month?
l Can we base an estimate of the cost to heat a home
in January on the number of square feet in the
home?
l Is there a relationship between the miles per gallon
achieved by large pickup trucks and the size of the
engine?
l Is there a relationship between the number of hours
that students studied for an exam and the score
earned?
131. 5
Correlation Analysis
l Correlation Analysis is the study of the
relationship between variables. It is also
defined as group of techniques to measure
the association between two variables.
l A Scatter Diagram is a chart that portrays
the relationship between the two variables. It
is the usual first step in correlations analysis
– The Dependent Variable is the variable being
predicted or estimated.
– The Independent Variable provides the basis for
estimation. It is the predictor variable.
132. 6
Regression Example
The sales manager of Copier Sales
of America, which has a large
sales force throughout the
United States and Canada,
wants to determine whether
there is a relationship between
the number of sales calls made
in a month and the number of
copiers sold that month. The
manager selects a random
sample of 10 representatives
and determines the number of
sales calls each representative
made last month and the
number of copiers sold.
134. 8
The Coefficient of Correlation, r
The Coefficient of Correlation (r) is a measure of the
strength of the relationship between two variables. It
requires interval or ratio-scaled data.
l It can range from -1.00 to 1.00.
l Values of -1.00 or 1.00 indicate perfect and strong
correlation.
l Values close to 0.0 indicate weak correlation.
l Negative values indicate an inverse relationship and
positive values indicate a direct relationship.
139. 13
Coefficient of Determination
The coefficient of determination (r2) is the
proportion of the total variation in the
dependent variable (Y) that is explained or
accounted for by the variation in the
independent variable (X). It is the square of
the coefficient of correlation.
l It ranges from 0 to 1.
l It does not give any information on the
direction of the relationship between the
variables.
140. 14
Using the Copier Sales of
America data which a
scatterplot was
developed earlier,
compute the correlation
coefficient and
coefficient of
determination.
Correlation Coefficient - Example
143. 17
How do we interpret a correlation of 0.759?
First, it is positive, so we see there is a direct relationship between
the number of sales calls and the number of copiers sold. The value
of 0.759 is fairly close to 1.00, so we conclude that the association
is strong.
However, does this mean that more sales calls cause more sales?
No, we have not demonstrated cause and effect here, only that the
two variables—sales calls and copiers sold—are related.
Correlation Coefficient - Example
144. 18
Coefficient of Determination (r2) - Example
•The coefficient of determination, r2 ,is 0.576,
found by (0.759)2
•This is a proportion or a percent; we can say that
57.6 percent of the variation in the number of
copiers sold is explained, or accounted for, by the
variation in the number of sales calls.
145. 19
Testing the Significance of
the Correlation Coefficient
H0: r = 0 (the correlation in the population is 0)
H1: r ≠ 0 (the correlation in the population is not 0)
Reject H0 if:
t > ta/2,n-2 or t < -ta/2,n-2
146. 20
Testing the Significance of
the Correlation Coefficient - Example
H0: r = 0 (the correlation in the population is 0)
H1: r ≠ 0 (the correlation in the population is not 0)
Reject H0 if:
t > ta/2,n-2 or t < -ta/2,n-2
t > t0.025,8 or t < -t0.025,8
t > 2.306 or t < -2.306
147. 21
Testing the Significance of
the Correlation Coefficient - Example
The computed t (3.297) is within the rejection region, therefore, we will reject H0. This means
the correlation in the population is not zero. From a practical standpoint, it indicates to the
sales manager that there is correlation with respect to the number of sales calls made
and the number of copiers sold in the population of salespeople.
152. 26
Regression Analysis
In regression analysis we use the independent variable
(X) to estimate the dependent variable (Y).
l The relationship between the variables is linear.
l Both variables must be at least interval scale.
l The least squares criterion is used to determine the
equation.
153. 27
Regression Analysis – Least Squares
Principle
l The least squares principle is used to
obtain a and b.
l The equations to determine a and b
are:
b
n XY X Y
n X X
a
Y
n
b
X
n
=
-
-
= -
( ) ( )( )
( ) ( )
S S S
S S
S S
2 2
155. 29
Regression Equation - Example
Recall the example involving
Copier Sales of America. The
sales manager gathered
information on the number of
sales calls made and the
number of copiers sold for a
random sample of 10 sales
representatives. Use the least
squares method to determine a
linear equation to express the
relationship between the two
variables.
What is the expected number of
copiers sold by a representative
who made 20 calls?
156. 30
Finding the Regression Equation - Example
6316
.
42
)
20
(
1842
.
1
9476
.
18
1842
.
1
9476
.
18
:
is
equation
regression
The
^
^
^
^
=
+
=
+
=
+
=
Y
Y
X
Y
bX
a
Y
157. 31
Computing the Estimates of Y
Step 1 – Using the regression equation, substitute the
value of each X to solve for the estimated sales
4736
.
54
)
30
(
1842
.
1
9476
.
18
1842
.
1
9476
.
18
Jones
Soni
^
^
^
=
+
=
+
=
Y
Y
X
Y
6316
.
42
)
20
(
1842
.
1
9476
.
18
1842
.
1
9476
.
18
Keller
Tom
^
^
^
=
+
=
+
=
Y
Y
X
Y
159. 33
The Standard Error of Estimate
l The standard error of estimate measures the
scatter, or dispersion, of the observed values
around the line of regression
l The formulas that are used to compute the
standard error:
2
)
( 2
^
.
-
-
S
=
n
Y
Y
s x
y
2
2
.
-
S
-
S
-
S
=
n
XY
b
Y
a
Y
s x
y
160. 34
Standard Error of the Estimate - Example
Recall the example involving
Copier Sales of America.
The sales manager
determined the least
squares regression
equation is given below.
Determine the standard error
of estimate as a measure
of how well the values fit
the regression line.
X
Y 1842
.
1
9476
.
18
^
+
=
901
.
9
2
10
211
.
784
2
)
( 2
^
.
=
-
=
-
-
S
=
n
Y
Y
s x
y
163. 37
Assumptions Underlying Linear
Regression
For each value of X, there is a group of Y values, and these
l Y values are normally distributed. The means of these normal
distributions of Y values all lie on the straight line of regression.
l The standard deviations of these normal distributions are equal.
l The Y values are statistically independent. This means that in
the selection of a sample, the Y values chosen for a particular X
value do not depend on the Y values for any other X values.
164. 38
Confidence Interval and Prediction
Interval Estimates of Y
•A confidence interval reports the mean value of Y
for a given X.
•A prediction interval reports the range of values
of Y for a particular value of X.
165. 39
Confidence Interval Estimate - Example
We return to the Copier Sales of America
illustration. Determine a 95 percent confidence
interval for all sales representatives who make
25 calls.
166. 40
Step 1 – Compute the point estimate of Y
In other words, determine the number of copiers we expect a sales
representative to sell if he or she makes 25 calls.
5526
.
48
)
25
(
1842
.
1
9476
.
18
1842
.
1
9476
.
18
:
is
equation
regression
The
^
^
^
=
+
=
+
=
Y
Y
X
Y
Confidence Interval Estimate - Example
167. 41
Step 2 – Find the value of t
l To find the t value, we need to first know the number
of degrees of freedom. In this case the degrees of
freedom is n - 2 = 10 – 2 = 8.
l We set the confidence level at 95 percent. To find the
value of t, move down the left-hand column of
Appendix B.2 to 8 degrees of freedom, then move
across to the column with the 95 percent level of
confidence.
l The value of t is 2.306.
Confidence Interval Estimate - Example
169. 43
Confidence Interval Estimate - Example
Step 4 – Use the formula above by substituting the numbers computed
in previous slides
Thus, the 95 percent confidence interval for the average sales of all
sales representatives who make 25 calls is from 40.9170 up to
56.1882 copiers.
170. 44
Prediction Interval Estimate - Example
We return to the Copier Sales of America
illustration. Determine a 95 percent
prediction interval for Sheila Baker, a West
Coast sales representative who made 25
calls.
171. 45
Step 1 – Compute the point estimate of Y
In other words, determine the number of copiers we
expect a sales representative to sell if he or she
makes 25 calls.
5526
.
48
)
25
(
1842
.
1
9476
.
18
1842
.
1
9476
.
18
:
is
equation
regression
The
^
^
^
=
+
=
+
=
Y
Y
X
Y
Prediction Interval Estimate - Example
172. 46
Step 2 – Using the information computed
earlier in the confidence interval estimation
example, use the formula above.
Prediction Interval Estimate - Example
If Sheila Baker makes 25 sales calls, the number of copiers she
will sell will be between about 24 and 73 copiers.
174. 48
Transforming Data
l The coefficient of correlation describes the
strength of the linear relationship between
two variables. It could be that two variables
are closely related, but there relationship is
not linear.
l Be cautious when you are interpreting the
coefficient of correlation. A value of r may
indicate there is no linear relationship, but it
could be there is a relationship of some other
nonlinear or curvilinear form.
175. 49
Transforming Data - Example
On the right is a listing of 22 professional
golfers, the number of events in
which they participated, the amount
of their winnings, and their mean
score for the 2004 season. In golf,
the objective is to play 18 holes in
the least number of strokes. So, we
would expect that those golfers with
the lower mean scores would have
the larger winnings. To put it another
way, score and winnings should be
inversely related. In 2004 Tiger
Woods played in 19 events, earned
$5,365,472, and had a mean score
per round of 69.04. Fred Couples
played in 16 events, earned
$1,396,109, and had a mean score
per round of 70.92. The data for the
22 golfers follows.
176. 50
Scatterplot of Golf Data
l The correlation between the
variables Winnings and
Score is 0.782. This is a
fairly strong inverse
relationship.
l However, when we plot the
data on a scatter diagram
the relationship does not
appear to be linear; it does
not seem to follow a straight
line.
177. 51
What can we do to explore other (nonlinear)
relationships?
One possibility is to transform one of the
variables. For example, instead of using Y as
the dependent variable, we might use its log,
reciprocal, square, or square root. Another
possibility is to transform the independent
variable in the same way. There are other
transformations, but these are the most
common.
178. 52
In the golf winnings
example, changing the
scale of the dependent
variable is effective. We
determine the log of each
golfer’s winnings and
then find the correlation
between the log of
winnings and score. That
is, we find the log to the
base 10 of Tiger Woods’
earnings of $5,365,472,
which is 6.72961.
Transforming Data - Example
181. 55
Using the Transformed Equation for
Estimation
Based on the regression equation, a golfer with
a mean score of 70 could expect to earn:
•The value 6.4372 is the log to the base 10 of winnings.
•The antilog of 6.4372 is 2.736
•So a golfer that had a mean score of 70 could expect to
earn $2,736,528.
184. 2
GOALS
l Describe the relationship between several independent variables and
a dependent variable using multiple regression analysis.
l Set up, interpret, and apply an ANOVA table
l Compute and interpret the multiple standard error of estimate, the
coefficient of multiple determination, and the adjusted coefficient of
multiple determination.
l Conduct a test of hypothesis to determine whether regression
coefficients differ from zero.
l Conduct a test of hypothesis on each of the regression coefficients.
l Use residual analysis to evaluate the assumptions of multiple
regression analysis.
l Evaluate the effects of correlated independent variables.
l Use and understand qualitative independent variables.
l Understand and interpret the stepwise regression method.
l Understand and interpret possible interaction among independent
variables.
185. 3
Multiple Regression Analysis
The general multiple regression with k
independent variables is given by:
The least squares criterion is used to develop
this equation. Because determining b1, b2, etc. is
very tedious, a software package such as Excel
or MINITAB is recommended.
186. 4
Multiple Regression Analysis
For two independent variables, the general form
of the multiple regression equation is:
•X1 and X2 are the independent variables.
•a is the Y-intercept
•b1 is the net change in Y for each unit change in X1 holding X2
constant. It is called a partial regression coefficient, a net regression
coefficient, or just a regression coefficient.
188. 6
Salsberry Realty sells homes along the east
coast of the United States. One of the
questions most frequently asked by
prospective buyers is: If we purchase this
home, how much can we expect to pay to
heat it during the winter? The research
department at Salsberry has been asked to
develop some guidelines regarding heating
costs for single-family homes.
Three variables are thought to relate to the
heating costs: (1) the mean daily outside
temperature, (2) the number of inches of
insulation in the attic, and (3) the age in
years of the furnace.
To investigate, Salsberry’s research department
selected a random sample of 20 recently
sold homes. It determined the cost to heat
each home last January, as well
Multiple Linear Regression - Example
192. 10
The Multiple Regression Equation –
Interpreting the Regression Coefficients
The regression coefficient for mean outside temperature is 4.583. The coefficient is
negative and shows an inverse relationship between heating cost and temperature.
As the outside temperature increases, the cost to heat the home decreases. The
numeric value of the regression coefficient provides more information. If we
increase temperature by 1 degree and hold the other two independent variables
constant, we can estimate a decrease of $4.583 in monthly heating cost. So if the
mean temperature in Boston is 25 degrees and it is 35 degrees in Philadelphia, all
other things being the same (insulation and age of furnace), we expect the heating
cost would be $45.83 less in Philadelphia.
The attic insulation variable also shows an inverse relationship: the more insulation in
the attic, the less the cost to heat the home. So the negative sign for this coefficient
is logical. For each additional inch of insulation, we expect the cost to heat the
home to decline $14.83 per month, regardless of the outside temperature or the
age of the furnace.
The age of the furnace variable shows a direct relationship. With an older furnace, the
cost to heat the home increases. Specifically, for each additional year older the
furnace is, we expect the cost to increase $6.10 per month.
193. 11
Applying the Model for Estimation
What is the estimated heating cost for a home if the
mean outside temperature is 30 degrees, there
are 5 inches of insulation in the attic, and the
furnace is 10 years old?
194. 12
Multiple Standard Error of
Estimate
The multiple standard error of estimate is a measure of the
effectiveness of the regression equation.
l It is measured in the same units as the dependent
variable.
l It is difficult to determine what is a large value and what
is a small value of the standard error.
l The formula is:
196. 14
Multiple Regression and
Correlation Assumptions
l The independent variables and the dependent
variable have a linear relationship. The dependent
variable must be continuous and at least interval-
scale.
l The residual must be the same for all values of Y.
When this is the case, we say the difference exhibits
homoscedasticity.
l The residuals should follow the normal distributed
with mean 0.
l Successive values of the dependent variable must
be uncorrelated.
197. 15
The ANOVA Table
The ANOVA table reports the variation in the
dependent variable. The variation is divided
into two components.
l The Explained Variation is that accounted for
by the set of independent variable.
l The Unexplained or Random Variation is not
accounted for by the independent variables.
199. 17
Coefficient of Multiple Determination (r2)
Characteristics of the coefficient of multiple determination:
1. It is symbolized by a capital R squared. In other words, it is written
as because it behaves like the square of a correlation coefficient.
2. It can range from 0 to 1. A value near 0 indicates little association
between the set of independent variables and the dependent
variable. A value near 1 means a strong association.
3. It cannot assume negative values. Any number that is squared or
raised to the second power cannot be negative.
4. It is easy to interpret. Because is a value between 0 and 1 it is easy
to interpret, compare, and understand.
200. 18
Minitab – the ANOVA Table
804
.
0
916
,
212
220
,
171
total
2
=
=
=
SS
SSR
R
201. 19
Adjusted Coefficient of Determination
l The number of independent variables in a multiple
regression equation makes the coefficient of
determination larger. Each new independent variable
causes the predictions to be more accurate.
l If the number of variables, k, and the sample size, n,
are equal, the coefficient of determination is 1.0. In
practice, this situation is rare and would also be
ethically questionable.
l To balance the effect that the number of independent
variables has on the coefficient of multiple
determination, statistical software packages use an
adjusted coefficient of multiple determination.
203. 21
Correlation Matrix
A correlation matrix is used to show all
possible simple correlation
coefficients among the variables.
l The matrix is useful for locating
correlated independent variables.
l It shows how strongly each
independent variable is correlated
with the dependent variable.
204. 22
Global Test: Testing the Multiple
Regression Model
The global test is used to investigate
whether any of the independent
variables have significant coefficients.
The hypotheses are:
0
equal
s
all
Not
:
0
...
:
1
2
1
0
b
b
b
b
H
H k =
=
=
=
205. 23
Global Test continued
l The test statistic is the F
distribution with k (number of
independent variables) and
n-(k+1) degrees of freedom, where
n is the sample size.
l Decision Rule:
Reject H0 if F > Fa,k,n-k-1
208. 26
Interpretation
l The computed value of F is
21.90, which is in the rejection
region.
l The null hypothesis that all the
multiple regression coefficients
are zero is therefore rejected.
l Interpretation: some of the
independent variables (amount
of insulation, etc.) do have the
ability to explain the variation in
the dependent variable (heating
cost).
l Logical question – which ones?
209. 27
Evaluating Individual Regression
Coefficients (βi = 0)
l This test is used to determine which independent variables have nonzero
regression coefficients.
l The variables that have zero regression coefficients are usually dropped
from the analysis.
l The test statistic is the t distribution with n-(k+1) degrees of freedom.
l The hypothesis test is as follows:
H0: βi = 0
H1: βi ≠ 0
Reject H0 if t > ta/2,n-k-1 or t < -ta/2,n-k-1
216. 34
Critical t-stat for the New Slopes
110
.
2
0
110
.
2
0
0
0
0
0
0
0
:
if
H
Reject
17
,
025
.
17
,
025
.
1
2
20
,
2
/
05
.
1
2
20
,
2
/
05
.
1
,
2
/
1
,
2
/
1
,
2
/
1
,
2
/
0
-
<
-
>
-
-
<
-
>
-
-
<
-
>
-
-
<
-
>
-
-
<
>
-
-
-
-
-
-
-
-
-
-
-
-
i
i
i
i
i
i
i
i
b
i
b
i
b
i
b
i
b
i
b
i
k
n
b
i
k
n
b
i
k
n
k
n
s
b
s
b
t
s
b
t
s
b
t
s
b
t
s
b
t
s
b
t
s
b
t
t
t
t
a
a
a
a
-2.110 2.110
218. 36
Evaluating the
Assumptions of Multiple Regression
1. There is a linear relationship. That is, there is a straight-line
relationship between the dependent variable and the set of
independent variables.
2. The variation in the residuals is the same for both large and
small values of the estimated Y To put it another way, the
residual is unrelated whether the estimated Y is large or small.
3. The residuals follow the normal probability distribution.
4. The independent variables should not be correlated. That is,
we would like to select a set of independent variables that are
not themselves correlated.
5. The residuals are independent. This means that successive
observations of the dependent variable are not correlated. This
assumption is often violated when time is involved with the
sampled observations.
219. 37
Analysis of Residuals
A residual is the difference between the
actual value of Y and the predicted
value of Y. Residuals should be
approximately normally distributed.
Histograms and stem-and-leaf charts
are useful in checking this requirement.
l A plot of the residuals and their
corresponding Y’ values is used for
showing that there are no trends or
patterns in the residuals.
222. 40
Distribution of Residuals
Both MINITAB and Excel offer another graph that helps to evaluate the
assumption of normally distributed residuals. It is a called a normal
probability plot and is shown to the right of the histogram.
223. 41
Multicollinearity
l Multicollinearity exists when independent
variables (X’s) are correlated.
l Correlated independent variables make it
difficult to make inferences about the
individual regression coefficients (slopes)
and their individual effects on the dependent
variable (Y).
l However, correlated independent variables
do not affect a multiple regression equation’s
ability to predict the dependent variable (Y).
224. 42
Variance Inflation Factor
l A general rule is if the correlation between two independent
variables is between -0.70 and 0.70 there likely is not a problem
using both of the independent variables.
l A more precise test is to use the variance inflation factor (VIF).
l The value of VIF is found as follows:
•The term R2
j refers to the coefficient of determination, where the selected
independent variable is used as a dependent variable and the remaining
independent variables are used as independent variables.
•A VIF greater than 10 is considered unsatisfactory, indicating that
independent variable should be removed from the analysis.
225. 43
Multicollinearity – Example
Refer to the data in the
table, which relates the
heating cost to the
independent variables
outside temperature,
amount of insulation,
and age of furnace.
Develop a correlation
matrix for all the
independent variables.
Does it appear there is a
problem with
multicollinearity?
Find and interpret the
variance inflation factor
for each of the
independent variables.
227. 45
VIF – Minitab Example
The VIF value of 1.32 is less than the upper limit
of 10. This indicates that the independent variable
temperature is not strongly correlated with the
other independent variables.
Coefficient of
Determination
228. 46
Independence Assumption
l The fifth assumption about regression and
correlation analysis is that successive
residuals should be independent.
l When successive residuals are correlated we
refer to this condition as autocorrelation.
Autocorrelation frequently occurs when the
data are collected over a period of time.
229. 47
Residual Plot versus Fitted Values
l The graph below shows the
residuals plotted on the
vertical axis and the fitted
values on the horizontal
axis.
l Note the run of residuals
above the mean of the
residuals, followed by a run
below the mean. A scatter
plot such as this would
indicate possible
autocorrelation.
230. 48
Qualitative Independent Variables
l Frequently we wish to use nominal-scale
variables—such as gender, whether the
home has a swimming pool, or whether the
sports team was the home or the visiting
team—in our analysis. These are called
qualitative variables.
l To use a qualitative variable in regression
analysis, we use a scheme of dummy
variables in which one of the two possible
conditions is coded 0 and the other 1.
231. 49
Qualitative Variable - Example
Suppose in the Salsberry
Realty example that the
independent variable
“garage” is added. For those
homes without an attached
garage, 0 is used; for homes
with an attached garage, a 1
is used. We will refer to the
“garage” variable as The
data from Table 14–2 are
entered into the MINITAB
system.
233. 51
Using the Model for Estimation
What is the effect of the garage variable? Suppose we have two houses exactly
alike next to each other in Buffalo, New York; one has an attached garage,
and the other does not. Both homes have 3 inches of insulation, and the
mean January temperature in Buffalo is 20 degrees.
For the house without an attached garage, a 0 is substituted for in the regression
equation. The estimated heating cost is $280.90, found by:
For the house with an attached garage, a 1 is substituted for in the regression
equation. The estimated heating cost is $358.30, found by:
Without garage
With garage
234. 52
Testing the Model for Significance
l We have shown the difference between the two
types of homes to be $77.40, but is the difference
significant?
l We conduct the following test of hypothesis.
H0: βi = 0
H1: βi ≠ 0
Reject H0 if t > ta/2,n-k-1 or t < -ta/2,n-k-1
235. 53
Evaluating Individual Regression
Coefficients (βi = 0)
l This test is used to determine which independent variables have nonzero
regression coefficients.
l The variables that have zero regression coefficients are usually dropped
from the analysis.
l The test statistic is the t distribution with
n-(k+1) or n-k-1degrees of freedom.
l The hypothesis test is as follows:
H0: βi = 0
H1: βi ≠ 0
Reject H0 if t > ta/2,n-k-1 or t < -ta/2,n-k-1
237. 55
Stepwise Regression
The advantages to the stepwise method are:
1. Only independent variables with significant regression
coefficients are entered into the equation.
2. The steps involved in building the regression equation are clear.
3. It is efficient in finding the regression equation with only
significant regression coefficients.
4. The changes in the multiple standard error of estimate and the
coefficient of determination are shown.
238. 56
The stepwise MINITAB output for the heating cost
problem follows.
Temperature is
selected first. This
variable explains
more of the
variation in heating
cost than any of the
other three
proposed
independent
variables.
Garage is selected
next, followed by
Insulation.
Stepwise Regression – Minitab Example
239. 57
Regression Models with Interaction
l In Chapter 12 we discussed interaction among independent variables.
To explain, suppose we are studying weight loss and assume, as the
current literature suggests, that diet and exercise are related. So the
dependent variable is amount of change in weight and the
independent variables are: diet (yes or no) and exercise (none,
moderate, significant). We are interested in whether there is interaction
among the independent variables. That is, if those studied maintain
their diet and exercise significantly, will that increase the mean amount
of weight lost? Is total weight loss more than the sum of the loss due to
the diet effect and the loss due to the exercise effect?
l In regression analysis, interaction can be examined as a separate
independent variable. An interaction prediction variable can be
developed by multiplying the data values in one independent variable
by the values in another independent variable, thereby creating a new
independent variable. A two-variable model that includes an interaction
term is:
240. 58
Refer to the heating cost
example. Is there an
interaction between
the outside
temperature and the
amount of insulation?
If both variables are
increased, is the
effect on heating cost
greater than the sum
of savings from
warmer temperature
and the savings from
increased insulation
separately?
Regression Models with Interaction -
Example
241. 59
Creating the Interaction Variable – Using the
information from the table in the previous slide, an
interaction variable is created by multiplying the
temperature variable by the insulation.
For the first sampled home the value temperature is 35
degrees and insulation is 3 inches so the value of
the interaction variable is 35 X 3 = 105. The values
of the other interaction products are found in a
similar fashion.
Regression Models with Interaction -
Example
243. 61
The regression equation is:
Is the interaction variable significant at 0.05
significance level?
Regression Models with Interaction -
Example
244. 62
There are other situations that can occur when studying
interaction among independent variables.
1. It is possible to have a three-way interaction among
the independent variables. In the heating example,
we might have considered the three-way interaction
between temperature, insulation, and age of the
furnace.
2. It is possible to have an interaction where one of the
independent variables is nominal scale. In our
heating cost example, we could have studied the
interaction between temperature and garage.