SCHAUM’S OUTLINE OF
THEORY AND PROBLEMS
OF
BEGINNING STATISTICS
LARRY J. STEPHENS, Ph.D.
Professor o
f Mathematics
University of Nebrasku at Oriialin
SCHAUM’S OUTLINE SERIES
McGRAW-HILL
New York San Francisco Washington,D.C. Auckland Bogotci Caracas Lisbon
London Madrid Mexico City Milan Montreal New Dehli
San Juan Singapore Sydney Tokyo Toronto
To My Mother and Father, Rosie, and Johnie Stephens
LARRY J. STEPHENS is Professor of Mathematics at the University of Nebraska at
Omaha. He received his bachelor’s degree from Memphis State University in Mathematics,
his master’s degree from the University of Arizona in Mathematics, and his Ph.D. degree
from Oklahoma State University in Statistics. Professor Stephens has over 40 publications in
professional journals. He has over 25 years of experience teaching Statistics. He has taught at
the University of Arizona, Christian Brothers College, Gonzaga University, Oklahoma State
University, the University of Nebraska at Kearney, and the University of Nebraska at Omaha.
He has published numerous computerized test banks to accompany elementary Statistics texts.
He has worked for NASA, Livermore Radiation Laboratory, and Los Alamos Laboratory.
Since 1989, Dr. Stephens has consulted with and conducted Statistics seminars for the
engineering group at 3M, Valley, Nebraska plant.
Schaurn’s Outline of Theory and Problems of
BEGlNNlNG STATISTICS
Copyright 0 1998by the McGraw-Hill Companies, Inc. All rights reserved. Printed in the United
States of Atnerica. Except as permitted under the Copyright Act of 1976, no part of this publication
tna he reproduced or distributed in any form or by any means, or stored in a data base or retrieval
system, without the prior written permission of the publisher.
3 4 5 6 7 8 9 10 I 1 12 13 14 IS 16 17 18 19 20 PRS PRS 9 0 2 1 0 9
ISBN 0-07-061259-5
Sponsoring Editor: Barbara Gilson
Production Supervisor: Clara Stanley
Editing Supervisor: Maureen B. Walker
Mirrirtih is U registered tradenuirk o
fMinitub h c .
Library of CongressCataloging-in-PublicationData
Stephens. Larry J.
Schaum’s outline of theory and problems of beginning statistics /
Larry J. Stephens.
Includes index.
p. cm. - (Schaurn’s outline series)
ISBN 0-07-061259-5(pbk.)
I . Mathematical statistics-Outlines, syllabi, etc.
2. Mathematical statistics-Problems, exercises, etc. I. Title.
11. Series.
QA276.19.S74 1998
519.5’0764~21 97-45979
CIP
AC
Preface
Statistics is a required course for undergraduate college students in a number of majors. Students in the
following disciplines are often required to take a course in beginning statistics: allied health careers, biology,
business, computer science, criminal justice, decision science, engineering, education, geography, geology,
information science, nursing, nutrition, medicine, pharmacy, psychology, and public administration. This
outline is intended to assist these students in the understanding of Statistics. The outline may be used as a
supplement to textbooks used in these courses or a text for the course itself.
The author has taught such courses for over 25 years and understands the difficulty students encounter with
statistics. I have included examples from a wide variety of current areas of application in order to motivate an
interest in learning statistics. As we leave the twentieth century and enter the twenty-first century, an
understanding of statistics is essential in understanding new technology, world affairs, and the ever-expanding
volume of knowledge. Statistical concepts are encountered in television and radio broadcasting, as well as in
magazines and newspapers. Modern newspapers, such as USA Today, are full of statistical information. The
sports section is filled with descriptive statistics concerning players and teams performance. The money section
of USA Today contains descriptive statistics concerning stocks and mutual funds. The life section of USA Today
often contains summaries of research studies in medicine. An understanding of statistics is helpful in evaluating
these research summaries.
The nature of the beginning statistics course has changed drastically in the past 30 or so years. This change
is due to the technical advances in computing. Prior to the 1960s statistical computing was usually performed
on mechanical calculators. These were large cumbersome computing devices (compared to today’s hand-held
calculators) that performed arithmetic by moving mechanical parts. Computers and computer software were no
comparison to today’s computers and software. The number of statistical packages available today numbers in
the hundreds. The burden of statistical computing has been reduced to simply entering your data into a data file
and then giving the correct command to perform the statistical method of interest.
One of the most widely used statistical packages in academia as well as industrial settings is the package
called Minitab (Minitab Inc., 3081 Enterprise Drive, State College, PA 16801-3008).I wish to thank Minitab
Inc. for granting me permission to include Minitab output, including graphics, throughout the text. Most
modern Statistics textbooks include computer software as part of the text. I have chosen to include Minitab
because it is widely used and is very friendly. Once a student learns the various data file structures needed to
use Minitab, and the structure of the commands and subcommands, this knowledge is readily transferable to
other statistical software.
The outline contains all the topics, and more, covered in a beginning statistics course. The only
mathematical prerequisite needed for the material found in the outline is arithmetic and some basic algebra. I
wish to thank my wife, Lana, for her understanding during the preparation of the book. I wish to thank my
friend Stanley Wileman for all the computer help he has given me during the preparation of the book. I wish to
thank Dr. Edwin C . Hackleman of Delta Software, Inc. for his timely assistance as compositor of the final
camera-ready manuscript. Finally, I wish to thank the staff at McGraw-Hill for their cooperation and
helpfulness.
LARRYJ. STEPHENS
...
111
This page intentionally left blank
Contents
Chapter 1 INTRODUCTION................................................................................. 1
Statistics. Descriptive Statistics. Inferential Statistics: Population and
Sample. Variable, Observation. and Data Set. Quantitative Variable:
Discrete and Continuous Variable. Qualitative Variable. Nominal,
Ordinal, Interval, and Ratio Levels of Measurement. Summation Notation.
Computers and Statistics.
Chapter 2 ORGANIZING DATA.......................................................................... 14
Raw Data. Frequency Distribution for Qualitative Data. Relative
Frequency of a Category. Percentage. Bar Graph. Pie Chart.
Frequency Distribution for Quantitative Data. Class Limits, Class
Boundaries, Class Marks, and Class Width. Single-Valued Classes.
Histograms. Cumulative Frequency Distributions. Cumulative Relative
Frequency Distributions. Ogives. Stem-and-Leaf Displays.
Chapter 3 DESCRIPTIVE MEASURES.............................................................. 40
Measures of Central Tendency. Mean, Median, and Mode for Ungrouped
Data. Measures of Dispersion. Range, Variance, and Standard Deviation
for Ungrouped Data. Measures of Central Tendency and Dispersion for
Grouped Data. Chebyshev’s Theorem. Empirical Rule. Coefficient of
Variation. Z Scores. Measures of Position: Percentiles, Deciles, and
Quartiles. Interquartile Range. Box-and-Whisker Plot.
Chapter 4 PROBABILITY..................................................................................... 63
Experiment, Outcomes, and Sample Space. Tree Diagrams and the
Counting Rule. Events, Simple Events, and Compound Events. Probability.
Classical, Relative Frequency and Subjective Probability Definitions.
Marginal and Conditional Probabilities. Mutually Exclusive Events.
Dependent and Independent Events. Complementary Events.
Multiplication Rule for the Intersection of Events. Addition Rule for the
Union of Events. Bayes’ Theorem. Permutations and Combinations. Using
Permutations and Combinations to Solve Probability Problems.
Chapter 5 DISCRETE RANDOM VARIABLES ................................................ 89
Random Variable. Discrete Random Variable. Continuous Random
Variable. Probability Distribution. Mean of a Discrete Random Variable.
Standard Deviation of a Discrete Random Variable. Binomial Random
Variable. Binomial Probability Formula. Tables of the Binomial
Distribution. Mean and Standard Deviation of a Binomial Random
Variable. Poisson Random Variable. Poisson Probability Formula.
Hypergeometric Random Variable. Hypergeometric Probability Formula.
V
vi CONTENTS
Chapter6 CONTINUOUSRANDOM VARIABLES AND THEIR
PROBABILITYDISTRIBUTIONS.................................................... 113
Uniform Probability Distribution. Mean and Standard Deviation for the
Uniform Probability Distribution. Normal Probability Distribution.
Standard Normal Distribution. Standardizing a Normal Distribution.
Applications of the Normal Distribution. Determining the z and x Values
When an Area under the Normal Curve is Known. Normal Approximation
to the Binomial Distribution. Exponential Probability Distribution.
Probabilities for the Exponential Probability Distribution.
Chapter7 SAMPLINGDISTRIBUTIONS.......................................................... 140
Simple Random Sampling. Using Random Number Tables. Using the
Computer to Obtain a Simple Random Sample. Systematic Random
Sampling. Cluster Sampling. Stratified Sampling. Sampling Distribution
of the Sampling Mean. Sampling Error. Mean and Standard Deviation of the
Sample Mean. Shape of the Sampling Distribution of the Sample Mean
and the Central Limit Theorem. Applications of the Sampling Distribution
of the Sample Mean. Sampling Distribution of the Sample Proportion.
Mean and Standard Deviation of the Sample Proportion. Shape of the
Sampling Distribution of the Sample Proportion and the Central Limit
Theorem. Applications of the Sampling Distribution of the Sample Proportion.
Chapter 8 ESTIMATIONAND SAMPLESIZE DETERMINATION:
ONE POPULATION ........................................................................... 166
Point Estimate. Interval Estimate. Confidence Interval for the Population
Mean: Large Samples. Maximum Error of Estimate for the Population
Mean. The t Distribution. Confidence Interval for the Population Mean:
Small Samples. Confidence Interval for the Population Proportion: Large
Samples. Determining the Sample Size for the Estimation of the
Population Mean. Determining the Sample Size for the Estimation of the
Population Proportion.
Chapter9 TESTS OF HYPOTHESIS:ONE POPULATION............................ 185
Null Hypothesis and Alternative Hypothesis. Test Statistic, Critical
Values, Rejection and Nonrejection Regions.Type I and Type I1 Errors.
Hypothesis Tests about a Population Mean: Large Samples. Calculating
Type I1 Errors. P Values. Hypothesis Tests about a Population Mean:
Small Samples. Hypothesis Tests about a Population Proportion: Large
Samples.
CONTENTS vii
Chapter 10 INFERENCESFOR TWO POPULATIONS..................................... 211
Sampling Distribution of -x,for Large Independent Samples.
Estimation of p1
-p2 Using Large Independent Samples. Testing
Hypothesis about pI-p, Using Large Independent Samples.
Sampling Distribution of XI-x2for Small Independent Samples
from Normal Populations with Equal (but unknown) Standard
Deviations. Estimation of p1-p, Using Small Independent Samples froin
Normal Populations with Equal (but unknown) Standard Deviations.
Testing Hypothesis about p,-p, Using Small Independent
Samples from Normal Populations with Equal (but Unknown) Standard
Deviations. Sampling Distribution of XI-x2for Small Independent Samples
from Normal Populations with Unequal (and Unknown) Standard Deviations.
Estimation of p,- p2Using Small Independent Samples from Normal
Populations with Unequal (and Unknown) Standard Deviations. Testing
Hypothesis about p1-p2 Using Small Independent Samples from Normal
Populations with Unequal (and Unknown) Standard Deviations.
Sampling Distribution of a for Normally Distributed Differences
Computed from Dependent Samples. Estimation of pdUsing Normally
Distributed Differences Computed from Dependent Samples. Testing
Hypothesis about pdUsing Normally Distributed Differences Computed
from Dependent Samples. Sampling Distribution of PI-p2for Large Independent
Samples. Estimation of PI-P2 Using Large Independent Samples. Testing
Hypothesis about PI-P2 Using Large Independent Samplcs.
Chapter 11 CHI-SQUAREPROCEDURES........................................................... 249
Chi-square Distribution. Chi-square Tables. Goodness-of-Fit Test.
Observed and Expected Frequencies. Sampling Distribution of the
Goodness-of-Fit Test Statistic. Chi-square Independence Test.
Sampling Distribution of the Test Statistic for the Chi-square
Independence Test. Sampling Distribution of the Sample Variance.
Inferences Concerning the Population Variance.
Chapter 12 ANALYSIS OF VARIANCE (ANOVA)............................................. 272
F Distribution. F Table. Logic Behind a One-way ANOVA. Sum of
Squares, Mean Squares, and Degrees of Freedom for a One-way ANOVA.
Sampling Distribution for the One-way ANOVA Test Statistic. Building
One-way ANOVA Tables and Testing the Equality of Means. Logic
Behind a Two-way ANOVA. Sum of Squares, Mean Squares, and Degrees
of Freedom for a Two-way ANOVA. Building Two-way ANOVA Tables.
Sampling Distributions for the Two-way ANOVA. Testing Hypothesis
Concerning Main Effects and Interaction.
-~
...
V l l l CONTENTS
Chapter 13 REGRESSIONAND CORRELATION.............................................. 309
Straight Lines. Linear Regression Model. Least Squares Line. Error Sum
of Squares. Standard Deviation of Errors. Total Sum of Squares.
Regression Sum of Squares. Coefficient of Determination. hiean, Standard
Deviaticjii,and Sampling Distribution of the Slope of the Estimated Regression
Equation. Inferences Concerning the Slope of the Population Regression Line.
Estimation and Prediction in Linear Regression. Linear Correlation
Coefficient. Inference Concerning the Population Correlation Coefficient.
Chapter 14 NONPARAMETRICSTATISTICS.................................................... 334
NonparametricMethods. Sign Test. Wilcoxon Signed-Ranks Test for Two
Dependent Samples. Wilcoxon Rank-Sum Test for Two Independent Samples.
Kruskal-Wall6 Test. Rank Correlation. Runs Test for Randomness.
APPENDIX
1 Binomial Probabilities....................................................................................................................... 359
~~- 2 Areas under the Standard Normal Curve from 0 to 2 ....................................................................... 364
~ 3 Area in the Right Tail under the t Distribution Curve....................................................................... 365
~ 4 Area in the Right Tail under the Chi-square Distribution Curve....................................................... 366
~ 5 Area in the Right Tail under the F Distribution Curve...................................................................... 367
..
INDEX.................................................................................................................................................... 369
-
Chapter 1
Introduction
STATISTICS
Statistics is a discipline of study dealing with the collection, analysis, interpretation, and
presentation of data. Statistical methodology is utilized by pollsters who sample our opinions
concerning topics ranging from art to zoology. Statistical methodology is also utilized by business
and industry to help control the quality of goods and services that they produce. Social scientists and
psychologists use statistical methodology to study our behaviors. Because of its broad range of
applicability, a course in statistics is required of majors in disciplines such as sociology, psychology,
criminal justice, nursing, exercise science, pharmacy, education, and many others. To accommodate
this diverse group of users, examples and problems in this outline are chosen from many different
sources.
DESCRIPTIVE STATISTICS
The use of graphs, charts, and tables and the calculation of various statistical measures to
organize and summarize information is called descriptive statistics. Descriptive statistics help to
reduce our information to a manageable size and put it into focus.
EXAMPLE 1.1 The compilation of batting average, runs batted in, runs scored, and number of home runs for
each player, as well as earned run average, wordlost percentage, number of saves, etc., for each pitcher from the
official score sheets for major league baseball players is an example of descriptive statistics. These statistical
measures allow us to compare players, determine whether a player is having an “off year” or “good year,” etc.
EXAMPLE 1.2 The publication entitled Crime in the United States published by the Federal Bureau of
Investigation gives summary information concerning various crimes for the United States. The statistical
measures given in this publication are also examples of descriptive statistics and they are useful to individuals in
law enforcement.
INFERENTIALSTATISTICS:POPULATIONAND SAMPLE
The complete collection of individuals, items, or data under consideration in a statistical study
is referred to as the pupulatiurz. The portion of the population selected for analysis is called the
sample. Inferential statistics consists of techniques for reaching conclusions about a population
based upon information contained in a sample.
EXAMPLE 1.3 The results of polls are widely reported by both the written and the electronic media. The
techniques of inferential statistics are widely utilized by pollsters. Table 1.1 gives several examples of
populations and samples encountered in polls reported by the media. The methods of inferential statistics are
used to make inferences about the populations based upon the results found in the samples and to give an
indication about the reliability of these inferences. The results of a poll of 600 registered voters might be
reported as follows: Forty percent of the voters approve of the president’s economic policies. The margin of
error for the survey is 4%. The survey indicates that an estimated 40%of all registered voters approve of the
economic policies, but it might be as low as 36%or as high as 44%.
1
2
Population
All registered voters
All owners of handguns
Households headed by a single parent
The CEOs of all private companies
INTRODUCTION [CHAP. 1
Sample
A telephone survey of 600 registered voters
A telephone survey of 1000handgun owners
The results from questionnaires sent to 2500
households headed by a single parent
The results from surveys sent to 150CEO’s of
private companies
Table 1.1
Population
All prison inmates
Legal aliens living in the United States
Alzheimer patients in the United States
Adult children of alcoholics
Sample
A criminaljustice study of 350 prison inmates
A sociological study conducted by a university
researcher of 200 legal aliens
A medical study of 75 such patients
conducted by a university hospital
A psychological study of 200 such individuals
EXAMPLE 1.4 The techniques of inferential statistics are applied in many industrial processes to control the
quality of the products produced. In industrial settings, the population may consist of the daily production of
toothbrushes, computer chips, bolts, and so forth. The sample will consist of a random and representative
selection of items from the process producing the toothbrushes, computer chips, bolts, etc. The information
contained in the daily samples is used to construct control charts. The control charts are then used to monitor the
quality of the products.
EXAMPLE 1.5 The statistical methods of inferential statistics are used to analyze the data collected in
research studies. Table 1.2 gives the samples and populations for several such studies. The information
contained in the samples is utilized to make inferences concerning the populations. If it is found that 245 of 350
or 70% of prison inmates in a criminal justice study were abused as children, what conclusions may be inferred
concerning the percent of all prison inmates who were abused as children? The answers to this question are
found in Chapters 8 and 9.
VARIABLE, OBSERVATION, AND DATA SET
A characteristic of interest concerning the individual elements of a population or a sample is
called a variable. A variable is often represented by a letter such as x, y, or z. The value of a variable
for one particular element from the sample or population is called an Observation. A data set consists
of the observations of a variable for the elements of a sample.
EXAMPLE 1.6 Six hundred registered voters are polled and each one is asked if they approve or disapprove
of the president’s economic policies. The variable is the registered voter’s opinion of the president’s economic
policies. The data set consists of 600 observations. Each observation will be the response “approve” or the
response “do not approve.’’ If the response “approve” is coded as the number 1 and the response “do not
approve’’is coded as 0, then the data set will consist of 600 observations, each one of which is either 0 or 1. If x
is used to represent the variable, then x can assume two values, 0 or I .
CHAP. 11 INTRODUCTION 3
appears for the first time
EXAMPLE 1.7 A survey of 2500 households headed by a single parent is conducted and one characteristic of
interest is the yearly household income. The data set consists of the 2500 yearly household incomes for the
individuals in the survey. If y is used to represent the variable, then the values for y will be between the smallest
and the largest yearly household incomes for the 2500 households.
, (there is no upper limit since conceivably
one might need to flip forever to obtain the
first head)
EXAMPLE 1.8 The number of speeding tickets issued by 75 Nebraska state troopers for the month of June is
recorded. The data set consists of 75 observations.
QUANTITATIVEVARIABLE: DISCRETE AND CONTINUOUS VARIABLE
A quantitative variable is determined when the description of the characteristic of interest results
in a numerical value. When a measurement is required to describe the characteristic of interest or it is
necessary to perform a count to describe the characteristic, a quantitative variable is defined. A
discrete variable is a quantitative variable whose values are countable. Discrete variables usually
result from counting. A continuous variable is a quantitative variable that can assume any numerical
value over an interval or over several intervals. A continuous variable usually results from making a
measurement of some type.
EXAMPLE 1.9 Table 1.3 gives several discrete variables and the set of possible values for each one. In each
case the value of the variable is determined by counting. For a given box of 100diabetic syringes, the number of
defective needles is determined by counting how many of the 100are defective. The number of defectives found
must equal one of the 101 values listed. The number of possible outcomes is finite for each of the first four
variables; that is, the number of possible outcomes are 101, 31, 501, and 51 respectively. The number of
possible outcomes for the last variable is infinite. Since the number of possible outcomes is infinite and
countable for this variable, we say that the number of outcomes is countably infinite.
Sometimes it is not clear whether a variable is discrete or continuous. Test scores expressed as a
percent, for example, are usually given as whole numbers between 0 and 100. It is possible to give a
score such as 75.57565. However, this is not done in practice because teachers are unable to evaluate
to this degree of accuracy. This variable is usually regarded as continuous, although for all practical
purposes, it is discrete. To summarize, due to measurement limitations, many continuous variables
actually assume only a countable number of values.
Table 1.3
Discrete variable
The number of defective needles in boxes of 100diabetic
syringes
The number of individuals in groups of 30 with a type A
personality
The number of surveys returned out of 500 mailed in
sociological studies
The number of prison inmates in 50 having finished high
school or obtained a GED who are selected for criminal
justice studies
The number of times you need to flip a coin before a head
Possible values for the variable
0, I, 2,. . . , 100
0,1,2, . . . , 30
0, 1,2, . . . , 500
0,1,2, . . . , 50
1 , 2 , 3 , .. .
4 INTRODUCTION [CHAP. 1
Qualitative variable
Marital status
Gender
Crime classification
Pain level
Personality type
EXAMPLE 1.10 Table 1.4 gives several continuous variables and the set of possible values for each one. All
three continuous variables given in Table 1.4 involve measurement, whereas the variables in Example I .9 all
involve counting.
Table 1.4
Possible categories for the variable
Single,married, divorced, separated
Male, female
Misdemeanor, felony
None, low, moderate, severe
Type A, type B
Continuous variable
The length of prison time served for
individuals convicted of first degree murder
The household income for households with
incomes less than or equal to $20,000
The cholesterol reading for those
individuals having cholesterol readings
equal to or greater than 200 mg/dl
Possible values for the variable
All the real numbers between a and b,
where a is the smallest amount of time
served and b is the largest amount
All the real numbers between a and
$20,000,where a is the smallest
household income in the population
All real numbers between 200 and b,
where b is the largest cholesterol reading
of all such individuals
QUALITATIVE VARIABLE
A qualitative variable is determined when the description of the characteristic of interest results
in a nonnumerical value. A qualitative variable may be classified into two or more categories.
EXAMPLE 1.11 Table 1.5 gives several examples of qualitative variables along with a set of categories into
which they may be classified.
The possible categories for qualitative variables are often coded for the purpose of performing
computerized statistical analysis. Marital status might be coded as 1, 2, 3, or 4, where 1 represents
single, 2 represents married, 3 represents divorced, and 4 represents separated. The variable gender
might be coded as 0 for female and 1 for male. The categories for any qualitative variable may be
coded in a similar fashion. Even though numerical values are associated with the characteristic of
interest after being coded, the variable is considered a qualitative variable.
NOMINAL, ORDINAL, INTERVAL, AND RATIO LEVELS OF MEASUREMENT
There are four levels of meusuremerit or scales o
f measurements into which data can be
classified. The nominal scale applies to data that are used for category identification. The nominal
level o
f measi~rernerttis characterized by data that consist of names, labels, or categories only.
Nominal scale duta cannot be arranged in an ordering scheme. The arithmetic operations of addition,
subtraction, multiplication, and division are not performed for nominal data.
CHAP. 1
3
Qualitative variable
Blood type
INTRODUCTION
Possible nominal level data values
associated with the variable
A, B, AB, 0
5
Qualitative variable
Automobile size description
Product rating
Socioeconomic class
Pain level
Baseball team classification
EXAMPLE 1.12 Table 1.6gives several qualitative variables and a set of possible nominal level data values.
The data values are often encoded for recording in a computer data tile. Blood type might be recorded as 1, 2, 3,
or 4; state of residence might be recorded as 1,2,. . . ,or 50; and type of crime might be recorded as 0 or I , or 1
or 2, etc. Similarly, color of road sign could be recorded as 1, 2, 3, 4, or 5 and religion could be recorded as 1,
2, or 3. There is no order associated with these data and arithmetic operations are not performed. For example,
adding Christian and Moslem (1 + 2) does not give other (3).
Possible ordinal level data values associated
with the variable
Subcompact, compact, intermediate, full-size
Poor, good, excellent
Lower, middle, upper
None, low, moderate, severe
Class A, class AA, class AAA , major league
State of residence
Type of crime
Color of road signs in the state of Nebraska
Religion
Alabama,. . . ,Wyoming
Misdemeanor, felony
Red, white, blue, brown, green
Christian, Moslem, other
The ordinal scale applies to data that can be arranged in some order, but differences between
data values either cannot be determined or are meaningless. The ordinal level o
f measurement is
characterized by data that applies to categories that can be ranked. Ordinal scale data can be
arranged in an ordering scheme.
EXAMPLE 1.13 Table 1.7 gives several qualitative variables and a set of possible ordinal level data values.
The data values for ordinal level data are often encoded for inclusion in computer data files. Arithmetic
operations are not performed on ordinal level data, but an ordering scheme exists. A full-size automobile is
larger than a subcompact, a tire rated excellent is better than one rated poor, no pain is preferable to any level
of pain, the level of play in major league baseball is better than the level of play in class AA, and so forth.
The interval scale applies to data that can be arranged in some order and for which differences in
data values are meaningful. The interval level o
f measurement results from counting or measuring.
Interval scale data can be arranged in an ordering scheme and differences can be calculated and
interpreted. The value zero is arbitrarily chosen for interval data and does not imply an absence of
the characteristic being measured. Ratios are not meaningful for interval data.
EXAMPLE 1.14 Stanford-Binet IQ scores represent interval level data. Joe’s IQ score equals 100 and John’s
IQ score equals 150.John has a higher IQ than Joe; that is, IQ scores can be arranged in order. John’s IQ score
is 50 points higher than Joe’s IQ score; that is, differences can be calculated and interpreted. However, we
cannot conclude that John is 1.5 times (150/100 = 1.5) more intelligent than Joe. An IQ score of zero does not
indicate a complete lack of intelligence.
6 INTRODUCTION [CHAP. 1
EXAMPLE 1.15 Temperatures represent interval level data. The high temperature on February 1 equaled 25°F
and the high temperature on March 1 equaled 50°F. It was warmer on March 1 than it was on February 1. That
is, temperatures can be arranged in order. It was 25” warmer on March 1 than on February 1. That is, differences
may be calculated and interpreted. We cannot conclude that it was twice as warm on March 1 than it was on
February 1. That is, ratios are not readily interpretable. A temperature of 0°F does not indicate an absence of
warmth.
EXAMPLE 1.16 Test scores represent interval level data. Lana scored 80 on a test and Christine scored 40 on
a test. Lana scored higher than Christine did on the test; that is, the test scores can be arranged in order. Lana
scored 40 points higher than Christine did on the test; that is, differences can be calculated and interpreted. We
cannot conclude that Lana knows twice as much as Christine about the subject matter. A test score of 0 does not
indicate an absence of knowledge concerning the subject matter.
The ratio scale applies to data that can be ranked and for which all arithmetic operations
including division can be performed. Division by zero is, of course, excluded. The ratio level of
measurement results from counting or measuring. Ratio scale data can be arranged in an ordering
scheme and differences and ratios can be calculated and interpreted. Ratio level data has an absolute
zero and a value of zero indicates a complete absence of the characteristicof interest.
EXAMPLE 1.17 The grams of fat consumed per day for adults in the United States is ratio scale data. Joe
consumes 50 grams of fat per day and John consumes 25 grams per day. Joe consumes twice as much fat as
John per day, since 50/25 = 2. For an individual who consumes 0 grams of fat on a given day, there is a
complete absence of fat consumed on that day. Notice that a ratio is interpretable and an absolute zero exists.
EXAMPLE 1.18 The number of 911 emergency calls in a sample of 50 such calls selected from a 24-hour
period involving a domestic disturbance is ratio scale data. The number found on May 1 equals 5 and the
number found on June 1 equals 10. Since 10/5 = 2, we say that twice as many were found on June 1 than were
found on May 1. For a 24-hour period in which no domestic disturbance calls were found, there is a complete
absence of such calls. Notice that a ratio is interpretable and an absolute zero exists.
SUMMATIONNOTATION
Many of the statistical measures discussed in the following chapters involve sums of various
types. Suppose the number of 91I emergency calls received on four days were 41 I, 375, 400, and
478. If we let x represent the number of calls received per day, then the values of the variable for the
four days are represented as follows: x1 = 41 I, x2 = 375, x j = 400, and x4 = 478. The sum of calls for
the four days is represented as XI + x2 + x j + x4 which equals 411 + 375 + 400 + 478 or 1664. The
symbol Cx, read as “the summation o
f x,” is used to represent XI +x2 + xj +x4 .The uppercase Greek
letter C (pronounced sigma) corresponds to the English letter S and stands for the phrase “the sum
of.” Using the summation notation, the total number of 911 calls for the four days would be written
as Cx = 1664.
EXAMPLE 1.19 The following five values were observed for the variable x: xI = 4, x2 = 5, x3 = 0, x4 = 6, and
x5 = 10.The following computations illustrate the usage of the summation notation.
CX= X I +~2 + x j + ~4 + ~ 5 = 4
+ 5 +O + 6 + 1 0 ~ 2 5
(CS)~
= (XI+ x2 + x j + ~4 + ~ 5 ) ~
= (25)’ = 625
Cx2= x12+ x
: +x3’ + X
: + xS2= 42+5
’ +0’ +6’ + 102= 177
C(x - 5 ) = (x, - 5) + (x2 - 5) + ( X j - 5) + (x4 - 5) + (xs - 5 )
C(X- 5) = ( 4 - 5 ) + ( 5 - 5 ) + ( 0 - 5 ) + ( 6 - 5 ) + ( 10- 5 )= -1 +0 - 5 + 1 +5 = 0
CHAP. I] INTRODUCTION 7
EXAMPLE 1.20 The following values were observed for the variables x and y: XI = I , x2 = 2, xj = 0, x4 = 4,
y1= 2, y2 = I , y3 = 4, and y4 = 5. The following computations show how the summation notation is used for two
variables.
ZXY
= xlyl + ~ 2 ~ 2
+xjyj + ~ 4 ~ 4
= 1 x 2 + 2 x 1 +0 x 4 +4 x 5 = 24
(Cx)(Cy)=(x1 +X2+Xj+Xq)(Y1 + y 2 + y 3 + y 4 ) = ( 1 + 2 + 0 + 4 ) ( 2 + 1 + 4 + 5 ) = 7 x 12=84
( C X ~ - ( C X ) ~ / ~ ) X ( Z ~ ~ - ( C Y ) ~ / ~ ) = ( ~
+ 4 + 0 + 1 6 - 7 2 / 4 ) ~ ( 4 +
1 + 16+25- 12*/4)=(8.75)~(10)=
87.5
COMPUTERSAND STATISTICS
The techniques of descriptive and inferential statistics involve lengthy repetitive computations as
well as the construction of various graphical constructs. These computations and graphical
constructions have been simplified by the development of computer software. These computer
software programs are referred to as statistical sojhare packages, or simply statistical packages.
These statistical packages are large computer programs which perform the various computations and
graphical constructions discussed in this outline plus many other ones beyond the scope of the
outline. Statistical packages are currently available for use on mainframes, minicomputers, and
microcomputers.
There are currently available numerous statistical packages. Four widely used statistical
packages are: MINITAB, BMDP, SPSS, and SAS. Many of the figures found in the following
chapters are MINITAB generated. MINITAB is a registered trademark of Minitab, Inc., 3081
Enterprise Drive, State College, PA 16801. Phone: 814-238-3280; fax: 8 14-238-4383; telex: 88 1612.
The author would like to thank Minitab Inc. for their permission to use output from MINITAB
throughout the outline.
Solved Problems
DESCRIPTIVE STATISTICSAND INFERENTIAL STATISTICS:
POPULATIONAND SAMPLE
1.1 Classify each of the following as descriptive statistics or inferential statistics.
(a) The average points per game, percent of free throws made, average number of rebounds
per game, and average number of fouls per game as well as several other measures
for players in the NBA are computed.
(b) Ten percent of the boxes of cereal sampled by a quality technician are found to be under
the labeled weight. Based on this finding, the filling machine is adjusted to increase the
amount of fill.
(c) USA Today gives several pages of numerical quantities concerning stocks listed in AMEX,
NASDAQ, and NYSE as well as mutual funds listed in MUTUALS.
(d)Based on a study of 500 single parent households by a social researcher, a magazine
reports that 25% of all single parent households are headed by a high school dropout.
Ans. (a) The measurements given organize and summarize information concerning the players and is
therefore considered descriptive statistics.
8 INTRODUCTION [CHAP. 1
(6) Because of the high percent of boxes of cereal which are under the labeled weight in the sample, a
decision is made to increase the weight per box for each box in the population. This is an example of
inferential statistics.
(c) The tables of measurements such as stock prices and change in stock prices are descriptive in nature
and therefore represent descriptive statistics.
((2) The magazine is stating a conclusion about &he
population based upon a sample and therefore this is
an example of inferential statistics.
1.2 Identify the sample and the population in each of the following scenarios.
(a) In order to study the response times for emergency 91 1 calls in Chicago, fifty “robbery in
progress” calls are selected randomly over a six-month period and the response times are
recorded.
(b) In order to study a new medical charting system at Saint Anthony’s Hospital, a
representative group of nurses is asked to use the charting system. Recording times and
error rates are recorded for the group.
(c)Fifteen hundred individuals who listen to talk radio programs of various types are selected
and information concerning their education level, income level, and so forth is recorded.
Am. (a) The 50 “robbery in progress” calls is the sample, and all “robbery in progress” calls in Chicago
during the six-month period is the population.
(bj The representative group of nurses who use the medical charting system is the sample and all nurses
who use the medical charting system at Saint Anthony’s is the population.
(cj The 1500 selected individuals who listen to talk radio programs is the sample and the millions who
listen nationally is the population.
VARIABLE, OBSERVATION, AND DATA SET
1.3
1.4
1.5
In a sociological study involving 35 low-income households, the number of children per
household was recorded for each household. What is the variable? How many observations are
in the data set?
Ans. The variable is the number of children per household. The data set contains 35 observations.
A national survey was mailed to 5000 households and one question asked for the number of
handguns per household. Three thousand of the surveys were completed and returned. What is
the variable and how large is the data set?
Ans. The variable is the number of handguns per household and there are 3000 observations in the data
set.
The number of hours spent per week on paper work was determined for 200 middle level
managers. The minimum was 0 hours and the maximum was 27 hours. What is the variable?
How many observations are in the data set?
Am. The variable is the number of hours spent per week on paper work and the number of observations
equals 200.
QUANTITATIVE VARIABLE: DISCRETE AND CONTINUOUS VARIABLE
1.6 Classify the variables in problems I .3, 1.4, and 1.S as continuous or discrete.
CHAP. 1
3 INTRODUCTION 9
Ans. The number of children per household is a discrete variable since the number of values this
variable may assume is countable. The values range from 0 to some maximum value such as 10 or
15depending upon the population.
The number of handguns per household is countable, ranging from 0 to some maximum value and
therefore this variable is discrete.
The time spent per week on paper work by middle level managers may be any real number
between 0 and some upper limit. The number of values possible is not countable and therefore this
variable is continuous.
1.7 A program to locate drunk drivers is initiated and roadblocks are used to check for individuals
driving under the influence of alcohol. Let n represent the number of drivers stopped before the
first drunk driver is found. What are the possible values for n? Classify n as discrete or
continuous.
Ans. The number of drivers stopped before finding the first drunk driver may equal 1, 2, 3, . . . ,up to
an infinitely large number. Although not likely, it is theoretically possible that an extremely large
number of drivers would need to be checked before finding the first drunk driver. The possible
values for n are all the positive integers. N is a discrete variable.
1.8 The KSW computer science aptitude test consists of 25 questions. The score reported is
reflective of the computer science aptitude of the test taker. How would the score likely be
reported for the test? What are the possible values for the scores? Is the variable discrete or
continuous?
Ans. The score reported would likely be the number or percent of correct answers. The number correct
would be a whole number from 0 to 25 and the percent correct would range from 0 to 100 in steps
of size 4. However if the test evaluator considered the reasoning process used to arrive at the
answers and assigned partial credit for each problem, the scores could range from 0 to 25 or 0 to
100 percent continuously. That is, the score could be any real number between 0 and 25 or any
real number between 0 and 100 percent. We might say that for all practical purposes, the variable
is discrete. However, theoretically the variable is continuous.
QUALITATIVE VARIABLE
1.9 Which of the following are qualitative variables?
(a) The color of automobiles involved in several severe accidents
(b)The length of time required for rats to move through a maze
(c) The classification of police administrations as city, county, or state
(d)The rating given to a pizza in a taste test as poor, good, or excellent
(e) The number of times subjects in a sociological research study have been married
Am. The variables given in (a),(c), and (d)are qualitative variables since they result in nonnumerical
values. They are classified into categories. The variables given in (b)and (e) result in numerical
values as a result of measuring and counting, respectively.
1.10 The pain level following surgery for an intestinal blockage was classified as none, low,
moderate, or severe for several patients. Give three different numerical coding schemes that
might be used for the purpose of inclusion of the responses in a computer data file. Does this
coding change the variable to a quantitative variable?
10 INTRODUCTION [CHAP. 1
Am. The responses none, low, moderate, or severe might be coded as 0, 1,2, or 3 or 1, 2, 3, or 4 or as
10, 20, 30, or 40. There is no limit to the number of coding schemes that could be used. Coding
the variable does not change it into a quantitative variable. Many times coding a qualitative
variable simplifies the computer analysis performed on the variable.
NOMINAL, ORDINAL, INTERVAL, AND RATIO LEVELS OF MEASUREMENT
1.11 Indicate the scale of measurement for each of the following variables: racial origin, monthly
phone bills, Fahrenheit and centigrade temperature scales, military ranks, time, ranking of a
personality trait, clinical diagnoses, and calendar numbering of the years.
Am. racial origin: nominal
monthly phone bills: ratio
temperature scales: interval
military ranks: ordinal
time: ratio
ranking of personality trait: ordinal
clinical diagnoses: nominal
calendar numbering of the years: interval
1.12 Which scales of measurement would usually apply to qualitative data?
Am. nominal or ordinal
SUMMATION NOTATION
1.13 The following values are recorded for the variable x: X I = 1.3, x2 = 2.5, x3 = 0.7, x4 = 3.5.
Evaluate the following summations: Ex, Zx2,(Ex)*,and E(x - S).
A ~ s . CX= XI + ~2 + x j +~4 = 1.3 +2.5 +0.7 + 3.5 = 8.0
Cx2 = xI2 + ~2~ + ~3~ + X
: = 1.32+2.5’ +0.72+ 3S2= 20.68
( C X ) ~
= (8.0)2= 64.0
C(X- .5) = (XI - .5) + ( ~ 2
- .5) +(xj - .5) + ( ~ 4
- .5) = 0.8 + 2.0 +0.2 + 3.0 = 6.0
1.14 The following values are recorded for the variables x and y: XI = 25.67, x2 = 10.95, xj = 5.65,
yl = 3.45, y2 = 1.55, and y j = 3.50. Evaluate the following summations: Zxy, Cx2y2,and
cxy - cxcy.
AIZS. Cxy = xlyl + x2y2+ x3y3= 25.67 x 3.45 + 10.95 x 1.55+5.65 x 3.50 = 125.31
Cx2y2= x12y12+ X
;
Y
; + ~ 3 ~ ~ 3 ~
= 25.672x 3.452+ 10.952x 1.552 +5.652x 3.502= 8522.26
CXY
- CXZY
= 125.31 - 42.27 x 8.50 = -233.99
1.15 The sum of four values for the variable y equals 25, that is, Cy = 25. If it is known that yl = 2,
y2 = 7, and y3 = 6, find y4.
Ans. Cy = 25 = 2 + 7 + 6 + y4, or 25 = 15 + y4. From this, we see that y4 must equal 10.
CHAP. 11
Patient name
Sam Alcorn
Susan Collins
Larry Halsey
Bill Samuels
Lana Williams
INTRODUCTION
Fasting blood sugar reading
135
I57
168
120
160
11
Supplementary Problems
DESCRIPTIVE STATISTICS AND INFERENTIALSTATISTICS: POPULATION AND SAMPLE
1.16 Classify each of the following as descriptive statistics or inferential statistics.
(a) The Nielsen Report on Television utilizes data from a sample of viewers to give estimates of average
viewing time per week per viewer for all television viewers.
(6) The U.S. National Center for Health Statistics publication entitled Vital Statistics of the United
States lists the leading causes of death in a given year. The estimates are based upon a sampling of
death certificates.
(c) The Omaha WorldHerald lists the low and high temperatures for several American cities.
(d)The number of votes a presidential candidate receives are given for each state following the
presidential election.
(e) The National Household Survey on Drug Abuse gives the current percentage of young adults using
different types of drugs. The percentages are based upon national samples.
Ans. (a) inferential statistics (b) inferential statistics (c) descriptive statistics
(d)descriptive statistics (e) inferential statistics
1.17 Classify each of the following as a sample or a population.
(a) all diabetics in the United States
(6) a group of 374 individuals selected for a New York TinzedCBS news poll
(c) all owners of Ford trucks
(d) all registered voters in the state of Arkansas
(e) a group of 22,000physicians who participate in a study to determine the role of aspirin in preventing
heart attacks
Am. (a)population (b) sample (c) population ( d ) population (e) sample
VARIABLE, OBSERVATION, AND DATA SET
1.18 Changes in systolic blood pressure readings were recorded for 325 hypertensive patients who were
participating in a study involving a new medication to control hypertension. Larry Doe is a patient in the
study and he experienced a drop of 15 units in his systolic blood pressure. What statistical term is used to
describe the change in systolic blood pressure readings? What does the number 325 represent? What term
is used for the 15-unitdrop is systolic blood pressure'?
Arts. The change in blood pressure is the variable, 325 is the number of observations in the data set, and
15-unitdrop in blood pressure is an observation.
1.19 Table 1.8 gives the fasting blood sugar reading for five patients at a small medical clinic. What is the
variable? Give the observations that comprise this data set.
Table 1.8
12 INTRODUCTION [CHAP. 1
Ans. The variable is the fasting blood sugar reading for a patient. The observations are 135, 157, 168,
120, and 160.
1.20 A sociological study involving a minority group recorded the educational level of the participants. The
educational level was coded as follows: less than high school was coded as 1, high school was coded as 2,
college graduate was coded as 3, and postgraduate was coded as 4. The results were:
1 1 2 3 4 3 2 2 2 2 1 1 1 2 2 1 2 3 3 2 2 1 1
2 2 2 2 2 2 1 1 3 3 2 2 2 2 2 1 2 2 2 2 2 1 3
What is the variable'?How many observations are in the data set?
Am. The variable is the educational level of a participant. There are 46 observations in the data set.
QUANTITATIVE VARIABLE: DISCRETE AND CONTINUOUS VARIABLE
1.21 Classify the variables in problems I. 18, 1.19,and 1.20as discrete or continuous.
Ans. The variables in problems 1.18 and 1.19 are continuous. The variable in problem 1.20 is not a
quantitative variable.
1.22 A die is tossed until the face 6 turns up on a toss. The variable x equals the toss upon which the face 6
first appears. What are the possible values that x may assume? Is x discrete or continuous?
Am. x may equal any positive integer, and it is therefore a discrete variable.
1.23 Is it possible for a variable to be both discrete and continuous?
Ans. no
QUALITATIVE VARIABLE
1.24 Give five examples of a qualitative variable.
Ans. I. Classification of government employees 4. Medical specialty of doctors
5. ZIP code
2. Motion picture ratings
3. College student classification
1.25 Which of the following is not a qualitative variable? hair color, eye color, make of computer, personality
type, and percent of income spent on food
Am. percent of income spent on food
NOMINAL, ORDINAL, INTERVAL, AND RATIO LEVELS OF MEASUREMENT
1.26 Indicate the scale of measurement for each of the following variables: religion classification; movie
ratings of 1, 2, 3, or 4 stars; body temperature; weights of runners; and consumer product ratings given
as poor, average, or excellent.
Ans. religion: nominal
movie ratings: ordinal
body temperature: interval
weights of runners: ratio
consumer product ratings: ordinal
1.27 Which scales of measurement would usually apply to quantitative data?
CHAP. I] INTRODUCTION 13
Ans. interval or ratio
SUMMATION NOTATION
1.28 The following values are recorded for the variable x: x1 = 15, x2 = 25, x3 = 10, and x4 = 5. Evaluate the
following summations: Cx, Ex2, (CX)~,
and C(x - 5).
Ans. Ex = 55, Cx’ = 975, (CX)~
= 3025, C(x - 5) = 35
1.29 The following values are recorded for the variables x and y: xi =17, x2 = 28, x3 = 35, y1 = 20, y2 = 30,
and y3 = 40. Evaluate the following summations:Zxy, Cx2y2,and Cxy - ZxCy.
Ans. Cxy = 2580, Cx2y2= 2,781,200, CXY- CXCY= -4,620
1.30 Given that xI = 5, x2= 10, y1 = 20, and Cxy = 200, find y2.
Ans. y2 = 10
Chapter 2
Offense
Rape
Robbery
Burglary
Arson
Murder
Theft
MansIaughter
Organizing Data
Tally Frequency
I/ 2
I// 3
/I/ 3
I// 3
/I/ 3
//I 3
)xl/ I// 8
RAW DATA
Information obtained by observing values of a variable is called raw data. Data obtained by
observing values of a qualitative variable are referred to as qualitative dafa. Data obtained by
observing values of a quantitative variable are referred to as quantitative data. Quantitative data
obtained from a discrete variable are also referred to as discrete data and quantitative data obtained
from a continuous variable are called continuous data.
EXAMPLE 2.1 A study is conducted in which individuals are classified into one of sixteen personality types
using the Myers-Briggs type indicator. The resulting raw data would be classified as qualitative data.
EXAMPLE 2.2 The cardiac output in liters per minute is measured for the participants in a medical study. The
resulting data would be classified as quantitative data and continuous data.
EXAMPLE 2.3 The number of murders per 100,000 inhabitants is recorded for each of several large cities for
the year 1994.The resulting data would be classified as quantitative data and discrete data.
FREQUENCY DISTRIBUTIONFOR QUALITATIVE DATA
A frequencv distributiou for qualitative data lists all categories and the number of elements that
belong to each of the categories.
EXAMPLE 2.4 A sample of rural county arrests gave the following set of offenses with which individuals were
charged:
rape robbery burglary arson murder robbery rape manslaughter
arson theft arson burglary theft robbery theft theft
theft burglary murder murder theft theft theft manslaughter
manslaughter
The variable, type of offense, is classified into the categories: rape, robbery, burglary, arson, murder, theft, and
manslaughter. As shown in Table 2.1, the seven categories are listed under the column entitled Offense, and each
occurrence of a category is recorded by using the symbol / in order to tally the number of times each offense
occurs. The number of tallies for each offense is counted and listed under the column entitled Frequency.
Occasionally the term absolute frequency is used rather than frequency.
Table 2.1
14
CHAP. 21 ORGANIZINGDATA
2/25 = .08
3/25 = .I2
3/25 = .I2
3/25 = .12
3/25 = .I2
8/25 = .32
3/25 = .12
15
.08 x 100= 8%
.12x loo= 12%
.I2 x loo= 12%
.I2 x loo= 12%
.I2 x loo= 12%
3 2 x 100 = 32%
.12 x loo= 12%
RELATIVE FREQUENCY OF A CATEGORY
The relative frequency of a category is obtained by dividing the frequency for a category by the
sum of all the frequencies. The relative frequencies for the seven categories in Table 2.1 are shown in
Table 2.2. The sum of the relative frequencies will always equal one.
PERCENTAGE
The percentage for a category is obtained by multiplying the relative frequency for that category
by 100. The percentages for the seven categories in Table 2.1 are shown in Table 2.2. The sum of the
percentages for all the categories will always equal 100percent.
Offense
Rape
Robbery
Burglary
Arson
Murder
Theft
Manslaughter
BAR GRAPH
A bar graph is a graph composed of bars whose heights are the frequencies of the different
categories. A bar graph displays graphically the same information concerning qualitative data that a
frequency distribution shows in tabular form.
EXAMPLE 2.5 The distribution of the primary sites for cancer is given in Table 2.3 for the residents of Dalton
County.
Tab
Primary site
Digestive system
Respiratory
Breast
Genitals
Urinary tract
Other
2
.
3
Frequency
20
30
10
5
5
To construct a bar graph, the categories are placed along the horizontal axis and frequencies are marked along
the vertical axis. A bar is drawn for each category such that the height of the bar is equal to the frequency for that
category. A small gap is left between the bars. The bar graph for Table 2.3 is shown in Fig. 2-1. Bar graphs can
also be constructed by placing the categories along the vertical axis and the frequencies along the horizontal axis.
See problem 2.5 for a bar graph of this type.
16
Primary site Relative frequency Angle sir,e
Digestive system -26 360 x .26 = 93.6"
Respiratory .40 360 x .40 = 144"
Breast .I3 360 x .I3 = 46.8"
Genitals .07 360 x .07= 25.2"
Urinary tract .07 360 x .07 = 25.2"
Other .07 360 x .07= 25.2"
b
ORGANIZING DATA [CHAP. 2
n
u n u o
Digesiive Resphtory Breast Genitals Urinary Other
system system tract
Primary site
Fig. 2-1
PIE CHART
A pie chart is also used to graphically display qualitative data. To construct a pie chart, a circle is
divided into portions that represent the relative frequencies or percentages belonging to different
categories.
EXAMPLE 2
.
6 To construct a pie chart for the frequency distribution in Table 2.3, construct a table that gives
angle sizes for each category. Table 2.4 shows the determination of the angle sizes for each of the categories in
Table 2.3. The 360" in a circle are divided into portions that are proportional to the category s k s . The pie chart
for the frequency distribution in Table 2.3 is shown in Fig. 2-2.
Table 2.4
Primary cancer sites
Respiratory
(40.0%
)
system
stern
(13.341) (6 7 % )
Fig. 2-2
CHAP. 21 ORGANIZING DATA
IQ score Frequency
80-94 8
95- 109 14
110-124 24
125- I39 16
J 140- I54 13
17
Class limits Class boundaries
80-94 79.5-94.5
95- 109 94.5-1 09.5
110-124 109.5- I 24.5
125- 139 124.5-1 39.5
140- 154 139.5- 154.5
FREQUENCY DISTRIBUTION FOR QUANTITATIVE DATA
Class width Class marks
15 87.0
15 102.0
15 117.0
15 132.0
15 147.0
There are many similarities between frequency distributions for qualitative data and frequency
distributions for quantitative data. Terminology for frequency distributions of quantitative data is
discussed first, and then examples illustrating the construction of frequency distributions for
quantitative data are given. Table 2.5 gives a frequency distribution of the Stanford-Binet intelligence
test scores for 75 adults.
IQ score is a quantitative variable and according to Table 2.5, eight of the individuals have an IQ
score between 80 and 94, fourteen have scores between 95 and 109, twenty-four have scores between
I I0 and 124, sixteen have scores between I25 and 139,and thirteen have scores between 140and 154.
CLASS LIMITS, CLASS BOUNDARIES, CLASS MARKS, AND CLASS WIDTH
The frequency distribution given in Table 2.5 is composed of five classes. The classes are: 80-94,
95-1 09, 1 10- 124, 125-1 39, and 140- 154. Each class has a lobtter class limit and an irpper class limit.
The lower class limits for this distribution are 80, 95, 110, 125,and 140.The upper class limits are 94,
109, 124, 139,and 154.
If the lower class limit for the second class, 95, is added to the upper class limit for the first class,
94, and the sum divided by 2, the upper horrrrdar?l for the first class and the Iokt'er bozrrrdary for the
second class is determined. Table 2.6 gives all the boundaries for Table 2.5.
If the lower class limit is added to the upper class limit for any class and the sum divided by 2, the
class riicirk for that class is obtained. The class mark for a class is the midpoint of the class and is
sometimes called the class rniclpoirrt rather than the class mark. The class marks for Table 2.5 are
shown in Table 2.6.
The difference between the boundaries for any class gives the class width for a distribution. The
class width for the distribution in Table 2.5 is IS.
When forming a frequency distribution, the following general guidelines should be followed:
I . The riiiiziher of c*lnssesshoirld be hetweeri 5 mid 1
.
5
2. Each data dire rziirst belong to m e , arid only one, class.
3. Wheii possible, all classes should be of equal Midth.
18 ORGANIZING DATA [CHAP. 2
Weight
100to under 125
125to under 150
150 to under 175
175to under 200
200 to under 225
225 to under 250
EXAMPLE 2.7 Group the following weights into the classes 100 to under 125, 125to under 150,and so forth:
Frequency
2
5
10
10
4
4
1 1 1 120 127 129 130 145 145 150 153 155 160
161 165 167 170 171 174 175 177 179 180 180
185 185 190 I95 I95 201 210 220 224 225 230
245 248
Price
2.50 to 2.59
2.60 to 2.69
2.70 to 2.79
2.80 to 2.89
2.90 to 2.99
3.00 to 3.09
The weights I 1 1 and 120 are tallied into the class 100 to under 125. The weights 127, 129, 130, 145 and 145 are
tallied into the class 125 to under 150 and so forth until the frequencies for all classes are found. The frequency
distribution for these weights is given in Table 2.7
'Tally Frequency
I 1
11 2
2
%, 6
/I/ 3
1/11 4
When a frequency distribution is given in this form, the class limits and class boundaries may be considered to be
the same. The class marks are 1 12.5, 137.5, 162.5, 187.5,212.5, and 237.5. The class width is 25.
EXAMPLE 2.8 The price thr 500 aspirin tablets is determined for each of twenty randomly selected stores as
part of a larger consuiiier study. 'The prices are as follows:
2.50 2.95 2.65 3.10 3.15 3.05 3.05 2.60 2.70 2.75
2.80 2.80 2.85 2.80 3.00 3.00 2.90 2.90 2.85 2.85
Suppose we wish to group these data into seven classes. Since the maximum price is 3.15 and the minimum price
is 2.50, the spread in prices is 0.65. Each class should then have a width equal to approximately 117 of 0.65 or
.093. There is it lot o f flexibility in choosing the classes while following the guidelines given above. Table 2.8
shows the results if a class width equal to 0.10 is selected and the first class begins at the minimum price.
Table 2.8
3.10 to 3.19 11 2
The frequency distribution might also be given in a form such as that shown in Table 2.9. The two different
ways of expressing the classes shown in Tables 2.8 and 2.9 will result in the same frequencies.
Table 2.9
Price
2.50 to less than 2.60
2.60 to less than 2.70
2.70 to less than 2.80
2.80 to less than 2.90
2.90 to less than 3.00
3.00 to less than 3.10
3.10 to less than 3.20
Frequency
1
2
2
6
5
4
7
Frequency
1
2
2
6
3
4
7
CHAP. 21 ORGANIZING DATA
Weight
4.72
4.73
4.74
4.75
4.76
4.77
19
Frequency
2
1
6
9
2
5
SINGLE-VALUED CLASSES
If only a few unique values occur in a set of data, the classes are expressed as a single value rather
than an interval of values. This typically occurs with discrete data but may also occur with continuous
data because of measurernent constraints.
EXAMPLE 2.9 A quality technician selects 25 bars of soap from the daily production. The weights in ounces of
the 25 bars are as follows:
4.75 4.74 4.74 4.77 4.73 4.75 4.76 4.77
4.72 4.75 4.77 4.74 4.75 4.77 4.72 4.74
4.75 4.75 4.74 4.76 4.75 4.75 4.74 4.75
4.77
Since only six unique values occur, we will use single-valued classes. The weight 4.72 occurs twice, 4.73 occurs
once, 4.74 occurs six times, 4.75 occurs nine times, 4.76 occurs twice, and 4.77 occurs five times. The frequency
distribution is shown in Table 2.10.
Table 2.10
HISTOGRAMS
A histogram is a graph that displays the classes on the horizontal axis and the frequencies of the
classes on the vertical axis. The frequency of each class is represented by a vertical bar whose height is
equal to the frequency of the class. A histogram is similar to a bar graph. However, a histogram utilizes
classes or intervals and frequencies while a bar graph utilizes categories and frequencies.
EXAMPLE 2.10 A histogram for the aspirin prices in Table 2.9 is shown in Fig. 2-3.
6
5
2 4
s
3
c
W
6)
L 2
I
n
"
I I I I 1 1 1
2.55 2.65 2.75 2.85 2.95 3.05 3.15
Price
Fig. 2-3
20 ORGANIZING DATA [CHAP. 2
5 -
4-
)
r
0
3
3 3 -
s
L t 2 -
1 -
0-
A symmetric histogram is one that can be divided into two pieces such that each is the mirror
image of the other. One of the most commonly occurring symmetric histograms is shown in Fig. 2-4.
This type of histogram is often referred to as a mound-shaped histogram or a bell-shaped histogram. A
symmetric histogram in which each class has the same frequency is called a uniform or rectangular
histogram. A skewed to the right histogram has a longer tail on the right side. The histogram shown in
Fig. 2-5 is skewed to the right. A skewed to the fefi histogram has a longer tail on the left side. The
histogram shown in Fig. 2-6 is skewed to the left.
8 -
7 -
6 -
2 5 -
& 3 -
2 -
1 -
0 7
h
Q)
g 4 -
Classes
Fig. 2-4
Classes
Fig. 2-5
Classes
Fig. 2-6
CHAP. 21
Cumulative frequency
0
5
5 + 10= 15
1 5 + 5 = 2 0
20 + 3 = 23
23 +2 = 25
ORGANIZINGDATA
Cumulative
0125 = 0 0%
5/25 = .20 20%
15/25= .60 60%
20/25 = .80 80%
23/25 = .92 92%
25/25 = 1.00 100%
relative frequency Cumulative percentage
21
CUMULATIVE FREQUENCY DISTRIBUTIONS
A cumulative frequency distribution gives the total number of values that fall below various class
boundaries of a frequency distribution.
EXAMPLE 2.11 Table 2.11 shows the frequency distribution of the contents in milliliters of a sample of 25 one-
liter bottles of soda. Table 2.12 shows how to construct the cumulative frequency distribution that corresponds to
the distribution in Table 2.11,
Table 2.11
Content
970 to less than 990
990 to less than 1010
1010to less than 1030
1030 to less than 1050
1050to less than 1070
3
CUMULATIVE RELATIVE FREQUENCY DISTRIBUTIONS
A cumulative relative frequency is obtained by dividing a cumulative frequency by the total
number of observations in the data set. The cumulative relative frequencies for the frequency
distribution given in Table 2.11 are shown in Table 2-12. Cr4mulative perceittages are obtained by
multiplying cumulative relative frequencies by 100. The cumulative percentages for the distribution
given in Table 2.1I are shown in Table 2.12.
Contents less than
970
990
1010
I030
1050
1070
OGIVES
An ogive is a graph in which a point is plotted above each class boundary at a height equal to the
cumulative frequency corresponding to that boundary. Ogives can also be constructed for a cumulative
relative frequency distribution as well as a cumulative percentage distribution.
EXAMPLE 2.12 The ogive corresponding to the cumulative frequency distribution in Table 2.12 is shown in
Fig. 2-7.
22 ORGANIZINGDATA [CHAP. 2
l l l l l 1 1 1 I l l
970 980 990 1OOO 1010 1020 103010401050 1060 1070
Contents
Fig. 2-7
STEM-AND-LEAFDISPLAYS
In a stem-and-leaf display each value is divided into a stem and a leaf. The leaves for each stem are
shown separately. The stem-and-leaf diagram preserves the information on individual observations.
EXAMPLE 2.13 The following are the California Achievement Percentile Scores (CAT scores) for 30 seventh-
grade students:
50 65 70 35 40 57 66 65 70 35
29 33 44 56 66 60 44 50 58 46
67 78 79 47 35 36 44 57 60 57
A stem-and-leaf diagram for these CAT scores is shown in Fig. 2-8.
Leaves
9
3 5 5 5 6
0 4 4 4 6 7
0 0 6 7 7 7 8
0 0 5 5 6 6 7
0 0 8 9
Fig. 2-8
Solved Problems
RAW DATA
2.1 Classify the following data as either qualitative data or quantitative data. In addition, classify the
quantitative data as discrete or continuous.
(a) The number of times that a movement authority is sent to a train from a relay station is
recorded for several trains over a two-week period. The movement authority, which is an
electronic transmission, is sent repeatedly until a return signal is received from the train.
CHAP. 21 ORGANIZING DATA 23
(b)A physician records the follow-up condition of patients with optic neuritis as improved,
unchanged, or worse.
(c) A quality technician records the length of material in a roll product for several products
selected from a production line.
(d) The Bureau o
f Justice Statistics Sourcebook of Criminal Justice Statistics in reporting on the
daily use within the last 30 days of drugs among young adults lists the type of drug as
marijuana, cocaine, or stimulants (adjusted).
(e) The number of aces in five-card poker hands is noted by a gambler over several weeks of
gambling at a casino.
Am. (a) The number of times that the moving authority must be sent is countable and therefore these data
are quantitative and discrete.
(h) These data are categorical or qualitative.
(c) The length of material can be any number within an interval of values, and therefore these data
are quantitative and continuous.
(d) These data are categorical or qualitative
(e) Each value in the data set would be one of the five numbers 0, 1, 2, 3, or 4. These data are
quantitative and discrete.
FREQUENCY DISTRIBUTIONFOR QUALITATIVEDATA
2.2 The following list gives the academic ranks of the 25 female faculty members at a small liberal
arts college:
instructor assistant professor assistant professor instructor
associate professor assistant professor associate professor assistant professor
full professor associate professor instructor assistant professor
full professor associate professor assistant professor instructor
assistant professor assistant professor associate professor assistant professor
full professor assistant professor assistant professor assistant professor
associate professor
Give a frequency distribution for these data.
Am. The academic ranks are tallied into the four possible categories and the results are shown in Table
2.13.
Table 2.13
Academic rank Fre uenc
Full professor
Associate professor
Assistant professor
Instructor
RELATIVEFREQUENCYOF A CATEGORY AND PERCENTAGE
2.3 Give the relative frequencies and percentages for the categories shown in Table 2.13.
24 ORGANIZINGDATA [CHAP. 2
Academic rank Relative frequency Percentage
Full professor .I2 12%
Associate professor .24 24%
Assistant professor .48 48%
Instructor .I6 16% .
Am. Each frequency in Table 2.13 is divided by 25 to obtain the relative frequencies for the categories.
The relative frequencies are then multiplied by 100 to obtain percentages. The results are shown in
Table 2.14.
~ ~~ ~
2.4 Refer to Table 2.14 to answer the following.
(a) What percent of the female faculty have a rank of associate professor or higher?
(h) What percent of the female faculty are not full professors?
(c) What percent of the female faculty are assistant or associate professors?
Ans. (a) 24% + 12%= 36% (11) 16%+- 48% + 24% = 88% (c) 48% + 24% = 72%
BAR GRAPHS AND PIE CHARTS
2.5 The subjects in an eating disorders research study were divided into one of three different
groups. Table 2.15 gives the frequency distribution for these three groups. Construct a bar graph.
Table 2.15
Anorexic
Control 20
AIIS. The bar graph for the distribution given in Table 2.15 is shown in Fig. 2-9.
Control
Anorexic
croup Bulimic 1 1
I 1 I I 1 I
0 10 20 30 40 50
Frequency
Fig. 2-9
2.6 Construct a pie chart for the frequency distribution given in Table 2.15.
Am. Table 2.16 illustrates the determination of the angles for each sector of the pie chart.
CHAP. 2) ORGANIZING DATA
k
Group Relative frequency Angle size
Bulimic .3 360 x .3 = 108"
Anorexic .5 360 x .5 = 180"
Control .2 360 x .2 = 72"
25
Ans. The pie chart for the distributiongiven in Table 2.15 is shown in Fig. 2-10.
Pie chart for eatingdisorder subiects
Fig. 2-10
2
.
7 A survey of 500 randomly chosen individuals is conducted. The individuals are asked to name
their favorite sport. The pie chart in Fig. 2-1 I summarizes the results of this survey.
Pie chart for favorite sport
Baseball (30.0%)
Basketball(20.0%)
Fig. 2-11
(a) How many individuals in the 500 gave baseball as their favorite sport?
(b) How many gave a sport other than basketball as their favorite sport?
(c) How many gave hockey or golf as their favorite sport?
Ans. (a) .3 x 500 = 150 (b) .8 x 500 = 400 (c) .2 x 500 = loo
26 ORGANIZINGDATA [CHAP. 2
Upper limit
189
209
229
249
269
FREQUENCY DISTRIBUTION FOR QUANTITATIVE DATA:
CLASS LIMITS, CLASS BOUNDARIES, CLASS MARKS, AND CLASS WIDTH
Lower Upper
boundary boundary Class mark
169.5 189.5 179.5
189.5 209.5 199.5
209.5 229.5 219.5
229.5 249.5 239.5
249.5 269.5 259.5
2.8 Table 2.17 gives the frequency distribution for the cholesterol values of 45 patients in a cardiac
rehabilitation study. Give the lower and upper class limits and boundaries as well as the class
marks for each class.
Class
170to 189
190to 209
210 to 229
230 to 249
250 to 269
Table 2.17
Lower limit
170
I90
2 10
230
250
Cholesterol value
170to 189
I90 to 209
210 to 229
230 to 249
250 to 269
Expenditure
0.5 to 0.9
1.oto 1.4
1.5 to 1.9
2.0 to 2.4
2.5 to 2.9
3.0to 3.4
Am. Table 2.18gives the limits, boundaries, and marks for the classes in Table 2.17.
Frequency
1
3
5
5
7
4
Table 2.18
2.9 The following data set gives the yearly food stamp expenditure in thousands of dollars for 25
households in Alcorn County:
2.3 1.9 1.1 3.2 2.7 1.5 0.7 2.5 2.5 3.1 2.5
2.0 2.7 1.9 2.2 1.2 1.3 1.7 2.9 3.0 3.2 1.7
2.2 2.7 2.0
Construct a frequency distribution consisting of six classes for this data set. Use 0.5 as the lower
limit for the first class and use a class width equal to 0.5.
Ans. The first class would extend from 0.5 to 0.9 since the desired lower limit is 0.5 and the desired class
width is 0.5. Note that the class boundaries are 0.45 and 0.95 and therefore the class width equals
0.95 - 0.45 or 0.5.The frequency distribution is shown in Table 2.19.
Table 2.19
2.10 Express the frequency distribution given in Table 2.19 using the “less than” form for the classes.
Am. The answer is shown in Table 2.20.
CHAP. 21
Expenditure
0.5 to less than 1.0
1.O to less than 1.S
1.S to less than 2.0
2.0 to less than 2.5
2.5 to less than 3.0
3.0 to less than 3.5
ORGANIZING DATA
Frequency
i
3
5
5
7
4
27
Gallons
0.000to 3.999
4.000 to 7.999
8.000 to 11.999
12.000to 15.999
16.000to 19.999
Frequency
3
8
3
20
14
2.11 The manager of a convenience store records the number of gallons of gasoline purchased for a
sample of customers chosen over a one-week period. Table 2.21 lists the raw data. Construct a
frequency distribution having five classes, each of width 4. Use 0.0oO as the lower limit of the
first class.
Table 2.21
Gallons
0 to less than 4
4 to less than 8
8 to less than 12
12 to less than 16
16to less than 20
Ans. When the data are tallied into the five classes, the frequency distribution shown in Table 2.22 is
obtained.
Frequency
3
8
3
20
14
2.12 Using the data given in Table 2.21, form a frequency distribution consisting of the classes 0 to
less than 4,4 to less than 8, 8 to less than 12, 12to less than 16, and 16 to less than 20. Compare
this frequency distribution with the one given in Table 2.22.
SINGLE-VALUED CLASSES
2.13 The Food Guide Pyramid divides food into the following six groups: Fats, Oils, and Sweets
Group; Milk, Yogurt, and Cheese Group; Vegetable Group; Bread, Cereal, Rice, and Pasta
Group; Meat, Poultry, Fish, Dry Beans, Eggs, and Nuts Group; Fruit Group. One question in a
28
Number of children
0
1
2
3
4
5
6
ORGANIZING DATA [CHAP. 2
Frequency
1
2
4
8
8
7
1
nutrition study asked the individuals in the study to give the number of groups included in their
daily meals. The results are given below:
6 4 5 4 4 3 4 5 5 5
6 5 4 3 6 6 6 5 2 3
4 5 6 4 5 5 5 6 5 6
5
Give a frequency distribution for these data.
Ans. A frequency distribution with single-valued classes is appropriate since only five unique values
occur. The frequency distribution is shown in Table 2.24.
Table 2.24
2 1
3
Number of food groups Frequency
5 12
2.14 A sociological study involving Mexican-American women utilized a 50-question survey. One
question concerned the number of children living at home. The data for this question are given
below:
5 2 3 0 4 6 2 1 1 2
3 3 4 4 5 5 5 3 3 3
3 4 4 4 5 5 2 3 4 4
5
Give a frequency distribution for these data.
Am. Since only a small number of unique values occur, the classes will be chosen to be single valued. The
frequency distribution is shown in Table 2.25.
HISTOGRAMS
2.15 Construct a histogram for the frequency distribution shown in Table 2.23.
Ans. The histogram for the frequency distribution in Table 2.23 is shown in Fig. 2- 12.
CHAP. 21
Beckmann-Beal score Frequency
0-7 5
8-15 15
16-23 20
24-3 1 30
32-39 50
40-47 30
ORGANIZING DATA 29
0 4 8 12 16 20
Gallons
Fig. 2-12
2.16 Construct a histogram for the frequency distribution given in Table 2.24.
Ans. The histogram for the frequency distribution in Table 2.24 is shown in Fig. 2-13.
I 1 1 I 1
2 3 4 5 6
Number of food goups
Fig. 2-13
CUMULATIVE FREQUENCY DISTRIBUTIONS
2.17 The Beckmann-Beal mathematics competency test is administered to 150 high school students
for an educational study. The test consists of 48 questions and the frequency distribution for the
scores is given in Table 2.26. Construct a cumulative frequency distribution for the scores.
Ans. The cumulative frequency distribution is shown in Table 2.27.
30
Scores less than
50
ORGANIZINGDATA
Cumulative frequency
0
[CHAP. 2
Cumulativerelative
Scores less than frequency Cumulative percentages
50 0.0 0.0%
60 0.143 14.3%
70 0.429 42.9%
80 0.857 85.7%
90 1.o 100.0%
i
Table 2.27
Scores less than I Cumulative freauencv
8
16
24
32
40
48
5
20
40
70
120
I50
2.18 Table 2.28 gives the cumulative frequency distribution of reading readiness scores for 35
kindergarten pupils.
(a) How many of the pupils scored 80 or higher?
(b) How many of the pupils scored 60 or higher but lower than 80?
(c) How many of the pupils scored 50 or higher?
(d)How many of the pupils scored 90 or lower?
Ans. (a) 35 - 3 0 = 5 (6) 30-5=25 (c) 35 (d) 35
CUMULATIVE RELATIVEFREQUENCY DISTRIBUTIONS
2.19 Give the cumulative relative frequencies and the cumulative percentages for the reading
readiness scores in Table 2.28.
Ans. The cumulative relative frequencies and cumulative percentages are shown in Table 2.29.
OGIVES
2.20 Table 2.30 gives the cumulative frequency distribution for the daily breast-milk production in
grams for 25 nursing mothers in a research study. Construct an ogive for this distribution.
CHAP. 21
Daily production
less than
500
550
600
650
700
750
ORGANIZINGDATA
Cumulative frequency
0
3
1 1
20
22
25
31
0
g 1.0-
%
9
)
9
)
>
&
3
2 0.5-
.C.(
Y
.CI
Y
p
U' 0.0-
500 550 600 650 700 750
Grams
Fig. 2-14
2.21 Construct an ogive curve for the cumulative relative frequency distribution that corresponds to
the cumulative frequency distributionin Table 2.30.
Ans. Each of the cumulative frequencies in Table 2.30 is divided by 25 and the cumulative relative
frequencies 0, .12, .44, 30, 3 8 , and 1.00 are determined. Using these, the cumulative relative
frequency distributionshown in Fig. 2-15 can be constructed.
500 550 600 650 700 750
Grams
Fig. 2-15
2.22 Construct an ogive curve for the cumulative percentage distribution that corresponds to the
cumulative frequency distribution in Table 2.30.
32
Stem
2
3
4
ORGANIZING DATA [CHAP. 2
Leaves
2 3 6 8 8 9 9
0 0 0 0 0 3 4 4 5 6 7 7 7 8 8 8
0 0 0 0 0 4 5
Am. The cumulative percentages are obtained by multiplying the cumulative relative frequencies given in
problem 2.21 by 100 to obtain 0%, 12%, 44%, 80%, 88%, and 100%.These percentages are then
used to construct the ogive shown in Fig. 2-16.
Stem
2
2
3
3
4
4 5
I I I 1 1 I
Leaves
2 3
6 8 8 9 9
0 0 0 0 0 3 4 4
5 6 7 7 7 8 8 8
0 0 0 0 0 4
500 550 600 650 700 750
Grams
Fig. 2-16
STEM-AND-LEAF DISPLAYS
2.23 The mathematical competency scores of 30junior high students participating in an educational
study are as follows:
30 35 28 44 33 22 40 38 37 36
28 29 30 30 40 30 34 37 38 40
38 34 40 37 23 26 30 45 29 40
Construct a stem-and-leaf display for these data. Use 2, 3, and 4 as your stems.
Ans. The stem-and-leaf display is shown in Fig. 2- 17.
Fig. 2-18
CHAP. 21 ORGANIZING DATA 33
2.25 The stem-and-leaf display shown in Fig. 2-19 gives the savings in 15 randomly selected
accounts. If the total amount in these 15 accounts equals $4340, find the value of the missing
leaf, x.
OS SO 65 90
102055 x 75
25 50 70
65
Fig. 2-19
Ails. Let A be the amount in the account with the missing leaf. Then the followingequation must hold.
(105 + 150 + 165+ 190+210 +220 +255 +A +275 + 325 + 350 + 370 +415 +480 + 565) = 4340
The solution to this equation is A = 265, which implies that x = 65.
Supplementary Problems
RAW DATA
2.26 Classify the data described in the followingscenarios as qualitative or quantitative. Classify the quantitative
data as either discrete or continuous.
The individuals in a sociological study are classified into one of five income classes as follows: low,
low to middle, middle, middle to upper, or upper.
The fasting blood sugar readings are determined for several individuals in a study involving diabetics.
The number of questions correctly answered on a 25-item test is recorded for each student in a
computer science class.
The number of attempts needed before successfully finding the path through a maze that leads to a
reward is recorded for several rats in a psychological study.
The race of each inmate is recorded for the individuals in a criminal justice study.
( a ) qualitative (6) quantitative, continuous (c) quantitative, discrete
(d) quantitative, discrete (e) qualitative
FREQUENCY DISTRIBUTION FOR QUALITATIVEDATA
2.27 The following responses were obtained when 50 randomly selected residents of a small city were asked the
question “How safe do you think your neighborhood is for kids?”
very very not sure not at all very not very not sure
very not sure somewhat not very very not at all not very
very very very very not very somewhat somewhat
very very not sure not very not at all not very very
very not sure very very not very very very
not very somewhat somewhat very somewhat very very
not very not at all very very very somewhat very
somewhat
Give a frequency distribution for these data.
34
Response
Very
Somewhat
Not very
Not at all
ORGANIZINGDATA
Relative frequency Percentage
.48 48%
.I6 16%
.I8 18%
.08 8%
[CHAP. 2
A m . See Table 2.31.
Table 2.31
Somewhat
Not very
Not at all
RELATIVE FREQUENCY OF A CATEGORY AND PERCENTAGE
2.28 Give the relative frequencies and percentages for the categories shown in Table 2.3I .
Am. See Table 2.32.
Table 2.32
I Not sure I .10 I 10% I
2.29 Refer to Table 2.32 to answer the following questions.
(a) What percent of the respondents have no opinion, i.e., responded not sure, on how safe the neighbor-
hood is for children‘?
(h) What percent of the respondents think the neighborhood is very or somewhat safe for children?
(c) What percent of the respondents give a response other than very safe?
Am. (a) 10% (h) 64% (c) 52%
BAR GRAPHS AND PIE CHARTS
2.30 Construct a bar graph for the frequency distribution in Table 2.3I .
2.31 Construct a pie chart for the frequency distribution given in Table 2.3I
2.32 The bar graph given Fig. 2-20 shows the distribution of responses of 300 individuals to the question “How
do you prefer to spend stressful times?”
(a) What percent preferred to spend time alone?
(6) What percent pave a response other than “with friends”?
(c.) How many individuals responded “with family,” “with friends,*’or “other”?
Ans. ( a ) SO% (6) 83.3% (c) IS0
CHAP. 21
_ I -
ORGANIZING DATA
Response time
5-9
10-14
15-19
20-24
25-29
30-34
35
Frequency
3
7
25
19
14
2
Class
5-9
10-14
15-19
20-24
25-29
30-34
With family With friends Other
Response
Fig. 2-20
Lower Upper limit
limit
5 9
10 14
15 19
20 24
25 29
30 34
FREQUENCY DISTRIBUTIONFOR QUANTITATIVEDATA:
CLASS LIMITS, CLASS BOUNDARIES, CLASS MARKS, AND CLASS WIDTH
Upper
boundary
9.5
14.5
19.5
24.5
29.5
34.5
2.33 Table 2.33 gives the distribution of response times in minutes for 91I emergency calls classified as
domestic disturbance calls. Give the lower and upper class limits and boundaries as well as the class marks
for each class. What is the class width for the distribution?
Class mark
7
12
17
22
27
32
Ans. See Table 2.34.
Tablc
The class width equals 5 minutes.
2
.
3
4
Lower
boundary
4.5
9.5
14.5
19.5
24.5
29.5
2.34 Express the distribution given in Table 2.33 in the “less than” form.
Ans. See Table 2.35.
36
Response time
5 to less than 10
10 to less than 15
15 to less than 20
20 to less than 25
25 to less than 30
30 to less than 35
ORGANIZING DATA
Frequency
3
7
25
19
14
2
[CHAP. 2
2.5
5.o
Table 2.35
5.O 8.5 5.5 10.5
5.5 7.O 5.O 10.0
6.5
1.5
2.35 Table 2.36 gives the response times in minutes for 50 randomly selected 91 1 emergency calls classified as
robbery in progress calls. Group the data into five classes, using 1.0to 2.9 as your first class.
6.0 7.0 6.0 10.0
7.5 7.0 6.5 4.5
Table 2.36
7.0
10.0
9.5
8.O
7.5
2.0 5.5 7.o 5.O
3.0 6.5 5.5 5.O
2.5 3.5 3.5 5.O
3.O 7.5 2.0 4.5
6.5 7.0 1.5 9.0
. Response tjme Frequency
1.0to 2.9 7
3.0 to 4.9 6
5.0 to 6.9 16
7.0to 8.9 14
9.0 to 10.9 7
Class
ato b
A m See Table 2.37.
Frequency
f l
2.37 Refer to the frequency distribution of response times to 91I robbery in progress calls given in Table 2.37 to
answer the following.
( a ) What percent of the response times are less than seven minutes'?
(h) What percent of the response times are equal to or greater than three minutes but less than seven
minutes?
( c ) What percent cif the response times are nine or more minutes in length?
Am. ((I) 58% (l?) 44%) (c) 14%
c to d
e to f
g to i
i to k
f2
f3
f4
f<
CHAP. 21 ORGANLZING DATA
Number of citations
10
11
14
16
17
20
37
Frequency
5
7
10
3
3
X
Ans. (a) (b +c)/2 and (d +e)/2 (b) (e + f)/2 (c) (i +j)/2 - (f + g)/2
(4 g (e) (fI + f2 + f3 + f4 + fs)
Response time less than
1.o
3.O
s.o
7.O
9.0
11.0
SINGLE-VALUED CLASSES
Cumulative frequency
0
7
13
29
43
so
2.39 A quality control technician records the number of defective items found in samples of size SO for each of
30 days. The data are as follows:
0 2 3 0 0 0 0 1 2 1
1 2 0 0 1 2 2 2 0 0
1 1 1 0 0 0 3 2 0 1
Give a frequency distribution for these data.
Am. See Table 2.39
Table 2.39
Number of defectives
2
2.40 The number of daily traffic citations issued over a 100-milesection of Interstate 80 is recorded for each day
of September. The frequency distribution for these data is shown in Table 2.40. Find the value for x.
Ans. x = 2
HISTOGRAMS
2.41 Construct a histogram for the response times frequency distribution given in Table 2.37.
2.42 Construct a histogram for the number of defectives frequency distribution given in Table 2.39.
CUMULATIVE FREQUENCY DISTRIBUTIONS
2.43 Give the cumulative frequency distribution for the frequencydistribution shown in Table 2.37
Ans. See Table 2.41
38 ORGANIZING DATA
c
Cumulative relative
Response time less than frequency Cumulative percentage
1.o 0.0 0%
3.O 0.14 14%
5.O 0.26 26%
7.0 0.58 58%
9.0 0.86 86%
11.0 1.o 100% *
[CHAP. 2
450
345
2.44 Refer to Table 2.4 1 to answer the following questions.
(a) How many of the response times are less than 5.0 minutes?
(6) How many of the response times are 7.0 or more minutes?
(c) How many of the response times are equal to or greater than 5.0 minutes but less than 9.0 minutes?
333 660 765 589
347 456 501 543
,
CUMULATIVE RELATIVE FREQUENCY DISTRIBUTIONS
678
345
189
2.45 Give the cumulative relative frequencies and the cumulative percentages for the cumulative frequency
distribution shown in Table 2.4 1.
543 234 597 521
780 435 456 678
567 675 525 453
Arm See Table 2.42.
500 304 805 586 267
456 435 346 514 508
654 564 225 524 465
324 475 505 707 556
700 125 578 286 531
i
OGIVES
2.46 Construct an ogive for the cumulative frequency distribution given in Table 2.4 1.
2.47 Construct an ogive for the cumulative relative frequencydistribution given in Table 2.42.
STEM-AND-LEAF DISPLAYS
2.48 The number of calls per 24-hour period to a 911 emergency number is recorded for 50 such periods. The
results are shown in Table 2.43. Construct a stem-and-leaf display for these data.
Table 2.43
Ans. See Fig. 2-2 I.
CHAP. 21
Leaves
25 89
25 34 67 86
04243345454647
35 35 50 53 56 56 56 65 75
0001 0508 1421242531434356646778868997
54 60 75 78 78
00 07 65 80
05 A
ORGANIZING DATA
Stem
3
4
5
6
7
8
9
39
Leaves
0 0 5 5 5 5
0 0 3 5 6 8 8
2 4 4 4 8 8 9 9 9
0 0 0 0 0 0 5 5 5 5 5 8 8 8
3 3 3 5 5
0 0 5
0 0
6
I:.
8
Stem
3
3
4
4
5
5
6
6
7
7
8
8 5
9
Leaves
0 0
5 5 5 5
0 0 3
5 6 8 8
2 4 4 4
8 8 9 9 9
0 0 0 0 0 0
5 5 5 5 5 8 8 8
3 3 3
5 5
0 0
0 0
Fig. 2-21
2.49 The number of syringes used per month by the patients of a diabetic specialist is recorded and the results
are given in Fig. 2-22. Answer the followingquestions by referring to Fig. 2-22.
(a) What is the minimum number of syringes used per month by these patients?
(b) What is the maximum number of syringes used per month by these patients?
(c) What is the total usage per month by these patients?
(4 What usage occurs most frequently?
Fig. 2-22
Ans. (a) 30 (b) 90 (c) 2700 (4 60
2.50 Refine the stem-and-leaf display in Fig.2-22 by using either the leaves 0, 1,2,3, or 4 or the leaves 5, 6,7,
8, or 9 on a particular row.
Ans. See Fig. 2-23.
~ ~
Fig. 2-23
Chapter 3
Descriptive Measures
MEASURES OF CENTRAL TENDENCY
Chapter 2 gives several techniques for organizing data. Bar graphs, pie charts, frequency
distributions, histograms, and stem-and-leaf plots are techniques for describing data. Often times, we
are interested in a typical numerical value to help us describe a data set. This typical value is often
called an average value or a measure o
f central tendency. We are looking for a single number that is
in some sense representative of the complete data set.
EXAMPLE 3.1 The following are examples of measures of central tendency: median priced home, average
cost of a new automobile, the average household income in the United States, and the modal number of
televisions per household. Each of these examples is a single number, which is intended to be typical of the
characteristic of interest.
MEAN, MEDIAN, AND MODE FOR UNGROUPED DATA
A data set consisting of the observations for some variable is referred to as raw data or
urtgroicpeddata. Data presented in the form of a frequency distribution are called grouped data. The
measures of central tendency discussed in this chapter will be described for both grouped and
ungrouped data since both forms of data occur frequently.
There are many different measures of central tendency. The three most widely used measures of
central tendency are the mean, median, and mode. These measures are defined for both samples and
populations.
The mean for a sampleconsistingof n observations is
- C X
x =-
n
and the mean for a population consistingof N observations is
C X
p-
N
EXAMPLE 3
.
2 The number of 911 emergency calls classified as domestic disturbance calls in a large metro-
politan location were sampled for thirty randomly selected 24 hour periods with the following results. Find the
mean number of calls per 24-hour period.
25 46 34 45 37 36 40 30 29 37 44 56 50 47 23
40 30 27 38 47 58 22 29 56 40 46 38 19 49 50
- c x 1168
x = - - - - - - 38.9
n 30
40
CHAP. 31 DESCRIPTIVE MEASURES 41
EXAMPLE 3.3 The total number of 911 emergency calls classified as domestic disturbance calls last year in a
large metropolitan location was 14,950. Find the mean number of such calls per 24-hour period if last year was
not a leap year.
Ex 14,950
N 365
p=----=41.0
-
The median of a set of data is a value that divides the bottom 50% of the data from the top 50%
of the data. To find the median of a data set, first arrange the data in increasing order. If the number
of observations is odd, the median is the number in the middle of the ordered list. If the number of
observations is even, the median is the mean of the two values closest to the middle of the ordered
list. There is no widely used symbol used to represent the median. Occasionally, the symbol X is
used to represent the sample median and the symbol j
Xis used to represent the population median.
EXAMPLE 3.4 To find the median number of domestic disturbance calls per 24-hour period for the data in
Example 3.I ,first arrange the data in increasingorder.
19 22 23 25 27 29 29 30 30 34 36 37 37 38 38
40 40 40 44 45 46 46 47 47 49 50 50 56 56 58
The two values closest to the middle are 38 and 40. The median is the mean of these two values or 39.
EXAMPLE 3.5 A bank auditor selects 11 checking accounts and records the amount in each of the accounts.
The 1 I observations in increasing order are as follows:
150.25 175.35 195.00 200.00 235.00 240.45 250.55 256.00 275.50 290.10 300.55
The median is 240.45 since this is the middle value in the ordered list.
The mode is the value in a data set that occurs the most often. If no such value exists, we say that
the data set has no mode. If two such values exist, we say the data set is bimodal. If three such values
exist, we say the data set is trimodal. There is no symbol that is used to represent the mode.
EXAMPLE 3.6 Find the mode for the data given in Example 3.2. Often it is helpful to arrange the data in
increasing order when finding the mode. The data, in increasing order, are given in Example 3.4. When the data
are examined, it is seen that 40 occurs three times, and that no other value occurs that often. The mode is equal
to 40.
For a large data set, as the number of classes is increased (and the width of the classes is
decreased), the histogram becomes a smooth curve. Oftentimes, the smooth curve assumes a shape
like that shown in Fig. 3-1. In this case, the data set is said to have a bell-shaped distribirtioit or a
mound-shaped distribution. For such a distribution,the mean, median, and mode are equal and they
are located at the center of the curve.
Fig. 3-1
42 DESCRIPTIVE MEASURES [CHAP. 3
L
Data set Distribution shape Mean Median Mode
1 Be11-shaped 15 15 15
2 Left-skewed 10 10.5 1s
3 Right-skewed 20.5 19.5 15
For a data set having a skewed to the right distribution, the mode is usually less than the median
which is usually less than the mean. For a data set having a skewed to the left distribution, the mean
is usually less than the median which is usually less than the mode.
EXAMPLE 3.7 Find the mean, median, and mode for the following three data sets and confirm the above
paragraph.
Dataset 1: 10, 12, 15, 15, 18, 20 Dataset 2: 2,4,6, 15, 15, I8 Data set 3: 12, 15, 15, 24, 26, 28
Table 3.1 gives the shape of the distribution, the mean, the median, and the mode for the three data sets.
MEASURES OF DISPERSION
In addition to measures of central tendency, it is desirable to have numerical values to describe
the spread or dispersion of a data set. Measures that describe the spread of a data set are called
measiires o
f dispersion.
EXAMPLE 3.8 Jon and Jack are two golfers who both average 85. However, Jon has shot as low as 75 and as
high as 99 whereas Jack has never shot below 80 nor higher than 90. When we say that Jack is a more consistent
golfer than Jon is, we mean that the spread in Jack’s scores is less than the spread in Jon’s scores. A measure of
dispersion is a numerical value that illustrates the differences in the spread of their scores.
RANGE, VARIANCE, AND STANDARD DEVIATION FOR UNGROUPED DATA
The range for a data set is equal to the maximum value in the data set minus the minimum value
in the data set. It is clear that the range is reflective of the spread in the data set since the difference
between the largest and the smallest value is directly related to the spread in the data.
EXAMPLE 3.9 Compare the range in golf scores for Jon and Jack in Example 3.8. The range for Jon is 99 -
75 = 24 and the range for Jack is 90 - 80 = 10.The spread in Jon’s scores, as measured by range, is over twice
the spread in Jack’s scores.
The variance and the standard deviation of a data set measures the spread of the data about the
mean of the data set. The variance of a sample of size n is represented by s2 and is given by
C(X--X)*
s2 =
n-1
and the variance of a population of size N is represented by d and is given by
(3.3)
The symbol o is the lowercase sigma of the Greek alphabet and o2is read as sigma squared.
CHAP. 31 DESCRIPTIVE MEASURES 43
EXAMPLE 3.10 The times required in minutes for five preschoolers to complete a task were 5 , 10, 15, 3, and
7. The mean time for the five preschoolers is 8 minutes. Table 3.2 illustrates the computation indicated by
formula (3.3). The first column lists the observations, x. The second column lists the deiiatiorzs from the niearr,
x - x . The third column lists the squares of the deviations. The sum at the bottom of the second column is
called the slim o
f the deviations, and is always equal to zero for any data set. The sum at the bottom of the third
column is referred to as the sum o
f the squares of the deviations. The sample variance is obtained by dividing
the sum of the squares of the deviations by n - I, or 5 -I = 4. The sample variance equals 88 divided by 4 which
is 22 minutes squared.
-
Table 3
.
2
I X
3 3-8=-5 (-S)2= 25
7 7 - 8 = - I (-I)2= 1
C(x-X)= 0 c(x-X)2
= 88
The variance of the data in Example 3.10 is 22 minutes squared. The units for the variance are
minutes squared since the terms which are added in column 3 are minutes squared. The square root
of the variance is called the standard deviation and the standard deviation is measured in the same
units as the variable. The standard deviation of the times to complete the task is or 4.7 minutes.
The sample standard deviation is
s = o (3.5)
and the populatioiz standard deviation is
(T=
Shortcut formulas equivalent to formulas (3.3)and (3.4)are useful in computing variances and
standard deviations. The shortcutformulas for computing sample and population variances are
2 n
s =
n-1
and
N
(3.6)
(3.7)
EXAMPLE 3.11 Formula (3.7)can be used to find the variance and standard deviation of the times given in
Example 3.10.The term Zx2is cailed the sum ofthe squares and is found as follows:
Cx2= S2 + 102+ 152+ 32+72=408
The term (CX)~
is referred to as the sum squared and is found as follows:
(CX)~
= (5 + 10+ 15 +3 + 7)*= 1600
The variance is given as follows:
1600
408--
408-320
5 -
- = 22
2
s =
4 4
44
and the standard deviation is
DESCRIPTIVE MEASURES [CHAP. 3
s = 422 =4.7
These are the same values we found in Example 3.10 for the variance and standard deviation of the times.
Since most populations are large, the computation of CT’
is rarely performed. In practice, the
population variance (or standard deviation) is usually estimated by taking a sample from the
population and using s2as a estimate of CT*.The use of n - 1 rather than n in the denominator of the
formula for s2enhances the ability of s’ to estimate d.
For data sets having a symmetric mound-shaped distribution, the standard deviation is
approximately equal to one-fourth of the range of the data set. This fact can be used to estimate s for
be1I-shaped distributions.
Statistical software packages are frequently used to compute the standard deviation as well as
many other statistical measures.
EXAMPLE 3.12 The costs for scientific calculators with comparable built-in functions were recorded for 20
different sales locations. The results were as follows:
10.50 12.75 11.00 16.50 19.30 20.00 16.50 13.90 17.50 18.00
13.50 17.75 18.50 20.00 15.00 14.45 17.85 15.00 17.50 13.50
The analysis of these data using Minitab is as follows.
MTB > name cl ‘cost’
MTB > set cl
DATA > 10.50 13.50 12.75 17.75 11.00 18.50 16.50 20.00 19.30
DATA> 15.00 20.00 14.45 16.50 17.85 13.90 15.00 17.50 17.50
DATA > 18.00 13.50
DATA > end
MTB > describe c1
Descriptive Statistics
Variable N Mean Median TrMean StDev SEMean
cost 20 15.950 16.500 16.028 2.826 0.632
Variable Min Max Q1 Q3
cost 10.500 20.000 13.600 17.962
The cost data are set into column c1, and the command describe cl produces 10different descriptive measures.
The student is encouraged to confirm the values for the mean, median, standard deviation, minimum, and
maximum. The other four measures are described elsewhere in the outline. Even though these data are not
symmetric mound shaped, a “ballpark” approximation to the standard deviation is obtained by dividing the
range by 4. One-fourth of the range is 2.375 and the value of the standard deviation is 2.826.
MEASURES OF CENTRAL TENDENCY AND DISPERSION
FOR GROUPED DATA
Statistical data are often given in grouped form, i.e., in the form of a frequency distribution, and
the raw data corresponding to the grouped data are not available or may be difficult to obtain. The
articles that appear in newspapers and professional journals do not give the raw data, but give the
CHAP. 31 DESCRIPTIVEMEASURES 45
Age
5-14
results in grouped form. Table 3.3 gives the frequency distribution of the ages of 5000 shoplifters in
a recent psychological study of these individuals.
Frequency
750
Table 3.3
15-24
25-34
45-54 100
The techniques for finding the mean, median, mode, range, variance and standard deviation of
grouped data will be illustrated by finding these measures for the data given in Table 3.3. The
formulas and techniques given will be for sample data. Similar formulas and techniques are used for
population data.
The meanfor grouped data is given by
where x represents the class marks, f represents the class frequencies, and n = Cf.
EXAMPLE 3.13 The class marks in Table 3.3 are xi = 9.5, x2 = 19.5, x3 = 29.5,
frequenciesare fl = 750, f2 = 2005, f3= 1950,f4= 195,and fs = 100.The sample size n is 5000. The mean is
= 39.5, x5 = 49.5 and the
- 9.5x 750+19.5x 2005+295x 1950+395x 195+495x 100 I 16,400
---
X = - - 23.3 years
5000 5000
The medianfor grouped data is found by locating the value that divides the data into two equal
parts. In finding the median for grouped data, it is assumed that the data in each class is uniformly
spread across the class.
EXAMPLE 3.14 The median age for the data in Table 3.3 is a value such that 2500 ages are less than the
value and 2500 are greater than the value. The median age must occur in the age group 15-24, since 750 are less
than 15 and 2755 are 24 years or less. The class 15-24 is called the median class since the median must fall in
this class. Since 750 are less than 15 years, there must be 1750additional ages in the class 15- 24 that are less
than the median. In other words, we need to go the fraction 1750/2005 across the class 15-24 to locate the
median. We give the value 14.5+ (1750/2OO5) x 10= 23.2 years as the median age. To summarize, 14.5 is the
lower boundary of the median class, 1750/2005is the fraction we must go across the median class to reach the
median, and 10 is the class width for the median class.
The modal class is defined to be the class with the maximum frequency. The mode for grouped
data is defined to be the class mark of the modal class.
EXAMPLE 3.15 The modal class for the distribution in Table 3.3 is the class 15-24. The mode is the class
mark for this class that equals 19.5 years.
The range for grouped data is given by the difference between the upper boundary of the class
having the largest values minus the lower boundary of the class having the smallest values.
EXAMPLE 3.16 The upper boundary for the class 45-54 is 54.5 and the lower boundary for the class 5-14 is
4.5, and the range is 54.5 - 4.5 = 50.0 years.
46 DESCRIPTIVEMEASURES
The variuncefor grouped data is given by
n-1
[CHAP. 3
(3.10)
and the standard deviation is given by
s = o (3.1I )
EXAMPLE 3.17 In order to find the variance and standard deviation for the distribution in Table 3.3 using
formulas (3.10)and (3.I I ) , we first evaluate cx2f and -
.
(Cxf)’
n
2x’f = 9S2x 750 + 19S2 x 2005 + 29S2x 1950+ 39S2x 195+49S2x 100= 3,076,350
(Cxf)2
From Example 3.13, we see that Cxf = I 16,400,and therefore -
= 2,709,792. The variance is
n
3,076,350- 2,709,792
4999
s = = 73.3
and the standard deviation is = 8.6 years.
CHEBYSHEV’STHEOREM
Cliebyshev ’s theorem provides a useful interpretation of the standard deviation. Chebyshev’s
theorem states that the fraction of any data set lying within k standard deviations of the mean is at
least 1 - - where k is a number greater than 1. The theorem applies to either a sample or a
1
k’ *
1 1 3
2 4 4
population. If k = 2, this theorem states that at least 1 - 7 = 1 - - = - or 75% of the data set will
8
9
fall between X-2s and X+2s. Similarly, for k = 3, the theorem states that at least - or 89% of the
data set will fall between X-3s and X+3s.
EXAMPLE 3.18 The reading readiness scores for a group of 4 and 5 year old children have a mean of 73.5
and a standard deviation equal to 5.5. At least 75% of the scores are between 73.5 - 2 x 5.5 = 62.5 and 73.5 + 2
x 5.5 = 84.5. At least 89% of the scores are between 73.5 - 3 x 5.5 = 57.0 and 73.5 + 3 x 5.5 = 90.0.
EMPIRICAL RULE
The empirical ride states that for a data set having a bell-shaped distribution, approximately 68%
of the observations lie within one standard deviation of the mean, approximately 95% of the
observations lie within two standard deviations of the mean, and approximately 99.7% of the
observations lie within three standard deviations of the mean. The empirical rule applies to either
large samples or populations.
EXAMPLE 3.19 Assuming the incomes for all single parent households last year had a bell-shaped distribu-
tion with a mean equal to $23,500 and a standard deviation equal to $4,500, the following conclusions follow:
68% of the incomes lie between $19,000 and $28,000, 95% of the incomes lie between $14,500 and $32,500,
CHAP. 31 DESCRIPTIVE MEASURES 47
and 99.7% of the incomes lie between $10,000 and $37,000. If the shape of the distribution is unknown, then
Chebyshev’s theorem does not give us any information about the percent of the distribution between $19,500
and $28,000. However, Chebyshev’s theorem assures us that at least 75% of the incomes are between $14,500
and $32,500 and at least 89% of the incomes are between $10,000and $37,000.
COEFFICIENTOF VARIATION
The coefficient of variation is equal to the standard deviation divided by the mean. The result is
usually multiplied by 100to express it as a percent. The coefficient of variation for a sample is given
by
S
X
cv = - x
- 100%
and the coefficient of variation for a population is given by
CL
cv=-xlOO%
0
(3.12)
(3.13)
The coefficient of variation is a measure of relative variation, whereas the standard deviation is a
measure of absolute variation.
EXAMPLE 3.20 A national sampling of prices for new and used cars found that the mean price for a new car
is $20,100 and the standard deviation is $6,125 and that the mean price for a used car is $5,485 with a standard
deviation equal to $2,730. In terms of absolute variation, the standard deviation of price for new cars is more
than twice that of used cars. However, in terms of relative variation, there is more relative variation in the price
of used cars than in new cars. The CV for used cars is e x loo= 49.8% and the CV for new cars is
5,485
x 100 = 30.5%.
20,100
2 SCORES
A z score is the number of standard deviations that a given observation, x, is below or above the
mean. For sample data, the z score is
-
x - x
z=-
S
and for population data, the z score is
z=- x-P
0
(3.14)
(3.15)
EXAMPLE 3.21 The mean salary for deputies in Douglas County is $27,500 and the standard deviation is
$4,500. The mean salary for deputies in Hall County is $24,250 and the standard deviation is $2,750. A deputy
who makes $30,000 in Douglas County makes $1,500 more than a deputy does in Hall County who makes
$28,500. Which deputy has the higher salary relative to the county in which he works?
For the deputy in Douglas County who makes $30,000,the z score is
48
3.0
3.3
3.5
3.5
3.6
4.0
4.0
4.2
DESCRIPTIVE MEASURES
5.o 6.2 7.6 9.4
5.2 6.3 7.6 9.S
5.5 6.4 7.7 9.5
5.5 6.6 7.8 10.0
5.5 6.6 7.8 10.5
5.8 6.8 8.S 10.8
5.8 6.8 8.5 10.9
5.9 6.8 8.8 11.0
[CHAP. 3
X-p 30,000-27,500
z=- -
- = 0.S6
0 4,500
For the deputy in Hall County who makes $28.500, the z score is
x -CL 28,500- 24,250
z=-- - = 1.55
0 2,750
When the county of employment is taken into consideration, the $28,500 salary is a higher relative salary than
the $30,000salary.
MEASURESOF POSITION: PERCENTILES,DECILES, AND QUARTILES
Measures o
f position are used to describe the location of a particular observation in relation to
the rest of the data set. Percerrtiles are values that divide the ranked data set into 100equal parts. The
pth percentile of a data set is a value such that at least p percent of the observations take on this value
or less and at least (100 - p) percent of the observations take on this value or more. Decile.7 are
values that divide the ranked data set into 10equal parts. Qicartilvs are values that divide the ranked
data set into four equal parts. The techniques for finding the various measures of position will be
illustrated by using the data in Table 3.4. Table 3.4 contains the aortic diameters measured in
centimeters for 45 patients. Notice that the data in Table 3.4 are already ranked. Raw data need to be
ranked prior to finding measures of position.
Table 3.4
I
4.6 I 6.0 I 7.0 1 8.8 I 11.0
The perreittilefor observutiori x is found by dividing the number of observations less than x by
the total number of observations and then multiplying this quantity by 100. This percent is then
rounded to the nearest whole number to give the percentile for observation x.
EXAMPLE 3.22 The number of observations in Table 3.4 less than 5.5 is 1 1. Eleven divided by 45 is ,244 and
.244 multiplied by 100 is 24.4%. This percent rounds to 24. The diameter 5.5 is the 24th percentile and we
express this as Pz4= 5.5. The number of observations less than 5.0 is 9. Nine divided by 45 is .20 and .20
multiplied by 100 is 20%. P3()= 5.0. The number of observations less than 10.0is 39. Thirty-nine divided by 45
is 367 and .867 multiplied by 100 is 86.7%. Since 86.7% rounds to 87%. we write Px7= 10.0
The prh perceiztile for a ranked data set consisting of n observations is found by a two-step
procedure. The first step is to compute index i = -
(p)(n).If i is not an integer, the next integer greater
I 00
than i locates the position of the pth percentile in the ranked data set. If i is an integer, the pth
percentile is the average of the observations in positions i and i + I in the ranked data set.
CHAP. 31 DESCRIPTIVE MEASURES 49
(1W45)
100
EXAMPLE 3.23 To find the tenth percentile for the data of Table 3.4, compute i = ~ = 4.5. The next
integer greater than 4.5 is 5. The observation in the fifth position in Table 3.4 is 3.6. Therefore, Plo = 3.6. Note
that at least 10% of the data in Table 3.4 are 3.6 or less (the actual aniount is I I , 1%) and at least 90% of the
data are 3.6 or more (the actual amount is 91.1%). For very large data sets, the percentage of observations equal
to or less than Plowill be very close tu 105T and the percentage of observations equal to or greater than PI0 will
be very close to 90%.
(40x45)
100
EXAMPLE 3.24 To find the tortieth percentile for the data in Table 3.4, compute i = -= 18. The
fortieth percentile is the average of the observations in the 18th and 19th positions in the ranked data set. The
observation in the 18th position is 6.0 and the observation in the 19th position is 6.2.Therefore Pj0 =-
-
6.1. Note that 40% of the data in Table 3.4 are 6.1 or less and that 60% of the observations are 6.I or more.
-
6.0+6.2
2
Deciles and quartiles are determined in the same manner as percentiles, since they may be
expressed as percentiles. The deciles are represented as DI, Dz, . . . , D
Y and the quartiles are repre-
sented by QI ,Q2 ,and Q3. The following equalities hold for deciles and percentiles:
The following equalities hold for quartiles and percentiles:
From the above definitions of percentiles, deciles, and quartiles, the following equalities also hold:
Median = P
5
o = D5 = Q2
The techniques for finding percentiles, deciles, and quartiles differ somewhat from textbook to
textbook, but the values obtained by the various techniques are usually very close to one another.
INTERQUARTILE RANGE
The irirerqirartile rtirige, designated by IQR, is defined as follows:
IQR = 4
3 - QI (3.16)
The interquartile range shows the spread of the middle 50% of the data and is not affected by
extremes in the data set.
EXAMPLE 3.25 The interquartile range for the aortic diameters in Table 3.4 is found by subtracting the value
of QI from Q3 . The first quartilc is equal to the 25th percentile and is found by noting that -= 11.25
and therefore i = 12. QI is in the 12th position in Table 3.4 and Q1 = 5.5. The third quartile is equal to the 75th
45 x 75
percentile and is found by noting that -= 33.75 and therefore i = 34. Q1is in the 34th position in Table
I 00
3.4 and Q3= 8.5. The IQR equals 8.5 -5.5 or 3.0 cm.
45 x 25
100
50 DESCRIPTIVE MEASURES
10.5
-2.5
14.0
[CHAP. 3
12.5 14.5 22.0 12.5
20.2 3.5 7.5 14.5
17.5 14.0 12.0 17.0
BOX-AND-WHISKER PLOT
20.3
5.5
4.O
A bo.r-atzd-whiskerplot, sometimes simply called a boxplot is a graphical display in which a box
extending from Q
1 to QJ is constructed and which contains the middle 50% of the data. Lines, called
whiskers, are drawn from Qi to the smallest value and from Q3 to the largest value. In addition, a
vertical line is constructed inside the box corresponding to the median.
27.5 22.5 10.5 40.0
12.7 35.5 38.0 10.5
-5.5 19.0 14.5 10.5
EXAMPLE 3.26 A Minitab generated boxplot for the data in Table 3.4 is shown in Fig. 3-2. For these data,
the minimum diameter is 3.0 cm, the maximum diameter is 11.0 cm., QI = 5.5 cm, Q2 = 6.6 cm, and Q3= 8.5
cm. Because Minitab uses a slightly different technique for finding the first and third quartile, the box extends
from 5.35 to 8.65. rather than from 5.5 to 8.5.
L I I
I 1 1 I I 1 I I 1
3 4 5 6 7 8 9 1 0 1 1
Diameter (cm)
Fig. 3-2
Another type boxplot, called a tnodifed huxplot, is also sometimes constructed in which possible
and probable outliers are identified. The modified boxplot is illustrated in problem 3.27.
Solved Problems
MEASURES OF CENTRAL TENDENCY: MEAN, MEDIAN, AND MODE
FOR UNGROUPED DATA
Table 3
.
5
Table 3.5 gives the annual returns for 30 randomly selected mutual funds. Problems 3.I , 3.2, and
3.3 refer to this data set.
3.1 Find the mean for the annual returns in Table 3.5.
Am. For the data in Table 3.5, Cx = 455.20, n = 30, and X = 15.17.
3
.
2 Find the median for the annual returns in Table 3.5.
CHAP. 31 DESCRIPTIVEMEASURES 51
60.5
75.O
100.0
89.0
50.0
Ans. The ranked annual returns are as follows:
113.5 79.0 475.5
70.0 122.5 150.0
125.5 90.0 175.5
130.0 111.5 100.0
340.5 100.0 525.0
-5.5 -2.5 3.5 4.0 5.5 7.5 10.5 10.5 10.5 10.5 12.0
12.5 12.5 12.7 14.0 14.0 14.5 14.5 14.5 17.0 17.5 19.0
20.2 20.3 22.0 22.5 27.5 35.5 38.0 40.0
The median is the average of the 15th and 16th values in the ranked returns or the average of 14.0
and 14.0, which equals 14.0.
3.3 Find the mode for the annual returns in Table 3.5.
Am. By considering the ranked annual returns in the solution to problem 3.2, we see that the observa-
tion 10.5 occurs more frequently than any other value and is therefore the mode for this data set.
3.4 Table 3.6 gives the distribution of the cause of death due to accidents or violence for white
males during a recent year.
Table 3.6
All other accidents 27,500
Suicide 20,234
What is the modal cause of death due to accidents or violence for white males? Can the mean
or median be calculated for the cause of death?
Am. The modal cause of death due to accidents or violence is motor vehicle accident. Because this is
nominal level data, the mean and the median have no meaning.
3
.
5 Table 3.7 gives the selling prices in tens of thousands of dollars for 20 homes sold during the
past month. Find the mean, median, and mode. Which measure is most representative for the
selling price of such homes?
Table 3.7
A m The ranked selling prices are:
50.0 60.5 70.0 75.0 79.0 89.0 90.0 100.0 100.0 100.0
I 1 1.5 113.5 122.5 125.5 130.0 150.0 175.5 340.5 475.5 525.0
The mean is 154.1, the median is 105.7, and the mode is 100.0. The median is the most
representative measure. The three selling prices 340.5,475.5, and 525.0 inflate the mean and make
it less representative than the median. Generally, the median is the best measure of central
tendency to use when the data are skewed.
52 DESCRIPTIVEMEASURES
Age
25
30
40
53
29
45
40
55
35
47
[CHAP. 3
Income
23,500
25,000
30,000
47,500
32,000
37,500
32,000
50,500
40,000
43,750
MEASURES OF DISPERSION: RANGE, VARIANCE, AND STANDARD
DEVIATION FOR UNGROUPED DATA
3.6 Find the range, variance, and standard deviation for the annual returns of the mutual funds in
Table 3.5.
Ans. The range is 40.0 - (-5.5) =45.5.
(455.2)*
10.060-
(Cx)2
c x 2--
30
= 108.7275and the standard
The variance is given by s2 = n --
n-1 29
deviation is s = 0= J m -= 10.43.
3
.
7 Find the range, variance,and standard deviation for the selling prices of homes in Table 3.7.
Ans. The range is 525 - 50 =475.
(3083)
812,884-----
(Cx)2
zx2--
The variance is given by s2 = n -
- 2o = 17,770.5026 and the
standard deviation is s = o=
4
- = 133.31.
n-1 19
range
4
3.8 Compare the values of the standard deviations with -for problems 3.6 and 3.7.
Ans. For problem 3.6, the values are 10.43 and 4 5 3 4 = 11.38 and for problem 3.7, the values are
133.31 and 47Y4 = 118.75. Even though the distributions are skewed in both problems, the
values are reasonably close. The approximation is closer for mound-shaped distributions than for
other distributions.
3.9 What are the chief advantage and the chief disadvantage of the range as a measure of
dispersion?
Am. The chief advantage is the simplicity of computationof the range and the chief disadvantageis that
it is insensitive to the values between the extremes.
3
.
1
0 The ages and incomes of the 10employeesat ComputerServices Inc. are given in Table 3.8.
Table 3.8
CHAP. 31 DESCRIPTIVE MEASURES 53
Compute the standard deviation of ages and incomes for these employees. Assuming that all
employees remain with the company 5 years and that each income is multiplied by 1.5 over
that period, what will the standard deviation of ages and incomes equal 5 years in the future?
Ans. The current standard deviations are 10.21 years and $9,224.10. Five years from now, the standard
deviations will equal 10.21 years and $13,836.15. In general, adding the same constant to each
observation does not affect the standard deviation of the data set and multiplying each observation
by the same constant multiplies the standard deviation by the constant.
MEASURES OF CENTRAL TENDENCY AND DISPERSION FOR GROUPED
DATA
3.11 Table 3.9 gives the age distribution of individuals starting new companies. Find the mean,
median, and mode for this distribution.
Table 3.9
20-29
30-39
40-49
50-59
60-69
Ans. The mean is found by dividing Xxf by n, where Cxf = 24.5 x 11 + 34.5 x 25 +44.5 x 14 + 54.5 x
7 + 64.5 x 3 = 2,330 and n = 60. X=38.8. The median class is the class 30-39. In order to find
the middle of the age distribution, that is, the age where 30 are younger than this age and 30 are
older, we must proceed through the 11 individuals in the 20-29 age group and 19 in the 30-39 age
group. This gives 29.5 + - x 10= 37.1 as the median age. The modal class is the class 30-39,
and the mode is the class mark for this class that equals 34.5.
19
25
3.12 Find the range, variance, and standard deviation for the distribution in Table 3.9.
Ans. The range is 69.5 - 19.5= 50.
! (2,330)*
deviation is s = 4116.497 = 10.8.
3.13 The raw data corresponding to the grouped data in Table 3.9 is given in Table 3.10. Find the
mean, median, and mode for the raw data and compare the results with the mean, median, and
mode for the grouped data found in problem 3.11.
54 DESCRIPTIVE MEASURES
Table 3.10
24 44
24 34 45
40 46
28 33 37 41 47 66
[CHAP. 3
Ans. The sum of the raw data equals 2,299, and the mean is 38.2. This compares with 38.8 for the
grouped data. The median is seen to be 37, and this compares with 37.1 for the grouped data. The
mode is seen to be 34 and this compares with 34.5 for the grouped data. This problem illustrates
that the measures of central tendency for a set of data in grouped and ungrouped form are
relatively close.
3.14 Find the range, variance, and standard deviation for the ungrouped data in Table 3.10 and
compare these with the same measures found for the grouped form in problem 3.12.
Ans. The range is 66 - 20 = 46. Cx = 2,299, Zcx2= 94,685 and n = 60.
(2,299)*
94,685-
2 ( C X > *
c x --
The variance is s2= n -- 6o = 111.788and s = 10.6.
n-1 59
These measures of dispersion compare favorably with the measures for the grouped data given in
problem 3.12.
CHEBYSHEV’S THEOREM AND THE EMPIRICAL RULE
3.15 The mean lifetime of rats used in many psychological experiments equals 3.5 years, and the
standard deviation of lifetimes is 0.5 year. At least what percent will have lifetimes between
2.5 years and 4.5 years? At least what percent will have lifetimes between 2.0 years and 5.0
years?
Ans. The interval from 2.5 years to 4.5 years is a 2 standard deviation interval about the mean, i.e., k = 2
in Chebyshev’s theorem. At least 75% of the rats will have lifetimes between 2.5 years and 4.5
years. The interval fiom 2.0 years to 5.0 years is a 3 standard deviation interval about the mean. At
least 89% of the lifetimes fall within this interval.
3.16 The mean height of adult females is 66 inches and the standard deviation is 2.5 inches. The
distribution of heights is mound-shaped. What percent have heights between: (a) 63.5 inches
and 68.5 inches? (b) 61.O inches and 71.O inches? (c) 58.5 inches and 73.5 inches?
Am. 63.5 to 68.5 is a one standard deviation interval about the mean, 61.0 to 71.0 is a two standard
deviation interval about the mean, and 58.5 to 73.5 is a three standard deviation interval about the
mean. According to the empirical rule, the percentages are: (a)68%; (b) 95% and (c)99.7%.
CHAP. 31 DESCRIPTIVE MEASURES 55
3.17 The mean length of service for Federal Bureau of Investigation (FBI) agents equals 9.5 years
and the standard deviation is 2.5 years, At least what percent of the employees have between
2.0 years of service and 17.0 years of service? If the lengths of service have a bell-shaped
distribution, what can you say about the percent having between 2.0 and 17.0years of service?
Ans. The interval from 2.0years to 17.0years is a 3 standard deviation interval about the mean, i.e., k =
3 in Chebyshev’s theorem. Therefore, at least 89% of the agents have lengths of service in this
interval. If we know that the distribution is bell shaped, then 99.7%of the agents will have lengths
of service between 2.0 and 17.0years.
COEFFICIENTOF VARIATION
3.18
3.19
Find the coefficient of variation for the ages in Table 3.10.
Ans. From problem 3.13, the mean age is 38.2 years and from problem 3.14 the standard deviation is
S 10.6
X 38.2
10.6years. Using formula (3.12), the coefficient of variation is CV = x 100% = -x loo=
27.7%.
The mean yearly salary of all the employees at Pretty Printing is $42,500 and the standard
deviation is $4,000. The mean number of years of education for the employees is 16 and the
standard deviation is 2.5 years. Which of the two variables has the higher relative variation?
4,000
42,500
Atis. The coefficient of variation for salaries is CV = -X
2.5
16
tion for years of education is CV = -x 100 = 15.6%.
variation.
2 SCORES
3.20
3.21
100= 9.4% and the coefficient of varia-
Years of education has a higher relative
The mean daily intake of protein for a group of individuals is 80 grams and the standard
deviation is 8 grams. Find the z scores for individuals with the following daily intakes of
protein: (a)95 grams; (b)75 grams; (c)80 grams.
95 -80 75-80 80-80
Ans. (a) z = -- - 1.88 (6) z = -
=-.63 (c) z = -
-
- 0
8 8 8
Three individuals were selected from the group described in problem 3.20 who have daily
intakes with z scores equal to -1.4,0.5, and 3.0. Find their daily intakes of protein.
-
x - x
S
A m . If the equation z = -is solved for x, the result is x = X +zs. The daily intake corresponding to
a z score of -I .4 is x = 80 +(-1.4)(8) = 68.8 grams. For a zscore equal to 0.5, x = 80 + (0.5)(8)=
84 grams. For a z score equal to 3.0, x = 80 +(3.0)(8)= 104 grams.
MEASURES OF POSITION:PERCENTILES, DECILES, AND QUARTILES
3.22 Find the percentiles for the ages 34,45, and 55 in Table 3.10.
56 DESCRIPTIVEMEASURES [CHAP. 3
20
60
Ans. The number of ages less than 3
4 is 20,and - x 100 = 33.396,which rounds to 33%. The age 34
is the thirty-third percentile.
The number of ages less than 45 is 45,and - x 100 = 75%. The age 45 is the seventy-fifth
percentile.
The number of ages less than 55 is 53,and - x 100 = 88.3%, which rounds to 88%. The age 55
is the eighty-eighth percentile.
45
60
53
60
3.23 Find the ninety-fifth percentile, the seventh decile, and the first quartile for the age distribution
given in Table 3.10.
- 57.Pg5 is the average of the observations in positions
np (60)(95)
Ans. To find Pgs, compute i = --
- -
-
100 100
57and 58 in the ranked data set, or the average of 58 and 62which is 60years.
= 42. D
7 is the average of the
observations in positions 42 and 43 in the ranked data set, or the average of 4
1 and 44 which is
42.5years.
np (60)(70)
To find D7,which is the same as Pt0, compute i = --
-
100 100
np (60)(25)
To find QI ,compute i = --
- -
= 15.QI is the average of the observations in positions
100 loo
15 and 1
6 in the ranked data set, or the average of 32 and 32which is 32 years.
INTERQUARTILERANGE
3.24 Find the interquartile range for the annual returns of the mutual funds given in Table 3.5.
Ans. The ranked annual returns are as follows.
-5.5 -2.5 3.5 4.0 5.5 7
.
5 10.5 1
0
.
5 10.5 1
0
.
5 12.0
12.5 12.5 12.7 14.0 14.0 1
4
.
5 14.5 1
4
.
5 17.0 17.5 19.0
20.2 20.3 22.0 22.5 27.5 35.5 38.0 40.0
= 7.5.
The next integer greater than 7.5 is 8,and this locates the position of Q,in the ranked data set.
np (30)(25)
The first quartile, which is the same as PtS ,is found by computing i = --
-
loo 100
QI = 10.5.
np (30)(75)
The third quartile is found by computing i = --
--
= 22.5,
and rounding this to 23.
100 loo
The third quartile is found in position 23 in the ranked data set. 4
3= 20.2
IQR = 4 3 - QI = 20.2- 1
0
.
5= 9.7.
3
.
2
5 Find the interquartile range for the selling prices given in Table 3.7.
Ans. The ranked selling prices are:
50.0 60.5 70.0 75.0 79.0 89.0 90.0 100.0 100.0 100.0
111.5 113.5 122.5 125.5 130.0 150.0 175.5 340.5 475.5 525.0
= 5. The
np (20)(25)
The first quartile, which is the same as P25 is found by computing i = --
-
1
0
0 100
first quartile is the average of the observations in positions 5 and 6 in the ranked data set.
CHAP. 31 DESCRIPTIVEMEASURES
35
40
60
55
56
57
25 90 60 45
58 90 90 55
55 80 90 60
60 85 75 60
55 75 80 90
79.0+89.0
2
= 84.0
QI=
The third quartile is found by computing i = - -
- ~ = 15. The third quartile is the
"P (20)(75)
100 100
averageof the observationsin positions 15 and 16in the ranked data set.
130.0+ 150.0 = 140.0
IQR = Q3- QI = 140.0- 84.0 = 56.0
2
4 3 =
BOX-AND-WHISKER PLOT
3.26 Table 3.11 gives the number of days that 25 individuals spent in house arrest in a criminal
justice study.Use Minitabto constructaboxplot for these data.
Table 3.11
Ans. The solution is shown in Fig. 3-3. It is seen that the minimum value is 25, the maximum value is
90, QI= 55, median = 60, and QJ = 83.
I I I I I I I 1
20 30 40 50 60 70 80 90
Days
Fig. 3-3
3.27 Construct a modifiedboxplot for the sellingprices given in Table 3.7.
Ans. The ranked selling prices are:
50.0 60.5 70.0 75.0 79.0 89.0 90.0 100.0 100.0 100.0
111.5 113.5 122.5 125.5 130.0 150.0 175.5 340.5 475.5 525.0
In problem 3.25 it is shown that QI= 84.0, Q3= 140.0,and IQR = 56.0.
A lower innerfence is defined to be QI- 1.5 x IQR = 84.0 - 1.5x 56.0 = 0.0
An upper innerfence is defined to be Q3+ 1.5 x IQR = 140.0+ 1.5 x 56.0 = 224.0
A lower outerfence is defined to be Q, -3 x IQR = 84.0- 3 x 56.0 = -84.0
An upper outerfence is defined to be Q3+ 3 x IQR = 140.0+ 3 x 56.0 = 308.0
The adjacent values are the most extreme values still lying within the inner fences. The
adjacent values for the above data are 50.0 and 175.5, since they are the most extreme values
between 0.0 and 224.0. In a modified boxplot, the whiskersextend only to the adjacent values.
DESCRIPTIVE MEASURES [CHAP. 3
t I
340 375 450 580 565
445 675 545 380 630
500 350 635 495 640
467 560 680 605 590
635 625 440 420 400
Data values that lie between the inner and outer fences are possible outliers and data values
that lie outside the outer fences are probable outliers. The observations340.5,
475.5,
and 525.0lie
outside the outer fences and are called probable outliers.
A Minitab printout for a boxplot of the data is given in Fig. 3-4.Each probable outlier is
represented by an asterisk.
I I 1 I 1 1
0 100 200 300 400 500
Sellingprice
Fig. 3-4
SupplementaryProblems
MEASURES OF CENTRAL TENDENCY: MEAN, MEDIAN, AND MODE FOR UNGROUPED DATA
3
.
2
8 Table 3.12gives the verbal scores for 25 individuals on the Scholastic Aptitude Test (SAT). Find the
mean verbal score for this set of data.
Ans. Cx = 13,027 51= 521.1
3
.
2
9 Find the median verbal scorefor the data in Table 3.12.
Ans. The ranked scores are shown below.
340 350 375 380 400 420 440 445 450 467 495 500 545
560 565 580 590 605 625 630 635 635 640 675 680
The median verbal score is 545.
3.30 Find the mode for the verbal scores in Table 3.12.
Am. 635
3.31 Give the output produced by the Describe command of Minitab when the data in Table 3.12 are analyzed.
Ans. Descriptive Statistics
Variable N Mean Median TrMean StDev SEMean
SAT 25 521.1 545.0 522.0 1
0
8
.
1 21.6
CHAP. 31 DESCRIPTIVEMEASURES 59
Variable Min Max Q1 Q3
SAT 340.0 680.0 430.0 627.5
3.32 Table 3.13 gives a stem-and-leaf display for the number of hours per week spent watching TV for a group
of teenagers. Find the mean, median, and mode for this distribution. What is the shape of the distribution'?
Table 3.13
1
1
0 0 0 0 0 0 0 0 0 0 0 0 5 5 5
Am. The mean, median, and mode are each equal to 20 hours. The distribution has a bell shape with
center at 20.
MEASURES OF DISPERSION: RANGE, VARIANCE, AND STANDARD DEVIATION
FOR UNGROUPED DATA
3.33
3.34
3.35
3.36
3.37
Find the range, variance, and standard deviation for the SAT scores in Table 3.12.
Am. range = 340 s2= I 1,692.91 s = 108.13
Find the range, variance, and standard deviation for the number of hours spent watching TV given in
Table 3.13.
Aim range = 20 s2= 18.4213 s = 4.29
What should your response be if you find that the variance of a data set equals -5.5'?
Ans.
Consider a data set in which all observations are equal. Find the range, variance, and standard deviation
for this data set.
Airs.
A data set consisting of 10observations has a mean equal to 0, and a variance equal to a. Express Ex2 in
terms of a.
Airs. Ex2= 9a
Check your calculations, since the variance can never be negative
The range, variance, and standard deviation will all equal zero.
MEASURES OF CENTRAL TENDENCY AND DISPERSION FOR GROUPED DATA
3.38 Table 3.14 gives the distribution of the words per minute for 60 individuals using a word processor. Find
the mean, median, and mode for this distribution.
Table 3.14
Words per minute I Frequency I
40-49
50-59
60-69
70-79
I 80-89
I 90-99
3
9
15
15
12
6
60 DESCRIPTIVE MEASURES [CHAP. 3
Ans.
3.39 Find the range, variance, and standard deviation for the distribution in Table 3.14.
mean = 71 .5, median = 7I .5, modes = 64.5 and 74.5
Am. range = 60 s2= 184.0678 s = 13.57
3.40 A quality control technician records the number of defective units found daily in samples of size 100 for
the month of July. The distribution of the number of defectives per 100 units is shown in Table 3.15.
Find the mean, median, and mode for this distribution. Which of the three measures is the most
representative for the distribution'?
Table 3.15
Number of defectives
27
A m . mean = 2 median = 0 mode = 0
The median or mode would be more representative than the mean. On most days, there were none
or one defective in the sample. The two days on which the process was out of control inflated the
mean.
3.41 The following formula is sometimes used for finding the median of grouped data.
5n-c
f
Median = L + -
x ( U - L )
In this formula, L is the lower boundary of the median class, U is the upper boundary of the median class,
n is the nurnber of observations, f is the frequency of the median class, and c is the cumulative frequency
of the class proceeding the median class. Give the values for L, U, n, f, c, and find the median for the
distribution given in Table 3.14.
A ~ s . L = 69.5 U = 79.5n = 60 f = 15 c = 27 median = 71.5
CHEBYSHEV'S THEOREM AND THE EMPIRICAL RULE
3.42
3.43
3.44
A sociological study of gang members in a large midwestern city found the mean age of the gang
members in the study to be 14.5 years and the standard deviation to be 1.5 years. According to
Chebyshev's theorem, at least what percent will be between 10.0and 19.0years of age?
Ans. 89%
A psychological study of Alcoholic Anonymous members found the mean nurnber of years without
drinking alcohol for individuals in the study to be 5.5 years and the standard deviation to be 1.5 years.
The distribution of the number of years without drinking is bell-shaped. What percent of the distribution
is between: ((I) 4.0 and 7.0 years; (b)2.5 and 8.5 years; (c) 1.0and 10.0years?
Am. ( a )68% (h)95% (c) 99.7%
The ranked verbal SAT scores in Table 3.12 are:
340 350 375 380 400 420 440 445 450 467 495 500 545
560 565 580 590 605 625 630 635 635 640 675 680
CHAP. 31 DESCRIPTIVE MEASURES 61
The mean and standard deviation of these data are 521.1 and 108.1, respectively. According to
Chebyshev’s theorem, at least 75% of the observations are between 304.9 and 737.3. What is the actual
percent of observations within two standard deviations of the mean?
Ans. 100%
COEFFICIENT OF VARIATION
3.45
3.46
The verbal scores on the SAT given in Table 3.12 have a mean equal to 52 1.1 and a standard deviation
equal to 108.1. Find the coefficient of variation for these scores.
Ans. 20.7%
Fastners Inc. produces nuts and bolts. One of their bolts has a mean length of 2.00 inches with a standard
deviation equal to 0.10 inch, and another type bolt has a mean length of 0.25 inch. What standard
deviation would the second type bolt need to have in order that both types of bolts have the same
coefficient of variation?
Am. 0.0125 inch
Z SCORES
3.47 The low-density lipoprotein (LDL) cholesterol concentration for a group has a mean equal to 140 mg/dL
and a standard deviation equal to 40 mg/dL. Find the z scores for individuals having LDL values of (a)
1IS; (b) 140; and (c) 200.
AILS. (a) -0.63 (b) 0.0 (c) 1.50
3.48 Three individuals from the group described in problem 3.47 have z scores equal to (a)-1.75; (b)0.5; and
(c) 2.0. Find their LDL values.
Ans. (a) 70 (6) 160 (c) 220
MEASURES OF POSITION: PERCENTILES, DECILES, AND QUARTILES
3.49 Table 3.16 gives the ages of commercial aircraft randomly selected from several airlines. Find the
percentiles for the ages 10, 15,and 20.
Table 3.16
Ans. The age 10 is the thirtieth percentile. The age 15 is the fifty-eighth percentile. The age 20 is the
eighty-fourth percentile.
3.50 Find Pm, Dg, and Q 3 for the commercial aircraft ages in Table 3.16.
62 DESCRIPTIVE MEASURES [CHAP. 3
INTERQUARTILE RANGE
3.51 'The first quartile for the salaries ofcounty sheriffs in the United States is $37,500 and the third quartile is
$50.500. What is the interquartile range for the salaries of county sheriffs?
Ans. $13,000
3.52 Find the interquartile range for the ages of commercial aircraft given in Table 3.16.
A m . 9 years
BOX-AND-WHISKER PLOT
3.53 A Minitab produced boxplot for the weights of high school football players is shown in Fig. 3-5. Give thc
rninimum. rnaximum, first quartile, median, and third quartile.
77-
-T---l---l--n
1
6
200 2 1 0 3 0 230 240 250 260 270
Weights
Fig. 3-5
Ans. rninimum = 195 maximum = 270 Q, = 205 Q2= 240 Q3= 260
3.54 Construct or use Minitab to construct a boxplot for the ages of the commercial aircraft in Table 3.16.
Atm The boxplot is shown in Fig. 3-6.
I I I 1
0 10 20 30
Age
Fig. 3-6
Chapter 4
ProbabiIity
EXPERIMENT,OUTCOMES, AND SAMPLESPACE
An experiment is any operation or procedure whose outcomes cannot be predicted with certainty.
The set of all possible outconies for an experiment is called the sample space for the experiment.
EXAMPLE4.1 Games of chance are examples of experiments. The single toss of a coin is an experiment
whose outcomes cannot be predicted with certainty. The sample space consists of two outcomes, heads or tails.
The letter S is used to represent the sample space and may be represented as S = (H, T}.The single toss of a die
is an experiment resulting in one of six outcomes. S may be represented as ( I , 2, 3, 4, 5 , 6). When a card is
selected from a standard deck, 52 outcomes are possible. When a roulette wheel is spun, the outcome cannot be
predicted with certainty.
EXAMPLE 4.2 When a quality control technician selects an item for inspection from a production line, it may
be classified as defective or nondefective. The sample space may be represented by S = (D,N}. When the blood
type of a patient is determined, the sample space may be represented as S = (A, AB, B, 0 ) .
When the Myers-
Briggs personality type indicator is administered to an individual, the sample space consists of I6 possible
outcomes.
The experiments discussed in Examples 4.1 and 4.2 are rather simple experiments and the
descriptions of the sample spaces are straightforward. More complicated experiments are discussed
in the following section and techniques such as tree diagrams are utilized to describe the sample
space for these experiments.
TREE DIAGRAMS AND THE COUNTING RULE
In a tree diagram, each outcome of an experiment is represented as a branch of a geometric
figure called a tree.
EXAMPLE 4.3 Figure 4-1 shows a tree diagram for the experiment of tossing a coin twice. The tree has four
branches. Each branch is an outcome for the experiment. If the experiment is expanded to three tosses, the
branches are simply continued with H or T added to the end of each branch shown in Fig. 4-1. This would result
in the eight outcomes: HHH, HHT, HTH, HTT, THH, THT, TTH,and TTT. This technique could be continued
systematically to give the outcomes for n tosses of a coin. Notice that 2 tosses has 4 outcomes and 3 tosses has 8
outcomes. N tosses has 2Npossible outcomes.
The courztirzg rule for a two-step experiment states that if the first step can result in any one of nl
outcomes, and the second step in any one of n2 outcomes, then the experiment can result in (nl)(nz)
outcomes. If a third step is added with n3 outcomes, then the experiment can result in (nl)(nz)(nl)
outcomes. The counting rule applies to an experiment consisting of any number of steps. If the
counting rule is applied to Example 4.3, we see that for two tosses of a coin, nl = 2, n2 = 2, and the
number of outcomes for the experiment is 2 x 2 = 4. For three tosses, there are 2 x 2 x 2 = 8
outcomes and so forth. The counting rule may be used to figure the number of outcomes of an
experiment and then a tree diagram may be used to actually represent the outcomes.
64
1
2
Die2 3 -
4
5
6
PROBABILITY
J , 8 8 8 8 8
- 8 8 8 8 8 8
8 8 8 8 8 8
- 8 8 8 8 W W
-. 8 8 8 8 8
8 8 8 8 8
[CHAP. 4
first toss second toss outcome
HH
I
HT
TH
TT
Fig. 4-1
EXAMPLE 4.4 For the experiment of rolling a pair of dice, the first die may be any of six numbers and the
second die may be any one of six numbers. According to the counting rule, there are 6 x 6 = 36 outcomes. The
outcomes may be represented by a tree having 36 branches. The sample space may also be represented by a two-
dimensional plot as shown in Fig. 4-2.
Fig. 4-2
EXAMPLE 4.5 An experiment consists of observing the blood types for five randomly selected individuals.
Each of the five will have one of four blood types A, B, AB, or 0. Using the counting rule, we see that the
experiment has 4 x 4 x 4 x 4 x 4 = 1,024 possible outcomes. In this case constructing a tree diagram would be
difficult.
EVENTS, SIMPLEEVENTS, AND COMPOUNDEVENTS
An event is a subset of the sample space consisting of at least one outcome from the sample
space. If the event consists of exactly one outcome, it is called a simple event. If an event consists of
more than one outcome, it is called a compound event.
EXAMPLE4.6 A quality control technician selects two computer mother boards and classifies each as
defective or nondefective. The sample space may be represented as S = {NN, ND, DN, DO], where D
represents a defective unit and N represents a nondefective unit. Let A represent the event that neither unit is
defective and let B represent the event that at least one of the units is defective. A = { NN} is a simple event and
B = {ND,DN, DD) is a compound event. Figure 4-3 is a Venn Diagram representation of the sample space S
CHAP. 41 PROBABILITY 65
and the events A and B. In a Venn diagram, the sample space is usually represented by a rectangle and events
are represented by circles within the rectangle.
Fig. 4-3
EXAMPLE 4.7 For the experiment described in Example 4.5,
there are 1,024different outcomes for the blood
types of the five individuals. The compound event that all five have the same blood type is composed of the
following four outcomes: (A, A, A, A, A), (B, B, B, B, B), (AB, AB, AB, AB, AB), and (0,0,0,0, 0).The
simple event that all five have blood type 0 would be the outcome (0,0,0,0 , O ) .
PROBABILITY
Probability is a measure of the likelihood of the occurrence of some event. There are several
different definitions of probability. Three definitions are discussed in the next section. The particular
definition that is utilized depends upon the nature of the event under consideration. However, all the
definitions satisfy the following two specific properties and obey the rules of probability developed
later in this chapter.
The probability of any event E is represented by the symbol P(E) and the symbol is read as “P of
E” or as “the probability of event E.” P(E) is a real number between zero and one as indicated in the
following inequality:
0 I P(E) I 1 (4.1)
The sum of the probabilities for all the simple events of an experiment must equal one. That is, if
E1 ,E2 , . . . ,E,are the simple events for an experiment, then the following equality must be true:
P(E1) +P(E2) + . ..+P(En)= 1 (4.2)
Equality (4.2)is also sometimes expressed as in formula (4.3):
P(S) = 1 (4-3)
Equation (4.3)states that the probability that some outcome in the sample space will occur is one.
CLASSICAL, RELATIVE FREQUENCY,AND SUBJECTIVE PROBABILITY
DEFINITIONS
The classical defirtition o
f probability is appropriate when all outcome‘s of an experiment are
equally likely. For an experiment consisting of n outcomes, the classical definition of probability
66 PROBABILITY [CHAP. 4
Type of transplant
Heart
Kidney
Liver
Pancreas
Eyes
1
n
assigns probability - to each outcome or simple event. For an event E consisting of k outcomes, the
probability of event E is given by formula (4.4)
k
n
P(E) = - (4.4)
Waiting Time for Transplant
10 5
7 3
5 5
3 2
5 5
Less than one year One year or more
EXAMPLE 4.8 The experiment of selecting one card randomly from a standard deck of cards has 52 equally
likely outcomes. The event A, = (club) has probability g,since A, consists of 13 outcomes. The event A2 =
{red card) has probability E,since A2 consists of 26 outcomes. The event A? = { face card (Jack, Queen,
King)} has probability g,since A3 consists of 12outcomes.
EXAMPLE 4.9 Table 4.1 gives information concerning fifty organ transplants in the state of Nebraska during
a recent year. Each patient represented in Table 4.1 had only one transplant. If one of the 50 patient records is
randomly selected, the probability that the patient had a heart transplant is = .30,since 15 of the patients had
heart transplants. The probability that a randomly selected patient had to wait one year or more for the transplant
I S - = .40,
since 20 of the patients had to wait one year or more. The display in Table 4.1 is called a titw-huy
table. It displays two different variables concerning the patients.
’ 20
so
EXAMPLE 4.10 To find the probability of the event A that the sum of the numbers on the faces of a pair of
dice equals seven when a pair of dice is rolled, consider the sample space shown in Fig. 4-4. The event A is
shown as a rectangular box in the sample space. The outcomes in A are as follows A = { ( 1, 6),(2, 5). (3, 4), (4,
3). (5, 2), (6, I)}. Since A contains six of the thirty-six equally likely outcomes for the experiment, the
probability of event A is .
36
Die 2
2 - 0
0
1 - - .
I I r I
1 2 3 4
Die 1
Fig. 4-4
CHAP. 41 PROBABILITY 67
The classical definition of probability is not always appropriate in computing probabilities of
events. If a coin is bent, heads and tails are not equally likely outcomes. If a die has been loaded,
each of the six faces do not have probability of occurrence equal to i.
For experiments not having
equally likely outcomes, the relative frequertcv definition ofprohahility is appropriate. The relative
frequency definition of probability states that if an experiment is performed n times, and if event E
occurs f times, then the probability of event E is given by formula (4.5).
f
P(E) = -
n
(4.5)
EXAMPLE 4.11 A bent coin is tossed 50 times and a head appears on 35 of the tosses. The relative frequency
definition of probability assigns the probability -7-5 = .70 to the event that a head occurs when this coin is tossed.
A loaded die is tossed 75 times and the face “6” appears 15 times in the 75 tosses. The relative frequency
definition of probability assigns the probability = .20 to the event that the face “6” will appear when this die
is tossed.
50
EXAMPLE 4.12 A study by the state of Tennessee found that when 750 drivers were randomly stopped 47 1
were found to be wearing seat belts. The relative frequency probability that a driver wears a seat belt in
Tennessee is I
= 0.63.
471
750
There are many circumstances where neither the classical definition nor the relative frequency
definition of probability is applicable. The subjectiLfe defirzitiorz o
f yrohabilihy utilizes intuition,
experience, and collective wisdom to assign a degree of belief that an event will occur. This method
of assigning probabilities allows for several different assignments of probability to a given event.
The different assignments must satisfy formulas (4.I ) and (4.2).
EXAMPLE 4.13 A military planner states that the probability of nuclear war in the next year is 1%. The
individual is assigning a subjective probability of .01 to the probability of the event “nuclear war in the next
year.” This event does not lend itself to either the classical definition or the relative frequency definition of
probability.
EXAMPLE 4.14 A medical doctor tells a patient with a newly diagnosed cancer that the probahility of
successfully treating the cancer is 90%. The doctor is assigning a subjective probability of .90 to the event that
the cancer can be successfully treated. The probability for this event cannot be determined by either the classical
definition or the relative frequency definition of probability.
MARGINAL AND CONDITIONAL PROBABILITIES
Table 4.2 classifies the 500 members of a police department according to their minority status as
well as their promotional status during the past year. One hundred of the individuals were classified
as being a minority and seventy were promoted during the past year. The probability that a randomly
selected individual from the police department is a minority is = .20 and the probability that a
randomly selected person was promoted during the past year is & = .14. Table 4.3 is obtained by
dividing each entry in Table 4.2 by 500.
68
Promoted
No
PROBABILITY
Minority
No Yes Total
350 80 430
[CHAP. 4
~ ~ ~
Promoted No
No .70
Yes . l O
Total .80
Table 4.2
Yes Total
.16 .86
.04 .14
.20 1.oo
Yes 50 I 20 I 70 I
I Total I 400 I 100 I 500 I
The four probabilities in the center of Table 4.3, .70, .16, .10, and .04, are called joint
probabilities. The four probabilities in the margin of the table, .80, .20, .86, and .14, are called
mtirginal probabilities.
Table 4.3 a
I Minoritv I I
The joint probabilities concerning the selected police officer may be described as follows:
.70 = the probability that the selected officer is not a minority and was not promoted
.I6 = the probability that the selected officer is a minority and was not promoted
.I0= the probability that the selected officer is not a minority and was promoted
.04= the probability that the selected officer is a minority and was promoted
The marginal probabilities concerning the selected police officer may be described as follows:
.80 = the probability that the selected officer is not a minority
.20 = the probability that the selected officer is a minority
.86= the probability that the selected officer was not promoted during the last year
.14 = the probability that the selected officer was promoted during the last year
In addition to the joint and marginal probabilities discussed above, another important concept is
that of a corzditional probability. If it is known that the selected police officer is a minority, then the
conditional probability of promotion during the past year is = .20, since 100 of the police officers
in Table 4.2 were classified as minority and 20 of those were promoted. This same probability may
be obtained from Table 4.3 by using the ratio $ = .20.
The formula for the conditional probability of the occurrence of event A given that event B is
known ta have occurred for some experiment is represented by P(A I B) and is the ratio of the joint
probability of A and B divided by the probability of B. The following formula is used to compute a
conditional probability.
P(A and B)
P(B)
P(A I B) =
The following example summarizes the above discussion and the newly introduced notation.
EXAMPLE 4.15 For the experiment of selecting one police officer at random from those described in Table
4.2,define event A to be the event that the individual was promoted last year and define event B to be the event
CHAP. 41 PROBABILITY 69
that the individual is a minority. The joint probability of A and B is expressed as P(A and B) = .04. The
marginal probabilities of A and B are expressed as P(A) = .I4 and P(B) = .20. The conditional probability of A
given B is P(A IB) =
P(AandB) .04
- -= .20.
P(B) - .20
MUTUALLY EXCLUSIVE EVENTS
Two or more events are said to be nzutuallv exclusive if the events do not have any outcomes in
common. They are events that cannot occur together. If A and B are mutually exclusive events then
the joint probability of A and B equals zero, that is, P(A and B) = 0. A Venn diagram representation
of two mutually exclusive events is shown in Fig. 4-5.
Fig. 4-5
EXAMPLE 4.16 An experiment consists in observing the gender of two randomly selected individuals. The
event, A, that both individuals are male and the event, B, that both individuals are female are mutually exclusive
since if both are male, then both cannot be female and P(A and B) = 0.
EXAMPLE 4.17 Let event A be the event that an employee at a large company is a white collar worker and let
B be the event that an employee is a blue collar worker. Then A and B are mutually exclusive since an employee
cannot be both a blue collar worker and a white collar worker and P(A and B) = 0.
DEPENDENT AND INDEPENDENT EVENTS
If the knowledge that some event B has occurred influences the probability of the occurrence of
another event A, then A and B are said to be dependent events. If knowing that event B has occurred
does not affect the probability of the occurrence of event A, then A and B are said to be independent
events. Two events are independent if the following equation is satisfied. Otherwise the events are
dependent.
P(A I B) = P(A) (4.7)
The event of having a criminal record and the event of not having a father in the home are
dependent events. The events of being a diabetic and having a family history of diabetes are
dependent events, since diabetes is an inheritable disease. The events of having 10letters in your last
name and being a sociology major are independent events. However, many times it is not obvious
whether two events are dependent or independent. In such cases, formula (4.7)is used to determine
whether the events are independent or not.
70
Smoker
No
Yes
Total
PROBABILITY
History of Heart Disease
No Yes Total
75 5 80
35 10 45
110 15 125
[CHAP. 4
EXAMPLE 4.18 For the experiment of drawing one card from a standard deck of 52 cards, let A be the event
that a club is selected, let B be the event that a face card (jack, queen, or king) is drawn, and let C be the event
that a jack is drawn. Then A and B are independent events since P(A) = $ = .25 and P(A I B) = $ = .25.
P(A I B) = 2 = .25. since there are 12 face cards and 3 of them are clubs. The events B and C are dependent
events since P(C)= 5 = .077 and P(C IB) = 5 = .333. P(C I B) = -% = .333, since there are 12 face cards and
4 of them are jacks.
12
52 12 12
EXAMPLE 4.19 Suppose one patient record is selected from the 125 represented in Table 4.4. The event that
a patient has a history of heart disease, A, and the event that a patient is a smoker, B, are dependent events, since
= .22. For this group of patients, knowing that an individual is a smoker
= .I2 and P(A I B) =
15
12s
P(A)= -
almost doubles the probability that the individual has a history of heart disease.
Table 4.4
COMPLEMENTARYEVENTS
To every event A, there corresponds another event A', called the complement of A and consisting
of all other outcomes in the sample space not in event A. The word nut is used to describe the
complement of an event. The complement of selecting a red card is not selecting a red card. The
complement of being a smoker is not being a smoker. Since an event and its complement must
account for all the outcomes of an experiment, their probabilities must add up to one. If A and A' are
complementary events then the following equation must be true.
P(A) +P(A') = 1 (4.8)
EXAMPLE 4.20 Approximately 2% of the American population is diabetic. The probability that a randomly
chosen American is not diabetic is .98, since P(A)= .02, where A is the event of being diabetic, and .02 + P(A')
= 1. Solving for P(A') we get P(A') = 1 - .02 = .98.
EXAMPLE4.21 Find the probability that on a given roll of a pair of dice that "snake eyes" are not rolled.
Snake eyes means that a one was observed on each of the dice. Let A be the event of rolling snake eyes. Then
P(A)= L!- = .028. The event that snake eyes are not rolled is AC.Then using formula (4.8),.028 + P(A') = 1, and
solving for P(A'), it follows that P(AC)= I - .028 = .972.
26
Complementary events are always mutually exclusive events but mutually exclusive events are
not always complementary events. The events of drawing a club and drawing a diamond from a
standard deck of cards are mutually exclusive, but they are not complementary events.
MULTIPLICATION RULE FOR THE INTERSECTION OF EVENTS
The intersection of two events A and B consists of all those outcomes which are common to both
A and B. The intersection of the two events is represented as A and B. The intersection of two events
CHAP. 41 PROBABILITY 71
is also represented by the symbol A n B, read as A intersect B. A Venn diagram representation of
the intersection of two events is shown in Fig. 4-6.
Fig. 4-6
The probability of the intersection of two events is given by the inultiplication rule. The
multiplication rule is obtained from the formula for conditional probabilities, and is given by formula
(4.9).
P(A and B) = P(A) P(B IA) (4.9)
EXAMPLE 4.22 A small hospital has 40 physicians on staff of which 5 are cardiologists. The probability that
two randomly selected physicians are both cardiologists is determined as follows. Let A be the event that the
first selected physician is a cardiologist, and B the event that the second selected physician is a cardiologist.
Then P(A) = $ = ,125, P(B I A) = L=.103 and P(A and B) = .125 x .103 = .013. If two physicians were
selected from a group of 40,000of which 5,000were cardiologists, then P(A) = = .125, P(B IA) = 4,999
40,000 39,999
39
= .125, and P(A and B) = (.125)2= .016. Notice that when the selection is from a large group, the probability of
selecting a cardiologist on the second selection is approximately the same as selecting one on the first selection.
Following this line of reasoning, suppose it is known that 12.5% of all physicians are cardiologists. If three
equals (.125)3 = .002. The
physicians are selected randomly, the probability that all three are cardiologists
probability that none of the three are cardiologists is (.875)3 = .670.
If events A and B are independent events, then P(A1B) = P(A) and P(B
is replaced by P(B), formula (4.9)simplifies to
P(A and B) =P(A) P(B)
A) = P(B). When P(B IA)
(4.10)
EXAMPLE 4.23 Ten percent of a particular population have hypertension and 40 percent of the same
population have a home computer. Assuming that having hypertension and owning a home computer are
independent events, the probability that an individual from this population has hypertension and owns a home
computer is .10 x .40 = -04.Another way of stating this result is that 4 percent have hypertension and own a
home computer.
ADDITION RULE FOR THE UNION OF EVENTS
The union of two events A and B consists of all those outcomes that belong to A or B or both A
and B. The union of events A and B is represented as A U B or simply as A or B. A Venn diagram
representation of the union of two events is shown in Fig. 4-7. The darker part of the shaded union of
the two events corresponds to overlap and corresponds to the outcomes in both A and B.
72 PROBABILITY [CHAP. 4
Fig. 4-7
To find the probability in the union, we add P(A) and P(B). Notice, however, that the darker part
indicates that P(A and B) gets added twice and must be subtracted out to obtain the correct
probability. The resultant equation is called the addition rule for probabilities and is given by
formula (4.1I).
P(A or B) = P(A) +P(B) -P(A and B) (4.11)
If A and B are mutually exclusive events, then P(A and B) = 0 and the formula (4.11)simplifies
to the following.
P(A or B) = P(A) +P(B)
EXAMPLE4.24 Forty percent of the employees at Computec, Inc. have a college degree, 30 percent have
been with Computec for at least three years, and 15 percent have both a college degree and have been with the
company for at least three years. If A is the event that a randomly selected employee has a college degree and B
is the event that a randomly selected employee has been with the company at least three years, then A or B is the
event that an employee has a college degree or has been with the company at least three years. The probability
of A or B is .40 + .30 - .15 = .55. Another way of stating the result is that 55 percent of the employees have a
college degree or have been with Computec for at least three years.
EXAMPLE 4.25 A hospital employs 25 medical-surgical nurses, 10intensive care nurses, 15 emergency room
nurses, and 50 floor care nurses. If a nurse is selected at random, the probability that the nurse is a medical-
surgical nurse or an emergency room nurse is .25 + .15 = .40. Since the events of being a medical-surgical nurse
and an emergency room nurse are mutually exclusive, the probability is simply the sum of probabilities of the
two events.
BAYES’ THEOREM
A computer disk manufacturer has three locations that produce computer disks. The Omaha plant
produces 30% of the disks, of which 0.5% are defective. The Memphis plant produces 50% of the
disks, of which 0.75% are defective. The Kansas City plant produces the remaining 20%, of which
0.25% are defective. If a disk is purchased at a store and found to be defective, what is the
probability that it was manufactured by the Omaha plant? This type of problem can be solved using
Bayes’ theorem. To formalize our approach, let AI be the event that the disk was manufactured by
the Omaha plant, let A2 be the event that the disk was manufactured by the Memphis plant, and let A3
be the event that it was manufactured by the Kansas City plant. Let B be the event that the disk is
defective. We are asked to find P(A1 I B). This probability is obtained by dividing P(A1 and B) by
P(B)-
CHAP. 41 PROBABILITY 73
The event that a disk is defective occurs if the disk is manufactured by the Omaha plant and is
defective or if the disk is manufactured by the Memphis plant and is defective or if the disk is
manufactured by the Kansas City plant and is defective. This is expressed as follows
B = (AIand B) or (A2 and B) or (A3 and B) (4.12)
Because the three events which are connected by or’s in formula (4.12)are mutually exclusive,
P(B) may be expressed as
P(B) =P(A1 and B) +P(A2 and B) +P(A3and B) (4.13)
By using the multiplication rule, formula (4.13)may be expressed as
Using formula (4.14),P(B) = .005x .3 + .0075 x .5 + .0025 x .2 = .00575. That is, 0.575% of
the disks manufactured by all three plants are defective. The probability P(AIand B) equals
.OO15
,00575
P(B I AI) P(AI) = ,005 x .3 = .0015. The probability we are seeking is equal to -= .261.
Summarizing, if a defective disk is found, the probability that it was manufactured by the Omaha
plant is .261.
In using Bayes’ theorem to find P(A1 I B) use the following steps:
Step 1: Compute P(A1 and B) by using the equation P(A1and B) = P(B IAI)P(A1).
Step 2: Compute P(B) by using formula (4.14).
Step 3: Divide the result in step 1by the result in step 2 to obtain P(A1 I B).
These same steps may be used to find P(A2 IB) and P(A3 IB).
Events like A1 ,A2 ,and A3 are called collectively exhaustive. They are mutually exclusive and
their union equals the sample space. Bayes’ theorem is applicable to any number of collectively
exhaustive events.
EXAMPLE 4.26 Using the three-step procedure given above, the probability that a defective disk was
manufactured by the Memphisplant is found as follows.
Step 1: P(A2 and B) = P(B I Az) P(A2) = .0075 x .5 = .00375.
Step 2: P(B) = .005 x .3 + .0075 x .5 + .0025 x .2 = ,00575.
.00375
.00575
Step 3: P(A21 B) = -
= .652.
The probability that a defective disk was manufactured by the Kansas City plant is found as follows.
Step 1: P(A3 and B) = P(B IA3) P(A3) = .0025 x .2= .0005.
Step 2: P(B) = .005x .3 + .0075 x .5 + .0025 x .2 = .00575.
.0005
.00575
Step 3: P(A2 IB) = -
-
- .087.
PERMUTATIONSAND COMBINATIONS
Many of the experiments in statistics involve the selection of a subset of items from a larger
group of items. The experiment of selecting two letters from the four letters a, b, c, and d is such an
experiment. The following pairs are possible: (a, b), (a, c), (a, d), (b, c), (b, d), and (c, d). We say that
74 PROBABILITY [CHAP. 4
when selecting two items from four distinct items that there are six possible combinations. The
number of combinations possible when selecting n from N items is represented by the symbol Cf;’ and
is given by
N !
n!( N -n)!
cf;’= (4.15)
NC”,C(N, n), and
addition to the symbol C,”.
The symbol n!, read as “n factorial,” is equal to n x (n - I ) x (n - 2) x . . . x 1. For example, 3! =
3 x 2 x 1 = 6, and 4! = 4 x 3 x 2 x I = 24. The values for n! become very large even for small values
of n. The value of 10!is 3,628,800, for example.
In the context of selecting two letters from four, N! = 4! = 4 x 3 x 2 x 1 = 24, n! = 2! = 2 x 1 = 2
and (N - n)! = 2! = 2. The number of combinations possible when selecting two items from four is
are three other notations that are used for the number of combinations in
[
:
I
24
-
4!
given by c; = -- -= 6 , the same number obtained when we listed all possibilities above.
2!2! 2 x 2
When the number of items is larger than four or five, it is difficult to enumerate all of the
possibilities.
EXAMPLE 4.27 The number of five card poker hands that can be dealt from a deck of 52 cards is given by
52! 52 x 5 I x 50x49 x48 x 47! 52 x 5 1 x 5 0 ~ 4 9
x 48
c
;
2 = -
-
- -
- = 2,598,960. Notice that by expressing 52!
5!47! 120x47! 120
as 52 x 51 x 50 x 49 x 48 x 47!, we are able to divide 47! out because it is a common factor in both the
numerator and the denominator.
If the order of selection of items is important, then we are interested in the number of
perniutations possible when selecting n items from N items. The number of permutations possible
when selecting n objects from N objects is represented by the symbol Pc;’,and given by
N !
(N-n)!
pf;’=- (4.16)
NPn, P(N, n) and (N)n are other symbols used to represent the number of permutations.
EXAMPLE 4.28 The number of permutations possible when selecting two letters from the four letters a, b, c,
anddis p!=-=-=-= 12. In this case, the 12 permutations are easy to list. They are ab, ba, ac, ca,
ad, da, bc, cb, bd, db, cd, and dc. There are always more permutations than combinations when selecting n items
from N, because each different ordering is a different permutation but not a different combination.
4! 4! 24
(4-2)! 2! 2
EXAMPLE 4.29 A president, vice president, and treasurer are to be selected from a group of 10 individuals.
How many different choices are possible? In this case, the order of listing of the three individuals for the three
offices is important because a slate of Jim, Joe, and Jane for president, vice president, and treasurer is different
from Joe, Jim, and Jane for president, vice president, and treasurer, for example. The number of permutations is
P:* = 7
= 1Ox 9 x 8 = 720. That is, there are 720 different sets of size three that could serve as president, vice
presideqt, and treasurer.
1O!
CHAP. 41 PROBABILITY 75
USING PERMUTATIONSAND COMBINATIONSTO SOLVE
PROBABILITY PROBLEMS
EXAMPLE4.30 For a lotto contest in which six numbers are selected from the numbers 01 through 45, the
number of combinations possible for the six numbers selected is Cis= -
-
-8,145,060. The probability that
you select the correct six numbers in order to win this lotto is = .000000123. The probability of
winning the lotto can be reduced by requiring that the six numbers be selected in the correct order. The number
4s !
of permutations possible when six numbers are selected from the numbers 01 through 45 is given by p:' = -
39 !
= 45 x 44 x 43 x 42 x 41 x 40 = 5,864,443,200. The probability of winning the lotto is
.00000000017I, when the order of selection of the six numbers is important.
45!
6!39!
1
8,145,060
-
-
I
5,864,443,200
EXAMPLE 4.31 A Royal Flush is a five-card hand consisting of the ace, king, queen, jack, and ten of the same
suit. The probability of a Royal Flush is equal to = .00000154, since from Example 4.27, there are
2,598,960 five-card hands possible and four of them are Royal Flushes.
4
2,598,960
Solved Problems
EXPERIMENT,OUTCOMES,AND SAMPLESPACE
4.1
4.2
An experiment consists of flipping a coin, followed by tossing a die. Give the sample space for
this experiment.
Ans. One of many possible representations of the sample space is S = {HI, H2, H3, H4, HS, H6, T1,
T2, T3, T4, T5, T6).
Give the sample space for observing a patient's Rh blood type.
Ans. One of many possible representations of the sample space is S = { Rh-, Rh').
TREE DIAGRAMS AND THE COUNTING RULE
4.3 Use a tree diagram to illustrate the sample space for the experiment of observing the sex of the
children in families consisting of three children.
Ans. The tree diagram representation for the sex distribution of the three children is shown in Fig. 4-8,
where, for example, the branch or outcome mfm represents the outcome that the first born was a
male, the second born was a female, and the last born was a male.
76 PROBABILITY [CHAP. 4
Fig. 4-8
4.4 A sociological study consists of recording the marital status, religion, race, and income of an
individual. If marital status is classified into one of four categories, religion into one of three
categories, race into one of five categories, and income into one of five categories, how many
outcomes are possible for the experiment of recording the information of one of these
individuals?
Ans. Using the counting rule, we see that there are 4 x 3 x 5 x 5 = 300 outcomes possible.
EVENTS, SIMPLEEVENTS, AND COMPOUNDEVENTS
4.5 For the sample space given in Fig. 4-8, give the outcomes associated with the following events
and classify each as a simple event or a compound event.
(a) At least one of the children is a girl.
(b) All the children are of the same sex.
(c) None of the children are boys.
(6) All of the children are boys.
Ans. The event that at least one of the children is a girl means that either one of the three was a girl, or
two of the three were girls, or all three were girls. The event that all were of the same sex means
that all three were boys or all three were girls. The event that none were boys means that all three
were girls. The outcomes for these events are as follows:
(a) mmf, mfm, mff, fmm, fmf, ffm, fff; compound event
(b) mmm, fff; compound event
(c) fff; simple event
(4 mmrn; simple event
4.6 In the game of Yahtzee, five dice are thrown simultaneously. How many outcomes are there for
this experiment? Give the outcomes that correspond to the event that the same number
appeared on all five dice.
Ans. By the counting rule, there are 6 x 6 x 6 x 6 x 6 = 7,776 outcomes possible. Six of these 7,776
outcomes correspond to the event that the same number appeared on all five dice. These six
outcomes are as follows: ( I , 1, 1, 1, l), (2, 2, 2, 2, 2), (3, 3, 3, 3, 3), (4, 4, 4, 4, 4), (5, 5 , 5 , 5 , 5).
(6, 6,6,6,6).
CHAP. 41 PROBABILITY 77
PROBABILITY
4.7 Which of the following are permissible values for the probability of the event E?
(a) P(E)= .75 (b) P(E) = -.25 (c) P(E) = 1S O
(d) P(E)= 1 (e) P(E) = .01
Ans. The probabilities given in (a), (d), and (e) are permissible since they are between 0 and I
inclusive. The probability in (6) is not permissible, since probability measure can never be
negative. The probability in (c) is not permissible, since probability measure can never exceed one.
4.8 An experiment is made up of five simple events designated A, B, C, D, and E. Given that
P(A) =.I , P(B) = .2, P(C) = 3,
and P(E) = .2, find P(D).
Am. The sum of the probabilities for all simple events in an experiment must equal one. This implies
that .1 + .2 + .3 + P(D) + .2 = 1, and solving for P(D), we find that P(D) = .2. Note that this
experiment does not have equally likely outcomes.
CLASSICAL, RELATIVE FREQUENCY, AND SUBJECTIVE PROBABILITY
DEFINITIONS
4.9 A container has 5 red balls, 10 white balls, and 35 blue balls. One of the balls is selected
randomly. Find the probability that the selected ball is (a)red; (h)white; (c)blue.
Ans. This experiment has 50 equally likely outcomes. The event that the ball is red consists of five
outcomes, the event that the ball is white consists of 10 outcomes, and the event that the ball is
blue consists of 35 outcomes. Using the classical definition of probability, the following
probabilities are obtained: (a) &=.10 ; (b) 5o =.20; ( c ) io =.70.
10 35
4.10 A store manager notes that for 250 randomly selected customers, 75 use coupons in their
purchase. What definition of probability should the manager use to compute the probability
that a customer will use coupons in their store purchase? What probability should be assigned
to this event?
Ans. The relative frequency definition of probability should be used. The probability of using coupons
in store purchases is approximately 250 =.30.
75
4.11 Statements such as “the probability of snow tonight is 70%,” “the probability that it will rain
today is 20%,” and “the probability that a new computer software package will be successful is
99%”are examples of what type of probability assignment?
Ans. Since all three of the statements are based on professional judgment and experience, they are
subjective probability assignments.
MARGINAL AND CONDITIONAL PROBABILITIES
4.12 Financial Planning Consultants Inc. keeps track of 500 stocks. Table 4.5 classifies the stocks
according to two criteria. Three hundred are from the New York exchange and 200 are from
the American exchange. Two hundred are up, 100are unchanged, and 200 are down,
78
NYSE
AMEX
Total
PROBABILITY [CHAP. 4
u p Unchanged Down Total
50 75 175 300
150 25 25 200
200 100 200 500
Table 4.5
Pair
A, B
A, c
A, D
B, c
B,D
c,D
Mutually exclusive Common outcomes
Yes None
No
No
No Ace of hearts
No
Yes None
Jack, queen, and king of hearts
Jack, queen, and king of spades and clubs
Ace of clubs and ace of spades
If one of these stocks is randomly selected find the following.
(a) The joint probability that the selected stock is from AMEX and unchanged.
(b)The marginal probability that the selected stock is from NYSE.
(c) The conditional probability that the stock is unchanged given that it is from AMEX.
Am. (a) There are 25 from the AMEX and unchanged. The joint probability is
(h) There are 300 from NYSE. The marginal probability is
(c) There are 200 from the AMEX. Of these 200, 25 are unchanged. The conditional probability
is -=.I25
=.OS.
=.60.
25
200
4.13 Twenty percent of a particular age group has hypertension. Five percent of this age group has
hypertension and diabetes. Given that an individual from this age group has hypertension, what
is the probability that the individual also has diabetes?
Am. Let A be the event that an individual from this age group has hypertension and let B be the event
that an individual from this age group has diabetes. We are given that P(A) = .20 and P(A and B ) =
.05. We are asked to find P(B I A). P(B IA) =
P(A and B) .OS
R A ) .20
---
- - .25.
MUTUALLY EXCLUSIVE EVENTS
4.14 For the experiment of drawing a card from a standard deck of 52, the following events are
defined: A is the event that the card is a face card, B is the event that the card is an Ace, C is
the event that the card is a heart, and D is the event that the card is black. List the six pairs of
events and determine which are mutually exclusive.
4.15 Three items are selected from a production process and each is classified as defective or non-
defective. Give the outcomes in the following events and check each pair to see if the pair is
mutually exclusive. Event A is the event that the first item is defective, B is the event that there
is exactly one defective in the three, and C is the event that all three items are defective.
Am. A consistsof the outcomes DDD, DDN, DND, arid DNN. B consists of the outcomes DNN, NDN,
and NND. C consists of the outcome DDD. (D represents a defective, and N represents a
nondefective.)
CHAP. 41
Pair Mutually cxclusive
A, B N0
A, C No
B, c Yes
PROBABILITY
Common outcomes
DNN
DDD
Nonc
79
Machine I
Machine 2
5 195
15 585
DEPENDENT AND INDEPENDENT EVENTS
4.16 African-American males have a higher rate of hypertension than the general population. Let A
represent the event that an individual is hypertensive and let B represent the event that an
individual is an African-American male. Are A and B independent or dependent events?
Am. To say that African-American males have a higher rate of hypertension than the general population
means that P(A IB) > P(A). Since P(A I B) # P(A), A and B are dependent events.
4.17 Table 4.6 gives the number of defective and nondefective items in samples from two different
machines. Is the event of a defective item being produced by the machines dependent upon
which machine produced it?
Table 4.6
I I Number defective I Number nondcfectivc I
Am. Let D be the event that a defective item is produced by the machines, Let M,be the event that the
item is produced by machine 1, and let M2 be the event that the item is produced by machine 2.
P(D) = =.025, P(D IM,) = 2oo =.025, and P(D IM2) = 6oo =.025, and the event of producing
a defective item is independent of which machine produces it.
20 5 15
COMPLEMENTARYEVENTS
4.18 The probability that a machine does not produce a defective item during a particular shift is
.90. What is the complement of the event that a machine does not produce a defective item
during that particular shift and what is the probability of that complementary event?
Ans. The complementary event is that the machine produces at least one defective item during the shift,
and the probability that the machine produces at least one defective item during the shift is
1 - .90= .10.
4.19 Events El, E2,and E3 have the following probabilities of occurrence: P(EI)= .OS, P(E2) = SO,
and P(E3) = .99. Find the probabilities of the complements of these events.
Am. P( El') = 1 - P(E,) = 1 - .05 = .95
P(E3' ) = 1 - P(E3) = 1 - .99 = .01
P(E2) = 1 - P(E2)= I - .SO= .SO
MULTIPLICATION RULE FOR THE INTERSECTION OF EVENTS
4.20 If one card is drawn from a standard deck, what is the probability that the card is a face card? If
two cards are drawn, without replacement, what is the probability that both are face cards? If
five cards are drawn, without replacement, what is the probability that all five are face cards?
80
Low IQ High IQ
Low creativity 75 30
High creativity 20 125 L
PROBABILITY [CHAP. 4
Ans. Let El be the event that the first card is a face card, let E2 be the event that the second drawn card
is a face card, and so on until Es represents the event that the fifth card is a face card. The
probability that the first card is a face card is P(EI)= i2 =.231. The event that both are face cards
12
1 1 132 -
is the event El and Ez,and P(EI and E*)= P(EI)P(E2 I El) = x lFT = -- .050. The event that
all five are face cards is the event Eland El nnd E3 and E4 arid ES. The probability of this event is
given by P(EI)RE2 I El) P(E.1I Et, E21 P(E4 I El, E2, Ed P(Es I El, Ez,
Ej, Ed, or
- .000305.
12 I 1 1
0 9 8 95040
- x - x - x - x - = L -
92 51 so 4Y 48 311,87S,200
2652
4.21 If 60 percent of all Americans own a handgun, find the probability that all five in a sample of
five randomly selected Americans own a handgun. Find the probability that none of the five
own a handgun.
Ans. Let E, be the event that the first individual owns a handgun, E2 be the event that the second
individual owns a handgun, E3 be the event that the third individual owns a handgun, E4 be the
event that the fourth individual owns a handgun, and E5 be the event that the fifth individual owns
a handgun. The probability that all five own a handgun is P(EI and E2 and E3and E4 arid E?).
Because of the large group from which the individuals are selected, the events El through Es are
independent and the probability is given by P(EI)P(E2) P(E3) P(E4)P(E5)= (.6)5= .078. Similarly,
the probability that none of the five own a handgun is (.4)’= .010.
ADDITION RULE FOR THE UNION OF EVENTS
4.22 Table 4.7 gives the IQ rating as well as the creativity rating of 250 individuals in a
psychological study. Find the probability that a randomly selected individual from this study
will be classified as having a high IQ or as having high creativity.
Ans. Let A be the event that the selected individual has a high IQ, and let B be the event that the
individual has high creativity. Then P(A) = 250 - .62,P(B) = 250 - .58,P(A and B) = = S O ,
and P(A or B) = P(A) + P(B) - P(A and B) = .62 + .58 - S O = .70.
15.5 - 14s -
4.23 The probability of event A is .25, the probability of event B is .lO, and A and B are inde-
pendent events. What is the probability of the event A or B?
Ans. Since A and B are independent events, P(A and B) = P(A) P(B) = .25 x .I0 = ,025.P(A or B) =
P(A) +P(B) - P(A and B) = .25 + .I0- ,025= .325.
BAYES’ THEOREM
4.24 Box 1 contains 30 red and 70 white balls, box 2 contains 50 red and 50 white balls, and box 3
contains 75 red and 25 white balls. The three boxes are all emptied into a large box, and a ball
is selected at random. If the selected ball is red, what is the probability that it came from (a)
box 1;(b)box 2; (c) box 3‘?
CHAP. 4
3 PROBABILITY 81
Region
Northeast
Midwest
South
West
Ans. Let B 1be the event that the selected ball came from box 1, let B2be the event that the selected ball
came from box 2, let B3be the event that the ball came from box 3, and let R be the event that the
selected ball is red.
Percentage of U.S.
popu1ation
Percentage of social security
rccipients in the region
20 15
2s 10
35 12
20 11
(CI) We are asked to find P(Bl I R). The three-step procedure is as follows:
Step 1:
Step 2:
Step 3:
(b) We are asked to find P(B2IR).
Step I: P(R and B2) = P(R IB2) P(B2)= Ex
Step 2: P(R) = .S 17
Step 3: P(B2IR)= -= .323
so 1
= .I67
.I 67
.517
(c) We are asked to find P(B3IR).
Step 1: P(R and B3) = P(R I B3) P(B2)=
Step 2: P(R) = .5 I7
7s 1
x _1 = .250
,250
.5I7
Step 3: P(B3I R)= --
- ,484
4.25 Table 4.8 gives the percentage of the U.S. population in four regions of the United States, as
well as the percentage of social security recipients within each region. For the population of all
social security recipients, what percent live in each of the four regions?
Table 4.8
Am. Let B1 be the event that an individual lives in the Northeast region, B2 be the event that an
individual lives in the Midwest region, B3be the event that an individual lives in the South, and B4
be the event that an individual lives in the West. Let S be the event that an individual is a social
security recipient. We are given that P(BI)= .20, P(B2) = .25, P(B3) = .3S.P(B4)= .20, P(S IBI)=
.15, P(S I B2) = .10, P(S 1 B3) = .12, and P(S I B4) = .11. We are to find P(B1 I S), P(B2 1 S),
P(B3I S), and P(B4I S).P(S) is needed to find each of the four probabilities.
P(S) = P(S I BI) P(B1)+ P(S I B2) w32) + P(S IBd P(B3) + P(S IB4) P(B4)
P(S) = .I5x .20+ .I0 x .25 + .12 x .35 + .I 1 x .20= .I 19.
This means that 1 I .9%of the population are social security recipients.
The three-step procedure to find P(BI I S) is as follows:
Step 1: P(BIarid S) = P(S I BI) P(BI)= .15 x .20 = .03
Step 2: P(S) = . I 19
.03
.1 19
Step 3: P(BII S) = -= .252
82 PROBABILITY [CHAP. 4
The three-step procedure to find P(B2 I S) is as follows:
Step 2:
Step I : P(B2 L i d S)= P(S IB2) P(B2) = . I 0 x .25 = .U25
P(S)= . I 19
.025
.I 19
Step 3: P(B2I S) = -= .210
The thrce-step procedure to find P(B3IS) is as follows:
Step 2: P(S)= . 1 19
Step 1: P ( B I ~ d S ) = P ( S I B j ) P ( B 3 ) = . 1 2 x , 3 5 = . 0 4 2
.U42
.I 19
Step 3: P(B3I S) = -= .353
The three-step procedure to find P(B4I S) is as follows:
Step 1: P(B d and S)= P(S I Bq)P(Bq)= .I 1 x .20= .022
Step 2: P(S)= .I19
Step 3: P(Bj IS)= -= .185
.022
.I 19
We can conclude that 25.2% of the social security recipients are from the Northeast, 21.0% are
from the Midwest, 35.3% are from the South, and 18.5%are from the West.
PERMUTATIONS AND COMBINATIONS
4.26 Evaluate the following: (a)C& (b)CE; (c) PE ;(d)P;.
Am. Each of the four parts uses the fact that O! = 1.
n!
I x n!
---
n!
( ( 1 ) c"- - - I , since the n! in the numerator and denominator divide out.
() - o!(n - o)!
n!
( t l ) c;:
= - -= 1, since O! is equal to one and the n! divides out of top and bottom.
-
n!
n!(n - n)! n!O!
n! n!
(n-O)! n!
( C ' ) PI=------- - -1.
n! n! n!
(n-n)! O! 1
-----
(d> Pi: = -
- - - n!
4.27 An e-wc-tizwngur at the racetrack is a bet where the bettor picks the horses that finish first and
second. A trff;ec.tn izwger is a bet where the bettor picks the three horses that finish first,
second, and third. (cl) In a 12-horse race, how many exactas are possible? (b)In a 12-horse
race, how many trifectas are possible?
Am. Since the finish order of the horse is important, we use permutations to count the number of
possib1e se1ections.
( ( 1 ) The number of ordered ways you can select two horses from twelve is
12! 12! 12xllx10!
=12~11=132
-
pi?= --=
(12-2)! IO! 10!
CHAP. 41 PROBABILITY 83
(b) The number of ordered ways you can select three horses from twelve is
17 12! 12! 12X11X10X9! = 12x11x10=
1320
--
p3-= ~ -
- -
(12-3)! 9! 9!
4.28 A committee of five senators is to be selected from the U.S. Senate. How many different
committees are possible?
Ans. Since the order of the five senators is not important, the proper counting technique is combina-
tions.
lOOx 99 x 98x 97 x 96x 95!
120x 95!
lOOx 99 x 98 x 97 x 96
I20
= 75,287,520.
-
-
-
-
loo!
5!( 100-5)!
c:" =
There are 75,287,520 different committees possible.
USING PERMUTATIONS AND COMBINATIONSTO SOLVE
PROBABILITY PROBLEMS
4.29
4.30
Twelve individuals are to be selected to serve on a jury from a group consisting of 10females
and 15 males. If the selection is done in a random fashion, what is the probability that all 12
are males?
Ans. Twelve individuals can be selected from 25 in the followingnumber of ways:
~ 2 5 - 25! - 2 5 x 2 4 ~ 2 3 ~ 2 2 ~ 2 1 ~ 2 0 ~
1 9 x l 8 x l 7 x l 6 x 15x 1 4 x 13!
-
12 - 12!13! 12!13!
After dividing out the common factor, 13!, we obtain the following.
25 x 24 x 23 x 22 x 21x 20 x I9 x I8 x 17x I 6x 15 x 14
12x 1 1 x 1 O x 9 x 8 x 7 x 6 x 5 x 4 x 3 x 2 x 1
c::= = 5,200,300
The above fraction may need to be evaluated in a zigzag fashion. That is, rather than multiply the
12 terms on top and then the 12 terms on bottom and then divide, do a multiplication, followed by
a division, followed by a multiplication,and so on until all terms on top and bottom are accounted
IS! 15x 14x 13x 12! 15x 14x 13
= 455
-
-
for. The jury can consist of all males in cl: = -
-
-
12!3! 12!3! 6
- .000087. That is, there are about 9 chances
ways. The probability of an all-malejury is 5,200,300 -
out of 100,000that an all-malejury would be chosen at random.
45s
The five teams in the western division of the American conference of the National Football
League are: Kansas City, Oakland, Denver, San Diego, and Seattle. Suppose the five teams are
equally balanced. (a) What is the probability that Kansas City, Seattle, and Denver finish the
season in first, second, and third place respectively? (b)What is the probability that the top
three finishers are Kansas City, Seattle, and Denver?
Ans. (a) Since the order of finish is specified, permutations are used to solve the problem. There are
P3=-= 60 different ordered ways that three of the five teams could finish the season in first,
second, and third place in the conference. The probability that Kansas City will finish first, Seattle
will finish second, and Denver will finish third is 6o = ,017.
5 !
2!
I
84 PROBABILITY [CHAP. 4
( 0 ) Since the order of finish for the three teams is not specified, cornbinations are used to solve
the problem. There are C:= -= 10 combinations of three teams that could finish in the top
three. The probability that the top three finishers are Kansas City, Seattle, and Denver is A = .10.
S!
3!2!
Supplementary Problems
EXPERIMENT, OUTCOMES, AND SAMPLE SPACE
4.31 An experiment consists of using a 25-question test instrument to classify an individual as having either a
type A or a type B personality. Give the sample space for this experiment. Suppose two individuals are
classified as to personality type. Give the sample space. Givc the sample space for three individuals.
Am. For one individual, S = {A,B},where A rneans the individual has a type A personality. and B
means the individual has a type B personality.
For two individuals, S = {AA,
AB,BA,BB},where AB,for example, is the outcome that the first
individual has a type A personality and the second individual has a type B personality.
For three individuals, S = {AAA,AAB,ABA,ABB,BAA,BAB,BBA,BBB},where ABA is the
outcome that the first individual has a type A personality, the second has a type B personality, and
the third has a type A.
4.32 A
t a roadblock, state troopers classify drivers as either driving while intoxicated, driving while impaired,
or sober. Give the sample space for the classification of one driver. Givc the sample space for two
drivers. How many outcomes are possible for three drivers'?
Ans. Let A be the event that a driver is classified as driving while intoxicated, let B be the event that a
driver is classified as driving while impaired, and let C be the event that a driver is classified as
sober.
The sample space for one driver is S = {A,
B,C).
The sample space for two drivers is S = {AA,
AB,AC, BA,BB,BC,CA,CB,CC)
The sample space for three drivers has 27 possible outcomes.
TREE DIAGRAMS AND THE COUNTING RULE
4.33 An experiment consists of inspecting four items selected from a production line and classifying each one
as defective, D, or nondefective, N.How many branches would a tree diagram for this experiment have'?
Give the branches that have exactly one defective. Give the branches that have exactly one nondefective.
Am. The tree would have 2' = 16 branches which would represent the possible outcomes for the
experiment.
The branches that have exactly one defective are DNNN, NDNN, NNDN, and NNND.
The branches that have exactly one nondefective are NDDD, DNDD, DDND. and DDDN.
4.34 An experiment consists of selecting one card from a standard deck, tossing a pair of dice, and then
flipping a coin. How many outcomes are possible for this experiment?
Am. According to the counting rule, there are 52 x 36 x 2 = 3,744 possible outcomes.
CHAP. 41 PROBABILITY 85
EVENTS, SIMPLE EVENTS, AND COMPOUND EVENTS
4.35 An experiment consists of rolling a single die. What are the simple events for this experiment'?
Ans. The simple events are the outcomes { 11, { 21, { 3 } , (41, { 5 ) ,and (61, where the number in braces
represents the number on the turned up face after the die is rolled.
4.36 Suppose we consider a baseball game between the New York Yankees and the Detroit Tigers as an
experiment. Is the event that the Tigers beat the Yankees one to nothing a simple event or a compound
event? Is the event that the Tigers shut out the Yankees a simple event or a compound event? (A shutout
is a game in which one of the teams scores no runs.)
Ans. The event that the Tigers shut out the Yankees one to nothing is a simple event because it
represents a single outcome. The event that the Tigers shut out the Yankees is a compound event
because it could be a one to nothing shutout, or a two to nothing shutout, or a three to nothing
shutout, etc.
PROBABILITY
4.37 Which of the following are permissible values for the probability of the event E?
Am. (a)and (c)are permissible values, since they are between 0 and 1 inclusive. (b)is not permissible
because it exceeds one. (d)is not permissible because it is negative.
4.38 An experiment is made up of three simple events A, B, and C. If P(A) = x, P(B) = y, and P(C) = z, and
x +y + z = 1, can you be sure that a valid assignmentof probabilities has been made?
Am. No. Suppose x = .75, y = .75, and z = -S, for example. Then x + y + z = 1, but this is not a valid
assignment of probabilities.
CLASSICAL, RELATIVE FREQUENCY, AND SUBJECTIVE PROBABILITY DEFINITIONS
4.39 If a U.S. senator is chosen at random, what is the probability that he/she is from one of the 48 contiguous
states?
Ans. The are 96 senators from the 48 contiguous states and a total of 100 from the 50 states. The
96
probability is = .96.
4.40 In an actuarial study, 9,875 females out of 10,000females who are age 20 live to be 30 years old. What is
the probability that a 20-year-old female will live to be 30 years old?
988.
-
9,875 -
Ans. Using the relative frequency definition of probability, the probability is
4.41 Casino odds for sporting events such as football games, fights, etc. are examples of which probability
definition?
Ans. subjective definition of probability
MARGINAL AND CONDITIONAL PROBABILITIES
4.42 Table 4.9 gives the joint probability distribution for a group of individuals in a sociological study.
86
High school graduate
N0
Yes
PROBABILITY
Welfare recipient
No Yes
.I5 .2s
.55 .OS
[CHAP. 4
Table 4.9
4.43
( ( I ) Find the marginal probability that an individual in this study is a welfare recipient.
( h ) Find the marginal probability that an individual in this study is not a high school graduate?
( c ) If an individual is a welfare recipient, what Is the probability that he/she is not a high school
graduatc?
(tl) It'an individual is a high school graduate, what is the probability that he/she is a welfare recipient?
.LJ
Airs. ( c l ) .25 + .OS= .30 (c) -= .83
.30
( h ) .IS+ .25 = .40
.05
.60
(d) -= .083
Sixty percent of the registered voters in Douglas County are Republicans. Fifteen percent of the
registered voters in Douglas County are Republican and have incomes above $250,000 per year. What
percent of the Republicans who are registered voters in Douglas County have incomes above $250,000
per year'!
.I 5
-= .25. Twenty-five percent of the Republicans have incomes above $250,000.
.60
A m .
MUTUALLY EXCLUSIVE EVENTS
4.44 With reference to the sociological study described in problem 4.42, are the events (high school graduate}
and { weltare recipient} mutually exclusive'?
Atis. No, since 5% of the individuals in the study satisfy both events.
4.45 Do mutually exclusive events cover all the possibilities in an experiment'?
A m . No. 'This is true only when the mutually exclusive events are also complementary.
DEPENDENT AND INDEPENDENT EVENTS
4.46 I f two events are mutually exclusive, are they dependent or independent?
A m If A and B are two nontrivial events (that is, they have a nonzero probability) and if they are
rnutually exclusive, then P(A I B) = 0, since if B occurs, then A cannot occur. But since P(A) is
positive, P(A I B) f P(A), and the events must be dependent.
4.47 Seventy-five percent of all Americans live in a metropolitan area. Eighty percent of all Americans
consider themselves happy. Sixty percent of all Americans live in a metropolitan area and consider
themselves happy. Are the events {lives in a metropolitan area) and {considers themselves happy)
independent or dependent events'?
Am. Let A be the event {live in a metropolitan area} and let B be the event {consider themselves
P(A and B) .60
P(B) .80
-= .75 = P(A) , and hence A and B are independent events.
-
-
happy}.P(A 1 B) =
COMPLEMENTARY EVENTS
4.48 What is the sum of the probabilities of two complementary events?
CHAP. 41 PROBABILITY 87
Ans. one
4.49 The probability that a machine used to make computer chips is out of control is .001. What is the
complement of the event that the machine is out of control and what is the probability of this event?
Ans. The complementary event is that the machine is in control. The probability of this event is .999.
MULTIPLICATION RULE FOR THE INTERSECTION OF EVENTS
4.50 In a particular state, 20 percent of the residents are classified as senior citizens. Sixty percent of the
senior citizens of this state are receiving social security payments. What percent of the residents are
senior citizens who are receiving social security payments?
A m Let A be the event that a resident is a senior citizen and let €3 be the event that a resident is
receiving social security payments. We are given that P(A) = .20 and P(B 1 A) = .60. P(A and B) =
P(A) P(B I A) = .20 x .60 = .12. Twelve percent of the residents are senior citizens who are
receiving social security payments.
4.51 If El ,E*, . . . ,E, are n independent events, then P(EI and E2 and E3 . . .and E,) is equal to the product
of the probabilities of the n events. Use this probability rule to answer the following.
(a) Find the probability of tossing five heads in a row with a coin.
(b) Find the probability that the face 6 turns up every time in four rolls of a die.
(c) If 43 percent of the population approve of the president's performance, what is the probability that
all 10individuals in a telephone poll disapprove of his performance.
Ans. (a) (S)' = .03125
(6) ( )4 = .00077
(c) (.57)" = .00362
ADDITION RULE FOR THE UNION OF EVENTS
4.52 Events A and B are mutually exclusive and P(A) = .25 and P(B) = .35.Find P(A and B) and P(A or B).
Ans. Since A and €3 are mutually exclusive, P(A and B) = 0. Also P(A or B) = .25 + .35= .60.
4.53 Fifty percent of a particular market own a VCR or have a cell phone. Forty percent of this market own a
VCR. Thirty percent of this market have a cell phone. What percent own a VCR and have a cell phone'?
Ans. P(A and B) = P(A) + P(B) - P(A or B) = .3 + .4 - .S = .2, and therefore 20% own a VCR and a
cell phone.
BAYES' THEOREM
4.54 In Arkansas, 30 percent of all cars emit excessive amounts of pollutants. The probability is 0.95 that a car
emitting excessive amounts of pollutants will fail the state's vehicular emission test, and the probability is
0.15 that a car not emitting excessive amounts of pollutants will also fail the test. If a car fails the
emission test, what is the probability that it actually emits excessive amounts of emissions?
Ans. Let A be the event that a car emits excessive amounts of pollutants, and B the event that a car fails
the emission test. Then, P(A) = .30,P(AC)
= .70, P(B IA) = .9S,and P(B IA') = .1S. We are asked
to find P(A IB). The three-step procedure results in P(A I B) = .73.
88 PROBABILITY [CHAP. 4
4.55 In a particular community, 15 percent of all adults over 50 have hypertension. The health service in this
community correctly diagnoses 99 percent of all such persons with hypertension. The health service
incorrectly diagnoses 5 percent who do not have hypertension as having hypertension.
(a) Find the probability that the health service will diagnose an adult over 50 as having hypertension.
(6) Find the probability that an individual over 50 who is diagnosed as having hypertension actually has
hypertension.
Am. Let A be the event that an individual over 50 in this community has hypertension, and let B be the
event that the health service diagnoses an individual over 50 as having hypertension. Then, P(A) =
.15, P(A') = .85, P(B IA) = .99, and P(B IA') = .05.
(a) P(B) = .15 x -99+.85 x .05 = .19
(6) Using the three-step procedure, P(A IB) = .78
PERMUTATIONS AND COMBINATIONS
4.56 How many ways can three letters be selected from the English alphabet if:
(0)The order of selection of the three letters is considered important, i.e., abc is different from cba, for
example.
(6) The order of selection of the three letters is not important?
Ans. (a) 15,600 (6) 2,600
4.57 The following teams comprise the Atlantic division of the Eastern conference of the National Hockey
League: Florida, Philadelphia, N.Y. Rangers, New Jersey, Washington, Tampa Bay, and N.Y. Islanders.
(a) Assuming no teams are tied at the end of the season, how many different final standinps are possible
(h) Assuming no ties, how many different first-, second-, and third-place finishers are possible?
for the seven teams?
Am. (a) 5,040 (6) 210
4.58 A criminologist selects five prison inmates from 30 volunteers for more intensive study. How many such
groups of five are possible when selected from the 30?
Ans. 142,506
USING PERMUTATIONS AND COMBINATIONS TO SOLVE PROBABILITY PROBLEMS
4.59
4.60
Three individuals are to be randomly selected from the 10 members of a club to serve as president, vice
president, and treasurer. What is the probability that Lana is selected for president, Larry for vice
president, and Johnny for treasurer?
1 I
-
A m . 1
- A= .00138
pio 720
A sample of size 3 is selected from a box which contains two defective items and 18 nondefective items.
What is the probability that the sample contains one defective item?
C:xCi8 - 2x153 - 306
C:O 1,140 1,140
Am. -
--
--= .268
Chapter 5
Outcome Value of X Value of Y
HH 0 2
HT 1 0
TH 1 0
r TT 2 -2
Discrete Random Variables
RANDOM VARIABLE
A random variable associates a numerical value with each outcome of an experiment.A random
variable is defined mathematically as a real-valued function defined on a sample space, and is
represented as a letter such as X or Y.
EXAMPLE 5.1 For the experiment of flipping a coin twice, the random variable X is defined to be the number
of tails to appear when the experiment is performed. The random variable Y is defined to be the number of
heads minus the number of tails when the experiment is conducted. Table 5.1 shows the outcomes and the
numerical value each random variable assigns to the outcome. These are two of the many random variables
possible for this experiment.
EXAMPLE 5.2 An experimental study involving diabetics measured the following random variables: fasting
blood sugar, hemoglobin, blood pressure, and triglecerides. These random variables assign numerical values to
each of the individuals in the study. The numerical values range over different intervals for the different random
variables.
DISCRETE RANDOM VARIABLE
A random variable is a discrete random variable if it has either a finite number of values or
infinitely many values that can be arranged in a sequence. We say that a discrete random variable
may assume a countable number of values. Discrete random variables usually arise from an
experiment that involves counting. The random variables given in Example 5.1 are discrete, since
they have a finite number of different values. Both of the variables are associated with counting.
EXAMPLE53 An experiment consists of observing 100 individuals who get a flu shot and counting the
number X who have a reaction. The variable X may assume 101 different values from 0 to 100. Another
experiment consists of counting the number of individuals W who get a flu shot until an individual gets a flu
shot and has a reaction. The variable W may assume the values 1, 2, 3, . . .. The variable W can assume a
countably infinite number of values.
CONTINUOUS RANDOM VARIABLE
A random variable is a continuous random variable if it is capable of assuming all the values in
an interval or in several intervals. Because of the limited accuracy of measuring devices, no random
89
90 DISCRETE RANDOM VARIABLES [CHAP. 5
Pfx)
variables are truly continuous. However, we may treat random variables abstractly as being
continuous.
~ 2 3 4 5 6 7 8 9 1 0 1 1 1 2
I
.028 .056 .083 .111 .139 .167 .139 .111 .083 .056 .028
EXAMPLE 5.4 The following random variables are considered continuous random variables: survival time of
cancer patients, the time between release from prison and conviction for another crime, the daily milk yield of
Holstein cows, weight loss during a dietary routine, and the household incomes for single-parent households in a
sociological study.
PROBABILITY DISTRIBUTION
The probability distribution of a discrete random variable X is a list or table of the distinct
numerical values of X and the probabilities associated with those values. The probability distribution
is usually given in tabular form or in the form of an equation.
EXAMPLE55 Table 5.2 lists the outcomes and the values of X, the sum of the up-turned faces for the
experiment of rolling a pair of dice. Table 5.2 is used to build the probability distribution of the random variable
X. This table lists the 36 possible outcomes. Only one outcome gives a value of 2 for X. The probability that
X = 2 is 1 divided by 36 or .028 when rounded to three decimal places. We write this as P(2)= .028. The
probability that X = 3,P(3), is equal to 2 divided by 36 or .056. The probability distribution for X is given in
Table 5.3.
Value of X
2
3
4
5
6
7
3
4
5
6
7
8
5.2
Value of X
4
5
6
7
8
9
5
6
7
8
9
10
Table 5.3
Outcome Value of X
The probability distribution, P(x)= P(X = x) satisfiesformulas (5.
I) and (5.2).
P(x) 20 for each value x of X (5.0
(5.2)
Notice that the values for P(x) in Table 5.3 are all positive, which satisfies formula ( 5 4 , and
X P(x)= 1 where the sum is over all values of X
that the sum equals 1 except for rounding errors.
X
EXAMPLE5.6 P(x) = -, x = 1,2,3,4
is a probability distribution since P(I) = .I, P(2) = .2, P(3)= 3,and
P(4) = .4 and (5.I ) and (5.2)are both satisfied.
10
CHAP. 51 DISCRETE RANDOM VARIABLES 91
X
P(x)
xP(x)
EXAMPLE 5.7 It is known from census data that for a particular income group that 10% of households have
no children, 25% have one child, 50%have two children, 10% have three children, and S% have four children.
If X represents the number of children per household for this income group, then the probability distribution of
X is given in Table 5.4.
2 3 4 5 6 7 8 9 10 11 12
.028 .OS6 .083 . I 1I .139 .167 .I39 .I 1 1 .083 .OS6 .028
.056 .I68 .332 .555 .834 1.169 1.1 12 .999 .830 .616 .336
Table 5.4
1 x 1 0 1 2 3 4 1
I P(x) I .I0 .25 S O .I0 .05 I
The event X 2 2 is the event that a household in this income group has at least two children and means that
X = 2, or X = 3, or X = 4. The probability that X 1 2 is given by
P(X 22) = P(X = 2) +P(X = 3) + P(X = 4) = .so+ .I0+ .OS = .6S
The event X I I is the event that a household in this income group has at most one child and is equivalent
to X = 0, or X = 1. The probability that X I1 is given by
P(X I1) = P(X = 0) + P(X = I ) = .I0+ .25 = .3S
The event 1 I X I
3 is the event that a household has between one and three children inclusive and is
equivalent to X = I, or X = 2, or X = 3. The probability that I IX I 3 is given by
P( 1 I
X 5 3 ) = P(X = I ) + P(X = 2) + P(X = 3 ) = .25 + S O + .I0 = .85
The above discussion may be summarized by stating that 65% of the households have at least two children,
35% have at most one child, and 85% have between one and three children inclusive.
MEAN OF A DISCRETE RANDOM VARIABLE
The rneatt o
f a discrete random variuble is given by
p =X x P(x) where the sum is over all values of X (5.-3)
The mean of a discrete random variable is also called the expected value, and is represented by E(X).
The mean or expected value will also often be referred to as the populatiorz meurz. Regardless of
whether it is called the mean, the expected value, or the population mean, the numerical value is
given by formula (5.3).
EXAMPLE 5.8 The mean of the random variable in Example 5.5 is found as follows.
The sum of the row labeled xP(x) is the mean of X and is equal to 7.007. If fractions are used in place of
decimals for P(x),the value will equal 7 exactly. In other words E(x) = 7. The long-term average value for the
sum on the dice is 7. The mean value of the sum on the dice for the population of all possible rolls of the dice
equals 7. If you were to record all the rolls of the dice at Las Vegas, the average value would equal 7.
EXAMPLE 5.9 The mean number of children per household for the distribution given in Example 5.7 is found
as follows.
92
P(x)
xP(x)
DISCRETE RANDOM VARIABLES
.I0 .25 .50 .10 .05
0 .25 1.00 .30 .20
[CHAP. 5
X P(x) X P W
0 0.0625 0.0
I 0.2s 0.25
2 0.375 0.75
3 0.25 0.75
4 0.0625 0.25
Total 1.0 p = 2
1 x 1 0 1 2 3 4 1
X-p (x - N2 (x - j$P( x)
-2 4 0.25
-I 1 0.2s
0 0 0.0
I 1 0.25
2 4 0.25
C(X - p) = 0 Var(x) = I
X P(X)
0 0.0625
1 0.25
2 0.375
3 0.25
4 0.0625
‘Total 1.o
The sum of the row labeled xP(x), 1.75, is the mean of the distribution. For this popudion of households, the
mean number of children per household is I .75. Notice that the mean does not have to equal a value assumed by
the random variable.
XP(X) X2P(X)
0.0 0.0
0.25 0.25
0.75 1.5
0.75 2.25
0.2s 1.o
/
.
I
= 2.0 c XZP(X) = 5
STANDARD DEVIATION OF A DISCRETE RANDOM VARIABLE
The wriwic‘e ofa discrete rcrrzcloriz vuriable is represented by 0’ and is defined by
d = x(x - p)”(x) (5.4)
The variance is also represented by Var(X) and may be calculated by the alternative formula given by
The standurd deviution o
f a discrete random variable is represented by or sd(X) and is given
EXAMPLE 5.10 A social researcher is interested in studying the family dynamics caused by gender makeup.
The distribution of the number of girls in families consisting of four children is as follows: 6.2S% of such
families have no girls, 2570 have one girl, 37.5% have two girls, 25% have three girls, and 6.25% have four
girls. Table 5.5 illustrates the cornputation of the variance of X, the number of girls in a family having four
children. The standard deviation is the square root of the variance and equals one.
Table 5.6 illustrates the computation of the components needed when using the alternative formula (5.5)to
compute the variance of X. Using formula (S.S),
the variance is Var(x) = c x2P(x)- p2= 5 - 4 = 1, and the
standard deviation is also equal to one. We see that formulas (5.4) and (5.5) give the same results for the
standard deviation. The mean number of girls in families of four children equals 2 and the standard deviation is
equal to 1 .
Table 5.6
CHAP. 51 DISCRETE RANDOM VARIABLES 93
BINOMIAL RANDOM VARIABLE
A binomial random variable is a discrete random variable that is defined when the conditions of
a binomial experimertt are satisfied. The conditions of a binomial experiment are given in Table 5.7.
I Conditions of a Binomial Exoeriment
~~~ ~ ~
I. There are n identical trials.
2. Each trial has only two possible outcomes.
3. The probabilities of the two outcomes remain constant for each trial.
r
4. The trials are indeoendent.
The two outcomes possible on each trial are called success and failure. The probability
associated with success is represented by the letter p and the probability associated with failure is
represented by the letter q, and since one or the other of success or failure must occur on each trial, p
+q must equal one, i.e., p +q = 1. When the conditions of the binomial experiment are satisfied, the
binomial random variable X is defined to equal the number of successes to occur in the n trials. The
random variable X may assume any one of the whole numbers from zero to n.
EXAMPLE 5.11 A balanced coin is tossed 10 times, and the number of times a head occurs is represented by
X. The conditions of a binomial experiment are satisfied. There are n = 10 identical trials. Each trial has two
possible outcomes, head or tail. Since we are interested in the occurrence of a head on a trial, we equate the
occurrence of a head with success, and the occurrence of a tail with failure. We see that p = .5 and q = .5. Also,
it is clear that the trials are independent since the occurrence of a head on a given toss is independent of what
occurred on previous tosses. The number of heads to occur in the 10 tosses, X, can equal any whole number
between 0 and 10. X is a binomial random variable with n = 10and p = .5.
EXAMPLE 5.12 A balanced die is tossed five times, and the number of times that the face with six spots on it
faces up is counted. The conditions of a binomial experiment are satisfied. There are five identical trials. Each
trial has two possible outcomes since the face 6 turns up or a face other than 6 turns up. Since we are interested
in the face 6, we equate the face 6 with success and any other face with failure. We see that p = and q = 5 .
Also, the outcomes from toss to toss are independent of one another. The number of times the face 6 turns up, X,
can equal 0, 1, 2, 3,4, or 5. X is a binomial random variable with n = 5 and p = .167.
6 6
EXAMPLE 5.13 A manufacturer uses an injection mold process to produce disposable razors. One-half of one
percent of the razors are defective. That is, on average, 500 out of every 100,000razors are defective. A quality
control technician chooses a daily sample of 100randomly selected razors and records the number of defectives
found in the sample in order to monitor the process. The conditions of a binomial experiment are satisfied.
There are 100 identical trials. Each trial has two possible outcomes since the razor is either defective or non-
defective. Since we zre recording the number of defectives, we equate the occurrence of a defective with success
and the occurrence of a nondefective with failure. We see that p = .005and q = .995. The number of defectives
in the 100,X, can equal any whole number between 0 and 100. X is a binomial random variable with n = 100
and p = .005.
BINOMIAL PROBABILITY FORMULA
The binomial prubability formula is used to compute probabilities for binomial random
variables. The binomial probability formula is given in
n!
x!(n - x)!
pXq("-')for x = 0,I , . . . ,n (5.7)
94 DISCRETE RANDOM VARIABLES [CHAP. 5
The symbol is discussed in Chapter 4 and represents the number of combinations possible when
(ri
x items are selected from n.
EXAMPLE 5.14 A die is rolled three times and the random variable X is defined to be the number of times
the face 6 turns up in the three tosses. X is a binomial random variable that assumes one of the values 0, I , 2, or
3. In this binomial experiment, success occurs on a given trial if the face 6 turns up and failure occurs if any
other face turns up. The probability of success is p = = .833. In
order to help understand why the binomial probability formula works, the binomial probabilities will he
computed using the basic principles of probability first and then by using formula (5.7).
To find P(0). note that X = 0 means that no successes occurred, that is, three failures occurred. Because of
the independence of trials, the probability of three failures is .833 x 333 x .833 = (.833)' = 578. That is, the
probability that the face 6 does not turn up on any of the three tosses is S78.
To find P( 1). note that X = 1 means that one success and two failures occurred. One success and two
failures occur if the sequence SFF, or the sequence FSF, or the sequence FFS occurs. The probability of SFF is
.I67 x .833 x .833 = .I 159. The probability of FSF is .833 x .I67 x ,833 = , I 159. The probability of FFS is
.833 x .833 x .I67 = . I 159. The three probabilities for SFF, FSF, and FFS are added because of the addition
rule for mutually exclusive events. Therefore, P(X = 1) = P( I ) = 3 x . I 159 = .348. The probability that the face
6 turns up on one of the three tosses is .348.
To find P(2), note that X = 2 means that two successes and one failure occurred. Two successes and one
failure occur if the sequence FSS or the sequence SFS or the sequence SSF occurs. The probability of FSS is
.833 x .I67 x .I67 = .0232. The probability of SFS is .I67 x .833 x .167 = .0232. The probability of SSF is
.I67 x .I67 x ,833 = .0232. The probabilities for FSS, SFS, and SSF are added because of the addition rule for
mutually exclusive events. Therefore, P(X = 2) = P(2) = 3 x .0232 = .070.The probability that the face 6 turns
up on two of the three tosses is .070.
To find P(3), note that X = 3 means that three successes occurred. The probability of three consecutive
successes is .I67 x .I67 x .I67 = (. 167)-'= .005.There are five chances in a thousand of the face 6 turning up on
each of the three tosses.
= .I67 and the probability of failure is q =
6
Using the binomial probability formula, we find the four probabilities as follows:
3!
O!3!
P(0) = -
(. 167)' (.833)-' = (.833)' = .578
3!
1!2!
P( I ) = -(. 167)' (.833)2= 3 x .I67 x (.833)*= .348
3!
2!I!
3!
3!0!
P(2) = -(.167)2 (.833)'= .070
P(3) = -(. 167)' (333)' = .00S
This example illustrates how much work the binomial probability formula saves us when solving problems
involving the binomial distribution. The distribution for the variable in Example 5.14 is given in Table 5.8.
Table 5.8
X I 0 1 2 3 1
I P(x) I .578 .348 .070 .005 I
In formula (5.7),
The term pXq("
- " gives the probability of x successes and (n - x) failures. The
counts the number of different arrangements which are possible for x successes and
n!
x!( n - x)!
term
(n - x) failures. The role of these terms is illustrated in Example 5.14.
CHAP. 51 DISCRETE RANDOM VARIABLES 95
~~~~~~~
4 0 . . . .I296 ,0625 .0256 . . .
1 . . . .3456 .2500 .I536 . . ,
2 . . . .3456 .3750 .3456 . . .
3 . .. .1536 ,2500 .3456 . . .
4 . . . .0256 .0625 .I 296 . . .
EXAMPLE515 Fifty-seven percent of companies in the U.S. ‘use networking to recruit workers. The
probability that in a survey of ten companies exactly half of them use networking to recruit workers is
I O!
5!5!
P(5) = -(37)‘ (.43y = .223
TABLES OF THE BINOMIAL DISTRIBUTION
Appendix 1 contains the table o
f birzoinialprobabilities. This table lists the probabilities of x for
n = 1 to n = 25 for selected values of p.
EXAMPLE 5.16 If X represents the number of girls in families having four children, then X is a binomial
random variable with n = 4 and p = .5. Using formula (5.7),the distribution of X is determined as follows:
4! 4! 4!
0!4! 1!3! 2!2!
P(0) = -(S)’
(.S)4 = (.S)4 = .0625 P(1) = -(5)’($ = .2s P(2)= -(5)*
(.S)* = .37s
4!
3!I!
P(3) = -(.5y(3’
= .2s
4!
4!0!
P(4) = -(.5)4(S)’ = .0625
Table 5.9 contains a portion of the table of binomial probabilities found in Appendix I . The numbers in bold
print indicates the portion of the table from which the binomial probability distribution for X is obtained. The
probabilities given are the same ones obtained by using formula (5.7).
Table 5.9
t d
I P
n X * . . .40 S O .60 . . .
EXAMPLE 5.17 Eighty percent of the residents in a large city feel that the government should allow more
than one company to provide local telephone service. Using the table of binomial probabilities, the probability
that at least five in a sample of ten residents feel that the government should allow more than one company to
provide local telephone service is found as follows. The event “at least five” means five or more and is
equivalent to X = 5 or X = 6 or X = 7 or X = 8 or X = 9 or X = 10.The probabilities are added because of the
addition law for mutually exclusive events.
P(X 2 5 )= P(5) + P(6) + P(7) + P(8) + P(9) + P(10)
P(X 25 )= .0264 + .0881 + .2013 + .3020+ .2684 + ,1074 = .9936
Statistical software is used to perform binomial probability computations and to some extent has
rendered binomial probability tables obsolete. Minitab contains routines for computing binomial
probabilities. Example 5.18 illustrates how to use Minitab to compute the probability for a single
value or the total distribution for a binomial random variable.
EXAMPLE 5.18 The following Minitab output shows the binomial probability computations given in
Examples 5.15 and 5.16. The binomial probabilities are shown in bold type.
96 DISCRETE RANDOM VARIABLES [CHAP. 5
MTB ># Minitab computation for Example 5.I5 #
MTB >pdf 5;
SUBC > binomial n = 10p = .57.
Probabi1ity Density Function
Binomial with n = 10and p = 0.570000
x P(X= x)
5.00 0.2229
MTB ># Minitab computation for Example 5.16#
MTB > pdf;
SUBC > binomial n = 4 and p = .5.
Probability Density Function
Binomial with n = 4 and p = 0.500000
x P(X= x)
0 0.0625
I 0.2500
2 0.3750
3 0.2500
4 0.0625
MEAN AND STANDARDDEVIATION OF A BINOMIAL RANDOM VARIABLE
The mean and variance for a binomial random variable may be found by using formulas (5.3)and
(5.4).However, in the case of a binomial random variable, shortcut formulas exist for computing the
mean and standard deviation of a binomial random variable. The mean of a binomial random variable
is given by
P= nP (5.8)
The variance of a binomial random variable is given by
d=npq (5.9)
EXAMPLE 5.19 The mean for the binomial distribution given in Example 5.18 using formula (5.3)is
p = C xP(x) = 0 x .0625+ 1 x .25 +2 x ,375 +3 x .25 +4 x .0625 = 2
The mean using the shortcut formula (5.8)is
p = np = 4 x .5 = 2
The variance for the binomial distribution given in Example 5.18 using formula (5.4)is
o2= C x2 P(x) - p2= 0 x .0625+ 1 x .25 +4 x .375 +9 x .25 + 16x .0625- 4 = 1
The variance using the shortcut formula (5.9)is
EXAMPLE 5.20 Chemotherapy provides a 5-year survival rate of 80% for a particular type of cancer. In a
group of 20 cancer patients receiving chemotherapy for this type of cancer, the mean number surviving after 5
CHAP. 51 DISCRETE RANDOM VARIABLES 97
years is 1= 20 x .8 = 16 and the standard deviation is (T = 420 x .8 x .2 = 1.8. On the average, 16 patients
will survive, and typically the number will vary by no more than two from this figure.
POISSON RANDOM VARIABLE
The binomial random variable is applicable when counting the number of occurrences of an
event called success in a finite number of trials. When the number of trials is large or potentially
infinite, another random variable called the Poisson random variable may be appropriate. The
Poisson probability distribution is applied to experiments with rarzdorn and iridependerit occurrences
of an event. The occurrences are considered with respect to a time interval, a length interval, a fixed
area or a particular volume.
EXAMPLE 5.21 The number of calls to arrive per hour at the reservation desk for Regional Airlines is a
Poisson random variable. The calls arrive randomly and independently of one another. The random variable X is
defined to be the number of calls arriving during a time interval equal to one hour and may be any number from
0 to some very large value.
EXAMPLE 5.22 The number of defects in a 10-foot coil of wire is a Poisson random variable. The defects
occur randomly and independently of one another. The random variable X is defined to be the number of defects
in a 10-foot coil of wire and may be any number between 0 and some very large value. The interval in this
example is a length interval of 10feet.
EXAMPLE 5.23 The number of pinholes in 1-yd2pieces of plastic is a Poisson random variable. The pinholes
occur randomly and independently of one another. The random variable X is defined to be the number of pin-
holes per square yard piece and can assume any number between 0 and a very large value. The interval in this
example is an area of 1 yd2.
POISSON PROBABILITY FORMULA
The probability of x occurrences of some event in an interval where the Poisson assumptions of
randomness and independence are satisfied is given by formula (5.10),where h is the mean number
of occurrences of the event in the interval and the value of e is approximately 2.71828.The value of e
is found on most calculators and powers of e are easily evaluated. Tables of Poisson probabilities are
found in many statistical texts. However, with the wide spread availability of calculators and
statistical software, they are being less widely utilized.
e-'
P(x)= - for x
X !
The mean of a Poisson random variable is given by
p = h
The variance of a Poisson random variable is given by
o2= h
EXAMPLE 5.24 The number of small pinholes in sheets
=o, 1,2,. . . (5.10)
(5.11)
(5.12)
of plastic are of concern to a manufacturer. If the
number of pinholes is too large, the plastic is unusable. The mean number per square yard is equal to 2.5. The 1-
yd2 sheets are unusable if the number of pinholes exceeds 6. The probability of interest is P(X > 6), where X
98 DISCRETE RANDOM VARIABLES [CHAP. 5
represents the number of pinholes in a I-yd2 sheet. This probability is found by finding the probability of the
complementary event and subtracting it from 1.
P(X > 6) = 1 - P(X 5 6)
The command cdf of Minitab can be used to find P(X 5 6). The following Minitab output illustrates how
this is accomplished.
MTB > cdf 6;
SUBC > poisson mean = 2.5.
Cumulative Distribution Function
Poisson with mu = 2.5oooO
x P(X<=x)
6.00 0.9858
P(X > 6) = 1 - P(X 5 6) = 1 - .9858 = .0142
By multiplying .0142 by 100we see that 1.42% of the plastic sheets are unusable.
EXAMPLE 5.25 The Poisson distribution approximates the binomial distribution closely when n 2 20 and p 5
.05. A machine produces items of which 1% are defective. In a sample of 150items selected from the output of
this machine, the probability of two or fewer defectives in the sample is P(X I 2) and is found by the command
cdf 2; when using Minitab. The following Minitab output gives the binomial probability that X 5 2 .
MTB > cdf 2;
SUBC > binomial n = I SO p = .O I .
Cumulative Distribution Function
Binomial with n = I50 and p = 0.0100000
X P(X <= x)
2.oo 0.8095
The mean number of defectives in a sample of 150 is np = 150 x .01 = 1.5. The Poisson probability that X 5 2
where h = 1.S is given by the following Minitab output.
MTB > cdf 2;
SUBC > poisson mean = 1.5.
Cumulative Distribution Function
Poisson with mu = 1SO000
x P(X <= x)
2.00 0.8088
The binomial probability of the event X 5 2 is ,8095 and the Poisson approximation is ,8088. Notice that the
Poisson approximation is very close to the binomial probability.
HYPERGEOMETRIC RANDOM VARIABLE
The hypergeometric rarzdom variable is used in situations where success or failure is possible on
each trial but where there is not independence from trial to trial. The lack of independence from trial
to trial distinguishes the hypergeometric distribution from the binomial distribution. The hyper-
geometric random variable applies in situations where there are N items, of which k are classified as
CHAP. 51 DISCRETE RANDOM VARIABLES 99
successes and N -k are classified as failures. A sample of size n i k is selected from the N items and
X is defined to equal the number of successes in the n items selected. X is a hypergeometric random
variable which can equal any whole number from 0 to n.
EXAMPLE 5.26 A sociologist randomly selects 5 individuals from a group consisting of 10 male single
parents and IS female single parents. The random variable X is defined to equal the number of male single
parents in the 5 selected individuals. In this example, N = 25, k = 10, N - k = 15,and n = 5. The number of male
single parents in the 5 selected is a hypergeometric random variable. If this hypergeometric random variable is
represented by X, then X may assume any one of the values 0, I , 2, 3,4,or 5.
EXAMPLE 5.27 A box contains 5 defective and 25 acceptable computer monitors. Three of the monitors are
randomly selected and X is defined to be the number of defective monitors in the three. X is a hypergeometric
random variable with N = 30, k = 5, N - k = 25, and n = 3. X may assume any one of the values 0, 1, 2, or 3.
HYPERGEOMETRICPROBABILITY FORMULA
When n items are selected from N items of which k are successes and N - k are failures, the
random variable X, defined to equal the number of successes in the n selected items, is a
hypergeometric random variable. The probability distribution of X is given by the liypergemietric
probabilityformula shown in formula (5.13).
k! (N - k)!
x!(k - x)! (n - x)!(N - k - n +x)!
- f n r y - f l , I , . . . , n (5.13)
(N
P(x) =
(4 n!(N - n)!
EXAMPLE 5.28 A police department consists of 25 officers of whom 5 are minorities. Three officers are
randomly selected to meet with the mayor. Let X be the number of minorities in the three selected to meet with
the mayor. X is a hypergeometric random variable with N = 25, k = 5, N - k = 20, and n = 3. The probability
distribution of X is derived as follows:
The probability distribution of X is given in Table 5.10.It is highly likely that at most one of the three will be a
minority. The probability that X I1 is .909.
Table 5.10
1 x 1 0 1 2 3 1
I P(x) I .496 .413 .087 .004 1
EXAMPLE 5.29 The binomial distribution approximates the hypergeometric distribution whenever n I.05N.
A box contains 200 computer chips, of which 7 are defective. The probability of finding one defective in a
sample of S randomly selected chips is given by the following hypergeometric probability computation.
100
I
Outcome
BBB
BBG
BGB
BGG
GBB
GBG
GGB
GGG
DISCRETE RANDOM VARIABLES [CHAP. 5
= .155
I,I ,Jx 4 J - 7 x 56,031,760
(200) 2,535,650,040
P(I) = -
Since n 5 -05 x 200 = 10, the probability may be approximated by using the binomial distribution. The five
selections of the computer chips may be viewed as n = 5 trials. The probability of success, selecting a defective
chip, is p = 7= .035,and q = .965.The binomial probability of one defective in the five chips is given as
foIlows:
200
J.
p(I)=- (.035)' (.965)4=.152
1!4!
The approximation is very good when n 5 .05N, as shown in this example.
Solved Problems
RANDOM VARIABLE
5.1 Let X represent the number of boys in families having three children. List all possible birth
order permutations for families having three children and give the value of X for each
outcome.
Ans. Table 5.1 1 gives the outcomes and the value X assigns to each outcome.
5.11
Value of X
3
2
2
1
2
1
I
0
5.2 For the experiment of rolling three dice, X is defined to be the sum of the three dice. What are
the unique values assumed by X?
Ans.
outcome (6,6,6).The unique values are 3,4,5,6,7, 8,9, 10, 1 I, 12, 13, 14, 15, 16, 17, and 18.
The values for X range from 3, corresponding to the outcome ( I , I , I ) to 18, corresponding to the
DISCRETE RANDOM VARIABLE
5
.
3 A telemarketing company administers an aptitude test consisting of 25 problems to potential
employees. A variable of interest to the company is X, the number of problems worked
correctly. How many different values are possible for X?
CHAP. 51 DISCRETE RANDOM VARIABLES
X
P(x)
101
3 6 9 12 15 18 21
.09 .13 .16 .21 .26 .13 .02
Am. 26, since an individual can get from 0 to 25 correct.
5.4 A random variable assigns a single number to each outcome of an experiment. Is it true that
each value of a random variable corresponds to a single outcome?
Ans. No. Consider the outcomes and values of X given in Table 5.11. The value X = I corresponds to
the three outcomes BGG, GBG, and GGB for example.
CONTINUOUS RANDOM VARIABLE
5.5 Identify the continuous random variables in parts (a) through (e).
(a) The time that an individual is logged onto the internet during a given week
(b) The number of domestic violence calls responded to per day by the Chicago police
(c) The mortality rate for women who used estrogen therapy for at least a year starting in 1969
(d)The daily room rate for luxury/upscale hotels in the U.S.
(e) The number of executions per state since 1975
department
Ans. Parts (a) and (d)are continuous random variables. Time and money are almost always considered
continuous even though in practice they are probably discrete.
5.6 Is it possible to give a probability value to each individual value of a continuous random
variable?
Ans. No. It is not possible to give an individual probability to each value of a continuous variable since
the variable may assume an uncountably infinite number of different values. Instead, probabilities
are assigned to an interval of values for a continuous random variable.
PROBABILITY DISTRIBUTION
5.7 According to the registrar’s office at the University of Nebraska at Omaha (UNO), during the
current semester, 9% of the students are registered for 3 credit hours, 13% are registered for 6
credit hours, 16% are registered for 9 credit hours, 21% are registered for 12 credit hours, 26%
are registered for 15credit hours, 13% are registered for 18credit hours, and 2% are registered
for 21 credit hours. If the random variable X represents the number of credit hours per student
at UNO, give the probability distribution for X.
Ans. The probability distribution for X is given in Table 5.12.
Table 5.12
5.8 An experiment consists of rolling a die and flipping a coin. The coin has the number 1 stamped
on one side and the number 2 stamped on the other side. The random variable Y is defined to
equal the sum of the number showing on the coin plus the number showing on the die after the
experiment is conducted. Give the probability distribution for Y.
102
Outcome
(coin= l,die= 1)
(coin = I , die = 2)
(coin = I , die = 3)
(coin = I , die = 4)
(coin = 1, die = 5 )
(coin = 1, die = 6)
(coin = 2, die = 1)
(coin = 2, die = 2)
(coin = 2, die = 3)
(coin = 2, die = 5 )
(coin = 2, die = 6)
(coin = 2, die = 4)
DISCRETE RANDOM VARIABLES [CHAP. 5
Value of Y
2
3
4
5
6
7
3
4
5
6
7
8
Am. Table 5.13 gives the outcomes for the experiment as well as the values the random variable Y
assigns to the outcomes. From Table 5.13, the probability distribution given in Table 5.14 is
determined.
Y = 2 corresponds to one of twelve equally likely outcomes. Therefore P(2) equals -L = .083.Y =
3 corresponds to two of twelve equally likely outcomes. Therefore P(3) equals A = .167. The
other probabilities are found in a similar fashion.
12
Table 5.14
l v 1 2 3 4 5 6 7 S I
I P(y) I .083 .I67 .I67 .I67 .167 .167 .083 I
MEAN OF A DISCRETE RANDOM VARIABLE
5.9
5.10
A roulette wheel has 18 red, 18 black, and 2 green slots. You bet $10 on red. If red comes up,
you get your $10 back plus $I0 more. If red does not come up, you lose your $10.Let random
variable P represent your profit when playing this roulette wheel. Find the mean value of P.
Ans. The distribution of P is given in table 5.15. The probability of a $10 loss, i.e., P = -10, is 2 =
326, and the probability of a $10gain is
38
= .474
Table 5.15
The mean profit is p = C xP(x) = -10 x .526 + 10x .474 = -.52. Your average loss is 52 cents per
play of the roulette wheel.
The distribution of the number of children per household for households receiving Aid to
Dependent Children (ADC) in a large eastern city is as follows: Five percent of the ADC
households have one child, 35% have 2 children, 30% have 3 children, 20% have 4 children,
and 10%have 5 children. Find the mean number of children per ADC household in this city.
Ans. The mean is p = CxP(x) = 1 x .05 + 2 x .35 + 3 x .30 + 4 x .20 + 5 x .I0 = 2.95. The mean is
about 3 per household.
CHAP. 51 DISCRETE RANDOM VARIABLES 103
STANDARD DEVIATION OF A DISCRETE RANDOM VARIABLE
5.11 Find the standard deviation of the profit when playing the color red on the roulette wheel
described in problem 5.9.
Arts. The variance is given by E x2P(x)- w2 = 100 x .526 + 100 x ,474 - .2704 = 99.7296, and the
standard deviation is JK
= $9.97.
5.12 Find the standard deviation of the number of children per ADC household for the distribution
given in problem 5.10.
Ans. The variance is given by C x2P(x) - p2= 1 x .05 +4 x .35 +9 x .30 + 16 x .20 +25 x .I0- 2.952=
1,1475and the standard deviation is 41.1475 = 1.07.
BINOMIAL RANDOM VARIABLE
5.13 Ninety percent of the residents of Stanford, California, twenty-five years of age or older have
at least a bachelor’s degree. Three hundred residents of Stanford twenty-five years or older are
selected for a poll concerning higher education. If X represents the number in the 300 who
have at least a bachelor’s degree, give the conditions necessary for X to be a binomial random
variable and identify n, p, and q.
Ans. The main condition we need to be concerned about is that the 300 residents be selected
independently. There are 300 identical trials and each trial has only two possible outcomes. We
may identify success with having at least a bachelor’s degree and failure with having less than a
bachelor’s degree. If the respondents are chosen randomly and independently, then the
probabilities of success and failure should remain constant. Pollsters use standard techniques to
ensure independence of individuals selected. The values of n, p, and q are 300, .90, and .10
respectively.
5.14 A box contains 20 items, of which 25% are defective. Three items are randomly selected and
X, the number of defectives in the three selected items, is determined. Explain why X is not a
binomial random variable with n = 3 and p = .25.
Ans. The probabilities p and q do not remain constant from trial to trial. Suppose we are interested in
the probability that X = 3; that is, we are interested in the probability of getting three consecutive
defectives. The probability that the first one is defective is .25. The probability that the second is
defective after selecting a defective on the first selection is = .21, and the probability that the
third is defective after selecting defectives on the first two selections is A = .17. The binomial
model assumes that the probability p remains constant at p = .25. In this experiment, p does not
remain constant, but changes from .25 to .21 to .17.
BINOMIAL PROBABILITY FORMULA
5.15 Approximately 12% of the U.S.population is composed of African-Americans. Assuming that
the same percentage is true for telephone ownership, what is the probability that when 25
phone numbers are selected at random for a small survey, that 5 of the numbers belong to an
African-American family?
104 DISCRETE RANDOM VARIABLES [CHAP. 5
Ans. Let X represent the number of phone numbers in the 25 belonging to African-Americans. Then, X
has a binomial distribution with n = 25 and p = .12. The probability P(X = 5 ) is given as follows:
25!
5!20!
P(X = 3) = -(.12)’ (.88)*O = .I025
5.16 It is estimated that 42% of women ages 45 to 54 are overweight. If 20 females between 45 and
54 are randomly selected, what is the probability that one-half of them are overweight?
Ans. Let X represent the number of women in the 20 who are overweight. Then, X has a binomial
distribution with n = 20 and p = .42.The probability P(X = 10)is given as follows:
20!
10!10!
P(X = 10)= -(.42)1°(.58)*O = .1359
TABLES OF THE BINOMIAL DISTRIBUTION
5.17 Sixty percent of teenagers who drink alcohol do so because of peer pressure. Use the table of
binomial probabilities to find the probability that in a sample of 15 teenagers who drink, 5 or
fewer do so because of peer pressure.
Ans. Table 5.16 shows the portion of the table needed to compute P(X 5 5).
P(X 55) = .OOOO + .0o00 + .0003 + .0016+ .0074 + .0212 = .0305
5.18 A domestic homicide is one in which the victim and the killer are relatives or involved in a
relationship. Suppose 40% of all murders are domestic homicides. A criminal justice study
randomly selects 10 murder cases for investigation. Use the table of binomial probabilities to
find the probability that between one and four inclusive of the murder cases will be domestic
homicide cases.
Ans. Table 5.17 shows the portion of the table needed to compute P(1 5 X I 4 ) .
P
(
l IX54)=.0403+.1209+.2150+.2508=.6270
Table 5.17
= .40
.0403
.2150
4 .2508
CHAP. 51 DISCRETE RANDOM VARIABLES 105
MEAN AND STANDARD DEVIATION OF A BINOMIAL RANDOM VARIABLE
5.19 Seventy-five percent of employed women say their income is essential to support their family.
Let X be the number in a sample of 200 employed women who will say their income is
essential to support their family. What is the mean and standard deviation of X?
Ans. X is a binomial random variable with n = 200 and p = .75. The mean is p = np = 200 x .75 = 150,
and the standard deviation is CY = 6= = 6.12.
5.20 A binomial distribution has a mean equal to 8 and a standard deviation equal to 2. Find the
values for n and p.
Ans. The following equations must hold: 8 = np and 4 = npq. Substituting 8 for np in the second
equation gives 4 = 8q, which gives q = .5. Since p +q = 1, p = 1 - .5 = .5. Substituting .5 for
the first equation gives n(S) = 8, and it follows that n = 16.
POISSON RANDOM VARIABLE
5.21 Consider the number of customers arriving at the Grover street branch of Industrial
Federal Savings and Loan during a one-hour interval. What assumptions concerning
arrivals of customers are necessary in order that X, the number of customers arriving in a
hour interval, be a Poisson random variable?
p in
and
the
one
Ans. The assumptions necessary are that the arrivals be random and independent of one another.
5.22 Why are the arrival of patients at a physician’s office and the arrival of commercial airplanes
at an airport not Poisson random variables?
Ans. Because of appointment times to see a physician, the arrivals are not random. Because of
scheduled arrivals of commercial airplanes, the arrivals are not random.
POISSON PROBABILITY FORMULA
5.23 The mean number of patients arriving at the emergency room of University Hospital on
Saturday nights between 1O:OO and 12:OO is 6.5. Assuming that the patients arrive randomly
and independently, what is the probability that on a given Saturday night, 5 or fewer patients
arrive at the emergency room between 1O:OO and 12:00?
Ans. Let X represent the number of patients to arrive at the emergency room of University Hospital
6.5xe-6.S
between 1O:OO and 12:OO. The probability formula for X is P(x) = -
.The probability of the
X!
event that X 5 5 is P(0) + P(I) + P(2) + P(3) + P(4) +P(5). Each of these probabilities contain the
common term e4.5, which may be factored out to give the following as the probability of the event
of interest.
p(x 5 5 ),,“5(65O ; 65’ + 652 ; 65’ ; 654 ; 65’)
O! I! 2! 3! 4! 5 !
P(X I
5 ) = .OO15(1 +6.5 +21.125 +45.7708 +74.7760 +96.6809) = .369
106 DISCRETE RANDOM VARIABLES [CHAP. 5
The solution using Minitab is as follows.
MTB > cdf 5 ;
SUBC > Poisson 6.5.
Cumulative Distribution Function
Poisson with mu = 6.50000
x P(X<= x)
5.00 0,3690
5.24 When working problems involving the Poisson random variable, it is important to remember
that the interval for the mean number of occurrences and the interval for X must be equal. If
they are not, the mean should be redefined to make them equal. This problem illustrates this
important point.
The mean number of patients arriving at the emergency room of University Hospital on
Saturday nights between 1O:OO and 12:OO is 6.5. Assuming that the patients arrive randomly
and independently, what is the probability that on a given Saturday night, 2 or fewer patients
arrive at the emergency room between 11:00and 12:00?
Arts. Let X be the number of patients to arrive at the emergency room of University Hospital on
6.5
2
Saturday nights between 1 1 9 0 and 12:OO. The mean for X is -= 3.25, and the event of interest
is X 5 2 .
e-3.25 3.250 e-3.2S33.251 e-3.2S33.252
P(X 5 2 ) = + + = ,369
O! I! 2!
HYPERGEOMETRIC RANDOM VARIABLE
5.25 A box contains 10 red marbles and 90 blue marbles. Five marbles are selected randomly from
the 100 in the box. Let X be the number of blue marbles in the five selected marbles. Identify
the values for N, k,N - k,and n in the hypergeometric distribution which corresponds to X.
Am. The total number of marbles is N = 100.The number of blue marbles (successes) is k = 90. The
number of red marbles (failures) is N - k = 10.The sample size is n = 5.
5.26 In problem 5.25, consider the event X = 2. This is the event that five marbles are selected and 2
are blue and 3 are red.
(a) How many ways may 5 marbles be selected from 100marbles?
(b) How many ways may 2 blue marbles be selected from 90 blue marbles?
(c) How many ways may 3 red marbles be selected from 10red marbles?
(d) How many ways may 3 red marbles be selected from 10red marbles and 2 blue marbles be
selected from 90 blue marbles?
(d) 120x 4,005= 480,600
CHAP. 51 DISCRETE RANDOM VARIABLES 107
HYPERGEOMETRICPROBABILITYFORMULA
5.27 Computec, Inc. manufactures personal computers. There are 40 employees at the Omaha plant
and 15 employees at the Lincoln plant. Five employees of Computec, Inc. are randomly
selected to fill out a benefits questionnaire.
(a) What is the probability that none of the five selected are from the Lincoln plant?
(b) What is the probability that all five of the selected employees are from the Lincoln plant?
Am. (a) [105)[450) -
- I x 658,008 = ,189 (6) [
:
I [
:
I -
- 3,003 x 1 = .000863
3,478,76 1 3,478,76 1
5.28 What is the probability of getting 5 face cards when 5 cards are selected from a deck of 52?
Am. A deck of cards consists of 12 face cards and 40 nonface cards. Let X represent the number of
face cards in the 5 cards selected. The event in which we are interested is X = 5. The probability is
P(X = 5 ) = [:'jo)-
- 7 9 2 ~
1 = .000305
2,598,960
(i:)
Supplementary Problems
RANDOM VARIABLE
5.29 A taste test is conducted involving 35 individuals. Random variable X is the number in the 35 who prefer
a locally produced nonalcoholic beer to a national brand. What are the possible values for X ?
Ans. The whole numbers 0 through 35
5.30 A psychological experiment was conducted in which the time to traverse a maze was recorded for each of
five dogs. The times were 4, 6, 8, 9, and 12 minutes. Two of the times were randomly selected and the
difference X = largest of the pair - smallest of the pair was recorded. Give all possible pairs of possible
selections, and the value of X for each outcome.
Am. See Table 5.18.
Table 5.18
Value of X
108
X -4 -1 0 3 8
P(x) . I .2 .4 .3 .2 L
DISCRETE RANDOM VARIABLES [CHAP. 5
DISCRETE RANDOM VARIABLE
5.31 A die is tossed until the face 6 turns up. Let X be the number of tosses needed until the face 6 first turns
up. Give the possible values for the variable X.
Ans. The possible values are the positive integers. That is, the possible values are I , 2, 3, . . . .
5.32 Identify the discrete random variables in parts (a)through (e).
(a) The number of arrests during a 10-day period during which the police apply a Zero-tolerance
strategy
(6) The time workers have spent with their current employer
( c ) The number of nurse practitioners per state
(d) The career lifetime of major league baseball players
(e) The number of executions of death row inmates per year in the U.S.
Am. (n), (c),and (e)
CONTINUOUSRANDOM VARIABLE
5.33 Identify the continuous random variables in the following list.
(a) Weight of individuals in kg
(b) Serum cholesterol level in mg/dl
(c) Length of intravenous therapy in hours
(6)Body mass index in kg/m2
(e) Cardiac output in liters/minute
Ans. All five are continuous.
5.34 What is the primary difference between a discrete random variable and a continuous random variable?
Am. There are values between the possible values of a discrete random variable which are not possible
values for the random variable. This is not generally true for a continuous random variable.
PROBABILITY DISTRIBUTION
5.35 Suppose Table 5.19 gives the number in thousands of students in grades 9 through 12 for public schools
in the United States. Let X represent the grade level. Give the probability distribution for X.
Table 5.19
I 12
Grade I 9 10 11 I
I Frequency I 3,525 3,475 3,050 2,950 I
Am. The distribution is given in Table 5.20.
Table 5.20
1 x 1 9 10 1 1 12 I
I P(x) I .271 267 .245 .227 I
CHAP. 51
X
P(x)
DISCRETE RANDOM VARIABLES
-20 -10 0 10 15 20 50
. I .2 .3 .2 .1 .2 -. 1
109
pounds
percent
0 5 15 25 50
33 21 11 25 10
2
(d) P(x)= .2x ,x = 1,2
h i s . Parts (6) and (d)are probability distributions. Part (a) is not because C P(x) = 1.2. Part (c)is not
P(50) < 0.
MEAN OF A DISCRETE RANDOM VARIABLE
5.37 The number of personal computers per household in the United States has the probability distribution
shown in Table 5.21. Find the mean number of personal computers per household.
Table 5.21
X I 0 1 2 1
Ans. p = 0.85
5.38 Based on a large survey, the distribution in Table 5.22 was found for the number of pounds individuals
desired to lose. Find the mean number of pounds they desired to lose.
Table 5.22
Ans. p = 13.95
STANDARD DEVIATION OF A DISCRETE RANDOM VARIABLE
5.39 Find the standard deviation of the number of personal computers per household for the distribution given
in Table 5.21,
Alzs. CJ = 0.65
5.40 Find the standard deviation of the number of pounds desired to be lost.
Ans. CJ = 15.55
BINOMIAL RANDOM VARIABLE
5.41 Thirty percent of the trees in a national forest are infested with a parasite. Fifty trees are randomly
selected from this forest and X is defined to equal the number of trees in the 50 sampled that are infested
with the parasite. The infestation is uniformly spread throughout the forest. Identify the values for n, p,
and q.
AIIS. n = SO, p = 30, and q = .70
110 DISCRETE RANDOM VARIABLES [CHAP. 5
5.42 Suppose in problem 5.41 that we define Y to be the number of trees in the 50 sampled that are not
infested with the parasite. Then Y is a binomial random variable.
(a) What are the values of n, p, and q for Y ?
(6) The event X = 20 is equivalent to the event that Y = a. Find the value for a.
Ans. (a) n = 50, p = .70,and q = .30
(h) a = 30
BINOMIAL PROBABILITY FORMULA
5.43 The Dallas-Fort Worth Airport claims that 85%of their flights are on time. If the claim is correct, what is
the probability that in a sample of 20 flights at the Dallas-Fort Worth Airport that 15 or more of the
sample flights are on time'?
Ans. .9321
5.44 A psychological study involving the troops in the Bosnia peacekeeping force was conducted. If 12
percent of the 21,496troops are females, what is the probability that in a sample of 50 randomly selected
individuals that five or fewer are female?
Ans. .435
3
TABLES OF THE BINOMIAL DISTRIBUTION
5.45 There are approximately 3,000 inmates on death row. Forty percent of the death row inmates are African-
American. Twenty of the death row inmates are randomly selected for a sociological study. Use the table
of binomial probabilities to find the probability that most of the selected inmates are African-American.
Am. .1275
5.46 Ten percent of the Rentwheels car-rental fleet are equipped with cellular phones. If five of the cars are
randomly selected, what is the probability that none are equipped with a cellular phone'?
Ans. S905
MEAN AND STANDARD DEVIATION OF A BINOMIAL RANDOM VARIABLE
5.47
5.48
It is conjectured that 60% of the deaths from melanoma can be prevented by a skin self-exam. If this
conjecture is correct, how many of the 7,000 deaths due to this skin cancer would be prevented per year
on the average'?What is the standard dcviation associatcd with the number of deaths prevented?
Ans. 4.200 41
Fifteen percent of the machinery and equipment at businesses is more than I0 years old. In a randomly
selected sample of 35 businesses, how many would you expect to have machinery or equipment that is
more than 10years old? What standard deviation is associated with this expected number?
Ans. 5.25 2.11
CHAP. 51 DISCRETE RANDOM VARIABLES 111
POISSON RANDOM VARIABLE
5.49 Give four examples of a Poisson random variable.
Am. 1. The number of telephone calls received per hour by an office
2. The number of keyboard errors per page made by an individual using a word processor
3. The number of bacteria in a given culture
4. The number of imperfections per yard in a roll of fabric
5.50 What are the potential values of a Poisson random variable?
Ails. 0, 1, 2,. . .
POISSON PROBABILITY FORMULA
5.51 Suppose the mean number of earthquakes per year is 13. What is the probability of 20 or more
earthquakes in a given year?
Arts, .0427
5.52 The number of plankton in a liter of lake water has a mean value of 7. What is the probability that the
number of plankton in a given liter is within one standard deviation of the mean?
Ans. .6575
HYPERGEOMETRIC RANDOM VARIABLE
5.53 Since 1977 twenty-four states have executed at least one death row inmate. In a study concerning capital
punishment, ten of the fifty states are randomly selected. Let X represent the number of states in the ten
that have executed at least one death row inmate. Identify success and failure. What are the values of N,
k, N - k,and n.
Am. Success is that at least one death row inmate has been executed since 1977. Failure is that no
execution has occurred in the state since 1977.N = 50, k = 24, N - k = 26, and n = 10.
5.54 If success and failure are interchanged in problem 5.53, how is X changed and what are the values of N,
k, N - k, and n for X?
Am. X is the number of states in the ten that have executed no one since 1977. N = SO,k = 26, N - k =
24, and n = 10.
HYPERGEOMETRIC PROBABILITY FORMULA
5.55 Thirty diabetics have volunteered for a medical study. Ten of the diabetics have high blood pressure. Five
are selected for a preliminary screening for the study. What is the probability that none of the five
selected have high blood pressure if the selection is done randomly?
5.56 A box of manufactured products contains 20 items. Three of the items are defective. Let X represent the
number of defectives in three randomly selected from the box. Give the probahility distribution for X and
find the mean value for X using the probability distribution. Show that the mean is also given by p = -.
nk
N
112
Ans.
DISCRETE RANDOM VARIABLES [CHAP. 5
1 x 1 0 1 2 3 1
I .000877 I
P(x> I S96491 .357895 .044737
p = .45
Chapter 6
Continuous Random Variables and
Their Probability Distributions
UNIFORM PROBABILITY DISTRIBUTION
A coiitir2uous random variclble is a random variable capable of assuming all the values in an
interval or several intervals of real numbers. Because of the uncountable number of possible values,
it is not possible to list all the values and their probabilities for a continuous random variable in a
table as is true with a discrete random variable. The probability distribution for a continuous random
variable is represented as the area under a curve called the probability densiqfunction, abbreviated
pdJ A pdf is characterized by thefollowirig t”o basic properties: The graph o
f the pdf is never below
the x axis arid the total area wider the pdf always equals I .
The probability density function shown in Fig. 6-1 is a urzifor-mprobability distribution. This
pdf represents the distribution of flight times between Omaha, Nebraska, and Memphis, Tennessee.
The flight time is represented by the letter X. The graph shows that the flight times range from 90 to
100minutes. The distance from the x axis to the graph remains constant at 0.10and since the area of
a rectangle is given by the length times the width, the area under the pdf is 10 x 0.10 = 1. Note that
this pdf has the two basic properties given above. The graph of the pdf is never below the x axis and
the total area under the pdf is equal to I.
X
90 100
Fig. 6-1
The representation of Fig. 6-1 by an equation is given as follows.
.I 9 0 < x < 1 0 0
0 elsewhere
f(x) =
In general, if a random variable X is uniformly distributed over the interval from a to b, then the
pdf is given by formula (6.1).
a < x < b
1
0 elsewhere
112
114 CONTINUOUS RANDOM VARIABLES [CHAP. 6
EXAMPLE 6.1 The probability that a flight takes between 92 and 97 minutes is represented as P(92 < X < 97)
and is equal to the shaded area shown in Fig. 6-2. The rectangular shaded area has a width equal to 5 and a
length equal to .1 and the area is equal to 5 x . 1 = .5. That is, 50 percent of the flights between Omaha and
Memphis will take between 92 and 97 minutes.
Fig. 6-2
MEAN AND STANDARD DEVIATION FOR THE UNIFORM
PROBABILITY DISTRIBUTION
The mean value for a random variable having a uniform probability distribution over the interval
from a to b is given by
a + b
p = -
2
The variance for a uniform random variable is given by
(b - a)2
cJ2=
12
(6-3)
EXAMPLE 6.2 The weights of 10-pound bags of potatoes packaged by Idaho Farms Inc. are uniformly
distributed between 9.75 pounds and 10.75 pounds. The distribution of weights for these bags is shown in Fig.
6-3.
f(x)
I
X
9.75 10.75
Fig. 6-3
a +b 9.75+ 10.75
Using formula (6.2),we see that the mean weight per bag is p = --
- = 10.25 pounds and using
2 2
formula (6.3),the standard deviation is CJ = ,/v
-= E= 0.29. If X represents the weight per bag, then
CHAP. 61 CONTINUOUS RANDOM VARIABLES 115
P(X > 10.0)corresponds to the proportion of bags that weigh inore than 10pounds. This probability is shown in
Fig. 6-4. The ,haded rectangle has dimensions 1 and 0.75 and the area is 1 x 0.75 = 0.75. Seventy-five percent
of the bags weigh more than 10 pounds. To help ensure consumer satisfaction, Idaho Farms instructs their
employees to try and never underfill if possible. This is reflected in the average weight of 10.25 pounds per bag.
Fig. 6-4
The probability associated with a single value for a continuous random variable is always equal
to zero, since there is no area associated with a single point. That is, the probability that X = a is
given by
P(X = a) = 0
NORMAL PROBABILITY DISTRIBUTION
The most important and widely used of all continuous distributions is the normal probability
distributiort. Figure 6-5 shows the pdf for a normal distribution having mean p and standard
deviation 0.
Fig. 6-5
Table 6.1 gives some of the main properties of the normal curve shown in Fig. 6-5. Figure 6-6
shows two different normal curves with the same mean but different standard deviations. The larger
the standard deviation, the more disperse are the values about the mean.
116 CONTINUOUS RANDOM VARIABLES [CHAP. 6
Table 6.1
ProDertiesof the Normal Probability Distribution
I
I. The total area under the normal curve is equal to one.
2.
3.
4.
5.
6.
7.
8.
9.
The curve is symmetric about p and the area under the curve on each side of the
mean equals 0.5.
The (ails of the curve extend indefinitely.
Each pair of values for p and CT determine a different normal curve.
The highest point on a normal curve occurs at the mean.
The mean, median, and mode are all equal for a normal curve.
The mean locates the center of the curve and can be any real number, negative,
positive, or zero.
The standard deviation is positive, and determines the shape of the normal curve.
The larger the standard deviation, the wider and tlatter the curve.
68.26% of the area under the curve is within I standard deviation of the mean,
95.44% of the area is within 2 standard deviations, and 99.72%of the area is within
3 standard dcviations of the mean.
I
X
P
Fig. 6-6
Figure 6-7 shows two normal curves having equal standard deviations but different means. The
normal curve with mean equal to 66 represents the distribution of adult female heights and the
normal curve with mean equal to 70 represents the distribution of adult male heights.
I
X
66 70
Fig. 6-7
The equation of the pdf of a normal curve having mean p and standard deviation 0 is given in
formula (6.5).
CHAP. 61
0.0
0.1
0.2
CONTINUOUS RANDOM VARIABLES
.oooO .0040 ... .O199 ... .0359
.0398 .0438 ... .0596 ... .0753
.0793 .0832 ... .0987 ... .1141
... ...
117
1.6
The two constants in the formula, e and x;, are important constants in mathematics. The constant e is
equal to 2.718281828 . . . and the constant .n is equal to 3.141592654 . . . . The equation of the pdf
for the normal curve is not used a great deal in practice and is given here for the sake of
completeness.
... ..,
... ...
.4452 .4463 ,.. .4505 ... .4545
... ... ...
... ... ...
STANDARDNORMAL DISTRIBUTION
3.0
The standard tionnal distvibirtion is the normal distribution having mean equal to 0 and standard
deviation equal to 1. The letter Z is used to represent the standard normal random variable. The
standard normal curve is shown in Fig. 6-8. The curve is centered at the mean, 0, and the z-axis is
labeled in standard deviations above and below the mean.
... ... ...
,4987 .4987 ... .4989 ... .4990
-3 -2 - 1 0 1 2 3
Fig. 6-8
Appendix 2 contains the startdard rtormal distribution table. This table gives areas under the
standard normal curve for the variable Z ranging from 0 to a positive number z. Some examples will
now be given to illustrate how to use this table.
EXAMPLE6.3 Table 6.2 illustrates how to use the standard normal distribution table to find the area under
the standard normal curve between z = 0 and z = 1.65. Figure 6-9 shows the corresponding area as the shaded
region under the curve. The value 1.65 may be written as 1.6 + .05, and by locating 1.6 under the column
labeled z and then moving to the right of 1.6until you come under the .05column you find the area .4505. This
is the area shown in Fig. 6-9. We express this area as P(0 < Z < 1.65) = .4505.
Table 6
.
2
I 2 I .oo .01 ... .05 ... .09 I
118 CONTINUOUS RANDOM VARIABLES [CHAP. 6
Fig. 6-9
EXAMPLE 6.4 The area under the standard normal curve between z = -1.65 and z = 0 is represented as
P(-1.65< Z< 0) and is shown in Fig. 6-10.By symmetry, the following probabilities are equal.
P(-1.65< Z < 0)= P(O < Z < 1.65)
From Example 6.3, we know that P(0< Z < 1.65)= ,4505and therefore, P(-1.65< Z < 0)= ,4505.
Fig. 6-10
EXAMPLE 6.5 The area under the standard normal curve between z = -1.65 and z = 1.65 is represented by
P(-1.65 < Z< I .65)and is shown in Fig. 6-11. The probability P(-1.65 < Z < 1.65) is expressible as
P(-l.65< Z< 1.65)= P(-l.65< Z< 0) +P(O< Z < 1.65)
The probabilities on the right side of the above equation are given in Examples 6.3 and 6.4,and their sum is
equal to 0.9010.Therefore, P(-1.65 < Z < 1.65)= .9010.
Fig. 6-11
CHAP. 61 CONTINUOUS RANDOM VARIABLES 119
EXAMPLE 6.6 The probability of the event Z c 1.96 is represented by P(Z < 1.96) and is shown in Fig. 6-12.
Fig. 6-12
The area shown in Fig. 6-12 is partitioned into two parts as shown in Fig. 6-13. The darker of the two areas is
equal to P(Z < 0) = .5, since it is one-half of the total area. The lighter of the two areas is found in the standard
normal distribution table to be .4750. The sum of the two areas is .5 + .4750 = .9750. Summarizing,
P(Z < 1.96)= P(Z < 0) +P(0 < Z c 1.96)= .5 + .4750 = .9750
Fig. 6-13
The probability in Example 6.6 can also be found by using CDF of Minitab as follows:
MTB > cdf 1.96;
SUBC > normal 0 1.
Cumulative Distribution Function
Normal with mean = 0 and standard deviation = I .OOOOO
x P(X <= x)
1.9600 0.9750
Fig. 6-14
120 CONTINUOUSRANDOM VARIABLES [CHAP. 6
In order to find the probability P(Z> 1.96),shown in Fig. 6-14, we use the complement of the event Z > 1.96
and the result for P(Z < 1.96)given above as follows.
P(Z > 1.96)= 1 - P(Z < 1.96)= 1 - .9750 = .0250
STANDARDIZING A NORMAL DISTRIBUTION
In order to find areas under a normal distribution having mean p and standard deviation 0,the
normal distribution must be standardized. A normal random variable X having mean p and standard
deviation 0 is converted or transformed to a standard normal random variable by the formula given
in (6.6).
x
-
C
L
Z = -
0
EXAMPLE 6.7 Figure 6-15 shows
weights, X, are normally distributed
the distribution of adult male weights for a particular age group. The
with mean 170 pounds and standard deviation equal to 15 pounds. The
X - - c ~ 215-170
standardized value for the weight X = 215 is z = --
- = 3. For this age group, an individual who
weighs 2 15 pounds is 3 standard deviations above average. The standardized value for 170 pounds is zero, and
the standardized value for 125 pounds is -3.
0 15
125 170 215
Fig. 6-15
APPLICATIONS OF THE NORMAL DISTRIBUTION
The fact that many real-world phenomena are normally distributed leads to numerous appli-
cations of the normal distribution. Applications of the normal distribution usually involve finding
areas under a normal curve. Tofind the area between two values o
f x’for a normal distribution, first
convert both values of x to their respective z values. Thenfind the area under the standard normal
cuwe between those two z values. The area between the tuqo z values gives the area between the
corresponding x values.
EXAMPLE 6.8 In a study involving stress-induced blood pressure, volunteers played a computer game called
the color-word interference task. The game was set so that everyone made errors about 17% of the time. The
average increase in systolic blood pressure was 10points of systolic pressure, and the standard deviation was 3
points. The percent experiencing an increase of 16 points or more is found by evaluating P(X > 16) and
CHAP. 61 CONTINUOUS RANDOM VARIABLES 121
multiplying by 100. The probability is shown as the shaded area in Fig. 6-16. The z value corresponding to x =
16isz= -
= 2. The area shown in Fig. 6-16 is the same as the area shown in Fig. 6-17.
16- 10
3
Fig. 6-16
The equality of the areas in Figures 6-16and 6-17 is expressible in terms of probability as follows:
P(X > 16)= P(Z > 2)
Using the standard normal distribution table, we find that P(Z > 2) = .5 - .4772 = .0228. The percent
experiencing an increase of 16 systolic points or more is 2.28%.
Fig. 6-17
EXAMPLE 6.9 The time between release from prison and conviction for another crime for individuals under
40 is normally distributed with a mean equal to 30 months and a standard deviation equal to 6 months. The
percentage of these individuals convicted for another crime within two years of their release from prison is
represented as P(X < 24) times 100.The probability is shown as the shaded area in Fig. 6-18. The event X < 24
X-30 24-30
is equivalent to Z = -
c -
= -1. The probability P(Z <-1) is shown as the shaded area in Fig. 6-19.
6 6
Fig. 6-18
122 CONTINUOUS RANDOM VARIABLES [CHAP. 6
The shaded area shown in Fig. 6-19 is P(Z < -l), and by symmetry P(Z c -1) = P(Z > 1). The probability
P(Z> 1) is found using the standard normal distribution table as .5 - .3413 = .1587. That is P(X c 24) =
P(Z c -1) = .1587 or 15.87% commit a crime within two years of their release.
Fig. 6-19
EXAMPLE 6.10 A study determined that the difference between the price quoted to women and men for used
cars is normally distributed with mean $400 and standard deviation $50. To clarify, let X be the amount quoted
to a woman minus the amount quoted to a man for a given used car. Then, the population distribution for X is
normal with p = $400 and a = $50.The percent of the time that quotes for used cars are $275 to $500 more for
women than men is given by P(275 c X c 500) times 100. The probability P(275 c X c 500) is shown as the
275-400
shaded area in Fig. 6-20. The Z value corresponding to X = 275 is z =
500-400
corresponding to X = 500 is z =
50
= -2.5 and the Z value
50
= 2.0. The probability P(-2.5 c Z c 2.0) is shown in Fig. 6-21.
Fig. 6-20
The probability P(-2.5 < 2 c 2.0) is found using the standard normal distribution table. The probability is
expressed as P(-2.5 c Z < 2.0) = P(-2.5 c Z < 0) + P(0 c Z c 2.0). By symmetry,P(-2S<ZcO) = P(O<Zc2.5)
= .4938 and P(0 < Z c 2.0) = .4772. Therefore, P(-2.5 < Z < 2.0) = .4938 + .4772= .971. P(275 c X c 500) =
P(-2.5 < Z <2.0) = .971,or 97.1% of the time the quotes for used cars will be between $275 and $500 more for
women than for men.
CHAP. 61 CONTINUOUS RANDOM VARIABLES 123
Fig. 6-21
DETERMINING THE Z AND X VALUES WHEN AN AREA UNDER THE
NORMAL CURVE IS KNOWN
In many applications, we are concerned with finding the z value or the x value when the area
under a normal curve is known. The following examples illustrate the techniques for solving these
type problems.
EXAMPLE 6.11 Find the positive value z such that the area under the standard normal curve between 0 and z
is .4951. Figure 6-22 shows the area and the location of z on the horizontal axis. Table 6.3 gives the portion of
the standard normal distribution table needed to find the z value.
Fig. 6-22
We search the interior of the table until we find .4951,the area we are given. This area is shown in bold print in
Table 6.3. By going to the beginning of the row and top of the column in which .4951 resides, we see that the
value for z is 2.58.
Table 6.3
124 CONTINUOUS RANDOM VARIABLES [CHAP. 6
To find the value of negative z such that the area under the standard normal curve between 0 and z equals
.4951, we find the value for positive z as above and then take z to be -2.58.
EXAMPLE 6.12 The mean number of passengers that fly per day is equal to 1.75 million and the standard
deviation is 0.25 million per day. If the number of passengers flying per day is normally distributed, the
distribution and ninety-fifth percentile, Pg5,is as shown in Fig. 6-23. The shaded area is equal to 0.95.We have
a known area and need to find Pg5,a value of X. Since we must always use the standard normal tables to solve
problems involving any normal curve, we first draw a standard normal curve corresponding to Fig. 6-23. This is
shown in Fig. 6-24.
Fig. 6-23
Using the technique shown in Example 6.11, we find the area .4500 in the interior of the standard normal
distribution table and find the value of z to equal 1.645. There is 95% of the area under the standard normal
curve to the left of 1.645 and there is 95% of the area under the curve
Pg5is standardized, the standardized value must equal 1.645. That is,
find Pgs= 1.75+ .25 x 1.645= 2.16 million.
in Fig. 6-23 to the left of P9s.Therefore if
Pgg -1.75
= 1.645 and solving for P
9
5 we
.25
Fig. 6-24
The value of z in Fig. 6-24 can also be found by using INVCDF of Minitab as follows.
MTB > invcdf .95;
SUBC > normal 0 1.
Inverse Cumulative Distribution Function
Normal with mean = 0 and standard deviation = 1.OOOOO
P(X <= x) X
0.9500 1.6449
CHAP. 61 CONTINUOUS RANDOM VARIABLES 125
NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION
Figure 6-25 shows the binomial distribution for X, the number of heads to occur in 10 tosses of a
coin. This distribution is shown as the shaded area under the histogram-shaped figure. Superimposed
upon this binomial distribution is a normal curve. The mean of the binomial distribution is p = np =
10 x .5 = 5 and the standard deviation of the binomial distribution is CT = ,/G-
= 410 x .5 x .5 =
= 1.58. The normal curve, which is shown, has the same mean and standard deviation as the
binomial distribution. When a normal curve is fit to a binomial distribution in this manner, this is
called the normal approximation to the binonzial distribution. The shaded area under the binomial
distribution is equal to one and so is the total area under the normal curve.
The normal approximation to the binomial distribution is appropriate whenever np 2 5 and
nq 2 5.
Fig. 6-25
EXAMPLE 6.13 Using the table of binomial probabilities, the probability of 4 to 6 heads inclusive is as
follows: P(4 I
X I 6) = .205 1 + .2461 + .2051 = 6563.This is the shaded area shown in Fig. 6-26.
Fig. 6-26
The normal approximation to this area is shown in Fig. 6-27. To account for the area of all three rectangles, note
that X must go from 3.5 to 6.5. The area under the normal curve for X between 3.5 and 6.5 is found by
determining the area under the standard normal curve for Z between z = -= -.95 and z = -= .95.
This area is 2 x .3289 = .6578. Note that the approximation, .6578, is extremely close to the exact answer,
.6563.
3.5-5 6.5-5
1.58 1.58
126 CONTINUOUS RANDOM VARIABLES [CHAP. 6
Fig. 6-27
EXAMPLE6.14 To aid in the early detection of breast cancer women are urged to perform a self-exam
monthly. Thirty-eight percent of American women perform this exam monthly. In a sample of 315 women, the
probability that 100 or fewer perform the exam monthly is P(X I
100).Even Minitab has difficulty performing
this extremely difficult computation involving binomial probabilities. However, this probability is quite easy to
approximate using the normal approximation. The mean value for X is p = np = 315 x .38 = 119.7 and the
standard deviation is (J = 6= 4315 x .38 x .62 = 8.61. A normal curve is constructed with mean 119.7
and standard deviation 8.61. In order to cover all the area associated with the rectangle at X = 100 and all those
less than 100, the normal curve area associated with x less than 100.5 is found. Figure 6-28 shows the normal
curve area we need to find.
Fig. 6-28
The corresponding area under the standard normal curve is shown in Fig. 6-29. The area under the standard
normal curve for Z <-2.23 is SO00 - .4871 = .0129.The probability is extremely small that 100 or fewer in the
315 will perform the breast examination each month.
Fig. 6-29
CHAP. 61 CONTINUOUS RANDOM VARIABLES 127
EXPONENTIAL PROBABILITY DISTRIBUTION
The exponentialprobability distribution is a continuous probability distribution that is useful in
describing the time it takes to complete some task. The pdf for an exponential probability distribution
is given by formula (6.3, where p is the mean of the probability distribution and e = 2.71828 to five
decimal places.
for x 2 0
1 - x l p
f(x)= -e
c1
The graph for the pdf of a typical exponential distribution is shown in Fig. 6-30.
Fig. 6-30
EXAMPLE 6.15 The exponential random variable can be used to describe the following characteristics: the
time between logins on the internet, the time between arrests for convicted felons, the lifetimes of electronic
devices, and the shelf life of fat free chips.
PROBABILITIES FOR THE EXPONENTIAL PROBABILITY DISTRIBUTION
A standardized table of probabilities does not exist for the exponential distribution and to find
areas under the exponential distribution curve requires the use of calculus. However, formula (6.8)
is useful in solving many problems involving the exponential distribution.
The area corresponding to the probability given in formula (6.8)is shown as the shaded area in Fig.
6-31.
Fig. 6-31
128 CONTINUOUS RANDOM VARIABLES [CHAP. 6
EXAMPLE 6.16 Suppose the time till death after infection with HIV, the AIDS virus, is exponentially
distributed with mean equal to 8 years. If X represents the time till death after infection with HIV, then the
percent who die within five years after infection with HIV is found by multiplying P(X I 5 ) by 100. The
probability is found as follows: P(X 5 5 ) = 1 - e-.625= 1 - .535 = .465. Using CDF of Minitab, we have the
following as an alternative solution.
MTB > cdf 5;
SUBC > exponential 8.
Cumulative Distribution Function
Exponential with mean = 8.00000
x P(X<=x)
5.OOOO 0.4647
To find the percent who live more than 10 years, we multiply P(X > 10) by 100. In order to utilize formula
(6.8),we use the complementary rule for probabilities. This rule allows us to write P(X > 10) as follows:
That is, 28.7% of the individuals live more than 10years after infection. This probability is shown as the shaded
area in Fig. 6-32.
Fig. 6-32
To find the percent who live between 2 and 4 years after infection, we multiply P(2 < X < 4) by 100. To use
formula (6.8)to find this probability, we express P(2 < X <4) as follows:
P(2 <X < 4) = P(X <4) - P(X < 2)= (1 - e 3- (1 - e-.25)= e-.25- e-.5= .172
That is, 17.2% live between 2 and 4 years after infection. This probability is shown as the shaded area in Fig.
6-33.
Fig. 6-33
CHAP. 61 CONTINUOUSRANDOM VARIABLES 129
Solved Problems
UNIFORMPROBABILITY DISTRIBUTION
6.1 The price for a gallon of whole milk is uniformly distributed between $2.25 and $2.75 during
July in the US.Give the equation and graph the pdf for X, the price per gallon of whole milk
during July. Also determine the percent of stores that charge more than $2.70 per gallon.
2 2.25< x < 2.75
0 elsewhere
.The graph is shown in Fig. 6-34.
Ans. The equation of the pdf is f(x) =
Fig. 6-34 Fig. 6-35
The percent of stores charging a higher price than $2.70 is P(X > 2.70) times 100. The probability
P(X > 2.70) is the shaded area in Fig. 6-35. This area is 2 x .05 = .10.Ten percent of all milk
outlets sell a gallon of milk for more than $2.70.
6
.
2 The time between release from prison and the commission of another crime is uniformly
distributed between 0 and 5 years for a high-risk group. Give the equation and graph the pdf
for X, the time between release and the commission of another crime for this group. What
percent of this group will commit another crime within two years of their release from prison?
.2 o < x < 5
0 elsewhere
.The graph of the pdf is shown in Fig. 6-36. The
Ans. The equation of the pdf is f(x) =
percent who commit another crime within two years is given by P(X c 2) times 100. This
probability is shown as the shaded area in Fig. 6-37, and is equal to 2 x .2 = .4. Forty percent will
commit another crime within two years.
Fig. 6-36 Fig. 6-37
130 CONTINUOUS RANDOM VARIABLES [CHAP. 6
MEAN AND STANDARD DEVIATION FOR THE UNIFORM
PROBABILITY DISTRIBUTION
6.3
6.4
Find the mean and standard deviation of the milk prices in problem 6.1. What percent of the
prices are within one standard deviation of the mean?
Ans.
Find
a + b 2.25+2.75
The mean is given by p = --
- = 2.50 and the standard deviation is given by
2 2
/ y = / T
(2*7s - 2‘25) = .144. The one standard deviation interval about the mean goes
from 2.36 to 2.64 and the probability of the interval is (2.64 - 2.36) x 2 = S6. Fifty-six percent of
the prices are within one standard deviation of the mean.
the mean and standard deviation of the times between release from prison and the
commission of another crime in problem 6.2. What percent of the times are within two
standard deviations of the mean?
(b - a)2
i 12
a + b 0+5
2 2
Am. The mean is given by = --
- --
- 2.50 and the standard deviation is given by
= J(5 - ”* = 1.44. A 2 standard deviation interval about the mean goes from -.38 to 5.38 and
100% of the times are within 2 standard deviations of the mean.
12
NORMAL PROBABILITY DISTRIBUTION
6.5
6.6
The mean net worth of all Hispanic individuals aged 51-61 in the U.S. is $80,000, and the
standard deviation of the net worths of such individuals is $20,000. If the net worths are
normally distributed, what percent have net worths between: (a) $60,000 and $100,000; (6)
$40,000 and $1 20,000;(c) $20,000and $140,000?
Ans. ( a ) 68.26% have net worths between $60,000and $100,000.
(b) 95.44% have net worths between $40,000 and $120,000.
(c) 99.72% have net worths between $20,000 and $140,000.
If the median amount of money that parents in the age group 51-61 gave a child in the last
year is $1,725 and the amount that parents in this age group give a child is normally
distributed, what is the modal amount that parents in this age group give a child?
Am. Since the distribution is normally distributed, the mean, median, and mode are all equal.
Therefore, the modal amount is also $1,725.
STANDARD NORMAL DISTRIBUTION
6.7 Express the areas shown in the following two standard normal curves as a probability
statement and find the area of each one.
CHAP. 61 CONTINUOUS RANDOM VARIABLES 131
Ans. The area under the curve on the left is represented as P(0 < Z < 1.83) and from the standard
normal distribution table is equal to .4664. The area under the curve on the right is represented as
P(-1.87< Z< 1.87)and from the standard normal distribution table is 2 x .4693 = .9386.
6.8 Represent the following probabilities as shaded areas under the standard normal curve and
explain in words how you find the areas: (a)P(Z < -1.75); (b)P(Z < 2.15).
Ans. The probability in part (a) is shown as the shaded area in Fig. 6-38 and the probability in part (b)
is shown in Fig 6-39.
To find the shaded area in Fig. 6-38, we note that P(Z < -1.75)= P(Z > 1.75)because of the
symmetry of the normal curve. In addition, P(Z > 1.75)= .5 - P(0 < Z< 1.75)since the total area
to the right of 0 is .S. From the standard normal distribution table, P(0 < Z < 1.75)is equal to
.4599. Therefore, P(Z <-1.75)= P(Z> 1.75)= .5 - .4599 = .0401.
To find the shaded area in Fig. 6-39, we note that P(Z < 2.15) = P(Z < 0)+ P(0 < Z < 2.15). The
probability P(Z< 0) = .5 because of the symmetry of the normal curve. From the standard normal
distribution table, P(0 < Z < 2.15) = .4842. Therefore, P(Z < 2.15) = .5 + ,4842 = .9842. The
solution to part (b)using Minitab is as follows:
MTB >cdf 2.15;
SUBC > normal 0 1.
Cumulative Distribution Function
Normal with mean = 0 and standard deviation = 1.OOOOO
X P(X <= x)
2.1500 0.9842
Fig. 6-38 Fig. 6-39
132 CONTINUOUS RANDOM VARIABLES
STANDARDIZING A NORMAL DISTRIBUTION
[CHAP. 6
6.9
6.10
The distribution of complaints per week per 100,OOO passengers for all airlines in the U.S. is
normally distributed with p = 4.5 and CJ = 0.8. Find the standardized values for the following
observed values of the number of complaints per week per 100,OOO passengers: (a)6.3; (h) 2.5;
(c) 4.5; (d)
8.0.
A m
X - / A 6.3-45
(a) The standardized value for 6.3 is found by z = --
- -
-
- 2.25.
(7 .8
X - c ~ 2.5-45
(6) The standardized value for 2.5 is found by z = --
- -
-
- -2.50.
(7 .8
x--J1 45-4.5
(c) The standardized value for 4.5 is found by z = --
- -
-
- 0.00.
(7 .8
X-/A 8.0-45
(4 The standardized value for 8.0 is found by z = --
- -
= 4.38.
(7 .8
Personal injury awards are normally distributed with a mean equal to $62,000 and a standard
deviation equal to $13,500. Find the amount of the award corresponding to the following
standardized values: (a) 2.0; (b) -3.0; (c)0.0; (44.5.
Ans. (a) The amount corresponding to standardized value 2.0 is x = p + ZCT= 62,000 + 2.0 x 13,500=
89,000.
(6) The amount corresponding to standardized value -3.0 is x = /A i-ZG = 62,000 - 3.0 x 13,500 =
2 1,500,
(c) The amount corresponding to standardized value 0.0 is x = /A + ZG = 62,000 + 0.0 x 13,500=
62,000.
(6)
The amount corresponding to standardized value 4.5 is x = p + zo = 62,000+ 4.5 x 13,500 =
122,750.
APPLICATIONS OF THE NORMAL DISTRIBUTION
6.11 In a sociological study concerning family life, it is found that the age at first marriage for men
is normally distributed with a mean equal to 23.7 years and a standard deviation equal to 3.5
years. Determine the percent of men for whom the age at first marriage is between 20 and 30
years of age. If X represents the age at first marriage for men, draw a normal curve for X and
show the shaded area for P(20 < X < 30) as well as the corresponding area under the standard
normal curve.
Ans. The distribution for X is shown in Fig. 6-40.The shaded area represents P(20 < X < 30). The z
value corresponding to x = 20 is z = = -1.06 and the z value corresponding to x = 30 is
z = = 1.80. The area shown in Fig. 6-4I is equal to the area under the curve shown in
Fig. 6-40, that is, P(20 < X < 30) = P(-l.O6 < Z < 1.80).Utilizing the standard normal distribution
table, P(-1.06 < Z < 1.80) = .3554+ .4641 = .8195. That is, about 82% of the first marriages for
men occur for men between 20 and 30 years of age.
20-23.1
3.5
30-23.1
3.5
CHAP. 61 CONTINUOUS RANDOM VARIABLES 133
Fig. 6-40 Fig. 6-41
6.12 The net worth of senior citizens is normally distributed with mean equal to $225,000 and
standard deviation equal to $35,000. What percent of senior citizens have a net worth less than
$300,000?
Ans. Let X represent the net worth of senior citizens in thousands of dollars. The percent of senior
citizens with a net worth less than $300,000 is found by multiplying P(X c 300) times 100. The
probability P(X < 300) is shown in Fig. 6-42. The event X < 300 is equivalent to the event
Z < = 2.14. The probability that Z < 2.14 is represented as the shaded area in Fig. 6-43.
The probability that Z is less than 2.14 is found by adding P(0 < Z < 2.14) to .5, which equals .5 +
.4838 = .9838. We can conclude that 98.38% of the senior citizens have net worths less than
$300,000.
300- 225
35
Fig. 6-42 Fig. 6-43
DETERMINING THE Z AND X VALUES WHEN AN AREA UNDER THE
NORMAL CURVE IS KNOWN
6.13 Find the value for positive number a such that P(-a < Z < a) = 9 5 .
Ans. The symmetry of the normal curve implies that P(0 < Z < a) = .4750. This area is found in the
interior of the standard normal distribution table and the z value corresponding to this area is 1.96.
The area under the standard normal curve corresponding to -1.96< Z < 1.96 is .9500.
134 CONTINUOUS RANDOM VARIABLES [CHAP. 6
6.14 U.S. divorce rates, by county, are normally distributed with mean value equal to 4.5 per 1000
people and standard deviation equal to 1.3 per 1000people. Find the third quartile for divorce
rates by county per 1000people. Draw graphs to illustrate your solution.
Ans. The divorce rate is represented by X and its distribution is shown in Fig. 6-44. The third quartile is
represented by Q3.The shaded area is equal to .7500. The third quartile for the standard normal
distribution is .67 and is shown in Fig. 6-45. Summarizing,we have P(X < 4 3 ) = P(Z < .67) = .75.
The standardized value of Q 3 must equal .67. That is, -= .67 or Q 3 = 4.5 + 1.3 x .67 =
5.37. Seventy-five percent of the divorce rates are 5.37 or below.
Q3 -4.5
1.3
Fig. 6-44 Fig. 6-45
NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION
6.15 Use Minitab to find the probability of the event that between 8 and 11 heads inclusive occur
when a coin is flipped 20 times. Find the normal approximation to this probability.
Ans. MTB > cdf 11; MTB >cdf 7;
SUBC > binomial 20 .5. SUBC > binomial 20 .5.
Cumulative Distribution Function
Binomial with n = 20 and p = 0.500
11.00 0.7483 7.00 0.1316
Cumulative Distribution Function
Binomial with n = 20 and p = 0.500
x P(X<=x) X P(X <= x)
To find the probability of the event 8 I X I 11 note that P(8 I X I11) = P(X I11) - P(X 27).
Therefore, P(8 I X I 11)= .7483 - .1316 = .6167.
The mean of the binomial distribution is p = np = 20 x .5 = 10, and the standard deviation is CT =
&= & = 2.24. A normal curve with mean equal to 10and standard deviation 2.24 is fit to the
binomial distribution and the area under this normal curve is found for X ranging between 7.5 and
11.5. This area is shown in Fig. 6-46. The standardized value for x = 7.5 is z = -1.12 and the
standardized value for x = 11.5 is z = .67. The area between z = -1.12 and z = .67 is shown in Fig.
6-47. The area is now found using Minitab.
MTB > cdf 11.5;
SUBC > normal 10 2.24.
Cumulative Distribution Function
Normal with mean = 10.00 and
standard deviation = 2.240
11.50 0.7485 7.50 0.1322
MTB >cdf 7.5;
SUBC > normal 10 2.24.
Cumulative Distribution Function
Normal with mean = 10.00and
standard deviation = 2.240
x P(X<= x) X P(X <= x)
CHAP. 61 CONTINUOUS RANDOM VARIABLES
The area between 7.5 and 11.5is the difference .7485 - .1322 = .6163.
135
Fig. 6-46 Fig. 6-47
6.16 In the United States, 80% of TV-owning homes also own a VCR. Use the normal approxi-
mation to the binomial distribution to find the probability that 425 or more in a sample of 500
TV-owning homes also own a VCR.
Ans. If we let X represent the number of TV-owning homes which also own a VCR, then we are to find
P(X 2 425). A normal curve having mean = np = 400 and standard deviation = 6= 8.94 is fit
to the binomial and we need to find the area under the normal curve to the right of 424.5. The
solution using Minitab is as follows.
MTB > cdf 424.5;
SUBC > normal 400 8.94.
Cumulative Distribution Function
Normal with mean = 400.00 and standard deviation = 8.94
424.50 0.9969
X P(X <= x)
Since .9969 is the probability that X is less than 424.5, the probability that X exceeds 424.5 is
1- .9969 = .0031.
EXPONENTIALPROBABILITYDISTRIBUTION
6.17 Which of the following statements best describes the exponential probability distribution?
(a) The exponential probability distribution is skewed to the right.
(b) The exponential probability distribution is skewed to the left.
(c) The exponential probability distribution is mound shaped.
Ans. (a) The exponential distribution is skewed to the right.
6.18 What is the equation of the exponential pdf having mean equal to 5?
Am. The equation is f(x) = .2e-.2X
for x 20.
PROBABILITIESFOR THE EXPONENTIALPROBABILITYDISTRIBUTION
6.19 The lifetimes in years for a particular brand of cathode ray tube are exponentially distributed
with a mean of 5 years. What percent of the tubes have lifetimes between 5 and 8 years? Draw
136 CONTINUOUS RANDOM VARIABLES [CHAP. 6
a graph of the pdf and shade the area which represents the probability of the event 5 < X < 8,
where X represents the lifetimes.
Ans. The shaded area in Fig. 6-48 represents P(5 <X < 8). The probability is found as follows:
P(5 c X c 8) = P(X c 8) - P(X c 5) = (1 - e-1.6)- (1 - e-I) = e-l - e-1.6= .3679 - .2019 = .166.
Approximately 16.6% of the tubes have lifetimesbetween 5 and 8 years.
Fig. 6-48
6.20 For the cathode ray tubes in problem 6.19, use Minitab to determine the percent of tubes that
have lifetimes of less than 10years.
Ans. MTB > cdf 10;
SUBC > exponential 5.
Cumulative Distribution Function
Exponential with mean = 5.00000
10.00 0.8647
x P(X<=x)
The percentage of tubes having lifetimes less than 10years is 86.5%.
Supplementary Problems
UNIFORM PROBABILITY DISTRIBUTION
6
.
2
1 The RND function is a computer language function which uniformly generates random numbers between
Oand 1.
(a) What percent of the random numbers generated by RND are less than .35?
(b) What percent of the random numbers generated by RND are between .20 and .55?
(c) What percent of the random numbers generated by RND are either less than .14 or greater than.81?
Ans. (a) 35% (b) 35% (c) 33%
6.22 In a psychological study involving personality types and career selections, it is found that the time
required to complete a task is uniformly distributed over the interval from 5.0to 7.5 minutes.
(a) What is the probability that the task is completed in less than 4 minutes?
(b) What is the probability that the task is completed in 7.0 or more minutes?
(c) What is the probability that it requires more than 10.0minutes to complete the task?
Ans. (a) 0 (b) .2 (c) 0
CHAP. 61 CONTINUOUS RANDOM VARIABLES 137
MEAN AND STANDARDDEVIATION FOR THE UNIFORM PROBABILITY DISTRIBUTION
6.23 What is the mean and standard deviation of the random numbers generated by RND in problem 6.21?
Ans. p = S O G = .29
6.24 What percent of the distribution in problem 6.23 is included within one standard deviation of the mean?
Ans. 57.6%
NORMAL PROBABILITY DISTRIBUTION
6.25 The heights of adult males are normally distributed with a mean equal to 70 inches and a standard
deviation equal to 3 inches. Give the pdf for the normal curve describing this distribution.
6.26 A normal distribution has mean equal to a and standard deviation equal to b and c is a positive number. If
it is known that P(X > a +c) = d, find the followingprobabilities:
(a) P(X<a+c) (6) P(X<a-c) (c) P(a - c <X < a +c) (d)P(0 < X < a +c)
Ans. (a) 1 -d (b) d (c) 1- 2d (d) .5-d
STANDARD NORMAL DISTRIBUTION
6.27 Find the following probabilities concerning the standard normal random variable Z.
(a) P(O< Z < 2.13) (b) P(-1.45 < Z < 2.10) (c) P(Z > 2.88) (d) P(Z 2 -2.01)
Ans. (a) A834 (6) .9086 (c) .0020 (6).9778
6.28 Find the probabilities of the followingevents involving the standard normal random variable Z.
(a) Z>4.50
(d) Z < 1.15andZ>-1.15 (e) Z<-2.45andZ> 1.11 u> thecomplementofZ< 1.44
(b) -4.00< Z < 5.50 (c) Z <-1.25 or Z > 2.35
Ans. (a) approx 0 (b) approx 1 (c) .I 150 (d).7498 (e) 0 (fj.0749
STANDARDIZING A NORMAL DISTRIBUTION
6.29 The marriage rate per 1,000population per county has a normal distribution with mean 8.9 and standard
deviation 1.7.Find the standardized values for the followingmarriage rates per 1,OOO per county.
(a) 6.5 (6) 8.8 (c) 12.5 (d)13.5
Ans. (a) -1.41 (b) -0.06 (c) 2.12 (d) 2.71
6.30 The hospital cost for individuals involved in accidents who do not wear seat belts is normally distributed
with mean $7,500 and standard deviation $1,200.
(a) Find the cost for an individual whose standardized value is 2.5.
(6) Find the cost for an individual whose bill is 3 standard deviations below the average.
Ans. (a) $10,500 (6) $3,900
138 CONTINUOUS RANDOM VARIABLES [CHAP. 6
APPLICATIONSOF THE NORMAL DISTRIBUTION
6.31 The average TV-viewing time per week for children ages 2 to 1 1 is 22.5 hours and the standard deviation
is 5.5 hours. Assuming the viewing times are normally distributed, find the following.
(a) What percent of the children have viewing times less than 10hours per week?
(6) What percent of the children have viewing times between 15 and 25 hours per week?
(c) What percent of the children have viewing times greater than 40 hours per week?
Ans. ( a ) 1.16% (6) 58.67% (c) less than 0.I %
6.32 The amount that airlines spend on food per passenger is normally distributed with mean $8.00 and
standard deviation $2.00.
(a) What percent spend less than $5.00 per passenger?
(6) What percent spend between $6.00 and $10.00?
(c) What percent spend more than $12.50?
Ans. (a) 6.68% (6) 68.26% (c) 1.22%
DETERMININGTHE Z AND X VALUES WHEN AN AREA UNDER THE NORMAL
CURVE IS KNOWN
6.33
6.34
Find the value of a in each of the following probability statements involving the standard normal variable
2.
(a) P(0 < Z < a) = .4616
(6) P(Z<a)=.1894
(6) P(Z < a) = .8980
(e) P(Z > a) = .I894
(c) P(-a < Z < a) = 3612
U
, P(Z = a) = SO00
Ans. (a) 1.77 (6) 1.27 (c) 1.48 (d)-0.88 (el 0.88 u> no solutions
The GMAT test is required for admission to most graduate programs in business. In a recent year, the
GMAT test scores were normally distributed with mean value 550 and standard deviation 100.
(a) Find the first quartile for the distribution of GMAT test scores.
(b) Find the median for the distribution of GMAT test scores.
(c) Find the ninety-fifth percentile for the distribution of GMAT test scores.
Ans. (a) 483 (6) 550 (c) 715
NORMAL APPROXIMATIONTO THE BINOMIALDISTRIBUTION
6.35
6.36
For which of the following binomial distributions is the normal approximation to the binomial
distribution appropriate?
((1) n = 15,p = .2 (6) n = 40, p = . I (c) n = 500, p = .05 (6) n = 50, p = .3
Ans. (a) and (d)
Thirteen percent of students took a college remedial course in 1992-1993. Assuming this is still true,
what is the probability that in 350 randomly selected students:
(a) Less than 40 take a remedial course
(6) Between 40 and 50, inclusive, take a remedial course
(c) More than 55 take a remedial course
Ans. (a) .1711 (b) .6141 (c) .0559
CHAP. 61 CONTINUOUS RANDOM VARIABLES 139
EXPONENTIALPROBABILITY DISTRIBUTION
6.37 The random variable X has an exponential distribution with pdf f(x) = .le-’’, for x > 0. What is the mean
for this random variable?
Ans. 10
6.38 What is the largest value an exponential random variable may assume?
Ans. There is no upper limit for an exponential distribution.
PROBABILITIESFOR THE EXPONENTIALPROBABILITYDISTRIBUTION
6.39 The time that a family lives in a home between purchase and resale is exponentially distributed with a
mean equal to 5 years. Let X represent the time between purchase and resale. (a) Find P(X -c3).(6) Find
P(X > 5).
Ans. (a) ,4512 (6) .3679
6.40 The time between orders received at a mail order company are exponentially distributed with a mean
equal to 0.5 hour. What is the probability that the time between orders is between 1 and 2 hours?
Ans. .1170
Chapter 7
Sampling Distributions
SIMPLE RANDOM SAMPLING
In order to obtain information about some population, either a censw of the whole population is
taken or a sample is chosen from the population and the information is inferred from the sample.
The second approach is usually taken, since it is much cheaper to obtain a sample than to conduct a
census. In choosing a sample, it is desirable to obtain one that is representative of the population.
The average weight of the football players at a college would not be a representative estimate of the
average weight of students attending the college, for example. A simple rundurn sumple of size n
from a population of size N is one selected in such a way that every sample of size n has the same
chance of occurring. In simple random sampling with replacement, a member of the population can
be selected more than once. In simple random sampling without replacement, a member of the
population can be selected at most once. Simple random sampling without replacement is the most
common type of simple random sampling.
EXAMPLE 7.1 Consider the population consisting of the world’s five busiest airports. This population
consists of the following: A: Chicago O’Hare, B: Atlanta Hartsfield, C: London Heathrow, D: Dallas-Fort
Worth, and E: Los Angeles Intl. The number of possible samples of size 2 from this population of size 5 is given
by the combination of 5 items selected two at a time, that is, = 10. In simple random sampling, each
possible pair would have probability 0.1 of being the pair selected. That is, Chicago O’Hare and 4tlanta
Hartsfield would have probability 0.1 of being chosen, Chicago O’Hare and London Heathrow would have
probability 0.1 of being chosen, etc. One way of ensuring that each pair would have an equal chance of being
selected would be to write the names of the five airports on separate sheets of paper and select two of the sheets
randomly from a box.
USING RANDOM NUMBER TABLES
The technique of writing names on slips of paper and selecting them from a box is not practical
for most real world situations. Tables of random numbers are available in a variety of sources. The
digits 0 through 9 occur randomly throughout a random number table with each digit having an equal
chance of occurring. Table 7.1 is an example of a random number table. This particular table has 50
columns and 20 rows. To use a random number table, first randomly select a starting position and
then move in any direction to select the numbers.
EXAMPLE 7
.
2 The money section of USA Today gives the 1,900 most active New York Stock Exchange
issues. The random numbers in Table 7.1 can be used to randomly select 10 of these issues. Imagine that the
issues are numbered from 0001 to 1900. Suppose we randomly decide to start in row 1 and columns 2 1 through
24. The four-digit number located here is 0345. Reading down these four columns and discarding any number
exceeding 1900, we obtain the following eight random numbers between 0001 and 1900: 0345, 1304, 0990,
1580, 1461, 1064, 0676, and 0347. To obtain our other two numbers, we proceed to row 1 and columns 26
through 29. Reading down this column, we find 1149and 1074.To obtain the 10stock issues, we read down the
columns and select the ones located in positions 345, 347,676,990, 1064, 1074, 1149. 1304, 1461, and 1580.
140
CHAP. 71 SAMPLING DISTRIBUTIONS 141
Table 7.1
Random Numbers (25 rows, 50 columns)
01-05 06-10 11-15 16-20 21-25 26-30 31-35 36-40 41-45 46-SO
87032
98340
70363
09749
84223
27 190
47087
47335
31031
60072
90202
91887
08264
46655
30428
55238
64993
90420
4462 1
15988
2656I
94192
95651
91862
87730
92922
68993
69753
52240
37318
45248
83092
04860
67610
33957
40718
84882
80152
76402
94013
44020
8I975
64089
12659
91759
86046
24807
44311
10346
07550
84967
39204
059I9
35334
53553
83328
03067
10418
04778
71898
06061
69931
31921
63079
9304I
09124
70755
33070
02I33
0441 I
67293
23539
28393
44369
22925
97613
19953
26576
58739
05785
03453
13047
09900
9I937
48757
42493
36834
15800
42534
43925
14612
98551
21460
10649
06766
23119
21077
4036 I
03474
72883
22484
28533
81554
58272
89477
26551
19522
92668
84923
I1499
99573
48427
28370
10744
37433
77718
27665
8242I
00570
17772
55858
34529
53640
90766
8422I
76639
595I0
78460
44548
93024
69573
25425
43026
505 15
45349
16016
10583
61952
28368
57471
6I768
02625
92109
09950
38566
15763
83888
39356
75222
72791
98695
43864
78296
01372
46565
58590
62587
627 I3
60340
75775
I2676
I1020
00459
27996
96274
I8068
51540
9 1692
89959
60190
51303
10714
58382
5508I
4701 1
03726
36875
04890
95227
95202
29353
65106
09599
29679
29195
38998
391 19
34824
921 19
29692
44925
08308
08276
31121
46762
03091
00638
0I032
39059
06545
USING THE COMPUTER TO OBTAIN A SIMPLE RANDOM SAMPLE
Most computer statistical software packages can be used to select random numbers and to some
extent have replaced random number tables. As the capability and availability of computers continue
to increase, many of the statistical tables are becoming obsolete.
EXAMPLE 7.3 Minitab can be used to select the random sample of stock issues in Example 7.2. The
commands are as follows.
MTB > set cl
DATA > 1:1900
DATA > end
MTB > sample 10c 1 c2
MTB > print c2
Data Display
c 2
1227 969 1834 1441 423 897 824 664 414 77
The first three lines of command put the numbers 1 through 1900 into column 1. The command on the fourth
line asks for a sample of size 10 from the numbers in column cl and asks that the selected numbers be placed
into column c2. The print command causes the random numbers to be printed. These are the numbers of the 10
stocks to be selected.
142 SAMPLING DISTRIBUTIONS [CHAP. 7
SYSTEMATIC RANDOM SAMPLING
Systematic random sampling consists of choosing a sample by randomly selecting the first
element and then selecting every kth element thereafter. The systematic method of selecting a sample
often saves time in selecting the sample units.
EXAMPLE 7.4 In order to obtain a systematic sample of 50 of the nation’s3,143 counties,divide 3,143 by 50
to obtain 62.86. Round 62.86 down to obtain 62. From a list of the 3,143 counties, select one of the first 62
counties at random. Suppose county number 35 is selected. To obtain the other 49 counties, add 62 to 35 to
obtain 97, add 2 x 62 to 35 to obtain 159,and continue in this fashion until the number 49 x 62 -t35 = 3,073 is
obtained. The counties numbered 35, 97, 159,, . . ,3073 would represent a systematic sample from the nations
counties.
CLUSTER SAMPLING
In cluster sampling, the population is divided into clusters and then a random sample of the
clusters are selected. The selected clusters may be completely sampled or a random sample may be
obtained from the selected clusters.
EXAMPLE 7.5 A large company has 30 plants located throughout the United States. In order to access a new
total quality plan, the 30 plants are considered to be clusters and five of the plants are randomly selected,All of
the quality control personnel at the five selected plants are asked to evaluate the total quality plan.
STRATIFIED SAMPLING
In stratifed sampling, the population is divided into strata and then a random sample is selected
from each strata. The strata may be determined by income levels, different stores in a supermarket
chain, different age groups, different governmental law enforcement agencies, and so forth.
EXAMPLE7.6 Super Value Discount has 10 stores. To assessjob satisfaction,one percent of the employees
at each of the 10 stores are administered ajob satisfaction questionnaire.The 10stores are the strata into which
the population of all employeesat Super Value Discount are divided. The results at the 10 stores are combined
to evaluate the job satisfactionof the employees.
SAMPLING DISTRIBUTION OF THE SAMPLING MEAN
The mean of a population, p, is a parameter that is often of interest but usually the value of p is
unknown. In order to obtain information about the population mean, a sample is taken and the sample
mean, T , is calculated. The value of the sample mean is determined by the sample actually selected.
The sample mean can assume several different values, whereas the population mean is constant. The
set of all possible values of the sample mean along with the probabilities of occurrence of the
possible values is called the sampling distribution o
f the sampling mean. The following example will
help illustrate the sampling distribution of the sample mean.
EXAMPLE 7.7 Suppose the five cities with the most African-American-owned businesses measured in
thousands is given in Table 7.2.
CHAP. 71
City
A: New York
C: Los Angeles
E:Atlanta
B: Washington, D.C.
D: Chicago
SAMPLING DISTRIBUTIONS
Number of African-American-
owned businesses, in thousands
42
39
36
33
30
143
X
P(x)
30 33 36 39 42
.2 .2 .2 .2 .2
If X represents the number of African-American-ownedbusinesses in thousands for this population consisting of
five cities, then the probability distribution for X is shown in Table 7.3.
I
Number of businesses Sample mean
-
in the samples X
Table 7
.
3
The population mean is p = C xP(x) = 30 x .2 + 33 x .2 + 36 x .2 + 39 x .2 +42 x .2 = 36, and the variance is
given by$=Cx2P(x)-p2 = 900 x .2 + 1089 x .2 + 1296 x .2 + 1521 x .2 + 1764 x .2 - 1296 = 1314 - 1296
= 18.The population standard deviation is the square root of 18,or 4.24.
The number of samples of size 3 possible from this population is equal to the number of combinations
possible when selecting three cities from five. The number of possible samples is C: = -= 10. Using the
letters A, B, C, D, and E rather than the name of the cities, Table 7.4 gives all the possible samples of three
cities, the sample values, and the means of the samples.
5!
3!2!
Samples
42, 39, 36
42, 39, 33
42, 39, 30
42, 36, 33
42, 36, 30
42, 33, 30
39, 36, 33
39, 36, 30
39, 33, 30
39
38
37
37
36
35
36
35
34
The sampling distribution of the mean is obtained from Table 7.4. For random sampling, each of the
samples in Table 7.4 is equally likely to be selected. The probability of selecting a sample with mean 39 is .1
since only one of 10samples has a mean of 39. The probability of selecting a sample with mean 36 is .2, since
two of the samples have a mean equal to 36. Table 7.5 gives the sampling distribution of the sample mean.
Table 7.5
I I
1 , 1 3 3 34 35 36 37 38 39 I
I PK’)I .I .1 .2 .2 .2 *1 .1 I
144
Sample mean
33
34
35
36
37
38
39
SAMPLING DISTRIBUTIONS
Sampling error Probability
3 .I
2 .1
I .2
0 .2
1 .2
2 .1
3 .I L
[CHAP. 7
33 34 35 36 37 38 39
P(Y) .I .I .2 .2 .2 .1 .I
-
, x
SAMPLINGERROR
When the sample mean is used to estimate the population mean, an error is usually made. This
error is called the sampling error, and is defined to be the absolute difference between the sample
mean and the population mean. The sampling error is defined by
sampling error = Ix -p I
EXAMPLE73 In Example 7.7, the population of the five cities with the most African-American-owned
businesses is given. The mean of this population is 36. Table 7.5 gives the possible sample means for all
samples of size 3 selected from this population. Table 7.6 gives the sampling errors and probabilities associated
with all the different sample means.
Table 7.6
From Table 7.6, it is seen that the probability of no sampling error in this scenario is .20. There is a 60%chance
that the sampling error is 1 or less.
MEAN AND STANDARDDEVIATION OF THE SAMPLEMEAN
Since the sample mean has a distribution, it is a random variable and has a mean and a standard
deviation, The mean of the sample mean is represented by the symbol px and the standard deviation
of the sample mean is represented by fi .The standard deviation of the sample mean is referred to as
the standard error o
f the mean. Example 7.9 illustrates how to find the mean of the sample mean and
the standard error of the mean.
EXAMPLE 7.9 In Example 7.7, the sampling distribution of the mean shown in Table 7.7 was obtained.
Table7.7
The mean of the sample mean is found as follows:
pir = Z 'XP(X) = 33 x .1 +34 x .1 +35 x .2 + 36 x .2 + 37 x .2 + 38 x .1 +39 x .1 = 36
The variance of the sample mean is found as follows:
a; = c X2P(Si;) - j.lz
2
0: = 1089x .l + 1156x .1 + 1225 x .2 + 1296x .2 + 1369x .2 + 1444 x .I + 1521 x . I - 1296= 3
The standard error of the mean, ox,
is equal to 6= 1.73.
CHAP. 71 SAMPLINGDISTRIBUTIONS 145
The relationshipbetween the mean of the sample mean and the population mean is expressed by
The relationship
expressed by formula
between the variance of the sample mean and the population variance is
(7.3),where N is the population size and n is the sample size.
2 c2x-
N-n
n N-1
‘
T
, = - (7.3)
EXAMPLE7.10 In Example 7.7, the population consisting of the five cities with the most African-American-
owned businesses was introduced. The population mean, p,is equal to 36 and the variance, 02,
is equal to 18. In
Example 7.9, the mean of the sample mean,pF, was shown to equal 36 and the variance of the sample mean,
02, was shown to equal 3. It is seen that p=kSr
= 36, illustrating formula (7.2). To illustrate formula (7.3),note
that
The standarderror of the mean is found taking the square root of both sides of formula (7.3),and
is given by
The term izis called thefinite population correctionfactor. If the sample size n is less than 5%
of the population size, i.e., n < .05N, the finite population correction factor is very near one and is
omitted in formula (7.4).If n < .05N, the standard error of the mean is given by
EXAMPLE7.11 The mean cost per county in the United States to maintain county roads is $785 thousand per
year and the standard deviation is $55 thousand. Approximately 4% of the counties are randomly selected and
the mean cost for the sample is computed. The number of counties is 3,143 and the sample size is 125.The
standard error of the mean using formula (7.4)is:
= 4.91935 x .98 = 4.82 thousand
3,143-1
The standard error of the mean using formula (7.5)is:
55
‘Tx = -= 4.92 thousand
J125
Ignoring the finite population correction factor in this case changes the standard error by a small amount.
146 SAMPLING DISTRIBUTIONS [CHAP. 7
SHAPE OF THE SAMPLING DISTRIBUTION OF'THE SAMPLE MEAN
AND THE CENTRAL LIMIT THEOREM
If samples are selected from a population which is normally distributed with mean p and
standard deviation 0,
then the distribution of sample means is normally distributed and the mean of
this distribution is p, = p, and the standard deviation of this distribution is e = -. The shape of
the distribution of the sample means is normal or bell-shaped regardless of the sample size.
The central l i k t theorem states that when sampling from a large population of any distribution
shape, the sample means have a normal distribution whenever the sample size is 30 or more.
Furthermore, the mean of the distribution of sample means is pX = p, and the standard deviation of
this distribution is = -. It is important to note that when sampling from a nonnormal
distribution, X has a normal distribution only if the sample size is 30 or more. The central limit
theorem is illustrated graphically in Figure 7-I.
0
&
0
J;T
Fig. 7-1
Figure 7-1 illustrates that for samples greater than or equal to 30, 51 has a distribution that is bell-
shaped and centers at p.The spread of the curve is determined by e.
EXAMPLE7.12 If a large number of samples each of size n, where n is 30 or more, are selected and the
means of the samples are calculated, then a histogram of the means will be bell-shaped regardless of the shape
of the population distribution from which the samples are selected. However, if the sample size is less than 30,
the histogram of the sample means may not be bell-shaped unless the samples are selected from a bell-shaped
distribution.
APPLICATIONS OF THE SAMPLING DISTRIBUTION
OF THE SAMPLE MEAN
The distribution properties of the sample mean are used to evaluate the results of sampling, and
form the underpinnings of many of the statistical inference techniques found in the remaining
chapters of this text. The examples in this section illustrate the usefulness of the central limit
theorem.
EXAMPLE 7.13 A government report states that the mean amount spent per capita for police protection for
cities exceeding 150,000 in population is $500 and the standard deviation is $75. A criminal justice research
CHAP. 71 SAMPLINGDISTRIBUTIONS 147
study found that for 40 such randomly selected cities, the average amount spent per capita for this sample for
police protection is $465. If the government report is correct, the probability of finding a sample mean that is
$35 or more below the national average is given by P(x < 465). The central limit theorem assures us that
because the sample size exceeds 30, the sample mean has a normal distribution. Furthermore, if the government
report is correct, the mean of x is $500 and the standard error is -= $1 1.86. The area to the left of $465
75
J40
465-500
under the normal curve for x is the same as the area to the left of 2 = = -2.95. We have the
11.86
following equality. P ( x < 465) = P(Z < -2.95) = .0016. This result suggests that either we have a highly
unusual sample or the government claim is incorrect. Figures 7-2 and 7-3 illustrate the solution graphically.
Fig. 7-2 Fig. 7-3
In Example 7.13, the value 465 was transformed to a z value by subtracting the population mean 500 from 465,
and then dividing by the standard error 11.86. The equation for transforming a sample mean to a z value is
shown in formula (7.6):
EXAMPLE 7.14 A machine fills containers of coffee labeled as 113 grams. Because of machine variability,
the amount per container is normally distributed with p = 113 and 0 = 1. Each day, 4 of the containers are
selected randomly and the mean amount in the 4 containers is determined. If the mean amount in the four
containers is either less than 112 grams or greater than 114 grams, the machine is stopped and adjusted. Since
the distribution of fills is normally distributed, the sample mean is normally distributed even for a sample as
small as four. The mean of the sample mean is px = 113 and the standard error is ( T ~
= .5. The machine is
adjusted if X < I12 or if ?I > 114. The probability the machine is adjusted is equal to the sum P(Y < 112) +
P(X > 114) since we add probabilities of events connected by the word or. To evaluate these probabilities, we
use formula (7.6)to express the events involving X in terms of z as follows:
X-113 < 112-113
5 5
P(X < 112)=P(- ) = P(z <-2.00) = .0228
X-113 114-113
.S 5
P(Y > 114)= P(- > ) = P(z >2.00)= .0228
The probability that the machine is adjusted is 2 x .0228 = .0456. It is seen that if this sampling technique is
used to monitor this process, there is a 4.56% chance that the machine will be adjusted even though it is
maintaining an average fill equal to 113grams.
148 SAMPLING DISTRIBUTIONS
Theme park
B: Magic Kingdom (Disney World)
C: Epcot (Disney World)
D: DisneyMGM Studios (Disney World)
E: Universal Studios Florida (Orlando)
A: Disneyland (Anaheim)
[CHAP. 7
Attendance exceeds 10 million
Yes
yes
yes
no
no
SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION
Sample
A, B, C, D
A, B, C, E
A, B, D, E
A, C, D, E
B, C, D, E
A populcition proportion is the proportion or percentage of a population that has a specified
characteristic or attribute. The population proportion is represented by the letter p. The sarnple
proportiori is the proportion or percentage in the sample that has a specified characteristic or
attribute. The sample proportion is represented by either the symbol 6 or F. We shall use the symbol
to represent the sample proportion in this text.
Exceeds 10 million Sample proportion Probabi1ity
Y. Y I Y1 n .75 .2
Y. Y I Y, n .75 .2
yl y, n, n .so .2
y, y, n*n so .2
y, y, n, n .so .2
EXAMPLE 7.15 The nation’s work force at a given time is 133,018,000and the number of unemployed is
7,355,000
133,O18,000
7,355,000. The proportion unemployed is p = = .055 and the jobless rate is 5.5%. A sample of
size 65,000 is selected from the nation’s work force and 3,900 are unemployed in the sample. The sample
proportion unemployed is j7 = --
- .06 and the samplejobless rate is 6%.
3,900
65,000
The population proportion p is a parumeter measured on the complete population and is constant
over some time interval. The sample proportion jT is a stutistic measured on a sample and is
considered to be a random variable whose value is dependent on the sample chosen. The set of all
possible values of a sample proportion along with the probabilities corresponding to those values is
called the sutnplitig distribirtion c$ the sample proportiori.
EXAMPLE 7.16 According to recent data, the nation’s five most popular theme parks are shown in Table 7.8.
The table gives the name of the theme park and indicates whether or not the attendance exceeds 10 million per
year.
For this population of size N = 5, the proportion of theme parks with attendance exceeding 10million is p = =
.60 or 60%. There are 5 samples of siLe 4 possible when selected from this population. These samples, the
theme parks exceeding 10 million (yes or no), the sample proportion, and the probability associated with the
sample proportion, are given in Table 7-9.
For each sample, the sampling error, Ip - jT 1, is either .I0 or .15. From Table 7.9, it is seen that the probability
associated with sample proportion .75 is .2 + .2 = .4 and the probability associated with sample proportion .SO is
.2 + .2 + .2 = .6.The sampling distribution for i7 is given in Table 7.10.
CHAP. 71 SAMPLING DISTRIBUTIONS 149
Table 7.10
For larger populations and samples, the sampling distribution of the sample proportion is more difficult to
construct, but the technique is the same.
MEAN AND STANDARDDEVIATION OF THE SAMPLEPROPORTION
Since the sample proportion has a distribution, it is a random variable and has a mean and a
standard deviation. The mean of the sample proportion is represented by the symbol pF and the
standard deviation of the sample proportion is represented by q.
The standard deviation of the
sample proportion is called the stmdard error of the proportion. Example 7.17 illustrates how to
find the mean of the sample proportion and the standard error of the proportion.
EXAMPLE 7.17 For the sampling distribution of the sample proportion shown in Table 7.10, the mean is
pLp
= C @'@) = .5 x .6 + .75 x .4 = 0.6
The variance of the sample proportion is
6 = C P2P@) - pi = .25 x .6 + S625 x .4- .36 = 0.015
The standard error of the proportion, OF,
is equal to JE
= 0.122.
The relationship between the mean of the sample proportion and the population proportion is
expressed by
vT,= p
The standard error of the sample proportion is related
size, and the sample size by
(7.7)
to the population proportion, the population
The term ,/%is called thefinite populatioiz correctiorzfactor. If the sample size n is less than 5%
of the population size, i.e., n < .05N, the finite population correction factor is very near one and is
omitted in formula (7.8).
If n < .OSN, the starzdard error o
f theproportion is given by formula (7.9),where q = 1 - p.
.=E
EXAMPLE 7.18 In Example 7.16, dealing with the five most popular theme parks, it was shown that p = 0.6
and in Example 7.17, it was shown that pii = 0.6 illustrating formula (7.7). In Example 7.17 it was also shown
that q,
= JG'
= 0.122. To illustrate formula (7.8),note that
150 SAMPLING DISTRIBUTIONS [CHAP. 7
EXAMPLE 7.19 Suppose 80% of all companies use e-mail. In a survey of 100companies, the standard error
of‘the sample proportion using e-mail is q
j = - -
- /=
~ -
- .04. The finite population correction factor
is not needed since there is a very large number of companies, and it is reasonable to assume that R < .05N.
SHAPE OF THE SAMPLINGDISTRIBUTION OF THE SAMPLEPROPORTION
AND THE CENTRAL LIMIT THEOREM
When the sample size satisfies the inequalities np > 5 and nq > 5, the sampling distribution of the
sample proportion is normally distributed with mean equal to p and standard error 9.
This result is
sometimes referred to as the central limit theoremfor the sample proportion. This result is illustrated
in Fig. 7-4.
EXAMPLE 7.20 Approximately 20% of the adults 25 and older have a bachelor’s degree in the United States.
If a large number of samples of adults 25 and older, each of size 100, were taken across the country, then the
sample proportions having a bachelor’s degree would vary from sample to sample. The distribution of sample
proportions would be normally distributed with a mean equal to 20% and a standard error equal to
4%.According to the empirical rule, approximately 68%of the sample proportions would fall between 16%and
24%, approximately 95% of the sample proportions would fall between 12% and 28%, and approximately
99.7% would fall between 8% and 32%. The sample proportion distribution may be assumed to be normally
distributed since np = 20 and nq = 80 are both greater than 5.
P
Fig. 7-4
APPLICATIONS OF THE SAMPLING DISTRIBUTION
OF THE SAMPLEPROPORTION
The theory underlying the sample proportion is utilized in numerous statistical applications. The
margin of error, control chart limits, and many other useful statistical techniques make use of the
sampling distribution theory connected with the sample proportion.
CHAP. 71 SAMPLING DISTRIBUTIONS 151
EXAMPLE 7.21 It is estimated that 42% of women and 36% of men ages 45 to 54 are overweight, that is,
20% over their desirable weight. The probability that one-half or more in a sample of 50 women ages 45 to 54
are overweight is expressed as P ( p 2 S).The distribution of the sample proportion is normal since np = 50 x
.42 = 21 > 5 and nq = 50 x .58 = 29 > 5. The mean of the distribution is .42 and the standard error is =
= .07.The probability, P ( p 2 3,is shown as the shaded area in Fig. 7-5.
= li*42:0*58
Fig. 7-5
To find the area under the normal curve shown in Fig. 7-5, it is necessary to transform the value .5 to a standard
normal value. The value for z is z = ~ = 1.14. The area to the right of j7 = .5 is equal to the area to the
right of z = 1.14 and is given as follows.
5 - .42
.07
P(p5: .5) = P(z > 1.14)= .5 - .3729= .1272
In Example 7.21, the p value equal to .5 was transformed to a z value by subtracting the mean
value of p from .5 and then dividing the result by the standard error of the proportion. The equation
for transforming a sample proportion value to a z value is given by
(7.10)
EXAMPLE 7.22 It is estimated that 1 out of 5 individuals over 65 have Parkinsonism, that is, signs of
Parkinson’s disease. The probability that 15% or less in a sample of 100 such individuals have Parkinsonism is
represented as P(p 5 15%).Since np = 100x .20 = 20 > 5 and nq = 100x -80= 80 > 5, it may be assumed that
-
p has a normal distribution. The mean of j7 is 20% and the standard error is 0
: = E= pg = 4%.
Formula (7.10)is used to find an event involving z which is equivalent to the event jT 5 15. The event jT 5 15
is equivalent to the event z = -<- = -1.25, and therefore we have P ( p 5 15%) = P(z < -1.25) =
j7-20 15-20
4 4
.5 - .3944 = .1056.
152
1. Alabama
2. Alaska
3. Arizona
4. Arkansas
5. California
SAMPLING DISTRIBUTIONS
18. Kentucky 35. North Dakota
19.Louisiana 36. Ohio
20. Maine 37. Oklahoma
2I. Maryland 38. Oregon
22. Massachusetts 39. Pennsvlvania
[CHAP. 7
6. Colorado
7. Connecticut
8. Delaware
Solved Problems
23. Michigan 40. Rhode Island
24. Minnesota
25. MississiDDi 42. South Dakota
4 1. South Carolina
SIMPLE RANDOM SAMPLING
9. District of Columbia 26. Missouri
10.Florida 27. Montana
11. Georgia 28. Nebraska
12. Hawaii 29. Nevada
13. Idaho 30. New Hampshire
14. Illinois
15. Indiana 32. New Mexico
3 1. New Jersey
7.1 The top nine medical doctor specialties in terms of median income are as follows: radiology,
surgery, anesthesiology, obstetrics/gynecology, pathology, internal medicine, psychiatry,
family practice, and pediatrics. How many simple random samples of size 4, chosen without
replacement, are possible when selected from the population consisting of these nine
specialties? What is the probability associated with each possible simple random sample of
size 4 from the population consisting of these nine specialties?
43. Tennessee
44. Texas
45. Utah
46. Vermont
47. Virginia
48. Washington
49. West Virginia
Arts: The number of possible samples is equal to /49)= 2= 126. Each possible sample has proba-
bility -!- = .00794 of being selected.
I26
16. Iowa
17. Kansas
7
.
2 In problem 7.1, how many of the 126 possible samples of size 4 would include the specialty
pediatrics?
Y
33. New York 50. Wisconsin
34. North Carolina 51. Wyoming
Ans: The number of samples of size 4 which include the specialtypediatrics would equal the number of
ways the other three specialties in the sample could be selected from the other eight specialties in
the population. The number of ways to select 3 from 8 is = -= 56. One such sample is the
sample consisting of the following: (radiology, surgery, pathology, pediatrics). There are 55 other
such samples containing the specialtypediatrics.
[:I :
:
!
USING RANDOM NUMBER TABLES
Table 7.11
7.3 Use Table 7.I of this chapter to obtain a sample of size 5 from the population consisting of the
50 states and the District of Columbia. Assume that the 51 members of the population are
CHAP. 71 SAMPLING DISTRIBUTIONS 153
listed in alphabetical as shown in Table 7.11. Start with the two digits in columns 11 and 12,
and line 1 of Table 7.1. Read down these two columns until five unique states have been
selected. Give the numbers and the selected states.
Ans: The random numbers are 44, 12, 24, 10, and 07. From an alphabetical listing of the 50 states and
the District of Columbia in Table 7.11, the following sample is obtained: (Connecticut, Florida,
Hawaii, Minnesota, and Texas)
7.4 The random numbers in Table 7.1 are used to select 15 faculty members from an alphabetical
listing of the 425 faculty members at a university. The total faculty listing goes from 001:
Lawrence Allison to 425: Carol Ziebarth. A random start position is determined to be columns
31 through 33 and line 1. Read down these columns until you reach the end. Then go to
columns 36 through 38, line 1 and read down these columns until you reach the end. If
necessary, go to columns 41 through 43 and line 1 and read down these columns until you
obtain 15 unique numbers. Give the 15 numbers which determine the 15 selected faculty
members.
Ans: The numbers, in the order obtained from Table 7.1 are: 345, 254, 160, 105, 283, 026, 099, 385,
157,393,013, 126, 110,004, and 279.
USING THE COMPUTER TO OBTAIN A SIMPLERANDOM SAMPLE
7.5 In reference to problem 7.3, use Minitab to obtain a sample of 5 of the 50 states plus the
District of Columbia.
Ans: MTB > set cl
DATA> 1 5 1
DATA > end
MTB > sample 5 cl c2
MTB > print c2
Data Display
c
2
3 21 38 44 36
From the alphabetical listing of the states shown in Table 7.11,the following sample is obtained:
(Arizona, Maryland, Ohio, Oregon, and Texas)
Texas is the only state which is common to the samples obtained in problems 7.3 and 7.5.
7.6 In reference to problem 7.4, use Minitab to obtain a sample of 15 of the 425 faculty members.
Ans: MTB > set cl
DATA > 1:425
DATA >end
MTB > sample 25 cl c2
MTB >sort c2 put into c3
MTB >print c3
Data Display
c3
58 73 111 112 187 228 239 283 285 319 322 325 364 384 394
The faculty member corresponding to number 283 is the only one that is common to the samples
obtained in problems 7.4 and 7.6.
154 SAMPLING DISTRIBUTIONS
City
A: New York
C: Los Angeles
E: Atlanta
B: Washington, D.C.
D: Chicago
[CHAP. 7
Number of African-American-owned
businesses, in thousands
42
39
36
33
30
SYSTEMATIC RANDOM SAMPLING
7.7 In reference to problem 7.3, choose a systematic sample of 5 of the 50 states plus the District
of Columbia.
Ans: Divide 51 by 5 to obtain 10.2. Round 10.2down to 10 and select a number randomly between 1
and 10. Suppose the number 7 is obtained. The other 4 members of the sample are 17, 27, 37, and
47. From the alphabetical listing of the states given in Table 7.11,the following sample is selected:
(Connecticut, Kansas, Montana, Oklahoma, Virginia)
CLUSTER SAMPLING
7.8 A large school district wishes to obtain a measure of the mathematical competency of the
junior high students in the district. The district consists of 40 different junior high schools.
Describe how you could use cluster sampling to obtain a measure of the mathematical
competencyof thejunior high school students in the district.
Ans: Randomly select a small number, say 5, of the 40 junior high schools. Then administer a test of
mathematical competency to all students in the 5 selected schools. The test results from the 5
schools constitute a cluster sample.
STRATIFIED SAMPLING
7.9 Refer to problem 7.8. Explain how you could use stratified sampling to determine the
mathematical competency of thejunior high students in the district.
Ans: Consider each of the 40 junior high schools as a stratum. Randomly select a sample from each
junior high proportional to the number of students in that junior high school and administer the
mathematical competency test to the selected students. Note that even though stratified sampling
could be used, cluster sampling as described in problem 7.8 would be easier to administer.
SAMPLING DISTRIBUTION OF THE SAMPLING MEAN
7.10 The five cities with the most African-American-ownedbusinesses was given in Table 7.2 and
is reproduced below.
List all samples of size 4 and find the mean of each sample. Also, construct the sampling
distribution of the sample mean.
CHAP. 71 SAMPLING DISTRIBUTIONS 155
Number of businesses
Samples in the samples Sample mean
A, B, C, D 42, 39, 36, 33 37.50
A, B,c,E 42, 39, 36, 30 36.75
A, B, D, E 42, 39, 33, 30 36.00
A, C , D, E 42, 36, 33, 30 35.25
Ans: Table 7.12 lists the 5 samples of size 4, along with the sample values for each sample, and the
mean of each sample, and Table 7.13 gives the sampling distribution of the sample mean.
f
Table 7.13
I ii- I 34.50 35.25 36.00 36.75 37.50 I
IP(T) I .2 .2 .2 .2 .2 1
7.11 Consider sampling from an infinite population. Suppose the number of firearms of any type per
household in this population is distributed uniformly with 25% having no firearms, 25%
having exactly one, 25% having exactly two, and 25% having exactly three. If X represents the
number of firearms per household, then X has the distribution shown in Table 7.14.
Table 7.14
X 1 0 1 2 3 1
I P(x) I .25 .25 .25 .25 I
Give the sampling distribution of the sample mean for all samples of size two from this
population.
Table 7.15
Samde mean
0
0.5
1.o
1.5
0.5
1.o
1.5
2.0
1.o
1.5
2.0
2.5
I .5
2.0
2.5
3.O
Probability
.0625
.0625
.0625
.0625
,0625
.0625
.0625
.0625
,0625
.0625
.0625
.0625
.0625
.0625
.0625
.0625
Ans: Suppose x1 and x2 represents the two possible values for the two households sampled. Table 7.15
gives the possible values for x1 and x2, the mean of the sample values, and the probability
associated with each pair of sample values. Since there are 16different possible sample pairs each
one has probability & = .0625. The possible values for the sample mean are 0, 0.5, 1.O, 1.5, 2.0,
156
-
, x
P(X)
SAMPLING DISTRIBUTIONS [CHAP. 7
0 0.5 1.o 1.5 2.0 2.5 3.0
.0625 .I250 .1875 ,2500 .I875 .1250 ,0625
2.5, and 3.0. The probability associated with each sample mean value can be determined from
Table 7.15. For example, the sample mean equals 1.5 for four different pairs and the probability
for the value 1.5 is obtained by adding the probabilities associated with (0, 3), ( I . 2), (2, I), and
(3, 0). That is, P(X = 1.5) = .0625 + .0625 + .0625 + .0625 = .25. The sampling distribution is
shown in Table 7.16.
Table 7.16
Samples Sample mean Sampling error
A, B, C. D 37.50 1S O
A, B, C, E 36.75 0.75
A, B, D, E 36.00 0
A, C, D, E 35.25 0.75
34.50 1.so
_i B, C, D, E
SAMPLING ERROR
7.12 Give the sampling error associated with each of the samples in problem 7.10.
Ans: The mean number of African-American-owned businesses for the five cities is 36,000. There are
five different possible samples as shown in Table 7.17. Each of the sample means shown in Table
7.17 is an estimate of the population mean. The sampling error is IX - 36 I,and is shown for each
sample in the last column in Table 7.17.
7.13 Give the minimum and the maximum sampling error encountered in the sampling described in
problem 7.1I and the probability associated with the minimum and maximum sampling error.
Ans: The population described in problem 7.11 has a uniform distribution and the mean is as follows:
p = 0 x .25 + 1 x .25 + 2 x .25 + 3 x .25 = 1.5.The minimum error is 0 and occurs when X = 1.5.
The probability of a sampling error equal to 0 is equal to the probability that the sample mean is
equal to 1.5 which is .25 as shown in Table 7.16. The maximum sampling error is 1.5 and occurs
when X = 0 or X = 3.0. The probability associated with the maximum sampling error is .0625 +
.0625 = .125.
MEAN AND STANDARD DEVIATION OF THE SAMPLE MEAN
7.14 Find the mean and variance of the sampling distribution of the sample mean derived in
problem 7.10 and given in Table 7.13. Verify that formulas (7.2) and (7.3) hold for this
problem.
Ans: Table 7.13 is reprinted below for ease of reference. The mean of the sample mean is
pK= C XP(X) = 34.50 x .2 + 35.25 x .2 + 36.00 x .2 + 36.75 x .2 + 37.50 x .2 = 36
The variance of the sample mean is 0; = C T2P(T)- p
: .
C Y2P(Y)= 1190.25x .2 + 1242.5625x .2 + 1296x .2 + 1350.5625x .2 + 1406.25 x .2
= 1297.I25 and p; = 1296,and therefore 0
: = 1297.125- 1296= I . 125.
CHAP. 71
-
, x
P(Z)
SAMPLING DISTRIBUTIONS
0 0.5 1.0 1.5 2.0 2.5 3.0
.0625 .1250 .1875 .2500 .I875 .I250 .0625
Table 7
.
1
3
X
P(x)
I P(X) I .2 .2 .2 .2 .2 I
0 I 2 3
.25 .25 .25 .25
157
In Example 7.7 it is shown that p = 36 and o2= 18. Formula (7.2)is verified since p = psr= 36.
-)(- - 2
To verify formula (7.3),note that -x- -
- - 1.125= 0,.
C
? N-n 18 5-4
n N-1 4 5-1
7.15 Find the mean and variance of the sampling distribution of the sample mean derived in
problem 7.1 I and given in Table 7.16. Verify that formulas (7.2) and (7.3) hold for this
problem.
Am: Table 7.16 is reprinted below for ease of reference. The mean of the sample mean is
& =cxP(X)
= O X .0625 +0.5 x .125 + 1 x .I875 + 1.5 x .25 + 2 x .1875 +2.5 x .125 + 3 x .0625
= 1S. The variance of the sample mean is 0
: = E X 'P( -jl ) - px.
2
CX2P(sI) = 0 ~ . 0 6 2 5 + . 2 5 ~ . 1 2 5 +
I ~ . 1 8 7 5 + 2 . 2 5 ~ . 2 5 + 4 ~ . 1 8 7 5 + 6 . 2 5 ~ . 1 2 5 + 9 ~ . 0 6 2 5
= 2.875 and pg = 2.25, and therefore02 = 2.875 - 2.25 = 0.625.
Table 7.16
In order to verify formulas (7.2) and (7.3),we need to find the values for p and d in problem
7.11.The population distribution was given in table 7.14 and is shown below.
The population mean is p = C xP(x) = 0 x .25 + I x .25 + 2 x .25 + 3 x .25 = 1.5, and the variance
is given by 0
' = E x2P(x)-p2 = 0 x .25 + 1 x .25 +4 x .25 +9 x .25 - 2.25 = 1.25.
Formula (7.2) is verified, since p = pLsr
= 1.5. To verify formula (7.3),note that since the
population is infinite, the finite population correction factor is not needed and - = --
- 0.625
= 0
:
. Anytime the population is infinite, the finite population correction factor is omitted and
O2 1.25
n 2
0'
formula (7.3)simplifies to 0
: = -.
n
SHAPE OF THE SAMPLING DISTRIBUTION OF THE MEAN
AND THE CENTRAL LIMIT THEOREM
7.16 The portfolios of wealthy people over the age of 50 produce yearly retirement incomes which
are normally distributed with mean equal to $125,000and standard deviation equal to $25,000.
Describe the distribution of the means of samples of size 16from this population.
158 SAMPLING DISTRIBUTIONS [CHAP. 7
Am: Since the population of yearly retirement incomes are normally distributed, the distribution of
sample means will be normally distributed for any sample size. The distribution of sample means
G 25,000
has mean $125,000and standard error q = -= -
-
- $6,250.
AJ16
7.17 The mean age of nonresidential buildings is 30 years and the standard deviation of the ages of
nonresidential buildings is 5 years. The distribution of the ages is not normally distributed.
Describe the distribution of the means of samples from the distribution of ages of non-
residential buildings for sample sizes n = 10and n = 50.
Am: The distribution type for the sample mean is unknown for small samples, i.e., n < 30. Therefore we
cannot say anything about the distribution of X when n = 10. For large samples, i.e., 230,the
central limit theorem assures us that T has a normal distribution. Therefore, for n = 50, the sample
mean has a normal distribution with mean = 30 years and standard error = -- --
- 0.707
years.
0 5
J;;-Jso
APPLICATIONS OF THE SAMPLING DISTRIBUTION OF
THE SAMPLE MEAN
7.18 In problem 7.16, find the probability of selecting a sample of 16 wealthy individuals whose
portfolios produce a mean retirement income exceeding $135,000.
Ans: We are asked to determine the probability that X exceeds $135,000.From problem 7.16, we know
that X has a normal distribution with mean $125,000 and standard error $6,250. The event
x > 135,000 is equivalent to the event z = > = 1.60. Since the
event X > 135,000 is equivalent to the event z > 1.60,the two events have equal probabilities.
That is, P(X > 135,000) = P(Z > 1.60). Using the standard normal distribution table, the
probability P(z > 1.60)is equal to .5 - .4452 = .0548.
That is, P(X > 135,000)= P(Z > 1.60) =
.0548.
- X - 125,000 135,000 - 125,000
6,250 6,250
7.19 In problem 7.17, determine the probability that a random sample of 50 nonresidential buildings
will have a mean age of 27.5 years or less.
Am: We are asked to determine P(X < 27.5). From problem 7.17, we know that X has a normal
distribution with mean 30 years and standard error 0.707 years. Since the event 51 < 27.5 is
X-30 275-30
equivalent to the event z = -< = -3.54, we have P(X < 27.5) = P(z < -3.54).
.707 .707
From the standard normal distribution table, the probability is less than .001.
SAMPLING DISTRIBUTIONOF THE SAMPLEPROPORTION
7.20 Approximately 8.8 million of the 12.8 million individuals receiving Aid to Families with
Dependent children are 18 or younger. What proportion of the individuals receiving such aid
are 18or younger?
CHAP. 71 SAMPLING DISTRIBUTIONS 159
State
A: Alabama
B: California
C: Colorado
D: Kentucky
E: Nebraska
Ans: The population consists of all individuals receiving Aid to Families with Dependent Children. The
proportion having the characteristic that an individual receiving such aid is 18 or younger is
At least one woman on death row
Yes
Yes
no
no
no
8.8
12.8
represented by p and is equal to -= 0.69.
7.21 Table 7.18 describes a population consisting of five states and indicates whether or not there is
at least one woman on death row for that state.
For this population 40%of the states have at least one woman on death row, that is, p = 0.4.
List all samples of size 2 from this population and find the proportion having at least one
woman on death row for each sample. Use this listing to derive the sampling distribution for
the sample proportion.
Ans: Table 7.19 lists all possible samples of size 2, indicates whether or not the states in the sample
have at least one woman on death row, gives the sample proportion for each sample, and gives the
probability for each sample.
From Table 7.19, it is seen that takes on the values 0, .5, and 1 with probabilities .3, .6, and .I,
respectively. The probability that iT = 0 is obtained by adding the probabilities for the samples (C,
D), (C, E), and (D, E) sincea = 0 if any one of these samples is selected. The other two
probabilities are obtained similarly.Table 7.20 gives the sampling distribution forF.
Table 7.20
7 1
MEAN AND STANDARD DEVIATION OF THE SAMPLE PROPORTION
7.22 Find the mean and variance of the sample proportion in problem 7.21 and verify that formula
(7.7)and formula (7.8)are satisfied.
160 SAMPLING DISTRIBUTIONS [CHAP.7
7.23
Ans: The mean of the sample proportion is pF= cp(p)= 0 x .3 + .5 x .6 + 1 x .1 = .4, and the
variance of the sample proportion is 0;= cD2
P(p) - pi = 0 x .3 + .25 x .6 + 1 x .1 - (.4)2
= .09.Formula (7.7) is satisfied since pF= p = .4. To verify formula (7.8), we need to show that
q
,= J ? x { ~ . Since the variance of the sample proportion is .09, the standard
deviation is the square root of .09 or .3. Therefore, to verify formula (7.8), all we need do is show
that the right-hand side of the equation equals .3.
Suppose 5% of all adults in America have 10 or more credit cards. Find the standard error of
the sample proportion in a sampleof 1,OOO American adults who have 10or more credit cards.
Ans: Since the sample size is less than 5% of the population size, the standard error is q -
SHAPE OF THE SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION
AND THE CENTRALLIMIT THEOREM
7.24 A sample of size n is selected from a large population. The proportion in the population with a
specified characteristic is p. The proportion in the sample with the specified characteristic is
p.In which of the following does
(a) n = 20, p = .9
have a normal distribution?
(c) n = 100, p = .97
(b) n = 15, p = .4 (6) n = IOOO, p = .01
np = 18, and nq = 2. Since both np and nq do not exceed 5, we are not sure that the
distribution of jT is normal.
np = 6, and nq =9. Since both np and nq exceed 5, the distribution of F is normal.
np = 97, and nq = 3. Since both np and nq do not exceed 5 , we are not sure that the
distribution of jT is normal.
np = 10,and nq = 990. Since both np and nq exceed 5, the distribution of is normal.
APPLICATIONS OF THE SAMPLING DISTRIBUTION
OF THE SAMPLEPROPORTION
7.25 Approximately 15% of the population is left-handed. What is the probability that in a sample
of 50 randomly chosen individuals, 30% or more in the sample will be left-handed? That is,
what is the probability of finding 15or more left-handersin the 50?
Ans: The sample proportion, jT,has a normal distribution since np = 50 x .15 = 7.5 > 5 and nq = 50 x
.85 = 42.5 > 5. The mean of the sample proportion is 15%and the standard error is oF= =
15x85 = 5.05%. To find the probability that F 2 30%,we first transform jT to a standard
lio
CHAP. 71 SAMPLING DISTRIBUTIONS 161
-
normal by the transormation z = -
’-”as follows. The inequality 2 30 is equivalent to z =
O
F
p-15 30-15
2 -= 2.97. Because F 2 30 is equivalent to z 2 2.97 we have P(F 2 30) =
5.05 5.05
P(z 22.97) = .5 - .4985 = .0015.That is, 15 or more left-handerswill be found in a sample of 50
individualsonly about 0.2% of the time.
7.26 Suppose 70% of the population support the ban on assault weapons. What is the probability
that between 65% and 75% in a poll of 100 individuals will support the ban on assault
weapons?
Ans: We are asked to find P(65 < j
j < 75). The sample proportion p has a normal distribution with
mean 70% and standard error air = = i
F= 4.58%. The event 65 < jT < 75 is
equivalentto the event -
<-<- or -1.09 c z c 1.09and since these events are
equivalent, they have equal probabilities.That is, P(65 c jT c 75) = P(-1.09 c z < 1.09) = 2 x
.3621 = .7242.
65-70 p-70 75-70
458 4.58 458
Supplementary Problems
SIMPLE RANDOM SAMPLING
7.27
7.28
Rather than have all 25 students in her Statistics class complete a teacher evaluation form, Mrs. Jones
decides to randomly select three students and have the department chairman interview the three
concerning her teaching after the course grade has been given. How many different samples of size 3 can
be selected?
Ans. 2,300
USA Today lists the 1900 most active New York stock exchange issues. How many samplesof size three
are possible when selected from these 1900stock issues?
Ans. 1,141,362,300
USING RANDOM NUMBER TABLES
7.29 In a table of random numbers such as Table 7.1 what relative frequency would you expect for each of the
digits 0, 1,2, 3,4, 5,6, 7, 8, and 9?
Ans. 0.1
7.30 The 100 U.S. Senators are listed in alphabeticalorder and then numbered as 00, 01, .. .,99. Use Table
7.1 to select 10 of the senators. Start with the two digits in columns 31 and 32 and row 6. The first
selected senator is numbered 76. Reading down the two columns from the number 76, what are the other
9 two-digit numbers of the other selected senators?
Ans. 59,78,44,93,69,25,43,50,
and 45
162 SAMPLING DISTRIBUTIONS [CHAP. 7
USING THE COMPUTER TO OBTAIN A SIMPLE RANDOM SAMPLE
7.31 Use Minitab to select 25 of the 1900 stock issues discussed in problem 7.28. Give the numbers in
ascending order of the 25 selected stock issues. Assume the stock issues are numbered from 1 to 1900.
Ans. MTB > set cl
DATA > 1:1900
DATA > end
MTB > sample 25 from cl put into c2
MTB > sort c2 put into c3
MTB > print c3
Data Display
c 3
26 219 238 288 328 423 766 785 943 1006 1187
1197 1234 1238 1257 1313 1532 1562 1613 1639 1728 1784
1788 1798 1870
7.32 Use Minitab to select 50 of the 3143 counties in the United States. Give the numbers in ascending order
of the
3143.
Ans.
50 selected counties. Assume the counties are in alphabetical order and are numbered from 1 to
MTB > set cl
DATA > 1:3143
DATA > end
MTB > sample 50 from cl put into c2
MTB > sort c2 put into c3
MTB > print c3
Data Display
c 3
33 47 53 139 265 312 321 343 444 519 599
652 658 764 789 829 949 964 1063 1134 1209 1300
1463 1567 1630 1694 1706 1818 1848 2021 2048 2138 2143
2270 2285 2306 2311 2408 2414 2463 2487 2513 2658 2701
2718 2763 2862 2892 2917 2975
SYSTEMATICRANDOM SAMPLING
7.33 Describe how the state patrol might obtain a systematic random sample of the speeds of vehicles along a
stretch of interstate 80.
Ans. Use a radar unit to measure the speeds of systematically chosen vehicles along the stretch of
interstate 80. For example, measure the speed of every tenth vehicle.
CLUSTER SAMPLING
7.34 A particular city is composed of 850 blocks and each block contains approximately 20 homes. Fifteen of
the 850 blocks are randomly selected and each household on the selected block is administered a survey
concerning issues of interest to the city council. How large is the population? How large is the sample?
What type of sampling is being used?
Ans. The population consists of 17,000 households. The sample consists of 300 households. Cluster
sampling is being used.
CHAP. 71
-
X
P(X)
SAMPLING DISTRIBUTIONS
0 .5 1 1.5 2
.01 .12 .42 .36 .09
163
STRATIFIED SAMPLING
7.35 A drug store chain has stores located in 5 cities as follows: 20 stores in Los Angeles, 40 stores in New
York, 20 stores in Seattle, 10 stores in Omaha, and 30 stores in Chicago. In order to estimate pharmacy
sales, the following number of stores are randomly selected from the 5 cities: 4 from Los Angeles, 8 from
New York, 4 from Seattle, 2 from Omaha, and 6 from Chicago. What type of sampling is being used?
Ans. stratified sampling
SAMPLING DISTRIBUTION OF THE SAMPLING MEAN
7.36
7.37
The export value in billions of dollars for four American cities for a recent year are as follows: A:
Detroit, 28; B: New York, 24; C: Los Angeles, 22; and D: Seattle, 20. If all possible samples of size 2 are
selected from this population of four cities and the sample mean value of exports computed for each
sample, give the sampling distribution of the sample mean.
Ans. P(X)= :,where X =21, 22, 23, 24, 25, or 26.
Consider a large population of households. Ten percent of the households have no home computer, 60
percent of the households have exactly one home computer, and 30 percent of the households have
exactly two home computers. Construct the sampling distribution for the mean of all possible samples of
size 2.
Ans.
SAMPLING ERROR
7.38
7.39
In reference to problem 7.36, if the mean of a sample of two cities is used to estimate the mean export
value of the four cities, what are the minimum and maximum values for the sampling error?
Ans. minimum sampling error = .5 maximum sampling error = 2.5
In reference to problem 7.37, what are the possible sampling error values associated with samples of two
households used to estimate the mean number of home computers per household for the population?
What is the most likely value for the samplingerror?
Am. 2, .3, .7, .8, and 1.2 The most likely value is .2. The probability that the sampling error equals .2 is
.42.
MEAN AND STANDARD DEVIATION OF THE SAMPLE MEAN
7.40 Find the mean and variance of the sampling distribution in problem 7.36. Verify that formulas (7.2)and
(7.3)hold for this sampling distribution.
Ans. p = 23.5 02=
8.75 pT = 23.5 0
: = 2.917
7.41 Find the mean and variance of the sampling distribution in problem 7.37. Verify that formulas (7.2) and
(7.3)hold for this sampling distribution.
164 SAMPLING DISTRIBUTIONS [CHAP. 7
2
Ans. p=1.2 02=0.36 p x = 1 . 2 0, =0.18
d
psi= p
d .36
n 2
Since the population is infinite, formula (7.3)becomes a
: = -
n
--
- - = 0;.
SHAPE OF THE SAMPLING DISTRIBUTION OF THE MEAN
AND THE CENTRAL LIMIT THEOREM
7.42
7.43
For cities of 100,000or more, the number of violent crimes per 1,000residents is normally distributed
with mean equal to 30 and standard deviation equal to 7. Describe the sampling distribution of means of
samples of size 4 from such cities.
Ans. The means of samples of size 4 will be normally distributed with mean equal to 30 and standard
error equal to 3.5.
For cities of 100,000 or more, the mean total crime rate per 1,000 residents is 95 and the standard
deviation of the total crime rate per 1,0oO is 15. The distribution of the total crime rate per 1,OOO
residents for cities of 100,000or more is not normally distributed. The distribution is skewed to the right.
Describe the sampling distribution of the means of samples of sizes 10and 50 from such cities.
Am. For samples of size 50, the central limit theorem assures us that the distribution of sample means is
normally distributed with mean equal to 95 and standard error equal to 2.12. For samples of size
10from a nonnormal distribution, the distribution form of the sample means is unknown.
APPLICATIONS OF THE SAMPLING DISTRIBUTIONOF THE SAMPLE MEAN
7.44 In problem 7.42, find the probability of selecting four cities of population 100,OOO or more whose mean
number of violent crimes per 1,000residents exceeds 40 violent crimes per 1,000 residents.
Ans. 0.0021
7.45 The mean number of bumped passengers per 10,000 passengers per day is 1.35 and the standard
deviation is 0.25. For a random selection of 40 days, what is the probability that the mean number of
bumped passengers per 1
0
,
O
O
O passengers for the 40 days will be between 1.25and 1S O ?
A m 0.9938
SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION
7.46 The world output of bicycles in 1995was 114 million. China produced 41 million bicycles in 1995. What
proportion of the world's new bicycles in 1995 were produced by China?
Ans. p = 0.36
7.47 A company has 38,000 employees and the proportion of the employees who have a college degree equals
0.25. In a sample of 400 of the employees, 30 percent have a college degree. How many of the company
employees have a college degree'?How many in the sample have a college degree?
Ans. 9,500 of the company employees have a college degree and 120 in the sample have a degree.
CHAP. 71 SAMPLING DISTRIBUTIONS 165
MEAN AND STANDARDDEVIATION OF THE SAMPLEPROPORTION
7.48 One percent of the vitamin C tablets produced by an industrial process are broken. The process fills
containers with 1,000tablets each. At regular intervals, a container is selected and 100 of the 1,000
tablets are inspected for broken tablets. What is the mean value and standard deviation of j7,where j
7
represents the proportion of broken tablets in the samples of lo0 selected tablets?
1,000- I
Ans. The mean value of iY is .01 and the standard deviation of is
,0094.
7.49 A survey reported that 20% of pregnant women smoke, 19% drink ,and 13% use crack cocaine or other
drugs. Assuming the survey results are correct, what is the mean and standard deviation of F,where jY is
the proportion of smokers in samples of 300 pregnant women?
Am. The mean value of i7 is 20% and the standard deviation of = 2.31%.
SHAPE OF THE SAMPLING DISTRIBUTION OF THE SAMPLEPROPORTION
AND THE CENTRALLIMIT THEOREM
7.51 For a sample of size 50, give the range of population proportion values for which j7 has an approximate
normal distribution.
Ans. . I < p < .9
APPLICATIONS OF THE SAMPLINGDISTRIBUTION OF THE SAMPLEPROPORTION
7.52 Thirty-five percent of the athletes who participated in the 1996 summer Olympics in Atlanta are female.
What is the probability of randomly selecting a sample of 50 of these athletes in which over half of those
selected are female?
Ans. P(j5 >SO) = P(z > 2.22) = .0132
7.53 It is estimated that 12% of Native Americans have diabetes. What is the probability of randomly selecting
100Native Americans and finding 5 or fewer in the hundred whom are diabetic?
Ans. P(p 5.05)= P(z < -2.15) = .0158
7.54 A pair of dice is tossed 180 times. What is the probability that the sum on the faces is equal to 7 on 20%
or more of the tosses?
Ans. P(F2.20) = .1170
Chapter 8
Estimationand Sample Size Determination:
One Population
POINT ESTIMATE
Estimation is the assignment of a numerical value to a population parameter or the construction
of an interval of numerical values likely to contain a population parameter. The value of a sample
statistic assigned to an unknown population parameter is called a point estimate of the parameter.
EXAMPLE 8.1 The mean starting salary for 10 randomly selected new graduates with a Masters of Business
Administration (MBA) at Fortune 500 companies is found to be $56,000.Fifty-six thousand dollars is a point
estimate of the mean starting salary for all new MBA degree graduates at Fortune 500 companies. The median
cost for 350 homes selected from across the United States is found to equal $1 15,OOO. The value of the sample
median, $1 15,000, is a point estimate of the median cost of a home in the United States. A survey of 950
households finds that 35% have a home computer. Thirty-five percent is a point estimate of the percentage of
homes that have a home computer.
INTERVAL ESTIMATE
In addition to a point estimate, it is desirable to have some idea of the size of the sampling error,
that is the difference between the population parameter and the point estimate. By utilizing the
standard error of the sample statistic and its sampling distribution, an interval estimate for the
population parameter may be developed. A confidence interval is an interval estimate that consists of
an interval of numbers obtained from the point estimate of the parameter along with a percentage that
specifies how confident we are that the value of the parameter lies in the interval. The confidence
percentage is called the confidence level. This chapter is concerned with the techniques for finding
confidence intervals for population means and population proportions.
CONFIDENCE INTERVAL FOR THE POPULATION MEAN: LARGE SAMPLES
According to the central limit theorem, the sample mean, 7,
has a normal distribution provided
the sample size is 30 or more. Furthermore, the mean of the sample mean equals the mean of the
population and the standard error of the mean equals the population standard deviation divided by
the square root of the sample size. In Chapter 7, the variable given in formula (7.6) (and reproduced
below) was shown to have a standard normal distribution provided n 2 30.
Since 95% of the area under the standard normal curve is between z = -1.96 and z = 1.96,and
since the variable in formula (7.6) has a standard normal distribution, we have the result shown in
formula (8.I )
166
CHAP. 81 ESTIMATION AND SAMPLE SIZE DETERMINATION 167
-
P(-1.96 < x--cL< 1.96)= .95
o
x
-
The inequality,-1.96 < -
'-'
< 1.96,is solved for p and the result is given in formula (8.2).
ajr
(8.2)
Y - 1.96- < p < X + 1.96-
The interval given in formula (8.2) is called a 95% confidence interval for the population mean, p.
The general form for the interval is shown in formula (8.3),where z represents the proper value from
the standard normal distributiontable as determinedby the desired confidencelevel.
EXAMPLE 8.2 The mean age of policyholders at Mutual Insurance Company is estimated by sampling the
records of 75 policyholders. The standard deviation of ages is known to equal 5.5 years and has not changed
over the years. However, it is unknown if the mean age has remained constant. The mean age for the sample of
75 policyholders is 30.5 years. The standard error of the ages is = ----
- .635 years. In order to find
a 90%confidence interval for p, it is necessary to find the value of z in formula (8.3)for confidence level equal
to 90%.If we let c be the correct value for z, then we are looking for that value of c which satisfies the equation
P(-c < z < c) = .90. Or, because of the symmetry of the z curve, we are looking for that value of c which
satisfies the equation P(0 < z < c) = .45. From the standard normal distribution table, we find P(0 < z < 1.64) =
.4495 and P(0 < z < 1.65) = .4505. The interpolated value for c is 1.645, which we round to 1.65. Figure 8-1
illustrates the confidence level and the corresponding values of z. Now the 90% confidence interval is computed
as follows. The lower limit of the interval is X- 1.650~
= 30.5 - 1.65 x .635 = 29.5 years, and the upper limit is
x + 1.650~= 31.5. We are 90% confident that the mean age of all 250,000 policyholders is between 29.5 and
31.5 years. It is important to note that p either is or is not between 29.5 and 31.5 years. To say we are 90%
confident that the mean age of all policyholders is between 29.5 and 31.5 years means that if this study were
conducted a large number of times and a confidence interval were computed each time, then 90% of all the
possible confidence intervals would contain the true value of p.
0 55
J;;-J75
-
Fig. 8-1
Since it is time consuming to determine the correct value of z in formula (8.3),the values for the most often
used confidence levels are given in Table 8.1. They are found in the same manner as illustrated in Example 8.2.
168 ESTIMATION AND SAMPLE SIZE DETERMINATION [CHAP. 8
Table 8.1
Confidence level Z value
I .96
I 99 I 2.58 I
EXAMPLE 8.3 The 91I response time for terrorists bomb threats was investigated. No historical data existed
concerning the standard deviation or mean for the response times. When (T is unknown and the sample size is 30
or more, the standard deviation of the sample itself is used in place of CJ when constructing a confidence interval
for p. A sample of 35 response times was obtained and it was found that the sample mean was 8.5 minutes and
the standard deviation was 4.5 minutes. The estimated standard error of the mean is represented by 3~ and is
equal to 5 -
- -
J
;
; -
- -
J35 -
- .76. From Table 8.1, the value for z is 2.58. The lower limit for the 99%
confidence interval is jC - 2.58 x q = 8.5 - 2.58 x .76 = 6.54 and the upper limit is 8.5 + 2.58 x .76 = 10.46.
We are 99% confident that the mean response time is between 6.54 minutes and 10.46 minutes. The true value
of p may or may not be between these limits. However, 99% of all such intervals contain p.It is this fact that
gives us 99% confidence in the interval from 6.54 to 10.46.
S 4 5
For large samples, no assumption is made concerning the shape of the population distribution. If
the population standard deviation is known, it is used in formula (8.3).If the population standard
deviation is unknown, it is estimated by using the sample standard deviation. The value for z in
formula (8.3)is found in the standard normal distribution table or, if applicable,by using Table 8.1.
MAXIMUM ERROR OF ESTIMATE FOR THE POPULATION MEAN
The inequality in formula(8.3)may be expressed as shown:
The left-hand side of formula (8.4)is the sampling error when X is used as a point estimate of p.
The right-hand side of formula (8.4) is the m i m u m error o
f estimate or margin of error when X is
used as a point estimate of p.That is, when X is used as a point estimate of p, the maximum error of
estimate or margin of error, E, is
When the confidence level is 95%, z = 1.96 and E = 1 . 9 6 ~ .
This value of E, 1.96%, is called the 95%
margin o
f error or simply margin o
f error when T is used as a point estimateof p.
EXAMPLE 8.4 The annual college tuition costs for 40 community colleges selected from across the United
States are given in Table 8.2. The mean for these 40 sample values is $1396, the standard deviation of the 40
s 655
values is $655, and the estimated standard error is 5 = -- -= $104. The mean tuition cost for all
J ; ; - f i
community colleges in the United States is represented by p. A point estimate of p is given by $1,396. The
margin of error associated with this estimate is 1.96 x 104 = $204. The 95% confidence interval for p, based
upon these data goes from 1396 - 204 = $1,192 to 1396 + 204 = $1,600. It is worth noting that the margin of
error is actually f $204, since the error may occur in either direction. Some publications give the margin of error
as E and some give it as fE. We shall omit the f sign when giving the margin of error.
CHAP. 81
1200
850
ESTIMATION AND SAMPLE SIZE DETERMINATION
850 I750 930
3000 1650 1640
Table 8.2
i
1500
700
I 1700 I 2100 I 900 I 1320 I
500 2050 I750
500 I780 2500
1200
1500
1950 675 2310
1000 1080 2900
2000
1950
I 750 I 500 I 1500 I 950 I
950 680 1875
560 900 1450
169
To compute the confidence interval for p using Minitab, set the data given in Table 8.2 into
column c1 and use the command standard deviation cl to compute the sample standard deviation. The
command zinterval 95% confidence, sigma = 655.44, data in c l uses the data in cl to compute a 95%
confidence interval for p.The output is shown below. The confidence interval extends from 1193 to 1599.The
interval computed in Example 8.4 extends from 1 192to 1600.The difference in the answers is due to rounding.
MTB > set cl
DATA> 1200 850 1700 1500 700 1200 1500 2000 1950 750 850
DATA> 3000 2100 500 500 1950 1000 950 560 500 1750 1650
DATA> 900 2050 1780 675 1080 680 900 1500 930 1640 1320
DATA> 1750 2500 2310 2900 1875 1450 950
DATA > end
MTB > standard deviation cl
Column Standard Deviation
Standard deviation of C1 = 655.44
MTB > name cl 'cost'
MTB > zinterval95% confidence, sigma =655.44, data in cl
Confidence Intervals
The assumed sigma = 655
Variable N Mean St Dev SEMean 95.0%C.I.
cost 40 1396 655 104 (1 193, 1599)
The width o
f a confidence interval is equal to the upper limit of the interval minus the lower limit
of the interval. In Example 8.4,the width of the 95% confidence interval is 1599- 1193= $406.
THE t DISTRIBUTION
S
When the sample size is less than 30 and the estimated standard error, 5 - - is used in place
of &jr in formula (8.3),the width of the confidence interval for p will generally be incorrect. The I
distribution, also known as the Student-t distribution, is used rather than the standard normal
distribution to find confidence intervals for p when the sample size is less than 30. In this section we
will discuss the properties of the t distribution, and in the next section, we will discuss the use of the
t distribution for confidence intervals when the sample size is small. The t distribution is used in
many different statistical applications.
The t distribution is actually a family of probability distributions. A parameter called the degrees
o
f freedom, and represented by df, determines each separate t distribution. The t distribution curves,
like the standard normal distribution curve, are centered at zero. However, the standard deviation of
- 42
170 ESTIMATION AND SAMPLE SIZE DETERMINATION [CHAP. 8
df
5
10
15
20
2s
each t distribution exceeds one and is dependent upon the value of df, whereas the standard deviation
of the standard normal distribution equals one. Figure 8-2 compares the standard normal curve with
the t distribution having 10degrees of freedom. Generally speaking, the t distribution curves have a
lower maximum height and thicker tails than the standard normal curve as shown in Fig. 8-2. All t
distribution curves are symmetrical about zero.
Area in the right tail under the t distributioncurve
.10 .05 .025 .o1 .005 .oo1
1.476 2.01s 2.571 3.365 4.032 5.893
1.372 1.812 2.228 2.764 3.169 4.144
1.341 1.753 2.131 2.602 2.947 3.733
1.325 1.725 2.086 2.528 2.845 3.552
1.316 1.708 2.060 2.485 2.787 3.450
Fig. 8-2
Appendix 3 gives right-hand tail areas under t distribution curves for degrees of freedom varying
from 1 to 40. This table is referred to as the t distribution table. Table 8.3 contains the rows
corresponding to degrees of freedom equal to 5, 10, 15, 20, and 25 selected from the t distribution
table in Appendix 3.
EXAMPLE 8.5 The t distribution having 10 degrees of freedom is shown in Fig. 8-3. To find the shaded area
in the right-hand tail to the right of t = 1.812, locate the degrees of freedom, 10, under the df column in Table
8.3. The t value, 1.812, is located under the column labeled .05 in Table 8.3. The shaded area is equal to .05.
Since the total area under the curve is equal to 1, the area under this curve to the left of t = 1.812 is 1- .05 =
.95. The area under this curve between t = -1.812 and t = 1.812is 1 - .05 -.05 = .90, since there is .05 to the
right of t = 1.812and .OS to the left of t = -1.8 12.
Fig. 8-3
CHAP. 81 ESTIMATION AND SAMPLE SIZE DETERMINATION 171
16.0
12.0
9.5
EXAMPLE 8.6 A t distribution curve with df = 20 is shown in Fig. 8-4. Suppose we wish to find the shaded
area under the curve between t = -2.528 and t = 2.086. From Table 8.3, the area to the right oft = 2.086 is .025,
and the area to the right of t = 2,528 is .01. By symmetry, the area to the left of t = -2.528 is also .O I. Since the
total area under the curve is 1,the shaded area is 1- .025 - .01= .965.
20.0 9.O 17.5
13.0 16.0 7.5
12.0 12.0 12.5
Fig. 8-4
df
19
CONFIDENCEINTERVALFOR THE POPULATIONMEAN: SMALL SAMPLES
Area in the right tail under the t distributioncurve
.10 .05 .025 .o1 ,005 .oo1
1.328 1.729 2.093 2.539 2.861 3.579
When a small sample (n < 30) is taken from a normally distributed population and the population
standard deviation, 0,is unknown, a confidence interval for the population mean, p, is given by
formula (8.6):
In formula (8.6),X is a point estimate of the population mean, SZ is the estimated standard error,
and t is determined by the confidence level and the degrees of freedom. The degrees of freedom is
given by formula (8.9,where n is the sample size.
d f = n - 1 (8.7)
EXAMPLE 8.7 The distance traveled, in hundreds of miles by automobile, was determined for 20 individuals
returning from vacation. The results are given in Table 8.4.
Table 8.4
I 12.5 I 15.0 I 10.0 I 15.5 I
8.0 5.O 15.0 9.5
For the data in Table 8.4, X = 12.375,s = 3.741, and sx = 0.837. To find a 90% confidence interval for the
mean vacation travel distance of all such individuals, we need to find the value of t in formula (8.6). The
degrees of freedom is df = n - 1 = 20 - 1 = 19. Using df = 19 and the t distribution table, we must find the t
value for which the area under the curve between -t and t is .90. This means the area to the left of -t is .05 and
the area to the right oft is also .05. Table 8.5, which is selected from the t distribution table, indicates that the
area to the right of 1.729 is .05. This is the proper value oft for the 90% confidence level when df = 19.
Table 8.5
172
8.O
16.0
12.0
ESTIMATION AND SAMPLE SIZE DETERMINATION
5.O 15.0 19.5
15.0 9.0 7.5
13.0 6.0 37.5
[CHAP. 8
The following technique is also often recommended for finding the t value for a confidence
interval: Tofind the t valuefor a given confidence interval, subtract the confidence level from I and
divide the answer by 2 tofind the correct area in the right-hand tuil. In this Example, the confidence
level is 90% or .90. Subtracting .90 from 1, we get .lo. Now, dividing S O by 2, we get .05 as the
right-hand tail area. From Table 8.5, we see that the proper t value is 1.729. Many students prefer
this technique for finding the proper t value for confidence intervals.
= 12.375- 1.729x
.837 = 10.928and the upper limit is X + t s = 12.375 + 1.729 x .837 = 13.822. A 90% confidence
interval for p extends from 1,093to 1,382 miles. The distribution of all vacation travel distances are
assumed to be normally distributed.
To find a 90% confidence interval for p using Minitab, use the set command to enter the data
given in Table 8.4. The command tinterval 90 percent confidence data in cl is used to find the
confidence interval. The output is shown below. The confidence interval is (10.928, 13.822).
Using formula (8.6),the lower limit for the 90%confidence interval is X - t
MTB > set cl
DATA> 12.5 8.0 16.0 12.0 9.5 15.0 5.0 20.0 13.0 12.0 10.0
DATA> 15.0 9.0 16.0 12.0 15.5 9.5 17.5 7.5 12.5
DATA > end
MTB > name c1 'distance'
MTB > tinterval90 percent confidence data in c l
Confidence Intervals
Variable N Mean St Dev SEMean 9O.O%C.I.
Distance 20 12.375 3.741 0.837 (10.928, 13.822)
EXAMPLE8.8 The health-care costs, in thousands of dollars for 20 males aged 75 or over, are shown in
Table 8.6. Formula (8.6) may not be used to set a confidence interval on the mean health-care cost for such
individuals, since the sample data indicate that the distribution of such health-care costs is not normally
distributed. The $515,000 and the $950,000 costs are far removed from the remaining health-care costs. This
indicates that the distribution of health-care costs for males aged 75 or over is skewed to the right and therefore
is not normally distributed. Formula (8.6) is applicable only if it is reasonable to assume that the population
characteristic is normally distributed.
Table 8.6
I 8.5 I 5 15.0 I 950.0 I 5.5 I
EXAMPLE 8.9 The number of square feet per mall devoted to children's apparel was determined for 15 malls
selected across the United States. The summary statistics for these 15 malls are as fo1lows:Y = 2,700, s = 450,
and 3~ = 116.19. To determine a 99% confidence interval for p, we need to find the value t where the area
under the t distribution curve between -t and t is .99 and the degrees of freedom = n - 1 = 15 - I = 14. Since the
area between -t and t is .99, the area to the left of -t is .005 and the area to the right of t is .005. Table 8.7 is
taken from the t distribution table and shows that the area to the right of 2.977 is .005. The lower limit for the
99%confidence interval is X - tsz = 2,700 - 2.977 x 116.19 = 2,354 ft2 and the upper limit is X + tsT( =
2,700 +2.977 x 116.19= 3,046 ft2.The 99% confidence interval for p is (2,354 to 3,046).
CHAP. 81 ESTIMATION AND SAMPLE SIZE DETERMINATION
df
14
173
Area in the right tail under the t distribution curve
.10 .05 .025 .o1 .005 .oo1
1.345 1.761 2.145 2.624 2.977 3.787
CONFIDENCE INTERVAL FOR THE POPULATION PROPORTION:
LARGE SAMPLES
In chapter 7, the variable shown in formula (7.10) and reproduced below was shown to have a
standard normal distribution provided np > 5 and nq > 5.
(7.10)
Since 95% of the area under the standard normal curve is between -I .96 and 1.96, and since the
variable in formula (7./0) has a standard normal distribution, we have the result shown below:
-
P(-1.96<-<1.96)
P-clfi = .95
3
From Chapter 7, we also know that pF=p and q
,= -. Substituting for p, and 075 in the inequality
/?
-
P -Pjj
-1.96 < -< 1.96given in formula (8.8)and solving the resulting inequa
%
-
p - 1 9 6 E < p < p+ 1 9 6 E
ity for p, we obtain
To obtain numerical values for the lower and upper limits of the confidence interval, it is necessary
to substitute the sample values p and for p and q in the expression for q.Making these
substitutions, we obtain formula (8.10).
(8.10)
The interval given in formula (8.10)is called a 95% confidence interval for p. The general form for
the interval is shown in formula (8./1),where z represents the proper value from the standard normal
distribution table as determined by the desired confidence level.
n
(8.I I )
EXAMPLE 8.10 A study of 75 small-business owners determined that 80% got the money to start their
business from personal savings or credit cards. If p represents the proportion of all small-business owners who
got the money to start their business from personal savings or credit cards, then a point estimate of p is f7 =
_ _
F=
80%. The estimated standard error of the proportion is represented by sr), and is given by SF, =
174 ESTIMATIONAND SAMPLE SIZE DETERMINATION [CHAP. 8
= 4.62%. From Table 8.1, the z value for a 99% confidence interval is 2.58. The lower limit for a
4 7
99%confidence interval is p- z = 80 - 2.58 x 4.62 = 68.08% and the upper limit is p+ z = 80 +
2.58 x 4.62 = 91.92%.Formula (8.
II) is considered to be valid if the sample size satisfies the inequalities np> 5
and nq >5. Since p and q are unknown, we check the sample size requirement by checking to see if np > 5 and
n q > 5. In this case nF = 75 x .8 = 60 and n q = 75 x .2 = 15,and since both nF and nT exceed 5, the sample
size is large enough to assure the valid use of formula (8.II).
EXAMPLE 8.11 In a random sample of 40 workplace homicides, it was found that 2 of the 40 were due to a
personal dispute. A point estimate of the population proportion of workplace homicides due to a personal
dispute is = = .05 or 5%. Formula (8.11)
may not be used to set a confidence interval on p, the population
proportion of homicides due to a personal dispute, since np = 40 x .05 = 2 is less than 5. Note that since p is
unknown, np cannot be computed. When p is unknown, nF and n q are computed to determine the appropriate-
ness of using formula (8.11).
40
DETERMINING THE SAMPLESIZE FOR THE ESTIMATION
OF THE POPULATION MEAN
The maximum error of estimate when using Fas a point estimate of p is defined by formula (8.5)
and is restated below.
If the maximum error of estimate is specified, then the sample size necessary to give this maximum
error may be determined from formula (8.5).Replacing by - to obtain E = z- and then
solving for n we obtain formula (8.22). The value obtained for n is always rounded up to the next
whole number. This is a conservative approach and sometimes results in a sample size larger than
actually needed.
0 0
J;; &’
Z2(T2
E2
n = - (8.22)
EXAMPLE 8.12 In Example 8.4, the mean of a sample of 40 tuition costs for community colleges was found
to equal to $1,396 and the standard deviation was $655. The margin of error associated with using $1,396 as a
point estimate of p is equal to $204. To reduce the margin of error to $100, a larger sample is needed. The
required sample size is n = 6552 = 164.8.To obtain a conservative estimate, we round the estimate up to
loo2
165. Note that we estimated 0 by s, because (T is unknown. The 40 tuition costs obtained in the original survey
would be supplemented by 165 - 40 = 125 additional community colleges. The new estimate based on the 165
tuition costs would have a margin of error of $100.
To use formula (8.12) to determine the sample size, either CT or an estimate of (T is needed. In
Example 8.12, the original study, based on sample size 40,may be regarded as a pilot sample. When
a pilot sample is taken, the sample standard deviation is used in place of (T.In other instances, a
historical value of (3 may exist. In some instances, the maximum and minimum value of the
characteristic being studied may be known and an estimate of (T may be obtained by dividing the
range by 4.
CHAP. 81 ESTIMATION AND SAMPLE SIZE DETERMINATION 175
EXAMPLE 8.13 A sociologist desires to estimate the mean age of teenagers having abortions. She wishes to
estimate p with a 99% confidence interval so that the maximum error of estimate is E = .1 years. The z value is
2.58. The range of ages for teenagers is 19- 13= 6 years. A rough estimate of the standard deviation is obtained
by dividing the range by 4 to obtain 1.5 years. This method for estimating 0 works best for mound-shaped
distributions. The approximate sample size is n = 2582 152 = 1497.7. Rounding up, we obtain n = 1,498.
.1*
DETERMINING THE SAMPLE SIZE FOR THE ESTIMATION
OF THE POPULATIONPROPORTION
When j
j is used as a point estimate of p, the maximum error of estimate is given by
E = z q (8.13)
If the maximum error of estimate is specified, then the sample size necessary to give this maximum
error may be determined from formula (8.13).If q is replaced by and the resultant equation is
solved for n, we obtain
z2P9
E2
n=- (8.14)
Since p and q are usually unknown, they must be estimated when formula (8.14) is used to
determine a sample size to give a specified maximum error of estimate. If a reasonable estimate of p
and q exists, then the estimate is used in the formula. If no reasonable estimate is known, then both p
and q are replaced by .5. This gives a conservative estimate for n. That is, replacing p and q by .5
usually gives a larger sample size than is needed, but it covers all cases so to speak.
EXAMPLE 8.14 A study is undertaken to obtain a precise estimate of the proportion of diabetics in the United
States. Estimates ranging from 2% to 5% are found in various publications. The sample size necessary to
1.96’ X .05 X .95
.oo1
estimate the population percentage to within 0.1% with 95% confidence is n = = 182,476.
When a range of possible values for p exists, as in this problem, use the value closest to .5 as a reasonable
estimate for p. In this example, the value in the range from .02 to .05 closest to .5 is .05. If the prior estimate of
p were not used, and .5 is used for p and q, then the computed sample size is n = = 960,400.
.0Ol2
Notice that using the prior information concerning p makes a tremendous difference in this example.
Solved Problems
POINT ESTIMATE
8.1 The mean annual salary for public school teachers is $32,000.The mean salary for a sample of
750 public school teachers equals $31,895. Identify the population, the population mean, and
the sample mean. Identify the parameter and the point estimate of the parameter.
Arts. The population consists of all public school teachers. p = $32,000, X = $31,895.The parameter is
p.A point estimate of the parameter p is $31,895.
176 ESTIMATION AND SAMPLE SIZE DETERMINATION [CHAP. 8
8.2 Ninety-eight percent of U.S. homes have a TV. A survey of 2000 homes finds that 1,925have
a TV.Identify the population, the population proportion, and the sample proportion. Identify
the parameter and a point estimate of the parameter.
Ans. The population consists of all U.S.homes. p = 98%, j
T = 96.25%. The parameter is p. A point
estimate of p is 96.25%.
INTERVAL ESTIMATE
8.3 When a multiple of the standard error of a point estimate is subtracted from the point estimate
to obtain a lower limit and added to the point estimate to obtain an upper limit, what term is
used to describe the numbers between the lower limit and the upper limit?
Ans. interval estimate
CONFIDENCE INTERVAL FOR THE POPULATION MEAN: LARGE SAMPLES
Table 8.8
8.4 A sample of 50 taxpayers receiving tax refunds is shown in Table 8.8. Find an 80% confidence
interval for p,where p represents the mean refund for all taxpayers receiving a refund.
Ans. For the data in Table 8.8, C x = 51,685, C x2 = 70,158,336, X= 1033.7, s = 584.3, JX = 82.6.
From Table 8.1, the z value for 80% confidence is 1.28. Lower limit = 1033.7 - 1.28 x 82.6 =
927.97, upper limit = 1033.7 + 1.28 x 82.6 = 1139.43.The 80% confidence interval is (927.97,
1139.43).
8.5 Use Minitab to find the 80%confidence interval for p in problem 8.4.
Ans. MTB > set cl
DATA> 1515 1432 120
DATA> 723 1212 202
DATA> 868 1236 1631
DATA> 1769 232 171
DATA> 1781 313 1067
DATA >end
MTB > name cl 'refund'
MTB > standard deviation c1
270 312 1904
118 1726 1562
681 1243 392
271 1062 1023
367 283 1973
662 1857 903 1313 959
518 1671 395 1744 764
623 169 589 1901 168
800 762 364 1396 668
CHAP. 81
df
7
ESTIMATION AND SAMPLE SIZE DETERMINATION
Area in the right tail under the t distributioncurve
.10 .05 .025 .o1 .005 .001
1.415 1.895 2.365 2.998 3.499 4.785
177
Standard deviation of refund = 584.35
MTB > zinterval 80 percent confidence sigma = 584.35data in cl
Confidence Intervals
The assumed sigma = 584
Variable N Mean St Dev SEMean 80.0%C.I.
Refund 50 1033.7 584.3 82.6 (927.8, 1139.6)
MAXIMUM ERROR OF ESTIMATE FOR THE POPULATION MEAN
8.6 A sample of size 50 is selected from a population having a standard deviation equal to 4. Find
the maximum error of estimate associated with the confidence intervals having the following
confidence levels: (a) 80%;(b) 90%; (c)95%; (499%.
4
Ans. The standard error of the mean is - = S66. Using the z values given in Table 8.1, the
following maximum errors of estimate are obtained.
sso
(a) E = 1.28x .566= .724
(c) E = 1.96x .566= 1.109
(b) E = 1.65 x .566= .934
(4 E = 2.58 x .566= 1.460
8.7 Describe the effect of the following on the maximum error of estimate when determining a
confidence interval for the mean: (a) sample size; (b) confidence level; (c) variability of the
characteristicbeing measured
Ans. (a) The maximum error of estimate decreases when the sample size is increased.
(b) The maximum error of estimate increases if the confidence level is increased.
(c) The larger the variability of the characteristic being measured, the larger the maximum error
of estimate.
THE t DISTRIBUTION
8.8 Table 8.9 contains the row corresponding to 7 degrees of freedom taken from the t distribution
table in Appendix 3. For a t distribution curve with df = 7, find the following:
(a) Area under the curve to the right oft = 2.998
(b) Area under the curve to the left oft =-2.998
(c) Area under the curve between t =-2.998 and t = 2.998
Ans. (a) .01
(b) Since the curve is symmetrical about 0, the answer is the same as in part (a),.01
(c) Since the total area under the curve is 1, the answer is 1 - .01- .01= .98
8.9 Refer to problem 8.8 and Table 8.9 to find the following.
(a) Find the t value for which the area under the curve to the right oft is .05.
(b) Find the t value for which the area under the curve to the left oft is .05.
(c) By considering the answers to parts (a) and (b), find that positive t value for which the
area between -t and t is equal to .90.
178
409
663
304
535
628
[CHAP. 8
487 480 535
676 494 332
565 670 434
554 515 665
308 319 519
ESTIMATION AND SAMPLE SIZE DETERMINATION
Ans. (a) I .895 (6)-1.895 (c) 1.895
CONFIDENCE INTERVAL FOR THE POPULATION MEAN: SMALL SAMPLES
Area in the right tail under the t distributioncurve
df .I0 .05 .025 .o1 .005 .oo1
* 19 1.328 1.729 2.093 2.539 2.861 3.579
8.10
8.11
In a transportation study, 20 cities were randomly selected from all cities having a population
of 50,000 or more. The number of cars per 1,000people was determined for each selected city,
and the results are shown in Table 8.10. Find a 95% confidence interval for p, where p is the
mean number of cars per 1,000 people for all cities having a population of 50,000 or more.
What assumption is necessary for the confidence interval to be valid?
Table 8.10
Am. For the data in Table 8.10, Z x = 10,092,C x2 = 5,381,738, X=504.6, s = 123.4, 5 = 27.6. The
t distribution is used because the sample is small. The degrees of freedom is df = 20 - I = 19.The
t value in the confidence interval is determined from Table 8.11, which is taken from the t
distribution table in Appendix 3. From Table 8.11, we see that the area to the right of 2.093 is ,025
and the area to the left of -2.093 is .025, and therefore the area between -2.093 and 2.093 is .95.
The lower limit of the 95% confidence interval is X - t
q = 504.6 - 2.093 x 27.6 = 446.8 and the
upper limit is X + t
s = 504.6 + 2.093 x 27.6 = 562.4. The 95% confidence interval is (446.8,
562.4). It is assumed that the distribution of the number of cars per 1,000 people in cities of
50,000or over is normally distributed.
Find the Minitab solution to problem 8.10.
Ans. MTB > Set cl
DATA> 409 663 304 535 628 487 676 565 554 308 480
DATA> 494 670 515 319 535 332 434 665 519
DATA > end
MTB > name cl 'cars'
MTB > tinterval 95 percent confidence data in cl
Confidence Intervals
Variable N Mean St Dev SE Mean 95.0% C.I.
Cars 20 504.6 123.4 27.6 (446.8,562.4)
CONFIDENCE INTERVAL FOR THE POPULATION PROPORTION:
LARGE SAMPLES
8.12 A national survey of 1200 adults found that 450 of those surveyed were pro-choice on the
abortion issue. Find a 95% confidence interval for p, the proportion of all adults who are pro-
choice on the abortion issue.
CHAP. 81 ESTIMATION AND SAMPLE SIZE DETERMINATION 179
450 -
Ans. The sample proportion who are pro-choice is =1
2
0
0 - .375 and the proportion who are not pro-
.625. The estimated standard error of the proportion is
choice or undecided is q==-
750 -
,/?
= /.375 '625 = .014. The z value is 1.96. The lower limit of the 95% confidence
1200
interval is p-z IF;'
--
- .375 - 1.96 x .014 = .348 and the upper limit is p+z
1.96 x .014 = .402. The 95%confidence interval is (.348, .402).
8.13 A national survey of 500 African-Americans found that 62% in the survey favored affirmative
action programs. Find a 90% confidence interval for p, the proportion of all African-Americans
who favor affirmative action programs.
Ans. The sample percent who favor affirmativeaction programs is 62%. The sample percent who do not
favor affirmative action programs or are undecided is 38%. The estimated standard error of
- .
proportion, expressed as a percentage, is JY
-= 4
% = 2.17%. The z value is 1.65. The
lower limit of the 90% confidence interval is p- zd
e= 62 - 1.65 x 2.17 = 58.4% and the
upper limit is p+zJ
'
:
'-= 62 + 1.65 x 2.17 = 65.6%. The 90% confidence interval is (58.4%,
65.6%).
DETERMINING THE SAMPLESIZEFOR THE ESTIMATION
OF THE POPULATIONMEAN
8.14 A machine fills containers with corn meal. The machine is set to put 680 grams in each
container on the average. The standard deviation is equal to 0.5 gram. The average fill is
known to shift from time to time. However, the variability remains constant. That is, o remains
constant at 0.5 gram. In order to estimate p,how many containers should be selected from a
large production run so that the maximum error of estimate equal 0.2 gram with probability
0.95?
z2cT2
Ans. The sample size is determined by use of the formula n = 2.
The value for CJ is known to equal
0.5, E is specified to be 0.2 and for probability equal to 0.95, the z value is 1.96. Therefore, the
E
Z2 cT2
E .2
sample size is n = 7
= 1*962 52 = 24.01. The required sample size is obtained by rounding
the answer up to 25.
8.15 A pilot study of 250 individuals found that the mean annual health-care cost per person was
$2550 and the standard deviation was $1350. How large a sample is needed to estimate the true
annual health-care cost with a maximum error of estimate equal to $100 with probability equal
to 0.99?
Z2 cT2
Ans. The sample size is determined by use of the formula n = 7.
The value for G is estimated from
the pilot study to be $1350. E is specified to be $100 and for probability equal to 0.99, the z value
E
180 ESTIMATION AND SAMPLE SIZE DETERMINATION [CHAP. 8
2582 *3502 = 1213.13. The required sample
is 2.58. Therefore, the sample size is n = --j- =
z2o2
E loo2
size is obtained by rounding the answer up to 1214. If the results from the pilot study are used,
then an additional 1214- 250 = 964 individuals need to be surveyed.
DETERMINING THE SAMPLESIZE FOR THE ESTIMATION
OF THE POPULATIONPROPORTION
8.16 A large study is undertaken to estimate thp, percentage of students in grades 9 through 12 who
use cigarettes. Other studies have indicated that the percent ranges between 30% and 35%.
How large a sample is needed in order to estimate the true percentage for all such students with
a maximum error of estimate equal to 0.5%with a probability of 0.90?
z2pq
Am. The sample size is given by n = 7.
The specified value for E is 0.5%. The z value for
probability equal to 0.90 is 1.65.The previous studies indicate that p is between 30% and 35%.
When a range of likely values for p exist, the value closest to 50% is used. This value gives a
E L
conservative estimate of the sample size. The sample size is n = 35 65 = 24,774.75. The
5*
sample size is 24.775.
8.17 Find the sample size in problem 8.16 if no prior estimate of the population proportion exists.
2
Ans. The sample size is given by n = -
". The specified value for E is 0.5%. The L value for
probability equal to 0.90 is 1.65.When no prior estimate for p exists, p and q are set equal to 50%
E2
in the sample size formula. Therefore, 1*652x50x50
= 27,225. Note that 2,450 fewer subjects are
52
needed when the prior estimate of 35% is used.
SupplementaryProblems
POINT ESTIMATE
8.18 Table 8.8 in problem 8.4 contains a sample of 50 tax refunds received by taxpayers. Give point estimates
for the population mean, population median, and the population standard deviation.
Ans. $1033.70, the mean of the SO tax refunds, is a point estimate of the population mean.
$1064.50,the median of the 50 tax refunds, is a point estimate of the population median.
$584.30, the standard deviation of the 50 tax refunds, is a point estimate of the population standard
deviation.
8.19 Table 8.10in problem 8.10 contains a sample of the number of cars per 1,000people for 20 cities having
a population of 50,000 or more. Give a point estimate of the proportion of all such cities with 600 or
more cars per 1,000people.
Ans. In this sample, there are S cities with 600 or more cars per 1,000.The point estimate of p is .2S or
25%.
CHAP. 81 ESTIMATION AND SAMPLE SIZE DETERMINATION
Distribution type
t distribution, df = 10
t distribution, df = 20
t distribution, df = 30
standard normal
181
Area under the curve to the
right of this value is .025
a
b
d
c
INTERVAL ESTIMATE
8.20 What happens to the width of an interval estimate when the sample size is increased?
Ans. The width of an interval estimate decreases when the sample s i x is increased.
CONFIDENCE INTERVAL FOR THE POPULATION MEAN: LARGE SAMPLES
8.21 One hundred subjects in a psychological study had a mean score of 35 on a test instrument designed to
measure anger. The standard deviation of the 100test scores was 10.Find a 99% confidence interval for
the mean anger score of the population from which the sample was selected.
Am. The confidence interval is X - z n < p < X + z n .
The lower limit is X-2% = 35 - 2.58 x 1 = 32.42 and the upper limit is X-2% = 35+2.58x 1 =
37.58. The interval is (32.42, 37.58).
8.22 The width of a large sample confidence interval is equal to w = (X + zox) - (X- zox)= 2 ~ 0 ~ .
Replacing
o
x by -, w is given as w = 22-. Discuss the effect of the confidence level, the standard deviation,
and the sample size on the width.
Ans.
0 0
& &
The width varies directly as the confidence level, directly as the variability of the characteristic
being measured, and inversely as the square root of the sample size.
MAXIMUM ERROR OF ESTIMATE FOR THE POPULATION MEAN
8.23
8.24
The U.S. abortion rate per 1,000 female residents in the age group 15 - 44 was determined for 35
different cities having a population of 25,000 or more across the U.S. The mean for the 35 cities was
equal to 24.5 and the standard deviation was equal to 4.5. What is the maximum error of estimate when a
90%confidence interval is constructed for p'?
Ans. The maximum error of estimate is E = zox.For a 90% confidence interval, z = 1.65.The estimated
standard error is 0.76. E = 1.65 x .76 = 1.25.
A study concerning one-way commuting times to work was conducted in Omaha, Nebraska. The mean
commuting time for 300 randomly selected workers was 45 minutes. The margin of error for the study
was 7 minutes. What is the 95% confidence interval for p?
Am. The 95% confidence interval is T - 1.96% < p < X + 1 . 9 6 ~ .
The margin of error is E = 1.96ojr
= 7 and therefore, the lower limit of the interval is 38 and the upper limit is 52.
THE T DISTRIBUTION
8.25 Use the t distribution table and the standard normal distribution table to find the values for a, b, c, and d,
What distribution does the t distribution approach as the degrees of freedom increases'?
Table 8.12
I82
df
5
ESTIMATION AND SAMPLE SIZE DETERMINATION [CHAP. 8
Standard deviation of the t
distribution with this value for df
1.29
Atis. a = 2.228 b = 2.086 c = 2.042 d = 1.96
The t distribution approaches the standard normal distribution when the degrees of freedom are
increased. This is illustrated by the results in Table 8.12.
525
300
8.26 The standard deviation of the t distribution is equal to jdP12.
- for df > 2. Find the standard deviation
for the t distributions having df values equal to 5,10, 20, 30, 50, and 100. What value is the standard
deviation getting close to when the degrees of freedom get larger?
500 390
715 425
Am. The standard deviations are shown in Table 8.13.The standard deviation is approaching one as the
degrees of freedom get larger. The t distribution approaches the standard normal distribution when
the degrees of freedom are increased.
1,127
625
475
Table 8.13
255 500
1,454 375
814 750
10
20
30
50
1.12
1.05
1.04
I .02
I I
100 1.01
1
CONFIDENCE INTERVAL FOR THE POPULATION MEAN: SMALL SAMPLES
8.27 In order to estimate the lifetime of a new bulb, ten were tested and the mean lifetime was equal to 835
hours with a standard deviation equal to 45 hours. A stem-and-leaf of the lifetimes indicated that it was
reasonable to assume that lifetimes were normally distributed. Determine a 99% confidence interval for
p,where p represents the mean lifetime of the population of lifetimes.
Am. The confidence interval is X - ts;3 < p < X -tts-jl, where X = 835, s = 45, sx = 14.23, and t =
3.250. The lower limit is X - ts? = 835 - 46.25 = 788.75 hours and the upper limit is X + t q =
835 + 46.25 = 881.25 hours.
8.28 The heights of 15 randomly selected buildings in Chicago are given in Table 8.14. Find a 90%
confidence interval on the mean height of buildings in Chicago.
Table 8.14
Am. A Minitab histogram of the data in Table 8.14 is shown in Fig. 8-5. Since the heights are not
normally distributed, the confidence interval using the t distribution is not appropriate.
CHAP. 81
2 .
1
ESTIMATION AND SAMPLE SIZE DETERMINATION 183
Fig. 8-5
CONFIDENCE INTERVAL FOR THE POPULATION PROPORTION:LARGE SAMPLES
8.29 In a survey of 900 adults, 360 responded “yes” to the question “Have you attended a major league
baseball game in the last year?” Determine a 90% confidence interval for p, the proportion of all adults
who attended a major league baseball game in the last year.
Am. The sample proportion attending a game in the last year is iT = .40 and the proportion not
attending is = .60. The z value for a 90% confidence interval is 1.65. The confidence interval
for p is p- z
]
y< p < p+ ZJF.
The maximum error of estimate is E = z,/?= .027. The
lower limit of the confidence interval is .40 - .027 = .373 and the upper limit is .40 +.027 = .427.
The 90%interval is (.373, .427).
8.30 Use the data in Table 8.8, found in problem 8.4, to find a 99% confidence interval for the proportion of
all taxpayers receiving a refund who receive a refund of more than $500.
Am. Thirty-seven of the fifty refunds in Table 8.8 exceed $500.The sample proportion exceeding $500
is .74. The z value for a 99% confidence interval is 2.58. The confidence interval for D is
p- z/F-
< p < p+ z/F-.
The maximum error of estimate is E = z,/?= .16 The lower
limit is .74 - .16 = .58 and the upper limit is .74 + .16 = .90.The 99%interval is ( 5 8 , .90).
DETERMINING THE SAMPLE SIZE FOR THE ESTIMATIONOF THE POPULATION MEAN
8.31
8
.
3
2
The estimated standard deviation of commutingdistances for workers in a large city is determined to be 3
miles in a pilot study. How large a sample is needed to estimate the mean commuting distance of all
workers in the city to within .5 mile with 95% confidence?
. For this problem, E = .5, z = 1.96, and CT is estimated to be
Am. The sample size is given by n = -
3. The sample size is determined to be 139.
Z2 o2
E2
In problem 8.31, what size sample is required to estimate p to within .I mile with 95% confidence?
Am. n = 3,458
184 ESTIMATION AND SAMPLE SIZE DETERMINATION [CHAP. 8
DETERMINING THE SAMPLE SIZE FOR THE ESTIMATION OF THE
POPULATION PROPORTION
8.33
8.34
What sample size is required to estimate the proportion of adults who wear a beeper to within 3% with
probability 0.95‘?A survey taken one year ago indicated that 15% of all adults wore a beeper.
2
Am. The sample size is given by n = -
“ . The error of estimate, E, is given to be 396, and the z value
E2
-
-
1.516~
X I5X 85
is 1.96. If we assume that the current proportion is “close” to 1596, then n =
32
544.2, which is rounded up to 545.
Suppose in problem 8.33 that there is no previous estimate of the population proportion. What sample
size would be required to estimate p to within 3% with probability 0.95?
2
Am. The sample size is given by n = -
pq . The error of estimate, E, is given to be 3%, and the z value
E2
= 1,067.I, which is
1.9fj2X 50x 50
is 1.96.With no previous estimate for p, use p = q = 50%. n =
32
rounded up to 1,068.
Chapter 9
Tests of Hypotheses: One Population
NULL HYPOTHESISAND ALTERNATIVEHYPOTHESIS
There are a number of mail order companies in the US.Consider such a company called L. C .
Stephens Inc. The mean sales per order for L. C. Stephens, determined from a large database last
year, is known to equal $155. From the same database, the standard deviation is determined to be 0=
$50. The company wishes to determine whether the mean sales per order for the current year is
different from the historical mean of $155. If p represents the mean sales per order for the current
year, then the company is interested in determining if p # $155. The alternative hypothesis, also
often called the research hypothesis, is related to the purpose of the research. In this instance, the
purpose of the research is to determine whether p f $155. The research hypothesis is represented
symbolicallyby Ha:p f $155. The negation of the research hypothesis is called the null hypothesis.
In this case, the negation of the research hypothesis is that p = $155. The null hypothesis is
represented symbolically by Ho: ~1 = $155. The company wishes to conduct a test of hypothesis to
determine which of the two hypotheses is true. A test of hypothesis where the research hypothesis is
of the form H
,: p f (some constant) is called a two-tailed test. The reason for this term will be clear
when the details of the test procedure are discussed.
EXAMPLE 9.1 A tire manufacturer claims that the mean mileage for their premium brand tire is 60,000 miles.
A consumer organization doubts the claim and decides to test it. That is, the purpose of the research from the
perspective of the consumer organization is to test if p < 60,000, where p represents the mean mileage of all
such tires. The research hypothesis is stated symbolically as Ha:p < 60,000miles. The negation of the research
hypothesis is p = 60,000. The null hypothesis is stated symbolically as Ho: p = 60,000. A test of hypothesis
where the research hypothesis is of the form Ha:p < (some constant) is called a one-tailed test, a lej3-taifedtest,
or a Lower-tailed test.
EXAMPLE 9.2 The National Football League (NFL)claims that the average cost for a family of four to attend
an NFL game is $200. This figure includes ticket prices and snacks. A sports magazine feels that this figure is
too low and plans to perform a test of hypothesis concerning the claim. The research hypothesis is Ha:p > $200,
where p is the mean cost for all such families attending an NFL game. The null hypothesis is Ho: p = $200. A
test of hypothesis where the research hypothesis is of the form H,: p > (some constant) is called a one-tailed
test, a right-tailed test, or an upper-tailed test.
TEST STATISTIC, CRITICAL VALUES, REJECTION, AND
NONREJECTION REGIONS
Consider again the test of hypothesis to be performed by the L. C. Stephensmail order company
discussed in the previous section. The two hypotheses are restated as follows:
Ho:p = $155 (the mean sales per order this year is $155)
Ha:p f $155 (the mean sales per order this year is not $155)
1 oc
186 TESTS OF HYPOTHESIS: ONE POPULATION [CHAP. 9
To test this hypothesis, a sample of 100 orders for the current year are selected and the mean of the
sample is used to decide whether to reject the null hypothesis or not. A statistic, which is used to
decide whether to reject the null hypothesis or not is called a test statistic. The central limit theorem
assures us that if the null hypothesis is true, then Y will equal $155 on the average and the standard
error will equal ax= -= 5, assuming that the standard deviation equals 50 and has remained
constant. Furthermore, the sample mean has a normal distribution.If the value of the test statistic, X,
is “close” to $155, then we would likely not reject the null hypothesis. However, if the computed
value of .jrc is considerably different from $155, we would reject the null hypothesis since this
outcome supports the truth of the research hypothesis. The critical decision is how different does X
need to be from $155 in order to reject the null hypothesis. Suppose we decide to reject the null
hypothesis if si; differs from $155 by two standard errors or more. Two standard errors is equal to 2 x
aK= $10. Since the distributionof the sample mean is bell-shaped, the empirical rule assures us that
the probability that .jrc differs from $155 by two standard errors or more is approximately equal to
0.05. That is, if T < $145 or if X > $165, then we will reject the null hypothesis.Otherwise, we will
not reject the null hypothesis. The values $145 and $165 are called critical values. The critical
values divide the possible values of the test statistic into two regions called the rejection region and
the nonrejection region. The critical values along with the regions are shown in Fig. 9-1.
50
f i
-
reiection region I nonrejection region Irejection region X
145 165
Fig. 9-1
EXAMPLE 9.3 In Example 9.1, the null and research hypothesis are stated as follows:
Ho: p = 60,000miles (the tire manufacturer’sclaim is correct)
Ha: p < 60,000 miles (the tire manufacturer’sclaim is false)
Suppose it is known that the standard deviation of tire mileages for this type of tire is 7,000 miles. If a sample of
49 of the tires are road tested for mileage and the manufacturer’s claim is correct, then X will equal 60,000
7
miles on the average and the standard error will equal 031 = -= 1,000 miles. Furthermore, because of the
J49
large sample size, the sample mean will have a normal distribution. Suppose the critical value is chosen two
standard errors below 60,000;then the rejection and nonrejection regions will be as shown in Fig. 9-2.
-
reiection region I nonreiection region X
58,000
Fig. 9-2
When the consumer organization tests the 49 tires, if the mean mileage is less than 58,000 miles for the sample,
then the manufacturer’s claim will be rejected.
EXAMPLE 9.4 In Example 9.2, the null hypothesis and the research hypothesis are stated as follows:
Ho: p = $200 (the NFL claim is correct)
Ha:p > $200 (the NFL claim is not correct)
Suppose the standard deviation of the cost for a family of four to attend an NFL game is known to equal $30. If
a sample of 36 costs for families of size 4 are obtained, then X will be normally distributed with mean equal to
CHAP. 91 TESTS OF HYPOTHESIS: ONE POPULATION 187
Conclusion
Do not reject H
o
Reject 6
30
$200 and standard error equal to -= $5 if the NFL claim is correct. Suppose the critical value is chosen
J36
HoTrue False
Correct conclusion Type I1 error
Type I error Correct conclusion
three standard errors above $200;then the rejection and nonrejectionregions are as shown in Fig. 9-3.
-
nonreiectionrepion Ireiection region X
215
Fig. 9-3
If the mean of the sample of costs for the 36 families of size 4 exceeds $215, then the NFL claim will be
rejected.
TYPE I AND TYPE I1 ERRORS
Consider again the test of hypothesis to be performed by the L. C. Stephens mail order company
discussed in the previous two sections. The two hypotheses are restated as follows:
Ho: p = $155 (the mean sales per order this year is $155)
Ha:pf $155 (the mean sales per order this year is not $155)
The decision to reject or not reject the null hypothesis is to be based on the sample mean. The critical
values and the rejection and nonrejection regions are given in Fig. 9-1. If the current year mean, p,is
$155 but the sample mean falls in the rejection region, resulting in the null hypothesis being rejected,
then a Type I error is made. If the current year mean is not equal to $155, but the sample mean falls
in the nonrejection region, resulting in the null, hypothesis not being rejected, then a Type I1 error is
made. That is, if the null hypothesis is true but the statistical test results in rejection of the null, then
a Type I error occurs. If the null hypothesis is false but the statistical test results in not rejecting the
null, then a Type I1 error occurs. The errors as well as the possible correct conclusions are
summarized in Table 9.1.
Table 9.1
The first two letters of the Greek alphabet are used in statistics to represent the probabilities of
committingthe two types of errors. The definitionsare as follows.
a=probability of making a Type I error
p= probability of making a Type 1
1error
The calculation of a Type I error will now be illustrated, and the calculation of Type IIerrors will be
illustrated in a later section.The term level ofsignificance is also used for a.
The level of significance,a,is defined by formula (9.1):
a = P(rejectingthe null hypothesis when the null hypothesis is true) (9.0
In the current discussion of the L. C. Stephensmail order company, the level of significancemay be
expressed as a = P(X < 145or X > 165and p = 155).In order to evaluate this probability,recall that
188 TESTS OF HYPOTHESIS:ONE POPULATION [CHAP. 9
-
for large samples, z = -
-’
has a standard normal distribution. As previously discussed, the standard
aji
-
x - 155
5
error is $5 and assuming that the null hypothesis is true, z = -has a standard normal
-
x - I55 145- 155
5 5
distribution. The event X < 145 is equivalent to the event -<
event 5i; > 165 is equivalent to the event ->
= -2 or z < -2. The
= 2 or z > 2. Therefore, a may be
-
x - I55 165- 155
5 5
expressed as a = P(Y c 145 ) + P(T > 165) = P(z c -2) + P(z > 2). From the standard normal
distribution table, P(z < -2) = P(z > 2) = .5 - .4772 = .0228. And therefore, a = 2 x .0228 = .0456.
When using the mean of a sample of 100 order sales and the rejection and nonrejection regions
shown in Fig. 9-1 to decide whether the overall mean sales per order has changed, 4.56% of the time
the company will conclude that the mean has changed when, in fact, it has not.
EXAMPLE 9.5 In Example 9.3, the null and research hypothesis for the tire manufacturer’s claim is as
follows:
Ho: p = 60,000 miles (the tire manufacturer’sclaim is correct)
Ha:p < 60,000miles (the tire manufacturer’sclaim is false)
The rejection and nonrejection regions were illustrated in Fig. 9-2. The level of significance is determined as
follows:
a = P(rejecting the null hypothesis when the null hypothesis is true)
Since the null is rejected when the sample mean is less than 58,000miles and the null is true when p = 60,000,
a = P(X < 58,000 when p = 60,000)
Transforming the sample mean to a standard normal, the expression for a now becomes
)
51-60,000 58,000- 60,000
a = P (
1,ooo 1,000
Simplifying, and finding the area to the left of -2 under the standard normal curve, we find
a = P(z < -2) = .5 - .4772 = .0228
If the tire manufacturer’s claim is tested by determining the mileages for 49 tires and rejecting the claim when
the mean mileage for the sample is less than 58,000 miles, then the probability of rejecting the manufacturer’s
claim, when it is correct, equals ,0228.
EXAMPLE 9.6 In Example 9.4, the null and research hypothesis for the NFL’s claim is as follows:
Ho: p = $200 (the NFL claim is correct)
Ha:j.~> $200 (the NFL claim is not correct)
The rejection and nonrejection regions were illustrated in Fig. 9-3. The level of significance is determined as
follows:
a = P(rejecting the null hypothesis when the null hypothesis is true)
Since the null is rejected when the sample mean is greater than $215 and the null is true when p = $200,
CHAP. 91 TESTS OF HYPOTHESIS: ONE POPULATION 189
a = P(X > 215 when p = 200)
Transforming the sample mean to a standard normal, the expression for a now becomes
)
X-200 215-200
5 5
a=P(- >
Simplifying, and finding the area to the right of 3 under the standard normal curve, we find
U
.= P(z> 3) = .5 - .4986 = .0014
The probability of falsely rejecting the NFL’s claim when using a sample of 36 families and the rejection and
nonrejection regions shown in Fig. 9-3 to test the hypothesis is .0014.
Thus far, we have considered how to find the level of significance when the rejection and
nonrejection regions are specified. Often, the level of significance is specified and the rejection and
nonrejection regions are required. Suppose in the mail order example that the level of significance is
specified to be a = .01. Because the hypothesis test is two-tailed, the level of significance is divided
into two halves equal to .005 each. We need to find the value a, where P(z > a) = .005 and P(z <-a) =
.005. The probability P(z > a) = .005 implies that P(0 < z < a) = .5 - .005 = .495. Using the standard
normal table, we see that P(0 < z < 2.57) = .4949 and P(0 < z < 2.58) = .4951. The interpolated value
is a = 2.575, which we round to 2.58. The rejection region is therefore as follows: z < -2.58 or
z > 2.58. Figure 9-4 shows the rejection regions as well as the area under the standard normal curve
associated with the regions.
Fig. 9-4
-
X - 155
5
To determine the rejection region in terms of si-, the equation z = -may be used. The
inequality z < -2.58 is replaced by -
< -2.58. Multiplying both sides of the inequality by 5 and
then adding 155 to both sides, the inequality -< -2.58 is seen to be equivalent to X < 142.1.
-
X - 155
5 -
X - 155
c
J
-
X - 155
5
The inequality z > 2.58 is replaced by -> 2.58
shows the rejection regions as well as the area under
means.
which is equivalent to X > 167.9. Figure 9-5
the normal curve associated with the sample
190 TESTS OF HYPOTHESIS: ONE POPULATION [CHAP. 9
Fig. 9-5
EXAMPLE 9.7 Suppose we wish to test the hypothesis in Example 9.5 at a level of significance equal to .01.
Since this is a lower-tailed test, we need to find a standard normal value a such that P(z < a) = .01. From the
standard normal distribution table, we find that P(0 < z < 2.33) = .4901. Therefore P(z > 2.33) = .01. Because of
the symmetry of the standard normal curve, P(z < -2.33) also equals .01. Hence z < -2.33 is a rejection region
of size a = .01.To find the rejection region in terms of the sample mean, we note from Example 9.5 that if the
null hypothesis is true, then z = has a standard normal distribution. Substituting for z in the
inequality z < -2.33, we have < -2.33 or solving for X’,
we find X < 57,670 miles as the rejection
region. Figure 9-6 shows the rejection region as well as the area under the standard normal curve associated with
the region.
51- 60,000
1,000
X - 60,000
1,000
F
i
g
.9-6
Figure 9-7 shows the rejection region as well as the area under the normal curve associated with the sample
means.
F
i
g
. 9-7
EXAMPLE 9.8 Suppose we wish to test the hypothesis in Example 9.6 at a level of significance equal to .10.
Since this is an upper-tailed test, we need to find a standard normal value a such that P(z > a) = .10. From the
CHAP. 91 TESTS OF HYPOTHESIS: ONE POPULATION 191
standard normal distribution table, we find that P(0 < z < 1.28) = .3997. Therefore P(z > 1.28) = .10. Hence
z > 1.28 is a rejection region of size a = .lO. To find the rejection region in terms of the sample mean, we note
-
x-200
from Example 9.6 that if the null hypothesis is true, then z = -has a standard normal distribution.
5
-
x-200
5
Substituting for z in the inequality z > 1.28, we have -> 1.28 or solving for X, we find X > $206.40 as
the rejection region. Figure 9-8 shows the rejection region as well as the area under the standard normal curve
associated with the region.
Fig. 9-8
Figure 9-9 shows the rejection region as well as the area under the normal curve associated with the sample
means.
Fig. 9-9
HYPOTHESIS TESTS ABOUT A POPULATION MEAN: LARGE SAMPLES
This section summarizes the material presented in the proceeding sections. The techniques
presented are appropriate when the sample size is 30 or more.
EXAMPLE 9.9 The mean age of policyholders at World Life Insurance Company, determined two years ago,
was found to equal 32.5 years and the standard deviation was found to equal 5.5 years. It is reasonable to
believe that the mean age has increased. However, some of the older policyholders are now deceased and some
younger policyholders have been added. The company determines the ages of 50 current policyholders in order
to decide whether the mean age has changed. If p represents the current mean of all policyholders, the null and
research hypothesis are stated as follows:
Ho:p= 32.5 (the mean age has not changed)
Ha:p f 32.5 (the mean age has changed)
192 TESTS OF HYPOTHESIS:ONE POPULATION [CHAP. 9
t
Steps for Testing a Hypothesis Concerning a Population Mean: Large Sample
Step 1: State the null and research hypothesis. The null hypothesis is represented symbolically by
Ho: p= h,and the research hypothesis is of the form Ha: p f
Step 2: Use the standard normal distribution table and the level of significance, a,to determine the
rejection region.
or Ha:p < or Ha: p > h.
The test is performed at the conventional level of significance, which is a = .05. From the standard normal
distribution, it is determined that P(z > 1.96) = .025 and therefore P(z c -1.96) = .025. The rejection and
nonrejection regions are therefore as follows:
re-iectionregion I nonrejection region 1 reiection region z
-1.96 1.96
The standard deviation of the policyholder ages is known to remain fairly constant, and therefore the standard
error is a
x = --
- .778. The mean age of the sample of 50 policyholders is determined to be 34.4 years. The
55
Jso -
x -po 34.4- 32.5
computed test statistic is z = --
- = 2.44. That is, assuming the mean age of all policyholders
0.
F .778
has not changed from two years ago, the mean of the current sample is 2.44 standard errors above the population
mean. Since this exceeds 1.96, the null hypothesis is rejected and it is concluded that the current mean age
exceeds 32.5 years. Notice that the four steps in Table 9.2 were followed in performing the test of hypothesis.
Table 9.2 gives a set of steps that may be used to perform a test of hypothesis about the mean of
a population.
-
X - F o
Step 3: Compute the value of the test statistic as follows: z = -
,where X is the mean of the
I 0,-
sample, his given in the null hypothesis, and a
x is computed by dividing CJ by &. If 0 is
unknown, estimate it by using the sample standard deviation, s, when computing OK.
Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test
statistic falls in the rejection region. Otherwise, the null hypothesis is not rejected.
EXAMPLE 9.10 The police department in a large city claims that the mean 91I response time for domestic
disturbance calls is 10 minutes. A “watchdog group” believes that the mean response time is greater than 10
minutes. If p represents the mean response time for all such calls, the watchdog group wishes to test the research
hypothesis that p > 10at level of significance a = .01.The null and research hypothesis are:
Ho:p = 10(the police department claim is correct)
Ha:p > 10(the police department claim is not correct)
From the standard normal distribution table, it is found that P(z > 2.33) = .01. The rejection and nonrejection
regions are as follows:
nonreiection region I reiection region Z
2.33
A sample of 35 response times for domestic disturbance calls is obtained and the mean response time is found to
be 11.5 minutes and the standard deviation of the 35 response times is 6.0 minutes. The standard error is
-
s 6.0 x - 11.5- 10.0
estimated to be ---= 1.01 minutes. The computed test statistic is z = 2= = 1.49.
& - & o
x 1.01
Since the computed test statistic does not exceed 2.33, the police department claim is not rejected. Note that the
CHAP. 91 TESTS OF HYPOTHESIS: ONE POPULATION 193
study has not proved the claim to be true. However, the results of the study are not strong enough to refute the
claim at the 1% level of significance.
EXAMPLE9.11 A sociologist wishes to test the null hypothesisthat the mean age of gang members in a large
city is 14years vs. the alternativethat the mean is less than 14 years at level of significancea= .10. A sampleof
40 ages from the police gang unit records in the city found that Y = 13.1 years and s = 2.5 years. The estimated
standard error is -= -= .40 years. Assuming the null hypothesis to be true, the sample mean is 2.25
standard errors below the population mean, since z = -
-
- -2.25. For a = .10, the rejection region is
z <-1.28. Since the computed test statistic is less than -1.28, the null hypothesis is rejected and it is concluded
that the mean age of the gang members is less than 14 years.
s 2.5
J;;&
13.1- 14
.40
The process of testing a statistical hypothesis is similar to the proceedings in a courtroom in the
United States. An individual, charged with a crime, is assumed not guilty. However, the prosecution
believes the individual is guilty and provides sample evidence to try and prove that the person is
guilty. The null and alternativehypothesis may be stated as follows:
Ho:the individualcharged with the crime is not guilty
Ha:the individualcharged with the crime is guilty
If the evidence is strong, the null hypothesis is rejected and the person is declared guilty. If the
evidence is circumstantial and not strong enough, the person is declared not guilty. Notice that the
person is not usually declared innocent, but is found not guilty. The evidence usually does not prove
the null hypothesis to be true, but is not strongenough to reject it in favor of the alternative.
CALCULATING TYPE I1 ERRORS
In testing a statistical hypothesis, a Type I1 error occurs when the test results in not rejecting the
null hypothesis when the research hypothesis is true. The probability of a Type I1 error is represented
by the Greek letter p and is defined in formula (9.2):
p =P(not rejecting the null hypothesis when the research hypothesis is true) (9.2)
Consider once again the mail order company example discussed in previous sections. The null
and research hypothesis are:
Ho:p = $155 (the mean sales per order this year is $155)
Ha:1.1
f $1 55 (the mean sales per order this year is not $155)
The rejection and nonrejection regions are:
-
rejection region I nonreiection region I rejection region X
145 165
The level of significance is a = .0456.
The level of significance is computed under the assumption
that the null hypothesis is true, that is, p = 155.The probability of a Type II error is calculated under
the assumption that the research hypothesis is true. However, the research hypothesis is true
whenever p f $155. Suppose we wish to calculate B when p = 157.5. The sequence of steps to
compute p are as follows:
p = P(not rejecting the null hypothesis when the research hypothesis is true)
194
P P
145.O so00
147.5 .6915
150.0 3399
152.5 .9270
155.0 .9544
157.5 .9270
160.0 3399
162.5 .6915
165.O so00
TESTS OF HYPOTHESIS:ONE POPULATION [CHAP. 9
p = P(145 < X < 165 when p = 157.5)
145- 157.5 X- 157.5 165- 157.5
< <
5 5 5
The event 145 < X < 165 when p = 157.5 is equivalent to or
-2.5 c z < 1.5.The calculation of p is therefore given as follows:
p = P(-2.5 < z < 1.5)= .4332+ .4938= .9270
Suppose we wish to compute p when p= 160.The sequence of steps to compute p are as follows:
p = P(not rejectingthe null hypothesis when the research hypothesis is true)
p = P(145 < X < 165 when p = 160.0)
145- 160 51- 160 165- 160
<- <
5 5 5
The event 145 < T < 165 when p = 160.0is equivalent to
-3 < z < 1. The calculation of p is therefore given as follows:
or
p = P(-3 < z < 1) = .4986+ .3413 = .8399
Notice that the value of p is not constant, but depends on the alternative value assumed for p. Table
9.3gives the values of p for several different values of p.
A plot of p vs. p is called an operating characteristic curve. An operating characteristic curve for
Table 9.3 is shown in Fig. 9-10.
P
0
.
9-
0.8 -
0.7 -
0
.
0-
0.5 -
145 155 165
Fig. 9-10
CHAP. 91 TESTS OF HYPOTHESIS:ONE POPULATION 195
56,000
56,500
57,000
57,500
58,000
58,500
59,000
59,500
60.000
EXAMPLE9.12 Suppose we wish to construct an operating characteristic curve for the hypothesis concerning
the tire manufacturer’s claim discussed in Example 9.3. Recall that the null and research hypothesis were as
follows:
Ho:p = 60,000miles (the tire manufacturer’sclaim is correct)
Ha:p < 60,000miles (the tire manufacturer’sclaim is false)
.0228
.0668
.1587
.3085
5000
.6915
3413
.9332
.9772
The rejection and nonrejection regions were as follows:
-
reiection region I nonreiection region X
58,000
The level of significance is a = .0228. To illustrate the computation of p, suppose we wish to compute the
probability of committing a Type I1 error when p = 59,000.
p = P(not rejecting the null hypothesis when the research hypothesis is true)
p = P(X > 58,000when p = 59,000)
X -59,000 58,000-59,000
The event X > 58,000when p = 59,000 is equivalent to > or z > -I. Therefore,
1,000 1,000
p = P(z >-1) = SO00 + .3413= .8413.
Table 9.4gives the p values for several different p values.
Table 9.4
The operating characteristic curve corresponding to Table 9.4 is shown in Fig. 9-11.
1.o
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
56 57 58 59 60
Fig. 9-11
196 TESTS OF HYPOTHESIS:ONE POPULATION [CHAP. 9
P VALUES
Consider once again the mail order company example. The null and research hypotheses are:
Ho: p = $155 (the mean sales per order this year is $155)
Ha:p f $155 (the mean sales per order this year is not $155)
For level of significancea = .05,the rejection region is shown in Fig. 9-12.
rejection region I nonre-iectionregion Irejection region Z
-1.96 1.96
Fig. 9-12
To understand the concept of p value, consider the following four scenarios. Also recall that the
standarderror is 05 = -= 5.
50
Jroo
Scenario I: The mean of the 100 sampled accounts is found to equal $149. The computed test
149- 155
5
statistic is z = = -1.2, and since the computed test statistic falls in the nonrejection region,
the null hypothesis is not rejected.
Scenario 2: The mean of the 100 sampled accounts is found to equal $164. The computed test
statistic is z = 1.8, and since the computed test statistic falls in the nonrejection region, the null
hypothesis is not rejected.
Scenario 3: The mean of the 100 sampled accounts is found to equal $165. The computed test
statistic is z = 2.0, and since the computed test statistic falls in the rejection region, the null
hypothesis is rejected.
Scenario 4: The mean of the 100 sampled accounts is found to equal $140. The computed test
statistic is z = -3.0, and since the computed test statistic falls in the rejection region, the null
hypothesis is rejected.
Notice in scenarios 1and 2 that even though the evidence is not strong enough to reject the null
hypothesis, the test statistic is nearer the rejection region in scenario2 than scenario 1. Also notice in
scenarios 3 and 4 that the evidence favoring rejection of the null hypothesis is stronger in scenario 4
than scenario 3, since the sample mean in scenario 4 is 3 standard errors away from the population
mean, and the sample mean is only 2 standard errors away in scenario 3. The p value is used to
reflect these differences. The p value is defined to be the smallest level of significanceat which the
null hypothesis would be rejected.
In scenario I, the smallest level of significance at which the null hypothesis would be rejected
would be one in which the rejection region would be z 5-1.2 or z 2 1.2. The level of significance
corresponding to this rejection region is 2 x P(z 2 1.2) = 2 x (.5 - 3849) = .2302. The p value
corresponding to the test statistic z = -1.2 is .2302.
In scenario 2, the smallest level of significance at which the null hypothesis would be rejected
would be one in which the rejection region would be z 5 -1.8 or z 2 1.8. The level of significance
corresponding to this rejection region is 2 x P(z 2 1.8) = 2 x (.5 - .4641) = .0718. The p value
correspondingto the test statistic z = 1.8is .0718.
CHAP. 91 TESTS OF HYPOTHESIS: ONE POPULATION 197
In scenario 3, the smallest level of significance at which the null hypothesis would be rejected
would be one in which the rejection region would be z 5-2.0 or z 2 2.0. The level of significance
corresponding to this rejection region is 2 x P(z 2 2.0) = 2 x (.5 - .4772) = .0456. The p value
correspondingto the test statisticz = 2.0 is .0456.
In scenario 4, the smallest level of significance at which the null hypothesis would be rejected
would be one in which the rejection region would be z 5 -3.0 or z 2 3.0. The level of significance
corresponding to this rejection region is 2 x P(z 2 3.0) = 2 x (.5 - .4987) = .0026. The p value
correspondingto the test statisticz = 3.0 is ,0026.
To summarize the procedure for computing the p value for a two-tailed test, suppose z*
represents the computed test statisticwhen testing Ho: p = vs. Ha: p + h.
The p value is given by
formula (9.3):
p value = P(I z I> lz* I) (9.3)
When using the p value approach to testing hypothesis, the null hypothesis is rejected if the p value
is less than a.In addition, the p value gives an idea about the strength of the evidence for rejecting
the null hypothesis. The smaller the p value, the stronger the evidence for rejecting the null
hypothesis.
EXAMPLE 9.13 In Example 9.10, the followingstatistical hypothesis was tested for a= .OI :
Ho: p = 10(the police department claim is correct)
Ha: p > 10(the police department claim is not correct)
The computed test statistic is z* = 1.49.The smallest level of significance at which the null hypothesis wou 1 be
rejected is one in which the rejection region is z > 1.49. The p value is therefore equal to P(z > 1.49). From the
standard normal distribution table, we find the p value = .5 - .4319 = .0681. Using the p value approach to
testing, the null hypothesis is not rejected since the p value > a.
To summarize the procedure for computing the p value for an upper-tailed test, suppose z*
represents the computed test statistic when testing Ho: p = bvs. Ha : p > h.
The p value is given by
p value =P(z > z*) (9.4)
EXAMPLE 9.14 In Example 9.11, the following statistical hypothesis was tested for a = .10.
Ho: p = 14 (the mean age of gang members in the city is 14)
Ha:p < 14 (the mean age of gang members in the city is less than 14)
The computed test statistic is z* = -2.25. The smallest level of significance at which the null hypothesis would
be rejected is one in which the rejection region is z < -2.25. The p value is therefore equal to P(z < -2.25).
Using the standard normal distribution table, we find the p value = .5 - .4878 = .0122. Using the p value
approach to testing, the null hypothesis is rejected since the p value < .lO.
To summarize the procedure for computing the p value for a lower-tailed test, suppose z* represents
the computed test statistic when testing Ho:p= c ~ 0
vs. Ha:p<b.
The p value is given by
p value = P(z c z*) (9.5)
EXAMPLE 9.15 A random sample of 40 community college tuition costs was selected to test the null
hypothesis that the mean cost for all community colleges equals $1500 vs. the alternative that the mean does not
equal $1500. The level of significance is .05. The data are shown in Table 9.5.
198
1200
850
1700
1500
700
1200
TESTS OF HYPOTHESIS:ONE POPULATION
850 1750 930
3000 1650 1640
2100 900 1320
500 2050 1750
500 1780 2500
1950 675 2310
1500
2000
1o00 1080 2900
950 680 1875
~ ~ ~ ~~ ~ ~~ ~
1950 560 900 1450
F 750 500 1500 950
[CHAP.9
The Minitab analysis for this test of hypothesis is shown below. The command set c l is used to put the data in
column 1. The name cost is given to column 1. The command standard deviation c l is used to find s. The
command ztest mean = 1500 sigma = 655.44 data in cl; is used to specify the value of h,the estimate of
sigma, and the column where the data are located. The subcommand alternative 0. indicates a two-tailed test.
Since the p value is 3 2 and exceeds .05,the null hypothesis is not rejected.
MTB > set cl
DATA>1200 850 1700 1500 700 1200 1500 2000 1950 750
DATA> 850 3000 2100 500 500 1950 1000 950 560 500
DATA>1750 1650 900 2050 1780 675 1080 680 900 1500
DATA > 930 1640 1320 1750 2500 2310 2900 1875 1450 950
DATA > end
MTB > name cl 'cost'
MTB > standard deviation c1
Column StandardDeviation
Standard deviation of cost = 655.44
MTB > ztest mean = 1500sigma = 655.44 data in cl;
SUBC> alternative 0.
2-Test
Test of mu = 1500 vs mu not = 1500
The assumed sigma = 655
Variable N Mean St Dev SEMean Z P
cost 40 1396 655 104 -1.00 .32
To test the research hypothesis that p c 1500,the subcommand alternative-1. is used. The p value is now
equal to .16.
MTB > ztest mean = 1500 sigma = 655.44 data in cl;
SUBC > alternative -1.
2-Test
Test of mu = 1500vs mu < 1500
The assumed sigma = 655
Variable N Mean St Dev SEMean 2 P
cost 40 1396 655 104 -1.00 .I6
To test the research hypothesis that p > 1500,the subcommandalternative 1. is used. The p value is now equal
to .84.
MTB > ztest mean = 1500 sigma = 655.44 data in cl;
SUBC > alternative 1.
CHAP. 91 TESTS OF HYPOTHESIS: ONE POPULATION
. Steps for Testing a Hypothesis Concerning a Population Mean: Small Sample
Normally Distributed Population with Unknown 6
Z-Test
Test of mu = 1500vs mu > 1500
The assumed sigma = 655
Variable N Mean StDev SEMean Z P
cost 40 1396 655 104 -1.00 .84
Step 2: Use the t distribution table, with degrees of freedom = n - 1 and the level of significance, a,to
determine the rejection region.
Step 3: Compute the value of the test statistic as follows: t = -
- ”,where X is the mean of the sample,
-
SX
i
~ is given in the null hypothesis, and sjr is computed by dividing s by&.
Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls
in the rejection region. Otherwise, the null hypothesis is not rejected.
HYPOTHESIS TESTS ABOUT A POPULATIONMEAN: SMALL SAMPLES
199
Step 1 : State the null and research hypothesis. The null hypothesis is represented symbolically by Ho: p =b,
and the research hypothesis is of the form Ha: p f cl0or Ha: ~1 < cl0or Ha: ~1 > k.
EXAMPLE 9.16 In order to test the hypothesis that the mean age of all commercial airplanes is 10 years vs.
the alternative that the mean is not 10 years, a sample of 25 airplanes is selected from several airlines. A
histogram of the 25 ages is mound-shaped and it is therefore assumed that the distribution of all airplane ages is
normally distributed. The sample mean is found to equal 11.5 years and the sample standard deviation is equal
to 5.5 years. The level of significance is chosen to be a = .05. From the t distribution tables with df = 24, it is
determined that the area to the right of 2.064 is .025. By symmetry, the area to the left of -2.064 is also .025.
The rejection regions as well as the corresponding areas under the t distribution curve are shown in Fig. 9-13.
Fig. 9-13
200 TESTS OF HYPOTHESIS:ONE POPULATION [CHAP. 9
300
250
-
5.5 X-P, 11.5-10
The estimated standard error is
1.36,and since this value does not fall in the rejection region, the null hypothesis is not rejected.
= -= 1.1 years. The computed test statistic is t = --
- -
-
-
6 5 Sf 1.1
280 190 250
220 255 272
EXAMPLE 9.17 A social worker wishes to test the hypothesis that the mean monthly household payment for
food stamps in Shelby county is $225 vs. the alternative that it is greater than that amount. The payments for 20
randomly selected households are given in Table 9.7.
200
225
275
Table 9
.
7
210 245 228
290 265 213
310 235 277
A histogram of the payments is shown in Fig. 9-14. The histogram indicates that it is reasonable to assume that
the payments are normally distributed.
I I I
-
- 1
190 210 230 250 270 290 310
payment
Fig. 9-14
The Minitab analysis for the data is shown below. The command ttest mean = 225 data in cl; gives the value
for and the location of the sample data in column 1. The subcommand alternative = 1. Identifies the
alternative as being upper-tailed. The p value indicates that the results are significant for any level of
significance greater than .0023.
MTB > set cl
DATA > 300 250 200 225 275 280 220 210 290 310
DATA> 190 255 245 265 235 250 272 228 213 277
DATA > end
MTB > ttest mean = 225 data in cl;
SUBC > alternative = 1.
T-Testof the Mean
Test of mu = 225.00 vs mu > 225.00
Variable N Mean St Dev SEMean T P
Payment 20 249.50 34.04 7.61 3.22 0.0023
HYPOTHESIS TESTS ABOUT A POPULATION PROPORTION:
LARGE SAMPLES
In Chapter 7 the sampling distribution of p,the sample proportion, was discussed. When the
sample size satisfies the inequalities np > 5 and nq > 5, the sampling distribution of the sample
CHAP. 91 TESTS OF HYPOTHESIS: ONE POPULATION 201
proportion is normally distributed with mean equal to p and standard error q, = - .This result is
sometimes referred to as the central limit theorem for the sample proportion. Furthermore, it was
shown that z = ~ has a standard normal distribution. These results form the theoretical
underpinnings for tests of hypothesis concerning p, the population proportion. Table 9.8 gives the
t
-
P -Pg
q,
steps that may be followed when testing a hypothesis about the population proportion.
Table 9.8
Steps for Testing a Hypothesis Concerning a Population Proportion:
Large Samples (np > 5 and nq > 5 )
Step I: State the null and research hypothesis. The null hypothesis is represented symbolically by Ho:p = po,
and the research hypothesis is of the form Ha:p f poor Ha:p c poor Ha:p >po.
Step 2: Use the standard normal distribution table to determine the rejection and nonrejection regions.
-
Step 3: Compute the value of the test statistic as follows: z = -
- ,where jT is the sample proportion, po is
OF
specified in the null hypothesis, and oF=
Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls
in the rejection region. Otherwise, the null hypothesis is not rejected.
EXAMPLE 9.18 Health-care coverage for employees varies with company size. It is reported that 30% of all
companies with fewer than 10employees provide health benefits for their employees. A sample of 50 companies
with fewer than 10employees is selected to test Ho:p = .3 vs. Ha:p f .3 at a = .01. It is found that 19 of the 50
companies surveyed provide health benefits for their employees. Using the standard normal table, it is found
that the area under the curve to the right of 2.58 is .005 and by symmetry, the area to the left of -2.58 is also
.005. The rejection regions as well as the associated areas under the standard normal curve are shown in Fig.
9-15.
Fig. 9-15
--
Assuming the null hypothesis to be true, the standard error of the proportion is oF
= /
T
y
T
'
- -
19 .38 - .30
.065.The sample proportion is j7 = -= .38. The computed value of the test statistic is z* = = 1.23.
Since the computed value of the test statistic is not in the rejection region, the null hypothesis is not rejected.
The p value is P(I z I> 1.23)= P(z < -1.23) +P(z > 1.23)= 2 x P(z > 1.23)= 2 x (.5 - .3907)= .2186.
50 .065
202 TESTS OF HYPOTHESIS: ONE POPULATION [CHAP. 9
EXAMPLE 9.19 A national survey asked the followingquestion of 2500 registered voters: “Is the character of
a candidate for president important to you when deciding for whom to vote?” Two thousand of the responses
were yes. Let p represent the percent of all registered voters who believe the character of the president is
important when deciding for whom to vote. The results of the survey were used to test the null hypothesis Ho:
p = 90% vs. Ha: p < 90% at level of significance a = .05. The sample percent is = 2500 x 100% = 80%.
2Ooo
Assuming the null hypothesis to be true, the standard error is OF= /pot
-
‘ O -
-,/-
--
- .60%.The computed
test statistic is z*= -
= -16.7. The null hypothesis would be rejected at all practical levels of signiticance.
80-90
.60
Solved Problems
NULL HYPOTHESISAND ALTERNATIVE HYPOTHESIS
9.1
9.2
A study in 1992established the mean commuting distance for workers in a certain city to be 15
miles. Because of the westward spread of the city, it is hypothesized that the current mean
commuting distance exceeds 15 miles. A traffic engineer wishes to test the hypothesis that the
mean commuting distance for workers in this city is greater than 15 miles. Give the null and
alternative hypothesis for this scenario.
Ans. Ho: p = IS,Ha:p > 15, where p represents the current mean commuting distance in this city
The mean number of sick days used per year nationally is reported to be 5.5 days. A study is
undertaken to determine if the mean number of sick days used for nonunion members in
Kansas differs from the national mean. Give the null and alternative hypothesis for this
scenario.
Ans. Ho: p = 5.5, Ha: p # 5.5, where p represents the mean number of sick days used for nonunion
members in Kansas.
TEST STATISTIC,CRITICAL VALUES, REJECTION
AND NONREJECTION REGIONS
9.3 Refer to problem 9.1. The decision is made to reject the null hypothesis if the mean of a
sample of 49 randomly selected commuting distances is more than 2.5 standard errors above
the mean established in 1992.The standard deviation of the sample is found to equal 3.5 miles.
Give the rejection and nonrejection regions.
s 3.5
Ans. The estimated standard error of the mean is sx = -- --
- .5 miles. The null hypothesis is to
be rejected if the sample mean of the 49 commuting distances exceeds the 1992 population mean
by 2.5 standard errors. That is, reject Ho if T > 15 + 2.5 x .5 = 16.25.
&-G
-
rionreiection region I reiection region X
16-25
CHAP. 91 TESTS OF HYPOTHESIS: ONE POPULATION 203
9.4 Refer to problem 9.2. The null hypothesis is to be rejected if the mean of a sample of 250
nonunion members differs from the national mean by more than 2 standard errors. The
standard deviation of the sample is found to equal 2.1 days. Give the rejection and nonrejection
regions.
S 2.1
Ans. The estimated standard error of the mean is s i = -- -= .13 days. The null hypothesis is
J;;-J250
to be rejected if the mean of the sample differs fiom the national mean by more than 2 standard
errors. That is, reject H
o if X < 5.5 - 2 x .I3 = 5.24 or if 'jY > 5.5 +2 x .13 = 5.76.
-
reiection region I nonreiection region 1 reiection region X
5.24 5.76
TYPE I AND TYPE I1 ERRORS
9.5 Refer to problems 9.1 and 9.2. Describe, in words, how a Type I and a Type I1 error for both
problems might occur.
Ans. In problem 9.1, a Type I error is made if it is concluded that the current mean commuting distance
exceeds 15 miles when in fact the current mean is equal to 15 miles. A Type I1 error is made if it is
concluded that the current mean commutingdistance is 15 miles when in fact the mean exceeds 15
miles.
In problem 9.2, a Type I error is made if it is concluded that the mean number of sick days used
for the nonunion members in Kansas is different than the national mean when in fact it is not
different. A Type I1 error is made if it is concluded that the mean number of sick days used for
nonunion members in Kansas is equal to the national mean when in fact it differs from the national
mean.
9.6 Refer to problems 9.3 and 9.4.Find the level of significance for the test procedures described
in both problems.
Ans. In problem 9.3, a = P(rejecting the null hypothesis when the null hypothesis is true) or a = P(Y >
16.25 when p = 15)= P(z > 2.5), since 16.25is 2.5 standard errors above p = 15.a = P(z > 2.5) =
In problem 9.4,a = P(rejecting the null hypothesis when the null hypothesis is true) or a = P(T c
5.24 or X > 5.76 when p = 5.5) = P(z < -2 or z > 2), since 5.24 is 2 standard errors below p = 5.5
and 5.76 is 2 standard errors above p = 5.5. a = P(z <-2) + P(z > 2) = 2 x (.5 - .4772) = .0456.
.5 - .4938 = .0062.
HYPOTHESISTESTS ABOUT A POPULATIONMEAN: LARGE SAMPLES
9.7 A sample of size 50 is used to test the following hypothesis: Ho: p = 17.5vs. Ha:p # 17.5 and
the level of significance is .01.Give the critical values, the rejection and nonrejection regions,
the computed test statistic, and your conclusion i f (a)Y = 21.5 and s = 5.5 (0
unknown); (b)
'ji' = 21.5 and 0= 5.0
Ans. The critical values and therefore the rejection and nonrejection regions are the same in both parts.
Since P(z < -2.58) = P(z > 2.58) = .005, the critical values are -2.58 and 2.58. The rejection and
nonrejection regions are as follows:
reiection region I nonreiection region Ireiection region I,
-2.58 2.58
204 TESTS OF HYPOTHESIS: ONE POPULATION [CHAP. 9
s 5.5
In part (a) the standard error is estimated by
deviation is unknown. The computed test statistic is z =
= -= --
- .78, since the population standard
= 5.13, and the null hypothesis
&Jso
21.5- 17.5
.78
is rejected since this value falls in the rejection region.
a 5
In part (6)the standard error is computed by 0
: = -- --
- .71, since the population standard
C J S o
21.5- 17.5
deviation is known. The computed test statistic is z = = 5.63, and the null hypothesis is
.71
rejected since this value falls in the rejection region.
9.8 A state environmental study concerning the number of scrap-tires accumulated per tire
dealership during the past year was conducted. The null hypothesis is H
o
:p = 2500 and the
research hypothesis is H : p # 2500, where p represents the mean number of scrap-tires per
dealership in the state. For a random sample of 85 dealerships, the mean is 2750 and the
standard deviation is 950.Conduct the hypothesis test at the 5% level of significance.
Ans. The null hypothesis is rejected if Iz* I > 1.96, where z* is the computed value of the test statistic.
The estimate standard error is = - -
- -= 103 miles. The computed value of the test
950
J f ; &
2750- 2500
103
statistic is z* = = 2.43. It is concluded that the mean number of scrap-tires per
dealership exceeded 2500 last year
CALCULATING TYPE I1 ERRORS
9.9 In problems 9.1 and 9.3, the hypothesis system Ho: p= 15 andHa:p> 15 was tested by using a
sample of size 49. The null hypothesis was to be rejected if the sample mean exceeded 16.25.
Suppose the true value of p is 16.5. What is the probability that this test procedure will not
result in the rejection of the null hypothesis?
Am. p = P(not rejecting the null hypothesis when p = 16.5)= P(X < 16.25 when p = 16.5).In problem
9.3, the estimated standard error was found to equal .5. The event X < 16.25 is equivalent to the
51- 16.5 16.25- 16.5
.5 .5
event z = -
< =-.5 when p = 16.5.Therefore, p = P(z < -S) = P(z > .5) = .5 -
.I915 = .3085. That is, there is a 30.85% chance that, even though the mean commuting distance
has increased from the 15 miles figure of 1992 to the current figure of 16.5 miles, the hypothesis
test will result in concluding that the current mean is the same as the 1992 mean.
9.10 In problems 9.2 and 9.4, the hypothesis system Ho: p = 5.5 and Ha:
1-1 f 5.5 was tested by using
a sample of size 250. The null hypothesis was to be rejected if the sample mean was less than
5.24 or exceeded 5.76. Suppose the true value of p is 5.0. What is the probability that this test
procedure will not result in the rejection of the null hypothesis?
Ans. p = P(not rejecting the null hypothesis when p = 5.0) = P(5.24< T < 5.76 when p is 5.0). In
problem 9.4, the estimated standard error was found to equal .13. The event 5.24 < X < 5.76 when
p is 5.0 is equivalent to the event which is the same as the event
5.24-5.0 x -5.0 5.76-5.0
<- <
.I 3 .13 .I3
CHAP. 91 TESTS OF HYPOTHESIS: ONE POPULATION 205
1.85 < z < 5.85 . Therefore p = P( I .85 < z < 5.85). Since there is practically zero probability to the
right of 5.85, p = P(z > 1.85)= .5 - .4678 = .0322.
P VALUES
9.11 Use the p value approach to test the hypothesis in problem 9.8.
Ans. The computed value of the test statistic is z* = 2.43. Since the hypothesis is two-tailed, the p value
is given by p value = P(I z I > I z* I) or equivalently P(z < -1 z* I) + P(z >I z* 1) = P(z < -2.43) +
P(z > 2.43). Since these two probabilities are equal, we have the p value = 2 x P(z > 2.43) = 2 x
(.5 - .4925) = .015. Using the p value approach, the null hypothesis is rejected since the p value
< a.
9.12 Find the p values for the following hypothesis systems and z* values.
( U ) Ho: p = 22.5 VS.Ha: p +22.5 and Z* = 1.76
(b) Ho: p = 37.I VS. Ha: 1
f 37.1 and Z* = -2.38
(c) Ho: p= 12.5 VS. Ha: j~ < 12.5 and Z* = -.98
(d) Ho: p = 22.5 VS. Ha: 1-1< 22.5 and Z* = 0.35
(e) Ho: p= 100vs. Ha:p> 100 and z* = 2.76
(f) Ho: p= 72.5 VS. Ha: p> 72.5 and Z* = -0.25
Ans. (a) The p value is given by p value = P(I z I > I z* 1) = P(I z I >1.76) = 2 x P(z > 1.76)= 2 x
(b) The p value is given by p value = P(I z I > I z* I) = P(I z I> 2.38), since the absolute value of
-2.38 is 2.38. Therefore, p value = P(l z I> 2.38) = 2 x P(z > 2.38) = 2 x (.5 - .4913) = .0174.
(c) The p value is given by p value = P(z < z*) = P(z < -.98) = P(z > .98) = (.5 - .3365) = .1635.
(
d
) The p value is given by p value = P(z < z*) = P(z < 3 5 )= .5 + .1368 = .6368.
(e) The p value is given by p value = P(z > z*) = P(z > 2.76) = .5 - .4971 = .0029.
U> The p value is given by p value = P(z > z*) = P(z > -.25) = .5 + .0987 = .5987.
(.5 - .4608) = .0784.
HYPOTHESIS TESTS ABOUT A POPULATIONMEAN: SMALL SAMPLES
9.13 A psychological test, used to measure the level of hostility, is known to produce scores which
are normally distributed, with mean = 35 and standard deviation = 5. The test is used to test the
null hypothesis that p = 35 vs. the research hypothesis that p > 35, where p represents the
mean score for criminal defense lawyers. Sixteen criminal defense lawyers are administered
the test and the sample mean is found to equal 39.5. The sample standard deviation is found to
equal 10.The level of significance is selected to be a = .01.
(a) Perform the test assuming that o = 5. This amounts to assuming that the standard deviation
(b) Perform the test assuming o is unknown.
of test scores for criminal defense lawyers is the same as the general population.
-
x-35
(a) Since the test scores are normally distributed and (r is known, the test statistic, z = -is
o
r
normally distributed. The rejection region is found using the standard normal distribution
tables. The null hypothesis is rejected if the test statistic exceeds 2.33.The calculated value of
Ans.
39.5- 35
1.25
the test statistic is z* = = 3.6, where the standard error is computed as
o
ox=--- -
- 1.25.The null hypothesis is rejected since 3.6 falls in the rejection region.
&-a
206
5.5
7.5
12.5
10.0
15.3
TESTS OF HYPOTHESIS: ONE POPULATION [CHAP.9
13.2 15.5 16.5 18.0
4.O 3.3 9.0 17.6
2.5 7.5 8.5 4.9
14.0 10.0 16.6 2.9
13.3 6.6 9.7 13.4
9.14
-
x - 35
Because (T is unknown and the sample size is small the test statistic, t = -has a t
sx
distribution with df = 16 - 1 = 15. The rejection region is found by using the t distribution
with df = 15. The null hypothesis is rejected if the test statistic exceeds 2.602. The calculated
value of the test statistic is t* = = 1.80, where the estimated standard error is
computed as =---= 2.5. The null hypothesis is not rejected, since 1.80does not
fall in the rejection region.
395- 35
2.5
s 10
&--G
The weights in pounds of weekly garbage for 25 households is shown in Table 9.9. Use these
data to test the hypothesis that the weekly mean for all households in the city is 10pounds. The
alternative hypothesis is that the mean differs from 10pounds. The level of significance is a =
.05.
Table 9.9
Ans. The Minitab solution is shown below.
MTB > set cl
DATA> 5.5 7.5 12.5 10.0 15.5 13.2
DATA> 15.5 3.3 7.5 10.0 6.6 16.5
DATA> 18.0 17.6 4.9 2.9 13.4
DATA > end
MTB > ttest mean = 10data in c1;
SUBC > alt = 0.
T-Testof the Mean
Test of mu = 10.000vs mu not = IO.OO0
Variable N Mean St Dev SEMean
4.0
9.0
T
CI 25 10.320 4.923 0.985 0.33
2.5 14.0 13.3
8.5 16.6 9.7
P
0.75
Since the p value > .05,the null hypothesis is not rejected.
HYPOTHESIS TESTS ABOUT A POPULATION PROPORTION:
LARGE SAMPLES
9.15 A survey of 2500 women between the ages of 15 and 50 found that 28% of those surveyed
relied on the pill for birth control. Use these sample results to test the null hypothesis that p =
25% vs. the research hypothesis that p f 25%, where p represents the population percentage
using the pill for birth control. Conduct the test at a = .05 by giving the critical values, the
computed value of the test statistic,and your conclusion.
CHAP. 91 TESTS OF HYPOTHESIS:ONE POPULATION 207
-
Ans. -Th<critical valms are f 1.96. The test statistic is given by z = -, where og=
’-
OF
f *
The standard error of the proportion is oTi
= -
25x75 -
- .87%, and the computed test statistic is
42500
28-25
z=-- - 3.45. The null hypothesis is rejected since the computed test statistic exceeds the
.87
critical value, 1.96.
9.16 A poll taken just prior to election day finds that 389 of 700 registered voters intend to vote for
Jake Konvalina for mayor of a large Midwestern city. Test the null hypothesis that p = .5 vs.
the alternativethat p > .5 at a = .05,where p represents the proportion of all voters who intend
to vote for Jake. Use the p value approach to do the test.
389
Ans. The sample proportion is F=,, = S56, and the standard error of the proportion is
-
p-po S56-.5
q =/q
= ,/%
= .019. The computed test statistic is z* = --
- --
- 2.95.
OF .019
The p value is given by P(z > 2.95) = .5 - .4984 = .0016. The null hypothesis is rejected since the
p value < .05.
Supplementary Problems
NULL HYPOTHESIS AND ALTERNATIVEHYPOTHESIS
9.17
9.18
Classify each of the following as a lower-tailed, upper-tailed, or two-tailed test:
( U ) H
o
:~ = 3 5 ,
Ha: p < 35 (b) Ho: P = 1.29 Ha: CL # 1.2 (c) Ho: p=$l050, Ha: p > $1050
Ans. (a) lower-tailed test (b) two-tailed test (c) upper-tailed test
Suppose the current mean cost to incarcerate a prisoner for one year is $18,000. Consider the following
scenarios.
( U ) A prison reform plan is implemented which prison authorities feel will reduce prison costs.
(b) It is uncertain how a new prison reform plan will affect costs.
(c) A prison reform plan is implemented which prison authorities feel will increase prison costs.
Let p represent the mean cost per prisoner per year after the reform plan is implemented. Give the
research hypothesis in symbolic form for each of the above cases.
TEST STATISTIC,CRITICALVALUES, REJECTIONANDNONREJECTIONREGIONS
9.19 A coin is tossed 10times to determine if it is balanced. The coin will be declared “fair,” i.e., balanced, if
between 2 and 8 heads, inclusive, are obtained. Otherwise, the coin will be declared “unfair.” The null
hypothesis corresponding to this test is Ho: (the coin is fair), and the alternative hypothesis is Ha: (the
coin is unfair). Identify the test statistic, the critical values, and the rejection and nonrejection regions.
Ans. The test statistic is the number of heads to occur in the 10tosses of the coin. The critical values are
1 and 9. The rejection region is x = 0, 1, 9, or 10,where x represents the number of heads to occur
in the 10tosses. The nonrejection region is x = 2, 3,4,5,6, 7, or 8.
TESTS OF HYPOTHESIS: ONE POPULATION [CHAP. 9
9.20 A die is tossed 6 times to determine if it is balanced. The die will be declared “fair,” i.e., balanced, if the
face “6” turns up 3 or fewer times. Otherwise, the die will be declared “unfair.” The null hypothesis
corresponding to this test is b:
(the die is fair), and the alternative hypothesis is Ha:(the die is unfair).
Identify the test statistic, the critical values, and the rejection and nonrejection regions.
Ans. The test statistic is the number of times the face “6” turns up in 6 tosses. The critical value is 4.
The rejection region is x = 4,5, or 6, where x represents the number of times the face “6” turns up
in the 6 tosses of the die. The nonrejection region is x = 0, 1, 2, or 3.
TYPEI AND TYPE I1 ERRORS
9.21 In problems9.19 and 9.20, describe how a Type I and a Type I1error would occur.
Ans. In problem 9.19, if the coin is fair, but you happen to be unlucky and get 9 or 10heads or 9 or 10
tails and therefore conclude that the coin is unfair, you commit a Type I error. If the coin is
actually bent so that it favors heads more often than tails or vice versa, but you obtain between 2
and 8 heads, then you commit a Type I1error.
In problem 9.20, if the die is fair, but you happen to be unlucky and obtain an unusual happening
such as 4 or more sixes in the 6 tosses and therefore conclude that the die is unfair, you commit a
Type I error. If the die is loaded so that the face “6” comes up more often than the other faces, but
when you toss it, you actually obtain 3 or fewer sixes in the 6 tosses, then you commit a Type 11
error.
9.22 Find the level of significancein problems 9.19 and 9.20.
Ans. The level of significanceis given by a = P(rejecting the null hypothesis when the null hypothesis
is true). In problem 9.19, if the null hypothesis is true, then x, the number of heads in 10tosses of a
fair coin, has a binomial distributionand the level of significanceis a = P(x = 0, 1,9, or 10when
p = S).Using the binomial tables, we find that the level of significance is a = .0010 + .0098 +
.0010 + ,0098 = .0216. In problem 9.20, if the null hypothesis is true, then x, the number of times
the face “6”turns up in 6 tosses, has a binomial distributionwith n = 6 and p = !
. = .167, and the
level of significance is a = P(x = 4, 5, or 6 when p = .167). Using the binomial probability
formula,we find a = ,008096+ .000649+.oooO22= .008767.
6
HYPOTHESIS TESTSABOUT A POPULATION MEAN: LARGE SAMPLES
9.23
9.24
Home Videos Inc. surveys 450 households and finds that the mean amount spent for renting or buying
videos is $13.50 per month and the standard deviation of the sample is $7.25. Is this evidence sufficient
to conclude that p > $12.75 per month at level of significancea= .01?
Ans. The computed test statistic is z* = 2.19 and the critical value is 2.33. The null hypothesis is not
rejected.
A survey of 700 hourly wages in the US.is taken and it is found that: T = $17.65 and s = $7.55. Is this
evidence sufficientto concludethat p > $17.20, the stated current mean hourly wage in the U.S.?
Test at
a = .05.
Ans. The computed test statistic is z* = 1.58 and the critical value is 1.65. The null hypothesis is not
rejected.
CHAP. 91 TESTS OF HYPOTHESIS: ONE POPULATION 209
CALCULATING TYPE I1 ERRORS
9,25 In problem 9.23, find the probability of a Type I1 error if p = $13.25. Use the sample standard deviation
given in problem 9.23 as your estimate of 0.
Ans. p = P(not rejecting the null hypothesis when p = $13.25) = P(X c 13.55 when p = $13.25) =
P(z< 37) = 3078.
9,26 In problem 9.24, find the probability of a Type I1 error if p = $18.00.Use the sample standard deviation
given in problem 9.24 as your estimate of 0.
Ans. p = P(not rejecting the null hypothesis when p = $18.00) = P(Y c 17.67 when p = $18.00) =
P(z<-1.16) = .123.
P VALUES
9.27 Find the p value for the sample results given in problem 9.23. Use this computed p value to test the
hypothesis given in the problem.
Ans. The p value = P(Z > 2.19) = .0143. Since the p value exceeds the preset a,the null hypothesis is
not rejected.
9.28 Find the p value for the sample results given in problem 9.24. Use this computed p value to test the
hypothesis given in the problem.
Ans. The p value = P(Z > 1.58) = .057
1. Since the p value exceeds the preset a,the null hypothesis is
not rejected.
HYPOTHESIS TESTS ABOUT A POPULATION MEAN: SMALL SAMPLES
9.29 The mean score on the KSW Computer Science Aptitude Test is equal to 13.5. This test consists of 25
problems and the mean score given above was obtained from data supplied by many colleges and
universities. Metropolitan College administers the test to 25 randomly selected students and obtains a
mean equal to 12.0 and a standard deviation equal to 4.1. Can it be concluded that the mean for all
Metropolitan students is less than the mean reported nationally? Assume the scores at Metropolitan are
normally distributed and use a = -05.
Ans. The null hypothesis is rejected since t* = -1.83 < -1.71 1. The mean score for Metropolitan
students is less than the national mean.
9.30 The claim is made that nationally, the mean payout per $1 waged is 90 cents for casinos. Twenty
randomly selected casinos are selected from across the country. The data and a Minitab analysis are
shown below and on the next page. If the null hypothesis is &:p = 90, the alternative hypothesis is H
,
:
p <90, and the level of significance is a= .01, what is your conclusion?
Data Display
payout
85 82 85 83 91 78 90
86 95 82 86 '74 89 92
96 81 90 86 92 90
MTB > ttest mean = 90 data in c1;
SUBC > alt = -1.
T-Test of the Mean
Test of mu = 90.00 vs mu c 90.00
210 TESTS OF HYPOTHESIS: ONE POPULATION [CHAP. 9
Variable N Mean St Dev SE Mean T P
Payout 20 86.65 5.63 1.26 -2.66 0.0077
Ans. Since the p value is less than a = .01, reject the claim and conclude that the mean payout is less
than 90 cents per dollar.
HYPOTHESISTESTS ABOUT A POPULATION PROPORTION:LARGESAMPLES
9
.
3
1 A survey of 300 gun owners is taken and 40% of those surveyed say they have a gun for protection for
self/family. Use these results to test Ho: p = 32% vs. Ha: p > 32%,where p is the percent of all gun
owners who say they have a gun for protection of self/family.Test at a = .025.
Ans. The null hypothesis is rejected, since z* = 2.97 > 1.96. It is concluded that the percent is greater
than 32%.
9.32 Perform the following hypothesis about p.
(a) Ho: p = .35vs. Ha:p # .35,n = 100, = .38, a = .I0
(b) H o : p = . 7 5 v s . H a : p < . 7 5 , n = 7 0 0 , ~ = . 7 1 , a = . 0 5
(c) Ho: p = .55 VS, Ha: p > .55, n = 390, p = 57,a = .01
Ans. (a)z* =0.63,p value = .5286, Do not reject the null hypothesis.
(6) z* =-2.44, p value = .0073,Reject the null hypothesis.
(c) z* = 0.79, p value = .2148, Do not reject the null hypothesis.
Chapter 10
Inferences for Two Populations
SAMPLINGDISTRIBUTION OF -x,FOR LARGEINDEPENDENTSAMPLES
Two samples drawn from two populations are independent samples if the selection of the sample
from population 1 does not affect the selection of the sample from population 2. The following
notation will be used for the sample and population measurements:
p1and p
2 = means of populations 1 and 2
01 and 02 = standard deviations of populations 1 and 2
nl and n2 = sizes of the samples drawn from populations 1 and 2 (nl 230, n2 230)
x1and X, = means of the samples selected from populations 1 and 2
-
SI and s2 = standard deviations of the samples selected from populations 1 and 2
When two large samples (nl 2 30, n2 2 30) are selected from two populations, the sampling
distribution of 51, - 51, is normal. The mean or expected value of the random variable 51, - x, is
given by formula (10.I ) :
-
The standard error of XI- X, is given by
(10.2)
Figure 10-1 illustrates that 51, - X2 is normally distributed and centers at p
i - p
2
.The standard error
of the curve is given by formula (10.2).
Fig. 10-1
21 1
212 INFERENCES FOR TWO POPULATIONS [CHAP. 10
Formula (10.3)is used to transform the distribution of X,- X, to a standard normal distribution.
(10.3)
EXAMPLE 10.1 The mean height of adult males is 69 inches and the standard deviation is 2.5 inches. The
mean height of adult females is 65 inches and the standard deviation is 2.5 inches. Let population 1 be the
population of male heights, and population 2 the population of female heights. Suppose samples of 50 each are
selected from both populations. Then XI- ?I2 is normally distributed with mean p,,_,, = kl - p2= 60 - 65 = 4
inches and standard error equal to oxl-si2
=
50
50 = .5 inches. The probability that the
"I "2
mean height of the sample of males exceeds the mean height of the sample of females by more than 5 inches is
represented by P(X, - X2 > 5). The event X,- x2 > 5 is converted to an equivalent event involving z by using
-
- -
- XI - x 2 -(P,-CLz) - 5 - 4
ox,
-x2 5
- 2. From the
formula (10.3). The z value corresponding to 51, - x2 = 5 is z = -
standard normal distribution table the area to the right of 2 is .5 - .4772 = .0228. Only 2.28% of the time would
the mean height of a sample of 50 male heights exceed the mean height of a sample of 50 female heights by 5 or
more inches.
ESTIMATION OF ~ 1 -
~2 USING LARGE INDEPENDENT SAMPLES
The difference in sample means, 51, - X2 ,is a point estimate o
f pI - p2.An interval estimatefor
pI - p2is obtained by using formula (10.3).Since 95% of the area under the standard normal curve
is between -1.96 and 1.96,we have the following probability statement:
Solving the inequality inside the parenthesis for
interval for pI - p2.
- p2, we obtain the following 95% confidence
The general form of a confidence ititend for 1-11 - p2 is given by formula (10.4), where z is
determined by the specified level of confidence. The z values for the most common levels of
confidence are given in Table 10.1,
(20.4)
-
(X,- x,) k z x OK,-KZ
Table 10.1
Confidence level Z value
1.96
2.58
EXAMPLE 10.2 Keyhole heart bypass was performed on 48 patients and conventional surgery was performed
on 55 patients. The time spent on breathing tubes was recorded for all patients and is sumrnarized in Table 10.2.
CHAP. 101 INFERENCES FOR TWO POPULATIONS 213
Sample Sample size Mean
1. Keyhole 48 3.0 hours
2. Conventional 55 15.0hours
-
The difference in sample means is X,- x2 = 3.0 - 15.0 = -12.0. Since population standard deviations are
unknown, they are estimated with the sample standard deviations. The estimated standard error of the
Standard deviation
1.5hours
3.7 hours
d$ferencein sample means is represented by = +- = .544.
- ~~ _ _ ~
r - - ~~
1 Step 1: State the null and research hypothesis. The null hypothesis is represented symbolically by &:pI -
I p2 = Do, and the research hypothesis is of the form Ha: pl - p
2# Do or Ha:pl - p2< Door Ha:pl - p2> Do.
' Step 2: Use the standard normal distribution table and the level of significance, a,to determine the
rejection region.
A 90% confidence interval for p1- p
2 is found by using formula (10.4).The z value is determined to be 1.65
from Table 10.1. The interval is -I 2.0 +_ 1.65 x -544,or -1 2.0+_ 0.9. The interval extends from -12.0 - 0.9 =
-12.9 to -12.0 + 0.9 = -1 1.1. That is, we are 95% confident that the keyhole procedure requires on the average
from 11.1 to 12.9 hours less time on breathing tubes. Note that when population standard deviations are
unknown and the samples are large, that the sample standard deviations are substituted for the population
standard deviations.
- -
X i - X 2 - DO
ox,
-si?
-
I Step 3: Compute the value of the test statistic as follows: z* = , where XI - x2 is the
TESTING HYPOTHESIS ABOUT pi- p
2 USING LARGE INDEPENDENT
SAMPLES
Sample
1. Keyhole
2. Conventional
The procedure for testing hypothesis about the difference in population means when using two
large independent samples is given in Table 10.3.
Table 10.3
Sample size Mean Standard deviation
48 3.5 days 1.5days
55 8.0 days 2.0 days
I Steps for Testing a Hypothesis Concerning pI - p2:Large Independent Samples
computed difference in the sample means, D
o is the hypothesized difference in the population means as
I ? ?
0 5
given in the null hypothesis, and ( T ~ , - ~ ~
= d
: +- or if the population standard deviations are unknown,
n 2
l m n
SKI-312
= -+- is used as an estimate of ( T ~ , - ~ ~
.
4
: :
Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls
in the rejection region. Otherwise, the null hypothesis is not rejected.
EXAMPLE 10.3 Keyhole heart bypass was performed on 48 patients and conventional surgery was performed
on 55 patients. The length of hospital stay in days was recorded for each patient. A summary of the data are
given in Table 10.4.
214 INFERENCES FOR TWO POPULATIONS [CHAP. 10
These results are used to test the research hypothesis that the mean hospital stay for keyhole patients is less than
the mean hospital stay for conventional patients with level of significance a = .01. The null hypothesis is Ho:
pI - p2= 0 and the research hypothesis is Ha: pI - p
2 < 0. This is a lower-tailed test, and the critical value is
-2.33, since P(z < -2.33) = .01.This critical value is determined in exactly the same way as it was in chapter 9.
The null hypothesis is rejected if the computed test statistic is less than the critical value. The following
quantities are needed to compute the value of the test statistic: Do = 0, X,- 51, = 3.5 - 8.0 = -4.5, and the
estimated standard error is Sx,-wz= +- = .346.The computed test statistic is:
"1 "2 48 55
The null hypothesis is rejected and it is concluded that the mean hospital stay for the keyhole procedure is less
than that for the conventional procedure.
SAMPLING DISTRIBUTION OF XI-X
2 FOR SMALL INDEPENDENT
SAMPLES FROM NORMAL POPULATIONS WITH EQUAL (BUT UNKNOWN)
STANDARD DEVIATIONS
When the sample sizes are small (one or both less than 30) and the population standard
deviations are unknown, the substitution of sample standard deviations for the population standard
deviations, as was done in Examples 10.2 and 10.3, is not appropriate. The use of the standard
normal table for confidence intervals and testing hypothesis is not valid in this case. Statisticians
have developed two different procedures for the case where the sample sizes are small and the
population standard deviations are unknown. Both cases require the assumption that both populations
have normal distributions. In this section it will also be assumed that the populations have equal
standard deviations.
A statistical test for deciding whether to assume C T ~= (32 or 01 # 02 utilizes the two sample
standard deviations and a statistical distribution called the F distribution. HoMjever, a rule o
f thumb
used by some statisticiaris states that if .5 I- 5 2, then assume 01 = 02. Otherwise, assume that
s1
s
2
0
1 f 0 2
Suppose CT is the common population standard deviation, i. e., 01 = 02 = O. The standard error of
-+- . Replacing the two population standard
x, - x2 is given by formula (10.2)as ox,-il2
=
deviations by CT and factoring it out, we get formula ( I0.5):
d
e
'
z
-
- -
( I0.5)
Now, $, the common population variance, is estimated by pooling the sample variances as a
weighted average. The pooled estimator o
f d is represented by S2and is given by
Replacing d by S2 in formula (10.5),the estimuted standard error of the diflererice in the sample
means is obtained and is given in formula (10.7)
CHAP. 101 INFERENCES FOR TWO POPULATIONS
Replacing oxl-K2
by s ~ ~ - ~ ~
in formula (10.3),we obtain
215
(10.7)
(10.8)
When sampling from two normally distributed populations having equal population standard
deviations, the statistic given in formula (10.8)has a t distribution with degrees of freedom given by
df=nl+n2-2 (10.9)
EXAMPLE 10.4 A sample of size 10 is randomly selected from a normally distributed population and a
second sample of size 15 is selected from another normally distributed population. The standard deviation of the
first sample is equal to 13.24 and the standard deviation of the second sample is equal to 24.25. The rule of
thumb, given above, suggests that the populations may be assumed to have a common standard deviation since
Sl
s
2
.5 I-= .55 5 2 . A pooled estimate of the common variance is given as follows:
(nl-l)s:+(n2-1)~: 9 ~ 1 3 . 2 4 ~
+ 1 4 ~ 2 4 . 2 5 ~
= 426.5458
-
s2= -
nl+n2-2 10+15-2
and a pooled estimate of the common standard deviation is S = J426,r458 = 20.65. The estimated standard
error of the difference in sample means is as follows:
The computations of S and are required for setting confidence intervals on p1- p2and for testing
hypothesis concerning p
1 - p2 when the two samples are small, and the population standard deviations are
unknown but assumed equal.
ESTIMATION OF ~ 1 -
~2 USING SMALL INDEPENDENT SAMPLES
FROM NORMAL POPULATIONS WITH EQUAL (BUT UNKNOWN)
STANDARD DEVIATIONS
The difference in sample means, 51, -X,, is a point estimate o
f pi - p2..An interval estimatefor
pi - p2is obtained by using formula (10.8). When independent small samples are selected from two
normal populations having unknown but equal standard deviations, the general form of a confidence
interval forQpl
- p
2 is given by formula (10.10),where t is obtained from the t distribution table and
is determined by the level of confidence and the degrees of freedom which is given by df = nl + n2 -
2. The standard error of the difference in sample means, s ~ ~ - ~ ~ ,
is given by formula (10.7).
(10.10)
-
(XI- x*) *t x sxl-xz
216 INFERENCES FOR TWO POPULATIONS [CHAP. 10
Sample Sample size
1. Keyhole 8
2. Conventional 10
EXAMPLE 10.5 Keyhole heart bypass was performed on 8 patients and conventional surgery was performed
on 10patients. The time spent on breathing tubes was recorded for all patients and is summarized in Table 10.5.
The difference in sample means is '51, - x2 = 3.0 - 7.0 = -4.0 hours. The pooled estimate of the common
population variance is obtained as follows:
-
Mean Standard deviation
3.0 hours 1.5 hours
7.0 hours 2.7 hours
( n l - I)s:+(nz-l)s: 7x2.25+9x7.29
= 5.085
-
s
2 = -
nl+ n2 -2 8+10-2
~ Step I : State the null and research hypothesis. The null hypothesis is represented symbolically by Ho: pl - p2
1 = Do and the research hypothesis is of the form Ha:pI- p2 f Do or Ha:pl - p2< Door Ha:pI- p2> Do.
The standard error of the difference in sample means is:
1 Step 2: Use the t distribution table with degrees of freedom equal to nl + n2 - 2 and the level of significance,
~ a,to determine the rejection region.
- -
-
~ Step 3: Compute the value of the test statistic as follows: t* = _"' , where 51, - x2 is the computed
To find a 95% confidence interval for pI - p
2 , we need to find the t value for use in formula (10.10).The
degrees of freedom is df = nl + n2 - 2 = 8 + 10 - 2 = 16. From the t distribution table having 16 degrees of
freedom, we find the following: P(-2.120 < t < 2.120) = .95. The proper value of t is therefore 2.120. The
margin of error is t x Sf,-fz = 2.120 x 1.070 = 2.2684 or 2.27 to 2 decimal places. The 95% confidence
interval is -4.0 k 2.27. Assuming that the times spent on breathing tubes are normally distributed for both
procedures, the difference pl - p2 is between -6.27 and -1.73 hours with 95% confidence.
TESTING HYPOTHESIS ABOUT pi- ~2 USING SMALL INDEPENDENT
SAMPLES FROM NORMAL POPULATIONS WITH EQUAL (BUT UNKNOWN)
STANDARD DEVIATIONS
The procedure for testing a hypothesis about the difference in population means when using two
small independent samples from normal populations with equal (but unknown) standard deviations is
given in Table 10.6.
Table 10,6
Steps for Testing a Hypothesis Concerning p1- p2:Small Independent Samples from
Normal Populations with Equal (but Unknown) Standard Deviations
I
1 difference in the sample means, D
o is the hypothesized difference in the population means as stated in the null
(nl- ~)s:+(n2- ~ ) s $
IS2[ +;
) ,where S2 =
hypothesis, and =
nl+n2-2 *
Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls in
the rejection region. Otherwise, the null hypothesis is not rejected.
I
CHAP. 101
Sample
1. Keyhole
INFERENCES FOR TWO POPULATIONS
Sample size Mean Standard deviation
8 3.5 days 1.5 days
217
EXAMPLE 10.6 Keyhole heart bypass was performed on 8 patients and conventional surgery was performed
on 10patients. The length of hospital stay in days was recorded for each patient. A summary of the data is given
in Table 10.6.
These results are used to test the research hypothesis that the mean hospital stay for keyhole patients is less than
the mean hospital stay for conventional patients with level of significance a = .01. The null hypothesis is Ho:
p1- p2= 0 and the research hypothesis is Ha:p1- p2 < 0. To determine the rejection region, note that the test is
a lower-tailed test and the degrees of freedom is df = nl + n2 - 2 = 8 + 10- 2 = 16. Using the t distribution table,
we find that for df = 16, P(t > 2.583) = .01, and therefore, P(t < -2.583) = .01.The shaded rejection region is
shown in Fig. 10-2.
Fig. 10-2
The null hypothesis is to be rejected if the computed test statistic is less than -2.583. The computed test statistic
is determined according to step 3 in Table 10.6. The difference in sample means is XI- x2 = 3.5- 8.0 = -4.5.
The value specified for Dois 0. The pooled estimate of the commonpopulation variance is
-
(nl- l)sf+ (n2- l)s$ 7x2.25+9~4.00
= 3.2344
-
s2= -
nl+ n2 - 2 8+10-2
and the standard error of the difference in sample means is
The computed test statistic is
-4.5 -0
-
Xi -X2-DO
t* = - - -5.28.
Sw,
-R2 353
Since t* is less than the critical value, the null hypothesis is rejected and it is concluded that mean length of
hospital stay is smaller for the keyhole procedure than the conventional procedure. This hypothesis test
procedure assumes that the two populations are normally distributed.
EXAMPLE 10.7 A sociological study compared the dating practices of high school senior males and high
school senior females. The two samples were selected independently of one another. The number of dates per
month were recorded for each participant. The data are shown in Table 10.7.
218
ggg d .
gg _..
INFERENCES FOR TWO POPULATIONS
I ! I I I 1
I < I 1 I ,
.. ~ .i.... ^. .. ,.:. ........ , . I . . . . . . . . .
! ......... i. ........ . . ( ~ .
. . . . . . .! . . . . . . " i . . . . . . :.
I I t
I 6 > I I
I " ' ' ' 1
$ $ r
.... .~. . . . . . . ~. . . . . . ..~.,. . . . . . . I . . . . . ~ . . . . . . . ..I. . . . . . .y . . . . . .
[CHAP. 1
0
95 -. . . . . .! . . . . . . .
-80
-. ..." J ....... . . . . . . . . . . _. ....
I
I I I 5
I I
. i
.... . . . . . . . . ......... . . . . . . . . . . . . . . . . . . . .
Probability
I
I I r 7 . '
I I t I
t I i $ 1
" . , ~ - ~ " . . "
. . . . . . . - - . . - ~ " ~ . . - " ~ " "
.... . . . _ . - . . . . . . . . " L . . . .
.... ..... .. ........ ........ .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
,05-..
I I I
.- I
-01_ . !
." ..!. '. .." '.. , . ( .
I I I
t I
U " Y .I I,<'
*. Y ..%
I I
I I I 1 ? , I
.'..
I I
! 1.
-001-+. <
. XI #,,.>
, ,
. " %
. ,
$ Y I .
< I
, .* (
, .. .., " .. .
. .* * ... " ....... .
. .. .... " . I ...
<
C .
I I I } 1
$ I 1 I I
I 1 I 1 1 I I 1 1 -
Tab1
Males
8
11
7
5
13
10
10
8
9
5
10.7
Females
7
7
12
11
14
10
10
9
4
8
The assumptions of equal variances and normal populations are usually checked out with statistical software
before the actual test of equality of means is carried out. This has been made much easier because of the wide-
spread availability of statistical software. The checking of assumptions as well as the actual test of equality of
population means for the data in Table 10.7 will be illustrated by using Minitab. The commands % normplot
mdates and % normplotfdates produced the following plots, which may be used to check for normality.
Normal Probability Plot
Average: 8.6
StDev:2.54733
N: 10
CHAP. 101 INFERENCES FOR TWO POPULATIONS 219
Normal Probability Plot
ProbabiIity
.95-
.80-
50 -
.20- <
.05-
(I
~ ..
...
I
4
Average: 9.2
StDev: 2.85968
N: 10
fdates
Anderson-Darling NormalityTest
A-Squared: 0.147
P Value: 0.948
If the sample data are selected from a normal population, the points on the normal probability plot will tend to
fall along a straight line. Normality of the population is usually rejected if the p value is less than a = .05. For
the above data, the p values are 0.840and 0.948and normality of the populations is not rejected.
An edited version of the Minitab procedure to test for equal population variances is shown below. A I in
column 1 indicates the response in c2 came from the male group and a 2 indicates the response came from the
female group. The assumption of equal variances is usually rejected if the p value shown in Bartlett’s Test
(normal distribution) is less than .05. Since the p value is 0.736, the assumption of equal variances is not
rejected.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
CI c2
1 8
1 11
1 7
1 5
1 13
1 10
1 10
1 8
1 9
1 5
2 7
2 7
2 12
2 I 1
2 14
2 10
2 10
2 9
2 4
2 8
220 INFERENCES FOR TWO POPULATIONS [CHAP. 10
MTB > %vartest c2 cl
Homogeneityof Variance
Response C2
Factors C1
Bartlett's Test (normal distribution)
Test Statistic: 0.1 14
P value :0.736
Having checked out the normality assumptions and the equal population variances assumption, the test
procedure is now conducted.
MTB > twot data in c2, groups in cl;
SUBC > alternative is 0;
SUBC > pooled procedure.
Two Sample T-Testand ConfidenceInterval
Two sample T for dates
Sample N Mean St Dev SEMean
1 10 8.60 2.55 0.81
2 10 9.20 2.86 0.90
95% CI for mu (1) - mu (2): (-3.14, 1.94)
T-Test mu (1) = mu (2) (vs not =): T = -0.50 P = 0.63 DF= 18
To use the TWOT command in Minitab, the responses must be in a column and the group from which the
responses came must be in another column. The data are entered in columns 1 and 2 as shown above. The null
hypothesis is Ho: pl - k2= 0. The alternative command indicates the nature of the research hypothesis. A -1
indicates a lower-tailed test, a 0 indicates a two-tailed test, and a 1 indicates an upper-tailed test. The
subcommand pooled procedure indicates that equal population standard deviations are assumed and a pooled
estimated is to be computed.
The output gives the sample size, mean, standard deviation, and standard error of the mean for both groups
separately. A 95% confidence interval for 111 - p2 is given. The computed test statistic is t* = -0.50. The two-
tailed p value is 0.63.The degrees of freedom is 18.The pooled standard deviation estimate is S = 2.7 1.
SAMPLING DISTRIBUTION OF X,-X
2 FOR SMALL INDEPENDENT
SAMPLES FROM NORMAL POPULATIONS WITH UNEQUAL
(AND UNKNOWN) STANDARD DEVIATIONS
When the population standarddeviations are unknown and it is not reasonable to assume they are
equal, the sample variances are not pooled as they were in the previous three sections. Statisticians
have developed a test statistic for this case, which has a distribution, which is approximately a t
distribution. The standard error of the difference in the sample means is approximatedby substituting
the sample variances for the population variances in formula (10.2) as was done in the large sample
case to obtain the estimated standard error given in the following:
(10.11)
In formula (I0.3),we replace~ , l - k zby the expression for s,,-,,in formula (10.1I), and obtain the
statistic shown in formula (10.12):
CHAP. 101
. Sample Sample size Mean Standard deviation
1. Keyhole 8 3.5hours 0.5 hour
2. Conventional 10 5.0 hours 1.7 hours
INFERENCESFOR TWO POPULATIONS 221
When sampling from two normally distributed populations having unequal population standard
deviations, the statistic given in formula (10.12)has an approximatet distribution and the degrees of
freedom is given by
df = minimum of {(nl - l), (n2- I)} (10.13)
EXAMPLE 10.8 A sample of size 13 is taken from a normally distributed population, and a sample of size 15
is taken from a second normally distributed population. The mean of population 1 is 70 and the mean of
population 2 is 65. The population variances are unknown, but assumed to be unequal. The following statistic
has an approximate t distribution.
- -
X I - x 2 - 5
sit,-x2
t =
The degrees of freedom is df = minimum( 12, 14}= 12.
ESTIMATION OF pi- ~2 USING SMALL INDEPENDENT SAMPLES
FROM NORMAL POPULATIONS WITH UNEQUAL (AND UNKNOWN)
STANDARD DEVIATIONS
The difference in sample means, 51,-X2, is apoint estimateof p1- 1 2 . . An interval estimatefor
p1- p
2 is obtained by using formula (10.12).When independent small samples are selected from two
normal populations having unknown and unequal standard deviations, the general form of a
confidence interval for p
1-p
2 is given by formula (10.1#),where t is obtained from the t distribution
table and is determined by the level of confidence and the degrees of freedom which is given by df =
minimum of {(nl - l), (n2 - 1)). The standard error of the difference in sample means, s ~ , - ~ ~ ,
is
given by formula (10.I 1).
EXAMPLE 10.9 Keyhole heart bypass was performed on 8 patients and conventional surgery was performed
on 10patients. The time spent on breathing tubes was recorded for all patients and is summarized in Table 10.8.
The difference in sample means is Xi- X2 = 3.5- 5.0 = -1.5 hours. Because the ratio of the sample standard
deviation for sample 2 divided by sample 1 is 3.4, it is not reasonable to assume that o1= 02.
To find a 95% confidence interval for pI- ~ 2 ,
we need to find the t value for use in formula (Z0.14).The
degrees of freedom is df = minimum of { (nl - I), (n2- 1)) = minimum { 7, 9) = 7. From the t distribution table
having 7 degrees of freedom, we find the following: P(-2.365 < t < 2.365) = 9 5 . The proper value of t is
therefore 2.365. The standard error of the difference in sample means is ssr,-sr2
= Using the sample
nl n2
222 INFERENCES FOR TWO POPULATIONS [CHAP. 10
.25 2.89
= -+-= S66.The margin of error is t x s ~ , - ~ ~
=
J8 10
standard deviations in Table 10.8,we find that
2.365x .566= 1.3386or 1.34to 2 decimal places. The 95% confidence interval is -1.5 k 1.34.Assuming that
the times spent on breathing tubes are normally distributed for both procedures, the difference pI - p2 is
between -2.84and -0.16hours with 95% confidence.
EXAMPLE 10.10 Mothers Against Drunk Driving (MADD) have pushed for lowering the limit for driving
while intoxicated from .1% to .08%.A study of alcohol-related traffic deaths compared the blood alcohol levels
of individuals involved in such accidents in states with a 0.08%limit with those in states with a .l% limit. The
results are shown in Table 10.9.
Minitab will be used to set a 95% confidence interval on pl - p2, where pIis
the mean blood-alcohol level of such individuals in states having a .08%limit and p
2 is the mean blood-alcohol
level in states having a .l% limit.
Table 10.9
0.08%level
.12
.05
.07
.09
.07
.04
.12
.I1
.03
.06
.06
.09
.I0
.I0
.I2
.l% level
.09
.16
.03
.21
.16
.12
-09
.09
.14
-19
.OS
.09
.I 1
.16
.24
Normal probability plots, such as those in Example 10.7, indicate that it is reasonable to assume that the
samples are taken from normally distributed populations. An edited version of the Minitab test for equal
population variances is shown below. Before executing the Minitab command %vartestc2 cl, the data in Table
10.9are set in columns cl and c2, where cl contains a 1 or a 2,depending on where the sample value in c2
comes from. The set up for the columns is similar to that shown in Example 10.7.Bartlett’s Test is used to test
the null hypothesis H
o
:The population standard deviations are equal vs. the alternative hypothesis H,: The
population standard deviations are not equal. The hypothesis of equal standard deviations is usually rejected if
the p value for Bartlett’s Test is less than .05. In this case, the p value is equal to 0.025, and equal population
standard deviations is rejected.
MTB > %vartest c2 cl
Homogeneity of Variance
Response C2
Factors C1
ConfLvl 95.oooO
Bartlett’sTest (normal distribution)
Test Statistic: 4.998
P value :0.025
Since it is reasonable to assume that the populations have normally distributed populations with unequal
population standard deviations, the techniques of this section are appropriate. The command twot data in c2
groups in cl produces a 95% confidence interval for - p2. If the population standard deviations were
CHAP. 101 INFERENCES FOR TWO POPULATIONS 223
Sample Sample size Mean Standard deviation
1. Keyhole 8 3.5days 1.2days
2. Conventional 10 8.0 days 3.0 days h
assumed to be equal, the subcommand SUBC> pooled procedure would be added to the twot command. The
Minitab output gives the sample size, mean, standard deviation, and standard error for both groups. The 95%
confidence interval for pl - 12 extends from -0.0829 to -0.014. The edited output is as follows:
MTB >twot data in c2 groups in cl
Two SampleT-Testand ConfidenceInterval
Two sample T for C2
C1 N Mean StDev SEMean
1 15 0.0820 0.0300 0.0078
2 15 0.1307 0.0562 0.015
95% CI for mu (1) - mu (2): (-0.0829, -0.014)
TESTING HYPOTHESIS ABOUT 1
.
1
1-1.12 USING SMALL INDEPENDENT
SAMPLES FROM NORMAL POPULATIONS WITH UNEQUAL
(AND UNKNOWN) STANDARD DEVIATIONS
The procedure for testing a hypothesis about the difference in population means when using two
small independent samples from normal populations with unequal (and unknown) standard
deviations is given in Table 10.10.
Table 10.10
Steps for Testing a HypothesisConcerning pI - p2:Small Independent Samples from
Normal Populations with Unequal (and Unknown)Standard Deviations
IStep 1: State the null and research hypothesis. The null hypothesis is represented symbolically by &:p1- p2
= Do and the research hypothesis is of the form Ha: p1- p2f Do or Ha:p1- p2 c Do or Ha: p1- p
2> Do.
Step 2: Use the t distribution table with degrees of freedom equal to df = minimum of { (n, - l), (n2- 1)) and
the level of significance, a,to determine the rejection region.
- -
Step 3: Compute the value of the test statistic as follows: t* = XI- x2 -Do,where X,- X2 is the computed
I SX,-xz
difference in the sample means, Do is the hypothesized difference in the population means as stated in the null
hypothesis, and =
nl n2
Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls in
the rejection region. Otherwise, the null hypothesis is not rejected.
EXAMPLE 10.11 Keyhole heart bypass was performed on 8 patients and conventional surgery was performed
on 10patients. The length of hospital stay in days was recorded for each patient. A summary of the data is given
in Table 10.11. The null hypothesis H
o
:pI - p
2 = -4 vs. Ha: - p
2 f -4 is of interest to the researchers
involved in the study. In words, the null hypothesis states that the mean hospital stay is 4 days less for the
keyhole procedure than for the conventional procedure. The research hypothesis states that the difference in
means is not equal to 4 days. Since the sample standard deviation for the conventional procedure is 2.5 times the
sample standard deviation for the keyhole procedure, it is assumed that population standard deviations are
unequal. The sample observations (which are not shown) indicate that it is reasonable to assume that both
populations are normally distributed.
224 INFERENCES FOR TWO POPULATIONS [CHAP. 10
The degrees of freedom for the t distribution is df = minimum of { (nl - l), (n2- 1)) = minimum of { 7, 9) = 7.
For a = .01, an area of .005 is allocated to each tail, since the research hypothesis is two-tailed. Using the t
distribution with 7 degrees of freedom, it is found that P(t > 3.499) = .005. The critical values are +, 3.499. The
estimated standard error of the difference in sample means is :
The computed test statistic is:
- -
35-8.0-(-4)
Xi-Xz-Do -
t* = - =-.48
Sii,-ii2 1.039
Since t* is between -3.499 and 3.499, the evidence is not sufficient to reject the null hypothesis.
EXAMPLE 10.12 The data in Table 10.9 are used to test the research hypothesis H,: pi - p2 < 0. The
statement p
1-p
2 <0 is equivalent to p
i < p2.The statistical statement pl < p2 states that the mean blood-alcohol
level for individuals involved in alcohol-related traffic deaths is lower in states with a .OS% limit than the mean
level in states with a .I% limit. In Example 10.10, it was shown that it was reasonable to assume normal
population distributions and unequal population standard deviations. After putting the data in columns c1 and c2
as described in Example 10.10,the followingMinitab output was obtained.
MTB > twot data in c2 groups in c1;
SUBC > alternative =-1.
Two Sample T-Testand ConfidenceInterval
Two sample T for C2
C1 N Mean St Dev SEMean
1 15 0.0820 0.0300 0.0078
2 15 0.1307 0.0562 0.0150
95% CI for mu (1) - mu (2): (-0.0829, -0.014)
T-Test mu (1) = mu (2) (vs <): T= -2.96 P=0.0038 DF= 21
The subcommand, SUBC> alternative=-1, indicates that a lower-tailed test is required. Note that the research
hypothesis is supported at the a = .05 level, since the p value is 0.0038. The computed test statistic is shown to
be T = -2.96 and the degrees of freedom is shown as 21. The test statistic is computed as shown in Table 10.10.
The degrees of freedom is not computed by df = minimum of { (nl - l), (n2 - 1)) = minimum of { 14, 14) = 14.
The degrees of freedom computation is given by a more complicated formula. The use of df = minimum of
{(nl - l), (n2 - 1)) gives a more conservative test than that used by Minitab. Both ways of computing the
degrees of freedom may be found in various textbooks.
SAMPLING DISTRIBUTION OF aFOR NORMALLY DISTRIBUTED
DIFFERENCES COMPUTED FOR DEPENDENT SAMPLES
The previous sections in this chapter have dealt with independent samples selected from two
populations. When sample values are purposely matched or paired, the samples are called dependent
samples. The samples are also referred to as paired or matched samples. Dependent samples include
pairs of measurements such as beforelafter measurementsmade on the same person or machine, pre-
and post-test scores taken on the same person, similar measurements made on twins who have
undergone different treatments, and so forth. Table 10.12 illustrates a typical set of paired data. The
differences are formed by subtractingthe second sample value from the first.
CHAP. 101 INFERENCES FOR TWO POPULATIONS
Table 10.12
Pair I Sample 1 Sample 2 I Difference
I I
I
dl = X
I - Yl
d2 = x2 -Y2
Yl
I Y2
XI
x2
225
The estimation and testing procedures for experiments involving paired samples involve the
differences shown in Table 10.12. As in previous sections, we shall investigate the sampling
distribution of the mean of the sample differences first, and then apply these results to establish
confidence intervals and test hypothesis concerning the mean of the population of all possible
differences.The followingnotation will be used for inferences involvingdependent samples.
= the mean of the population of paired differences
a d = the standard deviation of the population of paired differences
d =the mean of the paired differences which are computed from the samples
s
d = the standard deviation of the paired differenceswhich are computed from the samples
-
n =the number of paired differenceswhich are computed from the samples
0 , = the standard error of d
S, = the estimated standard error of d
The sample mean of paired differences is given by
- z d
n
d = -
The sample standard deviation of paired differencesis given by
The estimated standard error of 3is given by
- S d
s, - -
&
(10.15)
( I0.16)
(10.17)
When the population of differences is normally distributed, the statistic given in formula (10.18)has
a t distribution with (n - 1) degrees of freedom.
(10.18)
This result is perfectly analogous to the previous result given in Chapters 8 and 9; namely the result
that the statistic given in formula (10.19)has a t distribution with df = n - 1 when the sample values
comprising X come from a normally distributed population having mean equal to p.
226
Lubricating
tears
INFERENCES FOR TWO POPULATIONS [CHAP. 10
Emotional
tears
(ZO. 19)
The confidence interval for jkJ as well as the procedure for testing hypothesis about P
d are analogous
to the techniques given in Chapters 8 and 9 for estimating the mean of one population or testing an
hypothesis about the mean of a population. The only difference between the techniques is that in the
case of dependent samples, the procedures are applied to differences.
EXAMPLE 10.13 A digital Sphygmomanometergives diastolic blood pressure readings that are 5 units higher
on the average than those obtained by the traditional method used by most medical professionals. Suppose 20
individuals have their diastolic blood pressure taken both ways. Let x represent the readings obtained by using
the digital Sphygmomanometer, and let y represent the readings obtained by the traditional method. The
differences in readings are found by the formula d = x - y. The mean population difference is & = 5 units.
Assuming that the population of all differences is normally distributed, the following statistic has a t distribution
with df = 20 - 1 = 19:
-
d -5
t = -
S-
d
ESTIMATION OF
COMPUTEDFROM DEPENDENT SAMPLES
USING NORMALLY DISTRIBUTED DIFFERENCES
When a paired sample of size n is selected and differences are computed as shown in Table
10.12, a confidence interval for the population mean difference, &, can be formed by using the
sampling distribution of the statistic given in formula (10.Z8). The confidence interval is given by
formula (Z0.20), where t is determined by the level of confidence and df = n - 1. The standard error
of d is computed by using formulas (10.16)and (IO.Z7).
-
d + t x s, (10.20)
EXAMPLE 10.14 A psychological study compared the amount of manganese in tears that lubricate the eye
with the amount in emotional tears. A measurement, which is related to the amount manganese, was taken for
each type of tear on 10different subjects. The results are shown in Table 10.13.The computations needed to set
a 99%confidence interval on are given below the table.
Subject
1
2
3
4
5
6
7
8
9
10
10
13
11
10
9
8
10
10
8
12
12
12
9
11
7
9
I 1
13
10
10
Difference
-2
1
2
-1
2
-1
-1
-3
-2
2
CHAP. 101 INFERENCES FOR TWO POPULATIONS 227
The sum of the differences is Zd = -2 + 1 + 2 - 1 + 2 - 1 - 1 - 3 - 2 +2 =-3 and the sum of the squares of the
differences is Zd2 = 4 + 1 + 4 + 1 + 4 + 1 + 1 + 9 + 4 +4 = 33. The mean difference is a = -.3. The sample
standard deviation for the differences is
The standard error of a is
s
d 1.889
s, = J
;
; = Jlo= .597
The t value for a 99% confidence interval is found as follows: The degrees of freedom is df = 10- 1 = 9. From
the t distribution table for 9 degrees of freedom, we find that P(-3.250 < t < 3.250) = .99. The 99% confidence
interval is found by using formula (20.20) to be: -.3 k 3.25 x S97, or the 99% confidence interval is (-2.24,
1.a).
The confidence interval is valid provided the population of differences is normally distributed.
EXAMPLE 10.15 A Minitab solution to Example 10.14 is given below. The lubricating tear data is put in
column 1 and the emotional tear data is put in column 2. The command let c3 = cl - c2 computes the
differences and puts them in column 3. The command name c3 'dB" assigns the name diff to column 3. The
command tinterval 99% confidence data in c3 computes the 99% confidence interval for p,,. Note that
Minitab computes and prints out the mean difference, the sample standard deviation of differences, the standard
error of d,as well as the 99% confidence interval. These quantities are the same as those computed by hand in
Example 10.14.
Data Display
Row lubricate emotion
1 10
2 13
3 11
4 10
5 9
6 8
7 10
8 10
9 8
10 12
12
12
9
11
7
9
11
13
10
10
MTB > let c3 = cl - c2
MTB > name c3 'diff
MTB > tinterval99% confidence data in c3
Codidence Intervals
Variable N Mean StDev SEMean 99.0%CI
diff 10 -0.300 1.889 0.597 (-2.241, 1.641)
TESTING HYPOTHESIS ABOUT
DIFFERENCES COMPUTEDFROM DEPENDENT SAMPLES
USING NORMALLY DISTRIBUTED
The procedure for a hypothesis test about the mean differencefor paired samples is given in
Table 10.14.
228 IIWERENCES FOR TWO POPULATIONS [CHAP. 10
-~ ~ ~~~
Step 1: State the null and research hypothesis. The null hypothesis is represented symbolically by Ho:
Doand the research hypothesis is of the form Ha:& # Do or Ha:pd < Door Ha:& > Do.
Step 2: Use the t distribution table with degrees of freedom equal to df = n - 1 and the level of significance,
a,to determine the rejection region.
=
-
Table 10.14
Steps for Testing a Hypothesis Concerning &:Normally Distributed Differences
Computed from Dependent Samples
d-DO - U
Compute the value of the test statistic as follows: t* = -
, where d = -, Do is the
n
Step 3:
I Sd
s
d
hypothesized value of p
d in the null hypothesis, Sa = - and s d =
42
Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls
in the rejection region. Otherwise, the null hypothesis is not rejected.
I
EXAMPLE 10.16 A sociological study concerning marriage and the family was conducted and one of several
factors of interest was the educational level of husbands and wives. Table 10.15 gives the number of years of
education beyond high school for 15 couples. These data were used to test the null hypothesis that the mean
educational levels are the same vs. the research hypothesis that the educational levels are different at level of
significance a = .05.
Couple
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Table 10.15
Husband
4
0
7
4
2
8
4
3
1
2
4
8
0
4
1
Wife
2
4
6
4
3
4
4
1
0
2
4
2
4
0
4
~~~~~
Difference
2
-4
I
0
-1
4
0
2
1
0
0
6
4
4
-3
The degrees of freedom is df = 15 - 1 = 14, and since the research hypothesis is two-tailed and a = .05, a t
distribution curve with area equal to .025 in each tail will determine the rejection regions. Consulting the t
distribution table with df = 14, we find that P(t > 2.145) = .025.Figure 10-3illustrates the rejection regions.
CHAP. 101 INFERENCES FOR TWO POPULATIONS 229
Fig. 10-3
The sum of the differences is Zd = 8 and the sum of the squares of the differences is Zd2 = 120. The mean
c d 8
difference is a= --
- --- S33, and the standard deviation is
n 15
s d = i . 1
cd2-(Id)*/ n = 120-g2 /15 z2.8752
The standard error of the mean difference is
Sd 2.8752
%=---- - .7424
A - J15
The computed test statistic is -
d-Do 533-0
= 0.72
t*= ---
-
Sd .7424
Since t* = 0.72 does not fall within the rejection region shown in Fig. 10-3, we are unable to conclude that the
educational levels of husbands and wives are different.
EXAMPLE 10.17 A Minitab solution for Example 10.16 is shown below. The data for the husbands is put into
column cl and the data for the wives is put into column c2. The differences are put into c3, and a one-sample t
test is performed on c3. The subcommand S U B 0 alternative= 0. indicates a two-tailed alternative hypothesis.
Note that the mean, standard deviation, standard error, and computed t values are the same as those computed in
Example 10.16.Since the computed p value is 0.48, the null hypothesis is not rejected at the a = .05 level.
Data Display
Row husband
1 4
2 0
3 7
4 4
5 2
6 8
7 4
8 3
9 1
10 2
11 4
12 8
13 0
14 4
15 1
wife
2
4
6
4
3
4
4
1
0
2
4
2
4
0
4
230 INFERENCESFOR TWO POPULATIONS [CHAP. 10
MTB>letc3=cl -c2
MTB > name c3 'diff
MTB > ttest mean = 0, data in c3;
SUBC > alternative = 0.
T-Test of the Mean
Test of mu = O.OO0 vs mu not = 0.000
Variable N Mean St Dev SEMean T P
diff 15 0.533 2.875 0.742 0.72 0.48
SAMPLING DISTRIBUTION OF PI-Fz FOR LARGE INDEPENDENT SAMPLES
Previous sections in this chapter have discussed inferences concerning the differences in means
for two populations. The remaining sections of Chapter 10are concerned with inferences concerning
the differences in population proportions or percents. How does the percent of defectives produced
by machine 1 compare with the percent of defectives produced by machine 2? Is there a difference in
the percent of males who will vote for a presidential candidate and the percent of females who will
vote for the candidate? Is the percentage of smokers the same for African-Americans as it is for
Whites? All of these questions involve the comparison of percents or proportions. Sample
proportions will be used to estimate the differencesin population proportions or to test a hypothesis
about the differences in population proportions. The following notation will be used.
pI and p
2 = proportions in populations 1 and 2 having the characteristicof interest
nl and n2 = sizes of the independent samples drawn from populations 1 and 2
pI and p2 = proportions in samples 1 and 2 having the characteristicof interest
-
q1 = I - pi and q 2 = 1 -p2 = proportions in populations 1 and 2 not having the characteristic
q1= 1-p,andq2= 1-ij2 = proportions in samples 1 and 2 not having the characteristic
When the samples sizes, nl and n 2 , are such that nlpl > 5, n2p2 > 5, nlql > 5 and n2q2 > 5, the
sampling distribution of PI - p2 is normal. The mean or expected value of PI - p2 is given by
formula (10.21):
-
- -
The standarderror of is, - p2is given by formula (10.22).
(10.2I )
(10.22)
When the sample sizes are large enough to satisfy the above requirements,the distribution of F, - is,
is as shown in Fig. 10-4.The standard error of the curve is given by formula (I0.22).
CHAP. 101 INFERENCES FOR TWO POPULATIONS 231
Fig. 10-4
-
Formula (10.23) is used to transform the distributionof pl- p2 to a standard normal distribution.
-
PI -Pz-(PI -P2)
2 =
%-P2
(10.23)
EXAMPLE 10.18 Three percent of the items produced by machine 1 have a minor defect and 2% of the same
item produced by machine 2 have a minor defect. If samples of size 500 are selected from each machine, the
distribution of p,- p2will be normal since nlpl = 500 x .03 = 15, n2p2 = 500 x .02 = 10,nlql = 500 x .97 =
485 and n2q2=500 x .98= 490. Normality may be assumed, since all four products are greater than 5. The mean
value of PI- Tj2 is .03- .02= .01, and the standard error of F, - p2 is
-
1
+m
- .03 x .97+ .02 x .98
= .00987.
n2 -d 500 500
The probability that the percent defective in the sample from machine 1 exceeds the percent defective in the
sample from machine 2 by 3% or more is expressed as P(pl - p2 > .03). The event p1 - p2 > .03 is
transformed to an equivalent event involving z by the use of formula (10.23). The equivalent event is
= 2.03, or z > 2.03.The event p,- p2> .03 is equivalent to the event z > 2.03, and
therefore P(pl - p2> .03) = P(z > 2.03) = .5 - .4788 = .0212. There are only approximately 2 chances out of
100that the percent defective in sample 1will exceed the percent defective in sample 2 by 3% or more.
- -
-
pI-p2- .01 .03 - .01
>
.00987 .00987
ESTIMATIONOF Pi -P
2 USING LARGE INDEPENDENTSAMPLES
The difference in sample proportions, F1- F2,is a point estimate ofpl - p2. An interval estimate o
f
pI-p2 is obtained by using formula (10.23). However, the standarderror as given in formula (10.22)
will need to be estimated since p1 and p2 are unknown. If the population proportions are estimated by
their corresponding sample proportions, we obtain the estimated standard error of PI - p2. The
estimated standard error of PI- p2is represented by spI-p,
and is given by formula (10.24).
The confidence interval for pl -p2 is given by formula(20.25):
(10.24)
( I0.25)
232 INFERENCES FOR TWO POPULATIONS [CHAP. 10
The confidence interval is valid provided that nlpl > 5, n2p2 > 5, nlql > 5 and n2q2 > 5. Since PI, 91,
p2 ,and q2are unknown, the corresponding sample quantities are substituted to check the validity of
using the confidence interval.
EXAMPLE 10.19 A survey of 2000 ninth-graders found that 32% had used cigarettes in the past week and a
survey of 1500 high school seniors found that 35% had used cigarettes in the past week. Suppose p1represents
the proportion of all ninth-graders who used cigarettes in the past week and p2represents the proportion of all
seniors who used cigarettes in the past week. A point estimate for pl - p2 is -3%. The estimated standard error
of F, - p2is as follows:
A 95% confidence interval for pi - p2 is given by (p,- p,) f
. z x Siil-ii2or -3% f
. 1.96 x 1.614% or -3% f
3.2%. The 95% confidence interval extends from -6.2% to 0.2%. Note that nl x PI = 2000 x .32 = 640,
nl x ql = 2000 x .68 = 1360, n2 x p2= 1500x .35 = 525, and n2 x q2= 1500 x .65 = 975. Since all 4 of these
quantities exceed 5, the confidence interval is valid.
TESTING HYPOTHESIS ABOUT PI -Pz USING LARGE INDEPENDENT
SAMPLES
The most common null hypothesis concerning pl - p2 is H
o
:pl - p2 = 0. Recall that in testing
hypothesis, the null hypothesis is always assumed to be true and the test statistic is computed under
this assumption, If the test statistic value is judged to be highly unusual, then the assumption that the
null hypothesis is true is rejected. When Ho is assumed to be true, p1 -p2 = 0 or p1 = p2. If we let p be
the common value of p1 and p2, then the standard error of Fl - p2 simplifies as follows:
-
I
Since p and q are unknown, they must be estimated from the two samples. Let XI be the number in
sample 1 with the characteristic of interest and let x2 be the number in sample 2 with the
characteristic of interest. A pooled estimate o
fp is given by
(Z0.26)
Substituting for p and 4 = 1 - j
7 for q in the above expression for ( J ~ , - ~ ,
, we obtain the
estimated standard error of F, - as given in
(10.27)
The test statistic for testing the null hypothesis Ho: p1 - p2 = 0 is obtained by using formula
(Z0.23) with pl - p2 replaced by 0 andOFl-F2
estimated by s ~ ~ - ~ ~
as given in formula (10.27). The
resulting test statistic is given in formula (10.28):
CHAP. 101
Sample
1. Whites
2, Hispanics
INFERENCES FOR TWO POPULATIONS
Sample size Number of smokers Sample proportion
nl = 1500 X I = 555 p, = 0.37
n2 = 500 ~2 = 175 P, = 0.35
-
-
233
(10.28)
The steps for testing a hypothesis concerning pI - p2 are given in Table 10.16.
Table 10.16
Steps for Testing a Hypothesis Concerning pI - p2: Large Independent Samples
nlpl > 5, n2p2 > 5, nlql > 5 and n2q2 >5
Step 1: State the null and research hypothesis. The null hypothesis is represented symbolically by Ho: p1 -
p2 = 0 and the research hypothesis is of the form Ha: P I - p2 <0 or Ha: PI- p2 > 0 or Ha:p1- p2 f 0.
Step 2: Use the standard normal distribution table and the level of significance, a,to determine the rejection
region.
-
PI-ii,
Step 3: Compute the value of the test statistic as follows: z* = -, where PI- p2is the computed
I SB1-F2
difference in sample proportions, sF1-B2
= ,and S = 1 - jY.
Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls
in the rejection region. Otherwise, the null hypothesis is not rejected.
I
EXAMPLE 10.20 A study was conducted to compare teen cigarette use for whites and Hispanics. Suppose pI
represents the proportion of teenage whites who use cigarettes and p2 represents the proportion of teenage
Hispanics who use cigarettes. The null hypothesis is Ho:p1- p2= 0 and the research hypothesis is Ha:p1- pz#
0. The critical values for a level of significance equal to .05 are k 1.96.The sample results are given in Table
10.17.
xi+ x2 555 +175
The pooled proportion is = --
- = 0.365.The estimated standard error for PI- P, is:
n i + n 2 1500+500
and the computed test statistic is:
Based on this study, we would not be able to conclude that a difference exists between the proportion of
smokers within the two groups of teenagers.
234
Sample
I . Men
[CHAP. 10
Sample size Mean Standard deviation
250 5.5 years 2.I years
INFERENCES FOR TWO POPULATIONS
Solved Problems
SAMPLING DISTRIBUTION OF XI-X
2 FOR LARGE INDEPENDENT SAMPLES
2. Women 250 3.3 years 1.8years .
10.1 A sample of size 50 is taken from a population having a mean equal to 90 and a standard
deviation equal to 15. A second sample of size 70 and independent of the first sample is
selected from another population having mean equal to 75 and standard deviation equal to 10.
The mean of the sample of size 50 is represented by XIand the mean of the sample of size 70 is
represented as -j12 .
(a) What type distribution does XI - X2 have'?
(b) What is the expected value of XI- X2?
(c) What is the standard error of 511 - j12?
(d) Transform T1- 512 to a standard normal.
Am. (a)Because both samples are 30 or more, 511 - 512 will have a normal distribution.
(6)The expected value of XI- X2 is E(Xl - X2) = psr,-srz
= p1- p2 = 90 - 75 = 15.
= i$+:
= J-+-
225 100 = 2.43
(c)The standard error of XI - 512 is 051,-srz
50 70
- -
X I - ~2 - 15
( 4 z = is a standard normal variable.
2.43
ESTIMATION OF pi- ~2 USING LARGE INDEPENDENT SAMPLES
10.2 Table 10.18 gives the summary statistics for the number of years that 250 men and women
have spent with their current employers. Use these results to find a 90%confidence interval for
pl- p
2 ,the mean difference in years spent with their current employer for men and women.
Table 10.18
Am. The general form of the Confidence interval for p1 - p2 is (Tl - T2)+_ z x The difference
in means is 51,- x2= 5.5 - 3.3 = 2.2. The z value for a 90% confidence interval is 1.65. The
estimated standard error of the difference in means is +- =.1749.
The 90% margin of error when using 5(,- 51, as an estimate of pl - p2 is 1.65 x .1749 or .2886.
The confidence interval is 2.2 +_ .3. The confidence interval extends from 1.9to 2.5 years.
TESTING HYPOTHESIS ABOUT pi- ~2 USING LARGE INDEPENDENT
SAMPLES
10.3 Use the data in Table 10.18to test the null hypothesis that p1- p
2 = 1.5 years vs. the research
hypothesis that p1- p2> 1.5years at level of significance a = .01,
CHAP. 101 INFERENCES FOR TWO POPULATIONS 235
Ans. Since the samples are large, the standard normal distribution table is used to find the critical value.
From the standard normal table, we find that P(z > 2.33) = .01, and therefore the critical value is
2.33. The computed value of the test statistic is given as follows:
The standard error of the difference in means is replaced by the estimated standard error of the
difference in means. It is concluded that the mean exceeds 1.5years.
SAMPLING DISTRIBUTION OF XI-x,FOR SMALL INDEPENDENT
SAMPLES FROM NORMAL POPULATIONS WITH EQUAL (BUT UNKNOWN)
STANDARD DEVIATIONS
10.4 A sample of size 10is taken from a normal population having a mean equal to 35 and a sample
of size 15 is taken from another normal population having mean equal to 40. The two normal
populations have equal variances. The mean and variance of the sample of size 10 are
represented by F1and S: and the mean and variance of the sample of size 15 are represented
by X2 and s;.
(a) Give the expression for the pooled estimate of the common population variance.
(b) Give the expression for the estimated standard error of the difference in the sample means.
(c) Use the results in parts (a) and (b) to form a statistic that has a t distribution with 23
degrees of freedom.
(nI -1)s: +(n2 - 1) s$ - 9 x S
: +14x S:
nl+ n2 - 2 23
Ans. (a) S2= -
(b) = /S2[ +--!--I = lis..(”),
10 1s where S2is given in part (a).
ESTIMATION OF pi -j ~ 2
USING SMALLINDEPENDENT SAMPLES
FROM NORMAL POPULATIONS WITH EQUAL (BUT UNKNOWN)
STANDARD DEVIATIONS
10.5 A comparison of motel room rates for single occupancy was made for the cities of Omaha,
Nebraska and Kansas City, Missouri. The rates for the two cities are shown in Table 10.19.
Using the command % normplot of Minitab for both samples, it is found that it is reasonable
to assume that both populations are normally distributed. Using the command %vartestof
Minitab, it is also found that it is reasonable to assume that the populations have equal
variability. The Minitab output for setting a 99%confidence interval on p1- p
2 is given below.
Verify the Pooled standard deviation, the degrees of freedom, and the 99% confidence interval
given in the output.
236 INFERENCES FOR TWO POPULATIONS [CHAP. 10
Table 10.19
I
Omaha
75
70
70
65
85
85
80
90
90
60
60
Kansas City
80
75
75
85
85
100
60
65
95
105
70
MTB > twot 99% confidence data in c2, groups in c 1;
SUBC > pooled.
Two SampleT-Testand ConfidenceInterval
Two sample T for rate
city N Mean St Dev SEMean
1 1 1 75.5 11.3 3.4
2 11 81.4 14.3 4.3
99%CI for mu (1) - mu (2): (-2 1.6, 9.7)
T-Test mu (1) = mu (2) (vs not =):T =-1.07 P = 0.30 DF = 20
Both use Pooled St Dev = 12.9
-
Ans. The difference in sample mean is '51, - x2 = 75.5 - 81.4 =-5.9.
(nl -1)s: +(n2 - I) s! 10x 127.69+10x204.49
= 166.09,and S =
-
-
The pooled variance is S2 =
nl+ n2 -2 11+11-2
4166.09 = 12.9.
The standard error of the difference in the sample means is SxI-x2 = &
2
[
;
+
;
] =
1
7
3
166.09~-+- ~ 5 . 5 0 .
Using the t distributiontable with df = 20 and right-hand tail area equal to .005, we find the t value
is 2.845. The 99% margin of error is 2.845 x 5.50 = 15.6. The 99% confidence interval extends
from -5.9 - 15.6= -21.5 to -5.9 + 15.6= 9.7.
TESTING HYPOTHESIS ABOUT ~ 1 -
~2 USING SMALL INDEPENDENT
SAMPLES FROM NORMAL POPULATIONS WITH EQUAL (BUT UNKNOWN)
STANDARD DEVIATIONS
10.6 Use the motel rate data in Table 10.I9 and the Minitab output given in problem 10.5 to test the
null hypothesis Ho:pI- p2 = 0 vs. Ha:pl -p
2 # 0 at significance level a = .O 1.
(a) Give the critical values for performingthe test.
(b) Give the computed value of the test statistic, and your conclusion based upon this value
and the critical value in part (a).
(c) Give the p value and your conclusion based on this value.
CHAP. 101
Ans. (a)
INFERENCES FOR TWO POPULATIONS 237
The degrees of freedom for the t distribution is 20. Since the research hypothesis is two-tailed,
the significance level is divided by 2 and .005 is put into each tail of the distribution. By
consulting the t distribution table, we find that for df = 20 and right-tail area = .005, the t
value is 2.845. The critical values are k2.845.
The computed t value, from the Minitab output in problem 10.5,is t* = -1.07. Since this value
does not fall in the rejection region, the null hypothesis is not rejected. It cannot be concluded
that the mean motel rates differ for the two cities.
The p value, from the Minitab output in problem 10.5,is equal to 0.30. Since this exceeds the
preset level of significance,the null hypothesis is not rejected. The same conclusion is reached
as in part (b).
SAMPLING DISTRIBUTION OF Xi-X,FOR SMALL INDEPENDENT
SAMPLES FROM NORMAL POPULATIONS WITH UNEQUAL
(AND UNKNOWN) STANDARD DEVIATIONS
10.7 Refer to problem 10.4.Suppose the two populations have unequal population variances. What
changes are needed to the answers given in the problem?
Am. Since the population variances are unequal, the sample variances are not pooled together to
estimate a common population variance. The standard error of the difference in the sample means
- -
X i - X 2 4 - 5
is given by sxl-siz has a t distribution and
sx,-iT2
the degrees of freedom is given by df = minimum of { (nl - l), (n2- 1)) = minimum of { 9, 14) =
9. Note that the degrees of freedom is reduced from 23 to 9. This problem illustrates the
importance of checking the assumptions underlying the estimation and testing procedures. The
computation of the test statistic as well as the degrees of freedom is determined by the assumption
concerning the variances of the two populations.
ESTIMATION OF 1 1 -~2 USING SMALL INDEPENDENT SAMPLES
FROM NORMAL POPULATIONS WITH UNEQUAL (ANDUNKNOWN)
STANDARD DEVIATIONS
10.8 Refer to problem 10.5. Compute the standard error of the difference in means, the degrees of
freedom, and the 99% confidence interval assuming unequal population variances. Compare
the results with those in problem 10.5.
Ans. The standard error of the difference in sample means isSsr,-n2= +-
nl n2 I 1 11
which equals 5.50. This is the same answer obtained in the equal variances case. This will always
occur if the sample sizes are equal.
The degrees of freedom is df = minimum of ( ( n l - I), (n2 - l)] = minimum of { 10, 10) = 10.
Using the t distribution table with df = 10and right-hand tail area equal to .005, we find the t value
is 3.169. The 99% margin of error is 3.169 x 5.50 = 17.4. The 99% confidence interval extends
from -5.9 - 17.4= -23.3 to -5.9 + 17.4 = 11.5.
The 99% confidence interval in the equal variances case is (-21.5, 9.7). The 99% confidence
interval for the unequal variances case is (-23.3, 11.5). Note that the interval in the unequal
variances case is wider.
238 INFERENCES FOR TWO POPULATIONS [CHAP. 10
Sample
1. College graduate
2. Non-college graduate
TESTING HYPOTHESIS ABOUT PI- ~2 USING SMALL INDEPENDENT
SAMPLES FROM NORMAL POPULATIONS WITH UNEQUAL
(AND UNKNOWN) STANDARD DEVIATIONS
Sample size Mean Standard deviation
14 8.6 hours 1.1 hours
12 6.3 hours 2.7 hours
10.9 In a study of internet users, the average time spent online per week was determined for a group
of college graduates as well as a group of non-college graduates. The results of the study are
shown in Table 10.20.Test the research hypothesis that > p
2 at level of significancea = .05.
Give the critical value, the computed test statistic, and your conclusion. Assume that the times
are normally distributed for both populations.
Pair
1
2
3
4
5
6
Table 10.20
Sample 1 Sample 2 Difference
18 15 dl= 18- 15 = 3
23 22 d2=23-22= 1
27 25 d3 = 27 - 25 = 2
20 22 d4 = 20 - 22 = -2
19 19 d5 = 19- 19= 0
24 22 d6 = 24 - 22 = 2
Ans. Some statisticians use the following rule to decide whether population variances are equal or not:
If.5 I-5 2, then assume 01 = 02. Otherwise,assume that 01 f 02. Since the ratio of the sample
is less than .5, we assume that the populations have unequal standard deviations. The degrees of
freedom for this model is df = minimum of {(nl - l), (n2- I)} = minimum of { 13, 1 1}= 11. The
critical value is determined by using the t distribution table with df = I 1 and right-tail area equal to
.OS. This value is found to equal 1.796.
S1
s 2
The standard error of the difference in sample means is =
nl n2 14 12
- -
XI
-~2 -(PI -~ 2 )
- 8.6-6.3- 0
,833. The computed test statistic is t* = - = 2.76. The research
srr,-x2 .833
hypothesis is supported since t* = 2.76 exceeds the critical value, 1.796.
SAMPLING DISTRIBUTION OF d FOR NORMALLY DISTRIBUTED
DIFFERENCES COMPUTEDFOR DEPENDENT SAMPLES
10.10 Table 10.21 gives a set of paired data, along with the differences for the pairs. Answer the
following questionsconcerning these paired data.
Sd
U
n
/a2
-n(y)2
/ n ,and sa = -
(a) Find the following: 7= -, s
d =
A *
(b) What parameters do each of the statistics in part (a)estimate?
(c) What assumption is needed in order that the statistic t = k&-
have a t distribution
-
Sa
with (n - 1) = (6- 1) = 5 degrees of freedom?
Table 10.21
CHAP. 101 INFERENCES FOR TWO POPULATIONS 239
Pair
1
s
d 1.789
= 1.789, and S;i = -- -= 0.73.
& -
African-American White
65 60
(6) The statistic, 3,
estimates ~.ld
,the mean of the population of paired differences. The statistic,
s d , estimates a d ,the standard deviation of the population of paired differences. The statistic,
ss,estimatesoa,the standard error of the population of paired differences.
(c) It is assumed that the population of paired differences is normally distributed.
2
3
4
5
6
7
8
9
10
ESTIMATION OF
COMPUTEDFROM DEPENDENTSAMPLES
USING NORMALLY DISTRIBUTED DIFFERENCES
50
75
80
105
90
65
60
115
80
10.11 A sociological study compared the salaries of 10professional African-American women with
the salaries of 10 corresponding professional White-American women. The women were
paired according to certain salient characteristics and the 10 pairs were chosen from ten
different professions. The salaries (in thousands) are shown in Table 10.22.
Table 10.22
Difference
p-
55
70
105 10
90 -10
The Minitab command tinterval 90 percent confidence data in c3 is used to produce the
following output. The differences are computed and put into column c3. The command
requests a 90%confidence interval on the mean difference.
MTB>letc3=cl-c2
MTB >print c l -c3
Data Display
Row Afamer white diff
1 65 60 5
2 50 55 -5
3 75 70 5
4 80 75 5
5 105 95 10
6 90 100 -10
7 65 70 -5
8 60 50 10
9 115 105 10
10 80 90 -10
MTB > tinterval90 percent confidencedata in c3
240 INFERENCES FOR TWO POPULATIONS [CHAP. 10
Variable N Mean St Dev SEMean 90.0%CI
diff 10 1.50 8.18 2.59 (-3.24,6.24)
Verify that
output.
= 1SO, s
d = 8.18, SJ = 2.59, and that the confidence interval is as given in the
Ans. Using the differencesgiven above, we find that E
l = 15, and 3 = 1.5. We also find that Zd2 =
w2-(a)2
/ n 625- 22.5
625, and s
d=,
/
-
' =/T=
8.18. The estimated standard error of is Sa =
n-1
s
d 8.18196
= 2.59.
---
The confidence interval is given by f t x SJ. The t value for a 90% confidence interval is
found as follows. The degrees of freedom is df = 10- 1= 9. For a 90%confidence interval 5%
is put into each tail of the t distribution.Using the t distributiontable with 9 degrees of freedom
and a right-hand tail area equal to .05,we find the t value to be 1.833.The 90%margin of error
is t x S;i = 1.833x 2.59 = 4.75. The 90%confidence interval goes from 1S O - 4.75 = -3.25 to
1.50+4.75 = 6.25.
TESTINGHYPOTHESIS ABOUT ~.la
USING NORMALLY DISTRIBUTED
DIFFERENCES COMPUTEDFROM DEPENDENT SAMPLES
10.12 Use the data in Table 10.22 to test the research hypothesis # 0 at level of significance
a =.01.
Ans. The degrees of freedom is one less than the number of pairs, i.e., df = 9. Since the research
hypothesis is two-tailed, tail areas equal to .005 are put into both tails of the t distribution. From
the t distribution table the critical value is determined by using a right-tail area equal to .005. The
critical value is equal to 3.250.
From problem 10.11, the followingare found: d = 1S O , SJ = 2.59, and D
o = 0. The computed test
statistic is t* = --
- --
- 0.58. Since this value falls between -3.250 and 3.250, the
null hypothesis is not rejected.
-
d - D o 150-0
SS 259
SAMPLINGDISTRIBUTION OF PI-;Pz FOR LARGE INDEPENDENT SAMPLES
10.13 Population 1 has pl = ,010
and population 2 has p2 = .005. Independent samples of size 5000
each are selected from both populations. Find the probability that PI- p2exceeds .015?
- -
Ans. The mean value of E, - p2 is .010- .005 = .005. The standard error of PI- pz is as follows:
Since nlpl> 5, n2p2> 5 , nlql > 5 and n2q2> 5, Fl - T2has a normal distribution.We are asked to
-
-
is used to
pI-ij2-(pI-p2) -
- pl-?j,-.005
find P(p, - F2 > .015). The transformation z =
w , - v 2 .001725
-
transform the statistic Fl - p2 to a standard normal variable. The same transformation on .015
CHAP. 101 INFERENCES FOR TWO POPULATIONS 241
.015 - ,005
.001725
yives the value = 5.80. We have the following: P(F, - F2> .015) = P(z > 5.80).
which is approximately0. It is highly unlikely that pl- F2exceeds .015.
ESTINIATION OF Pi -P2 USING LARGE SAMPLES
10.14 Three thousand commuters in both New York and Chicago were surveyed and the percentage
of commuters who took more than 60 minutes to get to work were determined for both
groups. It was found that 16.5% in New York and 10.7% in Chicago required more than 60
minutes. Set a 90%confidence interval on p1- p2 ,where p1 correspondsto New York.
-
Ans. The confidence interval is given by (jjl- p2) f z x sil-i2,
where sjil-B2
=
n2
The z value for 90%confidenceis 1.65.The differencein samplepercentages is 16.5%- 10.7% =
5.8%. The standard error is sil-B2= = 0.88%.The 90%margin of
error is 1.65x 0.88 = 1.5%.The 90%confidence interval extends from 5.8 - 1.5 = 4.3%to 5.8 +
1.5= 7.3%.
TESTING HYPOTHESIS ABOUT Pi -P2 USING LARGE INDEPENDENT
SAMPLES
10.15 A survey of 50 men and 50 women was conducted and it was found that 16of the men and 10
of the women used hotel room minibars. Test the hypothesis that pl = p
2 vs. pl # p2 at level of
significancea = .05.
-
XI +x2
n l + n2
Ans. The critical values are f 1.96. The pooled estimate of the common proportion is jT = --
16+10
50+50
-- - .26. The standard error of the difference in proportions is SF,-p2
i.26 x .74($+&) = .0877.The differencein sampleproportions is - 55, = .32 - .20= .16.
-
P1-pz .16
The computed test statistic is z* = --
- --
- 1.82. Since the computed test statistic does
s V P 2 .Of377
not exceed the critical value, we cannot conclude that there is a difference between men and
women users of hotel room minibars.
Supplementary Problems
SAMPLINGDISTRIBUTIONOF XI- X,FOR LARGEINDEPENDENTSAMPLES
10.16 A sample of size 100 is selected from a population having mean 75 and standard deviation 3. Another
independent sample of size 100 is selected from a population having mean 50 and standard deviation 4.
Verify that j i l - X2 has mean 25 and standard deviation0.5.
(a) What percent of the time will sT1 - 512 fall within 0.5 of 25?
(b) What percent of the time will T1 - 512 fall within 1.0of 25?
(c) What percent of the time will - 512 fall within 1.5 of 25?
242
Sample
1
2
INFERENCESFOR TWO POPULATIONS
Ans. (a)68% (b)95% (c) 99.7%
ESTIMATIONOF pi- ~2 USING LARGEINDEPENDENTSAMPLES
Sample size Mean Standard deviation
50 $9,500 $1,250
75 $9,125 $950
[CHAP. 10
Chatty toddlers
75
10.17
Quiet toddlers
80
The information shown in Table 10.23 was obtained from two independent samples selected from two
populations.
(a) Give a point estimate for p~
- p2.
(b) Find a 99% confidence interval for pI-p2.
Am. (a)$375 (b)375 k 2.58 x 208.05 or -$161.77 to $911.7
TESTING HYPOTHESIS ABOUT -~2 USING LARGE INDEPENDENTSAMPLES
10.18 Use the data given in problem 10.17 to test the hypothesis Ho: p
1- p
2 = 0 vs. Ha:pl - p2> 0 at level of
significance a= .OI.
(a) Give the computed test statistic.
(6) Give the p value.
(c) Give your conclusion.
375-0
208.05
Ans. (a) z* = -
-
- 1.80 (6) p value = .5 - .4641= .0359
(c) Do not reject H
o since p value > a.
SAMPLINGDISTRIBUTIONOF ?i,- x2FOR SMALLINDEPENDENTSAMPLESFROM
NORMAL POPULATIONSWITH EQUAL (BUTUNKNOWN)STANDARDDEVIATIONS
10.19 A psychological study compared the language skills and mental development of two groups of two-year-
olds. One group consisted of chatty toddlers and the other group consisted of quiet children. The scores
on a test which measured language skills are shown for the two groups in Table 10.24. Use Minitab to
determine whether it is reasonable to assume the populations of test scores are normally distributed and
also determine if it is reasonable to assume that two populations have equal standard deviations.
tble 10.24
70
70
65
85
85
80
90
90
60
60
75
65
70
90
90
75
85
90
75
80
CHAP. 101 INFERENCES FOR TWO POPULATIONS 243
Ans. The following normal probability plots were produced by Minitab. Beneath the graph, the mean,
standard deviation, and sample size are shown as well as a p value for the Anderson-Darlington
normality test. The p value corresponds to a null hypothesis, which states that the sample data were
selected from a normally distributed population. This hypothesis is rejected if the p value is less
than a = .05. Otherwise, normality is usually assumed. In this case, it is safe to assume that both
samples were obtained from normally distributed populations since the p value for the chatty
sample is 0.433 and the p value for the quiet sample is 0.368.
Normal Probability Plot
.999
.99
.95
.80
.50
ProbabiIity2o
.05
.01
,001
Average: 75.4545
StDev: 11.2815
N:11
, ...
. . . . . .
1 I 1 1
60 70 80 90
chatty
Anderson-Darling NormalityTest
A-Sauared:0.337
PValue: 0.433
Normal Probability Plot
. . . . . . . . . . . . . . . . . .
: ........................ " .... .;. ................................
;
.
_ ..
,
.001 -. . . . . . . . . . . . . . . . . . ........................ - .." ',"". . . . . . . . . . . . . ............-.. ~
I
I
70
Average: 79.5455
StDev: 8.50134
N: 11
80 90
quiet
Anderson-Darling Normality Test
A-Squared: 0.365
P Value: 0.368
244 INFERENCES FOR TWO POPULATIONS [CHAP. 10
The following Minitab output may be used to test for equal population standard deviations. The
null hypothesis of equal population standard deviations is rejected if the p value corresponding to
Bartlett’s test is less than a = .05. In this case, the p value equals 0.386, and is reasonable to
assume that o1= 02.
MTB > %vartest c2 cl
Homogeneity of Variance
Response score
Factors sample
ConfLvl 95.0000
Bartlett’sTest (normal distribution)
Test Statistic: 0.752
Pvalue : 0.386
ESTIMATION OF pi- ~2 USING SMALL INDEPENDENTSAMPLES FROM NORMAL
POPULATIONS WITH EQUAL (BUT UNKNOWN) STANDARD DEVIATIONS
10.20 Refer to the data in Table 10.24. Minitab was used to set a 90% confidence interval on pI- p2,The
output is shown below. Using the output, give a 90% confidence interval for pI- p2
MTB > twot 90%confidence, data in c2, sample number in c1;
SUBC > pooled.
Two Sample T-Test and Confidence Interval
Two sample T for score
sample N Mean St Dev SEMean
1 11 75.50 11.30 3.4
2 1 1 79.55 8.50 2.6
90%CI for mu (1) - mu (2):(-1 1.4, 3.3)
T-Test mu (1) = mu (2) (vs not =): T= -0.96 P=0.35 DF= 20
Both use Pooled St Dev = 9.99
Am. The 90% interval extends from -1 1.4 to 3.3.
TESTING HYPOTHESIS ABOUT pi- ~2 USING SMALL INDEPENDENT SAMPLES FROM
NORMAL POPULATIONS WITH EQUAL (BUT UNKNOWN) STANDARD DEVIATIONS
10.21 Refer to problems 10.19 and 10.20.Suppose the research hypothesis is that the language skills scores
differ for the two groups. Give the computed test statistic and the corresponding p value.
Ans. The computed test statistic is t* = -0.96 and the p value is 0.35.
SAMPLING DISTRIBUTION OF
NORMAL POPULATIONS WITH UNEQUAL (AND UNKNOWN) STANDARD DEVIATIONS
FOR SMALL INDEPENDENT SAMPLES FROM
10.22 Thirty individuals who suffered from insomnia were randomly divided into two groups of 15 each. One
group was put on an exercise program of 40 minutes per day for four days a week. The other group was
not put on the exercise program and served as the control group. After six weeks, the time taken to fall
asleep was measured for each individual in the study. The results are given in Table 10.25.
CHAP. 101 INFERENCES FOR TWO POPULATIONS
Table 10.25
Exercise group
15
17
15
15
16
16
17
14
14
15
15
17
13
14
18
Control group
19
22
25
30
31
15
16
19
19
30
27
28
22
13
32
245
The Minitab output for the test of equal standard deviations is shown below. Would you assume equal
or unequal population standard deviations?
MTB > %oartest c2 cl
Homogeneity of Variance
Response C2
Factors C1
ConfLvl 95.OOOO
Bartlett’sTest (normal distribution)
Test Statistic: 23.039
P value : 0.000
Ans. The Minitab procedure is used to test Ho: 01 = o2vs. Ha:olf 02, Since the p value = 0.0o0,the
null hypothesis should be rejected. Assume that the standard deviations are not equal for the two
groups.
ESTIMATIONOF 1.11 -1.12 USING SMALL INDEPENDENTSAMPLESFROM NORMAL
POPULATIONSWITH UNEQUAL(AND UNKNOWN)STANDARDDEVIATIONS
10.23 Refer to the data in Table 10.25. The Minitab analysis for setting a 99% confidence interval on p1
- p2
is shown below. This analysis assumes unequal population standard deviations.
MTB >twot 99% data in c2 groups in cl
Two SampleT-Test and ConfidenceInterval
Two sample T for C2
c
1 N Mean St Dev SEMean
1 15 15.40 1.40 0.36
2 15 23.20 6.27 1.60
99% C1for mu (1) -mu (2): (-12.69, -2.9)
T-Test mu (1) = mu (2) (vs not =): T = -4.70 P = 0.0003 DF = 15
246 INFERENCES FOR TWO POPULATIONS [CHAP. 10
Rather than finding the degrees of freedom by using df = minimum of { nl - 1, n2 - 1 ), Minitab uses a
different formula. If the degrees of freedom are found using df = minimum of ( 14, 14) = 14, the t value
is 2.977. The t value, using df = 15,is 2.947. The confidence interval will be approximately the same for
either value. Using df = 14, find a 90% confidence interval for pl-p2,
1.40' 6.27*
Ans. (15.40- 23.20) k 1.761 x -
i15 '
1
5or -7.8 k 2.9 or (-10.7, -4.9)
TESTING HYPOTHESIS ABOUT -J.L~USING SMALL INDEPENDENT SAMPLES FROM
NORMAL POPULATIONS WITH UNEQUAL (AND UNKNOWN) STANDARD DEVIATIONS
10.24 Refer to problems 10.22and 10.23. Is there a difference in the time to go to sleep for the two groups?
Ans. Yes, the computed test statistic is t* = -4.70, and the p value is 0.0003.
SAMPLING DISTRIBUTION OF a FOR NORMALLY DISTRIBUTED DIFFERENCES COMPUTED
FOR DEPENDENT SAMPLES
10.25 Table 10.26 gives the diastolic blood pressure before treatment and six weeks after treatment is started
for 10hypertensive patients. The statistic, t = -
-cLd , has a t distribution with df = n - 1, provided the
differences have a normal distribution. The basic normality assumption needs to be verified before
setting a confidence interval on or testing a hypothesis concerning k,
The normal probability plot in
Minitab can be used to test the following hypothesis: H
o
:The differences are normally distributed vs.
Ha: The differences are not normally distributed. If the level of significance is set at the conventional
level of significance a = .05, then the null hypothesis is rejected if the p value < a. Using the Minitab
output shown below and on the next page, what decision do you reach concerning the assumption of
normality for the differences?
-
S;i
Patient
1
2
3
4
5
6
7
8
9
10
Tablc
Before
90
100
95
85
99
105
90
97
99
110
10.26
After
80
85
83
75
85
85
80
79
85
90
Difference
10
15
12
10
14
20
10
18
14
20
AIZS.The p value for the Anderson-Darling Normality test is 0.233. The null hypothesis is not rejected,
and it is assumed that the differences are normally distributed.
CHAP. 101
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
. . . . . . . . . . . . . . . . . . . . . . . .
I - .
,
I
INFERENCES FOR TWO POPULATIONS
Normal Probability Plot
247
.999
.99
.95
.80
Probability .50
.20
.05
.01
,001
1
%
. . . . . . . . . ~,
I
, s., . I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
<
. . . . . . . . . . . . . . . . p' . . . . . . . . . . . . . . . . . . . . . . . .
I 1
1
10
Average: 14.3
StDev: 3.94546
N: 10
I
15
difference
1
20
Anderson-DarlingNormalityTest
A-Squared:0.438
P Value: 0.233
ESTIMATION OF
FROM DEPENDENTSAMPLES
USING NORMALLYDISTRIBUTEDDIFFERENCES COMPUTED
10.26 Verify the output shown in the following Minitab output for a 90% confidence interval for the mean
difference in the blood pressures given in problem 10.25.
MTB > tinterval90 percent confidence, data in cl
ConfidenceIntervals
Variable N Mean St Dev SEMean 90.0 %CI
diff 10 14.30 3.95 1.25 (12.01, 16.59)
Ans. Cd = 143 cd2= 2185 d = 14.3 s d = 3.9455 S j = 1.2477 t = 1.833
-
d kt x Sa is found to be (12.01, 16.59).
TESTING HYPOTHESISABOUT
COMPUTEDFROM DEPENDENTSAMPLES
USING NORMALLYDISTRIBUTEDDIFFERENCES
10.27 Verify the computed test statistic in the below Minitab output for the differences in problem 10.25.
MTB > ttest mu = 0 data in cl
T-Test of the Mean
Test of mu = 0.00 vs mu not = 0.00
Variable N Mean St Dev SEMean T P
diff 10 14.30 3.95 1.25 11.46 0.0000
-
d-D, 14.30-0
Ans. t*= --
- = 11.44
S3 I.25
248 INFERENCES FOR TWO POPULATIONS [CHAP. 10
SAMPLINGDISTRIBUTIONOF Pi - P2 FOR LARGEINDEPENDENTSAMPLES
10.28 In a study of 100surgery patients, 50 were kept warm with blankets after surgery and 50 were kept cool.
Eight of the warm group developed wound infections and 14 of the cool group developed wound
infections. Let pi represent the proportion of all surgery patients kept warm after surgery who develop
wound infections and let p2 represent the proportion for the cool group. The sample proportions for the
two groups are P1= .16 and p2= .28.The difference in sample proportions, Pr- P2,will have a normal
distribution provided that nlpl > 5, n2p2> 5, nlql> 5, and n2q2> 5. Since the population proportions are
unknown, these conditions cannot be checked directly. In practice, the conditions are checked out by
substituting the sample proportions for the population proportions. Use the sample proportions to check
out the requirements for assuming normality.
Ans. nip, = 8, n2F2 = 16, nlq, = 42, and n2q, = 34 p1- p2has a normal distribution.
ESTIMATION OF PI -P2 USING LARGESAMPLES
10.29 Refer to problem 10.28.Find a 90%confidence interval for p1- p2.
-
- X;Tr ; Pzx% .
Ans. The 90% confidence interval is (PI- p2) k z x SFIUF2,
where SFl-Fz
=
nl n2
(.16 - .28)+_ 1.65x .0820 or -.I2 +_ .14 or (-.26 ,.02)
TESTING HYPOTHESISABOUT PI-P2 USING LARGEINDEPENDENTSAMPLES
10.30 Refer to problem 10.28.Test H
o
:p1 - pz= 0 vs. Ha:p1 - p2< 0 at a = .05.
Ans. The pooled estimate of the common proportion is jT = = 0.22. The standard error is
n l + n2
-
P,-F,
SF,-B2
STilPF2
= ,/- = .0828. The computed test statistic is z* = -= -1.45. The
critical value is -1.65. Do not reject the null hypothesis.
Chapter 11
0.15 -
0.10 -
0.05 -
0.007
Chi-square Procedures
df Mean
5 5
10 10
15 15
CHI-SQUAREDISTRIBUTION
Mode Standard deviation
3 3.16
8 4.47
13 5.48
The Chi-square procedures discussed in this chapter utilize a distribution called the Chi-square
distribution. The symbol x2 is often used rather than the term Chi-square. The Greek letter x is
pronounced Chi. Like the t distribution, the shape of the x2distribution curve is determined by the
degrees of freedom (df) associated with the distribution.Figure 11-1 shows x2distributions for 5, 10,
and 15 degrees of freedom.
I I I I
= 15
0 10 20 30
Fig. 11-1
Table 11.1 gives some of the basic propertiesof x2distributioncurves.
Table 11.1
Properties of the x2Distribution
1. The total area under a x2curve is equal to one.
2. A x2 curve starts at 0 on the horizontal axis and extends indefinitely to the right,
3. A x2curve is always skewed to the right.
4. As the number of degrees of freedom becomes larger, the x2 curves look more and
5. The mean of a x2distribution is df and the variance is 2df.
6. When the degrees of freedom is 3 or more, the peak of the x2 curve occurs at df -2.
approaching, but never touching the horizontal axis.
more like normal curves.
This value is the mode of the distribution.
EXAMPLE 11.1 Table 11.2 gives the mean, mode, and standard deviation for each of the three x2 curves
shown in Fig. 11-1. The mean, mode, and standard deviation are determined by using properties 5 and 6 from
Table 11.1.
249
250 CHI-SQUARE PROCEDURES
df
[CHAP. 11
.995 .990 .975 .950 .900 .lOO .050 .025 .010 .005
CHI-SQUARETABLES
5 0.412
The area in the right tail under the x2distribution curve for various degrees of freedom is given
in the Chi-square tables found in Appendix 4. Example 11.2illustrates how to read this table.
0.554 0.831 1.145 1.610 9.236 11.070 12.833 15.086 , 16.750
EXAMPLE 11.2 Table 11.3 contains the row corresponding to df = 5 from the Chi-square distribution table.
Figures 11-2 and 11-3 give Chi-square curves having df = 5. The shaded area to the right of 11.070 in Fig 11-2
is .050. The shaded area to the right of 1.610 in Fig 11-3 is equal to .900.These areas and Chi-square values are
shown in bold print in Table 11.3.
Table 11.3
I I Area in the right tail under the Chi-square distribution curve I
Fig. 11-2 Fig. 11-3
GOODNESS-OF-FITTEST
In many situations, each element of a population is assigned to one and only one of k categories
or classes. Such a population is described by a multinomial probability distribution. Example 1 1.3
describes such a population and illustrates the structure of the null and alternative hypotheses for a
goodness-of-fit test.
EXAMPLE 11.3 Consider the population of Americans who’ve dieted. A survey reported that 85% were most
likely to go off their diet on the weekend, 10%were most likely to go off on a weekday, and 5% didn’t know.
This population is divided into three categories: category 1: most likely to go off their diet on the weekend,
category 2: most likely to go off their diet on a weekday, and category 3: did not know when they were most
likely to go off their diet. The multinomial probability distribution is: p1= 3 5 , p2= .10, and p3= .05. The Delta
Health fitness club is interested in whether their members follow this same multinomial probability distribution.
A goodness-of-fit test is used to test the null hypothesis: Ho:p1= .85, p2= .10, and p3= .05 vs. the following
alternative hypothesis: Ha:The population proportions are not p1= 3 5 , p2= .10,and p3= .05. The probabilities
pl, p2, and p3 in the hypotheses statements represent the proportions for the categories as applied to the health
fitness club members. The next two sections will describe the steps for performing a goodness-of-fit test.
EXAMPLE 11.4 Table 11.4 gives the age distribution of part-time college students as determined five years
ago. If p1, p2,p3,p4,and p5represent the current percentages for the five groups, then the null hypothesis that the
current distribution is the same as five years ago is stated as follows: Ho:p1= .25, p2= .35, p3 = .25, p4= .10,
CHAP. 111 CHI-SQUAREPROCEDURES 251
Age 18-24 25-34 35-44 45-54
Percent 25% 35% 25% 10%
and p5 = .05. The research hypothesis is stated as: H,: The current proportions are not as stated in the null
hypothesis. A goodness-of-fit test is used to test this hypothesis system.
55 or over
5%
OBSERVEDAND EXPECTED FREQUENCIES
The first step in performing a goodness-of-fittest is the selection of a sample of size n from the
population and the determination of the observedfrequencies for k classes. Recall that in testing
hypothesis, the null hypothesis is assumed to be true, and is rejected if a highly unlikely value is
obtained for the test statistic.The expected frequencies are computed assuming the null hypothesis to
be true.
EXAMPLE 11.5 In Example 11.3, 200 members of Delta Health fitness club were surveyed and it was found
that 160 were most likely to go off the diet on the weekend, 22 were most likely to go off the diet on a weekday,
and 18 did not know. If the null hypothesis is true and the club members follow the nationwide distribution, then
the expected numbers in the three categories are: npl = 200 x .85 = 170, np2 = 200 x .10 = 20, and np3 =
200 x .05 = 10. The observed frequencies are 160,22, and 18 and the expected frequencies are 170,20, and 10.
EXAMPLE 11.6 In Example 11.4, 1500 part-time students were surveyed across the country. It was observed
that 352 were in the age group 18-24, 501 were in the age group 25-34, 371 were in the age group 35-44, 126
were in the age group 45-54, and the remainder were in the age group 55 or over. If the null hypothesis is true
and the distribution is the same as it was five years ago, the expected numbers in the five categories are as
follows: npl = 1500x .25 = 375, np2= 1500x .35 = 525, np3= 1500 x 2 5 = 375, np4= 1500 x .10 = 150, and
np5= 1500 x .05 = 75. The observed frequencies are 352,501, 371, 126, and 150 and the expected frequencies
are 375,525,375, 150,and 75.
SAMPLINGDISTRIBUTIONOF THE GOODNESS-OF-FITTEST STATISTIC
The goodness-of-fit test statistic is given in formula (22.2), where o represents an observed
frequency and e representsan expected frequency, and the sum is over all k categories.
(22.2)
The test statistic given in formula (21.2) has a Chi-square distribution with df = k - 1, provided
all the expected frequencies are 5 or more. Some statisticians use a less restrictive requirement,
namely, that all expected frequencies are at least one and that at most 20% of the expected
frequencies are less than 5. We shall use the requirement that all expected frequenciesare 5 or more.
This requirement means that a minimum sample size is needed to use this procedure. If the observed
and expected frequencies are close, then the computed value for x2will be close to zero since the
differences (0-e) will all be near zero. If the observed and expected values differ considerably,then
the computed value of x2will be large, supportingthe research hypothesis. Since only large values of
x2indicate that the null hypothesis should be rejected and the research hypothesis supported, this is
always a one-tailed test. That is, the null hypothesis is rejected only for large values of the computed
test statistic.Table 11.5 summarizesthe procedure for performing a goodness-of-fit test.
252 CHI-SQUARE PROCEDURES [CHAP. 11
Category 0 e o - e
1 160 I70 -10
2 22 20 2
3 18 10 8
Sum 200 200 0
Table 11.5
Steps for Performing a Goodness-of-Fit Test
Step 1: State the null and research hypothesis concerning the hypothesized distribution for the k categories.
Step 2: Use the x2 table and the level of significance, a,to determine the rejection region.
Step 3: Compute the value of the test statistic as follows: x2= c- ,where o represents the observed
frequencies and e represents the expected frequencies. Check to make sure that all expected frequencies are
5 or more.
Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls
in the rejection region. Otherwise, the null hypothesis is not rejected.
(0-e)*
e
(0
-e>'
100 .588
4 .200
64 6.4
7.188
(0 - e)2 e
EXAMPLE 11.7 Refer to Examples 11.3and 11.5.The null hypothesis may be stated in either of two ways:
Ho: The distribution of categories for Delta Health fitness club members is the same as the national distribution
or Ho: pi = .85, p2= .10,and p3 = .05. The research hypothesis may also be stated in either of two ways: Ha:The
distribution of categories for Delta Health fitness club members is not the same as the national distribution or
Ha:The population proportions are not pi = 3 5 , p2 = .10,and p3= .05.Table 11.6illustrates the computation of
the test statistic. The computed value of the test statistic is x2* = 7.188.
df
2
.995 ,990 .975 .950 .900 .100 .OS0 .025 .010 .005
0.010 0.020 0.051 0.103 0.211 4.605 5.991 7.378 9.210 10.597
For level of significance a = .05, the critical value is 5.991.The row corresponding to df = 3 - 1 = 2 from
the Chi-square table in Appendix 4 is shown in Table 11.7.
Table 11.7
r - I Area in the right tail under the Chi-square distribution curve I
Since the computed test statistic exceeds 5.991,the null hypothesis is rejected. There are almost twice as many
in the Don't Know category at Delta Health fitness club as would be expected if the distributions were the same.
EXAMPLE 11.8 Refer to Examples 1 1.4 and 11.6.The null hypothesis is b:
pi = .25, p2 = .35, p3 = .25, p4=
.10,and p5= -05.The research hypothesis is stated as: Ha: The current proportions are not as stated in the null
hypothesis. The observed and expected frequencies are given in Example 11.6. Table 11.8 illustrates the
computation of the goodness-of-fit test statistic. The computed value of the test statistic is x2* = 81.391.
Table 11.8
CHAP. 111 CHI-SQUARE PROCEDURES 253
df
4
The row corresponding to df = 5 - I = 4 from the Chi-square table in Appendix 4 is shown in Table 11.9.For
level of significance a = .01, the critical value is 13.277and is shown in bold type.
.995 .990 .975 .950 .900 .loo .050 .025 .010 .005
0.207 0.297 0.484 0.711 1.064 7,779 9.488 11.143 13.277 14.860
Table 11.9
I I Area in the right tail under the Chi-sauare distribution curve 1
Male
Female
Column total
Supports Opposes Undecided Row total
70 20 10 loo
70 20 10 I 00
140 40 20 200
Since the computed test statistic exceeds 13.277,the null hypothesis is rejected. There appears to have been an
increase in the 55 or over group from 5 years ago.
EXAMPLE 11.9 The Minitab solutions to Examples 11.7 and 11.8are shown in Figures 11-4 and 11-5.
MTB > set the observed values in cl
DATA> 160 22 18
DATA > end
MTB > set the expected values in c2
DATA> 170 20 10
DATA > end
MTB > let kl = sum((c1 - c2)**2/c2)
MTB > print kl
kl 7.18824
MTB >cdf kl k2;
SUBC > Chisquare 2.
MTB >let k3 = 1-k2
MTB >print k3
k3 0.0274849
Fig. 11-4
MTB > set the observed values in cl
DATA>352 501 371 126 150
DATA >end
MTB > set the expected values in c2
DATA > 375 525 375 150 75
DATA > end
MTB > let kl = sum((c1- c2)**2/k2)
MTB > print kl
kl 81.3905
MTB > cdf kl k2;
SUBC > Chisquare 4.
MTB >let k3 = 1-k2
MTB > print k3
Ik3 o
Fig. 11-5
The upper part of both figures illustrates the computation of the test statistic for Examples 11.7 and 11.8. The
computed values, 7.18824 and 81.3905, are the same as shown in Examples 11.7 and 1I .8. The portion of the
output shown in bold illustrates the computation of the p value for the two Examples. The p value is shown next
to k3. The p value for Example 11.7 is 0.027 and the p value for Example 1 1.8 is 0.
CHI-SQUAREINDEPENDENCETEST
Consider a survey of 100 males and 100 females concerning their opinion toward capital
punishment. Tables I I. 10 and 11.11 gives two different sets of results. In Table 11.10, the
distribution of opinions is exactly the same for males and females. In this case, we say that the
opinion concerning capital punishment is independent of the sex of the respondent.
Table 11.10
254
Male
Female
Column total
CHI-SQUAREPROCEDURES
supports Opposes Undecided Row total
70 20 10 100
40 50 10 100
110 70 20 200
[CHAP. 11
I
supports Opposes Undecided Row total
Male 80 15 5 100
Female 70 25 5 100
Column total 150 40 10 200
In Table 11.11,the distributionof opinions is clearly different for males and females. In this case, we
say that the opinion concerningcapital punishment is dependent on the sex of the respondent.
Supports Opposes
l00x 150 100x40
200 200
200 200
Male -=75 -=20
Female -=75
lOOx 150 -=20
100x40
Cnliimn tntal 150 40
Table 11.11
Undecided Row total
lO0x 10 - 100
--
200
200
- = 5
lO0x10 100
10 200
Tables I I .10and 1I. 11 are called contingency tables. In a Chi-square independence test, we are
interested in using results such as those shown in these tables, to test for the independence of two
characteristics on the elements of a population. In the above discussion, the two characteristics are
sex of the individual and opinion of the individual concerning capital punishment. The conclusions
regarding independence would be clear in the two tables given above. But suppose the results of the
survey are not as clear cut as above. How do we decide from the results given in a contingency table
if two characteristics are independent? Suppose the results of the survey were as shown in Table
11.12.
In a Chi-square test of independence, the null hypothesis is that the two characteristics are
independent. The obsewedfrequencies are shown in Table I 1.12.A table of expectedfrequencies is
also determined assuming the null hypothesis to be true. In Table 11.12,note that 150 out of 200, or
75% of those surveyed, supported capital punishment. If the two characteristics are independent, we
would expect 75% of the 100males and 75%of the 100females to support capital punishment. From
this discussion, note that if ell represents the expected frequency in the first row, first column cell,
then
(row 1 total)x (column 1 total)
= 75 =
100x150
ell= 100x .75 =
200 sample size
In general, the expected frequency in row i columnj is given by formula (IZ.2):
(row i total)x (columnj total)
samplesize
eij =
Table 11.13 showsthe computation of the expected frequencies for Table 1I.12.
( I 1.2)
CHAP. 111 CHI-SQUARE PROCEDURES 255
Male
Female
Column total
EXAMPLE 11.10 The table of expected frequencies for Table 11.10is given in Table 1I. 14.Notice that when
the contingency table indicates independence, the observed and expected frequencies are exactly the same.
supports Opposes Undecided Row total
70 -=20 --
100x20 - *o 100
--
100x40- 20 -
lOOx 20 = 10 100
100x40
200 200 200
200 200 200
-=
100x140 7o
-=
140 40 20 200
Table 11.14
Male
Female
Column total
Supports Opposes Undecided Row total
-- - 5 5 -=35 --
lOOx 110
200 200 200
200 200 200
100x70 100x20 - 1o 100
lOOx 110 - 55 100x20 100
-- 1oox70 - 35 -
= 10
110 70 20 200
EXAMPLE 11.11 The table of expected frequencies for Table 11.1 1 is given in Table 1 1.15. Notice that when
the contingency table indicates strong dependence, that the observed and expected frequencies are very
different.
Male
Female
Column total
Table 11.15
Supports Opposes Undecided Row total
80 (75) 15 (20) 5 (5) 100
70 (75) 25 (20) 5 (5) 100
150 40 10 200
How different do the observed frequencies and those you would expect when the characteristics are
independent need to be before you would reject independence? The test statistic given in the next
section will answer this question.
SAMPLING DISTRIBUTION OF THE TEST STATISTICFOR THE
CHI-SQUARE INDEPENDENCETEST
Table 11.16 gives the observed frequencies and the expected frequencies in parenthesis for the
data on Table 11.12. The null hypothesis is: Ho: Opinion concerning capital punishment is
independent of the sex of the individual. The research hypothesis is: Ha:Opinion concerning capital
punishment differs for males and females. The test statistic for testing this hypothesis is given by
formula (1Z.3), where the sum is over all 6 cells. If there are r rows and c columns, there will be r x c
cells in the contingency table. The test statistic has a Chi-square distribution with (r - I) x (c - 1)
degrees of freedom.
Table 11.16
The computed value of the test statistic is found as follows:
( I 1.3)
(0-e)* (80-75)2 (15-20)2 (5-5)* (70-75)2 (25-20)2 (5-5)2
+ +- + + + -= 3.167
p = E-- -
e 75 20 5 75 20 5
256 CHI-SQUARE PROCEDURES [CHAP. 11
The degrees of freedom is equal to df = (r - 1) x (c - 1) = (2 - 1) x (3 - 1) = 2. The critical value
from the Chi-square distribution table is 5.991 for a = .OS. Since the computed test statisticdoes not
exceed the critical value, the null hypothesis is not rejected. We cannot reject that the characteristics
are independent. A Minitab analysis of the same data is shown below. The data are read first. The
command Chisquare cl - c3 produces the expected frequencies, the computed value of the test
statistic, and the p value. The p value is equal to 0.205.
MTB > read cl -c3
DATA>80 15 5
DATA>70 25 5
DATA > end
2 rows read.
MTB > Chisquarecl -c3
Expected counts are printed below observed counts
C1 C2 C3 Total
1 80 15 5 100
75.00 20.00 5.00
2 70 25 5 100
75.00 20.00 5.00
Total 150 40 10 200
Chi-Sq = 0.333 + 1.250 +O.OO0 +0.333 + 1.250 +O.OO0 = 3.167
DF = 2, P value = 0.205
EXAMPLE 11.12 The Minitab output for the data in Tables 11.10 and 11.11 are shown in Figures 11-6 and
11-7.
MTB > read cl -c3
DATA>70 20 10
DATA>70 20 10
DATA >end
2 rows read.
MTB >Chisquarecl -c3
Expected counts are printed below observed counts
c1 C2 C3 Total
1 70 20 10 100
70.00 20.00 10.00
2 70 20 10 100
70.00 20.00 10.00
Total 140 40 20 200
Chi-Sq = O.OO0 +O.OO0 +O.OO0 +O.OO0 +0.0o0 +
o.oO0= o.Oo0
DF = 2. P value = 1.OOO
MTB > read cl -c3
DATA>70 20 10
DATA>40 50 10
DATA >end
2 rows read.
MTB >Chisquare cI -c3
Expected countsare printed below observed counts
c1 C2 C3 Total
1 70 20 10 100
55.00 35.00 10.00
2 40 50 10 100
55.00 35.00 10.00
Total 110 70 20 200
Chi-Sq = 4.091 +6.429 +O.OO0 +4.091 + 6.429 +
DF = 2, P value = O.OO0
0.O00 = 21.039
Fig. 11-6 Fig. 11-7
CHAP. 111 CHI-SQUARE PROCEDURES 257
SAMPLINGDISTRIBUTION OFTHE SAMPLE VARIANCE
The population and sample variance was defined and illustrated in Chapter 3. The population
variance is given by formula (ZZ.4):
The sample variance is given by
(11.4)
(11.5)
The concept of the samplingdistribution of the sample variance is established in Example 11.13.
EXAMPLE 11.13 Consider the small finite population consisting of the times required for six individuals to
open a “Child proof’ aspirin bottle. The required times (in seconds) are shown in Table 11.17. The population
mean is p= 30 seconds. The population variance is found as follows:
(x-p)2 400+ 100+O+ O+ 100+400
= 166.67
-
d = -
N 6
The standard deviation is G= 4166.67 = 12.9.
Table 11.17
Individual
Individual 1
Required time
10
20
30
30
40
50
There are 20 different samples of size 3 possible and they are listed along with the sample variance and
sampling error for each sample in Table 11.18.
Sample number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Tab
Sample
A, B, C = 10,20,30
A, B, D = 10,20,30
A, B, E = 10,20,40
A, B, F = 10,20,50
A, C, D = 10,30,30
A, C, E = 10,30,40
A, C, F = 10,30,50
A, D, E = 10,30,40
A, D, F = 10,30,50
A, E, F = 10,40,50
B, C, D = 20,30,30
B,C, E =20,30,40
B, C, F =20,30,50
B, D, E =20,30,40
B,D, F = 20,30,50
B, E, F =20,40,50
C, D,E =30,30,40
C, D,F = 30,30,50
C, E, F = 30,40,50
D.E. F = 30.40.50
! 11.18
Sample variance, s2
100.00
100.00
233.48
432.64
133.40
233.48
400.00
233.48
400.00
432.64
33.29
100.00
233.48
100.00
233.48
233.48
33.29
133.40
100.00
100.00
Sampling error, I s2 -dI
66.67
66.67
66.81
265.97
33.27
66.81
233.33
66.81
233.33
265.97
133.38
66.67
66.81
66.67
66.81
66.81
133.38
33.27
66.67
66.67
258 CHI-SQUARE PROCEDURES [CHAP. 11
33.29 100.0 133.40 233.48 400.00 432.64
' s2
~ l l l l _ _ l . . l -
P
(
S ) .1 .3 .1 .3 .1 .1
-
-
7
-
-
-
.
-
---
The distribution of the sample variance is obtained from Table 11.18and is given in Table 1 1.19.
249.990
250.010
250.005
250.000
249.980 250.OOO
250.000 249.999
250.015 250.001
250.010 250.000
The above procedure may be used to find the sampling distribution of the sample variance when
sampling from a finite population. However, it is clear that it is a tedious procedure even when using
a computer. When sampling from an infinite population, the result given in Table 11.20 is used to
determine the sampling distribution for a function of the sample variance, namely, . The
proof of this result is beyond the scope of this text. In the next section, this result is utilized to set a
confidence interval on o2as well as test hypotheses about 02.
(n - 1)s2
2
df
14
-
Table 11.20
Area in the right tail under the Chi-square distribution curve
.995 .990 .975 .950 .900 .100 .050 .025 .O10 .005
4.075 4.660 5.629 6.571 7.790 21.064 23.685 26.119 29.141 31.319
I
(n - 1)s2
Sampling Distribution of ,
When a simple random sample of size n is selected from a normally distributed
has a Chi-square distribution
population having population variance, c?,
with (n - 1) degrees of freedom, where S2is the sample variance.
(n - 1)s2
2
INFERENCES CONCERNING THE POPULATION VARIANCE
The result given in Table 11.20 will be used to find confidence intervals and test hypotheses
about a population variance or standard deviation.
EXAMPLE 11.14 It is important that drug manufacturers control the variation of the dosage in their products.
A drug company produces 250 milligram tablets of the antibiotic amoxicilliri. A sample of size 15 is selected
from the production process and the level ofamoxicillin is determined for each tablet. The results are shown in
Table 11.21.
Table 11.21
I 249.995 I 249.985 I 250.000 I
The mean of the sample is X=250.000mg and the sample variance is s2= .00008538mg2. According to Table
(n-1)s' 14s'
--
- has a Chi-square distribution with df = 14. Table 11.22 gives the row corresponding to
I 1.20,
df = 14 from the Chi-square table in Appendix 4.
2 d
Table 11.22
CHAP. 111 CHI-SQUAREPROCEDURES 259
1452
is between 5.629 and 26.119 with
14S2
has a Chi-square distribution with df = 14, we know that -
Since -
probability .95. This is true because according to Table 11.22,there is 97.5%of the area under the curve to the
right of 5.629 and 2.5%of the area under the curve to the right of 26.1 19, and therefore 95%of the area under
the curve is between 5.629 and 26.119.This may be expressed as follows:
02
d
14S2
P(5.629 <-<26.119) = .95
o2
The following notation is often used for the tabled Chi-square v a l u e s : ~ ~ ~ ~
= 5.629 and
probability statement may be expressed as follows:
= 26.119. The
14S2 1452 1452
Now, the if inequality < -< Xi25is solved for $,we obtain 7
< d< 7as a 95%confidence
interval for o2. The numerical confidence interval is obtained by replacing S2by .oooO8538mg2and using the
values obtained from the Chi-square table. The lower confidence limit is:
02 x.025 x.975
--
14S2 14 x .00008538 =.ooo04576
2 26.119
x.025
-
The upper confidence interval is:
= .0002124
14S2 14 x .00008538
X ,975
-- -
5.629
2
The 95% confidence interval for goes from .00004576 to .0002124. The 95% confidence interval for o is
obtained by taking the square root of the limits for 02.
The lower limit is = .006765 and the upper
limit is Jm
= .014574. The 95%confidence interval for the standard deviation (to three decimal places)
is (.007,.015).
The general form for a (1 - a)x 100% confidence interval for the population variance is given
by formula (11.6),where the values of x2are based on the Chi-squaredistribution with df = n - 1.
(11.6)
The (1 - a)x 100%confidence interval for the population standard deviation is given by formula
(11.7).
Both of the confidence intervals assume that the sample was obtained from a normally distributed
population.
260
df
29
CHI-SQUARE PROCEDURES
.995 .990 .975 .950 .900 .100 .050 .025 .010 .005
13.121 14.256 16.047 17.708 19.768 39.087 42.557 45.722 49,588 52.336
[CHAP. 11
EXAMPLE 11.15 A random sample of the lengths of bolts produced by Fastners, Inc. was taken. The sample
results were as follows: n = 30, X = 5.000cm, and s = 0.055 cm. A 99% confidence interval for D is determined
as follows: For a 99% confidence interval, I - a = .99, and therefore a = .01, or - = .005 and I - - = .995.
The degrees of freedom is df = n - 1 = 30 - 1 = 29. Table 11.23 gives that portion of the Chi-square table
needed to determine the values for use in formula (IZ.7). The following values are shown in bold print in the
table:X9s = 13.121 and xLs= 52.336. The sample variance is s2 = (0.055)2= ,003025. Substituting into
formula ( I /.7) we obtain the following 99% confidence interval for the population standard deviation.
a a
2 2
The population standard deviation of bolt lengths is between .041 cm and .082 cm with 99% confidence.
Table 11.23
r - I Area in the right tail under the Chi-square distribution curve - 1
(n - I)S*
2
, given in Table 11.20, is also utilized to test hypotheses
concerning the population variance or standard deviation. The steps for testing a population variance
are given in Table 11.24.
The sampling distribution of
Table 11.24
Steps for Testing a Hypothesis Concerning 02:
Sampling from a Normal Population
Step I: State the null and research hypothesis. The null hypothesis is represented symbolically by b:
o2= 0
:
and the research hypothesis is of the form Ha: o2# 0
: or Ha: d < 0;or Ha: d > 0;.
Step 2: For level of significance a,the critical values for Ha: o2f 0; are x:-u/z and xilz,the critical value
for Ha:o2< 0;is x;-~,and the critical value for Ha: d > 0;is Xa.
Step 3: Compute the value of the test statistic as follows: x2* = (n-1)S2 , whereoi is given in the null
hypothesis and S2is computed from your sample.
2
0;
Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls in
the rejection region. Otherwise, the null hypothesis is not rejected.
EXAMPLE 11.16 The ratio of potassium to sodium in an individual's diet is sometimes referred to as the K
factor. A sample of 15 Yanomamo Indians from Brazil was obtained and their K factors were determined. The
standard deviation of the sample of K factor values was found to equal 0.15. These sample results were used to
test the hypothesis Ho:0= 0.25 vs. Ha:0f 0.25 at level of significance a = .05.The critical values are shown in
bold in Table 1 1.25.From this table, we have &s = 5.629 and Xi25
= 26.I 19.The computed value of the test
statistic is:
(n-l)S2 14 x .1s2- 14 x .0225
x 2 * = - - = 5.04
0; .252 .0625
-
Since the computed value of the test statistic is less than 5.629, the null hypothesis is rejected and it is concluded
that the standard deviation for Yanomamo Indians is less than 0.25.
CHAP. 111
Area in the right tail under the Chi-square distribution curve I
Df ,995 .990 .975 .950 .900 .100 .050 .025 .O10 .005
14 4.075 4.660 5,629 6.571 7.790 21.064 23.685 26.119 29.141 31.319
CHI-SQUARE PROCEDURES
Area in the right tail under the Chi-square distribution curve
L I
Df ,995 .990 .975 .950 .900 .100 .050 .025 .010 .005
5 0.412 0.554 0.831 1.145 1.610 9.236 11.070 12.833 15.086 16.750
26 1
Df
10
Area in the right tail under the Chi-square distribution curve
.995 .990 .975 .950 .900 .loo .050 .025 .010 .005
2.156 2.558 3.247 3.940 4.865 15.987 18.307 20.483 23.209 25.188
Solved Problems
CHI-SQUARE DISTRIBUTION
11.1 What happens to the peak of the Chi-square curve as the degrees of freedom is increased?
Ans. The peak of the Chi-square curve shifts to the right as the degrees of freedom is increased.
11.2 The area under the Chi-square curve corresponding to x2 values greater than 118.498 is .10.
Find P(0 < x2< 118.498).
Ans. Since the total area under the Chi-square curve is equal to 1, P(0 < x2 < 118.498) is equal to
1 - P(x2> 118.498)= I - .I0= .90.
CHI-SQUARE TABLES
11.3 Find the value of x2for 5 degrees of freedom and
(a) .005 area in the right tail of the Chi-square distribution curve.
(6) .005 area in the left tail of the Chi-square distribution curve.
Ans. Table 11.26 gives the row corresponding to df = 5 from the Chi-square table. (a) The table
indicates thatthe area to the right of 16.750is .005. (b)The table indicates that the area to the
right of 0.412 is .995.Therefore, the area to the left of 0.412 is 1 - .995= .005.
Ans. According to Table 11.27,the area to the right of 23.209 is ,010 and the area to the right of 4.865
is .900. In Fig. 11-8,the shaded area equals .010 and corresponds to x2 values exceeding 23.209.
The shaded area in Fig. 11-9 equals .900 and corresponds to x2 exceeding 4.865. The area
corresponding to x2between 4.865 and 23.209 is the difference in the two shaded areas, or .900-
.010= .890.
262
Fig. 11-8
CHI-SQUARE PROCEDURES [CHAP. 11
Fig. 11-9
GOODNESS-OF-FITTEST
11.5 A magazine reported that 75% of the population oppose same sex marriages, 20% approve, and
5% are undecided. A survey is conducted to test the research hypothesis that the distribution is
different from that reported by the magazine. State the null and alternative hypotheses in terms
of p1 ,p2 ,and p3 where pl = the population proportion opposed to same sex marriage, p2 = the
population proportion who approve of same sex marriage, and p3 = the proportion who are
undecided.
Ans. The null hypothesis is Ho: The population proportions are p1 = .75, p2 = .20, and p3 = .05. The
research hypothesis is Ha: The population proportions are not p1= .75,p2= .20, and p3= .05.
11.6 The fairness of a die may be tested as a goodness-of-fit test. Let pi represent the proportion of
the time that face i turns up when the die is tossed, where i is 1, 2, 3, 4, 5, or 6. If the null
hypothesis is that the die is fair and the research hypothesis is that the die is unfair, state Ho
and Hain terms of the pi .
Ans. The null hypothesis is Ho: p1 = p2 = p3 = p4 =ps = p6 = i.The research hypothesis is Ha:Not all pi
equal +.
OBSERVEDAND EXPECTEDFREQUENCIES
11.7 Refer to problem 11.5. A poll was conducted and 585 opposed same sex marriage, 195
approved, 120 were undecided. What results would you expect if the null hypothesis were
true?
Ans. The sample size is n = 900. The expected frequencies are: el = npl = 900 x .75 = 675, e2 = np2=
900 x .20= 180,and e3= np3= 900 x .05 = 45.
11.8 Refer to problem 11.6. The die in question was rolled and face 1 turned up 89 times, face 2
turned up 93 times, face 3 turned up 103times, face 4 turned up 111 times, face 5 turned up
100 times, and face 6 turned up 104 times. What results would you expect if the null
hypothesis is true and the die is fair?
Ans. The sample size is n = 600. If the null hypothesis is true, then each face should occur 100times in
600 rolls.
CHAP. 113 CHI-SQUARE PROCEDURES
Category 0 e
Oppose 585 675
Support 195 180
Undecided 120 45
Sum 900 900
263
( 0-e12
o - e (0- el2 e
-90 8100 12.000
15 225 1.250
75 5625 125.000
0 138.250
SAMPLING DISTRIBUTION OF THE GOODNESS-OF-FITTEST STATISTIC
Df
5
11.9 Refer to problems I 1.5 and 1I .7. Test the hypothesis at a = .01.
Area in the right tail under the Chi-square distribution curve
.995 .990 .975 .950 .900 .100 .OS0 .025 .O10 .005
0.412 0.554 0.831 1.145 1.610 9.236 11.070 12.833 15.086 16.750
Ans. The number of categories is k = 3, and the degrees of freedom is df = k - 1 = 2. By referring to the
Chi-square distribution table in Appendix 4, we find that the critical value for a 1% level of
significance is equal to 9.210. The null hypothesis will be rejected if the computed value of the test
statistic exceeds this critical value.
Turned up face
1
2
3
4
5
6
sum
The computation of the test statistic is shown in Table 11.28. Since the computed test statistic is
138.250, the null hypothesis is rejected. There is a much higher number in the undecided group
than would be expected if the null hypothesis were true.
(0 -eI2
0 e o - e (0-e)2 e
89 100 -1 1 121 1.21
93 100 -7 49 0.49
103 100 3 9 0.09
111 100 11 121 1.21
100 100 0 0 0
104 100 4 16 0.16
600 600 0 3.16
11.10 Refer to problems 11.6and 11.8.Test the hypothesis at a = .05.
Ans. The number of categories is k = 6, and the degrees of freedom is df = k - 1 = 5. Table 11.29 gives
the row corresponding to df = 5 from the Chi-square distribution table in Appendix 4. The critical
value is 11.070, as shown in bold print.
Table 1
1
.
2
9
The computation of the test statistic is shown in Table 11.30. The computed value of the test
statistic is 3.16. Based on this experiment, there is no reason to doubt the fairness of the die.
CHI-SQUARE INDEPENDENCETEST
11.11 A study involving several cities from across the country involving crime was conducted. The
cities were divided into the categories South, Northeast, North Central, and West. Based on
interviews, crime was classified as a major concern, a minor concern, or of no concern for
each individual interviewed. The results are shown in Table 11.31. If the level of concern
264 CHI-SQUAREPROCEDURES [CHAP. 11
. Concern South Northeast North Central West
Major 25 40 35 40
Minor 65 45 40 40
No concern 15 10 15 20
regardingcrime is independent of the section of the country, find the expected frequencies for
the 12cells.
. Concern South Northeast North Central West Row total
I
140
190
60x 100 60
--
l4OX lo5
- 37.69 -
l4OX 95 -
-34.10 -
140x90 -
-32.31 --
l4OX loo
- 35.90
Major
Minor
390 390 390 390
390 390 390 390
390 390 390 390
190x95-46.28 --
l9OX9O-43.85 --
l9OX loo
-48.72
190x 105
-- - 51.15 ---
-
= 15.38
60x 90
-
= 13.85
-- 95 - 14.62
No concern 60x105= 16.1
I Column total I 105 95 90 100 I 390 I
Net worth
100to 249
250 to 499
500 to 999
loo0 or more
Ans. The computations of the expected frequencies are shown in Table 11.32.
Married Single Widowed
225 50 65
60 20 25
15 4 3
10 2 3
L
Net worth Married Single Widowed Row total
340x 96 340
105x96 105
250 - 499
22
500-999
15x76- 2.37 15x96 15
loo0 or more
482 482 482
Column total 310 76 96 482
-
=67.72
340x 76
482 482 482
105x 76
482 482 482
482 482 482
15x310
-
=53.61
340x 310
-
= 218.67
105x310
22 x 310
100- 249
-
= 20.91
-
= 1656
-
= 6753
-
22x 76 = 3.47 --
22x96-4.38
-= 14.15
-
= 2.99
--
-
= 9.65
11.12 Table 11.33 gives the results of a study involving marital status and net worth in thousands.
Find the expected frequenciesassuming the two characteristicsare independent.Comment on
the expected frequencies.
Table 11.33
Note that 4 of the 15,or 27%, of the expected frequencies are less than 5. The Chi-square test of
independence is not appropriate.
CHAP. 111 CHI-SQUARE PROCEDURES 265
SAMPLINGDISTRIBUTIONOF THE TEST STATISTICFOR THE
CHI-SQUAREINDEPENDENCETEST
11.13 Use Minitab to test the hypothesisthat opinions regarding crime is independent of the section
of the country in problem 11.11 at level of significance a = .05. Compare the expected
frequencies in the Minitab output with those computed in problem 11.11.
Row C1 C2 C3 C4
1 25 40 35 40
2 65 45 40 40
3 15 1
0 15 20
MTB > chisquare cl-c4
Expected counts are printed below observed counts.
c1 C2 C3 C4 Total
1 25 40 35 40 140
37.69 34.10 32.31 35.90
2 65 45 40 40 190
51.15 46.28 43.85 48.72
3 15 10 15 20 60
16.15 14.62 13.85 15.38
Total 1
0
5 95 90 100 390
Chi-Sq = 4.274+ 1.020+0.224+0.469+3.748+0.036+0.337+ 1.560+0.082+ 1.457+0.096+
DF = 6,P value = 0.023
1.385= 14.688
Ans. Since the computed p value = 0.023is less than a = .05, the null hypothesis is rejected. The
expected frequenciesare the same as computed in problem 1 1.11.
11.14 Refer to problem 11.12. Combine the categories Single and Widowed into a new category
called Nonmarried so that all cell expected frequencies exceed 5 and perform a test of
independence.
Ans. Table 1 1.35gives the new results after combiningcategories.
Table 11.35
Net worth Married Not Married
100to 249
250to 499
500to 999
loo0 or more
The Minitab analysisof the test of independenceis as follows:
Data Display
Row Cl C2
1 225 115
2 60 45
3 15 7
4 1
0 5
266 CHI-SQUAREPROCEDURES
MTB > Chisquare cl c2
[CHAP. 11
Chi-SquareTest
Expected counts are printed below observed counts
c1 C2 Total
I 225 115 340
218.67 121.33
2 60 45 105
67.53 37.47
3 15 7 22
14.15 7.85
4 10 5 15
9.65 5.35
Total 310 172 482
Chi-Sq = 0.183 +0.330 +0.840 + 1.514 +0.051 +0.092 +0.013+0.023 = 3.046
DF = 3, P value = 0.385
SAMPLING DISTRIBUTION OF THE SAMPLE VARIANCE
11.15 A small population consists of the three values 10, 20, and 30. Determine the population
variance and the sampling distribution of s2for samples of size 2.
Ans. The population mean is 20. The population variance is:
(x -p) (10- 20)*+(20- 20)2+(30- 20)2
= 66.67
-
o2= -
N 3
The three samples and their variances are: 10and 20, s2= 50; 10and 30, s2= 200; 20 and 30, s2=
50. The sampling distribution for s2is: P(s2= 50)= +,P(s2= 200) = 1 .
3
11.16 A sample of size 15 is taken from a normal population having cr2 = 14. What is the probability
that s2exceeds 21.064?
(n-1)S2 14s'
- - - - s2 has a Chi-square distribution with 14 degrees of freedom.
-
2 14
Ans. The statistic
From the Chi-square table, the area to the right of 21.064 is 0.10 when df = 14. Therefore P(s2>
21.064) = 0.10.
INFERENCES CONCERNING THE POPULATION VARIANCE
11.17 The standard deviation for the annual medical costs of 30 heart disease patients was found to
equal $1005. Find a 95% confidence interval for the standard deviation of the annual medical
costs of all heart disease patients.
.'
Ans. The general form of the confidence interval for the population variance is
(n - I)s2 (n- 1)s2
C O 2 < .
2
Xa12 X1-at2
CHAP. 111 CHI-SQUARE PROCEDURES 267
a a
2 2
The level of confidence is 1 - a = .95. This implies that a = .05, - = .025, and 1 - - = .975.
The Chi-square table values are
The lower limit for the 95% confidence interval for d is
= 45.722 and = 16.047, for df = 29.
(n-I)S2 29~1,010,025
-
- = 640,626.5037.
45.722
2
XaJ2
The upper limit for the 95% confidence interval for o2is
-
- 29 190109025
= 1,825,308.469.
(n -1)s2
Xl-aJ2
2 16.047
The lower limit for the 95% confidence interval for 0 is ,/640,6265037 = $800.39.
The upper limit for the 95% confidence interval for (T is 41,825,308.469 = $1,351.04.
11.18 In manufacturing processes, it is desirable to control the variability of mass produced items. A
machine fills containers with 5.68 liters of bleach. The variability in fills is acceptable
provided the standard deviation is 0.15 liters or less. A sample of 10containers is chosen to
test the research hypothesis that the standard deviation exceeds 0.15 liters at level of
significance a = .05. What conclusion is made if the variance of the sample is found to equal
0.095 liters?
Ans. The null hypothesis is Ho:(T = 0.15 and the research hypothesis is Ha:(T > 0.15.
The degrees of freedom is df = 10- 1 = 9, and the critical value is xis= 16.919.
The computed value of the test statistic is x2* =
(n-1)S2 - 9 x .095
0; .0225
- = 38.
Based on this sample, it is concluded that the variability is unacceptable and the filling machine
needs to be checked.
SupplementaryProblems
CHI-SQUARE DISTRIBUTION
11.19 Which one of the following terms best describes the Chi-square distribution curve: right-skewed, left
skewed, symmetric, or uniform?
Ans. right-skewed
11.20 According to Chebyshev’s theorem, at least 75% of any distribution will fall within 2 standard
deviations of the mean. For a Chi-square distribution having 8 degrees of freedom, find two values a and
b such that at least 75% of the area under the distribution curve will be between those values.
Ans. a = 8 - 2 ~ 4 = 0 , b = 8 + 2 ~ 4 =
16
CHI-SQUARE TABLES
11.21 A Chi-square distribution has 11 degrees of freedom. Find the x2values corresponding to the following
right-hand tail areas.
(a) .005 (b) .975 (c) .990 (d).025
Ans. (a)26,757 (6) 3.816 (c) 3.053 (d)21.920
268 CHI-SQUARE PROCEDURES [CHAP. 11
11.22 A Chi-square distribution has 17 degrees of freedom. Find the left-hand tail area corresponding to the
following values.
(a)30.191 (6)6.408 (c) 33.409 (d)24.769
Ans. (a).975 (6) .OlO (c) .990 (d).900
GOODNESS-OF-FITTEST
11.23 A national survey found that among married Americans age 40 to 65 with household income above
$50,000,45% planned to work after retirement age, 45% planned not to work after retirement age, and
10% were not sure. A similar survey was conducted in Nebraska. Let pI represent the percent in
Nebraska who plan to work after retirement age, p2 represent the percent in Nebraska that plan not to
work after retirement age, and p3 represent the percent not sure. What null and research hypotheses
would be tested to compare the multinomial probability distribution in Nebraska with the national
distribution?
Ans. Ho: pi = .45, p2= .45, and p3= .10; Ha:The proportions are not as stated in the null hypothesis.
11.24 A sociological study concerning impaired coworkers was conducted. It was reported that 30% of
American workers have worked with someone whose alcohol or drug use affected hisher job, 60% have
not worked with such individuals, and 10% did not know. The Central Intelligence Agency (CIA)
conducted a similar study of CIA employees. Let p1 represent the proportion of CIA employees who
have worked with someone whose alcohol or drug use affected hisher job, p2 represent the proportion
of CIA employees who have not worked with such coworkers, and p3 represent the proportion who do
not know. What null and research hypothesis would you test to determine if the CIA multinomial
distribution was the same as that reported in the study?
Am. Ho: pi = .30,p2 = .60, and p3 = .lO; Ha:The proportions are not as stated in the null hypothesis.
OBSERVEDAND EXPECTEDFREQUENCIES
11.25 Refer to problem 11.23. The Nebraska survey found that 120 planned to work after retirement age,l60
did not plan to work after retirement age, and 35 were not sure. What are the expected frequencies?
Ans. et = 141.75, e2= 141.75, and e3= 31.5
11.26 Refer to problem 11.24. The CIA study found that 37 had worked with someone whose alcohol or drug
use had affected theirjob, 58 had not, and 17 did not know. What are the expected frequencies?
Ans. el = 33.6, e2 = 67.2, and e3 = 11.2
SAMPLINGDISTRIBUTIONOF THE GOODNESS-OF-FITTEST STATISTIC
11.27 Refer to problems 11.23and 11.25.Perform the test at a = .01.
Ans. The critical value is 9.21. The computed value of the test statistic is x2* = 6.076. It cannot be
concluded that the Nebraska distribution is different from the national distribution.
11.28 Refer to problems 1 1.24 and 11.26.Perform the test at a = .05.
Am. The critical value is 5.991. The computed value of the test statistic is x2* = 4.607. It cannot be
concluded that the CIA distribution is different from that reported in the sociological study.
CHAP. 111
Crime type
Murder
Rape
Robbery
Assault
CHI-SQUARE PROCEDURES
18-25 26-33 34-40 over 40
11 15 4 2
21 26 11 3
120 85 3 2
147 42 5 3
269
Crime type 18-25
Murder 19.14
Rape 36.48
CHI-SQUARE INDEPENDENCE TEST
26-33 34-40 over 40
10.75 1.47 0.64
20.50 2.8 1 1.22
11.29 Five hundred arrest records were randomly selected and the records were categorized according to age
and type of violent crime. The results are shown in Table 1 1.36.
Robbery
Assault
Table 11.36
125.58 70.56 9.66 4.20
117.81 66.19 9.06 3.94
Cancer site
Lung
Oral-bladder
Other cancer
Find the table of expected frequencies assuming independence, and comment on performing the Chi-
square test of independence.
Smoking history
Never Light Medium Heavy
15 35 45 100
35 40 40 35
250 135 190 150
Ans. The expected frequencies are given in Table 11.37. The Chi-square test of independence is
inappropriate because 37.5% of the expected frequencies are less than 5 and one is less than I.
, Smoking history
Cancer site Never Light Medium Heavy
Lung 90.37 36.78 35.04 32.82
Oral-bladder 69.5 1 28.29 26.95 25.24
Other cancer 335.98 136.75 130.26 122.01
No cancer 2354.15 958.18 912.75 854.93
11.30 Table 11.38 gives various cancer sites and smoking history for participants in a cancer study.
I ~ o c a n c e r I 2550 I 950 I 830 I 750 1
Find the expected frequencies, assuming that cancer site is independent of smoking history.
Ans. The expected frequencies are shown in Table 11.39.
SAMPLING DISTRIBUTION OF’THE TEST STATISTICFOR THE CHI-SQUARE
INDEPENDENCE TEST
11.31 Combine the categories “34-40” and “Over 40” into a new category called “Over 33” and perform the
Chi-square independence test for problem 11.29.
270
Murder
Rape
Robbery
Assault
CHI-SQUARE PROCEDURES
11 (19.14) 15 (10.75) 6 (2.11)
21 (36.48) 26 (20.50) 14 (4.03)
120 (125.58) 85 (70.56) 5 (13.86)
147 (117.81) 42 (66.19) 8 (13.00)
[CHAP. 1 1
850
750
loo0
1250
800
950
Ans. The new contingency table is given in Table 11.40. The expected frequencies are shown in
parentheses.
1100 975 675 990
I150 700 780 1100
1025 I025 900 1250
975 1150 1125 980
I050 1200 I500 1040
1350 800 1255 1350
Table 11.40
I Crime tvDe I 18-25 I 26-33 I Over33 I
Two of the cells have expected frequencies less than 5. Some statisticians would not perform the
independence test because two of the expected frequencies are less than 5 . The less restrictive rule
states that the test is valid if all expected frequencies are at least one and at most 20% of the
expected frequencies are less than one. The computed value of the test statistic is 71.9I8 and the p
value is .000.The type of crime is most likely dependent on the age group.
11.32 Compute the value of the test statistic for the test of independence in problem 11.30 and test the
hypothesis of independence at a = .01.
Ans. The degrees of freedom is df = (r - 1) x (c - 1) = (4 - 1) x (4 - 1) = 9. The critical value is
21.666. The computed value of the test statistic is x2* = 327.960. The cancer site is dependent
upon the smoking history of an individual.
SAMPLING DISTRIBUTION OF THE SAMPLE VARIANCE
11.33 Fill in the following parentheses with s or 0.
(a)( ) is a constant, but (
(b)( ) has a distribution, but ( ) does not have a distribution.
(c) ( ) is a parameter and (
(
d
)
( ) describes the variability of some characteristic for a sample, whereas (
) is a variable.
) is a statistic.
bility of a characteristic for a population.
) describes the varia-
Ans. (a)o,s (b)s, o ( c ) ~ ,
s (d)s, 0
(n - I)S*
11.34 What distributional assumption is necessary in order that ~ have a Chi-square distribution with
0
"
(n - I) degrees of freedom?
Ans. The random sample from which S2 is computed is assumed to be selected from a normal
distribution.
INFERENCES CONCERNING THE POPULATION VARIANCE
11.35 Table 1I .41 gives the monthly rent for 30 randomly selected 2-bedroom apartments in good condition
from the state of New York. Find a 95% confidence interval for the standard deviation of all monthly
rents in New York for 2-bedroom apartments in good condition.
Table 11.41
CHAP. 111 CHI-SQUAREPROCEDURES
45
50
45
28
30
32
271
55 58 38 44
54 57 44 40
29 47 37 38
44 46 35 56
42 49 57 34
41 52 52 47
Ans. The sample standard deviation is equal to $203.24.
The Chi-square table values are
The 95% confidence interval extends from $161.86 to $273.22.
= 45.722 and x& = 16.047 for 29 degrees of freedom.
11.36 Table 11.42 gives the ages of 30 randomly selected airline pilots. Use the data to test the null hypothesis
that the standard deviation of airline pilots is equal to 10 years vs. the alternative that the standard
deviation is not equal to 10years. Use level of significancea = .01.
Ans. The standard deviation of the sample is 8.83 years. The critical values are 13.121 and 52.336. The
computed value of the test statistic is 22.61. The sample data does not contradict the null
hypothesis.
Chapter 12
0.9 -
J
0.8 -
0.7 -
0.6 -
0.5 -
0.4 -
0.3-
0.2-
0.1 -
0.0-
Analysis of Variance (ANOVA)
F DISTRIBUTION
Analysis of Variance (ANOVA) procedures are used to compare the means of several
populations. ANOVA procedures might be used to answer the following questions: Do differences
exist in the mean levels of repetitive stress injuries (RSI) for four different keyboard designs? Is there
a difference in the mean amount awarded for age bias suits, race bias suits, and sex bias suits? Is
there a difference in the mean amount spent on vacations for the three minority groups: Asians,
Hispanics, and African Americans?
ANOVA procedures utilize a distribution called the F distribution. A given F distribution has
two separate degrees of freedom, represented by dfl and df2. The first, dfl, is called the degrees of
freedom for the numerator and the second, df2, is called degrees of freedom for the denominator.
For an F distribution with 5 degrees of freedom for the numerator and 10degrees of freedom for the
numerator, we write df = (dfl, df2) = (5, 10).Figure 12-1 shows one F distribution with df = (10, 50)
and another with df = (10,5).
/ 
I 
-----------
I 1 I 1 I 1
0 1 2 3 4 5
Fig. 12-1
Table 12.1gives the basic properties of the F distribution.
Table 12.1
1 Propertiesof the Fdistribution
1. The total area under the F distributioncurve is equal to one.
2. An F distribution curve starts at 0 on the horizontal axis and extends
indefinitely to the right, approaching but never touching the horizontal axis.
3. An F distributioncurve is always skewed to the right.
4. The mean of the F distribution is p =-
df2 for df2 > 2, where df2 is the
I df2-2
I degrees of freedom for the denominator.
272
CHAP. 121 ANALYSIS OF VARIANCE (ANOVA) 273
-
df2
df2-2
EXAMPLE 1
2
.
1 The F distribution with df = (10, 5) shown in Fig. 12-1 has mean equal to p =--
5 df2 50
-- - 1.67,and the F distribution shown in the same figure with df = (10,50) has mean p =----
df2 -2-50- 2 -
5-2
1.04.
F TABLE
The F distribution table for 5% and 1% right-hand tail areas is given in Appendix 5. A technique
for finding F values for the 5% and 1%left-hand tail areas will be given after illustratinghow to read
the table.
EXAMPLE 1
2
.
2 Consider the F distribution with dfi = 4 and df2 = 6. Table 12.2 shows a portion of the F
distributiontable, with right tail area equal to .05,given in Appendix 5.
df2
1
2
3
4
5
6
100
...
Table 1
2
.
2
C
1
5.99
2
5.14
3
4.76
4
4.53
...
...
20
3.87
Table 12.2 indicates that the area to the right of 4.53 is .05. This is shown in Fig. 12-2. The F distribution
table with right tail area equal to .01 indicates that the area to the right of 9.15 is .01. This is shown in Fig.
12-3.
Fig. 12-2 Fig. 12-3
The following notation will be used with respect to the F distribution.The value, 4.53, shown in Fig. 12-2, is
represented as F.05(4, 6) =4.53, and the value 9.15, as shown in Fig. 12-3is represented as F.o1(4,6)=9.15.
274 ANALYSIS OF VARIANCE (ANOVA) [CHAP. 12
dfl
dfi 1 2 3 4 5 6 . . . 20
. . .
4 7.71 6.94 6.59 6.39 6.26 6.16 5.80
>
Using the notation introduced in Example 12.2,the result shown in formula (12.I ) holds for any
F distribution.
12
mean = 10
(12.1)
26 17
mean = 24 mean = 17
EXAMPLE 12.3 Formula (12.1)may be used to find the F values for 1% and 5% left-hand tail areas for the F
distribution discussed in Example 12.2. The symbol F.95(4,6) represents the value for which the area to the right
of F,95(4,
6) is .95 and therefore the area to the left of F.95(4,6) is .05. Using formula (12.1), we find that
The value F,05(6,4) = 6.16 is shown in bold print in Table 12.3,which is taken from the F distribution table.
Table 12.3
Similarly, the 1% left-hand tail area is
LOGIC BEHIND A ONE-WAY ANOVA
An experiment was conducted to compare three different computer keyboard designs with
respect to their affect on repetitive stress injuries (rsi). Fifteen businesses of comparable size
participated in a study to compare the three keyboard designs. Five of the fifteen businesses were
randomly selected and their computers were equipped with design 1 keyboards.Five of the remaining
ten were selected and equipped with design 2 keyboards, and the remaining five used design 3
keyboards. After one year, the number of rsi were recorded for each company. The results are shown
in Table 12.4.
Table 12.4
Design 1 Design 2 Design 3
24
10 24 19
A Minitab dotplot of the data for the three designs is shown in Fig. 12-4.
CHAP. 121 ANALYSIS OF VARIANCE (ANOVA) 275
Fig. 12-4
If p1, p
2
, and p3 represent the mean number of rsi for design 1, design 2, and design 3,
respectively, for the population of all such companies, the data indicate that the population means
differ.
Suppose the data for the study were as shown in Table 12.5. A Minitab dotplot for these data is
shown in Fig. 12-5.
Table 12.5
24
19 10
22 29 24
1 mean= 10 I mean=24 1 mean = 17 I
keyboard
2 . .
Fig. 12-5
Note that for the data shown in Fig. 12-5,it is not clear that the mean number of rsi differ for the
three populations using the three different designs. The difference between the two sets of data can
be explained by considering two different sources of variation. The between samples variation is
measured by considering the variation between the sample means. In both cases, the sample means
are XI = 10, 512 = 24, and X3 = 17.The between samples variation is the same for the two sets of data.
The within samples variation is measured by considering the variation within the samples. The
dotplots indicate that the within samples variation is much greater for the data in Table 12.5than for
the data in Table 12.4.The measurement of these two sources of variation will be discussed in the
next section. However, it is clear from the above discussion that a decision concerning the
effectiveness of the three designs may be based on a considerationof the two sources of variation. In
particular, it will be shown that the ratio of the between variation to the within variation may be used
to decide whether PI,p2,and p3 are equal or not.
276 ANALYSIS OF VARIANCE (ANOVA) [CHAP. 12
35 -
30
25
10
5 -
0 1
EXAMPLE 12.4 Figures 12-6 and 12-7 show boxplots for the data shown in Tables 12.4 and 12.5,
respectively. Notice that Fig. 12-6 suggests different population means, while Fig. 12-7 suggests that it is
possible that the population means may be the same.
:::fi8fl
1 I I
rsi
1 2 3
keyboard
Fig. 12-6 Fig. 12-7
SUM OF SQUARES, MEAN SQUARES, AND DEGREES OF FREEDOM
FOR A ONE-WAY ANOVA
The following notation will be used when samples from k different normally distributed
populations having equal population variances are selected in order to test for the equality of the
means of the k populations:
The sample size, sample mean, and sample variance for the ith population are represented by
ni, X,,ands?,,respectively. The total sample size is n = nl + n2 + . . - + nk. The overall mean
for all n sample values is represented by X . The population mean for the ith population is
represented by piand the standard deviation for the ith population is represented by oi.
The between samples variation is measured by the between treatmerzts mearz square and is
represented by MSTR. The expression for MSTR is given by formula (12.2):
SSTR
MSTR = -
k - 1
(12.2)
The numerator of formula (12.2),SSTR, is called the treatment sum of squares, and is computed by
using formula (12.3):
The within samples variation is measured by the error mean square and is represented by MSE.
The expression for MSE is given by
SSE
MSE = -
n-k
( I2.4)
The numerator of formula (12.4), SSE, is called the error sum of squares, and is computed by using
formula (12.5),where s:, s:, . . . ,S
[ are the sample variances.
CHAP. 121
Design 1
10
10
8
10
12
mean = 10
ANALYSIS OF VARIANCE (ANOVA) 277
Design 2 Design 3
24 17
22 17
24 15
24 19
26 17
mean = 24 mean = 17
SSE = (nl - 1)s: +(n2- l)s$ +.- +( n k - 1)s; (12.5)
The denominatorof formula (12.2),k - 1, is called the degrees offreedomfor treatments and the
The sum of the treatment sum of squares and the error sum of squares is called the total sum o
f
denominatorof formula (12.4),n - k,is called the degrees o
ffreedomfor error.
squares. The total sum of squares is represented by SST and is given by
SST = SSTR +SSE (12.6)
The total sum of squares may be computed directly by using formula ( 1 2 4 , where the sum is
over all n sample values. The degrees o
ffreedomfor total is equal to n - 1.
SST =C(X-7i)2 (12.7)
EXAMPLE 12.5 The data from Table 12.4 is listed below for convenient reference.
Since the three sample sizes are equal, the overall mean is computed by finding the mean of the three treatment
(design)means. The mean of 10, 24, and 17 is 17, and therefore Ti: = 17. The total sum of squares will be found
first.
SST= C ( X - X ) ~
= (10- 17)2+(10- 17)2+(8- 17)2+(10- 17)2+(12- 17)2+(24- 17)2+(22- 17)2+(24- 17)2
= 4 9 + 4 9 + 8 1 + 4 9 + 2 5 + 4 9 + 2 5 + 4 9 + 4 9 + 8 1 + 0 + 0 + 4 + 4 + 0
= 514
+(24- 17)2+(26- 17)2+(17- 17)2+(17- 17)2+(15- 17)2+(19- 17)2+(17- 17)2
The treatment sum of squares is computed by using formula (12.3).
SSTR= nl(xl- ~ r > ~
+ n 2 ( ~ 2
- x12+ n3 (x3- x12
= 5 ~ ( 1 0 - 1 7 ) ~ + 5 ~ ( 2 4 - 1 7 ) ~ + 5 ~ ( 1 7 - 1 7 ) ~
= 5 X 4 9 + 5 X 4 9 + 5 X O
= 490
We need to find the three sample variances before finding the error sum of squares. The formula for the
sample variance is applied to ezch sample separately as follows:
( 10 - 10)2+(10 - 10)2+(8 - 10)2+(10 - 10)2+( 12 - 10)2 =
s: =
5 - 1
(24 -24)’ +(22 -2412+(24 - 24)2+(24 -24)2+(26 -24)2 =
s; =
5-1
(17- 17)2+(17-17)2+(15-17)’+(19-17)2+(17-17)2 = 2
s: =
5 - 1
The error sum of squares is computed using formula (12.5).
SSE = (nl - 1)s: + (n2 - 1)s; +(n3 - 1)s;
= 4 X 2 + 4 X 2 + 4 X 2 = 2 4
278 ANALYSIS OF VARIANCE (ANOVA) [CHAP. 12
Design I
10
10
8
10
12
TI = 50
Note that SST = SSTR + SSE. It is clear from this example that the computation of the various
sums of squares is rather time consuming and subject to computational errors. Computer software is
usually employed to perform these computations. This will be discussed later. If it is necessary to
perform the computations by hand, shortcut formulas are recommended. These shortcut formulas will
now be discussed. The shortcut formula for computing the total sum of squares is given in (12.8):
Design 2 Design 3
24 17
22 17
24 15
24 19
26 17
Tz= 120 T3 = 85
(W2
SST = Cx2- - (12.8)
n
The treatment sum of squares is computed by using formula (12.9), where Ti is the sum of the
sample values for the ith treatment.
(12.9)
After SST and SSTR are computed, SSE is found by using formula (12.10):
SSE = SST - SSTR (12.10)
EXAMPLE 12.6 To find the sum of squares found in Example 12.5 using the shortcut formulas, refer to the
following table.
To find SST, it is necessary first to find Cx and Zx2.
C x = 1 0 + 1 0 + 8 + 1 0 + 1 2 + 2 4 + 2 2 + 2 4 + 2 4 + 2 6 + 1 7 + 1 7 + 1 5 + 1 9 + 1 7 = 2 5 5
Cx2= 100-t. 100+ 64 + 100 + 144 +576 +484 + 576 +576 +676 +289 + 289 +225 + 361 +289 = 4849
-4849-4335=514
SST = Cx2- --
(Cx)*
n
The treatment sum of squares is found next.
and, then the error sum of squares is found by subtraction:
SSE= SST - SSTR= 514 - 490 = 24
The degrees of freedom for total is n - I = 14, the degrees of freedom for treatments is k - 1 = 2, and the
degrees of freedom for error is n - k = 15 - 3 = 12.
CHAP. 121 ANALYSIS OF VARIANCE (ANOVA) 279
Source df ss MS = SS/df F Statistic
Error n - k SSE MSE
Total n - 1 SST
Treatment k - 1 SSTR MSTR F
SSTR 490
The between treatments mean square is MSTR = --
- -= 245, and the error mean square is MSE =
k - 1 2
Source
Treatment
Error
Total
SSE 24
- - - - = 2. This example illustrates that it is much easier to compute the sums of squares using the
n - k 12
shortcut formulas than it is to use the defining formulas. Most researchers use statistical software to perform
these computations.
df ss MS = SS/df F Statistic
2 490 245 122.5
12 24 2
14 514
SAMPLING DISTRIBUTION FOR THE ONE-WAY ANOVA TEST STATISTIC
The purpose behind a one-way ANOVA is to determine if the means of k populations are equal
or not. In particular, we are interested in testing the null hypothesis Ho: 111 = p2 = . - - = p
k against the
alternative hypothesis Ha: All k means are not equal. When the k samples are randomly and
independently selected from normal populations having equal population standard deviations (or
equivalently equal population variances), and the null hypothesis is true, the ratio of the treatment
mean square to the error mean square has an F distribution with dfl = k - I , and df2 = n - k. We
express this as shown in formula (12.11).This ratio, sometimes referred to as the F ratio, is the test
statistic used to test the above hypotheses.
MSTR
F=-
MSE
(12.12)
EXAMPLE 12.7 In Examples 12.5 and 12.6, three different keyboard designs were compared with respect to
the number of rsi found at 15 companies using the three designs. The test statistic given in formula (12.1I ) has
an F distribution with df, = k - 1 = 3 - 1 = 2, and df2 = n - k = 15 - 3 = 12. In Example 12.6, it was shown that
MSTR = 245 and MSE = 2. The computed value of the test statistic is F* = ?= 122.5. From the F distribution
table in Appendix 5, the 5% and 1% right-hand tail critical values are as follows: F05(2, 12) = 3.89 and
F01(2,12)= 6.93. The extremely large value of the test statistic suggests that the population means are not equal
and that the null hypothesis should be rejected. The differences in the sample means are significant, and indicate
that design I may reduce the number of rsi.
BUILDING ONE-WAY ANOVA TABLES AND TESTING THE EQUALITY
OF MEANS
The results of the computations in the proceeding sections are usually conveniently displayed in
a one-wwy ANOVA table. The general structure of the one-way ANOVA table is given in Table 12.6.
EXAMPLE 12.8 The computations in Examples 12.6and 12.7 are summarized in the ANOVA Table 12.7.
The steps in testing the equality of means is summarized in Table 12.8.
280
. Steps for Testing the Equality of Means
Using the One-way ANOVA Procedure
Step I : State the null and alternative hypothesis as follows:
H
o
: =112 = * * * =
Ha:
All k means are not equal.
Step 2: Use the F distribution table and the level of significance, a,to determine the
rejection region.
Step 3: Build the ANOVA Table, and from the table determine the computed value of
the F ratio.
Step 4: State your conclusion. The null hypothesis is rejected if the computed value of
the test statistic falls in the rejection region. Otherwise, the null hypothesis is not rejected.
ANALYSIS OF VARIANCE (ANOVA) [CHAP. 12
Design 1
10
Design 2 Design 3
24 17
Computer software is routinely used to perform the necessary computations to compute the test
statistic used to test the equality of means. The software usually produces the ANOVA table and in
addition provides a p value for the test. When the p value is given, the null hypothesis is rejected if
the p value is less than the preset level of significance.
EXAMPLE 12.9 Two different Minitab commands will now be discussed which produce a one-way ANOVA
table. The data for the number of rsi for the three keyboard designs is reproduced below.
The Minitab command aovonewaycl -c3 requires that the data for the three designs be put into three separate
columns. These data were put into columns cl, c2, and c3.
Row design 1 design2 design3
1 10 24 17
2 10 22 17
3 8 24 15
4 10 24 19
5 12 26 17
MTB > aovoneway c1 -c3
One-way Analysis of Variance
Analysis of Variance
Source DF SS MS
Factor 2 490.00 245.00
Error 12 24.00 2.00
Total 14 514.00
Level N Mean St Dev
design1 5 10.00 1.414
design2 5 24.00 1.414
design3 5 17.00 1.414
F P
122.50 O.OO0
CHAP. 121 ANALYSIS OF VARIANCE (ANOVA) 281
The above ANOVA table is the same as the one given in Table 12.7. The Minitab output shows the p value
associated with the test under the column labeled P. Recall from our previous discussion of p value, that the p
value is the area to the right of F*= 122.50 for an F distribution having df, = 2 and df2= 12. Or, stated in words,
if the null hypothesis is true, i.e., if pI= p
2 = p3,the probability of obtaining an F value this large or larger is
.OOO. For this reason, the null hypothesis would be rejected.
The second Minitab command which produces a one-way ANOVA table is oneway response in c2
treatment in cl. The treatment groups are identified in one column and the responses are given in another
column. The ouput is the same for both commands. The required data setup differs for the two commands.
Row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
design
1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
rsi
10
10
8
10
12
24
22
24
24
26
17
17
15
19
17
MTB >oneway response in c2 treatment in cl
One-way Analysisof Variance
Analysis of Variance for rsi
Source DF SS MS F P
Design 2 490.00 245.00 122.50 O.OO0
Error 12 24.00 2.00
Total 14 514.00
EXAMPLE 12.10 Before ending our discussion of the one-way ANOVA, it is instructive to compare the
ANOVA tables for the two sets of data given in Tables 12.4 and 12.5 and illustrated in Figs. 12-4 and 12-5 as
well as Figs. 12-6 and 12-7. The ANOVA for the data in Table 12.4 is given above in Example 12.9. The
Minitab output for the ANOVA Table corresponding to the data in Table 12.5 is
One-way Analysis of Variance
Analysis of Variance
Source DF SS MS F P
Factor 2 490.0 245.0 3.30 0.072
Error 12 890.0 74.2
Total 14 1380.0
The ANOVA for the data in Table 12.5 does not indicate a difference in the three designs for a = .05, since p =
0.072. Thus, the statistical test for the hypotheses concerning the means confirms what Figs. 12-4 and 12-5 as
well as Figs. 12-6and 12-7 suggested.
LOGIC BEHIND A TWO-WAY ANOVA
Suppose that rather than considering only the effect of the factor keyboard design, we would
also like to consider the effect of the factor seating design on repetitive stress injuries (rsi). The one-
282 ANALYSIS OF VARIANCE (ANOVA) [CHAP. 12
Seating design
I
2
3
way ANOVA allows us to analyze the effect of only one factor on the response variable, rsi. A two-
way ANOVA allows one to analyze the effects of two factors on a response variable. Suppose there
are three different keyboard designsand three different seating designs we are interested in testing. A
3 x 3factorialdesign is one in which each keyboard design is matched with each seatingdesign for a
total of nine treatment combinations. Such a design allows one to test the main eflects for keyboard
designs and seating designs as well as test for interaction between the two factors. Suppose 18
businesses of comparable size are selected for the study. A randomization scheme is used and each of
the nine treatments are used at two businesses. The number of rsi are recorded at each location and
the results are shown in Table 12.9.
1 2 3
10,12 20,22 16, 18
14, 16 25,27 20,22
898 18, 16 14, 16
Table 12.9
Seating design
1
2
3
Column mean
I Kevboard design I
Keyboard design
1 2 3 Row mean
11 21 17 16.33
15 26 21 20.67
8 17 15 13.33
11.33 21.33 17.67
21.0 -
18.5 -
16.0 -
13.5 -
11.0 -
EXAMPLE 12.11 Table 12.10 is a table of means for the results shown in Table 12.9. The mean for each
treatment (Keyboard-Seating design combination) is given as well as the marginal means. It is clear that
keyboard design 2 with seating design 2 resulted in the largest mean number of rsi. Also, keyboard design 1 with
seating design 3 resulted in the smallest mean number of rsi.
I
Table 12.10
A graphical technique for analyzing the results of the experiment is a main effects plot. A Minitab main effects
plot is shown in Fig. 12-8.The main effects plot for the factor, seating, is a plot of the row means given in Table
12.10. Each of these means is calculated for six businesses. The main effects plot for the factor, keyboard, is a
plot of the column means given in Table 12.10. Each of these means is calculated for six businesses.
Main Effects Plot - Means for rsi
1 I I
seating
I 1 1
keyboard
Fig. 12-8
CHAP. 121
Seating design I 2 3
1 18,20 20,22 16,18
2 14, 16 25,27 20,22
+ 3 8, 8 18, 16 14, 16
ANALYSIS OF VARIANCE (ANOVA) 283
Another plot used to help explain the experimental results is called the interaction plot. A Minitab interaction
plot is shown in Fig. 12-9. The interaction plot indicates that regardless of the seating design used, keyboard
design 1 results in the smallest number of rsi, keyboard design 2 results in the largest number of rsi, and the
number of rsi for keyboard design 3 is between the other two. When the line segments are roughly parallel, as
they are here, we say that there is no interactionbetween the factors.
Interaction Plot - Means for rsi
seating
1 2
keyboard
Fig. 12-9
EXAMPLE 12.12 Suppose the experiment described in
12.11.
3
Example 12.11 resulted in the data shown in Table
Table 12.11
1 Kevboard desien 1
The Minitab main effects plot is shown in Fig. 12-10and the Minitab interaction plot is shown in Fig. 12-11. An
understanding of interaction is accomplished by comparing Fig. 12-11 with Fig. 12-9. Recall that in Fig. 12-9,
we were able to conclude that regardless of the seating design used, keyboard design 1 results in the smallest
number of rsi, keyboard design 2 results in the largest number of rsi, and the number of rsi for keyboard design
3 is between the other two. Notice that this is not true in Fig. 12-11. In particular, when seating design 1 is used
the smallest mean number of rsi occur for keyboard design 3, not keyboard design 1. In this study, the factors
keyboard design and seating design are said to interact. Note that the line segments on the interaction plot are
not parallel. The keyboard design which produces the minimum number of rsi depends on the seating design
used. We cannot conclude that keyboard design 1 always produces the minimum number of rsi. When
interaction is present, the main effects must be interpreted carefully.
The discussion and examples in this section are intended to provide some insight into the logic
behind a two-factorfactorial design. In the next section the ideas will be generalized and the analysis
284 ANALYSIS OF VARIANCE (ANOVA) [CHAP. 12
of variance associated with the two-factordesign will be discussed.This section is intended to make
the general discussioneasierto understand.
Main Effects Plot - Means for rsi
21
19
17
15
13
25
20
15
10
I 1 I 1 1
seating keyboard
Fig. 12-10
Interaction Plot - Means for rsi
seating
' 1
O 2
* 3
1
2
3
-
--
- - - .
I
1
I
2
keyboard
I
3
Fig. 12-11
CHAP. 121
Gender
Male
Female
ANALYSIS OF VARIANCE (ANOVA)
Age Race Disability
200, 175,215 150, 125, 135 100,95, 115
185,210,225 130, 145, 115 90,80, 110
285
Source df ss MS = SS/df F Statistic
Error n - ab SSE MSE
Treatment ab- 1 SSTR MSTR F
i Total n - 1 SST
SUM OF SQUARES,MEAN SQUARES,AND DEGREES OF FREEDOM
FOR A TWO-WAYANOVA
Let A and B be two factors whose influence on a response variable we wish to investigate. Let a
be the number of levels of factor A and let b be the number of levels of factor B. Each level of factor
A is combined with each level of factor B to form a total of ab treatments. Equal size samples are
taken from each of the ab treatments. If each sample is of size m, the total number of sample values
is n = mab.
EXAMPLE 12.13 Table 12.12 gives the compensatory awards (in thousands)for wrongful firing based on age,
race, or disability for males and females. Either bias type or gender may be called factor A. Suppose we let
factor A be bias type. Then A has three levels and B has two levels. There are six populations or treatments. The
terms population, treatment, or treatment combination are used for the six groups. We shall use the term
treatment. The six treatments are: (age bias and male), (race bias and male), (disability bias and male), (age bias
and female), (race bias and female), and (disability bias and female). The values for a, b, m, and n are 3, 2, 3,
and 18, respectively.
Table 12.12
I Bias type I
Initially, a two factor factorial design may be analyzed using a one-way ANOVA. The number of
treatments is k = ab, and the total sample size is n. The one-way ANOVA for the two factor design is
given in Table 12.13.
The Treatment sum of squares, SSTR, is now partitioned in three parts as shown in formula (Z2.Z2),
where SSA is calledfactor A sum o
f squares, SSB is calledfactcr B sum of squares, and SSAB is
called interaction sum o
f squares.
SSTR = SSA +SSB +SSAB (12.12)
The treatment degrees of freedom, ab - I, is also partitioned into three parts as follows:
df for A = a - 1, df for B = b - 1, df for AB = (a - l)(b - 1)
The degrees of freedom for A, B, and AB add up to the degrees of freedom for treatment as given by
formula ( I2.13):
ab - 1 = (a - 1)+ (b - 1)+ (a - l)(b - I) (12.13)
Mean squares are also defined for A, B, and AB. The factor A mean square, MSA, is given by
formula (12.14):
SSA
MSA = -
a-1
(12.14)
286 ANALYSIS OF VARIANCE (ANOVA)
L
Source df ss MS F Statistic
Bias type 2 SSA MSA F A
Sex 1 SSB MSB FB
Interaction 2 SSAB MSAB FAB
Error 12 SSE MSE
Total 17 SST
:
The factor B mean square,MSB, is given by
SSB
MSB = -
b- 1
The interaction mean square,MSAB, is given by
SSAB
(a - I)(b - 1)
MSAB =
[CHAP. 12
(12.1.5)
(12.16)
The formulas for the sum of squares for A, B, and AB will not be discussed since the analysis of
factorial designs are almost always performed by computer software.
EXAMPLE 12.14 In Example 12.13,the total degrees of freedom is n - 1 = 18- 1 = 17, the treatment degrees
of freedom is ab - 1 = 6 - 1 = 5, and the error degrees of freedom is n - ab = 18 - 6 = 12. Furthermore, the 5
degrees of freedom for treatments is partitioned into a - 1 = 3 - 1 = 2 for factor A, b - 1 = 2 - 1 = 1 for factor
B, and (a - l)(b - I)= 2 x 1 = 2 for interaction.
BUILDING TWO-WAYANOVA TABLES
The sum of squares, mean squares, degrees of freedom, and test statistics for a factorial design are
summarized in a two-way ANOVA table. The structure for a two-way ANOVA table is given in
Table 12.14.
Table 12.14
Source
Factor A
Factor B
Interaction (a -l)(b - 1) SSAB
Error n - ab
Total n - 1 SST
MS = SS/df
'
7
1
F Statistic
FA= MSA/MSE
F
B = MSBMSE
FAB = MSABMSE
EXAMPLE 12.15 A two-way ANOVA table for the data given in Table 12.12 is given in Table 12.15.
Table 12.15
SAMPLING DISTRIBUTIONSFOR THETWO-WAYANOVA
When there is no interaction between factors A and B, the test statistic FABhas an F distribution
with dfl = (a - l)(b - 1)and df2 = n - ab. When there are no differences among the means for the
main effect A, the test statistic FAhas an F distribution with dfl = a - I and df2 = n - ab. When there
are no differences among the means for the main effect B, the test statistic Fg has an F distribution
with dfl = b - 1 and df2 = n - ab.
CHAP. 121 ANALYSIS OF VARIANCE (ANOVA) 287
Seating design
1
2
3
When there are no differences among the means for the factor A, we say there is no main effect
due to factor A. Otherwise, we say there is a main effect due to factor A. Similar terminology is used
for factor B.
1 2 3
10,12 20,22 16, 18
14, 16 25,27 20,22
8. 8 18. 16 14. 16
EXAMPLE 12.16 In Example 12.15, the test statistic F
A has an F distribution with df, = 2 and dfi = 12. The
test statistic F
B has an F distribution with dfl = 1 and df2= 12. The test statistic F
A
B has an F distribution with
dfl = 2 and dfi = 12. These distributions may be used to test for significant interaction and main effects and is
discussed in the next section when the computer analysis of the data is discussed.
TESTING HYPOTHESIS CONCERNINGMAIN EFFECTS AND INTERACTION
The steps for testing interaction and main effects in a two-factor factorial design is summarized
in Table 12.16.
Steps for Testing for Interaction and Main Effects
Using the Two-way ANOVA Procedure
Step 1: State the null and alternative hypothesis for interaction and main effects as
follows:
Ho:There is no interaction between factors A and B
Ha:Factors A and B interact
Ho:There are no differences among the means for main effect A
Ha: At least two of the main effect A means differ
Ho:There are no differences among the means for main effect B
Ha:At least two of the main effect B means differ
Step 2: Use the F distribution table and the level of significance a to determine the
rejection regions.
Step 3: Build the ANOVA table and from the table determine the computed values of the
test statistics.
Step 4: State your conclusions. The null hypothesis is rejected if the computed value of
the test statistic falls in the rejection region. Otherwise, the null hypothesis is not rejected.
EXAMPLE 12.17 Table 12.9gave the number of repetitive stress injuries (rsi)for a 3 by 3 factorial design and
the table is reproduced below. The Minitab analysis for this experiment is also shown.
I Keyboard design I
Two different Minitab commands will be discussed to analyze the data. The first to be discussed is the command
Twoway. First of all note the form of the data as given below. The data in the above table must be put in the
form shown before performing the analysis. Each data line identifies the row, the column, and the data value in
that row and column.
288 ANALYSIS OF VARIANCE (ANOVA) [CHAP. 12
Row
1
2
3
4
5
6
7
9
10
11
12
13
14
15
16
1 7
18
a
seating
1
1
1
1
1
1
2
2
2
2
2
2
3
3
3
3
3
3
keyboard
1
1
2
2
3
3
1
1
2
2
3
3
1
1
2
2
3
3
rsi
10
12
20
22
16
18
14
16
25
27
20
22
8
18
16
14
16
a
The command Twoway 'rsi' 'seating' 'keyboard';starts with the keyword Twoway, followed by the response
and then the row and column names. The subcommand Means 'seating' 'keyboard'. gives row and column
means and confidence intervals.
MTB > Twoooay Irmil lmoatiagl lkoyboardl;
SUBC > Moanm lmoati~gl
lkoyboardl.
Two-way Analysis of Variance
Analysis of Variance for rsi
Source DF ss MS
seating 2 163.11 81.56
keyboard 2 307.11 153.56
Interaction 4 4 . 0 9 1.22
Error 9 16.00 1.78
Total 17 491.11
The test statistic value for the null hypothesis b:No interaction between factors A and B is easily
computed, since MSAB = 1.22and MSE = 1.78.The computed value for FAB*
is 1.2211.78= 0.69. The critical
value for a = .05 is F.o5(4, 9) = 3.63. Since FA^* does not exceed 3.63, the null hypothesis is not rejected.
Therefore, interactionis not significant.This confirms what the interaction plot in Fig. 12-9indicates.
Suppose we call keyboard design factor A. The test statistic value for the null hypothesis &:There are no
differences among the means for main effect A is easily computed, since MSA = 153.56and MSE = 1.78.The
computed value for FA* is 153.56/1.78= 86.27. The critical value for a = .05 is F.m(2, 9) = 4.26. Since FA* far
exceeds 4.26, we reject the null hypothesis and concludethat the means for the three keyboard designsdiffer.
The seating design is factor B.The test statistic value for the null hypothesis H
o
:There are no differences
among the means for main effect B is easily computed, since MSB = 81.56 and MSE = 1.78.The computed
value for FB* is 81.56/1.78 = 45.82. The critical value for a = .05 is F.os(2, 9) = 4.26. Since FA* far exceeds
CHAP. 121
Seatingdesign
1
2
3
ANALYSIS OF VARIANCE (ANOVA)
1 2 3
18,20 20,22 16, 18
14,16 25,21 20,22
8,8 18, 16 14,16
289
4.26, we reject the null hypothesis and conclude that the means for the three seating designs differ. The main
effects plot in Fig. 12-8 indicates that seating design 3 and keyboard design 1 is a good combination for
reducing repetitive stress injuries.
A second Minitab command for doing the analysis is the command ANOVA 'rsi' = 'seating' 'keyboard'
'seating'*'keyboard'. The output below is produced by this command. This output may be preferable, since it
provides p values. By considering the p values, we see immediately that interaction is not significant, but that
both main effects are significant.
MTB > ANOVA 'rain = laeatl~gn
lkeyboardn lmoati~gn*'koyb~ardl
Analysis of Variance(BalancedDesigns)
Factor Type Levels Values
seating fixed 3 1 2 3
keyboard fixed 3 1 2 3
Analysis of Variance for rsi
Source DF ss MS F P
seating 2 163.111 81.556 45.87 0.000
keyboard 2 307.111 153.556 86.37 0.000
seating*keyboard 4 4 . aa9 1.222 0.69 0.619
Total 17 491.111
Error 9 16.000 1.778
EXAMPLE 12.18 The data from Table 12.11are reproduced below.
I Kevboard design I
The Minitab analysisfor these data is shown below.
MTB > ANOVA 'rsi' = 'seating' 'keyboard' 'seating'*'keyboard'
Analysis of Variance(BalancedDesigns)
Factor Type Levels Values
seating fixed 3 1 2 3
keyboard fixed 3 1 2 3
Analysis of Variance for rsi
Source DF ss MS F P
seating 2 177.333 88.667 49.88 0.000
keyboard 2 161.333 80.667 45.38 0.000
seating*keyboard 4 65.333 16.333 9.19 0.003
Error 9 16.000 1.778
Total 17 420.000
From this output, we immediatelysee that there is significant interaction. This confirmswhat the interactionplot
shows in Fig 12-11. It would be a good idea for you to review the discussion given in Example 12.12
concerning the nature of the interaction. Remember, when interaction is present, the main effects must be
interpreted carefully.
EXAMPLE 12.19 The 3 by 2 factorial data for the factors bias type and gender given in Table 12.12 are
reproduced on the next page.
290
Gender
Male
Female
ANALYSIS OF VARIANCE (ANOVA)
Age Race Disability
200, 175,215 150, 125, 135 100,95, 115
185.210.225 130. 145. 115 90.80. 110
[CHAP. 12
I Bias tvue I
The Minitab output for the Table 12.12data is given below.
MTB > ANOVA 'award' = 'gender' 'biastype' 'gender'*'biastype'
Analysis of Variance (Balanced Designs)
Factor Type Levels Values
gender fixed 2 1 2
biastype fixed 3 1 2 3
Analysis of Variance for award
Source DF ss MS F P
gender 1 2 2 . 2 2 2 . 2 0.09 0 . 7 7 4
biastype 2 3 3 1 4 4 . 4 1 6 5 7 2 . 2 6 4 . 5 0 0 . 0 0 0
gender*biastype 2 3 4 4 . 4 1 7 2 . 2 0 . 6 7 0 . 5 3 0
Error 1 2 3 0 8 3 . 3 2 5 6 . 9
Total 1 7 3 6 5 9 4 . 4
The p values indicate that there is no significant interaction. There is no significant effect due to gender.
However, the mean awards differ according to the type of bias.
Solved Problems
F DISTRIBUTION
12.1 What does it mean to say that the F distribution curve is asymptotic to the horizontal axis of
the rectangular coordinate system?
Ans. This means that as we move to the right from the origin along the horizontal axis, the F distribution
curve approaches the horizontal axis, but never touches it.
F TABLE
12.2 Find the following using the F distribution in Appendix 5.
(a) Foi(7, 3) (b) Fod4, 6)
Ans. ( a ) 21.67 (6) 4.53
12.3 Find the following using the F distribution in Appendix 5
(4 F99(7,3 (6) F95(4,6)
12.4 Find the critical value of F for the following.
(a) df = (4, 8) and area in the right tail equal to .01.
(b) df = (6,6) and area in the right tail equal to .05.
Ans. (a) 7.01 (h) 4.28
CHAP. 121 ANALYSIS OF VARIANCE (ANOVA)
Treatment 1 Treatment 2
4 6
3 8
-2 8
2 10
0 10
-1 12
291
Treatment 3
I 1
6
16
I 1
14
8
LOGIC BEHIND A ONE-WAY ANOVA
Treatment 1
-10
10
4
-4
0
6
mean = 1
12.5 Eighteen individuals, diagnosed as having mild high blood pressure, were randomly divided
into three groups of six each and were assigned to one of three treatment groups. The treatment
1 group served as a control and made no changes in their diet. The individuals in the treatment
2 group replaced 50% of their diet with fruits and vegetables. The individuals in treatment
group 3 replaced 50% of their diet with fruits and vegetables and reduced fat calories to 15%
of their daily total. After six months, the change in diastolic blood pressure was measured and
the results are given in Table 12.17. The change in blood pressure was determined by
subtracting the reading after the six-month study from the initial reading.
Treatment 2 Treatment 3
0 0
16 22
8 11
9 5
19 16
2 12
mean = 9 mean = 11
I I I
mean = 1 mean =9 mean = I 1
A Minitab dotplot of the data given in Table 12.17 is shown in Fig 12-12. Describe in words
what the dotplot suggests concerning the three treatments.
trtment 3
Fig. 12-12
Ans. The plot suggests that the means for treatments 2 and 3 are greater than the mean for treatment 1.
It is not clear whether the means for treatments 2 and 3 are different.
12.6 Table 12.18 is another set of data obtained for the same experiment described in Problem 12.5.
A Minitab dotplot of the data is shown in Fig. 12-13. Describe in words what the dotplot
suggests concerning the three treatments.
292 ANALYSIS OF VARIANCE (ANOVA) [CHAP. 12
A Minitab dotplot of the data is shown in Fig. 12-13. Describe in words what the dotplot
suggests concerning the three treatments.
Fig. 12-13
Ans. It is not clear whether the treatment means differ or not since the data points overlap for the three
treatments.
SUM OF SQUARES, MEAN SQUARES, AND DEGREES OF FREEDOM
FOR A ONE-WAY ANOVA
1
2
.
7 For the data in Table 12.17, find SST, SSTR, and SSE using both the defining and the shortcut
formulas. After finding the sum of squares, find MSTR and MSE. Also give the degrees of
freedom for total, treatments, and error.
Ans. Definingformulas:
SSTR = nl(Xl - F;)2+ n2(X2 - X)’ + n3(Xk - Sr)2= 6(I - 7)’ + 6(9 - 7)’+ 6(I I - 7)2 = 336
SSE = (nl - 1)s: + (nz - 1)s; +(n3- 1)s: = 5 x 2.3662+5 x 2.09g2+ 5 x 3.688’ = 118
SST= SSTR + SSE = 336 + I18 = 454
Shortcut formulas:
SSE= SST- SSTR = 454- 3 3 6 4 18
Degreesof freedom:for total = n - I = 17, for treatments= k - 1 = 2, for error = n - k = 15
= 168
SSTR
MSTR= -
k-1
SSE
n-k
MSE = -= 7.87
12.8 For the data in Table 12.18,find SST, SSTR, and SSE using both the defining and the shortcut
formulas. After finding the sum of squares, find MSTR and MSE. Also give the degrees of
freedom for total, treatments, and error.
CHAP. 121 ANALYSIS OF VARIANCE (ANOVA) 293
Ans. Defining formulas:
SSTR=nl(S11-Y')2+ I I ~ ( X ~ - X ) ~ +
n3(Xk -X)*= 6(1 -7)2+6(9-7)2+6(11-7)2=336
SSE = (nl - 1)s: +(n2 - 1)s; + (n3- 1)s; = 5 x 7,2392+5 x 7.4832+5 x 7.7972= 846
SST = SSTR + SSE = 336 +846 = I182
Shortcutformulas:
SSEzSST-SSTR= 1182-336=846
Degrees of freedom: for total = n - 1 = 17,for treatments= k - 1 = 2, for error = n -k = 15
MSTR= -
SSTR -
- 168
k-1
= 56.4
MSE= -
SSE
n - k
SAMPLING DISTRIBUTION FOR THE ONE-WAYANOVA TEST STATISTIC
12.9 (a) For the experiment described in problem 12.5,what conditions are necessary in order that
MSTR
F = -have an F distribution with dfl = 2 and df2= 15?
MSE
(b) What is the computed value of F, F*, for the data in problem 12.5?
(c) What is the critical value for testing equal means at a = .05? Would you reject the null
hypothesis at this significance level?
Ans.
12.10 (a)
(b)
(4
(a) The populations represented by the three samples are normally distributed. The three popula-
(6) The computed value of the test statistic is F*= -= 21.35.
(c) F,os(2, 15)= 3.68. Reject the null hypothesisbecause F* > 3.68,
tions have equal variances. The null hypothesis is assumed to be true.
168
7.87
For the experiment described in problem 12.6, what conditions are necessary in order that
MSTR
MSE
F=- have an F distribution with df, = 2 and df;!= 15?
What is the computed value of F, F*, for the data in problem 12.6?
What is the critical value for testing equal means at a = .05? Would you reject the null
hypothesis at this significance level?
294
Source Df
Treatment 2
Error 15
Total 17
ANALYSIS OF VARIANCE (ANOVA) [CHAP. 12
J
ss MS = SSfdf F Statistic
336 168 21.35
118 7.87
d
Ans. (a) The populations represented by the three samples are normally distributed. The three
populations have equal variances. The null hypothesis is assumed to be true.
Source Df ss
Treatment 2 336
Error 15 846
Total 17
(6) The computed value of the test statistic is F* = -- - 2.98.
(c) F,05(2,15)= 3.68. Do not reject the null hypothesis,since F* does not exceed 3.68.
56.4
MS = SS/df F Statistic
168 2.98
56.4
BUILDING ONE-WAY ANOVA TABLES AND TESTING THE EQUALITY
OF MEANS
12.11 Refer to Problems 12.5, 12.7, and 12.9. Use the results in these three problems to build the
one-way ANOVA table for this experiment.
Ans. The one-way ANOVA table is shown below. The ANOVA table is used to systematically
summarize the lengthy computations for the procedure and to give the computed value of the test
statistic. By referring to the F distribution table, we see that the null hypothesis of equal means for
the three treatments would be rejected at both the .05 and the -01levels. Replacing 50% of their
diet with fruits and vegetables appears to reduce the blood pressure for individuals with mild high
blood pressure based on the results of this study.
ANOVA Table for the Data in Table 12.17
12.12 Refer to problems 12.6, 12.8, and 12.10. Use the results in these three problems to build the
one-way ANOVA table for this experiment.
Am. The one-way ANOVA table is shown below. The ANOVA table is used to systematically
summarize the lengthy computations for the procedure and to give the computed value of the test
statistic. By referring to the F distribution table, we see that the null hypothesis of equal means for
the three treatments would not be rejected at either the .05 or the .01 levels. Based on the results
of this study, we cannot claim that replacing 50% of their diets by fruits and vegetables will reduce
the blood pressures for individuals with mild high blood pressure.
LOGIC BEHIND A TWO-WAY ANOVA
12.13 A medical study utilized a 3 by 2 factorial design. One factor was the type of surgical
procedure and the other factor was the temperature of the patients’ environment during
surgery. Some patients were kept warm with blankets and intravenous fluids and others were
kept cool. Four patients were available for each surgical procedure-temperature combination.
For each patient, the length of hospital stay was recorded and the results of the study are
given in Table 12.19.
CHAP. 121
Temperature
Warm
Cool
ANALYSIS OF VARIANCE (ANOVA)
Surgical Procedure .
1 2 3
2,393, 4 5,6,8,5 9,9, 10, 12
45,596 9,10,11, 10 13,14, 14, 15
295
Temperature
Warm
Cool
Column mean
Surgical Procedure
1 2 3 Row mean
3 6 10 6.3
5 1
0 14 9.7
4 8 12
The mean for each treatment (Surgical Procedure-Temperature combination), as well as
marginal means, is given in Table 12.20.
Table 12.20
A Minitab main effects plot is shown in Fig. 12-14 and an interaction plot is shown in Fig.
12-15.
Main EffectsPlot - M
e
a
n
sfor Days
12
10
Days 8
6
4
/
Surgical Procedure
Fig. 12-14
Explain in words the interaction plot and the main effects plot shown in Figures 12-14 and
12-15.
Ans. The solid line segment in Fig. 12-15 corresponds to patients who were kept warm during surgery
and the dashed line corresponds to patients who were kept cool during surgery. The interaction
plot indicates that regardless of surgical procedure, the mean length of hospital stay is less when
the patient is kept warm during surgery since the solid line is always below the dashed line. This
indicates that there is no interaction between temperature and the type of surgical procedure. The
main effects plot for temperature shows that the mean length of stay for patients kept warm during
surgery is less than the mean length of stay for those kept cool during surgery. The main effects
plot for surgical procedure contrasts the mean lengths of stay for the three surgical procedures.
296
Temperature
Warm
Cool
14
1 2 3
293, 394 9, 10, 11, 10 9,9, 10, 12
4,59596 5,6,8,5 13, 14, 14, 15
9
Days
Temperature 1 2 3 Row mean
Warm 3 10 10 7.67
Cool 5 6 14 8.33
Column mean 4 8 12
_I L
4
ANALYSIS OF VARIANCE (ANOVA)
Interaction Plot - Means for days
I
I I
1 2 3
Surgical Procedure
[CHAP. 12
T W
. I
’ 2
1
- - 2
Fig. 12-15
12.14 Suppose the data in the design described in Problem 12.13are as shown in Table 12.21.
Table 12.21
I Surgical Procedure I
The mean for each treatment (Surgical Procedure-Temperature combination), as well as
marginal means, is given in Table 12.22.
Table 12.22
I Surgical Procedure I
Explain in words the interaction plot and the main effects plot shown in Figs. 12-16 and
12-17.
CHAP. 121
Days
12
10
8
6
4
ANALYSIS OF VARIANCE (ANOVA)
MainEffectsPlot- MeansforDays
297
Fig. 12-16
I I I
Interaction Plot -Meansfor days
14
9
Days
4
I
1
/
/
f
/
/
/
/
/
/
'
/
I
2
I
3
ten^p
' 1
' 2
1
-- 2
-
Surgical Procedure
Fig. 12-17
Ans. The interaction plot in Fig. 12-17 illustrates a totally different situation than the interaction plot in
Fig. 12-15. The interaction plot in Fig. 12-17 indicates that keeping the patient warm during
surgery shortens the length of hospital stay for patients undergoing surgical procedures 1 and 3,
but lengthens the stay for those undergoing procedure 2. The response lines are not parallel and we
say there is interaction present. This was not true in Fig. 12-15. When interaction is present, the
main effects are more difficult to explain.
298 ANALYSIS OF VARIANCE (ANOVA) [CHAP. 12
Source
Procedure
Temperature
Interaction
Error
Total
SUM OF SQUARES, MEAN SQUARES, AND DEGREES OF FREEDOM
FOR A TWO-WAY ANOVA
df ss MS F Statistic
2 SSA MSA F
A
1 SSB MSB F
B
2 SSAB MSAB FAB
18 SSE MSE
23 SST
12.15 Refer to problem 12.13. By initially considering the two-way design as a one-way design,
give the degrees of freedom for total, treatments, and error. Then, break the degrees of
freedom for treatments down into degrees of freedom for temperature, surgical procedure, and
temperature-surgical procedure interaction.
Ans. As a one-way design, the total degrees of freedom is n - 1 = 24 - 1 = 23. The degrees of freedom
for treatments is ab - 1 = 2 x 3 - 1 = 5 . The degrees of freedom for error is n - ab = 24 - 6 = 18.
The 5 degrees of freedom for treatments is partitioned into a - 1 = 2 - 1 = 1 degree of freedom for
temperature, b - 1 = 3 - 1 = 2 degrees of freedom for surgical procedures, and (a - I)(b - 1) =
1 x 2 = 2 degrees of freedom for interaction.
12.16 Sixty plots were used in an agricultural experiment. Four varieties of wheat and five different
levels of fertilizer were used in a 4 x 5 factorial experiment. Each variety of wheat and each
fertilizer level combination was used on three randomly chosen plots. List the 20 treatments,
give the complete breakdown for the total degrees of freedom, and express the total sum of
squares as a sum of four separate sums of squares.
Ans. One treatment would be the first variety of wheat combined with the first level of fertilizer,
represented as (V 1,Fl). The other 19 treatments are: (V1, F2), (V1,F3), (V1,F4), (V1, F5), (V2,
Fl), (V2, F2), (V2, F3), (V2, F4), (V2, F5),(V3, Fl), (V3, F2), (V3, F3), W3, F4), (V3, F5), W4,
FI), (V4, F2), (V4, F3), (V4, F4), and (V4, F5).
The total degrees of freedom is n - 1 = 60 - 1 = 59. Let factor A be variety of wheat and let factor
B be fertilizer level. The degrees of freedom for A is a - 1 = 4 - 1 = 3, the degrees of freedom for
B is b - 1 = 5 - 1 = 4, the degrees of freedom for interaction is (a - I ) x (b - 1) = 3 x 4 = 12. The
degrees of freedom for error is n - ab = 60 - 20 = 40. Note that 59 = 3 +4 + 12 +40.
SST = SSA + SSB + SSAB + SSE.
BUILDING TWO-WAY ANOVA TABLES
12.17 Give the general structure for the two-way ANOVA table for the data in Table 12.19.
Ans. The results are given in Table 12.23.
12.18 Give the general structure for the two-way ANOVA table for the experiment described in
problem 12.16.
Ans. The results are given in Table 12.24.
CHAP. 121
Source Df ss MS
Variety 3 SSA MSA
Fertilizer level 4 SSB MSB
Interaction 12 SSAB MSAB
Error 40 SSE MSE
Total 59 SST
ANALYSIS OF VARIANCE (ANOVA)
F Statistic
F
A
FB
FAB
299
SAMPLING DISTRIBUTIONS FOR THE TWO-WAY ANOVA
12.19 Find the critical values for FA,Fg, and FAB
in problem 12.17.Use a = .05.
Ans. Factor A is the surgical procedure and factor B is the temperature of the patients' environment.
The test statistic F
A has an F distribution with dfi = 2 and df2 = 18. The critical value for FA is
3.55. The test statistic F
B has an F distribution with dfl = 1 and df2= 18. The critical value for FB
is 4.41. The test statistic F
A
B has an F distribution with df, = 2 and df2 = 18 and the critical value
for FAB is 3.55.
12.20 Find the critical values for FA,Fg, and FAB in problem 12.18. Use a= .01.
Ans. Factor A is the variety of wheat and factor B is the level of fertilizer. The test statistic F
A has an F
distribution with dfl = 3 and df2= 40. The critical value for F
A is 4.31. The test statistic FBhas an F
distribution with dfl = 4 and dfz = 40. The critical value for F
B is 3.83. The test statistic FAB has an
F distribution with dfl = 12 and dfi = 40 and the critical value for FAB is 2.66
TESTING HYPOTHESIS CONCERNING MAIN EFFECTS AND INTERACTION
12.21 The Minitab output for the data given in Table 12.19 is shown below. After reviewing
Problems 12.13, 12.15, 12.17, and 12.19, as well as the Minitab output, test for main effects
as well as interaction at a = .05.
Analysis of Variance for days
Source DF ss MS F P
Temp 1 66.667 66.667 6 0 . 0 0 0.000
Surgproc 2 256.000 128.000 115.20 0.000
Temp*Surgproc 2 5.333 2.667 2.40 0.119
Error 1 8 2 0 . 0 0 0 1.111
Total 23 348.000
Ans. The Minitab output confirms the results discussed in problems 12.13, 12.15, 12.17, and 12.19.The
p value for interaction, 0.119, is greater than a and interaction is not significant. This makes the
interpretation of main effects easier. The p values for temperature and surgical procedure are both
0.000.The mean length of hospital stay in days differs for the three surgical procedures and for the
two operation temperatures at which the patients are kept.
12.22 The Minitab output for the data given in Table 12.21 is shown below. After reviewing
Problem 12.14, as well as the Minitab output, test for main effects as well as interaction at
a = .05.
300 ANALYSIS OF VARIANCE (ANOVA) [CHAP. 12
Analysis of Variance for days
Source DF ss MS F P
Temp 1 2.667 2.667 2.40 0.139
Surgproc 2 256.000 128.000 115.20 0.000
Temp*Surgproc 2 69.333 34.667 31.20 0 . 0 0 0
Error 1 8 20.000 1.111
Total 23 348.000
Ans. Since the interaction is significant, i. e., the p value = 0.0oO < a,we explain the nature of the
interaction rather than make broad generalizations about the main effects. In Fig. 12-17, the solid
line segment corresponds to the patients who were kept warm during surgery and the line segment
made up of dashes corresponds to patients who were kept cool during surgery. The interaction plot
suggests that the hospital stay is shorter if the patient is kept warm during surgery for procedures 1
and 3. However, it appears that for procedure 2, keeping the patient warm during surgery increases
the hospital stay.
Supplementary Problems
F DISTRIBUTION
12.23 Fill in the following blanks with the appropriate distribution. Choose from the words standard normal,
student t, Chi-square, and F.
(4The -
- distribution is symmetricalabout zero and has standard deviation equal to one.
(6) The
(c) The
(6)The shapeofthe
distribution is skewed to the right. The shape of the distribution curve is
distribution is symmetrical about zero. The shape of the distribution curve is
determined by the number of degrees of freedom.
determined by the number of degrees of freedom.
distribution is determined by two separate degrees of freedom.
Ans. (a)standard normal (6)Chi-square (c) student t (d)F
F TABLE
12.24 Find the following using the F distribution in Appendix 5.
(a)F.Ol(2, 8) (6)F.05(2,8)
Ans. (a) 8.65 (6)4.46
12.25 Find the following using the F distribution in Appendix 5.
(a)F.W@,2) (6)F.95t8,2)
Ans. (a)0.1156 (6)0.2242
12.26 The random variable Frabo has an F distribution with dfl = 5 and df2= 5. Find two values a and b such
that P(a < Fmbo
< b) = .90.
Ans. a = 0.1980 and b = 5.05
CHAP. 121 ANALYSIS OF VARIANCE (ANOVA)
IRS
40
45
5s
50
45
45
50
so
40
60
301
Garbage collection Long distance service
55 60
60 65
65 70
so 70
55 70
55 7s
60 7s
70 70
65 80
60 80
LOGIC BEHIND A ONE-WAY ANOVA
12.27 Thirty individuals were randomly divided into 3 groups of 10 each. Each member of one group
completed a questionnaire concerning the Internal Revenue Service (IRS). The score on the
questionnaire is called the Customer Satisfaction Index (CSI). The higher the score, the greater the
satisfaction. A second set of CSI scores were obtained for garbage collection from the second group,
and a third set of scores were obtained for long distance telephone service, The scores are given in Table
12.25. A Minitab boxplot for the data in Table 12.25 is shown in Fig. 12-18. Describe what the boxplot
suggests concerning the mean CSI scores for the IRS, garbage collection, and long distance phone
service.
mean = 48.0 I mean = 59.5 I mean = 71.5
- - - - - - -
* ------- I------ Long
Distance
Service
-------
Fig. 12-18
Ans. Even though the whiskers overlap for the three different groups, the boxes that contain the middle
50%of the data, do not overlap for the three groups. Note that the plus signs are above the median
values and are fairly spread apart. The data suggest that the mean CSI scores for the three
populations are probably different. An analysis of variance may be performed to test the null
hypothesis of equal population means.
12.28 Suppose the survey described in problem 12.27resulted in the data given in Table 12.26. Describe what
the boxplot in Fig. 12-19 suggests concerning the mean CSI scores for the IRS, garbage collection, and
long distance phone service.
302 ANALYSIS OF VARIANCE (ANOVA)
IRS
15
45
30
70
45
45
30
70
40
90
mean = 48
Table 12.26
Garbage collection
20
50
65
50
25
85
60
90
65
85
mean = 59.5
~~
Long distance service
35
65
70
80
60
75
75
80
80
95
mean = 7 1.5
[CHAP. 12
*
Fig. 12-19
Ans. The boxplots for the three sets of CSI scores shown in Fig. 12-19 exhibit a considerable amount of
overlap. This is the type of results we might expect if there is no difference in the three population
means. The boxplots indicate that the assumption that the three populations have normal
distributions with equal variances should be checked.
SUM OF SQUARES,MEAN SQUARES,AND DEGREESOF FREEDOM FOR A ONE-WAY ANOVA
12.29 For the data in Table 12.25, find SST, SSTR, and SSE using both the defining and the shortcut
formulas. After finding the sum of squares, find MSTR and MSE. Also give the degrees of freedom for
total, treatments, and error.
Am. SSTR = 2761.7 SSE = 1035.0 SST = 3796.7 MSTR = 1380.8 MSE = 38.3
Degrees of freedom: for total = 29, for treatments= 2, and for error = 27.
12.30 For the data in Table 12.26, find SST, SSTR, and SSE using both the defining and the shortcut
formulas. After finding the sum of squares, find MSTR and MSE. Also give the degrees of freedom for
total, treatments, and error.
Ans. SSTR = 2761.7 SSE =12085 SST = 14847 MSTR = 1380.8 MSE = 448
Degrees of freedom: for total = 29, for treatments = 2, and for error = 27.
CHAP. 121 ANALYSIS OF VARIANCE (ANOVA)
South
15.5
13.0
19.5
18.8
17.5
19.0
16.6
14.9
mean = 16.85
303
North West
17.8 15.5
20.0 17.0
21.3 22.0
24.3 18.4
18.8 19.9
19.0 21.5
19.5
20.5
22.0
23.4
mean = 20.66 mean= 19.05
SAMPLINGDISTRIBUTIONFOR THE ONE-WAYANOVA TESTSTATISTIC
12.31 Refer to problems 12.27 and 12.29. If the populations of CSI responses concerning the IRS, garbage
collection, and long distance phone service are normally distributed, have equal variances and equal
MSTR
MSE
means, what is the distribution of F = -
? What is the a = .05 critical value for testing the null
hypothesis that the population means are equal? Give the computed value of the test statistic and your
conclusion.
MSTR
MSE
Ans. The test statistic F = -has an F distribution with dfl = 2 and dfi = 27. The critical value,
F,o5(2, 27), is between 3.39 and 3.32. The computed value of the test statistic is F* = 36.05. The
null hypothesis is rejected.
12.32 Refer to problems 12.28 and 12.30. If the populations of CSI responses concerning the IRS, garbage
collection, and long distance phone service are normally distributed, have equal variances and equal
means, what is the distribution of F = -
? What is the a = .05 critical value for testing the null
hypothesis that the population means are equal? Give the computed value of the test statistic and your
conclusion.
MSTR
MSE
MSTR
MSE
Ans. The test statistic F = -has an F distribution with dfl = 2 and df2 = 27. The critical value,
F.o5(2,27), is between 3.39 and 3.32. The computed value of the test statistic is F* = 3.08. The null
hypothesis is not rejected.
BUILDING ONE-WAY ANOVA TABLESAND TESTING THE EQUALITY OF MEANS
12.33 Table 12.27 gives the cost in thousands of dollars for randomly selected weddings with approximately
100 guests for three geographical regions in the U.S. Build a one-way ANOVA and test the null
hypothesis that the mean costs for such weddings do not differ for the three regions. Test at a 5% level
of significance.
Ans. The Minitab output for the above data is as follows:
Analysis of Variance
Source DF ss MS F P
Factor 2 64.55 3 2 . 2 7 6 . 2 6 0.007
Error 2 1 1 0 8 . 2 0 5.15
Total 23 172.75
304
I Urban
ANALYSIS OF VARIANCE (ANOVA)
PhysicianAssistant Nurse Practitioner
4 5 51,52,48,54 42,44,47,49,50
[CHAP. 12
The p value is less than the preset alpha. The mean costs differ for the three regions. Such
weddings appear to cost less in the South than in the North or West on the average.
12.34 A sociological study compared the time spent watching TV per week for children ages 2 to 11 for the
years 1994, 1995, 1996, and 1997. The times in hours per week are shown in Table 12.28. Is there a
difference in the means for the four years? Test at a 5% level of significance.
1994
22
24
25
18
19
25
17
24
15
25
25
25
25
16
20
20
17
22
25
16
mean = 21.3
Table
1995
21
19
25
20
18
15
19
21
19
20
16
25
22
24
20
20
18
18
18
20
mean= 19.9
12.28
1996
20
24
21
18
15
19
24
24
18
17
23
23
20
16
20
16
15
19
24
18
mean = 19.7
~
1997
25
20
18
20
15
21
20
15
23
21
20
15
19
25
25
23
16
22
21
15
mean = 20.0
AM. The Minitaboutputfor the one-way ANOVA is shown below.
Analysis of Variance
Source DF ss M S F P
F a c t o r 3 30.1 1 0 . 0
E r r o r 76 802.7 10.6
Total 79 832.8
0.95 0.421
The means seem to have decreased since 1994. However, the p value = .421 indicates that the
means are not different.
LOCIC BEHINDA TWO-WAY ANOVA
12.35 Table 12.29 gives the salaries in thousands of dollars for 20 individuals. Half are Nurse Practitioners
and half are Physician Assistants. In addition, they are classified as practicing in a rural or an urban
setting.
Table 12.29
I RWal [ 46,50,50,50,52 I 45,48,49,47,48 I
The interaction plots for the data given in Table 12.29 are given in Figs. 12-20and 12-21.Explain the
plots in words.
CHAP. 121
49-
Mean
48-
47 -
5
0
4
9
Mean
48
47
1
-- 2
-
----.
& 4 - - &
c # - c -
/I---
) - - / - -
I I
I
1
ANALYSIS OF VARIANCE (ANOVA)
InteractionPlot - Meansfor salary
l.kbdRl
. 1
8 2
- 1
-- 2
I
2
305
PNNP
Fig. 12-20
Interaction Plot - Means for salary
P A N
1
Ubatml
Fig. 12-21
Ans. The solid line in Fig. 12-20 corresponds to urban health care workers. The solid line shows that
urban Physician Assistants in the study had a greater mean salary than urban Nurse Practitioners.
The dashed line correspondsto rural health care workers. This line shows that the mean salary for
rural Physician Assistants in the study was also greater than the mean for Nurse Practitioners. The
solid line in Fig. 12-21 corresponds to Physician Assistants and the dashed line corresponds to
Nurse Practitioners. This Figure shows that the mean for urban Physician Assistants exceeds the
mean for rural Physician Assistants. The dashed line shows the opposite to be true for Nurse
Practitioners.
306 ANALYSIS OF VARIANCE (ANOVA) [CHAP. 12
Low Moisture
High Moisture
1
2
.
3
6 Table 12.30 gives the results for a 2 x 2 factorial design. The yield in kilograms of okra per plot is
given for 28 different plots. Each Fertilizer-Moisture combination was applied to 7 different plots and
the total yield was recorded for each plot. The interaction plot for the data in Table 12.30 is shown in
Fig. 12-22.Explain, in words, the nature of the interaction.
Low Fertilizer High Fertilizer
8, 8, 8,9, 10,6, 10
1, 1,292, 2,393
2,3,4,5,6,4,4
9,9, 11, 8, 8,9, 8
Table 12.30
Interaction Plot = Means for yield
Moisture
. 1
' 2
1
-- 2
-
I
1
I
Fertilizer 2
Fig. 12-22
Ans. The solid line corresponds to the low level of moisture. This line indicates that at the low level of
moisture, the mean yield is increased when the level of fertilizer is increased. The dashed line
corresponds to the high level of moisture. This line indicates that at the high level of moisture, the
yield is decreased when the level of fertilizer is increased. Whether or not this interaction is
significant is determined by performing an analysis of variance.
SUM OF SQUARES,MEAN SQUARES,AND DEGREES OF FREEDOM FOR A TWO-WAYANOVA
12.37
12.38
In problem 12.35, let factor A be the type of health professional and let the two levels of factor A be
Physician Assistant and Nurse Practitioner. Let factor B be the population setting and let the two levels
of factor B be urban and rural. Give the degrees of freedom for the following sources of variation: total,
A, B, AB, and error.
Ans. The degrees of freedom for the sources are: total df = 19,A df = 1, B df = 1,AB df = 1,error df =
16.
In problem 12.36, let factor A be fertilizer level and let the two levels of factor A be low and high. Let
factor B be the moisture level and let the two levels of factor B be low and high. Give the degrees of
freedom for the following sources of variation: total, A, B, AB, and error.
CHAP. 121 ANALYSIS OF VARIANCE (ANOVA) 307
Source df
A 1
B 1
AB 1
Error 16
Total 19
Ans. The degrees of freedom for the sources are: total df = 27, A df = 1, B df = 1, AB df = 1,error df =
24.
ss MS F Statistic
SSA MSA FA
SSB MSB F
B
SSAB MSAB F
A
B
SSE MSE
SST
BUILDINGTWO-WAY ANOVA TABLES
Source Df
A 1
B 1
AB 1
Error 24
Tota1 27
12.39
12.40
ss MS F Statistic
SSA MSA F
A
SSB MSB FB
SSAB MSAB FAB
SSE MSE
SST
Give the general structure for the two-way ANOVA table for the data in Table 12.29. Name the factors
and levels as given in problem 12.37.
Ans. The results are given in Table 12.31.
SAMPLINGDISTRIBUTIONSFOR THETWO-WAY ANOVA
12.41 Give the critical values for the test statistics FA, FB, and FAB in problem 12.39. Use a = .05.
Ans. The critical value for all three is 4.49.
12.42 Give the critical values for the test statistics FA, FB,and FABin problem 12.40.Use a = .05.
Ans. The critical value for all three is 4.26.
TESTING HYPOTHESIS CONCERNINGMAIN EFFECTS AND INTERACTION
12.43 The Minitab output for the data given in Table 12.29is shown below. After reviewing problems 12.35,
12.37, 12.39, and 12.41, as well as the below Minitab output, test for main effects as well as interaction
at a = .05.
Analysis of Variance for salary
Source DF ss MS F P
Urban/ru 1 0.450 0.450 0 . 0 6 0.812
PA/NP 1 42.050 42.050 5.44 0.033
Urban/ru*PA/NP 1 2 . 4 5 0 2 . 4 5 0 0 . 3 2 0 . 5 8 1
Error 1 6 1 2 3 . 6 0 0 7.725
Total 1 9 1 6 8 . 5 5 0
308 ANALYSIS OF VARIANCE (ANOVA) [CHAP. 12
Ans. The p values indicate that at the a = .05 level of significance, the interaction is not significant.
There is no significant difference in the mean salaries between urban and rural settings. However,
the mean salary for Physician Assistants exceeds the mean salary for Nurse Practitioners.
12.44 The Minitab output for the data given in Table 12.30is shown below. After reviewing Problems 12.36,
12.38, 12.40, and 12.42, as well as the below Minitab output, test for main effects as well as interaction
at a = .05.
Analysis of Variance for yield
Source DF ss MS F P
Moisture 1 4 . 3 2 1 4 . 3 2 1 3 . 1 8 0 . 0 8 7
Fertlzer 1 1 0 - 3 2 1 1 0 . 3 2 1 7 . 6 1 0 . 0 1 1
Moisture*Fertlzer 1 2 2 2 . 8 9 3 2 2 2 . 8 9 3 1 6 4 . 2 4 0 . 0 0 0
Error 2 4 3 2 . 5 7 1 1 . 3 5 7
Total 2 7 2 7 0 . 1 0 7
Am. Because of the significant moisture-fertilizer interaction, the main effects must be interpreted in
the presence of this interaction.
Chapter 13
X
2
4
8
10
16
20
24
30
Regression and Correlation
Y
12
13
15
16
19
21
23
26
STRAIGHTLINES
2 5 -
20-
W
c
4
f
15 -
10 --3
Suppose eight individuals, employed at large companies were interviewed, and the number of years
of service, x, and the number of annual paid days off, y, were determined for each. The results are
shown in Table 13.1.
0
8
8
8
I 1 1
Table 13.1
A scatterplot for these data is shown in Fig. 13-1.
8
8
Fig. 13-1
The data plotted in Fig. 13-1have a very special property. The points fall perfectly on a straight
line. The equation of the line is y = 11.O +0 .5 ~ .
The number 11.O is called the y intercept.This is the
value of y when x = 0. The number 0.5 is called the slope of the line. The equation may also be
expressed as y = 0 . 5 ~
+ 11.O. When straight lines are studied in algebra, the slope-intercept form of a
line is given as y = mx + b. When lines are discussed in statistics, the equation is often written as
310 REGRESSION AND CORRELATION [CHAP. 13
y = PO+plx. In this form, POis the y intercept and
+0 . 5 ~
and the points from Table 13.1on the same graph.
is the slope.Figure 13-2shows the line y = 11.O
25
10
Fig. 13-2
If the relationship between years of service and annual paid days off for all employees of large
companies were described by the equation y = 11.O + OSx, then we could predict perfectly the
number of annual paid days off if we knew the number of years of service. Assuming the equation
described the relationship perfectly, a person with x = 14 years of service would receive y = 11 +
.5 x 14 = 18annual paid days off.
EXAMPLE 13.1 The line y = 1.7- 4 . 5 ~
has y intercept 1.7and slope -4.5. When x = 2, the corresponding y
value is y = 1.7 - 4.5(2) = -7.3. That is, the point (2, -7.3) falls on the line whose equation is y = 1.7 - 4.5~.
LINEAR REGRESSION MODEL
Table 13.2 contains a more realistic data set than that shown in Table 13.1. In Table 13.2, x
represents the number of years of service, and y represents the annual paid days off. These data were
obtained from 20 individuals at a large company. Figure 13-3 is a plot of the data. These points
clearly do not fall along a straight line. However, the data do indicate a linear trend. That is, a line
could be fit to the points such that the variation of the points about the line is small. A line is shown
which provides a good fit to the data.
A linear regression model assumes that some dependent or response variable, represented by y,
is related to an independent variable, represented by x, by the relationship shown in formula (13.I).
y = p o + p * x + e (13.I)
The error term, e, is a normally distributed random variable with mean equal to 0 and standard
deviation equal to 0.
Each value of x determines a population of y values.
CHAP. 131
X
16
20
20
20
24
24
24
30
30
30
REGRESSION AND CORRELATION
Y
20
20
22
21
23
24
22
27
25
26
311
L
Basic Properties of the Linear Regression Model
1. The regression model is y = P O + plx + e, where P O is the y intercept of the line
given by PO+ plx, PI is the slope of the line given by P O + Plx, and e is the error
or deviation of the actual y value from the line given by PO+ plx.
2. The error term e is a random variable with a mean of 0, i.e., E(e) = 0.
3. The expected value of y is E(y) = PO+ plx.
4. The variance of y equals dand is the same for all values of x.
5. The values of e are independent.
6. The error term e is normally distributed.
Tab1
I I
X Y
2 10
2 14
2 12
4 11
4 15
8 14
8 16
10 14
10 18
16 18
I X I Y
10
14
12
11
15
14
16
14
18
18
25
20
Days off
15
10
I
0
0
0
0
I
10
I
2
0
1
30
Years
Fig. 13-3
The linear regression model, given in formula (13.1), describes the relationship between y and x
for some population. The intercept, PO,and the slope, PI, are unknown parameters and must be
estimated from sample data. The next section discusses the technique used to estimate P O and P I .
Table 13.3contains a summary of the basic properties of the linear regression model.
312 REGRESSION AND CORRELATION [CHAP. 13
EXAMPLE 13.2 Consider the linear regression model y = 2.5 + 3x + e, where the normal random variable e
has mean equal to 0 and standard deviation equal to 1.2. The mean response for y when x = 2 is E(y) = 2.5 +
3(2) = 8.5 and the mean response for y when x = 4 is E(y) = 2.5 + 3(4) = 14.5. In this example we are assuming
that we know the values for POand PI.However, this is not usually the case. We usually have to estimate these
two parameters from sample data taken from the population.
LEAST SQUARES LINE
The line shown in Fig. 13-3is called the least-squares line. The least-squares line is determined
such that the sum of the squares of the deviations of the data points from the line is a minimum. For
point (xi, yi), the deviation from the fitted line, whose equation is represented as 9 = bo + blx, is
given by di = yi - (bo + blxi). The estimates bo and bl are determined so that the sum of squares of
deviations about the line is minimized. The sum of squares of deviations is given by
(13.2)
The estimated y intercept, bo, is also represented by the symbols a and a. and the estimated slope is
represented by b and b,.We shall use bo and bl as the symbols for the estimates of POand .Using
calculus, it may be shown that the estimate for PI is given by
(13.3)
where S,, is given by
sx,= zx2--
(Cx)’ (13.4)
n
and Sx, is given by
(13.5)
The estimate for POis given by
EXAMPLE 13.3 The data from Table 13.2 is reproduced below. Recall that x represents the number of years
of service and y represents the annual paid days off.
r
2
4
4
8
8
10
10
16
Y
10
14
12
11
15
14
16
14
18
18
X
16
20
20
20
24
24
24
30
30
30
Y
20
20
22
21
23
24
22
27
25
26
CHAP. 131 REGRESSION AND CORRELATION 313
For these data, Cxy = 2 x 10+2 x 14 +2 x 12 + - - +30 x 27 +30 x 25 +30 x 26 = 6600.
Xx=2+2+2+...+30+30+30=304,andX= % = 15.2.
Cy = 10+ 14 + 12 + - . - +27 +25 +26 = 372, and 7 = % = 18.6.
Cx2=4+4 + 4 + * * * +900 +900 +900 = 6512, Zy2= 100+ 196 + 144+ * * * +729 + 625 +676 = 7426.
S,, =Ex2- @ = 6512- 4620.8 = 1891.2, S,, = Zxy - (Xx)('y) = 6600 - 5654.4 =945.6.
n n
Sxy 945.6 -
b l = - = - =0.5andbo= y - b l % =18.6-0.5x 15.2=11.0.
sx, 1891.2
The equation of the line is 9 = bo + blx = 11.0+ 0 . 5 ~ .The fitted line is referred to by several
different names. The fitted line is called the line o
f best fit, the least-squares line, the estimated
regression line, and theprediction line. The computations for bo and bl are rarely performed by hand
in practice. Computer software is normally used to find the equation. This is illustrated in Example
13.4.
EXAMPLE 13.4 The Minitab procedure for computing the equation of the regression line is shown in Fig.
13-4. The command Regress 'Daysow 1 'Years' requires that the column containing the dependent variable
values be named after the word Regress, followed by the number of independent variables, followed by the
column containing the independent variable values. The regression equation is printed out as the first line of
output in Fig. 13-4. The remaining output will be discussed in later sections.
MTB > Regress 'Daysoff' 1 'Years';
SUBC> Constant.
Regression Analysis
The regression equation is
Daysoff = 1 1 . 0 + 0.500 Years
Predictor Coef St Dev T P
Constant 1 1 . 0 0 0 0 0.5703 1 9 . 2 9 0.000
Years 0 . 5 0 0 0 0.0316 1 5 . 8 2 0.000
s = 1.374 R-Sq = 93.3% R-Sq(adj) = 9 2 . 9 %
Analysis of Variance
Source DF ss MS F P
Error 1 8 34.00 1 . 8 9
Total 1 9 506.80
Regression 1 472.00 472.00 2 5 0 . 3 1 0.000
Fig. 13-4
The amount of computation saved by this procedure is hard to appreciate. Before the availability of computer
software, these computations were done by use of a hand held calculator .
ERROR SUM OF SQUARES
The error sum o
f squares, denoted by SSE, is given by formula (13.7):
SSE =Z(yi - Yi)*
314 REGRESSION AND CORRELATION [CHAP. 13
The differences, (yi - yi), are called residuals. A residual measures the deviation of an observed
data point from the estimated regression line. If the estimated regression line fits the data points
perfectly, as is the case for the data in Table 13.1,then SSE = 0. The more the variability of the data
points away from the line, the larger the value for SSE. The computation of SSE is illustrated in
Example 13.5.
EXAMPLE 13.5 The data in Table 13.2 gives 20 observations for the number of years of service, x, and the
number of annual paid days off, y. The equation of the estimated regression line is found to be 9 = bo + blx =
11.0+ 0 . 5 ~
in Example 13.3. The computation of the residuals and SSE is shown in Table 13.4. Note that the
sum of the residuals, Z(yi - ii),
is equal to zero. The error sum of squares, SSE, is equal to 34. The error sum
of squares is shown in Fig. 13-4 as part of the Minitab output. It is shown under the Analysis of Variance
portion of the output and is located at the intersection of the Error row and the SS column.
Employee
1
2
3
4
5
6
7
8
9
10
I 1
12
13
14
15
16
17
18
19
20
Xi
2
2
2
4
4
8
8
10
10
16
16
20
20
20
24
24
24
30
30
30
Ta
Yi
10
14
12
11
15
14
16
14
18
18
20
20
22
21
23
24
22
27
25
26
le 13.4
9 i = 1 1 +0.5xi
12
12
12
13
13
15
15
16
16
19
19
21
21
21
23
23
23
26
26
26
-2
2
0
-2
2
-I
1
-2
2
- 1
I
-1
1
0
0
1
-1
1
-I
0
C(y1 -f i ) = 0
4
4
0
4
4
1
1
4
4
1
1
1
1
0
0
1
1
1
1
0
C(y, - iJ2
= 34
A convenient formula for computing SSE that is equivalent to formula (13.7) is given in formula
(13.8).The computation of S,, is the same as that for S
,
, using the y values instead of the x values.
s:,
SSE = sYy--
sxx
(13.8)
EXAMPLE 13.6 To illustrate the computation of SSE in Example 13.5using the computation formula, recall
that in Example 13.3,we found that S,, = 1891.2and S,, = 945.6. The computation of S,, is now illustrated:
945.6* =34
S,, = Cy2- oz= 7426 - -
3722 = 506.8 and SSE = sYy--
sty= 506.8 - -
n 20 s x x 1891.2
CHAP. 131 REGRESSION AND CORRELATION 315
STANDARDDEVIATION OF ERRORS
The standard deviation of the error term in the model y = +plx +e is represented by <T and the
variance of the error term is 0'. The variance 0' is estimated by S2,where S2 is given by formula
(13.9):
SSE
s=-
- -
n-2
The square root of S2is called the standard deviation o
f errors, and is given by formula (13.10):
(13.9)
s =J"""
n-2
(13.10)
EXAMPLE 13.7 The standard deviation of errors for the data given in Table 13.2 is found by recalling that
n = 20 and in Example 13.6, we found that SSE = 34, The standard deviation of errors is:
The value for S is shown in the Minitab output given in Fig. 13-4.
TOTAL SUM OF SQUARES
Suppose we were requested to estimate the mean number of annual paid days off given by large
companies, but that we did not know the years of service, x, for each sampled employee. Our best
estimate would be the mean of the 20 values for y given in Table 13.2.The mean of these 20 values is
y = 18.6days per year. The accuracy of the estimate is related to the variation of the individual y
values about the mean. The sum of squares about the mean is called the total sum o
f squares, and is
given by
-
(13.I I )
EXAMPLE 13.8 By referring to Example 13.6, it is seen that the total sum of squares is equal to Syy.
Therefore, from Example 13.6, we see that SST = 506.8. The total sum of squares is given in the Minitab output
shown in Fig. 13-4.
REGRESSIONSUM OF SQUARES
When the regression line is not used in estimating the mean annual paid days off, the total sum of
squares measures the variation about the mean, 18.6,as discussed in the previous section. When the
regression line is used, there is still some unexplained variation about the regression line. This
unexplained variation is given by SSE. This implies that the regression line explains an amount of
variation equal to SST - SSE. This explained variation is called the regression sum of squares. The
regression sum of squares is given by formula (I3.12):
SSR = SST- SSE (13.12)
EXAMPLE 13.9 When the mean, 7 = 18.6days per year, is used to estimate the mean number of days off per
year, the variation of the y values about this mean is SST = 506.8.When the number of years of service is taken
316 REGRESSION AND CORRELATION [CHAP. 13
into account, the estimated regression line was found to be f = 11.0+ 0 . 5 ~ .
There is variation about this line
that is not explained by the estimated regression line. This unexplained variation is given by SSE = 34. This
implies that the regression line explained SST - SSE = 506.8 - 34 = 472.8 or 93.3% of the variation of the
values about the mean. The regression sum of squares is SSR = 472.8.
The regression sum of squares is directly computable by the formula
SSR = E(ii -7)’ (13.13)
EXAMPLE 13.10 Table 13.5 illustrates the computation of the regression sum of squares using formula
(13.13).Note that SSR = 472.80. In Examples 13.5 and 13.8, we found that SSE = 34.0 and SST = 506.8. We
see that SST = SSR + SSE. These three sum of squares is shown under the SS column of the analysis of
variance portion of Fig. 13-4.
Employee
I
2
3
4
5
6
7
8
9
10
I 1
12
13
14
15
16
17
18
19
20
Xi
2
2
2
4
4
8
8
10
10
16
16
20
20
20
24
24
24
30
30
30
Tab
Yi
10
14
12
I 1
15
14
16
14
18
18
20
20
22
21
23
24
22
27
25
26
COEFFICIENT OF DETERMINATION
The coeficient o
f determination is defined by
12
12
12
13
13
15
15
16
16
19
19
21
21
21
23
23
23
26
26
26
SSR
r2=-
SST
-6.6
-6.6
-6.6
-5.6
-5.6
-3.6
-3.6
-2.6
-2.6
0.4
0.4
2.4
2.4
2.4
4.4
4.4
4.4
7.4
7.4
7.4
sum = 0
(9,-9’
43.56
43.56
43.56
31.36
3I .36
12.96
12.96
6.76
6.76
0.16
0.16
5.76
5.76
5.76
19.36
19.36
19.36
54.76
54.76
54.76
sum = 472.80
(13.14)
When the data points fall perfectly on a straight line as in Fig. 13-1, SSE = 0 and therefore SST =
SSR. In this case, the coefficient of determination is equal to 1. When the estimated regression line
explains none of the variation in y, SSR = 0, and the coefficient of determination is equal to 0. When
the coefficient of determination is expressed as a percentage, it can be thought of as the percentage of
the total sum of squares that can be explained using the estimated regression equation.
EXAMPLE 13.11 For the data in Table 13.2 and discussed in the previous examples, the coefficient of
x 100%= -
472*8 x 100=93.3%.
2 SSR
determination is r = -
SST 506.8
CHAP. 131 REGRESSION AND CORRELATION
The coefficient of determination may be determined by the use of formula (13.15):
s:y
r2= -
s x x s y y
317
(13.15)
EXAMPLE 13.12 For the data in Table 13.2, it was found in Example 13.3that Sx, = 1891.2 and S,, = 945.6.
Also in Example 13.6 we found that Sy,= 506.8. Therefore, the coefficient of determination is found as follows:
r2= --
s& - 945*62 = .933or 93.3%
sXx
syy 1891.2x 506.8
The value for r2is also given in Fig. 13-4.
MEAN, STANDARD DEVIATION, AND SAMPLING DISTRIBUTIONOF THE
SLOPE OF THE ESTIMATEDREGRESSIONEQUATION
The slope of the estimated regression line, bl, is a statistic and has a sampling distribution. That
is, if several different surveys concerning the number of years of service, x, and the number of annual
paid days off, y, were conducted, and the estimated regression line found in each case, the values for
the slope and the y intercept would not all be equal. The estimated slope, bl, has a sampling
distribution which is normally distributed. The mean of bl is PI and the standard deviation of bl is
given in the following:
(13.16)
Since 0 is unknown, the standard deviation of errors is substituted for 0 to obtain the standard error
for bl. The standard error for b1 is given in formula (13.17):
(13.17)
S
-
s b l - J
s
,
,
EXAMPLE 13.13 For the data given in Table 13.2, the value for S,, was found to equal 1891.2 in Example
13.3 and in Example 13.7, the value for S was found to equal 1.374. The standard error for bl is therefore equal
= 0.0316. This value is given in the computer output in Fig. 13-4 at the intersection
S 1.374
to Sb,= -
J s X x = 4 m
of the Years row and the St Dev column. The standard error for bl is needed to set confidence intervals on and
test hypothesis concerning the slope of the population regression line.
INFERENCES CONCERNINGTHE SLOPEOF THE POPULATION
REGRESSION LINE
A (I - a)x 100% confidence interval for is obtained by using the distribution theory given in
the previous section. The confidence interval is given in formula (13.18), where ta/2 is obtained from
the Student t distribution using n - 2 degrees of freedom.
318 REGRESSION AND CORRELATION [CHAP. 13
. Steps for Testing the Value of the Slope of the Population Regression Line
Step 1: The null hypothesis is H
o
:p1 = plo and the alternative hypothesis is either lower,
upper, or two-tailed.
Step 2: Use the Student t distribution table and the level of significance a to determine
the rejection regions. The degrees of freedom is n - 2.
EXAMPLE 13.14 Consider the continuing Example concerning the number of annual paid days off as a
function of the number of years of service. Suppose we wish to determine a 95% confidence interval for pl.The
values for bl and s b , were obtained in Examples 13.3 and 13.13.They may also be obtained from the computer
printout in Fig. 13-4.The values are b, = 0.5 and Sb,= 0.0316.The value for t . 0 2 ~is obtained from the Student t
table using df = 20 - 2 = 18.The value is t.025 = 2.101. The 95% margin of error is k tansbl
= k 2.101 x 0.0316
=f0.066.The confidence interval extends from 0.5 - 0.066=0.434 to 0.5 +0.066= 0.566.
The steps for testing an hypothesis about the value of the slope of the population regression line
are given in Table 13.6.
b, -PI0
Step 3: The test statistic is computed as t* = -
.
'bl
I I
Step 4: State your conclusions. The null hypothesis is rejected if the computed value of
Ithe test statistic falls in the rejection region. Otherwise, the null hypothesis is not rejected.
EXAMPLE 13.15 One of the most often tested hypothesis concerning Pr is that it equals zero. This is
equivalent to testing the hypothesis that x does not determine y and that there is no significant relationship
between x and y. Suppose we wish to test Ho: PI= 0 vs. Ha:PI# 0 at significance level a = .05 using the data in
Table 13.2. Recall from Example 13.3 that bl = 0.5 and from Example 13.13 that sbl
= 0.0316. The value for
pl0in Table 13.6 is 0. The critical values are determined by finding the t value with .025 area in the right tail
and df = 18.The values are f2.101. The computed value of the test statistic is found as follows:
Since this value falls far beyond the value 2.101, the null hypothesis is rejected. Note that the value 15.82 is
found in the printout in Fig. 13-4 as the t value corresponding to the predictor years and the corresponding p
value is given as O.OO0. Most researchers would use the information given in the printout to test this hypothesis.
ESTIMATION AND PREDICTION IN LINEAR REGRESSION
Suppose we are interested in estimating the mean number of annual paid days off for all
employees of large companies who have 20 years of service at such companies. Recalling basic
property 3 in Table 13.3,we have E(y) = PO+ plx. The mean number of annual paid days off would
be equal to P O + 2OP1. Since we do not know POand p1 ,we would use b
o and bl to estimate P O and
P1.Thepoint estimate for the mean would be b
o + 20bI. The standard error of this estimate is needed
to set a confidence interval on the mean or to test an hypothesis about this mean. A general
expression for a confidence interval when predicting the mean response will now be given.
A 100(1 - a)% confidence interval for the mean response E(y) = PO+ Plxo is given by formula
(13.19):
CHAP. 131 REGRESSION AND CORRELATION 319
EXAMPLE 13.16 The data in Table 13.2 will be used to obtain a 95% confidence interval for the mean
number of annual paid days off for employees having 20 years of service. In Example 13.3, the estimated
regression equation was found to be 9 = bo + b,x = 11.0 + 0 . 5 ~ .
The point estimate of the mean is 9 = 11.0 +
OS(20) = 2I days. The standard error of this point estimated is found by recalling from Example 13.3 that X =
15.2, S,, = 1891.2, and n = 20. In Example 13.7, we found that S = 1.374. Also, xo = 20. The standard error is
equal to
For df = n - 2 = 18degrees of freedom, the student t value is t.02s = 2.101.The 95% margin of error is
= 2.101 x 0.343= 0.72
s
x
x
The 95% confidence interval for the mean number of annual days off for employees having 20 years of service
extends from 21 - 0.72 = 20.28 days to 21 +0.72 = 21.72days.
Suppose we wished to predict the number of annual days off for a single individual having 20
years of service rather than the mean number for all employees having 20 years of service. The
prediction is still determined by using the estimated regression line as in the case of estimating the
mean. However, the standard error is larger because a single observation is more uncertain than the
mean of the population distribution.
A 100(1 - a)% prediction interval when predicting a single observation y at x = xo given by
formula (23.20):
1 (xO-X)’
bo +blxo k ta,,S 1+-+
i
. s x x
(13.20)
EXAMPLE 13.17 A 95% prediction interval for the number of annual days off for an individual having 20
years of service is found as follows. The point estimate is the same as that found when estimating the mean in
Example 13.16,21 years. The standard error associated with the prediction is found as follows:
t = 1
.
3
7
4
,
/
- = 1.42days
The margin of error associated with this prediction is 2.101 x 1.42= 2.98. The prediction interval extends from
21 - 2.98 = 18.02days to 21 +2.98 = 23.98 days.
EXAMPLE 13.18 In the Minitab output shown in Fig. 13-4, if the subcommand Predict 20; is added to the
commands, the following additional output is obtained.
F i t StDev F i t 9 5 . 0 % C I 95.0% P I
21.000 0.343 ( 2 0 . 2 8 0 , 2 1 . 7 2 0 ) ( 1 8 . 0 2 3 , 2 3 . 9 7 7 )
The fit value is the value obtained by substituting 20 into the estimated regression equation. The StDev Fit
value is the standard error associated with estimating the mean. In addition, the 95% confidence interval and the
95% prediction interval are also given.
LINEAR CORRELATION COEFFICIENT
The linear correlation coeficient, also known as simply the correlation coefirient, is a measure
of the strength of the linear association between two variables. The sample correlation coefficient is
320 REGRESSION AND CORRELATION [CHAP. 13
X
2
given by formula (13.21). The correlation coefficient is also sometimes referred to as the Pearsun
correlation coeflcient.
SXY (13.21)
Y
12
EXAMPLE 13.19 The correlation coefficient for the data in Table 13.2 will be determined. In Example 13.3
we found that S,, = 945.6 and S,, = 1891.2. In Example 13.6, we found that S,, = 506.8. The correlation
coefficient is therefore found as follows.
= 0.966.
S
X
Y = 945.6
= ,/%41891.2x 506.8
The Minitab computation for r is as shown below. The x values are put into column 1 and the y values are put
into column 2 and the command corr cl c2 is used to obtain the correlation.
MTB > corr cl c2
Correlations(Pearson)
Correlation of Years and Daysoff = 0.966
The basic properties of the sample correlation coefficient are given in Table 13.7.
Table 13.7
I
I1. The value of r is always between -1 and +1, i.e., -1 I
r I
+l.
2. The magnitude of r indicates the strength of the linear relationship, and the sign of r
indicates whether the relation is direct or inverse. If r is positive, then y tends to
increase linearly as x increases, with the tendency being greater the closer that r is to
1. If r is negative, then y tends to decrease linearly as x increases, with the tendency
being greater the closer that r is to -1. If all the points on the scatter diagram lie
perfectly on a straight line with a positive slope, the r = +l. If all the points on the
scatter diagram lie perfectly on a straight line with a negative slope, then r = -1.
I
3. A value of r close to -1 or +1 represents a strong linear relationship.
I4. A value of r close to 0 represents a weak linear relationship.
If the formula for r is applied to every pair of (x, y) values in the population, we obtain the population
correlation coeficient. The Greek letter p represents the population correlation coefficient.
EXAMPLE 13.20 The data given in Table 13.1 and the corresponding scatter plot shown in Fig, 13-1 are
reproduced below and on the next page.
4
8
10
16
20
24
13
15
16
19
21
23
I 30 I 26
CHAP. 131
25 -
20 -
Days off
15 -
10 1
REGRESSION AND CORRELATION
0
I I I
321
Step 1: The null hypothesis is H
o
:p = 0 and the alternativehypothesis is either
lower, upper, or two tailed.
Step 2: Use the Student t distribution table and the level of significance a to
, determinethe rejection regions. The degrees of freedom is n - 2.
Step 3: The test statistic is computed as t*=r -
Step 4: State your conclusions.The null hypothesisis rejected if the computed
value of the test statistic falls in the rejection region. Otherwise, the null
hypothesis is not rejected.
/z*
Years
For the data shown in the table, we have:
n = 8, Cx = 114, Cy = 145, Cx2= 2316, Cy2= 2801, and Cxy = 2412.
s,, = Ex2- -
(zx)2 = 2316 - 1624.5= 691.5, S,, = Cy2- -
(zy)2 -
-2801 - 2628.125 = 172.875, and
n n
s,, = zxy - (Cx)(zy) = 2412 - 2066.25 = 345.75. The correlation coefficient is then found as follows:
n
= 1.0
S
X
Y = 345.75
= Jsl(xsyy 46915 x 172.875
Since the points fall exactly on a straight line with a positive slope, the correlation coefficient is equal to 1.
INFERENCE CONCERNING THE POPULATION CORRELATION COEFFICIENT
One of the most important uses of the sample correlation coefficient is the determination of
whether or not a correlation exists between two population variables. Recall that r measures the
correlation between x and y in the sample and p is the corresponding population correlation. In
particular, we are often interested in testing the null hypothesis that p =0 vs. one of three alternatives.
The alternative p f 0 states that a correlation is present. The alternative p < 0 states that an inverse
correlation exists, and the alternative p > 0 states that a direct correlation exists. Table 13.8 gives the
steps for testing a hypothesis concerningp.
Table 1
3
.
8
I Steps for Testing for No PopulationCorrelation
322 REGRESSION AND CORRELATION [CHAP. 13
EXAMPLE 13.21 The data in Table 13.2 and the computed value of r in Example 13.19 will be used to test
whether or not there is a positive correlation between the years of service and the number of annual paid days
off for employees of large companies. The null hypothesis is Ho: p = 0 and the alternative hypothesis is Ha:p >
0. The computed value of the test statistic is:
Because of this large value oft*, we would reject the null and conclude that a positive correlation exists in the
population.
Solved Problems
STRAIGHTLINES
1
3
.
1 Find the slope and y intercept of the following straight lines.
Am. In order to find the slope and y intercept, each equation will be put into the form y = mx + b. The
number m is the slope and the number b is the y intercept. (a) The slope is m = 1.5 and the y
intercept is b = -2.5. (6) The slope is m = -2.5 and the y intercept is b = 3.0. (c) The given line is
equivalent to y = 2x - 3 and slope is m = 2 and the y intercept is b = -3. (d)The given line is
equivalent to y = .5x + .25 and the slope is m = .5 and the y intercept is b = .25.
13.2 Determine which of the following points fall on the line y = -2x +4.
(a) (2,O) (b) (0, 2) (c) (50,96) (d) (-10,24) (e) (14.23, -24.46)
AIZS. The points given in (a),(4,and (e)fall on the line since they satisfy the equation.
LINEAR REGRESSIONMODEL
13.3 Identify the values of PO,
is a normal random with mean 0 and standard 1.2.
,and CT in the linear regression model y = 12.1 +3 . 5 ~
+e, where e
Arts. PO=12.l,p,=3.5andO= 1.2.
13.4 Consider the linear regression model y = - 1 . 5 ~+ 5.6 + e, where e is a normal random variable
with mean 0 and variance d = 4. Determine the mean and standard deviation of y when x = 2
and when x = 4.5.
Ans. When x = 2, y = -1.5 (2) + 5.6 + e = 2.6 + e. E(y) = E(2.6 + e) = 2.6. Since 2.6 is a constant, the
variance of y is the same as the variance of e or 4. Therefore the standard deviation of y is 2 when
x = 2.
When x = 4.5, y = -1.5(4.5) + 5.6 + e = -1.15 + e. E(y) = E(-1.15 +- e) = -1.15. Since -1.15 is a
constant, the variance of y is the same as the variance of e or 4. Therefore the standard deviation
of y is 2 when x = 4.5. Notice that the standard deviation of y is 2, regardless of the value of x.
CHAP. 131 REGRESSION AND CORRELATION 323
LEAST SQUARES LINE
1
3
.
5 The number of hours spent per week viewing TV, y, and the number of years of education, x,
were recorded for 10 randomly selected individuals. The results are given in Table 13.9. In
addition, this table gives computations needed in finding bo and bl. Find the least-squares line
for these data.
Ans. S,,= Ex2- -= 2085 - 1988.1= 96.9 S,, = Zxy - (cx)(cy) = 1351- 1494.6=-143.6
n n
- -143.6
x = 14.1 7 = 10.6 bl = --
"' - -=--1.4819 bo= 7 -b15?= 10.6-(-1.4819)(14.1)
S X X 96.9
bo = 10.6+20.8948 = 31.4948
The equation of the least-squares line is 9 = b
o +blx = 31.495- 1.482~.
X
12
14
11
16
16
18
12
20
10
12
Y
Table 13.9
I X2
10
9
15
8
5
4
20
4
16
15
I44
196
121
256
256
324
144
400
100
144
Y2
100
81
225
64
25
16
400
16
256
225
XY
I20
126
I65
128
80
72
240
80
160
180
Cx= 141 I Cy= 106 I Zx2=2085 I Zyz= 1408 I ZXY=1351
13.6 The Minitab output for the data in problem 13.5 is shown in Fig. 13-5. Give the estimated
regression line.
MTB > Regress 'TVhours' 1 'Yearsedu';
SUBC> Constant;
SUBC> Predict 15.
Regression Analysis
The regression equation is
TVhours = 3 1 . 5 - 1 . 4 8 Yearsedu
Predictor Coef StDev T P
Constant 3 1 . 4 9 5 4.388 7 . 1 8 0.000
Yearsedu -1.4819 0.3039 -4.88 0.000
S = 2.992 R-Sq = 74.8% R-Sq(adj) = 71.7%
Analysis of Variance
Source DF ss MS F P
Regression 1 2 1 2 . 8 1 2 1 2 . 8 1 23.78 0.000
Error 8 7 1 . 5 9 8.95
Total 9 284.40
Fit St Dev Fit
9 . 2 6 6 0.985 ( 6 . 9 9 5 , 1 1 . 5 3 8 ) ( 2 . 0 0 2 , 1 6 . 5 3 1 )
95.0% CI 95.0% PI
Fig. 13-5
324 REGRESSION AND CORRELATION [CHAP. 13
Ans. From Fig. 13-5,we see that the estimated regression equation is given in the following portion of
the output in Fig. 13-5. Note that this is the same equation as the least-squares line found in
problem 13.5.
The regression equation is TVhours = 31.5 - 1.48 Yearsedu
Predictor Coef StDev T P
Constant 31.495 4.388 7 . 1 8 0 * 000
Yearsedu -1.4819 0.3039 - 4 . 8 8 0.000
ERROR SUM OF SQUARES
13.7 Compute the error sum of squares for the data in Table 13.9using both the defining formula
and the computational formula.
Ans. Table 13.10gives the details of the computation of SSE = C(y, - ii)2.
The columns from left to
right give the following information: Individual or subject, Number of years of education, Number
of hours spent per week watching TV, The fitted or estimated value using the least squares line,
The residual, and The squares of the residuals.
Table 13.10
Subject
1
2
3
4
5
6
7
8
9
10
Xi
12
14
I 1
16
16
18
12
20
10
12
Yi
10
9
15
8
5
4
20
4
16
15
9i = 31.495- 1.482~
13.7121
10.7482
15.1940
7.7843
7.7843
4.8204
13.7121
1.U66
16.6760
13.7121
(yi - f i
-3.71207
-1.74819
-0.1940 1
0.21569
-2.7843 1
-0.82043
6.28793
2.14345
I .28793
-0.67 595
13.7795
3.0562
0.0376
0.0465
7.7524
0.6731
39.5380
4.5944
0.4569
1.6588
The error sum of squares may also be computed using SSE = syy--’”. From problem 13.5, we have
SX
L
X
S,, = 96.9 and S,, = -143.6. In addition, S,, = Cy2- = 1408- 1123.6= 284.4. Using these values,
n
= 71.593.
(- 143.6)2
we find SSE= sY--
’”= 284.4-
Sxx 96.9
13.8 Using the Minitab output shown in Fig. 13-5, locate SSE and compare it with the result found
in problem 13.7.
Am. The error sum of squares is shown in bold in the following portion of Fig. 13-5.
Analysis of Variance
Source DF ss MS F P
Error 8 71.59 8 . 9 5
Total 9 284.40
Regression 1 2 1 2 . 8 1 2 1 2 . 8 1 2 3 - 7 8 0 . 0 0 0
CHAP. 131 REGRESSION AND CORRELATION
' Subject
1
2
3
4
5
6
7
8
9
10
325
Xi Yi ii = 31.495- 1.482~ (Yi - y > (yi - YY
12 10 13.7121 3.11207 9.6850
9 10.7482 0.14819 0.0220
14
11 15 15.1940 4.59401 21.1050
16 8 7.7843 -2.8 1569 7.9281
16 5 7.7843 -2.8 1569 7.9281
18 4 4.8204 -5.77957 33.4034
12 20 13.7121 3.11207 9.6850
20 4 1.8566 -8.74345 76.4479
10 16 16.6760 6.07595 36.9172
12 15 13.7121 3.11207 9.6850
Sum = 0 Sum = 212.81
STANDARDDEVIATION OF ERRORS
13.9 Use the computed value for SSE in problems 13.7 and 13.8 to find the standard deviation of
errors, S.
Ans. The standard deviation of errors is given by the formula S = j,"s";
--
- /?-= 2.9915.
13.10 Locate the standard deviation of errors in Fig. 13-5 and confirm that it is the same as that
computed in problem 13.9.
Ans. The following row taken from Fig. 13-5gives the value for S.
8 = 2.992 R-Sq = 74.8% R-Sq(adj) = 7 1 . 7 %
TOTAL SUM OF SQUARES
13.11 Find the total sum of squares for the data given in Table 13.9.
1M2
Ans. The total sum of squares is given by SST = C(yi - y)2= Zy2- &f = 1408- -= 1408-
n 10
1123.6= 284.4. The values for Cy2and Zy are found at the bottom of Table 13.9.
13.12 Locate the Total sum of squares in Fig. 13-5 and confirm that it is the same as that found in
problem 13.11.
Ans. The following portion of Fig. 13-5 gives the total sum of squares. It is the same as that found in
problem 13.11.
Source DF ss MS F P
Regression 1 2 1 2 . 8 1 2 1 2 . 8 1 2 3 . 7 8 0.000
E r r o r 8 71.59 8.95
Total 9 284.40
REGRESSIONSUM OF SQUARES
13.13 Find the regression sum of squares for the data given in Table 13.9 by subtraction as well as
by direct computation.
Ans. The regression sum of squares is given by SSR = SST - SSE = 284.4 - 71.593 = 212.807, where
SST is computed in problems 13.1I and 13.12and SSE is computed in problems 13.7 and 13.8.
The direct computation is illustrated in Table 13.11. The mean of the y values is 10.6.
326 REGRESSION AND CORRELATION [CHAP. 13
13.14 Locate the regression sum of squares in Fig. 13-5 and confirm that you get the same value as
that computed in problem 13.13.
Ans. The regression sum of squares is shown in the following portion of output selected from Fig.
13-5.
Source DF ss MS F P
ROgrO68iOn 1 212.81 2 1 2 . 8 1 2 3 . 7 8 0.000
Error a 7 1 . 5 9 8.95
Total 9 284.40
COEFFICIENT OF DETERMINATION
13.15
13.16
Calculate the coefficient of determination for the results given in Table 13.9 and explain its
meaning.
Ans. The coefficient of determination is given by r2 = - -
- -= 0.748. The estimated
regression equation accounts for about 75% of the variation in TV viewing time.
SSR 21281
SST 284.4
Locate the coefficient of determination in Fig. 13-5 and confirm that you get the same value as
that computed in problem 13.15.
Ans. The r2value shown in the following row, taken from Fig. 13-5,is the same as that in problem
13.15.
S = 2 . 9 9 2 R-8q m 74.8% R-Sq(adj) = 71.7%
MEAN, STANDARDDEVIATION,AND SAMPLINGDISTRIBUTION OF THE
SLOPE OF THE ESTIMATED REGRESSION EQUATION
13.17 Find the standarderror for bl using the data in Table 13.9.
2'992 - 0.304. The value for S is found in
The standard error for bl is given by sb,
= --
S
Ans.
J
s
x
X
-
3
z
-
problem 13.9and the value for S,, is given in problem 13.5.
13.18 Locate the value for the standard error for bl in Fig. 13-5and compare it with the value
computed in problem 13.17.
Ans. The standard error of bl is shown in bold print in the following selection taken from Fig. 13-5.It
is the same as that computed in problem 13.17.
Predictor Coef St Dev T P
Constant 3 1 . 4 9 5 4 . 3 8 8 7 . 1 8 0.000
Yeareedu -1.4819 0.3039 - 4 . 8 8 0.000
INFERENCES CONCERNING THE SLOPE OF THE POPULATION
REGRESSION LINE
13.19 Use the data in Table 13.9to test Ho: =0 vs. Ha: <0 at significancelevel a = .05.
Ans. First, let us determine the critical value. The degrees of freedom is df = n - 2 = 8. The t value for
a right tail area equal to -05 is t,os = 1.860. Since this is a lower-tail test, the critical value is
CHAP. 131 REGRESSION AND CORRELATION 327
-1.860. The computed value of the test statistic is t* = U.
From problem 13.5, we know
that bl = -1.4819, and from problem 13.17, s,,,= 0.304. The hypothesized value for the slope is
plo= 0. The computed value of the test statistic is therefore t* = = -4.87. Since this
value is smaller than -1.860, we reject the null hypothesis and conclude that the slope is less than
zero. This indicates that the number of hours spent viewing TV and the number of years of
education are inversely related.
Sb*
-1.48 19-0
.304
13.20 Locate the computed value of the test statistic in problem 13.19 in the Minitab output given in
Fig. 13-5.
Ans. The following selection from Fig. 13-5 gives the computed test statistic as -4.88 with a
corresponding p value equal to O.OO0.
Predictor Coef StDev T P
Constant 31.495 4.388 7 . 1 8 0.000
Yearsedu -1-4819 0.3039 -4.00 0.000
ESTIMATIONAND PREDICTIONIN LINEAR REGRESSION
13.21 Find a 95%confidence interval for the mean number of hours spent per week watching TV
for all individuals with 15 years of education and a 95% prediction interval for an individual
with 15years of education using the data in Table 13.9.
Ans. The 95% confidence interval for the mean is given by bo + blx, k t,,,S
point estimate for the mean .is
. The
= bo + bl(15) = 31.495 - 1.482(15) = 9.265, The margin of
I
error associated with this estimate is t,,,S .The value for t.025 is found by using
8 degrees of freedom to be 2.306. The following values have been found in previous problems:
S = 2.992, n = 10, X= 14.1, S,, = 96.9, and ~0 = 15. Therefore, the margin of error is:
2.306x 2.992 x ,/=
= 2.27. The 95% confidence interval for the mean number of
hours watching TV for all individuals having 15 years of education extends from 9.265 - 2.27 =
6.995 to 9.265 + 2.27 = 11.535. The 95% prediction interval for an individual is given by
The point estimate is the same as given above when setting a confidence interval on the mean
and the t value is the same as above. The margin of error is computed as follows:
(15- 14.1),
2.306x 2.992 x ,/z=
1+-+ 7.264. The 95% prediction interval extends from 9.265
- 7.264 = 2.001 to 9.265 + 7.264 = 16.529. Note that the prediction interval is much wider than
the confidence interval for the mean.
13.22 Locate the 95%confidence interval and the 95% prediction interval in Fig. 13-5and compare
with the results in problem 13.21.
328 REGRESSION AND CORRELATION [CHAP. 13
Ans. The following portion of the output given in Fig. 13-5 gives the 95% confidence interval and prediction
interval. The results are the same except for a small amount of round off error.
F i t S t D e v F i t 95.0% C I 95.0% PI
9 . 2 6 6 0 . 9 8 5 (6.995, 11.538) (2.002, 16.531)
LINEAR CORRELATIONCOEFFICIENT
13.23 Compute the linear correlation coefficient for the data in Table 13.9.
Ans. The correlation coefficient is given by the formula r = . In problem 13.5, we found
that S,, = 96.9 and S
,
, = -143.6. In problem 13.7, we found that S,, = 284.4. The correlation
sxy
.Is..syy
-143.6
4
- =
coefficient is equal to
13.24 Determine the correlation coefficient using the computer output in Fig. 13-5.
Ans. In problem 13.16, we found that r2 = ,748. Therefore r = &4
% = k 365. The negative sign is
taken since we know that the variables are inversely related. We know they are inversely related
since the slope of the estimated regression line is negative. Therefore r = -.865.
INFERENCE CONCERNING THE POPULATION CORRELATION COEFFICIENT
13.25
13.26
Use the data in Table 13.9to test Ho: p = 0 vs. Ha:p c 0 at a = .01.
Ans. The critical value is determined by noting that df = n - 2 = 10- 2 = 8 and that t,ol= 2.896. Since
this is a lower-tail test, the critical value is -2.896. The test statistic is computed as
= -4.876. Since this value is less than the critical value, we
8
J1-.748225
t * = r d s = -.865
conclude that the two variables are negatively correlated.
A sample of 100 federal taxpayers were interviewed and the annual income and cost to
prepare their return was determined for each. The correlation coefficient between the two
variables was found to be 0.37. Do these results indicate a positive correlation between these
two variables in the population? Use a = .05.
Ans. Because of the large sample size, we may use the standard normal distribution table to find the
critical value. The critical value is 1.645. The computed value of the test statistic is t* =
r i % = .37,/&
= 3.94. Since the computed value of the test statistic exceeds the critical
1 - r
value, we conclude that there is a positive population correlation.
Supplementary Problems
STRAIGHTLINES
13.27 Figure 13-6shows the line whose equation is y =4 +2x and 8 points, labeled A through H. Points A, C,
F, and G have the following coordinates: A:(2, 6), C:(2, lO), G:(5, 16), and F(5, 12). Find the
CHAP. 131 REGRESSION AND CORRELATION 329
coordinates for the points B, D, E, and H. Also find the distance between the following pairs of points:
A and B, B and C, H and G, and H and F. Find the sum of the squares of the distances between the four
pairs of points.
I I I
2 3 4 5
X
Fig. 13-6
Ans. B: (2,8),D: (3, lO), E: (4, 12),and H: (5, 14)
All four of the distances are equal to 2 units. The sum of the squares of the distances is equal to
16.
13.28 Use the coordinates of the points B and H found in problem 13.27to find the slope of the straight line
passing through these points and compare this with the slope of the line y = 4 + 2x.
Y 2 - Y 1 14-8
Ans. The slope of the line is m = -= -= 2. The slope of the line y =4 +2x is m = 2.
x Z - X ~ 5 - 2
LINEAR REGRESSION MODEL
13.29 A linear regression model has a slope of 12.5 and a y intercept equal to 19.2. The error term has a
standard deviation equal to 2.5.Give the equation of the linear regression model.
Ans. y = 19.2+ 1 2 . 5 ~
+e
13.30 For the linear regression model described in problem 13.29,find the mean values for y when x equals 2
and 5.
Ans. 44.2 and 81.7
LEAST-SQUARES LINE
13.31 Table 13.12 gives the household income in thousands of dollars, x, and cost of filing federal taxes in
dollars, y, for 20 randomly selected federal taxpayers. Find the following: Zx, Ex2,Zy, Zy2, Zxy, S,,,
S,,, S,,, bl, and bo.
330 REGRESSION AND CORRELATION [CHAP. 13
X
17.5
37.5
47.5
25.O
55.5
35.O
15.5
12.0
32.0
42.3
Table
Y
35.0
60.5
88.5
70.5
125.0
63.0
30.0
30.0
65.O
80.0
13.12
X
28.0
22.5
25.O
29.5
65.O
51.0
39.3
33.0
45.O
75.O
Y
75.0
70.0
60.0
65.O
150.0
100.0
75.O
40.0
75.0
200.0
Ans. Cx = 733.10, Cx2= 31,992, Cy = 1557.50, Cy2= 153,407, Cxy = 68,864, S,, = 5120.2195, S,, =
32116.6875, S,, = 11773.8375, bl = 2.2995, bo = -6.4132.
13.32 The Minitab output for the regression of y on x for the data in Table 13.12 is shown in Fig. 13-7. Give
the portion of the output that shows the values for boand bl.
MTB > Regress 'Cost' 1 'Income';
S U B 0 Predict 4 0 .
RegressionAnalysis
The rogrerraion oquation i8
Cost = - 6.42 + 2.30 Income
Predictor Coef St D e v T P
Constant - 6 . 4 2 0 9.353 - 0 . 6 9 0 . 5 0 1
Income 2.2997 0.2339 9.83 0.000
S = 1 6 . 7 3 R-Sq = 84.3% R-Sq(adj) = 8 3 . 4 %
Analysis of Variance
Source DF ss MS F P
Error 1 8 5040 280
Total 1 9 32116
Regression 1 27076 27076 9 6 . 7 0 0 . 0 0 0
I St Dev Fit 95.0% CI 95.0% PI
Fit
~ 8 5 . 5 7 3 . 8 2 ( 7 7 . 5 3 , 9 3 . 6 0 ) ( 4 9 . 5 0 , 1 2 1 . 6 4 )
Fig. 13-7
Ans. The estimate regression equation is shown in bold print. The differences between these values
and those given in problem 13.31 are due to round off error.
ERROR SUM OF SQUARES
13.33 Use the results given in problem 13.31 to find SSE for the regression line fit to the data in Table 13.12.
Ans. SSE = syy
--
'"
= 32I 16.6875- 27073.6927 = 5042.9948
sxx
13.34 The Minitab output for the regression of y on x for the data in Table 13.12 is shown in Fig. 13-7. Give
the portion of the output that shows the value for SSE.
CHAP. 131 REGRESSION AND CORRELATION 331
Ans. Error 1 8 5040 2 8 0
The difference in the values for SSE in problems 13.33and 13.34 are due to round-off error.
STANDARDDEVIATION OF ERRORS
13.35 Use the result found in problem 13.33 to find the standard deviation of errors for the regression line fit
to the data in Table 13.12.
Ans. S = ,/E
= ,/y
= 16.7382
13.36 The Minitab output for the regression of y on x for the data in Table 13.12 is shown in Fig. 13-7. Give
the portion of the output that shows the value for the standard deviation of errors.
Ans. S = 16.73 R-Sq = 8 4 . 3 % R-Sq(adj) = 8 3 . 4 %
TOTAL SUM OF SQUARES
13.37 Use the results given in problem 13.31 to find the total sum of squares for the data given in Table 13.12.
Ans. SST = S,, = 32116.6875
13.38 The Minitab output for the regression of y on x for the data in Table 13.12 is shown in Fig. 13-7. Give
the portion of the output that shows the value for the total sum of squares.
Ans. Total 1 9 32116
REGRESSIONSUM OF SQUARES
13.39 Use the results of problems 13.33 and 13.37to find the regression sum of squares for the data in Table
13.12.
Ans. SSR = SST - SSE = 32116.6875- 5042.9948 = 27073.6927
13.40 The Minitab output for the regression of y on x for the data in Table 13.12 is shown in Fig. 13-7. Give
the portion of the output that shows the value for the regression sum of squares.
Ans. Regression 1 27076 27076 96.70 0.000
The difference in the answers given in problems 13.39and 13.40is due to round-off error.
COEFFICIENTOF DETERMINATION
13.41
13.42
Use the total sum of squares found in problem 13.37 and the regression sum of squares found in
problem 13.39to find the coefficient of determination for the linear regression model applied to the data
in Table 13.12.
SSR - 27073.6927 = o.843
SST 32116.6875
Ans. r2 = --
The Minitab output for the regression of y on x for the data in Table 13.12 is shown in Fig. 13-7. Give
the portion of the output that shows the value of the coefficient of determination.
Am. s = 1 6 . 7 3 R-Sq p 04.3% R-Sq(adj) = 8 3 . 4 %
332 REGRESSION AND CORRELATION [CHAP. 13
MEAN, STANDARDDEVIATION, AND SAMPLINGDISTRIBUTIONOF THE SLOPE OF THE
ESTIMATEDREGRESSIONEQUATION
13.43
13.44
Find the standard error for bl ,the slope of the estimated regression equation found in problem 13.31.
- 16*7382 = 0.2339
S
-
Ans. sb,---
Jsxx 4
-
The Minitab output for the regression of y on x for the data in Table 13.12 is shown in Fig. 13-7. Give
the portion of the output that shows the value standard error for bl.
Ans. Income 2.2997 0 . 2 3 3 9 9.83 0.000
INFERENCESCONCERNINGTHE SLOPEOF THE POPULATIONREGRESSIONLINE
13.45 Using the data in Table 13.12, test the hypothesis that the slope of the line in the population regression
model is equal to 0. Use level of significance a = .05.
Arts, b -P
Sb,
The critical values are f 2.101. The computed value of the test statistic is t* = =
2'2995-0 = 9.8311.The slope of the regression line is different from 0.
.2339
13.46 Give the portion of Fig. 13-7that is used to test the null hypothesis that the slope of the regression line
is equal to 0. The regression model is y = PO+ Plx + e, where x = household income in thousands, and
y = cost of filing federal taxes. Assume a = .01.
Ans. Predictor Coef St Dev T P
Constant - 6 . 4 2 0 9.353 - 0 . 6 9 0 . 5 0 1
Income 2 . 2 9 9 7 0.2339 9.83 0 . 0 0 0
The computed test statistic is seen to be the same as in problem 13.45. The p value, O.OO0,
indicates that the null hypothesis should be rejected.
ESTIMATIONAND PREDICTION IN LINEAR REGRESSION
13.47 Find a 95%confidence interval for the mean cost of filing federal income taxes for all those individuals
with household income equal to $40,000 and a 95%prediction interval for an individual with household
income equal to $40,000 using the data in Table 13.12.
Ans. The 95% confidence interval for the mean is given by bo+blxo rf: t,,,S .The
point estimate of the mean is -6.4 I32 +2.2995 x 40 = 85.5668. The margin of error is 5 2.101 x
16.7382 x il+
(40-36.655)2 or +, 8.0336. The 95% confidence interval extends from 85.5668
20 5120.2195
- 8.0336 = $77.53 to 85.5668 + 8.0336 = $93.60. The 95%prediction interval for an individual is
given by bo + blxo k t,,,S .The point estimate for the individual is 85.5668.
The margin of error is k 2.101 x 16.7382 x (40-36'655)2 or f 36.0729. The 95%
20 5120.2195
prediction interval extends from 85.5668- 36.0729 = $49.49 to 85.5668 + 36.0729 = $121.64.
CHAP. 131 REGRESSION AND CORRELATION 333
13.48 Locate the 95% confidence interval and the 95% prediction interval in Fig. 13-7 and compare with the
results in problem 13.47.
Ans. Fit St D e v Fit 95.0% CI 95.0% PI
8 5 . 5 7 3 . 8 2 (77.53, 93.60) (49.50, 1 2 1 . 6 4 )
The 95% confidence interval and the 95% prediction intervals are seen to be the same as those
found in problem 13.47.
LINEAR CORRELATIONCOEFFICIENT
13.49 Calculate the correlation coefficient between the household income and the cost of filing federal taxes
using the data in Table 13.12.
= 0.918.
S X Y - 11773.8375
Arts. The correlation coefficient is given by r =
,/=
- J5120.2195~
32116.6875
13.50 Using the output shown in Fig. 13-7, determine the correlation coefficient for the data given in Table
13.12.
Ans. S = 16.73 R-Sq =84.3% R-Sq(adj) = 83.4%
Since r2= 343, r = 4.843 = 0.918.
INFERENCECONCERNINGTHE POPULATIONCORRELATIONCOEFFICIENT
13.51 Use the computed value for r in problem 13.49 to test the hypothesis Ho: p = 0 vs. Ha: p # 0 at level of
significance a = .05, where p represents the correlation between household income and the cost of
filing federal taxes in the population.
Ans. The test statistic is computed as t*= r iz-
-
- .918J
'
" = 9.82. The critical values are
1-.842724
*2.101. The null hypothesis is rejected at a = .05.
13.52 How can the Minitab output in Fig. 13-7be used to test the hypothesis Ho: p = 0 vs. Ha: p f 0 at level of
significance a = .05, where p represents the correlation between household income and the cost of
filing federal taxes in the population.
Ans. The hypothesis Ho: p = 0 vs. Ha: p # 0 is equivalent to the hypothesis &:
The following portion of the output is used to test &: = 0 vs. Ha: p1# 0.
= 0 vs. Ha: PI +0.
Predictor Coef S t D e v T P
Constant -6.420 9.353 -0.69 0 . 5 0 1
Income 2.2997 0.2339 9.03 0 .000
The p value = 0.000indicates that the null hypothesis is rejected.
Chapter 14
Value 17.2 13.5 , 15.5 12.5 13.5 16.0 11.5 13.5
Rank 9 4 7 2 4 8 1 4
Nonpararnetric
14.3 18.0 '
6 10
Statistics
Value
I
Sign
NONPARAMETRIC METHODS
thin thick thin thick thin thin thin thick thin thin I
- - - - -
+
+
-
- +
Many of the hypotheses tests discussed in previous chapters required various assumptions such as
normality of the characteristic being measured or equality of population variances in two sample tests
for example. What can we do if the assumptions are not satisfied? In addition, many research studies
involve low-level data such as nominal or ordinal data to which previously discussed procedures do
not apply. In a taste test, the response may be Pepsi or Coke when asked which of two colas are
preferred. In situations such as these, nonparametric statistical methods are often used.
Nonparametric tests often replace the raw data with ranks, signs, or both ranks and signs. The
nonparametric tests then analyze the resulting ranks and/or signs. Since the original data are not
actually used, one criticism of these procedures is that they are wasteful of information.
EXAMPLE 14.1 Table 14.1 gives a set of data and the corresponding ranks associated with the data values.
The smallest number receives a rank of 1, the next a rank of 2, and so on. If two or more values are tied, they
receive the average of the ranks that would normally occur. Note that 13.5 occurred 3 times and the ranks
would have been 3, 4, and 5. The average of 3, 4, and 5 is 4 and so 4 was assigned to each occurrence of the
value 13.5.The sum of the ranks 1 through n is always equal to -
. If n = 10, then the sum of the ranks is
1OX 11
= 55. This sum is obtained even if ties occur in the data. If the ranks are added in Table 14.1, the sum
2
n(n +1)
2
55 is obtained. The replacement of data with ranks will be utilized in many of the following sections.
EXAMPLE 14.2 In a taste test, each individual was asked to taste a pizza with a thick crust and a pizza with a
thin crust, and to state which one they preferred. A coin was flipped to determine which one they tasted first in
order to eliminate the order of tasting as a factor. Table 14.2 gives the response for each individual and each
response is coded as + if the response was thick and - if the response was thin. The results seem to indicate that
the thin crust is preferred over the thick crust. But are these sample results significant'?That is, what do these
results tell us about the set of preferences in the population? Seventy percent of the sample preferred thin over
thick crust. What are the chances of these results if in fact there is no difference in preference in the population?
The sign test in the next section will help us answer this question.
Table 14.2
SIGN TEST
The sign test is one of the simplest and easiest to understand of the nonparametric tests. In
Example 14.2, the purpose of the taste test is to decide if there is a difference in taste preference
334
CHAP. 141 NONPARAMETRIC STATISTICS 335
Price
Sign
between thin and thick pizza crust. The null hypothesis is that there is no difference in preference for
the two types of crusts. The research hypothesis might be one-tail or two-tail. Suppose it is a two-tail
research hypothesis. As usual, we proceed under the assumption that the null hypothesis is true and
only reject that assumption if our sample results lead us to reject it. If there is no difference in
preference, then the probability of a + on any trial is .5 and the probability of a - is also .5. The
number of + signs in the 10trials, x, is a binomial random variable. The p value associated with the
outcome x = 3 is computed by finding P(x 5 3) and then doubling the result since this is a two-tail
test. Using the binomial distribution tables in Appendix 1 with n = 10and p = .5, we find that P
(
x 5 3 )
is equal to .0010 + .0098 + .0439 + .1172 = .1719 and the p value = 2 x .1719 = 0.3438. At the
conventional level of significance cx = .05 we cannot reject the null hypothesis since the p value is not
less than the preselected level of significance. We see from this discussion that the sign test uses the
binomial distribution to perform the sign test. The p value may be computed by using Minitab. The
computation using Minitab is as follows:
110 130 102 125 140 112 117 130 200 119
+ + - + + + + + + +
MTB > cdf 3;
SUBC> binomial n = 10 p = . 5 .
Cumulative Distribution Function
Binomial with n = 10 and p = 0.500000
X P ( x <= x)
3 . 0 0 0 . 1 7 1 9
The p value is then found by doubling 0.1719to get 0.3438 as found above.
EXAMPLE 14.3 In order to test the hypothesis that the median price for a home in a city is $105,000, a
sample of 10 recently sold homes is obtained. If the selling price exceeds $105,000, a plus is recorded. If the
price is less than $105,000, a negative sign is recorded. If the price is equal to $105,OOO, the selling price is not
used in the analysis and the sample size is reduced. Table 14.3 gives the selling prices in thousands and the
signs are recorded as described.
Suppose the alternative hypothesis is that the median selling price exceeds $105,000.If x represents the number
of + signs, then the p value = P(x 29).The computation of the p value is as follows.
MTB > cdf 8;
S U B 0 binomial n = 10 p = . 5 .
Cumulative Distribution Function
Binomial with n = 10 and p = 0.500000
X P ( x <= x)
8 . 0 0 0 . 9 8 9 3
Since P(x 1 9 )= 1 - P(x S 8)= 1 - .9893 = 0.0107, the p value is 0.0107. The null hypothesis is rejected and we
conclude that the median price exceeds $105,000at level of significance a = .05. The normal approximation to
the binomial distribution may be used to determine the p value also. It is recommended that the student review
this approximation which is discussed in Chapter 6. The approximation is valid provided that np 15 and nq 2 5.
np and nq both equal 5 in this example. The mean is computed as p = np = 10 x .5 = 5 and the standard
deviation is computed as 0 = 6= d m = 1.581. The z value is computed as z = -
= 2.21.
The p value is the area to the right of 2.21 under the standard normal curve. P(z > 2.21) = .5 - .4864 = .0136.
The normal approximation to the binomial distribution provides a reasonably good result if n is 10 or greater.
8.5 -5
1.58I
336 NONPARAMETRICSTATISTICS [CHAP. 14
Patient
Beforereading
Afterreading
Difference
Many books recommend that n be 20 or 25, but if the continuity correction is used the result will be fairly good
for n equal to 10or more.
1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5
95 90 98 99 87 95 95 97 90 90 99 '95 95 92 96
85 80 91 88 90 90 88 84 93 91 90 90 97 94 86
10 10 7 11 , -3 5 7 13 -3 -1 9 5 -2 -2 10
EXAMPLE 14.4 The sign test described in Example 14.3 may be performed by using the sign test procedure
in Minitab. The prices given in Table 14.3 are entered in column cl and the following commands are used. The
subcommand Alternative 1. indicates that the alternative is upper tailed.
MTB > STest 105 C1;
SUBC> Alternative 1.
Sign Test for Median
Sign test of median = 105.0 vs. > 105.0
N Below Equal Above P Median
c1 10 1 0 9 0.0107 1 2 2 .0
Note that the same p value is given here as was found in Example 14.3.
EXAMPLE 14.5 The sign test is also used to test for differences in matched paired experiments.
gives the blood pressure readings before and six weeks after starting hypertensive medication as
differences in the readings.
Table 14.4
Table 14.4
well as the
The research hypothesis is that the medication will reduce the blood pressure. The difference is found by
subtracting the after reading from the before reading. If the medication is effective, the majority of the
differences will be positive. If x represents the number of positive signs in the 15 differences and if the null
hypothesis that the medication has no effect is true, then x will have a binomial distribution with n = 15 and p =
.5. The null hypothesis should be rejected for large values of x. That is, as this study is described, this is an
upper-tail test. The Minitab solution is as follows. If the medication is not effective, the median difference
should be zero. The differences are entered into column 1.
10 10 7 11 - 3 5 7 13 - 3 -1 9 5
-2 -2 10
MTB > STest median = 0.0 data in C1;
SUBC> Alternative 1.
Sign Test for Median
Sign test of median = 0.00000 vs. > 0.00000
c1 15 5 0 10 0.1509 7.000
N Below Equal Above P Median
The p value is seen to equal 0.1509.This indicates that the null hypothesis would not be rejected at level of
significance a = .05. A dotplot of the differences is shown in Fig. 14-1. The dotplot illustrates a weakness of the
sign test. Note that the 5 minus values are much smaller in absolute value than are the absolute values of the 10
positive differences. The sign test does not utilize this additional information. The signed-rank test discussed in
the next section uses this additional information.
Fig. 14-1
CHAP. 141 NONPARAMETRIC STATISTICS 337
The sign test is a nonparametric alternative to the one sample t test. The one sample t test
assumes that the characteristic being measured has a normal distribution. The sign test does not
require this assumption. If the selling prices of the homes discussed in Example 14.3 are normally
distributed, then the one sample t test could be used to test the null hypothesis that the mean is
$105,000.
WILCOXON SIGNED-RANKTEST FOR TWO DEPENDENT SAMPLES
In Example 14.5,it was noted that the sign test did not make use of the information that the 10
positive differences were all larger in absolute value than the absolute value of any of the 5 negative
differences. The sign test used only the information that there were 10 pluses and 5 negatives. The
data in Table 14.4 will be used to describe the Wilcoxon signed-rank test. Table 14.5 gives the
computationsneeded for the Wilcoxon signed-ranktest.
~~~ ~
Before reading
95
90
98
99
87
95
95
97
90
90
99
95
95
92
96
After reading
85
80
91
88
90
90
88
84
93
91
90
90
97
94
Tab1
Difference, D
10
10
7
11
-3
5
7
13
-3
-1
9
5
-2
-2
10
14.5
ID1
10
10
7
11
3
5
7
13
3
1
9
5
2
2
Rank of I D 1
12
12
14
8.5
4.5
6.5
8.5
15
4.5
1
10
6.5
2.5
2.5
12
Signed rank
12
12
8.5
14
-4.5
6.5
8.5
15
-4.5
-1
10
-2.5
-2.5
12
6.5
The first column gives the before blood pressure readings, the second column gives the after
readings, the third column gives the difference D = Before - After, the fourth column gives the
absolute values of the differences, the fifth column gives the ranks of the absolute differences,and the
sixth column gives the signed-rank which restores the sign of the difference to the rank of the
absolute difference.Two sum of ranks are defined as follows: W+= sum of positive ranks and W- =
absolute value of sum of negative ranks. For the data in Table 14.5,we have the following:
W+= 12+ 12+8.5+ 14+6.5+8.5+15+ 10+6.5+ 12= 105
W- = 4.5 +4.5 + 1+2.5 +2.5 = 15
To help in understanding the logic behind the signed-rank test, it is helpful to note some interesting
facts concerningthe above data. The sum of the ranks of the absolute differences must equal the sum
of the first 15 positive integers. Using the formula given in Example 14.1, the sum of the ranks 1
through 15 is -
l5 l6 -
- 120. Note that W+ + W- = 120. If the null hypothesis is true and the
medication has no effect, we would expect a 60,60 split in the two sum of ranks. That is if Ho is true,
we would expect to find W+ = 60 and W- = 60. If the medication is effective for every patient, we
would expect W+= 120and W- = 0. Either of the rank sums, W+ or W-, may be used to perform the
hypothesis test. Tables for the Wilcoxon test statistic are available for determining critical. values for
2
338 NONPARAMETRICSTATISTICS [CHAP. 14
the Wilcoxon signed-rank test. However, statistical software would most often be used to perform the
test in real world applications. The Minitab solution is given in Example 14.6.
EXAMPLE 14.6 The differences in column 3 of Table 14.5 are entered into column cl. The command
west 0 . 0 C1; indicates that the Wilcoxon signed-rank test is to be used to test that the median of the
differences is 0 and that the differences are in column c1. The subcommand Alternative 1.indicates that
the test is upper-tailed. The output gives the Wilcoxon statistic, W+,computed in the above discussion and most
importantly it gives the p value for the test. The probability of obtaining a value of W+equal to 105or larger is
0.006.This p value is much less than .05 and indicates that the medication is effective. Note that the p value
obtained when using the sign test in Example 14.5 is 0.1509. The Wilcoxon signed-rank test is more powerful
than the sign test as indicated by this example.
D i f f
10 10 7 11 -3 5 7 13 -3 -1
9 5 -2 - 2 10
MTB > WTest 0 . 0 C1;
S U B 0 Alternative 1.
Wilcoxon Signed Rank Test
Test of median = 0.000000 vs. median > 0.000000
N for Wilcoxon Estimated
N Test Statistic P Median
D i f f 15 15 105.0 0.006 5.000
A normal approximation procedure is also used for the Wilcoxon signed-rank test procedure
when the sample size is I5 or more. The test statistic is given by formula (14.I ) , where W is either
W' or W- depending on the nature of the alternative hypothesis, and n is the number of nonzero
differences.
W -n(n +1)/4
Jn(n +1)(2n+1)/24
Z = (14,I )
EXAMPLE 14.7 The normal approximation will be applied to the blood pressure data in Table 14.5 and the
results compared to the results in Example 14.6.The computed test statistic is as follows:
105- 15(16)/4 = 2.56
z = 4150(31)/24
The p value is P(z > 2.56) = .5 - ,4948 = 0.0052.The p value in Example 14.6 is equal to 0.006. Note that the
two values are very close. The difference may be due to round-off error or it may be due the Minitab software
using the exact distribution for the test statistic rather than the normal approximation.
Like the sign test, the Wilcoxon signeL rank test is a nonparametric alternative to the one sample
t test. However, the Wilcoxon signed-rank test is usually more powerful than the sign test. There are
situations such as taste preference tests where the sign test is applicable but the Wilcoxon signed-rank
test is not.
WILCOXON RANK-SUM TEST FOR TWO INDEPENDENT SAMPLES
Consider an experiment designed to compare the lifetimes of rats restricted to two different diets.
Eight rats were randomly divided into two groups of four each. The rats in the control group were
allowed to consume all the food they desired and the rats in the experimental group were allowed
CHAP. 141
Control
Experimental
NONPARAMETRIC STATISTICS
3.4 3.6 4.0 2.5
3.7 4.2 4.2 4.5
339
Control
only 80% of the food that they normally consumed. The experiment was continued until all rats
expired and their lifetimes are given in Table 14.6.
3.4 (2) 3.6 (3) 4.0 (5) 2.5 (1)
Experimental 3.7 (4) 4.2 (6.5) 4.2 (6.5) 4.5 (8) .
A nonparametric procedure for comparing the two groups was developed by Wilcoxon. Mann
and Whitney developed an equivalent but slightly different nonparametric procedure. Some books
discuss the Wilcoxort rank-srun test and some books discuss the Marzrz-Whitrzey U test. Both
procedures make use of a ranking procedure we will now describe. Imagine that the data values are
combined together and ranked. Table 14.7 gives the same data as shown in Table 14.6 along with the
combined rankings shown in parentheses.
Table 14.7
We let R
I be the sum of ranks for group 1, the control group, and R2 be the sum of ranks for
group 2. For the data in Table 14.7, we have RI = 11 and R
2 = 25. The sum of all eight ranks must
equal -
* = 36. Note that R1 + R2 = 36.
2
Now, consider some hypothetical situations. Suppose there is no difference in the lifetimes of the
control and experimental diets. We would expect R
I = I8 and R2 = 18. Suppose the experimental diet
is highly superior to the control diet, then we might expect all rats in the experimental group to be
alive after all the rats in the control diet have died, and as a result obtain RI = 10and R2 = 26. Critical
values for the testing the hypothesis of no difference in the two groups may be obtained from a table
of the Mann-Whitney statistic or a table of the Wilcoxon rank-sum statistic. These tables are available
in most elementary statistics texts. Most users of Statistics will use a computer software package to
evaluate their results. Example 14.8 illustrates the use of Minitab to evaluate the results in the
experiment described above.
EXAMPLE 14.8 The Minitab analysis for the data in Table 14.6 is shown below. Note that the data are stored
in separate columns. The command Mann-Whitney 95.0 C1 C2; will perform a Mann-Whitney test on
the data in columns c I and c2. The subcommand Alternative -1. will test the alternative that the median
lifetime for the control group is less than the median lifetime for the experimental group. The output line W =
11 is the value for R I computed in the above discussion of this experiment. The output line Test of ETAl
= ETA2 vs. ETAl < ETA2 is significant at 0.0303 gives the p value as 0.0303 for the
lower-tail alternative hypothesis. At level of significance a = .05, the null hypothesis would be rejected since
the p value < a. This rather small study seems to indicate that restricting the diet prolongs the lifetime of such
rats.
Row Control Expermtl
1 3 . 4 3 . 7
2 3 . 6 4.2
3 4 . 0 4 . 2
4 2.5 4 . 5
MTB > Mann-Whitney 95.0 C1 C2;
SUBC> Alternative -1.
340 NONPARAMETRIC STATISTICS
I RI= 184 I R
3 = 69
[CHAP. 14
Mann-WhitneyConfidence Interval and Test
Control N = 4 Median = 3.500
Expermtl N = 4 Median = 4.200
Point estimate f o r ETA1-ETA2 is -0.700
97.0 Percent CI for ETA1-ETA2 is ( - 2 . 0 0 0 , 0 . 3 0 0 )
w - 11.0
Teat of ETA1 = ETA2 va. ETA1 < ETA2 i a significant at 0.0303
The t e s t is significant at 0.0295 (adjusted for ties)
A normal approximation procedure is also used for the Wilcoxon rank-sum test procedure when the
two sample sizes are large. If both sample sizes, nl and n2, are 10 or more, the approximation is
usually good. The test statistic is given by formula (14.2) where RIis the rank-sum associated with
sample size nl .
R, - n,(n, +n2 +1) / 2
Jn,n2(nl +n2+1) / 12
Z = (14.2)
EXAMPLE 14.9 A test instrument which measures the level of type A personality behavior was administered
to a group of lawyers and a group of doctors. The results and the combined rankings are given in Table 14.8.
The higher the score on the instrument, the higher the level of type A behavior. The null hypothesis is that the
medians are equal for the two groups, and the alternative hypothesis is that the medians are unequal.
Tab1
Lawyers
28 (5)
51 (20)
54 (21)
34 (10)
33 (9)
38 (12)
44 (15)
45 (16)
47 (17)
48 (18)
56 (22)
50 (19)
14.8
Doctors
20 (1)
26 (4)
32 (8)
36 (11)
40 (13)
23 (2)
24 (3)
30 (6.5)
30 (6.5)
42 (14)
We note that nl = 12, n2 = 10,and R1= 184.The computed value of the test statistic is found as follows;
184-12(12+10+1)/2 =3.03
z*=
412 x 10x (12 +10+ 1) / 12
The null hypothesis is rejected for level of significance a = .05, since the critical values are k 1.96 and the
computed value of the test statistic exceeds 1.96. The median score for lawyers exceeds the median score for
doctors.
In closing this section, we note that the parametric equivalent test for the Wilcoxon rank-sum test
and the Mann-Whitney U test is the two-sample t test discussed in Chapter 10.
KRUSKAL-WALLIS TEST
In Chapter 12, the one-way ANOVA was discussed as a statistical procedure for comparing
several populations in terms of means. The procedure assumed that the samples were obtained from
CHAP. 141 NONPARAMETRIC STATISTICS 341
Whites
15
12
13
14
9.5
RI = 63.5
normally distributed populations with equal variances. The Kruskal-Wallis test is a nonparametric
statistical method for comparing several populations. Consider a study designed to compare the net
worth of three different groups of senior citizens (ages 51- 61). The net worth (in thousands) for each
senior citizen in the study is shown in Table 14.9.The null hypothesis is that the distributions of net
worths are the same for the three groups and the alternative hypothesis is that the distributions are
different.
African-Americans Hispanics
4 1.5
11 7
1.5 9.5
5 3
8 6
R2 = 29.5 R3 = 27
Table 14.9
205
225 110
The data are combined into one group of 15 numbers and ranked. The rankings are shown in Table
14.10.The sums of the ranks for each of the three samples are shown as RI,R2, and R3.
The sum of the ranks from 1 to 15 is equal to 120. If the null hypothesis were true, we would expect
the sum of ranks for each of the three groups to equal 40. That is, if H
o is true we would expect that
RI = R2 = R3 = 40. The Kruskal-Wallis test statistic measures the extent to which the rank sums for
the samples differs from what would be expected if the null were true. The Kruskal-Wallis test
statistic is given by formula (24.3),where k = the number of populations, n,= the number of items in
sample i, Ri = sum of ranks for sample i, and n =the total number of items in all samples.
(24.3)
W has an approximate Chi-square distribution with df = k - 1 when all sample sizes are 5 or more.
EXAMPLE 14.10 The computed value for W for the data in Table 14.9is determined as follows:
The critical value from the Chi-square table with df = 2 and a = .05 is 5.991. The population distributions are
not the same, since W* exceeds the critical value. Suppose the sample rank sums were each the same. That is,
suppose RI= R2 = R3= 40. The computed value for W would then equal the following:
It is clear that this is always a one-tailed test. That is, the null is rejected only for large values of the computed
test statistic.
342 NONPARAMETRICSTATISTICS [CHAP. 14
t
Basic Properties of the Spearman Rank Correlation Coefficient
1. The value of rs is always between -1 and +1, i.e., -1 I
rs I+1
2. A value of rs near +1 indicates a strong positive association between the rankings.
A value of rs near -I indicates a strong negative association between the rankings.
3. The Spearman rank correlation coefficient may be applied to data at the ordinal
level of measurement or above.
EXAMPLE 14.11 The Minitab analysis for the data in Table 14.9 is given below. The value for H is the same
as that for W* found in Example 14.10. The p value is 0.016, indicating that the null hypothesis would be
rejected for a = .05, the same conclusion reached in Example 14.10.
Row Group Networth
1 1 250
2 1 1 7 5
3 1 1 3 0
4 1 2 05
5 1 225
6 2 80
7 2 1 5 0
8 2 7 0
9 2 90
10 2 110
1
1 3 7 0
1 2 3 100
1 3 3 130
1 4 3 75
1 5 3 95
MTB > Kruskal-Wallis 'Networth' 'Group'.
Kruskal-Wallis Test
Kruskal-Wallis Test on Networth
Group N Median Ave Rank Z
1 5 205.00 1 2 . 7 2.88
2 5 9 0 . 0 0 5 . 9 - 1 . 2 9
3 5 9 5 . 0 0 5 . 4 - 1 . 5 9
Overall 15 8.0
H = 8.31 DF = 2 P = 0.016
RANK CORRELATION
The Pearson correlationcoefficient was introduced in chapter 13 as a measure of the strength of
the linear relationship of two variables. The Pearson correlation coefficient is defined by formula
(14.4).The Spearman rank correlation coefzcient is computed by replacing the observations for x
and y by their ranks and then applying formula (14.4)to the ranks. The Spearman rank correlation
coefficient is representedby rs
(14.4)
EXAMPLE 14.12 Table 14.12 gives the number of tornadoes and number of deaths due to tornadoes per
month for a sample of 10 months. In addition, the ranks are shown for the values for each variable.
CHAP. 141 NONPARAMETRIC STATISTICS 343
Number of
tornadoes, x
35
85
90
94
50
76
98
104
88
128
Table 14.12
Rank of x
1
4
6
7
2
3
8
9
5
10
Number of
deaths, y
0
0
2
1
0
3
4
4
5
7
Rank of y
2
2
5
4
2
6
7.5
7.5
9
10
The formula for the Pearson correlation coefficient given in formula (14.4)is applied to the ranks of x and the
ranks of y.
C X = 1 + 4 + 6 + . * . +l0=55,CxZ=
1 + 1 6 + 3 6 + - + 1oO=385,Cy=55,Cy2=4+4+25+...+ loo=
382.5,E~y=
1 ~ 2 + 4 ~ 2 + 6 ~ 5 + * . * + 1 0 ~ 1 0 = 3 6 3 . 5 .
(W@Y) -
= 385 - 302.5 = 82.5, S,, = Cy2- -
(cy)2 -
- 382.5 - 302.5 = 80, and S,, = Zxy - -
(EX)*
s,, = Ex2- -
n n n
363.5 - 302.5 = 61.
= 0.75
XY 61
Rs= ,
/
- = 9825x80
EXAMPLE 14.13 The Minitab solution for Example 14.12 is shown below. The raw data are entered into
columns cl and c2. The rank command is then used to put the ranks into columns c3 and c4. Then the Pearson
correlation coefficient is requested for columns c3 and c4.
Row
1
2
3
4
5
6
7
8
9
10
tornado deaths
3 5 0
85 0
90 2
94 1
50 0
7 6 3
98 4
1 0 4 4
88 5
1 2 8 7
MTB > rank cl put into c3
MTB > rank c2 put into c 4
MTB > print c3 c4
Row C3 C4
1 1 2.0
2 4 2.0
3 6 5 . 0
4 7 4.0
5 2 2.0
6 3 6 . 0
7 8 7 . 5
8 9 7 . 5
9 5 9 . 0
1 0 1 0 1 0 . 0
MTB > correlation c3 c4
Correlation of c3 and c4 = 0.739
344 NONPARAMETRIC STATISTICS [CHAP. 14
The population Spearman rank correlation coeficient is represented by ps.In order to test the
null hypothesis that ps= 0, we use the result that J"-lr, has an approximate standard normal
distribution when the null hypothesis is true and n 2 10. Some books suggest a larger value of n in
order to use the normal approximation.
EXAMPLE 14.14 The data in Table 14.12 may be used to test the null hypothesis &:ps= 0 vs. Ha: ps+0 at
level of significance a = .05. The critical values are k 1.96.The computed value of the test statistic is found as
follows: z* = 410-1 x .75 = 2.25. We conclude that there is a positive association between the number of
tornadoes per month and the number of tornadodeaths per month since z* > 1.96.
RUNS TEST FOR RANDOMNESS
Suppose 3.5-ounce bars of soap are being sampled and their true weights determined. If the
weight is above 3.5 ounces, then the letter A is recorded. If the weight is below 3.5 ounces, then a B
is recorded. Suppose that in the last 10 bars the result was AAAAABBBBB. This would indicate
nonrandomness in the variation about the 3.5 ounces. A result such as ABABABABAB would also
indicate nonrandomness. The first outcome is said to have R = 2 runs. The second outcome is said to
have R = 10runs. There are nl = 5 A's and n2 = 5 B's. A run is a sequence of the same letter written
one or more times. There is a different letter (or no letter)before and after the sequence. A very large
or very small number of runs would imply nonrandomness.
In general, suppose there are nl symbols of one type and n2 symbols of a second type and R runs
in the sequence of nl + n2 symbols. When the occurrence of the symbols is random, the mean value
of R is given by formula (14.5):
When the occurrence of the symbols is random, the standard deviation of R is given by formula
(14.6):
2nln2(2n2n2-n1- n2)
a = J (n1 +n2I2(n1+n,-l)
(14.6)
When both nl and n2 are greater than 10, R is approximately normally distributed. The size
requirements for nl and n2 in order that R be normally distributed vary from book to book. We shall
use the requirement that both exceed 10.From the proceeding discussion, we conclude that when the
occurrence of the symbols is random and both nl and n2 exceed 10, then the expression in formula
(14.7)has a standard normal distribution.
When the normal approximation is not appropriate, several Statistics books give a table of critical
values for the runs test.
EXAMPLE 14.15 Bars of soap are selected from the productionline and the letter A is recorded if the weight
of the bar exceeds 3.5 ounces and B is recorded if the weight is below 3.5 ounces. The following sequence was
obtained.
AAABBAAAABBABABAAABBBBAABBBABABBAA
CHAP. 141 NONPARAMETRIC STATISTICS 345
These results are used to test the null hypothesis that the occurrence of A’s and B’s is random vs. the alternative
hypothesis that the occurrences are not random. The level of significance is a = .01. The critical values are
f2.58. We note that nl = the number of A’s = 18 and n2 = the number of B’s = 16. The number of runs is R =
+1 =
2nln2 + I = 2 x 1 8 ~ 1 6
17. Assuming the null hypothesis is true, the mean number of runs is p = -
17.9411. Assuming the null hypothesis is true, the standard deviation of the number of runs is :
“1 + “ 2 34
2n1n2(2n2n2- n l -n2) 2 ~ 1 8 ~ 1 6 x 5 4 2
O = d342x33=2-8607
R -p 17-17.9411
The computed value of the test statistic is Z* = -= = -.33.
0 2.8607
There is no reason to doubt the randomness of the observations. The process seems to be varying randomly
about the 3.5 ounces, the target value for the bars of soap.
EXAMPLE 14.16 The Minitab solution to Example 14.15 is shown below. The symbols A and B are coded
as 1 and -1 respectively. The 0. in the statement Rune 0 C1, is the mean of -1 and 1. The output gives the
observed number of runs, the mean number of runs, the number of A’s and B’s, and the p value = 0.7422. The
output is the same as that obtained in Example 14.15.
Cl
1 1 1 -1 -1 1 1 1 1 -1 -1 1 -1 1 -1
1 1 1 -1 -1 -1 -1 1 1 -1 -1 -1 1 -1 1
-1 -1 1 1
MTB > Runs 0 C1.
RunsTest
The observed number of runs = 17
The expected number of runs = 17.9412
18 Observations above K 16 below
The test is significant at 0.7422
Cannot reject at alpha = 0.05
Solved Problems
NONPARAMETRICMETHODS
14.1
14.2
A table (not shown) gives the number of students per computer for each of the 50 states and the
District of Columbia. If the data values in the table are replaced by their ranks, what is the sum
of the ranks?
-
51x52
Ans. The ranks will range from 1 to 51. The sum of the integers from 1 to 51 inclusive is 7
-
L
1,326.This same sum is obtained even if some of the ranks are tied ranks.
A group of 30 individuals are asked to taste a low fat yogurt and one that is not low fat, and to
indicate their preference. The participants are unaware which one is low fat and which one is
not. A plus sign is recorded if the low fat yogurt is preferred and a minus sign is recorded if the
one that is not low fat is preferred. Assuming there is no difference in taste for the two types of
yogurt, what is the probability that all 30 prefer the low fat yogurt?
Ans. Under the assumption that there is no difference in taste, the plus and minus signs are randomly
assigned. There are z3O = 1,073,741,824 different arrangements of the 30 plus and minus signs.
346~ NONPARAMETRIC STATISTICS [CHAP. 14
~ Couple 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 i 4 ! 5
~
-
-
-
-
-
-
-
-
3
- I
Husband 35 16 17 25 30 32 28 31 27 15 19 22 33 30 18 ,
Wife 25 20 18 20 25 25 25 I 1 9 30 20 15 20 27 28 2 0 ,
Difference 10 -4 -1 5 5 7
- 3 12 , -3 i -5 4 , 2 6 ~ 2 -2
The probability that all 30 are plus signs is1divided by 1,073,741,824,or 9.313225746x 10-l'. If
this outcome occurred, we would surely reject the hypothesis of no difference in the taste of the
two types of pgurt.
SIGN TEST
14.3 Find the probability that 20 or more prefer the low fat yogurt in problem 14.2 if there is no
difference in the taste. Use the normal approximationto the binomial distribution to determine
yoiii answer.
Ans. Assuming that there is no difference in taste, the 30 responses represent 30 independent trials with
p = q = .5, where p is the probability of a + and q is the probability of a -. The mean number of +
signs is p = np = 15 and the standard deviation is 0 = &= 2.739. The z value corresponding
to 20 p!us signs is z = = 1.64.The probability of 20 or more plus signs is approximated
19.5-15
2.739
by P (Z > I .64)
= .5 - .4495= ,0505.
14.4 A sociotogical study involving married couples, where both husband and wife worked full
time, recorded the yeariy income for each in thousands of dollars. The results are shown in
Table 14.13.Use the sign test to test the research hypothesis that husbands have higher salaries.
Use level of significance a = .05.
Ans. A high number of plus signs among the differences supports the research hypothesis. There are 10
plus signs in the 15 differences. The p value is equal to the probability of 10or more pius signs in
the 15 differences. The p value is computed assuming the probability of a plus sign is .5 for any
one of the differences, since the p value is computed assuming the null hypothesis is true. From
the binemial distribution, we find that the p value = .0916 + .@I17 + .0139 + .0032 + .OOO5'+
.OOOO = 0.1509. Since the p value exceeds the preset leve! of significance, we cannot reject the
null hypothesis.
WILCOXON SIGNED-RANKTEST FOR TW-0DEPENDENT SAMPLES
14.5 Use the Wilcoxon signed-rank test and the data in Table M.13 to test the research hypothesis
that husbands have higher salaries. Determine the values for W+ and W- and then use the
normal approximation procedure to perform the test. Use level of significance a = .05
Am. From Table 14.14, we see that W+= 15 + 10 + 10 + 13 + 5.5 i 14 + 7.5 + 3 + 12 + 3 = 93, and
W-= 7.5 + l + 5.5 + 10 + 3 = 27. Since Difference = Husband salary - Wife saiary, large values
for W+ or small values for 'AT-will lend support to the research hypothesis that husbands have
higher salaries. If w' is used, the test will be upper-tailed. If W- is used, the test will be lower-
tailed. The test statistic used in the normal approximation is:
CHAP. 141 NONPARAMETRIC STATISTICS 347
Husband
35
16
17
25
30
32
28
31
27
15
19
22
33
30
18
Wife
25
20
18
20
25
25
25
19
30
20
15
20
27
28
20
Tablt
Difference. D
10
-4
-1
5
5
7
3
12
-3
-5
4
2
6
2
-2
14.14
ID1
10
4
1
5
5
7
3
12
3
5
4
2
6
2
2
Rank of ID I
15
7.5
1
10
10
13
5.5
14
5.5
10
7.5
3
12
3
3
Signed rank
15
-7.5
-1
10
10
13
5.5
14
-5.5
-10
7.5
3
12
3
-3
27 -60
If we use W-, then 2 = -
= -1.87, and the p value = .5 - .4693 = .0307. If we use W+, then 2 =
17.607
93 -60
-- - 1.87, and the p value = .0307. It is clear that we may use either W+or W-. In either case, we
17.607
reject the null hypothesis since the p value < .05.
14.6 The Minitab output for the solution to problem 14.5 is shown below. Answer the following
questions by referring to this output.
(a) Explain the various parts of the commandWTest 0.0 I Diff ;
(b) Why was the subcommandAlternative 1 used?
(c) What is the number 93.0shown under Wilcoxon statistic?
(4 How does the p value given in the Minitab output compare with the p value found in
problem 14.5?
MTB > print cl
Data Display
Diff
10 -4 -1 5 5 7
4 2 6 2 -2
3 12 -3 - 5
MTB > WTest 0.0 IDiff';
SUBC> Alternative 1.
Wilcoxon Signed Rank Test
Test of median = 0.000000 vs. median > 0.000000
N for Wilcoxon Estimated
N Test Statistic P Median
Diff 1 5 1 5 93.0 0.032 2 . 7 5 0
Ans. (a) Wtest indicates that the Wilcoxon signed-rank test is to be performed. The value 0.0 indicates
that we are testing that the median difference is assumed to be 0.0.The median difference will
be 0.0 if the null hypothesis is true. Diff indicates that we are performing the test for the
differences in column 1.
(6) The subcommand Alternative 1.indicates an upper-tailed test. Since Diff = Husband
salary - Wife salary, a positive median will support the research hypothesis that the husband
salaries are greater than the wife salaries.
(c) The Wilcoxon statistic is the same as W+ found in problem 14.5.
348 NONPARAMETRICSTATISTICS [CHAP. 14
(6)The p values are very close. The p value based on the normal approximation is 0.0307 and the
Minitab p value is 0.032.
WILCOXON RANK-SUM TEST FOR TWO INDEPENDENT SAMPLES
14.7 A study compared the household giving for two different Protestant denominations. Table
14.15 gives the yearly household giving in hundreds of dollars for individuals in both samples.
The ranks are shown in parentheses beside the household giving. Use the normal
approximation to the Wilcoxon rank-sum test statistic to test the research hypothesis that the
household giving differs for the two denominations.Use level of significancea = .Ol.
Tabk
Denomination 1
9.3 (5)
13.5(20)
13.7 (21)
10.2(10)
10.0(9)
10.7(12)
11.3(15)
11.4 (16)
11.5(17)
12.3(18)
14.2(22)
12.5(19)
14.15
Denomination 2
7.0 (1)
9.0(4)
9.7 (8)
10.4(11)
10.9(13)
8.0 (2)
8.5 (3)
9.5 (6.5)
9.5 (6.5)
11.1(14)
14.5 (23)
I RI = 184 I Rz = 92
Ans. The sample sizes are nl = 12 and n2= 1I. The rank sum associated with sample 1 is RI= 184.The
test statistic is computed as follows:
The critical values are f2.58. Since the computed value of the test statistic does not exceed the
right side critical value, the null hypothesis is not rejected. The p value is computed as follows.
The area to the right of 2.46 is .5 - .4931 = .0069, and since the alternative is two sided, the p
value = 2 x .0069= 0.0138. If the p value approach to testing is used, we do not reject since the p
value is not less than the preset a.
14.8 The Minitab output for the solution to problem 14.7 is shown below. Answer the following
questions by referring to this output.
(a) Explain the command Mann-Whitney Denoml I Denan2 ;
(b) What does the subcommandAlternative 0 mean?
(c) What is the line W = 184 giving you?
(d)Explain the line Test of ETA1 = ETA2 vs. ETA1 not = ETA2 is significantat 0.0151.
Row Denoml Denom2 Row D e n o m l Denom2
1 9 . 3 7 . 0 7 11.3 8 . 5
2 1 3 . 5 9 . 0 a 1 1 . 4 9 . 5
3 1 3 . 7 9.7 9 1 1 . 5 9 . 5
4 1 0 . 2 1 0 . 4 1 0 1 2 . 3 11.1
5 1 0 . 0 1 0 . 9 1
1 1 4 . 2 1 4 . 5
6 1 0 . 7 8 . 0 1 2 1 2 . 5
CHAP. 141 NONPARAMETRIC STATISTICS
IRS Garbage collection Long distance service ,
40 55 60
45 60 65
55 65 70
50 50 70
45 55 70
45 55 75
50 60 75
50 70 70
40 65 80
349
IRS
1.5
4.0
11.5
7.5
4.0
4.0
7.5
7.5
1.5
16.0
MTB > Mann-Whitney 'Denomlu 'Den0m2~;
S U B 0 Alternative 0 .
Mann-Whitney Confidence Interval and Test
Denoml N = 12 Median = 11.450
Denom2 N = 11 Median = 9.500
Point estimate f o r ETA1-ETA2 is 2.000
95.5 Percent CI for ETA1-ETA2 is (0.499,3.501)
W = 184.0
Test of ETAl = ETA2 vs. ETAl not = ETA2 is significant at 0.0151
The test is significant at 0.0150 (adjusted f o r ties)
Garbage collection Long distance service
11.5 16.0
16.0 20.0
20.0 24.0
7.5 24.0
11.5 24.0
11.5 27.5
16.0 27.5
24.0 24.0
20.0 29.5
16.0 29.5
Ans, (a) This command requests a Mann-Whitney test for the data in columns named Denoml and
Denom2.
(6) This subcommand indicates that the hypothesis is two-tailed.
(c) This is the same as R1,
the sum of ranks for sample 1.
(6)This line gives the p value for the test as 0.0151 . This value is close to the p value obtained in
problem 14.7.
KRUSKAL-WALLIS TEST
14.9 Thirty individualswere randomly divided into 3 groups of 10each. Each member of one group
completed a questionnaire concerning the Internal Revenue Service (IRS). The score on the
questionnaire is called the Customer Satisfaction Index (CSI). The higher the score, the greater
the satisfaction. A second set of CSI scores were obtained for garbage collection from the
second group, and a third set of scores were obtained for long distance telephone service. The
scores are given in Table 14.16.Perform a Kruskal-Wallis test to determine if the population
distributionsdiffer. Use significance level a = .01.
Ans. The data in Table 14.16 are combined and ranked as one group. The resulting ranks are given in
Table 14.17. The following rank sums are obtained for the three groups: RI= 65.0, R2 = 154.0,
and R3= 246.0.
Table 14.17
350
BMI Age
22.5 27
24.6 32
28.7 45
30.1 49
18.5 19
20.0 22
24.5 31
25.O 27
27.5 25
30.0 44
NONPARAMETRIC STATISTICS
BMI Age
19.0 44
17.5 18
32.5 29
22.4 29
28.8 40
21.3 39
25.O 21
19.0 20
29.7 52
16.7 19
[CHAP. 14
The computed value of the Kruskal-Wallis test statistic is found as follows:
-3X31 ~ 2 1 . 1 3 8
The critical value is found by using the Chi-square distribution table with df = 3 - 1 = 2 to equal
9.210. The null hypothesis is rejected since the computed value of the test statistic exceeds the
critical value.
14.10 The Minitab analysis for problem 14.9is shown below. Answer the following questions.
(a) Explain the command Kruekal-Wallis 'CSI 'Group'
(b) If the average ranks are multiplied by 10, what do you obtain?
(c) WhatdoestherowH = 2 1 . 1 4 DF = 2 P = 0 . 0 0 0 give you?
MTB > Kruskal-Wallis 'CSI' 'Group'.
Kruskal-WallisTest
Kruskal-Wallis Test on C S I
Group N Median Ave Rank Z
1 1 0 5 . 7 5 0 6 . 5 - 3 . 9 6
2 1 0 1 6 . 0 0 0 1 5 . 4 - 0 . 0 4
3 1 0 24.000 2 4 . 6 4 . 0 0
Overall 3 0 1 5 . 5
H = 2 1 . 1 4 DF = 2 P = 0.000
H = 2 1 . 4 8 DF = 2 P = 0.000 (adjusted for ties)
Ans. (a) The command asks for a Kruskal-Wallis test for the data in the column called CSI. The
column called Group identifies the three samples.
(b) RI= 65, R2 = 154, and R3 = 246.
(c) This row gives the computed value of the test statistic, the degrees of freedom for the Chi-
square distribution, and the p value = 0.0o0.
RANK CORRELATION
14.11 The body mass index (BMI) for an individual is found as follows: Using your weight in
pounds and your height in inches, multiply your weight by 705, divide the result by your
height, and divide again by your height. The desirable body mass index varies between 19 and
25. Table 14.18 gives the body mass index and the age for 20 individuals. Find the Spearman
rank correlation coefficient for data shown in the table.
CHAP. 141 NONPARAMETRIC STATISTICS 351
Ans. Table 14.19contains the ranks for the BMI values and the ranks for the ages given in Table 14.18.
The formula for the Pearson correlation coefficient given in formula (14.4)is applied to the ranks
of the BMI values and the ranks of the ages. Let x represent the ranks of the BMI values and y
represent the ranks of the ages.
Cx=9.0+ 11.0+ 15.0+...+17.0+ 1.O=210,Cx2=81+ 121 + 2 2 5 + . . . + 2 8 9 + 1 ~ 2 8 6 9 ,
Cy = 8.5 + 13.0+ 18.0 + * * + 20 +2.5 = 210, Cy2= 72.25 + 169+ 324 +* * . +400 +6.25 = 2868,
C x y = 9 ~ 8 , 5 +
11 x 13+ 1 5 ~
1 8 + . . . + 1 7 ~ 2 0 +
1 ~2.5=2646.5
(zy)2 - 2868 - 2205 = 663
S,, = Ex2- -= 2869 - 2205 = 664, S,, = Cy2- --
(W2
n n
s,, = cxy - (Cx)(cy) = 2646.5 -2205 = 441.5
n
441.5
,
-
, = 0.665
sxY -
r , = ,
-
-
BMI
9.0
11.0
15.0
19.0
3.0
6.0
10.0
12.5
14.0
18.0
Table
Age
8.5
13.0
18.0
19.0
2.5
6.0
12.0
8.5
7.O
16.5
d664 x 663
14.19
BMI
4.5
2.0
20.0
8.O
16.0
7.0
12.5
4.5
17.0
1.o
AGE
16.5
1.o
10.5
10.5
15.0
14.0
5.O
4.0
20.0
2.5
14.12 Use the results found in problem 14.11 to test the null hypothesis Ho: ps= 0 vs. Ha: psf 0 at
level of significance a = .05.
Ans. The critical values are k1.96. The computed value of the test statistic is found as follows: z* =
J
G
x .665 = 2.90. We conclude that there is a positive association between age and body
mass index.
RUNS TEST FOR RANDOMNESS
14.13 The gender of the past 30 individuals hired by a personnel office are as follows, where M
represents a male and F represents a female.
FFFMMMMFMFMFFFFMMMMMMMFFMMMFFF
Test for randomness in hiring using level of significance a = .05.
Ans. There are nl = 14 F’s and n2 =16 M’s. There are R = 11 runs. If the hiring is random, the mean
number of runs is
l4
nl + n 2 30
l6 +1= 15.933
p = -+I=
2n1n2
and the standard deviation of the number of runs is
2nln2(2n,n2- n l - n 2 ) - 2 ~ 1 4 ~ 1 6 x 4 1 8
(T= /
=
- d=2
352 NONPARAMETRIC STATISTICS [CHAP. 14
The computed value of the test statistic is Z* = %-
- - 5*933= -1.84. The critical values
0 2.679
are fl.96, and since the computed value of the test statistic is not less than -1.96, we are unable
to reject randomness in hiring with respect to gender.
14.14 The Minitab analysis for the runs test is shown below. Compute the p value correspondingto
the value Z* =-1.84in problem 14.13and compare it with the p value given in the Minitab output.
c1
0 0 0 1 1 1 1 0 1 0 1 0 0 0 0 1 1 1
1 1 1 1 0 0 1 1 1 0 0 0
MTB > Runs .5 C1.
The observed number of runs = 11
The expected number of runs = 15.9333
16 Observations above K 14 below
The test is significant at 0.0658
Cannot reject at alpha = 0.05
Ans. The area to the left of Z*= -1.84 is .5 - .4671 = .0329. Since the alternative hypothesis is two-
tailed, the p value = 2 x .0329 = 0.0658. This is the same as the p value given in the Minitab
output.
Supplementary Problems
NONPARAMETRICMETHODS
14.15
14.16
Three mathematical formulas for dealing with ranks are often utilized in nonparmetric methods. These
formulas are used for summing powers of ranks. They are as follows:
n(n +1)
1 + 2 + 3 + . . . + n = -
2
n(n +1)(2n+1)
6
l2+22+3*+- ' +n2 =
n2( n + ~ ) ~
l3+23+ 33+. . +n3 =
4
Use the above special formulas to find the sum, sum of squares, and sum of cubes for the ranks 1
through 10.
Am. sum = 55, sum of squares = 385, and sum of cubes = 3,025.
Nonparametric methods often deal with arrangements of plus and minus signs. The number of different
possible arrangements consisting of a plus signs and b minus signs is given by or
this formula to determine the number of different arrangements consisting of 2 plus signs and 3 negative
signs and list the arrangements.
5 !
Ans. The number of arrangements is = -
-
- 10.The arrangements are as follows:
(32! x 3!
CHAP. 141
35
66
NONPARAMETRIC STATISTICS
70 50 52 88
44 52 26 26
353
+ + ---, +-+--, +--+-, + - - - +, -++--, -+-+-, -+--+, --++-,
--+-+, --- ++
SIGN TEST
14.17
14.18
The police chief of a large city claims that the median response time for all 911calls is 15 minutes. In a
random sample of 250 such calls it was found that 149 of the calls exceeded 15 minutes in response
time. All the other response times were less than 15 minutes. Do these results refute the claim? Use
level of significance .05. Assume a two-tailed alternative hypothesis.
Ans. Assuming the claim is correct, we would expect p.= np = 250 x .5 = 125calls on the average. The
standard deviation is equal CJ = & = 7.906. The z value corresponding to 149 calls is 2.97.
Since the computed z value exceeds 1.96, the results refute the claim.
The claim is made that the median number of movies seen per year per person is 35. Table 14.20 gives
the number of movies seen last year by a random sample of 25 individuals.
Table 14.20
45 29 23 25 25
Test the null hypothesis that the median is 35 vs. the alternative that the median is not 35.Use a = .05.
Am. Table 14.21 gives a plus sign if the value in the corresponding cell in Table 14.20 exceeds 35, a 0
if the corresponding cell value equals 35 and a minus sign if the corresponding cell value is less
than 35. There are 12plus signs, 10minus signs, and 3 zeros.
Table 14.21
The below Minitab output indicates a p value = 0.8318, indicating that there is no statistical
evidence to reject the claim that the median is 35.
MTB > print cl
1 9 39 45 35 66 44 12 29 70 44 37
3 1 23 50 52 29 3 5 25 52 26 35 41
25 aa 2 6
MTB > STest 35 'movies';
SUBC> Alternative 0.
SignTest for Median
Sign test of median = 35.00 vs. not = 35.00
N Below Equal Above P Median
Movies 2 5 10 3 1 2 0.8318 35.00
354 NONPARAMETRIC STATISTICS [CHAP. 14
WILCOXON SIGNED-RANK TEST FOR TWO DEPENDENT SAMPLES
14.19 Use the normal approximation to the Wilcoxon sign-rank test statistic to solve problem 14.18. Note that
this application of the Wilcoxon sign-rank test does not involve dependent samples. We are testing the
median value of a population using a single sample.
Ans. Table 14.22 shows the computation of the signed ranks. Note that 0 differences are not ranked and
are omitted from the analysis. The sum of positive ranks is W+= 150.5,and the absolute value of
the sum of the negative ranks is W = 102.5. The test statistic used in the normal approximation
is:
I50.5-22~ 2 3
/ 4
= 0.779
-
-
W -n(n +1)/4
Jn(n +1)(2n+1)/24
z*=
422 x 23x 45 / 24
Since the computed value of the test statistic does not exceed 1.96, the null hypothesis is not
rejected.
X = Number
19
39
45
35
66
44
12
29
70
44
37
31
23
50
52
29
35
25
52
26
35
41
25
88
26
D = X - 3 5
-16
4
10
0
31
9
-23
-6
35
9
2
-4
-12
15
17
-6
0
-10
17
-9
0
6
-10
53
-9
Table 14.22
ID1
16
4
10
0
31
9
23
6
35
9
2
4
12
1s
17
6
0
10
17
9
0
6
10
53
9
Rank of ID I
16.0
2.5
12.0
20.0
8.5
19.0
5.o
21.0
8.5
1.O
2.5
14.0
15.0
17.5
5.O
12.0
17.5
8.5
5.O
12.0
22.0
8.5
Not used
Not used
Not used
Signed rank
-16.0
2.5
12.0
20.0
8.5
-19.0
-5 .O
21.o
8.5
1.o
-2.5
-14.0
15.0
17.5
-5 .O
-12.0
17.5
-8.5
5.0
-12.0
22.0
-8.5
14.20 The Minitab output for problem 14.19 is shown below. Answer the following questions.
(a) Explain the command m o s t 35 'movies' ;
(6) Explain the subcommand Alternative 0 .
(c) Explain the output.
(d) Compute the p value corresponding to the value Z* = 0.779 in problem 14.29 and compare it with
the p value given in the Minitab output.
MTB > print cl
1 9 3 9 45 3 5 66 44 1 2 2 9 7 0
44 37 3 1 23 50 52 29 3 5 2 5
52 26 35 4 1 2 5 88 2 6
CHAP. 141 NONPARAMETRIC STATISTICS
Men
I
3
4
6
8.5
8.5
1 1
11
15.5
19
355
Women
2
6
6
11
13
14
15.5
17
18
20
MTB > WTest 35 'movies';
SUBC> Alternative 0 .
Wilcoxon Signed RankTest
Test of median = 3 5 . 0 0 vs. median not = 35.00
N f o r Wilcoxon Estimated
N Test Statistic P Median
Movies 25 22 1 5 0 . 5 0.445 3 7 . 5 0
Ans. (a) Wtest indicates a Wilcoxon sign-rank test. The number 35 is the median value to be tested.
Movies indicates the column containing the sample data.
(h) The subcommand indicates that the research hypothesis is two-tailed.
(c) The output indicates that the sample size is 25. However only 22 of the values were used
since there were 3 differences that were 0. The value for the Wilcoxon Statistic is the same as
W+in problem 14.19.The p value is 0.445.
(d) P value = 2 x (.5 - .2823) = 0.4354
WILCOXON RANK-SUM TEST FOR TWO INDEPENDENT SAMPLES
14.21 The leading cause of posttraumatic stress disorder (PTSD) among men is wartime combat and among
women the leading causes are rape and sexual molestation. The duration of symptoms was measured for
a group of men and a group of women. The times of duration in years for both groups are given in
Table 14.23. Use the Wilcoxon rank-sum test to test the research hypothesis that the median times of
duration differ for men and women. Use level of significance a = .01. Use the normal approximation
procedure to perform the test.
Table
Men
2.5
3.3
3.5
4.0
5.O
5.0
6.5
6.5
10.0
15.0
14.23
Women
3.0
4.0
4.O
6.5
7.0
8.5
10.0
12.5
13.0
17.5
Ans. Table 14.24gives the rankings when the two samples are combined and ranked together.
Table 14.24
RI = 87.5 I R2 = 122.5
356 NONPARAMETRIC STATISTICS [CHAP. 14
The sample sizes are nl = 10and n2 = 10.The rank sum associated with sample 1 is RI= 87.5. The
test statistic is computed as follows:
RI-nl(n1 +n2+1)/2 875-10(10+10+1)/2
Jnln2(nl +n2 +I )/ 12
= -1.32
-
Z = -
JIOX I
O
X (10+ 10+ I)/ 12
The critical values are & 2.58. Since the computed value of the test statistic does not exceed the
left side critical value, the null hypothesis is not rejected. The p value is computed as follows. The
area to the left of -I .32 is .5 - .4066 = .0934,and since the alternative is two sided, the p value =
2 x .0934 = 0.1868. If the p value approach to testing is used, we do not reject the null since the p
value is not less than the preset a.
14.22 The Minitab solution for problem 14.21 is shown below. Answer the following questions.
(a) Explain the command line Mann-Whitney 95.0 I Men I
(6) Explain the subcommand Alternative 0.
(c) Discuss the output.
Women' ;
MTB > print
Row Men
1 2.5
2 3.3
3 3.5
4 4.0
5 5 . 0
6 5 . 0
7 6.5
8 6.5
9 10.0
10 15.0
cl c2
Women
3.0
4.0
4.0
6.5
7 . 0
8 . 5
1 0 . 0
12.5
13.0
17.5
MTB > Mann-Whitney 95.0 'Men' lWomen';
SUBC> Alternative 0.
Mann-WhitneyConfidenceIntervaland Test
Men N = 10 Median = 5.000
Women N = 10 Median = 7.750
Point estimate for ETA1-ETA2 is -2.250
95.5 Percent CI f o r ETA1-ETA2 is (-6.503,1.002)
W = 8 7 . 5
Test of ETAl = ETA2 vs. ETAl not = ETA2 is significant at 0.1988
The test is significant at 0.1971 (adjusted for ties)
Cannot reject at alpha = 0.05
Ans. (a) The command line requests a Mann-Whitney analysis using the data in the columns labeled
Men and Women. Set a 95%confidence interval on the difference in the medians for the two
populations.
(6) The subcommand indicates that the test is two-tailed.
(c) The output gives a 95% confidence interval on the difference in the two population medians.
The p value is given to be 0.1988. We are unable to reject the null hypothesis.
KRUSKAL-WALLIS TEST
14.23 The cost of a meal, including drink, tax and tip was determined for ten randomly selected individuals in
New York City, Chicago, Boston, and San Francisco. The results are shown in Table 14.25. Test the
null hypothesis that the four population distributions of costs are the same for the four cities vs. the
hypothesis that the distributions are different. Use level of significance a = .05.
CHAP. 141 NONPARAMETRIC STATISTICS 357
New York
21.oo
23.00
23.50
25.OO
25.50
30.00
30.00
33.50
35.00
40.00
Table
Chicago
15.OO
17.50
19.00
2I s o
22.00
23.75
25.00
26.00
27.50
33.00
14.25
Boston
15.50
16.00
18.50
19.50
20.00
22.50
25.25
26.50
30.00
37.50
San Francisco
16.50
18.25
21.25
24.25
26.25
27.75
29.00
31.25
35.50
42.25
Ans. Table 14.26gives the results when the four samples are combined and ranked together.
Table 14.26
New York
11.0
16.0
17.0
20.5
23.0
31.0
31.0
35.0
36.0
39.0
Chicago
1.o
5.O
8.0
13.0
14.0
18.0
20.5
24.0
27.O
34.0
Boston
2.0
3.0
7.0
9.0
10.0
15.0
22.0
26.0
31.0
38.0
San Francisco
~~ ~ ~
4.0
6.0
12.0
19.0
25.O
28.0
29.0
33.0
37.0
40.0
I RI =259.50 I R3= 164.50 I R I =163.00 I L=233.00
The critical value is 7.815. We cannot conclude that the distributions are different.
14.24 The Minitab output for the Kruskal-Wallistest in problem 14.23is shown below.
(a) Explain the command line.
(b) Explain the output.
MTB > Kruakal-Wallis 'cost' 'city'.
Kruskal-WallisTest
Kruskal-Wallis T e s t on cost
C i t y N Median A v e Rank Z
1 1 0 27.75 26.0 1.70
2 1 0 2 2 . 8 8 1 6 . 5 -1.27
3 1 0 21.25 16.3 - 1 . 3 1
4 1 0 27.00 23.3 0.87
Overall 40 2 0 . 5
H = 5 . 2 4 DF = 3 P = 0.155
H = 5.24 DF = 3 P = 0.155 (adjusted for ties)
Ans. (a) The command line Kruskal-Wallis cost city'.requests a Kruskal-Wallis test
(6) The output lists the 4 cities, the median cost for each city, and the mean rank for each city. In
for the responses in the column called cost and the cities identified in the column called city.
addition, the computed Kruskal-Wallis test statistic is H = 5.24. The p value is 0.155.
358 NONPARAMETRIC STATISTICS [CHAP. 14
Percent Fat Calories
40
35
33
29
35
36
30
36
33
28
39
RANK CORRELATION
Lead
13
12
1 1
8
13
15
I 1
14
12
9
15
14.25 Table 14.27 gives the percent of calories from fat and the micrograms of lead per deciliter of blood for a
sample of preschoolers. Find the Spearman rank correlation coefficient for data shown in the table.
28.2 28.2 27.9 27.8
27.9 27.8 28.1 28.2
28.1 28.0 27.9 27.8
27.8 28.0 28.3 28.1 28.2 28.3
28.1 27.7 28.4 28.2 28.3 28.3
27.6 27.8 27.9 28.1 28.3 28.5
Ans. The Spearman correlation coefficient = 0.919.
A A B B B B A A A A
B B A A A B A A A A
A B B B B B B A A A
-
-
-
-
-
-
-
-
-
-
14.26 Test for a positive correlation between the percent of calories from fat and the level of lead in the blood
using the results in problem 14.25.
A A B B B B A A A A
B B A A A B A A A A
A B B B B B B A A A
-
-
-
-
-
-
-
-
-
-
Ans. The computed test statistic is 3.05 and the a = .05 critical value is 1.65. We conclude that a
positive correlation exists.
RUNS TEST FOR RANDOMNESS
14.27 The first 100decimal places of TI contain 51 even digits (0,2,4,6, 8) and 49 odd digits. The number of
runs is 43. Are the occurrences of even and odd digits random?
Am. Assuming randomness, p = 50.98, 0 = 4.9727, and z* = -1.60. Critical values = k 1.96. The
occurrences are random.
14.28 The weights in grams of the last 30 containers of black pepper selected from a filling machine are
shown in Table 14.28.The weights are listed in order by rows. Are the weights varying randomly about
the mean‘?
Am. The mean of the 30 observations is 28.06. Table 14.29 shows an A if the observation is above
28.06 and a B if the observation is below 28.06.
Table 14.29
The sequence AABBBBAAAABBAAABAAAAABBBBBBAAA has 17 A’s, I3 B’s, and 9 runs.
If the weights are varying randomly about the mean, the mean number of runs is 15.7333 and the
standard deviation is 2.6414. The computed value of z is z* = -2.55. The critical values for a =
.05 are k I .96. Randomness about the mean is rejected.
Appendix 1
Binomial Probabilities
P
n x .OS .lO .20 .30 .40 S O .60 .70 .80 .90 .95
.9500
.0500
.9000
.loo0
.8OOo
.2000
.7000
.3000
.6000 SO00
.4000 SO00
.4O00
.6O00
.3OOo
.7OOo
.2000
.go00
.1OOo
.9OOo
.0500
.9500
1 0
1
2 0
1
2
.9025
.0950
.0025
.8100
.1800
.o100
,6400
.3200
,0400
.4900
.4200
.0900
.3600 ,2500
.4800 SO00
.1600 .2500
.1600
.4800
.3600
.woo
.4200
.4900
.0400
.3200
.6400
.0100
.1800
.8100
.0025
.0950
.9025
3 0
1
2
3
3574
.1354
,0071
.o001
.7290
.2430
.0270
.oo10
.5 120
.3840
.0960
.0080
.3430
.4410
.1890
.0270
.2160 .1250
.4320 .3750
,2880 .3750
.0640 .1250
.0640
.2880
.4320
.2160
.0270
.1890
.4410
.3430
.0080
.0960
.3840
.5 120
.0010
,0270
.2430
.7290
.0001
.0071
.1354
3574
.8145
.1715
.O135
,0005
.OoOo
.6561
,2916
.0486
.0036
.ooo1
.4096
.4096
.I536
.0256
.0016
.2401
.4116
.2646
.0756
.0081
.1296 .0625
.3456 .2500
.3456 .3750
.1536 ,2500
.0256 .0625
.0256
.1536
.3456
,3456
.1296
.0081
.0756
.2646
.4116
.2401
.0016
.0256
.1536
.4096
.4096
.OOo1
.0036
.0486
.2916
.6561
.m
.0005
.O135
.1715
.8145
4 0
1
2
3
4
5 0
1
2
3
4
5
.7738
.2036
.0214
.0011
.oooo
.moo
S905
,3280
.0729
.0081
.o004
.oooo
.3271
.4096
.2048
.0512
.0064
.WO3
.1681
.3602
.3087
.1323
.0283
.0024
.0778 .0312
.2592 .1562
.3456 .3125
.2304 .3125
.0768 .1562
.0102 .0312
.o102
.0768
.2304
.3456
.2592
.0778
.0024
.0284
.1323
.3087
.3601
.1681
.0003
.0064
.0512
.2048
.4096
.3277
.m
.o005
.0081
.0729
.3281
S905
.m
.m
.0011
.0214
.2036
.7738
6 0
1
2
3
4
5
6
.7351
.2321
.0305
.0021
.WO1
.m
.o000
.5314
.3543
.0984
.O146
.oo12
.ooo1
.m
.2621
.3932
.2458
.0819
.O154
.0015
.oO01
.1176
.3025
.3241
.1852
.0595
.o102
.oO07
.0467 .0156
.1866 .0937
.3110 .2344
.2765 .3125
.1382 .2344
.0369 .0937
.0041 .0156
.0041
,0369
.1382
.2765
.3110
.1866
.0467
.oO07
.o102
.0595
.1852
.3241
.3025
.1176
.o001
.0015
.O154
.OS19
.2458
.3932
.2621
.m
.o001
.0012
.O146
.0984
.3543
S314
.oooo
.m
.0001
.0021
.0305
.2321
.7351
7 0
1
2
3
4
5
6
7
.6983
.2573
.0406
.0036
.WO2
.oooo
.m
.m
.4783
.3720
.1240
.0230
,0026
.0002
.oooo
.0000
2097
.3670
.2753
.I 147
.0287
.0043
.o004
.m
.0824
.2471
.3177
.2269
.0972
.0250
.0036
.0002
.0280 .0078
.1306 .0547
.2613 .1641
.2903 .2734
.1935 .2734
.0774 .1641
.0172 .0547
.0016 .0078
.0016
.0172
.0774
.1935
.2903
.2613
.1306
.0280
.o002
.0036
.0250
.0972
.2269
.3177
.2471
.0824
.m
.o004
. w 3
.0287
.I 147
.2753
.3670
.2097
.m
.m
.o002
.0026
.0230
.1240
.3720
.4783
.m
.oooo
.m
.0002
.0036
.0406
.2573
.6983
8 0
1
2
3
4
.6634
.2793
.05 15
.0054
.0004
.4305
.3826
.1488
.0331
.0046
.1678
.3355
.2936
.1468
.0459
.0576
.1977
.2965
.2541
.1361
.0168 .0039
.0896 .0312
.2090 .1094
.2787 .2187
.2322 .2734
.o007
.0079
.0413
.123Y
.2322
.WO1
.oo12
.o100
.0467
.1361
.oOOo
.o001
.0011
.0092
.0459
.oOOo
.ooO0
.oOoO
.o004
.0046
.m
.m
.m
.m
.0004
359
360 BINOMIALPROBABILITIES [APPENDIX 1
P
n x .05 .lO .20 .30 .40 S O .60 .70 .80 .90 .95
.m
.m
.m
.0000
.oO04
.m
.m
.moo
.0092
,0011
.o001
.oooo
.0467
.o100
.0012
.oO01
.1239
.0413
.0079
.0007
.2187
.I094
.0312
,0039
.2787
.2090
.0896
.O168
.2541
.2965
.1977
.0576
.1468
.2936
.3355
.1678
.0331
.1488
,3826
.4305
.0054
.0515
.2793
.6634
9 0
1
2
3
4
5
6
7
8
9
.6302
.2985
.0629
.0077
.0006
.oooo
.m
.m
.oOOo
.m
.3874
.3874
,1722
.0446
.0074
.0008
.ooo1.
.m
.moo
.0000
.1342
,3020
.3020
.1762
.0661
.O165
,0028
.o003
.0000
.oo00
.0404
.1556
,2668
.2668
.1715
.0735
.0210
.0039
.WO4
.oooo
,0101
.0605
.1612
.2508
.2508
.1672
.0743
.0212
.0035
.oO03
.0020
.O176
.0703
,1641
.2461
.2461
.1641
.0703
.0176
,0020
.0003
.0035
.0212
.0743
.1672
.2508
.2508
.1612
.0605
.0101
.m
.oO04
.0039
.0210
.0735
.1715
.2668
,2668
.1556
.0404
.oooo
.oooo
.oO03
,0028
.O165
.0661
.1762
.3020
.3020
,1342
.oooo
.m
.m
,0001
.0008
.0074
,0446
.1722
.3874
.3874
.0000
.0000
.m
.m
.m
,0006
.0077
.0629
.2985
.6302
.0282
.1211
.2335
.2668
.2001
.1029
.0368
.0090
.OO 14
.ooo1
.oo00
.0060
.0403
.1209
.2150
.2508
.2007
.1115
.0425
.OI06
.0016
.0001
.0010
.0098
.0439
.I 172
.2051
.2461
.2051
.I 172
.0439
.0098
.0010
.o001
.OO16
.0106
,0425
.1115
.2007
.2508
.2150
.1209
.0403
.#60
.moo
.WO1
DO14
.0090
.0368
.1029
.2001
.2668
.2335
.1211
.0282
.m
.moo
.o001
.oO08
.0055
.0264
.0881
.2013
.3020
.2684
.1074
10 0
1
2
3
4
5
6
7
8
9
10
S987
.3151
.0746
.O105
.0010
.o001
.m
.0000
.0000
.m
.oooo
.3487
.3874
.1937
.0574
.0112
.0015
.o001
.m
.moo
.0000
.oooo
.1074
.2684
.3020
.2013
.0881
.0264
,0055
.0008
.ooo1
.m
.oooo
.oooo
.oOOo
.OoOo
.woo
.o001
.0015
.0112
.0574
.I937
.3874
.3487
.OoOo
.0000
.oooo
.m
.m
.ooo1
.0010
.O105
.0746
.3151
5987
.0036
.0266
.0887
.1774
.2365
.2207
.1471
.0701
.0234
.0052
.OOo7
.oOOo
.oooo
.oO07
,0052
.0234
.0701
.1471
.2207
-2365
.1774
.0887
,0266
.0036
.ooO0
.oooo
.0005
.0037
.O173
.0566
.1321
.2201
.2568
.1998
.0932
.O198
.m
.0000
.oooo
.0002
.0017
.0097
.0388
.I 107
.2215
.2953
,2362
.OM9
.oooo
.m
.moo
.0000
.m
.0000
.0001
.0014
.0137
.0867
.3293
5688
1 1 0
1
2
3
4
5
6
7
8
9
10
11
S688
.3293
.0867
.0137
.0014
.WO1
.m
.OooO
.0000
.oooo
.m
.m
.3138
.3835
.2131
.0710
.0158
.0025
.0003
.m
.oOOo
.oooo
.oOOo
.m
.0859
.2362
.2953
.2215
.I 107
.0388
.0097
.0017
.o002
.oooo
.m
.oo00
.O198
.0932
,1998
.2568
.2201
.1321
.0566
.0173
.0037
.0005
.moo
.oooo
.0005
.0054
.0269
.Of306
.1611
.2256
.2256
.1611
.0806
.0269
.0054
.0005
.0000
.oooo
.oooo
.m
.0000
.o003
.0025
.0158
.0710
.2131
.3835
.3138
12 0
1
2
3
4
5
6
7
8
9
,5404
.3413
.0988
.O173
.0021
.o002
.m
.m
.oooo
.OoOo
.2824
,3766
.2301
.0852
.0213
.0038
.0005
.oOOo
.OoOo
.oooo
.0687
,2062.
.2835
.2362
.1329
.0532
.OI55
.0033
.0005
.oO01
.0138
.0712
.1678
.2397
.2311
.1585
.0792
.0291
.0078
.OO15
.0022
.O174
,0639
.1419
.2128
.2270
.1766
.1009
.0420
.O125
.0002
,0029
.0161
.0537
,1208
.1934
.2256
.1934
.1208
.0537
.oooo
.o003
.0025
.O125
.0420
,1009
.1766
,2270
.2128
.1419
.m
.oooo
.0002
DO15
.0078
.0291
.0792
.1585
.2311
.2397
.m
.oOOo
.oooo
.o001
.oO05
.0033
.O155
.0532
.1329
,2362
.oooo
.oooo
.oooo
.oooo
.OOoo
.oooo
.ooo5
.0038
.0213
.0852
.0000
.oooo
.woo
.0000
.m
.OoOo
.oooo
.0002
.0021
.0173
APPENDIX I] BINOMIALPROBABILITIES 361
P
n x .05 .lO .20 30 .40 S O .60 .70 .80 .90 .95
1
0
11
12
.oooo
.OOoo
.oooo
.m
.oooo
.oooo
.m
.oOOo
.m
.o002
.m
.OoOo
.0025 .
0
1
6
1
.OOo3 .0029
.oooo .OOo2
.0639
.O174
.0022
.1678
.07
12
.0138
.2835
.2062
.0687
.2301
.3766
.2824
.0988
.34
13
5404
.5 133
.35
12
.
1
1
0
9
.02
1
4
.0028
.OOo3
.m
.m
.moo
.m
.0000
.GO00
.oOOo
.m
.2542
.3672
.2448
.0997
.0277
,0055
.OOo8
*OOo1
.m
.OOoO
.oOOo
,0000
.oooo
.OOoo
.0550
.1787
.2680
.2457
.1535
.069
1
.0230
.0058
.0011
.0001
.oo00
.oOOo
.m
.m
.
0
0
9
7
.0540
.1388
.2181
.2337
.1803
.1030
.0442
.0142
.0034
.o
00
6
.o001
.oooo
.m
.0013 .
o
0
0
1
.0113 .
0
0
1
6
.0453 .0095
.1107 .0349
.I845 .0873
.2214 .1571
.1968 .2095
.1312 .2095
.0656 .1571
.0243 ,0873
.0065 .0349
.0012 .0095
.OOo1 .0016
.m .
o
0
0
1
.m
.ooo1
.0012
.0065
.0243
.0656
,1312
.I968
.2214
,1845
.1107
.0453
.0113
,0013
.oOOo
.OOoo
.OOo1
.OOo6
.0034
.0142
0442
.I030
.1803
.2337
.2181
.1388
.0540
.0097
.m
.m
.m
.m
.
o
0
0
1
.0011
.0058
.0230
.
0
6
9
1
.1535
.2457
.2680
.1787
.0550
.m
.oooo
.m
.m
.m
.m
.OOo1
.OOo8
.0055
.0277
.0997
.2448
.3672
.2542
.OoOo
.OoOo
.0000
.m
.m
.m
.m
.m
.0003
.0028
.02
1
4
. I 1
0
9
.35
12
.5 133
13 0
1
2
3
4
5
6
7
8
9
1
0
11
12
13
1
4 0
1
2
3
4
5
6
7
8
9
10
11
12
13
1
4
.4877
.3593
.I229
.0259
.0037
.o004
.m
.oooo
.OOoO
.oooo
.m
.m
.m
.moo
.OOoO
,2288
.3559
,2570
.1142
.0349
.0078
.OO13
.OOo2
.m
.moo
.0000
.OOoo
.m
.oOOo
.0000
.0440
.1539
.2501
.2501
.1720
.0860
.0322
.#92
.0020
.oO03
.m
.OoOo
.OoOo
.oooo
.0000
.0068
.0
40
7
.1134
.1943
.2290
.1963
.1262
.
0
6
18
.0232
.0066
.
0
0
1
4
.o002
.m
.oooo
.o000
.OOo8 .
o
0
0
1
,0073 .
0
0
0
9
.0317 .0056
.0845 .0222
.1549 .
0
6
1
1
.2066 .1222
.2066 .1833
.1574 .2095
.0918 .1833
.MO8 .1222
.0136 .
0
6
1
1
.0033 .0222
.OOo5 .0056
.oO01 ,0009
.moo .o001
.0000
.OOO1
.OOo5
.oo33
.0136
9408
.
0
9
18
* 1574
.2066
.2066
.1549
.0845
.03
17
.073
.WO8
.m
.m
.OOoO
.OOo2
.OO 1
4
.
0
0
6
6
.0232
.06
I8
.1262
.1963
.2290
.1943
.I134
.MO7
.0068
.oooo
.OOOo
.moo
.m
.0000
.o003
.oo20
.0092
.0322
.0860
.1720
.2501
.2501
.1539
.0440
.0000
.moo
.m
.moo
.oooo
.m
.m
.OOo2
.0013
.0078
.0349
.1142
.2570
.3559
.2288
.OoOo
.m
.m
.m
.oOOo
.m
.moo
.m
.oOOo
.0004
.0037
.0259
.1229
.3593
.4877
1
5 0
1
2
3
4
5
6
7
8
9
1
0
11
12
13
1
4
15
.4633
.3658
.1348
.0307
.
0
0
4
9
.OOo6
.oOOo
‘.0000
.moo
.oooo
.oooo
.m
.moo
.oooo
.m
.oooo
.2059
.3432
.2669
.1285
.0428
.O105
.0019
.o003
.m
.oooo
.oooo
.0000
.0000
.oooo
.moo
.m
.0352
.1319
.2309
.250
1
.1876
.1032
.0430
.0138
.0035
.
0
0
0
7
.ooo1
.m
.m
.oooo
.OoOo
.oooo
. w 7
.0305
.
0
9
1
6
.1700
.2
186
.206
1
.1472
.0811
.0348
,0116
.0030
.
0
0
0
6
.OOo1
.oooo
.0000
.m
.0005 .OOOO
.0047 .0005
.0219 .0032
.0634 .0139
.1268 .0417
.1859 .
0
9
1
6
.2066 .1527
.1771 .1964
.1181 .I964
.0612 .I527
.0245 .
0
9
1
6
-0074 .
0
4
1
7
.0016 .0139
.0003 .0032
.0000 .WO5
.m .oOOo
.OOoo
.0000
.0003
.0016
.0074
.0245
.
0
6
12
.1181
.1771
.2066
.1859
.I268
.0634
.02
1
9
.0047
.OOo5
.m
.m
.m
.OOo1
.m
.0030
.0116
.0348
.OS1 1
.1472
.206
1
.2
186
.1700
.
0
9
1
6
.0305
.0047
.m
.m
.oOOo
.m
.oOOo
.0001
.WO7
.oo35
.0138
0430
.1032
.1876
.250
1
.2309
.1319
.0352
.OOoo
.oooo
.m
.0000
.m
.oOOo
.moo
.oo00
.o003
.0019
.O105
.0428
-1285
.2669
.3432
.2059
.m
.m
.moo
.moo
.oOOo
.oooo
.m
.m
.oOoO
.0000
.w06
.0049
.0307
.1348
.3658
.4633
362 BINOMIALPROBABILITIES [APPENDIX 1
P
n x .05 .lO .20 .30 .40 S O .60 .70 .80 .90 .95
16 0
1
2
.4401
,3706
.1463
.0359
.0061
.o008
.oO01
.m
.m
.m
.m
.
o
O
O
O
.m
.OOOO
.oOOo
.OOOO
.m
.1853
,3294
.2745
,1423
.0514
.O137
.0028
.oO04
.0001
.oOOo
.m
.m
.oooo
.m
.oooo
.m
.OoOo
.0281
.1126
.2111
.2463
.2001
.1201
.0550
.O197
.0055
.0012
.0002
.m
.m
.m
.m
.m
.m
.0033
.0228
.0732
.1465
.2040
.2099
.1649
.lOlO
.0487
.O185
,0056
.0013
.WO2
.oooo
.m
.m
.m
.o003
.0030
.O150
.0468
.1014
.1623
.1983
.1889
.1417
.0840
.0392
.O142
.
W
O
.o008
.o001
.oooo
.oooo
.m
.o002
.0018
.0085
.0278
.0667
.1222
.1746
.1964
.1746
.1222
,0666
.0278
,0085
.oo18
,0002
.moo
.oooo
.m
.ooo1
.0008
,0040
.O142
.0392
.0840
.1417
.1889
.1983
,1623
.1014
.0468
.O150
.0030
,0003
.oooo
.m
.m
.m
.o002
.0013
.0056
.O185
.0487
.1010
,1649
.2099
.2040
.1465
,0732
.0228
.0033
.m
.0000
.0000
.m
.m
.m
.o002
.0012
.0055
.O197
.OS0
.1201
.2001
.2463
.2111
.1126
.0281
.m
.oOOo
.m
.m
.m
.o000
.0000
.oooo
.OOo1
.0004
.0028
.O137
.0514
.1423
.2745
.3294
.1853
.oooo
.oooo
.oooo
.0000
.m
.m
.m
.m
.m
.m
.0001
.0008
.0061
,0359
.1463
.3706
.4401
3
4
5
6
7
8
9
10
11
12
13
14
15
16
.1668
.3150
.2800
.1556
,0605
.O175
.0039
.OOO7
.o001
.m
.moo
.0000
.oOOo
.oooo
.
o
O
O
O
.OOOO
.oooo
.m
.0225
.0957
.1914
,2393
.2093
.1361
.0680
.0267
.0084
.0021
.0004
.o001
.m
.m
.oOOo
.oooo
.m
.oOOo
.0023
.O169
.058
1
.1245
,1868
.2081
.1794
.1201
,0644
.0276
.0095
.0026
.0006
.o001
.m
.oooo
.moo
.m
.WO2
.0019
.o102
.0341
,0796
.1379
.1839
.1927
.1606
.1070
.0571
.0242
.0081
.002I
.0004
,0001
.o000
.oOOo
.m
.ooo1
.0010
.0052
.O182
.@I72
.0944
.1484
.1855
.1855
.1484
.0944
.0472
.0182
,0052
.0010
.o001
.moo
.m
.oooo
.o001
.OOO4
.0021
.0081
.0242
.0571
.1070
,1606
,1927.
.1839
.1379
.0796
.0341-
.o102
.OO19
.WO2
.0000
.m
.moo
.0000
.o001
.o006
.0026
.0095
.0276
.OM4
.1201
.1784
.2081
.1868
.1245
.0581
.O169
DO23
.0000
.0000
.m
.moo
.0000
.m
.o001
.0004
.0021
.0084
.0267
,0680
.1361
.2093
.2393
.1914
.0957
.0225
.oooo
.m
.oooo
.oooo
.oooo
.oooo
.OOOo
.0000
.m
.o001
.o007
.0039
,0175
.0605
.1556
.2800
.3150
,1668
.m
.oooo
.oooo,
.m
.moo
.oooo
.oooo
.m
.m
.oooo
.m
.0001
.0010
.0076
.0415
,1575
.3741
,4181
17 0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
.4181
.3741
.1575
.0415
.0076
.0010
.o001
.oooo
.oooo
.m
.m
.m
.m
.oooo
.oOOO
.m
.oooo
.0000
18 0
1
2
3
4
5
6
7
8
9
10
11
12
.3972
.3763
.1683
.0473
.0093
.0014
.o002
.m
.woo
.oooo
.0000
.m
.o000
.1501
,3002
.2835
.1680
,0700
.0218
.0052
.oo10
.0002
.m
.oooo
.moo
.m
.0180
.081I
.1723
.2297
.2153
.1507
.OS16
-0350
.o120
.0033
.oO08
,000
1
.0000
.OO16
.O126
-0458
.1046
.1681
.2017
.1873
.1376
.OS11
.0386
.O149
. W 6
.oo12
.o001
.oo12
,0069
.0246
.0614
.1146
.1655
.1892
.1734
.1284
,0771
,0374
.O145
.m
.ooo1
.0006
.0031
.0117
.0327
,0708
,1214
.1669
.1855
.1669
.1214
.0708
.oooo
.oooo
.woo
.o002
.oo11
.0045
.O145
.0374
.077f
.1284
.1734
.1892
.1655
.m
.m
.oooo
.oooo
.0000
.o002
.0012
.0046
.O149
.0386
.OS11
.1376
,1873
.oOOo
.oooo
.0000
.oooo
.0000
.oooo
.m
.o001
.0008
.0033
.o120
.0350
.0816
.m
.oooo
.m
.moo
.0000
.oooo
.m
.0000
.oooo
.m
.o002
.0010
.0052
.moo
.oooo
.0000
.0000
.m
.0000
.oOOo
.m
.OOoo
.moo
.oooo
.0000
.oO02
APPENDIX 11 BINOMIAL PROBABILITIES 363
P
-
.20
~ ~~~
.30 .40 S O .60 .70 .80 .90 .95
n x .05 .I0
.0000
.OoOo
.oooo
.oooo
.oooo
.moo
.0002 .0045 .0327 .1146
.oooO .0011 .0117 .0614
.OOOO .0002 .0031 .0246
.OOOO .OOOO .0006 .0069
.moo .moo .OoO1 .0012
.oooo .oooo .moo .o001
.2017
.I681
.1046
0 4 5 8
.O126
.0016
.I507
.2153
.2297
.1723
.Of311
.0180
-02I8
.0700
.1680
.2835
.3002
.1501
.OO14
.m93
.0473
.I683
.3763
.3972
13
14
15
16
17
18
.oooo
.oooo
.moo
.oooo
.m
.moo
.oooo
.oooo
.oooo
.oooo
.oooo
.oooo
.3774
.3774
.I787
.0533
.0112
.0018
.o002
.m
.oooo
.oOoO
.oooo
.oooo
.oooo
.moo
.moo
.0000
.OoOo
.oooo
.oooo
.oooo
.1351
.2852
.2852
.1796
.0798
.0266
.0069
.OO14
.OoO2
.m
.oooo
.oooo
.oooo
.oooo
,0000
.oooo
.oooo
.o000
.m
.OoOo
.O 144
.0685
.I540
.2182
.2182
.I636
.0955
,0443
.O166
,0051
.0013
.0003
.oooo
.oooo
.oooo
.oooo
.oooo
.m
.moo
.oooo
.0011
,0093
.0358
.0869
.I491
.I916
,1916
.I525
.0981
.0514
.0220
.oo77
.0022
.0005
.ooo1
.oooo
.oooo
.0000
.oooo
.OoOo
.ooo1
.0008
.0046
.O175
,0467
,0933
.1451
.I797
.1797
.1464
.0976
.0532
.0237
.0085
.0024
.0005
.ooo1
.moo
.oooo
.oooo
.oooo
.oooo
.WO3
.OOI 8
.0074
.0222
.05 18
.0961
.I442
.I762
.I762
.1442
.0961
.05 18
.0222
.0074
.0018
.0003
.m
.OoOo
.moo
.oooo
.oooo
.oooI
.0005
.0024
.0085
.0237
.0532
.0976
.I464
.I797
.I797
.1451
.0933
.0467
.0175
.0046
,0008
.ooo1
.oooo
.oooo
.oooo
.oOOo
.OOOO
.o001
.OOO5
.0022
.0077
.0220
.0514
.0981
.I525
.I916
.I916
.I491
.0869
.0358
.0093
.0011
.moo
.0000
.OoOo
.m
.oooo
.moo
.oOOo
.OoOo
.0003
.0013
.0051
.O166
.0443
.0955
.I636
.2182
.2182
.1540
.0685
.O144
,0000
.oooo
.oooo
.m
.m
.m
.m
.m
.m
.oOoO
.oOoO
.WO2
.OO14
.0069
.0266
.0798
.1796
.2852
.2852
.I351
.oooo
.m
.m
.oooo
.oooo
.oooo
.moo
.oooo
.OoOo
.oOOo
.oooo
.m
.m
.w02
.OOI8
.0112
.0533
.1787
.3774
.3774
19 0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20 0
1
2
3
4
5
6
7
8
9
10
I 1
12
13
14
15
16
17
18
19
20
.3585
.3774
.I887
.0596
.OI33
.0022
.0003
.m
.m
.m
.OoOo
.oooo
.oooo
.moo
.m
.oooo
.oooo
.oooo
.oooo
.oooo
.moo
.I216
,2702
,2852
.1901
.0898
.0319
.0089
,0020
.oO04
.ooo1
,0000
.oooo
.oooo
.oooo
,0000
.oooo
.0000
,0000
.oooo
.oooo
.oooo
.0115
.0576
.I369
.2054
.2182
.I746
.1091
.0545
.0222
.0074
.0020
.0005
.ooo1
.oooo
.m
.oooo
.oooo
.oooo
.oooo
.oooo
.oooo
.0008
.0068
.0278
.07 16
.I304
.I789
.1916
.I643
.1144
.0654
.0308
.0120
.0039
.ooI0
.0002
.0000
.moo
.m
.oooo
.OoOo
.OoOo
.oooo
.0005
.0031
.O123
.0350
.0746
,1244
.I659
.I797
.1597
.I 171
.0710
.0355
.OI46
.0049
.0013
.0003
.oOOo
.oooo
.oOoO
.m
.oooo
.oooo
.0002
.0011
9046
.O148
,0370
.0739
.1201
.I602
.1762
.1602
.I201
,0739
.0370
.0148
.OM6
.0011
.OOO2
.m
.o000
.0000
.OoOo
.oooo
.oooo
.0003
.0013
,0049
.O146
.0355
.0710
.I 171
.I597
.I797
.I659
.I 244
.0746
.0350
.0123
.0031
.o005
.oooo
.oooo
.oooo
.oooo
.oooo
.m
.OOoo
.WO2
.0010
.0039
.o120
.0308
.0654
.I 144
,1643
.1916
.1789
.I304
.0716
.0278
.0068
.0008
.oooo
.oooo
.OoOo
.m
.m
.oooo
.oooo
.oooo
.ooo1
.OOO5
.0020
.0074
.0222
.0545
.I091
.1746
.2182
,2054
.I369
,0576
.0115
.oooo
.oooo
.oooo
.oOOo
.m
.oooo
.oooo
.oooo
.oooo
.moo
.oooo
.oO01
. m 4
.0020
.0089
.0319
.OS98
.1901
,2852
.2702
.I216
.oooo
.oooo
.oooo
.oooo
.oooo
.oooo
.oooo
.m
.oom
.oooo
.m
.m
.oooo
.moo
.0003
.0022
.0133
.0596
.1887
.3774
.3585
Appendix 2
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.o
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3.O
-
Areas Under the Standard Normal Curve from 0 to Z
.oo .o1 .02 .03 .04 .05
.oooo
.0398
,0793
.1179
.1554
.1915
.2257
,2580
.2881
.3159
.3413
.3643
.3849
.4032
.4192
.4332
.4452
.4554
.4641
.4713
.4772
.4821
.4861
.4893
.4918
.4938
.4953
.4965
.4974
.498I
.4987
.0040
,0438
.0832
.1217
.1591
.1950
.2291
.2611
.2910
.3438
.3665
.3869
.4049
.4207
.4345
.4463
,4564
,4649
.4719
.4778
.4826
.4864
.4896
.4920
.4940
.4955
.4966
.4975
.4982
.4987
.31a6
.0080
.0478
.0871
.1255
.1628
.1985
.2324
.2642
.2939
.3212
.3461
.3686
.3888
.4066
.4222
.4357
.4474
.4573
.4656
.4726
.4783
.4830
.4868
.4898
.4922
.4941
.4956
4967
.4976
.4982
.4987
.o120
.0517
.0910
.1293
.1664
.2019
.2357
.2673
.2967
,3238
.3485
.3708
.3907
.4082
.4236
.4370
.4484
.4582
,4664
.4732
.4788
.4834
.4871
.4901
.4925
.4943
.4957
.4968
.4977
.4983
.4998
.O160
.OS7
.0948
.1331
.1700
.2054
. n a g
.2704
.2995
.3264
.3508
.3729
.3925
.4099
.4251
.4382
.4495
.4591
.4671
.4738
.4793
.4838
.4875
.4904
.4927
.4945
.4959
.4969
.4977
.4984
.4988
.O199
.0596
.0987
.1368
.1736
.2088
.2422
.2734
,3023
.3289
.3531
.3749
.3944
.4115
.4265
.4394
.4505
.4599
.4678
.4744
.4798
.4842
.4878
.4906
.4929
.4946
.4960
.4970
.4978
.4994
.4989
.06
.0239
.0636
.1026
.1406
.1772
.2123
.2454
.2764
.3051
.3315
.3554
.3770
.3962
.4131
.4279
.4406
.4515
.4608
.4686
.4750
.4803
.4846
.4881
.4909
.4931
.4948
.4961
.4971
.4979
.4985
.4989
.07 .OS
.0279
-0675
.1064
.1443
.1808
.2157
.2486
,2794
.3078
.3340
.3577
.3790
.3980
.4147
.4292
.4418
.4525
.46I6
,4693
.4756
.4808
.4850
.4884
,4911
.4932
.4949
.4962
.4972
.4979
.4985
.4989
.0319
.0714
.1103
.1480
.1844
.2190
.2517
.2823
.3106
.3365
.3599
.3810
.3997
.4162
,4306
.4429
.4535
.4625
.4699
.4761
.4812
.4854
.4887
.4913
.4934
.4951
.4963
.4973
.4980
.4986
.4990
.09
.0359
.0753
.1141
.1517
.1879
.2224
.2549
.2852
.3133
.3389
.3621
.3830
.4015
.4177
.4319
.4441
.4545
.4633
.4706
,4767
.4817
.4857
.4890
.4916
.4936
.4952
.4964
.4974
.4981
.4986
.4990
Appendix 3
The entries in the table are the critical values of t for the specified number of degrees of freedom and
areas in the right tail.
df
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Areas in the Right Tail under the t Distribution Curve
.01 .05 .025 .01 .005 ,001
3.078
1.886
1.638
1.533
1.476
1.440
1.415
1.397
1.385
1.372
1.363
1.356
1.350
1.345
1.341
1.337
1.333
1.330
1.328
1.325
1.323
1.321
1.319
1.318
1.316
1.315
1.314
1.313
1.311
1.310
6.314
2.920
2.353
2.132
2.015
1.943
1.895
1.860
1.833
1.812
1.796
1.782
1.771
1.761
1.753
1.746
1.740
1.734
1.729
1.725
1.721
1.717
1.714
1.711
1.708
1.706
1.703
1.701
1.699
1.697
12.706
4.303
3.182
2.776
2.571
2.447
2.365
2.306
2.262
2.228
2.201
2.179
2.160
2.145
2.131
2.120
2.110
2.101
2.093
2.086
2.080
2.074
2.069
2.064
2.060
2.056
2.052
2.048
2.045
2.042
31.821
6.965
4.541
3.747
3.365
3.143
2.998
2.896
2.821
2.764
2.718
2.681
2.650
2.624
2.602
2.583
2.567
2.552
2.539
2.528
2.518
2.508
2.500
2.492
2.485
2.479
2.473
2.467
2.462
2.457
63.657
9.925
5.841
4.604
4.032
3.707
3.499
3.355
3.250
3.169
3.106
3.055
3.012
2.977
2.947
2.921
2.898
2.878
2.861
2.845
2.831
2.819
2.807
2.797
2.787
2.779
2.771
2.763
2.756
2.750
318.309
22.327
10.215
7.173
5.893
5.208
4.785
4.501
4.297
4.144
4.025
3.930
3.852
3.787
3.733
3.686
3.646
3.610
3.579
3.552
3.527
3.505
3.485
3.467
3.450
3.435
3.421
3.408
3.396
3.385
Appendix 4
The entries in the table are the critical values of x2for the specifieddegrees of freedom and areas in the
right tail.
df
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
40
50
60
70
80
~ ~~~ ~~~
Area in the Right Tail under the Chi-squareDistribution Curve
.995 .990 .975 .950 .900 ,100 .OS0 ,025 .010 .005
0.ooo
0.010
0.072
0.207
0.412
0.676
0.989
1.344
1.735
2.156
2.603
3.074
3.565
4.075
4.601
5.142
5.697
6.265
6.844
7.434
8.034
8.643
9.260
9.886
10.520
11.160
I 1.808
12.461
13.121
13.787
20.707
27.991
35.534
43.215
51.172
0.000
0.020
0.115
0.297
0.554
0.872
1.239
1.646
2.088
2.558
3.053
3.571
4.107
4.660
5.229
5.812
6.408
7.015
7.633
8.260
8.897
9.542
10.196
10.856
11.524
12.198
12.879
13.565
14.256
14.953
22.164
29.707
37.485
45.442
53.540
0.001
0.051
0.216
0.484
0.831
1.237
1.690
2.180
2.700
3.247
3.816
4.404
5.009
5.629
6.262
6.908
7.564
8.231
8.907
9.591
10.283
10.982
1I .689
12.401
13.120
13.844
14.573
15.308
16.047
16.791
24.433
32.357
40.482
48.758
57.153
0.004
0.103
0.352
0.711
1.145
I .635
2.167
2.733
3.325
3.940
4.575
5.226
5.892
6.571
7.261
7.962
8.672
9.390
10.117
10.851
11.591
2.338
3.091
3.848
4.611
5.379
16.151
16.928
17.708
18.493
26.509
34.764
43.188
51.739
60.39I
0.016
0.21I
0.584
1.064
1.610
2.204
2.833
3.490
4.168
4.865
5.578
6.304
7.042
7.790
8.547
9.312
10.085
10.865
11651
12.443
13.240
14.041
14.848
15.659
16.473
17.292
18.114
18.939
19.768
20.599
29.051
46.459
55.329
64.278
37.689
2.706
4.605
6.251
7.779
9.236
10.645
12.017
13.362
14.684
15.987
17.275
18.549
19.812
21.064
22.307
23.542
24.769
25.989
27.204
28.412
29.615
30.813
32.007
33.196
34.382
35.563
36.741
37.916
39.087
40.256
51.805
63.167
74.397
85.527
96.578
3.841
5.991
7.815
9.488
11.070
12.592
14.067
15.507
16.919
18.307
19.675
21.026
22.362
23.685
24.996
26.296
27.587
28.869
30.144
31.410
32.671
33.924
35.172
36.415
37.652
38.885
40.113
41.337
42.557
43.773
55.758
67.505
79.082
90.531
101279
5.024
7.378
9.348
11.143
12.833
14.449
16.013
17.535
19.023
20.483
2 1.920
23.337
24.736
26.119
27.488
28.845
30.191
31S26
32.852
34.170
35.479
36.781
38.076
39.364
40.646
41.923
43.195
44.461
45.722
46.979
59.342
71.420
83.298
95.023
106.629
6.635 7.879
9.210 10.597
1.345 12.838
3.277 14.860
5.086 16.750
6.812 18.548
8.475 20.278
20.090
21.666
23.209
24.725
26.217
27.688
29.141
30.578
32.000
33.409
34.805
36.191
37.566
38.932
40.289
41638
42.980
44.314
45.642
46.963
48.278
49.588
50.892
63.691
76.154
88.379
100.425
112.329
2 1.955
23.589
25.188
26.757
28.300
29.819
31.319
32.801
34.267
35.718
37.156
38.582
39.997
41.401
42.796
44.181
45.559
46.928
48.290
49.645
50.993
52.336
53.672
66.766
79.490
91.952
104.215
116.321
Appendix 5
The area in the right tail under the F distributioncurve is equal to 0.01.
d
f
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
30
40
50
100
-
1
4052
98.50
34.12
21.20
16.26
13.75
12.25
11.26
10.56
10.04
9.65
9.33
9.07
8.86
8.68
8.53
8.40
8.29
8.18
8.10
8.02
7.95
7.88
7.82
7.77
7.56
7.31
7.17
6.90
-
2
5000
99.00
30.82
18.00
13.27
10.92
9.55
8.65
8.02
7.56
7.21
6.93
6.70
6.51
6.36
6.23
6.11
6.01
5.93
5.85
5.78
5.72
5.66
5.61
5.57
5.39
5.18
5.06
4.82
-
-
3
5403
99.17
29.46
16.69
12.06
9.78
8.45
7.59
6.99
6.55
6.22
5.95
5.74
5.56
5.42
5.29
5.18
5.09
5.01
4.94
4.87
4.82
4.76
4.72
4.68
4.5I
4.31
4.20
3.98
-
4
5625
99.25
28.71
15.98
11.39
9.15
7.85
7.01
6.42
5.99
5.67
5.41
5.21
5.04
4.89
4.77
4.67
4.58
4.50
4.43
4.37
4.31
4.26
4.22
4.18
4.02
3.83
3.72
3.51
-
5
5764
99.30
28.24
15.52
10.97
8.75
7.46
6.63
6.06
5.64
5.32
5.06
4.86
4.69
4.56
4.44
4.34
4.25
4.17
4.10
4.04
3.99
3.94
3.90
3.85
3.70
3.51
3.41
3.21
-
6
5859
99.33
27.91
15.21
10.67
8.47
7.19
6.37
5.80
5.39
5.07
4.82
4.62
4.46
4.32
4.20
4.10
4.01
3.94
3.87
3.81
3.76
3.71
3.67
3.63
3.47
3.29
3.19
2.99
-
-
d
7
5928
99.36
27.67
14.98
10.46
8.26
6.99
6.18
5.61
5.20
4.89
4.64
4.44
4.28
4.14
4.03
3.93
3.84
3.77
3.70
3.64
3.59
3.54
3.50
3.46
33 0
3.12
3.02
2.82
-
I
8
5981
99.37
27.49
14.80
10.29
8.10
6.84
6.03
5.47
5.06
4.74
4.50
4.30
4.14
4.00
3.89
3.79
3.71
3.63
3.56
3.51
3.45
3.41
3.36
3.32
3.17
2.99
2.89
2.69
-
-
9
6022
99.39
27.35
14.66
10.16
7.98
6.72
5.91
5.35
4.94
4.63
4.39
4.19
4.03
3.89
3.78
3.68
3.60
3.52
3.46
3.40
3.35
3.30
3.26
3.22
3.07
2.89
2.78
2.59
-
-
10
6056
99.40
27.23
14.55
10.05
7.87
6.62
5.81
5.26
4.85
4.54
4.30
4.10
3.94
3.80
3.69
3.59
3.51
3.43
3.37
3.31
3.26
3.21
3.17
3.13
2.98
2.80
2.70
2.50
-
11
6083
99.41
27.13
14.45
9.96
7.79
6.54
5.73
5.18
4.77
4.46
4.22
4.02
3.86
3.73
3.62
3.52
3.43
3.36
3.29
3.24
3.18
3.14
3.09
3.06
2.91
2.73
2.63
2.43
-
12
6106
99.42
27.05
14.37
9.89
7.72
6.47
5.67
5.11
4.71
4.40
4.16
3.96
3.80
3.67
3.55
3.46
3.37
3.30
3.23
3.17
3.12
3.07
3.03
2.99
2.84
2.66
2.56
2.37
-
15
6157
99.43
26.87
14.20
9.72
7.56
6.31
5.52
4.96
4.56
4.25
4.01
3.82
3.66
3.52
3.41
3.31
3.23
3.15
3.09
3.03
2.98
2.93
2.89
2.85
2.70
2.52
2.42
2.22
-
-
20
6209
99.45
26.69
14.02
9.55
7.40
6.16
5.36
4.81
4.41
4.10
3.86
3.66
3.51
3.37
3.26
3.16
3.08
3.OO
2.94
2.88
2.83
2.78
2.74
2.70
2.55
2.37
2.27
2.07
-
368 AREA IN RIGHT TAIL UNDER F DISTRIBUTION CURVE [APPENDIX 5
The area in the right tail under the F distributioncurve is equal to 0.05.
-
-
clfi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
30
40
50
100
-
1
161.5
18.51
10.13
7.71
6.61
5.99
5.59
5.32
5.12
4.96
4.84
4.75
4.67
4.60
4.54
4.49
4.45
4.41
4.38
4.35
4.32
4.30
4.28
4.26
4.24
4.17
4.08
4.03
3.94
-
2
199.5
19.00
9.55
6.94
5.79
5.14
4.74
4.46
4.26
4.10
3.98
3.89
3.81
3.74
3.68
3.63
3.59
3.55
3.52
3.49
3.47
3.44
3.42
3.40
3.39
3.32
3.23
3.18
3.09
-
3
215.7
19.16
9.28
6.59
5.41
4.76
4.35
4.07
3.86
3.71
3.59
3.49
3.41
3.34
3.29
3.24
3.20
3.16
3.13
3.10
3.07
3.05
3.03
3.01
2.99
2.92
2.84
2.79
2.70
4
224.6
19.25
9.12
6.39
5.19
4.53
4.12
3.84
3.63
3.48
3.36
3.26
3.18
3.1I
3.06
3.01
2.96
2.93
2.90
2.87
2.84
2.82
2.80
2.78
2.76
2.69
2.61
2.56
2.46
5
230.2
19.30
9.01
6.26
5.05
4.39
3.97
3.69
3.48
3.33
3-20
3.11
3.03
2.96
2.90
2.85
2.81
2.77
2.74
2.71
2.68
2.66
2.64
2.62
2.60
2.53
2.45
2.40
2.31
6
234.0
19.33
8.94
6.16
4.95
4.28
3.87
3.58
3.37
3.22
3.09
3.oo
2.92
2.85
2.79
2.74
2.70
2.66
2.63
2.60
2.57
2.55
2.53
2.51
2.49
2.42
2.34
2.29
2.19
-
d
7
236.8
19.35
8.89
6.09
4.88
4.21
3.79
3.50
3.29
3.14
3.01
2.91
2.83
2.76
2.71
2.66
2.61
2.58
2.54
2.51
2.49
2.46
2.44
2.42
2.40
2.33
2.25
2.20
2.10
-
L
8
238.9
19.37
8.85
6.04
4.82
4.15
3.73
3.44
3.23
3.07
2.95
2.85
2.77
2.70
2.64
2.59
2.55
2.51
2.48
2.45
2.42
2.40
2.37
2.36
2.34
2.27
2.18
2.13
2.03
-
9
240.5
19.38
8.81
6.00
4.77
4.10
3.68
3.39
3.18
3.02
2.90
2.80
2.71
2.65
2.59
2.54
2.49
2.46
2.42
2.39
2.37
2.34
2.32
2.30
2.28
2.21
2.12
2.07
1.97
-
10
241.9
19.40
8.79
5.96
4.74
4.06
3.64
3.35
3.14
2.98
2.85
2.75
2.67
2.60
2.54
2.49
2.45
2.41
2.38
2.35
2.32
2.30
2.27
2.25
2.24
2.16
2.08
2.03
1.93
-
11
243.O
19.40
8.76
5.94
4.70
4.03
3.60
3.31
3.10
2.94
2.82
2.72
2.63
2.57
2.51
2.46
2.41
2.37
2.34
2.31
2.28
2.26
2.24
2.22
2.20
2.13
2.04
1.99
1.89
-
12
243.9
19.41
8.74
5.91
4.68
4.00
3.57
3.28
3.07
2.91
2.79
2.69
2.60
2.53
2.48
2.42
2.38
2.34
2.31
2.28
2.25
2.23
2.20
2.18
2.16
2.09
2.00
1.95
1.85
15
246.0
19.43
8.70
5.86
4.62
3.94
3.51
3.22
3.01
2.85
2.72
2.62
2.53
2.46
2.40
2.35
2.31
2.27
2.23
2.20
2.18
2.15
2.13
2.16
2.09
2.01
1.92
1.87
1.77
-
20
248.0
19.45
8.66
5.80
4.56
3.87
3.44
3.15
2.94
2.77
2.65
2.54
2.46
2.39
2.33
2.28
2.23
2.19
2.16
2.12
2.10
2.07
2.05
2.03
2.01
1.93
1.84
1.78
1.68
Index
Addition rule for the union of two events, 72
Adjacent values, 57
Alternative hypothesis, 185
I
I
1
I
I
1
I
I
I
1
3ar graph, 15
3ayes rule, 72
3ell-shaped histogram, 20,4 1
3etween samples variation, 275
3etween treatments mean square, 276
3imoda1,trimodal, 4 1
3inomial experiment, 93
3inomial probability formula, 93
3inomial random variable, 93
3ox-and-whisker plot, 50
Census, 140
Central limit theorem for the sample proportion,
Central limit theorem, 146
Chebyshev's theorem, 46
Chi-square distribution, 249
Chi-square independence test, 253
Chi-square tables, 250
Class mark, 17
Class midpoint, 17
Class width, 17
Classes, 17
Cluster sampling, 142
Coefficient of determination, 316
Coefficient of variation, 47
Collectively exhaustive events, 73
Combinations, 74
Complementary events, 70
Compound event, 64
Computer selected simple random samples, 141
Conditional probability, 68
Confidence interval, 166
Confidence interval for:
150
mean response in linear regression, 318
population variance, 259
difference in means, 212
difference in population proportions, 231
mean difference of dependent samples, 226
mean, large sample, 167
mean, small sample, 171
population proportion, large sample, 173
slope of the population regression line, 317
Confidence level, 166
Contingency tables, 254
Continuous data, 14,
Continuous random variable, 89, 113
Continuous variable, 3
Correlation coefficient, 319
Counting rule, 63
Critical values, 186
Cumulative frequency distribution, 21
Cumulative percentages, 21
Cumulative relative frequency, 2I
Data set, 2
Degrees of freedom
for denominator, F distribution, 272
for error, one-way ANOVA, 277
for numerator, F distribution, 272
for total, one-way ANOVA, 277
for treatments, one-way ANOVA, 277
for t distribution, 169
Dependent events, 69
Dependent variable, 310
Descriptive Statistics, 1
Deviations from the mean, 43
Discrete data, 14
Discrete random variable, 89
Discrete variable, 3
Empirical rule, 46
Error mean square, 276
Error sum of squares, 276,313
Error term, 310
Estimated regression line, 313
Estimated standard error of the difference in
sample means, 213,215
Estimation, 166
Estimation of mean difference for dependent
samples, 226
Estimation of the difference in means, large
independent samples, 212
Estimation of the difference in means, small
samples, 215, 221
Estimation of the difference in population
proportions, 231
Event, 64
Expected frequencies, 251
Experiment, 63
Explained variation, 315
Exponential probability distribution, 127
F distribution, 272
F table, 273
Factor sum of squares, 285
Factorial design, 282
Finite population correction factor, 145
Frequency distribution, 14
Goodness-of-fit test, 250
Grouped data, 40
369
370 INDEX
Histogram, 20
Hypergeometric probability formula, 99
Independent events, 69
Independent samples, 2 1I
Independent variable, 310
Inferential Statistics, I
Interaction, 282
Interaction plot, 283
Interaction sum of squares, 285
Interquartile range, 49
Intersection of events, 70
Interval estimate, 166
Interval level of measurement, 5
Joint probability, 68
Kruskal-Wallis test, 340-342
Least squares line, 3 12
Left-tailed test, 185
Level of significance, I87
Levels of a factor, 285
Levels of measurement, 4
Line of best fit, 313
Linear regression model, 310
Lower boundary, 17
Lower class limit, 17
Lower inner fence, 57
Lower outer fence, 57
Main effects, 282
Mann-Whitney U test, 339
Marginal probability, 68
Maximum error of estimate, 168
Mean, 40
Mean and standard deviation for the uniform
Mean and standard deviation of the sample
Mean and variance of a binomial random variable,
Mean for grouped data, 45
Mean of a discrete random variable, 91
Mean of the sample mean, 144
Measures of central tendency, 40
Measures of dispersion, 42
Measures of position, 48
Median class, 45
Median for grouped data, 45
Median, 4 1
Modal class, 45
Mode for grouped data, 45
Mode, 41
Modified boxplot, 50
distribution, 114
proportion, I49
96
Mu1tinomial probability distribution, 250
Multiplication rule for the intersection of two
Mutually exclusive events, 69
events, 71
Nominal level of measurement, 4
Nonparametric methods, 334
Nonrejection region, 186
Normal approximation to the binomial, 125- 126
Normal curve pdf, 117
Normal probability distribution, 1 15
Null hypothesis, 185
Observation, 2
Observed frequencies, 251
Ogive, 21
One-tailed test, 185
One-way ANOVA table, 279
Operating characteristic curve, 194
Ordinal level of measurement, 5
Outcomes, 63
Paired or matched samples, 224
Parameter, 148
Pearson correlation coefficient, 320
Percentage, 15
Percentile for an observation, 48
Percentiles, deciles, and quartiles, 48
Permutations, 74
Pie chart, 16
Point estimate, 166
Poisson probability formula, 97
Population, 1
Population correlation coefficient, 320
Population proportion, 148
Population Spearman rank correlation coefficient,
Population standard deviation, 42
Possible outliers, 58
Prediction interval when predicting a single
Prediction line, 313
Probability, 65
Probability density function (pdf), 113
ProbabiIity distribution, 90
Probability, classical definition, 65
Probability, relative frequency definition, 67
Probability, subjective definition, 67
Probable outliers, 58
Pth percentile, 48
P value, 196-198
344
observation, 3I9
Qualitative data, 14
Qualitative variable, 4
Quantitative data, 14
Quantitative variable, 3
Random number tables, 140
INDEX 371
Random variable, 89
Range for grouped data, 45
Range, 42
Rank, 334
Ratio level of measurement, 6
Raw data, 14
Regression sum of squares, 3 15
Rejection regions, 186
Relative frequency, I5
Research hypothesis, I85
Residuals, 314
Response variable, 282
Right-tailed test, I85
Runs test for randomness, 344
Sample proportion, 148
Sample size determination for estimating the mean,
Sample size determination for estimating the
Sample space, 63
Sample standard deviation. 43
Sample, 1
Sampling distribution of the sample mean, 142
Sampling distribution of the sample proportion,
Sampling distribution of the sample variance, 257
Sampling error, 144
Scales of measurement. 4
Scatter plot, 304,
Shortcut formulas for computing variances, 43
Sign test, 334-336
Signed rank, 337
Simple event, 64
Simple random sampling, 140
Single-valued class. I9
Skewed to the left histogram, 20
Skewed to the right histogram, 20
Spearman rank correlation coefficient. 342
Standard deviation. 42
Standard deviation of a discrete random variable,
Standard deviation of errors, 3I5
Standard error of the mean, 144
Standard error of the proportion, 149
Standard normal distribution table, I 17
Standardizing a normal distribution, I20
Statistic, 148
Statistical software packages, 7
Statistics. 1
Stem-and-leaf display, 22
174
population proportion, I75
148
92
Stratified sampling, 142
Sum of squares of the deviations, 43
Sum of the deviations, 43
Sum of the squares and sum squared, 43
Summation notation, 6
Synimetric histogram, 20
Systematic random sampling, I42
T distribution, 169
Table of binomial probabilities, 95
Test statistic, 186
Testing a hypothesis concerning:
difference in means, large samples, 213
difference in means, small samples, 216, 223
difference in population proportions, 232
mean difference for dependent samples, 228
main effects and interaction, 299
population correlation coefficient, 32I
population mean, large sample, 19I
population mean, small sample, 199
population proportion, large sample, 200
population variance, 260
Total sum of squares, 277
Treatment sum of squares, 276
Tree diagram, 63
Two-tailed test, 185
Two-way ANOVA table, 286
Type I and Type I1 errors, 187
Type I1 errors. calculating, 193
Ungrouped data, 40
Uniform or rectangular histogram, 20
Uniform probability distribution, 113
Union of events, 71
Upper boundary, 17
Upper class limit, 17
Upper inner fence, 57
Upper outer fence, 57
Variable, 2
Variance for grouped data, 46
Variance of a discrete random variable, 92
Variance, 42
Venn diagram, 64
Wilcoxon rank-sum test, 338-340
Wilcoxon signed-rank test, 337-338
Within samples variation, 275
Z score, 47

Schaum Outlines Of Beginning Statistics.pdf

  • 2.
    SCHAUM’S OUTLINE OF THEORYAND PROBLEMS OF BEGINNING STATISTICS LARRY J. STEPHENS, Ph.D. Professor o f Mathematics University of Nebrasku at Oriialin SCHAUM’S OUTLINE SERIES McGRAW-HILL New York San Francisco Washington,D.C. Auckland Bogotci Caracas Lisbon London Madrid Mexico City Milan Montreal New Dehli San Juan Singapore Sydney Tokyo Toronto
  • 3.
    To My Motherand Father, Rosie, and Johnie Stephens LARRY J. STEPHENS is Professor of Mathematics at the University of Nebraska at Omaha. He received his bachelor’s degree from Memphis State University in Mathematics, his master’s degree from the University of Arizona in Mathematics, and his Ph.D. degree from Oklahoma State University in Statistics. Professor Stephens has over 40 publications in professional journals. He has over 25 years of experience teaching Statistics. He has taught at the University of Arizona, Christian Brothers College, Gonzaga University, Oklahoma State University, the University of Nebraska at Kearney, and the University of Nebraska at Omaha. He has published numerous computerized test banks to accompany elementary Statistics texts. He has worked for NASA, Livermore Radiation Laboratory, and Los Alamos Laboratory. Since 1989, Dr. Stephens has consulted with and conducted Statistics seminars for the engineering group at 3M, Valley, Nebraska plant. Schaurn’s Outline of Theory and Problems of BEGlNNlNG STATISTICS Copyright 0 1998by the McGraw-Hill Companies, Inc. All rights reserved. Printed in the United States of Atnerica. Except as permitted under the Copyright Act of 1976, no part of this publication tna he reproduced or distributed in any form or by any means, or stored in a data base or retrieval system, without the prior written permission of the publisher. 3 4 5 6 7 8 9 10 I 1 12 13 14 IS 16 17 18 19 20 PRS PRS 9 0 2 1 0 9 ISBN 0-07-061259-5 Sponsoring Editor: Barbara Gilson Production Supervisor: Clara Stanley Editing Supervisor: Maureen B. Walker Mirrirtih is U registered tradenuirk o fMinitub h c . Library of CongressCataloging-in-PublicationData Stephens. Larry J. Schaum’s outline of theory and problems of beginning statistics / Larry J. Stephens. Includes index. p. cm. - (Schaurn’s outline series) ISBN 0-07-061259-5(pbk.) I . Mathematical statistics-Outlines, syllabi, etc. 2. Mathematical statistics-Problems, exercises, etc. I. Title. 11. Series. QA276.19.S74 1998 519.5’0764~21 97-45979 CIP AC
  • 4.
    Preface Statistics is arequired course for undergraduate college students in a number of majors. Students in the following disciplines are often required to take a course in beginning statistics: allied health careers, biology, business, computer science, criminal justice, decision science, engineering, education, geography, geology, information science, nursing, nutrition, medicine, pharmacy, psychology, and public administration. This outline is intended to assist these students in the understanding of Statistics. The outline may be used as a supplement to textbooks used in these courses or a text for the course itself. The author has taught such courses for over 25 years and understands the difficulty students encounter with statistics. I have included examples from a wide variety of current areas of application in order to motivate an interest in learning statistics. As we leave the twentieth century and enter the twenty-first century, an understanding of statistics is essential in understanding new technology, world affairs, and the ever-expanding volume of knowledge. Statistical concepts are encountered in television and radio broadcasting, as well as in magazines and newspapers. Modern newspapers, such as USA Today, are full of statistical information. The sports section is filled with descriptive statistics concerning players and teams performance. The money section of USA Today contains descriptive statistics concerning stocks and mutual funds. The life section of USA Today often contains summaries of research studies in medicine. An understanding of statistics is helpful in evaluating these research summaries. The nature of the beginning statistics course has changed drastically in the past 30 or so years. This change is due to the technical advances in computing. Prior to the 1960s statistical computing was usually performed on mechanical calculators. These were large cumbersome computing devices (compared to today’s hand-held calculators) that performed arithmetic by moving mechanical parts. Computers and computer software were no comparison to today’s computers and software. The number of statistical packages available today numbers in the hundreds. The burden of statistical computing has been reduced to simply entering your data into a data file and then giving the correct command to perform the statistical method of interest. One of the most widely used statistical packages in academia as well as industrial settings is the package called Minitab (Minitab Inc., 3081 Enterprise Drive, State College, PA 16801-3008).I wish to thank Minitab Inc. for granting me permission to include Minitab output, including graphics, throughout the text. Most modern Statistics textbooks include computer software as part of the text. I have chosen to include Minitab because it is widely used and is very friendly. Once a student learns the various data file structures needed to use Minitab, and the structure of the commands and subcommands, this knowledge is readily transferable to other statistical software. The outline contains all the topics, and more, covered in a beginning statistics course. The only mathematical prerequisite needed for the material found in the outline is arithmetic and some basic algebra. I wish to thank my wife, Lana, for her understanding during the preparation of the book. I wish to thank my friend Stanley Wileman for all the computer help he has given me during the preparation of the book. I wish to thank Dr. Edwin C . Hackleman of Delta Software, Inc. for his timely assistance as compositor of the final camera-ready manuscript. Finally, I wish to thank the staff at McGraw-Hill for their cooperation and helpfulness. LARRYJ. STEPHENS ... 111
  • 5.
  • 6.
    Contents Chapter 1 INTRODUCTION.................................................................................1 Statistics. Descriptive Statistics. Inferential Statistics: Population and Sample. Variable, Observation. and Data Set. Quantitative Variable: Discrete and Continuous Variable. Qualitative Variable. Nominal, Ordinal, Interval, and Ratio Levels of Measurement. Summation Notation. Computers and Statistics. Chapter 2 ORGANIZING DATA.......................................................................... 14 Raw Data. Frequency Distribution for Qualitative Data. Relative Frequency of a Category. Percentage. Bar Graph. Pie Chart. Frequency Distribution for Quantitative Data. Class Limits, Class Boundaries, Class Marks, and Class Width. Single-Valued Classes. Histograms. Cumulative Frequency Distributions. Cumulative Relative Frequency Distributions. Ogives. Stem-and-Leaf Displays. Chapter 3 DESCRIPTIVE MEASURES.............................................................. 40 Measures of Central Tendency. Mean, Median, and Mode for Ungrouped Data. Measures of Dispersion. Range, Variance, and Standard Deviation for Ungrouped Data. Measures of Central Tendency and Dispersion for Grouped Data. Chebyshev’s Theorem. Empirical Rule. Coefficient of Variation. Z Scores. Measures of Position: Percentiles, Deciles, and Quartiles. Interquartile Range. Box-and-Whisker Plot. Chapter 4 PROBABILITY..................................................................................... 63 Experiment, Outcomes, and Sample Space. Tree Diagrams and the Counting Rule. Events, Simple Events, and Compound Events. Probability. Classical, Relative Frequency and Subjective Probability Definitions. Marginal and Conditional Probabilities. Mutually Exclusive Events. Dependent and Independent Events. Complementary Events. Multiplication Rule for the Intersection of Events. Addition Rule for the Union of Events. Bayes’ Theorem. Permutations and Combinations. Using Permutations and Combinations to Solve Probability Problems. Chapter 5 DISCRETE RANDOM VARIABLES ................................................ 89 Random Variable. Discrete Random Variable. Continuous Random Variable. Probability Distribution. Mean of a Discrete Random Variable. Standard Deviation of a Discrete Random Variable. Binomial Random Variable. Binomial Probability Formula. Tables of the Binomial Distribution. Mean and Standard Deviation of a Binomial Random Variable. Poisson Random Variable. Poisson Probability Formula. Hypergeometric Random Variable. Hypergeometric Probability Formula. V
  • 7.
    vi CONTENTS Chapter6 CONTINUOUSRANDOMVARIABLES AND THEIR PROBABILITYDISTRIBUTIONS.................................................... 113 Uniform Probability Distribution. Mean and Standard Deviation for the Uniform Probability Distribution. Normal Probability Distribution. Standard Normal Distribution. Standardizing a Normal Distribution. Applications of the Normal Distribution. Determining the z and x Values When an Area under the Normal Curve is Known. Normal Approximation to the Binomial Distribution. Exponential Probability Distribution. Probabilities for the Exponential Probability Distribution. Chapter7 SAMPLINGDISTRIBUTIONS.......................................................... 140 Simple Random Sampling. Using Random Number Tables. Using the Computer to Obtain a Simple Random Sample. Systematic Random Sampling. Cluster Sampling. Stratified Sampling. Sampling Distribution of the Sampling Mean. Sampling Error. Mean and Standard Deviation of the Sample Mean. Shape of the Sampling Distribution of the Sample Mean and the Central Limit Theorem. Applications of the Sampling Distribution of the Sample Mean. Sampling Distribution of the Sample Proportion. Mean and Standard Deviation of the Sample Proportion. Shape of the Sampling Distribution of the Sample Proportion and the Central Limit Theorem. Applications of the Sampling Distribution of the Sample Proportion. Chapter 8 ESTIMATIONAND SAMPLESIZE DETERMINATION: ONE POPULATION ........................................................................... 166 Point Estimate. Interval Estimate. Confidence Interval for the Population Mean: Large Samples. Maximum Error of Estimate for the Population Mean. The t Distribution. Confidence Interval for the Population Mean: Small Samples. Confidence Interval for the Population Proportion: Large Samples. Determining the Sample Size for the Estimation of the Population Mean. Determining the Sample Size for the Estimation of the Population Proportion. Chapter9 TESTS OF HYPOTHESIS:ONE POPULATION............................ 185 Null Hypothesis and Alternative Hypothesis. Test Statistic, Critical Values, Rejection and Nonrejection Regions.Type I and Type I1 Errors. Hypothesis Tests about a Population Mean: Large Samples. Calculating Type I1 Errors. P Values. Hypothesis Tests about a Population Mean: Small Samples. Hypothesis Tests about a Population Proportion: Large Samples.
  • 8.
    CONTENTS vii Chapter 10INFERENCESFOR TWO POPULATIONS..................................... 211 Sampling Distribution of -x,for Large Independent Samples. Estimation of p1 -p2 Using Large Independent Samples. Testing Hypothesis about pI-p, Using Large Independent Samples. Sampling Distribution of XI-x2for Small Independent Samples from Normal Populations with Equal (but unknown) Standard Deviations. Estimation of p1-p, Using Small Independent Samples froin Normal Populations with Equal (but unknown) Standard Deviations. Testing Hypothesis about p,-p, Using Small Independent Samples from Normal Populations with Equal (but Unknown) Standard Deviations. Sampling Distribution of XI-x2for Small Independent Samples from Normal Populations with Unequal (and Unknown) Standard Deviations. Estimation of p,- p2Using Small Independent Samples from Normal Populations with Unequal (and Unknown) Standard Deviations. Testing Hypothesis about p1-p2 Using Small Independent Samples from Normal Populations with Unequal (and Unknown) Standard Deviations. Sampling Distribution of a for Normally Distributed Differences Computed from Dependent Samples. Estimation of pdUsing Normally Distributed Differences Computed from Dependent Samples. Testing Hypothesis about pdUsing Normally Distributed Differences Computed from Dependent Samples. Sampling Distribution of PI-p2for Large Independent Samples. Estimation of PI-P2 Using Large Independent Samples. Testing Hypothesis about PI-P2 Using Large Independent Samplcs. Chapter 11 CHI-SQUAREPROCEDURES........................................................... 249 Chi-square Distribution. Chi-square Tables. Goodness-of-Fit Test. Observed and Expected Frequencies. Sampling Distribution of the Goodness-of-Fit Test Statistic. Chi-square Independence Test. Sampling Distribution of the Test Statistic for the Chi-square Independence Test. Sampling Distribution of the Sample Variance. Inferences Concerning the Population Variance. Chapter 12 ANALYSIS OF VARIANCE (ANOVA)............................................. 272 F Distribution. F Table. Logic Behind a One-way ANOVA. Sum of Squares, Mean Squares, and Degrees of Freedom for a One-way ANOVA. Sampling Distribution for the One-way ANOVA Test Statistic. Building One-way ANOVA Tables and Testing the Equality of Means. Logic Behind a Two-way ANOVA. Sum of Squares, Mean Squares, and Degrees of Freedom for a Two-way ANOVA. Building Two-way ANOVA Tables. Sampling Distributions for the Two-way ANOVA. Testing Hypothesis Concerning Main Effects and Interaction.
  • 9.
    -~ ... V l ll CONTENTS Chapter 13 REGRESSIONAND CORRELATION.............................................. 309 Straight Lines. Linear Regression Model. Least Squares Line. Error Sum of Squares. Standard Deviation of Errors. Total Sum of Squares. Regression Sum of Squares. Coefficient of Determination. hiean, Standard Deviaticjii,and Sampling Distribution of the Slope of the Estimated Regression Equation. Inferences Concerning the Slope of the Population Regression Line. Estimation and Prediction in Linear Regression. Linear Correlation Coefficient. Inference Concerning the Population Correlation Coefficient. Chapter 14 NONPARAMETRICSTATISTICS.................................................... 334 NonparametricMethods. Sign Test. Wilcoxon Signed-Ranks Test for Two Dependent Samples. Wilcoxon Rank-Sum Test for Two Independent Samples. Kruskal-Wall6 Test. Rank Correlation. Runs Test for Randomness. APPENDIX 1 Binomial Probabilities....................................................................................................................... 359 ~~- 2 Areas under the Standard Normal Curve from 0 to 2 ....................................................................... 364 ~ 3 Area in the Right Tail under the t Distribution Curve....................................................................... 365 ~ 4 Area in the Right Tail under the Chi-square Distribution Curve....................................................... 366 ~ 5 Area in the Right Tail under the F Distribution Curve...................................................................... 367 .. INDEX.................................................................................................................................................... 369 -
  • 10.
    Chapter 1 Introduction STATISTICS Statistics isa discipline of study dealing with the collection, analysis, interpretation, and presentation of data. Statistical methodology is utilized by pollsters who sample our opinions concerning topics ranging from art to zoology. Statistical methodology is also utilized by business and industry to help control the quality of goods and services that they produce. Social scientists and psychologists use statistical methodology to study our behaviors. Because of its broad range of applicability, a course in statistics is required of majors in disciplines such as sociology, psychology, criminal justice, nursing, exercise science, pharmacy, education, and many others. To accommodate this diverse group of users, examples and problems in this outline are chosen from many different sources. DESCRIPTIVE STATISTICS The use of graphs, charts, and tables and the calculation of various statistical measures to organize and summarize information is called descriptive statistics. Descriptive statistics help to reduce our information to a manageable size and put it into focus. EXAMPLE 1.1 The compilation of batting average, runs batted in, runs scored, and number of home runs for each player, as well as earned run average, wordlost percentage, number of saves, etc., for each pitcher from the official score sheets for major league baseball players is an example of descriptive statistics. These statistical measures allow us to compare players, determine whether a player is having an “off year” or “good year,” etc. EXAMPLE 1.2 The publication entitled Crime in the United States published by the Federal Bureau of Investigation gives summary information concerning various crimes for the United States. The statistical measures given in this publication are also examples of descriptive statistics and they are useful to individuals in law enforcement. INFERENTIALSTATISTICS:POPULATIONAND SAMPLE The complete collection of individuals, items, or data under consideration in a statistical study is referred to as the pupulatiurz. The portion of the population selected for analysis is called the sample. Inferential statistics consists of techniques for reaching conclusions about a population based upon information contained in a sample. EXAMPLE 1.3 The results of polls are widely reported by both the written and the electronic media. The techniques of inferential statistics are widely utilized by pollsters. Table 1.1 gives several examples of populations and samples encountered in polls reported by the media. The methods of inferential statistics are used to make inferences about the populations based upon the results found in the samples and to give an indication about the reliability of these inferences. The results of a poll of 600 registered voters might be reported as follows: Forty percent of the voters approve of the president’s economic policies. The margin of error for the survey is 4%. The survey indicates that an estimated 40%of all registered voters approve of the economic policies, but it might be as low as 36%or as high as 44%. 1
  • 11.
    2 Population All registered voters Allowners of handguns Households headed by a single parent The CEOs of all private companies INTRODUCTION [CHAP. 1 Sample A telephone survey of 600 registered voters A telephone survey of 1000handgun owners The results from questionnaires sent to 2500 households headed by a single parent The results from surveys sent to 150CEO’s of private companies Table 1.1 Population All prison inmates Legal aliens living in the United States Alzheimer patients in the United States Adult children of alcoholics Sample A criminaljustice study of 350 prison inmates A sociological study conducted by a university researcher of 200 legal aliens A medical study of 75 such patients conducted by a university hospital A psychological study of 200 such individuals EXAMPLE 1.4 The techniques of inferential statistics are applied in many industrial processes to control the quality of the products produced. In industrial settings, the population may consist of the daily production of toothbrushes, computer chips, bolts, and so forth. The sample will consist of a random and representative selection of items from the process producing the toothbrushes, computer chips, bolts, etc. The information contained in the daily samples is used to construct control charts. The control charts are then used to monitor the quality of the products. EXAMPLE 1.5 The statistical methods of inferential statistics are used to analyze the data collected in research studies. Table 1.2 gives the samples and populations for several such studies. The information contained in the samples is utilized to make inferences concerning the populations. If it is found that 245 of 350 or 70% of prison inmates in a criminal justice study were abused as children, what conclusions may be inferred concerning the percent of all prison inmates who were abused as children? The answers to this question are found in Chapters 8 and 9. VARIABLE, OBSERVATION, AND DATA SET A characteristic of interest concerning the individual elements of a population or a sample is called a variable. A variable is often represented by a letter such as x, y, or z. The value of a variable for one particular element from the sample or population is called an Observation. A data set consists of the observations of a variable for the elements of a sample. EXAMPLE 1.6 Six hundred registered voters are polled and each one is asked if they approve or disapprove of the president’s economic policies. The variable is the registered voter’s opinion of the president’s economic policies. The data set consists of 600 observations. Each observation will be the response “approve” or the response “do not approve.’’ If the response “approve” is coded as the number 1 and the response “do not approve’’is coded as 0, then the data set will consist of 600 observations, each one of which is either 0 or 1. If x is used to represent the variable, then x can assume two values, 0 or I .
  • 12.
    CHAP. 11 INTRODUCTION3 appears for the first time EXAMPLE 1.7 A survey of 2500 households headed by a single parent is conducted and one characteristic of interest is the yearly household income. The data set consists of the 2500 yearly household incomes for the individuals in the survey. If y is used to represent the variable, then the values for y will be between the smallest and the largest yearly household incomes for the 2500 households. , (there is no upper limit since conceivably one might need to flip forever to obtain the first head) EXAMPLE 1.8 The number of speeding tickets issued by 75 Nebraska state troopers for the month of June is recorded. The data set consists of 75 observations. QUANTITATIVEVARIABLE: DISCRETE AND CONTINUOUS VARIABLE A quantitative variable is determined when the description of the characteristic of interest results in a numerical value. When a measurement is required to describe the characteristic of interest or it is necessary to perform a count to describe the characteristic, a quantitative variable is defined. A discrete variable is a quantitative variable whose values are countable. Discrete variables usually result from counting. A continuous variable is a quantitative variable that can assume any numerical value over an interval or over several intervals. A continuous variable usually results from making a measurement of some type. EXAMPLE 1.9 Table 1.3 gives several discrete variables and the set of possible values for each one. In each case the value of the variable is determined by counting. For a given box of 100diabetic syringes, the number of defective needles is determined by counting how many of the 100are defective. The number of defectives found must equal one of the 101 values listed. The number of possible outcomes is finite for each of the first four variables; that is, the number of possible outcomes are 101, 31, 501, and 51 respectively. The number of possible outcomes for the last variable is infinite. Since the number of possible outcomes is infinite and countable for this variable, we say that the number of outcomes is countably infinite. Sometimes it is not clear whether a variable is discrete or continuous. Test scores expressed as a percent, for example, are usually given as whole numbers between 0 and 100. It is possible to give a score such as 75.57565. However, this is not done in practice because teachers are unable to evaluate to this degree of accuracy. This variable is usually regarded as continuous, although for all practical purposes, it is discrete. To summarize, due to measurement limitations, many continuous variables actually assume only a countable number of values. Table 1.3 Discrete variable The number of defective needles in boxes of 100diabetic syringes The number of individuals in groups of 30 with a type A personality The number of surveys returned out of 500 mailed in sociological studies The number of prison inmates in 50 having finished high school or obtained a GED who are selected for criminal justice studies The number of times you need to flip a coin before a head Possible values for the variable 0, I, 2,. . . , 100 0,1,2, . . . , 30 0, 1,2, . . . , 500 0,1,2, . . . , 50 1 , 2 , 3 , .. .
  • 13.
    4 INTRODUCTION [CHAP.1 Qualitative variable Marital status Gender Crime classification Pain level Personality type EXAMPLE 1.10 Table 1.4 gives several continuous variables and the set of possible values for each one. All three continuous variables given in Table 1.4 involve measurement, whereas the variables in Example I .9 all involve counting. Table 1.4 Possible categories for the variable Single,married, divorced, separated Male, female Misdemeanor, felony None, low, moderate, severe Type A, type B Continuous variable The length of prison time served for individuals convicted of first degree murder The household income for households with incomes less than or equal to $20,000 The cholesterol reading for those individuals having cholesterol readings equal to or greater than 200 mg/dl Possible values for the variable All the real numbers between a and b, where a is the smallest amount of time served and b is the largest amount All the real numbers between a and $20,000,where a is the smallest household income in the population All real numbers between 200 and b, where b is the largest cholesterol reading of all such individuals QUALITATIVE VARIABLE A qualitative variable is determined when the description of the characteristic of interest results in a nonnumerical value. A qualitative variable may be classified into two or more categories. EXAMPLE 1.11 Table 1.5 gives several examples of qualitative variables along with a set of categories into which they may be classified. The possible categories for qualitative variables are often coded for the purpose of performing computerized statistical analysis. Marital status might be coded as 1, 2, 3, or 4, where 1 represents single, 2 represents married, 3 represents divorced, and 4 represents separated. The variable gender might be coded as 0 for female and 1 for male. The categories for any qualitative variable may be coded in a similar fashion. Even though numerical values are associated with the characteristic of interest after being coded, the variable is considered a qualitative variable. NOMINAL, ORDINAL, INTERVAL, AND RATIO LEVELS OF MEASUREMENT There are four levels of meusuremerit or scales o f measurements into which data can be classified. The nominal scale applies to data that are used for category identification. The nominal level o f measi~rernerttis characterized by data that consist of names, labels, or categories only. Nominal scale duta cannot be arranged in an ordering scheme. The arithmetic operations of addition, subtraction, multiplication, and division are not performed for nominal data.
  • 14.
    CHAP. 1 3 Qualitative variable Bloodtype INTRODUCTION Possible nominal level data values associated with the variable A, B, AB, 0 5 Qualitative variable Automobile size description Product rating Socioeconomic class Pain level Baseball team classification EXAMPLE 1.12 Table 1.6gives several qualitative variables and a set of possible nominal level data values. The data values are often encoded for recording in a computer data tile. Blood type might be recorded as 1, 2, 3, or 4; state of residence might be recorded as 1,2,. . . ,or 50; and type of crime might be recorded as 0 or I , or 1 or 2, etc. Similarly, color of road sign could be recorded as 1, 2, 3, 4, or 5 and religion could be recorded as 1, 2, or 3. There is no order associated with these data and arithmetic operations are not performed. For example, adding Christian and Moslem (1 + 2) does not give other (3). Possible ordinal level data values associated with the variable Subcompact, compact, intermediate, full-size Poor, good, excellent Lower, middle, upper None, low, moderate, severe Class A, class AA, class AAA , major league State of residence Type of crime Color of road signs in the state of Nebraska Religion Alabama,. . . ,Wyoming Misdemeanor, felony Red, white, blue, brown, green Christian, Moslem, other The ordinal scale applies to data that can be arranged in some order, but differences between data values either cannot be determined or are meaningless. The ordinal level o f measurement is characterized by data that applies to categories that can be ranked. Ordinal scale data can be arranged in an ordering scheme. EXAMPLE 1.13 Table 1.7 gives several qualitative variables and a set of possible ordinal level data values. The data values for ordinal level data are often encoded for inclusion in computer data files. Arithmetic operations are not performed on ordinal level data, but an ordering scheme exists. A full-size automobile is larger than a subcompact, a tire rated excellent is better than one rated poor, no pain is preferable to any level of pain, the level of play in major league baseball is better than the level of play in class AA, and so forth. The interval scale applies to data that can be arranged in some order and for which differences in data values are meaningful. The interval level o f measurement results from counting or measuring. Interval scale data can be arranged in an ordering scheme and differences can be calculated and interpreted. The value zero is arbitrarily chosen for interval data and does not imply an absence of the characteristic being measured. Ratios are not meaningful for interval data. EXAMPLE 1.14 Stanford-Binet IQ scores represent interval level data. Joe’s IQ score equals 100 and John’s IQ score equals 150.John has a higher IQ than Joe; that is, IQ scores can be arranged in order. John’s IQ score is 50 points higher than Joe’s IQ score; that is, differences can be calculated and interpreted. However, we cannot conclude that John is 1.5 times (150/100 = 1.5) more intelligent than Joe. An IQ score of zero does not indicate a complete lack of intelligence.
  • 15.
    6 INTRODUCTION [CHAP.1 EXAMPLE 1.15 Temperatures represent interval level data. The high temperature on February 1 equaled 25°F and the high temperature on March 1 equaled 50°F. It was warmer on March 1 than it was on February 1. That is, temperatures can be arranged in order. It was 25” warmer on March 1 than on February 1. That is, differences may be calculated and interpreted. We cannot conclude that it was twice as warm on March 1 than it was on February 1. That is, ratios are not readily interpretable. A temperature of 0°F does not indicate an absence of warmth. EXAMPLE 1.16 Test scores represent interval level data. Lana scored 80 on a test and Christine scored 40 on a test. Lana scored higher than Christine did on the test; that is, the test scores can be arranged in order. Lana scored 40 points higher than Christine did on the test; that is, differences can be calculated and interpreted. We cannot conclude that Lana knows twice as much as Christine about the subject matter. A test score of 0 does not indicate an absence of knowledge concerning the subject matter. The ratio scale applies to data that can be ranked and for which all arithmetic operations including division can be performed. Division by zero is, of course, excluded. The ratio level of measurement results from counting or measuring. Ratio scale data can be arranged in an ordering scheme and differences and ratios can be calculated and interpreted. Ratio level data has an absolute zero and a value of zero indicates a complete absence of the characteristicof interest. EXAMPLE 1.17 The grams of fat consumed per day for adults in the United States is ratio scale data. Joe consumes 50 grams of fat per day and John consumes 25 grams per day. Joe consumes twice as much fat as John per day, since 50/25 = 2. For an individual who consumes 0 grams of fat on a given day, there is a complete absence of fat consumed on that day. Notice that a ratio is interpretable and an absolute zero exists. EXAMPLE 1.18 The number of 911 emergency calls in a sample of 50 such calls selected from a 24-hour period involving a domestic disturbance is ratio scale data. The number found on May 1 equals 5 and the number found on June 1 equals 10. Since 10/5 = 2, we say that twice as many were found on June 1 than were found on May 1. For a 24-hour period in which no domestic disturbance calls were found, there is a complete absence of such calls. Notice that a ratio is interpretable and an absolute zero exists. SUMMATIONNOTATION Many of the statistical measures discussed in the following chapters involve sums of various types. Suppose the number of 91I emergency calls received on four days were 41 I, 375, 400, and 478. If we let x represent the number of calls received per day, then the values of the variable for the four days are represented as follows: x1 = 41 I, x2 = 375, x j = 400, and x4 = 478. The sum of calls for the four days is represented as XI + x2 + x j + x4 which equals 411 + 375 + 400 + 478 or 1664. The symbol Cx, read as “the summation o f x,” is used to represent XI +x2 + xj +x4 .The uppercase Greek letter C (pronounced sigma) corresponds to the English letter S and stands for the phrase “the sum of.” Using the summation notation, the total number of 911 calls for the four days would be written as Cx = 1664. EXAMPLE 1.19 The following five values were observed for the variable x: xI = 4, x2 = 5, x3 = 0, x4 = 6, and x5 = 10.The following computations illustrate the usage of the summation notation. CX= X I +~2 + x j + ~4 + ~ 5 = 4 + 5 +O + 6 + 1 0 ~ 2 5 (CS)~ = (XI+ x2 + x j + ~4 + ~ 5 ) ~ = (25)’ = 625 Cx2= x12+ x : +x3’ + X : + xS2= 42+5 ’ +0’ +6’ + 102= 177 C(x - 5 ) = (x, - 5) + (x2 - 5) + ( X j - 5) + (x4 - 5) + (xs - 5 ) C(X- 5) = ( 4 - 5 ) + ( 5 - 5 ) + ( 0 - 5 ) + ( 6 - 5 ) + ( 10- 5 )= -1 +0 - 5 + 1 +5 = 0
  • 16.
    CHAP. I] INTRODUCTION7 EXAMPLE 1.20 The following values were observed for the variables x and y: XI = I , x2 = 2, xj = 0, x4 = 4, y1= 2, y2 = I , y3 = 4, and y4 = 5. The following computations show how the summation notation is used for two variables. ZXY = xlyl + ~ 2 ~ 2 +xjyj + ~ 4 ~ 4 = 1 x 2 + 2 x 1 +0 x 4 +4 x 5 = 24 (Cx)(Cy)=(x1 +X2+Xj+Xq)(Y1 + y 2 + y 3 + y 4 ) = ( 1 + 2 + 0 + 4 ) ( 2 + 1 + 4 + 5 ) = 7 x 12=84 ( C X ~ - ( C X ) ~ / ~ ) X ( Z ~ ~ - ( C Y ) ~ / ~ ) = ( ~ + 4 + 0 + 1 6 - 7 2 / 4 ) ~ ( 4 + 1 + 16+25- 12*/4)=(8.75)~(10)= 87.5 COMPUTERSAND STATISTICS The techniques of descriptive and inferential statistics involve lengthy repetitive computations as well as the construction of various graphical constructs. These computations and graphical constructions have been simplified by the development of computer software. These computer software programs are referred to as statistical sojhare packages, or simply statistical packages. These statistical packages are large computer programs which perform the various computations and graphical constructions discussed in this outline plus many other ones beyond the scope of the outline. Statistical packages are currently available for use on mainframes, minicomputers, and microcomputers. There are currently available numerous statistical packages. Four widely used statistical packages are: MINITAB, BMDP, SPSS, and SAS. Many of the figures found in the following chapters are MINITAB generated. MINITAB is a registered trademark of Minitab, Inc., 3081 Enterprise Drive, State College, PA 16801. Phone: 814-238-3280; fax: 8 14-238-4383; telex: 88 1612. The author would like to thank Minitab Inc. for their permission to use output from MINITAB throughout the outline. Solved Problems DESCRIPTIVE STATISTICSAND INFERENTIAL STATISTICS: POPULATIONAND SAMPLE 1.1 Classify each of the following as descriptive statistics or inferential statistics. (a) The average points per game, percent of free throws made, average number of rebounds per game, and average number of fouls per game as well as several other measures for players in the NBA are computed. (b) Ten percent of the boxes of cereal sampled by a quality technician are found to be under the labeled weight. Based on this finding, the filling machine is adjusted to increase the amount of fill. (c) USA Today gives several pages of numerical quantities concerning stocks listed in AMEX, NASDAQ, and NYSE as well as mutual funds listed in MUTUALS. (d)Based on a study of 500 single parent households by a social researcher, a magazine reports that 25% of all single parent households are headed by a high school dropout. Ans. (a) The measurements given organize and summarize information concerning the players and is therefore considered descriptive statistics.
  • 17.
    8 INTRODUCTION [CHAP.1 (6) Because of the high percent of boxes of cereal which are under the labeled weight in the sample, a decision is made to increase the weight per box for each box in the population. This is an example of inferential statistics. (c) The tables of measurements such as stock prices and change in stock prices are descriptive in nature and therefore represent descriptive statistics. ((2) The magazine is stating a conclusion about &he population based upon a sample and therefore this is an example of inferential statistics. 1.2 Identify the sample and the population in each of the following scenarios. (a) In order to study the response times for emergency 91 1 calls in Chicago, fifty “robbery in progress” calls are selected randomly over a six-month period and the response times are recorded. (b) In order to study a new medical charting system at Saint Anthony’s Hospital, a representative group of nurses is asked to use the charting system. Recording times and error rates are recorded for the group. (c)Fifteen hundred individuals who listen to talk radio programs of various types are selected and information concerning their education level, income level, and so forth is recorded. Am. (a) The 50 “robbery in progress” calls is the sample, and all “robbery in progress” calls in Chicago during the six-month period is the population. (bj The representative group of nurses who use the medical charting system is the sample and all nurses who use the medical charting system at Saint Anthony’s is the population. (cj The 1500 selected individuals who listen to talk radio programs is the sample and the millions who listen nationally is the population. VARIABLE, OBSERVATION, AND DATA SET 1.3 1.4 1.5 In a sociological study involving 35 low-income households, the number of children per household was recorded for each household. What is the variable? How many observations are in the data set? Ans. The variable is the number of children per household. The data set contains 35 observations. A national survey was mailed to 5000 households and one question asked for the number of handguns per household. Three thousand of the surveys were completed and returned. What is the variable and how large is the data set? Ans. The variable is the number of handguns per household and there are 3000 observations in the data set. The number of hours spent per week on paper work was determined for 200 middle level managers. The minimum was 0 hours and the maximum was 27 hours. What is the variable? How many observations are in the data set? Am. The variable is the number of hours spent per week on paper work and the number of observations equals 200. QUANTITATIVE VARIABLE: DISCRETE AND CONTINUOUS VARIABLE 1.6 Classify the variables in problems I .3, 1.4, and 1.S as continuous or discrete.
  • 18.
    CHAP. 1 3 INTRODUCTION9 Ans. The number of children per household is a discrete variable since the number of values this variable may assume is countable. The values range from 0 to some maximum value such as 10 or 15depending upon the population. The number of handguns per household is countable, ranging from 0 to some maximum value and therefore this variable is discrete. The time spent per week on paper work by middle level managers may be any real number between 0 and some upper limit. The number of values possible is not countable and therefore this variable is continuous. 1.7 A program to locate drunk drivers is initiated and roadblocks are used to check for individuals driving under the influence of alcohol. Let n represent the number of drivers stopped before the first drunk driver is found. What are the possible values for n? Classify n as discrete or continuous. Ans. The number of drivers stopped before finding the first drunk driver may equal 1, 2, 3, . . . ,up to an infinitely large number. Although not likely, it is theoretically possible that an extremely large number of drivers would need to be checked before finding the first drunk driver. The possible values for n are all the positive integers. N is a discrete variable. 1.8 The KSW computer science aptitude test consists of 25 questions. The score reported is reflective of the computer science aptitude of the test taker. How would the score likely be reported for the test? What are the possible values for the scores? Is the variable discrete or continuous? Ans. The score reported would likely be the number or percent of correct answers. The number correct would be a whole number from 0 to 25 and the percent correct would range from 0 to 100 in steps of size 4. However if the test evaluator considered the reasoning process used to arrive at the answers and assigned partial credit for each problem, the scores could range from 0 to 25 or 0 to 100 percent continuously. That is, the score could be any real number between 0 and 25 or any real number between 0 and 100 percent. We might say that for all practical purposes, the variable is discrete. However, theoretically the variable is continuous. QUALITATIVE VARIABLE 1.9 Which of the following are qualitative variables? (a) The color of automobiles involved in several severe accidents (b)The length of time required for rats to move through a maze (c) The classification of police administrations as city, county, or state (d)The rating given to a pizza in a taste test as poor, good, or excellent (e) The number of times subjects in a sociological research study have been married Am. The variables given in (a),(c), and (d)are qualitative variables since they result in nonnumerical values. They are classified into categories. The variables given in (b)and (e) result in numerical values as a result of measuring and counting, respectively. 1.10 The pain level following surgery for an intestinal blockage was classified as none, low, moderate, or severe for several patients. Give three different numerical coding schemes that might be used for the purpose of inclusion of the responses in a computer data file. Does this coding change the variable to a quantitative variable?
  • 19.
    10 INTRODUCTION [CHAP.1 Am. The responses none, low, moderate, or severe might be coded as 0, 1,2, or 3 or 1, 2, 3, or 4 or as 10, 20, 30, or 40. There is no limit to the number of coding schemes that could be used. Coding the variable does not change it into a quantitative variable. Many times coding a qualitative variable simplifies the computer analysis performed on the variable. NOMINAL, ORDINAL, INTERVAL, AND RATIO LEVELS OF MEASUREMENT 1.11 Indicate the scale of measurement for each of the following variables: racial origin, monthly phone bills, Fahrenheit and centigrade temperature scales, military ranks, time, ranking of a personality trait, clinical diagnoses, and calendar numbering of the years. Am. racial origin: nominal monthly phone bills: ratio temperature scales: interval military ranks: ordinal time: ratio ranking of personality trait: ordinal clinical diagnoses: nominal calendar numbering of the years: interval 1.12 Which scales of measurement would usually apply to qualitative data? Am. nominal or ordinal SUMMATION NOTATION 1.13 The following values are recorded for the variable x: X I = 1.3, x2 = 2.5, x3 = 0.7, x4 = 3.5. Evaluate the following summations: Ex, Zx2,(Ex)*,and E(x - S). A ~ s . CX= XI + ~2 + x j +~4 = 1.3 +2.5 +0.7 + 3.5 = 8.0 Cx2 = xI2 + ~2~ + ~3~ + X : = 1.32+2.5’ +0.72+ 3S2= 20.68 ( C X ) ~ = (8.0)2= 64.0 C(X- .5) = (XI - .5) + ( ~ 2 - .5) +(xj - .5) + ( ~ 4 - .5) = 0.8 + 2.0 +0.2 + 3.0 = 6.0 1.14 The following values are recorded for the variables x and y: XI = 25.67, x2 = 10.95, xj = 5.65, yl = 3.45, y2 = 1.55, and y j = 3.50. Evaluate the following summations: Zxy, Cx2y2,and cxy - cxcy. AIZS. Cxy = xlyl + x2y2+ x3y3= 25.67 x 3.45 + 10.95 x 1.55+5.65 x 3.50 = 125.31 Cx2y2= x12y12+ X ; Y ; + ~ 3 ~ ~ 3 ~ = 25.672x 3.452+ 10.952x 1.552 +5.652x 3.502= 8522.26 CXY - CXZY = 125.31 - 42.27 x 8.50 = -233.99 1.15 The sum of four values for the variable y equals 25, that is, Cy = 25. If it is known that yl = 2, y2 = 7, and y3 = 6, find y4. Ans. Cy = 25 = 2 + 7 + 6 + y4, or 25 = 15 + y4. From this, we see that y4 must equal 10.
  • 20.
    CHAP. 11 Patient name SamAlcorn Susan Collins Larry Halsey Bill Samuels Lana Williams INTRODUCTION Fasting blood sugar reading 135 I57 168 120 160 11 Supplementary Problems DESCRIPTIVE STATISTICS AND INFERENTIALSTATISTICS: POPULATION AND SAMPLE 1.16 Classify each of the following as descriptive statistics or inferential statistics. (a) The Nielsen Report on Television utilizes data from a sample of viewers to give estimates of average viewing time per week per viewer for all television viewers. (6) The U.S. National Center for Health Statistics publication entitled Vital Statistics of the United States lists the leading causes of death in a given year. The estimates are based upon a sampling of death certificates. (c) The Omaha WorldHerald lists the low and high temperatures for several American cities. (d)The number of votes a presidential candidate receives are given for each state following the presidential election. (e) The National Household Survey on Drug Abuse gives the current percentage of young adults using different types of drugs. The percentages are based upon national samples. Ans. (a) inferential statistics (b) inferential statistics (c) descriptive statistics (d)descriptive statistics (e) inferential statistics 1.17 Classify each of the following as a sample or a population. (a) all diabetics in the United States (6) a group of 374 individuals selected for a New York TinzedCBS news poll (c) all owners of Ford trucks (d) all registered voters in the state of Arkansas (e) a group of 22,000physicians who participate in a study to determine the role of aspirin in preventing heart attacks Am. (a)population (b) sample (c) population ( d ) population (e) sample VARIABLE, OBSERVATION, AND DATA SET 1.18 Changes in systolic blood pressure readings were recorded for 325 hypertensive patients who were participating in a study involving a new medication to control hypertension. Larry Doe is a patient in the study and he experienced a drop of 15 units in his systolic blood pressure. What statistical term is used to describe the change in systolic blood pressure readings? What does the number 325 represent? What term is used for the 15-unitdrop is systolic blood pressure'? Arts. The change in blood pressure is the variable, 325 is the number of observations in the data set, and 15-unitdrop in blood pressure is an observation. 1.19 Table 1.8 gives the fasting blood sugar reading for five patients at a small medical clinic. What is the variable? Give the observations that comprise this data set. Table 1.8
  • 21.
    12 INTRODUCTION [CHAP.1 Ans. The variable is the fasting blood sugar reading for a patient. The observations are 135, 157, 168, 120, and 160. 1.20 A sociological study involving a minority group recorded the educational level of the participants. The educational level was coded as follows: less than high school was coded as 1, high school was coded as 2, college graduate was coded as 3, and postgraduate was coded as 4. The results were: 1 1 2 3 4 3 2 2 2 2 1 1 1 2 2 1 2 3 3 2 2 1 1 2 2 2 2 2 2 1 1 3 3 2 2 2 2 2 1 2 2 2 2 2 1 3 What is the variable'?How many observations are in the data set? Am. The variable is the educational level of a participant. There are 46 observations in the data set. QUANTITATIVE VARIABLE: DISCRETE AND CONTINUOUS VARIABLE 1.21 Classify the variables in problems I. 18, 1.19,and 1.20as discrete or continuous. Ans. The variables in problems 1.18 and 1.19 are continuous. The variable in problem 1.20 is not a quantitative variable. 1.22 A die is tossed until the face 6 turns up on a toss. The variable x equals the toss upon which the face 6 first appears. What are the possible values that x may assume? Is x discrete or continuous? Am. x may equal any positive integer, and it is therefore a discrete variable. 1.23 Is it possible for a variable to be both discrete and continuous? Ans. no QUALITATIVE VARIABLE 1.24 Give five examples of a qualitative variable. Ans. I. Classification of government employees 4. Medical specialty of doctors 5. ZIP code 2. Motion picture ratings 3. College student classification 1.25 Which of the following is not a qualitative variable? hair color, eye color, make of computer, personality type, and percent of income spent on food Am. percent of income spent on food NOMINAL, ORDINAL, INTERVAL, AND RATIO LEVELS OF MEASUREMENT 1.26 Indicate the scale of measurement for each of the following variables: religion classification; movie ratings of 1, 2, 3, or 4 stars; body temperature; weights of runners; and consumer product ratings given as poor, average, or excellent. Ans. religion: nominal movie ratings: ordinal body temperature: interval weights of runners: ratio consumer product ratings: ordinal 1.27 Which scales of measurement would usually apply to quantitative data?
  • 22.
    CHAP. I] INTRODUCTION13 Ans. interval or ratio SUMMATION NOTATION 1.28 The following values are recorded for the variable x: x1 = 15, x2 = 25, x3 = 10, and x4 = 5. Evaluate the following summations: Cx, Ex2, (CX)~, and C(x - 5). Ans. Ex = 55, Cx’ = 975, (CX)~ = 3025, C(x - 5) = 35 1.29 The following values are recorded for the variables x and y: xi =17, x2 = 28, x3 = 35, y1 = 20, y2 = 30, and y3 = 40. Evaluate the following summations:Zxy, Cx2y2,and Cxy - ZxCy. Ans. Cxy = 2580, Cx2y2= 2,781,200, CXY- CXCY= -4,620 1.30 Given that xI = 5, x2= 10, y1 = 20, and Cxy = 200, find y2. Ans. y2 = 10
  • 23.
    Chapter 2 Offense Rape Robbery Burglary Arson Murder Theft MansIaughter Organizing Data TallyFrequency I/ 2 I// 3 /I/ 3 I// 3 /I/ 3 //I 3 )xl/ I// 8 RAW DATA Information obtained by observing values of a variable is called raw data. Data obtained by observing values of a qualitative variable are referred to as qualitative dafa. Data obtained by observing values of a quantitative variable are referred to as quantitative data. Quantitative data obtained from a discrete variable are also referred to as discrete data and quantitative data obtained from a continuous variable are called continuous data. EXAMPLE 2.1 A study is conducted in which individuals are classified into one of sixteen personality types using the Myers-Briggs type indicator. The resulting raw data would be classified as qualitative data. EXAMPLE 2.2 The cardiac output in liters per minute is measured for the participants in a medical study. The resulting data would be classified as quantitative data and continuous data. EXAMPLE 2.3 The number of murders per 100,000 inhabitants is recorded for each of several large cities for the year 1994.The resulting data would be classified as quantitative data and discrete data. FREQUENCY DISTRIBUTIONFOR QUALITATIVE DATA A frequencv distributiou for qualitative data lists all categories and the number of elements that belong to each of the categories. EXAMPLE 2.4 A sample of rural county arrests gave the following set of offenses with which individuals were charged: rape robbery burglary arson murder robbery rape manslaughter arson theft arson burglary theft robbery theft theft theft burglary murder murder theft theft theft manslaughter manslaughter The variable, type of offense, is classified into the categories: rape, robbery, burglary, arson, murder, theft, and manslaughter. As shown in Table 2.1, the seven categories are listed under the column entitled Offense, and each occurrence of a category is recorded by using the symbol / in order to tally the number of times each offense occurs. The number of tallies for each offense is counted and listed under the column entitled Frequency. Occasionally the term absolute frequency is used rather than frequency. Table 2.1 14
  • 24.
    CHAP. 21 ORGANIZINGDATA 2/25= .08 3/25 = .I2 3/25 = .I2 3/25 = .12 3/25 = .I2 8/25 = .32 3/25 = .12 15 .08 x 100= 8% .12x loo= 12% .I2 x loo= 12% .I2 x loo= 12% .I2 x loo= 12% 3 2 x 100 = 32% .12 x loo= 12% RELATIVE FREQUENCY OF A CATEGORY The relative frequency of a category is obtained by dividing the frequency for a category by the sum of all the frequencies. The relative frequencies for the seven categories in Table 2.1 are shown in Table 2.2. The sum of the relative frequencies will always equal one. PERCENTAGE The percentage for a category is obtained by multiplying the relative frequency for that category by 100. The percentages for the seven categories in Table 2.1 are shown in Table 2.2. The sum of the percentages for all the categories will always equal 100percent. Offense Rape Robbery Burglary Arson Murder Theft Manslaughter BAR GRAPH A bar graph is a graph composed of bars whose heights are the frequencies of the different categories. A bar graph displays graphically the same information concerning qualitative data that a frequency distribution shows in tabular form. EXAMPLE 2.5 The distribution of the primary sites for cancer is given in Table 2.3 for the residents of Dalton County. Tab Primary site Digestive system Respiratory Breast Genitals Urinary tract Other 2 . 3 Frequency 20 30 10 5 5 To construct a bar graph, the categories are placed along the horizontal axis and frequencies are marked along the vertical axis. A bar is drawn for each category such that the height of the bar is equal to the frequency for that category. A small gap is left between the bars. The bar graph for Table 2.3 is shown in Fig. 2-1. Bar graphs can also be constructed by placing the categories along the vertical axis and the frequencies along the horizontal axis. See problem 2.5 for a bar graph of this type.
  • 25.
    16 Primary site Relativefrequency Angle sir,e Digestive system -26 360 x .26 = 93.6" Respiratory .40 360 x .40 = 144" Breast .I3 360 x .I3 = 46.8" Genitals .07 360 x .07= 25.2" Urinary tract .07 360 x .07 = 25.2" Other .07 360 x .07= 25.2" b ORGANIZING DATA [CHAP. 2 n u n u o Digesiive Resphtory Breast Genitals Urinary Other system system tract Primary site Fig. 2-1 PIE CHART A pie chart is also used to graphically display qualitative data. To construct a pie chart, a circle is divided into portions that represent the relative frequencies or percentages belonging to different categories. EXAMPLE 2 . 6 To construct a pie chart for the frequency distribution in Table 2.3, construct a table that gives angle sizes for each category. Table 2.4 shows the determination of the angle sizes for each of the categories in Table 2.3. The 360" in a circle are divided into portions that are proportional to the category s k s . The pie chart for the frequency distribution in Table 2.3 is shown in Fig. 2-2. Table 2.4 Primary cancer sites Respiratory (40.0% ) system stern (13.341) (6 7 % ) Fig. 2-2
  • 26.
    CHAP. 21 ORGANIZINGDATA IQ score Frequency 80-94 8 95- 109 14 110-124 24 125- I39 16 J 140- I54 13 17 Class limits Class boundaries 80-94 79.5-94.5 95- 109 94.5-1 09.5 110-124 109.5- I 24.5 125- 139 124.5-1 39.5 140- 154 139.5- 154.5 FREQUENCY DISTRIBUTION FOR QUANTITATIVE DATA Class width Class marks 15 87.0 15 102.0 15 117.0 15 132.0 15 147.0 There are many similarities between frequency distributions for qualitative data and frequency distributions for quantitative data. Terminology for frequency distributions of quantitative data is discussed first, and then examples illustrating the construction of frequency distributions for quantitative data are given. Table 2.5 gives a frequency distribution of the Stanford-Binet intelligence test scores for 75 adults. IQ score is a quantitative variable and according to Table 2.5, eight of the individuals have an IQ score between 80 and 94, fourteen have scores between 95 and 109, twenty-four have scores between I I0 and 124, sixteen have scores between I25 and 139,and thirteen have scores between 140and 154. CLASS LIMITS, CLASS BOUNDARIES, CLASS MARKS, AND CLASS WIDTH The frequency distribution given in Table 2.5 is composed of five classes. The classes are: 80-94, 95-1 09, 1 10- 124, 125-1 39, and 140- 154. Each class has a lobtter class limit and an irpper class limit. The lower class limits for this distribution are 80, 95, 110, 125,and 140.The upper class limits are 94, 109, 124, 139,and 154. If the lower class limit for the second class, 95, is added to the upper class limit for the first class, 94, and the sum divided by 2, the upper horrrrdar?l for the first class and the Iokt'er bozrrrdary for the second class is determined. Table 2.6 gives all the boundaries for Table 2.5. If the lower class limit is added to the upper class limit for any class and the sum divided by 2, the class riicirk for that class is obtained. The class mark for a class is the midpoint of the class and is sometimes called the class rniclpoirrt rather than the class mark. The class marks for Table 2.5 are shown in Table 2.6. The difference between the boundaries for any class gives the class width for a distribution. The class width for the distribution in Table 2.5 is IS. When forming a frequency distribution, the following general guidelines should be followed: I . The riiiiziher of c*lnssesshoirld be hetweeri 5 mid 1 . 5 2. Each data dire rziirst belong to m e , arid only one, class. 3. Wheii possible, all classes should be of equal Midth.
  • 27.
    18 ORGANIZING DATA[CHAP. 2 Weight 100to under 125 125to under 150 150 to under 175 175to under 200 200 to under 225 225 to under 250 EXAMPLE 2.7 Group the following weights into the classes 100 to under 125, 125to under 150,and so forth: Frequency 2 5 10 10 4 4 1 1 1 120 127 129 130 145 145 150 153 155 160 161 165 167 170 171 174 175 177 179 180 180 185 185 190 I95 I95 201 210 220 224 225 230 245 248 Price 2.50 to 2.59 2.60 to 2.69 2.70 to 2.79 2.80 to 2.89 2.90 to 2.99 3.00 to 3.09 The weights I 1 1 and 120 are tallied into the class 100 to under 125. The weights 127, 129, 130, 145 and 145 are tallied into the class 125 to under 150 and so forth until the frequencies for all classes are found. The frequency distribution for these weights is given in Table 2.7 'Tally Frequency I 1 11 2 2 %, 6 /I/ 3 1/11 4 When a frequency distribution is given in this form, the class limits and class boundaries may be considered to be the same. The class marks are 1 12.5, 137.5, 162.5, 187.5,212.5, and 237.5. The class width is 25. EXAMPLE 2.8 The price thr 500 aspirin tablets is determined for each of twenty randomly selected stores as part of a larger consuiiier study. 'The prices are as follows: 2.50 2.95 2.65 3.10 3.15 3.05 3.05 2.60 2.70 2.75 2.80 2.80 2.85 2.80 3.00 3.00 2.90 2.90 2.85 2.85 Suppose we wish to group these data into seven classes. Since the maximum price is 3.15 and the minimum price is 2.50, the spread in prices is 0.65. Each class should then have a width equal to approximately 117 of 0.65 or .093. There is it lot o f flexibility in choosing the classes while following the guidelines given above. Table 2.8 shows the results if a class width equal to 0.10 is selected and the first class begins at the minimum price. Table 2.8 3.10 to 3.19 11 2 The frequency distribution might also be given in a form such as that shown in Table 2.9. The two different ways of expressing the classes shown in Tables 2.8 and 2.9 will result in the same frequencies. Table 2.9 Price 2.50 to less than 2.60 2.60 to less than 2.70 2.70 to less than 2.80 2.80 to less than 2.90 2.90 to less than 3.00 3.00 to less than 3.10 3.10 to less than 3.20 Frequency 1 2 2 6 5 4 7 Frequency 1 2 2 6 3 4 7
  • 28.
    CHAP. 21 ORGANIZINGDATA Weight 4.72 4.73 4.74 4.75 4.76 4.77 19 Frequency 2 1 6 9 2 5 SINGLE-VALUED CLASSES If only a few unique values occur in a set of data, the classes are expressed as a single value rather than an interval of values. This typically occurs with discrete data but may also occur with continuous data because of measurernent constraints. EXAMPLE 2.9 A quality technician selects 25 bars of soap from the daily production. The weights in ounces of the 25 bars are as follows: 4.75 4.74 4.74 4.77 4.73 4.75 4.76 4.77 4.72 4.75 4.77 4.74 4.75 4.77 4.72 4.74 4.75 4.75 4.74 4.76 4.75 4.75 4.74 4.75 4.77 Since only six unique values occur, we will use single-valued classes. The weight 4.72 occurs twice, 4.73 occurs once, 4.74 occurs six times, 4.75 occurs nine times, 4.76 occurs twice, and 4.77 occurs five times. The frequency distribution is shown in Table 2.10. Table 2.10 HISTOGRAMS A histogram is a graph that displays the classes on the horizontal axis and the frequencies of the classes on the vertical axis. The frequency of each class is represented by a vertical bar whose height is equal to the frequency of the class. A histogram is similar to a bar graph. However, a histogram utilizes classes or intervals and frequencies while a bar graph utilizes categories and frequencies. EXAMPLE 2.10 A histogram for the aspirin prices in Table 2.9 is shown in Fig. 2-3. 6 5 2 4 s 3 c W 6) L 2 I n " I I I I 1 1 1 2.55 2.65 2.75 2.85 2.95 3.05 3.15 Price Fig. 2-3
  • 29.
    20 ORGANIZING DATA[CHAP. 2 5 - 4- ) r 0 3 3 3 - s L t 2 - 1 - 0- A symmetric histogram is one that can be divided into two pieces such that each is the mirror image of the other. One of the most commonly occurring symmetric histograms is shown in Fig. 2-4. This type of histogram is often referred to as a mound-shaped histogram or a bell-shaped histogram. A symmetric histogram in which each class has the same frequency is called a uniform or rectangular histogram. A skewed to the right histogram has a longer tail on the right side. The histogram shown in Fig. 2-5 is skewed to the right. A skewed to the fefi histogram has a longer tail on the left side. The histogram shown in Fig. 2-6 is skewed to the left. 8 - 7 - 6 - 2 5 - & 3 - 2 - 1 - 0 7 h Q) g 4 - Classes Fig. 2-4 Classes Fig. 2-5 Classes Fig. 2-6
  • 30.
    CHAP. 21 Cumulative frequency 0 5 5+ 10= 15 1 5 + 5 = 2 0 20 + 3 = 23 23 +2 = 25 ORGANIZINGDATA Cumulative 0125 = 0 0% 5/25 = .20 20% 15/25= .60 60% 20/25 = .80 80% 23/25 = .92 92% 25/25 = 1.00 100% relative frequency Cumulative percentage 21 CUMULATIVE FREQUENCY DISTRIBUTIONS A cumulative frequency distribution gives the total number of values that fall below various class boundaries of a frequency distribution. EXAMPLE 2.11 Table 2.11 shows the frequency distribution of the contents in milliliters of a sample of 25 one- liter bottles of soda. Table 2.12 shows how to construct the cumulative frequency distribution that corresponds to the distribution in Table 2.11, Table 2.11 Content 970 to less than 990 990 to less than 1010 1010to less than 1030 1030 to less than 1050 1050to less than 1070 3 CUMULATIVE RELATIVE FREQUENCY DISTRIBUTIONS A cumulative relative frequency is obtained by dividing a cumulative frequency by the total number of observations in the data set. The cumulative relative frequencies for the frequency distribution given in Table 2.11 are shown in Table 2-12. Cr4mulative perceittages are obtained by multiplying cumulative relative frequencies by 100. The cumulative percentages for the distribution given in Table 2.1I are shown in Table 2.12. Contents less than 970 990 1010 I030 1050 1070 OGIVES An ogive is a graph in which a point is plotted above each class boundary at a height equal to the cumulative frequency corresponding to that boundary. Ogives can also be constructed for a cumulative relative frequency distribution as well as a cumulative percentage distribution. EXAMPLE 2.12 The ogive corresponding to the cumulative frequency distribution in Table 2.12 is shown in Fig. 2-7.
  • 31.
    22 ORGANIZINGDATA [CHAP.2 l l l l l 1 1 1 I l l 970 980 990 1OOO 1010 1020 103010401050 1060 1070 Contents Fig. 2-7 STEM-AND-LEAFDISPLAYS In a stem-and-leaf display each value is divided into a stem and a leaf. The leaves for each stem are shown separately. The stem-and-leaf diagram preserves the information on individual observations. EXAMPLE 2.13 The following are the California Achievement Percentile Scores (CAT scores) for 30 seventh- grade students: 50 65 70 35 40 57 66 65 70 35 29 33 44 56 66 60 44 50 58 46 67 78 79 47 35 36 44 57 60 57 A stem-and-leaf diagram for these CAT scores is shown in Fig. 2-8. Leaves 9 3 5 5 5 6 0 4 4 4 6 7 0 0 6 7 7 7 8 0 0 5 5 6 6 7 0 0 8 9 Fig. 2-8 Solved Problems RAW DATA 2.1 Classify the following data as either qualitative data or quantitative data. In addition, classify the quantitative data as discrete or continuous. (a) The number of times that a movement authority is sent to a train from a relay station is recorded for several trains over a two-week period. The movement authority, which is an electronic transmission, is sent repeatedly until a return signal is received from the train.
  • 32.
    CHAP. 21 ORGANIZINGDATA 23 (b)A physician records the follow-up condition of patients with optic neuritis as improved, unchanged, or worse. (c) A quality technician records the length of material in a roll product for several products selected from a production line. (d) The Bureau o f Justice Statistics Sourcebook of Criminal Justice Statistics in reporting on the daily use within the last 30 days of drugs among young adults lists the type of drug as marijuana, cocaine, or stimulants (adjusted). (e) The number of aces in five-card poker hands is noted by a gambler over several weeks of gambling at a casino. Am. (a) The number of times that the moving authority must be sent is countable and therefore these data are quantitative and discrete. (h) These data are categorical or qualitative. (c) The length of material can be any number within an interval of values, and therefore these data are quantitative and continuous. (d) These data are categorical or qualitative (e) Each value in the data set would be one of the five numbers 0, 1, 2, 3, or 4. These data are quantitative and discrete. FREQUENCY DISTRIBUTIONFOR QUALITATIVEDATA 2.2 The following list gives the academic ranks of the 25 female faculty members at a small liberal arts college: instructor assistant professor assistant professor instructor associate professor assistant professor associate professor assistant professor full professor associate professor instructor assistant professor full professor associate professor assistant professor instructor assistant professor assistant professor associate professor assistant professor full professor assistant professor assistant professor assistant professor associate professor Give a frequency distribution for these data. Am. The academic ranks are tallied into the four possible categories and the results are shown in Table 2.13. Table 2.13 Academic rank Fre uenc Full professor Associate professor Assistant professor Instructor RELATIVEFREQUENCYOF A CATEGORY AND PERCENTAGE 2.3 Give the relative frequencies and percentages for the categories shown in Table 2.13.
  • 33.
    24 ORGANIZINGDATA [CHAP.2 Academic rank Relative frequency Percentage Full professor .I2 12% Associate professor .24 24% Assistant professor .48 48% Instructor .I6 16% . Am. Each frequency in Table 2.13 is divided by 25 to obtain the relative frequencies for the categories. The relative frequencies are then multiplied by 100 to obtain percentages. The results are shown in Table 2.14. ~ ~~ ~ 2.4 Refer to Table 2.14 to answer the following. (a) What percent of the female faculty have a rank of associate professor or higher? (h) What percent of the female faculty are not full professors? (c) What percent of the female faculty are assistant or associate professors? Ans. (a) 24% + 12%= 36% (11) 16%+- 48% + 24% = 88% (c) 48% + 24% = 72% BAR GRAPHS AND PIE CHARTS 2.5 The subjects in an eating disorders research study were divided into one of three different groups. Table 2.15 gives the frequency distribution for these three groups. Construct a bar graph. Table 2.15 Anorexic Control 20 AIIS. The bar graph for the distribution given in Table 2.15 is shown in Fig. 2-9. Control Anorexic croup Bulimic 1 1 I 1 I I 1 I 0 10 20 30 40 50 Frequency Fig. 2-9 2.6 Construct a pie chart for the frequency distribution given in Table 2.15. Am. Table 2.16 illustrates the determination of the angles for each sector of the pie chart.
  • 34.
    CHAP. 2) ORGANIZINGDATA k Group Relative frequency Angle size Bulimic .3 360 x .3 = 108" Anorexic .5 360 x .5 = 180" Control .2 360 x .2 = 72" 25 Ans. The pie chart for the distributiongiven in Table 2.15 is shown in Fig. 2-10. Pie chart for eatingdisorder subiects Fig. 2-10 2 . 7 A survey of 500 randomly chosen individuals is conducted. The individuals are asked to name their favorite sport. The pie chart in Fig. 2-1 I summarizes the results of this survey. Pie chart for favorite sport Baseball (30.0%) Basketball(20.0%) Fig. 2-11 (a) How many individuals in the 500 gave baseball as their favorite sport? (b) How many gave a sport other than basketball as their favorite sport? (c) How many gave hockey or golf as their favorite sport? Ans. (a) .3 x 500 = 150 (b) .8 x 500 = 400 (c) .2 x 500 = loo
  • 35.
    26 ORGANIZINGDATA [CHAP.2 Upper limit 189 209 229 249 269 FREQUENCY DISTRIBUTION FOR QUANTITATIVE DATA: CLASS LIMITS, CLASS BOUNDARIES, CLASS MARKS, AND CLASS WIDTH Lower Upper boundary boundary Class mark 169.5 189.5 179.5 189.5 209.5 199.5 209.5 229.5 219.5 229.5 249.5 239.5 249.5 269.5 259.5 2.8 Table 2.17 gives the frequency distribution for the cholesterol values of 45 patients in a cardiac rehabilitation study. Give the lower and upper class limits and boundaries as well as the class marks for each class. Class 170to 189 190to 209 210 to 229 230 to 249 250 to 269 Table 2.17 Lower limit 170 I90 2 10 230 250 Cholesterol value 170to 189 I90 to 209 210 to 229 230 to 249 250 to 269 Expenditure 0.5 to 0.9 1.oto 1.4 1.5 to 1.9 2.0 to 2.4 2.5 to 2.9 3.0to 3.4 Am. Table 2.18gives the limits, boundaries, and marks for the classes in Table 2.17. Frequency 1 3 5 5 7 4 Table 2.18 2.9 The following data set gives the yearly food stamp expenditure in thousands of dollars for 25 households in Alcorn County: 2.3 1.9 1.1 3.2 2.7 1.5 0.7 2.5 2.5 3.1 2.5 2.0 2.7 1.9 2.2 1.2 1.3 1.7 2.9 3.0 3.2 1.7 2.2 2.7 2.0 Construct a frequency distribution consisting of six classes for this data set. Use 0.5 as the lower limit for the first class and use a class width equal to 0.5. Ans. The first class would extend from 0.5 to 0.9 since the desired lower limit is 0.5 and the desired class width is 0.5. Note that the class boundaries are 0.45 and 0.95 and therefore the class width equals 0.95 - 0.45 or 0.5.The frequency distribution is shown in Table 2.19. Table 2.19 2.10 Express the frequency distribution given in Table 2.19 using the “less than” form for the classes. Am. The answer is shown in Table 2.20.
  • 36.
    CHAP. 21 Expenditure 0.5 toless than 1.0 1.O to less than 1.S 1.S to less than 2.0 2.0 to less than 2.5 2.5 to less than 3.0 3.0 to less than 3.5 ORGANIZING DATA Frequency i 3 5 5 7 4 27 Gallons 0.000to 3.999 4.000 to 7.999 8.000 to 11.999 12.000to 15.999 16.000to 19.999 Frequency 3 8 3 20 14 2.11 The manager of a convenience store records the number of gallons of gasoline purchased for a sample of customers chosen over a one-week period. Table 2.21 lists the raw data. Construct a frequency distribution having five classes, each of width 4. Use 0.0oO as the lower limit of the first class. Table 2.21 Gallons 0 to less than 4 4 to less than 8 8 to less than 12 12 to less than 16 16to less than 20 Ans. When the data are tallied into the five classes, the frequency distribution shown in Table 2.22 is obtained. Frequency 3 8 3 20 14 2.12 Using the data given in Table 2.21, form a frequency distribution consisting of the classes 0 to less than 4,4 to less than 8, 8 to less than 12, 12to less than 16, and 16 to less than 20. Compare this frequency distribution with the one given in Table 2.22. SINGLE-VALUED CLASSES 2.13 The Food Guide Pyramid divides food into the following six groups: Fats, Oils, and Sweets Group; Milk, Yogurt, and Cheese Group; Vegetable Group; Bread, Cereal, Rice, and Pasta Group; Meat, Poultry, Fish, Dry Beans, Eggs, and Nuts Group; Fruit Group. One question in a
  • 37.
    28 Number of children 0 1 2 3 4 5 6 ORGANIZINGDATA [CHAP. 2 Frequency 1 2 4 8 8 7 1 nutrition study asked the individuals in the study to give the number of groups included in their daily meals. The results are given below: 6 4 5 4 4 3 4 5 5 5 6 5 4 3 6 6 6 5 2 3 4 5 6 4 5 5 5 6 5 6 5 Give a frequency distribution for these data. Ans. A frequency distribution with single-valued classes is appropriate since only five unique values occur. The frequency distribution is shown in Table 2.24. Table 2.24 2 1 3 Number of food groups Frequency 5 12 2.14 A sociological study involving Mexican-American women utilized a 50-question survey. One question concerned the number of children living at home. The data for this question are given below: 5 2 3 0 4 6 2 1 1 2 3 3 4 4 5 5 5 3 3 3 3 4 4 4 5 5 2 3 4 4 5 Give a frequency distribution for these data. Am. Since only a small number of unique values occur, the classes will be chosen to be single valued. The frequency distribution is shown in Table 2.25. HISTOGRAMS 2.15 Construct a histogram for the frequency distribution shown in Table 2.23. Ans. The histogram for the frequency distribution in Table 2.23 is shown in Fig. 2- 12.
  • 38.
    CHAP. 21 Beckmann-Beal scoreFrequency 0-7 5 8-15 15 16-23 20 24-3 1 30 32-39 50 40-47 30 ORGANIZING DATA 29 0 4 8 12 16 20 Gallons Fig. 2-12 2.16 Construct a histogram for the frequency distribution given in Table 2.24. Ans. The histogram for the frequency distribution in Table 2.24 is shown in Fig. 2-13. I 1 1 I 1 2 3 4 5 6 Number of food goups Fig. 2-13 CUMULATIVE FREQUENCY DISTRIBUTIONS 2.17 The Beckmann-Beal mathematics competency test is administered to 150 high school students for an educational study. The test consists of 48 questions and the frequency distribution for the scores is given in Table 2.26. Construct a cumulative frequency distribution for the scores. Ans. The cumulative frequency distribution is shown in Table 2.27.
  • 39.
    30 Scores less than 50 ORGANIZINGDATA Cumulativefrequency 0 [CHAP. 2 Cumulativerelative Scores less than frequency Cumulative percentages 50 0.0 0.0% 60 0.143 14.3% 70 0.429 42.9% 80 0.857 85.7% 90 1.o 100.0% i Table 2.27 Scores less than I Cumulative freauencv 8 16 24 32 40 48 5 20 40 70 120 I50 2.18 Table 2.28 gives the cumulative frequency distribution of reading readiness scores for 35 kindergarten pupils. (a) How many of the pupils scored 80 or higher? (b) How many of the pupils scored 60 or higher but lower than 80? (c) How many of the pupils scored 50 or higher? (d)How many of the pupils scored 90 or lower? Ans. (a) 35 - 3 0 = 5 (6) 30-5=25 (c) 35 (d) 35 CUMULATIVE RELATIVEFREQUENCY DISTRIBUTIONS 2.19 Give the cumulative relative frequencies and the cumulative percentages for the reading readiness scores in Table 2.28. Ans. The cumulative relative frequencies and cumulative percentages are shown in Table 2.29. OGIVES 2.20 Table 2.30 gives the cumulative frequency distribution for the daily breast-milk production in grams for 25 nursing mothers in a research study. Construct an ogive for this distribution.
  • 40.
    CHAP. 21 Daily production lessthan 500 550 600 650 700 750 ORGANIZINGDATA Cumulative frequency 0 3 1 1 20 22 25 31 0 g 1.0- % 9 ) 9 ) > & 3 2 0.5- .C.( Y .CI Y p U' 0.0- 500 550 600 650 700 750 Grams Fig. 2-14 2.21 Construct an ogive curve for the cumulative relative frequency distribution that corresponds to the cumulative frequency distributionin Table 2.30. Ans. Each of the cumulative frequencies in Table 2.30 is divided by 25 and the cumulative relative frequencies 0, .12, .44, 30, 3 8 , and 1.00 are determined. Using these, the cumulative relative frequency distributionshown in Fig. 2-15 can be constructed. 500 550 600 650 700 750 Grams Fig. 2-15 2.22 Construct an ogive curve for the cumulative percentage distribution that corresponds to the cumulative frequency distribution in Table 2.30.
  • 41.
    32 Stem 2 3 4 ORGANIZING DATA [CHAP.2 Leaves 2 3 6 8 8 9 9 0 0 0 0 0 3 4 4 5 6 7 7 7 8 8 8 0 0 0 0 0 4 5 Am. The cumulative percentages are obtained by multiplying the cumulative relative frequencies given in problem 2.21 by 100 to obtain 0%, 12%, 44%, 80%, 88%, and 100%.These percentages are then used to construct the ogive shown in Fig. 2-16. Stem 2 2 3 3 4 4 5 I I I 1 1 I Leaves 2 3 6 8 8 9 9 0 0 0 0 0 3 4 4 5 6 7 7 7 8 8 8 0 0 0 0 0 4 500 550 600 650 700 750 Grams Fig. 2-16 STEM-AND-LEAF DISPLAYS 2.23 The mathematical competency scores of 30junior high students participating in an educational study are as follows: 30 35 28 44 33 22 40 38 37 36 28 29 30 30 40 30 34 37 38 40 38 34 40 37 23 26 30 45 29 40 Construct a stem-and-leaf display for these data. Use 2, 3, and 4 as your stems. Ans. The stem-and-leaf display is shown in Fig. 2- 17. Fig. 2-18
  • 42.
    CHAP. 21 ORGANIZINGDATA 33 2.25 The stem-and-leaf display shown in Fig. 2-19 gives the savings in 15 randomly selected accounts. If the total amount in these 15 accounts equals $4340, find the value of the missing leaf, x. OS SO 65 90 102055 x 75 25 50 70 65 Fig. 2-19 Ails. Let A be the amount in the account with the missing leaf. Then the followingequation must hold. (105 + 150 + 165+ 190+210 +220 +255 +A +275 + 325 + 350 + 370 +415 +480 + 565) = 4340 The solution to this equation is A = 265, which implies that x = 65. Supplementary Problems RAW DATA 2.26 Classify the data described in the followingscenarios as qualitative or quantitative. Classify the quantitative data as either discrete or continuous. The individuals in a sociological study are classified into one of five income classes as follows: low, low to middle, middle, middle to upper, or upper. The fasting blood sugar readings are determined for several individuals in a study involving diabetics. The number of questions correctly answered on a 25-item test is recorded for each student in a computer science class. The number of attempts needed before successfully finding the path through a maze that leads to a reward is recorded for several rats in a psychological study. The race of each inmate is recorded for the individuals in a criminal justice study. ( a ) qualitative (6) quantitative, continuous (c) quantitative, discrete (d) quantitative, discrete (e) qualitative FREQUENCY DISTRIBUTION FOR QUALITATIVEDATA 2.27 The following responses were obtained when 50 randomly selected residents of a small city were asked the question “How safe do you think your neighborhood is for kids?” very very not sure not at all very not very not sure very not sure somewhat not very very not at all not very very very very very not very somewhat somewhat very very not sure not very not at all not very very very not sure very very not very very very not very somewhat somewhat very somewhat very very not very not at all very very very somewhat very somewhat Give a frequency distribution for these data.
  • 43.
    34 Response Very Somewhat Not very Not atall ORGANIZINGDATA Relative frequency Percentage .48 48% .I6 16% .I8 18% .08 8% [CHAP. 2 A m . See Table 2.31. Table 2.31 Somewhat Not very Not at all RELATIVE FREQUENCY OF A CATEGORY AND PERCENTAGE 2.28 Give the relative frequencies and percentages for the categories shown in Table 2.3I . Am. See Table 2.32. Table 2.32 I Not sure I .10 I 10% I 2.29 Refer to Table 2.32 to answer the following questions. (a) What percent of the respondents have no opinion, i.e., responded not sure, on how safe the neighbor- hood is for children‘? (h) What percent of the respondents think the neighborhood is very or somewhat safe for children? (c) What percent of the respondents give a response other than very safe? Am. (a) 10% (h) 64% (c) 52% BAR GRAPHS AND PIE CHARTS 2.30 Construct a bar graph for the frequency distribution in Table 2.3I . 2.31 Construct a pie chart for the frequency distribution given in Table 2.3I 2.32 The bar graph given Fig. 2-20 shows the distribution of responses of 300 individuals to the question “How do you prefer to spend stressful times?” (a) What percent preferred to spend time alone? (6) What percent pave a response other than “with friends”? (c.) How many individuals responded “with family,” “with friends,*’or “other”? Ans. ( a ) SO% (6) 83.3% (c) IS0
  • 44.
    CHAP. 21 _ I- ORGANIZING DATA Response time 5-9 10-14 15-19 20-24 25-29 30-34 35 Frequency 3 7 25 19 14 2 Class 5-9 10-14 15-19 20-24 25-29 30-34 With family With friends Other Response Fig. 2-20 Lower Upper limit limit 5 9 10 14 15 19 20 24 25 29 30 34 FREQUENCY DISTRIBUTIONFOR QUANTITATIVEDATA: CLASS LIMITS, CLASS BOUNDARIES, CLASS MARKS, AND CLASS WIDTH Upper boundary 9.5 14.5 19.5 24.5 29.5 34.5 2.33 Table 2.33 gives the distribution of response times in minutes for 91I emergency calls classified as domestic disturbance calls. Give the lower and upper class limits and boundaries as well as the class marks for each class. What is the class width for the distribution? Class mark 7 12 17 22 27 32 Ans. See Table 2.34. Tablc The class width equals 5 minutes. 2 . 3 4 Lower boundary 4.5 9.5 14.5 19.5 24.5 29.5 2.34 Express the distribution given in Table 2.33 in the “less than” form. Ans. See Table 2.35.
  • 45.
    36 Response time 5 toless than 10 10 to less than 15 15 to less than 20 20 to less than 25 25 to less than 30 30 to less than 35 ORGANIZING DATA Frequency 3 7 25 19 14 2 [CHAP. 2 2.5 5.o Table 2.35 5.O 8.5 5.5 10.5 5.5 7.O 5.O 10.0 6.5 1.5 2.35 Table 2.36 gives the response times in minutes for 50 randomly selected 91 1 emergency calls classified as robbery in progress calls. Group the data into five classes, using 1.0to 2.9 as your first class. 6.0 7.0 6.0 10.0 7.5 7.0 6.5 4.5 Table 2.36 7.0 10.0 9.5 8.O 7.5 2.0 5.5 7.o 5.O 3.0 6.5 5.5 5.O 2.5 3.5 3.5 5.O 3.O 7.5 2.0 4.5 6.5 7.0 1.5 9.0 . Response tjme Frequency 1.0to 2.9 7 3.0 to 4.9 6 5.0 to 6.9 16 7.0to 8.9 14 9.0 to 10.9 7 Class ato b A m See Table 2.37. Frequency f l 2.37 Refer to the frequency distribution of response times to 91I robbery in progress calls given in Table 2.37 to answer the following. ( a ) What percent of the response times are less than seven minutes'? (h) What percent of the response times are equal to or greater than three minutes but less than seven minutes? ( c ) What percent cif the response times are nine or more minutes in length? Am. ((I) 58% (l?) 44%) (c) 14% c to d e to f g to i i to k f2 f3 f4 f<
  • 46.
    CHAP. 21 ORGANLZINGDATA Number of citations 10 11 14 16 17 20 37 Frequency 5 7 10 3 3 X Ans. (a) (b +c)/2 and (d +e)/2 (b) (e + f)/2 (c) (i +j)/2 - (f + g)/2 (4 g (e) (fI + f2 + f3 + f4 + fs) Response time less than 1.o 3.O s.o 7.O 9.0 11.0 SINGLE-VALUED CLASSES Cumulative frequency 0 7 13 29 43 so 2.39 A quality control technician records the number of defective items found in samples of size SO for each of 30 days. The data are as follows: 0 2 3 0 0 0 0 1 2 1 1 2 0 0 1 2 2 2 0 0 1 1 1 0 0 0 3 2 0 1 Give a frequency distribution for these data. Am. See Table 2.39 Table 2.39 Number of defectives 2 2.40 The number of daily traffic citations issued over a 100-milesection of Interstate 80 is recorded for each day of September. The frequency distribution for these data is shown in Table 2.40. Find the value for x. Ans. x = 2 HISTOGRAMS 2.41 Construct a histogram for the response times frequency distribution given in Table 2.37. 2.42 Construct a histogram for the number of defectives frequency distribution given in Table 2.39. CUMULATIVE FREQUENCY DISTRIBUTIONS 2.43 Give the cumulative frequency distribution for the frequencydistribution shown in Table 2.37 Ans. See Table 2.41
  • 47.
    38 ORGANIZING DATA c Cumulativerelative Response time less than frequency Cumulative percentage 1.o 0.0 0% 3.O 0.14 14% 5.O 0.26 26% 7.0 0.58 58% 9.0 0.86 86% 11.0 1.o 100% * [CHAP. 2 450 345 2.44 Refer to Table 2.4 1 to answer the following questions. (a) How many of the response times are less than 5.0 minutes? (6) How many of the response times are 7.0 or more minutes? (c) How many of the response times are equal to or greater than 5.0 minutes but less than 9.0 minutes? 333 660 765 589 347 456 501 543 , CUMULATIVE RELATIVE FREQUENCY DISTRIBUTIONS 678 345 189 2.45 Give the cumulative relative frequencies and the cumulative percentages for the cumulative frequency distribution shown in Table 2.4 1. 543 234 597 521 780 435 456 678 567 675 525 453 Arm See Table 2.42. 500 304 805 586 267 456 435 346 514 508 654 564 225 524 465 324 475 505 707 556 700 125 578 286 531 i OGIVES 2.46 Construct an ogive for the cumulative frequency distribution given in Table 2.4 1. 2.47 Construct an ogive for the cumulative relative frequencydistribution given in Table 2.42. STEM-AND-LEAF DISPLAYS 2.48 The number of calls per 24-hour period to a 911 emergency number is recorded for 50 such periods. The results are shown in Table 2.43. Construct a stem-and-leaf display for these data. Table 2.43 Ans. See Fig. 2-2 I.
  • 48.
    CHAP. 21 Leaves 25 89 2534 67 86 04243345454647 35 35 50 53 56 56 56 65 75 0001 0508 1421242531434356646778868997 54 60 75 78 78 00 07 65 80 05 A ORGANIZING DATA Stem 3 4 5 6 7 8 9 39 Leaves 0 0 5 5 5 5 0 0 3 5 6 8 8 2 4 4 4 8 8 9 9 9 0 0 0 0 0 0 5 5 5 5 5 8 8 8 3 3 3 5 5 0 0 5 0 0 6 I:. 8 Stem 3 3 4 4 5 5 6 6 7 7 8 8 5 9 Leaves 0 0 5 5 5 5 0 0 3 5 6 8 8 2 4 4 4 8 8 9 9 9 0 0 0 0 0 0 5 5 5 5 5 8 8 8 3 3 3 5 5 0 0 0 0 Fig. 2-21 2.49 The number of syringes used per month by the patients of a diabetic specialist is recorded and the results are given in Fig. 2-22. Answer the followingquestions by referring to Fig. 2-22. (a) What is the minimum number of syringes used per month by these patients? (b) What is the maximum number of syringes used per month by these patients? (c) What is the total usage per month by these patients? (4 What usage occurs most frequently? Fig. 2-22 Ans. (a) 30 (b) 90 (c) 2700 (4 60 2.50 Refine the stem-and-leaf display in Fig.2-22 by using either the leaves 0, 1,2,3, or 4 or the leaves 5, 6,7, 8, or 9 on a particular row. Ans. See Fig. 2-23. ~ ~ Fig. 2-23
  • 49.
    Chapter 3 Descriptive Measures MEASURESOF CENTRAL TENDENCY Chapter 2 gives several techniques for organizing data. Bar graphs, pie charts, frequency distributions, histograms, and stem-and-leaf plots are techniques for describing data. Often times, we are interested in a typical numerical value to help us describe a data set. This typical value is often called an average value or a measure o f central tendency. We are looking for a single number that is in some sense representative of the complete data set. EXAMPLE 3.1 The following are examples of measures of central tendency: median priced home, average cost of a new automobile, the average household income in the United States, and the modal number of televisions per household. Each of these examples is a single number, which is intended to be typical of the characteristic of interest. MEAN, MEDIAN, AND MODE FOR UNGROUPED DATA A data set consisting of the observations for some variable is referred to as raw data or urtgroicpeddata. Data presented in the form of a frequency distribution are called grouped data. The measures of central tendency discussed in this chapter will be described for both grouped and ungrouped data since both forms of data occur frequently. There are many different measures of central tendency. The three most widely used measures of central tendency are the mean, median, and mode. These measures are defined for both samples and populations. The mean for a sampleconsistingof n observations is - C X x =- n and the mean for a population consistingof N observations is C X p- N EXAMPLE 3 . 2 The number of 911 emergency calls classified as domestic disturbance calls in a large metro- politan location were sampled for thirty randomly selected 24 hour periods with the following results. Find the mean number of calls per 24-hour period. 25 46 34 45 37 36 40 30 29 37 44 56 50 47 23 40 30 27 38 47 58 22 29 56 40 46 38 19 49 50 - c x 1168 x = - - - - - - 38.9 n 30 40
  • 50.
    CHAP. 31 DESCRIPTIVEMEASURES 41 EXAMPLE 3.3 The total number of 911 emergency calls classified as domestic disturbance calls last year in a large metropolitan location was 14,950. Find the mean number of such calls per 24-hour period if last year was not a leap year. Ex 14,950 N 365 p=----=41.0 - The median of a set of data is a value that divides the bottom 50% of the data from the top 50% of the data. To find the median of a data set, first arrange the data in increasing order. If the number of observations is odd, the median is the number in the middle of the ordered list. If the number of observations is even, the median is the mean of the two values closest to the middle of the ordered list. There is no widely used symbol used to represent the median. Occasionally, the symbol X is used to represent the sample median and the symbol j Xis used to represent the population median. EXAMPLE 3.4 To find the median number of domestic disturbance calls per 24-hour period for the data in Example 3.I ,first arrange the data in increasingorder. 19 22 23 25 27 29 29 30 30 34 36 37 37 38 38 40 40 40 44 45 46 46 47 47 49 50 50 56 56 58 The two values closest to the middle are 38 and 40. The median is the mean of these two values or 39. EXAMPLE 3.5 A bank auditor selects 11 checking accounts and records the amount in each of the accounts. The 1 I observations in increasing order are as follows: 150.25 175.35 195.00 200.00 235.00 240.45 250.55 256.00 275.50 290.10 300.55 The median is 240.45 since this is the middle value in the ordered list. The mode is the value in a data set that occurs the most often. If no such value exists, we say that the data set has no mode. If two such values exist, we say the data set is bimodal. If three such values exist, we say the data set is trimodal. There is no symbol that is used to represent the mode. EXAMPLE 3.6 Find the mode for the data given in Example 3.2. Often it is helpful to arrange the data in increasing order when finding the mode. The data, in increasing order, are given in Example 3.4. When the data are examined, it is seen that 40 occurs three times, and that no other value occurs that often. The mode is equal to 40. For a large data set, as the number of classes is increased (and the width of the classes is decreased), the histogram becomes a smooth curve. Oftentimes, the smooth curve assumes a shape like that shown in Fig. 3-1. In this case, the data set is said to have a bell-shaped distribirtioit or a mound-shaped distribution. For such a distribution,the mean, median, and mode are equal and they are located at the center of the curve. Fig. 3-1
  • 51.
    42 DESCRIPTIVE MEASURES[CHAP. 3 L Data set Distribution shape Mean Median Mode 1 Be11-shaped 15 15 15 2 Left-skewed 10 10.5 1s 3 Right-skewed 20.5 19.5 15 For a data set having a skewed to the right distribution, the mode is usually less than the median which is usually less than the mean. For a data set having a skewed to the left distribution, the mean is usually less than the median which is usually less than the mode. EXAMPLE 3.7 Find the mean, median, and mode for the following three data sets and confirm the above paragraph. Dataset 1: 10, 12, 15, 15, 18, 20 Dataset 2: 2,4,6, 15, 15, I8 Data set 3: 12, 15, 15, 24, 26, 28 Table 3.1 gives the shape of the distribution, the mean, the median, and the mode for the three data sets. MEASURES OF DISPERSION In addition to measures of central tendency, it is desirable to have numerical values to describe the spread or dispersion of a data set. Measures that describe the spread of a data set are called measiires o f dispersion. EXAMPLE 3.8 Jon and Jack are two golfers who both average 85. However, Jon has shot as low as 75 and as high as 99 whereas Jack has never shot below 80 nor higher than 90. When we say that Jack is a more consistent golfer than Jon is, we mean that the spread in Jack’s scores is less than the spread in Jon’s scores. A measure of dispersion is a numerical value that illustrates the differences in the spread of their scores. RANGE, VARIANCE, AND STANDARD DEVIATION FOR UNGROUPED DATA The range for a data set is equal to the maximum value in the data set minus the minimum value in the data set. It is clear that the range is reflective of the spread in the data set since the difference between the largest and the smallest value is directly related to the spread in the data. EXAMPLE 3.9 Compare the range in golf scores for Jon and Jack in Example 3.8. The range for Jon is 99 - 75 = 24 and the range for Jack is 90 - 80 = 10.The spread in Jon’s scores, as measured by range, is over twice the spread in Jack’s scores. The variance and the standard deviation of a data set measures the spread of the data about the mean of the data set. The variance of a sample of size n is represented by s2 and is given by C(X--X)* s2 = n-1 and the variance of a population of size N is represented by d and is given by (3.3) The symbol o is the lowercase sigma of the Greek alphabet and o2is read as sigma squared.
  • 52.
    CHAP. 31 DESCRIPTIVEMEASURES 43 EXAMPLE 3.10 The times required in minutes for five preschoolers to complete a task were 5 , 10, 15, 3, and 7. The mean time for the five preschoolers is 8 minutes. Table 3.2 illustrates the computation indicated by formula (3.3). The first column lists the observations, x. The second column lists the deiiatiorzs from the niearr, x - x . The third column lists the squares of the deviations. The sum at the bottom of the second column is called the slim o f the deviations, and is always equal to zero for any data set. The sum at the bottom of the third column is referred to as the sum o f the squares of the deviations. The sample variance is obtained by dividing the sum of the squares of the deviations by n - I, or 5 -I = 4. The sample variance equals 88 divided by 4 which is 22 minutes squared. - Table 3 . 2 I X 3 3-8=-5 (-S)2= 25 7 7 - 8 = - I (-I)2= 1 C(x-X)= 0 c(x-X)2 = 88 The variance of the data in Example 3.10 is 22 minutes squared. The units for the variance are minutes squared since the terms which are added in column 3 are minutes squared. The square root of the variance is called the standard deviation and the standard deviation is measured in the same units as the variable. The standard deviation of the times to complete the task is or 4.7 minutes. The sample standard deviation is s = o (3.5) and the populatioiz standard deviation is (T= Shortcut formulas equivalent to formulas (3.3)and (3.4)are useful in computing variances and standard deviations. The shortcutformulas for computing sample and population variances are 2 n s = n-1 and N (3.6) (3.7) EXAMPLE 3.11 Formula (3.7)can be used to find the variance and standard deviation of the times given in Example 3.10.The term Zx2is cailed the sum ofthe squares and is found as follows: Cx2= S2 + 102+ 152+ 32+72=408 The term (CX)~ is referred to as the sum squared and is found as follows: (CX)~ = (5 + 10+ 15 +3 + 7)*= 1600 The variance is given as follows: 1600 408-- 408-320 5 - - = 22 2 s = 4 4
  • 53.
    44 and the standarddeviation is DESCRIPTIVE MEASURES [CHAP. 3 s = 422 =4.7 These are the same values we found in Example 3.10 for the variance and standard deviation of the times. Since most populations are large, the computation of CT’ is rarely performed. In practice, the population variance (or standard deviation) is usually estimated by taking a sample from the population and using s2as a estimate of CT*.The use of n - 1 rather than n in the denominator of the formula for s2enhances the ability of s’ to estimate d. For data sets having a symmetric mound-shaped distribution, the standard deviation is approximately equal to one-fourth of the range of the data set. This fact can be used to estimate s for be1I-shaped distributions. Statistical software packages are frequently used to compute the standard deviation as well as many other statistical measures. EXAMPLE 3.12 The costs for scientific calculators with comparable built-in functions were recorded for 20 different sales locations. The results were as follows: 10.50 12.75 11.00 16.50 19.30 20.00 16.50 13.90 17.50 18.00 13.50 17.75 18.50 20.00 15.00 14.45 17.85 15.00 17.50 13.50 The analysis of these data using Minitab is as follows. MTB > name cl ‘cost’ MTB > set cl DATA > 10.50 13.50 12.75 17.75 11.00 18.50 16.50 20.00 19.30 DATA> 15.00 20.00 14.45 16.50 17.85 13.90 15.00 17.50 17.50 DATA > 18.00 13.50 DATA > end MTB > describe c1 Descriptive Statistics Variable N Mean Median TrMean StDev SEMean cost 20 15.950 16.500 16.028 2.826 0.632 Variable Min Max Q1 Q3 cost 10.500 20.000 13.600 17.962 The cost data are set into column c1, and the command describe cl produces 10different descriptive measures. The student is encouraged to confirm the values for the mean, median, standard deviation, minimum, and maximum. The other four measures are described elsewhere in the outline. Even though these data are not symmetric mound shaped, a “ballpark” approximation to the standard deviation is obtained by dividing the range by 4. One-fourth of the range is 2.375 and the value of the standard deviation is 2.826. MEASURES OF CENTRAL TENDENCY AND DISPERSION FOR GROUPED DATA Statistical data are often given in grouped form, i.e., in the form of a frequency distribution, and the raw data corresponding to the grouped data are not available or may be difficult to obtain. The articles that appear in newspapers and professional journals do not give the raw data, but give the
  • 54.
    CHAP. 31 DESCRIPTIVEMEASURES45 Age 5-14 results in grouped form. Table 3.3 gives the frequency distribution of the ages of 5000 shoplifters in a recent psychological study of these individuals. Frequency 750 Table 3.3 15-24 25-34 45-54 100 The techniques for finding the mean, median, mode, range, variance and standard deviation of grouped data will be illustrated by finding these measures for the data given in Table 3.3. The formulas and techniques given will be for sample data. Similar formulas and techniques are used for population data. The meanfor grouped data is given by where x represents the class marks, f represents the class frequencies, and n = Cf. EXAMPLE 3.13 The class marks in Table 3.3 are xi = 9.5, x2 = 19.5, x3 = 29.5, frequenciesare fl = 750, f2 = 2005, f3= 1950,f4= 195,and fs = 100.The sample size n is 5000. The mean is = 39.5, x5 = 49.5 and the - 9.5x 750+19.5x 2005+295x 1950+395x 195+495x 100 I 16,400 --- X = - - 23.3 years 5000 5000 The medianfor grouped data is found by locating the value that divides the data into two equal parts. In finding the median for grouped data, it is assumed that the data in each class is uniformly spread across the class. EXAMPLE 3.14 The median age for the data in Table 3.3 is a value such that 2500 ages are less than the value and 2500 are greater than the value. The median age must occur in the age group 15-24, since 750 are less than 15 and 2755 are 24 years or less. The class 15-24 is called the median class since the median must fall in this class. Since 750 are less than 15 years, there must be 1750additional ages in the class 15- 24 that are less than the median. In other words, we need to go the fraction 1750/2005 across the class 15-24 to locate the median. We give the value 14.5+ (1750/2OO5) x 10= 23.2 years as the median age. To summarize, 14.5 is the lower boundary of the median class, 1750/2005is the fraction we must go across the median class to reach the median, and 10 is the class width for the median class. The modal class is defined to be the class with the maximum frequency. The mode for grouped data is defined to be the class mark of the modal class. EXAMPLE 3.15 The modal class for the distribution in Table 3.3 is the class 15-24. The mode is the class mark for this class that equals 19.5 years. The range for grouped data is given by the difference between the upper boundary of the class having the largest values minus the lower boundary of the class having the smallest values. EXAMPLE 3.16 The upper boundary for the class 45-54 is 54.5 and the lower boundary for the class 5-14 is 4.5, and the range is 54.5 - 4.5 = 50.0 years.
  • 55.
    46 DESCRIPTIVEMEASURES The variunceforgrouped data is given by n-1 [CHAP. 3 (3.10) and the standard deviation is given by s = o (3.1I ) EXAMPLE 3.17 In order to find the variance and standard deviation for the distribution in Table 3.3 using formulas (3.10)and (3.I I ) , we first evaluate cx2f and - . (Cxf)’ n 2x’f = 9S2x 750 + 19S2 x 2005 + 29S2x 1950+ 39S2x 195+49S2x 100= 3,076,350 (Cxf)2 From Example 3.13, we see that Cxf = I 16,400,and therefore - = 2,709,792. The variance is n 3,076,350- 2,709,792 4999 s = = 73.3 and the standard deviation is = 8.6 years. CHEBYSHEV’STHEOREM Cliebyshev ’s theorem provides a useful interpretation of the standard deviation. Chebyshev’s theorem states that the fraction of any data set lying within k standard deviations of the mean is at least 1 - - where k is a number greater than 1. The theorem applies to either a sample or a 1 k’ * 1 1 3 2 4 4 population. If k = 2, this theorem states that at least 1 - 7 = 1 - - = - or 75% of the data set will 8 9 fall between X-2s and X+2s. Similarly, for k = 3, the theorem states that at least - or 89% of the data set will fall between X-3s and X+3s. EXAMPLE 3.18 The reading readiness scores for a group of 4 and 5 year old children have a mean of 73.5 and a standard deviation equal to 5.5. At least 75% of the scores are between 73.5 - 2 x 5.5 = 62.5 and 73.5 + 2 x 5.5 = 84.5. At least 89% of the scores are between 73.5 - 3 x 5.5 = 57.0 and 73.5 + 3 x 5.5 = 90.0. EMPIRICAL RULE The empirical ride states that for a data set having a bell-shaped distribution, approximately 68% of the observations lie within one standard deviation of the mean, approximately 95% of the observations lie within two standard deviations of the mean, and approximately 99.7% of the observations lie within three standard deviations of the mean. The empirical rule applies to either large samples or populations. EXAMPLE 3.19 Assuming the incomes for all single parent households last year had a bell-shaped distribu- tion with a mean equal to $23,500 and a standard deviation equal to $4,500, the following conclusions follow: 68% of the incomes lie between $19,000 and $28,000, 95% of the incomes lie between $14,500 and $32,500,
  • 56.
    CHAP. 31 DESCRIPTIVEMEASURES 47 and 99.7% of the incomes lie between $10,000 and $37,000. If the shape of the distribution is unknown, then Chebyshev’s theorem does not give us any information about the percent of the distribution between $19,500 and $28,000. However, Chebyshev’s theorem assures us that at least 75% of the incomes are between $14,500 and $32,500 and at least 89% of the incomes are between $10,000and $37,000. COEFFICIENTOF VARIATION The coefficient of variation is equal to the standard deviation divided by the mean. The result is usually multiplied by 100to express it as a percent. The coefficient of variation for a sample is given by S X cv = - x - 100% and the coefficient of variation for a population is given by CL cv=-xlOO% 0 (3.12) (3.13) The coefficient of variation is a measure of relative variation, whereas the standard deviation is a measure of absolute variation. EXAMPLE 3.20 A national sampling of prices for new and used cars found that the mean price for a new car is $20,100 and the standard deviation is $6,125 and that the mean price for a used car is $5,485 with a standard deviation equal to $2,730. In terms of absolute variation, the standard deviation of price for new cars is more than twice that of used cars. However, in terms of relative variation, there is more relative variation in the price of used cars than in new cars. The CV for used cars is e x loo= 49.8% and the CV for new cars is 5,485 x 100 = 30.5%. 20,100 2 SCORES A z score is the number of standard deviations that a given observation, x, is below or above the mean. For sample data, the z score is - x - x z=- S and for population data, the z score is z=- x-P 0 (3.14) (3.15) EXAMPLE 3.21 The mean salary for deputies in Douglas County is $27,500 and the standard deviation is $4,500. The mean salary for deputies in Hall County is $24,250 and the standard deviation is $2,750. A deputy who makes $30,000 in Douglas County makes $1,500 more than a deputy does in Hall County who makes $28,500. Which deputy has the higher salary relative to the county in which he works? For the deputy in Douglas County who makes $30,000,the z score is
  • 57.
    48 3.0 3.3 3.5 3.5 3.6 4.0 4.0 4.2 DESCRIPTIVE MEASURES 5.o 6.27.6 9.4 5.2 6.3 7.6 9.S 5.5 6.4 7.7 9.5 5.5 6.6 7.8 10.0 5.5 6.6 7.8 10.5 5.8 6.8 8.S 10.8 5.8 6.8 8.5 10.9 5.9 6.8 8.8 11.0 [CHAP. 3 X-p 30,000-27,500 z=- - - = 0.S6 0 4,500 For the deputy in Hall County who makes $28.500, the z score is x -CL 28,500- 24,250 z=-- - = 1.55 0 2,750 When the county of employment is taken into consideration, the $28,500 salary is a higher relative salary than the $30,000salary. MEASURESOF POSITION: PERCENTILES,DECILES, AND QUARTILES Measures o f position are used to describe the location of a particular observation in relation to the rest of the data set. Percerrtiles are values that divide the ranked data set into 100equal parts. The pth percentile of a data set is a value such that at least p percent of the observations take on this value or less and at least (100 - p) percent of the observations take on this value or more. Decile.7 are values that divide the ranked data set into 10equal parts. Qicartilvs are values that divide the ranked data set into four equal parts. The techniques for finding the various measures of position will be illustrated by using the data in Table 3.4. Table 3.4 contains the aortic diameters measured in centimeters for 45 patients. Notice that the data in Table 3.4 are already ranked. Raw data need to be ranked prior to finding measures of position. Table 3.4 I 4.6 I 6.0 I 7.0 1 8.8 I 11.0 The perreittilefor observutiori x is found by dividing the number of observations less than x by the total number of observations and then multiplying this quantity by 100. This percent is then rounded to the nearest whole number to give the percentile for observation x. EXAMPLE 3.22 The number of observations in Table 3.4 less than 5.5 is 1 1. Eleven divided by 45 is ,244 and .244 multiplied by 100 is 24.4%. This percent rounds to 24. The diameter 5.5 is the 24th percentile and we express this as Pz4= 5.5. The number of observations less than 5.0 is 9. Nine divided by 45 is .20 and .20 multiplied by 100 is 20%. P3()= 5.0. The number of observations less than 10.0is 39. Thirty-nine divided by 45 is 367 and .867 multiplied by 100 is 86.7%. Since 86.7% rounds to 87%. we write Px7= 10.0 The prh perceiztile for a ranked data set consisting of n observations is found by a two-step procedure. The first step is to compute index i = - (p)(n).If i is not an integer, the next integer greater I 00 than i locates the position of the pth percentile in the ranked data set. If i is an integer, the pth percentile is the average of the observations in positions i and i + I in the ranked data set.
  • 58.
    CHAP. 31 DESCRIPTIVEMEASURES 49 (1W45) 100 EXAMPLE 3.23 To find the tenth percentile for the data of Table 3.4, compute i = ~ = 4.5. The next integer greater than 4.5 is 5. The observation in the fifth position in Table 3.4 is 3.6. Therefore, Plo = 3.6. Note that at least 10% of the data in Table 3.4 are 3.6 or less (the actual aniount is I I , 1%) and at least 90% of the data are 3.6 or more (the actual amount is 91.1%). For very large data sets, the percentage of observations equal to or less than Plowill be very close tu 105T and the percentage of observations equal to or greater than PI0 will be very close to 90%. (40x45) 100 EXAMPLE 3.24 To find the tortieth percentile for the data in Table 3.4, compute i = -= 18. The fortieth percentile is the average of the observations in the 18th and 19th positions in the ranked data set. The observation in the 18th position is 6.0 and the observation in the 19th position is 6.2.Therefore Pj0 =- - 6.1. Note that 40% of the data in Table 3.4 are 6.1 or less and that 60% of the observations are 6.I or more. - 6.0+6.2 2 Deciles and quartiles are determined in the same manner as percentiles, since they may be expressed as percentiles. The deciles are represented as DI, Dz, . . . , D Y and the quartiles are repre- sented by QI ,Q2 ,and Q3. The following equalities hold for deciles and percentiles: The following equalities hold for quartiles and percentiles: From the above definitions of percentiles, deciles, and quartiles, the following equalities also hold: Median = P 5 o = D5 = Q2 The techniques for finding percentiles, deciles, and quartiles differ somewhat from textbook to textbook, but the values obtained by the various techniques are usually very close to one another. INTERQUARTILE RANGE The irirerqirartile rtirige, designated by IQR, is defined as follows: IQR = 4 3 - QI (3.16) The interquartile range shows the spread of the middle 50% of the data and is not affected by extremes in the data set. EXAMPLE 3.25 The interquartile range for the aortic diameters in Table 3.4 is found by subtracting the value of QI from Q3 . The first quartilc is equal to the 25th percentile and is found by noting that -= 11.25 and therefore i = 12. QI is in the 12th position in Table 3.4 and Q1 = 5.5. The third quartile is equal to the 75th 45 x 75 percentile and is found by noting that -= 33.75 and therefore i = 34. Q1is in the 34th position in Table I 00 3.4 and Q3= 8.5. The IQR equals 8.5 -5.5 or 3.0 cm. 45 x 25 100
  • 59.
    50 DESCRIPTIVE MEASURES 10.5 -2.5 14.0 [CHAP.3 12.5 14.5 22.0 12.5 20.2 3.5 7.5 14.5 17.5 14.0 12.0 17.0 BOX-AND-WHISKER PLOT 20.3 5.5 4.O A bo.r-atzd-whiskerplot, sometimes simply called a boxplot is a graphical display in which a box extending from Q 1 to QJ is constructed and which contains the middle 50% of the data. Lines, called whiskers, are drawn from Qi to the smallest value and from Q3 to the largest value. In addition, a vertical line is constructed inside the box corresponding to the median. 27.5 22.5 10.5 40.0 12.7 35.5 38.0 10.5 -5.5 19.0 14.5 10.5 EXAMPLE 3.26 A Minitab generated boxplot for the data in Table 3.4 is shown in Fig. 3-2. For these data, the minimum diameter is 3.0 cm, the maximum diameter is 11.0 cm., QI = 5.5 cm, Q2 = 6.6 cm, and Q3= 8.5 cm. Because Minitab uses a slightly different technique for finding the first and third quartile, the box extends from 5.35 to 8.65. rather than from 5.5 to 8.5. L I I I 1 1 I I 1 I I 1 3 4 5 6 7 8 9 1 0 1 1 Diameter (cm) Fig. 3-2 Another type boxplot, called a tnodifed huxplot, is also sometimes constructed in which possible and probable outliers are identified. The modified boxplot is illustrated in problem 3.27. Solved Problems MEASURES OF CENTRAL TENDENCY: MEAN, MEDIAN, AND MODE FOR UNGROUPED DATA Table 3 . 5 Table 3.5 gives the annual returns for 30 randomly selected mutual funds. Problems 3.I , 3.2, and 3.3 refer to this data set. 3.1 Find the mean for the annual returns in Table 3.5. Am. For the data in Table 3.5, Cx = 455.20, n = 30, and X = 15.17. 3 . 2 Find the median for the annual returns in Table 3.5.
  • 60.
    CHAP. 31 DESCRIPTIVEMEASURES51 60.5 75.O 100.0 89.0 50.0 Ans. The ranked annual returns are as follows: 113.5 79.0 475.5 70.0 122.5 150.0 125.5 90.0 175.5 130.0 111.5 100.0 340.5 100.0 525.0 -5.5 -2.5 3.5 4.0 5.5 7.5 10.5 10.5 10.5 10.5 12.0 12.5 12.5 12.7 14.0 14.0 14.5 14.5 14.5 17.0 17.5 19.0 20.2 20.3 22.0 22.5 27.5 35.5 38.0 40.0 The median is the average of the 15th and 16th values in the ranked returns or the average of 14.0 and 14.0, which equals 14.0. 3.3 Find the mode for the annual returns in Table 3.5. Am. By considering the ranked annual returns in the solution to problem 3.2, we see that the observa- tion 10.5 occurs more frequently than any other value and is therefore the mode for this data set. 3.4 Table 3.6 gives the distribution of the cause of death due to accidents or violence for white males during a recent year. Table 3.6 All other accidents 27,500 Suicide 20,234 What is the modal cause of death due to accidents or violence for white males? Can the mean or median be calculated for the cause of death? Am. The modal cause of death due to accidents or violence is motor vehicle accident. Because this is nominal level data, the mean and the median have no meaning. 3 . 5 Table 3.7 gives the selling prices in tens of thousands of dollars for 20 homes sold during the past month. Find the mean, median, and mode. Which measure is most representative for the selling price of such homes? Table 3.7 A m The ranked selling prices are: 50.0 60.5 70.0 75.0 79.0 89.0 90.0 100.0 100.0 100.0 I 1 1.5 113.5 122.5 125.5 130.0 150.0 175.5 340.5 475.5 525.0 The mean is 154.1, the median is 105.7, and the mode is 100.0. The median is the most representative measure. The three selling prices 340.5,475.5, and 525.0 inflate the mean and make it less representative than the median. Generally, the median is the best measure of central tendency to use when the data are skewed.
  • 61.
    52 DESCRIPTIVEMEASURES Age 25 30 40 53 29 45 40 55 35 47 [CHAP. 3 Income 23,500 25,000 30,000 47,500 32,000 37,500 32,000 50,500 40,000 43,750 MEASURESOF DISPERSION: RANGE, VARIANCE, AND STANDARD DEVIATION FOR UNGROUPED DATA 3.6 Find the range, variance, and standard deviation for the annual returns of the mutual funds in Table 3.5. Ans. The range is 40.0 - (-5.5) =45.5. (455.2)* 10.060- (Cx)2 c x 2-- 30 = 108.7275and the standard The variance is given by s2 = n -- n-1 29 deviation is s = 0= J m -= 10.43. 3 . 7 Find the range, variance,and standard deviation for the selling prices of homes in Table 3.7. Ans. The range is 525 - 50 =475. (3083) 812,884----- (Cx)2 zx2-- The variance is given by s2 = n - - 2o = 17,770.5026 and the standard deviation is s = o= 4 - = 133.31. n-1 19 range 4 3.8 Compare the values of the standard deviations with -for problems 3.6 and 3.7. Ans. For problem 3.6, the values are 10.43 and 4 5 3 4 = 11.38 and for problem 3.7, the values are 133.31 and 47Y4 = 118.75. Even though the distributions are skewed in both problems, the values are reasonably close. The approximation is closer for mound-shaped distributions than for other distributions. 3.9 What are the chief advantage and the chief disadvantage of the range as a measure of dispersion? Am. The chief advantage is the simplicity of computationof the range and the chief disadvantageis that it is insensitive to the values between the extremes. 3 . 1 0 The ages and incomes of the 10employeesat ComputerServices Inc. are given in Table 3.8. Table 3.8
  • 62.
    CHAP. 31 DESCRIPTIVEMEASURES 53 Compute the standard deviation of ages and incomes for these employees. Assuming that all employees remain with the company 5 years and that each income is multiplied by 1.5 over that period, what will the standard deviation of ages and incomes equal 5 years in the future? Ans. The current standard deviations are 10.21 years and $9,224.10. Five years from now, the standard deviations will equal 10.21 years and $13,836.15. In general, adding the same constant to each observation does not affect the standard deviation of the data set and multiplying each observation by the same constant multiplies the standard deviation by the constant. MEASURES OF CENTRAL TENDENCY AND DISPERSION FOR GROUPED DATA 3.11 Table 3.9 gives the age distribution of individuals starting new companies. Find the mean, median, and mode for this distribution. Table 3.9 20-29 30-39 40-49 50-59 60-69 Ans. The mean is found by dividing Xxf by n, where Cxf = 24.5 x 11 + 34.5 x 25 +44.5 x 14 + 54.5 x 7 + 64.5 x 3 = 2,330 and n = 60. X=38.8. The median class is the class 30-39. In order to find the middle of the age distribution, that is, the age where 30 are younger than this age and 30 are older, we must proceed through the 11 individuals in the 20-29 age group and 19 in the 30-39 age group. This gives 29.5 + - x 10= 37.1 as the median age. The modal class is the class 30-39, and the mode is the class mark for this class that equals 34.5. 19 25 3.12 Find the range, variance, and standard deviation for the distribution in Table 3.9. Ans. The range is 69.5 - 19.5= 50. ! (2,330)* deviation is s = 4116.497 = 10.8. 3.13 The raw data corresponding to the grouped data in Table 3.9 is given in Table 3.10. Find the mean, median, and mode for the raw data and compare the results with the mean, median, and mode for the grouped data found in problem 3.11.
  • 63.
    54 DESCRIPTIVE MEASURES Table3.10 24 44 24 34 45 40 46 28 33 37 41 47 66 [CHAP. 3 Ans. The sum of the raw data equals 2,299, and the mean is 38.2. This compares with 38.8 for the grouped data. The median is seen to be 37, and this compares with 37.1 for the grouped data. The mode is seen to be 34 and this compares with 34.5 for the grouped data. This problem illustrates that the measures of central tendency for a set of data in grouped and ungrouped form are relatively close. 3.14 Find the range, variance, and standard deviation for the ungrouped data in Table 3.10 and compare these with the same measures found for the grouped form in problem 3.12. Ans. The range is 66 - 20 = 46. Cx = 2,299, Zcx2= 94,685 and n = 60. (2,299)* 94,685- 2 ( C X > * c x -- The variance is s2= n -- 6o = 111.788and s = 10.6. n-1 59 These measures of dispersion compare favorably with the measures for the grouped data given in problem 3.12. CHEBYSHEV’S THEOREM AND THE EMPIRICAL RULE 3.15 The mean lifetime of rats used in many psychological experiments equals 3.5 years, and the standard deviation of lifetimes is 0.5 year. At least what percent will have lifetimes between 2.5 years and 4.5 years? At least what percent will have lifetimes between 2.0 years and 5.0 years? Ans. The interval from 2.5 years to 4.5 years is a 2 standard deviation interval about the mean, i.e., k = 2 in Chebyshev’s theorem. At least 75% of the rats will have lifetimes between 2.5 years and 4.5 years. The interval fiom 2.0 years to 5.0 years is a 3 standard deviation interval about the mean. At least 89% of the lifetimes fall within this interval. 3.16 The mean height of adult females is 66 inches and the standard deviation is 2.5 inches. The distribution of heights is mound-shaped. What percent have heights between: (a) 63.5 inches and 68.5 inches? (b) 61.O inches and 71.O inches? (c) 58.5 inches and 73.5 inches? Am. 63.5 to 68.5 is a one standard deviation interval about the mean, 61.0 to 71.0 is a two standard deviation interval about the mean, and 58.5 to 73.5 is a three standard deviation interval about the mean. According to the empirical rule, the percentages are: (a)68%; (b) 95% and (c)99.7%.
  • 64.
    CHAP. 31 DESCRIPTIVEMEASURES 55 3.17 The mean length of service for Federal Bureau of Investigation (FBI) agents equals 9.5 years and the standard deviation is 2.5 years, At least what percent of the employees have between 2.0 years of service and 17.0 years of service? If the lengths of service have a bell-shaped distribution, what can you say about the percent having between 2.0 and 17.0years of service? Ans. The interval from 2.0years to 17.0years is a 3 standard deviation interval about the mean, i.e., k = 3 in Chebyshev’s theorem. Therefore, at least 89% of the agents have lengths of service in this interval. If we know that the distribution is bell shaped, then 99.7%of the agents will have lengths of service between 2.0 and 17.0years. COEFFICIENTOF VARIATION 3.18 3.19 Find the coefficient of variation for the ages in Table 3.10. Ans. From problem 3.13, the mean age is 38.2 years and from problem 3.14 the standard deviation is S 10.6 X 38.2 10.6years. Using formula (3.12), the coefficient of variation is CV = x 100% = -x loo= 27.7%. The mean yearly salary of all the employees at Pretty Printing is $42,500 and the standard deviation is $4,000. The mean number of years of education for the employees is 16 and the standard deviation is 2.5 years. Which of the two variables has the higher relative variation? 4,000 42,500 Atis. The coefficient of variation for salaries is CV = -X 2.5 16 tion for years of education is CV = -x 100 = 15.6%. variation. 2 SCORES 3.20 3.21 100= 9.4% and the coefficient of varia- Years of education has a higher relative The mean daily intake of protein for a group of individuals is 80 grams and the standard deviation is 8 grams. Find the z scores for individuals with the following daily intakes of protein: (a)95 grams; (b)75 grams; (c)80 grams. 95 -80 75-80 80-80 Ans. (a) z = -- - 1.88 (6) z = - =-.63 (c) z = - - - 0 8 8 8 Three individuals were selected from the group described in problem 3.20 who have daily intakes with z scores equal to -1.4,0.5, and 3.0. Find their daily intakes of protein. - x - x S A m . If the equation z = -is solved for x, the result is x = X +zs. The daily intake corresponding to a z score of -I .4 is x = 80 +(-1.4)(8) = 68.8 grams. For a zscore equal to 0.5, x = 80 + (0.5)(8)= 84 grams. For a z score equal to 3.0, x = 80 +(3.0)(8)= 104 grams. MEASURES OF POSITION:PERCENTILES, DECILES, AND QUARTILES 3.22 Find the percentiles for the ages 34,45, and 55 in Table 3.10.
  • 65.
    56 DESCRIPTIVEMEASURES [CHAP.3 20 60 Ans. The number of ages less than 3 4 is 20,and - x 100 = 33.396,which rounds to 33%. The age 34 is the thirty-third percentile. The number of ages less than 45 is 45,and - x 100 = 75%. The age 45 is the seventy-fifth percentile. The number of ages less than 55 is 53,and - x 100 = 88.3%, which rounds to 88%. The age 55 is the eighty-eighth percentile. 45 60 53 60 3.23 Find the ninety-fifth percentile, the seventh decile, and the first quartile for the age distribution given in Table 3.10. - 57.Pg5 is the average of the observations in positions np (60)(95) Ans. To find Pgs, compute i = -- - - - 100 100 57and 58 in the ranked data set, or the average of 58 and 62which is 60years. = 42. D 7 is the average of the observations in positions 42 and 43 in the ranked data set, or the average of 4 1 and 44 which is 42.5years. np (60)(70) To find D7,which is the same as Pt0, compute i = -- - 100 100 np (60)(25) To find QI ,compute i = -- - - = 15.QI is the average of the observations in positions 100 loo 15 and 1 6 in the ranked data set, or the average of 32 and 32which is 32 years. INTERQUARTILERANGE 3.24 Find the interquartile range for the annual returns of the mutual funds given in Table 3.5. Ans. The ranked annual returns are as follows. -5.5 -2.5 3.5 4.0 5.5 7 . 5 10.5 1 0 . 5 10.5 1 0 . 5 12.0 12.5 12.5 12.7 14.0 14.0 1 4 . 5 14.5 1 4 . 5 17.0 17.5 19.0 20.2 20.3 22.0 22.5 27.5 35.5 38.0 40.0 = 7.5. The next integer greater than 7.5 is 8,and this locates the position of Q,in the ranked data set. np (30)(25) The first quartile, which is the same as PtS ,is found by computing i = -- - loo 100 QI = 10.5. np (30)(75) The third quartile is found by computing i = -- -- = 22.5, and rounding this to 23. 100 loo The third quartile is found in position 23 in the ranked data set. 4 3= 20.2 IQR = 4 3 - QI = 20.2- 1 0 . 5= 9.7. 3 . 2 5 Find the interquartile range for the selling prices given in Table 3.7. Ans. The ranked selling prices are: 50.0 60.5 70.0 75.0 79.0 89.0 90.0 100.0 100.0 100.0 111.5 113.5 122.5 125.5 130.0 150.0 175.5 340.5 475.5 525.0 = 5. The np (20)(25) The first quartile, which is the same as P25 is found by computing i = -- - 1 0 0 100 first quartile is the average of the observations in positions 5 and 6 in the ranked data set.
  • 66.
    CHAP. 31 DESCRIPTIVEMEASURES 35 40 60 55 56 57 2590 60 45 58 90 90 55 55 80 90 60 60 85 75 60 55 75 80 90 79.0+89.0 2 = 84.0 QI= The third quartile is found by computing i = - - - ~ = 15. The third quartile is the "P (20)(75) 100 100 averageof the observationsin positions 15 and 16in the ranked data set. 130.0+ 150.0 = 140.0 IQR = Q3- QI = 140.0- 84.0 = 56.0 2 4 3 = BOX-AND-WHISKER PLOT 3.26 Table 3.11 gives the number of days that 25 individuals spent in house arrest in a criminal justice study.Use Minitabto constructaboxplot for these data. Table 3.11 Ans. The solution is shown in Fig. 3-3. It is seen that the minimum value is 25, the maximum value is 90, QI= 55, median = 60, and QJ = 83. I I I I I I I 1 20 30 40 50 60 70 80 90 Days Fig. 3-3 3.27 Construct a modifiedboxplot for the sellingprices given in Table 3.7. Ans. The ranked selling prices are: 50.0 60.5 70.0 75.0 79.0 89.0 90.0 100.0 100.0 100.0 111.5 113.5 122.5 125.5 130.0 150.0 175.5 340.5 475.5 525.0 In problem 3.25 it is shown that QI= 84.0, Q3= 140.0,and IQR = 56.0. A lower innerfence is defined to be QI- 1.5 x IQR = 84.0 - 1.5x 56.0 = 0.0 An upper innerfence is defined to be Q3+ 1.5 x IQR = 140.0+ 1.5 x 56.0 = 224.0 A lower outerfence is defined to be Q, -3 x IQR = 84.0- 3 x 56.0 = -84.0 An upper outerfence is defined to be Q3+ 3 x IQR = 140.0+ 3 x 56.0 = 308.0 The adjacent values are the most extreme values still lying within the inner fences. The adjacent values for the above data are 50.0 and 175.5, since they are the most extreme values between 0.0 and 224.0. In a modified boxplot, the whiskersextend only to the adjacent values.
  • 67.
    DESCRIPTIVE MEASURES [CHAP.3 t I 340 375 450 580 565 445 675 545 380 630 500 350 635 495 640 467 560 680 605 590 635 625 440 420 400 Data values that lie between the inner and outer fences are possible outliers and data values that lie outside the outer fences are probable outliers. The observations340.5, 475.5, and 525.0lie outside the outer fences and are called probable outliers. A Minitab printout for a boxplot of the data is given in Fig. 3-4.Each probable outlier is represented by an asterisk. I I 1 I 1 1 0 100 200 300 400 500 Sellingprice Fig. 3-4 SupplementaryProblems MEASURES OF CENTRAL TENDENCY: MEAN, MEDIAN, AND MODE FOR UNGROUPED DATA 3 . 2 8 Table 3.12gives the verbal scores for 25 individuals on the Scholastic Aptitude Test (SAT). Find the mean verbal score for this set of data. Ans. Cx = 13,027 51= 521.1 3 . 2 9 Find the median verbal scorefor the data in Table 3.12. Ans. The ranked scores are shown below. 340 350 375 380 400 420 440 445 450 467 495 500 545 560 565 580 590 605 625 630 635 635 640 675 680 The median verbal score is 545. 3.30 Find the mode for the verbal scores in Table 3.12. Am. 635 3.31 Give the output produced by the Describe command of Minitab when the data in Table 3.12 are analyzed. Ans. Descriptive Statistics Variable N Mean Median TrMean StDev SEMean SAT 25 521.1 545.0 522.0 1 0 8 . 1 21.6
  • 68.
    CHAP. 31 DESCRIPTIVEMEASURES59 Variable Min Max Q1 Q3 SAT 340.0 680.0 430.0 627.5 3.32 Table 3.13 gives a stem-and-leaf display for the number of hours per week spent watching TV for a group of teenagers. Find the mean, median, and mode for this distribution. What is the shape of the distribution'? Table 3.13 1 1 0 0 0 0 0 0 0 0 0 0 0 0 5 5 5 Am. The mean, median, and mode are each equal to 20 hours. The distribution has a bell shape with center at 20. MEASURES OF DISPERSION: RANGE, VARIANCE, AND STANDARD DEVIATION FOR UNGROUPED DATA 3.33 3.34 3.35 3.36 3.37 Find the range, variance, and standard deviation for the SAT scores in Table 3.12. Am. range = 340 s2= I 1,692.91 s = 108.13 Find the range, variance, and standard deviation for the number of hours spent watching TV given in Table 3.13. Aim range = 20 s2= 18.4213 s = 4.29 What should your response be if you find that the variance of a data set equals -5.5'? Ans. Consider a data set in which all observations are equal. Find the range, variance, and standard deviation for this data set. Airs. A data set consisting of 10observations has a mean equal to 0, and a variance equal to a. Express Ex2 in terms of a. Airs. Ex2= 9a Check your calculations, since the variance can never be negative The range, variance, and standard deviation will all equal zero. MEASURES OF CENTRAL TENDENCY AND DISPERSION FOR GROUPED DATA 3.38 Table 3.14 gives the distribution of the words per minute for 60 individuals using a word processor. Find the mean, median, and mode for this distribution. Table 3.14 Words per minute I Frequency I 40-49 50-59 60-69 70-79 I 80-89 I 90-99 3 9 15 15 12 6
  • 69.
    60 DESCRIPTIVE MEASURES[CHAP. 3 Ans. 3.39 Find the range, variance, and standard deviation for the distribution in Table 3.14. mean = 71 .5, median = 7I .5, modes = 64.5 and 74.5 Am. range = 60 s2= 184.0678 s = 13.57 3.40 A quality control technician records the number of defective units found daily in samples of size 100 for the month of July. The distribution of the number of defectives per 100 units is shown in Table 3.15. Find the mean, median, and mode for this distribution. Which of the three measures is the most representative for the distribution'? Table 3.15 Number of defectives 27 A m . mean = 2 median = 0 mode = 0 The median or mode would be more representative than the mean. On most days, there were none or one defective in the sample. The two days on which the process was out of control inflated the mean. 3.41 The following formula is sometimes used for finding the median of grouped data. 5n-c f Median = L + - x ( U - L ) In this formula, L is the lower boundary of the median class, U is the upper boundary of the median class, n is the nurnber of observations, f is the frequency of the median class, and c is the cumulative frequency of the class proceeding the median class. Give the values for L, U, n, f, c, and find the median for the distribution given in Table 3.14. A ~ s . L = 69.5 U = 79.5n = 60 f = 15 c = 27 median = 71.5 CHEBYSHEV'S THEOREM AND THE EMPIRICAL RULE 3.42 3.43 3.44 A sociological study of gang members in a large midwestern city found the mean age of the gang members in the study to be 14.5 years and the standard deviation to be 1.5 years. According to Chebyshev's theorem, at least what percent will be between 10.0and 19.0years of age? Ans. 89% A psychological study of Alcoholic Anonymous members found the mean nurnber of years without drinking alcohol for individuals in the study to be 5.5 years and the standard deviation to be 1.5 years. The distribution of the number of years without drinking is bell-shaped. What percent of the distribution is between: ((I) 4.0 and 7.0 years; (b)2.5 and 8.5 years; (c) 1.0and 10.0years? Am. ( a )68% (h)95% (c) 99.7% The ranked verbal SAT scores in Table 3.12 are: 340 350 375 380 400 420 440 445 450 467 495 500 545 560 565 580 590 605 625 630 635 635 640 675 680
  • 70.
    CHAP. 31 DESCRIPTIVEMEASURES 61 The mean and standard deviation of these data are 521.1 and 108.1, respectively. According to Chebyshev’s theorem, at least 75% of the observations are between 304.9 and 737.3. What is the actual percent of observations within two standard deviations of the mean? Ans. 100% COEFFICIENT OF VARIATION 3.45 3.46 The verbal scores on the SAT given in Table 3.12 have a mean equal to 52 1.1 and a standard deviation equal to 108.1. Find the coefficient of variation for these scores. Ans. 20.7% Fastners Inc. produces nuts and bolts. One of their bolts has a mean length of 2.00 inches with a standard deviation equal to 0.10 inch, and another type bolt has a mean length of 0.25 inch. What standard deviation would the second type bolt need to have in order that both types of bolts have the same coefficient of variation? Am. 0.0125 inch Z SCORES 3.47 The low-density lipoprotein (LDL) cholesterol concentration for a group has a mean equal to 140 mg/dL and a standard deviation equal to 40 mg/dL. Find the z scores for individuals having LDL values of (a) 1IS; (b) 140; and (c) 200. AILS. (a) -0.63 (b) 0.0 (c) 1.50 3.48 Three individuals from the group described in problem 3.47 have z scores equal to (a)-1.75; (b)0.5; and (c) 2.0. Find their LDL values. Ans. (a) 70 (6) 160 (c) 220 MEASURES OF POSITION: PERCENTILES, DECILES, AND QUARTILES 3.49 Table 3.16 gives the ages of commercial aircraft randomly selected from several airlines. Find the percentiles for the ages 10, 15,and 20. Table 3.16 Ans. The age 10 is the thirtieth percentile. The age 15 is the fifty-eighth percentile. The age 20 is the eighty-fourth percentile. 3.50 Find Pm, Dg, and Q 3 for the commercial aircraft ages in Table 3.16.
  • 71.
    62 DESCRIPTIVE MEASURES[CHAP. 3 INTERQUARTILE RANGE 3.51 'The first quartile for the salaries ofcounty sheriffs in the United States is $37,500 and the third quartile is $50.500. What is the interquartile range for the salaries of county sheriffs? Ans. $13,000 3.52 Find the interquartile range for the ages of commercial aircraft given in Table 3.16. A m . 9 years BOX-AND-WHISKER PLOT 3.53 A Minitab produced boxplot for the weights of high school football players is shown in Fig. 3-5. Give thc rninimum. rnaximum, first quartile, median, and third quartile. 77- -T---l---l--n 1 6 200 2 1 0 3 0 230 240 250 260 270 Weights Fig. 3-5 Ans. rninimum = 195 maximum = 270 Q, = 205 Q2= 240 Q3= 260 3.54 Construct or use Minitab to construct a boxplot for the ages of the commercial aircraft in Table 3.16. Atm The boxplot is shown in Fig. 3-6. I I I 1 0 10 20 30 Age Fig. 3-6
  • 72.
    Chapter 4 ProbabiIity EXPERIMENT,OUTCOMES, ANDSAMPLESPACE An experiment is any operation or procedure whose outcomes cannot be predicted with certainty. The set of all possible outconies for an experiment is called the sample space for the experiment. EXAMPLE4.1 Games of chance are examples of experiments. The single toss of a coin is an experiment whose outcomes cannot be predicted with certainty. The sample space consists of two outcomes, heads or tails. The letter S is used to represent the sample space and may be represented as S = (H, T}.The single toss of a die is an experiment resulting in one of six outcomes. S may be represented as ( I , 2, 3, 4, 5 , 6). When a card is selected from a standard deck, 52 outcomes are possible. When a roulette wheel is spun, the outcome cannot be predicted with certainty. EXAMPLE 4.2 When a quality control technician selects an item for inspection from a production line, it may be classified as defective or nondefective. The sample space may be represented by S = (D,N}. When the blood type of a patient is determined, the sample space may be represented as S = (A, AB, B, 0 ) . When the Myers- Briggs personality type indicator is administered to an individual, the sample space consists of I6 possible outcomes. The experiments discussed in Examples 4.1 and 4.2 are rather simple experiments and the descriptions of the sample spaces are straightforward. More complicated experiments are discussed in the following section and techniques such as tree diagrams are utilized to describe the sample space for these experiments. TREE DIAGRAMS AND THE COUNTING RULE In a tree diagram, each outcome of an experiment is represented as a branch of a geometric figure called a tree. EXAMPLE 4.3 Figure 4-1 shows a tree diagram for the experiment of tossing a coin twice. The tree has four branches. Each branch is an outcome for the experiment. If the experiment is expanded to three tosses, the branches are simply continued with H or T added to the end of each branch shown in Fig. 4-1. This would result in the eight outcomes: HHH, HHT, HTH, HTT, THH, THT, TTH,and TTT. This technique could be continued systematically to give the outcomes for n tosses of a coin. Notice that 2 tosses has 4 outcomes and 3 tosses has 8 outcomes. N tosses has 2Npossible outcomes. The courztirzg rule for a two-step experiment states that if the first step can result in any one of nl outcomes, and the second step in any one of n2 outcomes, then the experiment can result in (nl)(nz) outcomes. If a third step is added with n3 outcomes, then the experiment can result in (nl)(nz)(nl) outcomes. The counting rule applies to an experiment consisting of any number of steps. If the counting rule is applied to Example 4.3, we see that for two tosses of a coin, nl = 2, n2 = 2, and the number of outcomes for the experiment is 2 x 2 = 4. For three tosses, there are 2 x 2 x 2 = 8 outcomes and so forth. The counting rule may be used to figure the number of outcomes of an experiment and then a tree diagram may be used to actually represent the outcomes.
  • 73.
    64 1 2 Die2 3 - 4 5 6 PROBABILITY J, 8 8 8 8 8 - 8 8 8 8 8 8 8 8 8 8 8 8 - 8 8 8 8 W W -. 8 8 8 8 8 8 8 8 8 8 [CHAP. 4 first toss second toss outcome HH I HT TH TT Fig. 4-1 EXAMPLE 4.4 For the experiment of rolling a pair of dice, the first die may be any of six numbers and the second die may be any one of six numbers. According to the counting rule, there are 6 x 6 = 36 outcomes. The outcomes may be represented by a tree having 36 branches. The sample space may also be represented by a two- dimensional plot as shown in Fig. 4-2. Fig. 4-2 EXAMPLE 4.5 An experiment consists of observing the blood types for five randomly selected individuals. Each of the five will have one of four blood types A, B, AB, or 0. Using the counting rule, we see that the experiment has 4 x 4 x 4 x 4 x 4 = 1,024 possible outcomes. In this case constructing a tree diagram would be difficult. EVENTS, SIMPLEEVENTS, AND COMPOUNDEVENTS An event is a subset of the sample space consisting of at least one outcome from the sample space. If the event consists of exactly one outcome, it is called a simple event. If an event consists of more than one outcome, it is called a compound event. EXAMPLE4.6 A quality control technician selects two computer mother boards and classifies each as defective or nondefective. The sample space may be represented as S = {NN, ND, DN, DO], where D represents a defective unit and N represents a nondefective unit. Let A represent the event that neither unit is defective and let B represent the event that at least one of the units is defective. A = { NN} is a simple event and B = {ND,DN, DD) is a compound event. Figure 4-3 is a Venn Diagram representation of the sample space S
  • 74.
    CHAP. 41 PROBABILITY65 and the events A and B. In a Venn diagram, the sample space is usually represented by a rectangle and events are represented by circles within the rectangle. Fig. 4-3 EXAMPLE 4.7 For the experiment described in Example 4.5, there are 1,024different outcomes for the blood types of the five individuals. The compound event that all five have the same blood type is composed of the following four outcomes: (A, A, A, A, A), (B, B, B, B, B), (AB, AB, AB, AB, AB), and (0,0,0,0, 0).The simple event that all five have blood type 0 would be the outcome (0,0,0,0 , O ) . PROBABILITY Probability is a measure of the likelihood of the occurrence of some event. There are several different definitions of probability. Three definitions are discussed in the next section. The particular definition that is utilized depends upon the nature of the event under consideration. However, all the definitions satisfy the following two specific properties and obey the rules of probability developed later in this chapter. The probability of any event E is represented by the symbol P(E) and the symbol is read as “P of E” or as “the probability of event E.” P(E) is a real number between zero and one as indicated in the following inequality: 0 I P(E) I 1 (4.1) The sum of the probabilities for all the simple events of an experiment must equal one. That is, if E1 ,E2 , . . . ,E,are the simple events for an experiment, then the following equality must be true: P(E1) +P(E2) + . ..+P(En)= 1 (4.2) Equality (4.2)is also sometimes expressed as in formula (4.3): P(S) = 1 (4-3) Equation (4.3)states that the probability that some outcome in the sample space will occur is one. CLASSICAL, RELATIVE FREQUENCY,AND SUBJECTIVE PROBABILITY DEFINITIONS The classical defirtition o f probability is appropriate when all outcome‘s of an experiment are equally likely. For an experiment consisting of n outcomes, the classical definition of probability
  • 75.
    66 PROBABILITY [CHAP.4 Type of transplant Heart Kidney Liver Pancreas Eyes 1 n assigns probability - to each outcome or simple event. For an event E consisting of k outcomes, the probability of event E is given by formula (4.4) k n P(E) = - (4.4) Waiting Time for Transplant 10 5 7 3 5 5 3 2 5 5 Less than one year One year or more EXAMPLE 4.8 The experiment of selecting one card randomly from a standard deck of cards has 52 equally likely outcomes. The event A, = (club) has probability g,since A, consists of 13 outcomes. The event A2 = {red card) has probability E,since A2 consists of 26 outcomes. The event A? = { face card (Jack, Queen, King)} has probability g,since A3 consists of 12outcomes. EXAMPLE 4.9 Table 4.1 gives information concerning fifty organ transplants in the state of Nebraska during a recent year. Each patient represented in Table 4.1 had only one transplant. If one of the 50 patient records is randomly selected, the probability that the patient had a heart transplant is = .30,since 15 of the patients had heart transplants. The probability that a randomly selected patient had to wait one year or more for the transplant I S - = .40, since 20 of the patients had to wait one year or more. The display in Table 4.1 is called a titw-huy table. It displays two different variables concerning the patients. ’ 20 so EXAMPLE 4.10 To find the probability of the event A that the sum of the numbers on the faces of a pair of dice equals seven when a pair of dice is rolled, consider the sample space shown in Fig. 4-4. The event A is shown as a rectangular box in the sample space. The outcomes in A are as follows A = { ( 1, 6),(2, 5). (3, 4), (4, 3). (5, 2), (6, I)}. Since A contains six of the thirty-six equally likely outcomes for the experiment, the probability of event A is . 36 Die 2 2 - 0 0 1 - - . I I r I 1 2 3 4 Die 1 Fig. 4-4
  • 76.
    CHAP. 41 PROBABILITY67 The classical definition of probability is not always appropriate in computing probabilities of events. If a coin is bent, heads and tails are not equally likely outcomes. If a die has been loaded, each of the six faces do not have probability of occurrence equal to i. For experiments not having equally likely outcomes, the relative frequertcv definition ofprohahility is appropriate. The relative frequency definition of probability states that if an experiment is performed n times, and if event E occurs f times, then the probability of event E is given by formula (4.5). f P(E) = - n (4.5) EXAMPLE 4.11 A bent coin is tossed 50 times and a head appears on 35 of the tosses. The relative frequency definition of probability assigns the probability -7-5 = .70 to the event that a head occurs when this coin is tossed. A loaded die is tossed 75 times and the face “6” appears 15 times in the 75 tosses. The relative frequency definition of probability assigns the probability = .20 to the event that the face “6” will appear when this die is tossed. 50 EXAMPLE 4.12 A study by the state of Tennessee found that when 750 drivers were randomly stopped 47 1 were found to be wearing seat belts. The relative frequency probability that a driver wears a seat belt in Tennessee is I = 0.63. 471 750 There are many circumstances where neither the classical definition nor the relative frequency definition of probability is applicable. The subjectiLfe defirzitiorz o f yrohabilihy utilizes intuition, experience, and collective wisdom to assign a degree of belief that an event will occur. This method of assigning probabilities allows for several different assignments of probability to a given event. The different assignments must satisfy formulas (4.I ) and (4.2). EXAMPLE 4.13 A military planner states that the probability of nuclear war in the next year is 1%. The individual is assigning a subjective probability of .01 to the probability of the event “nuclear war in the next year.” This event does not lend itself to either the classical definition or the relative frequency definition of probability. EXAMPLE 4.14 A medical doctor tells a patient with a newly diagnosed cancer that the probahility of successfully treating the cancer is 90%. The doctor is assigning a subjective probability of .90 to the event that the cancer can be successfully treated. The probability for this event cannot be determined by either the classical definition or the relative frequency definition of probability. MARGINAL AND CONDITIONAL PROBABILITIES Table 4.2 classifies the 500 members of a police department according to their minority status as well as their promotional status during the past year. One hundred of the individuals were classified as being a minority and seventy were promoted during the past year. The probability that a randomly selected individual from the police department is a minority is = .20 and the probability that a randomly selected person was promoted during the past year is & = .14. Table 4.3 is obtained by dividing each entry in Table 4.2 by 500.
  • 77.
    68 Promoted No PROBABILITY Minority No Yes Total 35080 430 [CHAP. 4 ~ ~ ~ Promoted No No .70 Yes . l O Total .80 Table 4.2 Yes Total .16 .86 .04 .14 .20 1.oo Yes 50 I 20 I 70 I I Total I 400 I 100 I 500 I The four probabilities in the center of Table 4.3, .70, .16, .10, and .04, are called joint probabilities. The four probabilities in the margin of the table, .80, .20, .86, and .14, are called mtirginal probabilities. Table 4.3 a I Minoritv I I The joint probabilities concerning the selected police officer may be described as follows: .70 = the probability that the selected officer is not a minority and was not promoted .I6 = the probability that the selected officer is a minority and was not promoted .I0= the probability that the selected officer is not a minority and was promoted .04= the probability that the selected officer is a minority and was promoted The marginal probabilities concerning the selected police officer may be described as follows: .80 = the probability that the selected officer is not a minority .20 = the probability that the selected officer is a minority .86= the probability that the selected officer was not promoted during the last year .14 = the probability that the selected officer was promoted during the last year In addition to the joint and marginal probabilities discussed above, another important concept is that of a corzditional probability. If it is known that the selected police officer is a minority, then the conditional probability of promotion during the past year is = .20, since 100 of the police officers in Table 4.2 were classified as minority and 20 of those were promoted. This same probability may be obtained from Table 4.3 by using the ratio $ = .20. The formula for the conditional probability of the occurrence of event A given that event B is known ta have occurred for some experiment is represented by P(A I B) and is the ratio of the joint probability of A and B divided by the probability of B. The following formula is used to compute a conditional probability. P(A and B) P(B) P(A I B) = The following example summarizes the above discussion and the newly introduced notation. EXAMPLE 4.15 For the experiment of selecting one police officer at random from those described in Table 4.2,define event A to be the event that the individual was promoted last year and define event B to be the event
  • 78.
    CHAP. 41 PROBABILITY69 that the individual is a minority. The joint probability of A and B is expressed as P(A and B) = .04. The marginal probabilities of A and B are expressed as P(A) = .I4 and P(B) = .20. The conditional probability of A given B is P(A IB) = P(AandB) .04 - -= .20. P(B) - .20 MUTUALLY EXCLUSIVE EVENTS Two or more events are said to be nzutuallv exclusive if the events do not have any outcomes in common. They are events that cannot occur together. If A and B are mutually exclusive events then the joint probability of A and B equals zero, that is, P(A and B) = 0. A Venn diagram representation of two mutually exclusive events is shown in Fig. 4-5. Fig. 4-5 EXAMPLE 4.16 An experiment consists in observing the gender of two randomly selected individuals. The event, A, that both individuals are male and the event, B, that both individuals are female are mutually exclusive since if both are male, then both cannot be female and P(A and B) = 0. EXAMPLE 4.17 Let event A be the event that an employee at a large company is a white collar worker and let B be the event that an employee is a blue collar worker. Then A and B are mutually exclusive since an employee cannot be both a blue collar worker and a white collar worker and P(A and B) = 0. DEPENDENT AND INDEPENDENT EVENTS If the knowledge that some event B has occurred influences the probability of the occurrence of another event A, then A and B are said to be dependent events. If knowing that event B has occurred does not affect the probability of the occurrence of event A, then A and B are said to be independent events. Two events are independent if the following equation is satisfied. Otherwise the events are dependent. P(A I B) = P(A) (4.7) The event of having a criminal record and the event of not having a father in the home are dependent events. The events of being a diabetic and having a family history of diabetes are dependent events, since diabetes is an inheritable disease. The events of having 10letters in your last name and being a sociology major are independent events. However, many times it is not obvious whether two events are dependent or independent. In such cases, formula (4.7)is used to determine whether the events are independent or not.
  • 79.
    70 Smoker No Yes Total PROBABILITY History of HeartDisease No Yes Total 75 5 80 35 10 45 110 15 125 [CHAP. 4 EXAMPLE 4.18 For the experiment of drawing one card from a standard deck of 52 cards, let A be the event that a club is selected, let B be the event that a face card (jack, queen, or king) is drawn, and let C be the event that a jack is drawn. Then A and B are independent events since P(A) = $ = .25 and P(A I B) = $ = .25. P(A I B) = 2 = .25. since there are 12 face cards and 3 of them are clubs. The events B and C are dependent events since P(C)= 5 = .077 and P(C IB) = 5 = .333. P(C I B) = -% = .333, since there are 12 face cards and 4 of them are jacks. 12 52 12 12 EXAMPLE 4.19 Suppose one patient record is selected from the 125 represented in Table 4.4. The event that a patient has a history of heart disease, A, and the event that a patient is a smoker, B, are dependent events, since = .22. For this group of patients, knowing that an individual is a smoker = .I2 and P(A I B) = 15 12s P(A)= - almost doubles the probability that the individual has a history of heart disease. Table 4.4 COMPLEMENTARYEVENTS To every event A, there corresponds another event A', called the complement of A and consisting of all other outcomes in the sample space not in event A. The word nut is used to describe the complement of an event. The complement of selecting a red card is not selecting a red card. The complement of being a smoker is not being a smoker. Since an event and its complement must account for all the outcomes of an experiment, their probabilities must add up to one. If A and A' are complementary events then the following equation must be true. P(A) +P(A') = 1 (4.8) EXAMPLE 4.20 Approximately 2% of the American population is diabetic. The probability that a randomly chosen American is not diabetic is .98, since P(A)= .02, where A is the event of being diabetic, and .02 + P(A') = 1. Solving for P(A') we get P(A') = 1 - .02 = .98. EXAMPLE4.21 Find the probability that on a given roll of a pair of dice that "snake eyes" are not rolled. Snake eyes means that a one was observed on each of the dice. Let A be the event of rolling snake eyes. Then P(A)= L!- = .028. The event that snake eyes are not rolled is AC.Then using formula (4.8),.028 + P(A') = 1, and solving for P(A'), it follows that P(AC)= I - .028 = .972. 26 Complementary events are always mutually exclusive events but mutually exclusive events are not always complementary events. The events of drawing a club and drawing a diamond from a standard deck of cards are mutually exclusive, but they are not complementary events. MULTIPLICATION RULE FOR THE INTERSECTION OF EVENTS The intersection of two events A and B consists of all those outcomes which are common to both A and B. The intersection of the two events is represented as A and B. The intersection of two events
  • 80.
    CHAP. 41 PROBABILITY71 is also represented by the symbol A n B, read as A intersect B. A Venn diagram representation of the intersection of two events is shown in Fig. 4-6. Fig. 4-6 The probability of the intersection of two events is given by the inultiplication rule. The multiplication rule is obtained from the formula for conditional probabilities, and is given by formula (4.9). P(A and B) = P(A) P(B IA) (4.9) EXAMPLE 4.22 A small hospital has 40 physicians on staff of which 5 are cardiologists. The probability that two randomly selected physicians are both cardiologists is determined as follows. Let A be the event that the first selected physician is a cardiologist, and B the event that the second selected physician is a cardiologist. Then P(A) = $ = ,125, P(B I A) = L=.103 and P(A and B) = .125 x .103 = .013. If two physicians were selected from a group of 40,000of which 5,000were cardiologists, then P(A) = = .125, P(B IA) = 4,999 40,000 39,999 39 = .125, and P(A and B) = (.125)2= .016. Notice that when the selection is from a large group, the probability of selecting a cardiologist on the second selection is approximately the same as selecting one on the first selection. Following this line of reasoning, suppose it is known that 12.5% of all physicians are cardiologists. If three equals (.125)3 = .002. The physicians are selected randomly, the probability that all three are cardiologists probability that none of the three are cardiologists is (.875)3 = .670. If events A and B are independent events, then P(A1B) = P(A) and P(B is replaced by P(B), formula (4.9)simplifies to P(A and B) =P(A) P(B) A) = P(B). When P(B IA) (4.10) EXAMPLE 4.23 Ten percent of a particular population have hypertension and 40 percent of the same population have a home computer. Assuming that having hypertension and owning a home computer are independent events, the probability that an individual from this population has hypertension and owns a home computer is .10 x .40 = -04.Another way of stating this result is that 4 percent have hypertension and own a home computer. ADDITION RULE FOR THE UNION OF EVENTS The union of two events A and B consists of all those outcomes that belong to A or B or both A and B. The union of events A and B is represented as A U B or simply as A or B. A Venn diagram representation of the union of two events is shown in Fig. 4-7. The darker part of the shaded union of the two events corresponds to overlap and corresponds to the outcomes in both A and B.
  • 81.
    72 PROBABILITY [CHAP.4 Fig. 4-7 To find the probability in the union, we add P(A) and P(B). Notice, however, that the darker part indicates that P(A and B) gets added twice and must be subtracted out to obtain the correct probability. The resultant equation is called the addition rule for probabilities and is given by formula (4.1I). P(A or B) = P(A) +P(B) -P(A and B) (4.11) If A and B are mutually exclusive events, then P(A and B) = 0 and the formula (4.11)simplifies to the following. P(A or B) = P(A) +P(B) EXAMPLE4.24 Forty percent of the employees at Computec, Inc. have a college degree, 30 percent have been with Computec for at least three years, and 15 percent have both a college degree and have been with the company for at least three years. If A is the event that a randomly selected employee has a college degree and B is the event that a randomly selected employee has been with the company at least three years, then A or B is the event that an employee has a college degree or has been with the company at least three years. The probability of A or B is .40 + .30 - .15 = .55. Another way of stating the result is that 55 percent of the employees have a college degree or have been with Computec for at least three years. EXAMPLE 4.25 A hospital employs 25 medical-surgical nurses, 10intensive care nurses, 15 emergency room nurses, and 50 floor care nurses. If a nurse is selected at random, the probability that the nurse is a medical- surgical nurse or an emergency room nurse is .25 + .15 = .40. Since the events of being a medical-surgical nurse and an emergency room nurse are mutually exclusive, the probability is simply the sum of probabilities of the two events. BAYES’ THEOREM A computer disk manufacturer has three locations that produce computer disks. The Omaha plant produces 30% of the disks, of which 0.5% are defective. The Memphis plant produces 50% of the disks, of which 0.75% are defective. The Kansas City plant produces the remaining 20%, of which 0.25% are defective. If a disk is purchased at a store and found to be defective, what is the probability that it was manufactured by the Omaha plant? This type of problem can be solved using Bayes’ theorem. To formalize our approach, let AI be the event that the disk was manufactured by the Omaha plant, let A2 be the event that the disk was manufactured by the Memphis plant, and let A3 be the event that it was manufactured by the Kansas City plant. Let B be the event that the disk is defective. We are asked to find P(A1 I B). This probability is obtained by dividing P(A1 and B) by P(B)-
  • 82.
    CHAP. 41 PROBABILITY73 The event that a disk is defective occurs if the disk is manufactured by the Omaha plant and is defective or if the disk is manufactured by the Memphis plant and is defective or if the disk is manufactured by the Kansas City plant and is defective. This is expressed as follows B = (AIand B) or (A2 and B) or (A3 and B) (4.12) Because the three events which are connected by or’s in formula (4.12)are mutually exclusive, P(B) may be expressed as P(B) =P(A1 and B) +P(A2 and B) +P(A3and B) (4.13) By using the multiplication rule, formula (4.13)may be expressed as Using formula (4.14),P(B) = .005x .3 + .0075 x .5 + .0025 x .2 = .00575. That is, 0.575% of the disks manufactured by all three plants are defective. The probability P(AIand B) equals .OO15 ,00575 P(B I AI) P(AI) = ,005 x .3 = .0015. The probability we are seeking is equal to -= .261. Summarizing, if a defective disk is found, the probability that it was manufactured by the Omaha plant is .261. In using Bayes’ theorem to find P(A1 I B) use the following steps: Step 1: Compute P(A1 and B) by using the equation P(A1and B) = P(B IAI)P(A1). Step 2: Compute P(B) by using formula (4.14). Step 3: Divide the result in step 1by the result in step 2 to obtain P(A1 I B). These same steps may be used to find P(A2 IB) and P(A3 IB). Events like A1 ,A2 ,and A3 are called collectively exhaustive. They are mutually exclusive and their union equals the sample space. Bayes’ theorem is applicable to any number of collectively exhaustive events. EXAMPLE 4.26 Using the three-step procedure given above, the probability that a defective disk was manufactured by the Memphisplant is found as follows. Step 1: P(A2 and B) = P(B I Az) P(A2) = .0075 x .5 = .00375. Step 2: P(B) = .005 x .3 + .0075 x .5 + .0025 x .2 = ,00575. .00375 .00575 Step 3: P(A21 B) = - = .652. The probability that a defective disk was manufactured by the Kansas City plant is found as follows. Step 1: P(A3 and B) = P(B IA3) P(A3) = .0025 x .2= .0005. Step 2: P(B) = .005x .3 + .0075 x .5 + .0025 x .2 = .00575. .0005 .00575 Step 3: P(A2 IB) = - - - .087. PERMUTATIONSAND COMBINATIONS Many of the experiments in statistics involve the selection of a subset of items from a larger group of items. The experiment of selecting two letters from the four letters a, b, c, and d is such an experiment. The following pairs are possible: (a, b), (a, c), (a, d), (b, c), (b, d), and (c, d). We say that
  • 83.
    74 PROBABILITY [CHAP.4 when selecting two items from four distinct items that there are six possible combinations. The number of combinations possible when selecting n from N items is represented by the symbol Cf;’ and is given by N ! n!( N -n)! cf;’= (4.15) NC”,C(N, n), and addition to the symbol C,”. The symbol n!, read as “n factorial,” is equal to n x (n - I ) x (n - 2) x . . . x 1. For example, 3! = 3 x 2 x 1 = 6, and 4! = 4 x 3 x 2 x I = 24. The values for n! become very large even for small values of n. The value of 10!is 3,628,800, for example. In the context of selecting two letters from four, N! = 4! = 4 x 3 x 2 x 1 = 24, n! = 2! = 2 x 1 = 2 and (N - n)! = 2! = 2. The number of combinations possible when selecting two items from four is are three other notations that are used for the number of combinations in [ : I 24 - 4! given by c; = -- -= 6 , the same number obtained when we listed all possibilities above. 2!2! 2 x 2 When the number of items is larger than four or five, it is difficult to enumerate all of the possibilities. EXAMPLE 4.27 The number of five card poker hands that can be dealt from a deck of 52 cards is given by 52! 52 x 5 I x 50x49 x48 x 47! 52 x 5 1 x 5 0 ~ 4 9 x 48 c ; 2 = - - - - - = 2,598,960. Notice that by expressing 52! 5!47! 120x47! 120 as 52 x 51 x 50 x 49 x 48 x 47!, we are able to divide 47! out because it is a common factor in both the numerator and the denominator. If the order of selection of items is important, then we are interested in the number of perniutations possible when selecting n items from N items. The number of permutations possible when selecting n objects from N objects is represented by the symbol Pc;’,and given by N ! (N-n)! pf;’=- (4.16) NPn, P(N, n) and (N)n are other symbols used to represent the number of permutations. EXAMPLE 4.28 The number of permutations possible when selecting two letters from the four letters a, b, c, anddis p!=-=-=-= 12. In this case, the 12 permutations are easy to list. They are ab, ba, ac, ca, ad, da, bc, cb, bd, db, cd, and dc. There are always more permutations than combinations when selecting n items from N, because each different ordering is a different permutation but not a different combination. 4! 4! 24 (4-2)! 2! 2 EXAMPLE 4.29 A president, vice president, and treasurer are to be selected from a group of 10 individuals. How many different choices are possible? In this case, the order of listing of the three individuals for the three offices is important because a slate of Jim, Joe, and Jane for president, vice president, and treasurer is different from Joe, Jim, and Jane for president, vice president, and treasurer, for example. The number of permutations is P:* = 7 = 1Ox 9 x 8 = 720. That is, there are 720 different sets of size three that could serve as president, vice presideqt, and treasurer. 1O!
  • 84.
    CHAP. 41 PROBABILITY75 USING PERMUTATIONSAND COMBINATIONSTO SOLVE PROBABILITY PROBLEMS EXAMPLE4.30 For a lotto contest in which six numbers are selected from the numbers 01 through 45, the number of combinations possible for the six numbers selected is Cis= - - -8,145,060. The probability that you select the correct six numbers in order to win this lotto is = .000000123. The probability of winning the lotto can be reduced by requiring that the six numbers be selected in the correct order. The number 4s ! of permutations possible when six numbers are selected from the numbers 01 through 45 is given by p:' = - 39 ! = 45 x 44 x 43 x 42 x 41 x 40 = 5,864,443,200. The probability of winning the lotto is .00000000017I, when the order of selection of the six numbers is important. 45! 6!39! 1 8,145,060 - - I 5,864,443,200 EXAMPLE 4.31 A Royal Flush is a five-card hand consisting of the ace, king, queen, jack, and ten of the same suit. The probability of a Royal Flush is equal to = .00000154, since from Example 4.27, there are 2,598,960 five-card hands possible and four of them are Royal Flushes. 4 2,598,960 Solved Problems EXPERIMENT,OUTCOMES,AND SAMPLESPACE 4.1 4.2 An experiment consists of flipping a coin, followed by tossing a die. Give the sample space for this experiment. Ans. One of many possible representations of the sample space is S = {HI, H2, H3, H4, HS, H6, T1, T2, T3, T4, T5, T6). Give the sample space for observing a patient's Rh blood type. Ans. One of many possible representations of the sample space is S = { Rh-, Rh'). TREE DIAGRAMS AND THE COUNTING RULE 4.3 Use a tree diagram to illustrate the sample space for the experiment of observing the sex of the children in families consisting of three children. Ans. The tree diagram representation for the sex distribution of the three children is shown in Fig. 4-8, where, for example, the branch or outcome mfm represents the outcome that the first born was a male, the second born was a female, and the last born was a male.
  • 85.
    76 PROBABILITY [CHAP.4 Fig. 4-8 4.4 A sociological study consists of recording the marital status, religion, race, and income of an individual. If marital status is classified into one of four categories, religion into one of three categories, race into one of five categories, and income into one of five categories, how many outcomes are possible for the experiment of recording the information of one of these individuals? Ans. Using the counting rule, we see that there are 4 x 3 x 5 x 5 = 300 outcomes possible. EVENTS, SIMPLEEVENTS, AND COMPOUNDEVENTS 4.5 For the sample space given in Fig. 4-8, give the outcomes associated with the following events and classify each as a simple event or a compound event. (a) At least one of the children is a girl. (b) All the children are of the same sex. (c) None of the children are boys. (6) All of the children are boys. Ans. The event that at least one of the children is a girl means that either one of the three was a girl, or two of the three were girls, or all three were girls. The event that all were of the same sex means that all three were boys or all three were girls. The event that none were boys means that all three were girls. The outcomes for these events are as follows: (a) mmf, mfm, mff, fmm, fmf, ffm, fff; compound event (b) mmm, fff; compound event (c) fff; simple event (4 mmrn; simple event 4.6 In the game of Yahtzee, five dice are thrown simultaneously. How many outcomes are there for this experiment? Give the outcomes that correspond to the event that the same number appeared on all five dice. Ans. By the counting rule, there are 6 x 6 x 6 x 6 x 6 = 7,776 outcomes possible. Six of these 7,776 outcomes correspond to the event that the same number appeared on all five dice. These six outcomes are as follows: ( I , 1, 1, 1, l), (2, 2, 2, 2, 2), (3, 3, 3, 3, 3), (4, 4, 4, 4, 4), (5, 5 , 5 , 5 , 5). (6, 6,6,6,6).
  • 86.
    CHAP. 41 PROBABILITY77 PROBABILITY 4.7 Which of the following are permissible values for the probability of the event E? (a) P(E)= .75 (b) P(E) = -.25 (c) P(E) = 1S O (d) P(E)= 1 (e) P(E) = .01 Ans. The probabilities given in (a), (d), and (e) are permissible since they are between 0 and I inclusive. The probability in (6) is not permissible, since probability measure can never be negative. The probability in (c) is not permissible, since probability measure can never exceed one. 4.8 An experiment is made up of five simple events designated A, B, C, D, and E. Given that P(A) =.I , P(B) = .2, P(C) = 3, and P(E) = .2, find P(D). Am. The sum of the probabilities for all simple events in an experiment must equal one. This implies that .1 + .2 + .3 + P(D) + .2 = 1, and solving for P(D), we find that P(D) = .2. Note that this experiment does not have equally likely outcomes. CLASSICAL, RELATIVE FREQUENCY, AND SUBJECTIVE PROBABILITY DEFINITIONS 4.9 A container has 5 red balls, 10 white balls, and 35 blue balls. One of the balls is selected randomly. Find the probability that the selected ball is (a)red; (h)white; (c)blue. Ans. This experiment has 50 equally likely outcomes. The event that the ball is red consists of five outcomes, the event that the ball is white consists of 10 outcomes, and the event that the ball is blue consists of 35 outcomes. Using the classical definition of probability, the following probabilities are obtained: (a) &=.10 ; (b) 5o =.20; ( c ) io =.70. 10 35 4.10 A store manager notes that for 250 randomly selected customers, 75 use coupons in their purchase. What definition of probability should the manager use to compute the probability that a customer will use coupons in their store purchase? What probability should be assigned to this event? Ans. The relative frequency definition of probability should be used. The probability of using coupons in store purchases is approximately 250 =.30. 75 4.11 Statements such as “the probability of snow tonight is 70%,” “the probability that it will rain today is 20%,” and “the probability that a new computer software package will be successful is 99%”are examples of what type of probability assignment? Ans. Since all three of the statements are based on professional judgment and experience, they are subjective probability assignments. MARGINAL AND CONDITIONAL PROBABILITIES 4.12 Financial Planning Consultants Inc. keeps track of 500 stocks. Table 4.5 classifies the stocks according to two criteria. Three hundred are from the New York exchange and 200 are from the American exchange. Two hundred are up, 100are unchanged, and 200 are down,
  • 87.
    78 NYSE AMEX Total PROBABILITY [CHAP. 4 up Unchanged Down Total 50 75 175 300 150 25 25 200 200 100 200 500 Table 4.5 Pair A, B A, c A, D B, c B,D c,D Mutually exclusive Common outcomes Yes None No No No Ace of hearts No Yes None Jack, queen, and king of hearts Jack, queen, and king of spades and clubs Ace of clubs and ace of spades If one of these stocks is randomly selected find the following. (a) The joint probability that the selected stock is from AMEX and unchanged. (b)The marginal probability that the selected stock is from NYSE. (c) The conditional probability that the stock is unchanged given that it is from AMEX. Am. (a) There are 25 from the AMEX and unchanged. The joint probability is (h) There are 300 from NYSE. The marginal probability is (c) There are 200 from the AMEX. Of these 200, 25 are unchanged. The conditional probability is -=.I25 =.OS. =.60. 25 200 4.13 Twenty percent of a particular age group has hypertension. Five percent of this age group has hypertension and diabetes. Given that an individual from this age group has hypertension, what is the probability that the individual also has diabetes? Am. Let A be the event that an individual from this age group has hypertension and let B be the event that an individual from this age group has diabetes. We are given that P(A) = .20 and P(A and B ) = .05. We are asked to find P(B I A). P(B IA) = P(A and B) .OS R A ) .20 --- - - .25. MUTUALLY EXCLUSIVE EVENTS 4.14 For the experiment of drawing a card from a standard deck of 52, the following events are defined: A is the event that the card is a face card, B is the event that the card is an Ace, C is the event that the card is a heart, and D is the event that the card is black. List the six pairs of events and determine which are mutually exclusive. 4.15 Three items are selected from a production process and each is classified as defective or non- defective. Give the outcomes in the following events and check each pair to see if the pair is mutually exclusive. Event A is the event that the first item is defective, B is the event that there is exactly one defective in the three, and C is the event that all three items are defective. Am. A consistsof the outcomes DDD, DDN, DND, arid DNN. B consists of the outcomes DNN, NDN, and NND. C consists of the outcome DDD. (D represents a defective, and N represents a nondefective.)
  • 88.
    CHAP. 41 Pair Mutuallycxclusive A, B N0 A, C No B, c Yes PROBABILITY Common outcomes DNN DDD Nonc 79 Machine I Machine 2 5 195 15 585 DEPENDENT AND INDEPENDENT EVENTS 4.16 African-American males have a higher rate of hypertension than the general population. Let A represent the event that an individual is hypertensive and let B represent the event that an individual is an African-American male. Are A and B independent or dependent events? Am. To say that African-American males have a higher rate of hypertension than the general population means that P(A IB) > P(A). Since P(A I B) # P(A), A and B are dependent events. 4.17 Table 4.6 gives the number of defective and nondefective items in samples from two different machines. Is the event of a defective item being produced by the machines dependent upon which machine produced it? Table 4.6 I I Number defective I Number nondcfectivc I Am. Let D be the event that a defective item is produced by the machines, Let M,be the event that the item is produced by machine 1, and let M2 be the event that the item is produced by machine 2. P(D) = =.025, P(D IM,) = 2oo =.025, and P(D IM2) = 6oo =.025, and the event of producing a defective item is independent of which machine produces it. 20 5 15 COMPLEMENTARYEVENTS 4.18 The probability that a machine does not produce a defective item during a particular shift is .90. What is the complement of the event that a machine does not produce a defective item during that particular shift and what is the probability of that complementary event? Ans. The complementary event is that the machine produces at least one defective item during the shift, and the probability that the machine produces at least one defective item during the shift is 1 - .90= .10. 4.19 Events El, E2,and E3 have the following probabilities of occurrence: P(EI)= .OS, P(E2) = SO, and P(E3) = .99. Find the probabilities of the complements of these events. Am. P( El') = 1 - P(E,) = 1 - .05 = .95 P(E3' ) = 1 - P(E3) = 1 - .99 = .01 P(E2) = 1 - P(E2)= I - .SO= .SO MULTIPLICATION RULE FOR THE INTERSECTION OF EVENTS 4.20 If one card is drawn from a standard deck, what is the probability that the card is a face card? If two cards are drawn, without replacement, what is the probability that both are face cards? If five cards are drawn, without replacement, what is the probability that all five are face cards?
  • 89.
    80 Low IQ HighIQ Low creativity 75 30 High creativity 20 125 L PROBABILITY [CHAP. 4 Ans. Let El be the event that the first card is a face card, let E2 be the event that the second drawn card is a face card, and so on until Es represents the event that the fifth card is a face card. The probability that the first card is a face card is P(EI)= i2 =.231. The event that both are face cards 12 1 1 132 - is the event El and Ez,and P(EI and E*)= P(EI)P(E2 I El) = x lFT = -- .050. The event that all five are face cards is the event Eland El nnd E3 and E4 arid ES. The probability of this event is given by P(EI)RE2 I El) P(E.1I Et, E21 P(E4 I El, E2, Ed P(Es I El, Ez, Ej, Ed, or - .000305. 12 I 1 1 0 9 8 95040 - x - x - x - x - = L - 92 51 so 4Y 48 311,87S,200 2652 4.21 If 60 percent of all Americans own a handgun, find the probability that all five in a sample of five randomly selected Americans own a handgun. Find the probability that none of the five own a handgun. Ans. Let E, be the event that the first individual owns a handgun, E2 be the event that the second individual owns a handgun, E3 be the event that the third individual owns a handgun, E4 be the event that the fourth individual owns a handgun, and E5 be the event that the fifth individual owns a handgun. The probability that all five own a handgun is P(EI and E2 and E3and E4 arid E?). Because of the large group from which the individuals are selected, the events El through Es are independent and the probability is given by P(EI)P(E2) P(E3) P(E4)P(E5)= (.6)5= .078. Similarly, the probability that none of the five own a handgun is (.4)’= .010. ADDITION RULE FOR THE UNION OF EVENTS 4.22 Table 4.7 gives the IQ rating as well as the creativity rating of 250 individuals in a psychological study. Find the probability that a randomly selected individual from this study will be classified as having a high IQ or as having high creativity. Ans. Let A be the event that the selected individual has a high IQ, and let B be the event that the individual has high creativity. Then P(A) = 250 - .62,P(B) = 250 - .58,P(A and B) = = S O , and P(A or B) = P(A) + P(B) - P(A and B) = .62 + .58 - S O = .70. 15.5 - 14s - 4.23 The probability of event A is .25, the probability of event B is .lO, and A and B are inde- pendent events. What is the probability of the event A or B? Ans. Since A and B are independent events, P(A and B) = P(A) P(B) = .25 x .I0 = ,025.P(A or B) = P(A) +P(B) - P(A and B) = .25 + .I0- ,025= .325. BAYES’ THEOREM 4.24 Box 1 contains 30 red and 70 white balls, box 2 contains 50 red and 50 white balls, and box 3 contains 75 red and 25 white balls. The three boxes are all emptied into a large box, and a ball is selected at random. If the selected ball is red, what is the probability that it came from (a) box 1;(b)box 2; (c) box 3‘?
  • 90.
    CHAP. 4 3 PROBABILITY81 Region Northeast Midwest South West Ans. Let B 1be the event that the selected ball came from box 1, let B2be the event that the selected ball came from box 2, let B3be the event that the ball came from box 3, and let R be the event that the selected ball is red. Percentage of U.S. popu1ation Percentage of social security rccipients in the region 20 15 2s 10 35 12 20 11 (CI) We are asked to find P(Bl I R). The three-step procedure is as follows: Step 1: Step 2: Step 3: (b) We are asked to find P(B2IR). Step I: P(R and B2) = P(R IB2) P(B2)= Ex Step 2: P(R) = .S 17 Step 3: P(B2IR)= -= .323 so 1 = .I67 .I 67 .517 (c) We are asked to find P(B3IR). Step 1: P(R and B3) = P(R I B3) P(B2)= Step 2: P(R) = .5 I7 7s 1 x _1 = .250 ,250 .5I7 Step 3: P(B3I R)= -- - ,484 4.25 Table 4.8 gives the percentage of the U.S. population in four regions of the United States, as well as the percentage of social security recipients within each region. For the population of all social security recipients, what percent live in each of the four regions? Table 4.8 Am. Let B1 be the event that an individual lives in the Northeast region, B2 be the event that an individual lives in the Midwest region, B3be the event that an individual lives in the South, and B4 be the event that an individual lives in the West. Let S be the event that an individual is a social security recipient. We are given that P(BI)= .20, P(B2) = .25, P(B3) = .3S.P(B4)= .20, P(S IBI)= .15, P(S I B2) = .10, P(S 1 B3) = .12, and P(S I B4) = .11. We are to find P(B1 I S), P(B2 1 S), P(B3I S), and P(B4I S).P(S) is needed to find each of the four probabilities. P(S) = P(S I BI) P(B1)+ P(S I B2) w32) + P(S IBd P(B3) + P(S IB4) P(B4) P(S) = .I5x .20+ .I0 x .25 + .12 x .35 + .I 1 x .20= .I 19. This means that 1 I .9%of the population are social security recipients. The three-step procedure to find P(BI I S) is as follows: Step 1: P(BIarid S) = P(S I BI) P(BI)= .15 x .20 = .03 Step 2: P(S) = . I 19 .03 .1 19 Step 3: P(BII S) = -= .252
  • 91.
    82 PROBABILITY [CHAP.4 The three-step procedure to find P(B2 I S) is as follows: Step 2: Step I : P(B2 L i d S)= P(S IB2) P(B2) = . I 0 x .25 = .U25 P(S)= . I 19 .025 .I 19 Step 3: P(B2I S) = -= .210 The thrce-step procedure to find P(B3IS) is as follows: Step 2: P(S)= . 1 19 Step 1: P ( B I ~ d S ) = P ( S I B j ) P ( B 3 ) = . 1 2 x , 3 5 = . 0 4 2 .U42 .I 19 Step 3: P(B3I S) = -= .353 The three-step procedure to find P(B4I S) is as follows: Step 1: P(B d and S)= P(S I Bq)P(Bq)= .I 1 x .20= .022 Step 2: P(S)= .I19 Step 3: P(Bj IS)= -= .185 .022 .I 19 We can conclude that 25.2% of the social security recipients are from the Northeast, 21.0% are from the Midwest, 35.3% are from the South, and 18.5%are from the West. PERMUTATIONS AND COMBINATIONS 4.26 Evaluate the following: (a)C& (b)CE; (c) PE ;(d)P;. Am. Each of the four parts uses the fact that O! = 1. n! I x n! --- n! ( ( 1 ) c"- - - I , since the n! in the numerator and denominator divide out. () - o!(n - o)! n! ( t l ) c;: = - -= 1, since O! is equal to one and the n! divides out of top and bottom. - n! n!(n - n)! n!O! n! n! (n-O)! n! ( C ' ) PI=------- - -1. n! n! n! (n-n)! O! 1 ----- (d> Pi: = - - - - n! 4.27 An e-wc-tizwngur at the racetrack is a bet where the bettor picks the horses that finish first and second. A trff;ec.tn izwger is a bet where the bettor picks the three horses that finish first, second, and third. (cl) In a 12-horse race, how many exactas are possible? (b)In a 12-horse race, how many trifectas are possible? Am. Since the finish order of the horse is important, we use permutations to count the number of possib1e se1ections. ( ( 1 ) The number of ordered ways you can select two horses from twelve is 12! 12! 12xllx10! =12~11=132 - pi?= --= (12-2)! IO! 10!
  • 92.
    CHAP. 41 PROBABILITY83 (b) The number of ordered ways you can select three horses from twelve is 17 12! 12! 12X11X10X9! = 12x11x10= 1320 -- p3-= ~ - - - (12-3)! 9! 9! 4.28 A committee of five senators is to be selected from the U.S. Senate. How many different committees are possible? Ans. Since the order of the five senators is not important, the proper counting technique is combina- tions. lOOx 99 x 98x 97 x 96x 95! 120x 95! lOOx 99 x 98 x 97 x 96 I20 = 75,287,520. - - - - loo! 5!( 100-5)! c:" = There are 75,287,520 different committees possible. USING PERMUTATIONS AND COMBINATIONSTO SOLVE PROBABILITY PROBLEMS 4.29 4.30 Twelve individuals are to be selected to serve on a jury from a group consisting of 10females and 15 males. If the selection is done in a random fashion, what is the probability that all 12 are males? Ans. Twelve individuals can be selected from 25 in the followingnumber of ways: ~ 2 5 - 25! - 2 5 x 2 4 ~ 2 3 ~ 2 2 ~ 2 1 ~ 2 0 ~ 1 9 x l 8 x l 7 x l 6 x 15x 1 4 x 13! - 12 - 12!13! 12!13! After dividing out the common factor, 13!, we obtain the following. 25 x 24 x 23 x 22 x 21x 20 x I9 x I8 x 17x I 6x 15 x 14 12x 1 1 x 1 O x 9 x 8 x 7 x 6 x 5 x 4 x 3 x 2 x 1 c::= = 5,200,300 The above fraction may need to be evaluated in a zigzag fashion. That is, rather than multiply the 12 terms on top and then the 12 terms on bottom and then divide, do a multiplication, followed by a division, followed by a multiplication,and so on until all terms on top and bottom are accounted IS! 15x 14x 13x 12! 15x 14x 13 = 455 - - for. The jury can consist of all males in cl: = - - - 12!3! 12!3! 6 - .000087. That is, there are about 9 chances ways. The probability of an all-malejury is 5,200,300 - out of 100,000that an all-malejury would be chosen at random. 45s The five teams in the western division of the American conference of the National Football League are: Kansas City, Oakland, Denver, San Diego, and Seattle. Suppose the five teams are equally balanced. (a) What is the probability that Kansas City, Seattle, and Denver finish the season in first, second, and third place respectively? (b)What is the probability that the top three finishers are Kansas City, Seattle, and Denver? Ans. (a) Since the order of finish is specified, permutations are used to solve the problem. There are P3=-= 60 different ordered ways that three of the five teams could finish the season in first, second, and third place in the conference. The probability that Kansas City will finish first, Seattle will finish second, and Denver will finish third is 6o = ,017. 5 ! 2! I
  • 93.
    84 PROBABILITY [CHAP.4 ( 0 ) Since the order of finish for the three teams is not specified, cornbinations are used to solve the problem. There are C:= -= 10 combinations of three teams that could finish in the top three. The probability that the top three finishers are Kansas City, Seattle, and Denver is A = .10. S! 3!2! Supplementary Problems EXPERIMENT, OUTCOMES, AND SAMPLE SPACE 4.31 An experiment consists of using a 25-question test instrument to classify an individual as having either a type A or a type B personality. Give the sample space for this experiment. Suppose two individuals are classified as to personality type. Give the sample space. Givc the sample space for three individuals. Am. For one individual, S = {A,B},where A rneans the individual has a type A personality. and B means the individual has a type B personality. For two individuals, S = {AA, AB,BA,BB},where AB,for example, is the outcome that the first individual has a type A personality and the second individual has a type B personality. For three individuals, S = {AAA,AAB,ABA,ABB,BAA,BAB,BBA,BBB},where ABA is the outcome that the first individual has a type A personality, the second has a type B personality, and the third has a type A. 4.32 A t a roadblock, state troopers classify drivers as either driving while intoxicated, driving while impaired, or sober. Give the sample space for the classification of one driver. Givc the sample space for two drivers. How many outcomes are possible for three drivers'? Ans. Let A be the event that a driver is classified as driving while intoxicated, let B be the event that a driver is classified as driving while impaired, and let C be the event that a driver is classified as sober. The sample space for one driver is S = {A, B,C). The sample space for two drivers is S = {AA, AB,AC, BA,BB,BC,CA,CB,CC) The sample space for three drivers has 27 possible outcomes. TREE DIAGRAMS AND THE COUNTING RULE 4.33 An experiment consists of inspecting four items selected from a production line and classifying each one as defective, D, or nondefective, N.How many branches would a tree diagram for this experiment have'? Give the branches that have exactly one defective. Give the branches that have exactly one nondefective. Am. The tree would have 2' = 16 branches which would represent the possible outcomes for the experiment. The branches that have exactly one defective are DNNN, NDNN, NNDN, and NNND. The branches that have exactly one nondefective are NDDD, DNDD, DDND. and DDDN. 4.34 An experiment consists of selecting one card from a standard deck, tossing a pair of dice, and then flipping a coin. How many outcomes are possible for this experiment? Am. According to the counting rule, there are 52 x 36 x 2 = 3,744 possible outcomes.
  • 94.
    CHAP. 41 PROBABILITY85 EVENTS, SIMPLE EVENTS, AND COMPOUND EVENTS 4.35 An experiment consists of rolling a single die. What are the simple events for this experiment'? Ans. The simple events are the outcomes { 11, { 21, { 3 } , (41, { 5 ) ,and (61, where the number in braces represents the number on the turned up face after the die is rolled. 4.36 Suppose we consider a baseball game between the New York Yankees and the Detroit Tigers as an experiment. Is the event that the Tigers beat the Yankees one to nothing a simple event or a compound event? Is the event that the Tigers shut out the Yankees a simple event or a compound event? (A shutout is a game in which one of the teams scores no runs.) Ans. The event that the Tigers shut out the Yankees one to nothing is a simple event because it represents a single outcome. The event that the Tigers shut out the Yankees is a compound event because it could be a one to nothing shutout, or a two to nothing shutout, or a three to nothing shutout, etc. PROBABILITY 4.37 Which of the following are permissible values for the probability of the event E? Am. (a)and (c)are permissible values, since they are between 0 and 1 inclusive. (b)is not permissible because it exceeds one. (d)is not permissible because it is negative. 4.38 An experiment is made up of three simple events A, B, and C. If P(A) = x, P(B) = y, and P(C) = z, and x +y + z = 1, can you be sure that a valid assignmentof probabilities has been made? Am. No. Suppose x = .75, y = .75, and z = -S, for example. Then x + y + z = 1, but this is not a valid assignment of probabilities. CLASSICAL, RELATIVE FREQUENCY, AND SUBJECTIVE PROBABILITY DEFINITIONS 4.39 If a U.S. senator is chosen at random, what is the probability that he/she is from one of the 48 contiguous states? Ans. The are 96 senators from the 48 contiguous states and a total of 100 from the 50 states. The 96 probability is = .96. 4.40 In an actuarial study, 9,875 females out of 10,000females who are age 20 live to be 30 years old. What is the probability that a 20-year-old female will live to be 30 years old? 988. - 9,875 - Ans. Using the relative frequency definition of probability, the probability is 4.41 Casino odds for sporting events such as football games, fights, etc. are examples of which probability definition? Ans. subjective definition of probability MARGINAL AND CONDITIONAL PROBABILITIES 4.42 Table 4.9 gives the joint probability distribution for a group of individuals in a sociological study.
  • 95.
    86 High school graduate N0 Yes PROBABILITY Welfarerecipient No Yes .I5 .2s .55 .OS [CHAP. 4 Table 4.9 4.43 ( ( I ) Find the marginal probability that an individual in this study is a welfare recipient. ( h ) Find the marginal probability that an individual in this study is not a high school graduate? ( c ) If an individual is a welfare recipient, what Is the probability that he/she is not a high school graduatc? (tl) It'an individual is a high school graduate, what is the probability that he/she is a welfare recipient? .LJ Airs. ( c l ) .25 + .OS= .30 (c) -= .83 .30 ( h ) .IS+ .25 = .40 .05 .60 (d) -= .083 Sixty percent of the registered voters in Douglas County are Republicans. Fifteen percent of the registered voters in Douglas County are Republican and have incomes above $250,000 per year. What percent of the Republicans who are registered voters in Douglas County have incomes above $250,000 per year'! .I 5 -= .25. Twenty-five percent of the Republicans have incomes above $250,000. .60 A m . MUTUALLY EXCLUSIVE EVENTS 4.44 With reference to the sociological study described in problem 4.42, are the events (high school graduate} and { weltare recipient} mutually exclusive'? Atis. No, since 5% of the individuals in the study satisfy both events. 4.45 Do mutually exclusive events cover all the possibilities in an experiment'? A m . No. 'This is true only when the mutually exclusive events are also complementary. DEPENDENT AND INDEPENDENT EVENTS 4.46 I f two events are mutually exclusive, are they dependent or independent? A m If A and B are two nontrivial events (that is, they have a nonzero probability) and if they are rnutually exclusive, then P(A I B) = 0, since if B occurs, then A cannot occur. But since P(A) is positive, P(A I B) f P(A), and the events must be dependent. 4.47 Seventy-five percent of all Americans live in a metropolitan area. Eighty percent of all Americans consider themselves happy. Sixty percent of all Americans live in a metropolitan area and consider themselves happy. Are the events {lives in a metropolitan area) and {considers themselves happy) independent or dependent events'? Am. Let A be the event {live in a metropolitan area} and let B be the event {consider themselves P(A and B) .60 P(B) .80 -= .75 = P(A) , and hence A and B are independent events. - - happy}.P(A 1 B) = COMPLEMENTARY EVENTS 4.48 What is the sum of the probabilities of two complementary events?
  • 96.
    CHAP. 41 PROBABILITY87 Ans. one 4.49 The probability that a machine used to make computer chips is out of control is .001. What is the complement of the event that the machine is out of control and what is the probability of this event? Ans. The complementary event is that the machine is in control. The probability of this event is .999. MULTIPLICATION RULE FOR THE INTERSECTION OF EVENTS 4.50 In a particular state, 20 percent of the residents are classified as senior citizens. Sixty percent of the senior citizens of this state are receiving social security payments. What percent of the residents are senior citizens who are receiving social security payments? A m Let A be the event that a resident is a senior citizen and let €3 be the event that a resident is receiving social security payments. We are given that P(A) = .20 and P(B 1 A) = .60. P(A and B) = P(A) P(B I A) = .20 x .60 = .12. Twelve percent of the residents are senior citizens who are receiving social security payments. 4.51 If El ,E*, . . . ,E, are n independent events, then P(EI and E2 and E3 . . .and E,) is equal to the product of the probabilities of the n events. Use this probability rule to answer the following. (a) Find the probability of tossing five heads in a row with a coin. (b) Find the probability that the face 6 turns up every time in four rolls of a die. (c) If 43 percent of the population approve of the president's performance, what is the probability that all 10individuals in a telephone poll disapprove of his performance. Ans. (a) (S)' = .03125 (6) ( )4 = .00077 (c) (.57)" = .00362 ADDITION RULE FOR THE UNION OF EVENTS 4.52 Events A and B are mutually exclusive and P(A) = .25 and P(B) = .35.Find P(A and B) and P(A or B). Ans. Since A and €3 are mutually exclusive, P(A and B) = 0. Also P(A or B) = .25 + .35= .60. 4.53 Fifty percent of a particular market own a VCR or have a cell phone. Forty percent of this market own a VCR. Thirty percent of this market have a cell phone. What percent own a VCR and have a cell phone'? Ans. P(A and B) = P(A) + P(B) - P(A or B) = .3 + .4 - .S = .2, and therefore 20% own a VCR and a cell phone. BAYES' THEOREM 4.54 In Arkansas, 30 percent of all cars emit excessive amounts of pollutants. The probability is 0.95 that a car emitting excessive amounts of pollutants will fail the state's vehicular emission test, and the probability is 0.15 that a car not emitting excessive amounts of pollutants will also fail the test. If a car fails the emission test, what is the probability that it actually emits excessive amounts of emissions? Ans. Let A be the event that a car emits excessive amounts of pollutants, and B the event that a car fails the emission test. Then, P(A) = .30,P(AC) = .70, P(B IA) = .9S,and P(B IA') = .1S. We are asked to find P(A IB). The three-step procedure results in P(A I B) = .73.
  • 97.
    88 PROBABILITY [CHAP.4 4.55 In a particular community, 15 percent of all adults over 50 have hypertension. The health service in this community correctly diagnoses 99 percent of all such persons with hypertension. The health service incorrectly diagnoses 5 percent who do not have hypertension as having hypertension. (a) Find the probability that the health service will diagnose an adult over 50 as having hypertension. (6) Find the probability that an individual over 50 who is diagnosed as having hypertension actually has hypertension. Am. Let A be the event that an individual over 50 in this community has hypertension, and let B be the event that the health service diagnoses an individual over 50 as having hypertension. Then, P(A) = .15, P(A') = .85, P(B IA) = .99, and P(B IA') = .05. (a) P(B) = .15 x -99+.85 x .05 = .19 (6) Using the three-step procedure, P(A IB) = .78 PERMUTATIONS AND COMBINATIONS 4.56 How many ways can three letters be selected from the English alphabet if: (0)The order of selection of the three letters is considered important, i.e., abc is different from cba, for example. (6) The order of selection of the three letters is not important? Ans. (a) 15,600 (6) 2,600 4.57 The following teams comprise the Atlantic division of the Eastern conference of the National Hockey League: Florida, Philadelphia, N.Y. Rangers, New Jersey, Washington, Tampa Bay, and N.Y. Islanders. (a) Assuming no teams are tied at the end of the season, how many different final standinps are possible (h) Assuming no ties, how many different first-, second-, and third-place finishers are possible? for the seven teams? Am. (a) 5,040 (6) 210 4.58 A criminologist selects five prison inmates from 30 volunteers for more intensive study. How many such groups of five are possible when selected from the 30? Ans. 142,506 USING PERMUTATIONS AND COMBINATIONS TO SOLVE PROBABILITY PROBLEMS 4.59 4.60 Three individuals are to be randomly selected from the 10 members of a club to serve as president, vice president, and treasurer. What is the probability that Lana is selected for president, Larry for vice president, and Johnny for treasurer? 1 I - A m . 1 - A= .00138 pio 720 A sample of size 3 is selected from a box which contains two defective items and 18 nondefective items. What is the probability that the sample contains one defective item? C:xCi8 - 2x153 - 306 C:O 1,140 1,140 Am. - -- --= .268
  • 98.
    Chapter 5 Outcome Valueof X Value of Y HH 0 2 HT 1 0 TH 1 0 r TT 2 -2 Discrete Random Variables RANDOM VARIABLE A random variable associates a numerical value with each outcome of an experiment.A random variable is defined mathematically as a real-valued function defined on a sample space, and is represented as a letter such as X or Y. EXAMPLE 5.1 For the experiment of flipping a coin twice, the random variable X is defined to be the number of tails to appear when the experiment is performed. The random variable Y is defined to be the number of heads minus the number of tails when the experiment is conducted. Table 5.1 shows the outcomes and the numerical value each random variable assigns to the outcome. These are two of the many random variables possible for this experiment. EXAMPLE 5.2 An experimental study involving diabetics measured the following random variables: fasting blood sugar, hemoglobin, blood pressure, and triglecerides. These random variables assign numerical values to each of the individuals in the study. The numerical values range over different intervals for the different random variables. DISCRETE RANDOM VARIABLE A random variable is a discrete random variable if it has either a finite number of values or infinitely many values that can be arranged in a sequence. We say that a discrete random variable may assume a countable number of values. Discrete random variables usually arise from an experiment that involves counting. The random variables given in Example 5.1 are discrete, since they have a finite number of different values. Both of the variables are associated with counting. EXAMPLE53 An experiment consists of observing 100 individuals who get a flu shot and counting the number X who have a reaction. The variable X may assume 101 different values from 0 to 100. Another experiment consists of counting the number of individuals W who get a flu shot until an individual gets a flu shot and has a reaction. The variable W may assume the values 1, 2, 3, . . .. The variable W can assume a countably infinite number of values. CONTINUOUS RANDOM VARIABLE A random variable is a continuous random variable if it is capable of assuming all the values in an interval or in several intervals. Because of the limited accuracy of measuring devices, no random 89
  • 99.
    90 DISCRETE RANDOMVARIABLES [CHAP. 5 Pfx) variables are truly continuous. However, we may treat random variables abstractly as being continuous. ~ 2 3 4 5 6 7 8 9 1 0 1 1 1 2 I .028 .056 .083 .111 .139 .167 .139 .111 .083 .056 .028 EXAMPLE 5.4 The following random variables are considered continuous random variables: survival time of cancer patients, the time between release from prison and conviction for another crime, the daily milk yield of Holstein cows, weight loss during a dietary routine, and the household incomes for single-parent households in a sociological study. PROBABILITY DISTRIBUTION The probability distribution of a discrete random variable X is a list or table of the distinct numerical values of X and the probabilities associated with those values. The probability distribution is usually given in tabular form or in the form of an equation. EXAMPLE55 Table 5.2 lists the outcomes and the values of X, the sum of the up-turned faces for the experiment of rolling a pair of dice. Table 5.2 is used to build the probability distribution of the random variable X. This table lists the 36 possible outcomes. Only one outcome gives a value of 2 for X. The probability that X = 2 is 1 divided by 36 or .028 when rounded to three decimal places. We write this as P(2)= .028. The probability that X = 3,P(3), is equal to 2 divided by 36 or .056. The probability distribution for X is given in Table 5.3. Value of X 2 3 4 5 6 7 3 4 5 6 7 8 5.2 Value of X 4 5 6 7 8 9 5 6 7 8 9 10 Table 5.3 Outcome Value of X The probability distribution, P(x)= P(X = x) satisfiesformulas (5. I) and (5.2). P(x) 20 for each value x of X (5.0 (5.2) Notice that the values for P(x) in Table 5.3 are all positive, which satisfies formula ( 5 4 , and X P(x)= 1 where the sum is over all values of X that the sum equals 1 except for rounding errors. X EXAMPLE5.6 P(x) = -, x = 1,2,3,4 is a probability distribution since P(I) = .I, P(2) = .2, P(3)= 3,and P(4) = .4 and (5.I ) and (5.2)are both satisfied. 10
  • 100.
    CHAP. 51 DISCRETERANDOM VARIABLES 91 X P(x) xP(x) EXAMPLE 5.7 It is known from census data that for a particular income group that 10% of households have no children, 25% have one child, 50%have two children, 10% have three children, and S% have four children. If X represents the number of children per household for this income group, then the probability distribution of X is given in Table 5.4. 2 3 4 5 6 7 8 9 10 11 12 .028 .OS6 .083 . I 1I .139 .167 .I39 .I 1 1 .083 .OS6 .028 .056 .I68 .332 .555 .834 1.169 1.1 12 .999 .830 .616 .336 Table 5.4 1 x 1 0 1 2 3 4 1 I P(x) I .I0 .25 S O .I0 .05 I The event X 2 2 is the event that a household in this income group has at least two children and means that X = 2, or X = 3, or X = 4. The probability that X 1 2 is given by P(X 22) = P(X = 2) +P(X = 3) + P(X = 4) = .so+ .I0+ .OS = .6S The event X I I is the event that a household in this income group has at most one child and is equivalent to X = 0, or X = 1. The probability that X I1 is given by P(X I1) = P(X = 0) + P(X = I ) = .I0+ .25 = .3S The event 1 I X I 3 is the event that a household has between one and three children inclusive and is equivalent to X = I, or X = 2, or X = 3. The probability that I IX I 3 is given by P( 1 I X 5 3 ) = P(X = I ) + P(X = 2) + P(X = 3 ) = .25 + S O + .I0 = .85 The above discussion may be summarized by stating that 65% of the households have at least two children, 35% have at most one child, and 85% have between one and three children inclusive. MEAN OF A DISCRETE RANDOM VARIABLE The rneatt o f a discrete random variuble is given by p =X x P(x) where the sum is over all values of X (5.-3) The mean of a discrete random variable is also called the expected value, and is represented by E(X). The mean or expected value will also often be referred to as the populatiorz meurz. Regardless of whether it is called the mean, the expected value, or the population mean, the numerical value is given by formula (5.3). EXAMPLE 5.8 The mean of the random variable in Example 5.5 is found as follows. The sum of the row labeled xP(x) is the mean of X and is equal to 7.007. If fractions are used in place of decimals for P(x),the value will equal 7 exactly. In other words E(x) = 7. The long-term average value for the sum on the dice is 7. The mean value of the sum on the dice for the population of all possible rolls of the dice equals 7. If you were to record all the rolls of the dice at Las Vegas, the average value would equal 7. EXAMPLE 5.9 The mean number of children per household for the distribution given in Example 5.7 is found as follows.
  • 101.
    92 P(x) xP(x) DISCRETE RANDOM VARIABLES .I0.25 .50 .10 .05 0 .25 1.00 .30 .20 [CHAP. 5 X P(x) X P W 0 0.0625 0.0 I 0.2s 0.25 2 0.375 0.75 3 0.25 0.75 4 0.0625 0.25 Total 1.0 p = 2 1 x 1 0 1 2 3 4 1 X-p (x - N2 (x - j$P( x) -2 4 0.25 -I 1 0.2s 0 0 0.0 I 1 0.25 2 4 0.25 C(X - p) = 0 Var(x) = I X P(X) 0 0.0625 1 0.25 2 0.375 3 0.25 4 0.0625 ‘Total 1.o The sum of the row labeled xP(x), 1.75, is the mean of the distribution. For this popudion of households, the mean number of children per household is I .75. Notice that the mean does not have to equal a value assumed by the random variable. XP(X) X2P(X) 0.0 0.0 0.25 0.25 0.75 1.5 0.75 2.25 0.2s 1.o / . I = 2.0 c XZP(X) = 5 STANDARD DEVIATION OF A DISCRETE RANDOM VARIABLE The wriwic‘e ofa discrete rcrrzcloriz vuriable is represented by 0’ and is defined by d = x(x - p)”(x) (5.4) The variance is also represented by Var(X) and may be calculated by the alternative formula given by The standurd deviution o f a discrete random variable is represented by or sd(X) and is given EXAMPLE 5.10 A social researcher is interested in studying the family dynamics caused by gender makeup. The distribution of the number of girls in families consisting of four children is as follows: 6.2S% of such families have no girls, 2570 have one girl, 37.5% have two girls, 25% have three girls, and 6.25% have four girls. Table 5.5 illustrates the cornputation of the variance of X, the number of girls in a family having four children. The standard deviation is the square root of the variance and equals one. Table 5.6 illustrates the computation of the components needed when using the alternative formula (5.5)to compute the variance of X. Using formula (S.S), the variance is Var(x) = c x2P(x)- p2= 5 - 4 = 1, and the standard deviation is also equal to one. We see that formulas (5.4) and (5.5) give the same results for the standard deviation. The mean number of girls in families of four children equals 2 and the standard deviation is equal to 1 . Table 5.6
  • 102.
    CHAP. 51 DISCRETERANDOM VARIABLES 93 BINOMIAL RANDOM VARIABLE A binomial random variable is a discrete random variable that is defined when the conditions of a binomial experimertt are satisfied. The conditions of a binomial experiment are given in Table 5.7. I Conditions of a Binomial Exoeriment ~~~ ~ ~ I. There are n identical trials. 2. Each trial has only two possible outcomes. 3. The probabilities of the two outcomes remain constant for each trial. r 4. The trials are indeoendent. The two outcomes possible on each trial are called success and failure. The probability associated with success is represented by the letter p and the probability associated with failure is represented by the letter q, and since one or the other of success or failure must occur on each trial, p +q must equal one, i.e., p +q = 1. When the conditions of the binomial experiment are satisfied, the binomial random variable X is defined to equal the number of successes to occur in the n trials. The random variable X may assume any one of the whole numbers from zero to n. EXAMPLE 5.11 A balanced coin is tossed 10 times, and the number of times a head occurs is represented by X. The conditions of a binomial experiment are satisfied. There are n = 10 identical trials. Each trial has two possible outcomes, head or tail. Since we are interested in the occurrence of a head on a trial, we equate the occurrence of a head with success, and the occurrence of a tail with failure. We see that p = .5 and q = .5. Also, it is clear that the trials are independent since the occurrence of a head on a given toss is independent of what occurred on previous tosses. The number of heads to occur in the 10 tosses, X, can equal any whole number between 0 and 10. X is a binomial random variable with n = 10and p = .5. EXAMPLE 5.12 A balanced die is tossed five times, and the number of times that the face with six spots on it faces up is counted. The conditions of a binomial experiment are satisfied. There are five identical trials. Each trial has two possible outcomes since the face 6 turns up or a face other than 6 turns up. Since we are interested in the face 6, we equate the face 6 with success and any other face with failure. We see that p = and q = 5 . Also, the outcomes from toss to toss are independent of one another. The number of times the face 6 turns up, X, can equal 0, 1, 2, 3,4, or 5. X is a binomial random variable with n = 5 and p = .167. 6 6 EXAMPLE 5.13 A manufacturer uses an injection mold process to produce disposable razors. One-half of one percent of the razors are defective. That is, on average, 500 out of every 100,000razors are defective. A quality control technician chooses a daily sample of 100randomly selected razors and records the number of defectives found in the sample in order to monitor the process. The conditions of a binomial experiment are satisfied. There are 100 identical trials. Each trial has two possible outcomes since the razor is either defective or non- defective. Since we zre recording the number of defectives, we equate the occurrence of a defective with success and the occurrence of a nondefective with failure. We see that p = .005and q = .995. The number of defectives in the 100,X, can equal any whole number between 0 and 100. X is a binomial random variable with n = 100 and p = .005. BINOMIAL PROBABILITY FORMULA The binomial prubability formula is used to compute probabilities for binomial random variables. The binomial probability formula is given in n! x!(n - x)! pXq("-')for x = 0,I , . . . ,n (5.7)
  • 103.
    94 DISCRETE RANDOMVARIABLES [CHAP. 5 The symbol is discussed in Chapter 4 and represents the number of combinations possible when (ri x items are selected from n. EXAMPLE 5.14 A die is rolled three times and the random variable X is defined to be the number of times the face 6 turns up in the three tosses. X is a binomial random variable that assumes one of the values 0, I , 2, or 3. In this binomial experiment, success occurs on a given trial if the face 6 turns up and failure occurs if any other face turns up. The probability of success is p = = .833. In order to help understand why the binomial probability formula works, the binomial probabilities will he computed using the basic principles of probability first and then by using formula (5.7). To find P(0). note that X = 0 means that no successes occurred, that is, three failures occurred. Because of the independence of trials, the probability of three failures is .833 x 333 x .833 = (.833)' = 578. That is, the probability that the face 6 does not turn up on any of the three tosses is S78. To find P( 1). note that X = 1 means that one success and two failures occurred. One success and two failures occur if the sequence SFF, or the sequence FSF, or the sequence FFS occurs. The probability of SFF is .I67 x .833 x .833 = .I 159. The probability of FSF is .833 x .I67 x ,833 = , I 159. The probability of FFS is .833 x .833 x .I67 = . I 159. The three probabilities for SFF, FSF, and FFS are added because of the addition rule for mutually exclusive events. Therefore, P(X = 1) = P( I ) = 3 x . I 159 = .348. The probability that the face 6 turns up on one of the three tosses is .348. To find P(2), note that X = 2 means that two successes and one failure occurred. Two successes and one failure occur if the sequence FSS or the sequence SFS or the sequence SSF occurs. The probability of FSS is .833 x .I67 x .I67 = .0232. The probability of SFS is .I67 x .833 x .167 = .0232. The probability of SSF is .I67 x .I67 x ,833 = .0232. The probabilities for FSS, SFS, and SSF are added because of the addition rule for mutually exclusive events. Therefore, P(X = 2) = P(2) = 3 x .0232 = .070.The probability that the face 6 turns up on two of the three tosses is .070. To find P(3), note that X = 3 means that three successes occurred. The probability of three consecutive successes is .I67 x .I67 x .I67 = (. 167)-'= .005.There are five chances in a thousand of the face 6 turning up on each of the three tosses. = .I67 and the probability of failure is q = 6 Using the binomial probability formula, we find the four probabilities as follows: 3! O!3! P(0) = - (. 167)' (.833)-' = (.833)' = .578 3! 1!2! P( I ) = -(. 167)' (.833)2= 3 x .I67 x (.833)*= .348 3! 2!I! 3! 3!0! P(2) = -(.167)2 (.833)'= .070 P(3) = -(. 167)' (333)' = .00S This example illustrates how much work the binomial probability formula saves us when solving problems involving the binomial distribution. The distribution for the variable in Example 5.14 is given in Table 5.8. Table 5.8 X I 0 1 2 3 1 I P(x) I .578 .348 .070 .005 I In formula (5.7), The term pXq(" - " gives the probability of x successes and (n - x) failures. The counts the number of different arrangements which are possible for x successes and n! x!( n - x)! term (n - x) failures. The role of these terms is illustrated in Example 5.14.
  • 104.
    CHAP. 51 DISCRETERANDOM VARIABLES 95 ~~~~~~~ 4 0 . . . .I296 ,0625 .0256 . . . 1 . . . .3456 .2500 .I536 . . , 2 . . . .3456 .3750 .3456 . . . 3 . .. .1536 ,2500 .3456 . . . 4 . . . .0256 .0625 .I 296 . . . EXAMPLE515 Fifty-seven percent of companies in the U.S. ‘use networking to recruit workers. The probability that in a survey of ten companies exactly half of them use networking to recruit workers is I O! 5!5! P(5) = -(37)‘ (.43y = .223 TABLES OF THE BINOMIAL DISTRIBUTION Appendix 1 contains the table o f birzoinialprobabilities. This table lists the probabilities of x for n = 1 to n = 25 for selected values of p. EXAMPLE 5.16 If X represents the number of girls in families having four children, then X is a binomial random variable with n = 4 and p = .5. Using formula (5.7),the distribution of X is determined as follows: 4! 4! 4! 0!4! 1!3! 2!2! P(0) = -(S)’ (.S)4 = (.S)4 = .0625 P(1) = -(5)’($ = .2s P(2)= -(5)* (.S)* = .37s 4! 3!I! P(3) = -(.5y(3’ = .2s 4! 4!0! P(4) = -(.5)4(S)’ = .0625 Table 5.9 contains a portion of the table of binomial probabilities found in Appendix I . The numbers in bold print indicates the portion of the table from which the binomial probability distribution for X is obtained. The probabilities given are the same ones obtained by using formula (5.7). Table 5.9 t d I P n X * . . .40 S O .60 . . . EXAMPLE 5.17 Eighty percent of the residents in a large city feel that the government should allow more than one company to provide local telephone service. Using the table of binomial probabilities, the probability that at least five in a sample of ten residents feel that the government should allow more than one company to provide local telephone service is found as follows. The event “at least five” means five or more and is equivalent to X = 5 or X = 6 or X = 7 or X = 8 or X = 9 or X = 10.The probabilities are added because of the addition law for mutually exclusive events. P(X 2 5 )= P(5) + P(6) + P(7) + P(8) + P(9) + P(10) P(X 25 )= .0264 + .0881 + .2013 + .3020+ .2684 + ,1074 = .9936 Statistical software is used to perform binomial probability computations and to some extent has rendered binomial probability tables obsolete. Minitab contains routines for computing binomial probabilities. Example 5.18 illustrates how to use Minitab to compute the probability for a single value or the total distribution for a binomial random variable. EXAMPLE 5.18 The following Minitab output shows the binomial probability computations given in Examples 5.15 and 5.16. The binomial probabilities are shown in bold type.
  • 105.
    96 DISCRETE RANDOMVARIABLES [CHAP. 5 MTB ># Minitab computation for Example 5.I5 # MTB >pdf 5; SUBC > binomial n = 10p = .57. Probabi1ity Density Function Binomial with n = 10and p = 0.570000 x P(X= x) 5.00 0.2229 MTB ># Minitab computation for Example 5.16# MTB > pdf; SUBC > binomial n = 4 and p = .5. Probability Density Function Binomial with n = 4 and p = 0.500000 x P(X= x) 0 0.0625 I 0.2500 2 0.3750 3 0.2500 4 0.0625 MEAN AND STANDARDDEVIATION OF A BINOMIAL RANDOM VARIABLE The mean and variance for a binomial random variable may be found by using formulas (5.3)and (5.4).However, in the case of a binomial random variable, shortcut formulas exist for computing the mean and standard deviation of a binomial random variable. The mean of a binomial random variable is given by P= nP (5.8) The variance of a binomial random variable is given by d=npq (5.9) EXAMPLE 5.19 The mean for the binomial distribution given in Example 5.18 using formula (5.3)is p = C xP(x) = 0 x .0625+ 1 x .25 +2 x ,375 +3 x .25 +4 x .0625 = 2 The mean using the shortcut formula (5.8)is p = np = 4 x .5 = 2 The variance for the binomial distribution given in Example 5.18 using formula (5.4)is o2= C x2 P(x) - p2= 0 x .0625+ 1 x .25 +4 x .375 +9 x .25 + 16x .0625- 4 = 1 The variance using the shortcut formula (5.9)is EXAMPLE 5.20 Chemotherapy provides a 5-year survival rate of 80% for a particular type of cancer. In a group of 20 cancer patients receiving chemotherapy for this type of cancer, the mean number surviving after 5
  • 106.
    CHAP. 51 DISCRETERANDOM VARIABLES 97 years is 1= 20 x .8 = 16 and the standard deviation is (T = 420 x .8 x .2 = 1.8. On the average, 16 patients will survive, and typically the number will vary by no more than two from this figure. POISSON RANDOM VARIABLE The binomial random variable is applicable when counting the number of occurrences of an event called success in a finite number of trials. When the number of trials is large or potentially infinite, another random variable called the Poisson random variable may be appropriate. The Poisson probability distribution is applied to experiments with rarzdorn and iridependerit occurrences of an event. The occurrences are considered with respect to a time interval, a length interval, a fixed area or a particular volume. EXAMPLE 5.21 The number of calls to arrive per hour at the reservation desk for Regional Airlines is a Poisson random variable. The calls arrive randomly and independently of one another. The random variable X is defined to be the number of calls arriving during a time interval equal to one hour and may be any number from 0 to some very large value. EXAMPLE 5.22 The number of defects in a 10-foot coil of wire is a Poisson random variable. The defects occur randomly and independently of one another. The random variable X is defined to be the number of defects in a 10-foot coil of wire and may be any number between 0 and some very large value. The interval in this example is a length interval of 10feet. EXAMPLE 5.23 The number of pinholes in 1-yd2pieces of plastic is a Poisson random variable. The pinholes occur randomly and independently of one another. The random variable X is defined to be the number of pin- holes per square yard piece and can assume any number between 0 and a very large value. The interval in this example is an area of 1 yd2. POISSON PROBABILITY FORMULA The probability of x occurrences of some event in an interval where the Poisson assumptions of randomness and independence are satisfied is given by formula (5.10),where h is the mean number of occurrences of the event in the interval and the value of e is approximately 2.71828.The value of e is found on most calculators and powers of e are easily evaluated. Tables of Poisson probabilities are found in many statistical texts. However, with the wide spread availability of calculators and statistical software, they are being less widely utilized. e-' P(x)= - for x X ! The mean of a Poisson random variable is given by p = h The variance of a Poisson random variable is given by o2= h EXAMPLE 5.24 The number of small pinholes in sheets =o, 1,2,. . . (5.10) (5.11) (5.12) of plastic are of concern to a manufacturer. If the number of pinholes is too large, the plastic is unusable. The mean number per square yard is equal to 2.5. The 1- yd2 sheets are unusable if the number of pinholes exceeds 6. The probability of interest is P(X > 6), where X
  • 107.
    98 DISCRETE RANDOMVARIABLES [CHAP. 5 represents the number of pinholes in a I-yd2 sheet. This probability is found by finding the probability of the complementary event and subtracting it from 1. P(X > 6) = 1 - P(X 5 6) The command cdf of Minitab can be used to find P(X 5 6). The following Minitab output illustrates how this is accomplished. MTB > cdf 6; SUBC > poisson mean = 2.5. Cumulative Distribution Function Poisson with mu = 2.5oooO x P(X<=x) 6.00 0.9858 P(X > 6) = 1 - P(X 5 6) = 1 - .9858 = .0142 By multiplying .0142 by 100we see that 1.42% of the plastic sheets are unusable. EXAMPLE 5.25 The Poisson distribution approximates the binomial distribution closely when n 2 20 and p 5 .05. A machine produces items of which 1% are defective. In a sample of 150items selected from the output of this machine, the probability of two or fewer defectives in the sample is P(X I 2) and is found by the command cdf 2; when using Minitab. The following Minitab output gives the binomial probability that X 5 2 . MTB > cdf 2; SUBC > binomial n = I SO p = .O I . Cumulative Distribution Function Binomial with n = I50 and p = 0.0100000 X P(X <= x) 2.oo 0.8095 The mean number of defectives in a sample of 150 is np = 150 x .01 = 1.5. The Poisson probability that X 5 2 where h = 1.S is given by the following Minitab output. MTB > cdf 2; SUBC > poisson mean = 1.5. Cumulative Distribution Function Poisson with mu = 1SO000 x P(X <= x) 2.00 0.8088 The binomial probability of the event X 5 2 is ,8095 and the Poisson approximation is ,8088. Notice that the Poisson approximation is very close to the binomial probability. HYPERGEOMETRIC RANDOM VARIABLE The hypergeometric rarzdom variable is used in situations where success or failure is possible on each trial but where there is not independence from trial to trial. The lack of independence from trial to trial distinguishes the hypergeometric distribution from the binomial distribution. The hyper- geometric random variable applies in situations where there are N items, of which k are classified as
  • 108.
    CHAP. 51 DISCRETERANDOM VARIABLES 99 successes and N -k are classified as failures. A sample of size n i k is selected from the N items and X is defined to equal the number of successes in the n items selected. X is a hypergeometric random variable which can equal any whole number from 0 to n. EXAMPLE 5.26 A sociologist randomly selects 5 individuals from a group consisting of 10 male single parents and IS female single parents. The random variable X is defined to equal the number of male single parents in the 5 selected individuals. In this example, N = 25, k = 10, N - k = 15,and n = 5. The number of male single parents in the 5 selected is a hypergeometric random variable. If this hypergeometric random variable is represented by X, then X may assume any one of the values 0, I , 2, 3,4,or 5. EXAMPLE 5.27 A box contains 5 defective and 25 acceptable computer monitors. Three of the monitors are randomly selected and X is defined to be the number of defective monitors in the three. X is a hypergeometric random variable with N = 30, k = 5, N - k = 25, and n = 3. X may assume any one of the values 0, 1, 2, or 3. HYPERGEOMETRICPROBABILITY FORMULA When n items are selected from N items of which k are successes and N - k are failures, the random variable X, defined to equal the number of successes in the n selected items, is a hypergeometric random variable. The probability distribution of X is given by the liypergemietric probabilityformula shown in formula (5.13). k! (N - k)! x!(k - x)! (n - x)!(N - k - n +x)! - f n r y - f l , I , . . . , n (5.13) (N P(x) = (4 n!(N - n)! EXAMPLE 5.28 A police department consists of 25 officers of whom 5 are minorities. Three officers are randomly selected to meet with the mayor. Let X be the number of minorities in the three selected to meet with the mayor. X is a hypergeometric random variable with N = 25, k = 5, N - k = 20, and n = 3. The probability distribution of X is derived as follows: The probability distribution of X is given in Table 5.10.It is highly likely that at most one of the three will be a minority. The probability that X I1 is .909. Table 5.10 1 x 1 0 1 2 3 1 I P(x) I .496 .413 .087 .004 1 EXAMPLE 5.29 The binomial distribution approximates the hypergeometric distribution whenever n I.05N. A box contains 200 computer chips, of which 7 are defective. The probability of finding one defective in a sample of S randomly selected chips is given by the following hypergeometric probability computation.
  • 109.
    100 I Outcome BBB BBG BGB BGG GBB GBG GGB GGG DISCRETE RANDOM VARIABLES[CHAP. 5 = .155 I,I ,Jx 4 J - 7 x 56,031,760 (200) 2,535,650,040 P(I) = - Since n 5 -05 x 200 = 10, the probability may be approximated by using the binomial distribution. The five selections of the computer chips may be viewed as n = 5 trials. The probability of success, selecting a defective chip, is p = 7= .035,and q = .965.The binomial probability of one defective in the five chips is given as foIlows: 200 J. p(I)=- (.035)' (.965)4=.152 1!4! The approximation is very good when n 5 .05N, as shown in this example. Solved Problems RANDOM VARIABLE 5.1 Let X represent the number of boys in families having three children. List all possible birth order permutations for families having three children and give the value of X for each outcome. Ans. Table 5.1 1 gives the outcomes and the value X assigns to each outcome. 5.11 Value of X 3 2 2 1 2 1 I 0 5.2 For the experiment of rolling three dice, X is defined to be the sum of the three dice. What are the unique values assumed by X? Ans. outcome (6,6,6).The unique values are 3,4,5,6,7, 8,9, 10, 1 I, 12, 13, 14, 15, 16, 17, and 18. The values for X range from 3, corresponding to the outcome ( I , I , I ) to 18, corresponding to the DISCRETE RANDOM VARIABLE 5 . 3 A telemarketing company administers an aptitude test consisting of 25 problems to potential employees. A variable of interest to the company is X, the number of problems worked correctly. How many different values are possible for X?
  • 110.
    CHAP. 51 DISCRETERANDOM VARIABLES X P(x) 101 3 6 9 12 15 18 21 .09 .13 .16 .21 .26 .13 .02 Am. 26, since an individual can get from 0 to 25 correct. 5.4 A random variable assigns a single number to each outcome of an experiment. Is it true that each value of a random variable corresponds to a single outcome? Ans. No. Consider the outcomes and values of X given in Table 5.11. The value X = I corresponds to the three outcomes BGG, GBG, and GGB for example. CONTINUOUS RANDOM VARIABLE 5.5 Identify the continuous random variables in parts (a) through (e). (a) The time that an individual is logged onto the internet during a given week (b) The number of domestic violence calls responded to per day by the Chicago police (c) The mortality rate for women who used estrogen therapy for at least a year starting in 1969 (d)The daily room rate for luxury/upscale hotels in the U.S. (e) The number of executions per state since 1975 department Ans. Parts (a) and (d)are continuous random variables. Time and money are almost always considered continuous even though in practice they are probably discrete. 5.6 Is it possible to give a probability value to each individual value of a continuous random variable? Ans. No. It is not possible to give an individual probability to each value of a continuous variable since the variable may assume an uncountably infinite number of different values. Instead, probabilities are assigned to an interval of values for a continuous random variable. PROBABILITY DISTRIBUTION 5.7 According to the registrar’s office at the University of Nebraska at Omaha (UNO), during the current semester, 9% of the students are registered for 3 credit hours, 13% are registered for 6 credit hours, 16% are registered for 9 credit hours, 21% are registered for 12 credit hours, 26% are registered for 15credit hours, 13% are registered for 18credit hours, and 2% are registered for 21 credit hours. If the random variable X represents the number of credit hours per student at UNO, give the probability distribution for X. Ans. The probability distribution for X is given in Table 5.12. Table 5.12 5.8 An experiment consists of rolling a die and flipping a coin. The coin has the number 1 stamped on one side and the number 2 stamped on the other side. The random variable Y is defined to equal the sum of the number showing on the coin plus the number showing on the die after the experiment is conducted. Give the probability distribution for Y.
  • 111.
    102 Outcome (coin= l,die= 1) (coin= I , die = 2) (coin = I , die = 3) (coin = I , die = 4) (coin = 1, die = 5 ) (coin = 1, die = 6) (coin = 2, die = 1) (coin = 2, die = 2) (coin = 2, die = 3) (coin = 2, die = 5 ) (coin = 2, die = 6) (coin = 2, die = 4) DISCRETE RANDOM VARIABLES [CHAP. 5 Value of Y 2 3 4 5 6 7 3 4 5 6 7 8 Am. Table 5.13 gives the outcomes for the experiment as well as the values the random variable Y assigns to the outcomes. From Table 5.13, the probability distribution given in Table 5.14 is determined. Y = 2 corresponds to one of twelve equally likely outcomes. Therefore P(2) equals -L = .083.Y = 3 corresponds to two of twelve equally likely outcomes. Therefore P(3) equals A = .167. The other probabilities are found in a similar fashion. 12 Table 5.14 l v 1 2 3 4 5 6 7 S I I P(y) I .083 .I67 .I67 .I67 .167 .167 .083 I MEAN OF A DISCRETE RANDOM VARIABLE 5.9 5.10 A roulette wheel has 18 red, 18 black, and 2 green slots. You bet $10 on red. If red comes up, you get your $10 back plus $I0 more. If red does not come up, you lose your $10.Let random variable P represent your profit when playing this roulette wheel. Find the mean value of P. Ans. The distribution of P is given in table 5.15. The probability of a $10 loss, i.e., P = -10, is 2 = 326, and the probability of a $10gain is 38 = .474 Table 5.15 The mean profit is p = C xP(x) = -10 x .526 + 10x .474 = -.52. Your average loss is 52 cents per play of the roulette wheel. The distribution of the number of children per household for households receiving Aid to Dependent Children (ADC) in a large eastern city is as follows: Five percent of the ADC households have one child, 35% have 2 children, 30% have 3 children, 20% have 4 children, and 10%have 5 children. Find the mean number of children per ADC household in this city. Ans. The mean is p = CxP(x) = 1 x .05 + 2 x .35 + 3 x .30 + 4 x .20 + 5 x .I0 = 2.95. The mean is about 3 per household.
  • 112.
    CHAP. 51 DISCRETERANDOM VARIABLES 103 STANDARD DEVIATION OF A DISCRETE RANDOM VARIABLE 5.11 Find the standard deviation of the profit when playing the color red on the roulette wheel described in problem 5.9. Arts. The variance is given by E x2P(x)- w2 = 100 x .526 + 100 x ,474 - .2704 = 99.7296, and the standard deviation is JK = $9.97. 5.12 Find the standard deviation of the number of children per ADC household for the distribution given in problem 5.10. Ans. The variance is given by C x2P(x) - p2= 1 x .05 +4 x .35 +9 x .30 + 16 x .20 +25 x .I0- 2.952= 1,1475and the standard deviation is 41.1475 = 1.07. BINOMIAL RANDOM VARIABLE 5.13 Ninety percent of the residents of Stanford, California, twenty-five years of age or older have at least a bachelor’s degree. Three hundred residents of Stanford twenty-five years or older are selected for a poll concerning higher education. If X represents the number in the 300 who have at least a bachelor’s degree, give the conditions necessary for X to be a binomial random variable and identify n, p, and q. Ans. The main condition we need to be concerned about is that the 300 residents be selected independently. There are 300 identical trials and each trial has only two possible outcomes. We may identify success with having at least a bachelor’s degree and failure with having less than a bachelor’s degree. If the respondents are chosen randomly and independently, then the probabilities of success and failure should remain constant. Pollsters use standard techniques to ensure independence of individuals selected. The values of n, p, and q are 300, .90, and .10 respectively. 5.14 A box contains 20 items, of which 25% are defective. Three items are randomly selected and X, the number of defectives in the three selected items, is determined. Explain why X is not a binomial random variable with n = 3 and p = .25. Ans. The probabilities p and q do not remain constant from trial to trial. Suppose we are interested in the probability that X = 3; that is, we are interested in the probability of getting three consecutive defectives. The probability that the first one is defective is .25. The probability that the second is defective after selecting a defective on the first selection is = .21, and the probability that the third is defective after selecting defectives on the first two selections is A = .17. The binomial model assumes that the probability p remains constant at p = .25. In this experiment, p does not remain constant, but changes from .25 to .21 to .17. BINOMIAL PROBABILITY FORMULA 5.15 Approximately 12% of the U.S.population is composed of African-Americans. Assuming that the same percentage is true for telephone ownership, what is the probability that when 25 phone numbers are selected at random for a small survey, that 5 of the numbers belong to an African-American family?
  • 113.
    104 DISCRETE RANDOMVARIABLES [CHAP. 5 Ans. Let X represent the number of phone numbers in the 25 belonging to African-Americans. Then, X has a binomial distribution with n = 25 and p = .12. The probability P(X = 5 ) is given as follows: 25! 5!20! P(X = 3) = -(.12)’ (.88)*O = .I025 5.16 It is estimated that 42% of women ages 45 to 54 are overweight. If 20 females between 45 and 54 are randomly selected, what is the probability that one-half of them are overweight? Ans. Let X represent the number of women in the 20 who are overweight. Then, X has a binomial distribution with n = 20 and p = .42.The probability P(X = 10)is given as follows: 20! 10!10! P(X = 10)= -(.42)1°(.58)*O = .1359 TABLES OF THE BINOMIAL DISTRIBUTION 5.17 Sixty percent of teenagers who drink alcohol do so because of peer pressure. Use the table of binomial probabilities to find the probability that in a sample of 15 teenagers who drink, 5 or fewer do so because of peer pressure. Ans. Table 5.16 shows the portion of the table needed to compute P(X 5 5). P(X 55) = .OOOO + .0o00 + .0003 + .0016+ .0074 + .0212 = .0305 5.18 A domestic homicide is one in which the victim and the killer are relatives or involved in a relationship. Suppose 40% of all murders are domestic homicides. A criminal justice study randomly selects 10 murder cases for investigation. Use the table of binomial probabilities to find the probability that between one and four inclusive of the murder cases will be domestic homicide cases. Ans. Table 5.17 shows the portion of the table needed to compute P(1 5 X I 4 ) . P ( l IX54)=.0403+.1209+.2150+.2508=.6270 Table 5.17 = .40 .0403 .2150 4 .2508
  • 114.
    CHAP. 51 DISCRETERANDOM VARIABLES 105 MEAN AND STANDARD DEVIATION OF A BINOMIAL RANDOM VARIABLE 5.19 Seventy-five percent of employed women say their income is essential to support their family. Let X be the number in a sample of 200 employed women who will say their income is essential to support their family. What is the mean and standard deviation of X? Ans. X is a binomial random variable with n = 200 and p = .75. The mean is p = np = 200 x .75 = 150, and the standard deviation is CY = 6= = 6.12. 5.20 A binomial distribution has a mean equal to 8 and a standard deviation equal to 2. Find the values for n and p. Ans. The following equations must hold: 8 = np and 4 = npq. Substituting 8 for np in the second equation gives 4 = 8q, which gives q = .5. Since p +q = 1, p = 1 - .5 = .5. Substituting .5 for the first equation gives n(S) = 8, and it follows that n = 16. POISSON RANDOM VARIABLE 5.21 Consider the number of customers arriving at the Grover street branch of Industrial Federal Savings and Loan during a one-hour interval. What assumptions concerning arrivals of customers are necessary in order that X, the number of customers arriving in a hour interval, be a Poisson random variable? p in and the one Ans. The assumptions necessary are that the arrivals be random and independent of one another. 5.22 Why are the arrival of patients at a physician’s office and the arrival of commercial airplanes at an airport not Poisson random variables? Ans. Because of appointment times to see a physician, the arrivals are not random. Because of scheduled arrivals of commercial airplanes, the arrivals are not random. POISSON PROBABILITY FORMULA 5.23 The mean number of patients arriving at the emergency room of University Hospital on Saturday nights between 1O:OO and 12:OO is 6.5. Assuming that the patients arrive randomly and independently, what is the probability that on a given Saturday night, 5 or fewer patients arrive at the emergency room between 1O:OO and 12:00? Ans. Let X represent the number of patients to arrive at the emergency room of University Hospital 6.5xe-6.S between 1O:OO and 12:OO. The probability formula for X is P(x) = - .The probability of the X! event that X 5 5 is P(0) + P(I) + P(2) + P(3) + P(4) +P(5). Each of these probabilities contain the common term e4.5, which may be factored out to give the following as the probability of the event of interest. p(x 5 5 ),,“5(65O ; 65’ + 652 ; 65’ ; 654 ; 65’) O! I! 2! 3! 4! 5 ! P(X I 5 ) = .OO15(1 +6.5 +21.125 +45.7708 +74.7760 +96.6809) = .369
  • 115.
    106 DISCRETE RANDOMVARIABLES [CHAP. 5 The solution using Minitab is as follows. MTB > cdf 5 ; SUBC > Poisson 6.5. Cumulative Distribution Function Poisson with mu = 6.50000 x P(X<= x) 5.00 0,3690 5.24 When working problems involving the Poisson random variable, it is important to remember that the interval for the mean number of occurrences and the interval for X must be equal. If they are not, the mean should be redefined to make them equal. This problem illustrates this important point. The mean number of patients arriving at the emergency room of University Hospital on Saturday nights between 1O:OO and 12:OO is 6.5. Assuming that the patients arrive randomly and independently, what is the probability that on a given Saturday night, 2 or fewer patients arrive at the emergency room between 11:00and 12:00? Arts. Let X be the number of patients to arrive at the emergency room of University Hospital on 6.5 2 Saturday nights between 1 1 9 0 and 12:OO. The mean for X is -= 3.25, and the event of interest is X 5 2 . e-3.25 3.250 e-3.2S33.251 e-3.2S33.252 P(X 5 2 ) = + + = ,369 O! I! 2! HYPERGEOMETRIC RANDOM VARIABLE 5.25 A box contains 10 red marbles and 90 blue marbles. Five marbles are selected randomly from the 100 in the box. Let X be the number of blue marbles in the five selected marbles. Identify the values for N, k,N - k,and n in the hypergeometric distribution which corresponds to X. Am. The total number of marbles is N = 100.The number of blue marbles (successes) is k = 90. The number of red marbles (failures) is N - k = 10.The sample size is n = 5. 5.26 In problem 5.25, consider the event X = 2. This is the event that five marbles are selected and 2 are blue and 3 are red. (a) How many ways may 5 marbles be selected from 100marbles? (b) How many ways may 2 blue marbles be selected from 90 blue marbles? (c) How many ways may 3 red marbles be selected from 10red marbles? (d) How many ways may 3 red marbles be selected from 10red marbles and 2 blue marbles be selected from 90 blue marbles? (d) 120x 4,005= 480,600
  • 116.
    CHAP. 51 DISCRETERANDOM VARIABLES 107 HYPERGEOMETRICPROBABILITYFORMULA 5.27 Computec, Inc. manufactures personal computers. There are 40 employees at the Omaha plant and 15 employees at the Lincoln plant. Five employees of Computec, Inc. are randomly selected to fill out a benefits questionnaire. (a) What is the probability that none of the five selected are from the Lincoln plant? (b) What is the probability that all five of the selected employees are from the Lincoln plant? Am. (a) [105)[450) - - I x 658,008 = ,189 (6) [ : I [ : I - - 3,003 x 1 = .000863 3,478,76 1 3,478,76 1 5.28 What is the probability of getting 5 face cards when 5 cards are selected from a deck of 52? Am. A deck of cards consists of 12 face cards and 40 nonface cards. Let X represent the number of face cards in the 5 cards selected. The event in which we are interested is X = 5. The probability is P(X = 5 ) = [:'jo)- - 7 9 2 ~ 1 = .000305 2,598,960 (i:) Supplementary Problems RANDOM VARIABLE 5.29 A taste test is conducted involving 35 individuals. Random variable X is the number in the 35 who prefer a locally produced nonalcoholic beer to a national brand. What are the possible values for X ? Ans. The whole numbers 0 through 35 5.30 A psychological experiment was conducted in which the time to traverse a maze was recorded for each of five dogs. The times were 4, 6, 8, 9, and 12 minutes. Two of the times were randomly selected and the difference X = largest of the pair - smallest of the pair was recorded. Give all possible pairs of possible selections, and the value of X for each outcome. Am. See Table 5.18. Table 5.18 Value of X
  • 117.
    108 X -4 -10 3 8 P(x) . I .2 .4 .3 .2 L DISCRETE RANDOM VARIABLES [CHAP. 5 DISCRETE RANDOM VARIABLE 5.31 A die is tossed until the face 6 turns up. Let X be the number of tosses needed until the face 6 first turns up. Give the possible values for the variable X. Ans. The possible values are the positive integers. That is, the possible values are I , 2, 3, . . . . 5.32 Identify the discrete random variables in parts (a)through (e). (a) The number of arrests during a 10-day period during which the police apply a Zero-tolerance strategy (6) The time workers have spent with their current employer ( c ) The number of nurse practitioners per state (d) The career lifetime of major league baseball players (e) The number of executions of death row inmates per year in the U.S. Am. (n), (c),and (e) CONTINUOUSRANDOM VARIABLE 5.33 Identify the continuous random variables in the following list. (a) Weight of individuals in kg (b) Serum cholesterol level in mg/dl (c) Length of intravenous therapy in hours (6)Body mass index in kg/m2 (e) Cardiac output in liters/minute Ans. All five are continuous. 5.34 What is the primary difference between a discrete random variable and a continuous random variable? Am. There are values between the possible values of a discrete random variable which are not possible values for the random variable. This is not generally true for a continuous random variable. PROBABILITY DISTRIBUTION 5.35 Suppose Table 5.19 gives the number in thousands of students in grades 9 through 12 for public schools in the United States. Let X represent the grade level. Give the probability distribution for X. Table 5.19 I 12 Grade I 9 10 11 I I Frequency I 3,525 3,475 3,050 2,950 I Am. The distribution is given in Table 5.20. Table 5.20 1 x 1 9 10 1 1 12 I I P(x) I .271 267 .245 .227 I
  • 118.
    CHAP. 51 X P(x) DISCRETE RANDOMVARIABLES -20 -10 0 10 15 20 50 . I .2 .3 .2 .1 .2 -. 1 109 pounds percent 0 5 15 25 50 33 21 11 25 10 2 (d) P(x)= .2x ,x = 1,2 h i s . Parts (6) and (d)are probability distributions. Part (a) is not because C P(x) = 1.2. Part (c)is not P(50) < 0. MEAN OF A DISCRETE RANDOM VARIABLE 5.37 The number of personal computers per household in the United States has the probability distribution shown in Table 5.21. Find the mean number of personal computers per household. Table 5.21 X I 0 1 2 1 Ans. p = 0.85 5.38 Based on a large survey, the distribution in Table 5.22 was found for the number of pounds individuals desired to lose. Find the mean number of pounds they desired to lose. Table 5.22 Ans. p = 13.95 STANDARD DEVIATION OF A DISCRETE RANDOM VARIABLE 5.39 Find the standard deviation of the number of personal computers per household for the distribution given in Table 5.21, Alzs. CJ = 0.65 5.40 Find the standard deviation of the number of pounds desired to be lost. Ans. CJ = 15.55 BINOMIAL RANDOM VARIABLE 5.41 Thirty percent of the trees in a national forest are infested with a parasite. Fifty trees are randomly selected from this forest and X is defined to equal the number of trees in the 50 sampled that are infested with the parasite. The infestation is uniformly spread throughout the forest. Identify the values for n, p, and q. AIIS. n = SO, p = 30, and q = .70
  • 119.
    110 DISCRETE RANDOMVARIABLES [CHAP. 5 5.42 Suppose in problem 5.41 that we define Y to be the number of trees in the 50 sampled that are not infested with the parasite. Then Y is a binomial random variable. (a) What are the values of n, p, and q for Y ? (6) The event X = 20 is equivalent to the event that Y = a. Find the value for a. Ans. (a) n = 50, p = .70,and q = .30 (h) a = 30 BINOMIAL PROBABILITY FORMULA 5.43 The Dallas-Fort Worth Airport claims that 85%of their flights are on time. If the claim is correct, what is the probability that in a sample of 20 flights at the Dallas-Fort Worth Airport that 15 or more of the sample flights are on time'? Ans. .9321 5.44 A psychological study involving the troops in the Bosnia peacekeeping force was conducted. If 12 percent of the 21,496troops are females, what is the probability that in a sample of 50 randomly selected individuals that five or fewer are female? Ans. .435 3 TABLES OF THE BINOMIAL DISTRIBUTION 5.45 There are approximately 3,000 inmates on death row. Forty percent of the death row inmates are African- American. Twenty of the death row inmates are randomly selected for a sociological study. Use the table of binomial probabilities to find the probability that most of the selected inmates are African-American. Am. .1275 5.46 Ten percent of the Rentwheels car-rental fleet are equipped with cellular phones. If five of the cars are randomly selected, what is the probability that none are equipped with a cellular phone'? Ans. S905 MEAN AND STANDARD DEVIATION OF A BINOMIAL RANDOM VARIABLE 5.47 5.48 It is conjectured that 60% of the deaths from melanoma can be prevented by a skin self-exam. If this conjecture is correct, how many of the 7,000 deaths due to this skin cancer would be prevented per year on the average'?What is the standard dcviation associatcd with the number of deaths prevented? Ans. 4.200 41 Fifteen percent of the machinery and equipment at businesses is more than I0 years old. In a randomly selected sample of 35 businesses, how many would you expect to have machinery or equipment that is more than 10years old? What standard deviation is associated with this expected number? Ans. 5.25 2.11
  • 120.
    CHAP. 51 DISCRETERANDOM VARIABLES 111 POISSON RANDOM VARIABLE 5.49 Give four examples of a Poisson random variable. Am. 1. The number of telephone calls received per hour by an office 2. The number of keyboard errors per page made by an individual using a word processor 3. The number of bacteria in a given culture 4. The number of imperfections per yard in a roll of fabric 5.50 What are the potential values of a Poisson random variable? Ails. 0, 1, 2,. . . POISSON PROBABILITY FORMULA 5.51 Suppose the mean number of earthquakes per year is 13. What is the probability of 20 or more earthquakes in a given year? Arts, .0427 5.52 The number of plankton in a liter of lake water has a mean value of 7. What is the probability that the number of plankton in a given liter is within one standard deviation of the mean? Ans. .6575 HYPERGEOMETRIC RANDOM VARIABLE 5.53 Since 1977 twenty-four states have executed at least one death row inmate. In a study concerning capital punishment, ten of the fifty states are randomly selected. Let X represent the number of states in the ten that have executed at least one death row inmate. Identify success and failure. What are the values of N, k, N - k,and n. Am. Success is that at least one death row inmate has been executed since 1977. Failure is that no execution has occurred in the state since 1977.N = 50, k = 24, N - k = 26, and n = 10. 5.54 If success and failure are interchanged in problem 5.53, how is X changed and what are the values of N, k, N - k, and n for X? Am. X is the number of states in the ten that have executed no one since 1977. N = SO,k = 26, N - k = 24, and n = 10. HYPERGEOMETRIC PROBABILITY FORMULA 5.55 Thirty diabetics have volunteered for a medical study. Ten of the diabetics have high blood pressure. Five are selected for a preliminary screening for the study. What is the probability that none of the five selected have high blood pressure if the selection is done randomly? 5.56 A box of manufactured products contains 20 items. Three of the items are defective. Let X represent the number of defectives in three randomly selected from the box. Give the probahility distribution for X and find the mean value for X using the probability distribution. Show that the mean is also given by p = -. nk N
  • 121.
    112 Ans. DISCRETE RANDOM VARIABLES[CHAP. 5 1 x 1 0 1 2 3 1 I .000877 I P(x> I S96491 .357895 .044737 p = .45
  • 122.
    Chapter 6 Continuous RandomVariables and Their Probability Distributions UNIFORM PROBABILITY DISTRIBUTION A coiitir2uous random variclble is a random variable capable of assuming all the values in an interval or several intervals of real numbers. Because of the uncountable number of possible values, it is not possible to list all the values and their probabilities for a continuous random variable in a table as is true with a discrete random variable. The probability distribution for a continuous random variable is represented as the area under a curve called the probability densiqfunction, abbreviated pdJ A pdf is characterized by thefollowirig t”o basic properties: The graph o f the pdf is never below the x axis arid the total area wider the pdf always equals I . The probability density function shown in Fig. 6-1 is a urzifor-mprobability distribution. This pdf represents the distribution of flight times between Omaha, Nebraska, and Memphis, Tennessee. The flight time is represented by the letter X. The graph shows that the flight times range from 90 to 100minutes. The distance from the x axis to the graph remains constant at 0.10and since the area of a rectangle is given by the length times the width, the area under the pdf is 10 x 0.10 = 1. Note that this pdf has the two basic properties given above. The graph of the pdf is never below the x axis and the total area under the pdf is equal to I. X 90 100 Fig. 6-1 The representation of Fig. 6-1 by an equation is given as follows. .I 9 0 < x < 1 0 0 0 elsewhere f(x) = In general, if a random variable X is uniformly distributed over the interval from a to b, then the pdf is given by formula (6.1). a < x < b 1 0 elsewhere 112
  • 123.
    114 CONTINUOUS RANDOMVARIABLES [CHAP. 6 EXAMPLE 6.1 The probability that a flight takes between 92 and 97 minutes is represented as P(92 < X < 97) and is equal to the shaded area shown in Fig. 6-2. The rectangular shaded area has a width equal to 5 and a length equal to .1 and the area is equal to 5 x . 1 = .5. That is, 50 percent of the flights between Omaha and Memphis will take between 92 and 97 minutes. Fig. 6-2 MEAN AND STANDARD DEVIATION FOR THE UNIFORM PROBABILITY DISTRIBUTION The mean value for a random variable having a uniform probability distribution over the interval from a to b is given by a + b p = - 2 The variance for a uniform random variable is given by (b - a)2 cJ2= 12 (6-3) EXAMPLE 6.2 The weights of 10-pound bags of potatoes packaged by Idaho Farms Inc. are uniformly distributed between 9.75 pounds and 10.75 pounds. The distribution of weights for these bags is shown in Fig. 6-3. f(x) I X 9.75 10.75 Fig. 6-3 a +b 9.75+ 10.75 Using formula (6.2),we see that the mean weight per bag is p = -- - = 10.25 pounds and using 2 2 formula (6.3),the standard deviation is CJ = ,/v -= E= 0.29. If X represents the weight per bag, then
  • 124.
    CHAP. 61 CONTINUOUSRANDOM VARIABLES 115 P(X > 10.0)corresponds to the proportion of bags that weigh inore than 10pounds. This probability is shown in Fig. 6-4. The ,haded rectangle has dimensions 1 and 0.75 and the area is 1 x 0.75 = 0.75. Seventy-five percent of the bags weigh more than 10 pounds. To help ensure consumer satisfaction, Idaho Farms instructs their employees to try and never underfill if possible. This is reflected in the average weight of 10.25 pounds per bag. Fig. 6-4 The probability associated with a single value for a continuous random variable is always equal to zero, since there is no area associated with a single point. That is, the probability that X = a is given by P(X = a) = 0 NORMAL PROBABILITY DISTRIBUTION The most important and widely used of all continuous distributions is the normal probability distributiort. Figure 6-5 shows the pdf for a normal distribution having mean p and standard deviation 0. Fig. 6-5 Table 6.1 gives some of the main properties of the normal curve shown in Fig. 6-5. Figure 6-6 shows two different normal curves with the same mean but different standard deviations. The larger the standard deviation, the more disperse are the values about the mean.
  • 125.
    116 CONTINUOUS RANDOMVARIABLES [CHAP. 6 Table 6.1 ProDertiesof the Normal Probability Distribution I I. The total area under the normal curve is equal to one. 2. 3. 4. 5. 6. 7. 8. 9. The curve is symmetric about p and the area under the curve on each side of the mean equals 0.5. The (ails of the curve extend indefinitely. Each pair of values for p and CT determine a different normal curve. The highest point on a normal curve occurs at the mean. The mean, median, and mode are all equal for a normal curve. The mean locates the center of the curve and can be any real number, negative, positive, or zero. The standard deviation is positive, and determines the shape of the normal curve. The larger the standard deviation, the wider and tlatter the curve. 68.26% of the area under the curve is within I standard deviation of the mean, 95.44% of the area is within 2 standard deviations, and 99.72%of the area is within 3 standard dcviations of the mean. I X P Fig. 6-6 Figure 6-7 shows two normal curves having equal standard deviations but different means. The normal curve with mean equal to 66 represents the distribution of adult female heights and the normal curve with mean equal to 70 represents the distribution of adult male heights. I X 66 70 Fig. 6-7 The equation of the pdf of a normal curve having mean p and standard deviation 0 is given in formula (6.5).
  • 126.
    CHAP. 61 0.0 0.1 0.2 CONTINUOUS RANDOMVARIABLES .oooO .0040 ... .O199 ... .0359 .0398 .0438 ... .0596 ... .0753 .0793 .0832 ... .0987 ... .1141 ... ... 117 1.6 The two constants in the formula, e and x;, are important constants in mathematics. The constant e is equal to 2.718281828 . . . and the constant .n is equal to 3.141592654 . . . . The equation of the pdf for the normal curve is not used a great deal in practice and is given here for the sake of completeness. ... .., ... ... .4452 .4463 ,.. .4505 ... .4545 ... ... ... ... ... ... STANDARDNORMAL DISTRIBUTION 3.0 The standard tionnal distvibirtion is the normal distribution having mean equal to 0 and standard deviation equal to 1. The letter Z is used to represent the standard normal random variable. The standard normal curve is shown in Fig. 6-8. The curve is centered at the mean, 0, and the z-axis is labeled in standard deviations above and below the mean. ... ... ... ,4987 .4987 ... .4989 ... .4990 -3 -2 - 1 0 1 2 3 Fig. 6-8 Appendix 2 contains the startdard rtormal distribution table. This table gives areas under the standard normal curve for the variable Z ranging from 0 to a positive number z. Some examples will now be given to illustrate how to use this table. EXAMPLE6.3 Table 6.2 illustrates how to use the standard normal distribution table to find the area under the standard normal curve between z = 0 and z = 1.65. Figure 6-9 shows the corresponding area as the shaded region under the curve. The value 1.65 may be written as 1.6 + .05, and by locating 1.6 under the column labeled z and then moving to the right of 1.6until you come under the .05column you find the area .4505. This is the area shown in Fig. 6-9. We express this area as P(0 < Z < 1.65) = .4505. Table 6 . 2 I 2 I .oo .01 ... .05 ... .09 I
  • 127.
    118 CONTINUOUS RANDOMVARIABLES [CHAP. 6 Fig. 6-9 EXAMPLE 6.4 The area under the standard normal curve between z = -1.65 and z = 0 is represented as P(-1.65< Z< 0) and is shown in Fig. 6-10.By symmetry, the following probabilities are equal. P(-1.65< Z < 0)= P(O < Z < 1.65) From Example 6.3, we know that P(0< Z < 1.65)= ,4505and therefore, P(-1.65< Z < 0)= ,4505. Fig. 6-10 EXAMPLE 6.5 The area under the standard normal curve between z = -1.65 and z = 1.65 is represented by P(-1.65 < Z< I .65)and is shown in Fig. 6-11. The probability P(-1.65 < Z < 1.65) is expressible as P(-l.65< Z< 1.65)= P(-l.65< Z< 0) +P(O< Z < 1.65) The probabilities on the right side of the above equation are given in Examples 6.3 and 6.4,and their sum is equal to 0.9010.Therefore, P(-1.65 < Z < 1.65)= .9010. Fig. 6-11
  • 128.
    CHAP. 61 CONTINUOUSRANDOM VARIABLES 119 EXAMPLE 6.6 The probability of the event Z c 1.96 is represented by P(Z < 1.96) and is shown in Fig. 6-12. Fig. 6-12 The area shown in Fig. 6-12 is partitioned into two parts as shown in Fig. 6-13. The darker of the two areas is equal to P(Z < 0) = .5, since it is one-half of the total area. The lighter of the two areas is found in the standard normal distribution table to be .4750. The sum of the two areas is .5 + .4750 = .9750. Summarizing, P(Z < 1.96)= P(Z < 0) +P(0 < Z c 1.96)= .5 + .4750 = .9750 Fig. 6-13 The probability in Example 6.6 can also be found by using CDF of Minitab as follows: MTB > cdf 1.96; SUBC > normal 0 1. Cumulative Distribution Function Normal with mean = 0 and standard deviation = I .OOOOO x P(X <= x) 1.9600 0.9750 Fig. 6-14
  • 129.
    120 CONTINUOUSRANDOM VARIABLES[CHAP. 6 In order to find the probability P(Z> 1.96),shown in Fig. 6-14, we use the complement of the event Z > 1.96 and the result for P(Z < 1.96)given above as follows. P(Z > 1.96)= 1 - P(Z < 1.96)= 1 - .9750 = .0250 STANDARDIZING A NORMAL DISTRIBUTION In order to find areas under a normal distribution having mean p and standard deviation 0,the normal distribution must be standardized. A normal random variable X having mean p and standard deviation 0 is converted or transformed to a standard normal random variable by the formula given in (6.6). x - C L Z = - 0 EXAMPLE 6.7 Figure 6-15 shows weights, X, are normally distributed the distribution of adult male weights for a particular age group. The with mean 170 pounds and standard deviation equal to 15 pounds. The X - - c ~ 215-170 standardized value for the weight X = 215 is z = -- - = 3. For this age group, an individual who weighs 2 15 pounds is 3 standard deviations above average. The standardized value for 170 pounds is zero, and the standardized value for 125 pounds is -3. 0 15 125 170 215 Fig. 6-15 APPLICATIONS OF THE NORMAL DISTRIBUTION The fact that many real-world phenomena are normally distributed leads to numerous appli- cations of the normal distribution. Applications of the normal distribution usually involve finding areas under a normal curve. Tofind the area between two values o f x’for a normal distribution, first convert both values of x to their respective z values. Thenfind the area under the standard normal cuwe between those two z values. The area between the tuqo z values gives the area between the corresponding x values. EXAMPLE 6.8 In a study involving stress-induced blood pressure, volunteers played a computer game called the color-word interference task. The game was set so that everyone made errors about 17% of the time. The average increase in systolic blood pressure was 10points of systolic pressure, and the standard deviation was 3 points. The percent experiencing an increase of 16 points or more is found by evaluating P(X > 16) and
  • 130.
    CHAP. 61 CONTINUOUSRANDOM VARIABLES 121 multiplying by 100. The probability is shown as the shaded area in Fig. 6-16. The z value corresponding to x = 16isz= - = 2. The area shown in Fig. 6-16 is the same as the area shown in Fig. 6-17. 16- 10 3 Fig. 6-16 The equality of the areas in Figures 6-16and 6-17 is expressible in terms of probability as follows: P(X > 16)= P(Z > 2) Using the standard normal distribution table, we find that P(Z > 2) = .5 - .4772 = .0228. The percent experiencing an increase of 16 systolic points or more is 2.28%. Fig. 6-17 EXAMPLE 6.9 The time between release from prison and conviction for another crime for individuals under 40 is normally distributed with a mean equal to 30 months and a standard deviation equal to 6 months. The percentage of these individuals convicted for another crime within two years of their release from prison is represented as P(X < 24) times 100.The probability is shown as the shaded area in Fig. 6-18. The event X < 24 X-30 24-30 is equivalent to Z = - c - = -1. The probability P(Z <-1) is shown as the shaded area in Fig. 6-19. 6 6 Fig. 6-18
  • 131.
    122 CONTINUOUS RANDOMVARIABLES [CHAP. 6 The shaded area shown in Fig. 6-19 is P(Z < -l), and by symmetry P(Z c -1) = P(Z > 1). The probability P(Z> 1) is found using the standard normal distribution table as .5 - .3413 = .1587. That is P(X c 24) = P(Z c -1) = .1587 or 15.87% commit a crime within two years of their release. Fig. 6-19 EXAMPLE 6.10 A study determined that the difference between the price quoted to women and men for used cars is normally distributed with mean $400 and standard deviation $50. To clarify, let X be the amount quoted to a woman minus the amount quoted to a man for a given used car. Then, the population distribution for X is normal with p = $400 and a = $50.The percent of the time that quotes for used cars are $275 to $500 more for women than men is given by P(275 c X c 500) times 100. The probability P(275 c X c 500) is shown as the 275-400 shaded area in Fig. 6-20. The Z value corresponding to X = 275 is z = 500-400 corresponding to X = 500 is z = 50 = -2.5 and the Z value 50 = 2.0. The probability P(-2.5 c Z c 2.0) is shown in Fig. 6-21. Fig. 6-20 The probability P(-2.5 < 2 c 2.0) is found using the standard normal distribution table. The probability is expressed as P(-2.5 c Z < 2.0) = P(-2.5 c Z < 0) + P(0 c Z c 2.0). By symmetry,P(-2S<ZcO) = P(O<Zc2.5) = .4938 and P(0 < Z c 2.0) = .4772. Therefore, P(-2.5 < Z < 2.0) = .4938 + .4772= .971. P(275 c X c 500) = P(-2.5 < Z <2.0) = .971,or 97.1% of the time the quotes for used cars will be between $275 and $500 more for women than for men.
  • 132.
    CHAP. 61 CONTINUOUSRANDOM VARIABLES 123 Fig. 6-21 DETERMINING THE Z AND X VALUES WHEN AN AREA UNDER THE NORMAL CURVE IS KNOWN In many applications, we are concerned with finding the z value or the x value when the area under a normal curve is known. The following examples illustrate the techniques for solving these type problems. EXAMPLE 6.11 Find the positive value z such that the area under the standard normal curve between 0 and z is .4951. Figure 6-22 shows the area and the location of z on the horizontal axis. Table 6.3 gives the portion of the standard normal distribution table needed to find the z value. Fig. 6-22 We search the interior of the table until we find .4951,the area we are given. This area is shown in bold print in Table 6.3. By going to the beginning of the row and top of the column in which .4951 resides, we see that the value for z is 2.58. Table 6.3
  • 133.
    124 CONTINUOUS RANDOMVARIABLES [CHAP. 6 To find the value of negative z such that the area under the standard normal curve between 0 and z equals .4951, we find the value for positive z as above and then take z to be -2.58. EXAMPLE 6.12 The mean number of passengers that fly per day is equal to 1.75 million and the standard deviation is 0.25 million per day. If the number of passengers flying per day is normally distributed, the distribution and ninety-fifth percentile, Pg5,is as shown in Fig. 6-23. The shaded area is equal to 0.95.We have a known area and need to find Pg5,a value of X. Since we must always use the standard normal tables to solve problems involving any normal curve, we first draw a standard normal curve corresponding to Fig. 6-23. This is shown in Fig. 6-24. Fig. 6-23 Using the technique shown in Example 6.11, we find the area .4500 in the interior of the standard normal distribution table and find the value of z to equal 1.645. There is 95% of the area under the standard normal curve to the left of 1.645 and there is 95% of the area under the curve Pg5is standardized, the standardized value must equal 1.645. That is, find Pgs= 1.75+ .25 x 1.645= 2.16 million. in Fig. 6-23 to the left of P9s.Therefore if Pgg -1.75 = 1.645 and solving for P 9 5 we .25 Fig. 6-24 The value of z in Fig. 6-24 can also be found by using INVCDF of Minitab as follows. MTB > invcdf .95; SUBC > normal 0 1. Inverse Cumulative Distribution Function Normal with mean = 0 and standard deviation = 1.OOOOO P(X <= x) X 0.9500 1.6449
  • 134.
    CHAP. 61 CONTINUOUSRANDOM VARIABLES 125 NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION Figure 6-25 shows the binomial distribution for X, the number of heads to occur in 10 tosses of a coin. This distribution is shown as the shaded area under the histogram-shaped figure. Superimposed upon this binomial distribution is a normal curve. The mean of the binomial distribution is p = np = 10 x .5 = 5 and the standard deviation of the binomial distribution is CT = ,/G- = 410 x .5 x .5 = = 1.58. The normal curve, which is shown, has the same mean and standard deviation as the binomial distribution. When a normal curve is fit to a binomial distribution in this manner, this is called the normal approximation to the binonzial distribution. The shaded area under the binomial distribution is equal to one and so is the total area under the normal curve. The normal approximation to the binomial distribution is appropriate whenever np 2 5 and nq 2 5. Fig. 6-25 EXAMPLE 6.13 Using the table of binomial probabilities, the probability of 4 to 6 heads inclusive is as follows: P(4 I X I 6) = .205 1 + .2461 + .2051 = 6563.This is the shaded area shown in Fig. 6-26. Fig. 6-26 The normal approximation to this area is shown in Fig. 6-27. To account for the area of all three rectangles, note that X must go from 3.5 to 6.5. The area under the normal curve for X between 3.5 and 6.5 is found by determining the area under the standard normal curve for Z between z = -= -.95 and z = -= .95. This area is 2 x .3289 = .6578. Note that the approximation, .6578, is extremely close to the exact answer, .6563. 3.5-5 6.5-5 1.58 1.58
  • 135.
    126 CONTINUOUS RANDOMVARIABLES [CHAP. 6 Fig. 6-27 EXAMPLE6.14 To aid in the early detection of breast cancer women are urged to perform a self-exam monthly. Thirty-eight percent of American women perform this exam monthly. In a sample of 315 women, the probability that 100 or fewer perform the exam monthly is P(X I 100).Even Minitab has difficulty performing this extremely difficult computation involving binomial probabilities. However, this probability is quite easy to approximate using the normal approximation. The mean value for X is p = np = 315 x .38 = 119.7 and the standard deviation is (J = 6= 4315 x .38 x .62 = 8.61. A normal curve is constructed with mean 119.7 and standard deviation 8.61. In order to cover all the area associated with the rectangle at X = 100 and all those less than 100, the normal curve area associated with x less than 100.5 is found. Figure 6-28 shows the normal curve area we need to find. Fig. 6-28 The corresponding area under the standard normal curve is shown in Fig. 6-29. The area under the standard normal curve for Z <-2.23 is SO00 - .4871 = .0129.The probability is extremely small that 100 or fewer in the 315 will perform the breast examination each month. Fig. 6-29
  • 136.
    CHAP. 61 CONTINUOUSRANDOM VARIABLES 127 EXPONENTIAL PROBABILITY DISTRIBUTION The exponentialprobability distribution is a continuous probability distribution that is useful in describing the time it takes to complete some task. The pdf for an exponential probability distribution is given by formula (6.3, where p is the mean of the probability distribution and e = 2.71828 to five decimal places. for x 2 0 1 - x l p f(x)= -e c1 The graph for the pdf of a typical exponential distribution is shown in Fig. 6-30. Fig. 6-30 EXAMPLE 6.15 The exponential random variable can be used to describe the following characteristics: the time between logins on the internet, the time between arrests for convicted felons, the lifetimes of electronic devices, and the shelf life of fat free chips. PROBABILITIES FOR THE EXPONENTIAL PROBABILITY DISTRIBUTION A standardized table of probabilities does not exist for the exponential distribution and to find areas under the exponential distribution curve requires the use of calculus. However, formula (6.8) is useful in solving many problems involving the exponential distribution. The area corresponding to the probability given in formula (6.8)is shown as the shaded area in Fig. 6-31. Fig. 6-31
  • 137.
    128 CONTINUOUS RANDOMVARIABLES [CHAP. 6 EXAMPLE 6.16 Suppose the time till death after infection with HIV, the AIDS virus, is exponentially distributed with mean equal to 8 years. If X represents the time till death after infection with HIV, then the percent who die within five years after infection with HIV is found by multiplying P(X I 5 ) by 100. The probability is found as follows: P(X 5 5 ) = 1 - e-.625= 1 - .535 = .465. Using CDF of Minitab, we have the following as an alternative solution. MTB > cdf 5; SUBC > exponential 8. Cumulative Distribution Function Exponential with mean = 8.00000 x P(X<=x) 5.OOOO 0.4647 To find the percent who live more than 10 years, we multiply P(X > 10) by 100. In order to utilize formula (6.8),we use the complementary rule for probabilities. This rule allows us to write P(X > 10) as follows: That is, 28.7% of the individuals live more than 10years after infection. This probability is shown as the shaded area in Fig. 6-32. Fig. 6-32 To find the percent who live between 2 and 4 years after infection, we multiply P(2 < X < 4) by 100. To use formula (6.8)to find this probability, we express P(2 < X <4) as follows: P(2 <X < 4) = P(X <4) - P(X < 2)= (1 - e 3- (1 - e-.25)= e-.25- e-.5= .172 That is, 17.2% live between 2 and 4 years after infection. This probability is shown as the shaded area in Fig. 6-33. Fig. 6-33
  • 138.
    CHAP. 61 CONTINUOUSRANDOMVARIABLES 129 Solved Problems UNIFORMPROBABILITY DISTRIBUTION 6.1 The price for a gallon of whole milk is uniformly distributed between $2.25 and $2.75 during July in the US.Give the equation and graph the pdf for X, the price per gallon of whole milk during July. Also determine the percent of stores that charge more than $2.70 per gallon. 2 2.25< x < 2.75 0 elsewhere .The graph is shown in Fig. 6-34. Ans. The equation of the pdf is f(x) = Fig. 6-34 Fig. 6-35 The percent of stores charging a higher price than $2.70 is P(X > 2.70) times 100. The probability P(X > 2.70) is the shaded area in Fig. 6-35. This area is 2 x .05 = .10.Ten percent of all milk outlets sell a gallon of milk for more than $2.70. 6 . 2 The time between release from prison and the commission of another crime is uniformly distributed between 0 and 5 years for a high-risk group. Give the equation and graph the pdf for X, the time between release and the commission of another crime for this group. What percent of this group will commit another crime within two years of their release from prison? .2 o < x < 5 0 elsewhere .The graph of the pdf is shown in Fig. 6-36. The Ans. The equation of the pdf is f(x) = percent who commit another crime within two years is given by P(X c 2) times 100. This probability is shown as the shaded area in Fig. 6-37, and is equal to 2 x .2 = .4. Forty percent will commit another crime within two years. Fig. 6-36 Fig. 6-37
  • 139.
    130 CONTINUOUS RANDOMVARIABLES [CHAP. 6 MEAN AND STANDARD DEVIATION FOR THE UNIFORM PROBABILITY DISTRIBUTION 6.3 6.4 Find the mean and standard deviation of the milk prices in problem 6.1. What percent of the prices are within one standard deviation of the mean? Ans. Find a + b 2.25+2.75 The mean is given by p = -- - = 2.50 and the standard deviation is given by 2 2 / y = / T (2*7s - 2‘25) = .144. The one standard deviation interval about the mean goes from 2.36 to 2.64 and the probability of the interval is (2.64 - 2.36) x 2 = S6. Fifty-six percent of the prices are within one standard deviation of the mean. the mean and standard deviation of the times between release from prison and the commission of another crime in problem 6.2. What percent of the times are within two standard deviations of the mean? (b - a)2 i 12 a + b 0+5 2 2 Am. The mean is given by = -- - -- - 2.50 and the standard deviation is given by = J(5 - ”* = 1.44. A 2 standard deviation interval about the mean goes from -.38 to 5.38 and 100% of the times are within 2 standard deviations of the mean. 12 NORMAL PROBABILITY DISTRIBUTION 6.5 6.6 The mean net worth of all Hispanic individuals aged 51-61 in the U.S. is $80,000, and the standard deviation of the net worths of such individuals is $20,000. If the net worths are normally distributed, what percent have net worths between: (a) $60,000 and $100,000; (6) $40,000 and $1 20,000;(c) $20,000and $140,000? Ans. ( a ) 68.26% have net worths between $60,000and $100,000. (b) 95.44% have net worths between $40,000 and $120,000. (c) 99.72% have net worths between $20,000 and $140,000. If the median amount of money that parents in the age group 51-61 gave a child in the last year is $1,725 and the amount that parents in this age group give a child is normally distributed, what is the modal amount that parents in this age group give a child? Am. Since the distribution is normally distributed, the mean, median, and mode are all equal. Therefore, the modal amount is also $1,725. STANDARD NORMAL DISTRIBUTION 6.7 Express the areas shown in the following two standard normal curves as a probability statement and find the area of each one.
  • 140.
    CHAP. 61 CONTINUOUSRANDOM VARIABLES 131 Ans. The area under the curve on the left is represented as P(0 < Z < 1.83) and from the standard normal distribution table is equal to .4664. The area under the curve on the right is represented as P(-1.87< Z< 1.87)and from the standard normal distribution table is 2 x .4693 = .9386. 6.8 Represent the following probabilities as shaded areas under the standard normal curve and explain in words how you find the areas: (a)P(Z < -1.75); (b)P(Z < 2.15). Ans. The probability in part (a) is shown as the shaded area in Fig. 6-38 and the probability in part (b) is shown in Fig 6-39. To find the shaded area in Fig. 6-38, we note that P(Z < -1.75)= P(Z > 1.75)because of the symmetry of the normal curve. In addition, P(Z > 1.75)= .5 - P(0 < Z< 1.75)since the total area to the right of 0 is .S. From the standard normal distribution table, P(0 < Z < 1.75)is equal to .4599. Therefore, P(Z <-1.75)= P(Z> 1.75)= .5 - .4599 = .0401. To find the shaded area in Fig. 6-39, we note that P(Z < 2.15) = P(Z < 0)+ P(0 < Z < 2.15). The probability P(Z< 0) = .5 because of the symmetry of the normal curve. From the standard normal distribution table, P(0 < Z < 2.15) = .4842. Therefore, P(Z < 2.15) = .5 + ,4842 = .9842. The solution to part (b)using Minitab is as follows: MTB >cdf 2.15; SUBC > normal 0 1. Cumulative Distribution Function Normal with mean = 0 and standard deviation = 1.OOOOO X P(X <= x) 2.1500 0.9842 Fig. 6-38 Fig. 6-39
  • 141.
    132 CONTINUOUS RANDOMVARIABLES STANDARDIZING A NORMAL DISTRIBUTION [CHAP. 6 6.9 6.10 The distribution of complaints per week per 100,OOO passengers for all airlines in the U.S. is normally distributed with p = 4.5 and CJ = 0.8. Find the standardized values for the following observed values of the number of complaints per week per 100,OOO passengers: (a)6.3; (h) 2.5; (c) 4.5; (d) 8.0. A m X - / A 6.3-45 (a) The standardized value for 6.3 is found by z = -- - - - - 2.25. (7 .8 X - c ~ 2.5-45 (6) The standardized value for 2.5 is found by z = -- - - - - -2.50. (7 .8 x--J1 45-4.5 (c) The standardized value for 4.5 is found by z = -- - - - - 0.00. (7 .8 X-/A 8.0-45 (4 The standardized value for 8.0 is found by z = -- - - = 4.38. (7 .8 Personal injury awards are normally distributed with a mean equal to $62,000 and a standard deviation equal to $13,500. Find the amount of the award corresponding to the following standardized values: (a) 2.0; (b) -3.0; (c)0.0; (44.5. Ans. (a) The amount corresponding to standardized value 2.0 is x = p + ZCT= 62,000 + 2.0 x 13,500= 89,000. (6) The amount corresponding to standardized value -3.0 is x = /A i-ZG = 62,000 - 3.0 x 13,500 = 2 1,500, (c) The amount corresponding to standardized value 0.0 is x = /A + ZG = 62,000 + 0.0 x 13,500= 62,000. (6) The amount corresponding to standardized value 4.5 is x = p + zo = 62,000+ 4.5 x 13,500 = 122,750. APPLICATIONS OF THE NORMAL DISTRIBUTION 6.11 In a sociological study concerning family life, it is found that the age at first marriage for men is normally distributed with a mean equal to 23.7 years and a standard deviation equal to 3.5 years. Determine the percent of men for whom the age at first marriage is between 20 and 30 years of age. If X represents the age at first marriage for men, draw a normal curve for X and show the shaded area for P(20 < X < 30) as well as the corresponding area under the standard normal curve. Ans. The distribution for X is shown in Fig. 6-40.The shaded area represents P(20 < X < 30). The z value corresponding to x = 20 is z = = -1.06 and the z value corresponding to x = 30 is z = = 1.80. The area shown in Fig. 6-4I is equal to the area under the curve shown in Fig. 6-40, that is, P(20 < X < 30) = P(-l.O6 < Z < 1.80).Utilizing the standard normal distribution table, P(-1.06 < Z < 1.80) = .3554+ .4641 = .8195. That is, about 82% of the first marriages for men occur for men between 20 and 30 years of age. 20-23.1 3.5 30-23.1 3.5
  • 142.
    CHAP. 61 CONTINUOUSRANDOM VARIABLES 133 Fig. 6-40 Fig. 6-41 6.12 The net worth of senior citizens is normally distributed with mean equal to $225,000 and standard deviation equal to $35,000. What percent of senior citizens have a net worth less than $300,000? Ans. Let X represent the net worth of senior citizens in thousands of dollars. The percent of senior citizens with a net worth less than $300,000 is found by multiplying P(X c 300) times 100. The probability P(X < 300) is shown in Fig. 6-42. The event X < 300 is equivalent to the event Z < = 2.14. The probability that Z < 2.14 is represented as the shaded area in Fig. 6-43. The probability that Z is less than 2.14 is found by adding P(0 < Z < 2.14) to .5, which equals .5 + .4838 = .9838. We can conclude that 98.38% of the senior citizens have net worths less than $300,000. 300- 225 35 Fig. 6-42 Fig. 6-43 DETERMINING THE Z AND X VALUES WHEN AN AREA UNDER THE NORMAL CURVE IS KNOWN 6.13 Find the value for positive number a such that P(-a < Z < a) = 9 5 . Ans. The symmetry of the normal curve implies that P(0 < Z < a) = .4750. This area is found in the interior of the standard normal distribution table and the z value corresponding to this area is 1.96. The area under the standard normal curve corresponding to -1.96< Z < 1.96 is .9500.
  • 143.
    134 CONTINUOUS RANDOMVARIABLES [CHAP. 6 6.14 U.S. divorce rates, by county, are normally distributed with mean value equal to 4.5 per 1000 people and standard deviation equal to 1.3 per 1000people. Find the third quartile for divorce rates by county per 1000people. Draw graphs to illustrate your solution. Ans. The divorce rate is represented by X and its distribution is shown in Fig. 6-44. The third quartile is represented by Q3.The shaded area is equal to .7500. The third quartile for the standard normal distribution is .67 and is shown in Fig. 6-45. Summarizing,we have P(X < 4 3 ) = P(Z < .67) = .75. The standardized value of Q 3 must equal .67. That is, -= .67 or Q 3 = 4.5 + 1.3 x .67 = 5.37. Seventy-five percent of the divorce rates are 5.37 or below. Q3 -4.5 1.3 Fig. 6-44 Fig. 6-45 NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION 6.15 Use Minitab to find the probability of the event that between 8 and 11 heads inclusive occur when a coin is flipped 20 times. Find the normal approximation to this probability. Ans. MTB > cdf 11; MTB >cdf 7; SUBC > binomial 20 .5. SUBC > binomial 20 .5. Cumulative Distribution Function Binomial with n = 20 and p = 0.500 11.00 0.7483 7.00 0.1316 Cumulative Distribution Function Binomial with n = 20 and p = 0.500 x P(X<=x) X P(X <= x) To find the probability of the event 8 I X I 11 note that P(8 I X I11) = P(X I11) - P(X 27). Therefore, P(8 I X I 11)= .7483 - .1316 = .6167. The mean of the binomial distribution is p = np = 20 x .5 = 10, and the standard deviation is CT = &= & = 2.24. A normal curve with mean equal to 10and standard deviation 2.24 is fit to the binomial distribution and the area under this normal curve is found for X ranging between 7.5 and 11.5. This area is shown in Fig. 6-46. The standardized value for x = 7.5 is z = -1.12 and the standardized value for x = 11.5 is z = .67. The area between z = -1.12 and z = .67 is shown in Fig. 6-47. The area is now found using Minitab. MTB > cdf 11.5; SUBC > normal 10 2.24. Cumulative Distribution Function Normal with mean = 10.00 and standard deviation = 2.240 11.50 0.7485 7.50 0.1322 MTB >cdf 7.5; SUBC > normal 10 2.24. Cumulative Distribution Function Normal with mean = 10.00and standard deviation = 2.240 x P(X<= x) X P(X <= x)
  • 144.
    CHAP. 61 CONTINUOUSRANDOM VARIABLES The area between 7.5 and 11.5is the difference .7485 - .1322 = .6163. 135 Fig. 6-46 Fig. 6-47 6.16 In the United States, 80% of TV-owning homes also own a VCR. Use the normal approxi- mation to the binomial distribution to find the probability that 425 or more in a sample of 500 TV-owning homes also own a VCR. Ans. If we let X represent the number of TV-owning homes which also own a VCR, then we are to find P(X 2 425). A normal curve having mean = np = 400 and standard deviation = 6= 8.94 is fit to the binomial and we need to find the area under the normal curve to the right of 424.5. The solution using Minitab is as follows. MTB > cdf 424.5; SUBC > normal 400 8.94. Cumulative Distribution Function Normal with mean = 400.00 and standard deviation = 8.94 424.50 0.9969 X P(X <= x) Since .9969 is the probability that X is less than 424.5, the probability that X exceeds 424.5 is 1- .9969 = .0031. EXPONENTIALPROBABILITYDISTRIBUTION 6.17 Which of the following statements best describes the exponential probability distribution? (a) The exponential probability distribution is skewed to the right. (b) The exponential probability distribution is skewed to the left. (c) The exponential probability distribution is mound shaped. Ans. (a) The exponential distribution is skewed to the right. 6.18 What is the equation of the exponential pdf having mean equal to 5? Am. The equation is f(x) = .2e-.2X for x 20. PROBABILITIESFOR THE EXPONENTIALPROBABILITYDISTRIBUTION 6.19 The lifetimes in years for a particular brand of cathode ray tube are exponentially distributed with a mean of 5 years. What percent of the tubes have lifetimes between 5 and 8 years? Draw
  • 145.
    136 CONTINUOUS RANDOMVARIABLES [CHAP. 6 a graph of the pdf and shade the area which represents the probability of the event 5 < X < 8, where X represents the lifetimes. Ans. The shaded area in Fig. 6-48 represents P(5 <X < 8). The probability is found as follows: P(5 c X c 8) = P(X c 8) - P(X c 5) = (1 - e-1.6)- (1 - e-I) = e-l - e-1.6= .3679 - .2019 = .166. Approximately 16.6% of the tubes have lifetimesbetween 5 and 8 years. Fig. 6-48 6.20 For the cathode ray tubes in problem 6.19, use Minitab to determine the percent of tubes that have lifetimes of less than 10years. Ans. MTB > cdf 10; SUBC > exponential 5. Cumulative Distribution Function Exponential with mean = 5.00000 10.00 0.8647 x P(X<=x) The percentage of tubes having lifetimes less than 10years is 86.5%. Supplementary Problems UNIFORM PROBABILITY DISTRIBUTION 6 . 2 1 The RND function is a computer language function which uniformly generates random numbers between Oand 1. (a) What percent of the random numbers generated by RND are less than .35? (b) What percent of the random numbers generated by RND are between .20 and .55? (c) What percent of the random numbers generated by RND are either less than .14 or greater than.81? Ans. (a) 35% (b) 35% (c) 33% 6.22 In a psychological study involving personality types and career selections, it is found that the time required to complete a task is uniformly distributed over the interval from 5.0to 7.5 minutes. (a) What is the probability that the task is completed in less than 4 minutes? (b) What is the probability that the task is completed in 7.0 or more minutes? (c) What is the probability that it requires more than 10.0minutes to complete the task? Ans. (a) 0 (b) .2 (c) 0
  • 146.
    CHAP. 61 CONTINUOUSRANDOM VARIABLES 137 MEAN AND STANDARDDEVIATION FOR THE UNIFORM PROBABILITY DISTRIBUTION 6.23 What is the mean and standard deviation of the random numbers generated by RND in problem 6.21? Ans. p = S O G = .29 6.24 What percent of the distribution in problem 6.23 is included within one standard deviation of the mean? Ans. 57.6% NORMAL PROBABILITY DISTRIBUTION 6.25 The heights of adult males are normally distributed with a mean equal to 70 inches and a standard deviation equal to 3 inches. Give the pdf for the normal curve describing this distribution. 6.26 A normal distribution has mean equal to a and standard deviation equal to b and c is a positive number. If it is known that P(X > a +c) = d, find the followingprobabilities: (a) P(X<a+c) (6) P(X<a-c) (c) P(a - c <X < a +c) (d)P(0 < X < a +c) Ans. (a) 1 -d (b) d (c) 1- 2d (d) .5-d STANDARD NORMAL DISTRIBUTION 6.27 Find the following probabilities concerning the standard normal random variable Z. (a) P(O< Z < 2.13) (b) P(-1.45 < Z < 2.10) (c) P(Z > 2.88) (d) P(Z 2 -2.01) Ans. (a) A834 (6) .9086 (c) .0020 (6).9778 6.28 Find the probabilities of the followingevents involving the standard normal random variable Z. (a) Z>4.50 (d) Z < 1.15andZ>-1.15 (e) Z<-2.45andZ> 1.11 u> thecomplementofZ< 1.44 (b) -4.00< Z < 5.50 (c) Z <-1.25 or Z > 2.35 Ans. (a) approx 0 (b) approx 1 (c) .I 150 (d).7498 (e) 0 (fj.0749 STANDARDIZING A NORMAL DISTRIBUTION 6.29 The marriage rate per 1,000population per county has a normal distribution with mean 8.9 and standard deviation 1.7.Find the standardized values for the followingmarriage rates per 1,OOO per county. (a) 6.5 (6) 8.8 (c) 12.5 (d)13.5 Ans. (a) -1.41 (b) -0.06 (c) 2.12 (d) 2.71 6.30 The hospital cost for individuals involved in accidents who do not wear seat belts is normally distributed with mean $7,500 and standard deviation $1,200. (a) Find the cost for an individual whose standardized value is 2.5. (6) Find the cost for an individual whose bill is 3 standard deviations below the average. Ans. (a) $10,500 (6) $3,900
  • 147.
    138 CONTINUOUS RANDOMVARIABLES [CHAP. 6 APPLICATIONSOF THE NORMAL DISTRIBUTION 6.31 The average TV-viewing time per week for children ages 2 to 1 1 is 22.5 hours and the standard deviation is 5.5 hours. Assuming the viewing times are normally distributed, find the following. (a) What percent of the children have viewing times less than 10hours per week? (6) What percent of the children have viewing times between 15 and 25 hours per week? (c) What percent of the children have viewing times greater than 40 hours per week? Ans. ( a ) 1.16% (6) 58.67% (c) less than 0.I % 6.32 The amount that airlines spend on food per passenger is normally distributed with mean $8.00 and standard deviation $2.00. (a) What percent spend less than $5.00 per passenger? (6) What percent spend between $6.00 and $10.00? (c) What percent spend more than $12.50? Ans. (a) 6.68% (6) 68.26% (c) 1.22% DETERMININGTHE Z AND X VALUES WHEN AN AREA UNDER THE NORMAL CURVE IS KNOWN 6.33 6.34 Find the value of a in each of the following probability statements involving the standard normal variable 2. (a) P(0 < Z < a) = .4616 (6) P(Z<a)=.1894 (6) P(Z < a) = .8980 (e) P(Z > a) = .I894 (c) P(-a < Z < a) = 3612 U , P(Z = a) = SO00 Ans. (a) 1.77 (6) 1.27 (c) 1.48 (d)-0.88 (el 0.88 u> no solutions The GMAT test is required for admission to most graduate programs in business. In a recent year, the GMAT test scores were normally distributed with mean value 550 and standard deviation 100. (a) Find the first quartile for the distribution of GMAT test scores. (b) Find the median for the distribution of GMAT test scores. (c) Find the ninety-fifth percentile for the distribution of GMAT test scores. Ans. (a) 483 (6) 550 (c) 715 NORMAL APPROXIMATIONTO THE BINOMIALDISTRIBUTION 6.35 6.36 For which of the following binomial distributions is the normal approximation to the binomial distribution appropriate? ((1) n = 15,p = .2 (6) n = 40, p = . I (c) n = 500, p = .05 (6) n = 50, p = .3 Ans. (a) and (d) Thirteen percent of students took a college remedial course in 1992-1993. Assuming this is still true, what is the probability that in 350 randomly selected students: (a) Less than 40 take a remedial course (6) Between 40 and 50, inclusive, take a remedial course (c) More than 55 take a remedial course Ans. (a) .1711 (b) .6141 (c) .0559
  • 148.
    CHAP. 61 CONTINUOUSRANDOM VARIABLES 139 EXPONENTIALPROBABILITY DISTRIBUTION 6.37 The random variable X has an exponential distribution with pdf f(x) = .le-’’, for x > 0. What is the mean for this random variable? Ans. 10 6.38 What is the largest value an exponential random variable may assume? Ans. There is no upper limit for an exponential distribution. PROBABILITIESFOR THE EXPONENTIALPROBABILITYDISTRIBUTION 6.39 The time that a family lives in a home between purchase and resale is exponentially distributed with a mean equal to 5 years. Let X represent the time between purchase and resale. (a) Find P(X -c3).(6) Find P(X > 5). Ans. (a) ,4512 (6) .3679 6.40 The time between orders received at a mail order company are exponentially distributed with a mean equal to 0.5 hour. What is the probability that the time between orders is between 1 and 2 hours? Ans. .1170
  • 149.
    Chapter 7 Sampling Distributions SIMPLERANDOM SAMPLING In order to obtain information about some population, either a censw of the whole population is taken or a sample is chosen from the population and the information is inferred from the sample. The second approach is usually taken, since it is much cheaper to obtain a sample than to conduct a census. In choosing a sample, it is desirable to obtain one that is representative of the population. The average weight of the football players at a college would not be a representative estimate of the average weight of students attending the college, for example. A simple rundurn sumple of size n from a population of size N is one selected in such a way that every sample of size n has the same chance of occurring. In simple random sampling with replacement, a member of the population can be selected more than once. In simple random sampling without replacement, a member of the population can be selected at most once. Simple random sampling without replacement is the most common type of simple random sampling. EXAMPLE 7.1 Consider the population consisting of the world’s five busiest airports. This population consists of the following: A: Chicago O’Hare, B: Atlanta Hartsfield, C: London Heathrow, D: Dallas-Fort Worth, and E: Los Angeles Intl. The number of possible samples of size 2 from this population of size 5 is given by the combination of 5 items selected two at a time, that is, = 10. In simple random sampling, each possible pair would have probability 0.1 of being the pair selected. That is, Chicago O’Hare and 4tlanta Hartsfield would have probability 0.1 of being chosen, Chicago O’Hare and London Heathrow would have probability 0.1 of being chosen, etc. One way of ensuring that each pair would have an equal chance of being selected would be to write the names of the five airports on separate sheets of paper and select two of the sheets randomly from a box. USING RANDOM NUMBER TABLES The technique of writing names on slips of paper and selecting them from a box is not practical for most real world situations. Tables of random numbers are available in a variety of sources. The digits 0 through 9 occur randomly throughout a random number table with each digit having an equal chance of occurring. Table 7.1 is an example of a random number table. This particular table has 50 columns and 20 rows. To use a random number table, first randomly select a starting position and then move in any direction to select the numbers. EXAMPLE 7 . 2 The money section of USA Today gives the 1,900 most active New York Stock Exchange issues. The random numbers in Table 7.1 can be used to randomly select 10 of these issues. Imagine that the issues are numbered from 0001 to 1900. Suppose we randomly decide to start in row 1 and columns 2 1 through 24. The four-digit number located here is 0345. Reading down these four columns and discarding any number exceeding 1900, we obtain the following eight random numbers between 0001 and 1900: 0345, 1304, 0990, 1580, 1461, 1064, 0676, and 0347. To obtain our other two numbers, we proceed to row 1 and columns 26 through 29. Reading down this column, we find 1149and 1074.To obtain the 10stock issues, we read down the columns and select the ones located in positions 345, 347,676,990, 1064, 1074, 1149. 1304, 1461, and 1580. 140
  • 150.
    CHAP. 71 SAMPLINGDISTRIBUTIONS 141 Table 7.1 Random Numbers (25 rows, 50 columns) 01-05 06-10 11-15 16-20 21-25 26-30 31-35 36-40 41-45 46-SO 87032 98340 70363 09749 84223 27 190 47087 47335 31031 60072 90202 91887 08264 46655 30428 55238 64993 90420 4462 1 15988 2656I 94192 95651 91862 87730 92922 68993 69753 52240 37318 45248 83092 04860 67610 33957 40718 84882 80152 76402 94013 44020 8I975 64089 12659 91759 86046 24807 44311 10346 07550 84967 39204 059I9 35334 53553 83328 03067 10418 04778 71898 06061 69931 31921 63079 9304I 09124 70755 33070 02I33 0441 I 67293 23539 28393 44369 22925 97613 19953 26576 58739 05785 03453 13047 09900 9I937 48757 42493 36834 15800 42534 43925 14612 98551 21460 10649 06766 23119 21077 4036 I 03474 72883 22484 28533 81554 58272 89477 26551 19522 92668 84923 I1499 99573 48427 28370 10744 37433 77718 27665 8242I 00570 17772 55858 34529 53640 90766 8422I 76639 595I0 78460 44548 93024 69573 25425 43026 505 15 45349 16016 10583 61952 28368 57471 6I768 02625 92109 09950 38566 15763 83888 39356 75222 72791 98695 43864 78296 01372 46565 58590 62587 627 I3 60340 75775 I2676 I1020 00459 27996 96274 I8068 51540 9 1692 89959 60190 51303 10714 58382 5508I 4701 1 03726 36875 04890 95227 95202 29353 65106 09599 29679 29195 38998 391 19 34824 921 19 29692 44925 08308 08276 31121 46762 03091 00638 0I032 39059 06545 USING THE COMPUTER TO OBTAIN A SIMPLE RANDOM SAMPLE Most computer statistical software packages can be used to select random numbers and to some extent have replaced random number tables. As the capability and availability of computers continue to increase, many of the statistical tables are becoming obsolete. EXAMPLE 7.3 Minitab can be used to select the random sample of stock issues in Example 7.2. The commands are as follows. MTB > set cl DATA > 1:1900 DATA > end MTB > sample 10c 1 c2 MTB > print c2 Data Display c 2 1227 969 1834 1441 423 897 824 664 414 77 The first three lines of command put the numbers 1 through 1900 into column 1. The command on the fourth line asks for a sample of size 10 from the numbers in column cl and asks that the selected numbers be placed into column c2. The print command causes the random numbers to be printed. These are the numbers of the 10 stocks to be selected.
  • 151.
    142 SAMPLING DISTRIBUTIONS[CHAP. 7 SYSTEMATIC RANDOM SAMPLING Systematic random sampling consists of choosing a sample by randomly selecting the first element and then selecting every kth element thereafter. The systematic method of selecting a sample often saves time in selecting the sample units. EXAMPLE 7.4 In order to obtain a systematic sample of 50 of the nation’s3,143 counties,divide 3,143 by 50 to obtain 62.86. Round 62.86 down to obtain 62. From a list of the 3,143 counties, select one of the first 62 counties at random. Suppose county number 35 is selected. To obtain the other 49 counties, add 62 to 35 to obtain 97, add 2 x 62 to 35 to obtain 159,and continue in this fashion until the number 49 x 62 -t35 = 3,073 is obtained. The counties numbered 35, 97, 159,, . . ,3073 would represent a systematic sample from the nations counties. CLUSTER SAMPLING In cluster sampling, the population is divided into clusters and then a random sample of the clusters are selected. The selected clusters may be completely sampled or a random sample may be obtained from the selected clusters. EXAMPLE 7.5 A large company has 30 plants located throughout the United States. In order to access a new total quality plan, the 30 plants are considered to be clusters and five of the plants are randomly selected,All of the quality control personnel at the five selected plants are asked to evaluate the total quality plan. STRATIFIED SAMPLING In stratifed sampling, the population is divided into strata and then a random sample is selected from each strata. The strata may be determined by income levels, different stores in a supermarket chain, different age groups, different governmental law enforcement agencies, and so forth. EXAMPLE7.6 Super Value Discount has 10 stores. To assessjob satisfaction,one percent of the employees at each of the 10 stores are administered ajob satisfaction questionnaire.The 10stores are the strata into which the population of all employeesat Super Value Discount are divided. The results at the 10 stores are combined to evaluate the job satisfactionof the employees. SAMPLING DISTRIBUTION OF THE SAMPLING MEAN The mean of a population, p, is a parameter that is often of interest but usually the value of p is unknown. In order to obtain information about the population mean, a sample is taken and the sample mean, T , is calculated. The value of the sample mean is determined by the sample actually selected. The sample mean can assume several different values, whereas the population mean is constant. The set of all possible values of the sample mean along with the probabilities of occurrence of the possible values is called the sampling distribution o f the sampling mean. The following example will help illustrate the sampling distribution of the sample mean. EXAMPLE 7.7 Suppose the five cities with the most African-American-owned businesses measured in thousands is given in Table 7.2.
  • 152.
    CHAP. 71 City A: NewYork C: Los Angeles E:Atlanta B: Washington, D.C. D: Chicago SAMPLING DISTRIBUTIONS Number of African-American- owned businesses, in thousands 42 39 36 33 30 143 X P(x) 30 33 36 39 42 .2 .2 .2 .2 .2 If X represents the number of African-American-ownedbusinesses in thousands for this population consisting of five cities, then the probability distribution for X is shown in Table 7.3. I Number of businesses Sample mean - in the samples X Table 7 . 3 The population mean is p = C xP(x) = 30 x .2 + 33 x .2 + 36 x .2 + 39 x .2 +42 x .2 = 36, and the variance is given by$=Cx2P(x)-p2 = 900 x .2 + 1089 x .2 + 1296 x .2 + 1521 x .2 + 1764 x .2 - 1296 = 1314 - 1296 = 18.The population standard deviation is the square root of 18,or 4.24. The number of samples of size 3 possible from this population is equal to the number of combinations possible when selecting three cities from five. The number of possible samples is C: = -= 10. Using the letters A, B, C, D, and E rather than the name of the cities, Table 7.4 gives all the possible samples of three cities, the sample values, and the means of the samples. 5! 3!2! Samples 42, 39, 36 42, 39, 33 42, 39, 30 42, 36, 33 42, 36, 30 42, 33, 30 39, 36, 33 39, 36, 30 39, 33, 30 39 38 37 37 36 35 36 35 34 The sampling distribution of the mean is obtained from Table 7.4. For random sampling, each of the samples in Table 7.4 is equally likely to be selected. The probability of selecting a sample with mean 39 is .1 since only one of 10samples has a mean of 39. The probability of selecting a sample with mean 36 is .2, since two of the samples have a mean equal to 36. Table 7.5 gives the sampling distribution of the sample mean. Table 7.5 I I 1 , 1 3 3 34 35 36 37 38 39 I I PK’)I .I .1 .2 .2 .2 *1 .1 I
  • 153.
    144 Sample mean 33 34 35 36 37 38 39 SAMPLING DISTRIBUTIONS Samplingerror Probability 3 .I 2 .1 I .2 0 .2 1 .2 2 .1 3 .I L [CHAP. 7 33 34 35 36 37 38 39 P(Y) .I .I .2 .2 .2 .1 .I - , x SAMPLINGERROR When the sample mean is used to estimate the population mean, an error is usually made. This error is called the sampling error, and is defined to be the absolute difference between the sample mean and the population mean. The sampling error is defined by sampling error = Ix -p I EXAMPLE73 In Example 7.7, the population of the five cities with the most African-American-owned businesses is given. The mean of this population is 36. Table 7.5 gives the possible sample means for all samples of size 3 selected from this population. Table 7.6 gives the sampling errors and probabilities associated with all the different sample means. Table 7.6 From Table 7.6, it is seen that the probability of no sampling error in this scenario is .20. There is a 60%chance that the sampling error is 1 or less. MEAN AND STANDARDDEVIATION OF THE SAMPLEMEAN Since the sample mean has a distribution, it is a random variable and has a mean and a standard deviation, The mean of the sample mean is represented by the symbol px and the standard deviation of the sample mean is represented by fi .The standard deviation of the sample mean is referred to as the standard error o f the mean. Example 7.9 illustrates how to find the mean of the sample mean and the standard error of the mean. EXAMPLE 7.9 In Example 7.7, the sampling distribution of the mean shown in Table 7.7 was obtained. Table7.7 The mean of the sample mean is found as follows: pir = Z 'XP(X) = 33 x .1 +34 x .1 +35 x .2 + 36 x .2 + 37 x .2 + 38 x .1 +39 x .1 = 36 The variance of the sample mean is found as follows: a; = c X2P(Si;) - j.lz 2 0: = 1089x .l + 1156x .1 + 1225 x .2 + 1296x .2 + 1369x .2 + 1444 x .I + 1521 x . I - 1296= 3 The standard error of the mean, ox, is equal to 6= 1.73.
  • 154.
    CHAP. 71 SAMPLINGDISTRIBUTIONS145 The relationshipbetween the mean of the sample mean and the population mean is expressed by The relationship expressed by formula between the variance of the sample mean and the population variance is (7.3),where N is the population size and n is the sample size. 2 c2x- N-n n N-1 ‘ T , = - (7.3) EXAMPLE7.10 In Example 7.7, the population consisting of the five cities with the most African-American- owned businesses was introduced. The population mean, p,is equal to 36 and the variance, 02, is equal to 18. In Example 7.9, the mean of the sample mean,pF, was shown to equal 36 and the variance of the sample mean, 02, was shown to equal 3. It is seen that p=kSr = 36, illustrating formula (7.2). To illustrate formula (7.3),note that The standarderror of the mean is found taking the square root of both sides of formula (7.3),and is given by The term izis called thefinite population correctionfactor. If the sample size n is less than 5% of the population size, i.e., n < .05N, the finite population correction factor is very near one and is omitted in formula (7.4).If n < .05N, the standard error of the mean is given by EXAMPLE7.11 The mean cost per county in the United States to maintain county roads is $785 thousand per year and the standard deviation is $55 thousand. Approximately 4% of the counties are randomly selected and the mean cost for the sample is computed. The number of counties is 3,143 and the sample size is 125.The standard error of the mean using formula (7.4)is: = 4.91935 x .98 = 4.82 thousand 3,143-1 The standard error of the mean using formula (7.5)is: 55 ‘Tx = -= 4.92 thousand J125 Ignoring the finite population correction factor in this case changes the standard error by a small amount.
  • 155.
    146 SAMPLING DISTRIBUTIONS[CHAP. 7 SHAPE OF THE SAMPLING DISTRIBUTION OF'THE SAMPLE MEAN AND THE CENTRAL LIMIT THEOREM If samples are selected from a population which is normally distributed with mean p and standard deviation 0, then the distribution of sample means is normally distributed and the mean of this distribution is p, = p, and the standard deviation of this distribution is e = -. The shape of the distribution of the sample means is normal or bell-shaped regardless of the sample size. The central l i k t theorem states that when sampling from a large population of any distribution shape, the sample means have a normal distribution whenever the sample size is 30 or more. Furthermore, the mean of the distribution of sample means is pX = p, and the standard deviation of this distribution is = -. It is important to note that when sampling from a nonnormal distribution, X has a normal distribution only if the sample size is 30 or more. The central limit theorem is illustrated graphically in Figure 7-I. 0 & 0 J;T Fig. 7-1 Figure 7-1 illustrates that for samples greater than or equal to 30, 51 has a distribution that is bell- shaped and centers at p.The spread of the curve is determined by e. EXAMPLE7.12 If a large number of samples each of size n, where n is 30 or more, are selected and the means of the samples are calculated, then a histogram of the means will be bell-shaped regardless of the shape of the population distribution from which the samples are selected. However, if the sample size is less than 30, the histogram of the sample means may not be bell-shaped unless the samples are selected from a bell-shaped distribution. APPLICATIONS OF THE SAMPLING DISTRIBUTION OF THE SAMPLE MEAN The distribution properties of the sample mean are used to evaluate the results of sampling, and form the underpinnings of many of the statistical inference techniques found in the remaining chapters of this text. The examples in this section illustrate the usefulness of the central limit theorem. EXAMPLE 7.13 A government report states that the mean amount spent per capita for police protection for cities exceeding 150,000 in population is $500 and the standard deviation is $75. A criminal justice research
  • 156.
    CHAP. 71 SAMPLINGDISTRIBUTIONS147 study found that for 40 such randomly selected cities, the average amount spent per capita for this sample for police protection is $465. If the government report is correct, the probability of finding a sample mean that is $35 or more below the national average is given by P(x < 465). The central limit theorem assures us that because the sample size exceeds 30, the sample mean has a normal distribution. Furthermore, if the government report is correct, the mean of x is $500 and the standard error is -= $1 1.86. The area to the left of $465 75 J40 465-500 under the normal curve for x is the same as the area to the left of 2 = = -2.95. We have the 11.86 following equality. P ( x < 465) = P(Z < -2.95) = .0016. This result suggests that either we have a highly unusual sample or the government claim is incorrect. Figures 7-2 and 7-3 illustrate the solution graphically. Fig. 7-2 Fig. 7-3 In Example 7.13, the value 465 was transformed to a z value by subtracting the population mean 500 from 465, and then dividing by the standard error 11.86. The equation for transforming a sample mean to a z value is shown in formula (7.6): EXAMPLE 7.14 A machine fills containers of coffee labeled as 113 grams. Because of machine variability, the amount per container is normally distributed with p = 113 and 0 = 1. Each day, 4 of the containers are selected randomly and the mean amount in the 4 containers is determined. If the mean amount in the four containers is either less than 112 grams or greater than 114 grams, the machine is stopped and adjusted. Since the distribution of fills is normally distributed, the sample mean is normally distributed even for a sample as small as four. The mean of the sample mean is px = 113 and the standard error is ( T ~ = .5. The machine is adjusted if X < I12 or if ?I > 114. The probability the machine is adjusted is equal to the sum P(Y < 112) + P(X > 114) since we add probabilities of events connected by the word or. To evaluate these probabilities, we use formula (7.6)to express the events involving X in terms of z as follows: X-113 < 112-113 5 5 P(X < 112)=P(- ) = P(z <-2.00) = .0228 X-113 114-113 .S 5 P(Y > 114)= P(- > ) = P(z >2.00)= .0228 The probability that the machine is adjusted is 2 x .0228 = .0456. It is seen that if this sampling technique is used to monitor this process, there is a 4.56% chance that the machine will be adjusted even though it is maintaining an average fill equal to 113grams.
  • 157.
    148 SAMPLING DISTRIBUTIONS Themepark B: Magic Kingdom (Disney World) C: Epcot (Disney World) D: DisneyMGM Studios (Disney World) E: Universal Studios Florida (Orlando) A: Disneyland (Anaheim) [CHAP. 7 Attendance exceeds 10 million Yes yes yes no no SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION Sample A, B, C, D A, B, C, E A, B, D, E A, C, D, E B, C, D, E A populcition proportion is the proportion or percentage of a population that has a specified characteristic or attribute. The population proportion is represented by the letter p. The sarnple proportiori is the proportion or percentage in the sample that has a specified characteristic or attribute. The sample proportion is represented by either the symbol 6 or F. We shall use the symbol to represent the sample proportion in this text. Exceeds 10 million Sample proportion Probabi1ity Y. Y I Y1 n .75 .2 Y. Y I Y, n .75 .2 yl y, n, n .so .2 y, y, n*n so .2 y, y, n, n .so .2 EXAMPLE 7.15 The nation’s work force at a given time is 133,018,000and the number of unemployed is 7,355,000 133,O18,000 7,355,000. The proportion unemployed is p = = .055 and the jobless rate is 5.5%. A sample of size 65,000 is selected from the nation’s work force and 3,900 are unemployed in the sample. The sample proportion unemployed is j7 = -- - .06 and the samplejobless rate is 6%. 3,900 65,000 The population proportion p is a parumeter measured on the complete population and is constant over some time interval. The sample proportion jT is a stutistic measured on a sample and is considered to be a random variable whose value is dependent on the sample chosen. The set of all possible values of a sample proportion along with the probabilities corresponding to those values is called the sutnplitig distribirtion c$ the sample proportiori. EXAMPLE 7.16 According to recent data, the nation’s five most popular theme parks are shown in Table 7.8. The table gives the name of the theme park and indicates whether or not the attendance exceeds 10 million per year. For this population of size N = 5, the proportion of theme parks with attendance exceeding 10million is p = = .60 or 60%. There are 5 samples of siLe 4 possible when selected from this population. These samples, the theme parks exceeding 10 million (yes or no), the sample proportion, and the probability associated with the sample proportion, are given in Table 7-9. For each sample, the sampling error, Ip - jT 1, is either .I0 or .15. From Table 7.9, it is seen that the probability associated with sample proportion .75 is .2 + .2 = .4 and the probability associated with sample proportion .SO is .2 + .2 + .2 = .6.The sampling distribution for i7 is given in Table 7.10.
  • 158.
    CHAP. 71 SAMPLINGDISTRIBUTIONS 149 Table 7.10 For larger populations and samples, the sampling distribution of the sample proportion is more difficult to construct, but the technique is the same. MEAN AND STANDARDDEVIATION OF THE SAMPLEPROPORTION Since the sample proportion has a distribution, it is a random variable and has a mean and a standard deviation. The mean of the sample proportion is represented by the symbol pF and the standard deviation of the sample proportion is represented by q. The standard deviation of the sample proportion is called the stmdard error of the proportion. Example 7.17 illustrates how to find the mean of the sample proportion and the standard error of the proportion. EXAMPLE 7.17 For the sampling distribution of the sample proportion shown in Table 7.10, the mean is pLp = C @'@) = .5 x .6 + .75 x .4 = 0.6 The variance of the sample proportion is 6 = C P2P@) - pi = .25 x .6 + S625 x .4- .36 = 0.015 The standard error of the proportion, OF, is equal to JE = 0.122. The relationship between the mean of the sample proportion and the population proportion is expressed by vT,= p The standard error of the sample proportion is related size, and the sample size by (7.7) to the population proportion, the population The term ,/%is called thefinite populatioiz correctiorzfactor. If the sample size n is less than 5% of the population size, i.e., n < .05N, the finite population correction factor is very near one and is omitted in formula (7.8). If n < .OSN, the starzdard error o f theproportion is given by formula (7.9),where q = 1 - p. .=E EXAMPLE 7.18 In Example 7.16, dealing with the five most popular theme parks, it was shown that p = 0.6 and in Example 7.17, it was shown that pii = 0.6 illustrating formula (7.7). In Example 7.17 it was also shown that q, = JG' = 0.122. To illustrate formula (7.8),note that
  • 159.
    150 SAMPLING DISTRIBUTIONS[CHAP. 7 EXAMPLE 7.19 Suppose 80% of all companies use e-mail. In a survey of 100companies, the standard error of‘the sample proportion using e-mail is q j = - - - /= ~ - - .04. The finite population correction factor is not needed since there is a very large number of companies, and it is reasonable to assume that R < .05N. SHAPE OF THE SAMPLINGDISTRIBUTION OF THE SAMPLEPROPORTION AND THE CENTRAL LIMIT THEOREM When the sample size satisfies the inequalities np > 5 and nq > 5, the sampling distribution of the sample proportion is normally distributed with mean equal to p and standard error 9. This result is sometimes referred to as the central limit theoremfor the sample proportion. This result is illustrated in Fig. 7-4. EXAMPLE 7.20 Approximately 20% of the adults 25 and older have a bachelor’s degree in the United States. If a large number of samples of adults 25 and older, each of size 100, were taken across the country, then the sample proportions having a bachelor’s degree would vary from sample to sample. The distribution of sample proportions would be normally distributed with a mean equal to 20% and a standard error equal to 4%.According to the empirical rule, approximately 68%of the sample proportions would fall between 16%and 24%, approximately 95% of the sample proportions would fall between 12% and 28%, and approximately 99.7% would fall between 8% and 32%. The sample proportion distribution may be assumed to be normally distributed since np = 20 and nq = 80 are both greater than 5. P Fig. 7-4 APPLICATIONS OF THE SAMPLING DISTRIBUTION OF THE SAMPLEPROPORTION The theory underlying the sample proportion is utilized in numerous statistical applications. The margin of error, control chart limits, and many other useful statistical techniques make use of the sampling distribution theory connected with the sample proportion.
  • 160.
    CHAP. 71 SAMPLINGDISTRIBUTIONS 151 EXAMPLE 7.21 It is estimated that 42% of women and 36% of men ages 45 to 54 are overweight, that is, 20% over their desirable weight. The probability that one-half or more in a sample of 50 women ages 45 to 54 are overweight is expressed as P ( p 2 S).The distribution of the sample proportion is normal since np = 50 x .42 = 21 > 5 and nq = 50 x .58 = 29 > 5. The mean of the distribution is .42 and the standard error is = = .07.The probability, P ( p 2 3,is shown as the shaded area in Fig. 7-5. = li*42:0*58 Fig. 7-5 To find the area under the normal curve shown in Fig. 7-5, it is necessary to transform the value .5 to a standard normal value. The value for z is z = ~ = 1.14. The area to the right of j7 = .5 is equal to the area to the right of z = 1.14 and is given as follows. 5 - .42 .07 P(p5: .5) = P(z > 1.14)= .5 - .3729= .1272 In Example 7.21, the p value equal to .5 was transformed to a z value by subtracting the mean value of p from .5 and then dividing the result by the standard error of the proportion. The equation for transforming a sample proportion value to a z value is given by (7.10) EXAMPLE 7.22 It is estimated that 1 out of 5 individuals over 65 have Parkinsonism, that is, signs of Parkinson’s disease. The probability that 15% or less in a sample of 100 such individuals have Parkinsonism is represented as P(p 5 15%).Since np = 100x .20 = 20 > 5 and nq = 100x -80= 80 > 5, it may be assumed that - p has a normal distribution. The mean of j7 is 20% and the standard error is 0 : = E= pg = 4%. Formula (7.10)is used to find an event involving z which is equivalent to the event jT 5 15. The event jT 5 15 is equivalent to the event z = -<- = -1.25, and therefore we have P ( p 5 15%) = P(z < -1.25) = j7-20 15-20 4 4 .5 - .3944 = .1056.
  • 161.
    152 1. Alabama 2. Alaska 3.Arizona 4. Arkansas 5. California SAMPLING DISTRIBUTIONS 18. Kentucky 35. North Dakota 19.Louisiana 36. Ohio 20. Maine 37. Oklahoma 2I. Maryland 38. Oregon 22. Massachusetts 39. Pennsvlvania [CHAP. 7 6. Colorado 7. Connecticut 8. Delaware Solved Problems 23. Michigan 40. Rhode Island 24. Minnesota 25. MississiDDi 42. South Dakota 4 1. South Carolina SIMPLE RANDOM SAMPLING 9. District of Columbia 26. Missouri 10.Florida 27. Montana 11. Georgia 28. Nebraska 12. Hawaii 29. Nevada 13. Idaho 30. New Hampshire 14. Illinois 15. Indiana 32. New Mexico 3 1. New Jersey 7.1 The top nine medical doctor specialties in terms of median income are as follows: radiology, surgery, anesthesiology, obstetrics/gynecology, pathology, internal medicine, psychiatry, family practice, and pediatrics. How many simple random samples of size 4, chosen without replacement, are possible when selected from the population consisting of these nine specialties? What is the probability associated with each possible simple random sample of size 4 from the population consisting of these nine specialties? 43. Tennessee 44. Texas 45. Utah 46. Vermont 47. Virginia 48. Washington 49. West Virginia Arts: The number of possible samples is equal to /49)= 2= 126. Each possible sample has proba- bility -!- = .00794 of being selected. I26 16. Iowa 17. Kansas 7 . 2 In problem 7.1, how many of the 126 possible samples of size 4 would include the specialty pediatrics? Y 33. New York 50. Wisconsin 34. North Carolina 51. Wyoming Ans: The number of samples of size 4 which include the specialtypediatrics would equal the number of ways the other three specialties in the sample could be selected from the other eight specialties in the population. The number of ways to select 3 from 8 is = -= 56. One such sample is the sample consisting of the following: (radiology, surgery, pathology, pediatrics). There are 55 other such samples containing the specialtypediatrics. [:I : : ! USING RANDOM NUMBER TABLES Table 7.11 7.3 Use Table 7.I of this chapter to obtain a sample of size 5 from the population consisting of the 50 states and the District of Columbia. Assume that the 51 members of the population are
  • 162.
    CHAP. 71 SAMPLINGDISTRIBUTIONS 153 listed in alphabetical as shown in Table 7.11. Start with the two digits in columns 11 and 12, and line 1 of Table 7.1. Read down these two columns until five unique states have been selected. Give the numbers and the selected states. Ans: The random numbers are 44, 12, 24, 10, and 07. From an alphabetical listing of the 50 states and the District of Columbia in Table 7.11, the following sample is obtained: (Connecticut, Florida, Hawaii, Minnesota, and Texas) 7.4 The random numbers in Table 7.1 are used to select 15 faculty members from an alphabetical listing of the 425 faculty members at a university. The total faculty listing goes from 001: Lawrence Allison to 425: Carol Ziebarth. A random start position is determined to be columns 31 through 33 and line 1. Read down these columns until you reach the end. Then go to columns 36 through 38, line 1 and read down these columns until you reach the end. If necessary, go to columns 41 through 43 and line 1 and read down these columns until you obtain 15 unique numbers. Give the 15 numbers which determine the 15 selected faculty members. Ans: The numbers, in the order obtained from Table 7.1 are: 345, 254, 160, 105, 283, 026, 099, 385, 157,393,013, 126, 110,004, and 279. USING THE COMPUTER TO OBTAIN A SIMPLERANDOM SAMPLE 7.5 In reference to problem 7.3, use Minitab to obtain a sample of 5 of the 50 states plus the District of Columbia. Ans: MTB > set cl DATA> 1 5 1 DATA > end MTB > sample 5 cl c2 MTB > print c2 Data Display c 2 3 21 38 44 36 From the alphabetical listing of the states shown in Table 7.11,the following sample is obtained: (Arizona, Maryland, Ohio, Oregon, and Texas) Texas is the only state which is common to the samples obtained in problems 7.3 and 7.5. 7.6 In reference to problem 7.4, use Minitab to obtain a sample of 15 of the 425 faculty members. Ans: MTB > set cl DATA > 1:425 DATA >end MTB > sample 25 cl c2 MTB >sort c2 put into c3 MTB >print c3 Data Display c3 58 73 111 112 187 228 239 283 285 319 322 325 364 384 394 The faculty member corresponding to number 283 is the only one that is common to the samples obtained in problems 7.4 and 7.6.
  • 163.
    154 SAMPLING DISTRIBUTIONS City A:New York C: Los Angeles E: Atlanta B: Washington, D.C. D: Chicago [CHAP. 7 Number of African-American-owned businesses, in thousands 42 39 36 33 30 SYSTEMATIC RANDOM SAMPLING 7.7 In reference to problem 7.3, choose a systematic sample of 5 of the 50 states plus the District of Columbia. Ans: Divide 51 by 5 to obtain 10.2. Round 10.2down to 10 and select a number randomly between 1 and 10. Suppose the number 7 is obtained. The other 4 members of the sample are 17, 27, 37, and 47. From the alphabetical listing of the states given in Table 7.11,the following sample is selected: (Connecticut, Kansas, Montana, Oklahoma, Virginia) CLUSTER SAMPLING 7.8 A large school district wishes to obtain a measure of the mathematical competency of the junior high students in the district. The district consists of 40 different junior high schools. Describe how you could use cluster sampling to obtain a measure of the mathematical competencyof thejunior high school students in the district. Ans: Randomly select a small number, say 5, of the 40 junior high schools. Then administer a test of mathematical competency to all students in the 5 selected schools. The test results from the 5 schools constitute a cluster sample. STRATIFIED SAMPLING 7.9 Refer to problem 7.8. Explain how you could use stratified sampling to determine the mathematical competency of thejunior high students in the district. Ans: Consider each of the 40 junior high schools as a stratum. Randomly select a sample from each junior high proportional to the number of students in that junior high school and administer the mathematical competency test to the selected students. Note that even though stratified sampling could be used, cluster sampling as described in problem 7.8 would be easier to administer. SAMPLING DISTRIBUTION OF THE SAMPLING MEAN 7.10 The five cities with the most African-American-ownedbusinesses was given in Table 7.2 and is reproduced below. List all samples of size 4 and find the mean of each sample. Also, construct the sampling distribution of the sample mean.
  • 164.
    CHAP. 71 SAMPLINGDISTRIBUTIONS 155 Number of businesses Samples in the samples Sample mean A, B, C, D 42, 39, 36, 33 37.50 A, B,c,E 42, 39, 36, 30 36.75 A, B, D, E 42, 39, 33, 30 36.00 A, C , D, E 42, 36, 33, 30 35.25 Ans: Table 7.12 lists the 5 samples of size 4, along with the sample values for each sample, and the mean of each sample, and Table 7.13 gives the sampling distribution of the sample mean. f Table 7.13 I ii- I 34.50 35.25 36.00 36.75 37.50 I IP(T) I .2 .2 .2 .2 .2 1 7.11 Consider sampling from an infinite population. Suppose the number of firearms of any type per household in this population is distributed uniformly with 25% having no firearms, 25% having exactly one, 25% having exactly two, and 25% having exactly three. If X represents the number of firearms per household, then X has the distribution shown in Table 7.14. Table 7.14 X 1 0 1 2 3 1 I P(x) I .25 .25 .25 .25 I Give the sampling distribution of the sample mean for all samples of size two from this population. Table 7.15 Samde mean 0 0.5 1.o 1.5 0.5 1.o 1.5 2.0 1.o 1.5 2.0 2.5 I .5 2.0 2.5 3.O Probability .0625 .0625 .0625 .0625 ,0625 .0625 .0625 .0625 ,0625 .0625 .0625 .0625 .0625 .0625 .0625 .0625 Ans: Suppose x1 and x2 represents the two possible values for the two households sampled. Table 7.15 gives the possible values for x1 and x2, the mean of the sample values, and the probability associated with each pair of sample values. Since there are 16different possible sample pairs each one has probability & = .0625. The possible values for the sample mean are 0, 0.5, 1.O, 1.5, 2.0,
  • 165.
    156 - , x P(X) SAMPLING DISTRIBUTIONS[CHAP. 7 0 0.5 1.o 1.5 2.0 2.5 3.0 .0625 .I250 .1875 ,2500 .I875 .1250 ,0625 2.5, and 3.0. The probability associated with each sample mean value can be determined from Table 7.15. For example, the sample mean equals 1.5 for four different pairs and the probability for the value 1.5 is obtained by adding the probabilities associated with (0, 3), ( I . 2), (2, I), and (3, 0). That is, P(X = 1.5) = .0625 + .0625 + .0625 + .0625 = .25. The sampling distribution is shown in Table 7.16. Table 7.16 Samples Sample mean Sampling error A, B, C. D 37.50 1S O A, B, C, E 36.75 0.75 A, B, D, E 36.00 0 A, C, D, E 35.25 0.75 34.50 1.so _i B, C, D, E SAMPLING ERROR 7.12 Give the sampling error associated with each of the samples in problem 7.10. Ans: The mean number of African-American-owned businesses for the five cities is 36,000. There are five different possible samples as shown in Table 7.17. Each of the sample means shown in Table 7.17 is an estimate of the population mean. The sampling error is IX - 36 I,and is shown for each sample in the last column in Table 7.17. 7.13 Give the minimum and the maximum sampling error encountered in the sampling described in problem 7.1I and the probability associated with the minimum and maximum sampling error. Ans: The population described in problem 7.11 has a uniform distribution and the mean is as follows: p = 0 x .25 + 1 x .25 + 2 x .25 + 3 x .25 = 1.5.The minimum error is 0 and occurs when X = 1.5. The probability of a sampling error equal to 0 is equal to the probability that the sample mean is equal to 1.5 which is .25 as shown in Table 7.16. The maximum sampling error is 1.5 and occurs when X = 0 or X = 3.0. The probability associated with the maximum sampling error is .0625 + .0625 = .125. MEAN AND STANDARD DEVIATION OF THE SAMPLE MEAN 7.14 Find the mean and variance of the sampling distribution of the sample mean derived in problem 7.10 and given in Table 7.13. Verify that formulas (7.2) and (7.3) hold for this problem. Ans: Table 7.13 is reprinted below for ease of reference. The mean of the sample mean is pK= C XP(X) = 34.50 x .2 + 35.25 x .2 + 36.00 x .2 + 36.75 x .2 + 37.50 x .2 = 36 The variance of the sample mean is 0; = C T2P(T)- p : . C Y2P(Y)= 1190.25x .2 + 1242.5625x .2 + 1296x .2 + 1350.5625x .2 + 1406.25 x .2 = 1297.I25 and p; = 1296,and therefore 0 : = 1297.125- 1296= I . 125.
  • 166.
    CHAP. 71 - , x P(Z) SAMPLINGDISTRIBUTIONS 0 0.5 1.0 1.5 2.0 2.5 3.0 .0625 .1250 .1875 .2500 .I875 .I250 .0625 Table 7 . 1 3 X P(x) I P(X) I .2 .2 .2 .2 .2 I 0 I 2 3 .25 .25 .25 .25 157 In Example 7.7 it is shown that p = 36 and o2= 18. Formula (7.2)is verified since p = psr= 36. -)(- - 2 To verify formula (7.3),note that -x- - - - 1.125= 0,. C ? N-n 18 5-4 n N-1 4 5-1 7.15 Find the mean and variance of the sampling distribution of the sample mean derived in problem 7.1 I and given in Table 7.16. Verify that formulas (7.2) and (7.3) hold for this problem. Am: Table 7.16 is reprinted below for ease of reference. The mean of the sample mean is & =cxP(X) = O X .0625 +0.5 x .125 + 1 x .I875 + 1.5 x .25 + 2 x .1875 +2.5 x .125 + 3 x .0625 = 1S. The variance of the sample mean is 0 : = E X 'P( -jl ) - px. 2 CX2P(sI) = 0 ~ . 0 6 2 5 + . 2 5 ~ . 1 2 5 + I ~ . 1 8 7 5 + 2 . 2 5 ~ . 2 5 + 4 ~ . 1 8 7 5 + 6 . 2 5 ~ . 1 2 5 + 9 ~ . 0 6 2 5 = 2.875 and pg = 2.25, and therefore02 = 2.875 - 2.25 = 0.625. Table 7.16 In order to verify formulas (7.2) and (7.3),we need to find the values for p and d in problem 7.11.The population distribution was given in table 7.14 and is shown below. The population mean is p = C xP(x) = 0 x .25 + I x .25 + 2 x .25 + 3 x .25 = 1.5, and the variance is given by 0 ' = E x2P(x)-p2 = 0 x .25 + 1 x .25 +4 x .25 +9 x .25 - 2.25 = 1.25. Formula (7.2) is verified, since p = pLsr = 1.5. To verify formula (7.3),note that since the population is infinite, the finite population correction factor is not needed and - = -- - 0.625 = 0 : . Anytime the population is infinite, the finite population correction factor is omitted and O2 1.25 n 2 0' formula (7.3)simplifies to 0 : = -. n SHAPE OF THE SAMPLING DISTRIBUTION OF THE MEAN AND THE CENTRAL LIMIT THEOREM 7.16 The portfolios of wealthy people over the age of 50 produce yearly retirement incomes which are normally distributed with mean equal to $125,000and standard deviation equal to $25,000. Describe the distribution of the means of samples of size 16from this population.
  • 167.
    158 SAMPLING DISTRIBUTIONS[CHAP. 7 Am: Since the population of yearly retirement incomes are normally distributed, the distribution of sample means will be normally distributed for any sample size. The distribution of sample means G 25,000 has mean $125,000and standard error q = -= - - - $6,250. AJ16 7.17 The mean age of nonresidential buildings is 30 years and the standard deviation of the ages of nonresidential buildings is 5 years. The distribution of the ages is not normally distributed. Describe the distribution of the means of samples from the distribution of ages of non- residential buildings for sample sizes n = 10and n = 50. Am: The distribution type for the sample mean is unknown for small samples, i.e., n < 30. Therefore we cannot say anything about the distribution of X when n = 10. For large samples, i.e., 230,the central limit theorem assures us that T has a normal distribution. Therefore, for n = 50, the sample mean has a normal distribution with mean = 30 years and standard error = -- -- - 0.707 years. 0 5 J;;-Jso APPLICATIONS OF THE SAMPLING DISTRIBUTION OF THE SAMPLE MEAN 7.18 In problem 7.16, find the probability of selecting a sample of 16 wealthy individuals whose portfolios produce a mean retirement income exceeding $135,000. Ans: We are asked to determine the probability that X exceeds $135,000.From problem 7.16, we know that X has a normal distribution with mean $125,000 and standard error $6,250. The event x > 135,000 is equivalent to the event z = > = 1.60. Since the event X > 135,000 is equivalent to the event z > 1.60,the two events have equal probabilities. That is, P(X > 135,000) = P(Z > 1.60). Using the standard normal distribution table, the probability P(z > 1.60)is equal to .5 - .4452 = .0548. That is, P(X > 135,000)= P(Z > 1.60) = .0548. - X - 125,000 135,000 - 125,000 6,250 6,250 7.19 In problem 7.17, determine the probability that a random sample of 50 nonresidential buildings will have a mean age of 27.5 years or less. Am: We are asked to determine P(X < 27.5). From problem 7.17, we know that X has a normal distribution with mean 30 years and standard error 0.707 years. Since the event 51 < 27.5 is X-30 275-30 equivalent to the event z = -< = -3.54, we have P(X < 27.5) = P(z < -3.54). .707 .707 From the standard normal distribution table, the probability is less than .001. SAMPLING DISTRIBUTIONOF THE SAMPLEPROPORTION 7.20 Approximately 8.8 million of the 12.8 million individuals receiving Aid to Families with Dependent children are 18 or younger. What proportion of the individuals receiving such aid are 18or younger?
  • 168.
    CHAP. 71 SAMPLINGDISTRIBUTIONS 159 State A: Alabama B: California C: Colorado D: Kentucky E: Nebraska Ans: The population consists of all individuals receiving Aid to Families with Dependent Children. The proportion having the characteristic that an individual receiving such aid is 18 or younger is At least one woman on death row Yes Yes no no no 8.8 12.8 represented by p and is equal to -= 0.69. 7.21 Table 7.18 describes a population consisting of five states and indicates whether or not there is at least one woman on death row for that state. For this population 40%of the states have at least one woman on death row, that is, p = 0.4. List all samples of size 2 from this population and find the proportion having at least one woman on death row for each sample. Use this listing to derive the sampling distribution for the sample proportion. Ans: Table 7.19 lists all possible samples of size 2, indicates whether or not the states in the sample have at least one woman on death row, gives the sample proportion for each sample, and gives the probability for each sample. From Table 7.19, it is seen that takes on the values 0, .5, and 1 with probabilities .3, .6, and .I, respectively. The probability that iT = 0 is obtained by adding the probabilities for the samples (C, D), (C, E), and (D, E) sincea = 0 if any one of these samples is selected. The other two probabilities are obtained similarly.Table 7.20 gives the sampling distribution forF. Table 7.20 7 1 MEAN AND STANDARD DEVIATION OF THE SAMPLE PROPORTION 7.22 Find the mean and variance of the sample proportion in problem 7.21 and verify that formula (7.7)and formula (7.8)are satisfied.
  • 169.
    160 SAMPLING DISTRIBUTIONS[CHAP.7 7.23 Ans: The mean of the sample proportion is pF= cp(p)= 0 x .3 + .5 x .6 + 1 x .1 = .4, and the variance of the sample proportion is 0;= cD2 P(p) - pi = 0 x .3 + .25 x .6 + 1 x .1 - (.4)2 = .09.Formula (7.7) is satisfied since pF= p = .4. To verify formula (7.8), we need to show that q ,= J ? x { ~ . Since the variance of the sample proportion is .09, the standard deviation is the square root of .09 or .3. Therefore, to verify formula (7.8), all we need do is show that the right-hand side of the equation equals .3. Suppose 5% of all adults in America have 10 or more credit cards. Find the standard error of the sample proportion in a sampleof 1,OOO American adults who have 10or more credit cards. Ans: Since the sample size is less than 5% of the population size, the standard error is q - SHAPE OF THE SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION AND THE CENTRALLIMIT THEOREM 7.24 A sample of size n is selected from a large population. The proportion in the population with a specified characteristic is p. The proportion in the sample with the specified characteristic is p.In which of the following does (a) n = 20, p = .9 have a normal distribution? (c) n = 100, p = .97 (b) n = 15, p = .4 (6) n = IOOO, p = .01 np = 18, and nq = 2. Since both np and nq do not exceed 5, we are not sure that the distribution of jT is normal. np = 6, and nq =9. Since both np and nq exceed 5, the distribution of F is normal. np = 97, and nq = 3. Since both np and nq do not exceed 5 , we are not sure that the distribution of jT is normal. np = 10,and nq = 990. Since both np and nq exceed 5, the distribution of is normal. APPLICATIONS OF THE SAMPLING DISTRIBUTION OF THE SAMPLEPROPORTION 7.25 Approximately 15% of the population is left-handed. What is the probability that in a sample of 50 randomly chosen individuals, 30% or more in the sample will be left-handed? That is, what is the probability of finding 15or more left-handersin the 50? Ans: The sample proportion, jT,has a normal distribution since np = 50 x .15 = 7.5 > 5 and nq = 50 x .85 = 42.5 > 5. The mean of the sample proportion is 15%and the standard error is oF= = 15x85 = 5.05%. To find the probability that F 2 30%,we first transform jT to a standard lio
  • 170.
    CHAP. 71 SAMPLINGDISTRIBUTIONS 161 - normal by the transormation z = - ’-”as follows. The inequality 2 30 is equivalent to z = O F p-15 30-15 2 -= 2.97. Because F 2 30 is equivalent to z 2 2.97 we have P(F 2 30) = 5.05 5.05 P(z 22.97) = .5 - .4985 = .0015.That is, 15 or more left-handerswill be found in a sample of 50 individualsonly about 0.2% of the time. 7.26 Suppose 70% of the population support the ban on assault weapons. What is the probability that between 65% and 75% in a poll of 100 individuals will support the ban on assault weapons? Ans: We are asked to find P(65 < j j < 75). The sample proportion p has a normal distribution with mean 70% and standard error air = = i F= 4.58%. The event 65 < jT < 75 is equivalentto the event - <-<- or -1.09 c z c 1.09and since these events are equivalent, they have equal probabilities.That is, P(65 c jT c 75) = P(-1.09 c z < 1.09) = 2 x .3621 = .7242. 65-70 p-70 75-70 458 4.58 458 Supplementary Problems SIMPLE RANDOM SAMPLING 7.27 7.28 Rather than have all 25 students in her Statistics class complete a teacher evaluation form, Mrs. Jones decides to randomly select three students and have the department chairman interview the three concerning her teaching after the course grade has been given. How many different samples of size 3 can be selected? Ans. 2,300 USA Today lists the 1900 most active New York stock exchange issues. How many samplesof size three are possible when selected from these 1900stock issues? Ans. 1,141,362,300 USING RANDOM NUMBER TABLES 7.29 In a table of random numbers such as Table 7.1 what relative frequency would you expect for each of the digits 0, 1,2, 3,4, 5,6, 7, 8, and 9? Ans. 0.1 7.30 The 100 U.S. Senators are listed in alphabeticalorder and then numbered as 00, 01, .. .,99. Use Table 7.1 to select 10 of the senators. Start with the two digits in columns 31 and 32 and row 6. The first selected senator is numbered 76. Reading down the two columns from the number 76, what are the other 9 two-digit numbers of the other selected senators? Ans. 59,78,44,93,69,25,43,50, and 45
  • 171.
    162 SAMPLING DISTRIBUTIONS[CHAP. 7 USING THE COMPUTER TO OBTAIN A SIMPLE RANDOM SAMPLE 7.31 Use Minitab to select 25 of the 1900 stock issues discussed in problem 7.28. Give the numbers in ascending order of the 25 selected stock issues. Assume the stock issues are numbered from 1 to 1900. Ans. MTB > set cl DATA > 1:1900 DATA > end MTB > sample 25 from cl put into c2 MTB > sort c2 put into c3 MTB > print c3 Data Display c 3 26 219 238 288 328 423 766 785 943 1006 1187 1197 1234 1238 1257 1313 1532 1562 1613 1639 1728 1784 1788 1798 1870 7.32 Use Minitab to select 50 of the 3143 counties in the United States. Give the numbers in ascending order of the 3143. Ans. 50 selected counties. Assume the counties are in alphabetical order and are numbered from 1 to MTB > set cl DATA > 1:3143 DATA > end MTB > sample 50 from cl put into c2 MTB > sort c2 put into c3 MTB > print c3 Data Display c 3 33 47 53 139 265 312 321 343 444 519 599 652 658 764 789 829 949 964 1063 1134 1209 1300 1463 1567 1630 1694 1706 1818 1848 2021 2048 2138 2143 2270 2285 2306 2311 2408 2414 2463 2487 2513 2658 2701 2718 2763 2862 2892 2917 2975 SYSTEMATICRANDOM SAMPLING 7.33 Describe how the state patrol might obtain a systematic random sample of the speeds of vehicles along a stretch of interstate 80. Ans. Use a radar unit to measure the speeds of systematically chosen vehicles along the stretch of interstate 80. For example, measure the speed of every tenth vehicle. CLUSTER SAMPLING 7.34 A particular city is composed of 850 blocks and each block contains approximately 20 homes. Fifteen of the 850 blocks are randomly selected and each household on the selected block is administered a survey concerning issues of interest to the city council. How large is the population? How large is the sample? What type of sampling is being used? Ans. The population consists of 17,000 households. The sample consists of 300 households. Cluster sampling is being used.
  • 172.
    CHAP. 71 - X P(X) SAMPLING DISTRIBUTIONS 0.5 1 1.5 2 .01 .12 .42 .36 .09 163 STRATIFIED SAMPLING 7.35 A drug store chain has stores located in 5 cities as follows: 20 stores in Los Angeles, 40 stores in New York, 20 stores in Seattle, 10 stores in Omaha, and 30 stores in Chicago. In order to estimate pharmacy sales, the following number of stores are randomly selected from the 5 cities: 4 from Los Angeles, 8 from New York, 4 from Seattle, 2 from Omaha, and 6 from Chicago. What type of sampling is being used? Ans. stratified sampling SAMPLING DISTRIBUTION OF THE SAMPLING MEAN 7.36 7.37 The export value in billions of dollars for four American cities for a recent year are as follows: A: Detroit, 28; B: New York, 24; C: Los Angeles, 22; and D: Seattle, 20. If all possible samples of size 2 are selected from this population of four cities and the sample mean value of exports computed for each sample, give the sampling distribution of the sample mean. Ans. P(X)= :,where X =21, 22, 23, 24, 25, or 26. Consider a large population of households. Ten percent of the households have no home computer, 60 percent of the households have exactly one home computer, and 30 percent of the households have exactly two home computers. Construct the sampling distribution for the mean of all possible samples of size 2. Ans. SAMPLING ERROR 7.38 7.39 In reference to problem 7.36, if the mean of a sample of two cities is used to estimate the mean export value of the four cities, what are the minimum and maximum values for the sampling error? Ans. minimum sampling error = .5 maximum sampling error = 2.5 In reference to problem 7.37, what are the possible sampling error values associated with samples of two households used to estimate the mean number of home computers per household for the population? What is the most likely value for the samplingerror? Am. 2, .3, .7, .8, and 1.2 The most likely value is .2. The probability that the sampling error equals .2 is .42. MEAN AND STANDARD DEVIATION OF THE SAMPLE MEAN 7.40 Find the mean and variance of the sampling distribution in problem 7.36. Verify that formulas (7.2)and (7.3)hold for this sampling distribution. Ans. p = 23.5 02= 8.75 pT = 23.5 0 : = 2.917 7.41 Find the mean and variance of the sampling distribution in problem 7.37. Verify that formulas (7.2) and (7.3)hold for this sampling distribution.
  • 173.
    164 SAMPLING DISTRIBUTIONS[CHAP. 7 2 Ans. p=1.2 02=0.36 p x = 1 . 2 0, =0.18 d psi= p d .36 n 2 Since the population is infinite, formula (7.3)becomes a : = - n -- - - = 0;. SHAPE OF THE SAMPLING DISTRIBUTION OF THE MEAN AND THE CENTRAL LIMIT THEOREM 7.42 7.43 For cities of 100,000or more, the number of violent crimes per 1,000residents is normally distributed with mean equal to 30 and standard deviation equal to 7. Describe the sampling distribution of means of samples of size 4 from such cities. Ans. The means of samples of size 4 will be normally distributed with mean equal to 30 and standard error equal to 3.5. For cities of 100,000 or more, the mean total crime rate per 1,000 residents is 95 and the standard deviation of the total crime rate per 1,0oO is 15. The distribution of the total crime rate per 1,OOO residents for cities of 100,000or more is not normally distributed. The distribution is skewed to the right. Describe the sampling distribution of the means of samples of sizes 10and 50 from such cities. Am. For samples of size 50, the central limit theorem assures us that the distribution of sample means is normally distributed with mean equal to 95 and standard error equal to 2.12. For samples of size 10from a nonnormal distribution, the distribution form of the sample means is unknown. APPLICATIONS OF THE SAMPLING DISTRIBUTIONOF THE SAMPLE MEAN 7.44 In problem 7.42, find the probability of selecting four cities of population 100,OOO or more whose mean number of violent crimes per 1,000residents exceeds 40 violent crimes per 1,000 residents. Ans. 0.0021 7.45 The mean number of bumped passengers per 10,000 passengers per day is 1.35 and the standard deviation is 0.25. For a random selection of 40 days, what is the probability that the mean number of bumped passengers per 1 0 , O O O passengers for the 40 days will be between 1.25and 1S O ? A m 0.9938 SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION 7.46 The world output of bicycles in 1995was 114 million. China produced 41 million bicycles in 1995. What proportion of the world's new bicycles in 1995 were produced by China? Ans. p = 0.36 7.47 A company has 38,000 employees and the proportion of the employees who have a college degree equals 0.25. In a sample of 400 of the employees, 30 percent have a college degree. How many of the company employees have a college degree'?How many in the sample have a college degree? Ans. 9,500 of the company employees have a college degree and 120 in the sample have a degree.
  • 174.
    CHAP. 71 SAMPLINGDISTRIBUTIONS 165 MEAN AND STANDARDDEVIATION OF THE SAMPLEPROPORTION 7.48 One percent of the vitamin C tablets produced by an industrial process are broken. The process fills containers with 1,000tablets each. At regular intervals, a container is selected and 100 of the 1,000 tablets are inspected for broken tablets. What is the mean value and standard deviation of j7,where j 7 represents the proportion of broken tablets in the samples of lo0 selected tablets? 1,000- I Ans. The mean value of iY is .01 and the standard deviation of is ,0094. 7.49 A survey reported that 20% of pregnant women smoke, 19% drink ,and 13% use crack cocaine or other drugs. Assuming the survey results are correct, what is the mean and standard deviation of F,where jY is the proportion of smokers in samples of 300 pregnant women? Am. The mean value of i7 is 20% and the standard deviation of = 2.31%. SHAPE OF THE SAMPLING DISTRIBUTION OF THE SAMPLEPROPORTION AND THE CENTRALLIMIT THEOREM 7.51 For a sample of size 50, give the range of population proportion values for which j7 has an approximate normal distribution. Ans. . I < p < .9 APPLICATIONS OF THE SAMPLINGDISTRIBUTION OF THE SAMPLEPROPORTION 7.52 Thirty-five percent of the athletes who participated in the 1996 summer Olympics in Atlanta are female. What is the probability of randomly selecting a sample of 50 of these athletes in which over half of those selected are female? Ans. P(j5 >SO) = P(z > 2.22) = .0132 7.53 It is estimated that 12% of Native Americans have diabetes. What is the probability of randomly selecting 100Native Americans and finding 5 or fewer in the hundred whom are diabetic? Ans. P(p 5.05)= P(z < -2.15) = .0158 7.54 A pair of dice is tossed 180 times. What is the probability that the sum on the faces is equal to 7 on 20% or more of the tosses? Ans. P(F2.20) = .1170
  • 175.
    Chapter 8 Estimationand SampleSize Determination: One Population POINT ESTIMATE Estimation is the assignment of a numerical value to a population parameter or the construction of an interval of numerical values likely to contain a population parameter. The value of a sample statistic assigned to an unknown population parameter is called a point estimate of the parameter. EXAMPLE 8.1 The mean starting salary for 10 randomly selected new graduates with a Masters of Business Administration (MBA) at Fortune 500 companies is found to be $56,000.Fifty-six thousand dollars is a point estimate of the mean starting salary for all new MBA degree graduates at Fortune 500 companies. The median cost for 350 homes selected from across the United States is found to equal $1 15,OOO. The value of the sample median, $1 15,000, is a point estimate of the median cost of a home in the United States. A survey of 950 households finds that 35% have a home computer. Thirty-five percent is a point estimate of the percentage of homes that have a home computer. INTERVAL ESTIMATE In addition to a point estimate, it is desirable to have some idea of the size of the sampling error, that is the difference between the population parameter and the point estimate. By utilizing the standard error of the sample statistic and its sampling distribution, an interval estimate for the population parameter may be developed. A confidence interval is an interval estimate that consists of an interval of numbers obtained from the point estimate of the parameter along with a percentage that specifies how confident we are that the value of the parameter lies in the interval. The confidence percentage is called the confidence level. This chapter is concerned with the techniques for finding confidence intervals for population means and population proportions. CONFIDENCE INTERVAL FOR THE POPULATION MEAN: LARGE SAMPLES According to the central limit theorem, the sample mean, 7, has a normal distribution provided the sample size is 30 or more. Furthermore, the mean of the sample mean equals the mean of the population and the standard error of the mean equals the population standard deviation divided by the square root of the sample size. In Chapter 7, the variable given in formula (7.6) (and reproduced below) was shown to have a standard normal distribution provided n 2 30. Since 95% of the area under the standard normal curve is between z = -1.96 and z = 1.96,and since the variable in formula (7.6) has a standard normal distribution, we have the result shown in formula (8.I ) 166
  • 176.
    CHAP. 81 ESTIMATIONAND SAMPLE SIZE DETERMINATION 167 - P(-1.96 < x--cL< 1.96)= .95 o x - The inequality,-1.96 < - '-' < 1.96,is solved for p and the result is given in formula (8.2). ajr (8.2) Y - 1.96- < p < X + 1.96- The interval given in formula (8.2) is called a 95% confidence interval for the population mean, p. The general form for the interval is shown in formula (8.3),where z represents the proper value from the standard normal distributiontable as determinedby the desired confidencelevel. EXAMPLE 8.2 The mean age of policyholders at Mutual Insurance Company is estimated by sampling the records of 75 policyholders. The standard deviation of ages is known to equal 5.5 years and has not changed over the years. However, it is unknown if the mean age has remained constant. The mean age for the sample of 75 policyholders is 30.5 years. The standard error of the ages is = ---- - .635 years. In order to find a 90%confidence interval for p, it is necessary to find the value of z in formula (8.3)for confidence level equal to 90%.If we let c be the correct value for z, then we are looking for that value of c which satisfies the equation P(-c < z < c) = .90. Or, because of the symmetry of the z curve, we are looking for that value of c which satisfies the equation P(0 < z < c) = .45. From the standard normal distribution table, we find P(0 < z < 1.64) = .4495 and P(0 < z < 1.65) = .4505. The interpolated value for c is 1.645, which we round to 1.65. Figure 8-1 illustrates the confidence level and the corresponding values of z. Now the 90% confidence interval is computed as follows. The lower limit of the interval is X- 1.650~ = 30.5 - 1.65 x .635 = 29.5 years, and the upper limit is x + 1.650~= 31.5. We are 90% confident that the mean age of all 250,000 policyholders is between 29.5 and 31.5 years. It is important to note that p either is or is not between 29.5 and 31.5 years. To say we are 90% confident that the mean age of all policyholders is between 29.5 and 31.5 years means that if this study were conducted a large number of times and a confidence interval were computed each time, then 90% of all the possible confidence intervals would contain the true value of p. 0 55 J;;-J75 - Fig. 8-1 Since it is time consuming to determine the correct value of z in formula (8.3),the values for the most often used confidence levels are given in Table 8.1. They are found in the same manner as illustrated in Example 8.2.
  • 177.
    168 ESTIMATION ANDSAMPLE SIZE DETERMINATION [CHAP. 8 Table 8.1 Confidence level Z value I .96 I 99 I 2.58 I EXAMPLE 8.3 The 91I response time for terrorists bomb threats was investigated. No historical data existed concerning the standard deviation or mean for the response times. When (T is unknown and the sample size is 30 or more, the standard deviation of the sample itself is used in place of CJ when constructing a confidence interval for p. A sample of 35 response times was obtained and it was found that the sample mean was 8.5 minutes and the standard deviation was 4.5 minutes. The estimated standard error of the mean is represented by 3~ and is equal to 5 - - - J ; ; - - - J35 - - .76. From Table 8.1, the value for z is 2.58. The lower limit for the 99% confidence interval is jC - 2.58 x q = 8.5 - 2.58 x .76 = 6.54 and the upper limit is 8.5 + 2.58 x .76 = 10.46. We are 99% confident that the mean response time is between 6.54 minutes and 10.46 minutes. The true value of p may or may not be between these limits. However, 99% of all such intervals contain p.It is this fact that gives us 99% confidence in the interval from 6.54 to 10.46. S 4 5 For large samples, no assumption is made concerning the shape of the population distribution. If the population standard deviation is known, it is used in formula (8.3).If the population standard deviation is unknown, it is estimated by using the sample standard deviation. The value for z in formula (8.3)is found in the standard normal distribution table or, if applicable,by using Table 8.1. MAXIMUM ERROR OF ESTIMATE FOR THE POPULATION MEAN The inequality in formula(8.3)may be expressed as shown: The left-hand side of formula (8.4)is the sampling error when X is used as a point estimate of p. The right-hand side of formula (8.4) is the m i m u m error o f estimate or margin of error when X is used as a point estimate of p.That is, when X is used as a point estimate of p, the maximum error of estimate or margin of error, E, is When the confidence level is 95%, z = 1.96 and E = 1 . 9 6 ~ . This value of E, 1.96%, is called the 95% margin o f error or simply margin o f error when T is used as a point estimateof p. EXAMPLE 8.4 The annual college tuition costs for 40 community colleges selected from across the United States are given in Table 8.2. The mean for these 40 sample values is $1396, the standard deviation of the 40 s 655 values is $655, and the estimated standard error is 5 = -- -= $104. The mean tuition cost for all J ; ; - f i community colleges in the United States is represented by p. A point estimate of p is given by $1,396. The margin of error associated with this estimate is 1.96 x 104 = $204. The 95% confidence interval for p, based upon these data goes from 1396 - 204 = $1,192 to 1396 + 204 = $1,600. It is worth noting that the margin of error is actually f $204, since the error may occur in either direction. Some publications give the margin of error as E and some give it as fE. We shall omit the f sign when giving the margin of error.
  • 178.
    CHAP. 81 1200 850 ESTIMATION ANDSAMPLE SIZE DETERMINATION 850 I750 930 3000 1650 1640 Table 8.2 i 1500 700 I 1700 I 2100 I 900 I 1320 I 500 2050 I750 500 I780 2500 1200 1500 1950 675 2310 1000 1080 2900 2000 1950 I 750 I 500 I 1500 I 950 I 950 680 1875 560 900 1450 169 To compute the confidence interval for p using Minitab, set the data given in Table 8.2 into column c1 and use the command standard deviation cl to compute the sample standard deviation. The command zinterval 95% confidence, sigma = 655.44, data in c l uses the data in cl to compute a 95% confidence interval for p.The output is shown below. The confidence interval extends from 1193 to 1599.The interval computed in Example 8.4 extends from 1 192to 1600.The difference in the answers is due to rounding. MTB > set cl DATA> 1200 850 1700 1500 700 1200 1500 2000 1950 750 850 DATA> 3000 2100 500 500 1950 1000 950 560 500 1750 1650 DATA> 900 2050 1780 675 1080 680 900 1500 930 1640 1320 DATA> 1750 2500 2310 2900 1875 1450 950 DATA > end MTB > standard deviation cl Column Standard Deviation Standard deviation of C1 = 655.44 MTB > name cl 'cost' MTB > zinterval95% confidence, sigma =655.44, data in cl Confidence Intervals The assumed sigma = 655 Variable N Mean St Dev SEMean 95.0%C.I. cost 40 1396 655 104 (1 193, 1599) The width o f a confidence interval is equal to the upper limit of the interval minus the lower limit of the interval. In Example 8.4,the width of the 95% confidence interval is 1599- 1193= $406. THE t DISTRIBUTION S When the sample size is less than 30 and the estimated standard error, 5 - - is used in place of &jr in formula (8.3),the width of the confidence interval for p will generally be incorrect. The I distribution, also known as the Student-t distribution, is used rather than the standard normal distribution to find confidence intervals for p when the sample size is less than 30. In this section we will discuss the properties of the t distribution, and in the next section, we will discuss the use of the t distribution for confidence intervals when the sample size is small. The t distribution is used in many different statistical applications. The t distribution is actually a family of probability distributions. A parameter called the degrees o f freedom, and represented by df, determines each separate t distribution. The t distribution curves, like the standard normal distribution curve, are centered at zero. However, the standard deviation of - 42
  • 179.
    170 ESTIMATION ANDSAMPLE SIZE DETERMINATION [CHAP. 8 df 5 10 15 20 2s each t distribution exceeds one and is dependent upon the value of df, whereas the standard deviation of the standard normal distribution equals one. Figure 8-2 compares the standard normal curve with the t distribution having 10degrees of freedom. Generally speaking, the t distribution curves have a lower maximum height and thicker tails than the standard normal curve as shown in Fig. 8-2. All t distribution curves are symmetrical about zero. Area in the right tail under the t distributioncurve .10 .05 .025 .o1 .005 .oo1 1.476 2.01s 2.571 3.365 4.032 5.893 1.372 1.812 2.228 2.764 3.169 4.144 1.341 1.753 2.131 2.602 2.947 3.733 1.325 1.725 2.086 2.528 2.845 3.552 1.316 1.708 2.060 2.485 2.787 3.450 Fig. 8-2 Appendix 3 gives right-hand tail areas under t distribution curves for degrees of freedom varying from 1 to 40. This table is referred to as the t distribution table. Table 8.3 contains the rows corresponding to degrees of freedom equal to 5, 10, 15, 20, and 25 selected from the t distribution table in Appendix 3. EXAMPLE 8.5 The t distribution having 10 degrees of freedom is shown in Fig. 8-3. To find the shaded area in the right-hand tail to the right of t = 1.812, locate the degrees of freedom, 10, under the df column in Table 8.3. The t value, 1.812, is located under the column labeled .05 in Table 8.3. The shaded area is equal to .05. Since the total area under the curve is equal to 1, the area under this curve to the left of t = 1.812 is 1- .05 = .95. The area under this curve between t = -1.812 and t = 1.812is 1 - .05 -.05 = .90, since there is .05 to the right of t = 1.812and .OS to the left of t = -1.8 12. Fig. 8-3
  • 180.
    CHAP. 81 ESTIMATIONAND SAMPLE SIZE DETERMINATION 171 16.0 12.0 9.5 EXAMPLE 8.6 A t distribution curve with df = 20 is shown in Fig. 8-4. Suppose we wish to find the shaded area under the curve between t = -2.528 and t = 2.086. From Table 8.3, the area to the right oft = 2.086 is .025, and the area to the right of t = 2,528 is .01. By symmetry, the area to the left of t = -2.528 is also .O I. Since the total area under the curve is 1,the shaded area is 1- .025 - .01= .965. 20.0 9.O 17.5 13.0 16.0 7.5 12.0 12.0 12.5 Fig. 8-4 df 19 CONFIDENCEINTERVALFOR THE POPULATIONMEAN: SMALL SAMPLES Area in the right tail under the t distributioncurve .10 .05 .025 .o1 ,005 .oo1 1.328 1.729 2.093 2.539 2.861 3.579 When a small sample (n < 30) is taken from a normally distributed population and the population standard deviation, 0,is unknown, a confidence interval for the population mean, p, is given by formula (8.6): In formula (8.6),X is a point estimate of the population mean, SZ is the estimated standard error, and t is determined by the confidence level and the degrees of freedom. The degrees of freedom is given by formula (8.9,where n is the sample size. d f = n - 1 (8.7) EXAMPLE 8.7 The distance traveled, in hundreds of miles by automobile, was determined for 20 individuals returning from vacation. The results are given in Table 8.4. Table 8.4 I 12.5 I 15.0 I 10.0 I 15.5 I 8.0 5.O 15.0 9.5 For the data in Table 8.4, X = 12.375,s = 3.741, and sx = 0.837. To find a 90% confidence interval for the mean vacation travel distance of all such individuals, we need to find the value of t in formula (8.6). The degrees of freedom is df = n - 1 = 20 - 1 = 19. Using df = 19 and the t distribution table, we must find the t value for which the area under the curve between -t and t is .90. This means the area to the left of -t is .05 and the area to the right oft is also .05. Table 8.5, which is selected from the t distribution table, indicates that the area to the right of 1.729 is .05. This is the proper value oft for the 90% confidence level when df = 19. Table 8.5
  • 181.
    172 8.O 16.0 12.0 ESTIMATION AND SAMPLESIZE DETERMINATION 5.O 15.0 19.5 15.0 9.0 7.5 13.0 6.0 37.5 [CHAP. 8 The following technique is also often recommended for finding the t value for a confidence interval: Tofind the t valuefor a given confidence interval, subtract the confidence level from I and divide the answer by 2 tofind the correct area in the right-hand tuil. In this Example, the confidence level is 90% or .90. Subtracting .90 from 1, we get .lo. Now, dividing S O by 2, we get .05 as the right-hand tail area. From Table 8.5, we see that the proper t value is 1.729. Many students prefer this technique for finding the proper t value for confidence intervals. = 12.375- 1.729x .837 = 10.928and the upper limit is X + t s = 12.375 + 1.729 x .837 = 13.822. A 90% confidence interval for p extends from 1,093to 1,382 miles. The distribution of all vacation travel distances are assumed to be normally distributed. To find a 90% confidence interval for p using Minitab, use the set command to enter the data given in Table 8.4. The command tinterval 90 percent confidence data in cl is used to find the confidence interval. The output is shown below. The confidence interval is (10.928, 13.822). Using formula (8.6),the lower limit for the 90%confidence interval is X - t MTB > set cl DATA> 12.5 8.0 16.0 12.0 9.5 15.0 5.0 20.0 13.0 12.0 10.0 DATA> 15.0 9.0 16.0 12.0 15.5 9.5 17.5 7.5 12.5 DATA > end MTB > name c1 'distance' MTB > tinterval90 percent confidence data in c l Confidence Intervals Variable N Mean St Dev SEMean 9O.O%C.I. Distance 20 12.375 3.741 0.837 (10.928, 13.822) EXAMPLE8.8 The health-care costs, in thousands of dollars for 20 males aged 75 or over, are shown in Table 8.6. Formula (8.6) may not be used to set a confidence interval on the mean health-care cost for such individuals, since the sample data indicate that the distribution of such health-care costs is not normally distributed. The $515,000 and the $950,000 costs are far removed from the remaining health-care costs. This indicates that the distribution of health-care costs for males aged 75 or over is skewed to the right and therefore is not normally distributed. Formula (8.6) is applicable only if it is reasonable to assume that the population characteristic is normally distributed. Table 8.6 I 8.5 I 5 15.0 I 950.0 I 5.5 I EXAMPLE 8.9 The number of square feet per mall devoted to children's apparel was determined for 15 malls selected across the United States. The summary statistics for these 15 malls are as fo1lows:Y = 2,700, s = 450, and 3~ = 116.19. To determine a 99% confidence interval for p, we need to find the value t where the area under the t distribution curve between -t and t is .99 and the degrees of freedom = n - 1 = 15 - I = 14. Since the area between -t and t is .99, the area to the left of -t is .005 and the area to the right of t is .005. Table 8.7 is taken from the t distribution table and shows that the area to the right of 2.977 is .005. The lower limit for the 99%confidence interval is X - tsz = 2,700 - 2.977 x 116.19 = 2,354 ft2 and the upper limit is X + tsT( = 2,700 +2.977 x 116.19= 3,046 ft2.The 99% confidence interval for p is (2,354 to 3,046).
  • 182.
    CHAP. 81 ESTIMATIONAND SAMPLE SIZE DETERMINATION df 14 173 Area in the right tail under the t distribution curve .10 .05 .025 .o1 .005 .oo1 1.345 1.761 2.145 2.624 2.977 3.787 CONFIDENCE INTERVAL FOR THE POPULATION PROPORTION: LARGE SAMPLES In chapter 7, the variable shown in formula (7.10) and reproduced below was shown to have a standard normal distribution provided np > 5 and nq > 5. (7.10) Since 95% of the area under the standard normal curve is between -I .96 and 1.96, and since the variable in formula (7./0) has a standard normal distribution, we have the result shown below: - P(-1.96<-<1.96) P-clfi = .95 3 From Chapter 7, we also know that pF=p and q ,= -. Substituting for p, and 075 in the inequality /? - P -Pjj -1.96 < -< 1.96given in formula (8.8)and solving the resulting inequa % - p - 1 9 6 E < p < p+ 1 9 6 E ity for p, we obtain To obtain numerical values for the lower and upper limits of the confidence interval, it is necessary to substitute the sample values p and for p and q in the expression for q.Making these substitutions, we obtain formula (8.10). (8.10) The interval given in formula (8.10)is called a 95% confidence interval for p. The general form for the interval is shown in formula (8./1),where z represents the proper value from the standard normal distribution table as determined by the desired confidence level. n (8.I I ) EXAMPLE 8.10 A study of 75 small-business owners determined that 80% got the money to start their business from personal savings or credit cards. If p represents the proportion of all small-business owners who got the money to start their business from personal savings or credit cards, then a point estimate of p is f7 = _ _ F= 80%. The estimated standard error of the proportion is represented by sr), and is given by SF, =
  • 183.
    174 ESTIMATIONAND SAMPLESIZE DETERMINATION [CHAP. 8 = 4.62%. From Table 8.1, the z value for a 99% confidence interval is 2.58. The lower limit for a 4 7 99%confidence interval is p- z = 80 - 2.58 x 4.62 = 68.08% and the upper limit is p+ z = 80 + 2.58 x 4.62 = 91.92%.Formula (8. II) is considered to be valid if the sample size satisfies the inequalities np> 5 and nq >5. Since p and q are unknown, we check the sample size requirement by checking to see if np > 5 and n q > 5. In this case nF = 75 x .8 = 60 and n q = 75 x .2 = 15,and since both nF and nT exceed 5, the sample size is large enough to assure the valid use of formula (8.II). EXAMPLE 8.11 In a random sample of 40 workplace homicides, it was found that 2 of the 40 were due to a personal dispute. A point estimate of the population proportion of workplace homicides due to a personal dispute is = = .05 or 5%. Formula (8.11) may not be used to set a confidence interval on p, the population proportion of homicides due to a personal dispute, since np = 40 x .05 = 2 is less than 5. Note that since p is unknown, np cannot be computed. When p is unknown, nF and n q are computed to determine the appropriate- ness of using formula (8.11). 40 DETERMINING THE SAMPLESIZE FOR THE ESTIMATION OF THE POPULATION MEAN The maximum error of estimate when using Fas a point estimate of p is defined by formula (8.5) and is restated below. If the maximum error of estimate is specified, then the sample size necessary to give this maximum error may be determined from formula (8.5).Replacing by - to obtain E = z- and then solving for n we obtain formula (8.22). The value obtained for n is always rounded up to the next whole number. This is a conservative approach and sometimes results in a sample size larger than actually needed. 0 0 J;; &’ Z2(T2 E2 n = - (8.22) EXAMPLE 8.12 In Example 8.4, the mean of a sample of 40 tuition costs for community colleges was found to equal to $1,396 and the standard deviation was $655. The margin of error associated with using $1,396 as a point estimate of p is equal to $204. To reduce the margin of error to $100, a larger sample is needed. The required sample size is n = 6552 = 164.8.To obtain a conservative estimate, we round the estimate up to loo2 165. Note that we estimated 0 by s, because (T is unknown. The 40 tuition costs obtained in the original survey would be supplemented by 165 - 40 = 125 additional community colleges. The new estimate based on the 165 tuition costs would have a margin of error of $100. To use formula (8.12) to determine the sample size, either CT or an estimate of (T is needed. In Example 8.12, the original study, based on sample size 40,may be regarded as a pilot sample. When a pilot sample is taken, the sample standard deviation is used in place of (T.In other instances, a historical value of (3 may exist. In some instances, the maximum and minimum value of the characteristic being studied may be known and an estimate of (T may be obtained by dividing the range by 4.
  • 184.
    CHAP. 81 ESTIMATIONAND SAMPLE SIZE DETERMINATION 175 EXAMPLE 8.13 A sociologist desires to estimate the mean age of teenagers having abortions. She wishes to estimate p with a 99% confidence interval so that the maximum error of estimate is E = .1 years. The z value is 2.58. The range of ages for teenagers is 19- 13= 6 years. A rough estimate of the standard deviation is obtained by dividing the range by 4 to obtain 1.5 years. This method for estimating 0 works best for mound-shaped distributions. The approximate sample size is n = 2582 152 = 1497.7. Rounding up, we obtain n = 1,498. .1* DETERMINING THE SAMPLE SIZE FOR THE ESTIMATION OF THE POPULATIONPROPORTION When j j is used as a point estimate of p, the maximum error of estimate is given by E = z q (8.13) If the maximum error of estimate is specified, then the sample size necessary to give this maximum error may be determined from formula (8.13).If q is replaced by and the resultant equation is solved for n, we obtain z2P9 E2 n=- (8.14) Since p and q are usually unknown, they must be estimated when formula (8.14) is used to determine a sample size to give a specified maximum error of estimate. If a reasonable estimate of p and q exists, then the estimate is used in the formula. If no reasonable estimate is known, then both p and q are replaced by .5. This gives a conservative estimate for n. That is, replacing p and q by .5 usually gives a larger sample size than is needed, but it covers all cases so to speak. EXAMPLE 8.14 A study is undertaken to obtain a precise estimate of the proportion of diabetics in the United States. Estimates ranging from 2% to 5% are found in various publications. The sample size necessary to 1.96’ X .05 X .95 .oo1 estimate the population percentage to within 0.1% with 95% confidence is n = = 182,476. When a range of possible values for p exists, as in this problem, use the value closest to .5 as a reasonable estimate for p. In this example, the value in the range from .02 to .05 closest to .5 is .05. If the prior estimate of p were not used, and .5 is used for p and q, then the computed sample size is n = = 960,400. .0Ol2 Notice that using the prior information concerning p makes a tremendous difference in this example. Solved Problems POINT ESTIMATE 8.1 The mean annual salary for public school teachers is $32,000.The mean salary for a sample of 750 public school teachers equals $31,895. Identify the population, the population mean, and the sample mean. Identify the parameter and the point estimate of the parameter. Arts. The population consists of all public school teachers. p = $32,000, X = $31,895.The parameter is p.A point estimate of the parameter p is $31,895.
  • 185.
    176 ESTIMATION ANDSAMPLE SIZE DETERMINATION [CHAP. 8 8.2 Ninety-eight percent of U.S. homes have a TV. A survey of 2000 homes finds that 1,925have a TV.Identify the population, the population proportion, and the sample proportion. Identify the parameter and a point estimate of the parameter. Ans. The population consists of all U.S.homes. p = 98%, j T = 96.25%. The parameter is p. A point estimate of p is 96.25%. INTERVAL ESTIMATE 8.3 When a multiple of the standard error of a point estimate is subtracted from the point estimate to obtain a lower limit and added to the point estimate to obtain an upper limit, what term is used to describe the numbers between the lower limit and the upper limit? Ans. interval estimate CONFIDENCE INTERVAL FOR THE POPULATION MEAN: LARGE SAMPLES Table 8.8 8.4 A sample of 50 taxpayers receiving tax refunds is shown in Table 8.8. Find an 80% confidence interval for p,where p represents the mean refund for all taxpayers receiving a refund. Ans. For the data in Table 8.8, C x = 51,685, C x2 = 70,158,336, X= 1033.7, s = 584.3, JX = 82.6. From Table 8.1, the z value for 80% confidence is 1.28. Lower limit = 1033.7 - 1.28 x 82.6 = 927.97, upper limit = 1033.7 + 1.28 x 82.6 = 1139.43.The 80% confidence interval is (927.97, 1139.43). 8.5 Use Minitab to find the 80%confidence interval for p in problem 8.4. Ans. MTB > set cl DATA> 1515 1432 120 DATA> 723 1212 202 DATA> 868 1236 1631 DATA> 1769 232 171 DATA> 1781 313 1067 DATA >end MTB > name cl 'refund' MTB > standard deviation c1 270 312 1904 118 1726 1562 681 1243 392 271 1062 1023 367 283 1973 662 1857 903 1313 959 518 1671 395 1744 764 623 169 589 1901 168 800 762 364 1396 668
  • 186.
    CHAP. 81 df 7 ESTIMATION ANDSAMPLE SIZE DETERMINATION Area in the right tail under the t distributioncurve .10 .05 .025 .o1 .005 .001 1.415 1.895 2.365 2.998 3.499 4.785 177 Standard deviation of refund = 584.35 MTB > zinterval 80 percent confidence sigma = 584.35data in cl Confidence Intervals The assumed sigma = 584 Variable N Mean St Dev SEMean 80.0%C.I. Refund 50 1033.7 584.3 82.6 (927.8, 1139.6) MAXIMUM ERROR OF ESTIMATE FOR THE POPULATION MEAN 8.6 A sample of size 50 is selected from a population having a standard deviation equal to 4. Find the maximum error of estimate associated with the confidence intervals having the following confidence levels: (a) 80%;(b) 90%; (c)95%; (499%. 4 Ans. The standard error of the mean is - = S66. Using the z values given in Table 8.1, the following maximum errors of estimate are obtained. sso (a) E = 1.28x .566= .724 (c) E = 1.96x .566= 1.109 (b) E = 1.65 x .566= .934 (4 E = 2.58 x .566= 1.460 8.7 Describe the effect of the following on the maximum error of estimate when determining a confidence interval for the mean: (a) sample size; (b) confidence level; (c) variability of the characteristicbeing measured Ans. (a) The maximum error of estimate decreases when the sample size is increased. (b) The maximum error of estimate increases if the confidence level is increased. (c) The larger the variability of the characteristic being measured, the larger the maximum error of estimate. THE t DISTRIBUTION 8.8 Table 8.9 contains the row corresponding to 7 degrees of freedom taken from the t distribution table in Appendix 3. For a t distribution curve with df = 7, find the following: (a) Area under the curve to the right oft = 2.998 (b) Area under the curve to the left oft =-2.998 (c) Area under the curve between t =-2.998 and t = 2.998 Ans. (a) .01 (b) Since the curve is symmetrical about 0, the answer is the same as in part (a),.01 (c) Since the total area under the curve is 1, the answer is 1 - .01- .01= .98 8.9 Refer to problem 8.8 and Table 8.9 to find the following. (a) Find the t value for which the area under the curve to the right oft is .05. (b) Find the t value for which the area under the curve to the left oft is .05. (c) By considering the answers to parts (a) and (b), find that positive t value for which the area between -t and t is equal to .90.
  • 187.
    178 409 663 304 535 628 [CHAP. 8 487 480535 676 494 332 565 670 434 554 515 665 308 319 519 ESTIMATION AND SAMPLE SIZE DETERMINATION Ans. (a) I .895 (6)-1.895 (c) 1.895 CONFIDENCE INTERVAL FOR THE POPULATION MEAN: SMALL SAMPLES Area in the right tail under the t distributioncurve df .I0 .05 .025 .o1 .005 .oo1 * 19 1.328 1.729 2.093 2.539 2.861 3.579 8.10 8.11 In a transportation study, 20 cities were randomly selected from all cities having a population of 50,000 or more. The number of cars per 1,000people was determined for each selected city, and the results are shown in Table 8.10. Find a 95% confidence interval for p, where p is the mean number of cars per 1,000 people for all cities having a population of 50,000 or more. What assumption is necessary for the confidence interval to be valid? Table 8.10 Am. For the data in Table 8.10, Z x = 10,092,C x2 = 5,381,738, X=504.6, s = 123.4, 5 = 27.6. The t distribution is used because the sample is small. The degrees of freedom is df = 20 - I = 19.The t value in the confidence interval is determined from Table 8.11, which is taken from the t distribution table in Appendix 3. From Table 8.11, we see that the area to the right of 2.093 is ,025 and the area to the left of -2.093 is .025, and therefore the area between -2.093 and 2.093 is .95. The lower limit of the 95% confidence interval is X - t q = 504.6 - 2.093 x 27.6 = 446.8 and the upper limit is X + t s = 504.6 + 2.093 x 27.6 = 562.4. The 95% confidence interval is (446.8, 562.4). It is assumed that the distribution of the number of cars per 1,000 people in cities of 50,000or over is normally distributed. Find the Minitab solution to problem 8.10. Ans. MTB > Set cl DATA> 409 663 304 535 628 487 676 565 554 308 480 DATA> 494 670 515 319 535 332 434 665 519 DATA > end MTB > name cl 'cars' MTB > tinterval 95 percent confidence data in cl Confidence Intervals Variable N Mean St Dev SE Mean 95.0% C.I. Cars 20 504.6 123.4 27.6 (446.8,562.4) CONFIDENCE INTERVAL FOR THE POPULATION PROPORTION: LARGE SAMPLES 8.12 A national survey of 1200 adults found that 450 of those surveyed were pro-choice on the abortion issue. Find a 95% confidence interval for p, the proportion of all adults who are pro- choice on the abortion issue.
  • 188.
    CHAP. 81 ESTIMATIONAND SAMPLE SIZE DETERMINATION 179 450 - Ans. The sample proportion who are pro-choice is =1 2 0 0 - .375 and the proportion who are not pro- .625. The estimated standard error of the proportion is choice or undecided is q==- 750 - ,/? = /.375 '625 = .014. The z value is 1.96. The lower limit of the 95% confidence 1200 interval is p-z IF;' -- - .375 - 1.96 x .014 = .348 and the upper limit is p+z 1.96 x .014 = .402. The 95%confidence interval is (.348, .402). 8.13 A national survey of 500 African-Americans found that 62% in the survey favored affirmative action programs. Find a 90% confidence interval for p, the proportion of all African-Americans who favor affirmative action programs. Ans. The sample percent who favor affirmativeaction programs is 62%. The sample percent who do not favor affirmative action programs or are undecided is 38%. The estimated standard error of - . proportion, expressed as a percentage, is JY -= 4 % = 2.17%. The z value is 1.65. The lower limit of the 90% confidence interval is p- zd e= 62 - 1.65 x 2.17 = 58.4% and the upper limit is p+zJ ' : '-= 62 + 1.65 x 2.17 = 65.6%. The 90% confidence interval is (58.4%, 65.6%). DETERMINING THE SAMPLESIZEFOR THE ESTIMATION OF THE POPULATIONMEAN 8.14 A machine fills containers with corn meal. The machine is set to put 680 grams in each container on the average. The standard deviation is equal to 0.5 gram. The average fill is known to shift from time to time. However, the variability remains constant. That is, o remains constant at 0.5 gram. In order to estimate p,how many containers should be selected from a large production run so that the maximum error of estimate equal 0.2 gram with probability 0.95? z2cT2 Ans. The sample size is determined by use of the formula n = 2. The value for CJ is known to equal 0.5, E is specified to be 0.2 and for probability equal to 0.95, the z value is 1.96. Therefore, the E Z2 cT2 E .2 sample size is n = 7 = 1*962 52 = 24.01. The required sample size is obtained by rounding the answer up to 25. 8.15 A pilot study of 250 individuals found that the mean annual health-care cost per person was $2550 and the standard deviation was $1350. How large a sample is needed to estimate the true annual health-care cost with a maximum error of estimate equal to $100 with probability equal to 0.99? Z2 cT2 Ans. The sample size is determined by use of the formula n = 7. The value for G is estimated from the pilot study to be $1350. E is specified to be $100 and for probability equal to 0.99, the z value E
  • 189.
    180 ESTIMATION ANDSAMPLE SIZE DETERMINATION [CHAP. 8 2582 *3502 = 1213.13. The required sample is 2.58. Therefore, the sample size is n = --j- = z2o2 E loo2 size is obtained by rounding the answer up to 1214. If the results from the pilot study are used, then an additional 1214- 250 = 964 individuals need to be surveyed. DETERMINING THE SAMPLESIZE FOR THE ESTIMATION OF THE POPULATIONPROPORTION 8.16 A large study is undertaken to estimate thp, percentage of students in grades 9 through 12 who use cigarettes. Other studies have indicated that the percent ranges between 30% and 35%. How large a sample is needed in order to estimate the true percentage for all such students with a maximum error of estimate equal to 0.5%with a probability of 0.90? z2pq Am. The sample size is given by n = 7. The specified value for E is 0.5%. The z value for probability equal to 0.90 is 1.65.The previous studies indicate that p is between 30% and 35%. When a range of likely values for p exist, the value closest to 50% is used. This value gives a E L conservative estimate of the sample size. The sample size is n = 35 65 = 24,774.75. The 5* sample size is 24.775. 8.17 Find the sample size in problem 8.16 if no prior estimate of the population proportion exists. 2 Ans. The sample size is given by n = - ". The specified value for E is 0.5%. The L value for probability equal to 0.90 is 1.65.When no prior estimate for p exists, p and q are set equal to 50% E2 in the sample size formula. Therefore, 1*652x50x50 = 27,225. Note that 2,450 fewer subjects are 52 needed when the prior estimate of 35% is used. SupplementaryProblems POINT ESTIMATE 8.18 Table 8.8 in problem 8.4 contains a sample of 50 tax refunds received by taxpayers. Give point estimates for the population mean, population median, and the population standard deviation. Ans. $1033.70, the mean of the SO tax refunds, is a point estimate of the population mean. $1064.50,the median of the 50 tax refunds, is a point estimate of the population median. $584.30, the standard deviation of the 50 tax refunds, is a point estimate of the population standard deviation. 8.19 Table 8.10in problem 8.10 contains a sample of the number of cars per 1,000people for 20 cities having a population of 50,000 or more. Give a point estimate of the proportion of all such cities with 600 or more cars per 1,000people. Ans. In this sample, there are S cities with 600 or more cars per 1,000.The point estimate of p is .2S or 25%.
  • 190.
    CHAP. 81 ESTIMATIONAND SAMPLE SIZE DETERMINATION Distribution type t distribution, df = 10 t distribution, df = 20 t distribution, df = 30 standard normal 181 Area under the curve to the right of this value is .025 a b d c INTERVAL ESTIMATE 8.20 What happens to the width of an interval estimate when the sample size is increased? Ans. The width of an interval estimate decreases when the sample s i x is increased. CONFIDENCE INTERVAL FOR THE POPULATION MEAN: LARGE SAMPLES 8.21 One hundred subjects in a psychological study had a mean score of 35 on a test instrument designed to measure anger. The standard deviation of the 100test scores was 10.Find a 99% confidence interval for the mean anger score of the population from which the sample was selected. Am. The confidence interval is X - z n < p < X + z n . The lower limit is X-2% = 35 - 2.58 x 1 = 32.42 and the upper limit is X-2% = 35+2.58x 1 = 37.58. The interval is (32.42, 37.58). 8.22 The width of a large sample confidence interval is equal to w = (X + zox) - (X- zox)= 2 ~ 0 ~ . Replacing o x by -, w is given as w = 22-. Discuss the effect of the confidence level, the standard deviation, and the sample size on the width. Ans. 0 0 & & The width varies directly as the confidence level, directly as the variability of the characteristic being measured, and inversely as the square root of the sample size. MAXIMUM ERROR OF ESTIMATE FOR THE POPULATION MEAN 8.23 8.24 The U.S. abortion rate per 1,000 female residents in the age group 15 - 44 was determined for 35 different cities having a population of 25,000 or more across the U.S. The mean for the 35 cities was equal to 24.5 and the standard deviation was equal to 4.5. What is the maximum error of estimate when a 90%confidence interval is constructed for p'? Ans. The maximum error of estimate is E = zox.For a 90% confidence interval, z = 1.65.The estimated standard error is 0.76. E = 1.65 x .76 = 1.25. A study concerning one-way commuting times to work was conducted in Omaha, Nebraska. The mean commuting time for 300 randomly selected workers was 45 minutes. The margin of error for the study was 7 minutes. What is the 95% confidence interval for p? Am. The 95% confidence interval is T - 1.96% < p < X + 1 . 9 6 ~ . The margin of error is E = 1.96ojr = 7 and therefore, the lower limit of the interval is 38 and the upper limit is 52. THE T DISTRIBUTION 8.25 Use the t distribution table and the standard normal distribution table to find the values for a, b, c, and d, What distribution does the t distribution approach as the degrees of freedom increases'? Table 8.12
  • 191.
    I82 df 5 ESTIMATION AND SAMPLESIZE DETERMINATION [CHAP. 8 Standard deviation of the t distribution with this value for df 1.29 Atis. a = 2.228 b = 2.086 c = 2.042 d = 1.96 The t distribution approaches the standard normal distribution when the degrees of freedom are increased. This is illustrated by the results in Table 8.12. 525 300 8.26 The standard deviation of the t distribution is equal to jdP12. - for df > 2. Find the standard deviation for the t distributions having df values equal to 5,10, 20, 30, 50, and 100. What value is the standard deviation getting close to when the degrees of freedom get larger? 500 390 715 425 Am. The standard deviations are shown in Table 8.13.The standard deviation is approaching one as the degrees of freedom get larger. The t distribution approaches the standard normal distribution when the degrees of freedom are increased. 1,127 625 475 Table 8.13 255 500 1,454 375 814 750 10 20 30 50 1.12 1.05 1.04 I .02 I I 100 1.01 1 CONFIDENCE INTERVAL FOR THE POPULATION MEAN: SMALL SAMPLES 8.27 In order to estimate the lifetime of a new bulb, ten were tested and the mean lifetime was equal to 835 hours with a standard deviation equal to 45 hours. A stem-and-leaf of the lifetimes indicated that it was reasonable to assume that lifetimes were normally distributed. Determine a 99% confidence interval for p,where p represents the mean lifetime of the population of lifetimes. Am. The confidence interval is X - ts;3 < p < X -tts-jl, where X = 835, s = 45, sx = 14.23, and t = 3.250. The lower limit is X - ts? = 835 - 46.25 = 788.75 hours and the upper limit is X + t q = 835 + 46.25 = 881.25 hours. 8.28 The heights of 15 randomly selected buildings in Chicago are given in Table 8.14. Find a 90% confidence interval on the mean height of buildings in Chicago. Table 8.14 Am. A Minitab histogram of the data in Table 8.14 is shown in Fig. 8-5. Since the heights are not normally distributed, the confidence interval using the t distribution is not appropriate.
  • 192.
    CHAP. 81 2 . 1 ESTIMATIONAND SAMPLE SIZE DETERMINATION 183 Fig. 8-5 CONFIDENCE INTERVAL FOR THE POPULATION PROPORTION:LARGE SAMPLES 8.29 In a survey of 900 adults, 360 responded “yes” to the question “Have you attended a major league baseball game in the last year?” Determine a 90% confidence interval for p, the proportion of all adults who attended a major league baseball game in the last year. Am. The sample proportion attending a game in the last year is iT = .40 and the proportion not attending is = .60. The z value for a 90% confidence interval is 1.65. The confidence interval for p is p- z ] y< p < p+ ZJF. The maximum error of estimate is E = z,/?= .027. The lower limit of the confidence interval is .40 - .027 = .373 and the upper limit is .40 +.027 = .427. The 90%interval is (.373, .427). 8.30 Use the data in Table 8.8, found in problem 8.4, to find a 99% confidence interval for the proportion of all taxpayers receiving a refund who receive a refund of more than $500. Am. Thirty-seven of the fifty refunds in Table 8.8 exceed $500.The sample proportion exceeding $500 is .74. The z value for a 99% confidence interval is 2.58. The confidence interval for D is p- z/F- < p < p+ z/F-. The maximum error of estimate is E = z,/?= .16 The lower limit is .74 - .16 = .58 and the upper limit is .74 + .16 = .90.The 99%interval is ( 5 8 , .90). DETERMINING THE SAMPLE SIZE FOR THE ESTIMATIONOF THE POPULATION MEAN 8.31 8 . 3 2 The estimated standard deviation of commutingdistances for workers in a large city is determined to be 3 miles in a pilot study. How large a sample is needed to estimate the mean commuting distance of all workers in the city to within .5 mile with 95% confidence? . For this problem, E = .5, z = 1.96, and CT is estimated to be Am. The sample size is given by n = - 3. The sample size is determined to be 139. Z2 o2 E2 In problem 8.31, what size sample is required to estimate p to within .I mile with 95% confidence? Am. n = 3,458
  • 193.
    184 ESTIMATION ANDSAMPLE SIZE DETERMINATION [CHAP. 8 DETERMINING THE SAMPLE SIZE FOR THE ESTIMATION OF THE POPULATION PROPORTION 8.33 8.34 What sample size is required to estimate the proportion of adults who wear a beeper to within 3% with probability 0.95‘?A survey taken one year ago indicated that 15% of all adults wore a beeper. 2 Am. The sample size is given by n = - “ . The error of estimate, E, is given to be 396, and the z value E2 - - 1.516~ X I5X 85 is 1.96. If we assume that the current proportion is “close” to 1596, then n = 32 544.2, which is rounded up to 545. Suppose in problem 8.33 that there is no previous estimate of the population proportion. What sample size would be required to estimate p to within 3% with probability 0.95? 2 Am. The sample size is given by n = - pq . The error of estimate, E, is given to be 3%, and the z value E2 = 1,067.I, which is 1.9fj2X 50x 50 is 1.96.With no previous estimate for p, use p = q = 50%. n = 32 rounded up to 1,068.
  • 194.
    Chapter 9 Tests ofHypotheses: One Population NULL HYPOTHESISAND ALTERNATIVEHYPOTHESIS There are a number of mail order companies in the US.Consider such a company called L. C . Stephens Inc. The mean sales per order for L. C. Stephens, determined from a large database last year, is known to equal $155. From the same database, the standard deviation is determined to be 0= $50. The company wishes to determine whether the mean sales per order for the current year is different from the historical mean of $155. If p represents the mean sales per order for the current year, then the company is interested in determining if p # $155. The alternative hypothesis, also often called the research hypothesis, is related to the purpose of the research. In this instance, the purpose of the research is to determine whether p f $155. The research hypothesis is represented symbolicallyby Ha:p f $155. The negation of the research hypothesis is called the null hypothesis. In this case, the negation of the research hypothesis is that p = $155. The null hypothesis is represented symbolically by Ho: ~1 = $155. The company wishes to conduct a test of hypothesis to determine which of the two hypotheses is true. A test of hypothesis where the research hypothesis is of the form H ,: p f (some constant) is called a two-tailed test. The reason for this term will be clear when the details of the test procedure are discussed. EXAMPLE 9.1 A tire manufacturer claims that the mean mileage for their premium brand tire is 60,000 miles. A consumer organization doubts the claim and decides to test it. That is, the purpose of the research from the perspective of the consumer organization is to test if p < 60,000, where p represents the mean mileage of all such tires. The research hypothesis is stated symbolically as Ha:p < 60,000miles. The negation of the research hypothesis is p = 60,000. The null hypothesis is stated symbolically as Ho: p = 60,000. A test of hypothesis where the research hypothesis is of the form Ha:p < (some constant) is called a one-tailed test, a lej3-taifedtest, or a Lower-tailed test. EXAMPLE 9.2 The National Football League (NFL)claims that the average cost for a family of four to attend an NFL game is $200. This figure includes ticket prices and snacks. A sports magazine feels that this figure is too low and plans to perform a test of hypothesis concerning the claim. The research hypothesis is Ha:p > $200, where p is the mean cost for all such families attending an NFL game. The null hypothesis is Ho: p = $200. A test of hypothesis where the research hypothesis is of the form H,: p > (some constant) is called a one-tailed test, a right-tailed test, or an upper-tailed test. TEST STATISTIC, CRITICAL VALUES, REJECTION, AND NONREJECTION REGIONS Consider again the test of hypothesis to be performed by the L. C. Stephensmail order company discussed in the previous section. The two hypotheses are restated as follows: Ho:p = $155 (the mean sales per order this year is $155) Ha:p f $155 (the mean sales per order this year is not $155) 1 oc
  • 195.
    186 TESTS OFHYPOTHESIS: ONE POPULATION [CHAP. 9 To test this hypothesis, a sample of 100 orders for the current year are selected and the mean of the sample is used to decide whether to reject the null hypothesis or not. A statistic, which is used to decide whether to reject the null hypothesis or not is called a test statistic. The central limit theorem assures us that if the null hypothesis is true, then Y will equal $155 on the average and the standard error will equal ax= -= 5, assuming that the standard deviation equals 50 and has remained constant. Furthermore, the sample mean has a normal distribution.If the value of the test statistic, X, is “close” to $155, then we would likely not reject the null hypothesis. However, if the computed value of .jrc is considerably different from $155, we would reject the null hypothesis since this outcome supports the truth of the research hypothesis. The critical decision is how different does X need to be from $155 in order to reject the null hypothesis. Suppose we decide to reject the null hypothesis if si; differs from $155 by two standard errors or more. Two standard errors is equal to 2 x aK= $10. Since the distributionof the sample mean is bell-shaped, the empirical rule assures us that the probability that .jrc differs from $155 by two standard errors or more is approximately equal to 0.05. That is, if T < $145 or if X > $165, then we will reject the null hypothesis.Otherwise, we will not reject the null hypothesis. The values $145 and $165 are called critical values. The critical values divide the possible values of the test statistic into two regions called the rejection region and the nonrejection region. The critical values along with the regions are shown in Fig. 9-1. 50 f i - reiection region I nonrejection region Irejection region X 145 165 Fig. 9-1 EXAMPLE 9.3 In Example 9.1, the null and research hypothesis are stated as follows: Ho: p = 60,000miles (the tire manufacturer’sclaim is correct) Ha: p < 60,000 miles (the tire manufacturer’sclaim is false) Suppose it is known that the standard deviation of tire mileages for this type of tire is 7,000 miles. If a sample of 49 of the tires are road tested for mileage and the manufacturer’s claim is correct, then X will equal 60,000 7 miles on the average and the standard error will equal 031 = -= 1,000 miles. Furthermore, because of the J49 large sample size, the sample mean will have a normal distribution. Suppose the critical value is chosen two standard errors below 60,000;then the rejection and nonrejection regions will be as shown in Fig. 9-2. - reiection region I nonreiection region X 58,000 Fig. 9-2 When the consumer organization tests the 49 tires, if the mean mileage is less than 58,000 miles for the sample, then the manufacturer’s claim will be rejected. EXAMPLE 9.4 In Example 9.2, the null hypothesis and the research hypothesis are stated as follows: Ho: p = $200 (the NFL claim is correct) Ha:p > $200 (the NFL claim is not correct) Suppose the standard deviation of the cost for a family of four to attend an NFL game is known to equal $30. If a sample of 36 costs for families of size 4 are obtained, then X will be normally distributed with mean equal to
  • 196.
    CHAP. 91 TESTSOF HYPOTHESIS: ONE POPULATION 187 Conclusion Do not reject H o Reject 6 30 $200 and standard error equal to -= $5 if the NFL claim is correct. Suppose the critical value is chosen J36 HoTrue False Correct conclusion Type I1 error Type I error Correct conclusion three standard errors above $200;then the rejection and nonrejectionregions are as shown in Fig. 9-3. - nonreiectionrepion Ireiection region X 215 Fig. 9-3 If the mean of the sample of costs for the 36 families of size 4 exceeds $215, then the NFL claim will be rejected. TYPE I AND TYPE I1 ERRORS Consider again the test of hypothesis to be performed by the L. C. Stephens mail order company discussed in the previous two sections. The two hypotheses are restated as follows: Ho: p = $155 (the mean sales per order this year is $155) Ha:pf $155 (the mean sales per order this year is not $155) The decision to reject or not reject the null hypothesis is to be based on the sample mean. The critical values and the rejection and nonrejection regions are given in Fig. 9-1. If the current year mean, p,is $155 but the sample mean falls in the rejection region, resulting in the null hypothesis being rejected, then a Type I error is made. If the current year mean is not equal to $155, but the sample mean falls in the nonrejection region, resulting in the null, hypothesis not being rejected, then a Type I1 error is made. That is, if the null hypothesis is true but the statistical test results in rejection of the null, then a Type I error occurs. If the null hypothesis is false but the statistical test results in not rejecting the null, then a Type I1 error occurs. The errors as well as the possible correct conclusions are summarized in Table 9.1. Table 9.1 The first two letters of the Greek alphabet are used in statistics to represent the probabilities of committingthe two types of errors. The definitionsare as follows. a=probability of making a Type I error p= probability of making a Type 1 1error The calculation of a Type I error will now be illustrated, and the calculation of Type IIerrors will be illustrated in a later section.The term level ofsignificance is also used for a. The level of significance,a,is defined by formula (9.1): a = P(rejectingthe null hypothesis when the null hypothesis is true) (9.0 In the current discussion of the L. C. Stephensmail order company, the level of significancemay be expressed as a = P(X < 145or X > 165and p = 155).In order to evaluate this probability,recall that
  • 197.
    188 TESTS OFHYPOTHESIS:ONE POPULATION [CHAP. 9 - for large samples, z = - -’ has a standard normal distribution. As previously discussed, the standard aji - x - 155 5 error is $5 and assuming that the null hypothesis is true, z = -has a standard normal - x - I55 145- 155 5 5 distribution. The event X < 145 is equivalent to the event -< event 5i; > 165 is equivalent to the event -> = -2 or z < -2. The = 2 or z > 2. Therefore, a may be - x - I55 165- 155 5 5 expressed as a = P(Y c 145 ) + P(T > 165) = P(z c -2) + P(z > 2). From the standard normal distribution table, P(z < -2) = P(z > 2) = .5 - .4772 = .0228. And therefore, a = 2 x .0228 = .0456. When using the mean of a sample of 100 order sales and the rejection and nonrejection regions shown in Fig. 9-1 to decide whether the overall mean sales per order has changed, 4.56% of the time the company will conclude that the mean has changed when, in fact, it has not. EXAMPLE 9.5 In Example 9.3, the null and research hypothesis for the tire manufacturer’s claim is as follows: Ho: p = 60,000 miles (the tire manufacturer’sclaim is correct) Ha:p < 60,000miles (the tire manufacturer’sclaim is false) The rejection and nonrejection regions were illustrated in Fig. 9-2. The level of significance is determined as follows: a = P(rejecting the null hypothesis when the null hypothesis is true) Since the null is rejected when the sample mean is less than 58,000miles and the null is true when p = 60,000, a = P(X < 58,000 when p = 60,000) Transforming the sample mean to a standard normal, the expression for a now becomes ) 51-60,000 58,000- 60,000 a = P ( 1,ooo 1,000 Simplifying, and finding the area to the left of -2 under the standard normal curve, we find a = P(z < -2) = .5 - .4772 = .0228 If the tire manufacturer’s claim is tested by determining the mileages for 49 tires and rejecting the claim when the mean mileage for the sample is less than 58,000 miles, then the probability of rejecting the manufacturer’s claim, when it is correct, equals ,0228. EXAMPLE 9.6 In Example 9.4, the null and research hypothesis for the NFL’s claim is as follows: Ho: p = $200 (the NFL claim is correct) Ha:j.~> $200 (the NFL claim is not correct) The rejection and nonrejection regions were illustrated in Fig. 9-3. The level of significance is determined as follows: a = P(rejecting the null hypothesis when the null hypothesis is true) Since the null is rejected when the sample mean is greater than $215 and the null is true when p = $200,
  • 198.
    CHAP. 91 TESTSOF HYPOTHESIS: ONE POPULATION 189 a = P(X > 215 when p = 200) Transforming the sample mean to a standard normal, the expression for a now becomes ) X-200 215-200 5 5 a=P(- > Simplifying, and finding the area to the right of 3 under the standard normal curve, we find U .= P(z> 3) = .5 - .4986 = .0014 The probability of falsely rejecting the NFL’s claim when using a sample of 36 families and the rejection and nonrejection regions shown in Fig. 9-3 to test the hypothesis is .0014. Thus far, we have considered how to find the level of significance when the rejection and nonrejection regions are specified. Often, the level of significance is specified and the rejection and nonrejection regions are required. Suppose in the mail order example that the level of significance is specified to be a = .01. Because the hypothesis test is two-tailed, the level of significance is divided into two halves equal to .005 each. We need to find the value a, where P(z > a) = .005 and P(z <-a) = .005. The probability P(z > a) = .005 implies that P(0 < z < a) = .5 - .005 = .495. Using the standard normal table, we see that P(0 < z < 2.57) = .4949 and P(0 < z < 2.58) = .4951. The interpolated value is a = 2.575, which we round to 2.58. The rejection region is therefore as follows: z < -2.58 or z > 2.58. Figure 9-4 shows the rejection regions as well as the area under the standard normal curve associated with the regions. Fig. 9-4 - X - 155 5 To determine the rejection region in terms of si-, the equation z = -may be used. The inequality z < -2.58 is replaced by - < -2.58. Multiplying both sides of the inequality by 5 and then adding 155 to both sides, the inequality -< -2.58 is seen to be equivalent to X < 142.1. - X - 155 5 - X - 155 c J - X - 155 5 The inequality z > 2.58 is replaced by -> 2.58 shows the rejection regions as well as the area under means. which is equivalent to X > 167.9. Figure 9-5 the normal curve associated with the sample
  • 199.
    190 TESTS OFHYPOTHESIS: ONE POPULATION [CHAP. 9 Fig. 9-5 EXAMPLE 9.7 Suppose we wish to test the hypothesis in Example 9.5 at a level of significance equal to .01. Since this is a lower-tailed test, we need to find a standard normal value a such that P(z < a) = .01. From the standard normal distribution table, we find that P(0 < z < 2.33) = .4901. Therefore P(z > 2.33) = .01. Because of the symmetry of the standard normal curve, P(z < -2.33) also equals .01. Hence z < -2.33 is a rejection region of size a = .01.To find the rejection region in terms of the sample mean, we note from Example 9.5 that if the null hypothesis is true, then z = has a standard normal distribution. Substituting for z in the inequality z < -2.33, we have < -2.33 or solving for X’, we find X < 57,670 miles as the rejection region. Figure 9-6 shows the rejection region as well as the area under the standard normal curve associated with the region. 51- 60,000 1,000 X - 60,000 1,000 F i g .9-6 Figure 9-7 shows the rejection region as well as the area under the normal curve associated with the sample means. F i g . 9-7 EXAMPLE 9.8 Suppose we wish to test the hypothesis in Example 9.6 at a level of significance equal to .10. Since this is an upper-tailed test, we need to find a standard normal value a such that P(z > a) = .10. From the
  • 200.
    CHAP. 91 TESTSOF HYPOTHESIS: ONE POPULATION 191 standard normal distribution table, we find that P(0 < z < 1.28) = .3997. Therefore P(z > 1.28) = .10. Hence z > 1.28 is a rejection region of size a = .lO. To find the rejection region in terms of the sample mean, we note - x-200 from Example 9.6 that if the null hypothesis is true, then z = -has a standard normal distribution. 5 - x-200 5 Substituting for z in the inequality z > 1.28, we have -> 1.28 or solving for X, we find X > $206.40 as the rejection region. Figure 9-8 shows the rejection region as well as the area under the standard normal curve associated with the region. Fig. 9-8 Figure 9-9 shows the rejection region as well as the area under the normal curve associated with the sample means. Fig. 9-9 HYPOTHESIS TESTS ABOUT A POPULATION MEAN: LARGE SAMPLES This section summarizes the material presented in the proceeding sections. The techniques presented are appropriate when the sample size is 30 or more. EXAMPLE 9.9 The mean age of policyholders at World Life Insurance Company, determined two years ago, was found to equal 32.5 years and the standard deviation was found to equal 5.5 years. It is reasonable to believe that the mean age has increased. However, some of the older policyholders are now deceased and some younger policyholders have been added. The company determines the ages of 50 current policyholders in order to decide whether the mean age has changed. If p represents the current mean of all policyholders, the null and research hypothesis are stated as follows: Ho:p= 32.5 (the mean age has not changed) Ha:p f 32.5 (the mean age has changed)
  • 201.
    192 TESTS OFHYPOTHESIS:ONE POPULATION [CHAP. 9 t Steps for Testing a Hypothesis Concerning a Population Mean: Large Sample Step 1: State the null and research hypothesis. The null hypothesis is represented symbolically by Ho: p= h,and the research hypothesis is of the form Ha: p f Step 2: Use the standard normal distribution table and the level of significance, a,to determine the rejection region. or Ha:p < or Ha: p > h. The test is performed at the conventional level of significance, which is a = .05. From the standard normal distribution, it is determined that P(z > 1.96) = .025 and therefore P(z c -1.96) = .025. The rejection and nonrejection regions are therefore as follows: re-iectionregion I nonrejection region 1 reiection region z -1.96 1.96 The standard deviation of the policyholder ages is known to remain fairly constant, and therefore the standard error is a x = -- - .778. The mean age of the sample of 50 policyholders is determined to be 34.4 years. The 55 Jso - x -po 34.4- 32.5 computed test statistic is z = -- - = 2.44. That is, assuming the mean age of all policyholders 0. F .778 has not changed from two years ago, the mean of the current sample is 2.44 standard errors above the population mean. Since this exceeds 1.96, the null hypothesis is rejected and it is concluded that the current mean age exceeds 32.5 years. Notice that the four steps in Table 9.2 were followed in performing the test of hypothesis. Table 9.2 gives a set of steps that may be used to perform a test of hypothesis about the mean of a population. - X - F o Step 3: Compute the value of the test statistic as follows: z = - ,where X is the mean of the I 0,- sample, his given in the null hypothesis, and a x is computed by dividing CJ by &. If 0 is unknown, estimate it by using the sample standard deviation, s, when computing OK. Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls in the rejection region. Otherwise, the null hypothesis is not rejected. EXAMPLE 9.10 The police department in a large city claims that the mean 91I response time for domestic disturbance calls is 10 minutes. A “watchdog group” believes that the mean response time is greater than 10 minutes. If p represents the mean response time for all such calls, the watchdog group wishes to test the research hypothesis that p > 10at level of significance a = .01.The null and research hypothesis are: Ho:p = 10(the police department claim is correct) Ha:p > 10(the police department claim is not correct) From the standard normal distribution table, it is found that P(z > 2.33) = .01. The rejection and nonrejection regions are as follows: nonreiection region I reiection region Z 2.33 A sample of 35 response times for domestic disturbance calls is obtained and the mean response time is found to be 11.5 minutes and the standard deviation of the 35 response times is 6.0 minutes. The standard error is - s 6.0 x - 11.5- 10.0 estimated to be ---= 1.01 minutes. The computed test statistic is z = 2= = 1.49. & - & o x 1.01 Since the computed test statistic does not exceed 2.33, the police department claim is not rejected. Note that the
  • 202.
    CHAP. 91 TESTSOF HYPOTHESIS: ONE POPULATION 193 study has not proved the claim to be true. However, the results of the study are not strong enough to refute the claim at the 1% level of significance. EXAMPLE9.11 A sociologist wishes to test the null hypothesisthat the mean age of gang members in a large city is 14years vs. the alternativethat the mean is less than 14 years at level of significancea= .10. A sampleof 40 ages from the police gang unit records in the city found that Y = 13.1 years and s = 2.5 years. The estimated standard error is -= -= .40 years. Assuming the null hypothesis to be true, the sample mean is 2.25 standard errors below the population mean, since z = - - - -2.25. For a = .10, the rejection region is z <-1.28. Since the computed test statistic is less than -1.28, the null hypothesis is rejected and it is concluded that the mean age of the gang members is less than 14 years. s 2.5 J;;& 13.1- 14 .40 The process of testing a statistical hypothesis is similar to the proceedings in a courtroom in the United States. An individual, charged with a crime, is assumed not guilty. However, the prosecution believes the individual is guilty and provides sample evidence to try and prove that the person is guilty. The null and alternativehypothesis may be stated as follows: Ho:the individualcharged with the crime is not guilty Ha:the individualcharged with the crime is guilty If the evidence is strong, the null hypothesis is rejected and the person is declared guilty. If the evidence is circumstantial and not strong enough, the person is declared not guilty. Notice that the person is not usually declared innocent, but is found not guilty. The evidence usually does not prove the null hypothesis to be true, but is not strongenough to reject it in favor of the alternative. CALCULATING TYPE I1 ERRORS In testing a statistical hypothesis, a Type I1 error occurs when the test results in not rejecting the null hypothesis when the research hypothesis is true. The probability of a Type I1 error is represented by the Greek letter p and is defined in formula (9.2): p =P(not rejecting the null hypothesis when the research hypothesis is true) (9.2) Consider once again the mail order company example discussed in previous sections. The null and research hypothesis are: Ho:p = $155 (the mean sales per order this year is $155) Ha:1.1 f $1 55 (the mean sales per order this year is not $155) The rejection and nonrejection regions are: - rejection region I nonreiection region I rejection region X 145 165 The level of significance is a = .0456. The level of significance is computed under the assumption that the null hypothesis is true, that is, p = 155.The probability of a Type II error is calculated under the assumption that the research hypothesis is true. However, the research hypothesis is true whenever p f $155. Suppose we wish to calculate B when p = 157.5. The sequence of steps to compute p are as follows: p = P(not rejecting the null hypothesis when the research hypothesis is true)
  • 203.
    194 P P 145.O so00 147.5.6915 150.0 3399 152.5 .9270 155.0 .9544 157.5 .9270 160.0 3399 162.5 .6915 165.O so00 TESTS OF HYPOTHESIS:ONE POPULATION [CHAP. 9 p = P(145 < X < 165 when p = 157.5) 145- 157.5 X- 157.5 165- 157.5 < < 5 5 5 The event 145 < X < 165 when p = 157.5 is equivalent to or -2.5 c z < 1.5.The calculation of p is therefore given as follows: p = P(-2.5 < z < 1.5)= .4332+ .4938= .9270 Suppose we wish to compute p when p= 160.The sequence of steps to compute p are as follows: p = P(not rejectingthe null hypothesis when the research hypothesis is true) p = P(145 < X < 165 when p = 160.0) 145- 160 51- 160 165- 160 <- < 5 5 5 The event 145 < T < 165 when p = 160.0is equivalent to -3 < z < 1. The calculation of p is therefore given as follows: or p = P(-3 < z < 1) = .4986+ .3413 = .8399 Notice that the value of p is not constant, but depends on the alternative value assumed for p. Table 9.3gives the values of p for several different values of p. A plot of p vs. p is called an operating characteristic curve. An operating characteristic curve for Table 9.3 is shown in Fig. 9-10. P 0 . 9- 0.8 - 0.7 - 0 . 0- 0.5 - 145 155 165 Fig. 9-10
  • 204.
    CHAP. 91 TESTSOF HYPOTHESIS:ONE POPULATION 195 56,000 56,500 57,000 57,500 58,000 58,500 59,000 59,500 60.000 EXAMPLE9.12 Suppose we wish to construct an operating characteristic curve for the hypothesis concerning the tire manufacturer’s claim discussed in Example 9.3. Recall that the null and research hypothesis were as follows: Ho:p = 60,000miles (the tire manufacturer’sclaim is correct) Ha:p < 60,000miles (the tire manufacturer’sclaim is false) .0228 .0668 .1587 .3085 5000 .6915 3413 .9332 .9772 The rejection and nonrejection regions were as follows: - reiection region I nonreiection region X 58,000 The level of significance is a = .0228. To illustrate the computation of p, suppose we wish to compute the probability of committing a Type I1 error when p = 59,000. p = P(not rejecting the null hypothesis when the research hypothesis is true) p = P(X > 58,000when p = 59,000) X -59,000 58,000-59,000 The event X > 58,000when p = 59,000 is equivalent to > or z > -I. Therefore, 1,000 1,000 p = P(z >-1) = SO00 + .3413= .8413. Table 9.4gives the p values for several different p values. Table 9.4 The operating characteristic curve corresponding to Table 9.4 is shown in Fig. 9-11. 1.o 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 56 57 58 59 60 Fig. 9-11
  • 205.
    196 TESTS OFHYPOTHESIS:ONE POPULATION [CHAP. 9 P VALUES Consider once again the mail order company example. The null and research hypotheses are: Ho: p = $155 (the mean sales per order this year is $155) Ha:p f $155 (the mean sales per order this year is not $155) For level of significancea = .05,the rejection region is shown in Fig. 9-12. rejection region I nonre-iectionregion Irejection region Z -1.96 1.96 Fig. 9-12 To understand the concept of p value, consider the following four scenarios. Also recall that the standarderror is 05 = -= 5. 50 Jroo Scenario I: The mean of the 100 sampled accounts is found to equal $149. The computed test 149- 155 5 statistic is z = = -1.2, and since the computed test statistic falls in the nonrejection region, the null hypothesis is not rejected. Scenario 2: The mean of the 100 sampled accounts is found to equal $164. The computed test statistic is z = 1.8, and since the computed test statistic falls in the nonrejection region, the null hypothesis is not rejected. Scenario 3: The mean of the 100 sampled accounts is found to equal $165. The computed test statistic is z = 2.0, and since the computed test statistic falls in the rejection region, the null hypothesis is rejected. Scenario 4: The mean of the 100 sampled accounts is found to equal $140. The computed test statistic is z = -3.0, and since the computed test statistic falls in the rejection region, the null hypothesis is rejected. Notice in scenarios 1and 2 that even though the evidence is not strong enough to reject the null hypothesis, the test statistic is nearer the rejection region in scenario2 than scenario 1. Also notice in scenarios 3 and 4 that the evidence favoring rejection of the null hypothesis is stronger in scenario 4 than scenario 3, since the sample mean in scenario 4 is 3 standard errors away from the population mean, and the sample mean is only 2 standard errors away in scenario 3. The p value is used to reflect these differences. The p value is defined to be the smallest level of significanceat which the null hypothesis would be rejected. In scenario I, the smallest level of significance at which the null hypothesis would be rejected would be one in which the rejection region would be z 5-1.2 or z 2 1.2. The level of significance corresponding to this rejection region is 2 x P(z 2 1.2) = 2 x (.5 - 3849) = .2302. The p value corresponding to the test statistic z = -1.2 is .2302. In scenario 2, the smallest level of significance at which the null hypothesis would be rejected would be one in which the rejection region would be z 5 -1.8 or z 2 1.8. The level of significance corresponding to this rejection region is 2 x P(z 2 1.8) = 2 x (.5 - .4641) = .0718. The p value correspondingto the test statistic z = 1.8is .0718.
  • 206.
    CHAP. 91 TESTSOF HYPOTHESIS: ONE POPULATION 197 In scenario 3, the smallest level of significance at which the null hypothesis would be rejected would be one in which the rejection region would be z 5-2.0 or z 2 2.0. The level of significance corresponding to this rejection region is 2 x P(z 2 2.0) = 2 x (.5 - .4772) = .0456. The p value correspondingto the test statisticz = 2.0 is .0456. In scenario 4, the smallest level of significance at which the null hypothesis would be rejected would be one in which the rejection region would be z 5 -3.0 or z 2 3.0. The level of significance corresponding to this rejection region is 2 x P(z 2 3.0) = 2 x (.5 - .4987) = .0026. The p value correspondingto the test statisticz = 3.0 is ,0026. To summarize the procedure for computing the p value for a two-tailed test, suppose z* represents the computed test statisticwhen testing Ho: p = vs. Ha: p + h. The p value is given by formula (9.3): p value = P(I z I> lz* I) (9.3) When using the p value approach to testing hypothesis, the null hypothesis is rejected if the p value is less than a.In addition, the p value gives an idea about the strength of the evidence for rejecting the null hypothesis. The smaller the p value, the stronger the evidence for rejecting the null hypothesis. EXAMPLE 9.13 In Example 9.10, the followingstatistical hypothesis was tested for a= .OI : Ho: p = 10(the police department claim is correct) Ha: p > 10(the police department claim is not correct) The computed test statistic is z* = 1.49.The smallest level of significance at which the null hypothesis wou 1 be rejected is one in which the rejection region is z > 1.49. The p value is therefore equal to P(z > 1.49). From the standard normal distribution table, we find the p value = .5 - .4319 = .0681. Using the p value approach to testing, the null hypothesis is not rejected since the p value > a. To summarize the procedure for computing the p value for an upper-tailed test, suppose z* represents the computed test statistic when testing Ho: p = bvs. Ha : p > h. The p value is given by p value =P(z > z*) (9.4) EXAMPLE 9.14 In Example 9.11, the following statistical hypothesis was tested for a = .10. Ho: p = 14 (the mean age of gang members in the city is 14) Ha:p < 14 (the mean age of gang members in the city is less than 14) The computed test statistic is z* = -2.25. The smallest level of significance at which the null hypothesis would be rejected is one in which the rejection region is z < -2.25. The p value is therefore equal to P(z < -2.25). Using the standard normal distribution table, we find the p value = .5 - .4878 = .0122. Using the p value approach to testing, the null hypothesis is rejected since the p value < .lO. To summarize the procedure for computing the p value for a lower-tailed test, suppose z* represents the computed test statistic when testing Ho:p= c ~ 0 vs. Ha:p<b. The p value is given by p value = P(z c z*) (9.5) EXAMPLE 9.15 A random sample of 40 community college tuition costs was selected to test the null hypothesis that the mean cost for all community colleges equals $1500 vs. the alternative that the mean does not equal $1500. The level of significance is .05. The data are shown in Table 9.5.
  • 207.
    198 1200 850 1700 1500 700 1200 TESTS OF HYPOTHESIS:ONEPOPULATION 850 1750 930 3000 1650 1640 2100 900 1320 500 2050 1750 500 1780 2500 1950 675 2310 1500 2000 1o00 1080 2900 950 680 1875 ~ ~ ~ ~~ ~ ~~ ~ 1950 560 900 1450 F 750 500 1500 950 [CHAP.9 The Minitab analysis for this test of hypothesis is shown below. The command set c l is used to put the data in column 1. The name cost is given to column 1. The command standard deviation c l is used to find s. The command ztest mean = 1500 sigma = 655.44 data in cl; is used to specify the value of h,the estimate of sigma, and the column where the data are located. The subcommand alternative 0. indicates a two-tailed test. Since the p value is 3 2 and exceeds .05,the null hypothesis is not rejected. MTB > set cl DATA>1200 850 1700 1500 700 1200 1500 2000 1950 750 DATA> 850 3000 2100 500 500 1950 1000 950 560 500 DATA>1750 1650 900 2050 1780 675 1080 680 900 1500 DATA > 930 1640 1320 1750 2500 2310 2900 1875 1450 950 DATA > end MTB > name cl 'cost' MTB > standard deviation c1 Column StandardDeviation Standard deviation of cost = 655.44 MTB > ztest mean = 1500sigma = 655.44 data in cl; SUBC> alternative 0. 2-Test Test of mu = 1500 vs mu not = 1500 The assumed sigma = 655 Variable N Mean St Dev SEMean Z P cost 40 1396 655 104 -1.00 .32 To test the research hypothesis that p c 1500,the subcommand alternative-1. is used. The p value is now equal to .16. MTB > ztest mean = 1500 sigma = 655.44 data in cl; SUBC > alternative -1. 2-Test Test of mu = 1500vs mu < 1500 The assumed sigma = 655 Variable N Mean St Dev SEMean 2 P cost 40 1396 655 104 -1.00 .I6 To test the research hypothesis that p > 1500,the subcommandalternative 1. is used. The p value is now equal to .84. MTB > ztest mean = 1500 sigma = 655.44 data in cl; SUBC > alternative 1.
  • 208.
    CHAP. 91 TESTSOF HYPOTHESIS: ONE POPULATION . Steps for Testing a Hypothesis Concerning a Population Mean: Small Sample Normally Distributed Population with Unknown 6 Z-Test Test of mu = 1500vs mu > 1500 The assumed sigma = 655 Variable N Mean StDev SEMean Z P cost 40 1396 655 104 -1.00 .84 Step 2: Use the t distribution table, with degrees of freedom = n - 1 and the level of significance, a,to determine the rejection region. Step 3: Compute the value of the test statistic as follows: t = - - ”,where X is the mean of the sample, - SX i ~ is given in the null hypothesis, and sjr is computed by dividing s by&. Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls in the rejection region. Otherwise, the null hypothesis is not rejected. HYPOTHESIS TESTS ABOUT A POPULATIONMEAN: SMALL SAMPLES 199 Step 1 : State the null and research hypothesis. The null hypothesis is represented symbolically by Ho: p =b, and the research hypothesis is of the form Ha: p f cl0or Ha: ~1 < cl0or Ha: ~1 > k. EXAMPLE 9.16 In order to test the hypothesis that the mean age of all commercial airplanes is 10 years vs. the alternative that the mean is not 10 years, a sample of 25 airplanes is selected from several airlines. A histogram of the 25 ages is mound-shaped and it is therefore assumed that the distribution of all airplane ages is normally distributed. The sample mean is found to equal 11.5 years and the sample standard deviation is equal to 5.5 years. The level of significance is chosen to be a = .05. From the t distribution tables with df = 24, it is determined that the area to the right of 2.064 is .025. By symmetry, the area to the left of -2.064 is also .025. The rejection regions as well as the corresponding areas under the t distribution curve are shown in Fig. 9-13. Fig. 9-13
  • 209.
    200 TESTS OFHYPOTHESIS:ONE POPULATION [CHAP. 9 300 250 - 5.5 X-P, 11.5-10 The estimated standard error is 1.36,and since this value does not fall in the rejection region, the null hypothesis is not rejected. = -= 1.1 years. The computed test statistic is t = -- - - - - 6 5 Sf 1.1 280 190 250 220 255 272 EXAMPLE 9.17 A social worker wishes to test the hypothesis that the mean monthly household payment for food stamps in Shelby county is $225 vs. the alternative that it is greater than that amount. The payments for 20 randomly selected households are given in Table 9.7. 200 225 275 Table 9 . 7 210 245 228 290 265 213 310 235 277 A histogram of the payments is shown in Fig. 9-14. The histogram indicates that it is reasonable to assume that the payments are normally distributed. I I I - - 1 190 210 230 250 270 290 310 payment Fig. 9-14 The Minitab analysis for the data is shown below. The command ttest mean = 225 data in cl; gives the value for and the location of the sample data in column 1. The subcommand alternative = 1. Identifies the alternative as being upper-tailed. The p value indicates that the results are significant for any level of significance greater than .0023. MTB > set cl DATA > 300 250 200 225 275 280 220 210 290 310 DATA> 190 255 245 265 235 250 272 228 213 277 DATA > end MTB > ttest mean = 225 data in cl; SUBC > alternative = 1. T-Testof the Mean Test of mu = 225.00 vs mu > 225.00 Variable N Mean St Dev SEMean T P Payment 20 249.50 34.04 7.61 3.22 0.0023 HYPOTHESIS TESTS ABOUT A POPULATION PROPORTION: LARGE SAMPLES In Chapter 7 the sampling distribution of p,the sample proportion, was discussed. When the sample size satisfies the inequalities np > 5 and nq > 5, the sampling distribution of the sample
  • 210.
    CHAP. 91 TESTSOF HYPOTHESIS: ONE POPULATION 201 proportion is normally distributed with mean equal to p and standard error q, = - .This result is sometimes referred to as the central limit theorem for the sample proportion. Furthermore, it was shown that z = ~ has a standard normal distribution. These results form the theoretical underpinnings for tests of hypothesis concerning p, the population proportion. Table 9.8 gives the t - P -Pg q, steps that may be followed when testing a hypothesis about the population proportion. Table 9.8 Steps for Testing a Hypothesis Concerning a Population Proportion: Large Samples (np > 5 and nq > 5 ) Step I: State the null and research hypothesis. The null hypothesis is represented symbolically by Ho:p = po, and the research hypothesis is of the form Ha:p f poor Ha:p c poor Ha:p >po. Step 2: Use the standard normal distribution table to determine the rejection and nonrejection regions. - Step 3: Compute the value of the test statistic as follows: z = - - ,where jT is the sample proportion, po is OF specified in the null hypothesis, and oF= Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls in the rejection region. Otherwise, the null hypothesis is not rejected. EXAMPLE 9.18 Health-care coverage for employees varies with company size. It is reported that 30% of all companies with fewer than 10employees provide health benefits for their employees. A sample of 50 companies with fewer than 10employees is selected to test Ho:p = .3 vs. Ha:p f .3 at a = .01. It is found that 19 of the 50 companies surveyed provide health benefits for their employees. Using the standard normal table, it is found that the area under the curve to the right of 2.58 is .005 and by symmetry, the area to the left of -2.58 is also .005. The rejection regions as well as the associated areas under the standard normal curve are shown in Fig. 9-15. Fig. 9-15 -- Assuming the null hypothesis to be true, the standard error of the proportion is oF = / T y T ' - - 19 .38 - .30 .065.The sample proportion is j7 = -= .38. The computed value of the test statistic is z* = = 1.23. Since the computed value of the test statistic is not in the rejection region, the null hypothesis is not rejected. The p value is P(I z I> 1.23)= P(z < -1.23) +P(z > 1.23)= 2 x P(z > 1.23)= 2 x (.5 - .3907)= .2186. 50 .065
  • 211.
    202 TESTS OFHYPOTHESIS: ONE POPULATION [CHAP. 9 EXAMPLE 9.19 A national survey asked the followingquestion of 2500 registered voters: “Is the character of a candidate for president important to you when deciding for whom to vote?” Two thousand of the responses were yes. Let p represent the percent of all registered voters who believe the character of the president is important when deciding for whom to vote. The results of the survey were used to test the null hypothesis Ho: p = 90% vs. Ha: p < 90% at level of significance a = .05. The sample percent is = 2500 x 100% = 80%. 2Ooo Assuming the null hypothesis to be true, the standard error is OF= /pot - ‘ O - -,/- -- - .60%.The computed test statistic is z*= - = -16.7. The null hypothesis would be rejected at all practical levels of signiticance. 80-90 .60 Solved Problems NULL HYPOTHESISAND ALTERNATIVE HYPOTHESIS 9.1 9.2 A study in 1992established the mean commuting distance for workers in a certain city to be 15 miles. Because of the westward spread of the city, it is hypothesized that the current mean commuting distance exceeds 15 miles. A traffic engineer wishes to test the hypothesis that the mean commuting distance for workers in this city is greater than 15 miles. Give the null and alternative hypothesis for this scenario. Ans. Ho: p = IS,Ha:p > 15, where p represents the current mean commuting distance in this city The mean number of sick days used per year nationally is reported to be 5.5 days. A study is undertaken to determine if the mean number of sick days used for nonunion members in Kansas differs from the national mean. Give the null and alternative hypothesis for this scenario. Ans. Ho: p = 5.5, Ha: p # 5.5, where p represents the mean number of sick days used for nonunion members in Kansas. TEST STATISTIC,CRITICAL VALUES, REJECTION AND NONREJECTION REGIONS 9.3 Refer to problem 9.1. The decision is made to reject the null hypothesis if the mean of a sample of 49 randomly selected commuting distances is more than 2.5 standard errors above the mean established in 1992.The standard deviation of the sample is found to equal 3.5 miles. Give the rejection and nonrejection regions. s 3.5 Ans. The estimated standard error of the mean is sx = -- -- - .5 miles. The null hypothesis is to be rejected if the sample mean of the 49 commuting distances exceeds the 1992 population mean by 2.5 standard errors. That is, reject Ho if T > 15 + 2.5 x .5 = 16.25. &-G - rionreiection region I reiection region X 16-25
  • 212.
    CHAP. 91 TESTSOF HYPOTHESIS: ONE POPULATION 203 9.4 Refer to problem 9.2. The null hypothesis is to be rejected if the mean of a sample of 250 nonunion members differs from the national mean by more than 2 standard errors. The standard deviation of the sample is found to equal 2.1 days. Give the rejection and nonrejection regions. S 2.1 Ans. The estimated standard error of the mean is s i = -- -= .13 days. The null hypothesis is J;;-J250 to be rejected if the mean of the sample differs fiom the national mean by more than 2 standard errors. That is, reject H o if X < 5.5 - 2 x .I3 = 5.24 or if 'jY > 5.5 +2 x .13 = 5.76. - reiection region I nonreiection region 1 reiection region X 5.24 5.76 TYPE I AND TYPE I1 ERRORS 9.5 Refer to problems 9.1 and 9.2. Describe, in words, how a Type I and a Type I1 error for both problems might occur. Ans. In problem 9.1, a Type I error is made if it is concluded that the current mean commuting distance exceeds 15 miles when in fact the current mean is equal to 15 miles. A Type I1 error is made if it is concluded that the current mean commutingdistance is 15 miles when in fact the mean exceeds 15 miles. In problem 9.2, a Type I error is made if it is concluded that the mean number of sick days used for the nonunion members in Kansas is different than the national mean when in fact it is not different. A Type I1 error is made if it is concluded that the mean number of sick days used for nonunion members in Kansas is equal to the national mean when in fact it differs from the national mean. 9.6 Refer to problems 9.3 and 9.4.Find the level of significance for the test procedures described in both problems. Ans. In problem 9.3, a = P(rejecting the null hypothesis when the null hypothesis is true) or a = P(Y > 16.25 when p = 15)= P(z > 2.5), since 16.25is 2.5 standard errors above p = 15.a = P(z > 2.5) = In problem 9.4,a = P(rejecting the null hypothesis when the null hypothesis is true) or a = P(T c 5.24 or X > 5.76 when p = 5.5) = P(z < -2 or z > 2), since 5.24 is 2 standard errors below p = 5.5 and 5.76 is 2 standard errors above p = 5.5. a = P(z <-2) + P(z > 2) = 2 x (.5 - .4772) = .0456. .5 - .4938 = .0062. HYPOTHESISTESTS ABOUT A POPULATIONMEAN: LARGE SAMPLES 9.7 A sample of size 50 is used to test the following hypothesis: Ho: p = 17.5vs. Ha:p # 17.5 and the level of significance is .01.Give the critical values, the rejection and nonrejection regions, the computed test statistic, and your conclusion i f (a)Y = 21.5 and s = 5.5 (0 unknown); (b) 'ji' = 21.5 and 0= 5.0 Ans. The critical values and therefore the rejection and nonrejection regions are the same in both parts. Since P(z < -2.58) = P(z > 2.58) = .005, the critical values are -2.58 and 2.58. The rejection and nonrejection regions are as follows: reiection region I nonreiection region Ireiection region I, -2.58 2.58
  • 213.
    204 TESTS OFHYPOTHESIS: ONE POPULATION [CHAP. 9 s 5.5 In part (a) the standard error is estimated by deviation is unknown. The computed test statistic is z = = -= -- - .78, since the population standard = 5.13, and the null hypothesis &Jso 21.5- 17.5 .78 is rejected since this value falls in the rejection region. a 5 In part (6)the standard error is computed by 0 : = -- -- - .71, since the population standard C J S o 21.5- 17.5 deviation is known. The computed test statistic is z = = 5.63, and the null hypothesis is .71 rejected since this value falls in the rejection region. 9.8 A state environmental study concerning the number of scrap-tires accumulated per tire dealership during the past year was conducted. The null hypothesis is H o :p = 2500 and the research hypothesis is H : p # 2500, where p represents the mean number of scrap-tires per dealership in the state. For a random sample of 85 dealerships, the mean is 2750 and the standard deviation is 950.Conduct the hypothesis test at the 5% level of significance. Ans. The null hypothesis is rejected if Iz* I > 1.96, where z* is the computed value of the test statistic. The estimate standard error is = - - - -= 103 miles. The computed value of the test 950 J f ; & 2750- 2500 103 statistic is z* = = 2.43. It is concluded that the mean number of scrap-tires per dealership exceeded 2500 last year CALCULATING TYPE I1 ERRORS 9.9 In problems 9.1 and 9.3, the hypothesis system Ho: p= 15 andHa:p> 15 was tested by using a sample of size 49. The null hypothesis was to be rejected if the sample mean exceeded 16.25. Suppose the true value of p is 16.5. What is the probability that this test procedure will not result in the rejection of the null hypothesis? Am. p = P(not rejecting the null hypothesis when p = 16.5)= P(X < 16.25 when p = 16.5).In problem 9.3, the estimated standard error was found to equal .5. The event X < 16.25 is equivalent to the 51- 16.5 16.25- 16.5 .5 .5 event z = - < =-.5 when p = 16.5.Therefore, p = P(z < -S) = P(z > .5) = .5 - .I915 = .3085. That is, there is a 30.85% chance that, even though the mean commuting distance has increased from the 15 miles figure of 1992 to the current figure of 16.5 miles, the hypothesis test will result in concluding that the current mean is the same as the 1992 mean. 9.10 In problems 9.2 and 9.4, the hypothesis system Ho: p = 5.5 and Ha: 1-1 f 5.5 was tested by using a sample of size 250. The null hypothesis was to be rejected if the sample mean was less than 5.24 or exceeded 5.76. Suppose the true value of p is 5.0. What is the probability that this test procedure will not result in the rejection of the null hypothesis? Ans. p = P(not rejecting the null hypothesis when p = 5.0) = P(5.24< T < 5.76 when p is 5.0). In problem 9.4, the estimated standard error was found to equal .13. The event 5.24 < X < 5.76 when p is 5.0 is equivalent to the event which is the same as the event 5.24-5.0 x -5.0 5.76-5.0 <- < .I 3 .13 .I3
  • 214.
    CHAP. 91 TESTSOF HYPOTHESIS: ONE POPULATION 205 1.85 < z < 5.85 . Therefore p = P( I .85 < z < 5.85). Since there is practically zero probability to the right of 5.85, p = P(z > 1.85)= .5 - .4678 = .0322. P VALUES 9.11 Use the p value approach to test the hypothesis in problem 9.8. Ans. The computed value of the test statistic is z* = 2.43. Since the hypothesis is two-tailed, the p value is given by p value = P(I z I > I z* I) or equivalently P(z < -1 z* I) + P(z >I z* 1) = P(z < -2.43) + P(z > 2.43). Since these two probabilities are equal, we have the p value = 2 x P(z > 2.43) = 2 x (.5 - .4925) = .015. Using the p value approach, the null hypothesis is rejected since the p value < a. 9.12 Find the p values for the following hypothesis systems and z* values. ( U ) Ho: p = 22.5 VS.Ha: p +22.5 and Z* = 1.76 (b) Ho: p = 37.I VS. Ha: 1 f 37.1 and Z* = -2.38 (c) Ho: p= 12.5 VS. Ha: j~ < 12.5 and Z* = -.98 (d) Ho: p = 22.5 VS. Ha: 1-1< 22.5 and Z* = 0.35 (e) Ho: p= 100vs. Ha:p> 100 and z* = 2.76 (f) Ho: p= 72.5 VS. Ha: p> 72.5 and Z* = -0.25 Ans. (a) The p value is given by p value = P(I z I > I z* 1) = P(I z I >1.76) = 2 x P(z > 1.76)= 2 x (b) The p value is given by p value = P(I z I > I z* I) = P(I z I> 2.38), since the absolute value of -2.38 is 2.38. Therefore, p value = P(l z I> 2.38) = 2 x P(z > 2.38) = 2 x (.5 - .4913) = .0174. (c) The p value is given by p value = P(z < z*) = P(z < -.98) = P(z > .98) = (.5 - .3365) = .1635. ( d ) The p value is given by p value = P(z < z*) = P(z < 3 5 )= .5 + .1368 = .6368. (e) The p value is given by p value = P(z > z*) = P(z > 2.76) = .5 - .4971 = .0029. U> The p value is given by p value = P(z > z*) = P(z > -.25) = .5 + .0987 = .5987. (.5 - .4608) = .0784. HYPOTHESIS TESTS ABOUT A POPULATIONMEAN: SMALL SAMPLES 9.13 A psychological test, used to measure the level of hostility, is known to produce scores which are normally distributed, with mean = 35 and standard deviation = 5. The test is used to test the null hypothesis that p = 35 vs. the research hypothesis that p > 35, where p represents the mean score for criminal defense lawyers. Sixteen criminal defense lawyers are administered the test and the sample mean is found to equal 39.5. The sample standard deviation is found to equal 10.The level of significance is selected to be a = .01. (a) Perform the test assuming that o = 5. This amounts to assuming that the standard deviation (b) Perform the test assuming o is unknown. of test scores for criminal defense lawyers is the same as the general population. - x-35 (a) Since the test scores are normally distributed and (r is known, the test statistic, z = -is o r normally distributed. The rejection region is found using the standard normal distribution tables. The null hypothesis is rejected if the test statistic exceeds 2.33.The calculated value of Ans. 39.5- 35 1.25 the test statistic is z* = = 3.6, where the standard error is computed as o ox=--- - - 1.25.The null hypothesis is rejected since 3.6 falls in the rejection region. &-a
  • 215.
    206 5.5 7.5 12.5 10.0 15.3 TESTS OF HYPOTHESIS:ONE POPULATION [CHAP.9 13.2 15.5 16.5 18.0 4.O 3.3 9.0 17.6 2.5 7.5 8.5 4.9 14.0 10.0 16.6 2.9 13.3 6.6 9.7 13.4 9.14 - x - 35 Because (T is unknown and the sample size is small the test statistic, t = -has a t sx distribution with df = 16 - 1 = 15. The rejection region is found by using the t distribution with df = 15. The null hypothesis is rejected if the test statistic exceeds 2.602. The calculated value of the test statistic is t* = = 1.80, where the estimated standard error is computed as =---= 2.5. The null hypothesis is not rejected, since 1.80does not fall in the rejection region. 395- 35 2.5 s 10 &--G The weights in pounds of weekly garbage for 25 households is shown in Table 9.9. Use these data to test the hypothesis that the weekly mean for all households in the city is 10pounds. The alternative hypothesis is that the mean differs from 10pounds. The level of significance is a = .05. Table 9.9 Ans. The Minitab solution is shown below. MTB > set cl DATA> 5.5 7.5 12.5 10.0 15.5 13.2 DATA> 15.5 3.3 7.5 10.0 6.6 16.5 DATA> 18.0 17.6 4.9 2.9 13.4 DATA > end MTB > ttest mean = 10data in c1; SUBC > alt = 0. T-Testof the Mean Test of mu = 10.000vs mu not = IO.OO0 Variable N Mean St Dev SEMean 4.0 9.0 T CI 25 10.320 4.923 0.985 0.33 2.5 14.0 13.3 8.5 16.6 9.7 P 0.75 Since the p value > .05,the null hypothesis is not rejected. HYPOTHESIS TESTS ABOUT A POPULATION PROPORTION: LARGE SAMPLES 9.15 A survey of 2500 women between the ages of 15 and 50 found that 28% of those surveyed relied on the pill for birth control. Use these sample results to test the null hypothesis that p = 25% vs. the research hypothesis that p f 25%, where p represents the population percentage using the pill for birth control. Conduct the test at a = .05 by giving the critical values, the computed value of the test statistic,and your conclusion.
  • 216.
    CHAP. 91 TESTSOF HYPOTHESIS:ONE POPULATION 207 - Ans. -Th<critical valms are f 1.96. The test statistic is given by z = -, where og= ’- OF f * The standard error of the proportion is oTi = - 25x75 - - .87%, and the computed test statistic is 42500 28-25 z=-- - 3.45. The null hypothesis is rejected since the computed test statistic exceeds the .87 critical value, 1.96. 9.16 A poll taken just prior to election day finds that 389 of 700 registered voters intend to vote for Jake Konvalina for mayor of a large Midwestern city. Test the null hypothesis that p = .5 vs. the alternativethat p > .5 at a = .05,where p represents the proportion of all voters who intend to vote for Jake. Use the p value approach to do the test. 389 Ans. The sample proportion is F=,, = S56, and the standard error of the proportion is - p-po S56-.5 q =/q = ,/% = .019. The computed test statistic is z* = -- - -- - 2.95. OF .019 The p value is given by P(z > 2.95) = .5 - .4984 = .0016. The null hypothesis is rejected since the p value < .05. Supplementary Problems NULL HYPOTHESIS AND ALTERNATIVEHYPOTHESIS 9.17 9.18 Classify each of the following as a lower-tailed, upper-tailed, or two-tailed test: ( U ) H o :~ = 3 5 , Ha: p < 35 (b) Ho: P = 1.29 Ha: CL # 1.2 (c) Ho: p=$l050, Ha: p > $1050 Ans. (a) lower-tailed test (b) two-tailed test (c) upper-tailed test Suppose the current mean cost to incarcerate a prisoner for one year is $18,000. Consider the following scenarios. ( U ) A prison reform plan is implemented which prison authorities feel will reduce prison costs. (b) It is uncertain how a new prison reform plan will affect costs. (c) A prison reform plan is implemented which prison authorities feel will increase prison costs. Let p represent the mean cost per prisoner per year after the reform plan is implemented. Give the research hypothesis in symbolic form for each of the above cases. TEST STATISTIC,CRITICALVALUES, REJECTIONANDNONREJECTIONREGIONS 9.19 A coin is tossed 10times to determine if it is balanced. The coin will be declared “fair,” i.e., balanced, if between 2 and 8 heads, inclusive, are obtained. Otherwise, the coin will be declared “unfair.” The null hypothesis corresponding to this test is Ho: (the coin is fair), and the alternative hypothesis is Ha: (the coin is unfair). Identify the test statistic, the critical values, and the rejection and nonrejection regions. Ans. The test statistic is the number of heads to occur in the 10tosses of the coin. The critical values are 1 and 9. The rejection region is x = 0, 1, 9, or 10,where x represents the number of heads to occur in the 10tosses. The nonrejection region is x = 2, 3,4,5,6, 7, or 8.
  • 217.
    TESTS OF HYPOTHESIS:ONE POPULATION [CHAP. 9 9.20 A die is tossed 6 times to determine if it is balanced. The die will be declared “fair,” i.e., balanced, if the face “6” turns up 3 or fewer times. Otherwise, the die will be declared “unfair.” The null hypothesis corresponding to this test is b: (the die is fair), and the alternative hypothesis is Ha:(the die is unfair). Identify the test statistic, the critical values, and the rejection and nonrejection regions. Ans. The test statistic is the number of times the face “6” turns up in 6 tosses. The critical value is 4. The rejection region is x = 4,5, or 6, where x represents the number of times the face “6” turns up in the 6 tosses of the die. The nonrejection region is x = 0, 1, 2, or 3. TYPEI AND TYPE I1 ERRORS 9.21 In problems9.19 and 9.20, describe how a Type I and a Type I1error would occur. Ans. In problem 9.19, if the coin is fair, but you happen to be unlucky and get 9 or 10heads or 9 or 10 tails and therefore conclude that the coin is unfair, you commit a Type I error. If the coin is actually bent so that it favors heads more often than tails or vice versa, but you obtain between 2 and 8 heads, then you commit a Type I1error. In problem 9.20, if the die is fair, but you happen to be unlucky and obtain an unusual happening such as 4 or more sixes in the 6 tosses and therefore conclude that the die is unfair, you commit a Type I error. If the die is loaded so that the face “6” comes up more often than the other faces, but when you toss it, you actually obtain 3 or fewer sixes in the 6 tosses, then you commit a Type 11 error. 9.22 Find the level of significancein problems 9.19 and 9.20. Ans. The level of significanceis given by a = P(rejecting the null hypothesis when the null hypothesis is true). In problem 9.19, if the null hypothesis is true, then x, the number of heads in 10tosses of a fair coin, has a binomial distributionand the level of significanceis a = P(x = 0, 1,9, or 10when p = S).Using the binomial tables, we find that the level of significance is a = .0010 + .0098 + .0010 + ,0098 = .0216. In problem 9.20, if the null hypothesis is true, then x, the number of times the face “6”turns up in 6 tosses, has a binomial distributionwith n = 6 and p = ! . = .167, and the level of significance is a = P(x = 4, 5, or 6 when p = .167). Using the binomial probability formula,we find a = ,008096+ .000649+.oooO22= .008767. 6 HYPOTHESIS TESTSABOUT A POPULATION MEAN: LARGE SAMPLES 9.23 9.24 Home Videos Inc. surveys 450 households and finds that the mean amount spent for renting or buying videos is $13.50 per month and the standard deviation of the sample is $7.25. Is this evidence sufficient to conclude that p > $12.75 per month at level of significancea= .01? Ans. The computed test statistic is z* = 2.19 and the critical value is 2.33. The null hypothesis is not rejected. A survey of 700 hourly wages in the US.is taken and it is found that: T = $17.65 and s = $7.55. Is this evidence sufficientto concludethat p > $17.20, the stated current mean hourly wage in the U.S.? Test at a = .05. Ans. The computed test statistic is z* = 1.58 and the critical value is 1.65. The null hypothesis is not rejected.
  • 218.
    CHAP. 91 TESTSOF HYPOTHESIS: ONE POPULATION 209 CALCULATING TYPE I1 ERRORS 9,25 In problem 9.23, find the probability of a Type I1 error if p = $13.25. Use the sample standard deviation given in problem 9.23 as your estimate of 0. Ans. p = P(not rejecting the null hypothesis when p = $13.25) = P(X c 13.55 when p = $13.25) = P(z< 37) = 3078. 9,26 In problem 9.24, find the probability of a Type I1 error if p = $18.00.Use the sample standard deviation given in problem 9.24 as your estimate of 0. Ans. p = P(not rejecting the null hypothesis when p = $18.00) = P(Y c 17.67 when p = $18.00) = P(z<-1.16) = .123. P VALUES 9.27 Find the p value for the sample results given in problem 9.23. Use this computed p value to test the hypothesis given in the problem. Ans. The p value = P(Z > 2.19) = .0143. Since the p value exceeds the preset a,the null hypothesis is not rejected. 9.28 Find the p value for the sample results given in problem 9.24. Use this computed p value to test the hypothesis given in the problem. Ans. The p value = P(Z > 1.58) = .057 1. Since the p value exceeds the preset a,the null hypothesis is not rejected. HYPOTHESIS TESTS ABOUT A POPULATION MEAN: SMALL SAMPLES 9.29 The mean score on the KSW Computer Science Aptitude Test is equal to 13.5. This test consists of 25 problems and the mean score given above was obtained from data supplied by many colleges and universities. Metropolitan College administers the test to 25 randomly selected students and obtains a mean equal to 12.0 and a standard deviation equal to 4.1. Can it be concluded that the mean for all Metropolitan students is less than the mean reported nationally? Assume the scores at Metropolitan are normally distributed and use a = -05. Ans. The null hypothesis is rejected since t* = -1.83 < -1.71 1. The mean score for Metropolitan students is less than the national mean. 9.30 The claim is made that nationally, the mean payout per $1 waged is 90 cents for casinos. Twenty randomly selected casinos are selected from across the country. The data and a Minitab analysis are shown below and on the next page. If the null hypothesis is &:p = 90, the alternative hypothesis is H , : p <90, and the level of significance is a= .01, what is your conclusion? Data Display payout 85 82 85 83 91 78 90 86 95 82 86 '74 89 92 96 81 90 86 92 90 MTB > ttest mean = 90 data in c1; SUBC > alt = -1. T-Test of the Mean Test of mu = 90.00 vs mu c 90.00
  • 219.
    210 TESTS OFHYPOTHESIS: ONE POPULATION [CHAP. 9 Variable N Mean St Dev SE Mean T P Payout 20 86.65 5.63 1.26 -2.66 0.0077 Ans. Since the p value is less than a = .01, reject the claim and conclude that the mean payout is less than 90 cents per dollar. HYPOTHESISTESTS ABOUT A POPULATION PROPORTION:LARGESAMPLES 9 . 3 1 A survey of 300 gun owners is taken and 40% of those surveyed say they have a gun for protection for self/family. Use these results to test Ho: p = 32% vs. Ha: p > 32%,where p is the percent of all gun owners who say they have a gun for protection of self/family.Test at a = .025. Ans. The null hypothesis is rejected, since z* = 2.97 > 1.96. It is concluded that the percent is greater than 32%. 9.32 Perform the following hypothesis about p. (a) Ho: p = .35vs. Ha:p # .35,n = 100, = .38, a = .I0 (b) H o : p = . 7 5 v s . H a : p < . 7 5 , n = 7 0 0 , ~ = . 7 1 , a = . 0 5 (c) Ho: p = .55 VS, Ha: p > .55, n = 390, p = 57,a = .01 Ans. (a)z* =0.63,p value = .5286, Do not reject the null hypothesis. (6) z* =-2.44, p value = .0073,Reject the null hypothesis. (c) z* = 0.79, p value = .2148, Do not reject the null hypothesis.
  • 220.
    Chapter 10 Inferences forTwo Populations SAMPLINGDISTRIBUTION OF -x,FOR LARGEINDEPENDENTSAMPLES Two samples drawn from two populations are independent samples if the selection of the sample from population 1 does not affect the selection of the sample from population 2. The following notation will be used for the sample and population measurements: p1and p 2 = means of populations 1 and 2 01 and 02 = standard deviations of populations 1 and 2 nl and n2 = sizes of the samples drawn from populations 1 and 2 (nl 230, n2 230) x1and X, = means of the samples selected from populations 1 and 2 - SI and s2 = standard deviations of the samples selected from populations 1 and 2 When two large samples (nl 2 30, n2 2 30) are selected from two populations, the sampling distribution of 51, - 51, is normal. The mean or expected value of the random variable 51, - x, is given by formula (10.I ) : - The standard error of XI- X, is given by (10.2) Figure 10-1 illustrates that 51, - X2 is normally distributed and centers at p i - p 2 .The standard error of the curve is given by formula (10.2). Fig. 10-1 21 1
  • 221.
    212 INFERENCES FORTWO POPULATIONS [CHAP. 10 Formula (10.3)is used to transform the distribution of X,- X, to a standard normal distribution. (10.3) EXAMPLE 10.1 The mean height of adult males is 69 inches and the standard deviation is 2.5 inches. The mean height of adult females is 65 inches and the standard deviation is 2.5 inches. Let population 1 be the population of male heights, and population 2 the population of female heights. Suppose samples of 50 each are selected from both populations. Then XI- ?I2 is normally distributed with mean p,,_,, = kl - p2= 60 - 65 = 4 inches and standard error equal to oxl-si2 = 50 50 = .5 inches. The probability that the "I "2 mean height of the sample of males exceeds the mean height of the sample of females by more than 5 inches is represented by P(X, - X2 > 5). The event X,- x2 > 5 is converted to an equivalent event involving z by using - - - - XI - x 2 -(P,-CLz) - 5 - 4 ox, -x2 5 - 2. From the formula (10.3). The z value corresponding to 51, - x2 = 5 is z = - standard normal distribution table the area to the right of 2 is .5 - .4772 = .0228. Only 2.28% of the time would the mean height of a sample of 50 male heights exceed the mean height of a sample of 50 female heights by 5 or more inches. ESTIMATION OF ~ 1 - ~2 USING LARGE INDEPENDENT SAMPLES The difference in sample means, 51, - X2 ,is a point estimate o f pI - p2.An interval estimatefor pI - p2is obtained by using formula (10.3).Since 95% of the area under the standard normal curve is between -1.96 and 1.96,we have the following probability statement: Solving the inequality inside the parenthesis for interval for pI - p2. - p2, we obtain the following 95% confidence The general form of a confidence ititend for 1-11 - p2 is given by formula (10.4), where z is determined by the specified level of confidence. The z values for the most common levels of confidence are given in Table 10.1, (20.4) - (X,- x,) k z x OK,-KZ Table 10.1 Confidence level Z value 1.96 2.58 EXAMPLE 10.2 Keyhole heart bypass was performed on 48 patients and conventional surgery was performed on 55 patients. The time spent on breathing tubes was recorded for all patients and is sumrnarized in Table 10.2.
  • 222.
    CHAP. 101 INFERENCESFOR TWO POPULATIONS 213 Sample Sample size Mean 1. Keyhole 48 3.0 hours 2. Conventional 55 15.0hours - The difference in sample means is X,- x2 = 3.0 - 15.0 = -12.0. Since population standard deviations are unknown, they are estimated with the sample standard deviations. The estimated standard error of the Standard deviation 1.5hours 3.7 hours d$ferencein sample means is represented by = +- = .544. - ~~ _ _ ~ r - - ~~ 1 Step 1: State the null and research hypothesis. The null hypothesis is represented symbolically by &:pI - I p2 = Do, and the research hypothesis is of the form Ha: pl - p 2# Do or Ha:pl - p2< Door Ha:pl - p2> Do. ' Step 2: Use the standard normal distribution table and the level of significance, a,to determine the rejection region. A 90% confidence interval for p1- p 2 is found by using formula (10.4).The z value is determined to be 1.65 from Table 10.1. The interval is -I 2.0 +_ 1.65 x -544,or -1 2.0+_ 0.9. The interval extends from -12.0 - 0.9 = -12.9 to -12.0 + 0.9 = -1 1.1. That is, we are 95% confident that the keyhole procedure requires on the average from 11.1 to 12.9 hours less time on breathing tubes. Note that when population standard deviations are unknown and the samples are large, that the sample standard deviations are substituted for the population standard deviations. - - X i - X 2 - DO ox, -si? - I Step 3: Compute the value of the test statistic as follows: z* = , where XI - x2 is the TESTING HYPOTHESIS ABOUT pi- p 2 USING LARGE INDEPENDENT SAMPLES Sample 1. Keyhole 2. Conventional The procedure for testing hypothesis about the difference in population means when using two large independent samples is given in Table 10.3. Table 10.3 Sample size Mean Standard deviation 48 3.5 days 1.5days 55 8.0 days 2.0 days I Steps for Testing a Hypothesis Concerning pI - p2:Large Independent Samples computed difference in the sample means, D o is the hypothesized difference in the population means as I ? ? 0 5 given in the null hypothesis, and ( T ~ , - ~ ~ = d : +- or if the population standard deviations are unknown, n 2 l m n SKI-312 = -+- is used as an estimate of ( T ~ , - ~ ~ . 4 : : Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls in the rejection region. Otherwise, the null hypothesis is not rejected. EXAMPLE 10.3 Keyhole heart bypass was performed on 48 patients and conventional surgery was performed on 55 patients. The length of hospital stay in days was recorded for each patient. A summary of the data are given in Table 10.4.
  • 223.
    214 INFERENCES FORTWO POPULATIONS [CHAP. 10 These results are used to test the research hypothesis that the mean hospital stay for keyhole patients is less than the mean hospital stay for conventional patients with level of significance a = .01. The null hypothesis is Ho: pI - p2= 0 and the research hypothesis is Ha: pI - p 2 < 0. This is a lower-tailed test, and the critical value is -2.33, since P(z < -2.33) = .01.This critical value is determined in exactly the same way as it was in chapter 9. The null hypothesis is rejected if the computed test statistic is less than the critical value. The following quantities are needed to compute the value of the test statistic: Do = 0, X,- 51, = 3.5 - 8.0 = -4.5, and the estimated standard error is Sx,-wz= +- = .346.The computed test statistic is: "1 "2 48 55 The null hypothesis is rejected and it is concluded that the mean hospital stay for the keyhole procedure is less than that for the conventional procedure. SAMPLING DISTRIBUTION OF XI-X 2 FOR SMALL INDEPENDENT SAMPLES FROM NORMAL POPULATIONS WITH EQUAL (BUT UNKNOWN) STANDARD DEVIATIONS When the sample sizes are small (one or both less than 30) and the population standard deviations are unknown, the substitution of sample standard deviations for the population standard deviations, as was done in Examples 10.2 and 10.3, is not appropriate. The use of the standard normal table for confidence intervals and testing hypothesis is not valid in this case. Statisticians have developed two different procedures for the case where the sample sizes are small and the population standard deviations are unknown. Both cases require the assumption that both populations have normal distributions. In this section it will also be assumed that the populations have equal standard deviations. A statistical test for deciding whether to assume C T ~= (32 or 01 # 02 utilizes the two sample standard deviations and a statistical distribution called the F distribution. HoMjever, a rule o f thumb used by some statisticiaris states that if .5 I- 5 2, then assume 01 = 02. Otherwise, assume that s1 s 2 0 1 f 0 2 Suppose CT is the common population standard deviation, i. e., 01 = 02 = O. The standard error of -+- . Replacing the two population standard x, - x2 is given by formula (10.2)as ox,-il2 = deviations by CT and factoring it out, we get formula ( I0.5): d e ' z - - - ( I0.5) Now, $, the common population variance, is estimated by pooling the sample variances as a weighted average. The pooled estimator o f d is represented by S2and is given by Replacing d by S2 in formula (10.5),the estimuted standard error of the diflererice in the sample means is obtained and is given in formula (10.7)
  • 224.
    CHAP. 101 INFERENCESFOR TWO POPULATIONS Replacing oxl-K2 by s ~ ~ - ~ ~ in formula (10.3),we obtain 215 (10.7) (10.8) When sampling from two normally distributed populations having equal population standard deviations, the statistic given in formula (10.8)has a t distribution with degrees of freedom given by df=nl+n2-2 (10.9) EXAMPLE 10.4 A sample of size 10 is randomly selected from a normally distributed population and a second sample of size 15 is selected from another normally distributed population. The standard deviation of the first sample is equal to 13.24 and the standard deviation of the second sample is equal to 24.25. The rule of thumb, given above, suggests that the populations may be assumed to have a common standard deviation since Sl s 2 .5 I-= .55 5 2 . A pooled estimate of the common variance is given as follows: (nl-l)s:+(n2-1)~: 9 ~ 1 3 . 2 4 ~ + 1 4 ~ 2 4 . 2 5 ~ = 426.5458 - s2= - nl+n2-2 10+15-2 and a pooled estimate of the common standard deviation is S = J426,r458 = 20.65. The estimated standard error of the difference in sample means is as follows: The computations of S and are required for setting confidence intervals on p1- p2and for testing hypothesis concerning p 1 - p2 when the two samples are small, and the population standard deviations are unknown but assumed equal. ESTIMATION OF ~ 1 - ~2 USING SMALL INDEPENDENT SAMPLES FROM NORMAL POPULATIONS WITH EQUAL (BUT UNKNOWN) STANDARD DEVIATIONS The difference in sample means, 51, -X,, is a point estimate o f pi - p2..An interval estimatefor pi - p2is obtained by using formula (10.8). When independent small samples are selected from two normal populations having unknown but equal standard deviations, the general form of a confidence interval forQpl - p 2 is given by formula (10.10),where t is obtained from the t distribution table and is determined by the level of confidence and the degrees of freedom which is given by df = nl + n2 - 2. The standard error of the difference in sample means, s ~ ~ - ~ ~ , is given by formula (10.7). (10.10) - (XI- x*) *t x sxl-xz
  • 225.
    216 INFERENCES FORTWO POPULATIONS [CHAP. 10 Sample Sample size 1. Keyhole 8 2. Conventional 10 EXAMPLE 10.5 Keyhole heart bypass was performed on 8 patients and conventional surgery was performed on 10patients. The time spent on breathing tubes was recorded for all patients and is summarized in Table 10.5. The difference in sample means is '51, - x2 = 3.0 - 7.0 = -4.0 hours. The pooled estimate of the common population variance is obtained as follows: - Mean Standard deviation 3.0 hours 1.5 hours 7.0 hours 2.7 hours ( n l - I)s:+(nz-l)s: 7x2.25+9x7.29 = 5.085 - s 2 = - nl+ n2 -2 8+10-2 ~ Step I : State the null and research hypothesis. The null hypothesis is represented symbolically by Ho: pl - p2 1 = Do and the research hypothesis is of the form Ha:pI- p2 f Do or Ha:pl - p2< Door Ha:pI- p2> Do. The standard error of the difference in sample means is: 1 Step 2: Use the t distribution table with degrees of freedom equal to nl + n2 - 2 and the level of significance, ~ a,to determine the rejection region. - - - ~ Step 3: Compute the value of the test statistic as follows: t* = _"' , where 51, - x2 is the computed To find a 95% confidence interval for pI - p 2 , we need to find the t value for use in formula (10.10).The degrees of freedom is df = nl + n2 - 2 = 8 + 10 - 2 = 16. From the t distribution table having 16 degrees of freedom, we find the following: P(-2.120 < t < 2.120) = .95. The proper value of t is therefore 2.120. The margin of error is t x Sf,-fz = 2.120 x 1.070 = 2.2684 or 2.27 to 2 decimal places. The 95% confidence interval is -4.0 k 2.27. Assuming that the times spent on breathing tubes are normally distributed for both procedures, the difference pl - p2 is between -6.27 and -1.73 hours with 95% confidence. TESTING HYPOTHESIS ABOUT pi- ~2 USING SMALL INDEPENDENT SAMPLES FROM NORMAL POPULATIONS WITH EQUAL (BUT UNKNOWN) STANDARD DEVIATIONS The procedure for testing a hypothesis about the difference in population means when using two small independent samples from normal populations with equal (but unknown) standard deviations is given in Table 10.6. Table 10,6 Steps for Testing a Hypothesis Concerning p1- p2:Small Independent Samples from Normal Populations with Equal (but Unknown) Standard Deviations I 1 difference in the sample means, D o is the hypothesized difference in the population means as stated in the null (nl- ~)s:+(n2- ~ ) s $ IS2[ +; ) ,where S2 = hypothesis, and = nl+n2-2 * Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls in the rejection region. Otherwise, the null hypothesis is not rejected. I
  • 226.
    CHAP. 101 Sample 1. Keyhole INFERENCESFOR TWO POPULATIONS Sample size Mean Standard deviation 8 3.5 days 1.5 days 217 EXAMPLE 10.6 Keyhole heart bypass was performed on 8 patients and conventional surgery was performed on 10patients. The length of hospital stay in days was recorded for each patient. A summary of the data is given in Table 10.6. These results are used to test the research hypothesis that the mean hospital stay for keyhole patients is less than the mean hospital stay for conventional patients with level of significance a = .01. The null hypothesis is Ho: p1- p2= 0 and the research hypothesis is Ha:p1- p2 < 0. To determine the rejection region, note that the test is a lower-tailed test and the degrees of freedom is df = nl + n2 - 2 = 8 + 10- 2 = 16. Using the t distribution table, we find that for df = 16, P(t > 2.583) = .01, and therefore, P(t < -2.583) = .01.The shaded rejection region is shown in Fig. 10-2. Fig. 10-2 The null hypothesis is to be rejected if the computed test statistic is less than -2.583. The computed test statistic is determined according to step 3 in Table 10.6. The difference in sample means is XI- x2 = 3.5- 8.0 = -4.5. The value specified for Dois 0. The pooled estimate of the commonpopulation variance is - (nl- l)sf+ (n2- l)s$ 7x2.25+9~4.00 = 3.2344 - s2= - nl+ n2 - 2 8+10-2 and the standard error of the difference in sample means is The computed test statistic is -4.5 -0 - Xi -X2-DO t* = - - -5.28. Sw, -R2 353 Since t* is less than the critical value, the null hypothesis is rejected and it is concluded that mean length of hospital stay is smaller for the keyhole procedure than the conventional procedure. This hypothesis test procedure assumes that the two populations are normally distributed. EXAMPLE 10.7 A sociological study compared the dating practices of high school senior males and high school senior females. The two samples were selected independently of one another. The number of dates per month were recorded for each participant. The data are shown in Table 10.7.
  • 227.
    218 ggg d . gg_.. INFERENCES FOR TWO POPULATIONS I ! I I I 1 I < I 1 I , .. ~ .i.... ^. .. ,.:. ........ , . I . . . . . . . . . ! ......... i. ........ . . ( ~ . . . . . . . .! . . . . . . " i . . . . . . :. I I t I 6 > I I I " ' ' ' 1 $ $ r .... .~. . . . . . . ~. . . . . . ..~.,. . . . . . . I . . . . . ~ . . . . . . . ..I. . . . . . .y . . . . . . [CHAP. 1 0 95 -. . . . . .! . . . . . . . -80 -. ..." J ....... . . . . . . . . . . _. .... I I I I 5 I I . i .... . . . . . . . . ......... . . . . . . . . . . . . . . . . . . . . Probability I I I r 7 . ' I I t I t I i $ 1 " . , ~ - ~ " . . " . . . . . . . - - . . - ~ " ~ . . - " ~ " " .... . . . _ . - . . . . . . . . " L . . . . .... ..... .. ........ ........ .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ,05-.. I I I .- I -01_ . ! ." ..!. '. .." '.. , . ( . I I I t I U " Y .I I,<' *. Y ..% I I I I I 1 ? , I .'.. I I ! 1. -001-+. < . XI #,,.> , , . " % . , $ Y I . < I , .* ( , .. .., " .. . . .* * ... " ....... . . .. .... " . I ... < C . I I I } 1 $ I 1 I I I 1 I 1 1 I I 1 1 - Tab1 Males 8 11 7 5 13 10 10 8 9 5 10.7 Females 7 7 12 11 14 10 10 9 4 8 The assumptions of equal variances and normal populations are usually checked out with statistical software before the actual test of equality of means is carried out. This has been made much easier because of the wide- spread availability of statistical software. The checking of assumptions as well as the actual test of equality of population means for the data in Table 10.7 will be illustrated by using Minitab. The commands % normplot mdates and % normplotfdates produced the following plots, which may be used to check for normality. Normal Probability Plot Average: 8.6 StDev:2.54733 N: 10
  • 228.
    CHAP. 101 INFERENCESFOR TWO POPULATIONS 219 Normal Probability Plot ProbabiIity .95- .80- 50 - .20- < .05- (I ~ .. ... I 4 Average: 9.2 StDev: 2.85968 N: 10 fdates Anderson-Darling NormalityTest A-Squared: 0.147 P Value: 0.948 If the sample data are selected from a normal population, the points on the normal probability plot will tend to fall along a straight line. Normality of the population is usually rejected if the p value is less than a = .05. For the above data, the p values are 0.840and 0.948and normality of the populations is not rejected. An edited version of the Minitab procedure to test for equal population variances is shown below. A I in column 1 indicates the response in c2 came from the male group and a 2 indicates the response came from the female group. The assumption of equal variances is usually rejected if the p value shown in Bartlett’s Test (normal distribution) is less than .05. Since the p value is 0.736, the assumption of equal variances is not rejected. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 CI c2 1 8 1 11 1 7 1 5 1 13 1 10 1 10 1 8 1 9 1 5 2 7 2 7 2 12 2 I 1 2 14 2 10 2 10 2 9 2 4 2 8
  • 229.
    220 INFERENCES FORTWO POPULATIONS [CHAP. 10 MTB > %vartest c2 cl Homogeneityof Variance Response C2 Factors C1 Bartlett's Test (normal distribution) Test Statistic: 0.1 14 P value :0.736 Having checked out the normality assumptions and the equal population variances assumption, the test procedure is now conducted. MTB > twot data in c2, groups in cl; SUBC > alternative is 0; SUBC > pooled procedure. Two Sample T-Testand ConfidenceInterval Two sample T for dates Sample N Mean St Dev SEMean 1 10 8.60 2.55 0.81 2 10 9.20 2.86 0.90 95% CI for mu (1) - mu (2): (-3.14, 1.94) T-Test mu (1) = mu (2) (vs not =): T = -0.50 P = 0.63 DF= 18 To use the TWOT command in Minitab, the responses must be in a column and the group from which the responses came must be in another column. The data are entered in columns 1 and 2 as shown above. The null hypothesis is Ho: pl - k2= 0. The alternative command indicates the nature of the research hypothesis. A -1 indicates a lower-tailed test, a 0 indicates a two-tailed test, and a 1 indicates an upper-tailed test. The subcommand pooled procedure indicates that equal population standard deviations are assumed and a pooled estimated is to be computed. The output gives the sample size, mean, standard deviation, and standard error of the mean for both groups separately. A 95% confidence interval for 111 - p2 is given. The computed test statistic is t* = -0.50. The two- tailed p value is 0.63.The degrees of freedom is 18.The pooled standard deviation estimate is S = 2.7 1. SAMPLING DISTRIBUTION OF X,-X 2 FOR SMALL INDEPENDENT SAMPLES FROM NORMAL POPULATIONS WITH UNEQUAL (AND UNKNOWN) STANDARD DEVIATIONS When the population standarddeviations are unknown and it is not reasonable to assume they are equal, the sample variances are not pooled as they were in the previous three sections. Statisticians have developed a test statistic for this case, which has a distribution, which is approximately a t distribution. The standard error of the difference in the sample means is approximatedby substituting the sample variances for the population variances in formula (10.2) as was done in the large sample case to obtain the estimated standard error given in the following: (10.11) In formula (I0.3),we replace~ , l - k zby the expression for s,,-,,in formula (10.1I), and obtain the statistic shown in formula (10.12):
  • 230.
    CHAP. 101 . SampleSample size Mean Standard deviation 1. Keyhole 8 3.5hours 0.5 hour 2. Conventional 10 5.0 hours 1.7 hours INFERENCESFOR TWO POPULATIONS 221 When sampling from two normally distributed populations having unequal population standard deviations, the statistic given in formula (10.12)has an approximatet distribution and the degrees of freedom is given by df = minimum of {(nl - l), (n2- I)} (10.13) EXAMPLE 10.8 A sample of size 13 is taken from a normally distributed population, and a sample of size 15 is taken from a second normally distributed population. The mean of population 1 is 70 and the mean of population 2 is 65. The population variances are unknown, but assumed to be unequal. The following statistic has an approximate t distribution. - - X I - x 2 - 5 sit,-x2 t = The degrees of freedom is df = minimum( 12, 14}= 12. ESTIMATION OF pi- ~2 USING SMALL INDEPENDENT SAMPLES FROM NORMAL POPULATIONS WITH UNEQUAL (AND UNKNOWN) STANDARD DEVIATIONS The difference in sample means, 51,-X2, is apoint estimateof p1- 1 2 . . An interval estimatefor p1- p 2 is obtained by using formula (10.12).When independent small samples are selected from two normal populations having unknown and unequal standard deviations, the general form of a confidence interval for p 1-p 2 is given by formula (10.1#),where t is obtained from the t distribution table and is determined by the level of confidence and the degrees of freedom which is given by df = minimum of {(nl - l), (n2 - 1)). The standard error of the difference in sample means, s ~ , - ~ ~ , is given by formula (10.I 1). EXAMPLE 10.9 Keyhole heart bypass was performed on 8 patients and conventional surgery was performed on 10patients. The time spent on breathing tubes was recorded for all patients and is summarized in Table 10.8. The difference in sample means is Xi- X2 = 3.5- 5.0 = -1.5 hours. Because the ratio of the sample standard deviation for sample 2 divided by sample 1 is 3.4, it is not reasonable to assume that o1= 02. To find a 95% confidence interval for pI- ~ 2 , we need to find the t value for use in formula (Z0.14).The degrees of freedom is df = minimum of { (nl - I), (n2- 1)) = minimum { 7, 9) = 7. From the t distribution table having 7 degrees of freedom, we find the following: P(-2.365 < t < 2.365) = 9 5 . The proper value of t is therefore 2.365. The standard error of the difference in sample means is ssr,-sr2 = Using the sample nl n2
  • 231.
    222 INFERENCES FORTWO POPULATIONS [CHAP. 10 .25 2.89 = -+-= S66.The margin of error is t x s ~ , - ~ ~ = J8 10 standard deviations in Table 10.8,we find that 2.365x .566= 1.3386or 1.34to 2 decimal places. The 95% confidence interval is -1.5 k 1.34.Assuming that the times spent on breathing tubes are normally distributed for both procedures, the difference pI - p2 is between -2.84and -0.16hours with 95% confidence. EXAMPLE 10.10 Mothers Against Drunk Driving (MADD) have pushed for lowering the limit for driving while intoxicated from .1% to .08%.A study of alcohol-related traffic deaths compared the blood alcohol levels of individuals involved in such accidents in states with a 0.08%limit with those in states with a .l% limit. The results are shown in Table 10.9. Minitab will be used to set a 95% confidence interval on pl - p2, where pIis the mean blood-alcohol level of such individuals in states having a .08%limit and p 2 is the mean blood-alcohol level in states having a .l% limit. Table 10.9 0.08%level .12 .05 .07 .09 .07 .04 .12 .I1 .03 .06 .06 .09 .I0 .I0 .I2 .l% level .09 .16 .03 .21 .16 .12 -09 .09 .14 -19 .OS .09 .I 1 .16 .24 Normal probability plots, such as those in Example 10.7, indicate that it is reasonable to assume that the samples are taken from normally distributed populations. An edited version of the Minitab test for equal population variances is shown below. Before executing the Minitab command %vartestc2 cl, the data in Table 10.9are set in columns cl and c2, where cl contains a 1 or a 2,depending on where the sample value in c2 comes from. The set up for the columns is similar to that shown in Example 10.7.Bartlett’s Test is used to test the null hypothesis H o :The population standard deviations are equal vs. the alternative hypothesis H,: The population standard deviations are not equal. The hypothesis of equal standard deviations is usually rejected if the p value for Bartlett’s Test is less than .05. In this case, the p value is equal to 0.025, and equal population standard deviations is rejected. MTB > %vartest c2 cl Homogeneity of Variance Response C2 Factors C1 ConfLvl 95.oooO Bartlett’sTest (normal distribution) Test Statistic: 4.998 P value :0.025 Since it is reasonable to assume that the populations have normally distributed populations with unequal population standard deviations, the techniques of this section are appropriate. The command twot data in c2 groups in cl produces a 95% confidence interval for - p2. If the population standard deviations were
  • 232.
    CHAP. 101 INFERENCESFOR TWO POPULATIONS 223 Sample Sample size Mean Standard deviation 1. Keyhole 8 3.5days 1.2days 2. Conventional 10 8.0 days 3.0 days h assumed to be equal, the subcommand SUBC> pooled procedure would be added to the twot command. The Minitab output gives the sample size, mean, standard deviation, and standard error for both groups. The 95% confidence interval for pl - 12 extends from -0.0829 to -0.014. The edited output is as follows: MTB >twot data in c2 groups in cl Two SampleT-Testand ConfidenceInterval Two sample T for C2 C1 N Mean StDev SEMean 1 15 0.0820 0.0300 0.0078 2 15 0.1307 0.0562 0.015 95% CI for mu (1) - mu (2): (-0.0829, -0.014) TESTING HYPOTHESIS ABOUT 1 . 1 1-1.12 USING SMALL INDEPENDENT SAMPLES FROM NORMAL POPULATIONS WITH UNEQUAL (AND UNKNOWN) STANDARD DEVIATIONS The procedure for testing a hypothesis about the difference in population means when using two small independent samples from normal populations with unequal (and unknown) standard deviations is given in Table 10.10. Table 10.10 Steps for Testing a HypothesisConcerning pI - p2:Small Independent Samples from Normal Populations with Unequal (and Unknown)Standard Deviations IStep 1: State the null and research hypothesis. The null hypothesis is represented symbolically by &:p1- p2 = Do and the research hypothesis is of the form Ha: p1- p2f Do or Ha:p1- p2 c Do or Ha: p1- p 2> Do. Step 2: Use the t distribution table with degrees of freedom equal to df = minimum of { (n, - l), (n2- 1)) and the level of significance, a,to determine the rejection region. - - Step 3: Compute the value of the test statistic as follows: t* = XI- x2 -Do,where X,- X2 is the computed I SX,-xz difference in the sample means, Do is the hypothesized difference in the population means as stated in the null hypothesis, and = nl n2 Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls in the rejection region. Otherwise, the null hypothesis is not rejected. EXAMPLE 10.11 Keyhole heart bypass was performed on 8 patients and conventional surgery was performed on 10patients. The length of hospital stay in days was recorded for each patient. A summary of the data is given in Table 10.11. The null hypothesis H o :pI - p 2 = -4 vs. Ha: - p 2 f -4 is of interest to the researchers involved in the study. In words, the null hypothesis states that the mean hospital stay is 4 days less for the keyhole procedure than for the conventional procedure. The research hypothesis states that the difference in means is not equal to 4 days. Since the sample standard deviation for the conventional procedure is 2.5 times the sample standard deviation for the keyhole procedure, it is assumed that population standard deviations are unequal. The sample observations (which are not shown) indicate that it is reasonable to assume that both populations are normally distributed.
  • 233.
    224 INFERENCES FORTWO POPULATIONS [CHAP. 10 The degrees of freedom for the t distribution is df = minimum of { (nl - l), (n2- 1)) = minimum of { 7, 9) = 7. For a = .01, an area of .005 is allocated to each tail, since the research hypothesis is two-tailed. Using the t distribution with 7 degrees of freedom, it is found that P(t > 3.499) = .005. The critical values are +, 3.499. The estimated standard error of the difference in sample means is : The computed test statistic is: - - 35-8.0-(-4) Xi-Xz-Do - t* = - =-.48 Sii,-ii2 1.039 Since t* is between -3.499 and 3.499, the evidence is not sufficient to reject the null hypothesis. EXAMPLE 10.12 The data in Table 10.9 are used to test the research hypothesis H,: pi - p2 < 0. The statement p 1-p 2 <0 is equivalent to p i < p2.The statistical statement pl < p2 states that the mean blood-alcohol level for individuals involved in alcohol-related traffic deaths is lower in states with a .OS% limit than the mean level in states with a .I% limit. In Example 10.10, it was shown that it was reasonable to assume normal population distributions and unequal population standard deviations. After putting the data in columns c1 and c2 as described in Example 10.10,the followingMinitab output was obtained. MTB > twot data in c2 groups in c1; SUBC > alternative =-1. Two Sample T-Testand ConfidenceInterval Two sample T for C2 C1 N Mean St Dev SEMean 1 15 0.0820 0.0300 0.0078 2 15 0.1307 0.0562 0.0150 95% CI for mu (1) - mu (2): (-0.0829, -0.014) T-Test mu (1) = mu (2) (vs <): T= -2.96 P=0.0038 DF= 21 The subcommand, SUBC> alternative=-1, indicates that a lower-tailed test is required. Note that the research hypothesis is supported at the a = .05 level, since the p value is 0.0038. The computed test statistic is shown to be T = -2.96 and the degrees of freedom is shown as 21. The test statistic is computed as shown in Table 10.10. The degrees of freedom is not computed by df = minimum of { (nl - l), (n2 - 1)) = minimum of { 14, 14) = 14. The degrees of freedom computation is given by a more complicated formula. The use of df = minimum of {(nl - l), (n2 - 1)) gives a more conservative test than that used by Minitab. Both ways of computing the degrees of freedom may be found in various textbooks. SAMPLING DISTRIBUTION OF aFOR NORMALLY DISTRIBUTED DIFFERENCES COMPUTED FOR DEPENDENT SAMPLES The previous sections in this chapter have dealt with independent samples selected from two populations. When sample values are purposely matched or paired, the samples are called dependent samples. The samples are also referred to as paired or matched samples. Dependent samples include pairs of measurements such as beforelafter measurementsmade on the same person or machine, pre- and post-test scores taken on the same person, similar measurements made on twins who have undergone different treatments, and so forth. Table 10.12 illustrates a typical set of paired data. The differences are formed by subtractingthe second sample value from the first.
  • 234.
    CHAP. 101 INFERENCESFOR TWO POPULATIONS Table 10.12 Pair I Sample 1 Sample 2 I Difference I I I dl = X I - Yl d2 = x2 -Y2 Yl I Y2 XI x2 225 The estimation and testing procedures for experiments involving paired samples involve the differences shown in Table 10.12. As in previous sections, we shall investigate the sampling distribution of the mean of the sample differences first, and then apply these results to establish confidence intervals and test hypothesis concerning the mean of the population of all possible differences.The followingnotation will be used for inferences involvingdependent samples. = the mean of the population of paired differences a d = the standard deviation of the population of paired differences d =the mean of the paired differences which are computed from the samples s d = the standard deviation of the paired differenceswhich are computed from the samples - n =the number of paired differenceswhich are computed from the samples 0 , = the standard error of d S, = the estimated standard error of d The sample mean of paired differences is given by - z d n d = - The sample standard deviation of paired differencesis given by The estimated standard error of 3is given by - S d s, - - & (10.15) ( I0.16) (10.17) When the population of differences is normally distributed, the statistic given in formula (10.18)has a t distribution with (n - 1) degrees of freedom. (10.18) This result is perfectly analogous to the previous result given in Chapters 8 and 9; namely the result that the statistic given in formula (10.19)has a t distribution with df = n - 1 when the sample values comprising X come from a normally distributed population having mean equal to p.
  • 235.
    226 Lubricating tears INFERENCES FOR TWOPOPULATIONS [CHAP. 10 Emotional tears (ZO. 19) The confidence interval for jkJ as well as the procedure for testing hypothesis about P d are analogous to the techniques given in Chapters 8 and 9 for estimating the mean of one population or testing an hypothesis about the mean of a population. The only difference between the techniques is that in the case of dependent samples, the procedures are applied to differences. EXAMPLE 10.13 A digital Sphygmomanometergives diastolic blood pressure readings that are 5 units higher on the average than those obtained by the traditional method used by most medical professionals. Suppose 20 individuals have their diastolic blood pressure taken both ways. Let x represent the readings obtained by using the digital Sphygmomanometer, and let y represent the readings obtained by the traditional method. The differences in readings are found by the formula d = x - y. The mean population difference is & = 5 units. Assuming that the population of all differences is normally distributed, the following statistic has a t distribution with df = 20 - 1 = 19: - d -5 t = - S- d ESTIMATION OF COMPUTEDFROM DEPENDENT SAMPLES USING NORMALLY DISTRIBUTED DIFFERENCES When a paired sample of size n is selected and differences are computed as shown in Table 10.12, a confidence interval for the population mean difference, &, can be formed by using the sampling distribution of the statistic given in formula (10.Z8). The confidence interval is given by formula (Z0.20), where t is determined by the level of confidence and df = n - 1. The standard error of d is computed by using formulas (10.16)and (IO.Z7). - d + t x s, (10.20) EXAMPLE 10.14 A psychological study compared the amount of manganese in tears that lubricate the eye with the amount in emotional tears. A measurement, which is related to the amount manganese, was taken for each type of tear on 10different subjects. The results are shown in Table 10.13.The computations needed to set a 99%confidence interval on are given below the table. Subject 1 2 3 4 5 6 7 8 9 10 10 13 11 10 9 8 10 10 8 12 12 12 9 11 7 9 I 1 13 10 10 Difference -2 1 2 -1 2 -1 -1 -3 -2 2
  • 236.
    CHAP. 101 INFERENCESFOR TWO POPULATIONS 227 The sum of the differences is Zd = -2 + 1 + 2 - 1 + 2 - 1 - 1 - 3 - 2 +2 =-3 and the sum of the squares of the differences is Zd2 = 4 + 1 + 4 + 1 + 4 + 1 + 1 + 9 + 4 +4 = 33. The mean difference is a = -.3. The sample standard deviation for the differences is The standard error of a is s d 1.889 s, = J ; ; = Jlo= .597 The t value for a 99% confidence interval is found as follows: The degrees of freedom is df = 10- 1 = 9. From the t distribution table for 9 degrees of freedom, we find that P(-3.250 < t < 3.250) = .99. The 99% confidence interval is found by using formula (20.20) to be: -.3 k 3.25 x S97, or the 99% confidence interval is (-2.24, 1.a). The confidence interval is valid provided the population of differences is normally distributed. EXAMPLE 10.15 A Minitab solution to Example 10.14 is given below. The lubricating tear data is put in column 1 and the emotional tear data is put in column 2. The command let c3 = cl - c2 computes the differences and puts them in column 3. The command name c3 'dB" assigns the name diff to column 3. The command tinterval 99% confidence data in c3 computes the 99% confidence interval for p,,. Note that Minitab computes and prints out the mean difference, the sample standard deviation of differences, the standard error of d,as well as the 99% confidence interval. These quantities are the same as those computed by hand in Example 10.14. Data Display Row lubricate emotion 1 10 2 13 3 11 4 10 5 9 6 8 7 10 8 10 9 8 10 12 12 12 9 11 7 9 11 13 10 10 MTB > let c3 = cl - c2 MTB > name c3 'diff MTB > tinterval99% confidence data in c3 Codidence Intervals Variable N Mean StDev SEMean 99.0%CI diff 10 -0.300 1.889 0.597 (-2.241, 1.641) TESTING HYPOTHESIS ABOUT DIFFERENCES COMPUTEDFROM DEPENDENT SAMPLES USING NORMALLY DISTRIBUTED The procedure for a hypothesis test about the mean differencefor paired samples is given in Table 10.14.
  • 237.
    228 IIWERENCES FORTWO POPULATIONS [CHAP. 10 -~ ~ ~~~ Step 1: State the null and research hypothesis. The null hypothesis is represented symbolically by Ho: Doand the research hypothesis is of the form Ha:& # Do or Ha:pd < Door Ha:& > Do. Step 2: Use the t distribution table with degrees of freedom equal to df = n - 1 and the level of significance, a,to determine the rejection region. = - Table 10.14 Steps for Testing a Hypothesis Concerning &:Normally Distributed Differences Computed from Dependent Samples d-DO - U Compute the value of the test statistic as follows: t* = - , where d = -, Do is the n Step 3: I Sd s d hypothesized value of p d in the null hypothesis, Sa = - and s d = 42 Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls in the rejection region. Otherwise, the null hypothesis is not rejected. I EXAMPLE 10.16 A sociological study concerning marriage and the family was conducted and one of several factors of interest was the educational level of husbands and wives. Table 10.15 gives the number of years of education beyond high school for 15 couples. These data were used to test the null hypothesis that the mean educational levels are the same vs. the research hypothesis that the educational levels are different at level of significance a = .05. Couple 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Table 10.15 Husband 4 0 7 4 2 8 4 3 1 2 4 8 0 4 1 Wife 2 4 6 4 3 4 4 1 0 2 4 2 4 0 4 ~~~~~ Difference 2 -4 I 0 -1 4 0 2 1 0 0 6 4 4 -3 The degrees of freedom is df = 15 - 1 = 14, and since the research hypothesis is two-tailed and a = .05, a t distribution curve with area equal to .025 in each tail will determine the rejection regions. Consulting the t distribution table with df = 14, we find that P(t > 2.145) = .025.Figure 10-3illustrates the rejection regions.
  • 238.
    CHAP. 101 INFERENCESFOR TWO POPULATIONS 229 Fig. 10-3 The sum of the differences is Zd = 8 and the sum of the squares of the differences is Zd2 = 120. The mean c d 8 difference is a= -- - --- S33, and the standard deviation is n 15 s d = i . 1 cd2-(Id)*/ n = 120-g2 /15 z2.8752 The standard error of the mean difference is Sd 2.8752 %=---- - .7424 A - J15 The computed test statistic is - d-Do 533-0 = 0.72 t*= --- - Sd .7424 Since t* = 0.72 does not fall within the rejection region shown in Fig. 10-3, we are unable to conclude that the educational levels of husbands and wives are different. EXAMPLE 10.17 A Minitab solution for Example 10.16 is shown below. The data for the husbands is put into column cl and the data for the wives is put into column c2. The differences are put into c3, and a one-sample t test is performed on c3. The subcommand S U B 0 alternative= 0. indicates a two-tailed alternative hypothesis. Note that the mean, standard deviation, standard error, and computed t values are the same as those computed in Example 10.16.Since the computed p value is 0.48, the null hypothesis is not rejected at the a = .05 level. Data Display Row husband 1 4 2 0 3 7 4 4 5 2 6 8 7 4 8 3 9 1 10 2 11 4 12 8 13 0 14 4 15 1 wife 2 4 6 4 3 4 4 1 0 2 4 2 4 0 4
  • 239.
    230 INFERENCESFOR TWOPOPULATIONS [CHAP. 10 MTB>letc3=cl -c2 MTB > name c3 'diff MTB > ttest mean = 0, data in c3; SUBC > alternative = 0. T-Test of the Mean Test of mu = O.OO0 vs mu not = 0.000 Variable N Mean St Dev SEMean T P diff 15 0.533 2.875 0.742 0.72 0.48 SAMPLING DISTRIBUTION OF PI-Fz FOR LARGE INDEPENDENT SAMPLES Previous sections in this chapter have discussed inferences concerning the differences in means for two populations. The remaining sections of Chapter 10are concerned with inferences concerning the differences in population proportions or percents. How does the percent of defectives produced by machine 1 compare with the percent of defectives produced by machine 2? Is there a difference in the percent of males who will vote for a presidential candidate and the percent of females who will vote for the candidate? Is the percentage of smokers the same for African-Americans as it is for Whites? All of these questions involve the comparison of percents or proportions. Sample proportions will be used to estimate the differencesin population proportions or to test a hypothesis about the differences in population proportions. The following notation will be used. pI and p 2 = proportions in populations 1 and 2 having the characteristicof interest nl and n2 = sizes of the independent samples drawn from populations 1 and 2 pI and p2 = proportions in samples 1 and 2 having the characteristicof interest - q1 = I - pi and q 2 = 1 -p2 = proportions in populations 1 and 2 not having the characteristic q1= 1-p,andq2= 1-ij2 = proportions in samples 1 and 2 not having the characteristic When the samples sizes, nl and n 2 , are such that nlpl > 5, n2p2 > 5, nlql > 5 and n2q2 > 5, the sampling distribution of PI - p2 is normal. The mean or expected value of PI - p2 is given by formula (10.21): - - - The standarderror of is, - p2is given by formula (10.22). (10.2I ) (10.22) When the sample sizes are large enough to satisfy the above requirements,the distribution of F, - is, is as shown in Fig. 10-4.The standard error of the curve is given by formula (I0.22).
  • 240.
    CHAP. 101 INFERENCESFOR TWO POPULATIONS 231 Fig. 10-4 - Formula (10.23) is used to transform the distributionof pl- p2 to a standard normal distribution. - PI -Pz-(PI -P2) 2 = %-P2 (10.23) EXAMPLE 10.18 Three percent of the items produced by machine 1 have a minor defect and 2% of the same item produced by machine 2 have a minor defect. If samples of size 500 are selected from each machine, the distribution of p,- p2will be normal since nlpl = 500 x .03 = 15, n2p2 = 500 x .02 = 10,nlql = 500 x .97 = 485 and n2q2=500 x .98= 490. Normality may be assumed, since all four products are greater than 5. The mean value of PI- Tj2 is .03- .02= .01, and the standard error of F, - p2 is - 1 +m - .03 x .97+ .02 x .98 = .00987. n2 -d 500 500 The probability that the percent defective in the sample from machine 1 exceeds the percent defective in the sample from machine 2 by 3% or more is expressed as P(pl - p2 > .03). The event p1 - p2 > .03 is transformed to an equivalent event involving z by the use of formula (10.23). The equivalent event is = 2.03, or z > 2.03.The event p,- p2> .03 is equivalent to the event z > 2.03, and therefore P(pl - p2> .03) = P(z > 2.03) = .5 - .4788 = .0212. There are only approximately 2 chances out of 100that the percent defective in sample 1will exceed the percent defective in sample 2 by 3% or more. - - - pI-p2- .01 .03 - .01 > .00987 .00987 ESTIMATIONOF Pi -P 2 USING LARGE INDEPENDENTSAMPLES The difference in sample proportions, F1- F2,is a point estimate ofpl - p2. An interval estimate o f pI-p2 is obtained by using formula (10.23). However, the standarderror as given in formula (10.22) will need to be estimated since p1 and p2 are unknown. If the population proportions are estimated by their corresponding sample proportions, we obtain the estimated standard error of PI - p2. The estimated standard error of PI- p2is represented by spI-p, and is given by formula (10.24). The confidence interval for pl -p2 is given by formula(20.25): (10.24) ( I0.25)
  • 241.
    232 INFERENCES FORTWO POPULATIONS [CHAP. 10 The confidence interval is valid provided that nlpl > 5, n2p2 > 5, nlql > 5 and n2q2 > 5. Since PI, 91, p2 ,and q2are unknown, the corresponding sample quantities are substituted to check the validity of using the confidence interval. EXAMPLE 10.19 A survey of 2000 ninth-graders found that 32% had used cigarettes in the past week and a survey of 1500 high school seniors found that 35% had used cigarettes in the past week. Suppose p1represents the proportion of all ninth-graders who used cigarettes in the past week and p2represents the proportion of all seniors who used cigarettes in the past week. A point estimate for pl - p2 is -3%. The estimated standard error of F, - p2is as follows: A 95% confidence interval for pi - p2 is given by (p,- p,) f . z x Siil-ii2or -3% f . 1.96 x 1.614% or -3% f 3.2%. The 95% confidence interval extends from -6.2% to 0.2%. Note that nl x PI = 2000 x .32 = 640, nl x ql = 2000 x .68 = 1360, n2 x p2= 1500x .35 = 525, and n2 x q2= 1500 x .65 = 975. Since all 4 of these quantities exceed 5, the confidence interval is valid. TESTING HYPOTHESIS ABOUT PI -Pz USING LARGE INDEPENDENT SAMPLES The most common null hypothesis concerning pl - p2 is H o :pl - p2 = 0. Recall that in testing hypothesis, the null hypothesis is always assumed to be true and the test statistic is computed under this assumption, If the test statistic value is judged to be highly unusual, then the assumption that the null hypothesis is true is rejected. When Ho is assumed to be true, p1 -p2 = 0 or p1 = p2. If we let p be the common value of p1 and p2, then the standard error of Fl - p2 simplifies as follows: - I Since p and q are unknown, they must be estimated from the two samples. Let XI be the number in sample 1 with the characteristic of interest and let x2 be the number in sample 2 with the characteristic of interest. A pooled estimate o fp is given by (Z0.26) Substituting for p and 4 = 1 - j 7 for q in the above expression for ( J ~ , - ~ , , we obtain the estimated standard error of F, - as given in (10.27) The test statistic for testing the null hypothesis Ho: p1 - p2 = 0 is obtained by using formula (Z0.23) with pl - p2 replaced by 0 andOFl-F2 estimated by s ~ ~ - ~ ~ as given in formula (10.27). The resulting test statistic is given in formula (10.28):
  • 242.
    CHAP. 101 Sample 1. Whites 2,Hispanics INFERENCES FOR TWO POPULATIONS Sample size Number of smokers Sample proportion nl = 1500 X I = 555 p, = 0.37 n2 = 500 ~2 = 175 P, = 0.35 - - 233 (10.28) The steps for testing a hypothesis concerning pI - p2 are given in Table 10.16. Table 10.16 Steps for Testing a Hypothesis Concerning pI - p2: Large Independent Samples nlpl > 5, n2p2 > 5, nlql > 5 and n2q2 >5 Step 1: State the null and research hypothesis. The null hypothesis is represented symbolically by Ho: p1 - p2 = 0 and the research hypothesis is of the form Ha: P I - p2 <0 or Ha: PI- p2 > 0 or Ha:p1- p2 f 0. Step 2: Use the standard normal distribution table and the level of significance, a,to determine the rejection region. - PI-ii, Step 3: Compute the value of the test statistic as follows: z* = -, where PI- p2is the computed I SB1-F2 difference in sample proportions, sF1-B2 = ,and S = 1 - jY. Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls in the rejection region. Otherwise, the null hypothesis is not rejected. I EXAMPLE 10.20 A study was conducted to compare teen cigarette use for whites and Hispanics. Suppose pI represents the proportion of teenage whites who use cigarettes and p2 represents the proportion of teenage Hispanics who use cigarettes. The null hypothesis is Ho:p1- p2= 0 and the research hypothesis is Ha:p1- pz# 0. The critical values for a level of significance equal to .05 are k 1.96.The sample results are given in Table 10.17. xi+ x2 555 +175 The pooled proportion is = -- - = 0.365.The estimated standard error for PI- P, is: n i + n 2 1500+500 and the computed test statistic is: Based on this study, we would not be able to conclude that a difference exists between the proportion of smokers within the two groups of teenagers.
  • 243.
    234 Sample I . Men [CHAP.10 Sample size Mean Standard deviation 250 5.5 years 2.I years INFERENCES FOR TWO POPULATIONS Solved Problems SAMPLING DISTRIBUTION OF XI-X 2 FOR LARGE INDEPENDENT SAMPLES 2. Women 250 3.3 years 1.8years . 10.1 A sample of size 50 is taken from a population having a mean equal to 90 and a standard deviation equal to 15. A second sample of size 70 and independent of the first sample is selected from another population having mean equal to 75 and standard deviation equal to 10. The mean of the sample of size 50 is represented by XIand the mean of the sample of size 70 is represented as -j12 . (a) What type distribution does XI - X2 have'? (b) What is the expected value of XI- X2? (c) What is the standard error of 511 - j12? (d) Transform T1- 512 to a standard normal. Am. (a)Because both samples are 30 or more, 511 - 512 will have a normal distribution. (6)The expected value of XI- X2 is E(Xl - X2) = psr,-srz = p1- p2 = 90 - 75 = 15. = i$+: = J-+- 225 100 = 2.43 (c)The standard error of XI - 512 is 051,-srz 50 70 - - X I - ~2 - 15 ( 4 z = is a standard normal variable. 2.43 ESTIMATION OF pi- ~2 USING LARGE INDEPENDENT SAMPLES 10.2 Table 10.18 gives the summary statistics for the number of years that 250 men and women have spent with their current employers. Use these results to find a 90%confidence interval for pl- p 2 ,the mean difference in years spent with their current employer for men and women. Table 10.18 Am. The general form of the Confidence interval for p1 - p2 is (Tl - T2)+_ z x The difference in means is 51,- x2= 5.5 - 3.3 = 2.2. The z value for a 90% confidence interval is 1.65. The estimated standard error of the difference in means is +- =.1749. The 90% margin of error when using 5(,- 51, as an estimate of pl - p2 is 1.65 x .1749 or .2886. The confidence interval is 2.2 +_ .3. The confidence interval extends from 1.9to 2.5 years. TESTING HYPOTHESIS ABOUT pi- ~2 USING LARGE INDEPENDENT SAMPLES 10.3 Use the data in Table 10.18to test the null hypothesis that p1- p 2 = 1.5 years vs. the research hypothesis that p1- p2> 1.5years at level of significance a = .01,
  • 244.
    CHAP. 101 INFERENCESFOR TWO POPULATIONS 235 Ans. Since the samples are large, the standard normal distribution table is used to find the critical value. From the standard normal table, we find that P(z > 2.33) = .01, and therefore the critical value is 2.33. The computed value of the test statistic is given as follows: The standard error of the difference in means is replaced by the estimated standard error of the difference in means. It is concluded that the mean exceeds 1.5years. SAMPLING DISTRIBUTION OF XI-x,FOR SMALL INDEPENDENT SAMPLES FROM NORMAL POPULATIONS WITH EQUAL (BUT UNKNOWN) STANDARD DEVIATIONS 10.4 A sample of size 10is taken from a normal population having a mean equal to 35 and a sample of size 15 is taken from another normal population having mean equal to 40. The two normal populations have equal variances. The mean and variance of the sample of size 10 are represented by F1and S: and the mean and variance of the sample of size 15 are represented by X2 and s;. (a) Give the expression for the pooled estimate of the common population variance. (b) Give the expression for the estimated standard error of the difference in the sample means. (c) Use the results in parts (a) and (b) to form a statistic that has a t distribution with 23 degrees of freedom. (nI -1)s: +(n2 - 1) s$ - 9 x S : +14x S: nl+ n2 - 2 23 Ans. (a) S2= - (b) = /S2[ +--!--I = lis..(”), 10 1s where S2is given in part (a). ESTIMATION OF pi -j ~ 2 USING SMALLINDEPENDENT SAMPLES FROM NORMAL POPULATIONS WITH EQUAL (BUT UNKNOWN) STANDARD DEVIATIONS 10.5 A comparison of motel room rates for single occupancy was made for the cities of Omaha, Nebraska and Kansas City, Missouri. The rates for the two cities are shown in Table 10.19. Using the command % normplot of Minitab for both samples, it is found that it is reasonable to assume that both populations are normally distributed. Using the command %vartestof Minitab, it is also found that it is reasonable to assume that the populations have equal variability. The Minitab output for setting a 99%confidence interval on p1- p 2 is given below. Verify the Pooled standard deviation, the degrees of freedom, and the 99% confidence interval given in the output.
  • 245.
    236 INFERENCES FORTWO POPULATIONS [CHAP. 10 Table 10.19 I Omaha 75 70 70 65 85 85 80 90 90 60 60 Kansas City 80 75 75 85 85 100 60 65 95 105 70 MTB > twot 99% confidence data in c2, groups in c 1; SUBC > pooled. Two SampleT-Testand ConfidenceInterval Two sample T for rate city N Mean St Dev SEMean 1 1 1 75.5 11.3 3.4 2 11 81.4 14.3 4.3 99%CI for mu (1) - mu (2): (-2 1.6, 9.7) T-Test mu (1) = mu (2) (vs not =):T =-1.07 P = 0.30 DF = 20 Both use Pooled St Dev = 12.9 - Ans. The difference in sample mean is '51, - x2 = 75.5 - 81.4 =-5.9. (nl -1)s: +(n2 - I) s! 10x 127.69+10x204.49 = 166.09,and S = - - The pooled variance is S2 = nl+ n2 -2 11+11-2 4166.09 = 12.9. The standard error of the difference in the sample means is SxI-x2 = & 2 [ ; + ; ] = 1 7 3 166.09~-+- ~ 5 . 5 0 . Using the t distributiontable with df = 20 and right-hand tail area equal to .005, we find the t value is 2.845. The 99% margin of error is 2.845 x 5.50 = 15.6. The 99% confidence interval extends from -5.9 - 15.6= -21.5 to -5.9 + 15.6= 9.7. TESTING HYPOTHESIS ABOUT ~ 1 - ~2 USING SMALL INDEPENDENT SAMPLES FROM NORMAL POPULATIONS WITH EQUAL (BUT UNKNOWN) STANDARD DEVIATIONS 10.6 Use the motel rate data in Table 10.I9 and the Minitab output given in problem 10.5 to test the null hypothesis Ho:pI- p2 = 0 vs. Ha:pl -p 2 # 0 at significance level a = .O 1. (a) Give the critical values for performingthe test. (b) Give the computed value of the test statistic, and your conclusion based upon this value and the critical value in part (a). (c) Give the p value and your conclusion based on this value.
  • 246.
    CHAP. 101 Ans. (a) INFERENCESFOR TWO POPULATIONS 237 The degrees of freedom for the t distribution is 20. Since the research hypothesis is two-tailed, the significance level is divided by 2 and .005 is put into each tail of the distribution. By consulting the t distribution table, we find that for df = 20 and right-tail area = .005, the t value is 2.845. The critical values are k2.845. The computed t value, from the Minitab output in problem 10.5,is t* = -1.07. Since this value does not fall in the rejection region, the null hypothesis is not rejected. It cannot be concluded that the mean motel rates differ for the two cities. The p value, from the Minitab output in problem 10.5,is equal to 0.30. Since this exceeds the preset level of significance,the null hypothesis is not rejected. The same conclusion is reached as in part (b). SAMPLING DISTRIBUTION OF Xi-X,FOR SMALL INDEPENDENT SAMPLES FROM NORMAL POPULATIONS WITH UNEQUAL (AND UNKNOWN) STANDARD DEVIATIONS 10.7 Refer to problem 10.4.Suppose the two populations have unequal population variances. What changes are needed to the answers given in the problem? Am. Since the population variances are unequal, the sample variances are not pooled together to estimate a common population variance. The standard error of the difference in the sample means - - X i - X 2 4 - 5 is given by sxl-siz has a t distribution and sx,-iT2 the degrees of freedom is given by df = minimum of { (nl - l), (n2- 1)) = minimum of { 9, 14) = 9. Note that the degrees of freedom is reduced from 23 to 9. This problem illustrates the importance of checking the assumptions underlying the estimation and testing procedures. The computation of the test statistic as well as the degrees of freedom is determined by the assumption concerning the variances of the two populations. ESTIMATION OF 1 1 -~2 USING SMALL INDEPENDENT SAMPLES FROM NORMAL POPULATIONS WITH UNEQUAL (ANDUNKNOWN) STANDARD DEVIATIONS 10.8 Refer to problem 10.5. Compute the standard error of the difference in means, the degrees of freedom, and the 99% confidence interval assuming unequal population variances. Compare the results with those in problem 10.5. Ans. The standard error of the difference in sample means isSsr,-n2= +- nl n2 I 1 11 which equals 5.50. This is the same answer obtained in the equal variances case. This will always occur if the sample sizes are equal. The degrees of freedom is df = minimum of ( ( n l - I), (n2 - l)] = minimum of { 10, 10) = 10. Using the t distribution table with df = 10and right-hand tail area equal to .005, we find the t value is 3.169. The 99% margin of error is 3.169 x 5.50 = 17.4. The 99% confidence interval extends from -5.9 - 17.4= -23.3 to -5.9 + 17.4 = 11.5. The 99% confidence interval in the equal variances case is (-21.5, 9.7). The 99% confidence interval for the unequal variances case is (-23.3, 11.5). Note that the interval in the unequal variances case is wider.
  • 247.
    238 INFERENCES FORTWO POPULATIONS [CHAP. 10 Sample 1. College graduate 2. Non-college graduate TESTING HYPOTHESIS ABOUT PI- ~2 USING SMALL INDEPENDENT SAMPLES FROM NORMAL POPULATIONS WITH UNEQUAL (AND UNKNOWN) STANDARD DEVIATIONS Sample size Mean Standard deviation 14 8.6 hours 1.1 hours 12 6.3 hours 2.7 hours 10.9 In a study of internet users, the average time spent online per week was determined for a group of college graduates as well as a group of non-college graduates. The results of the study are shown in Table 10.20.Test the research hypothesis that > p 2 at level of significancea = .05. Give the critical value, the computed test statistic, and your conclusion. Assume that the times are normally distributed for both populations. Pair 1 2 3 4 5 6 Table 10.20 Sample 1 Sample 2 Difference 18 15 dl= 18- 15 = 3 23 22 d2=23-22= 1 27 25 d3 = 27 - 25 = 2 20 22 d4 = 20 - 22 = -2 19 19 d5 = 19- 19= 0 24 22 d6 = 24 - 22 = 2 Ans. Some statisticians use the following rule to decide whether population variances are equal or not: If.5 I-5 2, then assume 01 = 02. Otherwise,assume that 01 f 02. Since the ratio of the sample is less than .5, we assume that the populations have unequal standard deviations. The degrees of freedom for this model is df = minimum of {(nl - l), (n2- I)} = minimum of { 13, 1 1}= 11. The critical value is determined by using the t distribution table with df = I 1 and right-tail area equal to .OS. This value is found to equal 1.796. S1 s 2 The standard error of the difference in sample means is = nl n2 14 12 - - XI -~2 -(PI -~ 2 ) - 8.6-6.3- 0 ,833. The computed test statistic is t* = - = 2.76. The research srr,-x2 .833 hypothesis is supported since t* = 2.76 exceeds the critical value, 1.796. SAMPLING DISTRIBUTION OF d FOR NORMALLY DISTRIBUTED DIFFERENCES COMPUTEDFOR DEPENDENT SAMPLES 10.10 Table 10.21 gives a set of paired data, along with the differences for the pairs. Answer the following questionsconcerning these paired data. Sd U n /a2 -n(y)2 / n ,and sa = - (a) Find the following: 7= -, s d = A * (b) What parameters do each of the statistics in part (a)estimate? (c) What assumption is needed in order that the statistic t = k&- have a t distribution - Sa with (n - 1) = (6- 1) = 5 degrees of freedom? Table 10.21
  • 248.
    CHAP. 101 INFERENCESFOR TWO POPULATIONS 239 Pair 1 s d 1.789 = 1.789, and S;i = -- -= 0.73. & - African-American White 65 60 (6) The statistic, 3, estimates ~.ld ,the mean of the population of paired differences. The statistic, s d , estimates a d ,the standard deviation of the population of paired differences. The statistic, ss,estimatesoa,the standard error of the population of paired differences. (c) It is assumed that the population of paired differences is normally distributed. 2 3 4 5 6 7 8 9 10 ESTIMATION OF COMPUTEDFROM DEPENDENTSAMPLES USING NORMALLY DISTRIBUTED DIFFERENCES 50 75 80 105 90 65 60 115 80 10.11 A sociological study compared the salaries of 10professional African-American women with the salaries of 10 corresponding professional White-American women. The women were paired according to certain salient characteristics and the 10 pairs were chosen from ten different professions. The salaries (in thousands) are shown in Table 10.22. Table 10.22 Difference p- 55 70 105 10 90 -10 The Minitab command tinterval 90 percent confidence data in c3 is used to produce the following output. The differences are computed and put into column c3. The command requests a 90%confidence interval on the mean difference. MTB>letc3=cl-c2 MTB >print c l -c3 Data Display Row Afamer white diff 1 65 60 5 2 50 55 -5 3 75 70 5 4 80 75 5 5 105 95 10 6 90 100 -10 7 65 70 -5 8 60 50 10 9 115 105 10 10 80 90 -10 MTB > tinterval90 percent confidencedata in c3
  • 249.
    240 INFERENCES FORTWO POPULATIONS [CHAP. 10 Variable N Mean St Dev SEMean 90.0%CI diff 10 1.50 8.18 2.59 (-3.24,6.24) Verify that output. = 1SO, s d = 8.18, SJ = 2.59, and that the confidence interval is as given in the Ans. Using the differencesgiven above, we find that E l = 15, and 3 = 1.5. We also find that Zd2 = w2-(a)2 / n 625- 22.5 625, and s d=, / - ' =/T= 8.18. The estimated standard error of is Sa = n-1 s d 8.18196 = 2.59. --- The confidence interval is given by f t x SJ. The t value for a 90% confidence interval is found as follows. The degrees of freedom is df = 10- 1= 9. For a 90%confidence interval 5% is put into each tail of the t distribution.Using the t distributiontable with 9 degrees of freedom and a right-hand tail area equal to .05,we find the t value to be 1.833.The 90%margin of error is t x S;i = 1.833x 2.59 = 4.75. The 90%confidence interval goes from 1S O - 4.75 = -3.25 to 1.50+4.75 = 6.25. TESTINGHYPOTHESIS ABOUT ~.la USING NORMALLY DISTRIBUTED DIFFERENCES COMPUTEDFROM DEPENDENT SAMPLES 10.12 Use the data in Table 10.22 to test the research hypothesis # 0 at level of significance a =.01. Ans. The degrees of freedom is one less than the number of pairs, i.e., df = 9. Since the research hypothesis is two-tailed, tail areas equal to .005 are put into both tails of the t distribution. From the t distribution table the critical value is determined by using a right-tail area equal to .005. The critical value is equal to 3.250. From problem 10.11, the followingare found: d = 1S O , SJ = 2.59, and D o = 0. The computed test statistic is t* = -- - -- - 0.58. Since this value falls between -3.250 and 3.250, the null hypothesis is not rejected. - d - D o 150-0 SS 259 SAMPLINGDISTRIBUTION OF PI-;Pz FOR LARGE INDEPENDENT SAMPLES 10.13 Population 1 has pl = ,010 and population 2 has p2 = .005. Independent samples of size 5000 each are selected from both populations. Find the probability that PI- p2exceeds .015? - - Ans. The mean value of E, - p2 is .010- .005 = .005. The standard error of PI- pz is as follows: Since nlpl> 5, n2p2> 5 , nlql > 5 and n2q2> 5, Fl - T2has a normal distribution.We are asked to - - is used to pI-ij2-(pI-p2) - - pl-?j,-.005 find P(p, - F2 > .015). The transformation z = w , - v 2 .001725 - transform the statistic Fl - p2 to a standard normal variable. The same transformation on .015
  • 250.
    CHAP. 101 INFERENCESFOR TWO POPULATIONS 241 .015 - ,005 .001725 yives the value = 5.80. We have the following: P(F, - F2> .015) = P(z > 5.80). which is approximately0. It is highly unlikely that pl- F2exceeds .015. ESTINIATION OF Pi -P2 USING LARGE SAMPLES 10.14 Three thousand commuters in both New York and Chicago were surveyed and the percentage of commuters who took more than 60 minutes to get to work were determined for both groups. It was found that 16.5% in New York and 10.7% in Chicago required more than 60 minutes. Set a 90%confidence interval on p1- p2 ,where p1 correspondsto New York. - Ans. The confidence interval is given by (jjl- p2) f z x sil-i2, where sjil-B2 = n2 The z value for 90%confidenceis 1.65.The differencein samplepercentages is 16.5%- 10.7% = 5.8%. The standard error is sil-B2= = 0.88%.The 90%margin of error is 1.65x 0.88 = 1.5%.The 90%confidence interval extends from 5.8 - 1.5 = 4.3%to 5.8 + 1.5= 7.3%. TESTING HYPOTHESIS ABOUT Pi -P2 USING LARGE INDEPENDENT SAMPLES 10.15 A survey of 50 men and 50 women was conducted and it was found that 16of the men and 10 of the women used hotel room minibars. Test the hypothesis that pl = p 2 vs. pl # p2 at level of significancea = .05. - XI +x2 n l + n2 Ans. The critical values are f 1.96. The pooled estimate of the common proportion is jT = -- 16+10 50+50 -- - .26. The standard error of the difference in proportions is SF,-p2 i.26 x .74($+&) = .0877.The differencein sampleproportions is - 55, = .32 - .20= .16. - P1-pz .16 The computed test statistic is z* = -- - -- - 1.82. Since the computed test statistic does s V P 2 .Of377 not exceed the critical value, we cannot conclude that there is a difference between men and women users of hotel room minibars. Supplementary Problems SAMPLINGDISTRIBUTIONOF XI- X,FOR LARGEINDEPENDENTSAMPLES 10.16 A sample of size 100 is selected from a population having mean 75 and standard deviation 3. Another independent sample of size 100 is selected from a population having mean 50 and standard deviation 4. Verify that j i l - X2 has mean 25 and standard deviation0.5. (a) What percent of the time will sT1 - 512 fall within 0.5 of 25? (b) What percent of the time will T1 - 512 fall within 1.0of 25? (c) What percent of the time will - 512 fall within 1.5 of 25?
  • 251.
    242 Sample 1 2 INFERENCESFOR TWO POPULATIONS Ans.(a)68% (b)95% (c) 99.7% ESTIMATIONOF pi- ~2 USING LARGEINDEPENDENTSAMPLES Sample size Mean Standard deviation 50 $9,500 $1,250 75 $9,125 $950 [CHAP. 10 Chatty toddlers 75 10.17 Quiet toddlers 80 The information shown in Table 10.23 was obtained from two independent samples selected from two populations. (a) Give a point estimate for p~ - p2. (b) Find a 99% confidence interval for pI-p2. Am. (a)$375 (b)375 k 2.58 x 208.05 or -$161.77 to $911.7 TESTING HYPOTHESIS ABOUT -~2 USING LARGE INDEPENDENTSAMPLES 10.18 Use the data given in problem 10.17 to test the hypothesis Ho: p 1- p 2 = 0 vs. Ha:pl - p2> 0 at level of significance a= .OI. (a) Give the computed test statistic. (6) Give the p value. (c) Give your conclusion. 375-0 208.05 Ans. (a) z* = - - - 1.80 (6) p value = .5 - .4641= .0359 (c) Do not reject H o since p value > a. SAMPLINGDISTRIBUTIONOF ?i,- x2FOR SMALLINDEPENDENTSAMPLESFROM NORMAL POPULATIONSWITH EQUAL (BUTUNKNOWN)STANDARDDEVIATIONS 10.19 A psychological study compared the language skills and mental development of two groups of two-year- olds. One group consisted of chatty toddlers and the other group consisted of quiet children. The scores on a test which measured language skills are shown for the two groups in Table 10.24. Use Minitab to determine whether it is reasonable to assume the populations of test scores are normally distributed and also determine if it is reasonable to assume that two populations have equal standard deviations. tble 10.24 70 70 65 85 85 80 90 90 60 60 75 65 70 90 90 75 85 90 75 80
  • 252.
    CHAP. 101 INFERENCESFOR TWO POPULATIONS 243 Ans. The following normal probability plots were produced by Minitab. Beneath the graph, the mean, standard deviation, and sample size are shown as well as a p value for the Anderson-Darlington normality test. The p value corresponds to a null hypothesis, which states that the sample data were selected from a normally distributed population. This hypothesis is rejected if the p value is less than a = .05. Otherwise, normality is usually assumed. In this case, it is safe to assume that both samples were obtained from normally distributed populations since the p value for the chatty sample is 0.433 and the p value for the quiet sample is 0.368. Normal Probability Plot .999 .99 .95 .80 .50 ProbabiIity2o .05 .01 ,001 Average: 75.4545 StDev: 11.2815 N:11 , ... . . . . . . 1 I 1 1 60 70 80 90 chatty Anderson-Darling NormalityTest A-Sauared:0.337 PValue: 0.433 Normal Probability Plot . . . . . . . . . . . . . . . . . . : ........................ " .... .;. ................................ ; . _ .. , .001 -. . . . . . . . . . . . . . . . . . ........................ - .." ',"". . . . . . . . . . . . . ............-.. ~ I I 70 Average: 79.5455 StDev: 8.50134 N: 11 80 90 quiet Anderson-Darling Normality Test A-Squared: 0.365 P Value: 0.368
  • 253.
    244 INFERENCES FORTWO POPULATIONS [CHAP. 10 The following Minitab output may be used to test for equal population standard deviations. The null hypothesis of equal population standard deviations is rejected if the p value corresponding to Bartlett’s test is less than a = .05. In this case, the p value equals 0.386, and is reasonable to assume that o1= 02. MTB > %vartest c2 cl Homogeneity of Variance Response score Factors sample ConfLvl 95.0000 Bartlett’sTest (normal distribution) Test Statistic: 0.752 Pvalue : 0.386 ESTIMATION OF pi- ~2 USING SMALL INDEPENDENTSAMPLES FROM NORMAL POPULATIONS WITH EQUAL (BUT UNKNOWN) STANDARD DEVIATIONS 10.20 Refer to the data in Table 10.24. Minitab was used to set a 90% confidence interval on pI- p2,The output is shown below. Using the output, give a 90% confidence interval for pI- p2 MTB > twot 90%confidence, data in c2, sample number in c1; SUBC > pooled. Two Sample T-Test and Confidence Interval Two sample T for score sample N Mean St Dev SEMean 1 11 75.50 11.30 3.4 2 1 1 79.55 8.50 2.6 90%CI for mu (1) - mu (2):(-1 1.4, 3.3) T-Test mu (1) = mu (2) (vs not =): T= -0.96 P=0.35 DF= 20 Both use Pooled St Dev = 9.99 Am. The 90% interval extends from -1 1.4 to 3.3. TESTING HYPOTHESIS ABOUT pi- ~2 USING SMALL INDEPENDENT SAMPLES FROM NORMAL POPULATIONS WITH EQUAL (BUT UNKNOWN) STANDARD DEVIATIONS 10.21 Refer to problems 10.19 and 10.20.Suppose the research hypothesis is that the language skills scores differ for the two groups. Give the computed test statistic and the corresponding p value. Ans. The computed test statistic is t* = -0.96 and the p value is 0.35. SAMPLING DISTRIBUTION OF NORMAL POPULATIONS WITH UNEQUAL (AND UNKNOWN) STANDARD DEVIATIONS FOR SMALL INDEPENDENT SAMPLES FROM 10.22 Thirty individuals who suffered from insomnia were randomly divided into two groups of 15 each. One group was put on an exercise program of 40 minutes per day for four days a week. The other group was not put on the exercise program and served as the control group. After six weeks, the time taken to fall asleep was measured for each individual in the study. The results are given in Table 10.25.
  • 254.
    CHAP. 101 INFERENCESFOR TWO POPULATIONS Table 10.25 Exercise group 15 17 15 15 16 16 17 14 14 15 15 17 13 14 18 Control group 19 22 25 30 31 15 16 19 19 30 27 28 22 13 32 245 The Minitab output for the test of equal standard deviations is shown below. Would you assume equal or unequal population standard deviations? MTB > %oartest c2 cl Homogeneity of Variance Response C2 Factors C1 ConfLvl 95.OOOO Bartlett’sTest (normal distribution) Test Statistic: 23.039 P value : 0.000 Ans. The Minitab procedure is used to test Ho: 01 = o2vs. Ha:olf 02, Since the p value = 0.0o0,the null hypothesis should be rejected. Assume that the standard deviations are not equal for the two groups. ESTIMATIONOF 1.11 -1.12 USING SMALL INDEPENDENTSAMPLESFROM NORMAL POPULATIONSWITH UNEQUAL(AND UNKNOWN)STANDARDDEVIATIONS 10.23 Refer to the data in Table 10.25. The Minitab analysis for setting a 99% confidence interval on p1 - p2 is shown below. This analysis assumes unequal population standard deviations. MTB >twot 99% data in c2 groups in cl Two SampleT-Test and ConfidenceInterval Two sample T for C2 c 1 N Mean St Dev SEMean 1 15 15.40 1.40 0.36 2 15 23.20 6.27 1.60 99% C1for mu (1) -mu (2): (-12.69, -2.9) T-Test mu (1) = mu (2) (vs not =): T = -4.70 P = 0.0003 DF = 15
  • 255.
    246 INFERENCES FORTWO POPULATIONS [CHAP. 10 Rather than finding the degrees of freedom by using df = minimum of { nl - 1, n2 - 1 ), Minitab uses a different formula. If the degrees of freedom are found using df = minimum of ( 14, 14) = 14, the t value is 2.977. The t value, using df = 15,is 2.947. The confidence interval will be approximately the same for either value. Using df = 14, find a 90% confidence interval for pl-p2, 1.40' 6.27* Ans. (15.40- 23.20) k 1.761 x - i15 ' 1 5or -7.8 k 2.9 or (-10.7, -4.9) TESTING HYPOTHESIS ABOUT -J.L~USING SMALL INDEPENDENT SAMPLES FROM NORMAL POPULATIONS WITH UNEQUAL (AND UNKNOWN) STANDARD DEVIATIONS 10.24 Refer to problems 10.22and 10.23. Is there a difference in the time to go to sleep for the two groups? Ans. Yes, the computed test statistic is t* = -4.70, and the p value is 0.0003. SAMPLING DISTRIBUTION OF a FOR NORMALLY DISTRIBUTED DIFFERENCES COMPUTED FOR DEPENDENT SAMPLES 10.25 Table 10.26 gives the diastolic blood pressure before treatment and six weeks after treatment is started for 10hypertensive patients. The statistic, t = - -cLd , has a t distribution with df = n - 1, provided the differences have a normal distribution. The basic normality assumption needs to be verified before setting a confidence interval on or testing a hypothesis concerning k, The normal probability plot in Minitab can be used to test the following hypothesis: H o :The differences are normally distributed vs. Ha: The differences are not normally distributed. If the level of significance is set at the conventional level of significance a = .05, then the null hypothesis is rejected if the p value < a. Using the Minitab output shown below and on the next page, what decision do you reach concerning the assumption of normality for the differences? - S;i Patient 1 2 3 4 5 6 7 8 9 10 Tablc Before 90 100 95 85 99 105 90 97 99 110 10.26 After 80 85 83 75 85 85 80 79 85 90 Difference 10 15 12 10 14 20 10 18 14 20 AIZS.The p value for the Anderson-Darling Normality test is 0.233. The null hypothesis is not rejected, and it is assumed that the differences are normally distributed.
  • 256.
    CHAP. 101 . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . I - . , I INFERENCES FOR TWO POPULATIONS Normal Probability Plot 247 .999 .99 .95 .80 Probability .50 .20 .05 .01 ,001 1 % . . . . . . . . . ~, I , s., . I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . < . . . . . . . . . . . . . . . . p' . . . . . . . . . . . . . . . . . . . . . . . . I 1 1 10 Average: 14.3 StDev: 3.94546 N: 10 I 15 difference 1 20 Anderson-DarlingNormalityTest A-Squared:0.438 P Value: 0.233 ESTIMATION OF FROM DEPENDENTSAMPLES USING NORMALLYDISTRIBUTEDDIFFERENCES COMPUTED 10.26 Verify the output shown in the following Minitab output for a 90% confidence interval for the mean difference in the blood pressures given in problem 10.25. MTB > tinterval90 percent confidence, data in cl ConfidenceIntervals Variable N Mean St Dev SEMean 90.0 %CI diff 10 14.30 3.95 1.25 (12.01, 16.59) Ans. Cd = 143 cd2= 2185 d = 14.3 s d = 3.9455 S j = 1.2477 t = 1.833 - d kt x Sa is found to be (12.01, 16.59). TESTING HYPOTHESISABOUT COMPUTEDFROM DEPENDENTSAMPLES USING NORMALLYDISTRIBUTEDDIFFERENCES 10.27 Verify the computed test statistic in the below Minitab output for the differences in problem 10.25. MTB > ttest mu = 0 data in cl T-Test of the Mean Test of mu = 0.00 vs mu not = 0.00 Variable N Mean St Dev SEMean T P diff 10 14.30 3.95 1.25 11.46 0.0000 - d-D, 14.30-0 Ans. t*= -- - = 11.44 S3 I.25
  • 257.
    248 INFERENCES FORTWO POPULATIONS [CHAP. 10 SAMPLINGDISTRIBUTIONOF Pi - P2 FOR LARGEINDEPENDENTSAMPLES 10.28 In a study of 100surgery patients, 50 were kept warm with blankets after surgery and 50 were kept cool. Eight of the warm group developed wound infections and 14 of the cool group developed wound infections. Let pi represent the proportion of all surgery patients kept warm after surgery who develop wound infections and let p2 represent the proportion for the cool group. The sample proportions for the two groups are P1= .16 and p2= .28.The difference in sample proportions, Pr- P2,will have a normal distribution provided that nlpl > 5, n2p2> 5, nlql> 5, and n2q2> 5. Since the population proportions are unknown, these conditions cannot be checked directly. In practice, the conditions are checked out by substituting the sample proportions for the population proportions. Use the sample proportions to check out the requirements for assuming normality. Ans. nip, = 8, n2F2 = 16, nlq, = 42, and n2q, = 34 p1- p2has a normal distribution. ESTIMATION OF PI -P2 USING LARGESAMPLES 10.29 Refer to problem 10.28.Find a 90%confidence interval for p1- p2. - - X;Tr ; Pzx% . Ans. The 90% confidence interval is (PI- p2) k z x SFIUF2, where SFl-Fz = nl n2 (.16 - .28)+_ 1.65x .0820 or -.I2 +_ .14 or (-.26 ,.02) TESTING HYPOTHESISABOUT PI-P2 USING LARGEINDEPENDENTSAMPLES 10.30 Refer to problem 10.28.Test H o :p1 - pz= 0 vs. Ha:p1 - p2< 0 at a = .05. Ans. The pooled estimate of the common proportion is jT = = 0.22. The standard error is n l + n2 - P,-F, SF,-B2 STilPF2 = ,/- = .0828. The computed test statistic is z* = -= -1.45. The critical value is -1.65. Do not reject the null hypothesis.
  • 258.
    Chapter 11 0.15 - 0.10- 0.05 - 0.007 Chi-square Procedures df Mean 5 5 10 10 15 15 CHI-SQUAREDISTRIBUTION Mode Standard deviation 3 3.16 8 4.47 13 5.48 The Chi-square procedures discussed in this chapter utilize a distribution called the Chi-square distribution. The symbol x2 is often used rather than the term Chi-square. The Greek letter x is pronounced Chi. Like the t distribution, the shape of the x2distribution curve is determined by the degrees of freedom (df) associated with the distribution.Figure 11-1 shows x2distributions for 5, 10, and 15 degrees of freedom. I I I I = 15 0 10 20 30 Fig. 11-1 Table 11.1 gives some of the basic propertiesof x2distributioncurves. Table 11.1 Properties of the x2Distribution 1. The total area under a x2curve is equal to one. 2. A x2 curve starts at 0 on the horizontal axis and extends indefinitely to the right, 3. A x2curve is always skewed to the right. 4. As the number of degrees of freedom becomes larger, the x2 curves look more and 5. The mean of a x2distribution is df and the variance is 2df. 6. When the degrees of freedom is 3 or more, the peak of the x2 curve occurs at df -2. approaching, but never touching the horizontal axis. more like normal curves. This value is the mode of the distribution. EXAMPLE 11.1 Table 11.2 gives the mean, mode, and standard deviation for each of the three x2 curves shown in Fig. 11-1. The mean, mode, and standard deviation are determined by using properties 5 and 6 from Table 11.1. 249
  • 259.
    250 CHI-SQUARE PROCEDURES df [CHAP.11 .995 .990 .975 .950 .900 .lOO .050 .025 .010 .005 CHI-SQUARETABLES 5 0.412 The area in the right tail under the x2distribution curve for various degrees of freedom is given in the Chi-square tables found in Appendix 4. Example 11.2illustrates how to read this table. 0.554 0.831 1.145 1.610 9.236 11.070 12.833 15.086 , 16.750 EXAMPLE 11.2 Table 11.3 contains the row corresponding to df = 5 from the Chi-square distribution table. Figures 11-2 and 11-3 give Chi-square curves having df = 5. The shaded area to the right of 11.070 in Fig 11-2 is .050. The shaded area to the right of 1.610 in Fig 11-3 is equal to .900.These areas and Chi-square values are shown in bold print in Table 11.3. Table 11.3 I I Area in the right tail under the Chi-square distribution curve I Fig. 11-2 Fig. 11-3 GOODNESS-OF-FITTEST In many situations, each element of a population is assigned to one and only one of k categories or classes. Such a population is described by a multinomial probability distribution. Example 1 1.3 describes such a population and illustrates the structure of the null and alternative hypotheses for a goodness-of-fit test. EXAMPLE 11.3 Consider the population of Americans who’ve dieted. A survey reported that 85% were most likely to go off their diet on the weekend, 10%were most likely to go off on a weekday, and 5% didn’t know. This population is divided into three categories: category 1: most likely to go off their diet on the weekend, category 2: most likely to go off their diet on a weekday, and category 3: did not know when they were most likely to go off their diet. The multinomial probability distribution is: p1= 3 5 , p2= .10, and p3= .05. The Delta Health fitness club is interested in whether their members follow this same multinomial probability distribution. A goodness-of-fit test is used to test the null hypothesis: Ho:p1= .85, p2= .10, and p3= .05 vs. the following alternative hypothesis: Ha:The population proportions are not p1= 3 5 , p2= .10,and p3= .05. The probabilities pl, p2, and p3 in the hypotheses statements represent the proportions for the categories as applied to the health fitness club members. The next two sections will describe the steps for performing a goodness-of-fit test. EXAMPLE 11.4 Table 11.4 gives the age distribution of part-time college students as determined five years ago. If p1, p2,p3,p4,and p5represent the current percentages for the five groups, then the null hypothesis that the current distribution is the same as five years ago is stated as follows: Ho:p1= .25, p2= .35, p3 = .25, p4= .10,
  • 260.
    CHAP. 111 CHI-SQUAREPROCEDURES251 Age 18-24 25-34 35-44 45-54 Percent 25% 35% 25% 10% and p5 = .05. The research hypothesis is stated as: H,: The current proportions are not as stated in the null hypothesis. A goodness-of-fit test is used to test this hypothesis system. 55 or over 5% OBSERVEDAND EXPECTED FREQUENCIES The first step in performing a goodness-of-fittest is the selection of a sample of size n from the population and the determination of the observedfrequencies for k classes. Recall that in testing hypothesis, the null hypothesis is assumed to be true, and is rejected if a highly unlikely value is obtained for the test statistic.The expected frequencies are computed assuming the null hypothesis to be true. EXAMPLE 11.5 In Example 11.3, 200 members of Delta Health fitness club were surveyed and it was found that 160 were most likely to go off the diet on the weekend, 22 were most likely to go off the diet on a weekday, and 18 did not know. If the null hypothesis is true and the club members follow the nationwide distribution, then the expected numbers in the three categories are: npl = 200 x .85 = 170, np2 = 200 x .10 = 20, and np3 = 200 x .05 = 10. The observed frequencies are 160,22, and 18 and the expected frequencies are 170,20, and 10. EXAMPLE 11.6 In Example 11.4, 1500 part-time students were surveyed across the country. It was observed that 352 were in the age group 18-24, 501 were in the age group 25-34, 371 were in the age group 35-44, 126 were in the age group 45-54, and the remainder were in the age group 55 or over. If the null hypothesis is true and the distribution is the same as it was five years ago, the expected numbers in the five categories are as follows: npl = 1500x .25 = 375, np2= 1500x .35 = 525, np3= 1500 x 2 5 = 375, np4= 1500 x .10 = 150, and np5= 1500 x .05 = 75. The observed frequencies are 352,501, 371, 126, and 150 and the expected frequencies are 375,525,375, 150,and 75. SAMPLINGDISTRIBUTIONOF THE GOODNESS-OF-FITTEST STATISTIC The goodness-of-fit test statistic is given in formula (22.2), where o represents an observed frequency and e representsan expected frequency, and the sum is over all k categories. (22.2) The test statistic given in formula (21.2) has a Chi-square distribution with df = k - 1, provided all the expected frequencies are 5 or more. Some statisticians use a less restrictive requirement, namely, that all expected frequencies are at least one and that at most 20% of the expected frequencies are less than 5. We shall use the requirement that all expected frequenciesare 5 or more. This requirement means that a minimum sample size is needed to use this procedure. If the observed and expected frequencies are close, then the computed value for x2will be close to zero since the differences (0-e) will all be near zero. If the observed and expected values differ considerably,then the computed value of x2will be large, supportingthe research hypothesis. Since only large values of x2indicate that the null hypothesis should be rejected and the research hypothesis supported, this is always a one-tailed test. That is, the null hypothesis is rejected only for large values of the computed test statistic.Table 11.5 summarizesthe procedure for performing a goodness-of-fit test.
  • 261.
    252 CHI-SQUARE PROCEDURES[CHAP. 11 Category 0 e o - e 1 160 I70 -10 2 22 20 2 3 18 10 8 Sum 200 200 0 Table 11.5 Steps for Performing a Goodness-of-Fit Test Step 1: State the null and research hypothesis concerning the hypothesized distribution for the k categories. Step 2: Use the x2 table and the level of significance, a,to determine the rejection region. Step 3: Compute the value of the test statistic as follows: x2= c- ,where o represents the observed frequencies and e represents the expected frequencies. Check to make sure that all expected frequencies are 5 or more. Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls in the rejection region. Otherwise, the null hypothesis is not rejected. (0-e)* e (0 -e>' 100 .588 4 .200 64 6.4 7.188 (0 - e)2 e EXAMPLE 11.7 Refer to Examples 11.3and 11.5.The null hypothesis may be stated in either of two ways: Ho: The distribution of categories for Delta Health fitness club members is the same as the national distribution or Ho: pi = .85, p2= .10,and p3 = .05. The research hypothesis may also be stated in either of two ways: Ha:The distribution of categories for Delta Health fitness club members is not the same as the national distribution or Ha:The population proportions are not pi = 3 5 , p2 = .10,and p3= .05.Table 11.6illustrates the computation of the test statistic. The computed value of the test statistic is x2* = 7.188. df 2 .995 ,990 .975 .950 .900 .100 .OS0 .025 .010 .005 0.010 0.020 0.051 0.103 0.211 4.605 5.991 7.378 9.210 10.597 For level of significance a = .05, the critical value is 5.991.The row corresponding to df = 3 - 1 = 2 from the Chi-square table in Appendix 4 is shown in Table 11.7. Table 11.7 r - I Area in the right tail under the Chi-square distribution curve I Since the computed test statistic exceeds 5.991,the null hypothesis is rejected. There are almost twice as many in the Don't Know category at Delta Health fitness club as would be expected if the distributions were the same. EXAMPLE 11.8 Refer to Examples 1 1.4 and 11.6.The null hypothesis is b: pi = .25, p2 = .35, p3 = .25, p4= .10,and p5= -05.The research hypothesis is stated as: Ha: The current proportions are not as stated in the null hypothesis. The observed and expected frequencies are given in Example 11.6. Table 11.8 illustrates the computation of the goodness-of-fit test statistic. The computed value of the test statistic is x2* = 81.391. Table 11.8
  • 262.
    CHAP. 111 CHI-SQUAREPROCEDURES 253 df 4 The row corresponding to df = 5 - I = 4 from the Chi-square table in Appendix 4 is shown in Table 11.9.For level of significance a = .01, the critical value is 13.277and is shown in bold type. .995 .990 .975 .950 .900 .loo .050 .025 .010 .005 0.207 0.297 0.484 0.711 1.064 7,779 9.488 11.143 13.277 14.860 Table 11.9 I I Area in the right tail under the Chi-sauare distribution curve 1 Male Female Column total Supports Opposes Undecided Row total 70 20 10 loo 70 20 10 I 00 140 40 20 200 Since the computed test statistic exceeds 13.277,the null hypothesis is rejected. There appears to have been an increase in the 55 or over group from 5 years ago. EXAMPLE 11.9 The Minitab solutions to Examples 11.7 and 11.8are shown in Figures 11-4 and 11-5. MTB > set the observed values in cl DATA> 160 22 18 DATA > end MTB > set the expected values in c2 DATA> 170 20 10 DATA > end MTB > let kl = sum((c1 - c2)**2/c2) MTB > print kl kl 7.18824 MTB >cdf kl k2; SUBC > Chisquare 2. MTB >let k3 = 1-k2 MTB >print k3 k3 0.0274849 Fig. 11-4 MTB > set the observed values in cl DATA>352 501 371 126 150 DATA >end MTB > set the expected values in c2 DATA > 375 525 375 150 75 DATA > end MTB > let kl = sum((c1- c2)**2/k2) MTB > print kl kl 81.3905 MTB > cdf kl k2; SUBC > Chisquare 4. MTB >let k3 = 1-k2 MTB > print k3 Ik3 o Fig. 11-5 The upper part of both figures illustrates the computation of the test statistic for Examples 11.7 and 11.8. The computed values, 7.18824 and 81.3905, are the same as shown in Examples 11.7 and 1I .8. The portion of the output shown in bold illustrates the computation of the p value for the two Examples. The p value is shown next to k3. The p value for Example 11.7 is 0.027 and the p value for Example 1 1.8 is 0. CHI-SQUAREINDEPENDENCETEST Consider a survey of 100 males and 100 females concerning their opinion toward capital punishment. Tables I I. 10 and 11.11 gives two different sets of results. In Table 11.10, the distribution of opinions is exactly the same for males and females. In this case, we say that the opinion concerning capital punishment is independent of the sex of the respondent. Table 11.10
  • 263.
    254 Male Female Column total CHI-SQUAREPROCEDURES supports OpposesUndecided Row total 70 20 10 100 40 50 10 100 110 70 20 200 [CHAP. 11 I supports Opposes Undecided Row total Male 80 15 5 100 Female 70 25 5 100 Column total 150 40 10 200 In Table 11.11,the distributionof opinions is clearly different for males and females. In this case, we say that the opinion concerningcapital punishment is dependent on the sex of the respondent. Supports Opposes l00x 150 100x40 200 200 200 200 Male -=75 -=20 Female -=75 lOOx 150 -=20 100x40 Cnliimn tntal 150 40 Table 11.11 Undecided Row total lO0x 10 - 100 -- 200 200 - = 5 lO0x10 100 10 200 Tables I I .10and 1I. 11 are called contingency tables. In a Chi-square independence test, we are interested in using results such as those shown in these tables, to test for the independence of two characteristics on the elements of a population. In the above discussion, the two characteristics are sex of the individual and opinion of the individual concerning capital punishment. The conclusions regarding independence would be clear in the two tables given above. But suppose the results of the survey are not as clear cut as above. How do we decide from the results given in a contingency table if two characteristics are independent? Suppose the results of the survey were as shown in Table 11.12. In a Chi-square test of independence, the null hypothesis is that the two characteristics are independent. The obsewedfrequencies are shown in Table I 1.12.A table of expectedfrequencies is also determined assuming the null hypothesis to be true. In Table 11.12,note that 150 out of 200, or 75% of those surveyed, supported capital punishment. If the two characteristics are independent, we would expect 75% of the 100males and 75%of the 100females to support capital punishment. From this discussion, note that if ell represents the expected frequency in the first row, first column cell, then (row 1 total)x (column 1 total) = 75 = 100x150 ell= 100x .75 = 200 sample size In general, the expected frequency in row i columnj is given by formula (IZ.2): (row i total)x (columnj total) samplesize eij = Table 11.13 showsthe computation of the expected frequencies for Table 1I.12. ( I 1.2)
  • 264.
    CHAP. 111 CHI-SQUAREPROCEDURES 255 Male Female Column total EXAMPLE 11.10 The table of expected frequencies for Table 11.10is given in Table 1I. 14.Notice that when the contingency table indicates independence, the observed and expected frequencies are exactly the same. supports Opposes Undecided Row total 70 -=20 -- 100x20 - *o 100 -- 100x40- 20 - lOOx 20 = 10 100 100x40 200 200 200 200 200 200 -= 100x140 7o -= 140 40 20 200 Table 11.14 Male Female Column total Supports Opposes Undecided Row total -- - 5 5 -=35 -- lOOx 110 200 200 200 200 200 200 100x70 100x20 - 1o 100 lOOx 110 - 55 100x20 100 -- 1oox70 - 35 - = 10 110 70 20 200 EXAMPLE 11.11 The table of expected frequencies for Table 11.1 1 is given in Table 1 1.15. Notice that when the contingency table indicates strong dependence, that the observed and expected frequencies are very different. Male Female Column total Table 11.15 Supports Opposes Undecided Row total 80 (75) 15 (20) 5 (5) 100 70 (75) 25 (20) 5 (5) 100 150 40 10 200 How different do the observed frequencies and those you would expect when the characteristics are independent need to be before you would reject independence? The test statistic given in the next section will answer this question. SAMPLING DISTRIBUTION OF THE TEST STATISTICFOR THE CHI-SQUARE INDEPENDENCETEST Table 11.16 gives the observed frequencies and the expected frequencies in parenthesis for the data on Table 11.12. The null hypothesis is: Ho: Opinion concerning capital punishment is independent of the sex of the individual. The research hypothesis is: Ha:Opinion concerning capital punishment differs for males and females. The test statistic for testing this hypothesis is given by formula (1Z.3), where the sum is over all 6 cells. If there are r rows and c columns, there will be r x c cells in the contingency table. The test statistic has a Chi-square distribution with (r - I) x (c - 1) degrees of freedom. Table 11.16 The computed value of the test statistic is found as follows: ( I 1.3) (0-e)* (80-75)2 (15-20)2 (5-5)* (70-75)2 (25-20)2 (5-5)2 + +- + + + -= 3.167 p = E-- - e 75 20 5 75 20 5
  • 265.
    256 CHI-SQUARE PROCEDURES[CHAP. 11 The degrees of freedom is equal to df = (r - 1) x (c - 1) = (2 - 1) x (3 - 1) = 2. The critical value from the Chi-square distribution table is 5.991 for a = .OS. Since the computed test statisticdoes not exceed the critical value, the null hypothesis is not rejected. We cannot reject that the characteristics are independent. A Minitab analysis of the same data is shown below. The data are read first. The command Chisquare cl - c3 produces the expected frequencies, the computed value of the test statistic, and the p value. The p value is equal to 0.205. MTB > read cl -c3 DATA>80 15 5 DATA>70 25 5 DATA > end 2 rows read. MTB > Chisquarecl -c3 Expected counts are printed below observed counts C1 C2 C3 Total 1 80 15 5 100 75.00 20.00 5.00 2 70 25 5 100 75.00 20.00 5.00 Total 150 40 10 200 Chi-Sq = 0.333 + 1.250 +O.OO0 +0.333 + 1.250 +O.OO0 = 3.167 DF = 2, P value = 0.205 EXAMPLE 11.12 The Minitab output for the data in Tables 11.10 and 11.11 are shown in Figures 11-6 and 11-7. MTB > read cl -c3 DATA>70 20 10 DATA>70 20 10 DATA >end 2 rows read. MTB >Chisquarecl -c3 Expected counts are printed below observed counts c1 C2 C3 Total 1 70 20 10 100 70.00 20.00 10.00 2 70 20 10 100 70.00 20.00 10.00 Total 140 40 20 200 Chi-Sq = O.OO0 +O.OO0 +O.OO0 +O.OO0 +0.0o0 + o.oO0= o.Oo0 DF = 2. P value = 1.OOO MTB > read cl -c3 DATA>70 20 10 DATA>40 50 10 DATA >end 2 rows read. MTB >Chisquare cI -c3 Expected countsare printed below observed counts c1 C2 C3 Total 1 70 20 10 100 55.00 35.00 10.00 2 40 50 10 100 55.00 35.00 10.00 Total 110 70 20 200 Chi-Sq = 4.091 +6.429 +O.OO0 +4.091 + 6.429 + DF = 2, P value = O.OO0 0.O00 = 21.039 Fig. 11-6 Fig. 11-7
  • 266.
    CHAP. 111 CHI-SQUAREPROCEDURES 257 SAMPLINGDISTRIBUTION OFTHE SAMPLE VARIANCE The population and sample variance was defined and illustrated in Chapter 3. The population variance is given by formula (ZZ.4): The sample variance is given by (11.4) (11.5) The concept of the samplingdistribution of the sample variance is established in Example 11.13. EXAMPLE 11.13 Consider the small finite population consisting of the times required for six individuals to open a “Child proof’ aspirin bottle. The required times (in seconds) are shown in Table 11.17. The population mean is p= 30 seconds. The population variance is found as follows: (x-p)2 400+ 100+O+ O+ 100+400 = 166.67 - d = - N 6 The standard deviation is G= 4166.67 = 12.9. Table 11.17 Individual Individual 1 Required time 10 20 30 30 40 50 There are 20 different samples of size 3 possible and they are listed along with the sample variance and sampling error for each sample in Table 11.18. Sample number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Tab Sample A, B, C = 10,20,30 A, B, D = 10,20,30 A, B, E = 10,20,40 A, B, F = 10,20,50 A, C, D = 10,30,30 A, C, E = 10,30,40 A, C, F = 10,30,50 A, D, E = 10,30,40 A, D, F = 10,30,50 A, E, F = 10,40,50 B, C, D = 20,30,30 B,C, E =20,30,40 B, C, F =20,30,50 B, D, E =20,30,40 B,D, F = 20,30,50 B, E, F =20,40,50 C, D,E =30,30,40 C, D,F = 30,30,50 C, E, F = 30,40,50 D.E. F = 30.40.50 ! 11.18 Sample variance, s2 100.00 100.00 233.48 432.64 133.40 233.48 400.00 233.48 400.00 432.64 33.29 100.00 233.48 100.00 233.48 233.48 33.29 133.40 100.00 100.00 Sampling error, I s2 -dI 66.67 66.67 66.81 265.97 33.27 66.81 233.33 66.81 233.33 265.97 133.38 66.67 66.81 66.67 66.81 66.81 133.38 33.27 66.67 66.67
  • 267.
    258 CHI-SQUARE PROCEDURES[CHAP. 11 33.29 100.0 133.40 233.48 400.00 432.64 ' s2 ~ l l l l _ _ l . . l - P ( S ) .1 .3 .1 .3 .1 .1 - - 7 - - - . - --- The distribution of the sample variance is obtained from Table 11.18and is given in Table 1 1.19. 249.990 250.010 250.005 250.000 249.980 250.OOO 250.000 249.999 250.015 250.001 250.010 250.000 The above procedure may be used to find the sampling distribution of the sample variance when sampling from a finite population. However, it is clear that it is a tedious procedure even when using a computer. When sampling from an infinite population, the result given in Table 11.20 is used to determine the sampling distribution for a function of the sample variance, namely, . The proof of this result is beyond the scope of this text. In the next section, this result is utilized to set a confidence interval on o2as well as test hypotheses about 02. (n - 1)s2 2 df 14 - Table 11.20 Area in the right tail under the Chi-square distribution curve .995 .990 .975 .950 .900 .100 .050 .025 .O10 .005 4.075 4.660 5.629 6.571 7.790 21.064 23.685 26.119 29.141 31.319 I (n - 1)s2 Sampling Distribution of , When a simple random sample of size n is selected from a normally distributed has a Chi-square distribution population having population variance, c?, with (n - 1) degrees of freedom, where S2is the sample variance. (n - 1)s2 2 INFERENCES CONCERNING THE POPULATION VARIANCE The result given in Table 11.20 will be used to find confidence intervals and test hypotheses about a population variance or standard deviation. EXAMPLE 11.14 It is important that drug manufacturers control the variation of the dosage in their products. A drug company produces 250 milligram tablets of the antibiotic amoxicilliri. A sample of size 15 is selected from the production process and the level ofamoxicillin is determined for each tablet. The results are shown in Table 11.21. Table 11.21 I 249.995 I 249.985 I 250.000 I The mean of the sample is X=250.000mg and the sample variance is s2= .00008538mg2. According to Table (n-1)s' 14s' -- - has a Chi-square distribution with df = 14. Table 11.22 gives the row corresponding to I 1.20, df = 14 from the Chi-square table in Appendix 4. 2 d Table 11.22
  • 268.
    CHAP. 111 CHI-SQUAREPROCEDURES259 1452 is between 5.629 and 26.119 with 14S2 has a Chi-square distribution with df = 14, we know that - Since - probability .95. This is true because according to Table 11.22,there is 97.5%of the area under the curve to the right of 5.629 and 2.5%of the area under the curve to the right of 26.1 19, and therefore 95%of the area under the curve is between 5.629 and 26.119.This may be expressed as follows: 02 d 14S2 P(5.629 <-<26.119) = .95 o2 The following notation is often used for the tabled Chi-square v a l u e s : ~ ~ ~ ~ = 5.629 and probability statement may be expressed as follows: = 26.119. The 14S2 1452 1452 Now, the if inequality < -< Xi25is solved for $,we obtain 7 < d< 7as a 95%confidence interval for o2. The numerical confidence interval is obtained by replacing S2by .oooO8538mg2and using the values obtained from the Chi-square table. The lower confidence limit is: 02 x.025 x.975 -- 14S2 14 x .00008538 =.ooo04576 2 26.119 x.025 - The upper confidence interval is: = .0002124 14S2 14 x .00008538 X ,975 -- - 5.629 2 The 95% confidence interval for goes from .00004576 to .0002124. The 95% confidence interval for o is obtained by taking the square root of the limits for 02. The lower limit is = .006765 and the upper limit is Jm = .014574. The 95%confidence interval for the standard deviation (to three decimal places) is (.007,.015). The general form for a (1 - a)x 100% confidence interval for the population variance is given by formula (11.6),where the values of x2are based on the Chi-squaredistribution with df = n - 1. (11.6) The (1 - a)x 100%confidence interval for the population standard deviation is given by formula (11.7). Both of the confidence intervals assume that the sample was obtained from a normally distributed population.
  • 269.
    260 df 29 CHI-SQUARE PROCEDURES .995 .990.975 .950 .900 .100 .050 .025 .010 .005 13.121 14.256 16.047 17.708 19.768 39.087 42.557 45.722 49,588 52.336 [CHAP. 11 EXAMPLE 11.15 A random sample of the lengths of bolts produced by Fastners, Inc. was taken. The sample results were as follows: n = 30, X = 5.000cm, and s = 0.055 cm. A 99% confidence interval for D is determined as follows: For a 99% confidence interval, I - a = .99, and therefore a = .01, or - = .005 and I - - = .995. The degrees of freedom is df = n - 1 = 30 - 1 = 29. Table 11.23 gives that portion of the Chi-square table needed to determine the values for use in formula (IZ.7). The following values are shown in bold print in the table:X9s = 13.121 and xLs= 52.336. The sample variance is s2 = (0.055)2= ,003025. Substituting into formula ( I /.7) we obtain the following 99% confidence interval for the population standard deviation. a a 2 2 The population standard deviation of bolt lengths is between .041 cm and .082 cm with 99% confidence. Table 11.23 r - I Area in the right tail under the Chi-square distribution curve - 1 (n - I)S* 2 , given in Table 11.20, is also utilized to test hypotheses concerning the population variance or standard deviation. The steps for testing a population variance are given in Table 11.24. The sampling distribution of Table 11.24 Steps for Testing a Hypothesis Concerning 02: Sampling from a Normal Population Step I: State the null and research hypothesis. The null hypothesis is represented symbolically by b: o2= 0 : and the research hypothesis is of the form Ha: o2# 0 : or Ha: d < 0;or Ha: d > 0;. Step 2: For level of significance a,the critical values for Ha: o2f 0; are x:-u/z and xilz,the critical value for Ha:o2< 0;is x;-~,and the critical value for Ha: d > 0;is Xa. Step 3: Compute the value of the test statistic as follows: x2* = (n-1)S2 , whereoi is given in the null hypothesis and S2is computed from your sample. 2 0; Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls in the rejection region. Otherwise, the null hypothesis is not rejected. EXAMPLE 11.16 The ratio of potassium to sodium in an individual's diet is sometimes referred to as the K factor. A sample of 15 Yanomamo Indians from Brazil was obtained and their K factors were determined. The standard deviation of the sample of K factor values was found to equal 0.15. These sample results were used to test the hypothesis Ho:0= 0.25 vs. Ha:0f 0.25 at level of significance a = .05.The critical values are shown in bold in Table 1 1.25.From this table, we have &s = 5.629 and Xi25 = 26.I 19.The computed value of the test statistic is: (n-l)S2 14 x .1s2- 14 x .0225 x 2 * = - - = 5.04 0; .252 .0625 - Since the computed value of the test statistic is less than 5.629, the null hypothesis is rejected and it is concluded that the standard deviation for Yanomamo Indians is less than 0.25.
  • 270.
    CHAP. 111 Area inthe right tail under the Chi-square distribution curve I Df ,995 .990 .975 .950 .900 .100 .050 .025 .O10 .005 14 4.075 4.660 5,629 6.571 7.790 21.064 23.685 26.119 29.141 31.319 CHI-SQUARE PROCEDURES Area in the right tail under the Chi-square distribution curve L I Df ,995 .990 .975 .950 .900 .100 .050 .025 .010 .005 5 0.412 0.554 0.831 1.145 1.610 9.236 11.070 12.833 15.086 16.750 26 1 Df 10 Area in the right tail under the Chi-square distribution curve .995 .990 .975 .950 .900 .loo .050 .025 .010 .005 2.156 2.558 3.247 3.940 4.865 15.987 18.307 20.483 23.209 25.188 Solved Problems CHI-SQUARE DISTRIBUTION 11.1 What happens to the peak of the Chi-square curve as the degrees of freedom is increased? Ans. The peak of the Chi-square curve shifts to the right as the degrees of freedom is increased. 11.2 The area under the Chi-square curve corresponding to x2 values greater than 118.498 is .10. Find P(0 < x2< 118.498). Ans. Since the total area under the Chi-square curve is equal to 1, P(0 < x2 < 118.498) is equal to 1 - P(x2> 118.498)= I - .I0= .90. CHI-SQUARE TABLES 11.3 Find the value of x2for 5 degrees of freedom and (a) .005 area in the right tail of the Chi-square distribution curve. (6) .005 area in the left tail of the Chi-square distribution curve. Ans. Table 11.26 gives the row corresponding to df = 5 from the Chi-square table. (a) The table indicates thatthe area to the right of 16.750is .005. (b)The table indicates that the area to the right of 0.412 is .995.Therefore, the area to the left of 0.412 is 1 - .995= .005. Ans. According to Table 11.27,the area to the right of 23.209 is ,010 and the area to the right of 4.865 is .900. In Fig. 11-8,the shaded area equals .010 and corresponds to x2 values exceeding 23.209. The shaded area in Fig. 11-9 equals .900 and corresponds to x2 exceeding 4.865. The area corresponding to x2between 4.865 and 23.209 is the difference in the two shaded areas, or .900- .010= .890.
  • 271.
    262 Fig. 11-8 CHI-SQUARE PROCEDURES[CHAP. 11 Fig. 11-9 GOODNESS-OF-FITTEST 11.5 A magazine reported that 75% of the population oppose same sex marriages, 20% approve, and 5% are undecided. A survey is conducted to test the research hypothesis that the distribution is different from that reported by the magazine. State the null and alternative hypotheses in terms of p1 ,p2 ,and p3 where pl = the population proportion opposed to same sex marriage, p2 = the population proportion who approve of same sex marriage, and p3 = the proportion who are undecided. Ans. The null hypothesis is Ho: The population proportions are p1 = .75, p2 = .20, and p3 = .05. The research hypothesis is Ha: The population proportions are not p1= .75,p2= .20, and p3= .05. 11.6 The fairness of a die may be tested as a goodness-of-fit test. Let pi represent the proportion of the time that face i turns up when the die is tossed, where i is 1, 2, 3, 4, 5, or 6. If the null hypothesis is that the die is fair and the research hypothesis is that the die is unfair, state Ho and Hain terms of the pi . Ans. The null hypothesis is Ho: p1 = p2 = p3 = p4 =ps = p6 = i.The research hypothesis is Ha:Not all pi equal +. OBSERVEDAND EXPECTEDFREQUENCIES 11.7 Refer to problem 11.5. A poll was conducted and 585 opposed same sex marriage, 195 approved, 120 were undecided. What results would you expect if the null hypothesis were true? Ans. The sample size is n = 900. The expected frequencies are: el = npl = 900 x .75 = 675, e2 = np2= 900 x .20= 180,and e3= np3= 900 x .05 = 45. 11.8 Refer to problem 11.6. The die in question was rolled and face 1 turned up 89 times, face 2 turned up 93 times, face 3 turned up 103times, face 4 turned up 111 times, face 5 turned up 100 times, and face 6 turned up 104 times. What results would you expect if the null hypothesis is true and the die is fair? Ans. The sample size is n = 600. If the null hypothesis is true, then each face should occur 100times in 600 rolls.
  • 272.
    CHAP. 113 CHI-SQUAREPROCEDURES Category 0 e Oppose 585 675 Support 195 180 Undecided 120 45 Sum 900 900 263 ( 0-e12 o - e (0- el2 e -90 8100 12.000 15 225 1.250 75 5625 125.000 0 138.250 SAMPLING DISTRIBUTION OF THE GOODNESS-OF-FITTEST STATISTIC Df 5 11.9 Refer to problems I 1.5 and 1I .7. Test the hypothesis at a = .01. Area in the right tail under the Chi-square distribution curve .995 .990 .975 .950 .900 .100 .OS0 .025 .O10 .005 0.412 0.554 0.831 1.145 1.610 9.236 11.070 12.833 15.086 16.750 Ans. The number of categories is k = 3, and the degrees of freedom is df = k - 1 = 2. By referring to the Chi-square distribution table in Appendix 4, we find that the critical value for a 1% level of significance is equal to 9.210. The null hypothesis will be rejected if the computed value of the test statistic exceeds this critical value. Turned up face 1 2 3 4 5 6 sum The computation of the test statistic is shown in Table 11.28. Since the computed test statistic is 138.250, the null hypothesis is rejected. There is a much higher number in the undecided group than would be expected if the null hypothesis were true. (0 -eI2 0 e o - e (0-e)2 e 89 100 -1 1 121 1.21 93 100 -7 49 0.49 103 100 3 9 0.09 111 100 11 121 1.21 100 100 0 0 0 104 100 4 16 0.16 600 600 0 3.16 11.10 Refer to problems 11.6and 11.8.Test the hypothesis at a = .05. Ans. The number of categories is k = 6, and the degrees of freedom is df = k - 1 = 5. Table 11.29 gives the row corresponding to df = 5 from the Chi-square distribution table in Appendix 4. The critical value is 11.070, as shown in bold print. Table 1 1 . 2 9 The computation of the test statistic is shown in Table 11.30. The computed value of the test statistic is 3.16. Based on this experiment, there is no reason to doubt the fairness of the die. CHI-SQUARE INDEPENDENCETEST 11.11 A study involving several cities from across the country involving crime was conducted. The cities were divided into the categories South, Northeast, North Central, and West. Based on interviews, crime was classified as a major concern, a minor concern, or of no concern for each individual interviewed. The results are shown in Table 11.31. If the level of concern
  • 273.
    264 CHI-SQUAREPROCEDURES [CHAP.11 . Concern South Northeast North Central West Major 25 40 35 40 Minor 65 45 40 40 No concern 15 10 15 20 regardingcrime is independent of the section of the country, find the expected frequencies for the 12cells. . Concern South Northeast North Central West Row total I 140 190 60x 100 60 -- l4OX lo5 - 37.69 - l4OX 95 - -34.10 - 140x90 - -32.31 -- l4OX loo - 35.90 Major Minor 390 390 390 390 390 390 390 390 390 390 390 390 190x95-46.28 -- l9OX9O-43.85 -- l9OX loo -48.72 190x 105 -- - 51.15 --- - = 15.38 60x 90 - = 13.85 -- 95 - 14.62 No concern 60x105= 16.1 I Column total I 105 95 90 100 I 390 I Net worth 100to 249 250 to 499 500 to 999 loo0 or more Ans. The computations of the expected frequencies are shown in Table 11.32. Married Single Widowed 225 50 65 60 20 25 15 4 3 10 2 3 L Net worth Married Single Widowed Row total 340x 96 340 105x96 105 250 - 499 22 500-999 15x76- 2.37 15x96 15 loo0 or more 482 482 482 Column total 310 76 96 482 - =67.72 340x 76 482 482 482 105x 76 482 482 482 482 482 482 15x310 - =53.61 340x 310 - = 218.67 105x310 22 x 310 100- 249 - = 20.91 - = 1656 - = 6753 - 22x 76 = 3.47 -- 22x96-4.38 -= 14.15 - = 2.99 -- - = 9.65 11.12 Table 11.33 gives the results of a study involving marital status and net worth in thousands. Find the expected frequenciesassuming the two characteristicsare independent.Comment on the expected frequencies. Table 11.33 Note that 4 of the 15,or 27%, of the expected frequencies are less than 5. The Chi-square test of independence is not appropriate.
  • 274.
    CHAP. 111 CHI-SQUAREPROCEDURES 265 SAMPLINGDISTRIBUTIONOF THE TEST STATISTICFOR THE CHI-SQUAREINDEPENDENCETEST 11.13 Use Minitab to test the hypothesisthat opinions regarding crime is independent of the section of the country in problem 11.11 at level of significance a = .05. Compare the expected frequencies in the Minitab output with those computed in problem 11.11. Row C1 C2 C3 C4 1 25 40 35 40 2 65 45 40 40 3 15 1 0 15 20 MTB > chisquare cl-c4 Expected counts are printed below observed counts. c1 C2 C3 C4 Total 1 25 40 35 40 140 37.69 34.10 32.31 35.90 2 65 45 40 40 190 51.15 46.28 43.85 48.72 3 15 10 15 20 60 16.15 14.62 13.85 15.38 Total 1 0 5 95 90 100 390 Chi-Sq = 4.274+ 1.020+0.224+0.469+3.748+0.036+0.337+ 1.560+0.082+ 1.457+0.096+ DF = 6,P value = 0.023 1.385= 14.688 Ans. Since the computed p value = 0.023is less than a = .05, the null hypothesis is rejected. The expected frequenciesare the same as computed in problem 1 1.11. 11.14 Refer to problem 11.12. Combine the categories Single and Widowed into a new category called Nonmarried so that all cell expected frequencies exceed 5 and perform a test of independence. Ans. Table 1 1.35gives the new results after combiningcategories. Table 11.35 Net worth Married Not Married 100to 249 250to 499 500to 999 loo0 or more The Minitab analysisof the test of independenceis as follows: Data Display Row Cl C2 1 225 115 2 60 45 3 15 7 4 1 0 5
  • 275.
    266 CHI-SQUAREPROCEDURES MTB >Chisquare cl c2 [CHAP. 11 Chi-SquareTest Expected counts are printed below observed counts c1 C2 Total I 225 115 340 218.67 121.33 2 60 45 105 67.53 37.47 3 15 7 22 14.15 7.85 4 10 5 15 9.65 5.35 Total 310 172 482 Chi-Sq = 0.183 +0.330 +0.840 + 1.514 +0.051 +0.092 +0.013+0.023 = 3.046 DF = 3, P value = 0.385 SAMPLING DISTRIBUTION OF THE SAMPLE VARIANCE 11.15 A small population consists of the three values 10, 20, and 30. Determine the population variance and the sampling distribution of s2for samples of size 2. Ans. The population mean is 20. The population variance is: (x -p) (10- 20)*+(20- 20)2+(30- 20)2 = 66.67 - o2= - N 3 The three samples and their variances are: 10and 20, s2= 50; 10and 30, s2= 200; 20 and 30, s2= 50. The sampling distribution for s2is: P(s2= 50)= +,P(s2= 200) = 1 . 3 11.16 A sample of size 15 is taken from a normal population having cr2 = 14. What is the probability that s2exceeds 21.064? (n-1)S2 14s' - - - - s2 has a Chi-square distribution with 14 degrees of freedom. - 2 14 Ans. The statistic From the Chi-square table, the area to the right of 21.064 is 0.10 when df = 14. Therefore P(s2> 21.064) = 0.10. INFERENCES CONCERNING THE POPULATION VARIANCE 11.17 The standard deviation for the annual medical costs of 30 heart disease patients was found to equal $1005. Find a 95% confidence interval for the standard deviation of the annual medical costs of all heart disease patients. .' Ans. The general form of the confidence interval for the population variance is (n - I)s2 (n- 1)s2 C O 2 < . 2 Xa12 X1-at2
  • 276.
    CHAP. 111 CHI-SQUAREPROCEDURES 267 a a 2 2 The level of confidence is 1 - a = .95. This implies that a = .05, - = .025, and 1 - - = .975. The Chi-square table values are The lower limit for the 95% confidence interval for d is = 45.722 and = 16.047, for df = 29. (n-I)S2 29~1,010,025 - - = 640,626.5037. 45.722 2 XaJ2 The upper limit for the 95% confidence interval for o2is - - 29 190109025 = 1,825,308.469. (n -1)s2 Xl-aJ2 2 16.047 The lower limit for the 95% confidence interval for 0 is ,/640,6265037 = $800.39. The upper limit for the 95% confidence interval for (T is 41,825,308.469 = $1,351.04. 11.18 In manufacturing processes, it is desirable to control the variability of mass produced items. A machine fills containers with 5.68 liters of bleach. The variability in fills is acceptable provided the standard deviation is 0.15 liters or less. A sample of 10containers is chosen to test the research hypothesis that the standard deviation exceeds 0.15 liters at level of significance a = .05. What conclusion is made if the variance of the sample is found to equal 0.095 liters? Ans. The null hypothesis is Ho:(T = 0.15 and the research hypothesis is Ha:(T > 0.15. The degrees of freedom is df = 10- 1 = 9, and the critical value is xis= 16.919. The computed value of the test statistic is x2* = (n-1)S2 - 9 x .095 0; .0225 - = 38. Based on this sample, it is concluded that the variability is unacceptable and the filling machine needs to be checked. SupplementaryProblems CHI-SQUARE DISTRIBUTION 11.19 Which one of the following terms best describes the Chi-square distribution curve: right-skewed, left skewed, symmetric, or uniform? Ans. right-skewed 11.20 According to Chebyshev’s theorem, at least 75% of any distribution will fall within 2 standard deviations of the mean. For a Chi-square distribution having 8 degrees of freedom, find two values a and b such that at least 75% of the area under the distribution curve will be between those values. Ans. a = 8 - 2 ~ 4 = 0 , b = 8 + 2 ~ 4 = 16 CHI-SQUARE TABLES 11.21 A Chi-square distribution has 11 degrees of freedom. Find the x2values corresponding to the following right-hand tail areas. (a) .005 (b) .975 (c) .990 (d).025 Ans. (a)26,757 (6) 3.816 (c) 3.053 (d)21.920
  • 277.
    268 CHI-SQUARE PROCEDURES[CHAP. 11 11.22 A Chi-square distribution has 17 degrees of freedom. Find the left-hand tail area corresponding to the following values. (a)30.191 (6)6.408 (c) 33.409 (d)24.769 Ans. (a).975 (6) .OlO (c) .990 (d).900 GOODNESS-OF-FITTEST 11.23 A national survey found that among married Americans age 40 to 65 with household income above $50,000,45% planned to work after retirement age, 45% planned not to work after retirement age, and 10% were not sure. A similar survey was conducted in Nebraska. Let pI represent the percent in Nebraska who plan to work after retirement age, p2 represent the percent in Nebraska that plan not to work after retirement age, and p3 represent the percent not sure. What null and research hypotheses would be tested to compare the multinomial probability distribution in Nebraska with the national distribution? Ans. Ho: pi = .45, p2= .45, and p3= .10; Ha:The proportions are not as stated in the null hypothesis. 11.24 A sociological study concerning impaired coworkers was conducted. It was reported that 30% of American workers have worked with someone whose alcohol or drug use affected hisher job, 60% have not worked with such individuals, and 10% did not know. The Central Intelligence Agency (CIA) conducted a similar study of CIA employees. Let p1 represent the proportion of CIA employees who have worked with someone whose alcohol or drug use affected hisher job, p2 represent the proportion of CIA employees who have not worked with such coworkers, and p3 represent the proportion who do not know. What null and research hypothesis would you test to determine if the CIA multinomial distribution was the same as that reported in the study? Am. Ho: pi = .30,p2 = .60, and p3 = .lO; Ha:The proportions are not as stated in the null hypothesis. OBSERVEDAND EXPECTEDFREQUENCIES 11.25 Refer to problem 11.23. The Nebraska survey found that 120 planned to work after retirement age,l60 did not plan to work after retirement age, and 35 were not sure. What are the expected frequencies? Ans. et = 141.75, e2= 141.75, and e3= 31.5 11.26 Refer to problem 11.24. The CIA study found that 37 had worked with someone whose alcohol or drug use had affected theirjob, 58 had not, and 17 did not know. What are the expected frequencies? Ans. el = 33.6, e2 = 67.2, and e3 = 11.2 SAMPLINGDISTRIBUTIONOF THE GOODNESS-OF-FITTEST STATISTIC 11.27 Refer to problems 11.23and 11.25.Perform the test at a = .01. Ans. The critical value is 9.21. The computed value of the test statistic is x2* = 6.076. It cannot be concluded that the Nebraska distribution is different from the national distribution. 11.28 Refer to problems 1 1.24 and 11.26.Perform the test at a = .05. Am. The critical value is 5.991. The computed value of the test statistic is x2* = 4.607. It cannot be concluded that the CIA distribution is different from that reported in the sociological study.
  • 278.
    CHAP. 111 Crime type Murder Rape Robbery Assault CHI-SQUAREPROCEDURES 18-25 26-33 34-40 over 40 11 15 4 2 21 26 11 3 120 85 3 2 147 42 5 3 269 Crime type 18-25 Murder 19.14 Rape 36.48 CHI-SQUARE INDEPENDENCE TEST 26-33 34-40 over 40 10.75 1.47 0.64 20.50 2.8 1 1.22 11.29 Five hundred arrest records were randomly selected and the records were categorized according to age and type of violent crime. The results are shown in Table 1 1.36. Robbery Assault Table 11.36 125.58 70.56 9.66 4.20 117.81 66.19 9.06 3.94 Cancer site Lung Oral-bladder Other cancer Find the table of expected frequencies assuming independence, and comment on performing the Chi- square test of independence. Smoking history Never Light Medium Heavy 15 35 45 100 35 40 40 35 250 135 190 150 Ans. The expected frequencies are given in Table 11.37. The Chi-square test of independence is inappropriate because 37.5% of the expected frequencies are less than 5 and one is less than I. , Smoking history Cancer site Never Light Medium Heavy Lung 90.37 36.78 35.04 32.82 Oral-bladder 69.5 1 28.29 26.95 25.24 Other cancer 335.98 136.75 130.26 122.01 No cancer 2354.15 958.18 912.75 854.93 11.30 Table 11.38 gives various cancer sites and smoking history for participants in a cancer study. I ~ o c a n c e r I 2550 I 950 I 830 I 750 1 Find the expected frequencies, assuming that cancer site is independent of smoking history. Ans. The expected frequencies are shown in Table 11.39. SAMPLING DISTRIBUTION OF’THE TEST STATISTICFOR THE CHI-SQUARE INDEPENDENCE TEST 11.31 Combine the categories “34-40” and “Over 40” into a new category called “Over 33” and perform the Chi-square independence test for problem 11.29.
  • 279.
    270 Murder Rape Robbery Assault CHI-SQUARE PROCEDURES 11 (19.14)15 (10.75) 6 (2.11) 21 (36.48) 26 (20.50) 14 (4.03) 120 (125.58) 85 (70.56) 5 (13.86) 147 (117.81) 42 (66.19) 8 (13.00) [CHAP. 1 1 850 750 loo0 1250 800 950 Ans. The new contingency table is given in Table 11.40. The expected frequencies are shown in parentheses. 1100 975 675 990 I150 700 780 1100 1025 I025 900 1250 975 1150 1125 980 I050 1200 I500 1040 1350 800 1255 1350 Table 11.40 I Crime tvDe I 18-25 I 26-33 I Over33 I Two of the cells have expected frequencies less than 5. Some statisticians would not perform the independence test because two of the expected frequencies are less than 5 . The less restrictive rule states that the test is valid if all expected frequencies are at least one and at most 20% of the expected frequencies are less than one. The computed value of the test statistic is 71.9I8 and the p value is .000.The type of crime is most likely dependent on the age group. 11.32 Compute the value of the test statistic for the test of independence in problem 11.30 and test the hypothesis of independence at a = .01. Ans. The degrees of freedom is df = (r - 1) x (c - 1) = (4 - 1) x (4 - 1) = 9. The critical value is 21.666. The computed value of the test statistic is x2* = 327.960. The cancer site is dependent upon the smoking history of an individual. SAMPLING DISTRIBUTION OF THE SAMPLE VARIANCE 11.33 Fill in the following parentheses with s or 0. (a)( ) is a constant, but ( (b)( ) has a distribution, but ( ) does not have a distribution. (c) ( ) is a parameter and ( ( d ) ( ) describes the variability of some characteristic for a sample, whereas ( ) is a variable. ) is a statistic. bility of a characteristic for a population. ) describes the varia- Ans. (a)o,s (b)s, o ( c ) ~ , s (d)s, 0 (n - I)S* 11.34 What distributional assumption is necessary in order that ~ have a Chi-square distribution with 0 " (n - I) degrees of freedom? Ans. The random sample from which S2 is computed is assumed to be selected from a normal distribution. INFERENCES CONCERNING THE POPULATION VARIANCE 11.35 Table 1I .41 gives the monthly rent for 30 randomly selected 2-bedroom apartments in good condition from the state of New York. Find a 95% confidence interval for the standard deviation of all monthly rents in New York for 2-bedroom apartments in good condition. Table 11.41
  • 280.
    CHAP. 111 CHI-SQUAREPROCEDURES 45 50 45 28 30 32 271 5558 38 44 54 57 44 40 29 47 37 38 44 46 35 56 42 49 57 34 41 52 52 47 Ans. The sample standard deviation is equal to $203.24. The Chi-square table values are The 95% confidence interval extends from $161.86 to $273.22. = 45.722 and x& = 16.047 for 29 degrees of freedom. 11.36 Table 11.42 gives the ages of 30 randomly selected airline pilots. Use the data to test the null hypothesis that the standard deviation of airline pilots is equal to 10 years vs. the alternative that the standard deviation is not equal to 10years. Use level of significancea = .01. Ans. The standard deviation of the sample is 8.83 years. The critical values are 13.121 and 52.336. The computed value of the test statistic is 22.61. The sample data does not contradict the null hypothesis.
  • 281.
    Chapter 12 0.9 - J 0.8- 0.7 - 0.6 - 0.5 - 0.4 - 0.3- 0.2- 0.1 - 0.0- Analysis of Variance (ANOVA) F DISTRIBUTION Analysis of Variance (ANOVA) procedures are used to compare the means of several populations. ANOVA procedures might be used to answer the following questions: Do differences exist in the mean levels of repetitive stress injuries (RSI) for four different keyboard designs? Is there a difference in the mean amount awarded for age bias suits, race bias suits, and sex bias suits? Is there a difference in the mean amount spent on vacations for the three minority groups: Asians, Hispanics, and African Americans? ANOVA procedures utilize a distribution called the F distribution. A given F distribution has two separate degrees of freedom, represented by dfl and df2. The first, dfl, is called the degrees of freedom for the numerator and the second, df2, is called degrees of freedom for the denominator. For an F distribution with 5 degrees of freedom for the numerator and 10degrees of freedom for the numerator, we write df = (dfl, df2) = (5, 10).Figure 12-1 shows one F distribution with df = (10, 50) and another with df = (10,5). / I ----------- I 1 I 1 I 1 0 1 2 3 4 5 Fig. 12-1 Table 12.1gives the basic properties of the F distribution. Table 12.1 1 Propertiesof the Fdistribution 1. The total area under the F distributioncurve is equal to one. 2. An F distribution curve starts at 0 on the horizontal axis and extends indefinitely to the right, approaching but never touching the horizontal axis. 3. An F distributioncurve is always skewed to the right. 4. The mean of the F distribution is p =- df2 for df2 > 2, where df2 is the I df2-2 I degrees of freedom for the denominator. 272
  • 282.
    CHAP. 121 ANALYSISOF VARIANCE (ANOVA) 273 - df2 df2-2 EXAMPLE 1 2 . 1 The F distribution with df = (10, 5) shown in Fig. 12-1 has mean equal to p =-- 5 df2 50 -- - 1.67,and the F distribution shown in the same figure with df = (10,50) has mean p =---- df2 -2-50- 2 - 5-2 1.04. F TABLE The F distribution table for 5% and 1% right-hand tail areas is given in Appendix 5. A technique for finding F values for the 5% and 1%left-hand tail areas will be given after illustratinghow to read the table. EXAMPLE 1 2 . 2 Consider the F distribution with dfi = 4 and df2 = 6. Table 12.2 shows a portion of the F distributiontable, with right tail area equal to .05,given in Appendix 5. df2 1 2 3 4 5 6 100 ... Table 1 2 . 2 C 1 5.99 2 5.14 3 4.76 4 4.53 ... ... 20 3.87 Table 12.2 indicates that the area to the right of 4.53 is .05. This is shown in Fig. 12-2. The F distribution table with right tail area equal to .01 indicates that the area to the right of 9.15 is .01. This is shown in Fig. 12-3. Fig. 12-2 Fig. 12-3 The following notation will be used with respect to the F distribution.The value, 4.53, shown in Fig. 12-2, is represented as F.05(4, 6) =4.53, and the value 9.15, as shown in Fig. 12-3is represented as F.o1(4,6)=9.15.
  • 283.
    274 ANALYSIS OFVARIANCE (ANOVA) [CHAP. 12 dfl dfi 1 2 3 4 5 6 . . . 20 . . . 4 7.71 6.94 6.59 6.39 6.26 6.16 5.80 > Using the notation introduced in Example 12.2,the result shown in formula (12.I ) holds for any F distribution. 12 mean = 10 (12.1) 26 17 mean = 24 mean = 17 EXAMPLE 12.3 Formula (12.1)may be used to find the F values for 1% and 5% left-hand tail areas for the F distribution discussed in Example 12.2. The symbol F.95(4,6) represents the value for which the area to the right of F,95(4, 6) is .95 and therefore the area to the left of F.95(4,6) is .05. Using formula (12.1), we find that The value F,05(6,4) = 6.16 is shown in bold print in Table 12.3,which is taken from the F distribution table. Table 12.3 Similarly, the 1% left-hand tail area is LOGIC BEHIND A ONE-WAY ANOVA An experiment was conducted to compare three different computer keyboard designs with respect to their affect on repetitive stress injuries (rsi). Fifteen businesses of comparable size participated in a study to compare the three keyboard designs. Five of the fifteen businesses were randomly selected and their computers were equipped with design 1 keyboards.Five of the remaining ten were selected and equipped with design 2 keyboards, and the remaining five used design 3 keyboards. After one year, the number of rsi were recorded for each company. The results are shown in Table 12.4. Table 12.4 Design 1 Design 2 Design 3 24 10 24 19 A Minitab dotplot of the data for the three designs is shown in Fig. 12-4.
  • 284.
    CHAP. 121 ANALYSISOF VARIANCE (ANOVA) 275 Fig. 12-4 If p1, p 2 , and p3 represent the mean number of rsi for design 1, design 2, and design 3, respectively, for the population of all such companies, the data indicate that the population means differ. Suppose the data for the study were as shown in Table 12.5. A Minitab dotplot for these data is shown in Fig. 12-5. Table 12.5 24 19 10 22 29 24 1 mean= 10 I mean=24 1 mean = 17 I keyboard 2 . . Fig. 12-5 Note that for the data shown in Fig. 12-5,it is not clear that the mean number of rsi differ for the three populations using the three different designs. The difference between the two sets of data can be explained by considering two different sources of variation. The between samples variation is measured by considering the variation between the sample means. In both cases, the sample means are XI = 10, 512 = 24, and X3 = 17.The between samples variation is the same for the two sets of data. The within samples variation is measured by considering the variation within the samples. The dotplots indicate that the within samples variation is much greater for the data in Table 12.5than for the data in Table 12.4.The measurement of these two sources of variation will be discussed in the next section. However, it is clear from the above discussion that a decision concerning the effectiveness of the three designs may be based on a considerationof the two sources of variation. In particular, it will be shown that the ratio of the between variation to the within variation may be used to decide whether PI,p2,and p3 are equal or not.
  • 285.
    276 ANALYSIS OFVARIANCE (ANOVA) [CHAP. 12 35 - 30 25 10 5 - 0 1 EXAMPLE 12.4 Figures 12-6 and 12-7 show boxplots for the data shown in Tables 12.4 and 12.5, respectively. Notice that Fig. 12-6 suggests different population means, while Fig. 12-7 suggests that it is possible that the population means may be the same. :::fi8fl 1 I I rsi 1 2 3 keyboard Fig. 12-6 Fig. 12-7 SUM OF SQUARES, MEAN SQUARES, AND DEGREES OF FREEDOM FOR A ONE-WAY ANOVA The following notation will be used when samples from k different normally distributed populations having equal population variances are selected in order to test for the equality of the means of the k populations: The sample size, sample mean, and sample variance for the ith population are represented by ni, X,,ands?,,respectively. The total sample size is n = nl + n2 + . . - + nk. The overall mean for all n sample values is represented by X . The population mean for the ith population is represented by piand the standard deviation for the ith population is represented by oi. The between samples variation is measured by the between treatmerzts mearz square and is represented by MSTR. The expression for MSTR is given by formula (12.2): SSTR MSTR = - k - 1 (12.2) The numerator of formula (12.2),SSTR, is called the treatment sum of squares, and is computed by using formula (12.3): The within samples variation is measured by the error mean square and is represented by MSE. The expression for MSE is given by SSE MSE = - n-k ( I2.4) The numerator of formula (12.4), SSE, is called the error sum of squares, and is computed by using formula (12.5),where s:, s:, . . . ,S [ are the sample variances.
  • 286.
    CHAP. 121 Design 1 10 10 8 10 12 mean= 10 ANALYSIS OF VARIANCE (ANOVA) 277 Design 2 Design 3 24 17 22 17 24 15 24 19 26 17 mean = 24 mean = 17 SSE = (nl - 1)s: +(n2- l)s$ +.- +( n k - 1)s; (12.5) The denominatorof formula (12.2),k - 1, is called the degrees offreedomfor treatments and the The sum of the treatment sum of squares and the error sum of squares is called the total sum o f denominatorof formula (12.4),n - k,is called the degrees o ffreedomfor error. squares. The total sum of squares is represented by SST and is given by SST = SSTR +SSE (12.6) The total sum of squares may be computed directly by using formula ( 1 2 4 , where the sum is over all n sample values. The degrees o ffreedomfor total is equal to n - 1. SST =C(X-7i)2 (12.7) EXAMPLE 12.5 The data from Table 12.4 is listed below for convenient reference. Since the three sample sizes are equal, the overall mean is computed by finding the mean of the three treatment (design)means. The mean of 10, 24, and 17 is 17, and therefore Ti: = 17. The total sum of squares will be found first. SST= C ( X - X ) ~ = (10- 17)2+(10- 17)2+(8- 17)2+(10- 17)2+(12- 17)2+(24- 17)2+(22- 17)2+(24- 17)2 = 4 9 + 4 9 + 8 1 + 4 9 + 2 5 + 4 9 + 2 5 + 4 9 + 4 9 + 8 1 + 0 + 0 + 4 + 4 + 0 = 514 +(24- 17)2+(26- 17)2+(17- 17)2+(17- 17)2+(15- 17)2+(19- 17)2+(17- 17)2 The treatment sum of squares is computed by using formula (12.3). SSTR= nl(xl- ~ r > ~ + n 2 ( ~ 2 - x12+ n3 (x3- x12 = 5 ~ ( 1 0 - 1 7 ) ~ + 5 ~ ( 2 4 - 1 7 ) ~ + 5 ~ ( 1 7 - 1 7 ) ~ = 5 X 4 9 + 5 X 4 9 + 5 X O = 490 We need to find the three sample variances before finding the error sum of squares. The formula for the sample variance is applied to ezch sample separately as follows: ( 10 - 10)2+(10 - 10)2+(8 - 10)2+(10 - 10)2+( 12 - 10)2 = s: = 5 - 1 (24 -24)’ +(22 -2412+(24 - 24)2+(24 -24)2+(26 -24)2 = s; = 5-1 (17- 17)2+(17-17)2+(15-17)’+(19-17)2+(17-17)2 = 2 s: = 5 - 1 The error sum of squares is computed using formula (12.5). SSE = (nl - 1)s: + (n2 - 1)s; +(n3 - 1)s; = 4 X 2 + 4 X 2 + 4 X 2 = 2 4
  • 287.
    278 ANALYSIS OFVARIANCE (ANOVA) [CHAP. 12 Design I 10 10 8 10 12 TI = 50 Note that SST = SSTR + SSE. It is clear from this example that the computation of the various sums of squares is rather time consuming and subject to computational errors. Computer software is usually employed to perform these computations. This will be discussed later. If it is necessary to perform the computations by hand, shortcut formulas are recommended. These shortcut formulas will now be discussed. The shortcut formula for computing the total sum of squares is given in (12.8): Design 2 Design 3 24 17 22 17 24 15 24 19 26 17 Tz= 120 T3 = 85 (W2 SST = Cx2- - (12.8) n The treatment sum of squares is computed by using formula (12.9), where Ti is the sum of the sample values for the ith treatment. (12.9) After SST and SSTR are computed, SSE is found by using formula (12.10): SSE = SST - SSTR (12.10) EXAMPLE 12.6 To find the sum of squares found in Example 12.5 using the shortcut formulas, refer to the following table. To find SST, it is necessary first to find Cx and Zx2. C x = 1 0 + 1 0 + 8 + 1 0 + 1 2 + 2 4 + 2 2 + 2 4 + 2 4 + 2 6 + 1 7 + 1 7 + 1 5 + 1 9 + 1 7 = 2 5 5 Cx2= 100-t. 100+ 64 + 100 + 144 +576 +484 + 576 +576 +676 +289 + 289 +225 + 361 +289 = 4849 -4849-4335=514 SST = Cx2- -- (Cx)* n The treatment sum of squares is found next. and, then the error sum of squares is found by subtraction: SSE= SST - SSTR= 514 - 490 = 24 The degrees of freedom for total is n - I = 14, the degrees of freedom for treatments is k - 1 = 2, and the degrees of freedom for error is n - k = 15 - 3 = 12.
  • 288.
    CHAP. 121 ANALYSISOF VARIANCE (ANOVA) 279 Source df ss MS = SS/df F Statistic Error n - k SSE MSE Total n - 1 SST Treatment k - 1 SSTR MSTR F SSTR 490 The between treatments mean square is MSTR = -- - -= 245, and the error mean square is MSE = k - 1 2 Source Treatment Error Total SSE 24 - - - - = 2. This example illustrates that it is much easier to compute the sums of squares using the n - k 12 shortcut formulas than it is to use the defining formulas. Most researchers use statistical software to perform these computations. df ss MS = SS/df F Statistic 2 490 245 122.5 12 24 2 14 514 SAMPLING DISTRIBUTION FOR THE ONE-WAY ANOVA TEST STATISTIC The purpose behind a one-way ANOVA is to determine if the means of k populations are equal or not. In particular, we are interested in testing the null hypothesis Ho: 111 = p2 = . - - = p k against the alternative hypothesis Ha: All k means are not equal. When the k samples are randomly and independently selected from normal populations having equal population standard deviations (or equivalently equal population variances), and the null hypothesis is true, the ratio of the treatment mean square to the error mean square has an F distribution with dfl = k - I , and df2 = n - k. We express this as shown in formula (12.11).This ratio, sometimes referred to as the F ratio, is the test statistic used to test the above hypotheses. MSTR F=- MSE (12.12) EXAMPLE 12.7 In Examples 12.5 and 12.6, three different keyboard designs were compared with respect to the number of rsi found at 15 companies using the three designs. The test statistic given in formula (12.1I ) has an F distribution with df, = k - 1 = 3 - 1 = 2, and df2 = n - k = 15 - 3 = 12. In Example 12.6, it was shown that MSTR = 245 and MSE = 2. The computed value of the test statistic is F* = ?= 122.5. From the F distribution table in Appendix 5, the 5% and 1% right-hand tail critical values are as follows: F05(2, 12) = 3.89 and F01(2,12)= 6.93. The extremely large value of the test statistic suggests that the population means are not equal and that the null hypothesis should be rejected. The differences in the sample means are significant, and indicate that design I may reduce the number of rsi. BUILDING ONE-WAY ANOVA TABLES AND TESTING THE EQUALITY OF MEANS The results of the computations in the proceeding sections are usually conveniently displayed in a one-wwy ANOVA table. The general structure of the one-way ANOVA table is given in Table 12.6. EXAMPLE 12.8 The computations in Examples 12.6and 12.7 are summarized in the ANOVA Table 12.7. The steps in testing the equality of means is summarized in Table 12.8.
  • 289.
    280 . Steps forTesting the Equality of Means Using the One-way ANOVA Procedure Step I : State the null and alternative hypothesis as follows: H o : =112 = * * * = Ha: All k means are not equal. Step 2: Use the F distribution table and the level of significance, a,to determine the rejection region. Step 3: Build the ANOVA Table, and from the table determine the computed value of the F ratio. Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the test statistic falls in the rejection region. Otherwise, the null hypothesis is not rejected. ANALYSIS OF VARIANCE (ANOVA) [CHAP. 12 Design 1 10 Design 2 Design 3 24 17 Computer software is routinely used to perform the necessary computations to compute the test statistic used to test the equality of means. The software usually produces the ANOVA table and in addition provides a p value for the test. When the p value is given, the null hypothesis is rejected if the p value is less than the preset level of significance. EXAMPLE 12.9 Two different Minitab commands will now be discussed which produce a one-way ANOVA table. The data for the number of rsi for the three keyboard designs is reproduced below. The Minitab command aovonewaycl -c3 requires that the data for the three designs be put into three separate columns. These data were put into columns cl, c2, and c3. Row design 1 design2 design3 1 10 24 17 2 10 22 17 3 8 24 15 4 10 24 19 5 12 26 17 MTB > aovoneway c1 -c3 One-way Analysis of Variance Analysis of Variance Source DF SS MS Factor 2 490.00 245.00 Error 12 24.00 2.00 Total 14 514.00 Level N Mean St Dev design1 5 10.00 1.414 design2 5 24.00 1.414 design3 5 17.00 1.414 F P 122.50 O.OO0
  • 290.
    CHAP. 121 ANALYSISOF VARIANCE (ANOVA) 281 The above ANOVA table is the same as the one given in Table 12.7. The Minitab output shows the p value associated with the test under the column labeled P. Recall from our previous discussion of p value, that the p value is the area to the right of F*= 122.50 for an F distribution having df, = 2 and df2= 12. Or, stated in words, if the null hypothesis is true, i.e., if pI= p 2 = p3,the probability of obtaining an F value this large or larger is .OOO. For this reason, the null hypothesis would be rejected. The second Minitab command which produces a one-way ANOVA table is oneway response in c2 treatment in cl. The treatment groups are identified in one column and the responses are given in another column. The ouput is the same for both commands. The required data setup differs for the two commands. Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 design 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 rsi 10 10 8 10 12 24 22 24 24 26 17 17 15 19 17 MTB >oneway response in c2 treatment in cl One-way Analysisof Variance Analysis of Variance for rsi Source DF SS MS F P Design 2 490.00 245.00 122.50 O.OO0 Error 12 24.00 2.00 Total 14 514.00 EXAMPLE 12.10 Before ending our discussion of the one-way ANOVA, it is instructive to compare the ANOVA tables for the two sets of data given in Tables 12.4 and 12.5 and illustrated in Figs. 12-4 and 12-5 as well as Figs. 12-6 and 12-7. The ANOVA for the data in Table 12.4 is given above in Example 12.9. The Minitab output for the ANOVA Table corresponding to the data in Table 12.5 is One-way Analysis of Variance Analysis of Variance Source DF SS MS F P Factor 2 490.0 245.0 3.30 0.072 Error 12 890.0 74.2 Total 14 1380.0 The ANOVA for the data in Table 12.5 does not indicate a difference in the three designs for a = .05, since p = 0.072. Thus, the statistical test for the hypotheses concerning the means confirms what Figs. 12-4 and 12-5 as well as Figs. 12-6and 12-7 suggested. LOGIC BEHIND A TWO-WAY ANOVA Suppose that rather than considering only the effect of the factor keyboard design, we would also like to consider the effect of the factor seating design on repetitive stress injuries (rsi). The one-
  • 291.
    282 ANALYSIS OFVARIANCE (ANOVA) [CHAP. 12 Seating design I 2 3 way ANOVA allows us to analyze the effect of only one factor on the response variable, rsi. A two- way ANOVA allows one to analyze the effects of two factors on a response variable. Suppose there are three different keyboard designsand three different seating designs we are interested in testing. A 3 x 3factorialdesign is one in which each keyboard design is matched with each seatingdesign for a total of nine treatment combinations. Such a design allows one to test the main eflects for keyboard designs and seating designs as well as test for interaction between the two factors. Suppose 18 businesses of comparable size are selected for the study. A randomization scheme is used and each of the nine treatments are used at two businesses. The number of rsi are recorded at each location and the results are shown in Table 12.9. 1 2 3 10,12 20,22 16, 18 14, 16 25,27 20,22 898 18, 16 14, 16 Table 12.9 Seating design 1 2 3 Column mean I Kevboard design I Keyboard design 1 2 3 Row mean 11 21 17 16.33 15 26 21 20.67 8 17 15 13.33 11.33 21.33 17.67 21.0 - 18.5 - 16.0 - 13.5 - 11.0 - EXAMPLE 12.11 Table 12.10 is a table of means for the results shown in Table 12.9. The mean for each treatment (Keyboard-Seating design combination) is given as well as the marginal means. It is clear that keyboard design 2 with seating design 2 resulted in the largest mean number of rsi. Also, keyboard design 1 with seating design 3 resulted in the smallest mean number of rsi. I Table 12.10 A graphical technique for analyzing the results of the experiment is a main effects plot. A Minitab main effects plot is shown in Fig. 12-8.The main effects plot for the factor, seating, is a plot of the row means given in Table 12.10. Each of these means is calculated for six businesses. The main effects plot for the factor, keyboard, is a plot of the column means given in Table 12.10. Each of these means is calculated for six businesses. Main Effects Plot - Means for rsi 1 I I seating I 1 1 keyboard Fig. 12-8
  • 292.
    CHAP. 121 Seating designI 2 3 1 18,20 20,22 16,18 2 14, 16 25,27 20,22 + 3 8, 8 18, 16 14, 16 ANALYSIS OF VARIANCE (ANOVA) 283 Another plot used to help explain the experimental results is called the interaction plot. A Minitab interaction plot is shown in Fig. 12-9. The interaction plot indicates that regardless of the seating design used, keyboard design 1 results in the smallest number of rsi, keyboard design 2 results in the largest number of rsi, and the number of rsi for keyboard design 3 is between the other two. When the line segments are roughly parallel, as they are here, we say that there is no interactionbetween the factors. Interaction Plot - Means for rsi seating 1 2 keyboard Fig. 12-9 EXAMPLE 12.12 Suppose the experiment described in 12.11. 3 Example 12.11 resulted in the data shown in Table Table 12.11 1 Kevboard desien 1 The Minitab main effects plot is shown in Fig. 12-10and the Minitab interaction plot is shown in Fig. 12-11. An understanding of interaction is accomplished by comparing Fig. 12-11 with Fig. 12-9. Recall that in Fig. 12-9, we were able to conclude that regardless of the seating design used, keyboard design 1 results in the smallest number of rsi, keyboard design 2 results in the largest number of rsi, and the number of rsi for keyboard design 3 is between the other two. Notice that this is not true in Fig. 12-11. In particular, when seating design 1 is used the smallest mean number of rsi occur for keyboard design 3, not keyboard design 1. In this study, the factors keyboard design and seating design are said to interact. Note that the line segments on the interaction plot are not parallel. The keyboard design which produces the minimum number of rsi depends on the seating design used. We cannot conclude that keyboard design 1 always produces the minimum number of rsi. When interaction is present, the main effects must be interpreted carefully. The discussion and examples in this section are intended to provide some insight into the logic behind a two-factorfactorial design. In the next section the ideas will be generalized and the analysis
  • 293.
    284 ANALYSIS OFVARIANCE (ANOVA) [CHAP. 12 of variance associated with the two-factordesign will be discussed.This section is intended to make the general discussioneasierto understand. Main Effects Plot - Means for rsi 21 19 17 15 13 25 20 15 10 I 1 I 1 1 seating keyboard Fig. 12-10 Interaction Plot - Means for rsi seating ' 1 O 2 * 3 1 2 3 - -- - - - . I 1 I 2 keyboard I 3 Fig. 12-11
  • 294.
    CHAP. 121 Gender Male Female ANALYSIS OFVARIANCE (ANOVA) Age Race Disability 200, 175,215 150, 125, 135 100,95, 115 185,210,225 130, 145, 115 90,80, 110 285 Source df ss MS = SS/df F Statistic Error n - ab SSE MSE Treatment ab- 1 SSTR MSTR F i Total n - 1 SST SUM OF SQUARES,MEAN SQUARES,AND DEGREES OF FREEDOM FOR A TWO-WAYANOVA Let A and B be two factors whose influence on a response variable we wish to investigate. Let a be the number of levels of factor A and let b be the number of levels of factor B. Each level of factor A is combined with each level of factor B to form a total of ab treatments. Equal size samples are taken from each of the ab treatments. If each sample is of size m, the total number of sample values is n = mab. EXAMPLE 12.13 Table 12.12 gives the compensatory awards (in thousands)for wrongful firing based on age, race, or disability for males and females. Either bias type or gender may be called factor A. Suppose we let factor A be bias type. Then A has three levels and B has two levels. There are six populations or treatments. The terms population, treatment, or treatment combination are used for the six groups. We shall use the term treatment. The six treatments are: (age bias and male), (race bias and male), (disability bias and male), (age bias and female), (race bias and female), and (disability bias and female). The values for a, b, m, and n are 3, 2, 3, and 18, respectively. Table 12.12 I Bias type I Initially, a two factor factorial design may be analyzed using a one-way ANOVA. The number of treatments is k = ab, and the total sample size is n. The one-way ANOVA for the two factor design is given in Table 12.13. The Treatment sum of squares, SSTR, is now partitioned in three parts as shown in formula (Z2.Z2), where SSA is calledfactor A sum o f squares, SSB is calledfactcr B sum of squares, and SSAB is called interaction sum o f squares. SSTR = SSA +SSB +SSAB (12.12) The treatment degrees of freedom, ab - I, is also partitioned into three parts as follows: df for A = a - 1, df for B = b - 1, df for AB = (a - l)(b - 1) The degrees of freedom for A, B, and AB add up to the degrees of freedom for treatment as given by formula ( I2.13): ab - 1 = (a - 1)+ (b - 1)+ (a - l)(b - I) (12.13) Mean squares are also defined for A, B, and AB. The factor A mean square, MSA, is given by formula (12.14): SSA MSA = - a-1 (12.14)
  • 295.
    286 ANALYSIS OFVARIANCE (ANOVA) L Source df ss MS F Statistic Bias type 2 SSA MSA F A Sex 1 SSB MSB FB Interaction 2 SSAB MSAB FAB Error 12 SSE MSE Total 17 SST : The factor B mean square,MSB, is given by SSB MSB = - b- 1 The interaction mean square,MSAB, is given by SSAB (a - I)(b - 1) MSAB = [CHAP. 12 (12.1.5) (12.16) The formulas for the sum of squares for A, B, and AB will not be discussed since the analysis of factorial designs are almost always performed by computer software. EXAMPLE 12.14 In Example 12.13,the total degrees of freedom is n - 1 = 18- 1 = 17, the treatment degrees of freedom is ab - 1 = 6 - 1 = 5, and the error degrees of freedom is n - ab = 18 - 6 = 12. Furthermore, the 5 degrees of freedom for treatments is partitioned into a - 1 = 3 - 1 = 2 for factor A, b - 1 = 2 - 1 = 1 for factor B, and (a - l)(b - I)= 2 x 1 = 2 for interaction. BUILDING TWO-WAYANOVA TABLES The sum of squares, mean squares, degrees of freedom, and test statistics for a factorial design are summarized in a two-way ANOVA table. The structure for a two-way ANOVA table is given in Table 12.14. Table 12.14 Source Factor A Factor B Interaction (a -l)(b - 1) SSAB Error n - ab Total n - 1 SST MS = SS/df ' 7 1 F Statistic FA= MSA/MSE F B = MSBMSE FAB = MSABMSE EXAMPLE 12.15 A two-way ANOVA table for the data given in Table 12.12 is given in Table 12.15. Table 12.15 SAMPLING DISTRIBUTIONSFOR THETWO-WAYANOVA When there is no interaction between factors A and B, the test statistic FABhas an F distribution with dfl = (a - l)(b - 1)and df2 = n - ab. When there are no differences among the means for the main effect A, the test statistic FAhas an F distribution with dfl = a - I and df2 = n - ab. When there are no differences among the means for the main effect B, the test statistic Fg has an F distribution with dfl = b - 1 and df2 = n - ab.
  • 296.
    CHAP. 121 ANALYSISOF VARIANCE (ANOVA) 287 Seating design 1 2 3 When there are no differences among the means for the factor A, we say there is no main effect due to factor A. Otherwise, we say there is a main effect due to factor A. Similar terminology is used for factor B. 1 2 3 10,12 20,22 16, 18 14, 16 25,27 20,22 8. 8 18. 16 14. 16 EXAMPLE 12.16 In Example 12.15, the test statistic F A has an F distribution with df, = 2 and dfi = 12. The test statistic F B has an F distribution with dfl = 1 and df2= 12. The test statistic F A B has an F distribution with dfl = 2 and dfi = 12. These distributions may be used to test for significant interaction and main effects and is discussed in the next section when the computer analysis of the data is discussed. TESTING HYPOTHESIS CONCERNINGMAIN EFFECTS AND INTERACTION The steps for testing interaction and main effects in a two-factor factorial design is summarized in Table 12.16. Steps for Testing for Interaction and Main Effects Using the Two-way ANOVA Procedure Step 1: State the null and alternative hypothesis for interaction and main effects as follows: Ho:There is no interaction between factors A and B Ha:Factors A and B interact Ho:There are no differences among the means for main effect A Ha: At least two of the main effect A means differ Ho:There are no differences among the means for main effect B Ha:At least two of the main effect B means differ Step 2: Use the F distribution table and the level of significance a to determine the rejection regions. Step 3: Build the ANOVA table and from the table determine the computed values of the test statistics. Step 4: State your conclusions. The null hypothesis is rejected if the computed value of the test statistic falls in the rejection region. Otherwise, the null hypothesis is not rejected. EXAMPLE 12.17 Table 12.9gave the number of repetitive stress injuries (rsi)for a 3 by 3 factorial design and the table is reproduced below. The Minitab analysis for this experiment is also shown. I Keyboard design I Two different Minitab commands will be discussed to analyze the data. The first to be discussed is the command Twoway. First of all note the form of the data as given below. The data in the above table must be put in the form shown before performing the analysis. Each data line identifies the row, the column, and the data value in that row and column.
  • 297.
    288 ANALYSIS OFVARIANCE (ANOVA) [CHAP. 12 Row 1 2 3 4 5 6 7 9 10 11 12 13 14 15 16 1 7 18 a seating 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 keyboard 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 rsi 10 12 20 22 16 18 14 16 25 27 20 22 8 18 16 14 16 a The command Twoway 'rsi' 'seating' 'keyboard';starts with the keyword Twoway, followed by the response and then the row and column names. The subcommand Means 'seating' 'keyboard'. gives row and column means and confidence intervals. MTB > Twoooay Irmil lmoatiagl lkoyboardl; SUBC > Moanm lmoati~gl lkoyboardl. Two-way Analysis of Variance Analysis of Variance for rsi Source DF ss MS seating 2 163.11 81.56 keyboard 2 307.11 153.56 Interaction 4 4 . 0 9 1.22 Error 9 16.00 1.78 Total 17 491.11 The test statistic value for the null hypothesis b:No interaction between factors A and B is easily computed, since MSAB = 1.22and MSE = 1.78.The computed value for FAB* is 1.2211.78= 0.69. The critical value for a = .05 is F.o5(4, 9) = 3.63. Since FA^* does not exceed 3.63, the null hypothesis is not rejected. Therefore, interactionis not significant.This confirms what the interaction plot in Fig. 12-9indicates. Suppose we call keyboard design factor A. The test statistic value for the null hypothesis &:There are no differences among the means for main effect A is easily computed, since MSA = 153.56and MSE = 1.78.The computed value for FA* is 153.56/1.78= 86.27. The critical value for a = .05 is F.m(2, 9) = 4.26. Since FA* far exceeds 4.26, we reject the null hypothesis and concludethat the means for the three keyboard designsdiffer. The seating design is factor B.The test statistic value for the null hypothesis H o :There are no differences among the means for main effect B is easily computed, since MSB = 81.56 and MSE = 1.78.The computed value for FB* is 81.56/1.78 = 45.82. The critical value for a = .05 is F.os(2, 9) = 4.26. Since FA* far exceeds
  • 298.
    CHAP. 121 Seatingdesign 1 2 3 ANALYSIS OFVARIANCE (ANOVA) 1 2 3 18,20 20,22 16, 18 14,16 25,21 20,22 8,8 18, 16 14,16 289 4.26, we reject the null hypothesis and conclude that the means for the three seating designs differ. The main effects plot in Fig. 12-8 indicates that seating design 3 and keyboard design 1 is a good combination for reducing repetitive stress injuries. A second Minitab command for doing the analysis is the command ANOVA 'rsi' = 'seating' 'keyboard' 'seating'*'keyboard'. The output below is produced by this command. This output may be preferable, since it provides p values. By considering the p values, we see immediately that interaction is not significant, but that both main effects are significant. MTB > ANOVA 'rain = laeatl~gn lkeyboardn lmoati~gn*'koyb~ardl Analysis of Variance(BalancedDesigns) Factor Type Levels Values seating fixed 3 1 2 3 keyboard fixed 3 1 2 3 Analysis of Variance for rsi Source DF ss MS F P seating 2 163.111 81.556 45.87 0.000 keyboard 2 307.111 153.556 86.37 0.000 seating*keyboard 4 4 . aa9 1.222 0.69 0.619 Total 17 491.111 Error 9 16.000 1.778 EXAMPLE 12.18 The data from Table 12.11are reproduced below. I Kevboard design I The Minitab analysisfor these data is shown below. MTB > ANOVA 'rsi' = 'seating' 'keyboard' 'seating'*'keyboard' Analysis of Variance(BalancedDesigns) Factor Type Levels Values seating fixed 3 1 2 3 keyboard fixed 3 1 2 3 Analysis of Variance for rsi Source DF ss MS F P seating 2 177.333 88.667 49.88 0.000 keyboard 2 161.333 80.667 45.38 0.000 seating*keyboard 4 65.333 16.333 9.19 0.003 Error 9 16.000 1.778 Total 17 420.000 From this output, we immediatelysee that there is significant interaction. This confirmswhat the interactionplot shows in Fig 12-11. It would be a good idea for you to review the discussion given in Example 12.12 concerning the nature of the interaction. Remember, when interaction is present, the main effects must be interpreted carefully. EXAMPLE 12.19 The 3 by 2 factorial data for the factors bias type and gender given in Table 12.12 are reproduced on the next page.
  • 299.
    290 Gender Male Female ANALYSIS OF VARIANCE(ANOVA) Age Race Disability 200, 175,215 150, 125, 135 100,95, 115 185.210.225 130. 145. 115 90.80. 110 [CHAP. 12 I Bias tvue I The Minitab output for the Table 12.12data is given below. MTB > ANOVA 'award' = 'gender' 'biastype' 'gender'*'biastype' Analysis of Variance (Balanced Designs) Factor Type Levels Values gender fixed 2 1 2 biastype fixed 3 1 2 3 Analysis of Variance for award Source DF ss MS F P gender 1 2 2 . 2 2 2 . 2 0.09 0 . 7 7 4 biastype 2 3 3 1 4 4 . 4 1 6 5 7 2 . 2 6 4 . 5 0 0 . 0 0 0 gender*biastype 2 3 4 4 . 4 1 7 2 . 2 0 . 6 7 0 . 5 3 0 Error 1 2 3 0 8 3 . 3 2 5 6 . 9 Total 1 7 3 6 5 9 4 . 4 The p values indicate that there is no significant interaction. There is no significant effect due to gender. However, the mean awards differ according to the type of bias. Solved Problems F DISTRIBUTION 12.1 What does it mean to say that the F distribution curve is asymptotic to the horizontal axis of the rectangular coordinate system? Ans. This means that as we move to the right from the origin along the horizontal axis, the F distribution curve approaches the horizontal axis, but never touches it. F TABLE 12.2 Find the following using the F distribution in Appendix 5. (a) Foi(7, 3) (b) Fod4, 6) Ans. ( a ) 21.67 (6) 4.53 12.3 Find the following using the F distribution in Appendix 5 (4 F99(7,3 (6) F95(4,6) 12.4 Find the critical value of F for the following. (a) df = (4, 8) and area in the right tail equal to .01. (b) df = (6,6) and area in the right tail equal to .05. Ans. (a) 7.01 (h) 4.28
  • 300.
    CHAP. 121 ANALYSISOF VARIANCE (ANOVA) Treatment 1 Treatment 2 4 6 3 8 -2 8 2 10 0 10 -1 12 291 Treatment 3 I 1 6 16 I 1 14 8 LOGIC BEHIND A ONE-WAY ANOVA Treatment 1 -10 10 4 -4 0 6 mean = 1 12.5 Eighteen individuals, diagnosed as having mild high blood pressure, were randomly divided into three groups of six each and were assigned to one of three treatment groups. The treatment 1 group served as a control and made no changes in their diet. The individuals in the treatment 2 group replaced 50% of their diet with fruits and vegetables. The individuals in treatment group 3 replaced 50% of their diet with fruits and vegetables and reduced fat calories to 15% of their daily total. After six months, the change in diastolic blood pressure was measured and the results are given in Table 12.17. The change in blood pressure was determined by subtracting the reading after the six-month study from the initial reading. Treatment 2 Treatment 3 0 0 16 22 8 11 9 5 19 16 2 12 mean = 9 mean = 11 I I I mean = 1 mean =9 mean = I 1 A Minitab dotplot of the data given in Table 12.17 is shown in Fig 12-12. Describe in words what the dotplot suggests concerning the three treatments. trtment 3 Fig. 12-12 Ans. The plot suggests that the means for treatments 2 and 3 are greater than the mean for treatment 1. It is not clear whether the means for treatments 2 and 3 are different. 12.6 Table 12.18 is another set of data obtained for the same experiment described in Problem 12.5. A Minitab dotplot of the data is shown in Fig. 12-13. Describe in words what the dotplot suggests concerning the three treatments.
  • 301.
    292 ANALYSIS OFVARIANCE (ANOVA) [CHAP. 12 A Minitab dotplot of the data is shown in Fig. 12-13. Describe in words what the dotplot suggests concerning the three treatments. Fig. 12-13 Ans. It is not clear whether the treatment means differ or not since the data points overlap for the three treatments. SUM OF SQUARES, MEAN SQUARES, AND DEGREES OF FREEDOM FOR A ONE-WAY ANOVA 1 2 . 7 For the data in Table 12.17, find SST, SSTR, and SSE using both the defining and the shortcut formulas. After finding the sum of squares, find MSTR and MSE. Also give the degrees of freedom for total, treatments, and error. Ans. Definingformulas: SSTR = nl(Xl - F;)2+ n2(X2 - X)’ + n3(Xk - Sr)2= 6(I - 7)’ + 6(9 - 7)’+ 6(I I - 7)2 = 336 SSE = (nl - 1)s: + (nz - 1)s; +(n3- 1)s: = 5 x 2.3662+5 x 2.09g2+ 5 x 3.688’ = 118 SST= SSTR + SSE = 336 + I18 = 454 Shortcut formulas: SSE= SST- SSTR = 454- 3 3 6 4 18 Degreesof freedom:for total = n - I = 17, for treatments= k - 1 = 2, for error = n - k = 15 = 168 SSTR MSTR= - k-1 SSE n-k MSE = -= 7.87 12.8 For the data in Table 12.18,find SST, SSTR, and SSE using both the defining and the shortcut formulas. After finding the sum of squares, find MSTR and MSE. Also give the degrees of freedom for total, treatments, and error.
  • 302.
    CHAP. 121 ANALYSISOF VARIANCE (ANOVA) 293 Ans. Defining formulas: SSTR=nl(S11-Y')2+ I I ~ ( X ~ - X ) ~ + n3(Xk -X)*= 6(1 -7)2+6(9-7)2+6(11-7)2=336 SSE = (nl - 1)s: +(n2 - 1)s; + (n3- 1)s; = 5 x 7,2392+5 x 7.4832+5 x 7.7972= 846 SST = SSTR + SSE = 336 +846 = I182 Shortcutformulas: SSEzSST-SSTR= 1182-336=846 Degrees of freedom: for total = n - 1 = 17,for treatments= k - 1 = 2, for error = n -k = 15 MSTR= - SSTR - - 168 k-1 = 56.4 MSE= - SSE n - k SAMPLING DISTRIBUTION FOR THE ONE-WAYANOVA TEST STATISTIC 12.9 (a) For the experiment described in problem 12.5,what conditions are necessary in order that MSTR F = -have an F distribution with dfl = 2 and df2= 15? MSE (b) What is the computed value of F, F*, for the data in problem 12.5? (c) What is the critical value for testing equal means at a = .05? Would you reject the null hypothesis at this significance level? Ans. 12.10 (a) (b) (4 (a) The populations represented by the three samples are normally distributed. The three popula- (6) The computed value of the test statistic is F*= -= 21.35. (c) F,os(2, 15)= 3.68. Reject the null hypothesisbecause F* > 3.68, tions have equal variances. The null hypothesis is assumed to be true. 168 7.87 For the experiment described in problem 12.6, what conditions are necessary in order that MSTR MSE F=- have an F distribution with df, = 2 and df;!= 15? What is the computed value of F, F*, for the data in problem 12.6? What is the critical value for testing equal means at a = .05? Would you reject the null hypothesis at this significance level?
  • 303.
    294 Source Df Treatment 2 Error15 Total 17 ANALYSIS OF VARIANCE (ANOVA) [CHAP. 12 J ss MS = SSfdf F Statistic 336 168 21.35 118 7.87 d Ans. (a) The populations represented by the three samples are normally distributed. The three populations have equal variances. The null hypothesis is assumed to be true. Source Df ss Treatment 2 336 Error 15 846 Total 17 (6) The computed value of the test statistic is F* = -- - 2.98. (c) F,05(2,15)= 3.68. Do not reject the null hypothesis,since F* does not exceed 3.68. 56.4 MS = SS/df F Statistic 168 2.98 56.4 BUILDING ONE-WAY ANOVA TABLES AND TESTING THE EQUALITY OF MEANS 12.11 Refer to Problems 12.5, 12.7, and 12.9. Use the results in these three problems to build the one-way ANOVA table for this experiment. Ans. The one-way ANOVA table is shown below. The ANOVA table is used to systematically summarize the lengthy computations for the procedure and to give the computed value of the test statistic. By referring to the F distribution table, we see that the null hypothesis of equal means for the three treatments would be rejected at both the .05 and the -01levels. Replacing 50% of their diet with fruits and vegetables appears to reduce the blood pressure for individuals with mild high blood pressure based on the results of this study. ANOVA Table for the Data in Table 12.17 12.12 Refer to problems 12.6, 12.8, and 12.10. Use the results in these three problems to build the one-way ANOVA table for this experiment. Am. The one-way ANOVA table is shown below. The ANOVA table is used to systematically summarize the lengthy computations for the procedure and to give the computed value of the test statistic. By referring to the F distribution table, we see that the null hypothesis of equal means for the three treatments would not be rejected at either the .05 or the .01 levels. Based on the results of this study, we cannot claim that replacing 50% of their diets by fruits and vegetables will reduce the blood pressures for individuals with mild high blood pressure. LOGIC BEHIND A TWO-WAY ANOVA 12.13 A medical study utilized a 3 by 2 factorial design. One factor was the type of surgical procedure and the other factor was the temperature of the patients’ environment during surgery. Some patients were kept warm with blankets and intravenous fluids and others were kept cool. Four patients were available for each surgical procedure-temperature combination. For each patient, the length of hospital stay was recorded and the results of the study are given in Table 12.19.
  • 304.
    CHAP. 121 Temperature Warm Cool ANALYSIS OFVARIANCE (ANOVA) Surgical Procedure . 1 2 3 2,393, 4 5,6,8,5 9,9, 10, 12 45,596 9,10,11, 10 13,14, 14, 15 295 Temperature Warm Cool Column mean Surgical Procedure 1 2 3 Row mean 3 6 10 6.3 5 1 0 14 9.7 4 8 12 The mean for each treatment (Surgical Procedure-Temperature combination), as well as marginal means, is given in Table 12.20. Table 12.20 A Minitab main effects plot is shown in Fig. 12-14 and an interaction plot is shown in Fig. 12-15. Main EffectsPlot - M e a n sfor Days 12 10 Days 8 6 4 / Surgical Procedure Fig. 12-14 Explain in words the interaction plot and the main effects plot shown in Figures 12-14 and 12-15. Ans. The solid line segment in Fig. 12-15 corresponds to patients who were kept warm during surgery and the dashed line corresponds to patients who were kept cool during surgery. The interaction plot indicates that regardless of surgical procedure, the mean length of hospital stay is less when the patient is kept warm during surgery since the solid line is always below the dashed line. This indicates that there is no interaction between temperature and the type of surgical procedure. The main effects plot for temperature shows that the mean length of stay for patients kept warm during surgery is less than the mean length of stay for those kept cool during surgery. The main effects plot for surgical procedure contrasts the mean lengths of stay for the three surgical procedures.
  • 305.
    296 Temperature Warm Cool 14 1 2 3 293,394 9, 10, 11, 10 9,9, 10, 12 4,59596 5,6,8,5 13, 14, 14, 15 9 Days Temperature 1 2 3 Row mean Warm 3 10 10 7.67 Cool 5 6 14 8.33 Column mean 4 8 12 _I L 4 ANALYSIS OF VARIANCE (ANOVA) Interaction Plot - Means for days I I I 1 2 3 Surgical Procedure [CHAP. 12 T W . I ’ 2 1 - - 2 Fig. 12-15 12.14 Suppose the data in the design described in Problem 12.13are as shown in Table 12.21. Table 12.21 I Surgical Procedure I The mean for each treatment (Surgical Procedure-Temperature combination), as well as marginal means, is given in Table 12.22. Table 12.22 I Surgical Procedure I Explain in words the interaction plot and the main effects plot shown in Figs. 12-16 and 12-17.
  • 306.
    CHAP. 121 Days 12 10 8 6 4 ANALYSIS OFVARIANCE (ANOVA) MainEffectsPlot- MeansforDays 297 Fig. 12-16 I I I Interaction Plot -Meansfor days 14 9 Days 4 I 1 / / f / / / / / / ' / I 2 I 3 ten^p ' 1 ' 2 1 -- 2 - Surgical Procedure Fig. 12-17 Ans. The interaction plot in Fig. 12-17 illustrates a totally different situation than the interaction plot in Fig. 12-15. The interaction plot in Fig. 12-17 indicates that keeping the patient warm during surgery shortens the length of hospital stay for patients undergoing surgical procedures 1 and 3, but lengthens the stay for those undergoing procedure 2. The response lines are not parallel and we say there is interaction present. This was not true in Fig. 12-15. When interaction is present, the main effects are more difficult to explain.
  • 307.
    298 ANALYSIS OFVARIANCE (ANOVA) [CHAP. 12 Source Procedure Temperature Interaction Error Total SUM OF SQUARES, MEAN SQUARES, AND DEGREES OF FREEDOM FOR A TWO-WAY ANOVA df ss MS F Statistic 2 SSA MSA F A 1 SSB MSB F B 2 SSAB MSAB FAB 18 SSE MSE 23 SST 12.15 Refer to problem 12.13. By initially considering the two-way design as a one-way design, give the degrees of freedom for total, treatments, and error. Then, break the degrees of freedom for treatments down into degrees of freedom for temperature, surgical procedure, and temperature-surgical procedure interaction. Ans. As a one-way design, the total degrees of freedom is n - 1 = 24 - 1 = 23. The degrees of freedom for treatments is ab - 1 = 2 x 3 - 1 = 5 . The degrees of freedom for error is n - ab = 24 - 6 = 18. The 5 degrees of freedom for treatments is partitioned into a - 1 = 2 - 1 = 1 degree of freedom for temperature, b - 1 = 3 - 1 = 2 degrees of freedom for surgical procedures, and (a - I)(b - 1) = 1 x 2 = 2 degrees of freedom for interaction. 12.16 Sixty plots were used in an agricultural experiment. Four varieties of wheat and five different levels of fertilizer were used in a 4 x 5 factorial experiment. Each variety of wheat and each fertilizer level combination was used on three randomly chosen plots. List the 20 treatments, give the complete breakdown for the total degrees of freedom, and express the total sum of squares as a sum of four separate sums of squares. Ans. One treatment would be the first variety of wheat combined with the first level of fertilizer, represented as (V 1,Fl). The other 19 treatments are: (V1, F2), (V1,F3), (V1,F4), (V1, F5), (V2, Fl), (V2, F2), (V2, F3), (V2, F4), (V2, F5),(V3, Fl), (V3, F2), (V3, F3), W3, F4), (V3, F5), W4, FI), (V4, F2), (V4, F3), (V4, F4), and (V4, F5). The total degrees of freedom is n - 1 = 60 - 1 = 59. Let factor A be variety of wheat and let factor B be fertilizer level. The degrees of freedom for A is a - 1 = 4 - 1 = 3, the degrees of freedom for B is b - 1 = 5 - 1 = 4, the degrees of freedom for interaction is (a - I ) x (b - 1) = 3 x 4 = 12. The degrees of freedom for error is n - ab = 60 - 20 = 40. Note that 59 = 3 +4 + 12 +40. SST = SSA + SSB + SSAB + SSE. BUILDING TWO-WAY ANOVA TABLES 12.17 Give the general structure for the two-way ANOVA table for the data in Table 12.19. Ans. The results are given in Table 12.23. 12.18 Give the general structure for the two-way ANOVA table for the experiment described in problem 12.16. Ans. The results are given in Table 12.24.
  • 308.
    CHAP. 121 Source Dfss MS Variety 3 SSA MSA Fertilizer level 4 SSB MSB Interaction 12 SSAB MSAB Error 40 SSE MSE Total 59 SST ANALYSIS OF VARIANCE (ANOVA) F Statistic F A FB FAB 299 SAMPLING DISTRIBUTIONS FOR THE TWO-WAY ANOVA 12.19 Find the critical values for FA,Fg, and FAB in problem 12.17.Use a = .05. Ans. Factor A is the surgical procedure and factor B is the temperature of the patients' environment. The test statistic F A has an F distribution with dfi = 2 and df2 = 18. The critical value for FA is 3.55. The test statistic F B has an F distribution with dfl = 1 and df2= 18. The critical value for FB is 4.41. The test statistic F A B has an F distribution with df, = 2 and df2 = 18 and the critical value for FAB is 3.55. 12.20 Find the critical values for FA,Fg, and FAB in problem 12.18. Use a= .01. Ans. Factor A is the variety of wheat and factor B is the level of fertilizer. The test statistic F A has an F distribution with dfl = 3 and df2= 40. The critical value for F A is 4.31. The test statistic FBhas an F distribution with dfl = 4 and dfz = 40. The critical value for F B is 3.83. The test statistic FAB has an F distribution with dfl = 12 and dfi = 40 and the critical value for FAB is 2.66 TESTING HYPOTHESIS CONCERNING MAIN EFFECTS AND INTERACTION 12.21 The Minitab output for the data given in Table 12.19 is shown below. After reviewing Problems 12.13, 12.15, 12.17, and 12.19, as well as the Minitab output, test for main effects as well as interaction at a = .05. Analysis of Variance for days Source DF ss MS F P Temp 1 66.667 66.667 6 0 . 0 0 0.000 Surgproc 2 256.000 128.000 115.20 0.000 Temp*Surgproc 2 5.333 2.667 2.40 0.119 Error 1 8 2 0 . 0 0 0 1.111 Total 23 348.000 Ans. The Minitab output confirms the results discussed in problems 12.13, 12.15, 12.17, and 12.19.The p value for interaction, 0.119, is greater than a and interaction is not significant. This makes the interpretation of main effects easier. The p values for temperature and surgical procedure are both 0.000.The mean length of hospital stay in days differs for the three surgical procedures and for the two operation temperatures at which the patients are kept. 12.22 The Minitab output for the data given in Table 12.21 is shown below. After reviewing Problem 12.14, as well as the Minitab output, test for main effects as well as interaction at a = .05.
  • 309.
    300 ANALYSIS OFVARIANCE (ANOVA) [CHAP. 12 Analysis of Variance for days Source DF ss MS F P Temp 1 2.667 2.667 2.40 0.139 Surgproc 2 256.000 128.000 115.20 0.000 Temp*Surgproc 2 69.333 34.667 31.20 0 . 0 0 0 Error 1 8 20.000 1.111 Total 23 348.000 Ans. Since the interaction is significant, i. e., the p value = 0.0oO < a,we explain the nature of the interaction rather than make broad generalizations about the main effects. In Fig. 12-17, the solid line segment corresponds to the patients who were kept warm during surgery and the line segment made up of dashes corresponds to patients who were kept cool during surgery. The interaction plot suggests that the hospital stay is shorter if the patient is kept warm during surgery for procedures 1 and 3. However, it appears that for procedure 2, keeping the patient warm during surgery increases the hospital stay. Supplementary Problems F DISTRIBUTION 12.23 Fill in the following blanks with the appropriate distribution. Choose from the words standard normal, student t, Chi-square, and F. (4The - - distribution is symmetricalabout zero and has standard deviation equal to one. (6) The (c) The (6)The shapeofthe distribution is skewed to the right. The shape of the distribution curve is distribution is symmetrical about zero. The shape of the distribution curve is determined by the number of degrees of freedom. determined by the number of degrees of freedom. distribution is determined by two separate degrees of freedom. Ans. (a)standard normal (6)Chi-square (c) student t (d)F F TABLE 12.24 Find the following using the F distribution in Appendix 5. (a)F.Ol(2, 8) (6)F.05(2,8) Ans. (a) 8.65 (6)4.46 12.25 Find the following using the F distribution in Appendix 5. (a)F.W@,2) (6)F.95t8,2) Ans. (a)0.1156 (6)0.2242 12.26 The random variable Frabo has an F distribution with dfl = 5 and df2= 5. Find two values a and b such that P(a < Fmbo < b) = .90. Ans. a = 0.1980 and b = 5.05
  • 310.
    CHAP. 121 ANALYSISOF VARIANCE (ANOVA) IRS 40 45 5s 50 45 45 50 so 40 60 301 Garbage collection Long distance service 55 60 60 65 65 70 so 70 55 70 55 7s 60 7s 70 70 65 80 60 80 LOGIC BEHIND A ONE-WAY ANOVA 12.27 Thirty individuals were randomly divided into 3 groups of 10 each. Each member of one group completed a questionnaire concerning the Internal Revenue Service (IRS). The score on the questionnaire is called the Customer Satisfaction Index (CSI). The higher the score, the greater the satisfaction. A second set of CSI scores were obtained for garbage collection from the second group, and a third set of scores were obtained for long distance telephone service, The scores are given in Table 12.25. A Minitab boxplot for the data in Table 12.25 is shown in Fig. 12-18. Describe what the boxplot suggests concerning the mean CSI scores for the IRS, garbage collection, and long distance phone service. mean = 48.0 I mean = 59.5 I mean = 71.5 - - - - - - - * ------- I------ Long Distance Service ------- Fig. 12-18 Ans. Even though the whiskers overlap for the three different groups, the boxes that contain the middle 50%of the data, do not overlap for the three groups. Note that the plus signs are above the median values and are fairly spread apart. The data suggest that the mean CSI scores for the three populations are probably different. An analysis of variance may be performed to test the null hypothesis of equal population means. 12.28 Suppose the survey described in problem 12.27resulted in the data given in Table 12.26. Describe what the boxplot in Fig. 12-19 suggests concerning the mean CSI scores for the IRS, garbage collection, and long distance phone service.
  • 311.
    302 ANALYSIS OFVARIANCE (ANOVA) IRS 15 45 30 70 45 45 30 70 40 90 mean = 48 Table 12.26 Garbage collection 20 50 65 50 25 85 60 90 65 85 mean = 59.5 ~~ Long distance service 35 65 70 80 60 75 75 80 80 95 mean = 7 1.5 [CHAP. 12 * Fig. 12-19 Ans. The boxplots for the three sets of CSI scores shown in Fig. 12-19 exhibit a considerable amount of overlap. This is the type of results we might expect if there is no difference in the three population means. The boxplots indicate that the assumption that the three populations have normal distributions with equal variances should be checked. SUM OF SQUARES,MEAN SQUARES,AND DEGREESOF FREEDOM FOR A ONE-WAY ANOVA 12.29 For the data in Table 12.25, find SST, SSTR, and SSE using both the defining and the shortcut formulas. After finding the sum of squares, find MSTR and MSE. Also give the degrees of freedom for total, treatments, and error. Am. SSTR = 2761.7 SSE = 1035.0 SST = 3796.7 MSTR = 1380.8 MSE = 38.3 Degrees of freedom: for total = 29, for treatments= 2, and for error = 27. 12.30 For the data in Table 12.26, find SST, SSTR, and SSE using both the defining and the shortcut formulas. After finding the sum of squares, find MSTR and MSE. Also give the degrees of freedom for total, treatments, and error. Ans. SSTR = 2761.7 SSE =12085 SST = 14847 MSTR = 1380.8 MSE = 448 Degrees of freedom: for total = 29, for treatments = 2, and for error = 27.
  • 312.
    CHAP. 121 ANALYSISOF VARIANCE (ANOVA) South 15.5 13.0 19.5 18.8 17.5 19.0 16.6 14.9 mean = 16.85 303 North West 17.8 15.5 20.0 17.0 21.3 22.0 24.3 18.4 18.8 19.9 19.0 21.5 19.5 20.5 22.0 23.4 mean = 20.66 mean= 19.05 SAMPLINGDISTRIBUTIONFOR THE ONE-WAYANOVA TESTSTATISTIC 12.31 Refer to problems 12.27 and 12.29. If the populations of CSI responses concerning the IRS, garbage collection, and long distance phone service are normally distributed, have equal variances and equal MSTR MSE means, what is the distribution of F = - ? What is the a = .05 critical value for testing the null hypothesis that the population means are equal? Give the computed value of the test statistic and your conclusion. MSTR MSE Ans. The test statistic F = -has an F distribution with dfl = 2 and dfi = 27. The critical value, F,o5(2, 27), is between 3.39 and 3.32. The computed value of the test statistic is F* = 36.05. The null hypothesis is rejected. 12.32 Refer to problems 12.28 and 12.30. If the populations of CSI responses concerning the IRS, garbage collection, and long distance phone service are normally distributed, have equal variances and equal means, what is the distribution of F = - ? What is the a = .05 critical value for testing the null hypothesis that the population means are equal? Give the computed value of the test statistic and your conclusion. MSTR MSE MSTR MSE Ans. The test statistic F = -has an F distribution with dfl = 2 and df2 = 27. The critical value, F.o5(2,27), is between 3.39 and 3.32. The computed value of the test statistic is F* = 3.08. The null hypothesis is not rejected. BUILDING ONE-WAY ANOVA TABLESAND TESTING THE EQUALITY OF MEANS 12.33 Table 12.27 gives the cost in thousands of dollars for randomly selected weddings with approximately 100 guests for three geographical regions in the U.S. Build a one-way ANOVA and test the null hypothesis that the mean costs for such weddings do not differ for the three regions. Test at a 5% level of significance. Ans. The Minitab output for the above data is as follows: Analysis of Variance Source DF ss MS F P Factor 2 64.55 3 2 . 2 7 6 . 2 6 0.007 Error 2 1 1 0 8 . 2 0 5.15 Total 23 172.75
  • 313.
    304 I Urban ANALYSIS OFVARIANCE (ANOVA) PhysicianAssistant Nurse Practitioner 4 5 51,52,48,54 42,44,47,49,50 [CHAP. 12 The p value is less than the preset alpha. The mean costs differ for the three regions. Such weddings appear to cost less in the South than in the North or West on the average. 12.34 A sociological study compared the time spent watching TV per week for children ages 2 to 11 for the years 1994, 1995, 1996, and 1997. The times in hours per week are shown in Table 12.28. Is there a difference in the means for the four years? Test at a 5% level of significance. 1994 22 24 25 18 19 25 17 24 15 25 25 25 25 16 20 20 17 22 25 16 mean = 21.3 Table 1995 21 19 25 20 18 15 19 21 19 20 16 25 22 24 20 20 18 18 18 20 mean= 19.9 12.28 1996 20 24 21 18 15 19 24 24 18 17 23 23 20 16 20 16 15 19 24 18 mean = 19.7 ~ 1997 25 20 18 20 15 21 20 15 23 21 20 15 19 25 25 23 16 22 21 15 mean = 20.0 AM. The Minitaboutputfor the one-way ANOVA is shown below. Analysis of Variance Source DF ss M S F P F a c t o r 3 30.1 1 0 . 0 E r r o r 76 802.7 10.6 Total 79 832.8 0.95 0.421 The means seem to have decreased since 1994. However, the p value = .421 indicates that the means are not different. LOCIC BEHINDA TWO-WAY ANOVA 12.35 Table 12.29 gives the salaries in thousands of dollars for 20 individuals. Half are Nurse Practitioners and half are Physician Assistants. In addition, they are classified as practicing in a rural or an urban setting. Table 12.29 I RWal [ 46,50,50,50,52 I 45,48,49,47,48 I The interaction plots for the data given in Table 12.29 are given in Figs. 12-20and 12-21.Explain the plots in words.
  • 314.
    CHAP. 121 49- Mean 48- 47 - 5 0 4 9 Mean 48 47 1 --2 - ----. & 4 - - & c # - c - /I--- ) - - / - - I I I 1 ANALYSIS OF VARIANCE (ANOVA) InteractionPlot - Meansfor salary l.kbdRl . 1 8 2 - 1 -- 2 I 2 305 PNNP Fig. 12-20 Interaction Plot - Means for salary P A N 1 Ubatml Fig. 12-21 Ans. The solid line in Fig. 12-20 corresponds to urban health care workers. The solid line shows that urban Physician Assistants in the study had a greater mean salary than urban Nurse Practitioners. The dashed line correspondsto rural health care workers. This line shows that the mean salary for rural Physician Assistants in the study was also greater than the mean for Nurse Practitioners. The solid line in Fig. 12-21 corresponds to Physician Assistants and the dashed line corresponds to Nurse Practitioners. This Figure shows that the mean for urban Physician Assistants exceeds the mean for rural Physician Assistants. The dashed line shows the opposite to be true for Nurse Practitioners.
  • 315.
    306 ANALYSIS OFVARIANCE (ANOVA) [CHAP. 12 Low Moisture High Moisture 1 2 . 3 6 Table 12.30 gives the results for a 2 x 2 factorial design. The yield in kilograms of okra per plot is given for 28 different plots. Each Fertilizer-Moisture combination was applied to 7 different plots and the total yield was recorded for each plot. The interaction plot for the data in Table 12.30 is shown in Fig. 12-22.Explain, in words, the nature of the interaction. Low Fertilizer High Fertilizer 8, 8, 8,9, 10,6, 10 1, 1,292, 2,393 2,3,4,5,6,4,4 9,9, 11, 8, 8,9, 8 Table 12.30 Interaction Plot = Means for yield Moisture . 1 ' 2 1 -- 2 - I 1 I Fertilizer 2 Fig. 12-22 Ans. The solid line corresponds to the low level of moisture. This line indicates that at the low level of moisture, the mean yield is increased when the level of fertilizer is increased. The dashed line corresponds to the high level of moisture. This line indicates that at the high level of moisture, the yield is decreased when the level of fertilizer is increased. Whether or not this interaction is significant is determined by performing an analysis of variance. SUM OF SQUARES,MEAN SQUARES,AND DEGREES OF FREEDOM FOR A TWO-WAYANOVA 12.37 12.38 In problem 12.35, let factor A be the type of health professional and let the two levels of factor A be Physician Assistant and Nurse Practitioner. Let factor B be the population setting and let the two levels of factor B be urban and rural. Give the degrees of freedom for the following sources of variation: total, A, B, AB, and error. Ans. The degrees of freedom for the sources are: total df = 19,A df = 1, B df = 1,AB df = 1,error df = 16. In problem 12.36, let factor A be fertilizer level and let the two levels of factor A be low and high. Let factor B be the moisture level and let the two levels of factor B be low and high. Give the degrees of freedom for the following sources of variation: total, A, B, AB, and error.
  • 316.
    CHAP. 121 ANALYSISOF VARIANCE (ANOVA) 307 Source df A 1 B 1 AB 1 Error 16 Total 19 Ans. The degrees of freedom for the sources are: total df = 27, A df = 1, B df = 1, AB df = 1,error df = 24. ss MS F Statistic SSA MSA FA SSB MSB F B SSAB MSAB F A B SSE MSE SST BUILDINGTWO-WAY ANOVA TABLES Source Df A 1 B 1 AB 1 Error 24 Tota1 27 12.39 12.40 ss MS F Statistic SSA MSA F A SSB MSB FB SSAB MSAB FAB SSE MSE SST Give the general structure for the two-way ANOVA table for the data in Table 12.29. Name the factors and levels as given in problem 12.37. Ans. The results are given in Table 12.31. SAMPLINGDISTRIBUTIONSFOR THETWO-WAY ANOVA 12.41 Give the critical values for the test statistics FA, FB, and FAB in problem 12.39. Use a = .05. Ans. The critical value for all three is 4.49. 12.42 Give the critical values for the test statistics FA, FB,and FABin problem 12.40.Use a = .05. Ans. The critical value for all three is 4.26. TESTING HYPOTHESIS CONCERNINGMAIN EFFECTS AND INTERACTION 12.43 The Minitab output for the data given in Table 12.29is shown below. After reviewing problems 12.35, 12.37, 12.39, and 12.41, as well as the below Minitab output, test for main effects as well as interaction at a = .05. Analysis of Variance for salary Source DF ss MS F P Urban/ru 1 0.450 0.450 0 . 0 6 0.812 PA/NP 1 42.050 42.050 5.44 0.033 Urban/ru*PA/NP 1 2 . 4 5 0 2 . 4 5 0 0 . 3 2 0 . 5 8 1 Error 1 6 1 2 3 . 6 0 0 7.725 Total 1 9 1 6 8 . 5 5 0
  • 317.
    308 ANALYSIS OFVARIANCE (ANOVA) [CHAP. 12 Ans. The p values indicate that at the a = .05 level of significance, the interaction is not significant. There is no significant difference in the mean salaries between urban and rural settings. However, the mean salary for Physician Assistants exceeds the mean salary for Nurse Practitioners. 12.44 The Minitab output for the data given in Table 12.30is shown below. After reviewing Problems 12.36, 12.38, 12.40, and 12.42, as well as the below Minitab output, test for main effects as well as interaction at a = .05. Analysis of Variance for yield Source DF ss MS F P Moisture 1 4 . 3 2 1 4 . 3 2 1 3 . 1 8 0 . 0 8 7 Fertlzer 1 1 0 - 3 2 1 1 0 . 3 2 1 7 . 6 1 0 . 0 1 1 Moisture*Fertlzer 1 2 2 2 . 8 9 3 2 2 2 . 8 9 3 1 6 4 . 2 4 0 . 0 0 0 Error 2 4 3 2 . 5 7 1 1 . 3 5 7 Total 2 7 2 7 0 . 1 0 7 Am. Because of the significant moisture-fertilizer interaction, the main effects must be interpreted in the presence of this interaction.
  • 318.
    Chapter 13 X 2 4 8 10 16 20 24 30 Regression andCorrelation Y 12 13 15 16 19 21 23 26 STRAIGHTLINES 2 5 - 20- W c 4 f 15 - 10 --3 Suppose eight individuals, employed at large companies were interviewed, and the number of years of service, x, and the number of annual paid days off, y, were determined for each. The results are shown in Table 13.1. 0 8 8 8 I 1 1 Table 13.1 A scatterplot for these data is shown in Fig. 13-1. 8 8 Fig. 13-1 The data plotted in Fig. 13-1have a very special property. The points fall perfectly on a straight line. The equation of the line is y = 11.O +0 .5 ~ . The number 11.O is called the y intercept.This is the value of y when x = 0. The number 0.5 is called the slope of the line. The equation may also be expressed as y = 0 . 5 ~ + 11.O. When straight lines are studied in algebra, the slope-intercept form of a line is given as y = mx + b. When lines are discussed in statistics, the equation is often written as
  • 319.
    310 REGRESSION ANDCORRELATION [CHAP. 13 y = PO+plx. In this form, POis the y intercept and +0 . 5 ~ and the points from Table 13.1on the same graph. is the slope.Figure 13-2shows the line y = 11.O 25 10 Fig. 13-2 If the relationship between years of service and annual paid days off for all employees of large companies were described by the equation y = 11.O + OSx, then we could predict perfectly the number of annual paid days off if we knew the number of years of service. Assuming the equation described the relationship perfectly, a person with x = 14 years of service would receive y = 11 + .5 x 14 = 18annual paid days off. EXAMPLE 13.1 The line y = 1.7- 4 . 5 ~ has y intercept 1.7and slope -4.5. When x = 2, the corresponding y value is y = 1.7 - 4.5(2) = -7.3. That is, the point (2, -7.3) falls on the line whose equation is y = 1.7 - 4.5~. LINEAR REGRESSION MODEL Table 13.2 contains a more realistic data set than that shown in Table 13.1. In Table 13.2, x represents the number of years of service, and y represents the annual paid days off. These data were obtained from 20 individuals at a large company. Figure 13-3 is a plot of the data. These points clearly do not fall along a straight line. However, the data do indicate a linear trend. That is, a line could be fit to the points such that the variation of the points about the line is small. A line is shown which provides a good fit to the data. A linear regression model assumes that some dependent or response variable, represented by y, is related to an independent variable, represented by x, by the relationship shown in formula (13.I). y = p o + p * x + e (13.I) The error term, e, is a normally distributed random variable with mean equal to 0 and standard deviation equal to 0. Each value of x determines a population of y values.
  • 320.
    CHAP. 131 X 16 20 20 20 24 24 24 30 30 30 REGRESSION ANDCORRELATION Y 20 20 22 21 23 24 22 27 25 26 311 L Basic Properties of the Linear Regression Model 1. The regression model is y = P O + plx + e, where P O is the y intercept of the line given by PO+ plx, PI is the slope of the line given by P O + Plx, and e is the error or deviation of the actual y value from the line given by PO+ plx. 2. The error term e is a random variable with a mean of 0, i.e., E(e) = 0. 3. The expected value of y is E(y) = PO+ plx. 4. The variance of y equals dand is the same for all values of x. 5. The values of e are independent. 6. The error term e is normally distributed. Tab1 I I X Y 2 10 2 14 2 12 4 11 4 15 8 14 8 16 10 14 10 18 16 18 I X I Y 10 14 12 11 15 14 16 14 18 18 25 20 Days off 15 10 I 0 0 0 0 I 10 I 2 0 1 30 Years Fig. 13-3 The linear regression model, given in formula (13.1), describes the relationship between y and x for some population. The intercept, PO,and the slope, PI, are unknown parameters and must be estimated from sample data. The next section discusses the technique used to estimate P O and P I . Table 13.3contains a summary of the basic properties of the linear regression model.
  • 321.
    312 REGRESSION ANDCORRELATION [CHAP. 13 EXAMPLE 13.2 Consider the linear regression model y = 2.5 + 3x + e, where the normal random variable e has mean equal to 0 and standard deviation equal to 1.2. The mean response for y when x = 2 is E(y) = 2.5 + 3(2) = 8.5 and the mean response for y when x = 4 is E(y) = 2.5 + 3(4) = 14.5. In this example we are assuming that we know the values for POand PI.However, this is not usually the case. We usually have to estimate these two parameters from sample data taken from the population. LEAST SQUARES LINE The line shown in Fig. 13-3is called the least-squares line. The least-squares line is determined such that the sum of the squares of the deviations of the data points from the line is a minimum. For point (xi, yi), the deviation from the fitted line, whose equation is represented as 9 = bo + blx, is given by di = yi - (bo + blxi). The estimates bo and bl are determined so that the sum of squares of deviations about the line is minimized. The sum of squares of deviations is given by (13.2) The estimated y intercept, bo, is also represented by the symbols a and a. and the estimated slope is represented by b and b,.We shall use bo and bl as the symbols for the estimates of POand .Using calculus, it may be shown that the estimate for PI is given by (13.3) where S,, is given by sx,= zx2-- (Cx)’ (13.4) n and Sx, is given by (13.5) The estimate for POis given by EXAMPLE 13.3 The data from Table 13.2 is reproduced below. Recall that x represents the number of years of service and y represents the annual paid days off. r 2 4 4 8 8 10 10 16 Y 10 14 12 11 15 14 16 14 18 18 X 16 20 20 20 24 24 24 30 30 30 Y 20 20 22 21 23 24 22 27 25 26
  • 322.
    CHAP. 131 REGRESSIONAND CORRELATION 313 For these data, Cxy = 2 x 10+2 x 14 +2 x 12 + - - +30 x 27 +30 x 25 +30 x 26 = 6600. Xx=2+2+2+...+30+30+30=304,andX= % = 15.2. Cy = 10+ 14 + 12 + - . - +27 +25 +26 = 372, and 7 = % = 18.6. Cx2=4+4 + 4 + * * * +900 +900 +900 = 6512, Zy2= 100+ 196 + 144+ * * * +729 + 625 +676 = 7426. S,, =Ex2- @ = 6512- 4620.8 = 1891.2, S,, = Zxy - (Xx)('y) = 6600 - 5654.4 =945.6. n n Sxy 945.6 - b l = - = - =0.5andbo= y - b l % =18.6-0.5x 15.2=11.0. sx, 1891.2 The equation of the line is 9 = bo + blx = 11.0+ 0 . 5 ~ .The fitted line is referred to by several different names. The fitted line is called the line o f best fit, the least-squares line, the estimated regression line, and theprediction line. The computations for bo and bl are rarely performed by hand in practice. Computer software is normally used to find the equation. This is illustrated in Example 13.4. EXAMPLE 13.4 The Minitab procedure for computing the equation of the regression line is shown in Fig. 13-4. The command Regress 'Daysow 1 'Years' requires that the column containing the dependent variable values be named after the word Regress, followed by the number of independent variables, followed by the column containing the independent variable values. The regression equation is printed out as the first line of output in Fig. 13-4. The remaining output will be discussed in later sections. MTB > Regress 'Daysoff' 1 'Years'; SUBC> Constant. Regression Analysis The regression equation is Daysoff = 1 1 . 0 + 0.500 Years Predictor Coef St Dev T P Constant 1 1 . 0 0 0 0 0.5703 1 9 . 2 9 0.000 Years 0 . 5 0 0 0 0.0316 1 5 . 8 2 0.000 s = 1.374 R-Sq = 93.3% R-Sq(adj) = 9 2 . 9 % Analysis of Variance Source DF ss MS F P Error 1 8 34.00 1 . 8 9 Total 1 9 506.80 Regression 1 472.00 472.00 2 5 0 . 3 1 0.000 Fig. 13-4 The amount of computation saved by this procedure is hard to appreciate. Before the availability of computer software, these computations were done by use of a hand held calculator . ERROR SUM OF SQUARES The error sum o f squares, denoted by SSE, is given by formula (13.7): SSE =Z(yi - Yi)*
  • 323.
    314 REGRESSION ANDCORRELATION [CHAP. 13 The differences, (yi - yi), are called residuals. A residual measures the deviation of an observed data point from the estimated regression line. If the estimated regression line fits the data points perfectly, as is the case for the data in Table 13.1,then SSE = 0. The more the variability of the data points away from the line, the larger the value for SSE. The computation of SSE is illustrated in Example 13.5. EXAMPLE 13.5 The data in Table 13.2 gives 20 observations for the number of years of service, x, and the number of annual paid days off, y. The equation of the estimated regression line is found to be 9 = bo + blx = 11.0+ 0 . 5 ~ in Example 13.3. The computation of the residuals and SSE is shown in Table 13.4. Note that the sum of the residuals, Z(yi - ii), is equal to zero. The error sum of squares, SSE, is equal to 34. The error sum of squares is shown in Fig. 13-4 as part of the Minitab output. It is shown under the Analysis of Variance portion of the output and is located at the intersection of the Error row and the SS column. Employee 1 2 3 4 5 6 7 8 9 10 I 1 12 13 14 15 16 17 18 19 20 Xi 2 2 2 4 4 8 8 10 10 16 16 20 20 20 24 24 24 30 30 30 Ta Yi 10 14 12 11 15 14 16 14 18 18 20 20 22 21 23 24 22 27 25 26 le 13.4 9 i = 1 1 +0.5xi 12 12 12 13 13 15 15 16 16 19 19 21 21 21 23 23 23 26 26 26 -2 2 0 -2 2 -I 1 -2 2 - 1 I -1 1 0 0 1 -1 1 -I 0 C(y1 -f i ) = 0 4 4 0 4 4 1 1 4 4 1 1 1 1 0 0 1 1 1 1 0 C(y, - iJ2 = 34 A convenient formula for computing SSE that is equivalent to formula (13.7) is given in formula (13.8).The computation of S,, is the same as that for S , , using the y values instead of the x values. s:, SSE = sYy-- sxx (13.8) EXAMPLE 13.6 To illustrate the computation of SSE in Example 13.5using the computation formula, recall that in Example 13.3,we found that S,, = 1891.2and S,, = 945.6. The computation of S,, is now illustrated: 945.6* =34 S,, = Cy2- oz= 7426 - - 3722 = 506.8 and SSE = sYy-- sty= 506.8 - - n 20 s x x 1891.2
  • 324.
    CHAP. 131 REGRESSIONAND CORRELATION 315 STANDARDDEVIATION OF ERRORS The standard deviation of the error term in the model y = +plx +e is represented by <T and the variance of the error term is 0'. The variance 0' is estimated by S2,where S2 is given by formula (13.9): SSE s=- - - n-2 The square root of S2is called the standard deviation o f errors, and is given by formula (13.10): (13.9) s =J""" n-2 (13.10) EXAMPLE 13.7 The standard deviation of errors for the data given in Table 13.2 is found by recalling that n = 20 and in Example 13.6, we found that SSE = 34, The standard deviation of errors is: The value for S is shown in the Minitab output given in Fig. 13-4. TOTAL SUM OF SQUARES Suppose we were requested to estimate the mean number of annual paid days off given by large companies, but that we did not know the years of service, x, for each sampled employee. Our best estimate would be the mean of the 20 values for y given in Table 13.2.The mean of these 20 values is y = 18.6days per year. The accuracy of the estimate is related to the variation of the individual y values about the mean. The sum of squares about the mean is called the total sum o f squares, and is given by - (13.I I ) EXAMPLE 13.8 By referring to Example 13.6, it is seen that the total sum of squares is equal to Syy. Therefore, from Example 13.6, we see that SST = 506.8. The total sum of squares is given in the Minitab output shown in Fig. 13-4. REGRESSIONSUM OF SQUARES When the regression line is not used in estimating the mean annual paid days off, the total sum of squares measures the variation about the mean, 18.6,as discussed in the previous section. When the regression line is used, there is still some unexplained variation about the regression line. This unexplained variation is given by SSE. This implies that the regression line explains an amount of variation equal to SST - SSE. This explained variation is called the regression sum of squares. The regression sum of squares is given by formula (I3.12): SSR = SST- SSE (13.12) EXAMPLE 13.9 When the mean, 7 = 18.6days per year, is used to estimate the mean number of days off per year, the variation of the y values about this mean is SST = 506.8.When the number of years of service is taken
  • 325.
    316 REGRESSION ANDCORRELATION [CHAP. 13 into account, the estimated regression line was found to be f = 11.0+ 0 . 5 ~ . There is variation about this line that is not explained by the estimated regression line. This unexplained variation is given by SSE = 34. This implies that the regression line explained SST - SSE = 506.8 - 34 = 472.8 or 93.3% of the variation of the values about the mean. The regression sum of squares is SSR = 472.8. The regression sum of squares is directly computable by the formula SSR = E(ii -7)’ (13.13) EXAMPLE 13.10 Table 13.5 illustrates the computation of the regression sum of squares using formula (13.13).Note that SSR = 472.80. In Examples 13.5 and 13.8, we found that SSE = 34.0 and SST = 506.8. We see that SST = SSR + SSE. These three sum of squares is shown under the SS column of the analysis of variance portion of Fig. 13-4. Employee I 2 3 4 5 6 7 8 9 10 I 1 12 13 14 15 16 17 18 19 20 Xi 2 2 2 4 4 8 8 10 10 16 16 20 20 20 24 24 24 30 30 30 Tab Yi 10 14 12 I 1 15 14 16 14 18 18 20 20 22 21 23 24 22 27 25 26 COEFFICIENT OF DETERMINATION The coeficient o f determination is defined by 12 12 12 13 13 15 15 16 16 19 19 21 21 21 23 23 23 26 26 26 SSR r2=- SST -6.6 -6.6 -6.6 -5.6 -5.6 -3.6 -3.6 -2.6 -2.6 0.4 0.4 2.4 2.4 2.4 4.4 4.4 4.4 7.4 7.4 7.4 sum = 0 (9,-9’ 43.56 43.56 43.56 31.36 3I .36 12.96 12.96 6.76 6.76 0.16 0.16 5.76 5.76 5.76 19.36 19.36 19.36 54.76 54.76 54.76 sum = 472.80 (13.14) When the data points fall perfectly on a straight line as in Fig. 13-1, SSE = 0 and therefore SST = SSR. In this case, the coefficient of determination is equal to 1. When the estimated regression line explains none of the variation in y, SSR = 0, and the coefficient of determination is equal to 0. When the coefficient of determination is expressed as a percentage, it can be thought of as the percentage of the total sum of squares that can be explained using the estimated regression equation. EXAMPLE 13.11 For the data in Table 13.2 and discussed in the previous examples, the coefficient of x 100%= - 472*8 x 100=93.3%. 2 SSR determination is r = - SST 506.8
  • 326.
    CHAP. 131 REGRESSIONAND CORRELATION The coefficient of determination may be determined by the use of formula (13.15): s:y r2= - s x x s y y 317 (13.15) EXAMPLE 13.12 For the data in Table 13.2, it was found in Example 13.3that Sx, = 1891.2 and S,, = 945.6. Also in Example 13.6 we found that Sy,= 506.8. Therefore, the coefficient of determination is found as follows: r2= -- s& - 945*62 = .933or 93.3% sXx syy 1891.2x 506.8 The value for r2is also given in Fig. 13-4. MEAN, STANDARD DEVIATION, AND SAMPLING DISTRIBUTIONOF THE SLOPE OF THE ESTIMATEDREGRESSIONEQUATION The slope of the estimated regression line, bl, is a statistic and has a sampling distribution. That is, if several different surveys concerning the number of years of service, x, and the number of annual paid days off, y, were conducted, and the estimated regression line found in each case, the values for the slope and the y intercept would not all be equal. The estimated slope, bl, has a sampling distribution which is normally distributed. The mean of bl is PI and the standard deviation of bl is given in the following: (13.16) Since 0 is unknown, the standard deviation of errors is substituted for 0 to obtain the standard error for bl. The standard error for b1 is given in formula (13.17): (13.17) S - s b l - J s , , EXAMPLE 13.13 For the data given in Table 13.2, the value for S,, was found to equal 1891.2 in Example 13.3 and in Example 13.7, the value for S was found to equal 1.374. The standard error for bl is therefore equal = 0.0316. This value is given in the computer output in Fig. 13-4 at the intersection S 1.374 to Sb,= - J s X x = 4 m of the Years row and the St Dev column. The standard error for bl is needed to set confidence intervals on and test hypothesis concerning the slope of the population regression line. INFERENCES CONCERNINGTHE SLOPEOF THE POPULATION REGRESSION LINE A (I - a)x 100% confidence interval for is obtained by using the distribution theory given in the previous section. The confidence interval is given in formula (13.18), where ta/2 is obtained from the Student t distribution using n - 2 degrees of freedom.
  • 327.
    318 REGRESSION ANDCORRELATION [CHAP. 13 . Steps for Testing the Value of the Slope of the Population Regression Line Step 1: The null hypothesis is H o :p1 = plo and the alternative hypothesis is either lower, upper, or two-tailed. Step 2: Use the Student t distribution table and the level of significance a to determine the rejection regions. The degrees of freedom is n - 2. EXAMPLE 13.14 Consider the continuing Example concerning the number of annual paid days off as a function of the number of years of service. Suppose we wish to determine a 95% confidence interval for pl.The values for bl and s b , were obtained in Examples 13.3 and 13.13.They may also be obtained from the computer printout in Fig. 13-4.The values are b, = 0.5 and Sb,= 0.0316.The value for t . 0 2 ~is obtained from the Student t table using df = 20 - 2 = 18.The value is t.025 = 2.101. The 95% margin of error is k tansbl = k 2.101 x 0.0316 =f0.066.The confidence interval extends from 0.5 - 0.066=0.434 to 0.5 +0.066= 0.566. The steps for testing an hypothesis about the value of the slope of the population regression line are given in Table 13.6. b, -PI0 Step 3: The test statistic is computed as t* = - . 'bl I I Step 4: State your conclusions. The null hypothesis is rejected if the computed value of Ithe test statistic falls in the rejection region. Otherwise, the null hypothesis is not rejected. EXAMPLE 13.15 One of the most often tested hypothesis concerning Pr is that it equals zero. This is equivalent to testing the hypothesis that x does not determine y and that there is no significant relationship between x and y. Suppose we wish to test Ho: PI= 0 vs. Ha:PI# 0 at significance level a = .05 using the data in Table 13.2. Recall from Example 13.3 that bl = 0.5 and from Example 13.13 that sbl = 0.0316. The value for pl0in Table 13.6 is 0. The critical values are determined by finding the t value with .025 area in the right tail and df = 18.The values are f2.101. The computed value of the test statistic is found as follows: Since this value falls far beyond the value 2.101, the null hypothesis is rejected. Note that the value 15.82 is found in the printout in Fig. 13-4 as the t value corresponding to the predictor years and the corresponding p value is given as O.OO0. Most researchers would use the information given in the printout to test this hypothesis. ESTIMATION AND PREDICTION IN LINEAR REGRESSION Suppose we are interested in estimating the mean number of annual paid days off for all employees of large companies who have 20 years of service at such companies. Recalling basic property 3 in Table 13.3,we have E(y) = PO+ plx. The mean number of annual paid days off would be equal to P O + 2OP1. Since we do not know POand p1 ,we would use b o and bl to estimate P O and P1.Thepoint estimate for the mean would be b o + 20bI. The standard error of this estimate is needed to set a confidence interval on the mean or to test an hypothesis about this mean. A general expression for a confidence interval when predicting the mean response will now be given. A 100(1 - a)% confidence interval for the mean response E(y) = PO+ Plxo is given by formula (13.19):
  • 328.
    CHAP. 131 REGRESSIONAND CORRELATION 319 EXAMPLE 13.16 The data in Table 13.2 will be used to obtain a 95% confidence interval for the mean number of annual paid days off for employees having 20 years of service. In Example 13.3, the estimated regression equation was found to be 9 = bo + b,x = 11.0 + 0 . 5 ~ . The point estimate of the mean is 9 = 11.0 + OS(20) = 2I days. The standard error of this point estimated is found by recalling from Example 13.3 that X = 15.2, S,, = 1891.2, and n = 20. In Example 13.7, we found that S = 1.374. Also, xo = 20. The standard error is equal to For df = n - 2 = 18degrees of freedom, the student t value is t.02s = 2.101.The 95% margin of error is = 2.101 x 0.343= 0.72 s x x The 95% confidence interval for the mean number of annual days off for employees having 20 years of service extends from 21 - 0.72 = 20.28 days to 21 +0.72 = 21.72days. Suppose we wished to predict the number of annual days off for a single individual having 20 years of service rather than the mean number for all employees having 20 years of service. The prediction is still determined by using the estimated regression line as in the case of estimating the mean. However, the standard error is larger because a single observation is more uncertain than the mean of the population distribution. A 100(1 - a)% prediction interval when predicting a single observation y at x = xo given by formula (23.20): 1 (xO-X)’ bo +blxo k ta,,S 1+-+ i . s x x (13.20) EXAMPLE 13.17 A 95% prediction interval for the number of annual days off for an individual having 20 years of service is found as follows. The point estimate is the same as that found when estimating the mean in Example 13.16,21 years. The standard error associated with the prediction is found as follows: t = 1 . 3 7 4 , / - = 1.42days The margin of error associated with this prediction is 2.101 x 1.42= 2.98. The prediction interval extends from 21 - 2.98 = 18.02days to 21 +2.98 = 23.98 days. EXAMPLE 13.18 In the Minitab output shown in Fig. 13-4, if the subcommand Predict 20; is added to the commands, the following additional output is obtained. F i t StDev F i t 9 5 . 0 % C I 95.0% P I 21.000 0.343 ( 2 0 . 2 8 0 , 2 1 . 7 2 0 ) ( 1 8 . 0 2 3 , 2 3 . 9 7 7 ) The fit value is the value obtained by substituting 20 into the estimated regression equation. The StDev Fit value is the standard error associated with estimating the mean. In addition, the 95% confidence interval and the 95% prediction interval are also given. LINEAR CORRELATION COEFFICIENT The linear correlation coeficient, also known as simply the correlation coefirient, is a measure of the strength of the linear association between two variables. The sample correlation coefficient is
  • 329.
    320 REGRESSION ANDCORRELATION [CHAP. 13 X 2 given by formula (13.21). The correlation coefficient is also sometimes referred to as the Pearsun correlation coeflcient. SXY (13.21) Y 12 EXAMPLE 13.19 The correlation coefficient for the data in Table 13.2 will be determined. In Example 13.3 we found that S,, = 945.6 and S,, = 1891.2. In Example 13.6, we found that S,, = 506.8. The correlation coefficient is therefore found as follows. = 0.966. S X Y = 945.6 = ,/%41891.2x 506.8 The Minitab computation for r is as shown below. The x values are put into column 1 and the y values are put into column 2 and the command corr cl c2 is used to obtain the correlation. MTB > corr cl c2 Correlations(Pearson) Correlation of Years and Daysoff = 0.966 The basic properties of the sample correlation coefficient are given in Table 13.7. Table 13.7 I I1. The value of r is always between -1 and +1, i.e., -1 I r I +l. 2. The magnitude of r indicates the strength of the linear relationship, and the sign of r indicates whether the relation is direct or inverse. If r is positive, then y tends to increase linearly as x increases, with the tendency being greater the closer that r is to 1. If r is negative, then y tends to decrease linearly as x increases, with the tendency being greater the closer that r is to -1. If all the points on the scatter diagram lie perfectly on a straight line with a positive slope, the r = +l. If all the points on the scatter diagram lie perfectly on a straight line with a negative slope, then r = -1. I 3. A value of r close to -1 or +1 represents a strong linear relationship. I4. A value of r close to 0 represents a weak linear relationship. If the formula for r is applied to every pair of (x, y) values in the population, we obtain the population correlation coeficient. The Greek letter p represents the population correlation coefficient. EXAMPLE 13.20 The data given in Table 13.1 and the corresponding scatter plot shown in Fig, 13-1 are reproduced below and on the next page. 4 8 10 16 20 24 13 15 16 19 21 23 I 30 I 26
  • 330.
    CHAP. 131 25 - 20- Days off 15 - 10 1 REGRESSION AND CORRELATION 0 I I I 321 Step 1: The null hypothesis is H o :p = 0 and the alternativehypothesis is either lower, upper, or two tailed. Step 2: Use the Student t distribution table and the level of significance a to , determinethe rejection regions. The degrees of freedom is n - 2. Step 3: The test statistic is computed as t*=r - Step 4: State your conclusions.The null hypothesisis rejected if the computed value of the test statistic falls in the rejection region. Otherwise, the null hypothesis is not rejected. /z* Years For the data shown in the table, we have: n = 8, Cx = 114, Cy = 145, Cx2= 2316, Cy2= 2801, and Cxy = 2412. s,, = Ex2- - (zx)2 = 2316 - 1624.5= 691.5, S,, = Cy2- - (zy)2 - -2801 - 2628.125 = 172.875, and n n s,, = zxy - (Cx)(zy) = 2412 - 2066.25 = 345.75. The correlation coefficient is then found as follows: n = 1.0 S X Y = 345.75 = Jsl(xsyy 46915 x 172.875 Since the points fall exactly on a straight line with a positive slope, the correlation coefficient is equal to 1. INFERENCE CONCERNING THE POPULATION CORRELATION COEFFICIENT One of the most important uses of the sample correlation coefficient is the determination of whether or not a correlation exists between two population variables. Recall that r measures the correlation between x and y in the sample and p is the corresponding population correlation. In particular, we are often interested in testing the null hypothesis that p =0 vs. one of three alternatives. The alternative p f 0 states that a correlation is present. The alternative p < 0 states that an inverse correlation exists, and the alternative p > 0 states that a direct correlation exists. Table 13.8 gives the steps for testing a hypothesis concerningp. Table 1 3 . 8 I Steps for Testing for No PopulationCorrelation
  • 331.
    322 REGRESSION ANDCORRELATION [CHAP. 13 EXAMPLE 13.21 The data in Table 13.2 and the computed value of r in Example 13.19 will be used to test whether or not there is a positive correlation between the years of service and the number of annual paid days off for employees of large companies. The null hypothesis is Ho: p = 0 and the alternative hypothesis is Ha:p > 0. The computed value of the test statistic is: Because of this large value oft*, we would reject the null and conclude that a positive correlation exists in the population. Solved Problems STRAIGHTLINES 1 3 . 1 Find the slope and y intercept of the following straight lines. Am. In order to find the slope and y intercept, each equation will be put into the form y = mx + b. The number m is the slope and the number b is the y intercept. (a) The slope is m = 1.5 and the y intercept is b = -2.5. (6) The slope is m = -2.5 and the y intercept is b = 3.0. (c) The given line is equivalent to y = 2x - 3 and slope is m = 2 and the y intercept is b = -3. (d)The given line is equivalent to y = .5x + .25 and the slope is m = .5 and the y intercept is b = .25. 13.2 Determine which of the following points fall on the line y = -2x +4. (a) (2,O) (b) (0, 2) (c) (50,96) (d) (-10,24) (e) (14.23, -24.46) AIZS. The points given in (a),(4,and (e)fall on the line since they satisfy the equation. LINEAR REGRESSIONMODEL 13.3 Identify the values of PO, is a normal random with mean 0 and standard 1.2. ,and CT in the linear regression model y = 12.1 +3 . 5 ~ +e, where e Arts. PO=12.l,p,=3.5andO= 1.2. 13.4 Consider the linear regression model y = - 1 . 5 ~+ 5.6 + e, where e is a normal random variable with mean 0 and variance d = 4. Determine the mean and standard deviation of y when x = 2 and when x = 4.5. Ans. When x = 2, y = -1.5 (2) + 5.6 + e = 2.6 + e. E(y) = E(2.6 + e) = 2.6. Since 2.6 is a constant, the variance of y is the same as the variance of e or 4. Therefore the standard deviation of y is 2 when x = 2. When x = 4.5, y = -1.5(4.5) + 5.6 + e = -1.15 + e. E(y) = E(-1.15 +- e) = -1.15. Since -1.15 is a constant, the variance of y is the same as the variance of e or 4. Therefore the standard deviation of y is 2 when x = 4.5. Notice that the standard deviation of y is 2, regardless of the value of x.
  • 332.
    CHAP. 131 REGRESSIONAND CORRELATION 323 LEAST SQUARES LINE 1 3 . 5 The number of hours spent per week viewing TV, y, and the number of years of education, x, were recorded for 10 randomly selected individuals. The results are given in Table 13.9. In addition, this table gives computations needed in finding bo and bl. Find the least-squares line for these data. Ans. S,,= Ex2- -= 2085 - 1988.1= 96.9 S,, = Zxy - (cx)(cy) = 1351- 1494.6=-143.6 n n - -143.6 x = 14.1 7 = 10.6 bl = -- "' - -=--1.4819 bo= 7 -b15?= 10.6-(-1.4819)(14.1) S X X 96.9 bo = 10.6+20.8948 = 31.4948 The equation of the least-squares line is 9 = b o +blx = 31.495- 1.482~. X 12 14 11 16 16 18 12 20 10 12 Y Table 13.9 I X2 10 9 15 8 5 4 20 4 16 15 I44 196 121 256 256 324 144 400 100 144 Y2 100 81 225 64 25 16 400 16 256 225 XY I20 126 I65 128 80 72 240 80 160 180 Cx= 141 I Cy= 106 I Zx2=2085 I Zyz= 1408 I ZXY=1351 13.6 The Minitab output for the data in problem 13.5 is shown in Fig. 13-5. Give the estimated regression line. MTB > Regress 'TVhours' 1 'Yearsedu'; SUBC> Constant; SUBC> Predict 15. Regression Analysis The regression equation is TVhours = 3 1 . 5 - 1 . 4 8 Yearsedu Predictor Coef StDev T P Constant 3 1 . 4 9 5 4.388 7 . 1 8 0.000 Yearsedu -1.4819 0.3039 -4.88 0.000 S = 2.992 R-Sq = 74.8% R-Sq(adj) = 71.7% Analysis of Variance Source DF ss MS F P Regression 1 2 1 2 . 8 1 2 1 2 . 8 1 23.78 0.000 Error 8 7 1 . 5 9 8.95 Total 9 284.40 Fit St Dev Fit 9 . 2 6 6 0.985 ( 6 . 9 9 5 , 1 1 . 5 3 8 ) ( 2 . 0 0 2 , 1 6 . 5 3 1 ) 95.0% CI 95.0% PI Fig. 13-5
  • 333.
    324 REGRESSION ANDCORRELATION [CHAP. 13 Ans. From Fig. 13-5,we see that the estimated regression equation is given in the following portion of the output in Fig. 13-5. Note that this is the same equation as the least-squares line found in problem 13.5. The regression equation is TVhours = 31.5 - 1.48 Yearsedu Predictor Coef StDev T P Constant 31.495 4.388 7 . 1 8 0 * 000 Yearsedu -1.4819 0.3039 - 4 . 8 8 0.000 ERROR SUM OF SQUARES 13.7 Compute the error sum of squares for the data in Table 13.9using both the defining formula and the computational formula. Ans. Table 13.10gives the details of the computation of SSE = C(y, - ii)2. The columns from left to right give the following information: Individual or subject, Number of years of education, Number of hours spent per week watching TV, The fitted or estimated value using the least squares line, The residual, and The squares of the residuals. Table 13.10 Subject 1 2 3 4 5 6 7 8 9 10 Xi 12 14 I 1 16 16 18 12 20 10 12 Yi 10 9 15 8 5 4 20 4 16 15 9i = 31.495- 1.482~ 13.7121 10.7482 15.1940 7.7843 7.7843 4.8204 13.7121 1.U66 16.6760 13.7121 (yi - f i -3.71207 -1.74819 -0.1940 1 0.21569 -2.7843 1 -0.82043 6.28793 2.14345 I .28793 -0.67 595 13.7795 3.0562 0.0376 0.0465 7.7524 0.6731 39.5380 4.5944 0.4569 1.6588 The error sum of squares may also be computed using SSE = syy--’”. From problem 13.5, we have SX L X S,, = 96.9 and S,, = -143.6. In addition, S,, = Cy2- = 1408- 1123.6= 284.4. Using these values, n = 71.593. (- 143.6)2 we find SSE= sY-- ’”= 284.4- Sxx 96.9 13.8 Using the Minitab output shown in Fig. 13-5, locate SSE and compare it with the result found in problem 13.7. Am. The error sum of squares is shown in bold in the following portion of Fig. 13-5. Analysis of Variance Source DF ss MS F P Error 8 71.59 8 . 9 5 Total 9 284.40 Regression 1 2 1 2 . 8 1 2 1 2 . 8 1 2 3 - 7 8 0 . 0 0 0
  • 334.
    CHAP. 131 REGRESSIONAND CORRELATION ' Subject 1 2 3 4 5 6 7 8 9 10 325 Xi Yi ii = 31.495- 1.482~ (Yi - y > (yi - YY 12 10 13.7121 3.11207 9.6850 9 10.7482 0.14819 0.0220 14 11 15 15.1940 4.59401 21.1050 16 8 7.7843 -2.8 1569 7.9281 16 5 7.7843 -2.8 1569 7.9281 18 4 4.8204 -5.77957 33.4034 12 20 13.7121 3.11207 9.6850 20 4 1.8566 -8.74345 76.4479 10 16 16.6760 6.07595 36.9172 12 15 13.7121 3.11207 9.6850 Sum = 0 Sum = 212.81 STANDARDDEVIATION OF ERRORS 13.9 Use the computed value for SSE in problems 13.7 and 13.8 to find the standard deviation of errors, S. Ans. The standard deviation of errors is given by the formula S = j,"s"; -- - /?-= 2.9915. 13.10 Locate the standard deviation of errors in Fig. 13-5 and confirm that it is the same as that computed in problem 13.9. Ans. The following row taken from Fig. 13-5gives the value for S. 8 = 2.992 R-Sq = 74.8% R-Sq(adj) = 7 1 . 7 % TOTAL SUM OF SQUARES 13.11 Find the total sum of squares for the data given in Table 13.9. 1M2 Ans. The total sum of squares is given by SST = C(yi - y)2= Zy2- &f = 1408- -= 1408- n 10 1123.6= 284.4. The values for Cy2and Zy are found at the bottom of Table 13.9. 13.12 Locate the Total sum of squares in Fig. 13-5 and confirm that it is the same as that found in problem 13.11. Ans. The following portion of Fig. 13-5 gives the total sum of squares. It is the same as that found in problem 13.11. Source DF ss MS F P Regression 1 2 1 2 . 8 1 2 1 2 . 8 1 2 3 . 7 8 0.000 E r r o r 8 71.59 8.95 Total 9 284.40 REGRESSIONSUM OF SQUARES 13.13 Find the regression sum of squares for the data given in Table 13.9 by subtraction as well as by direct computation. Ans. The regression sum of squares is given by SSR = SST - SSE = 284.4 - 71.593 = 212.807, where SST is computed in problems 13.1I and 13.12and SSE is computed in problems 13.7 and 13.8. The direct computation is illustrated in Table 13.11. The mean of the y values is 10.6.
  • 335.
    326 REGRESSION ANDCORRELATION [CHAP. 13 13.14 Locate the regression sum of squares in Fig. 13-5 and confirm that you get the same value as that computed in problem 13.13. Ans. The regression sum of squares is shown in the following portion of output selected from Fig. 13-5. Source DF ss MS F P ROgrO68iOn 1 212.81 2 1 2 . 8 1 2 3 . 7 8 0.000 Error a 7 1 . 5 9 8.95 Total 9 284.40 COEFFICIENT OF DETERMINATION 13.15 13.16 Calculate the coefficient of determination for the results given in Table 13.9 and explain its meaning. Ans. The coefficient of determination is given by r2 = - - - -= 0.748. The estimated regression equation accounts for about 75% of the variation in TV viewing time. SSR 21281 SST 284.4 Locate the coefficient of determination in Fig. 13-5 and confirm that you get the same value as that computed in problem 13.15. Ans. The r2value shown in the following row, taken from Fig. 13-5,is the same as that in problem 13.15. S = 2 . 9 9 2 R-8q m 74.8% R-Sq(adj) = 71.7% MEAN, STANDARDDEVIATION,AND SAMPLINGDISTRIBUTION OF THE SLOPE OF THE ESTIMATED REGRESSION EQUATION 13.17 Find the standarderror for bl using the data in Table 13.9. 2'992 - 0.304. The value for S is found in The standard error for bl is given by sb, = -- S Ans. J s x X - 3 z - problem 13.9and the value for S,, is given in problem 13.5. 13.18 Locate the value for the standard error for bl in Fig. 13-5and compare it with the value computed in problem 13.17. Ans. The standard error of bl is shown in bold print in the following selection taken from Fig. 13-5.It is the same as that computed in problem 13.17. Predictor Coef St Dev T P Constant 3 1 . 4 9 5 4 . 3 8 8 7 . 1 8 0.000 Yeareedu -1.4819 0.3039 - 4 . 8 8 0.000 INFERENCES CONCERNING THE SLOPE OF THE POPULATION REGRESSION LINE 13.19 Use the data in Table 13.9to test Ho: =0 vs. Ha: <0 at significancelevel a = .05. Ans. First, let us determine the critical value. The degrees of freedom is df = n - 2 = 8. The t value for a right tail area equal to -05 is t,os = 1.860. Since this is a lower-tail test, the critical value is
  • 336.
    CHAP. 131 REGRESSIONAND CORRELATION 327 -1.860. The computed value of the test statistic is t* = U. From problem 13.5, we know that bl = -1.4819, and from problem 13.17, s,,,= 0.304. The hypothesized value for the slope is plo= 0. The computed value of the test statistic is therefore t* = = -4.87. Since this value is smaller than -1.860, we reject the null hypothesis and conclude that the slope is less than zero. This indicates that the number of hours spent viewing TV and the number of years of education are inversely related. Sb* -1.48 19-0 .304 13.20 Locate the computed value of the test statistic in problem 13.19 in the Minitab output given in Fig. 13-5. Ans. The following selection from Fig. 13-5 gives the computed test statistic as -4.88 with a corresponding p value equal to O.OO0. Predictor Coef StDev T P Constant 31.495 4.388 7 . 1 8 0.000 Yearsedu -1-4819 0.3039 -4.00 0.000 ESTIMATIONAND PREDICTIONIN LINEAR REGRESSION 13.21 Find a 95%confidence interval for the mean number of hours spent per week watching TV for all individuals with 15 years of education and a 95% prediction interval for an individual with 15years of education using the data in Table 13.9. Ans. The 95% confidence interval for the mean is given by bo + blx, k t,,,S point estimate for the mean .is . The = bo + bl(15) = 31.495 - 1.482(15) = 9.265, The margin of I error associated with this estimate is t,,,S .The value for t.025 is found by using 8 degrees of freedom to be 2.306. The following values have been found in previous problems: S = 2.992, n = 10, X= 14.1, S,, = 96.9, and ~0 = 15. Therefore, the margin of error is: 2.306x 2.992 x ,/= = 2.27. The 95% confidence interval for the mean number of hours watching TV for all individuals having 15 years of education extends from 9.265 - 2.27 = 6.995 to 9.265 + 2.27 = 11.535. The 95% prediction interval for an individual is given by The point estimate is the same as given above when setting a confidence interval on the mean and the t value is the same as above. The margin of error is computed as follows: (15- 14.1), 2.306x 2.992 x ,/z= 1+-+ 7.264. The 95% prediction interval extends from 9.265 - 7.264 = 2.001 to 9.265 + 7.264 = 16.529. Note that the prediction interval is much wider than the confidence interval for the mean. 13.22 Locate the 95%confidence interval and the 95% prediction interval in Fig. 13-5and compare with the results in problem 13.21.
  • 337.
    328 REGRESSION ANDCORRELATION [CHAP. 13 Ans. The following portion of the output given in Fig. 13-5 gives the 95% confidence interval and prediction interval. The results are the same except for a small amount of round off error. F i t S t D e v F i t 95.0% C I 95.0% PI 9 . 2 6 6 0 . 9 8 5 (6.995, 11.538) (2.002, 16.531) LINEAR CORRELATIONCOEFFICIENT 13.23 Compute the linear correlation coefficient for the data in Table 13.9. Ans. The correlation coefficient is given by the formula r = . In problem 13.5, we found that S,, = 96.9 and S , , = -143.6. In problem 13.7, we found that S,, = 284.4. The correlation sxy .Is..syy -143.6 4 - = coefficient is equal to 13.24 Determine the correlation coefficient using the computer output in Fig. 13-5. Ans. In problem 13.16, we found that r2 = ,748. Therefore r = &4 % = k 365. The negative sign is taken since we know that the variables are inversely related. We know they are inversely related since the slope of the estimated regression line is negative. Therefore r = -.865. INFERENCE CONCERNING THE POPULATION CORRELATION COEFFICIENT 13.25 13.26 Use the data in Table 13.9to test Ho: p = 0 vs. Ha:p c 0 at a = .01. Ans. The critical value is determined by noting that df = n - 2 = 10- 2 = 8 and that t,ol= 2.896. Since this is a lower-tail test, the critical value is -2.896. The test statistic is computed as = -4.876. Since this value is less than the critical value, we 8 J1-.748225 t * = r d s = -.865 conclude that the two variables are negatively correlated. A sample of 100 federal taxpayers were interviewed and the annual income and cost to prepare their return was determined for each. The correlation coefficient between the two variables was found to be 0.37. Do these results indicate a positive correlation between these two variables in the population? Use a = .05. Ans. Because of the large sample size, we may use the standard normal distribution table to find the critical value. The critical value is 1.645. The computed value of the test statistic is t* = r i % = .37,/& = 3.94. Since the computed value of the test statistic exceeds the critical 1 - r value, we conclude that there is a positive population correlation. Supplementary Problems STRAIGHTLINES 13.27 Figure 13-6shows the line whose equation is y =4 +2x and 8 points, labeled A through H. Points A, C, F, and G have the following coordinates: A:(2, 6), C:(2, lO), G:(5, 16), and F(5, 12). Find the
  • 338.
    CHAP. 131 REGRESSIONAND CORRELATION 329 coordinates for the points B, D, E, and H. Also find the distance between the following pairs of points: A and B, B and C, H and G, and H and F. Find the sum of the squares of the distances between the four pairs of points. I I I 2 3 4 5 X Fig. 13-6 Ans. B: (2,8),D: (3, lO), E: (4, 12),and H: (5, 14) All four of the distances are equal to 2 units. The sum of the squares of the distances is equal to 16. 13.28 Use the coordinates of the points B and H found in problem 13.27to find the slope of the straight line passing through these points and compare this with the slope of the line y = 4 + 2x. Y 2 - Y 1 14-8 Ans. The slope of the line is m = -= -= 2. The slope of the line y =4 +2x is m = 2. x Z - X ~ 5 - 2 LINEAR REGRESSION MODEL 13.29 A linear regression model has a slope of 12.5 and a y intercept equal to 19.2. The error term has a standard deviation equal to 2.5.Give the equation of the linear regression model. Ans. y = 19.2+ 1 2 . 5 ~ +e 13.30 For the linear regression model described in problem 13.29,find the mean values for y when x equals 2 and 5. Ans. 44.2 and 81.7 LEAST-SQUARES LINE 13.31 Table 13.12 gives the household income in thousands of dollars, x, and cost of filing federal taxes in dollars, y, for 20 randomly selected federal taxpayers. Find the following: Zx, Ex2,Zy, Zy2, Zxy, S,,, S,,, S,,, bl, and bo.
  • 339.
    330 REGRESSION ANDCORRELATION [CHAP. 13 X 17.5 37.5 47.5 25.O 55.5 35.O 15.5 12.0 32.0 42.3 Table Y 35.0 60.5 88.5 70.5 125.0 63.0 30.0 30.0 65.O 80.0 13.12 X 28.0 22.5 25.O 29.5 65.O 51.0 39.3 33.0 45.O 75.O Y 75.0 70.0 60.0 65.O 150.0 100.0 75.O 40.0 75.0 200.0 Ans. Cx = 733.10, Cx2= 31,992, Cy = 1557.50, Cy2= 153,407, Cxy = 68,864, S,, = 5120.2195, S,, = 32116.6875, S,, = 11773.8375, bl = 2.2995, bo = -6.4132. 13.32 The Minitab output for the regression of y on x for the data in Table 13.12 is shown in Fig. 13-7. Give the portion of the output that shows the values for boand bl. MTB > Regress 'Cost' 1 'Income'; S U B 0 Predict 4 0 . RegressionAnalysis The rogrerraion oquation i8 Cost = - 6.42 + 2.30 Income Predictor Coef St D e v T P Constant - 6 . 4 2 0 9.353 - 0 . 6 9 0 . 5 0 1 Income 2.2997 0.2339 9.83 0.000 S = 1 6 . 7 3 R-Sq = 84.3% R-Sq(adj) = 8 3 . 4 % Analysis of Variance Source DF ss MS F P Error 1 8 5040 280 Total 1 9 32116 Regression 1 27076 27076 9 6 . 7 0 0 . 0 0 0 I St Dev Fit 95.0% CI 95.0% PI Fit ~ 8 5 . 5 7 3 . 8 2 ( 7 7 . 5 3 , 9 3 . 6 0 ) ( 4 9 . 5 0 , 1 2 1 . 6 4 ) Fig. 13-7 Ans. The estimate regression equation is shown in bold print. The differences between these values and those given in problem 13.31 are due to round off error. ERROR SUM OF SQUARES 13.33 Use the results given in problem 13.31 to find SSE for the regression line fit to the data in Table 13.12. Ans. SSE = syy -- '" = 32I 16.6875- 27073.6927 = 5042.9948 sxx 13.34 The Minitab output for the regression of y on x for the data in Table 13.12 is shown in Fig. 13-7. Give the portion of the output that shows the value for SSE.
  • 340.
    CHAP. 131 REGRESSIONAND CORRELATION 331 Ans. Error 1 8 5040 2 8 0 The difference in the values for SSE in problems 13.33and 13.34 are due to round-off error. STANDARDDEVIATION OF ERRORS 13.35 Use the result found in problem 13.33 to find the standard deviation of errors for the regression line fit to the data in Table 13.12. Ans. S = ,/E = ,/y = 16.7382 13.36 The Minitab output for the regression of y on x for the data in Table 13.12 is shown in Fig. 13-7. Give the portion of the output that shows the value for the standard deviation of errors. Ans. S = 16.73 R-Sq = 8 4 . 3 % R-Sq(adj) = 8 3 . 4 % TOTAL SUM OF SQUARES 13.37 Use the results given in problem 13.31 to find the total sum of squares for the data given in Table 13.12. Ans. SST = S,, = 32116.6875 13.38 The Minitab output for the regression of y on x for the data in Table 13.12 is shown in Fig. 13-7. Give the portion of the output that shows the value for the total sum of squares. Ans. Total 1 9 32116 REGRESSIONSUM OF SQUARES 13.39 Use the results of problems 13.33 and 13.37to find the regression sum of squares for the data in Table 13.12. Ans. SSR = SST - SSE = 32116.6875- 5042.9948 = 27073.6927 13.40 The Minitab output for the regression of y on x for the data in Table 13.12 is shown in Fig. 13-7. Give the portion of the output that shows the value for the regression sum of squares. Ans. Regression 1 27076 27076 96.70 0.000 The difference in the answers given in problems 13.39and 13.40is due to round-off error. COEFFICIENTOF DETERMINATION 13.41 13.42 Use the total sum of squares found in problem 13.37 and the regression sum of squares found in problem 13.39to find the coefficient of determination for the linear regression model applied to the data in Table 13.12. SSR - 27073.6927 = o.843 SST 32116.6875 Ans. r2 = -- The Minitab output for the regression of y on x for the data in Table 13.12 is shown in Fig. 13-7. Give the portion of the output that shows the value of the coefficient of determination. Am. s = 1 6 . 7 3 R-Sq p 04.3% R-Sq(adj) = 8 3 . 4 %
  • 341.
    332 REGRESSION ANDCORRELATION [CHAP. 13 MEAN, STANDARDDEVIATION, AND SAMPLINGDISTRIBUTIONOF THE SLOPE OF THE ESTIMATEDREGRESSIONEQUATION 13.43 13.44 Find the standard error for bl ,the slope of the estimated regression equation found in problem 13.31. - 16*7382 = 0.2339 S - Ans. sb,--- Jsxx 4 - The Minitab output for the regression of y on x for the data in Table 13.12 is shown in Fig. 13-7. Give the portion of the output that shows the value standard error for bl. Ans. Income 2.2997 0 . 2 3 3 9 9.83 0.000 INFERENCESCONCERNINGTHE SLOPEOF THE POPULATIONREGRESSIONLINE 13.45 Using the data in Table 13.12, test the hypothesis that the slope of the line in the population regression model is equal to 0. Use level of significance a = .05. Arts, b -P Sb, The critical values are f 2.101. The computed value of the test statistic is t* = = 2'2995-0 = 9.8311.The slope of the regression line is different from 0. .2339 13.46 Give the portion of Fig. 13-7that is used to test the null hypothesis that the slope of the regression line is equal to 0. The regression model is y = PO+ Plx + e, where x = household income in thousands, and y = cost of filing federal taxes. Assume a = .01. Ans. Predictor Coef St Dev T P Constant - 6 . 4 2 0 9.353 - 0 . 6 9 0 . 5 0 1 Income 2 . 2 9 9 7 0.2339 9.83 0 . 0 0 0 The computed test statistic is seen to be the same as in problem 13.45. The p value, O.OO0, indicates that the null hypothesis should be rejected. ESTIMATIONAND PREDICTION IN LINEAR REGRESSION 13.47 Find a 95%confidence interval for the mean cost of filing federal income taxes for all those individuals with household income equal to $40,000 and a 95%prediction interval for an individual with household income equal to $40,000 using the data in Table 13.12. Ans. The 95% confidence interval for the mean is given by bo+blxo rf: t,,,S .The point estimate of the mean is -6.4 I32 +2.2995 x 40 = 85.5668. The margin of error is 5 2.101 x 16.7382 x il+ (40-36.655)2 or +, 8.0336. The 95% confidence interval extends from 85.5668 20 5120.2195 - 8.0336 = $77.53 to 85.5668 + 8.0336 = $93.60. The 95%prediction interval for an individual is given by bo + blxo k t,,,S .The point estimate for the individual is 85.5668. The margin of error is k 2.101 x 16.7382 x (40-36'655)2 or f 36.0729. The 95% 20 5120.2195 prediction interval extends from 85.5668- 36.0729 = $49.49 to 85.5668 + 36.0729 = $121.64.
  • 342.
    CHAP. 131 REGRESSIONAND CORRELATION 333 13.48 Locate the 95% confidence interval and the 95% prediction interval in Fig. 13-7 and compare with the results in problem 13.47. Ans. Fit St D e v Fit 95.0% CI 95.0% PI 8 5 . 5 7 3 . 8 2 (77.53, 93.60) (49.50, 1 2 1 . 6 4 ) The 95% confidence interval and the 95% prediction intervals are seen to be the same as those found in problem 13.47. LINEAR CORRELATIONCOEFFICIENT 13.49 Calculate the correlation coefficient between the household income and the cost of filing federal taxes using the data in Table 13.12. = 0.918. S X Y - 11773.8375 Arts. The correlation coefficient is given by r = ,/= - J5120.2195~ 32116.6875 13.50 Using the output shown in Fig. 13-7, determine the correlation coefficient for the data given in Table 13.12. Ans. S = 16.73 R-Sq =84.3% R-Sq(adj) = 83.4% Since r2= 343, r = 4.843 = 0.918. INFERENCECONCERNINGTHE POPULATIONCORRELATIONCOEFFICIENT 13.51 Use the computed value for r in problem 13.49 to test the hypothesis Ho: p = 0 vs. Ha: p # 0 at level of significance a = .05, where p represents the correlation between household income and the cost of filing federal taxes in the population. Ans. The test statistic is computed as t*= r iz- - - .918J ' " = 9.82. The critical values are 1-.842724 *2.101. The null hypothesis is rejected at a = .05. 13.52 How can the Minitab output in Fig. 13-7be used to test the hypothesis Ho: p = 0 vs. Ha: p f 0 at level of significance a = .05, where p represents the correlation between household income and the cost of filing federal taxes in the population. Ans. The hypothesis Ho: p = 0 vs. Ha: p # 0 is equivalent to the hypothesis &: The following portion of the output is used to test &: = 0 vs. Ha: p1# 0. = 0 vs. Ha: PI +0. Predictor Coef S t D e v T P Constant -6.420 9.353 -0.69 0 . 5 0 1 Income 2.2997 0.2339 9.03 0 .000 The p value = 0.000indicates that the null hypothesis is rejected.
  • 343.
    Chapter 14 Value 17.213.5 , 15.5 12.5 13.5 16.0 11.5 13.5 Rank 9 4 7 2 4 8 1 4 Nonpararnetric 14.3 18.0 ' 6 10 Statistics Value I Sign NONPARAMETRIC METHODS thin thick thin thick thin thin thin thick thin thin I - - - - - + + - - + Many of the hypotheses tests discussed in previous chapters required various assumptions such as normality of the characteristic being measured or equality of population variances in two sample tests for example. What can we do if the assumptions are not satisfied? In addition, many research studies involve low-level data such as nominal or ordinal data to which previously discussed procedures do not apply. In a taste test, the response may be Pepsi or Coke when asked which of two colas are preferred. In situations such as these, nonparametric statistical methods are often used. Nonparametric tests often replace the raw data with ranks, signs, or both ranks and signs. The nonparametric tests then analyze the resulting ranks and/or signs. Since the original data are not actually used, one criticism of these procedures is that they are wasteful of information. EXAMPLE 14.1 Table 14.1 gives a set of data and the corresponding ranks associated with the data values. The smallest number receives a rank of 1, the next a rank of 2, and so on. If two or more values are tied, they receive the average of the ranks that would normally occur. Note that 13.5 occurred 3 times and the ranks would have been 3, 4, and 5. The average of 3, 4, and 5 is 4 and so 4 was assigned to each occurrence of the value 13.5.The sum of the ranks 1 through n is always equal to - . If n = 10, then the sum of the ranks is 1OX 11 = 55. This sum is obtained even if ties occur in the data. If the ranks are added in Table 14.1, the sum 2 n(n +1) 2 55 is obtained. The replacement of data with ranks will be utilized in many of the following sections. EXAMPLE 14.2 In a taste test, each individual was asked to taste a pizza with a thick crust and a pizza with a thin crust, and to state which one they preferred. A coin was flipped to determine which one they tasted first in order to eliminate the order of tasting as a factor. Table 14.2 gives the response for each individual and each response is coded as + if the response was thick and - if the response was thin. The results seem to indicate that the thin crust is preferred over the thick crust. But are these sample results significant'?That is, what do these results tell us about the set of preferences in the population? Seventy percent of the sample preferred thin over thick crust. What are the chances of these results if in fact there is no difference in preference in the population? The sign test in the next section will help us answer this question. Table 14.2 SIGN TEST The sign test is one of the simplest and easiest to understand of the nonparametric tests. In Example 14.2, the purpose of the taste test is to decide if there is a difference in taste preference 334
  • 344.
    CHAP. 141 NONPARAMETRICSTATISTICS 335 Price Sign between thin and thick pizza crust. The null hypothesis is that there is no difference in preference for the two types of crusts. The research hypothesis might be one-tail or two-tail. Suppose it is a two-tail research hypothesis. As usual, we proceed under the assumption that the null hypothesis is true and only reject that assumption if our sample results lead us to reject it. If there is no difference in preference, then the probability of a + on any trial is .5 and the probability of a - is also .5. The number of + signs in the 10trials, x, is a binomial random variable. The p value associated with the outcome x = 3 is computed by finding P(x 5 3) and then doubling the result since this is a two-tail test. Using the binomial distribution tables in Appendix 1 with n = 10and p = .5, we find that P ( x 5 3 ) is equal to .0010 + .0098 + .0439 + .1172 = .1719 and the p value = 2 x .1719 = 0.3438. At the conventional level of significance cx = .05 we cannot reject the null hypothesis since the p value is not less than the preselected level of significance. We see from this discussion that the sign test uses the binomial distribution to perform the sign test. The p value may be computed by using Minitab. The computation using Minitab is as follows: 110 130 102 125 140 112 117 130 200 119 + + - + + + + + + + MTB > cdf 3; SUBC> binomial n = 10 p = . 5 . Cumulative Distribution Function Binomial with n = 10 and p = 0.500000 X P ( x <= x) 3 . 0 0 0 . 1 7 1 9 The p value is then found by doubling 0.1719to get 0.3438 as found above. EXAMPLE 14.3 In order to test the hypothesis that the median price for a home in a city is $105,000, a sample of 10 recently sold homes is obtained. If the selling price exceeds $105,000, a plus is recorded. If the price is less than $105,000, a negative sign is recorded. If the price is equal to $105,OOO, the selling price is not used in the analysis and the sample size is reduced. Table 14.3 gives the selling prices in thousands and the signs are recorded as described. Suppose the alternative hypothesis is that the median selling price exceeds $105,000.If x represents the number of + signs, then the p value = P(x 29).The computation of the p value is as follows. MTB > cdf 8; S U B 0 binomial n = 10 p = . 5 . Cumulative Distribution Function Binomial with n = 10 and p = 0.500000 X P ( x <= x) 8 . 0 0 0 . 9 8 9 3 Since P(x 1 9 )= 1 - P(x S 8)= 1 - .9893 = 0.0107, the p value is 0.0107. The null hypothesis is rejected and we conclude that the median price exceeds $105,000at level of significance a = .05. The normal approximation to the binomial distribution may be used to determine the p value also. It is recommended that the student review this approximation which is discussed in Chapter 6. The approximation is valid provided that np 15 and nq 2 5. np and nq both equal 5 in this example. The mean is computed as p = np = 10 x .5 = 5 and the standard deviation is computed as 0 = 6= d m = 1.581. The z value is computed as z = - = 2.21. The p value is the area to the right of 2.21 under the standard normal curve. P(z > 2.21) = .5 - .4864 = .0136. The normal approximation to the binomial distribution provides a reasonably good result if n is 10 or greater. 8.5 -5 1.58I
  • 345.
    336 NONPARAMETRICSTATISTICS [CHAP.14 Patient Beforereading Afterreading Difference Many books recommend that n be 20 or 25, but if the continuity correction is used the result will be fairly good for n equal to 10or more. 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 95 90 98 99 87 95 95 97 90 90 99 '95 95 92 96 85 80 91 88 90 90 88 84 93 91 90 90 97 94 86 10 10 7 11 , -3 5 7 13 -3 -1 9 5 -2 -2 10 EXAMPLE 14.4 The sign test described in Example 14.3 may be performed by using the sign test procedure in Minitab. The prices given in Table 14.3 are entered in column cl and the following commands are used. The subcommand Alternative 1. indicates that the alternative is upper tailed. MTB > STest 105 C1; SUBC> Alternative 1. Sign Test for Median Sign test of median = 105.0 vs. > 105.0 N Below Equal Above P Median c1 10 1 0 9 0.0107 1 2 2 .0 Note that the same p value is given here as was found in Example 14.3. EXAMPLE 14.5 The sign test is also used to test for differences in matched paired experiments. gives the blood pressure readings before and six weeks after starting hypertensive medication as differences in the readings. Table 14.4 Table 14.4 well as the The research hypothesis is that the medication will reduce the blood pressure. The difference is found by subtracting the after reading from the before reading. If the medication is effective, the majority of the differences will be positive. If x represents the number of positive signs in the 15 differences and if the null hypothesis that the medication has no effect is true, then x will have a binomial distribution with n = 15 and p = .5. The null hypothesis should be rejected for large values of x. That is, as this study is described, this is an upper-tail test. The Minitab solution is as follows. If the medication is not effective, the median difference should be zero. The differences are entered into column 1. 10 10 7 11 - 3 5 7 13 - 3 -1 9 5 -2 -2 10 MTB > STest median = 0.0 data in C1; SUBC> Alternative 1. Sign Test for Median Sign test of median = 0.00000 vs. > 0.00000 c1 15 5 0 10 0.1509 7.000 N Below Equal Above P Median The p value is seen to equal 0.1509.This indicates that the null hypothesis would not be rejected at level of significance a = .05. A dotplot of the differences is shown in Fig. 14-1. The dotplot illustrates a weakness of the sign test. Note that the 5 minus values are much smaller in absolute value than are the absolute values of the 10 positive differences. The sign test does not utilize this additional information. The signed-rank test discussed in the next section uses this additional information. Fig. 14-1
  • 346.
    CHAP. 141 NONPARAMETRICSTATISTICS 337 The sign test is a nonparametric alternative to the one sample t test. The one sample t test assumes that the characteristic being measured has a normal distribution. The sign test does not require this assumption. If the selling prices of the homes discussed in Example 14.3 are normally distributed, then the one sample t test could be used to test the null hypothesis that the mean is $105,000. WILCOXON SIGNED-RANKTEST FOR TWO DEPENDENT SAMPLES In Example 14.5,it was noted that the sign test did not make use of the information that the 10 positive differences were all larger in absolute value than the absolute value of any of the 5 negative differences. The sign test used only the information that there were 10 pluses and 5 negatives. The data in Table 14.4 will be used to describe the Wilcoxon signed-rank test. Table 14.5 gives the computationsneeded for the Wilcoxon signed-ranktest. ~~~ ~ Before reading 95 90 98 99 87 95 95 97 90 90 99 95 95 92 96 After reading 85 80 91 88 90 90 88 84 93 91 90 90 97 94 Tab1 Difference, D 10 10 7 11 -3 5 7 13 -3 -1 9 5 -2 -2 10 14.5 ID1 10 10 7 11 3 5 7 13 3 1 9 5 2 2 Rank of I D 1 12 12 14 8.5 4.5 6.5 8.5 15 4.5 1 10 6.5 2.5 2.5 12 Signed rank 12 12 8.5 14 -4.5 6.5 8.5 15 -4.5 -1 10 -2.5 -2.5 12 6.5 The first column gives the before blood pressure readings, the second column gives the after readings, the third column gives the difference D = Before - After, the fourth column gives the absolute values of the differences, the fifth column gives the ranks of the absolute differences,and the sixth column gives the signed-rank which restores the sign of the difference to the rank of the absolute difference.Two sum of ranks are defined as follows: W+= sum of positive ranks and W- = absolute value of sum of negative ranks. For the data in Table 14.5,we have the following: W+= 12+ 12+8.5+ 14+6.5+8.5+15+ 10+6.5+ 12= 105 W- = 4.5 +4.5 + 1+2.5 +2.5 = 15 To help in understanding the logic behind the signed-rank test, it is helpful to note some interesting facts concerningthe above data. The sum of the ranks of the absolute differences must equal the sum of the first 15 positive integers. Using the formula given in Example 14.1, the sum of the ranks 1 through 15 is - l5 l6 - - 120. Note that W+ + W- = 120. If the null hypothesis is true and the medication has no effect, we would expect a 60,60 split in the two sum of ranks. That is if Ho is true, we would expect to find W+ = 60 and W- = 60. If the medication is effective for every patient, we would expect W+= 120and W- = 0. Either of the rank sums, W+ or W-, may be used to perform the hypothesis test. Tables for the Wilcoxon test statistic are available for determining critical. values for 2
  • 347.
    338 NONPARAMETRICSTATISTICS [CHAP.14 the Wilcoxon signed-rank test. However, statistical software would most often be used to perform the test in real world applications. The Minitab solution is given in Example 14.6. EXAMPLE 14.6 The differences in column 3 of Table 14.5 are entered into column cl. The command west 0 . 0 C1; indicates that the Wilcoxon signed-rank test is to be used to test that the median of the differences is 0 and that the differences are in column c1. The subcommand Alternative 1.indicates that the test is upper-tailed. The output gives the Wilcoxon statistic, W+,computed in the above discussion and most importantly it gives the p value for the test. The probability of obtaining a value of W+equal to 105or larger is 0.006.This p value is much less than .05 and indicates that the medication is effective. Note that the p value obtained when using the sign test in Example 14.5 is 0.1509. The Wilcoxon signed-rank test is more powerful than the sign test as indicated by this example. D i f f 10 10 7 11 -3 5 7 13 -3 -1 9 5 -2 - 2 10 MTB > WTest 0 . 0 C1; S U B 0 Alternative 1. Wilcoxon Signed Rank Test Test of median = 0.000000 vs. median > 0.000000 N for Wilcoxon Estimated N Test Statistic P Median D i f f 15 15 105.0 0.006 5.000 A normal approximation procedure is also used for the Wilcoxon signed-rank test procedure when the sample size is I5 or more. The test statistic is given by formula (14.I ) , where W is either W' or W- depending on the nature of the alternative hypothesis, and n is the number of nonzero differences. W -n(n +1)/4 Jn(n +1)(2n+1)/24 Z = (14,I ) EXAMPLE 14.7 The normal approximation will be applied to the blood pressure data in Table 14.5 and the results compared to the results in Example 14.6.The computed test statistic is as follows: 105- 15(16)/4 = 2.56 z = 4150(31)/24 The p value is P(z > 2.56) = .5 - ,4948 = 0.0052.The p value in Example 14.6 is equal to 0.006. Note that the two values are very close. The difference may be due to round-off error or it may be due the Minitab software using the exact distribution for the test statistic rather than the normal approximation. Like the sign test, the Wilcoxon signeL rank test is a nonparametric alternative to the one sample t test. However, the Wilcoxon signed-rank test is usually more powerful than the sign test. There are situations such as taste preference tests where the sign test is applicable but the Wilcoxon signed-rank test is not. WILCOXON RANK-SUM TEST FOR TWO INDEPENDENT SAMPLES Consider an experiment designed to compare the lifetimes of rats restricted to two different diets. Eight rats were randomly divided into two groups of four each. The rats in the control group were allowed to consume all the food they desired and the rats in the experimental group were allowed
  • 348.
    CHAP. 141 Control Experimental NONPARAMETRIC STATISTICS 3.43.6 4.0 2.5 3.7 4.2 4.2 4.5 339 Control only 80% of the food that they normally consumed. The experiment was continued until all rats expired and their lifetimes are given in Table 14.6. 3.4 (2) 3.6 (3) 4.0 (5) 2.5 (1) Experimental 3.7 (4) 4.2 (6.5) 4.2 (6.5) 4.5 (8) . A nonparametric procedure for comparing the two groups was developed by Wilcoxon. Mann and Whitney developed an equivalent but slightly different nonparametric procedure. Some books discuss the Wilcoxort rank-srun test and some books discuss the Marzrz-Whitrzey U test. Both procedures make use of a ranking procedure we will now describe. Imagine that the data values are combined together and ranked. Table 14.7 gives the same data as shown in Table 14.6 along with the combined rankings shown in parentheses. Table 14.7 We let R I be the sum of ranks for group 1, the control group, and R2 be the sum of ranks for group 2. For the data in Table 14.7, we have RI = 11 and R 2 = 25. The sum of all eight ranks must equal - * = 36. Note that R1 + R2 = 36. 2 Now, consider some hypothetical situations. Suppose there is no difference in the lifetimes of the control and experimental diets. We would expect R I = I8 and R2 = 18. Suppose the experimental diet is highly superior to the control diet, then we might expect all rats in the experimental group to be alive after all the rats in the control diet have died, and as a result obtain RI = 10and R2 = 26. Critical values for the testing the hypothesis of no difference in the two groups may be obtained from a table of the Mann-Whitney statistic or a table of the Wilcoxon rank-sum statistic. These tables are available in most elementary statistics texts. Most users of Statistics will use a computer software package to evaluate their results. Example 14.8 illustrates the use of Minitab to evaluate the results in the experiment described above. EXAMPLE 14.8 The Minitab analysis for the data in Table 14.6 is shown below. Note that the data are stored in separate columns. The command Mann-Whitney 95.0 C1 C2; will perform a Mann-Whitney test on the data in columns c I and c2. The subcommand Alternative -1. will test the alternative that the median lifetime for the control group is less than the median lifetime for the experimental group. The output line W = 11 is the value for R I computed in the above discussion of this experiment. The output line Test of ETAl = ETA2 vs. ETAl < ETA2 is significant at 0.0303 gives the p value as 0.0303 for the lower-tail alternative hypothesis. At level of significance a = .05, the null hypothesis would be rejected since the p value < a. This rather small study seems to indicate that restricting the diet prolongs the lifetime of such rats. Row Control Expermtl 1 3 . 4 3 . 7 2 3 . 6 4.2 3 4 . 0 4 . 2 4 2.5 4 . 5 MTB > Mann-Whitney 95.0 C1 C2; SUBC> Alternative -1.
  • 349.
    340 NONPARAMETRIC STATISTICS IRI= 184 I R 3 = 69 [CHAP. 14 Mann-WhitneyConfidence Interval and Test Control N = 4 Median = 3.500 Expermtl N = 4 Median = 4.200 Point estimate f o r ETA1-ETA2 is -0.700 97.0 Percent CI for ETA1-ETA2 is ( - 2 . 0 0 0 , 0 . 3 0 0 ) w - 11.0 Teat of ETA1 = ETA2 va. ETA1 < ETA2 i a significant at 0.0303 The t e s t is significant at 0.0295 (adjusted for ties) A normal approximation procedure is also used for the Wilcoxon rank-sum test procedure when the two sample sizes are large. If both sample sizes, nl and n2, are 10 or more, the approximation is usually good. The test statistic is given by formula (14.2) where RIis the rank-sum associated with sample size nl . R, - n,(n, +n2 +1) / 2 Jn,n2(nl +n2+1) / 12 Z = (14.2) EXAMPLE 14.9 A test instrument which measures the level of type A personality behavior was administered to a group of lawyers and a group of doctors. The results and the combined rankings are given in Table 14.8. The higher the score on the instrument, the higher the level of type A behavior. The null hypothesis is that the medians are equal for the two groups, and the alternative hypothesis is that the medians are unequal. Tab1 Lawyers 28 (5) 51 (20) 54 (21) 34 (10) 33 (9) 38 (12) 44 (15) 45 (16) 47 (17) 48 (18) 56 (22) 50 (19) 14.8 Doctors 20 (1) 26 (4) 32 (8) 36 (11) 40 (13) 23 (2) 24 (3) 30 (6.5) 30 (6.5) 42 (14) We note that nl = 12, n2 = 10,and R1= 184.The computed value of the test statistic is found as follows; 184-12(12+10+1)/2 =3.03 z*= 412 x 10x (12 +10+ 1) / 12 The null hypothesis is rejected for level of significance a = .05, since the critical values are k 1.96 and the computed value of the test statistic exceeds 1.96. The median score for lawyers exceeds the median score for doctors. In closing this section, we note that the parametric equivalent test for the Wilcoxon rank-sum test and the Mann-Whitney U test is the two-sample t test discussed in Chapter 10. KRUSKAL-WALLIS TEST In Chapter 12, the one-way ANOVA was discussed as a statistical procedure for comparing several populations in terms of means. The procedure assumed that the samples were obtained from
  • 350.
    CHAP. 141 NONPARAMETRICSTATISTICS 341 Whites 15 12 13 14 9.5 RI = 63.5 normally distributed populations with equal variances. The Kruskal-Wallis test is a nonparametric statistical method for comparing several populations. Consider a study designed to compare the net worth of three different groups of senior citizens (ages 51- 61). The net worth (in thousands) for each senior citizen in the study is shown in Table 14.9.The null hypothesis is that the distributions of net worths are the same for the three groups and the alternative hypothesis is that the distributions are different. African-Americans Hispanics 4 1.5 11 7 1.5 9.5 5 3 8 6 R2 = 29.5 R3 = 27 Table 14.9 205 225 110 The data are combined into one group of 15 numbers and ranked. The rankings are shown in Table 14.10.The sums of the ranks for each of the three samples are shown as RI,R2, and R3. The sum of the ranks from 1 to 15 is equal to 120. If the null hypothesis were true, we would expect the sum of ranks for each of the three groups to equal 40. That is, if H o is true we would expect that RI = R2 = R3 = 40. The Kruskal-Wallis test statistic measures the extent to which the rank sums for the samples differs from what would be expected if the null were true. The Kruskal-Wallis test statistic is given by formula (24.3),where k = the number of populations, n,= the number of items in sample i, Ri = sum of ranks for sample i, and n =the total number of items in all samples. (24.3) W has an approximate Chi-square distribution with df = k - 1 when all sample sizes are 5 or more. EXAMPLE 14.10 The computed value for W for the data in Table 14.9is determined as follows: The critical value from the Chi-square table with df = 2 and a = .05 is 5.991. The population distributions are not the same, since W* exceeds the critical value. Suppose the sample rank sums were each the same. That is, suppose RI= R2 = R3= 40. The computed value for W would then equal the following: It is clear that this is always a one-tailed test. That is, the null is rejected only for large values of the computed test statistic.
  • 351.
    342 NONPARAMETRICSTATISTICS [CHAP.14 t Basic Properties of the Spearman Rank Correlation Coefficient 1. The value of rs is always between -1 and +1, i.e., -1 I rs I+1 2. A value of rs near +1 indicates a strong positive association between the rankings. A value of rs near -I indicates a strong negative association between the rankings. 3. The Spearman rank correlation coefficient may be applied to data at the ordinal level of measurement or above. EXAMPLE 14.11 The Minitab analysis for the data in Table 14.9 is given below. The value for H is the same as that for W* found in Example 14.10. The p value is 0.016, indicating that the null hypothesis would be rejected for a = .05, the same conclusion reached in Example 14.10. Row Group Networth 1 1 250 2 1 1 7 5 3 1 1 3 0 4 1 2 05 5 1 225 6 2 80 7 2 1 5 0 8 2 7 0 9 2 90 10 2 110 1 1 3 7 0 1 2 3 100 1 3 3 130 1 4 3 75 1 5 3 95 MTB > Kruskal-Wallis 'Networth' 'Group'. Kruskal-Wallis Test Kruskal-Wallis Test on Networth Group N Median Ave Rank Z 1 5 205.00 1 2 . 7 2.88 2 5 9 0 . 0 0 5 . 9 - 1 . 2 9 3 5 9 5 . 0 0 5 . 4 - 1 . 5 9 Overall 15 8.0 H = 8.31 DF = 2 P = 0.016 RANK CORRELATION The Pearson correlationcoefficient was introduced in chapter 13 as a measure of the strength of the linear relationship of two variables. The Pearson correlation coefficient is defined by formula (14.4).The Spearman rank correlation coefzcient is computed by replacing the observations for x and y by their ranks and then applying formula (14.4)to the ranks. The Spearman rank correlation coefficient is representedby rs (14.4) EXAMPLE 14.12 Table 14.12 gives the number of tornadoes and number of deaths due to tornadoes per month for a sample of 10 months. In addition, the ranks are shown for the values for each variable.
  • 352.
    CHAP. 141 NONPARAMETRICSTATISTICS 343 Number of tornadoes, x 35 85 90 94 50 76 98 104 88 128 Table 14.12 Rank of x 1 4 6 7 2 3 8 9 5 10 Number of deaths, y 0 0 2 1 0 3 4 4 5 7 Rank of y 2 2 5 4 2 6 7.5 7.5 9 10 The formula for the Pearson correlation coefficient given in formula (14.4)is applied to the ranks of x and the ranks of y. C X = 1 + 4 + 6 + . * . +l0=55,CxZ= 1 + 1 6 + 3 6 + - + 1oO=385,Cy=55,Cy2=4+4+25+...+ loo= 382.5,E~y= 1 ~ 2 + 4 ~ 2 + 6 ~ 5 + * . * + 1 0 ~ 1 0 = 3 6 3 . 5 . (W@Y) - = 385 - 302.5 = 82.5, S,, = Cy2- - (cy)2 - - 382.5 - 302.5 = 80, and S,, = Zxy - - (EX)* s,, = Ex2- - n n n 363.5 - 302.5 = 61. = 0.75 XY 61 Rs= , / - = 9825x80 EXAMPLE 14.13 The Minitab solution for Example 14.12 is shown below. The raw data are entered into columns cl and c2. The rank command is then used to put the ranks into columns c3 and c4. Then the Pearson correlation coefficient is requested for columns c3 and c4. Row 1 2 3 4 5 6 7 8 9 10 tornado deaths 3 5 0 85 0 90 2 94 1 50 0 7 6 3 98 4 1 0 4 4 88 5 1 2 8 7 MTB > rank cl put into c3 MTB > rank c2 put into c 4 MTB > print c3 c4 Row C3 C4 1 1 2.0 2 4 2.0 3 6 5 . 0 4 7 4.0 5 2 2.0 6 3 6 . 0 7 8 7 . 5 8 9 7 . 5 9 5 9 . 0 1 0 1 0 1 0 . 0 MTB > correlation c3 c4 Correlation of c3 and c4 = 0.739
  • 353.
    344 NONPARAMETRIC STATISTICS[CHAP. 14 The population Spearman rank correlation coeficient is represented by ps.In order to test the null hypothesis that ps= 0, we use the result that J"-lr, has an approximate standard normal distribution when the null hypothesis is true and n 2 10. Some books suggest a larger value of n in order to use the normal approximation. EXAMPLE 14.14 The data in Table 14.12 may be used to test the null hypothesis &:ps= 0 vs. Ha: ps+0 at level of significance a = .05. The critical values are k 1.96.The computed value of the test statistic is found as follows: z* = 410-1 x .75 = 2.25. We conclude that there is a positive association between the number of tornadoes per month and the number of tornadodeaths per month since z* > 1.96. RUNS TEST FOR RANDOMNESS Suppose 3.5-ounce bars of soap are being sampled and their true weights determined. If the weight is above 3.5 ounces, then the letter A is recorded. If the weight is below 3.5 ounces, then a B is recorded. Suppose that in the last 10 bars the result was AAAAABBBBB. This would indicate nonrandomness in the variation about the 3.5 ounces. A result such as ABABABABAB would also indicate nonrandomness. The first outcome is said to have R = 2 runs. The second outcome is said to have R = 10runs. There are nl = 5 A's and n2 = 5 B's. A run is a sequence of the same letter written one or more times. There is a different letter (or no letter)before and after the sequence. A very large or very small number of runs would imply nonrandomness. In general, suppose there are nl symbols of one type and n2 symbols of a second type and R runs in the sequence of nl + n2 symbols. When the occurrence of the symbols is random, the mean value of R is given by formula (14.5): When the occurrence of the symbols is random, the standard deviation of R is given by formula (14.6): 2nln2(2n2n2-n1- n2) a = J (n1 +n2I2(n1+n,-l) (14.6) When both nl and n2 are greater than 10, R is approximately normally distributed. The size requirements for nl and n2 in order that R be normally distributed vary from book to book. We shall use the requirement that both exceed 10.From the proceeding discussion, we conclude that when the occurrence of the symbols is random and both nl and n2 exceed 10, then the expression in formula (14.7)has a standard normal distribution. When the normal approximation is not appropriate, several Statistics books give a table of critical values for the runs test. EXAMPLE 14.15 Bars of soap are selected from the productionline and the letter A is recorded if the weight of the bar exceeds 3.5 ounces and B is recorded if the weight is below 3.5 ounces. The following sequence was obtained. AAABBAAAABBABABAAABBBBAABBBABABBAA
  • 354.
    CHAP. 141 NONPARAMETRICSTATISTICS 345 These results are used to test the null hypothesis that the occurrence of A’s and B’s is random vs. the alternative hypothesis that the occurrences are not random. The level of significance is a = .01. The critical values are f2.58. We note that nl = the number of A’s = 18 and n2 = the number of B’s = 16. The number of runs is R = +1 = 2nln2 + I = 2 x 1 8 ~ 1 6 17. Assuming the null hypothesis is true, the mean number of runs is p = - 17.9411. Assuming the null hypothesis is true, the standard deviation of the number of runs is : “1 + “ 2 34 2n1n2(2n2n2- n l -n2) 2 ~ 1 8 ~ 1 6 x 5 4 2 O = d342x33=2-8607 R -p 17-17.9411 The computed value of the test statistic is Z* = -= = -.33. 0 2.8607 There is no reason to doubt the randomness of the observations. The process seems to be varying randomly about the 3.5 ounces, the target value for the bars of soap. EXAMPLE 14.16 The Minitab solution to Example 14.15 is shown below. The symbols A and B are coded as 1 and -1 respectively. The 0. in the statement Rune 0 C1, is the mean of -1 and 1. The output gives the observed number of runs, the mean number of runs, the number of A’s and B’s, and the p value = 0.7422. The output is the same as that obtained in Example 14.15. Cl 1 1 1 -1 -1 1 1 1 1 -1 -1 1 -1 1 -1 1 1 1 -1 -1 -1 -1 1 1 -1 -1 -1 1 -1 1 -1 -1 1 1 MTB > Runs 0 C1. RunsTest The observed number of runs = 17 The expected number of runs = 17.9412 18 Observations above K 16 below The test is significant at 0.7422 Cannot reject at alpha = 0.05 Solved Problems NONPARAMETRICMETHODS 14.1 14.2 A table (not shown) gives the number of students per computer for each of the 50 states and the District of Columbia. If the data values in the table are replaced by their ranks, what is the sum of the ranks? - 51x52 Ans. The ranks will range from 1 to 51. The sum of the integers from 1 to 51 inclusive is 7 - L 1,326.This same sum is obtained even if some of the ranks are tied ranks. A group of 30 individuals are asked to taste a low fat yogurt and one that is not low fat, and to indicate their preference. The participants are unaware which one is low fat and which one is not. A plus sign is recorded if the low fat yogurt is preferred and a minus sign is recorded if the one that is not low fat is preferred. Assuming there is no difference in taste for the two types of yogurt, what is the probability that all 30 prefer the low fat yogurt? Ans. Under the assumption that there is no difference in taste, the plus and minus signs are randomly assigned. There are z3O = 1,073,741,824 different arrangements of the 30 plus and minus signs.
  • 355.
    346~ NONPARAMETRIC STATISTICS[CHAP. 14 ~ Couple 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 i 4 ! 5 ~ - - - - - - - - 3 - I Husband 35 16 17 25 30 32 28 31 27 15 19 22 33 30 18 , Wife 25 20 18 20 25 25 25 I 1 9 30 20 15 20 27 28 2 0 , Difference 10 -4 -1 5 5 7 - 3 12 , -3 i -5 4 , 2 6 ~ 2 -2 The probability that all 30 are plus signs is1divided by 1,073,741,824,or 9.313225746x 10-l'. If this outcome occurred, we would surely reject the hypothesis of no difference in the taste of the two types of pgurt. SIGN TEST 14.3 Find the probability that 20 or more prefer the low fat yogurt in problem 14.2 if there is no difference in the taste. Use the normal approximationto the binomial distribution to determine yoiii answer. Ans. Assuming that there is no difference in taste, the 30 responses represent 30 independent trials with p = q = .5, where p is the probability of a + and q is the probability of a -. The mean number of + signs is p = np = 15 and the standard deviation is 0 = &= 2.739. The z value corresponding to 20 p!us signs is z = = 1.64.The probability of 20 or more plus signs is approximated 19.5-15 2.739 by P (Z > I .64) = .5 - .4495= ,0505. 14.4 A sociotogical study involving married couples, where both husband and wife worked full time, recorded the yeariy income for each in thousands of dollars. The results are shown in Table 14.13.Use the sign test to test the research hypothesis that husbands have higher salaries. Use level of significance a = .05. Ans. A high number of plus signs among the differences supports the research hypothesis. There are 10 plus signs in the 15 differences. The p value is equal to the probability of 10or more pius signs in the 15 differences. The p value is computed assuming the probability of a plus sign is .5 for any one of the differences, since the p value is computed assuming the null hypothesis is true. From the binemial distribution, we find that the p value = .0916 + .@I17 + .0139 + .0032 + .OOO5'+ .OOOO = 0.1509. Since the p value exceeds the preset leve! of significance, we cannot reject the null hypothesis. WILCOXON SIGNED-RANKTEST FOR TW-0DEPENDENT SAMPLES 14.5 Use the Wilcoxon signed-rank test and the data in Table M.13 to test the research hypothesis that husbands have higher salaries. Determine the values for W+ and W- and then use the normal approximation procedure to perform the test. Use level of significance a = .05 Am. From Table 14.14, we see that W+= 15 + 10 + 10 + 13 + 5.5 i 14 + 7.5 + 3 + 12 + 3 = 93, and W-= 7.5 + l + 5.5 + 10 + 3 = 27. Since Difference = Husband salary - Wife saiary, large values for W+ or small values for 'AT-will lend support to the research hypothesis that husbands have higher salaries. If w' is used, the test will be upper-tailed. If W- is used, the test will be lower- tailed. The test statistic used in the normal approximation is:
  • 356.
    CHAP. 141 NONPARAMETRICSTATISTICS 347 Husband 35 16 17 25 30 32 28 31 27 15 19 22 33 30 18 Wife 25 20 18 20 25 25 25 19 30 20 15 20 27 28 20 Tablt Difference. D 10 -4 -1 5 5 7 3 12 -3 -5 4 2 6 2 -2 14.14 ID1 10 4 1 5 5 7 3 12 3 5 4 2 6 2 2 Rank of ID I 15 7.5 1 10 10 13 5.5 14 5.5 10 7.5 3 12 3 3 Signed rank 15 -7.5 -1 10 10 13 5.5 14 -5.5 -10 7.5 3 12 3 -3 27 -60 If we use W-, then 2 = - = -1.87, and the p value = .5 - .4693 = .0307. If we use W+, then 2 = 17.607 93 -60 -- - 1.87, and the p value = .0307. It is clear that we may use either W+or W-. In either case, we 17.607 reject the null hypothesis since the p value < .05. 14.6 The Minitab output for the solution to problem 14.5 is shown below. Answer the following questions by referring to this output. (a) Explain the various parts of the commandWTest 0.0 I Diff ; (b) Why was the subcommandAlternative 1 used? (c) What is the number 93.0shown under Wilcoxon statistic? (4 How does the p value given in the Minitab output compare with the p value found in problem 14.5? MTB > print cl Data Display Diff 10 -4 -1 5 5 7 4 2 6 2 -2 3 12 -3 - 5 MTB > WTest 0.0 IDiff'; SUBC> Alternative 1. Wilcoxon Signed Rank Test Test of median = 0.000000 vs. median > 0.000000 N for Wilcoxon Estimated N Test Statistic P Median Diff 1 5 1 5 93.0 0.032 2 . 7 5 0 Ans. (a) Wtest indicates that the Wilcoxon signed-rank test is to be performed. The value 0.0 indicates that we are testing that the median difference is assumed to be 0.0.The median difference will be 0.0 if the null hypothesis is true. Diff indicates that we are performing the test for the differences in column 1. (6) The subcommand Alternative 1.indicates an upper-tailed test. Since Diff = Husband salary - Wife salary, a positive median will support the research hypothesis that the husband salaries are greater than the wife salaries. (c) The Wilcoxon statistic is the same as W+ found in problem 14.5.
  • 357.
    348 NONPARAMETRICSTATISTICS [CHAP.14 (6)The p values are very close. The p value based on the normal approximation is 0.0307 and the Minitab p value is 0.032. WILCOXON RANK-SUM TEST FOR TWO INDEPENDENT SAMPLES 14.7 A study compared the household giving for two different Protestant denominations. Table 14.15 gives the yearly household giving in hundreds of dollars for individuals in both samples. The ranks are shown in parentheses beside the household giving. Use the normal approximation to the Wilcoxon rank-sum test statistic to test the research hypothesis that the household giving differs for the two denominations.Use level of significancea = .Ol. Tabk Denomination 1 9.3 (5) 13.5(20) 13.7 (21) 10.2(10) 10.0(9) 10.7(12) 11.3(15) 11.4 (16) 11.5(17) 12.3(18) 14.2(22) 12.5(19) 14.15 Denomination 2 7.0 (1) 9.0(4) 9.7 (8) 10.4(11) 10.9(13) 8.0 (2) 8.5 (3) 9.5 (6.5) 9.5 (6.5) 11.1(14) 14.5 (23) I RI = 184 I Rz = 92 Ans. The sample sizes are nl = 12 and n2= 1I. The rank sum associated with sample 1 is RI= 184.The test statistic is computed as follows: The critical values are f2.58. Since the computed value of the test statistic does not exceed the right side critical value, the null hypothesis is not rejected. The p value is computed as follows. The area to the right of 2.46 is .5 - .4931 = .0069, and since the alternative is two sided, the p value = 2 x .0069= 0.0138. If the p value approach to testing is used, we do not reject since the p value is not less than the preset a. 14.8 The Minitab output for the solution to problem 14.7 is shown below. Answer the following questions by referring to this output. (a) Explain the command Mann-Whitney Denoml I Denan2 ; (b) What does the subcommandAlternative 0 mean? (c) What is the line W = 184 giving you? (d)Explain the line Test of ETA1 = ETA2 vs. ETA1 not = ETA2 is significantat 0.0151. Row Denoml Denom2 Row D e n o m l Denom2 1 9 . 3 7 . 0 7 11.3 8 . 5 2 1 3 . 5 9 . 0 a 1 1 . 4 9 . 5 3 1 3 . 7 9.7 9 1 1 . 5 9 . 5 4 1 0 . 2 1 0 . 4 1 0 1 2 . 3 11.1 5 1 0 . 0 1 0 . 9 1 1 1 4 . 2 1 4 . 5 6 1 0 . 7 8 . 0 1 2 1 2 . 5
  • 358.
    CHAP. 141 NONPARAMETRICSTATISTICS IRS Garbage collection Long distance service , 40 55 60 45 60 65 55 65 70 50 50 70 45 55 70 45 55 75 50 60 75 50 70 70 40 65 80 349 IRS 1.5 4.0 11.5 7.5 4.0 4.0 7.5 7.5 1.5 16.0 MTB > Mann-Whitney 'Denomlu 'Den0m2~; S U B 0 Alternative 0 . Mann-Whitney Confidence Interval and Test Denoml N = 12 Median = 11.450 Denom2 N = 11 Median = 9.500 Point estimate f o r ETA1-ETA2 is 2.000 95.5 Percent CI for ETA1-ETA2 is (0.499,3.501) W = 184.0 Test of ETAl = ETA2 vs. ETAl not = ETA2 is significant at 0.0151 The test is significant at 0.0150 (adjusted f o r ties) Garbage collection Long distance service 11.5 16.0 16.0 20.0 20.0 24.0 7.5 24.0 11.5 24.0 11.5 27.5 16.0 27.5 24.0 24.0 20.0 29.5 16.0 29.5 Ans, (a) This command requests a Mann-Whitney test for the data in columns named Denoml and Denom2. (6) This subcommand indicates that the hypothesis is two-tailed. (c) This is the same as R1, the sum of ranks for sample 1. (6)This line gives the p value for the test as 0.0151 . This value is close to the p value obtained in problem 14.7. KRUSKAL-WALLIS TEST 14.9 Thirty individualswere randomly divided into 3 groups of 10each. Each member of one group completed a questionnaire concerning the Internal Revenue Service (IRS). The score on the questionnaire is called the Customer Satisfaction Index (CSI). The higher the score, the greater the satisfaction. A second set of CSI scores were obtained for garbage collection from the second group, and a third set of scores were obtained for long distance telephone service. The scores are given in Table 14.16.Perform a Kruskal-Wallis test to determine if the population distributionsdiffer. Use significance level a = .01. Ans. The data in Table 14.16 are combined and ranked as one group. The resulting ranks are given in Table 14.17. The following rank sums are obtained for the three groups: RI= 65.0, R2 = 154.0, and R3= 246.0. Table 14.17
  • 359.
    350 BMI Age 22.5 27 24.632 28.7 45 30.1 49 18.5 19 20.0 22 24.5 31 25.O 27 27.5 25 30.0 44 NONPARAMETRIC STATISTICS BMI Age 19.0 44 17.5 18 32.5 29 22.4 29 28.8 40 21.3 39 25.O 21 19.0 20 29.7 52 16.7 19 [CHAP. 14 The computed value of the Kruskal-Wallis test statistic is found as follows: -3X31 ~ 2 1 . 1 3 8 The critical value is found by using the Chi-square distribution table with df = 3 - 1 = 2 to equal 9.210. The null hypothesis is rejected since the computed value of the test statistic exceeds the critical value. 14.10 The Minitab analysis for problem 14.9is shown below. Answer the following questions. (a) Explain the command Kruekal-Wallis 'CSI 'Group' (b) If the average ranks are multiplied by 10, what do you obtain? (c) WhatdoestherowH = 2 1 . 1 4 DF = 2 P = 0 . 0 0 0 give you? MTB > Kruskal-Wallis 'CSI' 'Group'. Kruskal-WallisTest Kruskal-Wallis Test on C S I Group N Median Ave Rank Z 1 1 0 5 . 7 5 0 6 . 5 - 3 . 9 6 2 1 0 1 6 . 0 0 0 1 5 . 4 - 0 . 0 4 3 1 0 24.000 2 4 . 6 4 . 0 0 Overall 3 0 1 5 . 5 H = 2 1 . 1 4 DF = 2 P = 0.000 H = 2 1 . 4 8 DF = 2 P = 0.000 (adjusted for ties) Ans. (a) The command asks for a Kruskal-Wallis test for the data in the column called CSI. The column called Group identifies the three samples. (b) RI= 65, R2 = 154, and R3 = 246. (c) This row gives the computed value of the test statistic, the degrees of freedom for the Chi- square distribution, and the p value = 0.0o0. RANK CORRELATION 14.11 The body mass index (BMI) for an individual is found as follows: Using your weight in pounds and your height in inches, multiply your weight by 705, divide the result by your height, and divide again by your height. The desirable body mass index varies between 19 and 25. Table 14.18 gives the body mass index and the age for 20 individuals. Find the Spearman rank correlation coefficient for data shown in the table.
  • 360.
    CHAP. 141 NONPARAMETRICSTATISTICS 351 Ans. Table 14.19contains the ranks for the BMI values and the ranks for the ages given in Table 14.18. The formula for the Pearson correlation coefficient given in formula (14.4)is applied to the ranks of the BMI values and the ranks of the ages. Let x represent the ranks of the BMI values and y represent the ranks of the ages. Cx=9.0+ 11.0+ 15.0+...+17.0+ 1.O=210,Cx2=81+ 121 + 2 2 5 + . . . + 2 8 9 + 1 ~ 2 8 6 9 , Cy = 8.5 + 13.0+ 18.0 + * * + 20 +2.5 = 210, Cy2= 72.25 + 169+ 324 +* * . +400 +6.25 = 2868, C x y = 9 ~ 8 , 5 + 11 x 13+ 1 5 ~ 1 8 + . . . + 1 7 ~ 2 0 + 1 ~2.5=2646.5 (zy)2 - 2868 - 2205 = 663 S,, = Ex2- -= 2869 - 2205 = 664, S,, = Cy2- -- (W2 n n s,, = cxy - (Cx)(cy) = 2646.5 -2205 = 441.5 n 441.5 , - , = 0.665 sxY - r , = , - - BMI 9.0 11.0 15.0 19.0 3.0 6.0 10.0 12.5 14.0 18.0 Table Age 8.5 13.0 18.0 19.0 2.5 6.0 12.0 8.5 7.O 16.5 d664 x 663 14.19 BMI 4.5 2.0 20.0 8.O 16.0 7.0 12.5 4.5 17.0 1.o AGE 16.5 1.o 10.5 10.5 15.0 14.0 5.O 4.0 20.0 2.5 14.12 Use the results found in problem 14.11 to test the null hypothesis Ho: ps= 0 vs. Ha: psf 0 at level of significance a = .05. Ans. The critical values are k1.96. The computed value of the test statistic is found as follows: z* = J G x .665 = 2.90. We conclude that there is a positive association between age and body mass index. RUNS TEST FOR RANDOMNESS 14.13 The gender of the past 30 individuals hired by a personnel office are as follows, where M represents a male and F represents a female. FFFMMMMFMFMFFFFMMMMMMMFFMMMFFF Test for randomness in hiring using level of significance a = .05. Ans. There are nl = 14 F’s and n2 =16 M’s. There are R = 11 runs. If the hiring is random, the mean number of runs is l4 nl + n 2 30 l6 +1= 15.933 p = -+I= 2n1n2 and the standard deviation of the number of runs is 2nln2(2n,n2- n l - n 2 ) - 2 ~ 1 4 ~ 1 6 x 4 1 8 (T= / = - d=2
  • 361.
    352 NONPARAMETRIC STATISTICS[CHAP. 14 The computed value of the test statistic is Z* = %- - - 5*933= -1.84. The critical values 0 2.679 are fl.96, and since the computed value of the test statistic is not less than -1.96, we are unable to reject randomness in hiring with respect to gender. 14.14 The Minitab analysis for the runs test is shown below. Compute the p value correspondingto the value Z* =-1.84in problem 14.13and compare it with the p value given in the Minitab output. c1 0 0 0 1 1 1 1 0 1 0 1 0 0 0 0 1 1 1 1 1 1 1 0 0 1 1 1 0 0 0 MTB > Runs .5 C1. The observed number of runs = 11 The expected number of runs = 15.9333 16 Observations above K 14 below The test is significant at 0.0658 Cannot reject at alpha = 0.05 Ans. The area to the left of Z*= -1.84 is .5 - .4671 = .0329. Since the alternative hypothesis is two- tailed, the p value = 2 x .0329 = 0.0658. This is the same as the p value given in the Minitab output. Supplementary Problems NONPARAMETRICMETHODS 14.15 14.16 Three mathematical formulas for dealing with ranks are often utilized in nonparmetric methods. These formulas are used for summing powers of ranks. They are as follows: n(n +1) 1 + 2 + 3 + . . . + n = - 2 n(n +1)(2n+1) 6 l2+22+3*+- ' +n2 = n2( n + ~ ) ~ l3+23+ 33+. . +n3 = 4 Use the above special formulas to find the sum, sum of squares, and sum of cubes for the ranks 1 through 10. Am. sum = 55, sum of squares = 385, and sum of cubes = 3,025. Nonparametric methods often deal with arrangements of plus and minus signs. The number of different possible arrangements consisting of a plus signs and b minus signs is given by or this formula to determine the number of different arrangements consisting of 2 plus signs and 3 negative signs and list the arrangements. 5 ! Ans. The number of arrangements is = - - - 10.The arrangements are as follows: (32! x 3!
  • 362.
    CHAP. 141 35 66 NONPARAMETRIC STATISTICS 7050 52 88 44 52 26 26 353 + + ---, +-+--, +--+-, + - - - +, -++--, -+-+-, -+--+, --++-, --+-+, --- ++ SIGN TEST 14.17 14.18 The police chief of a large city claims that the median response time for all 911calls is 15 minutes. In a random sample of 250 such calls it was found that 149 of the calls exceeded 15 minutes in response time. All the other response times were less than 15 minutes. Do these results refute the claim? Use level of significance .05. Assume a two-tailed alternative hypothesis. Ans. Assuming the claim is correct, we would expect p.= np = 250 x .5 = 125calls on the average. The standard deviation is equal CJ = & = 7.906. The z value corresponding to 149 calls is 2.97. Since the computed z value exceeds 1.96, the results refute the claim. The claim is made that the median number of movies seen per year per person is 35. Table 14.20 gives the number of movies seen last year by a random sample of 25 individuals. Table 14.20 45 29 23 25 25 Test the null hypothesis that the median is 35 vs. the alternative that the median is not 35.Use a = .05. Am. Table 14.21 gives a plus sign if the value in the corresponding cell in Table 14.20 exceeds 35, a 0 if the corresponding cell value equals 35 and a minus sign if the corresponding cell value is less than 35. There are 12plus signs, 10minus signs, and 3 zeros. Table 14.21 The below Minitab output indicates a p value = 0.8318, indicating that there is no statistical evidence to reject the claim that the median is 35. MTB > print cl 1 9 39 45 35 66 44 12 29 70 44 37 3 1 23 50 52 29 3 5 25 52 26 35 41 25 aa 2 6 MTB > STest 35 'movies'; SUBC> Alternative 0. SignTest for Median Sign test of median = 35.00 vs. not = 35.00 N Below Equal Above P Median Movies 2 5 10 3 1 2 0.8318 35.00
  • 363.
    354 NONPARAMETRIC STATISTICS[CHAP. 14 WILCOXON SIGNED-RANK TEST FOR TWO DEPENDENT SAMPLES 14.19 Use the normal approximation to the Wilcoxon sign-rank test statistic to solve problem 14.18. Note that this application of the Wilcoxon sign-rank test does not involve dependent samples. We are testing the median value of a population using a single sample. Ans. Table 14.22 shows the computation of the signed ranks. Note that 0 differences are not ranked and are omitted from the analysis. The sum of positive ranks is W+= 150.5,and the absolute value of the sum of the negative ranks is W = 102.5. The test statistic used in the normal approximation is: I50.5-22~ 2 3 / 4 = 0.779 - - W -n(n +1)/4 Jn(n +1)(2n+1)/24 z*= 422 x 23x 45 / 24 Since the computed value of the test statistic does not exceed 1.96, the null hypothesis is not rejected. X = Number 19 39 45 35 66 44 12 29 70 44 37 31 23 50 52 29 35 25 52 26 35 41 25 88 26 D = X - 3 5 -16 4 10 0 31 9 -23 -6 35 9 2 -4 -12 15 17 -6 0 -10 17 -9 0 6 -10 53 -9 Table 14.22 ID1 16 4 10 0 31 9 23 6 35 9 2 4 12 1s 17 6 0 10 17 9 0 6 10 53 9 Rank of ID I 16.0 2.5 12.0 20.0 8.5 19.0 5.o 21.0 8.5 1.O 2.5 14.0 15.0 17.5 5.O 12.0 17.5 8.5 5.O 12.0 22.0 8.5 Not used Not used Not used Signed rank -16.0 2.5 12.0 20.0 8.5 -19.0 -5 .O 21.o 8.5 1.o -2.5 -14.0 15.0 17.5 -5 .O -12.0 17.5 -8.5 5.0 -12.0 22.0 -8.5 14.20 The Minitab output for problem 14.19 is shown below. Answer the following questions. (a) Explain the command m o s t 35 'movies' ; (6) Explain the subcommand Alternative 0 . (c) Explain the output. (d) Compute the p value corresponding to the value Z* = 0.779 in problem 14.29 and compare it with the p value given in the Minitab output. MTB > print cl 1 9 3 9 45 3 5 66 44 1 2 2 9 7 0 44 37 3 1 23 50 52 29 3 5 2 5 52 26 35 4 1 2 5 88 2 6
  • 364.
    CHAP. 141 NONPARAMETRICSTATISTICS Men I 3 4 6 8.5 8.5 1 1 11 15.5 19 355 Women 2 6 6 11 13 14 15.5 17 18 20 MTB > WTest 35 'movies'; SUBC> Alternative 0 . Wilcoxon Signed RankTest Test of median = 3 5 . 0 0 vs. median not = 35.00 N f o r Wilcoxon Estimated N Test Statistic P Median Movies 25 22 1 5 0 . 5 0.445 3 7 . 5 0 Ans. (a) Wtest indicates a Wilcoxon sign-rank test. The number 35 is the median value to be tested. Movies indicates the column containing the sample data. (h) The subcommand indicates that the research hypothesis is two-tailed. (c) The output indicates that the sample size is 25. However only 22 of the values were used since there were 3 differences that were 0. The value for the Wilcoxon Statistic is the same as W+in problem 14.19.The p value is 0.445. (d) P value = 2 x (.5 - .2823) = 0.4354 WILCOXON RANK-SUM TEST FOR TWO INDEPENDENT SAMPLES 14.21 The leading cause of posttraumatic stress disorder (PTSD) among men is wartime combat and among women the leading causes are rape and sexual molestation. The duration of symptoms was measured for a group of men and a group of women. The times of duration in years for both groups are given in Table 14.23. Use the Wilcoxon rank-sum test to test the research hypothesis that the median times of duration differ for men and women. Use level of significance a = .01. Use the normal approximation procedure to perform the test. Table Men 2.5 3.3 3.5 4.0 5.O 5.0 6.5 6.5 10.0 15.0 14.23 Women 3.0 4.0 4.O 6.5 7.0 8.5 10.0 12.5 13.0 17.5 Ans. Table 14.24gives the rankings when the two samples are combined and ranked together. Table 14.24 RI = 87.5 I R2 = 122.5
  • 365.
    356 NONPARAMETRIC STATISTICS[CHAP. 14 The sample sizes are nl = 10and n2 = 10.The rank sum associated with sample 1 is RI= 87.5. The test statistic is computed as follows: RI-nl(n1 +n2+1)/2 875-10(10+10+1)/2 Jnln2(nl +n2 +I )/ 12 = -1.32 - Z = - JIOX I O X (10+ 10+ I)/ 12 The critical values are & 2.58. Since the computed value of the test statistic does not exceed the left side critical value, the null hypothesis is not rejected. The p value is computed as follows. The area to the left of -I .32 is .5 - .4066 = .0934,and since the alternative is two sided, the p value = 2 x .0934 = 0.1868. If the p value approach to testing is used, we do not reject the null since the p value is not less than the preset a. 14.22 The Minitab solution for problem 14.21 is shown below. Answer the following questions. (a) Explain the command line Mann-Whitney 95.0 I Men I (6) Explain the subcommand Alternative 0. (c) Discuss the output. Women' ; MTB > print Row Men 1 2.5 2 3.3 3 3.5 4 4.0 5 5 . 0 6 5 . 0 7 6.5 8 6.5 9 10.0 10 15.0 cl c2 Women 3.0 4.0 4.0 6.5 7 . 0 8 . 5 1 0 . 0 12.5 13.0 17.5 MTB > Mann-Whitney 95.0 'Men' lWomen'; SUBC> Alternative 0. Mann-WhitneyConfidenceIntervaland Test Men N = 10 Median = 5.000 Women N = 10 Median = 7.750 Point estimate for ETA1-ETA2 is -2.250 95.5 Percent CI f o r ETA1-ETA2 is (-6.503,1.002) W = 8 7 . 5 Test of ETAl = ETA2 vs. ETAl not = ETA2 is significant at 0.1988 The test is significant at 0.1971 (adjusted for ties) Cannot reject at alpha = 0.05 Ans. (a) The command line requests a Mann-Whitney analysis using the data in the columns labeled Men and Women. Set a 95%confidence interval on the difference in the medians for the two populations. (6) The subcommand indicates that the test is two-tailed. (c) The output gives a 95% confidence interval on the difference in the two population medians. The p value is given to be 0.1988. We are unable to reject the null hypothesis. KRUSKAL-WALLIS TEST 14.23 The cost of a meal, including drink, tax and tip was determined for ten randomly selected individuals in New York City, Chicago, Boston, and San Francisco. The results are shown in Table 14.25. Test the null hypothesis that the four population distributions of costs are the same for the four cities vs. the hypothesis that the distributions are different. Use level of significance a = .05.
  • 366.
    CHAP. 141 NONPARAMETRICSTATISTICS 357 New York 21.oo 23.00 23.50 25.OO 25.50 30.00 30.00 33.50 35.00 40.00 Table Chicago 15.OO 17.50 19.00 2I s o 22.00 23.75 25.00 26.00 27.50 33.00 14.25 Boston 15.50 16.00 18.50 19.50 20.00 22.50 25.25 26.50 30.00 37.50 San Francisco 16.50 18.25 21.25 24.25 26.25 27.75 29.00 31.25 35.50 42.25 Ans. Table 14.26gives the results when the four samples are combined and ranked together. Table 14.26 New York 11.0 16.0 17.0 20.5 23.0 31.0 31.0 35.0 36.0 39.0 Chicago 1.o 5.O 8.0 13.0 14.0 18.0 20.5 24.0 27.O 34.0 Boston 2.0 3.0 7.0 9.0 10.0 15.0 22.0 26.0 31.0 38.0 San Francisco ~~ ~ ~ 4.0 6.0 12.0 19.0 25.O 28.0 29.0 33.0 37.0 40.0 I RI =259.50 I R3= 164.50 I R I =163.00 I L=233.00 The critical value is 7.815. We cannot conclude that the distributions are different. 14.24 The Minitab output for the Kruskal-Wallistest in problem 14.23is shown below. (a) Explain the command line. (b) Explain the output. MTB > Kruakal-Wallis 'cost' 'city'. Kruskal-WallisTest Kruskal-Wallis T e s t on cost C i t y N Median A v e Rank Z 1 1 0 27.75 26.0 1.70 2 1 0 2 2 . 8 8 1 6 . 5 -1.27 3 1 0 21.25 16.3 - 1 . 3 1 4 1 0 27.00 23.3 0.87 Overall 40 2 0 . 5 H = 5 . 2 4 DF = 3 P = 0.155 H = 5.24 DF = 3 P = 0.155 (adjusted for ties) Ans. (a) The command line Kruskal-Wallis cost city'.requests a Kruskal-Wallis test (6) The output lists the 4 cities, the median cost for each city, and the mean rank for each city. In for the responses in the column called cost and the cities identified in the column called city. addition, the computed Kruskal-Wallis test statistic is H = 5.24. The p value is 0.155.
  • 367.
    358 NONPARAMETRIC STATISTICS[CHAP. 14 Percent Fat Calories 40 35 33 29 35 36 30 36 33 28 39 RANK CORRELATION Lead 13 12 1 1 8 13 15 I 1 14 12 9 15 14.25 Table 14.27 gives the percent of calories from fat and the micrograms of lead per deciliter of blood for a sample of preschoolers. Find the Spearman rank correlation coefficient for data shown in the table. 28.2 28.2 27.9 27.8 27.9 27.8 28.1 28.2 28.1 28.0 27.9 27.8 27.8 28.0 28.3 28.1 28.2 28.3 28.1 27.7 28.4 28.2 28.3 28.3 27.6 27.8 27.9 28.1 28.3 28.5 Ans. The Spearman correlation coefficient = 0.919. A A B B B B A A A A B B A A A B A A A A A B B B B B B A A A - - - - - - - - - - 14.26 Test for a positive correlation between the percent of calories from fat and the level of lead in the blood using the results in problem 14.25. A A B B B B A A A A B B A A A B A A A A A B B B B B B A A A - - - - - - - - - - Ans. The computed test statistic is 3.05 and the a = .05 critical value is 1.65. We conclude that a positive correlation exists. RUNS TEST FOR RANDOMNESS 14.27 The first 100decimal places of TI contain 51 even digits (0,2,4,6, 8) and 49 odd digits. The number of runs is 43. Are the occurrences of even and odd digits random? Am. Assuming randomness, p = 50.98, 0 = 4.9727, and z* = -1.60. Critical values = k 1.96. The occurrences are random. 14.28 The weights in grams of the last 30 containers of black pepper selected from a filling machine are shown in Table 14.28.The weights are listed in order by rows. Are the weights varying randomly about the mean‘? Am. The mean of the 30 observations is 28.06. Table 14.29 shows an A if the observation is above 28.06 and a B if the observation is below 28.06. Table 14.29 The sequence AABBBBAAAABBAAABAAAAABBBBBBAAA has 17 A’s, I3 B’s, and 9 runs. If the weights are varying randomly about the mean, the mean number of runs is 15.7333 and the standard deviation is 2.6414. The computed value of z is z* = -2.55. The critical values for a = .05 are k I .96. Randomness about the mean is rejected.
  • 368.
    Appendix 1 Binomial Probabilities P nx .OS .lO .20 .30 .40 S O .60 .70 .80 .90 .95 .9500 .0500 .9000 .loo0 .8OOo .2000 .7000 .3000 .6000 SO00 .4000 SO00 .4O00 .6O00 .3OOo .7OOo .2000 .go00 .1OOo .9OOo .0500 .9500 1 0 1 2 0 1 2 .9025 .0950 .0025 .8100 .1800 .o100 ,6400 .3200 ,0400 .4900 .4200 .0900 .3600 ,2500 .4800 SO00 .1600 .2500 .1600 .4800 .3600 .woo .4200 .4900 .0400 .3200 .6400 .0100 .1800 .8100 .0025 .0950 .9025 3 0 1 2 3 3574 .1354 ,0071 .o001 .7290 .2430 .0270 .oo10 .5 120 .3840 .0960 .0080 .3430 .4410 .1890 .0270 .2160 .1250 .4320 .3750 ,2880 .3750 .0640 .1250 .0640 .2880 .4320 .2160 .0270 .1890 .4410 .3430 .0080 .0960 .3840 .5 120 .0010 ,0270 .2430 .7290 .0001 .0071 .1354 3574 .8145 .1715 .O135 ,0005 .OoOo .6561 ,2916 .0486 .0036 .ooo1 .4096 .4096 .I536 .0256 .0016 .2401 .4116 .2646 .0756 .0081 .1296 .0625 .3456 .2500 .3456 .3750 .1536 ,2500 .0256 .0625 .0256 .1536 .3456 ,3456 .1296 .0081 .0756 .2646 .4116 .2401 .0016 .0256 .1536 .4096 .4096 .OOo1 .0036 .0486 .2916 .6561 .m .0005 .O135 .1715 .8145 4 0 1 2 3 4 5 0 1 2 3 4 5 .7738 .2036 .0214 .0011 .oooo .moo S905 ,3280 .0729 .0081 .o004 .oooo .3271 .4096 .2048 .0512 .0064 .WO3 .1681 .3602 .3087 .1323 .0283 .0024 .0778 .0312 .2592 .1562 .3456 .3125 .2304 .3125 .0768 .1562 .0102 .0312 .o102 .0768 .2304 .3456 .2592 .0778 .0024 .0284 .1323 .3087 .3601 .1681 .0003 .0064 .0512 .2048 .4096 .3277 .m .o005 .0081 .0729 .3281 S905 .m .m .0011 .0214 .2036 .7738 6 0 1 2 3 4 5 6 .7351 .2321 .0305 .0021 .WO1 .m .o000 .5314 .3543 .0984 .O146 .oo12 .ooo1 .m .2621 .3932 .2458 .0819 .O154 .0015 .oO01 .1176 .3025 .3241 .1852 .0595 .o102 .oO07 .0467 .0156 .1866 .0937 .3110 .2344 .2765 .3125 .1382 .2344 .0369 .0937 .0041 .0156 .0041 ,0369 .1382 .2765 .3110 .1866 .0467 .oO07 .o102 .0595 .1852 .3241 .3025 .1176 .o001 .0015 .O154 .OS19 .2458 .3932 .2621 .m .o001 .0012 .O146 .0984 .3543 S314 .oooo .m .0001 .0021 .0305 .2321 .7351 7 0 1 2 3 4 5 6 7 .6983 .2573 .0406 .0036 .WO2 .oooo .m .m .4783 .3720 .1240 .0230 ,0026 .0002 .oooo .0000 2097 .3670 .2753 .I 147 .0287 .0043 .o004 .m .0824 .2471 .3177 .2269 .0972 .0250 .0036 .0002 .0280 .0078 .1306 .0547 .2613 .1641 .2903 .2734 .1935 .2734 .0774 .1641 .0172 .0547 .0016 .0078 .0016 .0172 .0774 .1935 .2903 .2613 .1306 .0280 .o002 .0036 .0250 .0972 .2269 .3177 .2471 .0824 .m .o004 . w 3 .0287 .I 147 .2753 .3670 .2097 .m .m .o002 .0026 .0230 .1240 .3720 .4783 .m .oooo .m .0002 .0036 .0406 .2573 .6983 8 0 1 2 3 4 .6634 .2793 .05 15 .0054 .0004 .4305 .3826 .1488 .0331 .0046 .1678 .3355 .2936 .1468 .0459 .0576 .1977 .2965 .2541 .1361 .0168 .0039 .0896 .0312 .2090 .1094 .2787 .2187 .2322 .2734 .o007 .0079 .0413 .123Y .2322 .WO1 .oo12 .o100 .0467 .1361 .oOOo .o001 .0011 .0092 .0459 .oOOo .ooO0 .oOoO .o004 .0046 .m .m .m .m .0004 359
  • 369.
    360 BINOMIALPROBABILITIES [APPENDIX1 P n x .05 .lO .20 .30 .40 S O .60 .70 .80 .90 .95 .m .m .m .0000 .oO04 .m .m .moo .0092 ,0011 .o001 .oooo .0467 .o100 .0012 .oO01 .1239 .0413 .0079 .0007 .2187 .I094 .0312 ,0039 .2787 .2090 .0896 .O168 .2541 .2965 .1977 .0576 .1468 .2936 .3355 .1678 .0331 .1488 ,3826 .4305 .0054 .0515 .2793 .6634 9 0 1 2 3 4 5 6 7 8 9 .6302 .2985 .0629 .0077 .0006 .oooo .m .m .oOOo .m .3874 .3874 ,1722 .0446 .0074 .0008 .ooo1. .m .moo .0000 .1342 ,3020 .3020 .1762 .0661 .O165 ,0028 .o003 .0000 .oo00 .0404 .1556 ,2668 .2668 .1715 .0735 .0210 .0039 .WO4 .oooo ,0101 .0605 .1612 .2508 .2508 .1672 .0743 .0212 .0035 .oO03 .0020 .O176 .0703 ,1641 .2461 .2461 .1641 .0703 .0176 ,0020 .0003 .0035 .0212 .0743 .1672 .2508 .2508 .1612 .0605 .0101 .m .oO04 .0039 .0210 .0735 .1715 .2668 ,2668 .1556 .0404 .oooo .oooo .oO03 ,0028 .O165 .0661 .1762 .3020 .3020 ,1342 .oooo .m .m ,0001 .0008 .0074 ,0446 .1722 .3874 .3874 .0000 .0000 .m .m .m ,0006 .0077 .0629 .2985 .6302 .0282 .1211 .2335 .2668 .2001 .1029 .0368 .0090 .OO 14 .ooo1 .oo00 .0060 .0403 .1209 .2150 .2508 .2007 .1115 .0425 .OI06 .0016 .0001 .0010 .0098 .0439 .I 172 .2051 .2461 .2051 .I 172 .0439 .0098 .0010 .o001 .OO16 .0106 ,0425 .1115 .2007 .2508 .2150 .1209 .0403 .#60 .moo .WO1 DO14 .0090 .0368 .1029 .2001 .2668 .2335 .1211 .0282 .m .moo .o001 .oO08 .0055 .0264 .0881 .2013 .3020 .2684 .1074 10 0 1 2 3 4 5 6 7 8 9 10 S987 .3151 .0746 .O105 .0010 .o001 .m .0000 .0000 .m .oooo .3487 .3874 .1937 .0574 .0112 .0015 .o001 .m .moo .0000 .oooo .1074 .2684 .3020 .2013 .0881 .0264 ,0055 .0008 .ooo1 .m .oooo .oooo .oOOo .OoOo .woo .o001 .0015 .0112 .0574 .I937 .3874 .3487 .OoOo .0000 .oooo .m .m .ooo1 .0010 .O105 .0746 .3151 5987 .0036 .0266 .0887 .1774 .2365 .2207 .1471 .0701 .0234 .0052 .OOo7 .oOOo .oooo .oO07 ,0052 .0234 .0701 .1471 .2207 -2365 .1774 .0887 ,0266 .0036 .ooO0 .oooo .0005 .0037 .O173 .0566 .1321 .2201 .2568 .1998 .0932 .O198 .m .0000 .oooo .0002 .0017 .0097 .0388 .I 107 .2215 .2953 ,2362 .OM9 .oooo .m .moo .0000 .m .0000 .0001 .0014 .0137 .0867 .3293 5688 1 1 0 1 2 3 4 5 6 7 8 9 10 11 S688 .3293 .0867 .0137 .0014 .WO1 .m .OooO .0000 .oooo .m .m .3138 .3835 .2131 .0710 .0158 .0025 .0003 .m .oOOo .oooo .oOOo .m .0859 .2362 .2953 .2215 .I 107 .0388 .0097 .0017 .o002 .oooo .m .oo00 .O198 .0932 ,1998 .2568 .2201 .1321 .0566 .0173 .0037 .0005 .moo .oooo .0005 .0054 .0269 .Of306 .1611 .2256 .2256 .1611 .0806 .0269 .0054 .0005 .0000 .oooo .oooo .m .0000 .o003 .0025 .0158 .0710 .2131 .3835 .3138 12 0 1 2 3 4 5 6 7 8 9 ,5404 .3413 .0988 .O173 .0021 .o002 .m .m .oooo .OoOo .2824 ,3766 .2301 .0852 .0213 .0038 .0005 .oOOo .OoOo .oooo .0687 ,2062. .2835 .2362 .1329 .0532 .OI55 .0033 .0005 .oO01 .0138 .0712 .1678 .2397 .2311 .1585 .0792 .0291 .0078 .OO15 .0022 .O174 ,0639 .1419 .2128 .2270 .1766 .1009 .0420 .O125 .0002 ,0029 .0161 .0537 ,1208 .1934 .2256 .1934 .1208 .0537 .oooo .o003 .0025 .O125 .0420 ,1009 .1766 ,2270 .2128 .1419 .m .oooo .0002 DO15 .0078 .0291 .0792 .1585 .2311 .2397 .m .oOOo .oooo .o001 .oO05 .0033 .O155 .0532 .1329 ,2362 .oooo .oooo .oooo .oooo .OOoo .oooo .ooo5 .0038 .0213 .0852 .0000 .oooo .woo .0000 .m .OoOo .oooo .0002 .0021 .0173
  • 370.
    APPENDIX I] BINOMIALPROBABILITIES361 P n x .05 .lO .20 30 .40 S O .60 .70 .80 .90 .95 1 0 11 12 .oooo .OOoo .oooo .m .oooo .oooo .m .oOOo .m .o002 .m .OoOo .0025 . 0 1 6 1 .OOo3 .0029 .oooo .OOo2 .0639 .O174 .0022 .1678 .07 12 .0138 .2835 .2062 .0687 .2301 .3766 .2824 .0988 .34 13 5404 .5 133 .35 12 . 1 1 0 9 .02 1 4 .0028 .OOo3 .m .m .moo .m .0000 .GO00 .oOOo .m .2542 .3672 .2448 .0997 .0277 ,0055 .OOo8 *OOo1 .m .OOoO .oOOo ,0000 .oooo .OOoo .0550 .1787 .2680 .2457 .1535 .069 1 .0230 .0058 .0011 .0001 .oo00 .oOOo .m .m . 0 0 9 7 .0540 .1388 .2181 .2337 .1803 .1030 .0442 .0142 .0034 .o 00 6 .o001 .oooo .m .0013 . o 0 0 1 .0113 . 0 0 1 6 .0453 .0095 .1107 .0349 .I845 .0873 .2214 .1571 .1968 .2095 .1312 .2095 .0656 .1571 .0243 ,0873 .0065 .0349 .0012 .0095 .OOo1 .0016 .m . o 0 0 1 .m .ooo1 .0012 .0065 .0243 .0656 ,1312 .I968 .2214 ,1845 .1107 .0453 .0113 ,0013 .oOOo .OOoo .OOo1 .OOo6 .0034 .0142 0442 .I030 .1803 .2337 .2181 .1388 .0540 .0097 .m .m .m .m . o 0 0 1 .0011 .0058 .0230 . 0 6 9 1 .1535 .2457 .2680 .1787 .0550 .m .oooo .m .m .m .m .OOo1 .OOo8 .0055 .0277 .0997 .2448 .3672 .2542 .OoOo .OoOo .0000 .m .m .m .m .m .0003 .0028 .02 1 4 . I 1 0 9 .35 12 .5 133 13 0 1 2 3 4 5 6 7 8 9 1 0 11 12 13 1 4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 1 4 .4877 .3593 .I229 .0259 .0037 .o004 .m .oooo .OOoO .oooo .m .m .m .moo .OOoO ,2288 .3559 ,2570 .1142 .0349 .0078 .OO13 .OOo2 .m .moo .0000 .OOoo .m .oOOo .0000 .0440 .1539 .2501 .2501 .1720 .0860 .0322 .#92 .0020 .oO03 .m .OoOo .OoOo .oooo .0000 .0068 .0 40 7 .1134 .1943 .2290 .1963 .1262 . 0 6 18 .0232 .0066 . 0 0 1 4 .o002 .m .oooo .o000 .OOo8 . o 0 0 1 ,0073 . 0 0 0 9 .0317 .0056 .0845 .0222 .1549 . 0 6 1 1 .2066 .1222 .2066 .1833 .1574 .2095 .0918 .1833 .MO8 .1222 .0136 . 0 6 1 1 .0033 .0222 .OOo5 .0056 .oO01 ,0009 .moo .o001 .0000 .OOO1 .OOo5 .oo33 .0136 9408 . 0 9 18 * 1574 .2066 .2066 .1549 .0845 .03 17 .073 .WO8 .m .m .OOoO .OOo2 .OO 1 4 . 0 0 6 6 .0232 .06 I8 .1262 .1963 .2290 .1943 .I134 .MO7 .0068 .oooo .OOOo .moo .m .0000 .o003 .oo20 .0092 .0322 .0860 .1720 .2501 .2501 .1539 .0440 .0000 .moo .m .moo .oooo .m .m .OOo2 .0013 .0078 .0349 .1142 .2570 .3559 .2288 .OoOo .m .m .m .oOOo .m .moo .m .oOOo .0004 .0037 .0259 .1229 .3593 .4877 1 5 0 1 2 3 4 5 6 7 8 9 1 0 11 12 13 1 4 15 .4633 .3658 .1348 .0307 . 0 0 4 9 .OOo6 .oOOo ‘.0000 .moo .oooo .oooo .m .moo .oooo .m .oooo .2059 .3432 .2669 .1285 .0428 .O105 .0019 .o003 .m .oooo .oooo .0000 .0000 .oooo .moo .m .0352 .1319 .2309 .250 1 .1876 .1032 .0430 .0138 .0035 . 0 0 0 7 .ooo1 .m .m .oooo .OoOo .oooo . w 7 .0305 . 0 9 1 6 .1700 .2 186 .206 1 .1472 .0811 .0348 ,0116 .0030 . 0 0 0 6 .OOo1 .oooo .0000 .m .0005 .OOOO .0047 .0005 .0219 .0032 .0634 .0139 .1268 .0417 .1859 . 0 9 1 6 .2066 .1527 .1771 .1964 .1181 .I964 .0612 .I527 .0245 . 0 9 1 6 -0074 . 0 4 1 7 .0016 .0139 .0003 .0032 .0000 .WO5 .m .oOOo .OOoo .0000 .0003 .0016 .0074 .0245 . 0 6 12 .1181 .1771 .2066 .1859 .I268 .0634 .02 1 9 .0047 .OOo5 .m .m .m .OOo1 .m .0030 .0116 .0348 .OS1 1 .1472 .206 1 .2 186 .1700 . 0 9 1 6 .0305 .0047 .m .m .oOOo .m .oOOo .0001 .WO7 .oo35 .0138 0430 .1032 .1876 .250 1 .2309 .1319 .0352 .OOoo .oooo .m .0000 .m .oOOo .moo .oo00 .o003 .0019 .O105 .0428 -1285 .2669 .3432 .2059 .m .m .moo .moo .oOOo .oooo .m .m .oOoO .0000 .w06 .0049 .0307 .1348 .3658 .4633
  • 371.
    362 BINOMIALPROBABILITIES [APPENDIX1 P n x .05 .lO .20 .30 .40 S O .60 .70 .80 .90 .95 16 0 1 2 .4401 ,3706 .1463 .0359 .0061 .o008 .oO01 .m .m .m .m . o O O O .m .OOOO .oOOo .OOOO .m .1853 ,3294 .2745 ,1423 .0514 .O137 .0028 .oO04 .0001 .oOOo .m .m .oooo .m .oooo .m .OoOo .0281 .1126 .2111 .2463 .2001 .1201 .0550 .O197 .0055 .0012 .0002 .m .m .m .m .m .m .0033 .0228 .0732 .1465 .2040 .2099 .1649 .lOlO .0487 .O185 ,0056 .0013 .WO2 .oooo .m .m .m .o003 .0030 .O150 .0468 .1014 .1623 .1983 .1889 .1417 .0840 .0392 .O142 . W O .o008 .o001 .oooo .oooo .m .o002 .0018 .0085 .0278 .0667 .1222 .1746 .1964 .1746 .1222 ,0666 .0278 ,0085 .oo18 ,0002 .moo .oooo .m .ooo1 .0008 ,0040 .O142 .0392 .0840 .1417 .1889 .1983 ,1623 .1014 .0468 .O150 .0030 ,0003 .oooo .m .m .m .o002 .0013 .0056 .O185 .0487 .1010 ,1649 .2099 .2040 .1465 ,0732 .0228 .0033 .m .0000 .0000 .m .m .m .o002 .0012 .0055 .O197 .OS0 .1201 .2001 .2463 .2111 .1126 .0281 .m .oOOo .m .m .m .o000 .0000 .oooo .OOo1 .0004 .0028 .O137 .0514 .1423 .2745 .3294 .1853 .oooo .oooo .oooo .0000 .m .m .m .m .m .m .0001 .0008 .0061 ,0359 .1463 .3706 .4401 3 4 5 6 7 8 9 10 11 12 13 14 15 16 .1668 .3150 .2800 .1556 ,0605 .O175 .0039 .OOO7 .o001 .m .moo .0000 .oOOo .oooo . o O O O .OOOO .oooo .m .0225 .0957 .1914 ,2393 .2093 .1361 .0680 .0267 .0084 .0021 .0004 .o001 .m .m .oOOo .oooo .m .oOOo .0023 .O169 .058 1 .1245 ,1868 .2081 .1794 .1201 ,0644 .0276 .0095 .0026 .0006 .o001 .m .oooo .moo .m .WO2 .0019 .o102 .0341 ,0796 .1379 .1839 .1927 .1606 .1070 .0571 .0242 .0081 .002I .0004 ,0001 .o000 .oOOo .m .ooo1 .0010 .0052 .O182 .@I72 .0944 .1484 .1855 .1855 .1484 .0944 .0472 .0182 ,0052 .0010 .o001 .moo .m .oooo .o001 .OOO4 .0021 .0081 .0242 .0571 .1070 ,1606 ,1927. .1839 .1379 .0796 .0341- .o102 .OO19 .WO2 .0000 .m .moo .0000 .o001 .o006 .0026 .0095 .0276 .OM4 .1201 .1784 .2081 .1868 .1245 .0581 .O169 DO23 .0000 .0000 .m .moo .0000 .m .o001 .0004 .0021 .0084 .0267 ,0680 .1361 .2093 .2393 .1914 .0957 .0225 .oooo .m .oooo .oooo .oooo .oooo .OOOo .0000 .m .o001 .o007 .0039 ,0175 .0605 .1556 .2800 .3150 ,1668 .m .oooo .oooo, .m .moo .oooo .oooo .m .m .oooo .m .0001 .0010 .0076 .0415 ,1575 .3741 ,4181 17 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 .4181 .3741 .1575 .0415 .0076 .0010 .o001 .oooo .oooo .m .m .m .m .oooo .oOOO .m .oooo .0000 18 0 1 2 3 4 5 6 7 8 9 10 11 12 .3972 .3763 .1683 .0473 .0093 .0014 .o002 .m .woo .oooo .0000 .m .o000 .1501 ,3002 .2835 .1680 ,0700 .0218 .0052 .oo10 .0002 .m .oooo .moo .m .0180 .081I .1723 .2297 .2153 .1507 .OS16 -0350 .o120 .0033 .oO08 ,000 1 .0000 .OO16 .O126 -0458 .1046 .1681 .2017 .1873 .1376 .OS11 .0386 .O149 . W 6 .oo12 .o001 .oo12 ,0069 .0246 .0614 .1146 .1655 .1892 .1734 .1284 ,0771 ,0374 .O145 .m .ooo1 .0006 .0031 .0117 .0327 ,0708 ,1214 .1669 .1855 .1669 .1214 .0708 .oooo .oooo .woo .o002 .oo11 .0045 .O145 .0374 .077f .1284 .1734 .1892 .1655 .m .m .oooo .oooo .0000 .o002 .0012 .0046 .O149 .0386 .OS11 .1376 ,1873 .oOOo .oooo .0000 .oooo .0000 .oooo .m .o001 .0008 .0033 .o120 .0350 .0816 .m .oooo .m .moo .0000 .oooo .m .0000 .oooo .m .o002 .0010 .0052 .moo .oooo .0000 .0000 .m .0000 .oOOo .m .OOoo .moo .oooo .0000 .oO02
  • 372.
    APPENDIX 11 BINOMIALPROBABILITIES 363 P - .20 ~ ~~~ .30 .40 S O .60 .70 .80 .90 .95 n x .05 .I0 .0000 .OoOo .oooo .oooo .oooo .moo .0002 .0045 .0327 .1146 .oooO .0011 .0117 .0614 .OOOO .0002 .0031 .0246 .OOOO .OOOO .0006 .0069 .moo .moo .OoO1 .0012 .oooo .oooo .moo .o001 .2017 .I681 .1046 0 4 5 8 .O126 .0016 .I507 .2153 .2297 .1723 .Of311 .0180 -02I8 .0700 .1680 .2835 .3002 .1501 .OO14 .m93 .0473 .I683 .3763 .3972 13 14 15 16 17 18 .oooo .oooo .moo .oooo .m .moo .oooo .oooo .oooo .oooo .oooo .oooo .3774 .3774 .I787 .0533 .0112 .0018 .o002 .m .oooo .oOoO .oooo .oooo .oooo .moo .moo .0000 .OoOo .oooo .oooo .oooo .1351 .2852 .2852 .1796 .0798 .0266 .0069 .OO14 .OoO2 .m .oooo .oooo .oooo .oooo ,0000 .oooo .oooo .o000 .m .OoOo .O 144 .0685 .I540 .2182 .2182 .I636 .0955 ,0443 .O166 ,0051 .0013 .0003 .oooo .oooo .oooo .oooo .oooo .m .moo .oooo .0011 ,0093 .0358 .0869 .I491 .I916 ,1916 .I525 .0981 .0514 .0220 .oo77 .0022 .0005 .ooo1 .oooo .oooo .0000 .oooo .OoOo .ooo1 .0008 .0046 .O175 ,0467 ,0933 .1451 .I797 .1797 .1464 .0976 .0532 .0237 .0085 .0024 .0005 .ooo1 .moo .oooo .oooo .oooo .oooo .WO3 .OOI 8 .0074 .0222 .05 18 .0961 .I442 .I762 .I762 .1442 .0961 .05 18 .0222 .0074 .0018 .0003 .m .OoOo .moo .oooo .oooo .oooI .0005 .0024 .0085 .0237 .0532 .0976 .I464 .I797 .I797 .1451 .0933 .0467 .0175 .0046 ,0008 .ooo1 .oooo .oooo .oooo .oOOo .OOOO .o001 .OOO5 .0022 .0077 .0220 .0514 .0981 .I525 .I916 .I916 .I491 .0869 .0358 .0093 .0011 .moo .0000 .OoOo .m .oooo .moo .oOOo .OoOo .0003 .0013 .0051 .O166 .0443 .0955 .I636 .2182 .2182 .1540 .0685 .O144 ,0000 .oooo .oooo .m .m .m .m .m .m .oOoO .oOoO .WO2 .OO14 .0069 .0266 .0798 .1796 .2852 .2852 .I351 .oooo .m .m .oooo .oooo .oooo .moo .oooo .OoOo .oOOo .oooo .m .m .w02 .OOI8 .0112 .0533 .1787 .3774 .3774 19 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 I 1 12 13 14 15 16 17 18 19 20 .3585 .3774 .I887 .0596 .OI33 .0022 .0003 .m .m .m .OoOo .oooo .oooo .moo .m .oooo .oooo .oooo .oooo .oooo .moo .I216 ,2702 ,2852 .1901 .0898 .0319 .0089 ,0020 .oO04 .ooo1 ,0000 .oooo .oooo .oooo ,0000 .oooo .0000 ,0000 .oooo .oooo .oooo .0115 .0576 .I369 .2054 .2182 .I746 .1091 .0545 .0222 .0074 .0020 .0005 .ooo1 .oooo .m .oooo .oooo .oooo .oooo .oooo .oooo .0008 .0068 .0278 .07 16 .I304 .I789 .1916 .I643 .1144 .0654 .0308 .0120 .0039 .ooI0 .0002 .0000 .moo .m .oooo .OoOo .OoOo .oooo .0005 .0031 .O123 .0350 .0746 ,1244 .I659 .I797 .1597 .I 171 .0710 .0355 .OI46 .0049 .0013 .0003 .oOOo .oooo .oOoO .m .oooo .oooo .0002 .0011 9046 .O148 ,0370 .0739 .1201 .I602 .1762 .1602 .I201 ,0739 .0370 .0148 .OM6 .0011 .OOO2 .m .o000 .0000 .OoOo .oooo .oooo .0003 .0013 ,0049 .O146 .0355 .0710 .I 171 .I597 .I797 .I659 .I 244 .0746 .0350 .0123 .0031 .o005 .oooo .oooo .oooo .oooo .oooo .m .OOoo .WO2 .0010 .0039 .o120 .0308 .0654 .I 144 ,1643 .1916 .1789 .I304 .0716 .0278 .0068 .0008 .oooo .oooo .OoOo .m .m .oooo .oooo .oooo .ooo1 .OOO5 .0020 .0074 .0222 .0545 .I091 .1746 .2182 ,2054 .I369 ,0576 .0115 .oooo .oooo .oooo .oOOo .m .oooo .oooo .oooo .oooo .moo .oooo .oO01 . m 4 .0020 .0089 .0319 .OS98 .1901 ,2852 .2702 .I216 .oooo .oooo .oooo .oooo .oooo .oooo .oooo .m .oom .oooo .m .m .oooo .moo .0003 .0022 .0133 .0596 .1887 .3774 .3585
  • 373.
    Appendix 2 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.o 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.O - Areas Underthe Standard Normal Curve from 0 to Z .oo .o1 .02 .03 .04 .05 .oooo .0398 ,0793 .1179 .1554 .1915 .2257 ,2580 .2881 .3159 .3413 .3643 .3849 .4032 .4192 .4332 .4452 .4554 .4641 .4713 .4772 .4821 .4861 .4893 .4918 .4938 .4953 .4965 .4974 .498I .4987 .0040 ,0438 .0832 .1217 .1591 .1950 .2291 .2611 .2910 .3438 .3665 .3869 .4049 .4207 .4345 .4463 ,4564 ,4649 .4719 .4778 .4826 .4864 .4896 .4920 .4940 .4955 .4966 .4975 .4982 .4987 .31a6 .0080 .0478 .0871 .1255 .1628 .1985 .2324 .2642 .2939 .3212 .3461 .3686 .3888 .4066 .4222 .4357 .4474 .4573 .4656 .4726 .4783 .4830 .4868 .4898 .4922 .4941 .4956 4967 .4976 .4982 .4987 .o120 .0517 .0910 .1293 .1664 .2019 .2357 .2673 .2967 ,3238 .3485 .3708 .3907 .4082 .4236 .4370 .4484 .4582 ,4664 .4732 .4788 .4834 .4871 .4901 .4925 .4943 .4957 .4968 .4977 .4983 .4998 .O160 .OS7 .0948 .1331 .1700 .2054 . n a g .2704 .2995 .3264 .3508 .3729 .3925 .4099 .4251 .4382 .4495 .4591 .4671 .4738 .4793 .4838 .4875 .4904 .4927 .4945 .4959 .4969 .4977 .4984 .4988 .O199 .0596 .0987 .1368 .1736 .2088 .2422 .2734 ,3023 .3289 .3531 .3749 .3944 .4115 .4265 .4394 .4505 .4599 .4678 .4744 .4798 .4842 .4878 .4906 .4929 .4946 .4960 .4970 .4978 .4994 .4989 .06 .0239 .0636 .1026 .1406 .1772 .2123 .2454 .2764 .3051 .3315 .3554 .3770 .3962 .4131 .4279 .4406 .4515 .4608 .4686 .4750 .4803 .4846 .4881 .4909 .4931 .4948 .4961 .4971 .4979 .4985 .4989 .07 .OS .0279 -0675 .1064 .1443 .1808 .2157 .2486 ,2794 .3078 .3340 .3577 .3790 .3980 .4147 .4292 .4418 .4525 .46I6 ,4693 .4756 .4808 .4850 .4884 ,4911 .4932 .4949 .4962 .4972 .4979 .4985 .4989 .0319 .0714 .1103 .1480 .1844 .2190 .2517 .2823 .3106 .3365 .3599 .3810 .3997 .4162 ,4306 .4429 .4535 .4625 .4699 .4761 .4812 .4854 .4887 .4913 .4934 .4951 .4963 .4973 .4980 .4986 .4990 .09 .0359 .0753 .1141 .1517 .1879 .2224 .2549 .2852 .3133 .3389 .3621 .3830 .4015 .4177 .4319 .4441 .4545 .4633 .4706 ,4767 .4817 .4857 .4890 .4916 .4936 .4952 .4964 .4974 .4981 .4986 .4990
  • 374.
    Appendix 3 The entriesin the table are the critical values of t for the specified number of degrees of freedom and areas in the right tail. df 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Areas in the Right Tail under the t Distribution Curve .01 .05 .025 .01 .005 ,001 3.078 1.886 1.638 1.533 1.476 1.440 1.415 1.397 1.385 1.372 1.363 1.356 1.350 1.345 1.341 1.337 1.333 1.330 1.328 1.325 1.323 1.321 1.319 1.318 1.316 1.315 1.314 1.313 1.311 1.310 6.314 2.920 2.353 2.132 2.015 1.943 1.895 1.860 1.833 1.812 1.796 1.782 1.771 1.761 1.753 1.746 1.740 1.734 1.729 1.725 1.721 1.717 1.714 1.711 1.708 1.706 1.703 1.701 1.699 1.697 12.706 4.303 3.182 2.776 2.571 2.447 2.365 2.306 2.262 2.228 2.201 2.179 2.160 2.145 2.131 2.120 2.110 2.101 2.093 2.086 2.080 2.074 2.069 2.064 2.060 2.056 2.052 2.048 2.045 2.042 31.821 6.965 4.541 3.747 3.365 3.143 2.998 2.896 2.821 2.764 2.718 2.681 2.650 2.624 2.602 2.583 2.567 2.552 2.539 2.528 2.518 2.508 2.500 2.492 2.485 2.479 2.473 2.467 2.462 2.457 63.657 9.925 5.841 4.604 4.032 3.707 3.499 3.355 3.250 3.169 3.106 3.055 3.012 2.977 2.947 2.921 2.898 2.878 2.861 2.845 2.831 2.819 2.807 2.797 2.787 2.779 2.771 2.763 2.756 2.750 318.309 22.327 10.215 7.173 5.893 5.208 4.785 4.501 4.297 4.144 4.025 3.930 3.852 3.787 3.733 3.686 3.646 3.610 3.579 3.552 3.527 3.505 3.485 3.467 3.450 3.435 3.421 3.408 3.396 3.385
  • 375.
    Appendix 4 The entriesin the table are the critical values of x2for the specifieddegrees of freedom and areas in the right tail. df 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 50 60 70 80 ~ ~~~ ~~~ Area in the Right Tail under the Chi-squareDistribution Curve .995 .990 .975 .950 .900 ,100 .OS0 ,025 .010 .005 0.ooo 0.010 0.072 0.207 0.412 0.676 0.989 1.344 1.735 2.156 2.603 3.074 3.565 4.075 4.601 5.142 5.697 6.265 6.844 7.434 8.034 8.643 9.260 9.886 10.520 11.160 I 1.808 12.461 13.121 13.787 20.707 27.991 35.534 43.215 51.172 0.000 0.020 0.115 0.297 0.554 0.872 1.239 1.646 2.088 2.558 3.053 3.571 4.107 4.660 5.229 5.812 6.408 7.015 7.633 8.260 8.897 9.542 10.196 10.856 11.524 12.198 12.879 13.565 14.256 14.953 22.164 29.707 37.485 45.442 53.540 0.001 0.051 0.216 0.484 0.831 1.237 1.690 2.180 2.700 3.247 3.816 4.404 5.009 5.629 6.262 6.908 7.564 8.231 8.907 9.591 10.283 10.982 1I .689 12.401 13.120 13.844 14.573 15.308 16.047 16.791 24.433 32.357 40.482 48.758 57.153 0.004 0.103 0.352 0.711 1.145 I .635 2.167 2.733 3.325 3.940 4.575 5.226 5.892 6.571 7.261 7.962 8.672 9.390 10.117 10.851 11.591 2.338 3.091 3.848 4.611 5.379 16.151 16.928 17.708 18.493 26.509 34.764 43.188 51.739 60.39I 0.016 0.21I 0.584 1.064 1.610 2.204 2.833 3.490 4.168 4.865 5.578 6.304 7.042 7.790 8.547 9.312 10.085 10.865 11651 12.443 13.240 14.041 14.848 15.659 16.473 17.292 18.114 18.939 19.768 20.599 29.051 46.459 55.329 64.278 37.689 2.706 4.605 6.251 7.779 9.236 10.645 12.017 13.362 14.684 15.987 17.275 18.549 19.812 21.064 22.307 23.542 24.769 25.989 27.204 28.412 29.615 30.813 32.007 33.196 34.382 35.563 36.741 37.916 39.087 40.256 51.805 63.167 74.397 85.527 96.578 3.841 5.991 7.815 9.488 11.070 12.592 14.067 15.507 16.919 18.307 19.675 21.026 22.362 23.685 24.996 26.296 27.587 28.869 30.144 31.410 32.671 33.924 35.172 36.415 37.652 38.885 40.113 41.337 42.557 43.773 55.758 67.505 79.082 90.531 101279 5.024 7.378 9.348 11.143 12.833 14.449 16.013 17.535 19.023 20.483 2 1.920 23.337 24.736 26.119 27.488 28.845 30.191 31S26 32.852 34.170 35.479 36.781 38.076 39.364 40.646 41.923 43.195 44.461 45.722 46.979 59.342 71.420 83.298 95.023 106.629 6.635 7.879 9.210 10.597 1.345 12.838 3.277 14.860 5.086 16.750 6.812 18.548 8.475 20.278 20.090 21.666 23.209 24.725 26.217 27.688 29.141 30.578 32.000 33.409 34.805 36.191 37.566 38.932 40.289 41638 42.980 44.314 45.642 46.963 48.278 49.588 50.892 63.691 76.154 88.379 100.425 112.329 2 1.955 23.589 25.188 26.757 28.300 29.819 31.319 32.801 34.267 35.718 37.156 38.582 39.997 41.401 42.796 44.181 45.559 46.928 48.290 49.645 50.993 52.336 53.672 66.766 79.490 91.952 104.215 116.321
  • 376.
    Appendix 5 The areain the right tail under the F distributioncurve is equal to 0.01. d f ? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 30 40 50 100 - 1 4052 98.50 34.12 21.20 16.26 13.75 12.25 11.26 10.56 10.04 9.65 9.33 9.07 8.86 8.68 8.53 8.40 8.29 8.18 8.10 8.02 7.95 7.88 7.82 7.77 7.56 7.31 7.17 6.90 - 2 5000 99.00 30.82 18.00 13.27 10.92 9.55 8.65 8.02 7.56 7.21 6.93 6.70 6.51 6.36 6.23 6.11 6.01 5.93 5.85 5.78 5.72 5.66 5.61 5.57 5.39 5.18 5.06 4.82 - - 3 5403 99.17 29.46 16.69 12.06 9.78 8.45 7.59 6.99 6.55 6.22 5.95 5.74 5.56 5.42 5.29 5.18 5.09 5.01 4.94 4.87 4.82 4.76 4.72 4.68 4.5I 4.31 4.20 3.98 - 4 5625 99.25 28.71 15.98 11.39 9.15 7.85 7.01 6.42 5.99 5.67 5.41 5.21 5.04 4.89 4.77 4.67 4.58 4.50 4.43 4.37 4.31 4.26 4.22 4.18 4.02 3.83 3.72 3.51 - 5 5764 99.30 28.24 15.52 10.97 8.75 7.46 6.63 6.06 5.64 5.32 5.06 4.86 4.69 4.56 4.44 4.34 4.25 4.17 4.10 4.04 3.99 3.94 3.90 3.85 3.70 3.51 3.41 3.21 - 6 5859 99.33 27.91 15.21 10.67 8.47 7.19 6.37 5.80 5.39 5.07 4.82 4.62 4.46 4.32 4.20 4.10 4.01 3.94 3.87 3.81 3.76 3.71 3.67 3.63 3.47 3.29 3.19 2.99 - - d 7 5928 99.36 27.67 14.98 10.46 8.26 6.99 6.18 5.61 5.20 4.89 4.64 4.44 4.28 4.14 4.03 3.93 3.84 3.77 3.70 3.64 3.59 3.54 3.50 3.46 33 0 3.12 3.02 2.82 - I 8 5981 99.37 27.49 14.80 10.29 8.10 6.84 6.03 5.47 5.06 4.74 4.50 4.30 4.14 4.00 3.89 3.79 3.71 3.63 3.56 3.51 3.45 3.41 3.36 3.32 3.17 2.99 2.89 2.69 - - 9 6022 99.39 27.35 14.66 10.16 7.98 6.72 5.91 5.35 4.94 4.63 4.39 4.19 4.03 3.89 3.78 3.68 3.60 3.52 3.46 3.40 3.35 3.30 3.26 3.22 3.07 2.89 2.78 2.59 - - 10 6056 99.40 27.23 14.55 10.05 7.87 6.62 5.81 5.26 4.85 4.54 4.30 4.10 3.94 3.80 3.69 3.59 3.51 3.43 3.37 3.31 3.26 3.21 3.17 3.13 2.98 2.80 2.70 2.50 - 11 6083 99.41 27.13 14.45 9.96 7.79 6.54 5.73 5.18 4.77 4.46 4.22 4.02 3.86 3.73 3.62 3.52 3.43 3.36 3.29 3.24 3.18 3.14 3.09 3.06 2.91 2.73 2.63 2.43 - 12 6106 99.42 27.05 14.37 9.89 7.72 6.47 5.67 5.11 4.71 4.40 4.16 3.96 3.80 3.67 3.55 3.46 3.37 3.30 3.23 3.17 3.12 3.07 3.03 2.99 2.84 2.66 2.56 2.37 - 15 6157 99.43 26.87 14.20 9.72 7.56 6.31 5.52 4.96 4.56 4.25 4.01 3.82 3.66 3.52 3.41 3.31 3.23 3.15 3.09 3.03 2.98 2.93 2.89 2.85 2.70 2.52 2.42 2.22 - - 20 6209 99.45 26.69 14.02 9.55 7.40 6.16 5.36 4.81 4.41 4.10 3.86 3.66 3.51 3.37 3.26 3.16 3.08 3.OO 2.94 2.88 2.83 2.78 2.74 2.70 2.55 2.37 2.27 2.07 -
  • 377.
    368 AREA INRIGHT TAIL UNDER F DISTRIBUTION CURVE [APPENDIX 5 The area in the right tail under the F distributioncurve is equal to 0.05. - - clfi 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 30 40 50 100 - 1 161.5 18.51 10.13 7.71 6.61 5.99 5.59 5.32 5.12 4.96 4.84 4.75 4.67 4.60 4.54 4.49 4.45 4.41 4.38 4.35 4.32 4.30 4.28 4.26 4.24 4.17 4.08 4.03 3.94 - 2 199.5 19.00 9.55 6.94 5.79 5.14 4.74 4.46 4.26 4.10 3.98 3.89 3.81 3.74 3.68 3.63 3.59 3.55 3.52 3.49 3.47 3.44 3.42 3.40 3.39 3.32 3.23 3.18 3.09 - 3 215.7 19.16 9.28 6.59 5.41 4.76 4.35 4.07 3.86 3.71 3.59 3.49 3.41 3.34 3.29 3.24 3.20 3.16 3.13 3.10 3.07 3.05 3.03 3.01 2.99 2.92 2.84 2.79 2.70 4 224.6 19.25 9.12 6.39 5.19 4.53 4.12 3.84 3.63 3.48 3.36 3.26 3.18 3.1I 3.06 3.01 2.96 2.93 2.90 2.87 2.84 2.82 2.80 2.78 2.76 2.69 2.61 2.56 2.46 5 230.2 19.30 9.01 6.26 5.05 4.39 3.97 3.69 3.48 3.33 3-20 3.11 3.03 2.96 2.90 2.85 2.81 2.77 2.74 2.71 2.68 2.66 2.64 2.62 2.60 2.53 2.45 2.40 2.31 6 234.0 19.33 8.94 6.16 4.95 4.28 3.87 3.58 3.37 3.22 3.09 3.oo 2.92 2.85 2.79 2.74 2.70 2.66 2.63 2.60 2.57 2.55 2.53 2.51 2.49 2.42 2.34 2.29 2.19 - d 7 236.8 19.35 8.89 6.09 4.88 4.21 3.79 3.50 3.29 3.14 3.01 2.91 2.83 2.76 2.71 2.66 2.61 2.58 2.54 2.51 2.49 2.46 2.44 2.42 2.40 2.33 2.25 2.20 2.10 - L 8 238.9 19.37 8.85 6.04 4.82 4.15 3.73 3.44 3.23 3.07 2.95 2.85 2.77 2.70 2.64 2.59 2.55 2.51 2.48 2.45 2.42 2.40 2.37 2.36 2.34 2.27 2.18 2.13 2.03 - 9 240.5 19.38 8.81 6.00 4.77 4.10 3.68 3.39 3.18 3.02 2.90 2.80 2.71 2.65 2.59 2.54 2.49 2.46 2.42 2.39 2.37 2.34 2.32 2.30 2.28 2.21 2.12 2.07 1.97 - 10 241.9 19.40 8.79 5.96 4.74 4.06 3.64 3.35 3.14 2.98 2.85 2.75 2.67 2.60 2.54 2.49 2.45 2.41 2.38 2.35 2.32 2.30 2.27 2.25 2.24 2.16 2.08 2.03 1.93 - 11 243.O 19.40 8.76 5.94 4.70 4.03 3.60 3.31 3.10 2.94 2.82 2.72 2.63 2.57 2.51 2.46 2.41 2.37 2.34 2.31 2.28 2.26 2.24 2.22 2.20 2.13 2.04 1.99 1.89 - 12 243.9 19.41 8.74 5.91 4.68 4.00 3.57 3.28 3.07 2.91 2.79 2.69 2.60 2.53 2.48 2.42 2.38 2.34 2.31 2.28 2.25 2.23 2.20 2.18 2.16 2.09 2.00 1.95 1.85 15 246.0 19.43 8.70 5.86 4.62 3.94 3.51 3.22 3.01 2.85 2.72 2.62 2.53 2.46 2.40 2.35 2.31 2.27 2.23 2.20 2.18 2.15 2.13 2.16 2.09 2.01 1.92 1.87 1.77 - 20 248.0 19.45 8.66 5.80 4.56 3.87 3.44 3.15 2.94 2.77 2.65 2.54 2.46 2.39 2.33 2.28 2.23 2.19 2.16 2.12 2.10 2.07 2.05 2.03 2.01 1.93 1.84 1.78 1.68
  • 378.
    Index Addition rule forthe union of two events, 72 Adjacent values, 57 Alternative hypothesis, 185 I I 1 I I 1 I I I 1 3ar graph, 15 3ayes rule, 72 3ell-shaped histogram, 20,4 1 3etween samples variation, 275 3etween treatments mean square, 276 3imoda1,trimodal, 4 1 3inomial experiment, 93 3inomial probability formula, 93 3inomial random variable, 93 3ox-and-whisker plot, 50 Census, 140 Central limit theorem for the sample proportion, Central limit theorem, 146 Chebyshev's theorem, 46 Chi-square distribution, 249 Chi-square independence test, 253 Chi-square tables, 250 Class mark, 17 Class midpoint, 17 Class width, 17 Classes, 17 Cluster sampling, 142 Coefficient of determination, 316 Coefficient of variation, 47 Collectively exhaustive events, 73 Combinations, 74 Complementary events, 70 Compound event, 64 Computer selected simple random samples, 141 Conditional probability, 68 Confidence interval, 166 Confidence interval for: 150 mean response in linear regression, 318 population variance, 259 difference in means, 212 difference in population proportions, 231 mean difference of dependent samples, 226 mean, large sample, 167 mean, small sample, 171 population proportion, large sample, 173 slope of the population regression line, 317 Confidence level, 166 Contingency tables, 254 Continuous data, 14, Continuous random variable, 89, 113 Continuous variable, 3 Correlation coefficient, 319 Counting rule, 63 Critical values, 186 Cumulative frequency distribution, 21 Cumulative percentages, 21 Cumulative relative frequency, 2I Data set, 2 Degrees of freedom for denominator, F distribution, 272 for error, one-way ANOVA, 277 for numerator, F distribution, 272 for total, one-way ANOVA, 277 for treatments, one-way ANOVA, 277 for t distribution, 169 Dependent events, 69 Dependent variable, 310 Descriptive Statistics, 1 Deviations from the mean, 43 Discrete data, 14 Discrete random variable, 89 Discrete variable, 3 Empirical rule, 46 Error mean square, 276 Error sum of squares, 276,313 Error term, 310 Estimated regression line, 313 Estimated standard error of the difference in sample means, 213,215 Estimation, 166 Estimation of mean difference for dependent samples, 226 Estimation of the difference in means, large independent samples, 212 Estimation of the difference in means, small samples, 215, 221 Estimation of the difference in population proportions, 231 Event, 64 Expected frequencies, 251 Experiment, 63 Explained variation, 315 Exponential probability distribution, 127 F distribution, 272 F table, 273 Factor sum of squares, 285 Factorial design, 282 Finite population correction factor, 145 Frequency distribution, 14 Goodness-of-fit test, 250 Grouped data, 40 369
  • 379.
    370 INDEX Histogram, 20 Hypergeometricprobability formula, 99 Independent events, 69 Independent samples, 2 1I Independent variable, 310 Inferential Statistics, I Interaction, 282 Interaction plot, 283 Interaction sum of squares, 285 Interquartile range, 49 Intersection of events, 70 Interval estimate, 166 Interval level of measurement, 5 Joint probability, 68 Kruskal-Wallis test, 340-342 Least squares line, 3 12 Left-tailed test, 185 Level of significance, I87 Levels of a factor, 285 Levels of measurement, 4 Line of best fit, 313 Linear regression model, 310 Lower boundary, 17 Lower class limit, 17 Lower inner fence, 57 Lower outer fence, 57 Main effects, 282 Mann-Whitney U test, 339 Marginal probability, 68 Maximum error of estimate, 168 Mean, 40 Mean and standard deviation for the uniform Mean and standard deviation of the sample Mean and variance of a binomial random variable, Mean for grouped data, 45 Mean of a discrete random variable, 91 Mean of the sample mean, 144 Measures of central tendency, 40 Measures of dispersion, 42 Measures of position, 48 Median class, 45 Median for grouped data, 45 Median, 4 1 Modal class, 45 Mode for grouped data, 45 Mode, 41 Modified boxplot, 50 distribution, 114 proportion, I49 96 Mu1tinomial probability distribution, 250 Multiplication rule for the intersection of two Mutually exclusive events, 69 events, 71 Nominal level of measurement, 4 Nonparametric methods, 334 Nonrejection region, 186 Normal approximation to the binomial, 125- 126 Normal curve pdf, 117 Normal probability distribution, 1 15 Null hypothesis, 185 Observation, 2 Observed frequencies, 251 Ogive, 21 One-tailed test, 185 One-way ANOVA table, 279 Operating characteristic curve, 194 Ordinal level of measurement, 5 Outcomes, 63 Paired or matched samples, 224 Parameter, 148 Pearson correlation coefficient, 320 Percentage, 15 Percentile for an observation, 48 Percentiles, deciles, and quartiles, 48 Permutations, 74 Pie chart, 16 Point estimate, 166 Poisson probability formula, 97 Population, 1 Population correlation coefficient, 320 Population proportion, 148 Population Spearman rank correlation coefficient, Population standard deviation, 42 Possible outliers, 58 Prediction interval when predicting a single Prediction line, 313 Probability, 65 Probability density function (pdf), 113 ProbabiIity distribution, 90 Probability, classical definition, 65 Probability, relative frequency definition, 67 Probability, subjective definition, 67 Probable outliers, 58 Pth percentile, 48 P value, 196-198 344 observation, 3I9 Qualitative data, 14 Qualitative variable, 4 Quantitative data, 14 Quantitative variable, 3 Random number tables, 140
  • 380.
    INDEX 371 Random variable,89 Range for grouped data, 45 Range, 42 Rank, 334 Ratio level of measurement, 6 Raw data, 14 Regression sum of squares, 3 15 Rejection regions, 186 Relative frequency, I5 Research hypothesis, I85 Residuals, 314 Response variable, 282 Right-tailed test, I85 Runs test for randomness, 344 Sample proportion, 148 Sample size determination for estimating the mean, Sample size determination for estimating the Sample space, 63 Sample standard deviation. 43 Sample, 1 Sampling distribution of the sample mean, 142 Sampling distribution of the sample proportion, Sampling distribution of the sample variance, 257 Sampling error, 144 Scales of measurement. 4 Scatter plot, 304, Shortcut formulas for computing variances, 43 Sign test, 334-336 Signed rank, 337 Simple event, 64 Simple random sampling, 140 Single-valued class. I9 Skewed to the left histogram, 20 Skewed to the right histogram, 20 Spearman rank correlation coefficient. 342 Standard deviation. 42 Standard deviation of a discrete random variable, Standard deviation of errors, 3I5 Standard error of the mean, 144 Standard error of the proportion, 149 Standard normal distribution table, I 17 Standardizing a normal distribution, I20 Statistic, 148 Statistical software packages, 7 Statistics. 1 Stem-and-leaf display, 22 174 population proportion, I75 148 92 Stratified sampling, 142 Sum of squares of the deviations, 43 Sum of the deviations, 43 Sum of the squares and sum squared, 43 Summation notation, 6 Synimetric histogram, 20 Systematic random sampling, I42 T distribution, 169 Table of binomial probabilities, 95 Test statistic, 186 Testing a hypothesis concerning: difference in means, large samples, 213 difference in means, small samples, 216, 223 difference in population proportions, 232 mean difference for dependent samples, 228 main effects and interaction, 299 population correlation coefficient, 32I population mean, large sample, 19I population mean, small sample, 199 population proportion, large sample, 200 population variance, 260 Total sum of squares, 277 Treatment sum of squares, 276 Tree diagram, 63 Two-tailed test, 185 Two-way ANOVA table, 286 Type I and Type I1 errors, 187 Type I1 errors. calculating, 193 Ungrouped data, 40 Uniform or rectangular histogram, 20 Uniform probability distribution, 113 Union of events, 71 Upper boundary, 17 Upper class limit, 17 Upper inner fence, 57 Upper outer fence, 57 Variable, 2 Variance for grouped data, 46 Variance of a discrete random variable, 92 Variance, 42 Venn diagram, 64 Wilcoxon rank-sum test, 338-340 Wilcoxon signed-rank test, 337-338 Within samples variation, 275 Z score, 47