SlideShare a Scribd company logo
1 of 78
Go to TOC
Statistics for the Sciences
Charles Peters
Go to TOC
Contents
1 Background 6
1.1 Populations, Samples and Variables . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 6
1.2 Types of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 7
1.3 Random Experiments and Sample Spaces . . . . . . . . . . . . . .
. . . . . . . . . . . 7
1.4 Computing in Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 8
1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 9
2 Descriptive and Graphical Statistics 11
2.1 Location Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 11
2.1.1 The Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 11
2.1.2 The Median and Other Quantiles . . . . . . . . . . . . . . . . . . .
. . . . . . . 12
2.1.3 Trimmed Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 12
2.1.4 Grouped Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 13
2.1.5 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 14
2.1.6 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 15
2.1.7 The Five Number Summary . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 15
2.1.8 The Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 15
2.1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 15
2.2 Measures of Variability or Scale . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 16
2.2.1 The Variance and Standard Deviation . . . . . . . . . . . . . . .
. . . . . . . . 16
2.2.2 The Coefficient of Variation . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 17
2.2.3 The Mean and Median Absolute Deviation . . . . . . . . . . . .
. . . . . . . . 17
2.2.4 The Interquartile Range . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 18
2.2.5 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 18
2.2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 20
2.3 Jointly Distributed Variables . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 20
2.3.1 Side by Side Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 20
2.3.2 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 22
2.3.3 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 23
2.3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 26
3 Probability 28
3.1 Basic Definitions. Equally Likely Outcomes . . . . . . . . . . . .
. . . . . . . . . . . . 28
3.2 Combinations of Events . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 29
3.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 30
1
Go to TOC
CONTENTS 2
3.3 Rules for Probability Measures . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 31
3.4 Counting Outcomes. Sampling with and without
Replacement . . . . . . . . . . . . . 32
3.4.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 33
3.5 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 34
3.5.1 Relating Conditional and Unconditional Probabilities . . . .
. . . . . . . . . . 36
3.5.2 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 36
3.6 Independent Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 37
3.6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 38
3.7 Replications of a Random Experiment . . . . . . . . . . . . . . . .
. . . . . . . . . . . 39
4 Discrete Distributions 40
4.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 40
4.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 41
4.3 Expected Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 42
4.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 43
4.4 Bernoulli Random Variables . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 43
4.4.1 The Mean and Variance of a Bernoulli Variable . . . . . . . .
. . . . . . . . . . 44
4.5 Binomial Random Variables . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 44
4.5.1 The Mean and Variance of a Binomial Distribution . . . . .
. . . . . . . . . . . 48
4.5.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 48
4.6 Hypergeometric Distributions . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 49
4.6.1 The Mean and Variance of a Hypergeometric Distribution
. . . . . . . . . . . . 51
4.7 Poisson Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 51
4.7.1 The Mean and Variance of a Poisson Distribution . . . . . . .
. . . . . . . . . 54
4.7.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 54
4.8 Jointly Distributed Variables . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 55
4.8.1 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 57
4.9 Multinomial Distributions . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 58
4.9.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 60
5 Continuous Distributions 62
5.1 Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 62
5.2 Expected Values and Quantiles for Continuous Distributions
. . . . . . . . . . . . . . 67
5.2.1 Expected Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 67
5.2.2 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 68
5.2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 69
5.3 Uniform Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 69
5.4 Exponential Distributions and Their Relatives . . . . . . . . . .
. . . . . . . . . . . . . 70
5.4.1 Exponential Distributions . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 70
5.4.2 Gamma Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 72
5.4.3 Weibull Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 74
5.4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 78
5.5 Normal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 79
5.5.1 Tables of the Standard Normal Distribution . . . . . . . . . . .
. . . . . . . . . 80
5.5.2 Other Normal Distributions . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 81
5.5.3 The Normal Approximation to the Binomial Distribution .
. . . . . . . . . . . 83
Go to TOC
CONTENTS 3
5.5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 84
6 Joint Distributions and Sampling Distributions 85
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 85
6.2 Jointly Distributed Continuous Variables . . . . . . . . . . . . . .
. . . . . . . . . . . . 85
6.2.1 Mixed Joint Distributions . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 88
6.2.2 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 89
6.2.3 Bivariate Normal Distributions . . . . . . . . . . . . . . . . . . . .
. . . . . . . 90
6.3 Independent Random Variables . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 92
6.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 93
6.4 Sums of Random Variables . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 94
6.4.1 Simulating Random Samples . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 96
6.5 Sample Sums and the Central Limit Theorem . . . . . . . . . . .
. . . . . . . . . . . . 98
6.5.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 102
6.6 Other Distributions Associated with Normal Sampling . . . .
. . . . . . . . . . . . . . 103
6.6.1 Chi Square Distributions . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 103
6.6.2 Student t Distributions . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 106
6.6.3 The Joint Distribution of the Sample Mean and Variance .
. . . . . . . . . . . 108
6.6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 109
7 Statistical Inference for a Single Population 110
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 110
7.2 Estimation of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 110
7.2.1 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 110
7.2.2 Desireable Properties of Estimators . . . . . . . . . . . . . . . . .
. . . . . . . . 111
7.3 Estimating a Population Mean . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 112
7.3.1 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 113
7.3.2 Small Sample Confidence Intervals for a Normal Mean . . .
. . . . . . . . . . . 115
7.3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 117
7.4 Estimating a Population Proportion . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 119
7.4.1 Choosing the Sample Size . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 120
7.4.2 Confidence Intervals for p . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 121
7.4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 123
7.5 Estimating Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 123
7.5.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 125
7.6 Estimating the Variance and Standard Deviation . . . . . . . . .
. . . . . . . . . . . . 125
7.7 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 126
7.7.1 Test Statistics, Type 1 and Type 2 Errors . . . . . . . . . . . . .
. . . . . . . . 127
7.8 Hypotheses About a Population Mean . . . . . . . . . . . . . . . . .
. . . . . . . . . . 127
7.8.1 Tests for the mean when the variance is unknown . . . . . . .
. . . . . . . . . . 129
7.9 p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 130
7.9.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 132
7.10 Hypotheses About a Population Proportion . . . . . . . . . . . .
. . . . . . . . . . . . 132
7.10.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 134
Go to TOC
CONTENTS 4
8 Regression and Correlation 136
8.1 Examples of Linear Regression Problems . . . . . . . . . . . . . .
. . . . . . . . . . . . 136
8.2 Least Squares Estimates . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 140
8.2.1 The ”lm” Function in R . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 143
8.2.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 144
8.3 Distributions of the Least Squares Estimators . . . . . . . . . . .
. . . . . . . . . . . . 145
8.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 147
8.4 Inference for the Regression Parameters . . . . . . . . . . . . . . .
. . . . . . . . . . . 148
8.4.1 Confidence Intervals for the Parameters . . . . . . . . . . . . . .
. . . . . . . . 150
8.4.2 Hypothesis Tests for the Parameters . . . . . . . . . . . . . . . .
. . . . . . . . 150
8.4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 156
8.5 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 157
8.5.1 Confidence intervals for ρ . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 158
8.5.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 159
9 Inference from Multiple Samples 160
9.1 Comparison of Two Population Means . . . . . . . . . . . . . . . .
. . . . . . . . . . . 160
9.1.1 Large Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 160
9.1.2 Comparing Two Population Proportions . . . . . . . . . . . . . .
. . . . . . . . 162
9.1.3 Samples from Normal Distributions . . . . . . . . . . . . . . . . .
. . . . . . . 164
9.1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 167
9.2 Paired Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 168
9.2.1 Crossover Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 169
9.2.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 171
9.3 More than Two Independent Samples: Single Factor
Analysis of Variance . . . . . . . 171
9.3.1 Example Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 175
9.3.2 Multiple Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 177
9.3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 178
9.4 Two-Way Analysis of Variance . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 179
9.4.1 Interactions Between the Factors . . . . . . . . . . . . . . . . . . .
. . . . . . . 183
9.4.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 184
10 Analysis of Categorical Data 185
10.1 Multinomial Distributions . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 185
10.1.1 Estimators and Hypothesis Tests for the Parameters . . . .
. . . . . . . . . . . 186
10.1.2 Multinomial Probabilities That Are Functions of Other
Parameters . . . . . . . 187
10.1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 189
10.2 Testing Equality of Multinomial Probabilities . . . . . . . . . .
. . . . . . . . . . . . . 190
10.3 Independence of Attributes: Contingency Tables . . . . . . . .
. . . . . . . . . . . . . 192
10.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 195
11 Miscellaneous Topics 196
11.1 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 196
11.1.1 Inferences Based on Normality . . . . . . . . . . . . . . . . . . .
. . . . . . . . 197
11.1.2 Using R’s ”lm” Function for Multiple Regression . . . . . .
. . . . . . . . . . . 198
11.1.3 Factor Variables as Predictors . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 201
11.1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 206
Go to TOC
CONTENTS 5
11.2 Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 207
11.2.1 The Signed Rank Test . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 207
11.2.2 The Wilcoxon Rank Sum Test . . . . . . . . . . . . . . . . . . . .
. . . . . . . 212
11.2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 214
11.3 Bootstrap Confidence Intervals . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 215
11.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 218
Go to TOC
Chapter 1
Background
Statistics is the art of summarizing data, depicting data, and
extracting information from it. Statistics
and the theory of probability are distinct subjects, although
statistics depends on probability to
quantify the strength of its inferences. The probability used in
this course will be developed in Chapter
3 and throughout the text as needed. We begin by introducing
some basic ideas and terminology.
1.1 Populations, Samples and Variables
A population is a set of individual elements whose collective
properties are the subject of investigation.
Usually, populations are large collections whose individual
members cannot all be examined in detail.
In statistical inference a manageable subset of the population is
selected according to certain sampling
procedures and properties of the subset are generalized to the
entire population. These generalizations
are accompanied by statements quantifying their accuracy and
reliability. The selected subset is called
a sample from the population.
Examples:
(a) the population of registered voters in a congressional
district,
(b) the population of U.S. adult males,
(c) the population of currently enrolled students at a certain
large urban university,
(d) the population of all transactions in the U.S. stock market
for the past month,
(e) the population of all peak temperatures at points on the
Earth’s surface over a given time interval.
Some samples from these populations might be:
(a) the voters contacted in a pre-election telephone poll,
(b) adult males interviewed by a TV reporter,
(c) the dean’s list,
(d) transactions recorded on the books of Smith Barney,
(e) peak temperatures recorded at several weather stations.
Clearly, for these particular samples, some generalizations from
sample to population would be highly
questionable.
6
Go to TOC
CHAPTER 1. BACKGROUND 7
A population variable is an attribute that has a value for each
individual in the population. In
other words, it is a function from the population to some set of
possible values. It may be helpful to
imagine a population as a spreadsheet with one row or record
for each individual member. Along the
ith row, the values of a number of attributes of the ith
individual are recorded in different columns.
The column headings of the spreadsheet can be thought of as the
population variables. For example,
if the population is the set of currently enrolled students at the
urban university, some of the variables
are academic classification, number of hours currently enrolled,
total hours taken, grade point average,
gender, ethnic classification, major, and so on. Variables, such
as these, that are defined for the same
population are said to be jointly observed or jointly distributed.
1.2 Types of Variables
Variables are classified according to the kinds of values they
have. The three basic types are numeric
variables, factor variables, and ordered factor variables.
Numeric variables are those for which arith-
metic operations such as addition and subtraction make sense.
Numeric variables are often related to
a scale of measurement and expressed in units, such as meters,
seconds, or dollars. Factor variables
are those whose values are mere names, to which arithmetic
operations do not apply. Factors usually
have a small number of possible values. These values might be
designated by numbers. If they are, the
numbers that represent distinct values are chosen merely for
convenience. The values of factors might
also be letters, words, or pictorial symbols. Factor variables are
sometimes called nominal variables
or categorical variables. Ordered factor variables are factors
whose values are ordered in some natural
and important way. Ordered factors are also called ordinal
variables. Some textbooks have a more
elaborate classification of variables, with various subtypes. The
three types above are enough for our
purposes.
Examples: Consider the population of students currently
enrolled at a large university. Each stu-
dent has a residency status, either resident or nonresident.
Residency status is an unordered factor
variable. Academic classification is an ordered factor with
values “freshman”, “sophomore”, “junior”,
“senior”, “post-baccalaureate” and “graduate student”. The
number of hours enrolled is a numeric
variable with integer values. The distance a student travels from
home to campus is a numeric vari-
able expressed in miles or kilometers. Home area code is an
unordered factor variable whose values
are designated by numbers.
1.3 Random Experiments and Sample Spaces
An experiment can be something as simple as flipping a coin or
as complex as conducting a public
opinion poll. A random experiment is one with the following
two characteristics:
(1) The experiment can be replicated an indefinite number of
times under essentially the same exper-
imental conditions.
(2) There is a degree of uncertainty in the outcome of the
experiment. The outcome may vary
from replication to replication even though experimental
conditions are the same.
Go to TOC
CHAPTER 1. BACKGROUND 8
When we say that an experiment can be replicated under the
same conditions, we mean that control-
lable or observable conditions that we think might affect the
outcome are the same. There may be
hidden conditions that affect the outcome, but we cannot
account for them. Implicit in (1) is the idea
that replications of a random experiment are independent, that
is, the outcomes of some replications
do not affect the outcomes of others. Obviously, a random
experiment is an idealization of a real
experiment. Some simple experiments, such as tossing a coin,
approach this ideal closely while more
complicated experiments may not.
The sample space of a random experiment is the set of all its
possible outcomes. We use the Greek
capital letter Ω (omega)to denote the sample space. There is
some degree of arbitrariness in the
description of Ω. It depends on how the outcomes of the
experiment are represented symbolically.
Examples:
(a) Toss a coin. Ω = {H,T}, where “H” denotes a head and “T” a
tail. Another way of repre-
senting the outcome is to let the number 1 denote a head and 0 a
tail (or vice-versa). If we do this,
then Ω = {0, 1}. In the latter representation the outcome of the
experiment is just the number of heads.
(b) Toss a coin 5 times, i.e., replicate the experiment in (a) 5
times. An outcome of this experiment
is a 5 term sequence of heads and tails. A typical outcome might
be indicated by (H,T,T,H,H), or
by (1,0,0,1,1). Even for this little experiment it is cumbersome
to list all the outcomes, so we use a
shorter notation
Ω = {(x1, x2, x3, x4, x5) | xi = 0 or xi = 1 for each i} .
(c) Select a student randomly from the population of all
currently enrolled students. The sample
space is the same as the population. The word “randomly” is
vague. We will define it later.
(d) Repeat the Michelson-Morley experiment to measure the
speed of the Earth relative to the ether
(which doesn’t exist, as we now know). The outcome of the
experiment could conceivably be any
nonnegative number, so we take Ω = [0,∞) = {x | x is a real
number and x ≥ 0.} Uncertainty arises
from the fact that this is a very delicate experiment with several
sources of unpredictable error.
1.4 Computing in Statistics
Even moderately large data sets cannot be managed effectively
without a computer and computer
software. Furthermore, much of applied statistics is exploratory
in nature and cannot be carried out
by hand, even with a calculator. Spreadsheet programs, such as
Microsoft Excel, are designed to
manipulate data in tabular form and have functions for
performing the common tasks of statistics. In
addition, many add-ins are available, some of them free, for
enhancing the graphical and statistical
capabilities of spreadsheet programs. Some of the exercises and
examples in this text make use of
Excel with its built-in data analysis package. Because it is so
common in the business world, it is
important for students to have some experience with Excel or a
similar program.
The disadvantages of spreadsheet programs are their
dependence on the spreadsheet data format
with cell ranges as input for statistical functions, their lack of
flexibility, and their relatively poor
graphics. Many highly sophisticated packages for statistics and
data analysis are available. Some of
Go to TOC
CHAPTER 1. BACKGROUND 9
the best known commercial packages are Minitab, SAS, SPSS,
Splus, Stata, and Systat. The package
used in this text is called R. It is an open source implementation
of the same language used in Splus
and may be downloaded free at
http://www.r-project.org .
After downloading and installing R we recommend that you
download and install another free package
called Rstudio. It can be obtained from
http://www.rstudio.com .
Rstudio makes importing data into R much easier and makes it
easier to integrate R output with
other programs. Detailed instructions on using R and Rstudio
for the exercises will be provided.
Data files used in this course are from four sources. Some are
local in origin and come from student
or course data at the University of Houston. Others are
simulated but made to look as realistic as
possible. These and others are available at
http://www.math.uh.edu/ charles/data .
Many data sets are included with R in the datasets library and
other contributed packages. We will
refer to them frequently. The main external sources of data are
the data archives maintained by the
Journal of Statistics Education.
www.amstat.org/publications/jse
and the Statistical Science Web:
http://www.stasci.org/datasets.html.
1.5 Exercises
1. Go to http://www.math.uh.edu/ charles/data. Examine the
data set “Air Pollution Filter Noise”.
Identify the variables and give their types.
2. Highlight the data in Air Pollution Filter Noise. Include the
column headings but not the language
preceding the column headings. Copy and paste the data into a
plain text file, for example with
Notepad in Windows. Import the text file into Excel or another
spread sheet program. Create a new
folder or directory named “math3339” and save both files there.
3. Start R by double clicking on the big blue R icon on your
desktop. Click on the file menu at the
top of the R Gui window. Select “change dir . . . ” . In the
window that opens next, find the name of
the directory where you saved the text file and double click on
the name of that directory. Suppose
that you named your file “apfilternoise”. (Name it anything you
like.) Import the file into R with the
command
http://www.r-project.org
http://www.rstudio.com
http://www.math.uh.edu/~charles/data
http://www.amstat.org/publications/jse/jse_data_archive.htm
http://www.statsci.org/datasets.html
http://www.math.uh.edu/~charles/data
Go to TOC
CHAPTER 1. BACKGROUND 10
> apfilternoise=read.table(”apfilternoise.txt”,header=T)
and display it with the command
> apfilternoise
Click on the file menu at the top again and select “Exit”. At the
prompt to save your workspace, click
“Yes”. If you open the folder where your work was saved you
will see another big blue R icon. If you
double click on it, R will start again and your previously saved
workspace will be restored.
If you use Rstudio for this exercise you can import apfilternoise
into R by clicking on the ”Import
Dataset” tab. This will open a window on your file system and
allow you to select the file you saved
in Exercise 2. The dialog box allows you to rename the data and
make other minor changes before
importing the data as a data frame in R.
4. If you are using Rstudio, click on the ”Packages” tab and
then the word ”datasets”. Find the data
set ”airquality” and click on it. Read about it. If you are using R
alone, type
> help(airquality)
at the command prompt > in the Console window.
Then type
> airquality
to view the data. Could ”Month” and ”Day” be considered
ordered factors rather than numeric vari-
ables?
5. A random experiment consists of throwing a standard 6-sided
die and noting the number of spots
on the upper face. Describe the sample space of this experiment.
6. An experiment consists of replicating the experiment in
exercise 4 four times. Describe the sample
space of this experiment. How many possible outcomes does
this experiment have?
Go to TOC
Chapter 2
Descriptive and Graphical Statistics
A large part of a statistician’s job consists of summarizing and
presenting important features of data.
Simply looking at a spreadsheet with 1000 rows and 50 columns
conveys very little information. Most
likely, the user of the data would rather see numerical and
graphical summaries of how the values
of different variables are distributed and how the variables are
related to each other. This chapter
concerns some of the most important ways of summarizing data.
2.1 Location Measures
2.1.1 The Mean
Suppose that x is the name of a numeric variable whose values
are recorded either for the entire
population or for a sample from that population. Let the n
recorded values of x be denoted by
x1, x2, . . . , xn. These are not necessarily distinct numbers. The
mean or average of these values is
x̄ =
1
n
n∑
i=1
xi
When the values of x for the entire population are included, it is
customary to denote this quantity
by µ(x) and call it the population mean. The mean is called a
location measure partly because it is
taken as a representative or central value of x. More
importantly, it behaves in a certain way if we
change the scale of measurement for values of x. Imagine that x
is temperature recorded in degrees
Celsius and we decide to change the unit of measurement to
degrees Fahrenheit. If yi denotes the
Fahrenheit temperature of the ith individual, then yi = 1.8xi +
32. In effect, we have defined a new
variable y by the equation y = 1.8x + 32. The means of the new
and old variables have the same
relationship as the individual measurements have.
ȳ =
1
n
n∑
i=1
yi =
1
n
n∑
1
(1.8xi + 32) = 1.8x̄ + 32
In general, if a and b > 0 are constants and y = a+bx, ȳ = a+bx̄ .
Other location measures introduced
below behave in the same way.
11
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
12
When there are repeated values of x, there is an equivalent
formula for the mean. Let the m distinct
values of x be denoted by v1, . . . , vm. Let ni be the number of
times vi is repeated and let fi = ni/n.
Note that
∑m
i=1 ni = n and
∑m
i=1 fi = 1. Then the average is given by
x̄ =
m∑
i=1
fivi
The number ni is the frequency of the value vi and fi is its
relative frequency.
2.1.2 The Median and Other Quantiles
Let x be a numeric variable with values x1, x2, . . . , xn.
Arrange the values in increasing order x(1) ≤
x(2) ≤ . . . ≤ x(n). The median of x is a number median(x) such
that at least half the values of x
are ≤ median(x) and at least half the values of x are ≥
median(x). This conveys the essential idea
but unfortunately it may define an interval of numbers rather
than a single number. The ambiguity is
usually resolved by taking the median to be the midpoint of that
interval. Thus, if n is odd, n = 2k+1,
where k is a positive integer,
median(x) = x(k+1)
,
while if n is even, n = 2k,
median(x) =
x(k) + x(k+1)
2
.
Let p ∈ (0, 1) be a number between 0 and 1. The pth quantile of
x is more commonly known as the
100pth percentile; e.g., the 0.8 quantile is the same as the 80th
percentile. We define it as a number
q(x, p) such that the fraction of values of x that are ≤ q(x, p) is
at least p and the fraction of values of
x that are ≥ q(x, p) is at least 1−p. For example, at least 80
percent of the values of x are ≤ the 80th
percentile of x and at least 20 percent of the values of x are ≥
its 80th percentile. Again, this may
not define a unique number q(x, p). Software packages have
rules for resolving the ambiguity, but the
details are usually not important.
The median is the 50th percentile, i.e., the 0.5 quantile. The
25th and 75th percentiles are called the
first and third quartiles. The 10th, 20th, 30th, etc. percentiles
are called the deciles. The median is a
location measure as defined in the preceding section.
2.1.3 Trimmed Means
Trimmed means of a variable x are obtained by finding the
mean of the values of x excluding a given
percentage of the largest and smallest values. For example, the
5% trimmed mean is the mean of the
values of x excluding the largest 5% of the values and the
smallest 5% of the values. In other words, it
is the mean of all the values between the 5th and 95th
percentiles of x. A trimmed mean is a location
measure.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
13
2.1.4 Grouped Data
Sometimes large data sets are summarized by grouping values.
Let x be a numeric variable with values
x1, x2, . . . , xn. Let c0 < c1 < . . . < cm be numbers such that
all the values of x are between c0 and
cm. For each i, let ni be the number of values of x (including
repetitions) that are in the interval
(ci−1, ci], i.e., the number of indices j such that ci−1 < xj ≤ ci.
A frequency table of x is a table
showing the class intervals (ci−1, ci] along with frequencies ni
with which the data values fall into each
interval. Sometimes additional columns are included showing
the relative frequencies fi = ni/n, the
cumulative relative frequencies Fi =
∑
j≤i fj , and the midpoints of the intervals.
Example 2.1. The data below are 50 measured reaction times in
response to a sensory stimulus,
arranged in increasing order. A frequency table is shown below
the data.
0.12 0.30 0.35 0.37 0.44 0.57 0.61 0.62 0.71 0.80 0.88 1.02 1.08
1.12 1.13 1.17 1.21 1.23 1.35 1.41 1.42
1.42 1.46 1.50 1.52 1.54 1.60 1.61 1.68 1.72 1.86 1.90 1.91 2.07
2.09 2.16 2.17 2.20 2.29 2.32 2.39 2.47
2.60 2.86 3.43 3.43 3.77 3.97 4.54 4.73
Interval Midpoint ni fi Fi
(0,1] 0.5 11 0.22 0.22
(1,2] 1.5 22 0.44 0.66
(2,3] 2.5 11 0.22 0.88
(3,4] 3.5 4 0.08 0.96
(4,5] 4.5 2 0.04 1.00
If only a frequency table like the one above is given, the mean
and median cannot be calculated
exactly. However, they can be estimated. If we take the
midpoint of an interval as a stand-in for all
the values in that interval, then we can use the formula in the
preceding section for calculating a mean
with repeated values. Thus, in the example above, we would
estimate the mean as
0.22(0.5) + .44(1.5) + 0.22(2.5) + 0.08(3.5) + 0.04(4.5) = 1.78
Estimating the median is a bit more difficult. By examining the
cumulative frequencies Fi, we see that
22% of the data is less than or equal to 1 and 66% of the data is
less than or equal to 2. Therefore,
the median lies between 1 and 2. That is, it is 1 + a certain
fraction of the distance from 1 to 2. A
reasonable guess at that fraction is given by linear interpolation
between the cumulative frequencies
at 1 and 2. In other words, we estimate the median as
1 +
.50− .22
.66− .22
(2− 1) = 1.636.
A cruder estimate of the median is just the midpoint of the
interval that contains the median, in
this case 1.5. We leave it as an exercise to calculate the mean
and median from the data of Example
1 and to compare them to these estimates.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
14
2.1.5 Histograms
The figure below is a histogram of the reaction times.
> reacttimes=read.table("reacttimes.txt",header=T)
> hist(reacttimes$Times,breaks=0:5,xlab="Reaction
Times",main=" ")
Reaction Times
F
re
qu
en
cy
0 1 2 3 4 5
0
5
10
15
20
The histogram is a graphical depiction of the grouped data. The
end points ci of the class intervals
are shown on the horizontal axis. This is an absolute frequency
histogram because the heights of the
vertical bars above the class intervals are the absolute
frequencies ni. A relative frequency histogram
would show the relative frequencies fi. A density histogram has
bars whose heights are the relative
frequencies divided by the lengths of the corresponding class
intervals. Thus,in a density histogram
the area of the bar is equal to the relative frequency. If all class
intervals have the same length, these
types of histograms all have the same shape and convey the
same visual information.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
15
2.1.6 Robustness
A robust measure of location is one that is not affected by a few
extremely large or extremely small
values. Values of a numeric variable that lie a great distance
from most of the other values are
called outliers. Outliers might be the result of mistakes in
measuring or recording data, perhaps from
misplacing a decimal point. The mean is not a robust location
measure. It can be affected significantly
by a single extreme outlier if that outlying value is extreme
enough. Thus, if there is any doubt about
the quality of the data, the median or a trimmed mean might be
preferred to the mean as a reliable
location measure. The median is very insensitive to outliers. A
5% trimmed mean is insensitive to
outliers that make up no more than 5% of the data values.
2.1.7 The Five Number Summary
The five number summary is a convenient way of summarizing
numeric data. The five numbers are the
minimum value, the first quartile (25th percentile), the median,
the third quartile (75th percentile),
and the maximum value. Sometimes the mean is also included,
which makes it a six number summary.
Example 2.2. The natural logarithms y of the data values x in
Example 1 are, to two places:
-2.12 -1.20 -1.05 -0.99 -0.82 -0.56 -0.49 -0.48 -0.34 -0.22 -0.13
0.02 0.08 0.11 0.12 0.16 0.19 0.21 0.30
0.34 0.35 0.35 0.38 0.40 0.42 0.43 0.47 0.48 0.52 0.54 0.62 0.64
0.65 0.73 0.74 0.77 0.78 0.79 0.83 0.84
0.87 0.90 0.96 1.05 1.23 1.23 1.33 1.38 1.51 1.55
It is sometimes advantageous to transform data in some way,
i.e., to define a new variable y as a
function of the old variable x. In this case, we have transformed
the reaction times x with the
natural logarithm transformation. We might want to do this to
so that we can more easily apply
certain statistical inference procedures you will learn about
later. The six number summary of the
transformed data y is:
> reacttimes=read.table("reacttimes.txt",header=T)
> summary(log(reacttimes$Times))
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.12000 0.08605 0.42520 0.33710 0.78500 1.55400
2.1.8 The Mode
The mode of a variable is its most frequently occurring value.
With numeric variables the mode is
less important than the mean and median for descriptive
purposes or for statistical inference. For
factor variables the mode is the most natural way of choosing a
”most representative” value. We hear
this frequently in the media, in statements such as ”Financial
problems are the most common cause
of marital strife”. For grouped numeric data the modal class
interval is the class interval having the
highest absolute or relative frequency. In Example 1, the modal
class interval is the interval (1,2].
2.1.9 Exercises
1. Find the mean and median of the reaction time data in
Example 1.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
16
2. Find the quartiles of the reaction time data. There is more
than one acceptable answer.
3. The 40th value x40 of the reaction time data has a value of
2.32. Replace that with 232.0.
Recalculate the mean and median. Comment.
4. Construct a frequency table like the one in Example 1 for the
log-transformed reaction times
of Example 2. Use 5 class intervals of equal length beginning at
-3 and ending at 2. Draw an absolute
frequency histogram.
5. Estimate the mean and median of the grouped log-
transformed reaction times by using the tech-
niques discussed in Example 1. Compare your answers to the
summary in Example 2.
6. Repeat exercises 1, 2, and the histogram of exercise 4 by
using R.
7. Let x be a numeric variable with values x1, . . . , xn−1, xn.
Let x̄ n be the average of all n val-
ues and let x̄ n−1 be the average of x1, . . . , xn−1. Show that x̄ n
= (1− 1n )x̄ n−1 +
1
nxn. What happens
if xn →∞ while all the other values of x are fixed?
2.2 Measures of Variability or Scale
2.2.1 The Variance and Standard Deviation
Let x be a population variable with values x1, x2, . . . , xn.
Some of the values might be repeated. The
variance of x is
var(x) = σ2 =
1
n
n∑
i=1
(xi − µ(x))2.
The standard deviation of x is
sd(x) = σ =
√
var(x).
When x1, x2, . . . , xn are values of x from a sample rather than
the entire population, we modify the
definition of the variance slightly, use a different notation, and
call these objects the sample variance
and standard deviation.
s2 =
1
n− 1
n∑
i=1
(xi − x̄ )2,
s =
√
s2.
The reason for modifying the definition for the sample variance
has to do with its properties as an
estimate of the population variance.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
17
Alternate algebraically equivalent formulas for the variance and
sample variance are
σ2 =
1
n
n∑
i=1
x2i − µ(x)2,
s2 =
1
n− 1
(
n∑
i=1
x2i − nx̄ 2).
These are sometimes easier to use for hand computation.
The standard deviation σ is called a measure of scale because of
the way it behaves under linear
transformations of the data. If a new variable y is defined by y
= a+ bx, where a and b are constants,
sd(y) = |b|sd(x). For example, the standard deviation of
Fahrenheit temperatures is 1.8 times the
standard deviation of Celsius temperatures. The transformation
y = a + bx can be thought of as a
rescaling operation, or a choice of a different system of
measurement units, and the standard deviation
takes account of it in a natural way.
2.2.2 The Coefficient of Variation
For a variable that has only positive values, it may be more
important to measure the relative vari-
ability than the absolute variability. That is, the amount of
variation should be compared to the mean
value of the variable. The coefficient of variation for a
population variable is defined as
cv(x) =
sd(x)
µ(x)
,
For a sample of values of x we substitute the sample standard
deviation s and the sample average x̄ .
2.2.3 The Mean and Median Absolute Deviation
Suppose that you must choose a single number c to represent all
the values of a variable x as accurately
as possible. One measure of the overall error with which c
represents the values of x is
g(c) =
√√√√ 1
n
n∑
i=1
(xi − c)2.
In the exercises, you are asked to show that this expression is
minimized when c = x̄ . In other words,
the single number which most accurately represents all the
values is, by this criterion, the mean of the
variable. Furthermore, the minimum possible overall error, by
this criterion, is the standard deviation.
However, this is not the only reasonable criterion. Another is
h(c) =
1
n
n∑
i=1
|xi − c|.
It can be shown that this criterion is minimized when c =
median(x). The minimum value of h(c) is
called the mean absolute deviation from the median. It is a scale
measure which is somewhat more
robust(less affected by outliers) than the standard deviation, but
still not very robust. A related very
robust measure of scale is the median absolute deviation from
the median, or mad :
mad(x) = median(|x−median(x)|).
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
18
2.2.4 The Interquartile Range
The interquartile range of a variable x is the difference between
its 75th and 25th percentiles.
IQR(x) = q(x, .75)− q(x, .25).
It is a robust measure of scale which is important in the
construction and interpretation of boxplots,
discussed below.
All of these measures of scale are valid for comparison of the
”spread”or variability of numeric variables
about a central value. In general, the greater their values, the
more spread out the values of the variable
are. Of course, the standard deviation, median absolute
deviation, and interquartile range of a variable
will be different numbers and one must be careful to compare
like measures.
2.2.5 Boxplots
Boxplots are also called box and whisker diagrams. Essentially,
a boxplot is a graphical representation
of the five number summary. The boxplot below depicts the
sensory response data of the preceding
section without the log transformation.
> reacttimes=read.table("reacttimes.txt",header=T)
> boxplot(reacttimes$Times,horizontal=T,xlab="Reaction
Times")
> summary(reacttimes)
Times
Min. :0.120
1st Qu.:1.090
Median :1.530
Mean :1.742
3rd Qu.:2.192
Max. :4.730
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
19
0 1 2 3 4
Reaction Times
The central box in the diagram encloses the middle 50% of the
numeric data. Its left and right bound-
aries mark the first and third quartiles. The boldface middle line
in the box marks the median of
the data. Thus, the interquartile range is the distance between
the left and right boundaries of the
central box. For construction of a boxplot, an outlier is defined
as a data value whose distance from
the nearest quartile is more than 1.5 times the interquartile
range. Outliers are indicated by isolated
points (tiny circles in this boxplot). The dashed lines extending
outward from the quartiles are called
the whiskers. They extend from the quartiles to the most
extreme values in either direction that are
not outliers.
This boxplot shows a number of interesting things about the
response time data.
(a) The median is about 1.5. The interquartile range is slightly
more than 1.
(b) The three largest values are outliers. They lie a long way
from most of the data. They might call
for special investigation or explanation.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
20
(c) The distribution of values is not symmetric about the
median. The values in the lower half
of the data are more crowded together than those in the upper
half. This is shown by comparing the
distances from the median to the two quartiles, by the lengths of
the whiskers and by the presence of
outliers at the upper end .
The asymmetry of the distribution of values is also evident in
the histogram of the preceding sec-
tion.
2.2.6 Exercises
1. Find the variance and standard deviation of the response time
data. Treat it as a sample from a
larger population.
2. Find the interquartile range and the median absolute
deviation for the response time data.
3. In the response time data, replace the value x40 = 2.32 by
232.0. Recalculate the standard
deviation, the interquartile range and the median absolute
deviation and compare with the answers
from problems 1 and 2.
4. Make a boxplot of the log-transformed reaction time data. Is
the transformed data more sym-
metrically distributed than the original data?
5. Show that the function g(c) in section 2.2.3 is minimized
when c = µ(x). Hint: Minimize g(c)2.
6. Find the variance, standard deviation, IQR, mean absolute
deviation and median absolute de-
viation of the variable ”Ozone” in the data set ”airquality”. Use
R or Rstudio. You can address the
variable Ozone directly if you attach the airquality data frame
to the search path as follows:
> attach(airquality)
The R functions you will need are ”sd” for standard deviation,
”var” for variance, ”IQR” for the
interquartile range, and ”mad” for the median absolute
deviation. There is no built-in function in R
for the mean absolute deviation, but it is easy to obtain it.
> mean(abs(Ozone-median(Ozone)))
2.3 Jointly Distributed Variables
When two or more variables are jointly distributed, or jointly
observed, it is important to understand
how they are related and how closely they are related. We will
first consider the case where one
variable is numeric and the other is a factor.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
21
2.3.1 Side by Side Boxplots
Boxplots are particularly useful in quickly comparing the values
of two or more sets of numeric data
with a common scale of measurement and in investigating the
relationship between a factor variable
and a numeric variable. The figure below compares placement
test scores for each of the letter grades
in a sample of 179 students who took a particular math course
in the same semester under the same
instructor. The two jointly observed population variables are the
placement test score and the letter
grade received. The figure separates test scores according to the
letter grade and shows a boxplot for
each group of students. One would expect to see a decrease in
the median test score as the letter grade
decreases and that is confirmed by the picture. However, the
decrease in median test scores from a
letter grade of B to a grade of F is not very dramatic, especially
compared to the size of the IQRs.
This suggests that the placement test is not especially good at
predicting a student’s final grade in
the course. Notice the two outliers. The outlier for the ”W”
group is clearly a mistake in recording
data because the scale of scores only went to 100.
> test.vs.grade=read.csv("test.vs.grade.csv",header=T)
> attach(test.vs.grade)
> plot(Test~Grade,varwidth=T)
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
22
A B C D F W
40
60
80
10
0
12
0
Grade
Te
st
2.3.2 Scatterplots
Suppose x and y are two jointly distributed numeric variables.
Whether we consider the entire
population or a sample from the population, we have the same
number n of observed values for
each variable. If we plot the n points (x1, y1), (x2, y2), . . . ,
(xn, yn) in a Cartesian plane, we obtain
a scatterplot or a scatter diagram of the two variables. Below
are the first 6 rows of the ”Payroll”
data set. The column labeled ”payroll” is the total monthly
payroll in thousands of dollars for each
company listed. The column ”employees” is the number of
employees in each company and ”industry”
indicates which of two related industries the company is in. A
scatterplot of all 50 values of the two
variables ”payroll” and ”employees” is also shown.
> Payroll=read.table("Payroll.txt",header=T)
> Payroll[1:6,]
payroll employees industry
1 190.67 85 A
2 233.58 109 A
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
23
3 244.04 130 B
4 351.41 166 A
5 298.60 154 B
6 241.43 124 B
> attach(Payroll)
> plot(payroll~employees,col=industry)
50 100 150
15
0
20
0
25
0
30
0
35
0
employees
pa
yr
ol
l
The scatterplot shows that in general the more employees a
company has, the higher its monthly
payroll. Of course this is expected. It also shows that the
relationship between the number of
employees and the payroll is quite strong. For any given number
of employees, the variation in
payrolls for that number is small compared to the overall
variation in payrolls for all employment
levels. In this plot, the data from industry A is in black and that
from industry B is red. The plot
shows that for employees ≥ 100, payrolls for industry A are
generally greater than those for industry
B at the same level of employment.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
24
2.3.3 Covariance and Correlation
If x and y are jointly distributed numeric variables, we define
their covariance as
cov(x, y) =
1
n
n∑
i=1
(xi − µ(x))(yi − µ(y)).
If x and y come from samples of size n rather than the whole
population, replace the denominator
n by n − 1 and the population means µ(x), µ(y) by the sample
means x̄ , ȳ to obtain the sample
covariance. The sign of the covariance reveals something about
the relationship between x and y. If
the covariance is negative, values of x greater than µ(x) tend to
be accompanied by values of y less
than µ(y). Values of x less than µ(x) tend to go with values of y
greater than µ(y), so x and y tend
to deviate from their means in opposite directions. If cov(x, y)
> 0, they tend to deviate in the same
direction. The strength of these tendencies is not expressed by
the covariance because its magnitude
depends on the variability of each of the variables about its
mean. To correct this, we divide each
deviation in the sum by the standard deviation of the variable.
The resulting quantity is called the
correlation between x and y:
cor(x, y) =
cov(x, y)
sd(x) ∗ sd(y)
.
The correlation between payroll and employees in the example
above is 0.9782 (97.82 %).
Theorem 2.1. The correlation between x and y satisfies −1 ≤
cor(x, y) ≤ 1. cor(x, y) = 1 if and
only if there are constants a and b > 0 such that y = a+ bx.
cor(x, y) = −1 if and only if y = a+ bx
with b < 0.
A correlation close to 1 indicates a strong positive relationship
(tending to vary in the same direction
from their means) between x and y while a correlation close to
−1 indicates a strong negative rela-
tionship. A correlation close to 0 indicates that there is no
linear relationship between x and y. In
this case, x and y are said to be (nearly) uncorrelated. There
might be a relationship between x and
y but it would be nonlinear. The picture below shows a
scatterplot of two variables that are clearly
related but very nearly uncorrelated.
> xs=runif(500,0,3*pi)
> ys=sin(xs)+rnorm(500,0,.15)
> cor(xs,ys)
[1] 0.004200081
> plot(xs,ys)
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
25
0 2 4 6 8
−
1.
0
−
0.
5
0.
0
0.
5
1.
0
1.
5
xs
ys
Some sample scatterplots of variables with different population
correlations are shown below.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
26
−1 0 1 2
−
4
−
2
0
1
2
3
cor(x,y)=0
−2 −1 0 1 2
−
3
−
2
−
1
0
1
2
cor(x,y)=0.3
−3 −1 0 1 2 3 4
−
3
−
1
1
2
cor(x,y)=−0.5
−2 −1 0 1 2
−
2
−
1
0
1
2
cor(x,y)=0.9
2.3.4 Exercises
1. With the Air Pollution Filter Noise data, construct side by
side boxplots of the variable NOISE for
the different levels of the factor SIZE. Comment. Do the same
for NOISE and TYPE.
2. With the Payroll data, construct side by side boxplots of
”employees” versus ”industry” and ”pay-
roll” versus ”industry”. Are these boxplots as informative as the
color coded scatterplot in Section
2.3.2?
3. If you are using Rstudio click on the ”Packages” tab, then the
checkbox next to the library MASS.
Click on the word MASS and then the data set ”mammals” and
read about it. If you are using R
alone, in the Console window at the prompt > type
> data(mammals,package=”MASS”).
View the data with
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
27
> mammals
Make a scatterplot with the following commands and comment
on the result.
> attach(mammals)
> plot(body,brain)
Also make a scatterplot of the log transformed body and brain
weights.
> plot(log(body),log(brain))
A recently discovered hominid species homo floresiensis had an
estimated average body weight of
25 kg. Based on the scatterplots, what would you guess its brain
weight to be?
4. Let x and y be jointly distributed numeric variables and let z
= a + by, where a and b are
constants. Show that cov(x, z) = b ∗ cov(x, y). Show that if b >
0, cor(x, z) = cor(x, y). What
happens if b < 0?
Go to TOC
Chapter 3
Probability
3.1 Basic Definitions. Equally Likely Outcomes
Let a random experiment with sample space Ω be given. Recall
from Chapter 1 that Ω is the set
of all possible outcomes of the experiment. An event is a subset
of Ω. A probability measure is a
function which assigns numbers between 0 and 1 to events. If
the sample space Ω, the collection
of events, and the probability measure are all specified, they
constitute a probability model of the
random experiment.
The simplest probability models have a finite sample space Ω.
The collection of events is the col-
lection of all subsets of Ω and the probability of an event is
simply the proportion of all possible
outcomes that correspond to that event. In such models, we say
that the experiment has equally likely
outcomes. If the sample space has N elements, then each
elementary event {ω} consisting of a single
outcome has probability 1N . If E is a subset of Ω, then
Pr(E) =
#(E)
N
.
Here we introduce some notation that will be used throughout
this text. The probability measure
for a random experiment is most often denoted by the
abbreviation Pr, sometimes with subscripts.
Events will be denoted by upper case Latin letters near the
beginning of the alphabet. The expression
#(E) denotes the number of elements of the subset E.
Example 3.1. The Payroll data consists of 50 observations of 3
variables, ”payroll”, ”employees” and
”industry”. Suppose that a random experiment is to choose one
record from the Payroll data and
suppose that the experiment has equally likely outcomes. Then,
as the summary below shows, the
probability that industry A is selected is
Pr(industry = A) =
27
50
= 0.54.
> Payroll=read.table("Payroll.txt",header=T)
> summary(Payroll)
28
Go to TOC
CHAPTER 3. PROBABILITY 29
payroll employees industry
Min. :129.1 Min. : 26.00 A:27
1st Qu.:167.8 1st Qu.: 71.25 B:23
Median :216.1 Median :108.50
Mean :228.2 Mean :106.42
3rd Qu.:287.8 3rd Qu.:143.25
Max. :354.8 Max. :172.00
In this example we use another common and convenient
notational convention. The event whose
probability we want is described in quasi-natural language as
”industry=A” rather than with the the
formal but too cumbersome {ω ∈ Payroll|industry(ω) = A}. The
description ”industry=A” refers to
the set of all possible outcomes of the experiment for which the
variable ”industry” has the value ”A”.
This sort of informal description of an event will be used again
and again.
The assumption of equally likely outcomes is an assumption
about the selection procedure for ob-
taining one record from the data. It is conceivable that a
selection method is employed for which
this assumption is not valid. If so, we should be able to discover
that it is invalid by replicating the
experiment sufficiently many times. This is a basic principle of
classical statistical inference. It relies
on a famous result of mathematical probability theory called the
law of large numbers. One version
of it is loosely stated as follows:
Law of Large Numbers: Let E be an event associated with a
random experiment and let Pr be the
probability measure of a true probability model of the
experiment. Suppose the experiment is repli-
cated n times and let P̂ r(E) = 1n × # replications in which E
occurs. Then P̂ r(E) → Pr(E) as
n→∞.
P̂ r(E) is called the empirical probability of E.
3.2 Combinations of Events
Events are related to other events by familiar set operations. Let
E1, E2, . . . be a finite or infinite
sequence of events. The union of E1 and E2 is the event
E1 ∪ E2 = {ω ∈ Ω|ω ∈ E1 or ω ∈ E2}.
More generally, ⋃
i
Ei = E1 ∪ E2 ∪ . . . = {ω ∈ Ω|ω ∈ Ei for some i }.
The intersection of E1 and E2 is the event
E1 ∩ E2 = {ω ∈ Ω|ω ∈ E1 and ω ∈ E2},
and, in general, ⋂
i
Ei = E1 ∩ E2 ∩ . . . = {ω ∈ Ω|ω ∈ Ei for all i}.
Go to TOC
CHAPTER 3. PROBABILITY 30
Sometimes we omit the intersection symbol ∩ and simply
conjoin the symbols for the events in an
intersection. In other words,
E1E2 . . . En = E1 ∩ E2 ∩ . . . ∩ En.
The complement of the event E is the event
∼E = {ω ∈ Ω|ω /∈ E}.
∼E occurs if and only if E does not occur. The event E∼1 E2
occurs if and only if E1 occurs and E2
does not occur.
Finally, the entire sample space Ω is an event with complement
φ, the empty event. The empty
event never occurs. We need the empty event because it is
possible to formulate a perfectly sensible
description of an event which happens never to be satisfied. For
example, if Ω = Payroll the event
”employees < 25” is never satisfied, so it is the empty event.
We also have the subset relation between events. E1 ⊆ E2
means that if E1 occurs, then E2 oc-
curs, or in more familiar language, E1 is a subset of E2. For any
event E, it is true that φ ⊆ E ⊆ Ω.
E2 ⊇ E1 means the same as E1 ⊆ E2.
3.2.1 Exercises
1. A random experiment consists of throwing a pair of dice, say
a red die and a green die, simultane-
ously. They are standard 6-sided dice with one to six dots on
different faces. Describe the sample space.
2. For the same experiment, let E be the event that the sum of
the numbers of spots on the two dice
is an odd number. Write E as a subset of the sample space, i.e.,
list the outcomes in E.
3. List the outcomes in the event F = ”the sum of the spots is a
multiple of 3”.
4. Find ∼F , E ∪ F , EF = E ∩ F , and E∼F .
5. Assume that the outcomes of this experiment are equally
likely. Find the probability of each
of the events in # 4.
6. Show that for any events E1 and E2, if E1 ⊆ E2 then ∼E2
⊆∼ E1.
7. Load the ”mammals” data set into your R workspace. In
Rstudio you can click on the ”Pack-
ages” tab and then on the checkbox next to MASS. Without
Rstudio, type
> data(mammals,package=”MASS”)
Attach the mammals data frame to your R search path with
> attach(mammals)
Go to TOC
CHAPTER 3. PROBABILITY 31
A random experiment is to choose one of the species listed in
this data set. All outcomes are equally
likely. You can obtain a list of the species in the event ”body >
200” with the command
> subset(mammals,body>200)
What is the probability of this event, i.e., what is the
probability that you randomly select a species
with a body weight greater than 200 kg?
8. What are the species in the event that the ratio of brain
weight to body weight is greater than 0.02?
Remember that brain weight is recorded in grams and body
weight in kilograms, so body weight must
be multiplied by 1000 to make the two weights comparable.
What is the probability of that event?
3.3 Rules for Probability Measures
The assumption of equally likely outcomes is the starting point
for the construction of many proba-
bility models. There are many random experiments for which
this assumption is wrong. No matter
what other considerations are involved in choosing a probability
measure for a model of a a random
experiment, there are certain rules that it must satisfy. They are:
1. 0 ≤ Pr(E) ≤ 1 for each event E.
2. Pr(Ω) = 1.
3. If E1, E2, . . . is a finite or infinite sequence of events such
that EiEj = φ for i 6= j, then Pr(
⋃
iEi) =∑
i Pr(Ei). If EiEj = φ for all i 6= j we say that the events E1, E2,
. . . are pairwise disjoint.
These are the basic rules. There are other properties that may be
derived from them as theorems.
4. Pr(E∼F ) = Pr(E)− Pr(EF ) for all events E and F . In
particular, Pr(∼E) = 1− Pr(E)
5. Pr(φ) = 0.
6. Pr(E ∪ F ) = Pr(E) + Pr(F )− Pr(EF ) for all events E and F .
7. If E ⊆ F , then Pr(E) ≤ Pr(F ).
8. If E1 ⊆ E2 ⊆ . . . is an infinite sequence, then Pr(
⋃
iEi) = limi→∞ Pr(Ei).
9. If E1 ⊇ E2 ⊇ . . . is an infinite sequence, then Pr(
⋂
iEi) = limi→∞ Pr(Ei).
Go to TOC
CHAPTER 3. PROBABILITY 32
3.4 Counting Outcomes. Sampling with and without Replace-
ment
Suppose a random experiment with sample space Ω is replicated
n times. The result is a sequence
(ω1, ω2, . . . , ωn), where ωi ∈ Ω is the outcome of the ith
replication. This sequence is the outcome of
a so-called compound experiment - the sequential replications
of the basic experiment. The sample
space of this compound experiment is the n-fold cartesian
product Ωn = Ω × Ω × · · · × Ω. Now
suppose that the basic experiment is to choose one member of a
finite population with N elements.
We may identify the sample space Ω with the population.
Consider an outcome (ω1, ω2, . . . , ωn) of
the replicated experiment. There are N possibilities for ω1 and
for each of those there are N possi-
bilities for ω2 and for each pair ω1, ω2 there are N possibilities
for ω3, and so on. In all, there are
N × N × · · · × N = Nn possibilities for the entire sequence (ω1,
ω2, · · · , ωn). If all outcomes of the
compound experiment are equally likely, then each has
probability 1Nn . Moreover, it can be shown
that the compound experiment has equally likely outcomes if
and only if the basic experiment has
equally likely outcomes, each with probability 1N .
Definition: An ordered random sample of size n with
replacement from a population of size N is a
randomly chosen sequence of length n of elements of the
population, where repetitions are possible
and each outcome (ω1, ω2, · · · , ωn) has probability 1Nn .
Now suppose that we sample one element ω1 from the
population, with all N outcomes equally likely.
Next, we sample one element ω2 from the population excluding
the one already chosen. That is, we
randomly select one element from Ω ∼ {ω1} with all the
remaining N − 1 elements being equally
likely. Next, we randomly select one element ω3 from the the N
− 2 elements of Ω ∼ {ω1, ω2}, and so
on until at last we select ωn from the remaining N − (n− 1)
elements of the population. The result is
a nonrepeating sequence (ω1, ω2, · · · , ωn) of length n from the
population. A nonrepeating sequence
of length n is also called a permutation of length n from the N
objects of the population. The total
number of such permutations is N × (N − 1)× · · · × (N − n+ 1)
= N !(N−n)! . Obviously, we must have
n ≤ N for this to make sense. The number of permutations of
length N from a set of N objects is
N !. It can be shown that, with the sampling scheme described
above, all permutations of length n
are equally likely to result. Each has probability (N−n)!N ! of
occurring.
Definition: An ordered random sample of size n without
replacement from a population of size N
is a randomly chosen nonrepeating sequence of length n from
the population where each outcome
(ω1, ω2, · · · , ωn) has probability (N−n)!N ! .
Most of the time when sampling without replacement from a
finite population, we do not care about
the order of appearance of the elements of the sample. Two
nonrepeating sequences with the same
elements in different order will be regarded as equivalent. In
other words, we are concerned only with
the resulting subset of the population. Let us count the number
of subsets of size n from a set of N
objects. Temporarily, let C denote that number. Each subset of
size n can be ordered in n! different
ways to give a nonrepeating sequence. Thus, the number of
nonrepeating sequences of length n is C
times n!. So, N !(N−n)! = C × n! i.e., C =
N !
n!(N−n)! =
(
N
n
)
. This is the same binomial coefficient
(
N
n
)
that appears in the binomial theorem: (a+ b)N =
∑N
n=0
(
N
n
)
anbN−n.
Go to TOC
CHAPTER 3. PROBABILITY 33
Definition: A simple random sample of size n from a population
of size N is a randomly chosen subset
of size n from the population, where each subset has the same
probability of being chosen, namely 1
(Nn)
.
A simple random sample may be obtained by choosing objects
from the population sequentially, in
the manner described above, and then ignoring the order of their
selection.
Example: The Birthday Problem
There are N = 365 days in a year. (Ignore leap years.) Suppose n
= 23 people are chosen ran-
domly and their birthdays recorded. What is the probability that
at least two of them have the same
birthday?
Solution
: Arbitrarily numbering the people involved from 1 to n, their
birthdays form an ordered sam-
ple, with replacement, from the set of 365 birthdays. Therefore,
each sequence has probability 1Nn of
occurring. No two people have the same birthday if and only if
the sequence is actually nonrepeating.
The number of nonrepeating sequences of birthdays is N(N − 1)
· · · (N −n+ 1). Therefore, the event
”No two people have the same birthday” has probability
N(N − 1) · · · (N − n+ 1)
Nn
=
N(N − 1) · · · (N − n+ 1)
N ×N × · · · ×N
= (1− 1
N
)(1− 2
N
) · · · (1− n− 1
N
)
With n = 23 and N = 365 we can find this in R as follows:
> prod(1-(1:22)/365)
[1] 0.4927028
So, there is about a 49% probability that no two people in a
random selection of 23 have the same
birthday. In other words, the probability that at least two share
a birthday is about 51%.
An important, intuitively obvious principle in statistics is that if
the sample size n is very small in
comparison to the population size N , a sample taken without
replacement may be regarded as one
taken with replacement, if it is mathematically convenient to do
so. A sample of size 100 taken with
replacement from a population of 100,000 has very little chance
of repeating itself. The probability of
a repetition is about 5%.
3.4.1 Exercises
1. A red 6-sided die and a green 6-sided die are thrown
simultaneously. The outcomes of this exper-
iment are equally likely. What is the probability that at least
one of the dice lands with a 6 on its
upper face?
2. A hand of 5-card draw poker is a simple random sample from
the standard deck of 52 cards. What
is the probability that a 5-card draw hand contains the ace of
hearts?
Go to TOC
CHAPTER 3. PROBABILITY 34
3. How many 5 draw poker hands are there? In 5-card stud
poker, the cards are dealt sequentially
and the order of appearance is important. How many 5 stud
poker hands are there?
4. Everybody in Ourtown is a fool or a knave or possibly both.
70% of the citizens are fools and 85%
are knaves. One citizen is randomly selected to be mayor. What
is the probability that the mayor is
both a fool and a knave?
5. A Martian year has 669 days. An R program for calculating
the probability of no repetitions in a
sample with replacement of n birthdays from a year of N days is
given below.
> birthdays=function(n,N) prod(1-1:(n-1)/N)
To invoke this function with, for example, n=12 and N=400
simply type
> birthdays(12,400)
Check that the program gives the right answer for N=365 and
n=23. Then use it to find the number
n of Martians that must be sampled in order for the probability
of a repetition to be at least 0.5.
6. A standard deck of 52 cards has four queens. Two cards are
randomly drawn in succession, without
replacement, from a standard deck. What is the probability that
the first card is a queen? What is
the probability that the second card is a queen? If three cards
are drawn, what is the probability that
the third is a queen? Make a general conjecture. Prove it if you
can. (Hint: Does the probability
change if ”queen” is replaced by ”king” or ”seven”?)
3.5 Conditional Probability
Definition: Let A and B be events with Pr(B) > 0. The
conditional probability of A, given B is:
Pr(A|B) = Pr(AB)
Pr(B)
. (3.1)
Pr(A) itself is called the unconditional probability of A.
Example 3.2. R includes a tabulation by various factors of the
2201 passengers and crew on the
Titanic. Read about it by typing
> help(Titanic)
We are going to look at these factors two at a time, starting with
the steerage class of the passengers
and whether they survived or not.
> apply(Titanic,c(1,4),sum)
Survived
Class No Yes
Go to TOC
CHAPTER 3. PROBABILITY 35
1st 122 203
2nd 167 118
3rd 528 178
Crew 673 212
Suppose that a passenger or crew member is selected randomly.
The unconditional probability that
that person survived is 7112201 = 0.323.
> apply(Titanic,4,sum)
No Yes
1490 711
> apply(Titanic,1,sum)
1st 2nd 3rd Crew
325 285 706 885
Let us calculate the conditional probability of survival, given
that the person selected was in a first
class cabin. If A = ”survived” and B = ”first class”, then
Pr(AB) =
203
2201
= 0.0922
and
Pr(B) =
325
2201
= 0.1477.
Thus,
Pr(A|B) = 0.0922
0.1477
= 0.625.
First class passengers had about a 62% chance of survival. For
random sampling from a finite popu-
lation such as this, we can use the counts of occurrences of the
events rather than their probabilities
because the denominators in Pr(AB) and Pr(B) cancel.
Pr(A|B) = #(AB)
#(B)
=
203
325
= 0.625
For comparison, look at the conditional probabilities of survival
for the other classes.
Pr(survived|second class) = 118
285
= 0.414
Pr(survived|third class) = 178
706
= 0.252
Pr(survived|crew) = 212
885
= 0.240
Go to TOC
CHAPTER 3. PROBABILITY 36
3.5.1 Relating Conditional and Unconditional Probabilities
The defining equation (3.1) for conditional probability can be
written as
Pr(AB) = Pr(A|B)Pr(B), (3.2)
which is often more useful, especially when Pr(A|B) is easily
determined from the description of the
experiment. There is an even more useful result sometimes
called the law of total probability. Let
B1, B2, · · · , Bk be pairwise disjoint events such that each
Pr(Bi) > 0 and Ω = B1 ∪ B2 ∪ · · · ∪ Bk.
Let A be another event. Then,
Pr(A) =
k∑
i=1
Pr(A|Bi)Pr(Bi). (3.3)
This is quite easy to show since A = (AB1) ∪ · · · ∪ (ABk) is a
union of pairwise disjoint events and
Pr(ABi) = Pr(A|Bi)Pr(Bi).
Example 3.3. Diagnostic Tests:
Let D denote the presence of a disease in a randomly selected
member of a given population. Suppose
that there is a diagnostic test for the disease and let T denote
the event that a random subject tests
positive, that is, that the test indicates the disease. The
conditional probability Pr(T |D) is called the
sensitivity of the test. The conditional probability Pr(∼T |∼D)
is called the specificity of the test. The
unconditional probability Pr(D) is called the prevalence of the
disease in the population. A good test
will have both a high sensitivity and a high specificity, although
there is usually a trade-off between
the two. The unconditional probability that a randomly chosen
subject tests positive for the disease
is
Pr(T ) = Pr(T |D)Pr(D) + Pr(T |∼D)Pr(∼D)
Suppose that the disease is rare, Pr(D) = 0.02, and that the
sensitivity of the test is Pr(T |D) =
0.95 with specificity Pr(∼T |∼D) = 0.85. The false positive rate
for the test is Pr(T |∼D) = 1 −
Pr(∼T |∼D) = 0.15. The unconditional probability of a positive
test result is
Pr(T ) = 0.95× 0.02 + 0.15× 0.98 = 0.166
16.6% of the population will test positive for the disease, even
though only 2% have it.
3.5.2 Bayes’ Rule
Bayes’ rule is named for Thomas Bayes, an eighteenth century
clergyman and part-time mathemati-
cian. As given below, it is merely a relationship between
conditional probabilities but it is associated
with Bayesian inference, a distinct philosophy and methodology
of statistical practice. Bayes’ rule is
often described as a rule for calculating conditional ”posterior”
probabilities from unconditional ”prior”
probabilities.
Bayes’ Rule: Let A and B1, B2, · · · , Bk be given as in the law
of total probability (3.3) and assume
Pr(A) > 0. Then for each i,
Pr(Bi|A) =
Pr(A|Bi)Pr(Bi)
Pr(A)
, (3.4)
where Pr(A) is calculated as in (3.3).
Go to TOC
CHAPTER 3. PROBABILITY 37
Example 3.4. Urn 1 contains 3 red balls and 5 white balls. Urn 2
contains 6 red balls and 3 white
balls. A fair coin is tossed (meaning that heads and tails are
equally likely). If a head turns up, a ball
is randomly selected from Urn 1. If a tail comes up, a ball is
randomly selected from Urn 2. Given
that a white ball was selected, what is the probability that it
came from Urn 1?

More Related Content

Similar to Go to TOCStatistics for the SciencesCharles Peters.docx

Business Mathematics Code 1429
Business Mathematics Code 1429Business Mathematics Code 1429
Business Mathematics Code 1429eniacnetpoint
 
Ric walter (auth.) numerical methods and optimization a consumer guide-sprin...
Ric walter (auth.) numerical methods and optimization  a consumer guide-sprin...Ric walter (auth.) numerical methods and optimization  a consumer guide-sprin...
Ric walter (auth.) numerical methods and optimization a consumer guide-sprin...valentincivil
 
I do like cfd vol 1 2ed_v2p2
I do like cfd vol 1 2ed_v2p2I do like cfd vol 1 2ed_v2p2
I do like cfd vol 1 2ed_v2p2NovoConsult S.A.C
 
Quantum Mechanics: Lecture notes
Quantum Mechanics: Lecture notesQuantum Mechanics: Lecture notes
Quantum Mechanics: Lecture notespolariton
 
Applied mathematics I for enginering
Applied mathematics  I for engineringApplied mathematics  I for enginering
Applied mathematics I for engineringgeletameyu
 
Manual Solution Probability and Statistic Hayter 4th Edition
Manual Solution Probability and Statistic Hayter 4th EditionManual Solution Probability and Statistic Hayter 4th Edition
Manual Solution Probability and Statistic Hayter 4th EditionRahman Hakim
 
probability_stats_for_DS.pdf
probability_stats_for_DS.pdfprobability_stats_for_DS.pdf
probability_stats_for_DS.pdfdrajou
 
Methods for Applied Macroeconomic Research.pdf
Methods for Applied Macroeconomic Research.pdfMethods for Applied Macroeconomic Research.pdf
Methods for Applied Macroeconomic Research.pdfComrade15
 
Navarro & Foxcroft (2018). Learning statistics with jamovi (1).pdf
Navarro & Foxcroft (2018). Learning statistics with jamovi (1).pdfNavarro & Foxcroft (2018). Learning statistics with jamovi (1).pdf
Navarro & Foxcroft (2018). Learning statistics with jamovi (1).pdfTerimSura
 
An Introduction to Statistical Inference and Its Applications.pdf
An Introduction to Statistical Inference and Its Applications.pdfAn Introduction to Statistical Inference and Its Applications.pdf
An Introduction to Statistical Inference and Its Applications.pdfSharon Collins
 
A_Practical_Introduction_to_Python_Programming_Heinold.pdf
A_Practical_Introduction_to_Python_Programming_Heinold.pdfA_Practical_Introduction_to_Python_Programming_Heinold.pdf
A_Practical_Introduction_to_Python_Programming_Heinold.pdfssuser7fcce2
 
A practical introduction_to_python_programming_heinold
A practical introduction_to_python_programming_heinoldA practical introduction_to_python_programming_heinold
A practical introduction_to_python_programming_heinoldthe_matrix
 
A Practical Introduction To Python Programming
A Practical Introduction To Python ProgrammingA Practical Introduction To Python Programming
A Practical Introduction To Python ProgrammingNat Rice
 
A practical introduction_to_python_programming_heinold
A practical introduction_to_python_programming_heinoldA practical introduction_to_python_programming_heinold
A practical introduction_to_python_programming_heinoldFaruqolayinkaSalako
 
A_Practical_Introduction_to_Python_Programming_Heinold.pdf
A_Practical_Introduction_to_Python_Programming_Heinold.pdfA_Practical_Introduction_to_Python_Programming_Heinold.pdf
A_Practical_Introduction_to_Python_Programming_Heinold.pdfTariqSaeed80
 
Stochastic Programming
Stochastic ProgrammingStochastic Programming
Stochastic ProgrammingSSA KPI
 
Statistics for economists
Statistics for economistsStatistics for economists
Statistics for economistsMt Ch
 

Similar to Go to TOCStatistics for the SciencesCharles Peters.docx (20)

Business Mathematics Code 1429
Business Mathematics Code 1429Business Mathematics Code 1429
Business Mathematics Code 1429
 
Ric walter (auth.) numerical methods and optimization a consumer guide-sprin...
Ric walter (auth.) numerical methods and optimization  a consumer guide-sprin...Ric walter (auth.) numerical methods and optimization  a consumer guide-sprin...
Ric walter (auth.) numerical methods and optimization a consumer guide-sprin...
 
I do like cfd vol 1 2ed_v2p2
I do like cfd vol 1 2ed_v2p2I do like cfd vol 1 2ed_v2p2
I do like cfd vol 1 2ed_v2p2
 
Quantum Mechanics: Lecture notes
Quantum Mechanics: Lecture notesQuantum Mechanics: Lecture notes
Quantum Mechanics: Lecture notes
 
Applied mathematics I for enginering
Applied mathematics  I for engineringApplied mathematics  I for enginering
Applied mathematics I for enginering
 
Manual Solution Probability and Statistic Hayter 4th Edition
Manual Solution Probability and Statistic Hayter 4th EditionManual Solution Probability and Statistic Hayter 4th Edition
Manual Solution Probability and Statistic Hayter 4th Edition
 
Na 20130603
Na 20130603Na 20130603
Na 20130603
 
probability_stats_for_DS.pdf
probability_stats_for_DS.pdfprobability_stats_for_DS.pdf
probability_stats_for_DS.pdf
 
Thats How We C
Thats How We CThats How We C
Thats How We C
 
Methods for Applied Macroeconomic Research.pdf
Methods for Applied Macroeconomic Research.pdfMethods for Applied Macroeconomic Research.pdf
Methods for Applied Macroeconomic Research.pdf
 
Free high-school-science-texts-physics
Free high-school-science-texts-physicsFree high-school-science-texts-physics
Free high-school-science-texts-physics
 
Navarro & Foxcroft (2018). Learning statistics with jamovi (1).pdf
Navarro & Foxcroft (2018). Learning statistics with jamovi (1).pdfNavarro & Foxcroft (2018). Learning statistics with jamovi (1).pdf
Navarro & Foxcroft (2018). Learning statistics with jamovi (1).pdf
 
An Introduction to Statistical Inference and Its Applications.pdf
An Introduction to Statistical Inference and Its Applications.pdfAn Introduction to Statistical Inference and Its Applications.pdf
An Introduction to Statistical Inference and Its Applications.pdf
 
A_Practical_Introduction_to_Python_Programming_Heinold.pdf
A_Practical_Introduction_to_Python_Programming_Heinold.pdfA_Practical_Introduction_to_Python_Programming_Heinold.pdf
A_Practical_Introduction_to_Python_Programming_Heinold.pdf
 
A practical introduction_to_python_programming_heinold
A practical introduction_to_python_programming_heinoldA practical introduction_to_python_programming_heinold
A practical introduction_to_python_programming_heinold
 
A Practical Introduction To Python Programming
A Practical Introduction To Python ProgrammingA Practical Introduction To Python Programming
A Practical Introduction To Python Programming
 
A practical introduction_to_python_programming_heinold
A practical introduction_to_python_programming_heinoldA practical introduction_to_python_programming_heinold
A practical introduction_to_python_programming_heinold
 
A_Practical_Introduction_to_Python_Programming_Heinold.pdf
A_Practical_Introduction_to_Python_Programming_Heinold.pdfA_Practical_Introduction_to_Python_Programming_Heinold.pdf
A_Practical_Introduction_to_Python_Programming_Heinold.pdf
 
Stochastic Programming
Stochastic ProgrammingStochastic Programming
Stochastic Programming
 
Statistics for economists
Statistics for economistsStatistics for economists
Statistics for economists
 

More from whittemorelucilla

Database reports provide us with the ability to further analyze ou.docx
Database reports provide us with the ability to further analyze ou.docxDatabase reports provide us with the ability to further analyze ou.docx
Database reports provide us with the ability to further analyze ou.docxwhittemorelucilla
 
DataInformationKnowledge1.  Discuss the relationship between.docx
DataInformationKnowledge1.  Discuss the relationship between.docxDataInformationKnowledge1.  Discuss the relationship between.docx
DataInformationKnowledge1.  Discuss the relationship between.docxwhittemorelucilla
 
DataHole 12 Score6757555455555455575775655565656555655656556566643.docx
DataHole 12 Score6757555455555455575775655565656555655656556566643.docxDataHole 12 Score6757555455555455575775655565656555655656556566643.docx
DataHole 12 Score6757555455555455575775655565656555655656556566643.docxwhittemorelucilla
 
DataDestination PalletsTotal CasesCases redCases whiteCases organi.docx
DataDestination PalletsTotal CasesCases redCases whiteCases organi.docxDataDestination PalletsTotal CasesCases redCases whiteCases organi.docx
DataDestination PalletsTotal CasesCases redCases whiteCases organi.docxwhittemorelucilla
 
DataIllinois Tool WorksConsolidated Statement of Income($ in milli.docx
DataIllinois Tool WorksConsolidated Statement of Income($ in milli.docxDataIllinois Tool WorksConsolidated Statement of Income($ in milli.docx
DataIllinois Tool WorksConsolidated Statement of Income($ in milli.docxwhittemorelucilla
 
DataIDSalaryCompa-ratioMidpoint AgePerformance RatingServiceGender.docx
DataIDSalaryCompa-ratioMidpoint AgePerformance RatingServiceGender.docxDataIDSalaryCompa-ratioMidpoint AgePerformance RatingServiceGender.docx
DataIDSalaryCompa-ratioMidpoint AgePerformance RatingServiceGender.docxwhittemorelucilla
 
DataCity1997 Median Price1997 Change1998 Forecast1993-98 Annualize.docx
DataCity1997 Median Price1997 Change1998 Forecast1993-98 Annualize.docxDataCity1997 Median Price1997 Change1998 Forecast1993-98 Annualize.docx
DataCity1997 Median Price1997 Change1998 Forecast1993-98 Annualize.docxwhittemorelucilla
 
DataClientRoom QualityFood QualityService Quality1GPG2GGG3GGG4GPG5.docx
DataClientRoom QualityFood QualityService Quality1GPG2GGG3GGG4GPG5.docxDataClientRoom QualityFood QualityService Quality1GPG2GGG3GGG4GPG5.docx
DataClientRoom QualityFood QualityService Quality1GPG2GGG3GGG4GPG5.docxwhittemorelucilla
 
Database Project CharterBusiness CaseKhalia HartUnive.docx
Database Project CharterBusiness CaseKhalia HartUnive.docxDatabase Project CharterBusiness CaseKhalia HartUnive.docx
Database Project CharterBusiness CaseKhalia HartUnive.docxwhittemorelucilla
 
Databases selected Multiple databases...Full Text (1223 .docx
Databases selected Multiple databases...Full Text (1223  .docxDatabases selected Multiple databases...Full Text (1223  .docx
Databases selected Multiple databases...Full Text (1223 .docxwhittemorelucilla
 
Database SystemsDesign, Implementation, and ManagementCo.docx
Database SystemsDesign, Implementation, and ManagementCo.docxDatabase SystemsDesign, Implementation, and ManagementCo.docx
Database SystemsDesign, Implementation, and ManagementCo.docxwhittemorelucilla
 
DATABASE SYSTEMS DEVELOPMENT & IMPLEMENTATION PLAN1DATABASE SYS.docx
DATABASE SYSTEMS DEVELOPMENT & IMPLEMENTATION PLAN1DATABASE SYS.docxDATABASE SYSTEMS DEVELOPMENT & IMPLEMENTATION PLAN1DATABASE SYS.docx
DATABASE SYSTEMS DEVELOPMENT & IMPLEMENTATION PLAN1DATABASE SYS.docxwhittemorelucilla
 
Database Security Assessment Transcript You are a contracting office.docx
Database Security Assessment Transcript You are a contracting office.docxDatabase Security Assessment Transcript You are a contracting office.docx
Database Security Assessment Transcript You are a contracting office.docxwhittemorelucilla
 
Database Design Mid Term ExamSpring 2020Name ________________.docx
Database Design Mid Term ExamSpring 2020Name ________________.docxDatabase Design Mid Term ExamSpring 2020Name ________________.docx
Database Design Mid Term ExamSpring 2020Name ________________.docxwhittemorelucilla
 
Database Justification MemoCreate a 1-page memo for the .docx
Database Justification MemoCreate a 1-page memo for the .docxDatabase Justification MemoCreate a 1-page memo for the .docx
Database Justification MemoCreate a 1-page memo for the .docxwhittemorelucilla
 
Database Concept Maphttpwikieducator.orgCCNCCCN.docx
Database Concept Maphttpwikieducator.orgCCNCCCN.docxDatabase Concept Maphttpwikieducator.orgCCNCCCN.docx
Database Concept Maphttpwikieducator.orgCCNCCCN.docxwhittemorelucilla
 
Database Dump Script(Details of project in file)Mac1) O.docx
Database Dump Script(Details of project in file)Mac1) O.docxDatabase Dump Script(Details of project in file)Mac1) O.docx
Database Dump Script(Details of project in file)Mac1) O.docxwhittemorelucilla
 
Database Design 1. What is a data model A. method of sto.docx
Database Design 1.  What is a data model A. method of sto.docxDatabase Design 1.  What is a data model A. method of sto.docx
Database Design 1. What is a data model A. method of sto.docxwhittemorelucilla
 
DataAGEGENDERETHNICMAJORSEMHOUSEGPAHRSNEWSPAPTVHRSSLEEPWEIGHTHEIGH.docx
DataAGEGENDERETHNICMAJORSEMHOUSEGPAHRSNEWSPAPTVHRSSLEEPWEIGHTHEIGH.docxDataAGEGENDERETHNICMAJORSEMHOUSEGPAHRSNEWSPAPTVHRSSLEEPWEIGHTHEIGH.docx
DataAGEGENDERETHNICMAJORSEMHOUSEGPAHRSNEWSPAPTVHRSSLEEPWEIGHTHEIGH.docxwhittemorelucilla
 

More from whittemorelucilla (20)

Database reports provide us with the ability to further analyze ou.docx
Database reports provide us with the ability to further analyze ou.docxDatabase reports provide us with the ability to further analyze ou.docx
Database reports provide us with the ability to further analyze ou.docx
 
DataInformationKnowledge1.  Discuss the relationship between.docx
DataInformationKnowledge1.  Discuss the relationship between.docxDataInformationKnowledge1.  Discuss the relationship between.docx
DataInformationKnowledge1.  Discuss the relationship between.docx
 
DataHole 12 Score6757555455555455575775655565656555655656556566643.docx
DataHole 12 Score6757555455555455575775655565656555655656556566643.docxDataHole 12 Score6757555455555455575775655565656555655656556566643.docx
DataHole 12 Score6757555455555455575775655565656555655656556566643.docx
 
DataDestination PalletsTotal CasesCases redCases whiteCases organi.docx
DataDestination PalletsTotal CasesCases redCases whiteCases organi.docxDataDestination PalletsTotal CasesCases redCases whiteCases organi.docx
DataDestination PalletsTotal CasesCases redCases whiteCases organi.docx
 
DataIllinois Tool WorksConsolidated Statement of Income($ in milli.docx
DataIllinois Tool WorksConsolidated Statement of Income($ in milli.docxDataIllinois Tool WorksConsolidated Statement of Income($ in milli.docx
DataIllinois Tool WorksConsolidated Statement of Income($ in milli.docx
 
DataIDSalaryCompa-ratioMidpoint AgePerformance RatingServiceGender.docx
DataIDSalaryCompa-ratioMidpoint AgePerformance RatingServiceGender.docxDataIDSalaryCompa-ratioMidpoint AgePerformance RatingServiceGender.docx
DataIDSalaryCompa-ratioMidpoint AgePerformance RatingServiceGender.docx
 
DataCity1997 Median Price1997 Change1998 Forecast1993-98 Annualize.docx
DataCity1997 Median Price1997 Change1998 Forecast1993-98 Annualize.docxDataCity1997 Median Price1997 Change1998 Forecast1993-98 Annualize.docx
DataCity1997 Median Price1997 Change1998 Forecast1993-98 Annualize.docx
 
DataClientRoom QualityFood QualityService Quality1GPG2GGG3GGG4GPG5.docx
DataClientRoom QualityFood QualityService Quality1GPG2GGG3GGG4GPG5.docxDataClientRoom QualityFood QualityService Quality1GPG2GGG3GGG4GPG5.docx
DataClientRoom QualityFood QualityService Quality1GPG2GGG3GGG4GPG5.docx
 
Database Project CharterBusiness CaseKhalia HartUnive.docx
Database Project CharterBusiness CaseKhalia HartUnive.docxDatabase Project CharterBusiness CaseKhalia HartUnive.docx
Database Project CharterBusiness CaseKhalia HartUnive.docx
 
Databases selected Multiple databases...Full Text (1223 .docx
Databases selected Multiple databases...Full Text (1223  .docxDatabases selected Multiple databases...Full Text (1223  .docx
Databases selected Multiple databases...Full Text (1223 .docx
 
Database SystemsDesign, Implementation, and ManagementCo.docx
Database SystemsDesign, Implementation, and ManagementCo.docxDatabase SystemsDesign, Implementation, and ManagementCo.docx
Database SystemsDesign, Implementation, and ManagementCo.docx
 
DATABASE SYSTEMS DEVELOPMENT & IMPLEMENTATION PLAN1DATABASE SYS.docx
DATABASE SYSTEMS DEVELOPMENT & IMPLEMENTATION PLAN1DATABASE SYS.docxDATABASE SYSTEMS DEVELOPMENT & IMPLEMENTATION PLAN1DATABASE SYS.docx
DATABASE SYSTEMS DEVELOPMENT & IMPLEMENTATION PLAN1DATABASE SYS.docx
 
Database Security Assessment Transcript You are a contracting office.docx
Database Security Assessment Transcript You are a contracting office.docxDatabase Security Assessment Transcript You are a contracting office.docx
Database Security Assessment Transcript You are a contracting office.docx
 
Data.docx
Data.docxData.docx
Data.docx
 
Database Design Mid Term ExamSpring 2020Name ________________.docx
Database Design Mid Term ExamSpring 2020Name ________________.docxDatabase Design Mid Term ExamSpring 2020Name ________________.docx
Database Design Mid Term ExamSpring 2020Name ________________.docx
 
Database Justification MemoCreate a 1-page memo for the .docx
Database Justification MemoCreate a 1-page memo for the .docxDatabase Justification MemoCreate a 1-page memo for the .docx
Database Justification MemoCreate a 1-page memo for the .docx
 
Database Concept Maphttpwikieducator.orgCCNCCCN.docx
Database Concept Maphttpwikieducator.orgCCNCCCN.docxDatabase Concept Maphttpwikieducator.orgCCNCCCN.docx
Database Concept Maphttpwikieducator.orgCCNCCCN.docx
 
Database Dump Script(Details of project in file)Mac1) O.docx
Database Dump Script(Details of project in file)Mac1) O.docxDatabase Dump Script(Details of project in file)Mac1) O.docx
Database Dump Script(Details of project in file)Mac1) O.docx
 
Database Design 1. What is a data model A. method of sto.docx
Database Design 1.  What is a data model A. method of sto.docxDatabase Design 1.  What is a data model A. method of sto.docx
Database Design 1. What is a data model A. method of sto.docx
 
DataAGEGENDERETHNICMAJORSEMHOUSEGPAHRSNEWSPAPTVHRSSLEEPWEIGHTHEIGH.docx
DataAGEGENDERETHNICMAJORSEMHOUSEGPAHRSNEWSPAPTVHRSSLEEPWEIGHTHEIGH.docxDataAGEGENDERETHNICMAJORSEMHOUSEGPAHRSNEWSPAPTVHRSSLEEPWEIGHTHEIGH.docx
DataAGEGENDERETHNICMAJORSEMHOUSEGPAHRSNEWSPAPTVHRSSLEEPWEIGHTHEIGH.docx
 

Recently uploaded

Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 

Recently uploaded (20)

Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 

Go to TOCStatistics for the SciencesCharles Peters.docx

  • 1. Go to TOC Statistics for the Sciences Charles Peters Go to TOC Contents 1 Background 6 1.1 Populations, Samples and Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Types of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Random Experiments and Sample Spaces . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Computing in Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Descriptive and Graphical Statistics 11 2.1 Location Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 The Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.2 The Median and Other Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . 12
  • 2. 2.1.3 Trimmed Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.4 Grouped Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.5 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.6 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.7 The Five Number Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.8 The Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Measures of Variability or Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 The Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.2 The Coefficient of Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.3 The Mean and Median Absolute Deviation . . . . . . . . . . . . . . . . . . . . 17 2.2.4 The Interquartile Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.5 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 Jointly Distributed Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.1 Side by Side Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.2 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
  • 3. 2.3.3 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3 Probability 28 3.1 Basic Definitions. Equally Likely Outcomes . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Combinations of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 1 Go to TOC CONTENTS 2 3.3 Rules for Probability Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4 Counting Outcomes. Sampling with and without Replacement . . . . . . . . . . . . . 32 3.4.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5.1 Relating Conditional and Unconditional Probabilities . . . . . . . . . . . . . . 36 3.5.2 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
  • 4. 3.6 Independent Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.7 Replications of a Random Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4 Discrete Distributions 40 4.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3 Expected Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4 Bernoulli Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4.1 The Mean and Variance of a Bernoulli Variable . . . . . . . . . . . . . . . . . . 44 4.5 Binomial Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.5.1 The Mean and Variance of a Binomial Distribution . . . . . . . . . . . . . . . . 48 4.5.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.6 Hypergeometric Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.6.1 The Mean and Variance of a Hypergeometric Distribution . . . . . . . . . . . . 51
  • 5. 4.7 Poisson Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.7.1 The Mean and Variance of a Poisson Distribution . . . . . . . . . . . . . . . . 54 4.7.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.8 Jointly Distributed Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.8.1 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.9 Multinomial Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.9.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5 Continuous Distributions 62 5.1 Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2 Expected Values and Quantiles for Continuous Distributions . . . . . . . . . . . . . . 67 5.2.1 Expected Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2.2 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3 Uniform Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.4 Exponential Distributions and Their Relatives . . . . . . . . . . . . . . . . . . . . . . . 70
  • 6. 5.4.1 Exponential Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.4.2 Gamma Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.4.3 Weibull Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.5 Normal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.5.1 Tables of the Standard Normal Distribution . . . . . . . . . . . . . . . . . . . . 80 5.5.2 Other Normal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.5.3 The Normal Approximation to the Binomial Distribution . . . . . . . . . . . . 83 Go to TOC CONTENTS 3 5.5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6 Joint Distributions and Sampling Distributions 85 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.2 Jointly Distributed Continuous Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.2.1 Mixed Joint Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.2.2 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . .
  • 7. . . . . . . 89 6.2.3 Bivariate Normal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.3 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.4 Sums of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.4.1 Simulating Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.5 Sample Sums and the Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . 98 6.5.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.6 Other Distributions Associated with Normal Sampling . . . . . . . . . . . . . . . . . . 103 6.6.1 Chi Square Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.6.2 Student t Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.6.3 The Joint Distribution of the Sample Mean and Variance . . . . . . . . . . . . 108 6.6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7 Statistical Inference for a Single Population 110 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.2 Estimation of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
  • 8. 7.2.1 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.2.2 Desireable Properties of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.3 Estimating a Population Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.3.1 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.3.2 Small Sample Confidence Intervals for a Normal Mean . . . . . . . . . . . . . . 115 7.3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.4 Estimating a Population Proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.4.1 Choosing the Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.4.2 Confidence Intervals for p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.5 Estimating Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.5.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.6 Estimating the Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . 125 7.7 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.7.1 Test Statistics, Type 1 and Type 2 Errors . . . . . . . . . . . . . . . . . . . . . 127 7.8 Hypotheses About a Population Mean . . . . . . . . . . . . . . . . .
  • 9. . . . . . . . . . . 127 7.8.1 Tests for the mean when the variance is unknown . . . . . . . . . . . . . . . . . 129 7.9 p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.9.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 7.10 Hypotheses About a Population Proportion . . . . . . . . . . . . . . . . . . . . . . . . 132 7.10.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Go to TOC CONTENTS 4 8 Regression and Correlation 136 8.1 Examples of Linear Regression Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 136 8.2 Least Squares Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 8.2.1 The ”lm” Function in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 8.2.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 8.3 Distributions of the Least Squares Estimators . . . . . . . . . . . . . . . . . . . . . . . 145 8.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
  • 10. 8.4 Inference for the Regression Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 148 8.4.1 Confidence Intervals for the Parameters . . . . . . . . . . . . . . . . . . . . . . 150 8.4.2 Hypothesis Tests for the Parameters . . . . . . . . . . . . . . . . . . . . . . . . 150 8.4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 8.5 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 8.5.1 Confidence intervals for ρ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 8.5.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 9 Inference from Multiple Samples 160 9.1 Comparison of Two Population Means . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 9.1.1 Large Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 9.1.2 Comparing Two Population Proportions . . . . . . . . . . . . . . . . . . . . . . 162 9.1.3 Samples from Normal Distributions . . . . . . . . . . . . . . . . . . . . . . . . 164 9.1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 9.2 Paired Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 9.2.1 Crossover Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 9.2.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
  • 11. 9.3 More than Two Independent Samples: Single Factor Analysis of Variance . . . . . . . 171 9.3.1 Example Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 9.3.2 Multiple Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 9.3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 9.4 Two-Way Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 9.4.1 Interactions Between the Factors . . . . . . . . . . . . . . . . . . . . . . . . . . 183 9.4.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 10 Analysis of Categorical Data 185 10.1 Multinomial Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 10.1.1 Estimators and Hypothesis Tests for the Parameters . . . . . . . . . . . . . . . 186 10.1.2 Multinomial Probabilities That Are Functions of Other Parameters . . . . . . . 187 10.1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 10.2 Testing Equality of Multinomial Probabilities . . . . . . . . . . . . . . . . . . . . . . . 190 10.3 Independence of Attributes: Contingency Tables . . . . . . . . . . . . . . . . . . . . . 192 10.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
  • 12. 11 Miscellaneous Topics 196 11.1 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 11.1.1 Inferences Based on Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 11.1.2 Using R’s ”lm” Function for Multiple Regression . . . . . . . . . . . . . . . . . 198 11.1.3 Factor Variables as Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 11.1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Go to TOC CONTENTS 5 11.2 Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 11.2.1 The Signed Rank Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 11.2.2 The Wilcoxon Rank Sum Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 11.2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 11.3 Bootstrap Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 11.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 Go to TOC
  • 13. Chapter 1 Background Statistics is the art of summarizing data, depicting data, and extracting information from it. Statistics and the theory of probability are distinct subjects, although statistics depends on probability to quantify the strength of its inferences. The probability used in this course will be developed in Chapter 3 and throughout the text as needed. We begin by introducing some basic ideas and terminology. 1.1 Populations, Samples and Variables A population is a set of individual elements whose collective properties are the subject of investigation. Usually, populations are large collections whose individual members cannot all be examined in detail. In statistical inference a manageable subset of the population is selected according to certain sampling procedures and properties of the subset are generalized to the entire population. These generalizations are accompanied by statements quantifying their accuracy and reliability. The selected subset is called a sample from the population. Examples: (a) the population of registered voters in a congressional district, (b) the population of U.S. adult males, (c) the population of currently enrolled students at a certain large urban university, (d) the population of all transactions in the U.S. stock market
  • 14. for the past month, (e) the population of all peak temperatures at points on the Earth’s surface over a given time interval. Some samples from these populations might be: (a) the voters contacted in a pre-election telephone poll, (b) adult males interviewed by a TV reporter, (c) the dean’s list, (d) transactions recorded on the books of Smith Barney, (e) peak temperatures recorded at several weather stations. Clearly, for these particular samples, some generalizations from sample to population would be highly questionable. 6 Go to TOC CHAPTER 1. BACKGROUND 7 A population variable is an attribute that has a value for each individual in the population. In other words, it is a function from the population to some set of possible values. It may be helpful to imagine a population as a spreadsheet with one row or record for each individual member. Along the ith row, the values of a number of attributes of the ith individual are recorded in different columns. The column headings of the spreadsheet can be thought of as the population variables. For example, if the population is the set of currently enrolled students at the urban university, some of the variables are academic classification, number of hours currently enrolled,
  • 15. total hours taken, grade point average, gender, ethnic classification, major, and so on. Variables, such as these, that are defined for the same population are said to be jointly observed or jointly distributed. 1.2 Types of Variables Variables are classified according to the kinds of values they have. The three basic types are numeric variables, factor variables, and ordered factor variables. Numeric variables are those for which arith- metic operations such as addition and subtraction make sense. Numeric variables are often related to a scale of measurement and expressed in units, such as meters, seconds, or dollars. Factor variables are those whose values are mere names, to which arithmetic operations do not apply. Factors usually have a small number of possible values. These values might be designated by numbers. If they are, the numbers that represent distinct values are chosen merely for convenience. The values of factors might also be letters, words, or pictorial symbols. Factor variables are sometimes called nominal variables or categorical variables. Ordered factor variables are factors whose values are ordered in some natural and important way. Ordered factors are also called ordinal variables. Some textbooks have a more elaborate classification of variables, with various subtypes. The three types above are enough for our purposes. Examples: Consider the population of students currently enrolled at a large university. Each stu- dent has a residency status, either resident or nonresident. Residency status is an unordered factor variable. Academic classification is an ordered factor with
  • 16. values “freshman”, “sophomore”, “junior”, “senior”, “post-baccalaureate” and “graduate student”. The number of hours enrolled is a numeric variable with integer values. The distance a student travels from home to campus is a numeric vari- able expressed in miles or kilometers. Home area code is an unordered factor variable whose values are designated by numbers. 1.3 Random Experiments and Sample Spaces An experiment can be something as simple as flipping a coin or as complex as conducting a public opinion poll. A random experiment is one with the following two characteristics: (1) The experiment can be replicated an indefinite number of times under essentially the same exper- imental conditions. (2) There is a degree of uncertainty in the outcome of the experiment. The outcome may vary from replication to replication even though experimental conditions are the same. Go to TOC CHAPTER 1. BACKGROUND 8 When we say that an experiment can be replicated under the same conditions, we mean that control- lable or observable conditions that we think might affect the outcome are the same. There may be hidden conditions that affect the outcome, but we cannot
  • 17. account for them. Implicit in (1) is the idea that replications of a random experiment are independent, that is, the outcomes of some replications do not affect the outcomes of others. Obviously, a random experiment is an idealization of a real experiment. Some simple experiments, such as tossing a coin, approach this ideal closely while more complicated experiments may not. The sample space of a random experiment is the set of all its possible outcomes. We use the Greek capital letter Ω (omega)to denote the sample space. There is some degree of arbitrariness in the description of Ω. It depends on how the outcomes of the experiment are represented symbolically. Examples: (a) Toss a coin. Ω = {H,T}, where “H” denotes a head and “T” a tail. Another way of repre- senting the outcome is to let the number 1 denote a head and 0 a tail (or vice-versa). If we do this, then Ω = {0, 1}. In the latter representation the outcome of the experiment is just the number of heads. (b) Toss a coin 5 times, i.e., replicate the experiment in (a) 5 times. An outcome of this experiment is a 5 term sequence of heads and tails. A typical outcome might be indicated by (H,T,T,H,H), or by (1,0,0,1,1). Even for this little experiment it is cumbersome to list all the outcomes, so we use a shorter notation Ω = {(x1, x2, x3, x4, x5) | xi = 0 or xi = 1 for each i} . (c) Select a student randomly from the population of all
  • 18. currently enrolled students. The sample space is the same as the population. The word “randomly” is vague. We will define it later. (d) Repeat the Michelson-Morley experiment to measure the speed of the Earth relative to the ether (which doesn’t exist, as we now know). The outcome of the experiment could conceivably be any nonnegative number, so we take Ω = [0,∞) = {x | x is a real number and x ≥ 0.} Uncertainty arises from the fact that this is a very delicate experiment with several sources of unpredictable error. 1.4 Computing in Statistics Even moderately large data sets cannot be managed effectively without a computer and computer software. Furthermore, much of applied statistics is exploratory in nature and cannot be carried out by hand, even with a calculator. Spreadsheet programs, such as Microsoft Excel, are designed to manipulate data in tabular form and have functions for performing the common tasks of statistics. In addition, many add-ins are available, some of them free, for enhancing the graphical and statistical capabilities of spreadsheet programs. Some of the exercises and examples in this text make use of Excel with its built-in data analysis package. Because it is so common in the business world, it is important for students to have some experience with Excel or a similar program. The disadvantages of spreadsheet programs are their dependence on the spreadsheet data format with cell ranges as input for statistical functions, their lack of flexibility, and their relatively poor
  • 19. graphics. Many highly sophisticated packages for statistics and data analysis are available. Some of Go to TOC CHAPTER 1. BACKGROUND 9 the best known commercial packages are Minitab, SAS, SPSS, Splus, Stata, and Systat. The package used in this text is called R. It is an open source implementation of the same language used in Splus and may be downloaded free at http://www.r-project.org . After downloading and installing R we recommend that you download and install another free package called Rstudio. It can be obtained from http://www.rstudio.com . Rstudio makes importing data into R much easier and makes it easier to integrate R output with other programs. Detailed instructions on using R and Rstudio for the exercises will be provided. Data files used in this course are from four sources. Some are local in origin and come from student or course data at the University of Houston. Others are simulated but made to look as realistic as possible. These and others are available at http://www.math.uh.edu/ charles/data .
  • 20. Many data sets are included with R in the datasets library and other contributed packages. We will refer to them frequently. The main external sources of data are the data archives maintained by the Journal of Statistics Education. www.amstat.org/publications/jse and the Statistical Science Web: http://www.stasci.org/datasets.html. 1.5 Exercises 1. Go to http://www.math.uh.edu/ charles/data. Examine the data set “Air Pollution Filter Noise”. Identify the variables and give their types. 2. Highlight the data in Air Pollution Filter Noise. Include the column headings but not the language preceding the column headings. Copy and paste the data into a plain text file, for example with Notepad in Windows. Import the text file into Excel or another spread sheet program. Create a new folder or directory named “math3339” and save both files there. 3. Start R by double clicking on the big blue R icon on your desktop. Click on the file menu at the top of the R Gui window. Select “change dir . . . ” . In the window that opens next, find the name of the directory where you saved the text file and double click on the name of that directory. Suppose that you named your file “apfilternoise”. (Name it anything you like.) Import the file into R with the command
  • 21. http://www.r-project.org http://www.rstudio.com http://www.math.uh.edu/~charles/data http://www.amstat.org/publications/jse/jse_data_archive.htm http://www.statsci.org/datasets.html http://www.math.uh.edu/~charles/data Go to TOC CHAPTER 1. BACKGROUND 10 > apfilternoise=read.table(”apfilternoise.txt”,header=T) and display it with the command > apfilternoise Click on the file menu at the top again and select “Exit”. At the prompt to save your workspace, click “Yes”. If you open the folder where your work was saved you will see another big blue R icon. If you double click on it, R will start again and your previously saved workspace will be restored. If you use Rstudio for this exercise you can import apfilternoise into R by clicking on the ”Import Dataset” tab. This will open a window on your file system and allow you to select the file you saved in Exercise 2. The dialog box allows you to rename the data and make other minor changes before importing the data as a data frame in R. 4. If you are using Rstudio, click on the ”Packages” tab and then the word ”datasets”. Find the data set ”airquality” and click on it. Read about it. If you are using R
  • 22. alone, type > help(airquality) at the command prompt > in the Console window. Then type > airquality to view the data. Could ”Month” and ”Day” be considered ordered factors rather than numeric vari- ables? 5. A random experiment consists of throwing a standard 6-sided die and noting the number of spots on the upper face. Describe the sample space of this experiment. 6. An experiment consists of replicating the experiment in exercise 4 four times. Describe the sample space of this experiment. How many possible outcomes does this experiment have? Go to TOC Chapter 2 Descriptive and Graphical Statistics A large part of a statistician’s job consists of summarizing and presenting important features of data. Simply looking at a spreadsheet with 1000 rows and 50 columns conveys very little information. Most likely, the user of the data would rather see numerical and
  • 23. graphical summaries of how the values of different variables are distributed and how the variables are related to each other. This chapter concerns some of the most important ways of summarizing data. 2.1 Location Measures 2.1.1 The Mean Suppose that x is the name of a numeric variable whose values are recorded either for the entire population or for a sample from that population. Let the n recorded values of x be denoted by x1, x2, . . . , xn. These are not necessarily distinct numbers. The mean or average of these values is x̄ = 1 n n∑ i=1 xi When the values of x for the entire population are included, it is customary to denote this quantity by µ(x) and call it the population mean. The mean is called a location measure partly because it is taken as a representative or central value of x. More importantly, it behaves in a certain way if we change the scale of measurement for values of x. Imagine that x is temperature recorded in degrees Celsius and we decide to change the unit of measurement to degrees Fahrenheit. If yi denotes the
  • 24. Fahrenheit temperature of the ith individual, then yi = 1.8xi + 32. In effect, we have defined a new variable y by the equation y = 1.8x + 32. The means of the new and old variables have the same relationship as the individual measurements have. ȳ = 1 n n∑ i=1 yi = 1 n n∑ 1 (1.8xi + 32) = 1.8x̄ + 32 In general, if a and b > 0 are constants and y = a+bx, ȳ = a+bx̄ . Other location measures introduced below behave in the same way. 11 Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 12
  • 25. When there are repeated values of x, there is an equivalent formula for the mean. Let the m distinct values of x be denoted by v1, . . . , vm. Let ni be the number of times vi is repeated and let fi = ni/n. Note that ∑m i=1 ni = n and ∑m i=1 fi = 1. Then the average is given by x̄ = m∑ i=1 fivi The number ni is the frequency of the value vi and fi is its relative frequency. 2.1.2 The Median and Other Quantiles Let x be a numeric variable with values x1, x2, . . . , xn. Arrange the values in increasing order x(1) ≤ x(2) ≤ . . . ≤ x(n). The median of x is a number median(x) such that at least half the values of x are ≤ median(x) and at least half the values of x are ≥ median(x). This conveys the essential idea but unfortunately it may define an interval of numbers rather than a single number. The ambiguity is usually resolved by taking the median to be the midpoint of that interval. Thus, if n is odd, n = 2k+1, where k is a positive integer,
  • 26. median(x) = x(k+1) , while if n is even, n = 2k, median(x) = x(k) + x(k+1) 2 . Let p ∈ (0, 1) be a number between 0 and 1. The pth quantile of x is more commonly known as the 100pth percentile; e.g., the 0.8 quantile is the same as the 80th percentile. We define it as a number q(x, p) such that the fraction of values of x that are ≤ q(x, p) is at least p and the fraction of values of x that are ≥ q(x, p) is at least 1−p. For example, at least 80 percent of the values of x are ≤ the 80th percentile of x and at least 20 percent of the values of x are ≥ its 80th percentile. Again, this may not define a unique number q(x, p). Software packages have rules for resolving the ambiguity, but the details are usually not important. The median is the 50th percentile, i.e., the 0.5 quantile. The 25th and 75th percentiles are called the first and third quartiles. The 10th, 20th, 30th, etc. percentiles are called the deciles. The median is a location measure as defined in the preceding section. 2.1.3 Trimmed Means Trimmed means of a variable x are obtained by finding the mean of the values of x excluding a given
  • 27. percentage of the largest and smallest values. For example, the 5% trimmed mean is the mean of the values of x excluding the largest 5% of the values and the smallest 5% of the values. In other words, it is the mean of all the values between the 5th and 95th percentiles of x. A trimmed mean is a location measure. Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 13 2.1.4 Grouped Data Sometimes large data sets are summarized by grouping values. Let x be a numeric variable with values x1, x2, . . . , xn. Let c0 < c1 < . . . < cm be numbers such that all the values of x are between c0 and cm. For each i, let ni be the number of values of x (including repetitions) that are in the interval (ci−1, ci], i.e., the number of indices j such that ci−1 < xj ≤ ci. A frequency table of x is a table showing the class intervals (ci−1, ci] along with frequencies ni with which the data values fall into each interval. Sometimes additional columns are included showing the relative frequencies fi = ni/n, the cumulative relative frequencies Fi = ∑ j≤i fj , and the midpoints of the intervals. Example 2.1. The data below are 50 measured reaction times in response to a sensory stimulus,
  • 28. arranged in increasing order. A frequency table is shown below the data. 0.12 0.30 0.35 0.37 0.44 0.57 0.61 0.62 0.71 0.80 0.88 1.02 1.08 1.12 1.13 1.17 1.21 1.23 1.35 1.41 1.42 1.42 1.46 1.50 1.52 1.54 1.60 1.61 1.68 1.72 1.86 1.90 1.91 2.07 2.09 2.16 2.17 2.20 2.29 2.32 2.39 2.47 2.60 2.86 3.43 3.43 3.77 3.97 4.54 4.73 Interval Midpoint ni fi Fi (0,1] 0.5 11 0.22 0.22 (1,2] 1.5 22 0.44 0.66 (2,3] 2.5 11 0.22 0.88 (3,4] 3.5 4 0.08 0.96 (4,5] 4.5 2 0.04 1.00 If only a frequency table like the one above is given, the mean and median cannot be calculated exactly. However, they can be estimated. If we take the midpoint of an interval as a stand-in for all the values in that interval, then we can use the formula in the preceding section for calculating a mean with repeated values. Thus, in the example above, we would estimate the mean as 0.22(0.5) + .44(1.5) + 0.22(2.5) + 0.08(3.5) + 0.04(4.5) = 1.78 Estimating the median is a bit more difficult. By examining the cumulative frequencies Fi, we see that 22% of the data is less than or equal to 1 and 66% of the data is less than or equal to 2. Therefore, the median lies between 1 and 2. That is, it is 1 + a certain fraction of the distance from 1 to 2. A reasonable guess at that fraction is given by linear interpolation between the cumulative frequencies at 1 and 2. In other words, we estimate the median as
  • 29. 1 + .50− .22 .66− .22 (2− 1) = 1.636. A cruder estimate of the median is just the midpoint of the interval that contains the median, in this case 1.5. We leave it as an exercise to calculate the mean and median from the data of Example 1 and to compare them to these estimates. Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 14 2.1.5 Histograms The figure below is a histogram of the reaction times. > reacttimes=read.table("reacttimes.txt",header=T) > hist(reacttimes$Times,breaks=0:5,xlab="Reaction Times",main=" ") Reaction Times F re qu
  • 30. en cy 0 1 2 3 4 5 0 5 10 15 20 The histogram is a graphical depiction of the grouped data. The end points ci of the class intervals are shown on the horizontal axis. This is an absolute frequency histogram because the heights of the vertical bars above the class intervals are the absolute frequencies ni. A relative frequency histogram would show the relative frequencies fi. A density histogram has bars whose heights are the relative frequencies divided by the lengths of the corresponding class intervals. Thus,in a density histogram the area of the bar is equal to the relative frequency. If all class intervals have the same length, these types of histograms all have the same shape and convey the same visual information. Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 15
  • 31. 2.1.6 Robustness A robust measure of location is one that is not affected by a few extremely large or extremely small values. Values of a numeric variable that lie a great distance from most of the other values are called outliers. Outliers might be the result of mistakes in measuring or recording data, perhaps from misplacing a decimal point. The mean is not a robust location measure. It can be affected significantly by a single extreme outlier if that outlying value is extreme enough. Thus, if there is any doubt about the quality of the data, the median or a trimmed mean might be preferred to the mean as a reliable location measure. The median is very insensitive to outliers. A 5% trimmed mean is insensitive to outliers that make up no more than 5% of the data values. 2.1.7 The Five Number Summary The five number summary is a convenient way of summarizing numeric data. The five numbers are the minimum value, the first quartile (25th percentile), the median, the third quartile (75th percentile), and the maximum value. Sometimes the mean is also included, which makes it a six number summary. Example 2.2. The natural logarithms y of the data values x in Example 1 are, to two places: -2.12 -1.20 -1.05 -0.99 -0.82 -0.56 -0.49 -0.48 -0.34 -0.22 -0.13 0.02 0.08 0.11 0.12 0.16 0.19 0.21 0.30 0.34 0.35 0.35 0.38 0.40 0.42 0.43 0.47 0.48 0.52 0.54 0.62 0.64 0.65 0.73 0.74 0.77 0.78 0.79 0.83 0.84 0.87 0.90 0.96 1.05 1.23 1.23 1.33 1.38 1.51 1.55
  • 32. It is sometimes advantageous to transform data in some way, i.e., to define a new variable y as a function of the old variable x. In this case, we have transformed the reaction times x with the natural logarithm transformation. We might want to do this to so that we can more easily apply certain statistical inference procedures you will learn about later. The six number summary of the transformed data y is: > reacttimes=read.table("reacttimes.txt",header=T) > summary(log(reacttimes$Times)) Min. 1st Qu. Median Mean 3rd Qu. Max. -2.12000 0.08605 0.42520 0.33710 0.78500 1.55400 2.1.8 The Mode The mode of a variable is its most frequently occurring value. With numeric variables the mode is less important than the mean and median for descriptive purposes or for statistical inference. For factor variables the mode is the most natural way of choosing a ”most representative” value. We hear this frequently in the media, in statements such as ”Financial problems are the most common cause of marital strife”. For grouped numeric data the modal class interval is the class interval having the highest absolute or relative frequency. In Example 1, the modal class interval is the interval (1,2]. 2.1.9 Exercises 1. Find the mean and median of the reaction time data in
  • 33. Example 1. Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 16 2. Find the quartiles of the reaction time data. There is more than one acceptable answer. 3. The 40th value x40 of the reaction time data has a value of 2.32. Replace that with 232.0. Recalculate the mean and median. Comment. 4. Construct a frequency table like the one in Example 1 for the log-transformed reaction times of Example 2. Use 5 class intervals of equal length beginning at -3 and ending at 2. Draw an absolute frequency histogram. 5. Estimate the mean and median of the grouped log- transformed reaction times by using the tech- niques discussed in Example 1. Compare your answers to the summary in Example 2. 6. Repeat exercises 1, 2, and the histogram of exercise 4 by using R. 7. Let x be a numeric variable with values x1, . . . , xn−1, xn. Let x̄ n be the average of all n val- ues and let x̄ n−1 be the average of x1, . . . , xn−1. Show that x̄ n = (1− 1n )x̄ n−1 + 1
  • 34. nxn. What happens if xn →∞ while all the other values of x are fixed? 2.2 Measures of Variability or Scale 2.2.1 The Variance and Standard Deviation Let x be a population variable with values x1, x2, . . . , xn. Some of the values might be repeated. The variance of x is var(x) = σ2 = 1 n n∑ i=1 (xi − µ(x))2. The standard deviation of x is sd(x) = σ = √ var(x). When x1, x2, . . . , xn are values of x from a sample rather than the entire population, we modify the definition of the variance slightly, use a different notation, and call these objects the sample variance and standard deviation. s2 = 1
  • 35. n− 1 n∑ i=1 (xi − x̄ )2, s = √ s2. The reason for modifying the definition for the sample variance has to do with its properties as an estimate of the population variance. Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 17 Alternate algebraically equivalent formulas for the variance and sample variance are σ2 = 1 n n∑ i=1 x2i − µ(x)2,
  • 36. s2 = 1 n− 1 ( n∑ i=1 x2i − nx̄ 2). These are sometimes easier to use for hand computation. The standard deviation σ is called a measure of scale because of the way it behaves under linear transformations of the data. If a new variable y is defined by y = a+ bx, where a and b are constants, sd(y) = |b|sd(x). For example, the standard deviation of Fahrenheit temperatures is 1.8 times the standard deviation of Celsius temperatures. The transformation y = a + bx can be thought of as a rescaling operation, or a choice of a different system of measurement units, and the standard deviation takes account of it in a natural way. 2.2.2 The Coefficient of Variation For a variable that has only positive values, it may be more important to measure the relative vari- ability than the absolute variability. That is, the amount of variation should be compared to the mean value of the variable. The coefficient of variation for a population variable is defined as cv(x) = sd(x)
  • 37. µ(x) , For a sample of values of x we substitute the sample standard deviation s and the sample average x̄ . 2.2.3 The Mean and Median Absolute Deviation Suppose that you must choose a single number c to represent all the values of a variable x as accurately as possible. One measure of the overall error with which c represents the values of x is g(c) = √√√√ 1 n n∑ i=1 (xi − c)2. In the exercises, you are asked to show that this expression is minimized when c = x̄ . In other words, the single number which most accurately represents all the values is, by this criterion, the mean of the variable. Furthermore, the minimum possible overall error, by this criterion, is the standard deviation. However, this is not the only reasonable criterion. Another is h(c) = 1 n
  • 38. n∑ i=1 |xi − c|. It can be shown that this criterion is minimized when c = median(x). The minimum value of h(c) is called the mean absolute deviation from the median. It is a scale measure which is somewhat more robust(less affected by outliers) than the standard deviation, but still not very robust. A related very robust measure of scale is the median absolute deviation from the median, or mad : mad(x) = median(|x−median(x)|). Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 18 2.2.4 The Interquartile Range The interquartile range of a variable x is the difference between its 75th and 25th percentiles. IQR(x) = q(x, .75)− q(x, .25). It is a robust measure of scale which is important in the construction and interpretation of boxplots, discussed below. All of these measures of scale are valid for comparison of the
  • 39. ”spread”or variability of numeric variables about a central value. In general, the greater their values, the more spread out the values of the variable are. Of course, the standard deviation, median absolute deviation, and interquartile range of a variable will be different numbers and one must be careful to compare like measures. 2.2.5 Boxplots Boxplots are also called box and whisker diagrams. Essentially, a boxplot is a graphical representation of the five number summary. The boxplot below depicts the sensory response data of the preceding section without the log transformation. > reacttimes=read.table("reacttimes.txt",header=T) > boxplot(reacttimes$Times,horizontal=T,xlab="Reaction Times") > summary(reacttimes) Times Min. :0.120 1st Qu.:1.090 Median :1.530 Mean :1.742 3rd Qu.:2.192 Max. :4.730
  • 40. Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 19 0 1 2 3 4 Reaction Times The central box in the diagram encloses the middle 50% of the numeric data. Its left and right bound- aries mark the first and third quartiles. The boldface middle line in the box marks the median of the data. Thus, the interquartile range is the distance between the left and right boundaries of the central box. For construction of a boxplot, an outlier is defined as a data value whose distance from the nearest quartile is more than 1.5 times the interquartile range. Outliers are indicated by isolated points (tiny circles in this boxplot). The dashed lines extending outward from the quartiles are called the whiskers. They extend from the quartiles to the most extreme values in either direction that are not outliers. This boxplot shows a number of interesting things about the response time data. (a) The median is about 1.5. The interquartile range is slightly more than 1. (b) The three largest values are outliers. They lie a long way from most of the data. They might call
  • 41. for special investigation or explanation. Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 20 (c) The distribution of values is not symmetric about the median. The values in the lower half of the data are more crowded together than those in the upper half. This is shown by comparing the distances from the median to the two quartiles, by the lengths of the whiskers and by the presence of outliers at the upper end . The asymmetry of the distribution of values is also evident in the histogram of the preceding sec- tion. 2.2.6 Exercises 1. Find the variance and standard deviation of the response time data. Treat it as a sample from a larger population. 2. Find the interquartile range and the median absolute deviation for the response time data. 3. In the response time data, replace the value x40 = 2.32 by 232.0. Recalculate the standard deviation, the interquartile range and the median absolute deviation and compare with the answers from problems 1 and 2.
  • 42. 4. Make a boxplot of the log-transformed reaction time data. Is the transformed data more sym- metrically distributed than the original data? 5. Show that the function g(c) in section 2.2.3 is minimized when c = µ(x). Hint: Minimize g(c)2. 6. Find the variance, standard deviation, IQR, mean absolute deviation and median absolute de- viation of the variable ”Ozone” in the data set ”airquality”. Use R or Rstudio. You can address the variable Ozone directly if you attach the airquality data frame to the search path as follows: > attach(airquality) The R functions you will need are ”sd” for standard deviation, ”var” for variance, ”IQR” for the interquartile range, and ”mad” for the median absolute deviation. There is no built-in function in R for the mean absolute deviation, but it is easy to obtain it. > mean(abs(Ozone-median(Ozone))) 2.3 Jointly Distributed Variables When two or more variables are jointly distributed, or jointly observed, it is important to understand how they are related and how closely they are related. We will first consider the case where one variable is numeric and the other is a factor. Go to TOC
  • 43. CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 21 2.3.1 Side by Side Boxplots Boxplots are particularly useful in quickly comparing the values of two or more sets of numeric data with a common scale of measurement and in investigating the relationship between a factor variable and a numeric variable. The figure below compares placement test scores for each of the letter grades in a sample of 179 students who took a particular math course in the same semester under the same instructor. The two jointly observed population variables are the placement test score and the letter grade received. The figure separates test scores according to the letter grade and shows a boxplot for each group of students. One would expect to see a decrease in the median test score as the letter grade decreases and that is confirmed by the picture. However, the decrease in median test scores from a letter grade of B to a grade of F is not very dramatic, especially compared to the size of the IQRs. This suggests that the placement test is not especially good at predicting a student’s final grade in the course. Notice the two outliers. The outlier for the ”W” group is clearly a mistake in recording data because the scale of scores only went to 100. > test.vs.grade=read.csv("test.vs.grade.csv",header=T) > attach(test.vs.grade) > plot(Test~Grade,varwidth=T)
  • 44. Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 22 A B C D F W 40 60 80 10 0 12 0 Grade Te st 2.3.2 Scatterplots Suppose x and y are two jointly distributed numeric variables. Whether we consider the entire population or a sample from the population, we have the same number n of observed values for each variable. If we plot the n points (x1, y1), (x2, y2), . . . , (xn, yn) in a Cartesian plane, we obtain a scatterplot or a scatter diagram of the two variables. Below are the first 6 rows of the ”Payroll” data set. The column labeled ”payroll” is the total monthly payroll in thousands of dollars for each
  • 45. company listed. The column ”employees” is the number of employees in each company and ”industry” indicates which of two related industries the company is in. A scatterplot of all 50 values of the two variables ”payroll” and ”employees” is also shown. > Payroll=read.table("Payroll.txt",header=T) > Payroll[1:6,] payroll employees industry 1 190.67 85 A 2 233.58 109 A Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 23 3 244.04 130 B 4 351.41 166 A 5 298.60 154 B 6 241.43 124 B > attach(Payroll) > plot(payroll~employees,col=industry) 50 100 150
  • 46. 15 0 20 0 25 0 30 0 35 0 employees pa yr ol l The scatterplot shows that in general the more employees a company has, the higher its monthly payroll. Of course this is expected. It also shows that the relationship between the number of employees and the payroll is quite strong. For any given number of employees, the variation in payrolls for that number is small compared to the overall variation in payrolls for all employment levels. In this plot, the data from industry A is in black and that from industry B is red. The plot shows that for employees ≥ 100, payrolls for industry A are generally greater than those for industry
  • 47. B at the same level of employment. Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 24 2.3.3 Covariance and Correlation If x and y are jointly distributed numeric variables, we define their covariance as cov(x, y) = 1 n n∑ i=1 (xi − µ(x))(yi − µ(y)). If x and y come from samples of size n rather than the whole population, replace the denominator n by n − 1 and the population means µ(x), µ(y) by the sample means x̄ , ȳ to obtain the sample covariance. The sign of the covariance reveals something about the relationship between x and y. If the covariance is negative, values of x greater than µ(x) tend to be accompanied by values of y less than µ(y). Values of x less than µ(x) tend to go with values of y greater than µ(y), so x and y tend to deviate from their means in opposite directions. If cov(x, y) > 0, they tend to deviate in the same
  • 48. direction. The strength of these tendencies is not expressed by the covariance because its magnitude depends on the variability of each of the variables about its mean. To correct this, we divide each deviation in the sum by the standard deviation of the variable. The resulting quantity is called the correlation between x and y: cor(x, y) = cov(x, y) sd(x) ∗ sd(y) . The correlation between payroll and employees in the example above is 0.9782 (97.82 %). Theorem 2.1. The correlation between x and y satisfies −1 ≤ cor(x, y) ≤ 1. cor(x, y) = 1 if and only if there are constants a and b > 0 such that y = a+ bx. cor(x, y) = −1 if and only if y = a+ bx with b < 0. A correlation close to 1 indicates a strong positive relationship (tending to vary in the same direction from their means) between x and y while a correlation close to −1 indicates a strong negative rela- tionship. A correlation close to 0 indicates that there is no linear relationship between x and y. In this case, x and y are said to be (nearly) uncorrelated. There might be a relationship between x and y but it would be nonlinear. The picture below shows a scatterplot of two variables that are clearly related but very nearly uncorrelated. > xs=runif(500,0,3*pi)
  • 49. > ys=sin(xs)+rnorm(500,0,.15) > cor(xs,ys) [1] 0.004200081 > plot(xs,ys) Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 25 0 2 4 6 8 − 1. 0 − 0. 5 0. 0 0. 5 1. 0
  • 50. 1. 5 xs ys Some sample scatterplots of variables with different population correlations are shown below. Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 26 −1 0 1 2 − 4 − 2 0 1 2 3 cor(x,y)=0 −2 −1 0 1 2
  • 51. − 3 − 2 − 1 0 1 2 cor(x,y)=0.3 −3 −1 0 1 2 3 4 − 3 − 1 1 2 cor(x,y)=−0.5 −2 −1 0 1 2
  • 52. − 2 − 1 0 1 2 cor(x,y)=0.9 2.3.4 Exercises 1. With the Air Pollution Filter Noise data, construct side by side boxplots of the variable NOISE for the different levels of the factor SIZE. Comment. Do the same for NOISE and TYPE. 2. With the Payroll data, construct side by side boxplots of ”employees” versus ”industry” and ”pay- roll” versus ”industry”. Are these boxplots as informative as the color coded scatterplot in Section 2.3.2? 3. If you are using Rstudio click on the ”Packages” tab, then the checkbox next to the library MASS. Click on the word MASS and then the data set ”mammals” and read about it. If you are using R alone, in the Console window at the prompt > type > data(mammals,package=”MASS”). View the data with
  • 53. Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 27 > mammals Make a scatterplot with the following commands and comment on the result. > attach(mammals) > plot(body,brain) Also make a scatterplot of the log transformed body and brain weights. > plot(log(body),log(brain)) A recently discovered hominid species homo floresiensis had an estimated average body weight of 25 kg. Based on the scatterplots, what would you guess its brain weight to be? 4. Let x and y be jointly distributed numeric variables and let z = a + by, where a and b are constants. Show that cov(x, z) = b ∗ cov(x, y). Show that if b > 0, cor(x, z) = cor(x, y). What happens if b < 0? Go to TOC
  • 54. Chapter 3 Probability 3.1 Basic Definitions. Equally Likely Outcomes Let a random experiment with sample space Ω be given. Recall from Chapter 1 that Ω is the set of all possible outcomes of the experiment. An event is a subset of Ω. A probability measure is a function which assigns numbers between 0 and 1 to events. If the sample space Ω, the collection of events, and the probability measure are all specified, they constitute a probability model of the random experiment. The simplest probability models have a finite sample space Ω. The collection of events is the col- lection of all subsets of Ω and the probability of an event is simply the proportion of all possible outcomes that correspond to that event. In such models, we say that the experiment has equally likely outcomes. If the sample space has N elements, then each elementary event {ω} consisting of a single outcome has probability 1N . If E is a subset of Ω, then Pr(E) = #(E) N . Here we introduce some notation that will be used throughout this text. The probability measure for a random experiment is most often denoted by the
  • 55. abbreviation Pr, sometimes with subscripts. Events will be denoted by upper case Latin letters near the beginning of the alphabet. The expression #(E) denotes the number of elements of the subset E. Example 3.1. The Payroll data consists of 50 observations of 3 variables, ”payroll”, ”employees” and ”industry”. Suppose that a random experiment is to choose one record from the Payroll data and suppose that the experiment has equally likely outcomes. Then, as the summary below shows, the probability that industry A is selected is Pr(industry = A) = 27 50 = 0.54. > Payroll=read.table("Payroll.txt",header=T) > summary(Payroll) 28 Go to TOC CHAPTER 3. PROBABILITY 29 payroll employees industry Min. :129.1 Min. : 26.00 A:27 1st Qu.:167.8 1st Qu.: 71.25 B:23
  • 56. Median :216.1 Median :108.50 Mean :228.2 Mean :106.42 3rd Qu.:287.8 3rd Qu.:143.25 Max. :354.8 Max. :172.00 In this example we use another common and convenient notational convention. The event whose probability we want is described in quasi-natural language as ”industry=A” rather than with the the formal but too cumbersome {ω ∈ Payroll|industry(ω) = A}. The description ”industry=A” refers to the set of all possible outcomes of the experiment for which the variable ”industry” has the value ”A”. This sort of informal description of an event will be used again and again. The assumption of equally likely outcomes is an assumption about the selection procedure for ob- taining one record from the data. It is conceivable that a selection method is employed for which this assumption is not valid. If so, we should be able to discover that it is invalid by replicating the experiment sufficiently many times. This is a basic principle of classical statistical inference. It relies on a famous result of mathematical probability theory called the law of large numbers. One version of it is loosely stated as follows: Law of Large Numbers: Let E be an event associated with a random experiment and let Pr be the probability measure of a true probability model of the experiment. Suppose the experiment is repli-
  • 57. cated n times and let P̂ r(E) = 1n × # replications in which E occurs. Then P̂ r(E) → Pr(E) as n→∞. P̂ r(E) is called the empirical probability of E. 3.2 Combinations of Events Events are related to other events by familiar set operations. Let E1, E2, . . . be a finite or infinite sequence of events. The union of E1 and E2 is the event E1 ∪ E2 = {ω ∈ Ω|ω ∈ E1 or ω ∈ E2}. More generally, ⋃ i Ei = E1 ∪ E2 ∪ . . . = {ω ∈ Ω|ω ∈ Ei for some i }. The intersection of E1 and E2 is the event E1 ∩ E2 = {ω ∈ Ω|ω ∈ E1 and ω ∈ E2}, and, in general, ⋂ i Ei = E1 ∩ E2 ∩ . . . = {ω ∈ Ω|ω ∈ Ei for all i}. Go to TOC CHAPTER 3. PROBABILITY 30 Sometimes we omit the intersection symbol ∩ and simply conjoin the symbols for the events in an
  • 58. intersection. In other words, E1E2 . . . En = E1 ∩ E2 ∩ . . . ∩ En. The complement of the event E is the event ∼E = {ω ∈ Ω|ω /∈ E}. ∼E occurs if and only if E does not occur. The event E∼1 E2 occurs if and only if E1 occurs and E2 does not occur. Finally, the entire sample space Ω is an event with complement φ, the empty event. The empty event never occurs. We need the empty event because it is possible to formulate a perfectly sensible description of an event which happens never to be satisfied. For example, if Ω = Payroll the event ”employees < 25” is never satisfied, so it is the empty event. We also have the subset relation between events. E1 ⊆ E2 means that if E1 occurs, then E2 oc- curs, or in more familiar language, E1 is a subset of E2. For any event E, it is true that φ ⊆ E ⊆ Ω. E2 ⊇ E1 means the same as E1 ⊆ E2. 3.2.1 Exercises 1. A random experiment consists of throwing a pair of dice, say a red die and a green die, simultane- ously. They are standard 6-sided dice with one to six dots on different faces. Describe the sample space. 2. For the same experiment, let E be the event that the sum of the numbers of spots on the two dice is an odd number. Write E as a subset of the sample space, i.e.,
  • 59. list the outcomes in E. 3. List the outcomes in the event F = ”the sum of the spots is a multiple of 3”. 4. Find ∼F , E ∪ F , EF = E ∩ F , and E∼F . 5. Assume that the outcomes of this experiment are equally likely. Find the probability of each of the events in # 4. 6. Show that for any events E1 and E2, if E1 ⊆ E2 then ∼E2 ⊆∼ E1. 7. Load the ”mammals” data set into your R workspace. In Rstudio you can click on the ”Pack- ages” tab and then on the checkbox next to MASS. Without Rstudio, type > data(mammals,package=”MASS”) Attach the mammals data frame to your R search path with > attach(mammals) Go to TOC CHAPTER 3. PROBABILITY 31 A random experiment is to choose one of the species listed in this data set. All outcomes are equally likely. You can obtain a list of the species in the event ”body > 200” with the command
  • 60. > subset(mammals,body>200) What is the probability of this event, i.e., what is the probability that you randomly select a species with a body weight greater than 200 kg? 8. What are the species in the event that the ratio of brain weight to body weight is greater than 0.02? Remember that brain weight is recorded in grams and body weight in kilograms, so body weight must be multiplied by 1000 to make the two weights comparable. What is the probability of that event? 3.3 Rules for Probability Measures The assumption of equally likely outcomes is the starting point for the construction of many proba- bility models. There are many random experiments for which this assumption is wrong. No matter what other considerations are involved in choosing a probability measure for a model of a a random experiment, there are certain rules that it must satisfy. They are: 1. 0 ≤ Pr(E) ≤ 1 for each event E. 2. Pr(Ω) = 1. 3. If E1, E2, . . . is a finite or infinite sequence of events such that EiEj = φ for i 6= j, then Pr( ⋃ iEi) =∑ i Pr(Ei). If EiEj = φ for all i 6= j we say that the events E1, E2, . . . are pairwise disjoint. These are the basic rules. There are other properties that may be
  • 61. derived from them as theorems. 4. Pr(E∼F ) = Pr(E)− Pr(EF ) for all events E and F . In particular, Pr(∼E) = 1− Pr(E) 5. Pr(φ) = 0. 6. Pr(E ∪ F ) = Pr(E) + Pr(F )− Pr(EF ) for all events E and F . 7. If E ⊆ F , then Pr(E) ≤ Pr(F ). 8. If E1 ⊆ E2 ⊆ . . . is an infinite sequence, then Pr( ⋃ iEi) = limi→∞ Pr(Ei). 9. If E1 ⊇ E2 ⊇ . . . is an infinite sequence, then Pr( ⋂ iEi) = limi→∞ Pr(Ei). Go to TOC CHAPTER 3. PROBABILITY 32 3.4 Counting Outcomes. Sampling with and without Replace- ment Suppose a random experiment with sample space Ω is replicated n times. The result is a sequence (ω1, ω2, . . . , ωn), where ωi ∈ Ω is the outcome of the ith replication. This sequence is the outcome of a so-called compound experiment - the sequential replications of the basic experiment. The sample space of this compound experiment is the n-fold cartesian product Ωn = Ω × Ω × · · · × Ω. Now
  • 62. suppose that the basic experiment is to choose one member of a finite population with N elements. We may identify the sample space Ω with the population. Consider an outcome (ω1, ω2, . . . , ωn) of the replicated experiment. There are N possibilities for ω1 and for each of those there are N possi- bilities for ω2 and for each pair ω1, ω2 there are N possibilities for ω3, and so on. In all, there are N × N × · · · × N = Nn possibilities for the entire sequence (ω1, ω2, · · · , ωn). If all outcomes of the compound experiment are equally likely, then each has probability 1Nn . Moreover, it can be shown that the compound experiment has equally likely outcomes if and only if the basic experiment has equally likely outcomes, each with probability 1N . Definition: An ordered random sample of size n with replacement from a population of size N is a randomly chosen sequence of length n of elements of the population, where repetitions are possible and each outcome (ω1, ω2, · · · , ωn) has probability 1Nn . Now suppose that we sample one element ω1 from the population, with all N outcomes equally likely. Next, we sample one element ω2 from the population excluding the one already chosen. That is, we randomly select one element from Ω ∼ {ω1} with all the remaining N − 1 elements being equally likely. Next, we randomly select one element ω3 from the the N − 2 elements of Ω ∼ {ω1, ω2}, and so on until at last we select ωn from the remaining N − (n− 1) elements of the population. The result is a nonrepeating sequence (ω1, ω2, · · · , ωn) of length n from the population. A nonrepeating sequence of length n is also called a permutation of length n from the N objects of the population. The total
  • 63. number of such permutations is N × (N − 1)× · · · × (N − n+ 1) = N !(N−n)! . Obviously, we must have n ≤ N for this to make sense. The number of permutations of length N from a set of N objects is N !. It can be shown that, with the sampling scheme described above, all permutations of length n are equally likely to result. Each has probability (N−n)!N ! of occurring. Definition: An ordered random sample of size n without replacement from a population of size N is a randomly chosen nonrepeating sequence of length n from the population where each outcome (ω1, ω2, · · · , ωn) has probability (N−n)!N ! . Most of the time when sampling without replacement from a finite population, we do not care about the order of appearance of the elements of the sample. Two nonrepeating sequences with the same elements in different order will be regarded as equivalent. In other words, we are concerned only with the resulting subset of the population. Let us count the number of subsets of size n from a set of N objects. Temporarily, let C denote that number. Each subset of size n can be ordered in n! different ways to give a nonrepeating sequence. Thus, the number of nonrepeating sequences of length n is C times n!. So, N !(N−n)! = C × n! i.e., C = N ! n!(N−n)! = (
  • 64. N n ) . This is the same binomial coefficient ( N n ) that appears in the binomial theorem: (a+ b)N = ∑N n=0 ( N n ) anbN−n. Go to TOC CHAPTER 3. PROBABILITY 33 Definition: A simple random sample of size n from a population of size N is a randomly chosen subset of size n from the population, where each subset has the same probability of being chosen, namely 1 (Nn)
  • 65. . A simple random sample may be obtained by choosing objects from the population sequentially, in the manner described above, and then ignoring the order of their selection. Example: The Birthday Problem There are N = 365 days in a year. (Ignore leap years.) Suppose n = 23 people are chosen ran- domly and their birthdays recorded. What is the probability that at least two of them have the same birthday? Solution : Arbitrarily numbering the people involved from 1 to n, their birthdays form an ordered sam- ple, with replacement, from the set of 365 birthdays. Therefore, each sequence has probability 1Nn of occurring. No two people have the same birthday if and only if the sequence is actually nonrepeating. The number of nonrepeating sequences of birthdays is N(N − 1) · · · (N −n+ 1). Therefore, the event ”No two people have the same birthday” has probability
  • 66. N(N − 1) · · · (N − n+ 1) Nn = N(N − 1) · · · (N − n+ 1) N ×N × · · · ×N = (1− 1 N )(1− 2 N ) · · · (1− n− 1 N ) With n = 23 and N = 365 we can find this in R as follows: > prod(1-(1:22)/365) [1] 0.4927028
  • 67. So, there is about a 49% probability that no two people in a random selection of 23 have the same birthday. In other words, the probability that at least two share a birthday is about 51%. An important, intuitively obvious principle in statistics is that if the sample size n is very small in comparison to the population size N , a sample taken without replacement may be regarded as one taken with replacement, if it is mathematically convenient to do so. A sample of size 100 taken with replacement from a population of 100,000 has very little chance of repeating itself. The probability of a repetition is about 5%. 3.4.1 Exercises 1. A red 6-sided die and a green 6-sided die are thrown simultaneously. The outcomes of this exper- iment are equally likely. What is the probability that at least one of the dice lands with a 6 on its upper face? 2. A hand of 5-card draw poker is a simple random sample from the standard deck of 52 cards. What
  • 68. is the probability that a 5-card draw hand contains the ace of hearts? Go to TOC CHAPTER 3. PROBABILITY 34 3. How many 5 draw poker hands are there? In 5-card stud poker, the cards are dealt sequentially and the order of appearance is important. How many 5 stud poker hands are there? 4. Everybody in Ourtown is a fool or a knave or possibly both. 70% of the citizens are fools and 85% are knaves. One citizen is randomly selected to be mayor. What is the probability that the mayor is both a fool and a knave? 5. A Martian year has 669 days. An R program for calculating the probability of no repetitions in a sample with replacement of n birthdays from a year of N days is given below.
  • 69. > birthdays=function(n,N) prod(1-1:(n-1)/N) To invoke this function with, for example, n=12 and N=400 simply type > birthdays(12,400) Check that the program gives the right answer for N=365 and n=23. Then use it to find the number n of Martians that must be sampled in order for the probability of a repetition to be at least 0.5. 6. A standard deck of 52 cards has four queens. Two cards are randomly drawn in succession, without replacement, from a standard deck. What is the probability that the first card is a queen? What is the probability that the second card is a queen? If three cards are drawn, what is the probability that the third is a queen? Make a general conjecture. Prove it if you can. (Hint: Does the probability change if ”queen” is replaced by ”king” or ”seven”?) 3.5 Conditional Probability Definition: Let A and B be events with Pr(B) > 0. The
  • 70. conditional probability of A, given B is: Pr(A|B) = Pr(AB) Pr(B) . (3.1) Pr(A) itself is called the unconditional probability of A. Example 3.2. R includes a tabulation by various factors of the 2201 passengers and crew on the Titanic. Read about it by typing > help(Titanic) We are going to look at these factors two at a time, starting with the steerage class of the passengers and whether they survived or not. > apply(Titanic,c(1,4),sum) Survived Class No Yes
  • 71. Go to TOC CHAPTER 3. PROBABILITY 35 1st 122 203 2nd 167 118 3rd 528 178 Crew 673 212 Suppose that a passenger or crew member is selected randomly. The unconditional probability that that person survived is 7112201 = 0.323. > apply(Titanic,4,sum) No Yes 1490 711 > apply(Titanic,1,sum)
  • 72. 1st 2nd 3rd Crew 325 285 706 885 Let us calculate the conditional probability of survival, given that the person selected was in a first class cabin. If A = ”survived” and B = ”first class”, then Pr(AB) = 203 2201 = 0.0922 and Pr(B) = 325 2201 = 0.1477. Thus,
  • 73. Pr(A|B) = 0.0922 0.1477 = 0.625. First class passengers had about a 62% chance of survival. For random sampling from a finite popu- lation such as this, we can use the counts of occurrences of the events rather than their probabilities because the denominators in Pr(AB) and Pr(B) cancel. Pr(A|B) = #(AB) #(B) = 203 325 = 0.625 For comparison, look at the conditional probabilities of survival for the other classes. Pr(survived|second class) = 118 285
  • 74. = 0.414 Pr(survived|third class) = 178 706 = 0.252 Pr(survived|crew) = 212 885 = 0.240 Go to TOC CHAPTER 3. PROBABILITY 36 3.5.1 Relating Conditional and Unconditional Probabilities The defining equation (3.1) for conditional probability can be written as Pr(AB) = Pr(A|B)Pr(B), (3.2)
  • 75. which is often more useful, especially when Pr(A|B) is easily determined from the description of the experiment. There is an even more useful result sometimes called the law of total probability. Let B1, B2, · · · , Bk be pairwise disjoint events such that each Pr(Bi) > 0 and Ω = B1 ∪ B2 ∪ · · · ∪ Bk. Let A be another event. Then, Pr(A) = k∑ i=1 Pr(A|Bi)Pr(Bi). (3.3) This is quite easy to show since A = (AB1) ∪ · · · ∪ (ABk) is a union of pairwise disjoint events and Pr(ABi) = Pr(A|Bi)Pr(Bi). Example 3.3. Diagnostic Tests: Let D denote the presence of a disease in a randomly selected member of a given population. Suppose that there is a diagnostic test for the disease and let T denote the event that a random subject tests
  • 76. positive, that is, that the test indicates the disease. The conditional probability Pr(T |D) is called the sensitivity of the test. The conditional probability Pr(∼T |∼D) is called the specificity of the test. The unconditional probability Pr(D) is called the prevalence of the disease in the population. A good test will have both a high sensitivity and a high specificity, although there is usually a trade-off between the two. The unconditional probability that a randomly chosen subject tests positive for the disease is Pr(T ) = Pr(T |D)Pr(D) + Pr(T |∼D)Pr(∼D) Suppose that the disease is rare, Pr(D) = 0.02, and that the sensitivity of the test is Pr(T |D) = 0.95 with specificity Pr(∼T |∼D) = 0.85. The false positive rate for the test is Pr(T |∼D) = 1 − Pr(∼T |∼D) = 0.15. The unconditional probability of a positive test result is Pr(T ) = 0.95× 0.02 + 0.15× 0.98 = 0.166 16.6% of the population will test positive for the disease, even though only 2% have it.
  • 77. 3.5.2 Bayes’ Rule Bayes’ rule is named for Thomas Bayes, an eighteenth century clergyman and part-time mathemati- cian. As given below, it is merely a relationship between conditional probabilities but it is associated with Bayesian inference, a distinct philosophy and methodology of statistical practice. Bayes’ rule is often described as a rule for calculating conditional ”posterior” probabilities from unconditional ”prior” probabilities. Bayes’ Rule: Let A and B1, B2, · · · , Bk be given as in the law of total probability (3.3) and assume Pr(A) > 0. Then for each i, Pr(Bi|A) = Pr(A|Bi)Pr(Bi) Pr(A) , (3.4) where Pr(A) is calculated as in (3.3).
  • 78. Go to TOC CHAPTER 3. PROBABILITY 37 Example 3.4. Urn 1 contains 3 red balls and 5 white balls. Urn 2 contains 6 red balls and 3 white balls. A fair coin is tossed (meaning that heads and tails are equally likely). If a head turns up, a ball is randomly selected from Urn 1. If a tail comes up, a ball is randomly selected from Urn 2. Given that a white ball was selected, what is the probability that it came from Urn 1?