13. Chapter 1
Background
Statistics is the art of summarizing data, depicting data, and
extracting information from it. Statistics
and the theory of probability are distinct subjects, although
statistics depends on probability to
quantify the strength of its inferences. The probability used in
this course will be developed in Chapter
3 and throughout the text as needed. We begin by introducing
some basic ideas and terminology.
1.1 Populations, Samples and Variables
A population is a set of individual elements whose collective
properties are the subject of investigation.
Usually, populations are large collections whose individual
members cannot all be examined in detail.
In statistical inference a manageable subset of the population is
selected according to certain sampling
procedures and properties of the subset are generalized to the
entire population. These generalizations
are accompanied by statements quantifying their accuracy and
reliability. The selected subset is called
a sample from the population.
Examples:
(a) the population of registered voters in a congressional
district,
(b) the population of U.S. adult males,
(c) the population of currently enrolled students at a certain
large urban university,
(d) the population of all transactions in the U.S. stock market
14. for the past month,
(e) the population of all peak temperatures at points on the
Earth’s surface over a given time interval.
Some samples from these populations might be:
(a) the voters contacted in a pre-election telephone poll,
(b) adult males interviewed by a TV reporter,
(c) the dean’s list,
(d) transactions recorded on the books of Smith Barney,
(e) peak temperatures recorded at several weather stations.
Clearly, for these particular samples, some generalizations from
sample to population would be highly
questionable.
6
Go to TOC
CHAPTER 1. BACKGROUND 7
A population variable is an attribute that has a value for each
individual in the population. In
other words, it is a function from the population to some set of
possible values. It may be helpful to
imagine a population as a spreadsheet with one row or record
for each individual member. Along the
ith row, the values of a number of attributes of the ith
individual are recorded in different columns.
The column headings of the spreadsheet can be thought of as the
population variables. For example,
if the population is the set of currently enrolled students at the
urban university, some of the variables
are academic classification, number of hours currently enrolled,
15. total hours taken, grade point average,
gender, ethnic classification, major, and so on. Variables, such
as these, that are defined for the same
population are said to be jointly observed or jointly distributed.
1.2 Types of Variables
Variables are classified according to the kinds of values they
have. The three basic types are numeric
variables, factor variables, and ordered factor variables.
Numeric variables are those for which arith-
metic operations such as addition and subtraction make sense.
Numeric variables are often related to
a scale of measurement and expressed in units, such as meters,
seconds, or dollars. Factor variables
are those whose values are mere names, to which arithmetic
operations do not apply. Factors usually
have a small number of possible values. These values might be
designated by numbers. If they are, the
numbers that represent distinct values are chosen merely for
convenience. The values of factors might
also be letters, words, or pictorial symbols. Factor variables are
sometimes called nominal variables
or categorical variables. Ordered factor variables are factors
whose values are ordered in some natural
and important way. Ordered factors are also called ordinal
variables. Some textbooks have a more
elaborate classification of variables, with various subtypes. The
three types above are enough for our
purposes.
Examples: Consider the population of students currently
enrolled at a large university. Each stu-
dent has a residency status, either resident or nonresident.
Residency status is an unordered factor
variable. Academic classification is an ordered factor with
16. values “freshman”, “sophomore”, “junior”,
“senior”, “post-baccalaureate” and “graduate student”. The
number of hours enrolled is a numeric
variable with integer values. The distance a student travels from
home to campus is a numeric vari-
able expressed in miles or kilometers. Home area code is an
unordered factor variable whose values
are designated by numbers.
1.3 Random Experiments and Sample Spaces
An experiment can be something as simple as flipping a coin or
as complex as conducting a public
opinion poll. A random experiment is one with the following
two characteristics:
(1) The experiment can be replicated an indefinite number of
times under essentially the same exper-
imental conditions.
(2) There is a degree of uncertainty in the outcome of the
experiment. The outcome may vary
from replication to replication even though experimental
conditions are the same.
Go to TOC
CHAPTER 1. BACKGROUND 8
When we say that an experiment can be replicated under the
same conditions, we mean that control-
lable or observable conditions that we think might affect the
outcome are the same. There may be
hidden conditions that affect the outcome, but we cannot
17. account for them. Implicit in (1) is the idea
that replications of a random experiment are independent, that
is, the outcomes of some replications
do not affect the outcomes of others. Obviously, a random
experiment is an idealization of a real
experiment. Some simple experiments, such as tossing a coin,
approach this ideal closely while more
complicated experiments may not.
The sample space of a random experiment is the set of all its
possible outcomes. We use the Greek
capital letter Ω (omega)to denote the sample space. There is
some degree of arbitrariness in the
description of Ω. It depends on how the outcomes of the
experiment are represented symbolically.
Examples:
(a) Toss a coin. Ω = {H,T}, where “H” denotes a head and “T” a
tail. Another way of repre-
senting the outcome is to let the number 1 denote a head and 0 a
tail (or vice-versa). If we do this,
then Ω = {0, 1}. In the latter representation the outcome of the
experiment is just the number of heads.
(b) Toss a coin 5 times, i.e., replicate the experiment in (a) 5
times. An outcome of this experiment
is a 5 term sequence of heads and tails. A typical outcome might
be indicated by (H,T,T,H,H), or
by (1,0,0,1,1). Even for this little experiment it is cumbersome
to list all the outcomes, so we use a
shorter notation
Ω = {(x1, x2, x3, x4, x5) | xi = 0 or xi = 1 for each i} .
(c) Select a student randomly from the population of all
18. currently enrolled students. The sample
space is the same as the population. The word “randomly” is
vague. We will define it later.
(d) Repeat the Michelson-Morley experiment to measure the
speed of the Earth relative to the ether
(which doesn’t exist, as we now know). The outcome of the
experiment could conceivably be any
nonnegative number, so we take Ω = [0,∞) = {x | x is a real
number and x ≥ 0.} Uncertainty arises
from the fact that this is a very delicate experiment with several
sources of unpredictable error.
1.4 Computing in Statistics
Even moderately large data sets cannot be managed effectively
without a computer and computer
software. Furthermore, much of applied statistics is exploratory
in nature and cannot be carried out
by hand, even with a calculator. Spreadsheet programs, such as
Microsoft Excel, are designed to
manipulate data in tabular form and have functions for
performing the common tasks of statistics. In
addition, many add-ins are available, some of them free, for
enhancing the graphical and statistical
capabilities of spreadsheet programs. Some of the exercises and
examples in this text make use of
Excel with its built-in data analysis package. Because it is so
common in the business world, it is
important for students to have some experience with Excel or a
similar program.
The disadvantages of spreadsheet programs are their
dependence on the spreadsheet data format
with cell ranges as input for statistical functions, their lack of
flexibility, and their relatively poor
19. graphics. Many highly sophisticated packages for statistics and
data analysis are available. Some of
Go to TOC
CHAPTER 1. BACKGROUND 9
the best known commercial packages are Minitab, SAS, SPSS,
Splus, Stata, and Systat. The package
used in this text is called R. It is an open source implementation
of the same language used in Splus
and may be downloaded free at
http://www.r-project.org .
After downloading and installing R we recommend that you
download and install another free package
called Rstudio. It can be obtained from
http://www.rstudio.com .
Rstudio makes importing data into R much easier and makes it
easier to integrate R output with
other programs. Detailed instructions on using R and Rstudio
for the exercises will be provided.
Data files used in this course are from four sources. Some are
local in origin and come from student
or course data at the University of Houston. Others are
simulated but made to look as realistic as
possible. These and others are available at
http://www.math.uh.edu/ charles/data .
20. Many data sets are included with R in the datasets library and
other contributed packages. We will
refer to them frequently. The main external sources of data are
the data archives maintained by the
Journal of Statistics Education.
www.amstat.org/publications/jse
and the Statistical Science Web:
http://www.stasci.org/datasets.html.
1.5 Exercises
1. Go to http://www.math.uh.edu/ charles/data. Examine the
data set “Air Pollution Filter Noise”.
Identify the variables and give their types.
2. Highlight the data in Air Pollution Filter Noise. Include the
column headings but not the language
preceding the column headings. Copy and paste the data into a
plain text file, for example with
Notepad in Windows. Import the text file into Excel or another
spread sheet program. Create a new
folder or directory named “math3339” and save both files there.
3. Start R by double clicking on the big blue R icon on your
desktop. Click on the file menu at the
top of the R Gui window. Select “change dir . . . ” . In the
window that opens next, find the name of
the directory where you saved the text file and double click on
the name of that directory. Suppose
that you named your file “apfilternoise”. (Name it anything you
like.) Import the file into R with the
command
21. http://www.r-project.org
http://www.rstudio.com
http://www.math.uh.edu/~charles/data
http://www.amstat.org/publications/jse/jse_data_archive.htm
http://www.statsci.org/datasets.html
http://www.math.uh.edu/~charles/data
Go to TOC
CHAPTER 1. BACKGROUND 10
> apfilternoise=read.table(”apfilternoise.txt”,header=T)
and display it with the command
> apfilternoise
Click on the file menu at the top again and select “Exit”. At the
prompt to save your workspace, click
“Yes”. If you open the folder where your work was saved you
will see another big blue R icon. If you
double click on it, R will start again and your previously saved
workspace will be restored.
If you use Rstudio for this exercise you can import apfilternoise
into R by clicking on the ”Import
Dataset” tab. This will open a window on your file system and
allow you to select the file you saved
in Exercise 2. The dialog box allows you to rename the data and
make other minor changes before
importing the data as a data frame in R.
4. If you are using Rstudio, click on the ”Packages” tab and
then the word ”datasets”. Find the data
set ”airquality” and click on it. Read about it. If you are using R
22. alone, type
> help(airquality)
at the command prompt > in the Console window.
Then type
> airquality
to view the data. Could ”Month” and ”Day” be considered
ordered factors rather than numeric vari-
ables?
5. A random experiment consists of throwing a standard 6-sided
die and noting the number of spots
on the upper face. Describe the sample space of this experiment.
6. An experiment consists of replicating the experiment in
exercise 4 four times. Describe the sample
space of this experiment. How many possible outcomes does
this experiment have?
Go to TOC
Chapter 2
Descriptive and Graphical Statistics
A large part of a statistician’s job consists of summarizing and
presenting important features of data.
Simply looking at a spreadsheet with 1000 rows and 50 columns
conveys very little information. Most
likely, the user of the data would rather see numerical and
23. graphical summaries of how the values
of different variables are distributed and how the variables are
related to each other. This chapter
concerns some of the most important ways of summarizing data.
2.1 Location Measures
2.1.1 The Mean
Suppose that x is the name of a numeric variable whose values
are recorded either for the entire
population or for a sample from that population. Let the n
recorded values of x be denoted by
x1, x2, . . . , xn. These are not necessarily distinct numbers. The
mean or average of these values is
x̄ =
1
n
n∑
i=1
xi
When the values of x for the entire population are included, it is
customary to denote this quantity
by µ(x) and call it the population mean. The mean is called a
location measure partly because it is
taken as a representative or central value of x. More
importantly, it behaves in a certain way if we
change the scale of measurement for values of x. Imagine that x
is temperature recorded in degrees
Celsius and we decide to change the unit of measurement to
degrees Fahrenheit. If yi denotes the
24. Fahrenheit temperature of the ith individual, then yi = 1.8xi +
32. In effect, we have defined a new
variable y by the equation y = 1.8x + 32. The means of the new
and old variables have the same
relationship as the individual measurements have.
ȳ =
1
n
n∑
i=1
yi =
1
n
n∑
1
(1.8xi + 32) = 1.8x̄ + 32
In general, if a and b > 0 are constants and y = a+bx, ȳ = a+bx̄ .
Other location measures introduced
below behave in the same way.
11
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
12
25. When there are repeated values of x, there is an equivalent
formula for the mean. Let the m distinct
values of x be denoted by v1, . . . , vm. Let ni be the number of
times vi is repeated and let fi = ni/n.
Note that
∑m
i=1 ni = n and
∑m
i=1 fi = 1. Then the average is given by
x̄ =
m∑
i=1
fivi
The number ni is the frequency of the value vi and fi is its
relative frequency.
2.1.2 The Median and Other Quantiles
Let x be a numeric variable with values x1, x2, . . . , xn.
Arrange the values in increasing order x(1) ≤
x(2) ≤ . . . ≤ x(n). The median of x is a number median(x) such
that at least half the values of x
are ≤ median(x) and at least half the values of x are ≥
median(x). This conveys the essential idea
but unfortunately it may define an interval of numbers rather
than a single number. The ambiguity is
usually resolved by taking the median to be the midpoint of that
interval. Thus, if n is odd, n = 2k+1,
where k is a positive integer,
26. median(x) = x(k+1)
,
while if n is even, n = 2k,
median(x) =
x(k) + x(k+1)
2
.
Let p ∈ (0, 1) be a number between 0 and 1. The pth quantile of
x is more commonly known as the
100pth percentile; e.g., the 0.8 quantile is the same as the 80th
percentile. We define it as a number
q(x, p) such that the fraction of values of x that are ≤ q(x, p) is
at least p and the fraction of values of
x that are ≥ q(x, p) is at least 1−p. For example, at least 80
percent of the values of x are ≤ the 80th
percentile of x and at least 20 percent of the values of x are ≥
its 80th percentile. Again, this may
not define a unique number q(x, p). Software packages have
rules for resolving the ambiguity, but the
details are usually not important.
The median is the 50th percentile, i.e., the 0.5 quantile. The
25th and 75th percentiles are called the
first and third quartiles. The 10th, 20th, 30th, etc. percentiles
are called the deciles. The median is a
location measure as defined in the preceding section.
2.1.3 Trimmed Means
Trimmed means of a variable x are obtained by finding the
mean of the values of x excluding a given
27. percentage of the largest and smallest values. For example, the
5% trimmed mean is the mean of the
values of x excluding the largest 5% of the values and the
smallest 5% of the values. In other words, it
is the mean of all the values between the 5th and 95th
percentiles of x. A trimmed mean is a location
measure.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
13
2.1.4 Grouped Data
Sometimes large data sets are summarized by grouping values.
Let x be a numeric variable with values
x1, x2, . . . , xn. Let c0 < c1 < . . . < cm be numbers such that
all the values of x are between c0 and
cm. For each i, let ni be the number of values of x (including
repetitions) that are in the interval
(ci−1, ci], i.e., the number of indices j such that ci−1 < xj ≤ ci.
A frequency table of x is a table
showing the class intervals (ci−1, ci] along with frequencies ni
with which the data values fall into each
interval. Sometimes additional columns are included showing
the relative frequencies fi = ni/n, the
cumulative relative frequencies Fi =
∑
j≤i fj , and the midpoints of the intervals.
Example 2.1. The data below are 50 measured reaction times in
response to a sensory stimulus,
28. arranged in increasing order. A frequency table is shown below
the data.
0.12 0.30 0.35 0.37 0.44 0.57 0.61 0.62 0.71 0.80 0.88 1.02 1.08
1.12 1.13 1.17 1.21 1.23 1.35 1.41 1.42
1.42 1.46 1.50 1.52 1.54 1.60 1.61 1.68 1.72 1.86 1.90 1.91 2.07
2.09 2.16 2.17 2.20 2.29 2.32 2.39 2.47
2.60 2.86 3.43 3.43 3.77 3.97 4.54 4.73
Interval Midpoint ni fi Fi
(0,1] 0.5 11 0.22 0.22
(1,2] 1.5 22 0.44 0.66
(2,3] 2.5 11 0.22 0.88
(3,4] 3.5 4 0.08 0.96
(4,5] 4.5 2 0.04 1.00
If only a frequency table like the one above is given, the mean
and median cannot be calculated
exactly. However, they can be estimated. If we take the
midpoint of an interval as a stand-in for all
the values in that interval, then we can use the formula in the
preceding section for calculating a mean
with repeated values. Thus, in the example above, we would
estimate the mean as
0.22(0.5) + .44(1.5) + 0.22(2.5) + 0.08(3.5) + 0.04(4.5) = 1.78
Estimating the median is a bit more difficult. By examining the
cumulative frequencies Fi, we see that
22% of the data is less than or equal to 1 and 66% of the data is
less than or equal to 2. Therefore,
the median lies between 1 and 2. That is, it is 1 + a certain
fraction of the distance from 1 to 2. A
reasonable guess at that fraction is given by linear interpolation
between the cumulative frequencies
at 1 and 2. In other words, we estimate the median as
29. 1 +
.50− .22
.66− .22
(2− 1) = 1.636.
A cruder estimate of the median is just the midpoint of the
interval that contains the median, in
this case 1.5. We leave it as an exercise to calculate the mean
and median from the data of Example
1 and to compare them to these estimates.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
14
2.1.5 Histograms
The figure below is a histogram of the reaction times.
> reacttimes=read.table("reacttimes.txt",header=T)
> hist(reacttimes$Times,breaks=0:5,xlab="Reaction
Times",main=" ")
Reaction Times
F
re
qu
30. en
cy
0 1 2 3 4 5
0
5
10
15
20
The histogram is a graphical depiction of the grouped data. The
end points ci of the class intervals
are shown on the horizontal axis. This is an absolute frequency
histogram because the heights of the
vertical bars above the class intervals are the absolute
frequencies ni. A relative frequency histogram
would show the relative frequencies fi. A density histogram has
bars whose heights are the relative
frequencies divided by the lengths of the corresponding class
intervals. Thus,in a density histogram
the area of the bar is equal to the relative frequency. If all class
intervals have the same length, these
types of histograms all have the same shape and convey the
same visual information.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
15
31. 2.1.6 Robustness
A robust measure of location is one that is not affected by a few
extremely large or extremely small
values. Values of a numeric variable that lie a great distance
from most of the other values are
called outliers. Outliers might be the result of mistakes in
measuring or recording data, perhaps from
misplacing a decimal point. The mean is not a robust location
measure. It can be affected significantly
by a single extreme outlier if that outlying value is extreme
enough. Thus, if there is any doubt about
the quality of the data, the median or a trimmed mean might be
preferred to the mean as a reliable
location measure. The median is very insensitive to outliers. A
5% trimmed mean is insensitive to
outliers that make up no more than 5% of the data values.
2.1.7 The Five Number Summary
The five number summary is a convenient way of summarizing
numeric data. The five numbers are the
minimum value, the first quartile (25th percentile), the median,
the third quartile (75th percentile),
and the maximum value. Sometimes the mean is also included,
which makes it a six number summary.
Example 2.2. The natural logarithms y of the data values x in
Example 1 are, to two places:
-2.12 -1.20 -1.05 -0.99 -0.82 -0.56 -0.49 -0.48 -0.34 -0.22 -0.13
0.02 0.08 0.11 0.12 0.16 0.19 0.21 0.30
0.34 0.35 0.35 0.38 0.40 0.42 0.43 0.47 0.48 0.52 0.54 0.62 0.64
0.65 0.73 0.74 0.77 0.78 0.79 0.83 0.84
0.87 0.90 0.96 1.05 1.23 1.23 1.33 1.38 1.51 1.55
32. It is sometimes advantageous to transform data in some way,
i.e., to define a new variable y as a
function of the old variable x. In this case, we have transformed
the reaction times x with the
natural logarithm transformation. We might want to do this to
so that we can more easily apply
certain statistical inference procedures you will learn about
later. The six number summary of the
transformed data y is:
> reacttimes=read.table("reacttimes.txt",header=T)
> summary(log(reacttimes$Times))
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.12000 0.08605 0.42520 0.33710 0.78500 1.55400
2.1.8 The Mode
The mode of a variable is its most frequently occurring value.
With numeric variables the mode is
less important than the mean and median for descriptive
purposes or for statistical inference. For
factor variables the mode is the most natural way of choosing a
”most representative” value. We hear
this frequently in the media, in statements such as ”Financial
problems are the most common cause
of marital strife”. For grouped numeric data the modal class
interval is the class interval having the
highest absolute or relative frequency. In Example 1, the modal
class interval is the interval (1,2].
2.1.9 Exercises
1. Find the mean and median of the reaction time data in
33. Example 1.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
16
2. Find the quartiles of the reaction time data. There is more
than one acceptable answer.
3. The 40th value x40 of the reaction time data has a value of
2.32. Replace that with 232.0.
Recalculate the mean and median. Comment.
4. Construct a frequency table like the one in Example 1 for the
log-transformed reaction times
of Example 2. Use 5 class intervals of equal length beginning at
-3 and ending at 2. Draw an absolute
frequency histogram.
5. Estimate the mean and median of the grouped log-
transformed reaction times by using the tech-
niques discussed in Example 1. Compare your answers to the
summary in Example 2.
6. Repeat exercises 1, 2, and the histogram of exercise 4 by
using R.
7. Let x be a numeric variable with values x1, . . . , xn−1, xn.
Let x̄ n be the average of all n val-
ues and let x̄ n−1 be the average of x1, . . . , xn−1. Show that x̄ n
= (1− 1n )x̄ n−1 +
1
34. nxn. What happens
if xn →∞ while all the other values of x are fixed?
2.2 Measures of Variability or Scale
2.2.1 The Variance and Standard Deviation
Let x be a population variable with values x1, x2, . . . , xn.
Some of the values might be repeated. The
variance of x is
var(x) = σ2 =
1
n
n∑
i=1
(xi − µ(x))2.
The standard deviation of x is
sd(x) = σ =
√
var(x).
When x1, x2, . . . , xn are values of x from a sample rather than
the entire population, we modify the
definition of the variance slightly, use a different notation, and
call these objects the sample variance
and standard deviation.
s2 =
1
35. n− 1
n∑
i=1
(xi − x̄ )2,
s =
√
s2.
The reason for modifying the definition for the sample variance
has to do with its properties as an
estimate of the population variance.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
17
Alternate algebraically equivalent formulas for the variance and
sample variance are
σ2 =
1
n
n∑
i=1
x2i − µ(x)2,
36. s2 =
1
n− 1
(
n∑
i=1
x2i − nx̄ 2).
These are sometimes easier to use for hand computation.
The standard deviation σ is called a measure of scale because of
the way it behaves under linear
transformations of the data. If a new variable y is defined by y
= a+ bx, where a and b are constants,
sd(y) = |b|sd(x). For example, the standard deviation of
Fahrenheit temperatures is 1.8 times the
standard deviation of Celsius temperatures. The transformation
y = a + bx can be thought of as a
rescaling operation, or a choice of a different system of
measurement units, and the standard deviation
takes account of it in a natural way.
2.2.2 The Coefficient of Variation
For a variable that has only positive values, it may be more
important to measure the relative vari-
ability than the absolute variability. That is, the amount of
variation should be compared to the mean
value of the variable. The coefficient of variation for a
population variable is defined as
cv(x) =
sd(x)
37. µ(x)
,
For a sample of values of x we substitute the sample standard
deviation s and the sample average x̄ .
2.2.3 The Mean and Median Absolute Deviation
Suppose that you must choose a single number c to represent all
the values of a variable x as accurately
as possible. One measure of the overall error with which c
represents the values of x is
g(c) =
√√√√ 1
n
n∑
i=1
(xi − c)2.
In the exercises, you are asked to show that this expression is
minimized when c = x̄ . In other words,
the single number which most accurately represents all the
values is, by this criterion, the mean of the
variable. Furthermore, the minimum possible overall error, by
this criterion, is the standard deviation.
However, this is not the only reasonable criterion. Another is
h(c) =
1
n
38. n∑
i=1
|xi − c|.
It can be shown that this criterion is minimized when c =
median(x). The minimum value of h(c) is
called the mean absolute deviation from the median. It is a scale
measure which is somewhat more
robust(less affected by outliers) than the standard deviation, but
still not very robust. A related very
robust measure of scale is the median absolute deviation from
the median, or mad :
mad(x) = median(|x−median(x)|).
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
18
2.2.4 The Interquartile Range
The interquartile range of a variable x is the difference between
its 75th and 25th percentiles.
IQR(x) = q(x, .75)− q(x, .25).
It is a robust measure of scale which is important in the
construction and interpretation of boxplots,
discussed below.
All of these measures of scale are valid for comparison of the
39. ”spread”or variability of numeric variables
about a central value. In general, the greater their values, the
more spread out the values of the variable
are. Of course, the standard deviation, median absolute
deviation, and interquartile range of a variable
will be different numbers and one must be careful to compare
like measures.
2.2.5 Boxplots
Boxplots are also called box and whisker diagrams. Essentially,
a boxplot is a graphical representation
of the five number summary. The boxplot below depicts the
sensory response data of the preceding
section without the log transformation.
> reacttimes=read.table("reacttimes.txt",header=T)
> boxplot(reacttimes$Times,horizontal=T,xlab="Reaction
Times")
> summary(reacttimes)
Times
Min. :0.120
1st Qu.:1.090
Median :1.530
Mean :1.742
3rd Qu.:2.192
Max. :4.730
40. Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
19
0 1 2 3 4
Reaction Times
The central box in the diagram encloses the middle 50% of the
numeric data. Its left and right bound-
aries mark the first and third quartiles. The boldface middle line
in the box marks the median of
the data. Thus, the interquartile range is the distance between
the left and right boundaries of the
central box. For construction of a boxplot, an outlier is defined
as a data value whose distance from
the nearest quartile is more than 1.5 times the interquartile
range. Outliers are indicated by isolated
points (tiny circles in this boxplot). The dashed lines extending
outward from the quartiles are called
the whiskers. They extend from the quartiles to the most
extreme values in either direction that are
not outliers.
This boxplot shows a number of interesting things about the
response time data.
(a) The median is about 1.5. The interquartile range is slightly
more than 1.
(b) The three largest values are outliers. They lie a long way
from most of the data. They might call
41. for special investigation or explanation.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
20
(c) The distribution of values is not symmetric about the
median. The values in the lower half
of the data are more crowded together than those in the upper
half. This is shown by comparing the
distances from the median to the two quartiles, by the lengths of
the whiskers and by the presence of
outliers at the upper end .
The asymmetry of the distribution of values is also evident in
the histogram of the preceding sec-
tion.
2.2.6 Exercises
1. Find the variance and standard deviation of the response time
data. Treat it as a sample from a
larger population.
2. Find the interquartile range and the median absolute
deviation for the response time data.
3. In the response time data, replace the value x40 = 2.32 by
232.0. Recalculate the standard
deviation, the interquartile range and the median absolute
deviation and compare with the answers
from problems 1 and 2.
42. 4. Make a boxplot of the log-transformed reaction time data. Is
the transformed data more sym-
metrically distributed than the original data?
5. Show that the function g(c) in section 2.2.3 is minimized
when c = µ(x). Hint: Minimize g(c)2.
6. Find the variance, standard deviation, IQR, mean absolute
deviation and median absolute de-
viation of the variable ”Ozone” in the data set ”airquality”. Use
R or Rstudio. You can address the
variable Ozone directly if you attach the airquality data frame
to the search path as follows:
> attach(airquality)
The R functions you will need are ”sd” for standard deviation,
”var” for variance, ”IQR” for the
interquartile range, and ”mad” for the median absolute
deviation. There is no built-in function in R
for the mean absolute deviation, but it is easy to obtain it.
> mean(abs(Ozone-median(Ozone)))
2.3 Jointly Distributed Variables
When two or more variables are jointly distributed, or jointly
observed, it is important to understand
how they are related and how closely they are related. We will
first consider the case where one
variable is numeric and the other is a factor.
Go to TOC
43. CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
21
2.3.1 Side by Side Boxplots
Boxplots are particularly useful in quickly comparing the values
of two or more sets of numeric data
with a common scale of measurement and in investigating the
relationship between a factor variable
and a numeric variable. The figure below compares placement
test scores for each of the letter grades
in a sample of 179 students who took a particular math course
in the same semester under the same
instructor. The two jointly observed population variables are the
placement test score and the letter
grade received. The figure separates test scores according to the
letter grade and shows a boxplot for
each group of students. One would expect to see a decrease in
the median test score as the letter grade
decreases and that is confirmed by the picture. However, the
decrease in median test scores from a
letter grade of B to a grade of F is not very dramatic, especially
compared to the size of the IQRs.
This suggests that the placement test is not especially good at
predicting a student’s final grade in
the course. Notice the two outliers. The outlier for the ”W”
group is clearly a mistake in recording
data because the scale of scores only went to 100.
> test.vs.grade=read.csv("test.vs.grade.csv",header=T)
> attach(test.vs.grade)
> plot(Test~Grade,varwidth=T)
44. Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
22
A B C D F W
40
60
80
10
0
12
0
Grade
Te
st
2.3.2 Scatterplots
Suppose x and y are two jointly distributed numeric variables.
Whether we consider the entire
population or a sample from the population, we have the same
number n of observed values for
each variable. If we plot the n points (x1, y1), (x2, y2), . . . ,
(xn, yn) in a Cartesian plane, we obtain
a scatterplot or a scatter diagram of the two variables. Below
are the first 6 rows of the ”Payroll”
data set. The column labeled ”payroll” is the total monthly
payroll in thousands of dollars for each
45. company listed. The column ”employees” is the number of
employees in each company and ”industry”
indicates which of two related industries the company is in. A
scatterplot of all 50 values of the two
variables ”payroll” and ”employees” is also shown.
> Payroll=read.table("Payroll.txt",header=T)
> Payroll[1:6,]
payroll employees industry
1 190.67 85 A
2 233.58 109 A
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
23
3 244.04 130 B
4 351.41 166 A
5 298.60 154 B
6 241.43 124 B
> attach(Payroll)
> plot(payroll~employees,col=industry)
50 100 150
46. 15
0
20
0
25
0
30
0
35
0
employees
pa
yr
ol
l
The scatterplot shows that in general the more employees a
company has, the higher its monthly
payroll. Of course this is expected. It also shows that the
relationship between the number of
employees and the payroll is quite strong. For any given number
of employees, the variation in
payrolls for that number is small compared to the overall
variation in payrolls for all employment
levels. In this plot, the data from industry A is in black and that
from industry B is red. The plot
shows that for employees ≥ 100, payrolls for industry A are
generally greater than those for industry
47. B at the same level of employment.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
24
2.3.3 Covariance and Correlation
If x and y are jointly distributed numeric variables, we define
their covariance as
cov(x, y) =
1
n
n∑
i=1
(xi − µ(x))(yi − µ(y)).
If x and y come from samples of size n rather than the whole
population, replace the denominator
n by n − 1 and the population means µ(x), µ(y) by the sample
means x̄ , ȳ to obtain the sample
covariance. The sign of the covariance reveals something about
the relationship between x and y. If
the covariance is negative, values of x greater than µ(x) tend to
be accompanied by values of y less
than µ(y). Values of x less than µ(x) tend to go with values of y
greater than µ(y), so x and y tend
to deviate from their means in opposite directions. If cov(x, y)
> 0, they tend to deviate in the same
48. direction. The strength of these tendencies is not expressed by
the covariance because its magnitude
depends on the variability of each of the variables about its
mean. To correct this, we divide each
deviation in the sum by the standard deviation of the variable.
The resulting quantity is called the
correlation between x and y:
cor(x, y) =
cov(x, y)
sd(x) ∗ sd(y)
.
The correlation between payroll and employees in the example
above is 0.9782 (97.82 %).
Theorem 2.1. The correlation between x and y satisfies −1 ≤
cor(x, y) ≤ 1. cor(x, y) = 1 if and
only if there are constants a and b > 0 such that y = a+ bx.
cor(x, y) = −1 if and only if y = a+ bx
with b < 0.
A correlation close to 1 indicates a strong positive relationship
(tending to vary in the same direction
from their means) between x and y while a correlation close to
−1 indicates a strong negative rela-
tionship. A correlation close to 0 indicates that there is no
linear relationship between x and y. In
this case, x and y are said to be (nearly) uncorrelated. There
might be a relationship between x and
y but it would be nonlinear. The picture below shows a
scatterplot of two variables that are clearly
related but very nearly uncorrelated.
> xs=runif(500,0,3*pi)
50. 1.
5
xs
ys
Some sample scatterplots of variables with different population
correlations are shown below.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
26
−1 0 1 2
−
4
−
2
0
1
2
3
cor(x,y)=0
−2 −1 0 1 2
52. −
2
−
1
0
1
2
cor(x,y)=0.9
2.3.4 Exercises
1. With the Air Pollution Filter Noise data, construct side by
side boxplots of the variable NOISE for
the different levels of the factor SIZE. Comment. Do the same
for NOISE and TYPE.
2. With the Payroll data, construct side by side boxplots of
”employees” versus ”industry” and ”pay-
roll” versus ”industry”. Are these boxplots as informative as the
color coded scatterplot in Section
2.3.2?
3. If you are using Rstudio click on the ”Packages” tab, then the
checkbox next to the library MASS.
Click on the word MASS and then the data set ”mammals” and
read about it. If you are using R
alone, in the Console window at the prompt > type
> data(mammals,package=”MASS”).
View the data with
53. Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
27
> mammals
Make a scatterplot with the following commands and comment
on the result.
> attach(mammals)
> plot(body,brain)
Also make a scatterplot of the log transformed body and brain
weights.
> plot(log(body),log(brain))
A recently discovered hominid species homo floresiensis had an
estimated average body weight of
25 kg. Based on the scatterplots, what would you guess its brain
weight to be?
4. Let x and y be jointly distributed numeric variables and let z
= a + by, where a and b are
constants. Show that cov(x, z) = b ∗ cov(x, y). Show that if b >
0, cor(x, z) = cor(x, y). What
happens if b < 0?
Go to TOC
54. Chapter 3
Probability
3.1 Basic Definitions. Equally Likely Outcomes
Let a random experiment with sample space Ω be given. Recall
from Chapter 1 that Ω is the set
of all possible outcomes of the experiment. An event is a subset
of Ω. A probability measure is a
function which assigns numbers between 0 and 1 to events. If
the sample space Ω, the collection
of events, and the probability measure are all specified, they
constitute a probability model of the
random experiment.
The simplest probability models have a finite sample space Ω.
The collection of events is the col-
lection of all subsets of Ω and the probability of an event is
simply the proportion of all possible
outcomes that correspond to that event. In such models, we say
that the experiment has equally likely
outcomes. If the sample space has N elements, then each
elementary event {ω} consisting of a single
outcome has probability 1N . If E is a subset of Ω, then
Pr(E) =
#(E)
N
.
Here we introduce some notation that will be used throughout
this text. The probability measure
for a random experiment is most often denoted by the
55. abbreviation Pr, sometimes with subscripts.
Events will be denoted by upper case Latin letters near the
beginning of the alphabet. The expression
#(E) denotes the number of elements of the subset E.
Example 3.1. The Payroll data consists of 50 observations of 3
variables, ”payroll”, ”employees” and
”industry”. Suppose that a random experiment is to choose one
record from the Payroll data and
suppose that the experiment has equally likely outcomes. Then,
as the summary below shows, the
probability that industry A is selected is
Pr(industry = A) =
27
50
= 0.54.
> Payroll=read.table("Payroll.txt",header=T)
> summary(Payroll)
28
Go to TOC
CHAPTER 3. PROBABILITY 29
payroll employees industry
Min. :129.1 Min. : 26.00 A:27
1st Qu.:167.8 1st Qu.: 71.25 B:23
56. Median :216.1 Median :108.50
Mean :228.2 Mean :106.42
3rd Qu.:287.8 3rd Qu.:143.25
Max. :354.8 Max. :172.00
In this example we use another common and convenient
notational convention. The event whose
probability we want is described in quasi-natural language as
”industry=A” rather than with the the
formal but too cumbersome {ω ∈ Payroll|industry(ω) = A}. The
description ”industry=A” refers to
the set of all possible outcomes of the experiment for which the
variable ”industry” has the value ”A”.
This sort of informal description of an event will be used again
and again.
The assumption of equally likely outcomes is an assumption
about the selection procedure for ob-
taining one record from the data. It is conceivable that a
selection method is employed for which
this assumption is not valid. If so, we should be able to discover
that it is invalid by replicating the
experiment sufficiently many times. This is a basic principle of
classical statistical inference. It relies
on a famous result of mathematical probability theory called the
law of large numbers. One version
of it is loosely stated as follows:
Law of Large Numbers: Let E be an event associated with a
random experiment and let Pr be the
probability measure of a true probability model of the
experiment. Suppose the experiment is repli-
57. cated n times and let P̂ r(E) = 1n × # replications in which E
occurs. Then P̂ r(E) → Pr(E) as
n→∞.
P̂ r(E) is called the empirical probability of E.
3.2 Combinations of Events
Events are related to other events by familiar set operations. Let
E1, E2, . . . be a finite or infinite
sequence of events. The union of E1 and E2 is the event
E1 ∪ E2 = {ω ∈ Ω|ω ∈ E1 or ω ∈ E2}.
More generally, ⋃
i
Ei = E1 ∪ E2 ∪ . . . = {ω ∈ Ω|ω ∈ Ei for some i }.
The intersection of E1 and E2 is the event
E1 ∩ E2 = {ω ∈ Ω|ω ∈ E1 and ω ∈ E2},
and, in general, ⋂
i
Ei = E1 ∩ E2 ∩ . . . = {ω ∈ Ω|ω ∈ Ei for all i}.
Go to TOC
CHAPTER 3. PROBABILITY 30
Sometimes we omit the intersection symbol ∩ and simply
conjoin the symbols for the events in an
58. intersection. In other words,
E1E2 . . . En = E1 ∩ E2 ∩ . . . ∩ En.
The complement of the event E is the event
∼E = {ω ∈ Ω|ω /∈ E}.
∼E occurs if and only if E does not occur. The event E∼1 E2
occurs if and only if E1 occurs and E2
does not occur.
Finally, the entire sample space Ω is an event with complement
φ, the empty event. The empty
event never occurs. We need the empty event because it is
possible to formulate a perfectly sensible
description of an event which happens never to be satisfied. For
example, if Ω = Payroll the event
”employees < 25” is never satisfied, so it is the empty event.
We also have the subset relation between events. E1 ⊆ E2
means that if E1 occurs, then E2 oc-
curs, or in more familiar language, E1 is a subset of E2. For any
event E, it is true that φ ⊆ E ⊆ Ω.
E2 ⊇ E1 means the same as E1 ⊆ E2.
3.2.1 Exercises
1. A random experiment consists of throwing a pair of dice, say
a red die and a green die, simultane-
ously. They are standard 6-sided dice with one to six dots on
different faces. Describe the sample space.
2. For the same experiment, let E be the event that the sum of
the numbers of spots on the two dice
is an odd number. Write E as a subset of the sample space, i.e.,
59. list the outcomes in E.
3. List the outcomes in the event F = ”the sum of the spots is a
multiple of 3”.
4. Find ∼F , E ∪ F , EF = E ∩ F , and E∼F .
5. Assume that the outcomes of this experiment are equally
likely. Find the probability of each
of the events in # 4.
6. Show that for any events E1 and E2, if E1 ⊆ E2 then ∼E2
⊆∼ E1.
7. Load the ”mammals” data set into your R workspace. In
Rstudio you can click on the ”Pack-
ages” tab and then on the checkbox next to MASS. Without
Rstudio, type
> data(mammals,package=”MASS”)
Attach the mammals data frame to your R search path with
> attach(mammals)
Go to TOC
CHAPTER 3. PROBABILITY 31
A random experiment is to choose one of the species listed in
this data set. All outcomes are equally
likely. You can obtain a list of the species in the event ”body >
200” with the command
60. > subset(mammals,body>200)
What is the probability of this event, i.e., what is the
probability that you randomly select a species
with a body weight greater than 200 kg?
8. What are the species in the event that the ratio of brain
weight to body weight is greater than 0.02?
Remember that brain weight is recorded in grams and body
weight in kilograms, so body weight must
be multiplied by 1000 to make the two weights comparable.
What is the probability of that event?
3.3 Rules for Probability Measures
The assumption of equally likely outcomes is the starting point
for the construction of many proba-
bility models. There are many random experiments for which
this assumption is wrong. No matter
what other considerations are involved in choosing a probability
measure for a model of a a random
experiment, there are certain rules that it must satisfy. They are:
1. 0 ≤ Pr(E) ≤ 1 for each event E.
2. Pr(Ω) = 1.
3. If E1, E2, . . . is a finite or infinite sequence of events such
that EiEj = φ for i 6= j, then Pr(
⋃
iEi) =∑
i Pr(Ei). If EiEj = φ for all i 6= j we say that the events E1, E2,
. . . are pairwise disjoint.
These are the basic rules. There are other properties that may be
61. derived from them as theorems.
4. Pr(E∼F ) = Pr(E)− Pr(EF ) for all events E and F . In
particular, Pr(∼E) = 1− Pr(E)
5. Pr(φ) = 0.
6. Pr(E ∪ F ) = Pr(E) + Pr(F )− Pr(EF ) for all events E and F .
7. If E ⊆ F , then Pr(E) ≤ Pr(F ).
8. If E1 ⊆ E2 ⊆ . . . is an infinite sequence, then Pr(
⋃
iEi) = limi→∞ Pr(Ei).
9. If E1 ⊇ E2 ⊇ . . . is an infinite sequence, then Pr(
⋂
iEi) = limi→∞ Pr(Ei).
Go to TOC
CHAPTER 3. PROBABILITY 32
3.4 Counting Outcomes. Sampling with and without Replace-
ment
Suppose a random experiment with sample space Ω is replicated
n times. The result is a sequence
(ω1, ω2, . . . , ωn), where ωi ∈ Ω is the outcome of the ith
replication. This sequence is the outcome of
a so-called compound experiment - the sequential replications
of the basic experiment. The sample
space of this compound experiment is the n-fold cartesian
product Ωn = Ω × Ω × · · · × Ω. Now
62. suppose that the basic experiment is to choose one member of a
finite population with N elements.
We may identify the sample space Ω with the population.
Consider an outcome (ω1, ω2, . . . , ωn) of
the replicated experiment. There are N possibilities for ω1 and
for each of those there are N possi-
bilities for ω2 and for each pair ω1, ω2 there are N possibilities
for ω3, and so on. In all, there are
N × N × · · · × N = Nn possibilities for the entire sequence (ω1,
ω2, · · · , ωn). If all outcomes of the
compound experiment are equally likely, then each has
probability 1Nn . Moreover, it can be shown
that the compound experiment has equally likely outcomes if
and only if the basic experiment has
equally likely outcomes, each with probability 1N .
Definition: An ordered random sample of size n with
replacement from a population of size N is a
randomly chosen sequence of length n of elements of the
population, where repetitions are possible
and each outcome (ω1, ω2, · · · , ωn) has probability 1Nn .
Now suppose that we sample one element ω1 from the
population, with all N outcomes equally likely.
Next, we sample one element ω2 from the population excluding
the one already chosen. That is, we
randomly select one element from Ω ∼ {ω1} with all the
remaining N − 1 elements being equally
likely. Next, we randomly select one element ω3 from the the N
− 2 elements of Ω ∼ {ω1, ω2}, and so
on until at last we select ωn from the remaining N − (n− 1)
elements of the population. The result is
a nonrepeating sequence (ω1, ω2, · · · , ωn) of length n from the
population. A nonrepeating sequence
of length n is also called a permutation of length n from the N
objects of the population. The total
63. number of such permutations is N × (N − 1)× · · · × (N − n+ 1)
= N !(N−n)! . Obviously, we must have
n ≤ N for this to make sense. The number of permutations of
length N from a set of N objects is
N !. It can be shown that, with the sampling scheme described
above, all permutations of length n
are equally likely to result. Each has probability (N−n)!N ! of
occurring.
Definition: An ordered random sample of size n without
replacement from a population of size N
is a randomly chosen nonrepeating sequence of length n from
the population where each outcome
(ω1, ω2, · · · , ωn) has probability (N−n)!N ! .
Most of the time when sampling without replacement from a
finite population, we do not care about
the order of appearance of the elements of the sample. Two
nonrepeating sequences with the same
elements in different order will be regarded as equivalent. In
other words, we are concerned only with
the resulting subset of the population. Let us count the number
of subsets of size n from a set of N
objects. Temporarily, let C denote that number. Each subset of
size n can be ordered in n! different
ways to give a nonrepeating sequence. Thus, the number of
nonrepeating sequences of length n is C
times n!. So, N !(N−n)! = C × n! i.e., C =
N !
n!(N−n)! =
(
64. N
n
)
. This is the same binomial coefficient
(
N
n
)
that appears in the binomial theorem: (a+ b)N =
∑N
n=0
(
N
n
)
anbN−n.
Go to TOC
CHAPTER 3. PROBABILITY 33
Definition: A simple random sample of size n from a population
of size N is a randomly chosen subset
of size n from the population, where each subset has the same
probability of being chosen, namely 1
(Nn)
65. .
A simple random sample may be obtained by choosing objects
from the population sequentially, in
the manner described above, and then ignoring the order of their
selection.
Example: The Birthday Problem
There are N = 365 days in a year. (Ignore leap years.) Suppose n
= 23 people are chosen ran-
domly and their birthdays recorded. What is the probability that
at least two of them have the same
birthday?
Solution
: Arbitrarily numbering the people involved from 1 to n, their
birthdays form an ordered sam-
ple, with replacement, from the set of 365 birthdays. Therefore,
each sequence has probability 1Nn of
occurring. No two people have the same birthday if and only if
the sequence is actually nonrepeating.
The number of nonrepeating sequences of birthdays is N(N − 1)
· · · (N −n+ 1). Therefore, the event
”No two people have the same birthday” has probability
66. N(N − 1) · · · (N − n+ 1)
Nn
=
N(N − 1) · · · (N − n+ 1)
N ×N × · · · ×N
= (1− 1
N
)(1− 2
N
) · · · (1− n− 1
N
)
With n = 23 and N = 365 we can find this in R as follows:
> prod(1-(1:22)/365)
[1] 0.4927028
67. So, there is about a 49% probability that no two people in a
random selection of 23 have the same
birthday. In other words, the probability that at least two share
a birthday is about 51%.
An important, intuitively obvious principle in statistics is that if
the sample size n is very small in
comparison to the population size N , a sample taken without
replacement may be regarded as one
taken with replacement, if it is mathematically convenient to do
so. A sample of size 100 taken with
replacement from a population of 100,000 has very little chance
of repeating itself. The probability of
a repetition is about 5%.
3.4.1 Exercises
1. A red 6-sided die and a green 6-sided die are thrown
simultaneously. The outcomes of this exper-
iment are equally likely. What is the probability that at least
one of the dice lands with a 6 on its
upper face?
2. A hand of 5-card draw poker is a simple random sample from
the standard deck of 52 cards. What
68. is the probability that a 5-card draw hand contains the ace of
hearts?
Go to TOC
CHAPTER 3. PROBABILITY 34
3. How many 5 draw poker hands are there? In 5-card stud
poker, the cards are dealt sequentially
and the order of appearance is important. How many 5 stud
poker hands are there?
4. Everybody in Ourtown is a fool or a knave or possibly both.
70% of the citizens are fools and 85%
are knaves. One citizen is randomly selected to be mayor. What
is the probability that the mayor is
both a fool and a knave?
5. A Martian year has 669 days. An R program for calculating
the probability of no repetitions in a
sample with replacement of n birthdays from a year of N days is
given below.
69. > birthdays=function(n,N) prod(1-1:(n-1)/N)
To invoke this function with, for example, n=12 and N=400
simply type
> birthdays(12,400)
Check that the program gives the right answer for N=365 and
n=23. Then use it to find the number
n of Martians that must be sampled in order for the probability
of a repetition to be at least 0.5.
6. A standard deck of 52 cards has four queens. Two cards are
randomly drawn in succession, without
replacement, from a standard deck. What is the probability that
the first card is a queen? What is
the probability that the second card is a queen? If three cards
are drawn, what is the probability that
the third is a queen? Make a general conjecture. Prove it if you
can. (Hint: Does the probability
change if ”queen” is replaced by ”king” or ”seven”?)
3.5 Conditional Probability
Definition: Let A and B be events with Pr(B) > 0. The
70. conditional probability of A, given B is:
Pr(A|B) = Pr(AB)
Pr(B)
. (3.1)
Pr(A) itself is called the unconditional probability of A.
Example 3.2. R includes a tabulation by various factors of the
2201 passengers and crew on the
Titanic. Read about it by typing
> help(Titanic)
We are going to look at these factors two at a time, starting with
the steerage class of the passengers
and whether they survived or not.
> apply(Titanic,c(1,4),sum)
Survived
Class No Yes
71. Go to TOC
CHAPTER 3. PROBABILITY 35
1st 122 203
2nd 167 118
3rd 528 178
Crew 673 212
Suppose that a passenger or crew member is selected randomly.
The unconditional probability that
that person survived is 7112201 = 0.323.
> apply(Titanic,4,sum)
No Yes
1490 711
> apply(Titanic,1,sum)
72. 1st 2nd 3rd Crew
325 285 706 885
Let us calculate the conditional probability of survival, given
that the person selected was in a first
class cabin. If A = ”survived” and B = ”first class”, then
Pr(AB) =
203
2201
= 0.0922
and
Pr(B) =
325
2201
= 0.1477.
Thus,
73. Pr(A|B) = 0.0922
0.1477
= 0.625.
First class passengers had about a 62% chance of survival. For
random sampling from a finite popu-
lation such as this, we can use the counts of occurrences of the
events rather than their probabilities
because the denominators in Pr(AB) and Pr(B) cancel.
Pr(A|B) = #(AB)
#(B)
=
203
325
= 0.625
For comparison, look at the conditional probabilities of survival
for the other classes.
Pr(survived|second class) = 118
285
74. = 0.414
Pr(survived|third class) = 178
706
= 0.252
Pr(survived|crew) = 212
885
= 0.240
Go to TOC
CHAPTER 3. PROBABILITY 36
3.5.1 Relating Conditional and Unconditional Probabilities
The defining equation (3.1) for conditional probability can be
written as
Pr(AB) = Pr(A|B)Pr(B), (3.2)
75. which is often more useful, especially when Pr(A|B) is easily
determined from the description of the
experiment. There is an even more useful result sometimes
called the law of total probability. Let
B1, B2, · · · , Bk be pairwise disjoint events such that each
Pr(Bi) > 0 and Ω = B1 ∪ B2 ∪ · · · ∪ Bk.
Let A be another event. Then,
Pr(A) =
k∑
i=1
Pr(A|Bi)Pr(Bi). (3.3)
This is quite easy to show since A = (AB1) ∪ · · · ∪ (ABk) is a
union of pairwise disjoint events and
Pr(ABi) = Pr(A|Bi)Pr(Bi).
Example 3.3. Diagnostic Tests:
Let D denote the presence of a disease in a randomly selected
member of a given population. Suppose
that there is a diagnostic test for the disease and let T denote
the event that a random subject tests
76. positive, that is, that the test indicates the disease. The
conditional probability Pr(T |D) is called the
sensitivity of the test. The conditional probability Pr(∼T |∼D)
is called the specificity of the test. The
unconditional probability Pr(D) is called the prevalence of the
disease in the population. A good test
will have both a high sensitivity and a high specificity, although
there is usually a trade-off between
the two. The unconditional probability that a randomly chosen
subject tests positive for the disease
is
Pr(T ) = Pr(T |D)Pr(D) + Pr(T |∼D)Pr(∼D)
Suppose that the disease is rare, Pr(D) = 0.02, and that the
sensitivity of the test is Pr(T |D) =
0.95 with specificity Pr(∼T |∼D) = 0.85. The false positive rate
for the test is Pr(T |∼D) = 1 −
Pr(∼T |∼D) = 0.15. The unconditional probability of a positive
test result is
Pr(T ) = 0.95× 0.02 + 0.15× 0.98 = 0.166
16.6% of the population will test positive for the disease, even
though only 2% have it.
77. 3.5.2 Bayes’ Rule
Bayes’ rule is named for Thomas Bayes, an eighteenth century
clergyman and part-time mathemati-
cian. As given below, it is merely a relationship between
conditional probabilities but it is associated
with Bayesian inference, a distinct philosophy and methodology
of statistical practice. Bayes’ rule is
often described as a rule for calculating conditional ”posterior”
probabilities from unconditional ”prior”
probabilities.
Bayes’ Rule: Let A and B1, B2, · · · , Bk be given as in the law
of total probability (3.3) and assume
Pr(A) > 0. Then for each i,
Pr(Bi|A) =
Pr(A|Bi)Pr(Bi)
Pr(A)
, (3.4)
where Pr(A) is calculated as in (3.3).
78. Go to TOC
CHAPTER 3. PROBABILITY 37
Example 3.4. Urn 1 contains 3 red balls and 5 white balls. Urn 2
contains 6 red balls and 3 white
balls. A fair coin is tossed (meaning that heads and tails are
equally likely). If a head turns up, a ball
is randomly selected from Urn 1. If a tail comes up, a ball is
randomly selected from Urn 2. Given
that a white ball was selected, what is the probability that it
came from Urn 1?