Basic Statistics for Beginners 123458921

Elementary Statistics for the
Biological and Life Sciences
STAT 205
University of South Carolina
Columbia, SC
© 2011, University of South Carolina. All rights reserved, except where previous rights
exist. No part of this material may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means — electronic, mechanical, photoreproduction,
recording, or scanning — without the prior written consent of the University of South
Carolina.

STAT205 – Elementary Statistics for the Biological and Life Sciences 2
Motivation: why analyze data?
 Clinical trials/drug development:
compare existing treatments with new
methods to cure disease.
 Agriculture: enhance crop yields,
improve pest resistance
 Ecology: study how ecosystems
develop/respond to environmental
impacts
 Lab studies: learn more about
biological tissue/cellular activity

Chapter 2: Description of
Populations and Samples
Selected tables and figures from Samuels, M. L., and Witmer, J. A., Statistics for the
Life Sciences, 3rd Ed. © 2003, Prentice Hall, Upper Saddle River, NJ. Used by per-
mission.

Statistics is:
 Statistics is the science of
• collecting,
• summarizing,
• analyzing, and
• interpreting
data.
 Goal: to understand the underlying
biological phenomena that generate
the data.

Random Variables
 Data are generated by some random
process or phenomenon.
 Any observed datum represents the
outcome of a Random Variable.
 NOTATION: upper case letter, W, X, Y, etc.

Types of Random
Variables
 Qualitative
• Categorical (e.g., blood type: A, B, AB, O)
• Ordinal (e.g., therapy response: none,
some, cured)
 Quantitative
• Discrete (e.g., number of nests – 0,1,2,…)
• Continuous (e.g., cholesterol conc. – 220.2,
210.4, 180.9, etc.)

Random Samples
 We take data as samples from a larger
population.
 DEF’N: A SAMPLE is a collection of
‘subjects’ upon which we measure one
or more variables.
 DEF’N: The SAMPLE SIZE is the
number of subjects in a sample.
NOTATION: n.

Observations
 DEF’N: The OBSERVATIONAL UNIT is
the type of subject being sampled.
Example: observational units could be
(i) baby, (ii) moth, (iii), Petri dish, etc.
 DEF’N: An OBSERVATION is a
recorded outcome of a variable from a
random sample.
NOTATION: lower case letter, x, y, etc.

Frequency Distributions
 DEF’N: A FREQUENCY DISTRIBUTION is a
summary display of the frequencies of
occurrence of each value in a sample.
 For continuous (Ex. 2.4, 2.6, 2.7, & 2.8) or
categorical (Ex. 2.1, 2.2, 2.3, & 2.5) data.
 DEF’N: A RELATIVE FREQUENCY is a raw
frequency divided by n:
Rel. Freq. =
Freq
n

Example 2.4
Ex. 2.4: Y = no. of piglets surviving 21
days (litter size).
A sample of n=36 pigs (sows) generated
the data in Table 2.4.

Dot Plot
 A DOT PLOT is a simple graphic where
dots indicate observed data in a sample.
 Ex. 2.4: Fig. 2.4 gives the dot plot for the
litter size data:

Histogram
 A HISTOGRAM is a simple bar chart where
the bars replace the dots in a dot plot.
 Ex. 2.4 (cont’d): Fig. 2.5 gives the
histogram for the litter size data.

Stemplot
 A STEMPLOT (a.k.a. STEM-LEAF
DIAGRAM) is a dot plot (often drawn on its
side) with data information replacing the
dots.
 The ‘stems’ are the core values of the data,
set in common groups.
 The ‘leaves’ are the last digits of each
datum.

Example 2.8
 Ex. 2.8: Y = radish growth. Data in Table 2.8:
 (Ordered) stemplot in Fig. 2.15:

Frequency Distn’s
 Frequency distributions come in varied
shapes:
• Symmetric & bell-shaped
• Symmetric, not bell-shaped
• Asymmetric & skewed right (rt. tail longer)
• Asymmetric & skewed left (left tail longer)
• Bimodal (two distinct clumps)
 We use histograms, etc., to visualize these
shapes in the data.

Histogram for continuous
data
 For cont. data, histogram is defined by
constructing bins to toss data in.
 Let y(1) be the smallest (min) and y(n) be the
largest (max) values in the data set.
 Divide interval (y(1), y(n)) into, say, 5 to 20
intervals of equal-sized length (more
intervals for more data, e.g. n/5 bins total).
 Count how many obs. in each bin...that's it!

Descriptive Statistics
 DEF’N: The SAMPLE MEAN is the
arithmetic average of a set of n data values.
 NOTATION:
 The sample mean is often viewed as a kind
of ‘balance point’ in the data.
y = 1
n yi

i=1
n
=
y1 + y2 + + yn
n

Example 2.15
 Ex. 2.15: Y = weight gain (lb) of lambs on
special diet. Data: {11, 13, 19, 2, 10, 1}
 n = 6:
 Fig. 2.27:
y = 11 + 13 + 19 + 2 + 10 + 1
6
= 56
6
= 9.33 lb

Sample Median
 DEF’N: The SAMPLE MEDIAN is the
value of the data nearest to their
middle. It splits the data in half.
 Find the median by ordering the
data, and calculating their middle
point (n odd) or the average of their
two middle points (n even).
 NOTATION: Q2

New notation
 Original data: y1, y2,…, yn.
 Ordered data: y(1), y(2),…, y(n).
 Example: y1=3.7, y2=-2.0, y3=-7.5, y4=2.1.
 Then: y(1)=-7.5, y(2)=-2.0, y(3)=2.1, y(4)=3.7.
 If n is odd then Q2 is middle ordered value.
 If n is even then Q2 is average of middle
two ordered values.

Example 2.17
 Ex. 2.17: (2.15 cont’d) Lamb weight gain.
n = 6 is even , so find Q2 as avg. of two
middle points
 ordered data: y(1) = 1, y(2) = 2,
y(3) = 10, y(4) = 11, y(5) = 13, y(6) = 19.
Q2 = 10 + 11
2
= 10.5 lb

Example 2.19
Ex. 2.19: Y = cricket singing times.
Data in Table 2.10:

Example 2.19 (cont’d)

Skewness
 Mean & median indicate skewness:
• If data are skewed right, mean > median.
• If data are skewed left, mean < median.
• If data are symmetric, mean ≈ median.
 Both the mean and the median are useful
summary measures of location. The
median is slightly more ROBUST to
extreme values of yi, but of course, the
mean is easier to calculate.

Quartiles
DEF’N: The QUARTILES of a distribution are
points that separate the data into quarters or
fourths:
• The first quartile separates the lower 25% of
the data from the upper 75%. NOTATION: Q1
• The second quartile separates the lower 50%
of the data from the upper 50%. NOTATION: Q2
• The third quartile separates the lower 75% of
the data from the upper 25%. NOTATION: Q3

Example 2.20
 Ex. 2.20: Y = Systolic blood pressure
(mm Hg) in men; n= 7.
 Ordered data:
y(1) = 113, y(2) = 124, y(3) = 124,
y(4) = 132,
y(5) = 146, y(6) = 151, y(7) = 170.
 Q1 = 124
 Q2 = 132
 Q3 = 151

IQR
 DEF’N: The INTER-QUARTILE RANGE is
IQR = Q3 – Q1
 DEF’N: The MINIMUM is the smallest value
of a data set or distribution.
NOTATION: y(1)
 DEF’N: The MAXIMUM is the largest value
of a data set or distribution.
NOTATION: y(n)

Five Number Summary
 DEF’N: The FIVE NUMBER SUMMARY is
{y(1), Q1, Q2, Q3, y(n)}
 DEF’N: A BOXPLOT is a graphic plot of the
5-no. summary, with a box spanning the
IQR and bridging the quartiles:
y(1) y(n)
Q1 Q2 Q3

Example 2.22
Ex. 2.22: Y = radish growth data from Ex.
2.8. Five-no. summary is {8, 15, 21, 30, 37}.
Boxplot is given in Fig. 2.30:

Example 2.23
Ex. 2.23: Y = radish
growth data over three
different growth
regimes (see Ex. 2.9).
In Fig. 2.32, we use
boxplots for compar-
ative purposes. 

Outliers
 DEF’N: An OUTLIER is an obsv’n that differs
dramatically from the rest of the data.
Formally: Yi is an outlier if
Yi < Q1 – (1.5  IQR) or Yi > Q3 + (1.5  IQR)
“lower fence” “upper fence”

Example 2.25
 Ex. 2.25: Y = radish growth data in full light
(from Ex. 2.23). The ordered data are:
3, 5, 5, 7, 7, 8, 9, 10, 10, 10, 10, 14, 20, 21
 IQR = Q3 – Q1 = 10 – 7 = 3
 Upper fence = Q3 + (1.5  IQR)
= 10 + (1.5)(3) = 14.5
 Lower fence = Q1 – (1.5  IQR)
= 7 – (1.5)(3) = 2.5
 y = 20 and y = 21 are outliers.

Dispersion
 DEF’N: The SAMPLE RANGE is
Range = Y(n) – Y(1) = Max. – Min.
 DEF’N: The SAMPLE VARIANCE is
 DEF’N: The SAMPLE STANDARD
DEVIATION (SD) is S = S2
S
2
= 1
n-1
(Yi - Y)
2

i=1
n

The Empirical Rule
The sample mean and the sample SD
are useful in describing data sets. The
EMPIRICAL RULE states that
• ~68% of the data lie between
• ~95% of the data lie between
• >99% of the data lie between
Y - S and Y + S
Y - 2S and Y + 2S
Y - 3S and Y + 3S

Example 2.36
Ex. 2.36: Suppose Y = pulse rate after 5 mins.
of exercise. For n = 28 subjects, we find Y =
98 (beats/min) and S = 13.4 (beats/min).
Thus, e.g., from the empirical rule we expect
~95% of the data to lie between
98 – (2)(13.4) = 98 – 26.8 = 71.2 beats/min
and
98 + (2)(13.4) = 98 + 26.8 = 124.8 beats/min.

Inference
 DEF’N: The POPULATION is the larger
group of subjects (organisms, plots,
regions, ecosystems, etc.) on which we
wish to draw inferences.
 DEF’N: A PARAMETER is a quantified
population characteristic. E.g., the popl’n
mean is µ, the popl’n SD is s.
 DEF’N: A STATISTIC is a sample quantity
used to estimate a popl’n parameter.

Proportions
 DEF’N: The POPULATION PROPORTION
is the proportion of subjects exhibiting a
particular trait or outcome in the popl’n.
(It generalizes to the probability that any
popl’n element will exhibit the trait.)
NOTATION: p
 DEF’N: The SAMPLE PROPORTION is the
number of sample elements exhibiting the
trait, divided by the sample size, n.
NOTATION: p

Chapter 3: Random
Sampling, Probability, and the
Binomial Distribution
mission.

Random Samples
 DEF’N: A SIMPLE RANDOM SAMPLE of n
items is a data set where
(a) every popl’n element has an equal chance
of selection, and
(b) every popl’n element is chosen
independently of every other element.
 This draws upon the larger concept of
RANDOMIZATION: selection of data that
avoids sources of possible bias.

Random Sampling
To choose a random sample:
1. assign each popl’n element a unique
code (or set of codes);
2. from a random number table (Table 1,
p. 670) or via computer, in a systematic
manner select n random digits whose
range corresponds to the codes assigned
above; and
3. select every element if its code appears in
step (2), ignoring repeated codes or those
with no assignment.

Example 3.1
Ex. 3.1: Simple random sample of size n =
6 from population of 75 elements.
1. label each element 01, 02, …, 75
2. select random digits from a source such
as Table 1, a TI-84, or R.
3. choose elements for the sample if they
correspond to the selected random digits
(ignore repeats and drop-outs)
See Table 3.1 

 The sample uses elements 23, 38, 59, 21, 08, 09

Probability
 DEF’N: A PROBABILITY is the chance of
some event, E, occurring in a specified
manner. NOTATION: P{E}
 We often view probabilities from a
Relative Frequency Interpretation:
P{E} =
# ways E occurs
# total events

Example 3.12
Ex. 3.12: Toss a fair coin twice. We know
P{H} = 1/2 (see Ex. 3.8). What is P{HH}?
 Consider all possible outcomes:
HH, HT, TH, TT
 If each outcome is equally likely, then
P{HH} = # HH
# all outcomes
= 1
4

Probability Rules
 Rule 1: 0 ≤ P{E} ≤ 1.
 Rule 2: The entirety of events has
probability = 1. That is, if E1, ..., Ek are
all the possible events, ∑P{Ei} = 1.
(here, E1, ..., Ek are disjoint!)
 Rule 3: (The Complement Rule):
If E
c
= {not E}, then P{E
c
} = 1 – P{E}.

Example 3.19
 Ex. 3.19: U.S. Blood types:
P{O} = 0.44 P{A} = 0.42
P{B} = 0.10 P{AB} = 0.04
 Note: (1) all are between 0 and 1 
and (2) P{O} + P{A} + P{B} + P{AB}
= 0.44 + 0.42 + 0.10 + 0.04
= 1.00 
 So, e.g., P{Oc
} = 1 – P{O} = 1 – 0.44 = 0.56

Probability (cont’d)
 DEF’N: Two events, E1 and E2, are
DISJOINT (a.k.a MUTUALLY EXCLUSIVE) if
they cannot occur simultaneously.
 DEF’N: The UNION of two events, E1 and
E2, is the event that E1 or E2 (or both)
occurs.
 DEF’N: The INTERSECTION of two
events, E1 and E2, is the event that E1 and
E2 occurs.

Venn Diagrams
 A useful graphic to conceptualize how
events interrelate is the Venn Diagram.
 For example, Fig. 3.8 shows a Venn Diagram
with 2 intersecting events, E1 and E2:

Probability Rules (cont’d)
 We often denote the entirety of events as
the Sample Space, S. Conversely, the
Null Space is  = S
c
 Rule 4: If E1 and E2 are disjoint, then
P{E1 or E2} = P{E1} + P{E2}.
 Rule 5: If E1 and E2 are any two events,
then
P{E1 or E2} = P{E1} + P{E2} – P{E1 and E2}.

Example 3.20
Ex. 3.20: Hair/Eye color of 1770 men. We
have the following distribution of traits:
So, e.g., P{Black Hair} = 500/1770, etc.

Find P{Black Hair OR Red Hair}.
Clearly, E1 = {Black Hair} and
E2 = {Red Hair} are disjoint,
so from Rule 4,
P{Black Hair OR Red Hair}
= P{Black Hair} + P{Red Hair}
= 500/1770 + 70/1770 = 570/1770
= 0.32.

Now, find P{Black Hair OR Blue Eyes}.
Here, E1 = {Black Hair} and
E3 = {Blue Eyes} are NOT disjoint,
so apply Rule 5:
P{Black Hair OR Blue Eyes}
= P{Black Hair} + P{Blue Eyes}
– P{Black Hair AND Blue Eyes}
= 500/1770 + 1050/1770 – 200/1770
= 1350/1770 = 0.76.

Probability (cont’d)
 DEF’N: Two events, E1 and E2, are
INDEPENDENT if knowledge that E1 occurs
does not affect P{E2} and vice versa.
If two events are not independent, they are
DEPENDENT.
 DEF’N: A CONDITIONAL PROBABILITY is
the probability that 1 event occurs, given
that the other has already occurred.
NOTATION: P{E1 | E2}.

Probability Rules (cont’d)
 Rule 6: If E1 and E2 are independent, then
P{E1 and E2} = P{E1}  P{E2}.
 Rule 7: If E1 and E2 are any two events, then
P{E1 and E2} = P{E1}  P{E2 | E1}
= P{E2}  P{E1 | E2}.
 Consequences:
• if E1 and E2 are independent, then
P{E1} = P{E1 | E2} and P{E2} = P{E2 | E1}
• also, P{E2 | E1} = P{E1 and E2}/P{E1} if P{E1}≠0.

Examples 3.21–3.22
Exs. 3.21–3.22 (3.20, cont’d): Hair/Eye color
of 1770 men.
Refer back to Table 3.3. There, we saw
P{Blue Eyes AND Black Hair} = 200/1770,
while P{Black Hair} = 500/1770. So,
P{Blue Eyes | Black Hair}
=
P{Blue Eyes AND Black Hair }
P{Black Hair}
= 200/1770
500/1770
= 200
500
= 0.40

Example 3.25
Ex. 3.25 (3.20, cont’d): Hair/Eye color of 1770
men.
In Table 3.3, there is no evidence of indepen-
dence between Hair & Eye color. So, e.g.,
P{Red Hair AND Brown Eyes}
= P{Red Hair} P{Brown Eyes | Red Hair}
which agrees with the display in Table 3.3.
= 70
1770
20
70
= 20
1770

Bayes’ rule
 Bayes rule is a powerful identity for
obtaining conditional probabilities:
 P{A|B}=P{B|A}P{A} / P{B}.
 Can get P{B}=P{B|A}P{A}+P{B|Ac}P{Ac}.
 Useful in diagnostic screening
applications.

Diagnostic tests
 Say a test is positive or negative: T+ or T-.
 A subject has the disease or not: D+ or D-.
 P(D+) is the prevalence of the disease.
 P(T+|D+) is the sensitivity of the test.
 P(T-|D-) is the specificity of the test.
 P(D+|T+) and P(D-|T-) are the predictive
values positive and negative, respectively.

Screening for hepatitis C at an
STD clinic
 (Weisbord, Trepka, Zhang, Smith, and Brewer,
2003). At an STD clinic in Miami, Florida,
patients were screened for hepatitis C using
CDC screening criteria in the form of a
questionnaire.
 Study concluded P(T+ |D+) = 0.61, P(T− |D−)
= 0.91 and P(D+) = 0.047.

Law of total probability for
P(T+)
 P(T+)=P(T+|D+)P(D+)+P(T+|D−)P(D−)
=P(T+|D+)P(D+)+[1−P(T−|D−)][1−P(D+)]
= 0.61×0.047+(1−0.91)(1−0.047)
= 0.114.
 Say the CDC criteria tells me I’m at risk for
hepatitis C, i.e. my questionnaire yields T+.
 What is the probability that I really have it?

To C or not to C?
 P(D+|T+) = P(T+|D+)P(D+)/P(T+)
= 0.61 × 0.047 / 0.114
= 0.25.
 There’s still only a 1 in 4 chance I’ve got
hepatitis C. But this is much larger than
P(D+)=0.047, the probability before knowing
T+.
 Better get a blood test.

Density Curves
 DEF’N: A RANDOM VARIABLE is a
measured outcome of some random
process.
 When a random variable is discrete, it is
usually straightforward to interpret
probabilities associated with it.
 For instance, if Y = {# leaves on tree}:
 P{Y = 122} = 0.42 is interpretable
 P{Y = 18} = 0.02 is interpretable
 but P{Y=120.472} is not interpretable.

Probability Histogram
A probability histogram is used to
visualize discrete probability masses:
Notice: each “mass” has area=probability,
and all masses sum to 1.
0
0.1
0.2
0.3
0.4
0.5
1 2 3 4 5 6 7 8 9
k
P{Y=k}

Continuous Random Variables
 By contrast, a continuous random variable
has a different probability interpretation.
 Extending the probability histogram to the
continuous case, we say Y has a
PROBABILITY DENSITY CURVE, where area
still represents probability.

Continuous Random Variables
Consequences of the continuous probability
model:
• P{Y = a} = 0 = P{Y = b} (area of a line is zero)
• So, P{Y ≤ a} = P{Y < a} + P{Y = a} = P{Y < a}
• And for that matter:
P{a ≤ Y ≤ b} = P{a < Y ≤ b}
= P{a ≤ Y < b} = P{a < Y < b}
(all if Y is continuous).

Example 3.30
Ex. 3.30: Y = diameter (in.) of tree trunk.
• Suppose the density has the form given in
Fig. 3.13:
• Then, for example, P{Y > 8} =
P{8 < Y ≤ 10} + P{Y > 10} = 0.12 + 0.07 = 0.19

Mean and Expected Value
 DEF’N: If Y is a discrete random variable,
its POPULATION MEAN is given by
µY = ∑yiP{Y = yi}
(where the sum is taken over all possible
yi’s)
 More generally, the EXPECTED VALUE of Y
is E(Y) = ∑yiP{Y = yi}.

Ex. 3.35: Y = # tail vertebrae in fish.
From Table 3.4 we find
yi 20 21 22 23
P{Y = yi} .03 .51 .40 .06
So, E(Y) = ∑yiP{Y = yi}
= (20)(.03) + (21)(.51) + (22)(.40) + (23)(.06)
= … = 21.49.
Example 3.35

Variance
 DEF’N: If Y is a discrete random variable,
its POPULATION VARIANCE is given by
sY
2 = ∑(yi – µY)2P{Y = yi}
One can show this is also
sY
2 = E(Y2) – {E(Y)}2 = E(Y2) – µY
2
 From this, the POPULATION STANDARD
DEVIATION of Y is sY = (sY
2)1/2.

Example 3.37
Ex. 3.37: (3.35, cont’d). From Table 3.4 we
were given the values of P{Y = yi}.
Recall µY = 21.49.
So, sY
2 = ∑(yi – µY)2P{Y = yi}
= (20–21.49)2
(.03) + (21–21.49)
2
(.51)
+ (22–21.49)
2
(.40) + (23–21.49)
2
(.06)
= … = 0.4299.

So sY
2 = 0.4299.
But, it’s a lot easier to use
sY
2 = E(Y2) – µY
2 =
{(20)
2
(.03) + (21)
2
(.51)
+ (22)
2
(.40) + (23)
2
(.06)} – (21.49)2
= 462.25 – 461.8201
= 0.4299.

Rules of Expected Value
 E(·) is a mathematical operator.
 It has certain general properties:
• Rule E1: E(aX + bY) = aE(X) + bE(Y)
= aµX + bµY
• Rule E2: E(a + bY) = a + bE(Y) = a +
bµY
(a “linear operator”)

Rules of Variance
The special variance operator also has
certain general properties:
• Rule E3: If X and Y are independent, then
sX+Y
2 = sX
2 + sY
2.
• Rule E4: If X and Y are independent, then
sX–Y
2 = sX
2 + sY
2.
• General rule: If X and Y are independent,
then
saX+bY
2 = a2sX
2 + b2sY
2.

Example 3.41
Ex. 3.41: X = mass of cylinder from balance.
Y = mass of cylinder from 2nd balance.
Suppose sX = 0.03 and sY = 0.04. Then, if we
calculate the difference between the two
weighings, X – Y, we know
sX-Y = sX
2 + sY
2 = 0.032
+ 0.042
= 0.0009 + 0.0016 = 0.0025 = 0.05

Independent Trials
 DEF’N: The INDEPENDENT TRIALS
MODEL occurs when
(i) n independent trials are studied
(ii) each trial results in a single binary obsv’n
(iii) each trial’s success has (constant)
probability: P{success} = p
Notice that if P{success} = p, P{failure} = 1–p.
 We call this a BInS (Binary / Indep. / n is
const. / Same p) setting.

Example 3.43
Ex 3.43: Suppose 39% of organisms in a
popl’n exhibit a mutant trait. Sample n=5
organisms randomly and check for
mutation:
• Binary?  (mutant vs. non-mutant)
• Indep.?  (if no bias in sampling)
• n const.?  (n=5)
• Same p?  (p = 0.39)

Binomial Distribution
 DEF’N: In a BInS setting, if we let
Y = {# successes} then Y has a
BINOMIAL DISTRIBUTION.
 NOTATION: Y ~ Bin(n,p).
 The binomial probability function is
P{Y = j} = nCj pj
(1 – p)n–j
(j = 0,1,…,n).

Binomial Coefficient
 In the binomial probability function
P{Y = j} = nCj pj
(1 – p)n–j
the BINOMIAL COEFFICIENT is
 Also, j! is the FACTORIAL OPERATOR:
j! = j(j–1)(j–2)…(2)(1)
 We define 0! = 1.
nCj = n!
j! (n-j)!

Example of factorial operator: at n = 5,
5! = (5)(4)(3)(2)(1) = 120
4! = (4)(3)(2)(1) = 24
3! = (3)(2)(1) = 6
2! = (2)(1) = 2
So: j 0 1 2 3 4 5
nCj 1 5 10 10 5 1
(Also see Table 3.6.)
Values of nCj are given in Table 2 (p. 674)
Factorial Operator

Table 3.6

Example 3.45
Ex 3.45 (Ex. 3.43 cont’d): Y ~ Bin(5 , 0.39);
So P{Y = 3} = 5C3(.39)3
(.61)2
= (10)(.0593)(.3721) = 0.22.
Can also find this via TI-
84 or R. Table 3.7 gives
the full distribution.
Figure 3.15 gives a
probability histogram.

Binomial Mean & Variance
 If Y ~ Bin(n,p), the population mean and
variance are:
µY = np and sY
2 = np(1–p)
 Ex. 3.49: Y = {# Rh+ in BInS sample}. We’re
given p = P{Rh+} = 0.85. So, if n = 6, we
expect µY = (6)(0.85) = 5.1 Rh+ in the
sample, with sY
2 = (6)(.85)(.15) = 0.765, so
that sY = √.765 = 0.87.

Chapter 4:
The Normal Distribution
mission.

Normal Distribution
 DEF’N: A continuous random variable
Y has a NORMAL DISTRIBUTION if its
probability density can be written as
over –∞ < y < ∞.
 NOTATION: Y ~ N(µY , sY
2)
 The mean and variance of a normal dist’n
are E(Y) = µY and E[(Y – µY)2] = sY
2.
f (y) = 1
sY 2
e-(y-µY)
2
/2sY
2

Normal Dist’n Examples
 The Normal distribution appears in many
biological contexts:
 Ex. 4.1: Y = serum cholesterol (mg/dLi)
 Ex. 4.2: Y = eggshell thickness (mm)
 Ex. 4.3: Y = nerve cell interspike times (ms)

Normal Curve
The Normal density curve is
(i) continuous over –∞ < y < ∞
(ii) symmetric about y = µ
(iii) unimodal, and hence “bell-shaped”

Figure 4.7
Since each µ,s2 pair indexes a different
Normal dist’n, this represents a rich family
of curves:

Standard Normal
 DEF’N: The STANDARDIZATION FORMULA
for Y ~ N(µ,s2) is
Z = (Y – µ)/s
This is often called a ‘Z-score’.
 If Y ~ N(µ,s2), then Z ~ N(0,1) and we say Z
has a STANDARD NORMAL dist’n.
 Std. Normal probab’s are tabulated in Table
3 (p. 675) and on text’s inside front cover.

(Portion of) Table 3, p.675

Example: (p. 124) Suppose Z ~ N(0,1).
Find P{Z ≤ 1.53}.
In Table 3:
1.53  0.03
 M
 M
 M
1.5 ………... 0.9370
Hint: “always draw the picture”
P(Z ≤ z)

P(a < Z ≤ b)
 If Z ~ N(0,1), and we find P{Z ≤ 1.53} = 0.937,
notice then that P{Z > 1.53} = 1 – 0.937
= 0.063.
 Example: (p. 125) Suppose Z ~ N(0,1); then
P{–1.20 < Z ≤ 0.80} = P{Z ≤ 0.80} – P{Z ≤ –1.20}
= 0.7881 – 0.1151 = 0.6730.
(See Fig. 4.11)
 Can also find Std. Normal probabilities using TI-84
or R!

Empirical Rule, revisited
 If Z ~ N(0,1), it mimics the empirical rule
very closely:
 The same effect holds for any Y ~ N(µ,s2).

Example 4.5
Ex. 4.5: Y = length of herrings (mm).
Suppose Y ~ N(54, 20.25). Then we know
(a) What % of fish are less than 60 mm long?
Z = Y - 54
20.25
= Y - 54
4.5
~ N(0,1)
P[Y < 60] = P Y - 54
4.5
< 60 - 54
4.5
= P Z < 6
4.5
= P[Z < 1.33]
= 0.9082

Y = length of herrings ~ N(54, 20.25).
(c) What % of fish are between 51 and 60 mm
long?
P[51 < Y < 60] = P 51 - 54
4.5
< Y - 54
4.5
< 60 - 54
4.5
= P -3
4.5
< Z < 6
4.5
= P[-.67 < Z < 1.33]
= P[Z  1.33] - P[Z < -.67]
= 0.9082 - 0.2514 = 0.6568

Std. Normal Tail Areas
 We can also INVERT the std. Normal table
(Table 3):
 Z ~ N(0,1), so find P{Z < 1.96} = 0.975. Then we
know P{Z > 1.96} = 1 – 0.975 = 0.025.
 So, 2.5% of std. normal popl’n exceeds 1.96.

za
More generally, if we find some number
za such that P{Z ≤ za} = 1 – a, we know
P{Z > za} = a and vice versa:

Std. Normal Critical Point
 DEF’N: The UPPER-a CRITICAL POINT
from Z ~ N(0,1) is the value za such that
P{Z > za} = a.
 Find za by:
• carefully inverting Table 3
• reading off the bottom row (df = ∞) of
Table 4 (p. 677)
• using TI-84 Normal dist’n calculator or R

Percentiles
 DEF’N: The point of a distribution below
which p% lies is the pth PERCENTILE of
the dist’n.
 If Z ~ N(0,1), za is the (1 – a)th percentile
of Z.
 We often ask what value is the pth
percentile of a biological population (see
Ex. 4.6).

Example 4.6

 We want to find y* such that P{Y < y*} = 0.70.
This is
 Now, from Table 3 we find P{Z < 0.52} =
0.6985 is close to 0.70. This tells us to
equate (approximately) 0.52 and (y*–54)/4.5
 y* – 54 ≈ (0.52)(4.5)
 y* ≈ (0.52)(4.5) + 54 = 56.34
P Y - 54
4.5
<
y* - 54
4.5
= P Z <
y* - 54
4.5

Example 4.6 (conclusion)
So, we find that approximately 70% (69.85%,
exactly) of herring are less than 56.34 mm
long.
Notice also that we derived the critical point
z0.30 ≈ 0.52. (More precisely, we found z0.3015
= 0.52.)
Using TI-84, we can find z0.30 = 0.5244: this
yields the exact value y* = (0.5244)(4.5) + 54
= 56.36 for Example 4.6.

Basic Statistics for Beginners 123458921

Recommended

Recommended

More Related Content

Similar to Basic Statistics for Beginners 123458921

Similar to Basic Statistics for Beginners 123458921 (20)

Recently uploaded

Recently uploaded (20)

Basic Statistics for Beginners 123458921