1.
STAT 101, Module 6: Probability, Expected Values, Variance
(Book: chapter 4)
Reasons for wanting probability statements
• Quantify uncertainty: “For the manager the probability of a future
event presents a level of knowledge” (p. 79).
• Quantification makes uncertainty “predictable” in that probabilities
describe what happens in large numbers. For example, a company
can figure an assumption of 5.5% faulty shipments from its supplier
into its operations planning.
• Many statements about “rates” are hidden probability statements:
“The take-rate of our services from a telemarketing campaign is
2.1%.” This means that in the past and presumably in the future about
21 out of 1000 targeted households sign up for the offered service.
We can say the estimated probability of taking the service after
targeting by telemarketing is 0.021.
• Quantification with probabilities can give insight. Some examples
found by googling “probability of successful corporate take-over”:
o “The probability of merger success is inversely related to
analyst coverage at the time of announcement.”
o “The probability of takeover success rose to 71.3% when a
bidder initially held the shares of the target firm.”
Probability as an idealization
• Probability is an idealization. It is to statistics what straight lines are
to geometry, point masses to Newtonian mechanics, point charges to
electrodynamics,…
• The idealization comes about for relative frequencies when the
number of cases goes to infinity. The limit is the probability. (Book:
p.90)
• Probabilities do not “exist”, just like straight lines don’t “exist”.
Relative frequencies and taught ropes exist, but probabilities and
straight lines do not “exist”. Both are constructed by a thought
process called “idealization”.
2.
• Why do we need idealizations? Idealizations simplify our thinking
and focus on aspects of reality that are essential in a given context.
o In geometry we don’t want to talk about the thickness of a line.
Thickness matters for taught ropes, but when we do geometry
we “idealize away” thickness by thinking of a line as “infinitely
thin”.
o In statistics we need probability to have a target that is
estimated by relative frequencies. We can think of the
probability as the relative frequency if we had an infinite
amount of data.
Like geometry does away with thickness of taught ropes, probability
does away with limitations of finite data samples.
Foundations of frequentist probability
• Probabilities are numbers between 0 and 1 (inclusive) assigned to
outcomes of random events.
• A random event can have one of two outcomes, which we may
characterize as “yes” or “no”.
• Examples of random events:
o head on a coin toss,
o 6 on the rolling of a die,
o ticket 806400058584 is drawn in blind drawing from a well-
tumbled box of lottery tickets,
o any trader committing insider trading in 2007,
o any targeted household signing up for my firm’s services,
o a randomly caught bear is tagged and found next season again,
o GM stock gains by more than 5% in a given week,
o the S&P loses more than 3% in a given day,…
• The term “random” does not mean haphazard; it means “repeatable
in principle, but with unpredictable outcome”.
• The term “unpredictable” is relative to what we know; if we know
more, less is unpredictable, and vice versa.
For a random trader on Wallstreet, we would not know whether he/she
would do insider trading; if the trader is my dad or mom, I might be
certain that he/she wouldn’t. Hence for a “random trader” we might
estimate the probability of insider trading to be 1%; for mom or dad there
is no probability because there is no repeatability, and I know.
3.
• It should be possible to “independently repeat” the random event.
Independence means knowing the outcome of the first realization of
the event should not help in guessing the outcome of a second
realization of the event.
Independence of realizations of a random event is sometimes difficult to
justify. For example, if it rains today, chances are probably elevated that it
will rain tomorrow, too. Hence there is information in the outcome of the
first realization for the outcome of the second realization.
• A counter-example: the event that a whale jumps on my yacht.
Why is this not a random event? It only defines the ‘yes’ side of an
event but not the ‘no’ side. To actually have an event in our sense,
one has to be able to determine when it did not occurred. The ‘no’
side of an event is needed so we have a notion of total number of
trials/observations/measurements.
To make “a whale jumps on my yacht” a well-specified event, one would
have to modify it to something like “the event that a whale jumps on my
yacht on a given day”, or “the event that a whale jumps on a given yacht
throughout its lifetime”, or “the event that a whale jumps on any yacht
anywhere on a given day”, or “the event that a whale jumps on a given
yacht on a given day”. These specifications define the repeatability as
across days, or yachts, or combinations of days and yachts. Day-to-day
repetitions may be justified as independent if one thinks that next days
jumping of a whale on a yacht is not linked to today’s behaviors of whales
and yachts.
A similar thought applies to events such as 5% daily gains of GM stock
and 3% daily losses of the S&P index. Is it so clear that tomorrow’s
likelihood of these events is unrelated to today’s movements of GM stock
or the S&P index?
• The preceding thoughts show that there exist deep problems in
applying probability.
There exist more free-wheeling notions of probabiliy, called “subjective
probability”. It formalizes strength of belief and can be founded on
betting: how much money one is willing to bet on an outcome implies a
belief in a certain probability of the outcome. Our notion of probabiliy,
however, is “frequentist”.
• Definition: Probability is the limit of relative frequencies of “yes”
outcomes when the number of observed outcomes goes to infinity.
Recall the definition of relative frequency:
rel.freq.(yes) = #yes / (#yes + #no)
4.
Do we ever know probabilities?
• No we don’t. But for thought experiments we often assume them.
• Example: Flipping a coin; we assume P(head) = P(tail) = ½.
This is actually taken as the definition or model of a fair coin.
Similarly, we’d take P(any particular card) = 1/52 as the definition of
a well-shuffled deck. Finally, P(any of 1…6) = 1/6 is the definition of
a fair die.
• For a finite sequence of independent repetitions of a random event, we
can estimate the probability by the relative frequency. Later we will
learn how to estimate the precision of this estimate with so-called
“confidence intervals”.
PS: In past long-run experiments with flipping coins one has never seen
relative frequencies as close to ½ as they should be. We can say that fair
coins do not exist. Yet the deviations from fairness are so small that they
don’t matter.
Notations, definitions, and properties of probabilities
• A sample space is the set of all possible outcomes
of a random experiment. Notation: Ω
• A random event A is a subset of the set Ω of possible outcomes.
• The set of outcomes that are not in A are denoted AC.
Note that this is also a random event.
• Events A and B are said to be disjoint if A∩ B = Ø.
That is, events A and B cannot occur simultaneously.
• The probability of a random event A is written P(A).
• Probabilities must satisfy the following axioms:
o P(A) ≥ 0
o P(Ω) = 1
o P(A or B) = P(A) + P(B) if A and B are disjoint.
5.
• Illustrations, using the example S&P losses and gains:
o P(daily loss ≥ 3%) ≥ 0
o P(daily loss or gain or same) = 1
o P(daily loss ≥ 3% or daily gain ≥ 3%)
= P(daily loss ≥ 3%) + P(daily gain ≥ 3%)
• The axioms are direct consequences of the fact that probabilities are
limits of relative frequencies. The same properties hold for observed
relative frequencies:
o #(daily market losses ≥ 3% in last 1000 trading days)/1000 ≥ 0
o #(daily losses or gains or same in…)/1000 = 1
o #(daily losses ≥ 3%... or daily gains ≥ 3% in...)/1000
= #(daily losses ≥ 3% in...)/1000 + #(daily gains ≥ 3% in...)/1000
Properties of probabilities (Book: Sec. 4.3)
• Again, note that all of the following properties also hold for relative
frequencies.
• Complement rule:
P(AC) = 1 – P(A)
Proof: A and AC are disjoint, hence P(A)+P(AC) = P(Ω) = 1
Illustration: P(daily loss ≥ 3%)
= 1 – P(daily loss < 3% or a daily gain or same)
• Monotonicity rule:
A contained in B => P(A) ≤ P(B)
Proof: B = A or (B w/o A), where A and (B w/o A) are disjoint. Hence P(B) =
P(A) + P(B w/o A) ≥ P(A) because P(B w/o A) ≥ 0.
Illustration: P(daily loss ≥ 5%) ≤ P(daily loss ≥ 3%)
6.
• General addition rule: For any two random events A and B, not
necessarily disjoint, we have
P(A or B) = P(A) + P(B) – P(A and B)
Important: The conjunction “or” is always used in the inclusive sense, meaning
“A or B or both”, not “either A or B but not both”. In the inclusive sense, “or”
expresses set-theoretic union.
Proof: The intersection C = (A and B) is double-counted by P(A)+P(B).
Draw a Venn-diagram to make this clear. For a formal proof, write:
A = (A w/o B) or (A and B)
B = (B w/o A) or (A and B)
(A or B) = (A w/o B) or (A and B) or (B w/o A)
and note that the events on the right hand sides are all disjoint, hence their
probabilities add up.
Illustration: P(earthquake or hurricane)
= P(earthquake) + P(hurricane) – P(earthquake and hurricane)
• The summation axiom can be generalized to more than two disjoint
events. Here is the generalization to three:
P(A1 or A2 or A3) = P(A1) + P(A2) + P(A3)
if A1, A2, A3 are pairwise disjoint.
Example:
A1 = (S&P drops between 0 and 3% today)
A2 = (S&P stays same today)
A3 = (S&P rises between 0 and 3% today)
• Partitioning rule: If B1, B2, B3 partition Ω in the sense that
B1 or B2 or B3 = Ω and
(B1 and B2) = (B1 and B3) = (B2 and B3) = Ø ,
and if A is an arbitrary event, then
P(A) = P(A and B1) + P(A and B2) + P(B and B3)
7.
This formula generalizes of course to partitions of Ω consisting of
more than three events.
Example: B1 = market drop, B2 = market unchanged, B3 = market rise
A = (GM rises by more than 1%)
Conditional probability (Book: p. 99, Sec. 4.4)
• Start with relative frequencies and pictures, a familiar mosaic plot:
1.00
yes
0.75
SURVIVED
0.50
no
0.25
0.00
1st 2nd 3rd crew
CLASS
The skinny spine on the right shows in blue the overall relative
frequency of survival, across classes.
The spines of the mosaic, however, show the relative frequencies of
survival conditional on being in a given class.
Because concepts and properties of relative frequencies carry over to
their limits for infinitely many cases (Titanic with infinitely many
passengers, imagine!), the concept of conditional relative frequency
carries over to the concept of conditional probability.
P(A|B) := P(A and B) / P(B)
That is, restrict the event A (“survival”) to the subset B (“1st Class”)
by forming (A and B), and renormalize the probability to
P(A and B)/P(B), such that the event B has conditional probability 1:
P(B | B) = 1.
8.
• Example: What is the conditional probability of an outcome of A =
{1} conditional on the outcomes B = {1,2,3}, assuming a fair die?
• Example: What is the conditional probability of an outcome of A =
{Ace} conditional on the outcome B = {spade}, assuming a well-
shuffled deck?
• Example: What is the conditional probability of a daily market drop of
over 10% given that the market drop is over 3%? (There are no numbers
given here; this is only a small conceptual point about two events where one
implies the other.)
Ramifications of conditional probability (Book: Sec. 4.4, 4.5)
• Sometimes the conditional probability P(A|B) is given, and so is P(B).
We can then calculate P(A and B):
P(A and B) = P(A|B) · P(B)
Example: Assume if it rains today it rains tomorrow with probability
0.6. The probability of rain is 0.1 on any day. What is the probability
that it rains on any two consecutive days?
• In other applications, the conditional probability P(A|B) is given, and
so is P(A and B). Then one can calculate P(B):
P(B) = P(A and B) / P(A|B)
Example: If you know a plausible example, let me know…
• A famous application of conditional probability is Bayes’ rule. In its
simplest case, it allows us to infer P(B|A) from P(A|B), given also
P(A) and P(B). (Book: Sec. 4.5)
P(A|B) = P(B|A) · P(A) / P(B)
Example: The conditional probability of getting disease Y given gene
X is 0.8. The probability of someone having gene X is 0.0001, and
9.
the probability of having disease Y is 0.04. What is the conditional
probability of having gene X given one has disease Y?
Instead of applying the above formula, it might be easier to calculate P(B|
A) to P(A and B) to P(A|B) in steps:
P(gene X) = 0.0001
P(disease Y | gene X) = 0.8
P(gene X and disease Y) = 0.8 · 0.0001 = 0.00008
P(gene X | disease Y)
= P(gene X and disease Y) / P(disease Y)
= 0.00008 / 0.04 = 8/4000 = 0.002
Are you surprised? It seemed that if the gene causes the disease with
such high probability, having the disease should be a pretty good
indicator for the gene? What’s the hook?
Exercise: Change the numbers in the example. What happens if the
probability of having disease Y is 0.0001? What if it were 0.00001?
• In the above Titanic example, one should think that it would be
possible to calculate P(survival) from the mosaic plot, that is, from
the conditional probabilities P(survival|class). That’s only true
actually if one is given the probabilities of the classes, P(class). Then
it works as follows (strictly speaking for relative frequencies, not
probabilities):
P(survival) = P(survival | 1st class) · P( 1st class)
+ P(survival | 2nd class) · P( 2nd class)
+ P(survival | 3rd class) · P( 3rd class)
+ P(survival | crew) · P( crew)
Proof: The four terms on the right side equal
= P(survival and 1st class) + P(survival and 2nd class)
+ P(survival and 3rd class) + P(survival and crew)
These four events are disjoint and their union is simply “survival”. QED
Numerically, this works out as follows:
0.3230 = 0.6246·0.1477 + 0.4140·0.1295 + 0.2521·0.3208 + 0.2395·0.4021
The numbers can be gleaned from the table below the mosaic plot in JMP.
The above example can be restated in general terms as follows:
10.
P(A) = P(A|B1) · P(B1) + P(A|B2) · P(B2)
+ P(A|B3) · P(B3) + P(A|B4) · P(B4)
For two conditioning events, which you illustrate with B = (1st class)
and BC = (not 1st class), and A = (survival), this simplifies as follows:
P(A) = P(A|B) · P(B) + P(A|BC) · P(BC)
These are called marginalization formulas because they turn
conditional probabilities of A into the marginal (= plain) probability
of A.
Here is an example from the Titanic with conditioning on sex:
1.00
yes SEX By SURVIVED
0.75
Count no
SURVIVED
Total %
0.50 Col %
Row %
no female 126 344 470
0.25
5.72 15.63 21.35
8.46 48.38
0.00 26.81 73.19
female male male 1364 367 1731
61.97 16.67 78.65
SEX
91.54 51.62
78.80 21.20
1490 711 2201
67.70 32.30
It takes some effort to decode the table on the right, but the four terms in the top
left cell give the necessary clues:
P(survival) = 0.3230
= P(survived|female) • P(female) + P(survived|male) • P(male)
= 0.7319 • 0.2135 + 0.2120 • 0.7865
• Here is a hairy but typical legal application that involves both Bayes’
rule and the probability summation of the preceding bullet. It shows
how methods of detection with very small error probabilities can fail
to be conclusive (from http://en.wikipedia.org/wiki/Bayes_rule):
Assume a company is testing its employees for drugs, and assume the test
is 99% accurate; also assume half a percent of employees take drugs.
What is the probability of taking drugs given a positive test result?
11.
A first problem is to make sense of the term “99% accurate”. We take it
to mean this:
P(pos | drug) = 0.99 true positives
P(neg | no drug) = 0.99 true negatives
We infer the following, which is useful later:
P(pos | no drug) = 0.01 false positives
P(neg | drug) = 0.01 false negatives
Then we know this:
P(drug) = 0.005 marginal prob of drug use
P(no drug) = 0.995 marginal prob of no drug use
The question is this:
P(drug | pos) = ? prob of being correct given a positive
Solution:
P(drug | pos) = P(drug and pos) / P(pos)
The numerator is easy:
P(drug and pos) = P(pos | drug) · P(drug) = 0.99 · 0.005 = 0.00495
The denominator is messy and requires collecting the pieces with the
marginalization formula:
P(pos) = P(pos | drug) · P(drug) + P(pos | no drug) · P(no drug)
= 0.99 · 0.005 + 0.01 · 0.995
= 0.0149 marginal prob of a positive
Finally:
P(drug | pos) = P(drug and pos) / P(pos)
= 0.00495 / 0.0149
= 0.3322148
Now that’s a disappointment! The probability of catching a drug user with
a positive drug test is only 1 in 3!!! Two thirds of the positives are law-
abiding citizens!!!
12.
Intuitively, what is going on? The jist is this: In order to catch a small
minority of half a percent (the drug users), the detection method must have
an error rate of much less than half a percent. The rate of false positives in
the example is one percent, hence out of the majority of 99.5% non-drug
users the test finds almost 1% false positives, which is large compared to
the 0.5% actual drug users. Sure, the 0.5% drug users get reliably
detected, but the true positives get swamped by the false positives…
• Based on the above example, what do you think about a DNA test
whose error rate is one in a million when used in a murder case?
What would happen if the DNA test were administered to all adults in
this country?
Independent events (Book: p.113)
• Definition: Two events A and B are said to be independent if
P(A and B) = P(A) · P(B)
Note: This is a definition! Some pairs of events will be independent,
others won’t!
Examples: 1) A = (head now) and B = (head next) are usually
considered independent events, unless someone cheats.
2) Whether a rise in the S&P on any day is independent from a rise
the previous day cannot be known without doing data analysis. Most
likely one will find some deviation from independence, although it
might be small.
• Independence in terms of conditional probability:
If P(B) ≠ 0, then A and B are independent if and only if
P(A|B) = P(A)
This makes intuitive sense: Knowing that B occurred does not give us
any information about the frequency of A occurring.
• What would this mean for the relative frequencies of survival and 1st
class? It means that P(survival | 1st class ) = P(survival).
13.
In other words, it doesn’t matter whether one was in 1st class or not;
the survival probability is the same. (This is only a thought
experiment, not the truth!)
• What does it mean for two flips of a coin?
P(head now | head before) = P(head)
That is, any coin flip is independent of the preceding coin flip. We
usually assume that coin flips are “independent”, that is, no one is
rigging the flip such that a head is more or less likely next time.
(The idea that observing many consecutive heads makes a tail more likely next
time is a typical form of magical thinking and applied by many people when
playing the lottery: “I’m getting close to winning the lottery any time now…”)
• Source of confusion: “Independent” is not the same as “disjoint”. In
fact, two disjoint events A and B with positive probability are never
independent:
P(A and B) = P(Ø) = 0 ≠ P(A) · P(B) > 0
Intuitively, knowing that B occurred gives a lot of knowledge about
the frequency of A occuring: zero!
Example: A = (spades), B = (heart).
• The notion of independence shows the advantage of using idealized
probabilities as opposed to empirical relative frequencies. In any
finite series of pairs of coin flips, the product condition will rarely be
satisfied. Yet in the limit, for infinitely many pairs of coin flips, the
product formula should hold.
• The notion of more than two independent events is a little messy: We
say A1, A2, …, An are independent, if product formulas for all subsets
of all sizes. For three events we ask that all of the following hold:
P(A1 and A2) = P(A1) · P(A2)
P(A1 and A3) = P(A1) · P(A3)
P(A2 and A3) = P(A2) · P(A3)
P(A1 and A2 and A3) = P(A1) · P(A2) · P(A3)
14.
For n events, we’d have close to 2n conditions (2n – n – 1, to be exact),
the last one being:
P(A1 and A2 and … An ) = P(A1) · P(A2) · … · P(An)
Example: The probability of 10 heads in 10 coin flips is 0.510 = 1/1024 ≈ 0.001.
Random Variables (Book Chap. 5)
• A random variable is a random outcome that is a number.
Random variables are written as X, Y, Z,…
(italicized letters from the end of the alphabet;
for random events we use roman letters from the beginning.)
• Compare:
o A random event is a random binary outcome , “yes” or “no”.
o A random variable is a random number outcome.
• Examples:
o The number heads in 10 flips of a coin
o The number of Aces in a hand after dealing cards from a well-
shuffled deck
o The number of days with an S&P loss in a year
o The trading volume (number of shares traded) on the NY Stock
Exchange in a given day
o A daily return on Google stock
[return = (price today – price yesterday)/(price yesterday)
rt = (pt – pt-1) / pt-1 , often expressed as a percent: rt · 100%.]
o The number of takers of your company’s service offering after a
telemarketing campaign
o Income of any household in a random sample of households
o Average household income of a random sample of households
o GPA of any student in a random sample of students
o Average of GPAs across students in a random sample
o Daily temperature changes in a given location
• Characteristics:
15.
Like random events, random variables incorporate the ideas of
unpredictable outcomes that are independently repeatable, at
least in principle.
• Connection between random variables and random events:
o Random variables can be used to define random events such as
A = (number of heads in 10 coin flips ≤ 5)
A = (today’s Google return ≤ – 0.03)
A = (today’s daily temperature change ≤ – 5F° )
o So-called dummy variables can be formed by assigning 1 and
0 to “yes” and “no” of a random event:
X = 1 if A = “yes”
0 if A = “no”
• Types of random variables:
o Discrete random variables: There are only finitely many
possible values that X can produce as outcomes, or else the
outcomes are integers.
o Continuous random variables: The possible outcomes can be
anywhere on the real number line or in an interval thereof.
Exercise: Classify the above examples according to discrete and continuous.
Expected Values of Random Variables (Book p.140ff)
• Motivation: If a random variable X is drawn N times, that is, if we
observe X1, X2, X3, X4,…, XN, then the mean is
1
·( X1 +X2 + X3 + … + XN ) .
N
We are now going to show that this mean has a limit as N → ∞.
To this end, consider a simplified way of calculating the mean,
assuming N is large and also assuming X takes on only few possible
values x1,…,xC.
For concreteness, keep in mind the example of N rolls of a die, so that the
possible values are 1,2,3,4,5,6.
16.
Now we can bundle the observed outcomes X1, X2, X3, X4,…, XN by
simply counting the frequencies of the possible values:
n1 = #{Xi = x1 , i = 1,…,N }
n2 = #{Xi = x2 , i = 1,…,N }
…
nC = #{Xi = xC , i = 1,…,N }
For the rolls of a die, this would be the six frequencies of the values
1,2,3,4,5,6.
Note that n1 + n2 +…+ nC = N. We can rewrite the sum as follows:
X1 +X2 + X3 + … + XN = x1 · n1 + x2 · n2 + … + xC · nC
The relative frequencies are fi = ni / N and note that
1
·( X1 +X2 + X3 +… + XN ) = x1 · f1 + x2 · f2 + … + xC · fC
N
If you don’t grasp this formula, look at “Sim Dice and Coin Flips.JMP”,
do Tables > Sort > (select ‘Throws of Dice’) > OK. Then think about
how to simply calculate the mean: multiply with 1, 2,…, 6 with how many
times you see 1s, 2s,…,6s, then divide by N=20. Well, ‘how many times’/
20 are just the relative frequencies f1, f2,…, f6. For example, last time I
checked, the numbers were ( 1· 3 + 2· 0 + 3· 7 + 4· 3 + 5· 3 + 6· 4 ) / 20 =
3.75, close to 3.5
In the limit for N → ∞ the relative frequencies will go to the
respective probabilities: fi → pi (= 1/6 for rolling a fair die)
Hence, again in the limit, we have:
1
· ( X1 +X2 + X3 + X4 +… + XN )
N
→ x1 · P(X=x1) + x2 · P(X=x2) + … + xC · P(X=xC)
This is called the law of large numbers. It is really just a
glorification of the convergence of relative frequencies to
probabilities: fi → pi. It also motivates the following…
• Definition of the expected value of a discrete random variable X:
17.
E(X) = x1 · P(X=x1) + x2 · P(X=x2) + x3 · P(X=x3) + …
where x1, x2, x3,… are the possible outcomes of X.
This can in principle be an infinite sum if C=∞.
Note that each Ai = (X = xi) is a random event and hence is assumed
to have a probability.
(Expected values of continuous random variables will be discussed later.)
Alternate notation: the Greek letter μ if it is understood which
random variable is meant; if not, one may write μX instead.
E(X) = μ = μX
• Examples:
o X = 1 for (head) and 0 for (tail) in fair coin flipping:
E(X) = 1 · ½ + 0 · ½ = 0.5
o X = 1,…, 6 when throwing a fair die:
E(X) = 1 · 1/6 + 2 · 1/6 + … + 6 · 1/6 = 3.5
• Interpretation:
o The expected value of X is the probability-weighted average of
its outcomes. (That’s just stating the formula in words.)
o The expected value of X is the long-run average of outcomes
of X when X is drawn repeatedly and independently.
In a simplifying way, we can think of the expected value as
“the mean if we had an infinite amount of data”.
Therefore the expected value is to the mean what the probability is to
the relative frequency.
• Special case: If we take a dummy random variable X of a random
event A (X = 1 when A is observed, 0 else), then: E(X) = P(A) .
• Experiment: Look at “Sim Dice and Coin Flips.JMP” again, which
has a formula for simulating the throw of 20 dice, or throwing a die 20
times. Do Distribution on this column to see what its mean is. It will
be close but not very close to 3.5. Then increase the number of rows
(Rows > Add Rows…), for example, to 100,000. Now the mean
18.
will be close to 3.5. Note also how the bars of the histogram become
much more equal, illustrating the convergence of the relative
frequencies of 1, 2, …, 6 to the probabilities 1/6. ― By the way, the
same can be done for the flips of a coin where X=1 for head and 0 for
tail.
• Source of confusion: The term “expected value” is unfortunate. It
suggests that this is a value one expects to observe. But it is often the
case that the expected value is not among the possible outcome values
of the random variable. Example: X = outcome of a fair die. The
expected value is E(X) = 3.5, but the possible values are only the
integers from 1 to 6. Same for X = 1/0 for head/tail in a coin flip.
Properties of the Expected Value Operation
• Sums and multiples of random variables: In what follows, we need
to have a sense in which we can add two random variables and
multiply them with numbers. Let’s explain by example:
o Imagine a game in which we flip a coin and roll a die. We say
the total outcome is what the die shows, plus 1 if the coin
shows head. This can obviously be described as forming a new
random variable Z by adding up the outcomes of two other
random variables X (roll of the die) and Y (coin flip, head=1).
o A real-life example is the addition of stock holdings when
forming investment portfolios. One thinks of the prices as
random variables (actually: “random processes” because they
keep changing over time).
Multiplication of a random variable with a constant occurs when we
convert currency. For example, X = Google stock price in $, might
need to be converted to €, hence the converted random variable might
be something like Y = 0.75 · X.
• Linearity property of expected values under addition and
multiplication:
E( X + Y ) = E(X) + E(Y)
E( c X ) = c E(X)
Other simple properties follow from it:
19.
E( X + a ) = E(X) + a
E( a X + b Y ) = a E(X) + b E(Y)
This last property alone is so powerful that it is equivalent to the first
two properties.
The math proof is just a grinding piece of algebra, starting with the definition
applied to E(X+Y) and E(cX), respectively.
The intuitions are simple, though: Forming the mean of a portfolio of two stocks
should be possible by adding the means of the two stocks. In addition, converting
a stock price series from $ to € and taking the mean should be the same as
converting the mean from $ to € .
Population Variance and Standard Deviation (Book p.142ff)
• If the mean has a counterpart for N → ∞, then so will the standard
deviation, and the variance (= s2).
Definition: Abbreviating μ = E(X), one defines
V(X) = (x1–μ)2 · P(X=x1) + (x2– μ)2 · P(X=x2) + …
or, equivalently:
V(X) = E( (X – μ)2 )
Either way, the variance is “the expected value of the squared
deviation from the expected value of X.” This is of similar to the
variance for finite data, which was “the average squared deviation
from the mean.”
Alternate notation: Greek letter σ2 = V(X)
Convention for any type of parameter (location, dispersion,…):
o For the “population” (N → ∞) one uses Greek letters such as σ2.
o For the “sample” one uses Roman letters s2.
• Limit: s2 → σ2 as N → ∞ , just as x → µ .
20.
This is another application of the “law of large numbers”.
• Properties of variances: V(cX) = c2 V(X) and V(X+c) = V(X).
• The population standard deviation is defined as σ = V(X) ½ .
This is a measure of dispersion for the “population” (= infinite amount
of data):
σ(cX) = | c| σ(X) and σ(X+c) = σ(X).
Recall: If one stretches an object by a factor 2 (=c), the width growths
by a factor 2, and if one moves an object, its width stays the same.
• Examples: As usual, stylized coin tossing and rolling of dice are
most accessible.
o Biased coin: head → X=1, tail → X = 0, P(X=1) = p
E(X) = 1· p + 0· (1–p) = p
V(X) = (1– p)2 · p + (0 – p)2 · (1–p) = p(1–p)
σ(X) = (p(1–p)) ½
For p = ½ we get V(X) = ¼ and σ(X) = ½ .
o Fair die:
E(X) = 3.5
V(X) = (1–3.5)2 /6 + (2–3.5)2 /6 + (3–3.5)2 /6 +
(4–3.5)2 /6 + (5–3.5)2 /6 + (6 –3.5)2 /6
= 2.916667
σ(X) = 1.707825 (Nothing pretty about these numbers.)
• JMP Experiments: These dry calculations can again be illustrated
with “Monte Carlo” or simulation experiments. The larger N, the
closer the estimates are to the theoretical population values. Again,
go to “Sim Dice and Coin Flips.JMP”, add for example 999,980 rows
to the existing 20 for a full million, then do Distribution on both
variables, and you’ll get not only means but standard deviations for
fair coin flips and throws of dice. You can edit the formula for the
coin flip column and replace ‘0.5’ with another value between 0 and 1
to see how an unfair/biased coin flips.
21.
• Limit: s → σ as N → ∞ , which is due to the continuity of the
square root operation since we knew this to be true for the variance.
• Importance of variance: We mentioned in Module 2 that the sample
variance is of importance in finance as a measure of volatility of
investments. Actually, the population variance is of even greater
importance because it lends itself to wonderful algebra for optimizing
portfolios (investment mixes) by maximizing return subject to a
bound on the volatility/variance ― the famous Markowitz Theory. If
we denote by R = w1 R1 + w2 R2 + w3 R3, the hypothetical returns of a
stock portfolio that is a mix of three stocks with returns R1, R2, R3 with
proportions w1, w2, w3, the theory would try to maximize E(R)
subject to V(R) ≤ c by adjusting the proportions suitably.
Population Covariance and Correlation
• Motivation: We have seen that many statistically relevant quantities
computed from sampled data have limits if the sample size N goes to
infinity:
relative frequency → probability
sample mean → expected value (population mean)
sample variance → population variance
sample standard deviation → population standard deviation
The same holds for covariances and correlations computed from
samples: in the limit, they approach the population covariances and
population correlations. These will measure the strength of linear
association for an infinity of data.
• Bivariate probability: For population covariance to make sense, one
has to have a notion of joint probability P(X=xi, Y=yj). The reason is
that we will need to look at products such as X·Y, for example, where
the possible values are xi · yj. Their probabilities are obviously
P(X=xi, Y=yj). An expected value E(X·Y), for example, is then the
sum over all values xi · yj multiplied with their probabilities.
• Population covariance: For two random variables X and Y that can
be observed together, abbreviate µX = E(X) and µY = E(Y) and
define
22.
C(X,Y) = E((X–μX) · (Y–μY)) .
If we write this as a probability-weighted sum, it is a little messy
because we have to sum over all possible combinations of values
taken on by X and Y:
C(X,Y) = (x1–μX) · (y1–μY) P(X=x1, Y=y1) +
(x2–μX) · (y1–μY) P(X=x2, Y=y1) +
(x1–μX) · (y2–μY) P(X=x1, Y=y2) +
(x2–μX) · (y2–μY) P(X=x2, Y=y2) + …
Important is that the covariance is linear in both of its arguments:
C(X+Y,Z) = C(X,Z) + C(Y,Z)
C(aX,Y) = a C(X,Y)
C(X,Y+Z) = C(X,Y) + C(X,Z)
C(X,bY) = bC(X,Y)
These formulas are at the root of some simple but miraculous
calculations that will show us at what rate relative frequencies and
means approach their limits, the probabilities and expected values, as
N → ∞. In other words, these calculations will show how the
precision of relative frequencies and means increases as the sample
size increases.
For now note the following:
o C(X,X) = V(X)
o sample covariance → population covariance as N → ∞
• Population Correlation:
C(X,Y)
c(X,Y) = ―――――
σ(X) σ(Y)
Everything we know about sample correlation carries over to the
population correlation:
–1 ≤ c(X,Y) ≤ +1
23.
c(X,Y) = +1 if Y = aX + b with a > 0
c(X,Y) = –1 if Y = aX + b with a < 0
Again: sample correlation → population correlation as N → ∞
Continuous versus Discrete Probability
• Sofar we have worked with probability on finite sample spaces Ω, but
as the last example of portfolio applications show, numeric outcomes
such as investment returns do not usually form a finite or otherwise
discrete set of numbers. Instead, it is more natural to think of them as
outcomes that can theoretically be any real number.
Real numbers: any number with arbitrarily many decimals, possibly
infinitely many decimals. Why infinitely many decimals? So we won’t
have problems exponentiating, taking logarithms, powers, roots,….
• This thought makes it desirable to introduce continuous random
variables. A problem we encounter quickly is that any continuous
numeric outcome should be thought of having zero probability,
assuming we could observe it to infinite precision:
P(X = 0.0378736…) = 0, for example.
Only non-trivial intervals such as (0.0378… ≤ X ≤ 0.101009…) would
have non-zero probabilities. The idea is then to describe such
probabilities with integrals or areas under so-called density functions
f(x):
b
P(a ≤ X ≤ b) = ∫ a
f ( x ) dx
∞
Required properties: f(x) ≥ 0 and ∫−∞
f ( x ) dx = 1
Remember: The integral is the area under the curve from the lower to
the upper limit.
Intuition: Think of density functions as idealized histograms obtained
when N → ∞ and the bin width goes to zero.
24.
• Examples of density functions:
o The density function of a ‘uniformly distributed’ random
variable in the interval (0,1):
f(x) = 1 for 0 ≤ x ≤ 1 and f(x) = 0 otherwise.
For example, for a random variable that is uniformly distributed
in the unit interval (c=0, d=1), the density function is
f(x) = 1 for 0 ≤ x ≤ 1 and f(x) = 0 otherwise.
o The density function a random variable with a ‘pyramid
distribution’ is
f(x) = min(1+x,1–x) for –1 ≤ x ≤ 1 and f(x) = 0
otherwise.
o The density function an ‘exponentially distributed’ random
variable is
f(x) = e − x for 0 ≤ x and f(x) = 0 otherwise.
o The density function of a (standard) normally distributed
random variable is
x2
1 −2
f(x) = e
2π
This is the famous “bellcurve”. There is some deep math
behind this function, and we will mention some later (“central
limit theorem”). The distracting factor 1/(2π)1/2 is there to
make the integral equal to 1 over the whole real number line.
All four density functions are illustrated in the figure below. In all
four cases the total area below the curve is 1, as it should be.
25.
JMP simulation of a uniformly distributed random variable: Go to the JMP
table Sim Uniform and Normal.JMP and look at the Distribution of the two
columns. You can add more rows and see the law of large numbers at work
again: The histograms will approach the shapes of the above curves. For
10,000 cases one sees a reasonably good likeness:
.00 .20 .40 .60 .80 1.00 -4.0 -2.0 .0 1.02.03.0
(One can tell JMP to plot the histograms horizontally with (red diamond)
> Histogram Options > Vertical to remove the check mark.)
26.
Normally Distributed Random Variables (Book: Sec. 6.3)
• In nature and in the human world, many types of measurements and
quantitative observations are approximately normally distributed,
meaning their histograms look like approximations to the bellcurve, if
one allows it to be shifted and stretched. In order to shift and stretch
it, we first have to find out what its location and dispersion parameters
are because “shifting” means changing the location and “stretching”
means changing the dispersion. Miraculously, these are the expected
value (idealized mean) and population standard deviation:
E(X) = 0 and σ (X) = 1 .
The first is rather obvious because the bellcurve is symmetric about zero,
but the second requires an integration by parts, which we will gladly skip.
Because of the zero mean and unit SD property, this is called the
“standard normal distribution”, as opposed to the following:
• We now need normally distributed random variables with arbitrary
expected values and population standard deviations. Here is a trick
for stretching and shifting a random variable with zero population
mean (=expected value) and unit population standard deviation to
another random variable that has population mean µ and population
SD σ:
Y = σX+µ
Following the rules of linearity and the rules for variances, we find
E(Y) = µ and V(Y) = σ2 .
Conversely, if we are given a random variable Y with these
properties, then
Y −µ
Z =
σ
will have zero population mean and unit population SD. There is a
reason for calling this new random variable ‘Z’: It is what we may
27.
call the “z-score random variable” because it does for N → ∞ what
forming z-scores does for columns of datasets. (Book p.147.)
Definition: A random variable Y is said to be normally distributed
with population mean µ and population SD σ if its z-score Z
variable is normally distributed with population mean 0 and
population SD 1. One writes shorthand Y ~ N(µ, σ2).
Obviously, Z~N(0,1).
The formula for the density function of Y is not very insightful, but here is:
( y − µ )2
1 −
2σ 2
f(y) = e
2π σ
• The empirical rule for the normal distribution: If Y ~ N(µ, σ2),
o the probability of observing a value from Y in the interval
(µ–σ, µ+σ) is 0.6826895 or about 2/3.
“About 2 out of three observations fall within one SD of the mean.”
o the probability of observing a value from Y in the interval
(µ–2σ, µ+2σ) is 0.9544997 or about 95%.
“About 19 out of 20 observations fall within two SDs of the mean.”
o the probability of observing a value from Y in the interval
(µ–3σ, µ+3σ) is 0.9973002 or about 99.7%.
“About 997 out of 1000 observations fall within three SDs of the mean.”
Now we have a better concept of what “residuals in the order of RMSE” means:
If the RMSE is an estimate of the residual SD and if the residuals are
approximately normally distributed, we’d expect about 2/3 of the residuals within
±RMSE, about 95% within ±2RMSE, and “almost all” within ±3RMSE. Recall
that residuals always have zero mean. Using the empirical rule for residuals does
assume, however, that the residuals are approximately normally distributed, which
they not always are.
How do we ever know whether a batch of numbers is approximately normally
distributed? Good question. A natural suggestion: look at the histogram. There
is a better graphical method, though, and we will look into it next. The normal
distribution is practically and theoretically so important that people devised a
more sensitive method than the histogram.
28.
• The normal quantile plot or normal Q-Q plot: The idea is to
compare the quantiles of a column of numbers with the quantiles of a
theoretical normal distribution. (Book: p. 205ff)
Yes, not only means and SDs but quantiles as well exist for “populations”.
Again, population quantiles are the limits of sample quantiles as N → ∞.
The theoretical 30% quantile of the standard normal distribution, for
example, is –0.5244005… It has the property that the area under the
bellcurve from –∞ to –0.5244005… is 0.30.
Here are examples of normal Q-Q plots generated in JMP:
.01 .05.10 .25 .50 .75 .90.95 .99
3
2
1
0
-1
-2
-3
-3 -2 -1 0 1 2 3
Normal Quantile Plot
This is a fine plot. If the data are approximately normally distributed,
the black dots remain within the “double funnel” of dashed red curves.
If the dots go outside, we have evidence against the assumption of
approximately normally distributed data. An example: the prices of
the Honda Accord data.
.01 .05.10 .25 .50 .75 .90.95 .99
20000
15000
10000
5000
0
-3 -2 -1 0 1 2 3
Normal Quantile Plot
The skewness is an obvious violation because normally distributed
data are approximately symmetrically distributed about the mean.
29.
Here is another example which fails normality due to discreteness:
100 rolls of a fair die. The six possible values cause plateaux in the
normal quantile plot that reach outside the double funnel. You can
create this plot by adding 80 rows to “Sim Dice and Coin Flips.JMP”.
.01 .05.10 .25 .50 .75 .90.95 .99
6
5
4
3
2
1
-3 -2 -1 0 1 2 3
Normal Quantile Plot
JMP: You can add the Q-Q plots by clicking the red diamond of the
histogram and select “Normal Quantile Plot”, which is the third item
down. Homework 5 will ask you to set the JMP preferences such that a
normal Q-Q plot is provided by default whenever you do Distribution of a
quantitative variable.
Be the first to comment