Probability theory

ST318
Probability Theory
Keegan Kang
Spring 2013
Second Edition

Contents
0 Measures 3
1 Axiomatic Probability Theory 10
2 Independence 12
3 Tail σ−algebra and Kolmogorov’s 0 − 1 law 16
4 Integration 24
5 Expectations 27
6 Inequalities 29
7 Convergence of Random Variables 35
8 Characteristic Functions and the Central Limit Theorem 43
9 Conditional Expectation & Martingales 46
10 Filtrations, martingales and stopping times 50
Notes on the First Edition
Thanks to Pierre Tai and Nico Prokop for pointing out the many typos within.
Keegan Kang
Notes on the Second Edition
These notes were written for the 2010-2011 course, so might not be directly relevant to
our course. There have been changes made since the ﬁrst edition, but these are almost
exclusively cosmetic.
Iain Carson
2

0 Measures
Definition 0.1 – σ-algebra
F is a σ− algebra if it satisfies the following properties:
• Ω ∈ F
• if A ∈ F, then AC
∈ F
• if {Ai}∞
i=0 ∈ F, then ∞
i=0 ∈ F
If we have F, then (Ω, F) is a measurable space.
Example 0.1 – Examples of σ−algebras on a set Ω
smallest σ−algebra (∅, Ω)
largest σ−algebra power set 2Ω
It is also possible to generate other σ−algebras on Ω.
Take a subset A of Ω, i.e. A ⊆ Ω. We know {A} ∈ 2Ω
.
We look at σ({A}), which is the smallest σ−algebra generated by A.
σ({A}) =
F⊇{A}
F this is non-empty because {A} ⊆ 2Ω
= {∅, Ω, A, AC
}
(0.1)
To say that (0.1) is the smallest σ−algebra generated by A, we need to check that:
ˆ (0.1) fulfills the axioms of a σ−algebra
ˆ (0.1) is contained in every σ−algebra containing A
which are trivial.
Therefore, we can take any arbitrary collection C of subsets where C ⊆ 2Ω
to generate
σ−algebras, and σ(C) =
F⊇C
F, where F are σ−algebras.
Definition 0.2 – Borel σ−algebra (for R)
The Borel σ−algebra is the smallest σ−algebra containing all open sets in the topological
space. So when we get Ω = R, then B(R) is the smallest σ−algebra generated by open
intervals in R.
B(R) = σ (J : J open interval in R) = σ((−∞, x], x ∈ Q)
Consider σ({m} : m ∈ Q). Is σ({m} : m ∈ Q) = B(R)? No, it is not.
Proof. We know that intervals (sets) in B(R) are uncountable (and their complements are
uncountable as well).
So if we can show that the sets in σ({m} : m ∈ Q) are countable, or the complements of
the sets are countable, then σ({m} : m ∈ Q) = B(R).
3

But the sets of all rational points are countable. We can construct a bijection from the set
of all rationals to a subset of N.
Deﬁne f : Q → N as follows.
• For each q ∈ Q+
, write q = m
n
where m, n ∈ Z, m, n > 0, hcf(m, n) = 1.
• For each q ∈ Q−
, write q = m
n
where m, n ∈ Z, m < 0, n > 0, hcf(|m|, n) = 1.
Write:
f(q) =



2m
3n
q > 0
5|m|
7n
q < 0
1 q = 0
This is an injection from Q to a subset of N, so there exists a bijection between Q and a
particular subset of N, hence Q is countable.
Therefore σ({m} : m ∈ Q) = B(R).
We also want to show that B(R) = σ((−∞, x], x ∈ Q).
Proof. To show that B(R) = σ((−∞, x], x ∈ Q), we need to show that:
F = σ((−∞, x], x ∈ Q) ⊆ B(R) (†)
F = σ((−∞, x], x ∈ Q) ⊇ B(R) (††)
(†)
It is enough to show that (−∞, x] ∈ B(R) ∀ x ∈ Q.
This is true since (−∞, x]C
= (x, ∞) ∈ B(R) ⇒ F ∈ B(R), as B(R) is closed under
complements.
(††)
It is enough to show that F contains J ∀ J = (a, b) ⊆ R because B(R) is the smallest
σ−algebra containing all open intervals. Then any σ−algebra containing all open intervals
contains B(R).
(a, b) ∈ F ⇔ R (a, b) ∈ F
⇔ (−∞, a] [b, ∞)
♥
∈ F
We just need to show that and ♥ ∈ F.
It is obvious that ∈ F if a ∈ Q. Otherwise, we construct a decreasing sequence of
rationals ai which tends to a, i.e. ai a, and write = (−∞, a] =
i∈N
(−∞, ai] ∈ F. Hence
(−∞, a] ∈ F for a ∈ R.
We now consider ♥. We have ♥ =
i
−∞, b −
1
i
∈ F if b ∈ Q. Otherwise, we
construct a decreasing sequence of rationals bi which tends to b, i.e. bi b, and write:
♥ = [b, ∞) =
n i
−∞, bi −
1
n
∈ F
4

Hence [b, ∞) ∈ F for all b ∈ R.
We thus have shown that (a, b) ∈ F and (††) holds.
Hence B(R) = σ((−∞, x], x ∈ Q).
There is a fundamental question: When are two measures equal?
Let (Ω, F) be a measurable space and let µ, ν be two measures on (Ω, F). When does the
equality hold? i.e. When does µ(F) = ν(F) ∀ F ∈ F?
Definition 0.3 – d-system
Let Ω be a set. A collection of subsets D ⊆ 2Ω
is a d-system if:
i) Ω ∈ D
ii) If A ⊆ B and A, B ∈ D, then B − A ∈ D
iii) If Am ∈ D ∀ m ∈ N and Am ⊆ Am+1 ∀ m ∈ N, then
m∈N
Am ∈ D
Remarks:
i) In literature, the d−system is also called a Dynkin system, or a λ−system.
ii) Every σ−algebra is a d−system.
iii) If µ, ν are finite measures on (Ω, F), such that µ(Ω) = ν(Ω), then
D = {F ∈ F | µ(F) = ν(F)}
iii) is a d−system.
iv) For any collection I ⊆ 2Ω
, the smallest d−system d(I) that contains I is given by
iv) d(I) =
d systems
D⊇I
D
Proof. Proof of ii)
Axiom i) follows by definition.
Axiom ii) is satisfied, since B − A = (B ∩ A ) and is thus in the σ−algebra.
Axiom iii) is satisfied - let A1 ∪ . . . ∪ An = Bn ∀ n ∈ N. Then Bm ⊆ Bm+1 ∀ m ∈ N, and
Bm =
m∈N
Am ∈ D. Proof of iii)
Axiom i) follows since Ω ∈ F by definition of σ−algebra.
To prove Axiom ii), we need to show that A ⊆ B, A, B ∈ D ⇒ µ(B − A) = ν(B − A).
Rewrite µ(B − A) as µ(B) − µ(A) and similarly ν(B − A) as ν(B) − ν(A). We can do so
since
ˆ these are finite measures and hence µ(A), µ(B) are finite
ˆ A ⊆ B therefore µ(B − A) = µ(B) − µ(A) and ν(B − A) = ν(B) − ν(A)
But then we know µ(A) = ν(A) and µ(B) = ν(B). So Axiom ii) holds.
To prove Axiom iii), we need to show that:
Am ∈ D ∀ m ∈ N and Am ⊆ Am+1 ∀ m ∈ N ⇒ µ
m∈N
Am = ν
m∈N
Am
5

By continuity of measures, we can write:
µ
m∈N
Am = lim
m↑∞
µ(Am)
ν
m∈N
Am = lim
m↑∞
ν(Am)
But then we know lim
m↑∞
µ(Am) = lim
m↑∞
ν(Am). So Axiom iii) holds.
Proof of iv) (to show that d(I) =
d systems
D⊇I
D is a d−system)
We need to show that d(I) is non empty, that it is the smallest d−system, and that it
satisfies the axioms of a d−system.
d(I) is non empty as the set B = {I, I , ∅, Ω} contains I, and B fulfills the axioms of a
d−system. Furthermore, all other d−systems containing I must contain B, and hence B is
the smallest d−system.
To show d(I) satisfies the axioms of a d−system, let Dk be an index of d−systems containing
I.
Axiom 1: Ω ∈ Dk ∀ k ∈ N ⇒ Ω ∈ Dk
Axiom 2: Suppose we have B ∈ Dk ∀ k ∈ N. Then A ⊆ B ⇒ A ∈ Dk ∀ k ∈ N as well.
Hence (B − A) ∈ Dk ∀ k ∈ N and thus we have A ⊆ B, A, B ∈ Dk ⇒ (B − A) ∈ Dk.
Axiom 3: Suppose we have A1, A2, . . . ∈ Dk ∀ k ∈ N, with Am ⊆ Am+1 ∀ m ∈ N. Then
Am ∈ Dk ∀ k ∈ N.
Thus we have A1, A2, . . . ∈ Dk, Am ⊆ Am+1 ∀ m ⇒ Am ∈ Dk ∀ k ∈ N.
So d(I) satisfies the axioms of a d−system.
Definition 0.4 – π−system
Let I ⊆ 2Ω
. Then I is a π−system if ∀ A, B ∈ I ⇒ A ∩ B ∈ I.
Example 0.2 – Examples of π−systems on R.
Consider the set R, and take I1 = {(−∞, x] : x ∈ R}. This is a π−system.
I2 = {(−∞, x] : x ∈ Q} is a π−system as well.
Proof. Suppose we have two sets A and B, with A, B ∈ I1. Without loss of generality, take
a ≤ b. Then:
A = (−∞, a]
B = (−∞, b]
and therefore:
A ∩ B = (−∞, a]
So A ∩ B = A ∈ I1. The same holds for I2. So I1 and I2 are both π−systems.
6

The Borel σ−algebra on R is generated by the π−system(I1) (or I2). In other words,
B(R) = σ(I).
Remark 0.1
A collection C ⊆ 2Ω
is a σ−algebra ⇔ C is a d−system and π−system.
Proof.
(⇒)
We have proved that a σ−algebra is a d−system on Page 5. We need to show that a
σ−algebra is a π−system as well, and want to show that ∀ A, B ∈ F, A ∩ B ∈ F.
Take A, B ∈ F. Then:
A, B ∈ F ⇒ A , B ∈ F by Axiom 2 of σ− algebra
⇒ A ∪ B ∈ F by Axiom 3 of σ−algebra
⇒ A ∪ B ∈ F by Axiom 2 of σ−algebra
⇒ A ∩ B ∈ F by De Morgan’s Laws
(⇐)
We then need to show that the definitions of a π−system and a d−system fulfill the axioms
of a σ−algebra.
Axiom 1 is satisfied due to axiom 1 of the d−system, i.e. Ω ∈ D, so Ω ∈ C.
Axiom 2: Choose A ∈ C. Since A ∪ A = Ω, then A ⊆ Ω, Ω, A ∈ C, B − A = A ∈ C (by
axiom 2 of d−system). So this implies that A ∈ C ⇒ A ∈ C.
Axiom 3: Take A1, A2 ∈ C. Wish to show A1 ∪ A2 ∈ C. If we can do so, then by induction,
A1, A2, . . . ∈ C, Ai ∈ C.
We have proven that Axiom 2 of a σ−algebra is satisfied, so A1, A2 ∈ C. By definition of
π−system, A1 ∩ A2 ∈ C. Again, by using Axiom 2 of a σ−algebra, we have A1 ∩ A2 =
(A1 ∪ A2) ∈ C.
Hence by induction, Axiom 3 is satisfied.
Therefore if C is both a d−system and a π−system, then C is a σ−algebra.
Theorem 0.1 – Monotone Class Theorem for Sets
If I ⊆ 2Ω
is a π−system, then d(I) = σ(I). In other words, the smallest d−system generated
by I coincides with the σ−algebra σ(I) generated by I.
Proof. Need to show that:
• d(I) ⊆ σ(I)
• d(I) ⊇ σ(I)
To show (d(I) ⊆ σ(I)):
We have proven that every σ−algebra is a d−system. So it follows that:
d(I) =
d systems
D⊇I
D ⊆
F⊇I
F σ−algebra
F = σ(I)
7

Hence d(I) ⊆ σ(I).
To show (d(I) ⊇ σ(I)):
To prove this, we note Remark 0.1 and show that d(I) is a d−system and a π−system.
Then d(I) would be a σ−algebra, and d(I) ⊇ σ(I).
Define the family of sets:
D1 = {B ∈ d(I) | B ∩ C ∈ d(I) ∀ C ∈ I}
We wish to show that D1 is a d−system, and is in fact d(I).
First note that I ⊆ D1 since B ∈ I ⊆ d(I) ⇒ B ∩C ∈ d(I) ∀ C ∈ I (this is how we defined
our D1).
Secondly, we show that D1 satisfies the axioms of a d−system.
Axiom 1: ∀ C ∈ I, Ω ∩ C = C ∈ d(I), hence Ω ∈ D1.
Axiom 2: Consider the equality (B − A) ∩ C = (B ∩ C) − (A ∩ C) which holds for every set
A, B, C, given that A ⊆ B.
Pick A, B ∈ D1 which satisfies A ⊆ B, and we want to show that B − A ∈ D1.
Since A, B ∈ d(I), which is a d−system, then B − A ∈ d(I). It suffices to check if (B −
A) ∩ C ∈ d(I) ∀ C ∈ I.
Since A ⊆ B, we have (B ∩ C) ⊇ (A ∩ C), and using the above inequality, we have
(B ∩ C) − (A ∩ C) ∈ d(I), and therefore (B − A) ∈ D1.
Axiom 3: Give Am ∈ D1, such that Am ⊆ Am+1 ∀ m ∈ N, we wish to show that ( Am)∩C ∈
d(I) ∀ c ∈ I.
Note that Am ∩ C ∈ d(I) ∀ m ∈ N, so (Am ∩ C) ⊆ (Am+1 ∩ C) ∀ m.
Therefore ( Am) ∩ C ∈ d(I) ∀ c ∈ I.
Now, we have satisfied the axioms for D1 to be a d−system, and since I ⊆ D1, we can write:
I ⊆ D1 ⊆ d(I) ⇒ D1 = d(I) (0.2)
since D1 contains I, and d(I) is the smallest d−system containing I.
Now consider the family of sets:
D2 = {B ∈ d(I) | B ∩ C ∈ d(I) ∀ C ∈ d(I)}
This is the set of subsets in Ω which is in D1.
We want to show that I ∈ D2: But B ∈ I, C ∈ d(I) ⇒ B ∩ C ∈ d(I) using (0.2).
D2 is also a d−system (similar proof to above, and using the same inequality), and therefore,
we can also write:
I ⊆ D2 ⊆ d(I) ⇒ D2 = d(I)
This implies that d(I) is a π−system.
We have thus shown that d(I) is both a π−system and a d−system, and therefore a
σ−algebra, hence d(I) ⊇ σ(I).
Therefore we have shown: d(I) ⊆ σ(I) and d(I) ⊇ σ(I) and hence d(I) = σ(I).
8

We can use the Monotone class theorem (Theorem 0.1) to state certain relations between
measures.
Proposition 0.1
1) Let µ, ν be two measures on a measurable space (Ω, F) such that µ(Ω) = ν(Ω) < ∞. If
µ(C) = ν(C) ∀ C ∈ I where I is a π−system in F, then µ and ν coincide on the smallest
σ−algebra generated by I, i.e. σ(I).
2) Any two probability measures that agree on a π−system must agree on the σ−algebra
generated by this π−system.
Proof. (of part 1)
We deﬁne the set:
D = {A ∈ F | µ(A) = ν(A)}.
This is a d−system (using same proof for part iii) of remarks on Page 5. Since:
µ(C) = ν(C) ∀ C ∈ I ⇒ I ⊆ D,
this implies that the σ−algebra generated by I in D, σ(I) = d(I) ⊆ D by the Monotone
class theorem (Theorem 0.1).
Example 0.3
Let P and P be two probability measures on (R, B(R)). If:
P(−∞, x] = P (−∞, x] ∀ x ∈ R (or ∈ Q),
then P and P coincides on B(R). This follows by Proposition 0.1 and therefore:
B(R) = σ({(−∞, x] | x ∈ R}).
Hence a cumulative distribution function of a probability measure P on (R, B(R)):
F : R → [0, 1] such that F(x) = P(−∞, x]
uniquely deﬁnes the measure P.
9

1 Axiomatic Probability Theory
We know from previous Probability courses that Ω is our sample space, i.e. all outcomes of
a random experiment.
Example 1.1
i) Toss a coin twice. Then Ω = {HT, TH, HH, TT}.
ii) An infinite sequence of coin tosses. Then Ω = {ω : N → {T, H}}. Here, |Ω| = ∞, and is
uncountable.
Proof. Proof that Example 1.1 ii) is uncountable.
We attempt a proof by contradiction. Suppose that Ω is countable (infinite). Then we can
enumerate out all possible ω. But if we find an ω not in the list, we get a contradiction, and
hence Ω is uncountable.
Note: What Ω = {ω : N → {T, H}} means is simply the set of all ω. So, assume we have
enumerated out all our ω, say:
ω1 = ω11ω12ω13ω14 . . .
ω2 = ω21ω22ω23ω24 . . .
ω3 = ω31ω32ω33ω34 . . .
ω4 = ω41ω42ω43ω44 . . .
... =
...
with ωij = H or T.
So, we construct a sequence ωk say, such that ωkk = ωii, where i = 1, 2, 3, . . .. ωk is not in
the above list, and therefore Ω is uncountable.
Having defined ii) in Example 1.1, we wish to know the probability of the coin landing H
(or T) on the ith
toss. We thus want:
(Ω, F) : F = σ({ω | ω(i) = H}, {ω | ω(i) = T} : i ∈ N)
We thus need a probability measure on F.
It is possible to embed Ω → [0, 1] where a Lebesgue measure has been constructed such
that:
P[{ω | ω(i) = H}] =
1
2
, ∀ i ∈ N
Definition 1.1 – Random Object / Variable / Vector
Given a probability space (Ω, F, P), a random object X in a measurable space (S, Σ) is a
measurable function X : (Ω, F) → (S, Σ) i.e. the pre-image X−1
(Σ) ⊆ F.
If X ∈ R and X ∈ mB(R), then X is a random variable.
If X : Ω → Rn
, X ∈ mB(Rn
), then X is a random vector.
Recall: X ∈ mB(R) means that X is a measurable function with respect to B(R).
10

Example 1.2 – Defining a random variable
Recall part ii) of Example 1.1, where we defined Ω = {ω : N → {H, T}} and our σ−algebra
to be F = σ({ω(i) = T}, {ω(i) = H} : i ∈ N).
We can define our random variable to be:
Xi(ω) =
1 ω(i) = H
0 ω(i) = T
Now Xi : Ω → {0, 1}, and it is a random variable.
Proof. (that Xi is a random variable)
Let F = {Ω, ∅, A, A } where A is the event that Heads occurs at ith
toss (therefore A is
when tails occurs). So X−1
(1) = A and X−1
(0) = A , which are both in F.
We define a random variable Sn =
n
i=1
Xi, which is the number of heads obtained in n
tosses. This is a random variable as well, since the sum of random variables is a random
variable.
From the notes (lecturer’s Measure Theory notes), it follows that lim
n→∞
1
n
Sn = p ∈ F for
any p ∈ [0, 1]. If p /∈ [0, 1], then we obviously get ∅.
Definition 1.2 – Law of a random variable
Given a random variable X on (Ω, F, P), the law of a random variable is the probability
measure PX
on (R, B(R)) given by:
PX
[B(R)] = P X−1
(B(R))
It is enough to know PX
[(−∞, x)] = P[X ∈ (−∞, x)] ∀ x ∈ R (or Q).
11

2 Independence
Definition 2.1 – Independence
Let (Ω, F, P) be a probability space and let Gi ∈ F be sub−σ-algebras for i ∈ N. The family
of sub−σ−algebras Gi, i ∈ N is independent if for any sequence of events, Gi ∈ Gi, i ∈ N,
we have for a family {i1, i2, . . . , ik} ⊆ N of distinct indices:
P
k
j=1
Gij =
k
j=1
P [Gij] (2.1)
Remarks:
1. (2.1) has to hold for all finite subsets {i1, . . . , ik} ⊆ N.
2. If we have a finite family G1, . . . , Gn of sub−σ−algebras, the condition (2.1) collapses
to:
P
n
i=1
Gi =
n
i=1
P [Gi]
where Gi ∈ Gi for i = 1, 2, . . . , m.
3. Random variables X and Y are independent if and only if σ(X), σ(Y ) are independent.
We may wish to ask what is σ(X). While we know σ(X) = X−1
(B(R)), what does it
mean intuitively? σ(X) can be intuitively thought of as “information we can obtain
about the outcome of the random experiment by knowing X(ω), but not knowing ω”.
For random variables X, Y , we have σ(X) and σ(Y ) independent if and only if:
P[X ∈ A, X ∈ B] = P[X ∈ A] · P[X ∈ B] for A, B ∈ B(R)
4. Let E1, E2, . . . be events in F. They are independent if σ(E1), σ(E2) . . . are indepen-
dent, where:
σ(E1) = {Ω, ∅, E1, E1}
σ(E2) = {Ω, ∅, E2, E2}
... =
...
Proof. Prove that E1, E2, . . . are independent if and only if:
P
m
j=1
Eij =
m
j=1
P [Eij] ∀ {i1, i2, . . . , im}
(⇐)
This follows from Definition 2.1, showing that σ(E1), σ(E2), . . . are independent, but
remark 4 shows that this means E1, E2, . . . are independent.
(⇒) Exercise
12

5. Pairwise independence does not imply independence.
Example 2.1 – Example of above statement
Take Ω = {1, 2, 3, 4}, F = 2Ω
, A = {1, 2}, B = {1, 3}, C = {2, 3}, and deﬁne the prob-
ability measure P[ω] = 1
4
for ω ∈ Ω. Note that A, B, A, C and B, C are independent,
since:
P[A ∩ B] = P[{1}] = 1
4
P[A] · P[B] = 1
2
· 1
2
= 1
4
P[A ∩ C] = P[{2}] = 1
4
P[A] · P[C] = 1
2
· 1
2
= 1
4
P[B ∩ C] = P[{3}] = 1
4
P[B] · P[C] = 1
2
· 1
2
= 1
4
However, A, B, C are not independent, since:
P[A ∩ B ∩ C] = 0 = P[A] · P[B] · P[C] = 1
8
Theorem 2.1
See Probability with Martingales, Williams D.W. page 39.
Let (Ω, F, P) be a probability space and sub−σ−algebras H, G ⊆ F be generated by
π−systems I and J respectively. In other words:
σ(I) = H, σ(J ) = G
Then H and G are independent if and only if I and J are independent in the sense
A ∈ I, B ∈ J ⇒ P[A ∩ B] = P[A] · P[B]
Proof. Our goal is to prove:
P[H ∩ G] = P[H] · P[G] ∀ H ∈ H, G ∈ G (2.2)
Fix A ∈ I and consider the following two measures on F:
F → P[F ∩ A]
F → P[F] · P[A]
These measures have equal mass given by P[A].
By assumption, we have that the two measures coincide on J . Therefore by Proposi-
tion 0.1, we have
P[F ∩ A] = P[F] · P[A] ∀ F ∈ σ(J ) = G (2.3)
To show (2.2) , we deﬁne two measures. Fix G ∈ G and let
F → P[G ∩ F]
F → P[G] · P[F]
These two measures coincide on the π−system I by (2.3). Hence as before, the two measure
coincide on σ(I) = H.
13

Remarks:
1. Let X, Y be random variables on (Ω, F, P). Then:
X and Y are independent
⇔ P[X ∈ A, Y ∈ B] = P[X ∈ A] · P[Y ∈ B] ∀ A, B ∈ B(R) (by Theorem 2.1)
⇔ P[X ≤ x, Y ≤ y] = P[X ≤ x] · P[Y ≤ y] ∀ x, y ∈ R
We claim this since {(−∞, x] : x ∈ R} is a π−system in B(R) which generates
B(R). Hence {X ≤ x : x ∈ R} is a π−system in σ(X) which generates σ(X)
since σ(X) = X−1
(B(R)). So the π−systems π(X), π(Y ) are independent implies
σ(X), σ(Y ) independent.
2. Similarly, X1, . . . , Xn random variables are independent if and only if
P [Xi ≤ xi : 1 ≤ i ≤ n] =
n
i=1
P [Xi ≤ xi] ∀ xi ∈ R, i = 1, 2, . . .
3. If X is independent of Y and X is independent of Z, then it does not follow that X
is independent of (Y, Z).
Example 2.2
Let X = IA, Y = IB and Z = IC. Let A = {2, 3}, B = {1, 2}, C = {1, 3} be subsets in
Ω = {1, 2, 3, 4}, F = 2Ω
.
Recall that A, B independent, B, C independent. However X is not independent of
(Y, Z).
Deﬁnition 2.2 – Joint law (of random variables)
Let X, Y be random variables on (Ω, F, P) and let B(R2
) be the Borel σ−algebra on R2
.
The joint law of X and Y is given by
P(X,Y )
[A] = P [(x, y) ∈ A] ∀ A ∈ B(R2
)
Remarks:
1. Recall that:
B(R2
) = B(R) ⊗ B(R) (2.4)
Prove (2.4), i.e. that B(R) ⊗ B(R) = σ (U × V : U, V ∈ B(R))
(2.4) implies that B(R2
) is generated by the π−system {(−∞, x]×(−∞, y] : x, y ∈ R}
14

Proposition 2.1
The following statements are equivalent:
a) X and Y are independent
b) PX,Y
= PX
⊗ PY
c) Deﬁne FXY (x, y) = P[X ≤ x, Y ≤ y] ∀ (x, y) ∈ R2
. Then FXY (xy) = FX(x)FY (y).
Furthermore, if (x, y) has a density, i.e. there exists fXY :R2
→ [0, ∞) such that:
PX,Y
[A] =
A
fXY (xy) dx ⊗ dy
then statements a), b), c) are further equivalent to:
d) fXY (xy) = fX(x)fY (y) ∀ x, y ∈ R2
where fX, fY are the densities of X and Y respec-
tively.
Remark:
Note in d), the existence of fXY implies the existence of the densities of the factors X and
Y and:
fX(x) =
R
fXY (x, y) dy
fY (y) =
R
fXY (x, y) dx
Proof. Based on B(R2
) = B(R) ⊗ B(R). This implies that B(R2
) is generated by the
π−system {(−∞, x] × (−∞, y] : x, y ∈ R}. Apply Theorem 2.1.
15

3 Tail σ−algebra and Kolmogorov’s 0 − 1 law
Definition 3.1 – Tail σ−algebra
Let {Fn : n ∈ N} be a collection of σ−algebras. A tail σ−algebra T is given by:
T =
n∈N
Tn
where Tn = σ (Fn, Fn+1, . . .) = σ
k≥n
Fk .
Remarks:
1. T is a σ−algebra that depends on the tail events of a sequence of experiments where
the outcome of the nth
experiment is given by the σ−algebra Fn.
2. Note that the tail σ−algebra depends on the choice of {Fn : n ∈ N}.
Example 3.1
Let X1, X2, . . . be a sequence of random variables on (Ω, F, P) and define Fn := σ(Xn).
Then Tn = σ(Xn, Xn+1, . . .) ∀ n ∈ N and T =
n∈N
Tn. We define the following events:
F1 = ∃ lim
n→∞
Xn = ω ∈ Ω : lim
n→∞
Xn(ω) exists
F2 =
n∈N
Xn exists
F3 = lim
n→∞
X1 + X2 + . . . + Xn
n
exists
F4 =
n∈N
Xn exists and
n∈N
Xn = 0
Then F1, F2, F3 are contained within the tail σ−algebra of the sequences X1, X2, . . ..
Proof. It helps to intuitively think of tail events as those events whose ocurrence or not is
not affected by altering any finite number of random variables in the sequence.
Claim that F1 ∈ T =
n∈N
Tn. It is enough to show that F1 ∈ Tn ∀ n.
This is clear because the limit of a sequence Xk, k ∈ N only depends on (Xn+k)k∈N ∀ n ∈ N.
In other words, for a sequence to have a limit, we look at the tail of the sequence, i.e. we
can first discard the first finitely many terms.
Similarly, F2 ∈ Tn ∀ n ∈ N ⇒ F2 ∈ T .
To show F3 ∈ T is slightly trickier.
Let ξ = lim sup
n→∞
Sn
n
.
16

We need to show the following:
i) ξ(ω) is well defined ∀ ω ∈ Ω and ξ ∈ mσ(X1, X2 . . .).
ii) ξ ∈ mTn ∀ n.
Consider i). We know that ξ(ω) exists in [−∞, ∞] since every sequence of real numbers has
a lim sup (See Probability Theory Ex Sheet 1 Q1c).
Recall that ξ = inf
k∈N
sup
n≥k
Sn
n
and hence:
{ξ ≥ a} = sup
n≥k
Sn
n
≥ a ∈ σ(Xi, i ∈ N)
which implies ξ ∈ mσ(Xi, i ∈ N). Here we used the fact that {(−∞, a], a ∈ R} is a
π−system which generates B(R).
We now wish to prove ii)
Let Sk := Sn+k − Sn =
n+k
i=n+1
Xi ∈ σ (Xn+1, . . . , Xn+k). Then:
Sk
k
=
Sn+k
n + k
·
n + k
k
−
Sn
k
lim
k→∞
Sk
k
=
Sn+k
n + k
· (1) − 0
=
Sn+k
n + k
Therefore, we have lim sup
k→∞
Sk
k
= lim sup
k→∞
Sn+k
n + k
∈ mTn.
Now consider F4. We knew that F2 ∈ T because F2 does not depend on the first finitely
many terms.
F4 ∈ mG, where G = σ(X1, X2, . . .), but is not in the tail σ−algebra T . This is because the
event given by F4 clearly depends on the value of X1 (and possibly the first finitely many
terms). If X1 is different, Xi may or may not be 0.
So F4 is not necessarily in T .
Theorem 3.1 – Kolmogorov’s 0-1 Law
Let {Fn : n ∈ N} be a sequence of independent sub−σ−algebras in (Ω, F, P). Then
the tail σ−algebra T =
n∈N
Tn where Tn = σ(Fn+1, Fn+2, . . .) satisfies the following two
properties:
i) ∀ F ∈ T ⇒ P[F] = 0 or P[F] = 1
ii) ∀ random variables ξ ∈ mT ∃ c ∈ [−∞, ∞] such that P[ξ = c] = 1.
Proof. We start by proving i).
Define Hn := σ(F1, F2, . . . , Fn).
17

Step 1: We claim that Hn and Tn are independent.
We know that:
In =
n
i=1
Fi : Fi ∈ Fi, i = 1, 2, . . . , n
Jn =
l
i=1
Fn+i : Fn+i ∈ Fn+i, i = 1, 2, . . . l, l ∈ N
Both are π−systems as they are closed under intersection.
In generates Hn since Fi ⊆ In ∀ i = 1, . . . , n and In ⊆ Hn.
Similarly, Jn is a π−system that generates Tn, since Jn ⊇ Fn+i ∀ i ∈ N and Jn ⊆ Tn.
So it is clear that ∀ A ∈ In, ∀ B ∈ Jn ⇒ P[A ∩ B] = P[A] · P[B] since we have {Fk : k ∈ N}
independent sub−σ−algebras. So Hn and Tn are independent.
Step 2: We claim Hn and T are independent.
Since T ⊆ Tn ∀ n ∈ N, then T is independent of Hn ∀ n.
Step 3: We claim that T is independent of σ
n∈N
Hn .
Since T is independent of Hn, then this implies that T is independent of
n∈N
Hn which
further implies that T is independent of σ
n∈N
Hn by Theorem 2.1. Here, we have used
the fact that
n∈N
Hn is a π−system since this is an increasing sequence of σ−algebras.
Step 4: Claim that T is independent of T .
Note that σ
n∈N
Hn = σ (Fi : i ∈ N) and hence T ⊆ σ
n∈N
Hn .
So for F ∈ T ⊆ σ
n∈N
Hn , we must have that F is independent of itself. So:
P[F] = P[F ∩ F] = (P[F])2
But (P[F])2
= P[F] for F ∈ [0, 1] → P[F] = 0 or P[F] = 1.
We now prove part ii).
By part i), we have
P [ξ ≤ x] =
0
1
∀ x ∈ R
Let c := sup {x : P[ξ ≤ x] = 0}. Deﬁne sup ∅ == ∞, so c is well deﬁned on [−∞, ∞].
Then there are three cases.
18

If c = −∞, this implies that P [ξ ≤ x] = 1 ∀ x ∈ R ⇒ ξ = −∞ (P−a.s.)
Similarly, if c = +∞, this implies that P [ξ ≤ x] = 0 ∀ x ∈ R ⇒ ξ = +∞ (P−a.s.)
Suppose c ∈ R.
We then have P ξ ≤ c − 1
n
= 0 ∀ n, and hence:
P
n∈N
ξ ≤ c −
1
n
= lim
n→∞
P ξ ≤ c −
1
n
= P [ξ < c] = 0
We also have P ξ ≤ c + 1
n
= 1 ∀ n, and hence:
P
n∈N
ξ ≤ c +
1
n
= P lim
n→∞
ξ ≤ c +
1
n
= P [ξ ≤ c] = 1
Therefore P [ξ = c] = 1 (P−a.s.)
Definition 3.2 – Infinitely often (i.o.)
Let (En)n∈N be a sequence of events in (Ω, F, P). The event that En happens for infinitely
many n ∈ N is given by:
lim sup
n→∞
En :=
m∈N n≥m
En = En i.o. (infinitely often)
Definition 3.3 – Eventually (ev)
Let (En)n∈N be a sequence of events in (Ω, F, P). The event that En happens for all n ≥ m
for some m ∈ N is given as:
lim inf
n→∞
En :=
m∈N n≥m
En = En ev (eventually)
Remarks:
1. We can also write: lim sup
n→∞
En = ω ∈ Ω, ∀ m ∈ N ∃n(ω) ≥ m s.t. ω ∈ En(ω) .
2. Similarly, lim inf
n→∞
En = {ω ∈ Ω, ∃ m(ω) ∈ N s.t. ∀ n ≥ m(ω) we have ω ∈ En}.
3. (En i.o.) = En ev . To see this, note that
m∈N n≥m
En =
m∈N n≥m
En .
4. (En i.o.) , (En ev) ∈ T , where T is the tail σ−algebra of the family Fn = σ(Fn). To see
this, recall that T =
n∈N
σ(Fn, Fn+1, . . .) and note that (En i.o.) ∈ σ(Fm, Fm+1) ∀ m ∈
N since (En i.o.) =
k∈N n≥k
En. This is because the sequence of events n≥k En k∈N
is decreasing.
19

Lemma 3.1 – The first Borel-Cantelli lemma
Let (En)n∈N be a sequence of events in (Ω, F, P) and let
n∈N
P [En] < ∞.
Then P [En i.o.] = P [lim sup En] = 0.
Proof. We have lim sup
n→∞
En =
m∈N
Am, where Am =
m≥n
En.
Since Am ⊆ Am+1 ∀ m ∈ N, we find P [En i.o.] = lim
m→∞
P [Am].
But note that:
0 ≤ P [Am] ≤
n≥m
P[En] → 0 as m → ∞
So this concludes the proof.
Remarks:
1. The first Borel-Cantelli Lemma is very important. It is for example used in the con-
struction of Brownian motion.
2. Let (Ω, F, P) be a probability space and let Q be a probability measure on (Ω, F). We
say that Q is absolutely continuous with respect to P(Q P) if ∀ F ∈ F such that
P[F] = 0 ⇒ Q[F] = 0.
We claim that if Q P, then ∀ > 0, ∃ δ > 0 s.t. ∀ F ∈ F with P[F] < δ ⇒ Q(F) < .
Proof. We begin a proof by contradiction, by showing that the converse statement
leads to a contradiction.
Our converse statement is thus:
∃ > 0 s.t. ∀ δ, ∃ Fδ ∈ F s.t. P [Fδ] < δ and Q [Fδ] ≥
Hence ∀ n ∈ N, pick δn = 2−n
and let Fn ∈ F satisfy P [Fn] < 2−n
and Q[Fn] ≥ .
Let F = lim sup
n→∞
Fn.
We have P[F] = 0 by Borel Cantelli Lemma 1 (Lemma 3.1) since
n∈N
P [Fn] < ∞.
But Q[F] = lim
m→∞
Q
n≥m
Fn ≥ ∀ m.
This implies that Q[F] ≥ . But if P[F] = 0, then Q[F] = 0 as well since Q P.
Hence we have a contradiction.
20

Lemma 3.2 – The second Borel-Cantelli lemma
Let (En)n∈N in (Ω, F, P) be a sequence of independent events with
n∈N
P [En] = ∞.
Then P [En i.o.] = P [lim sup En] = 1.
Proof. Note that (En i.o.) =
m∈N n≥m
En.
So if we can show that this has probability 0, we are done. Then:
P
n≥m
En = lim
k→∞
P
k
n=m
En by monotonicity of measure P[Ω] = 1
= lim
k→∞
k
n=m
P En by independence of En
= lim
k→∞
k
n=m
(1 − P[En])
≤ lim
k→∞
e
−
k
n=m
P [En]
by inequality 1 − x ≤ e−x
∀x
= 0
This implies that P
n≥m
En = 0 ∀ m ∈ N. Therefore we have:
P [En i.o.] = P
m∈N n≥m
En ≤
m∈N
P
m≥n
En = 0
Therefore we are done.
Remarks:
1. Let (En)n∈N be the sequence of independent events. Then P [lim sup En] is either 0 or
1 by Kolmogorov’s 0 − 1 Law.
2. Furthermore, P [lim sup En] = 1 ⇔
n∈N
P[En] = ∞ by Borel Cantelli Lemma 2
(Lemma 3.2) .
Example 3.2 z
1. Let X ∼ N(0, 1). Then the following inequality holds:
f(x)
x + x−1
< P[X > x] <
f(x)
x
, f(x) =
1
√
2π
e−x2
2 ∀ x > 0
This follows by noting that: x
∞
x
f(y) dy <
∞
x
yf(y) dy and that f (x) = −xf(x)
and
d f(x)
x
dx
= −f(x) 1 +
1
x2
.
21

2. Let Xn ∼ N(0, 1) independent and let L = lim sup
n→∞
Xn
√
2 log n
. Show that P[L > 1] = 0.
Proof. Deﬁne En(a) := Xn > (1 + a)
√
2 log n , a ∈ R.
Note that L > 1 +
1
k
⊆ lim sup
n→∞
En
1
2k
∀ k ∈ N.
We want to show that P lim sup En
1
2k
= 0 by Borel Cantelli Lemma 1 (Lemma 3.1)
.
P En
1
2k
<
1
√
2π
exp{−1
2
1 + 1
2k
2
2 log n}
1 + 1
2k
√
2 log n
using part 1
=
1
√
2π 1 + 1
2k
·
exp − 1 + 1
2k
2
√
2 log n
Since
n∈N
n−α
√
log n
< ∞ for any α > 1, then Borel Cantelli Lemma 1 (Lemma 3.1)
⇒ P L > 1 + 1
k
= 0.
Hence {L > 1} =
k∈N
L > 1 +
1
k
⇒ P[L > 1] = 0. This is also equivalent to
P [L ≤ 1] = 1.
3. Prove that P[L = 1] = 1.
Proof. It is suﬃcient to show that P [L < 1 − ] = 0 ∀ > 0.
Recall that {L < 1} =
∞
n=2
L < 1 −
1
n
so if we can show {L < 1 − 1
n
} has P = 0 a.s.
then through our operations of countable union, we have {L < 1} has P = 0 a.s.
We pick > 0, and consider the set:
En(a) =
Xn
√
2 log n
> 1 + a
Then {L < 1 − } = En(− ) ev = (En(− ) i.o.) .
We want to show that P [lim sup En(− )] = 1.
So we need to prove that
n∈N
P [En(− )] = ∞ by showing that P [En(− )] ≥ an for
some sequence where an > 0 such that an = ∞ (using the inequalities in part 1).
Exercise: Find such a sequence an.
Note that En(− ) are independent since random variables Xn are independent. There-
fore Borel Cantelli Lemma 2 (Lemma 3.2) ⇒ P [lim sup En(− )] = 1 ⇒ P [L < 1 − ] =
0.
Exercise: Show that L ∈ mT .
22

Example 3.3
Let Xn ∼ N(0, 1) be independent random variables, and let Sn = X1 +. . .+Xn. Show that:
i)
Sn
√
n
∼ N(0, 1)
ii) lim inf
Sn
n
= lim sup
Sn
n
= 0 (implies lim
Sn
n
exists and = 0)
Note that ii) is the strong Law of Large Numbers for N(0, 1) random variables.
For i) It is easy to check that E
Sn
√
n
= 0 by properties of expectations. So all we need
now is to check that Var
Sn
√
n
= 1. This holds since Var
Sn
√
n
=
1
√
n
2
Var [Sn] =
1
n
n
i=1
Var [Xi].
For ii) we consider the set En = |Sn| ≤ 2
√
n log n , and claim that P [En ev] = 1.
This claim is useful as it shows that we have a bound for Sn, i.e. −
√
2 log n ≤ Sn ≤
√
2 log n
for all large n with probability 1.
Therefore we have:
−2
√
n log n
n
≤
Sn
n
≤
2
√
n log n
n
for large n(P−a.s.). But as n → ∞, we then get:
0 ≤
Sn
n
≤ 0
which would enable us to prove ii). It then suffices to prove the claim.
To prove the claim, note that (En ev) = En i.o.. We wish to apply Borel Cantelli Lemma
1 (Lemma 3.1) , and hence we need to show that
n∈N
P En is finite.
We need to find an upper bound on P En = P
|Sn|
√
n
≥ 2 log n ≤ an, say, and such that
an < ∞.
Exercise: Find this upper bound.
We can then use Borel Cantelli Lemma 1 (Lemma 3.1) to show that P En i.o. = 0.
23

4 Integration
(Ω, F, µ) is a measure space, mF = {f : Ω → R s.t. f−1
(B(R)) ⊆ F}.
The Lebesgue integral is first defined for f ∈ (mF)+
where:
f ∈ (mF)+
⇔ f ∈ mF and f ≥ 0 µ a.s.
Let f =
n
i=1
aiIAi
, AI ∈ F, ai ≥ 0 be a simple function with
Ω
f dµ =
n
i=1
aiµ(Ai).
For general f ∈ (mF)+
, we find (fn)n∈N of simple functions such that fn(ω) f(ω) ∀ ω ∈ Ω.
(Here, means that fn(ω) is a monotone increasing sequence which converges to f(ω))
We define:
Ω
f dµ = lim
n→∞
Ω
fn dµ (4.1)
We need to check that (4.1) is a good definition, and hence need to check:
i) lim exists in (4.1) (which is true since fn ≤ fn+1 ∀ n ⇒
Ω
fn dµ ≤
Ω
fn+1 dµ.
ii) gn(ω) f(ω), gn simple functions, then ∀ ω ∈ Ω:
Ω
gn dµ −→
n→∞
Ω
f dµ
ii) In other words we need to check that the definition is independent of sequences (fn)n∈N.
ii) Exercise: check this.
Theorem 4.1 – Monotone convergence theorem
Take f, fn ∈ (mF)+
such that fn(ω) ≤ fn+1(ω) ∀ n ∈ N, ω ∈ Ω and f(ω) = lim
n→∞
fn(ω).
Then we have
Ω
f dµ = lim
n→∞
Ω
fn dµ.
Properties of Lebesgue Integral
• (Linearity) - For a, b ≥ 0, g, h ∈ (mF)+
, then:
Ω
(ag + bh) dµ = a
Ω
g dµ + b
Ω
h dµ
• f ∈ (mF)+
⇒
Ω
f dµ ≥ 0 (from (4.1)) since it is true for simple functions.
24

Deﬁnition 4.1 – Integrable
We deﬁne:
L1
(Ω, F, µ) =



f ∈ mF s.t.
Ω
f+
dµ,
Ω
f−
dµ < ∞



where f+
= max{f, 0}, f−
= max{−f, 0}. Then we say f ∈ mF is integrable.
We have:
•
Ω
f dµ :=
Ω
f+
dµ −
Ω
f−
dµ
• |f| = f+
+ f−
Therefore we have:
Ω
f dµ =
Ω
f+
dµ −
Ω
f−
dµ
≤
Ω
f+
dµ +
Ω
f−
dµ
=
Ω
|f| dµ
Lemma 4.1 – Fatou’s lemma
Let (fn)n∈N be a sequence in (mF)+
. Then we have:
Ω
lim inf
n→∞
fn dµ ≤ lim inf
Ω
fn dµ
Proof. Recall that lim inf
n→∞
fn = lim
n→∞
gn where gn = inf{fn, fn+1, . . .}.
Note that (gn)n∈N is non-decreasing and gn ≤ fn ∀ n ∈ N.
By Monotone Convergence Theorem (Theorem 4.1), we have:
Ω
lim inf
n→∞
fn dµ =
Ω
lim
n→∞
gn dµ
= lim
n→∞
Ω
gn dµ
= lim inf
n→∞
Ω
gn dµ since if lim exist, then lim inf exists
≤ lim inf
n→∞
Ω
fn dµ using
Ω
f dµ ≥
Ω
g dµ ∀ n ∈ N
25

Theorem 4.2 – Dominated convergence theorem
Let fn, f ∈ (mF) and assume that ∃ g ∈ L1
(Ω, F, µ) s.t. |fn| ≤ g ∀ n ∈ N and lim
n→∞
fn(ω) =
f(ω) ∀ ω ∈ Ω. Then
Ω
f dµ = lim
n→∞
Ω
fn dµ.
Proof. Note that f ∈ L1
since |f| ≤ g. Here we use f+
≤ g ⇒
Ω
f+
dµ ≤
Ω
g dµ, so
f+
, f−
∈ L1
and bounded by g. We wish to show that
Ω
|f − fn| dµ → 0.
Note that:
2g − |f − fn|
Fn
≥ g − |f| + g − |fn|
≥ 0 ∀ n
By Fatou’s lemma (Lemma 4.1) applied to (Fn)n∈N, we get:
Ω
lim inf Fn dµ ≤ lim inf
Ω
Fn dµ (4.2)
We also know that the terms on the LHS and RHS of (4.2):
Ω
lim inf
n→∞
Fn dµ =
Ω
2g − lim inf
n→∞
|f − fn| dµ
Ω
Fn dµ =
Ω
2g dµ −
Ω
|f − fn|dµ
=
Ω
2g − 0 dµ
=
Ω
2g dµ
Rearranging (4.2), we have:
Ω
2g dµ ≤
Ω
2g dµ − lim sup
n→∞
Ω
|f − fn| dµ
≥0
This implies that lim sup
Ω
|f − fn| dµ = 0.
We know that lim sup
n→∞
Ω
|fn−f| dµ = 0 ⇒ lim
n→∞
Ω
|fn−f| dµ = 0 since lim inf
n→∞
Ω
|fn−f| dµ =
0 as well as |f − fn| is non-negative and lim inf ≤ lim sup.
Therefore
Ω
f dµ −
Ω
fn dµ ≤
Ω
|f − fn| dµ → 0.
26

5 Expectations
We take (Ω, F, P) to be our probability space, and X a random variable, which implies that
X ∈ mF.
If X ≥ 0, then E[X] =
Ω
X dP =
Ω
X(ω) P[dω].
For X ∈ mF, we say X ∈ L1
(Ω, F, P) if E [X+
] , E [X−
] < ∞ where:
X+
= max{X, 0}, X−
= max{−X, 0}
Then expectation of X is given by E[X] = E [X+
] − E [X−
].
Proposition 5.1
Let X be a random variable on (Ω, F, P) and let g : F → R be Borel measurable. Then
g(X) is in L1
(Ω, F, P) ⇔ g ∈ L1
(R, B, PX
) where PX
[A] = P[X ∈ A] ∀ A ∈ B(R). Then we
have:
E [g(X)] =
R
g(x)PX
[dx] (5.1)
Remarks:
1. If X is a continuous random variable, i.e. PX
γL = Lebesgue measure ⇔ PX
[A] =
A
fX(x) dx, then by Proposition 5.1, we have E [g(X)] =
R
g(x)fX(x) dx.
2. If X is a discrete random variable, e.g. X ∈ N with probability 1, then E [g(X)] =
k∈N
g(k)P[X = k] where PX
[k] = P[X = k].
Proof. of Proposition 5.1.
Here we want to show that this holds, starting from indicator random variable, simple
random variable, non-negative random variable, to all random variables.
Let g = IA, A ∈ B(R). Then (5.1) holds, since E [IA(x)] = P[X ∈ A] = PX
[A]. (Indicator
random variables)
By linearity of integrals, and that simple random variables are ﬁnitely weighted sums of
indicator functions, then (5.1) holds for simple random variables as well. (Simple random
variables)
Assume g ≥ 0 and let 0 ≤ gn g be a sequence of simple random variables that is
monotonic and converges to g. We have E [gn(X)] =
R
gn(X)PX
[dx]. Then Monotone
Convergence Theorem (Theorem 4.1) implies (5.1) for g ≥ 0. This is because gn(X) is a
simple random variable on Ω, lim
n→∞
gn(x) = g(x), gn(x) g(x), so Monotone Convergence
Theorem (Theorem 4.1) tells us that E [g(X)] = lim
n→∞
E [gn(X)] and
R
gndPX
R
g dPX
.
So (5.1) holds for non-negative random variables. (Non-negative random variables)
27

Lastly, take g ∈ L1
(R, B, PX
), then apply (5.1) to g+
, g−
. Then by linearity of integral, (5.1)
holds for random variables. (All random variables)
Lemma 5.1
X ∈ (mF)+
and E[X] = 0 ⇒ P[X = 0] = 1(⇔ P[X > 0] = 0)
Proof. Note that {X > 0} =
n∈N
X >
1
n
.
We attempt a proof by contradiction, and assume that P[X > 0] > 0.
P[X > 0] > 0 ⇒ ∃ n ∈ N s.t. P X > 1
n
> 0.
Then we have:
E[X] =
Ω
X dP
=
Ω
XI{X> 1
n } dP +
Ω
XI{X≤ 1
n } dP
≥
Ω
XI{X> 1
n } dP
≥
Ω
1
n
I{X> 1
n } dP
=
1
n
P X >
1
n
> 0
Which is a contradiction. Therefore P[X] = 0.
28

6 Inequalities
Deﬁnition 6.1 – Convex function (in R)
A function f : I → R, where I ⊆ R is an interval (either open or closed) is convex if
∀ p ∈ (0, 1) and x, y ∈ I, we have f(px + (1 − p)y) ≤ pf(x) + (1 − p)f(y).
If f is a convex function, then f is continuous. Exercise: Prove by contradiction.
Example 6.1 – Examples of convex functions
1. x → |x|
2. x → x2
3. x → eθx
∀ θ ∈ R
Example 6.2 – Examples of non-convex functions
1. x → −|x| (concave function)
2. x → sin x (neither convex or concave) Exercise: Prove this.
Proposition 6.1
If f is both convex and concave on R, there exists a, b ∈ R such that f(x) = ax + b ∀x ∈ R.
Exercise: Prove this.
Exercise: Prove that a concave function is continuous (this follows from a convex function
is continuous).
Proposition 6.2
If f : I → R is in C2
(I), then f is convex ⇔ f (x) ≥ 0 ∀ x ∈ I.
Proof.
⇒
Using Taylor’s theorem, we can expand:
f(x + ) = f(x) + f (x) +
2
2
f (ξx), where ξx ∈ (x, x + )
and
f(x − ) = f(x) − f (x) +
2
2
f (ξx), where ξx ∈ (x, x + )
Then we can write:
f (x) =
f(x + ) + f(x − ) − 2f(x)
2
(6.1)
as when 0, then ξx → x.
Assume f is convex.
Then we can write x = p(x − ) + (1 − p)(x + ) where p = 1
2
.
29

By convexity, we can write:
f(x) = f 1
2
(x − ) + 1
2
(x + ) ≤ 1
2
f(x − ) + 1
2
f(x + )
This gives f(x+ )+f(x− )−2f(x) ≥ 0, and since 2
> 0, this implies that (6.1) (f (x)) ≥ 0.
⇐ Exercise
Theorem 6.1 – Markov’s inequality
Let (Ω, F, P) be a probability space. Then take X ∈ mF, and g : I → [0, ∞] a non-
decreasing B−measurable function where I ⊆ R is an interval such that P[X ∈ I] = 1.
Then g(c) · P[X ≥ c] ≤ E [g(X)] ∀ c ∈ I.
Note here that E [g(X)] exists (which may be +∞) since g(X) ∈ (mF)+
.
Proof.
g(c) · P[X ≥ c] = E [g(c) · IX≥c]
≤ E [g(X) · IX≥c] since on {X ≥ c} we have g(X) ≥ g(c)
as g non-decreasing
≤ E [g(X)] this holds since g(X) ≥ 0.
Example 6.3 – Examples of using Markov’s inequality
Suppose x ∈ mF, > 0. Then:
P [|x| ≥ ] ≤
E [|x|]
(6.2)
and
P [|x| ≥ ] ≤
E [x2
]
2
(6.3)
(6.2) follows by applying Markov’s inequality (Theorem 6.1) to the random variable |X|
and having g : [0, ∞] → [0, ∞], with x → x.
(6.3) follows by applying Markov’s inequality (Theorem 6.1) to the random variable |X|
and having g : [0, ∞] → [0, ∞], with x → x2
. (6.3) is also known as Chebyshev’s inequality.
Theorem 6.2 – Jensen’s inequality
Let (Ω, F, P) be a probability space. Let X be a random variable such that P[X ∈ I] = 1,
where I ⊆ R is an interval. Let g : I → R be a convex function such that E [g(X)] < ∞
and E [|x|] < ∞. Then g (E [X]) ≤ E [g(X)].
Proof. Since g is convex, we have g(x) = sup
n∈N
{anx + bn} ∀ x ∈ I and some sequences
(an)n∈N, (bn)n∈N. Hence:
g(X) ≥ anX + bn ∀ n ∈ N
⇒ E [g(X)] ≥ anE[X] + bn
⇒ E [g(X)] ≥ sup
n∈N
{anE[X] + bn}
= g (E[X])
30

Remarks:
1. If assumptions of Jensen’s inequality (Theorem 6.2) hold and g is concave, then we
get the inequality g (E[X]) ≥ E [g(X)].
2. If a random variable X takes two values, x, y ∈ I with p = P[X = x] and 1 − p =
P[X = y], then Jensen’s inequality (Theorem 6.2) is just the definition of convexity
of g. i.e. g (px + (1 − p)y) ≤ pg(x) + (1 − p)g(y).
3. Under assumptions above, we have E[X] ∈ I.
Exercise: Prove this. Hint: If I = (a, b), then P[X < b] = 1 and P[X > a] = 1. We
have E[X] < E[b] = b.
Hence g (E[X]) is well defined.
Definition 6.2 – Lp
space
We define Lp
(Ω, F, P) to be {X ∈ mF : E [|x|p
] < ∞}.
p = 1, 2 are the most common, but for p ≥ 1 we get a vector space.
Proof. (p ≥ 1) is a vector space)
We first note that:
(x + y)p
≤ (2 max{x, y})p
≤ 2p
(xp
+ yp
) ∀ x, y ≥ 0 (6.4)
Take X, Y ∈ Lp
. We need to show that E [|αX + βY |p
] < ∞ for α, β ∈ R. So:
E [|αX + βY |p
] ≤ E [(|αX| + |βY |)p
] by inequality
≤ 2p
(E [|α|p
|X|p
] + E [|β|p
|Y |p
]) using (6.4)
< ∞
Definition 6.3 – x p
We define x p := (E [|X|p
])
1
p for X ∈ Lp
, p ≥ 1.
Note that this is not a norm - the first property fails.
Theorem 6.3 – Cauchy-Schwarz inequality
Take X, Y ∈ L2
. Then XY ∈ L1
and:
|E[XY ]| ≤ E [|XY |] ≤ E[X2
]E[Y 2
]
1
2
Furthermore, we have equality if there exists a, b ∈ R s.t. |a| + |b| > 0 and aX + bY =
0 (P−a.s.)
Proof. We first note that:
0 ≤ (X + λY )2
∀ λ ∈ R (6.5)
and X + λY ∈ L2
.
31

Hence XY = 1
2
[(X + Y )2
− X2
− Y 2
] ∈ L1
.
From (6.5), we have:
0 ≤ X2
+ 2λXY + λ2
Y 2
⇒ E[0] ≤ E[X2
] + 2λE[XY ] + λ2
E[Y 2
] ∀ λ ∈ R
We can differentiate the above function and find λ which gives the minimum value, which
is λ = −
E[XY ]
E[Y 2]
.
Note that if E[Y 2
] = 0 ⇒ Y = 0 (P−a.s.), then Cauchy-Schwarz inequality (Theorem 6.3)
holds. So WLOG, we assume Y = 0.
Substituting the value of λ, we get:
0 ≤ E[X2
] − 2
E[XY ]2
E[Y 2]
+
E[XY ]2
E[Y 2]
⇒ E[XY ]2
≤ E[X2
]E[Y 2
]
This satisfies our theorem. However, if we have equality, then we know that:
0 = E (X + λY )2
for λ =
E[XY ]
E[Y 2]
This implies that X + λY = 0 (P−a.s.)
If Y is 0 (P−a.s.), then E[X2
] = 0, and therefore X = 0 (P−a.s.)
As L2
(Ω, F, P) is a vector space, and we define the ‘inner product’: < X, Y > = E[XY ].
This is well defined since X, Y ∈ L2
⇒ XY ∈ L1
.
Then the Cauchy-Schwarz inequality takes the form:
| < X, Y > | ≤ X 2 Y 2
where X 2 = E [|X|2
]
1
2
.
Note that the inequality X+Y 2 ≤ X 2+ Y 2 holds for X, Y ∈ L2
by Cauchy-Schwarz
inequality (Theorem 6.3).
Proof.
X + Y 2
2 = E [(X + Y )2
]
= E[X2
] + E[Y 2
] + 2E[XY ]
≤ E[X2
] + E[Y 2
] + 2E[X2
]
1
2 E[Y 2
]
1
2 by applying Cauchy-Schwarz inequality
= ( X 2 + Y 2)2
32

Theorem 6.4 – Monotonicity of Lp
norms
Given X ∈ Lp
(Ω, F, P), p ≥ 1; X p = E [|X|p
]
1
p , then for 1 ≤ p ≤ r < ∞, we have for any
Y ∈ Lr
(Ω, F, P), Y p ≤ Y r. In particular, Lr
⊆ Lp
.
Proof. Note that g(x) = x
r
p is convex on [0, ∞). Then we have:
g (E [|Y |p
]) ≤ E [g (|Y |p
)] by Jensen’s inequality
⇒ E [|Y |p
]
r
p ≤ E [|Y |r
]
⇒ Y p ≤ Y r by taking rth
root on both sides
Remark:
Note that Theorem 6.4 holds for probability measures only. Exercise: Find f ∈ L2
(R, B, γL)
such that f /∈ L1
(R, B, γL).
Recap of Definitions from probability:
•Cov[X, Y ] = E [(X − E[X])(Y − E[Y ])] is well defined
• Var[X] = Cov[X, X]
• |Cov[X, Y ]| ≤ Var[X]Var[Y ] (by Cauchy-Schwarz inequality)
• ρ(X, Y ) =
Cov[X, Y ]
Var[X]Var[Y ]
∈ [−1, 1] (correlation between 2 random variables)
These concepts are well defined if X, Y ∈ L2
.
Theorem 6.5 – Independence
If random variables X, Y ∈ L1
(Ω, F, P) and X and Y are independent, then XY ∈
L1
(Ω, F, P). Furthermore E[XY ] = E[X]E[Y ].
Remarks:
1. Let f, g : R → R ∈ mB and (independent) X, Y as in Theorem 6.5. Then if
E [f(X)] , E [g(Y )] are finite, we have:
E [f(X)g(Y )] = E [f(X)] E [g(Y )] (6.6)
Exercise: Prove that X, Y independent ⇒ f(X), g(Y ) independent. Use the fact that
f(X), g(Y ) ∈ mF.
To prove (6.6), we apply Theorem 6.5 to f(X) and g(Y ).
2. If X, Y are independent in L2
⇒ Cov[X, Y ] = 0. So:
Cov[X, Y ] = E [(X − E[X])(Y − E[Y ])]
= E [X − E[X]] E [Y − E[Y ]]
= 0
33

Example 6.4
Take E[X] = 0, E[|X|3
] < ∞. In other words, X ∈ L3
and E[X] = 0. If E[X3
] = 0, then
Cov[X, X2
] = 0 and X and X2
are not independent.
Prove this.
Proof. Sketch proof of Theorem 6.5 Exercise: Write out full proof.
Note that it is enough to prove theorem for X, Y ∈ L1
∩ (mF)+
since X = X+
− X−
, Y =
Y +
− Y −
.
Note that X+
and Y +
are independent since X+
and Y +
are some functions of X, Y (max{X, 0},
max{Y, 0}) and use linearity of expectation.
Assume X, Y ≥ 0 and note that α(n)
(X) X, α(n)
(Y ) (Y ) ∀ ω ∈ Ω for α(n)
: R → R
given by:
α(n)
(x) :=



0 x = 0
(i − 1)2−n
(i − 1)2−n
< x ≤ i2−n
≤ r, i ∈ N
n x > n
Then note that: (Show this as an exercise)
1. α(n)
(X) is a simple function.
2. α(n)
(X), α(n)
(Y ) are independent.
3. E α(n)
(X)α(n)
(Y ) = E α(n)
(X) E α(n)
(Y ) ∀ n.
4. α(n)
(X)α(n)
(Y ) XY as n → ∞.
5. Theorem 6.5 follows by Monotone Convergence Theorem (Theorem 4.1) on 3.
34

7 Convergence of Random Variables
Let (Xn)n∈N be a sequence of random variables on (Ω, F, P).
Definition 7.1 – Converge almost surely
The sequence (Xn)n∈N converges to a random variable X almost surely if the set:
lim
n→∞
Xn = X = ω ∈ Ω lim
n→∞
Xn(ω) = X(ω)
has probability 1, i.e. P lim
n→∞
Xn = X = 1.
Definition 7.2 – Converges in probability
The sequence (Xn)n∈N converges in probability to a random variable X if:
∀ > 0, lim
n→∞
P [|Xn − X| > ] = 0
Definition 7.3 – Converges in Lp
Let Xn ∈ Lp
(Ω, F, P), p ≥ 1 ∀ n. Then the sequence (Xn)n∈N converges in Lp
to a random
variable X ∈ Lp
if E [|Xn − X|p
] −−−−→
n→∞
0.
Notation: Xn
· p
−→ X.
Definition 7.4 – Cauchy (in Lp
)
A sequence (Xn)n∈N is Cauchy in Lp
if:
∀ > 0 ∃ N ∈ N s.t. Xn − Xm p < ∀ n, m > N
Definition 7.5 – Converges in distribution
The sequence (Xn)n∈N converges in distribution to a random variable X if:
lim
n→∞
P [Xn ≤ x] = P [X ≤ x] ∀ x ∈ R
such that the cdf FX(y) = P [X ≤ y] is continuous.
Convergence in distribution is also known as weak convergence.
Notation: Xn
d
−→ X or Xn
w
−→ X.
Remarks:
1. Note that if Xn
d
−→ X, then the random variables (Xn)n∈N, X need not be defined
on the same probability space. For other modes of convergence, (Xn)n∈N, X have to
be defined on the same probability space.
2. (Xn)n∈N in Lp
is Cauchy if and only if sup
n,m≥N
Xn − Xm −−−−→
n→∞
0.
35

Lemma 7.1
Convergence in probability implies almost sure convergence along the subsequence. In other
words, if (Xn)n∈N converges in probability to X, with (Xn)n∈N, X ∈ (mF) (on probability
space (Ω, F, P), then there exists a subsequence (Xkn )n∈N s.t. Xkn
a.s.
−→ X.
Proof. Idea of proof examinable
Let ( n)n∈N be a decreasing sequence of positive real numbers such that n 0.
Then ∀ n ∈ N, ∃ kn ∈ N s.t. P [|Xkn − X| > n] < 2−n
(since Xn
P
−→ X).
WLOG, we can assume that kn < kn+1 ∀ n ∈ N.
Now we prove that (Xkn )n∈N tends to X almost surely, using Borel Cantelli Lemma 1
(Lemma 3.1).
Note that ∀ ω ∈ Ω, we have:
(Xkn (ω))n∈N converges to X(ω) ⇔ ω ∈
m∈N
lim inf
n→∞
|Xkn − X| ≤
1
m
(7.1)
Fix m, then note:
lim inf
n→∞
|Xkn − X| ≤
1
m
⊇ lim inf
n→∞
{|Xkn − X| ≤ n} since n 0
= lim sup
n→∞
{|Xkn − X| > n}
Now:
n∈N
P [|Xkn − X| > n] < ∞
⇒ P lim sup
n→∞
{|Xkn − X| > n} = 0 by Borel Cantelli Lemma 1 (Lemma 3.1)
⇒ P lim inf
n→∞
|Xkn − X| ≤
1
m
= 1 ∀ m ∈ N
⇒ P
m∈N
lim inf
n→∞
|Xkn − X| ≤
1
m
= 1
⇒ P [{Xkn → X}] = 1 by (7.1)
Remark:
If (Xn)n∈N, X are random variables on (Ω, F, P) and Xn ∈ mG ∀ n ∈ N, where G ⊆ F, then
if Xn
P
−→ X, we also have X ∈ mG.
Proof. By Lemma 7.1, ∃ subsequence (Xkn )n∈N s.t. Xkn
a.s.
−−−−→
n→∞
X ⇒ X ∈ mG since
Xkn ∈ mG ∀ n ∈ N.
36

Proposition 7.1
A sequence of random variables (Xn)n∈N converges to X in distribution (or converges weakly)
if and only if lim
n→∞
E [f(Xn)] = E [f(X)] ∀ f : R → R continuous and bounded.
Proof. (⇒)
Let Fn(x) = P[Xn ≤ x], F(x) = P[X ≤ x] be the cdf of Xn and X respectively.
Let ([0, 1], B, γL) be a probability space and deﬁne random variables:
Yn(ω) := inf {z ∈ R : ω ≤ Fn(z)} ∀ ω ∈ [0, 1]
Y (ω) := inf {z ∈ R : ω ≤ F(z)} ∀ ω ∈ [0, 1]
Exercise: Show that Yn(ω), Y (ω) ∈ mB.
Note: Yn(ω) ≤ y ⇔ ω ≤ Fn(y) ∀ y ∈ R.
Exercise: Show this.
Therefore:
γL (Yn ≤ y) = Fn(y)
= P [Xn ≤ y]
and E [f(Xn)] = E [f(Yn)].
A similar equality holds for X and Y .
Now:
Xn
d
−→ X ⇒ Fn(x) → F(x) ∀ x ∈ R {points of discontinuity of F}
⇒ Yn → Y γL a.s.
Hence E [f(Yn)] −−−−→
n→∞
E [f(Y )] = E [f(X)] by Dominated Convergence Theorem (Theo-
rem 4.2) since f(Yn) → f(Y ) as f is continuous and |f(Yn)| ≤ sup
x∈R
|f(x)| < ∞. ⇐ as
homework.
Theorem 7.1 – Modes of Convergence
The implications between modes of converges of random variables are:
a) almost sure convergence implies convergence in probability.
b) Lp
convergence (for p ≥ 1) implies convergence in probability.
c) convergence in probability implies convergence in distribution.
Proof.
a) Let (Xn)n∈N converge almost surely to X, and pick > 0. We need to prove that
P [|Xn − X| > ] −−−−→
n→∞
0.
Let An := {|Xn − X| > } and note that P [An i.o.] = P [Xn does not converge to X] = 0
by our initial assumption. Then:
37

0 = P [An i.o.]
= P lim sup
n→∞
An
= P
m∈N
Bm where Bm =
n≥m
An
= lim
m→∞
P [Bm] since Bm ⊇ Bm+1 ∀ m ∈ N
= inf
m∈N
P[Bm] since (P[Bm])m∈N decreasing
≥ inf
m∈N
sup
n≥m
P[An] since P[Bm] ⊇ P[An] ∀ n ≥ m
= lim
m→∞
sup
n≥m
P[An]
≥ 0
This implies that lim
m→∞
P[Am] = 0.
b) Let (Xn)n∈N converge in Lp
to X.
Pick > 0, then we apply Markov’s inequality (Theorem 6.1) to f(x) = xp
; f : R+
→ R+
to get:
0 ≤ p
P [|Xn − X| > ] ≤ E [|Xn − X|p
]
Hence lim
n→∞
P [|Xn − X| > ] = 0 since lim
n→∞
E [|Xn − X|p
] = 0 by assumption.
c) Let Xn
P
−→ X, and pick f : R → R continuous and bounded.
We need to check that E [f(Xn)] −−−−→
n→∞
E [f(X)].
We argue by contradiction.
Contrapositive statement: ∃ > 0 and an increasing subsequence (kn)n∈N, kn ∈ N s.t.
|E [f (Xkn )] − E [f(X)]| > .
We denote Yn := Xkn ∀ n ∈ N. Then note that:
Xkn
P
−→ X ⇒
∃ subsequence of (Yn)n∈N , say (Yln )n∈N
s.t. Yln −−−−→
n→∞
X a.s. by Lemma 7.1
⇒ f (Yln ) −−−−→
n→∞
f(X) a.s. as f continuous
⇒ lim
n→∞
E [f (Yln )] = E [f(X)]
by Dominated Convergence Theorem (Theorem 4.2) as f bounded
⇒ |E [f (Yln )] − E [f(X)]| < ∀ n ≥ N0 ∈ N
This is a contradiction.
38

Corollary 7.1
Xn → X in probability if and only if every subsequence (Xkn )n∈N of (Xn)n∈N has a further
subsequence that converges almost surely to X.
Proof.
(⇒)
This follows from Lemma 7.1 since Xkn
P
−→ X.
(⇐)
We will prove the negation of the statement.
Assume that (Xn)n∈N does not converge to X in probability, and:
∃ , δ > 0 and k : N → N s.t. P[|Xk(n) − X| >
Ak(n)
] ≥ δ ∀ n ∈ N
Then no subsequence of Xk(n) n∈N
converges to X almost surely.
Let l : N → k(N) be an increasing function. We must show that this subsequence Xl(n) n∈N
of Xk(n) n∈N
does not converge to X almost surely.
Note that:
P Al(n) i.o. = P lim sup
n→∞
Al(n)
≥ lim sup
n→∞
P Al(n) by Fatou’s lemma (Lemma 4.1)
≥ δ
> 0
Therefore the negation is proved.
Corollary 7.2 – Continuous mapping theorem
Let (Xn)n∈N converge to X in probability (or respectively converge to X in distribution), and
let f : R → R be a continuous function. Then (f(Xn))n∈N converges to f(X) in probability
(or respectively in distribution).
Proof.
Converges in probability
By Corollary 7.1, since f(Xn)
P
−−−−→
n→∞
f(X) if and only if every subsequence f Xk(n) n∈N
has a further subsequence that tends to f(X) almost surely because f is continuous.
Converges in distribution
By Proposition 7.1, since f(Xn)
d
−−−−→
n→∞
f(X) if and only if E [g (f(Xn))] −−−−→
n→∞
E [g (f(X))]
for every g : R → R that is continuous and bounded.
39

Theorem 7.2 – Weak law of large numbers
Let Yn ∈ L2
and {Yn : n ∈ N} independent and let µ = E[Yn] identically distributed. µ is
finite since L2
⊂ L1
. Define Xn =
1
n
n
i=1
Yi. Then Xn
P
−−−−→
n→∞
µ.
Proof. For every > 0, we have:
2
P [|Xn − µ| ≥ ] ≤ E [(Xn − µ)2
] by Markov’s inequality (Theorem 6.1)
= Var [Xn]
=
1
n2
Var
n
i=1
Yi
=
Var [Y1]
n
→ 0 as n → ∞
Therefore Xn
P
−−−−→
n→∞
µ.
Theorem 7.3 – Strong law of large numbers
Let Yn ∈ L4
and {Yn : n ∈ N} independent and let µ = E[Yn] identically distributed. µ is
finite since L4
⊂ L1
. Then Xn−−−−→
n→∞
µ almost surely.
Proof. Without loss of generality, we assume µ = 0. Otherwise, we could consider Yn =
Yn − µ. Then:
E [X4
n] =
1
n4
n
i=1
E Y 4
i + 6
n
1≤i<j≤n
E Y 2
i Y 2
j
≤
1
n4
nc + 6
n(n − 1)
2
E Y 2
1 Y 2
2 for some constant c > 0
≤
d
n2
for some constant d > 0
Thus this implies E
∞
n=1
X4
n ≤
∞
n=1
d
n2
< ∞.
Therefore Xn−−−−→
n→∞
0 almost surely.
Theorem 7.4 – Completeness of Lp
The space Lp
(Ω, F, P) is complete for any p ≥ 1. In other words, any Cauchy sequence
(Xn)n∈N in Lp
has a limit in Lp
. In other words, there exists a random variable X in Lp
such that Xn
· p
−→ X.
Proof. Exercise
40

Remarks:
1. Proof of this theorem uses Borel Cantelli lemma, Fatou’s lemma, etc. See notes for
details.
2. If p = 2, we deﬁne < X, Y >= E[XY ] where X, Y ∈ L2
. Pythagoras Theorem says:
If X, Y ∈ L2
satisfy < X, Y >= 0 (i.e. they are orthogonal), then:
X + Y 2
2 = X 2
2 + Y 2
2 (7.2)
where X 2 =
√
< X, X > = (E[X2
])
1
2
.
Proof.
X + Y 2
2 = < X + Y, X + Y >
= X 2
2 + Y 2
2 + 2 < X, Y >
0
3. In probabilistic language, if < X, Y >= 0 and E[X] = E[Y ] = 0, then Cov[X, Y ] = 0.
Furthermore, Var[X + Y ] = Var[X] + Var[Y ]. This is equivalent to (7.2).
4. Parallelogram Law:
1
2
X + Y 2
2 + X − Y 2
2 = X 2
2 + Y 2
2 ∀ X, Y ∈ Lp
Proof. Exercise
Theorem 7.5
Let (Ω, F, P) be a probability space and G ⊆ F a sub−σ−algebra of F. Then L2
(Ω, G, P)
is a complete subspace of L2
(Ω, F, P) and ∀ X ∈ L2
(Ω, F, P), there exists Y ∈ L2
(Ω, G, P)
such that the following holds:
i) X − Y 2 = inf { X − Z 2 : Z ∈ L2
(Ω, G, P)}
ii) E [(X − Y )Z] = 0 ∀ Z ∈ L2
(Ω, G, P)
Furthermore, i) and ii) are equivalent and Y ∈ L2
(Ω, G, P) satisﬁes i) or ii) if and only if
Y = Y (P−a.s.)
Note that ii) ⇔< (X − Y, Z >= 0 ∀Z ∈ L2
(Ω, G, P).
Proof. We need to show that L2
(Ω, G, P) is complete. Then:
Take (Xn)n∈N in L2
(Ω, G, P) Cauchy ⇒ (Xn)n∈N is Cauchy in L2
(Ω, F, P)
⇒ Xn
· 2
−→ X ∈ L2
(Ω, F, P)
⇒ Xn
P
−→ X by Theorem 7.1
⇒ ∃ subsequence Xkn −−−−→
n→∞
X a.s.
⇒ X ∈ mG
⇒ X ∈ L2
(Ω, G, P)
There ∃ (Yn)n∈N ∈ L2
(Ω, G, P) such that:
X − Yn 2 → d := inf X − Z : Z ∈ L2
(Ω, G, P)
41

We apply the parallelogram law to X − Ym, X − Yn:
Ym − Yn
2
2 = 2 ( Ym − X 2
2 + X − Yn
2
2) − 4 X − (Ym + Yn)/2 2
2
≤ 2 ( Ym − X 2
2 + X − Yn
2
2) − 4d2
≤ 2(d2
+ d2
) − 4d2
as n, m → ∞
= 0
Hence (Yn)n∈N is Cauchy in L2
(Ω, G, P) such that Yn − Y 2 −−−−→
n→∞
0.
Note: d ≤ X − Y 2 ≤ X − Yn 2 + Yn − Y 2.
For every n ∈ N ⇒ d ≤ X − Y ≤ d.
Hence i) holds.
Now we show that i) ⇒ ii) by contradiction.
Assume ∃Z ∈ L2
(Ω, G, P) such that < X − Y, Z > > 0 and Z 2 = 1.
Then Y + < X − Y, Z > Z ∈ L2
(Ω, G, P).
X − (Y + < X − Y, Z > Z 2
2 = X − Y 2
2+ < X − Y, Z >2
Z 2
− 2 < X − Y, Z >2
= X − Y 2
2− < X − Y, Z >2
< X − Y 2
2
This is a contradiction because of i): we know that X−Y 2 = inf { X − Z 2 : Z ∈ L2
(Ω, G, P)}.
But X − Y 2 is the smallest element, and we cannot have anything smaller than that.
Hence i) ⇒ ii).
To see that ii) ⇒ i), note that:
X − Z 2
2 = |(X − Y ) + (Y − Z) 2
2
= X − Y 2
2 + Y − Z 2
2 by Pythagoras Theorem since Y − Z ∈ L2
(Ω, G, P)
≥ X − Y 2
2
So ii) ⇒ i).
If Y satisﬁes ii), then:
a = X − Y 2
2
= X − Y 2
2 + Y − Y 2
2
≥ X − Y 2
2
= b
By i), a = b (since there can only be one inﬁmum), hence Y − Y 2
2 = 0.
This implies that Y = Y (P−a.s.), because E Y − Y 2
2 = 0.
42

8 Characteristic Functions and the Central Limit
Theorem
Definition 8.1 – Characteristic function
Let X be a random variable taking values in R with cumulative distribution function F = FX
and law µ (i.e. µ is a measure on (R, B) such that µ(a, b) = F(b) − F(a) ∀ a ≤ b ∈ R). The
characteristic function of X is given by φ : R → C such that:
φ(θ) = E eiθX
= E [cos(θX)] + iE [sin(θX)]
=
R
eiθx
µ(dx)
=
R
eiθx
dF(x)
Remarks:
1. X ∼ Y ⇒ φX = φY where φX is the characteristic function of X and φY is the
characteristic function of Y .
2. φ(θ) is well-defined for every θ ∈ R since eiθx
= sin2
(θx) + cos2(θx) = 1 ∀ x, θ ∈ R.
Hence eiθX
∈ L1
.
Theorem 8.1
Let φ = φX be the characteristic function of a random variable X. Then:
1. φ(0) = 1 (by definition).
2. |φ(θ)| ≤ 1.
3. θ → φ(θ) is continuous ∀ θ ∈ R.
Exercise: Prove this using DCT.
4. φ−X(θ) = φX(−θ) = φX(θ) ∀ θ ∈ R.
5. φaX+b(θ) = eiθb
φX(aθ) ∀ a, b ∈ R.
6. If E [|X|n
] < ∞ for some n ∈ N, then φ
(n)
X (0) = in
E[Xn
].
Exercise: Prove this using DCT.
Theorem 8.2
If X and Y are independent, then φX+Y (θ) = φX(θ)φY (θ) ∀ θ ∈ R.
Remark:
If E eiαX+iβY
= E eiαX
E eiβY
∀ α, β ∈ R, then X and Y are independent.
43

Theorem 8.3 – Levy’s inversion formula
Let φ be a characteristic function of a random variable X with law µ and cumulative
distribution function F. Then:
lim
T→∞
1
2π
T
−T
e−iθa
− e−iθb
iθ
φ(θ)dθ =
1
2
µ({a}) + µ(a, b) +
1
2
µ({b})
= −
1
2
FX(a) + FX(a−
) +
1
2
FX(b) + FX(b−
)
where F(a−
) = lim
x a
F(x).
Proof. Elementary. Exercise
Remarks:
If φ ∈ L1
(R, B, γL), then Levy’s inversion formula (Theorem 8.3) implies that X has a
density fX : R → R+
:
1
2π
R
e−iθa
− eiθb
iθ
φX(θ) dθ
= FX(b) − FX(a)
=
b
a
fX(y) dy
Furthermore, we have fX(x) =
1
2π
R
e−iθx
φX(θ) dθ.
Theorem 8.4 – Levy’s convergence theorem
Let Fn, n ∈ N be a sequence of cumulative distribution functions with characteristic function:
φn(θ) =
R
eiθx
dFn(x)
Suppose that:
• g(θ) = lim
n→∞
φn(θ) ∀ θ ∈ R
•g is continuous at 0.
Then g is a characteristic function of some cumulative distribution function F
i.e. g(θ) =
R
eiθx
dF(x) and Fn
d
−→ F (i.e. Fn(x) → F(x) ∀ x ∈ R s.t. F is continuous at
x).
Proof. Proof given in Williams: Probability with Martingales.
Theorem 8.5 – Central limit theorem
Let (Xn)n∈N be a sequence of independent identically distributed random variables such that
E [X2
1 ] = σ2
< ∞ and E [X1] = 0. Let Sn =
n
i=1
Xi and Gn =
1
√
nσ
Sn. Then Gn
d
−→ N(0, 1).
44

Remark:
If Xi ∼ N(0, 1) for each i ∈ N, then Gn ∼ N(0, 1) ∀ n ∈ N.
Proof. Note that φGn (θ) = E e
iθ√
nσ
n
i=1 Xi
= φX1
θ
√
nσ
n
since Xi are independent.
Note that since E [X2
1 ] < ∞, we have:
φX1 (θ) = 1 + iE [X1]
iθ
1!
0
+
(iθ)2
2!
E X2
1 + o(θ2
)
= 1 −
θ2
2
σ2
+ o(θ2
)
Hence φGn (θ) = φX1
θ
σ
√
n
n
= 1 −
θ2
2n
+ o
θ
σ
√
n
2 n
.
Using results proved in course where 1 +
bn
n
n
→ eb
as n → ∞, for bn → b ∈ R, we have:
lim
n→∞
φGn (θ) = φ(θ) = e−θ2
2
It is well known that
R
eiθx 1
√
2π
e−x2
2 dx = e−θ2
2 , i.e. this is the characteristic function of
N(0, 1).
Therefore by Levy’s inversion formula (Theorem 8.3), we have that Gn
d
−→ N(0, 1).
45

9 Conditional Expectation & Martingales
Example 9.1
Let X be a random variable on (Ω, F, P) that takes values in A = {X1, X2, . . . , Xm},
P [X ∈ A] = 1, and let Y be a random variable on (Ω, F, P) such that P [Y ∈ B] = 1,
B = {y1, . . . , yn}.
In particular, we assume that P [Y = Yi] > 0 ∀ i = 1, . . . , n. We have:
E [X | Y = yi] =
m
j=1
xj · P [X = xj | Y = yi]
=
m
j=1
xj ·
P [X = xj, Y = yi]
P [Y = yi]
= F(Yi), F : B → R.
In other words, E [X | Y ] = F(Y ).
Note that:
E I{Y =yi}F(Y ) = P [Y = yi] ·
m
j=1
xjP [X = xj | Y = yi]
=
m
j=1
xj · P [X = xj, Y = yi]
= E X · I{Y =yi}
Remarks:
1. To define E [X | Y ], X and Y have to be defined on the same probability space.
2. Note that E [X | Y ] is a random variable in mσ(Y ) such that E [E [X | Y ] · IG] =
E [X · IG] ∀ G ∈ σ(Y ).
Definition 9.1 – Version of conditional expectation
Let X be a random variable on L1
(Ω, F, P) and let G ⊆ F be a sub−σ−algebra. If ˆX
satisfies:
i) ˆX ∈ mG
ii) E ˆX · IG = E [X · IG] ∀ G ∈ G
then ˆX is a version of conditional expectation E [X | G] of X given G.
We denote ˆX = E [X | G] (P−a.s.).
46

Remarks:
1. If X ∈ mG satisfies ii) in Definition 9.1, then X = ˆX (P−a.s.).
Proof. Note that X > ˆX , ˆX > X ∈ G and that:
0 ≤ E X − ˆX · I{X− ˆX} = E X · I{X− ˆX} − E ˆX · I{X− ˆX}
= E X · I{X− ˆX} − E X · I{X− ˆX} by ii) in Def 9.1
= 0
⇒ X ≤ ˆX (P − a.s.)
Similarly, by looking at the event ˆX > X , we find that ˆX ≤ X (P−a.s.).
Therefore, this implies that ˆX = X (P−a.s.).
2. Note that for ii) in Definition 9.1, we implicitly assume that E [X · IG] is well-defined.
Hence we use the fact that X ∈ L1
. (|XIG| ≤ |X|)
3. In Definition 9.1, we can also assume that X ∈ (mF)+
and drop X ∈ L1
.
Theorem 9.1 – Conditional Expectation
Let X ∈ L1
(Ω, F, P) or X ∈ (mF)+
. Then conditional expectation E [X |G] exists and is
unique (P−a.s.). (i.e. if X, ˆX are both versions of E [X |G], then X = ˆX (P−a.s.))
Proof. We consider 3 cases. X ∈ L2
, X ∈ L1
, and X ∈ (mF)+
.
Case 1: If X ∈ L2
, then this implies there exists a unique Y ∈ L2
(Ω, G, P) such that
E [(X − Y ) · IG] = 0 ∀ G ∈ G.
This is equivalent to E [IG · Y ] = E [X · IG] ∀ G ∈ G, which implies that Y is a version of
E [X | G].
Case 2: If X ∈ (mF)+
, then let Xn = min{X, n}.
Note that Xn ∈ L2
, Xn X as n → ∞ almost surely.
Now we have 0 ≥ E ˆX · I{ ˆXn<0} = E Xn · I{ ˆXn<0} ≥ 0 by ii) in Definition 9.1.
This implies that E ˆXn · I{ ˆXn<0} = 0 which in turn implies that P ˆXn < 0 = 0.
Hence we have ˆXn = E [Xn | G] and 0 ≤ ˆXn ≤ ˆXn+1.
To prove ˆXn ≤ ˆXn+1 (P−a.s.), note that Xn+1−Xn ≥ 0 (P−a.s.), and that E [Xn+1 − Xn | G] =
ˆXn+1 − ˆXn implies that ˆXn+1 − ˆXn ≥ 0.
Hence there ∃ ˆX = lim
n→∞
ˆXn ∈ mG since Xn X.
We then have E ˆX · IG = E [X · IG] ∀ G ∈ G by Monotone Convergence Theorem (The-
orem 4.1).
47

Remarks:
1. X ∈ L1
⇒ ˆX ∈ L1
(Ω, G, P)
Proof. We can write ˆX = ˆX+
− ˆX−
, with ˆX = max 0, ˆX , ˆX−
= min 0, − ˆX .
We need to show that ˆX+
, ˆX−
∈ L1
(Ω, G, P).
Note that ˆX+
= ˆX · I{ ˆX≥0}, where ˆX ≥ 0 ∈ G.
Then E I{ ˆX≥0} · ˆX = E X · I{ ˆX≥0} ∈ R (ﬁnite)
A similar argument implies E ˆX−
∈ R.
2. If X ∈ (mF)+
⇒ ˆX ≥ 0 (P−a.s.).
Proof. Take ˆX < 0 ∈ G and note that:
0 ≤ E X · I{ ˆX<0} = E ˆX · I{ ˆX<0} ≤ 0
This implies that P ˆX < 0 = 0 ⇔ ˆX ≥ 0 (P−a.s.).
3. If ∃ X ∈ mG and satisﬁes E [X · IG] = E XIG ∀ G ∈ G, then X = ˆX (P−a.s.).
Proof. To prove this, note the following:
E X − ˆX · I{X> ˆX} = E X · I{X> ˆX} − E X · I{X> ˆX} = 0 (9.1)
This implies that P X > ˆX = 0.
(9.1) holds if X ∈ L1
(Ω, F, P).
Similarly, one can show that P ˆX > X = 0.
Therefore the statement follows if X ∈ L1
(Ω, F, P).
However, if X ≥ 0, and E[X] > ∞, then an approximation argument and (9.1) yields
the statement.
Exercise.
Hint: apply (9.1) to Xn = min {X, n}.
48

Theorem 9.2
Let (Ω, F, P) be our probability space, and let X, Y ∈ mF. Take G, H sub−σ−algebras in
F. Then:
a) If X ∈ mG and X ∈ L1
or X ∈ (mG)+
, then E [X |G] = X (P−a.s.).
b) If X, Y ∈ L1
(Ω, F, P), and a, b ∈ R, then E [aX + bY | G] = aE [X | G] + bE [Y | G].
c) X ∈ L1
(Ω, F, P) on X ∈ (mF)+
, then E [E [X | G]] = E[X].
d) X ∈ mG and assume either X, Y ∈ L2
(Ω, F, P) or X, Y ∈ (mF)+
, then E[XY | G] =
XE [Y | G] (P−a.s.).
e) If X ∈ L1
(Ω, F, P) or X ∈ (mF)+
and H is independent of σ(X), then E[X |H] =
E[X] (P−a.s.).
f) (Tower Property): Let H ⊆ G and X ∈ L1
or X ∈ (mF)+
. Then:
E [E[X | G] | H] = E[X | H]
g) If X ≥ 0, then E[X | G] ≥ 0 (P−a.s.).
h) (Jensen’s Inequality): If φ : R → R is convex such that φ(X), X ∈ L1
(Ω, F, P), then:
E [φ(X) | G] ≥ φ (E [X |G]) (P − a.s.)
i) Let f : R2
→ R be B(R2
) measurable, and X ∈ mG and Y independent of G, and
f(XY ) ∈ L1
(Ω, F, P). Then g(X) : E [f(x, Y )] , x ∈ R (fixed x) defines a Borel
measurable map g : R → R which satisfies E [f(X, Y ) | G] = g(X) (P−a.s.).
Proof. For some parts:
c) This is clear from the definition. E [IG · E [X | G]] = E [IG · X] ∀ G ∈ G by taking
G = Ω.
d) If X = IG, G ∈ G, then our statement d) follows by the definition of conditional
expectation since E [IG · Y | G] = IG · E [Y | G].
For any A ∈ G, we need to see that E [IA · IG · E [Y | G]] = E [IAIG · Y ].
However, this holds since E [IA∩G · E [Y |G]] = E [IA∩G · Y ] ∀ A ∈ G and IAIG = IA∩G.
Our statement d) holds by approximating X ∈ mG by simple functions and proving
E [IA · X · E [Y | G]] = E [IA · XY ] ∀ A ∈ G.
e) We need to prove that ∀ H ∈ H, we have E [IH · E [X]] = E [IH · X].
But E [IH · E[X]] = E [IH] · E[X] since IH, X are independent.
f) Pick H ∈ H and note that:
E IH · ˆX = E [E [IHE [X | G] | H]] by d)
= E [E [E [IH · X | G] | H]] by d), and that H ∈ H ⊆ G
= E [IH · X] by applying c) twice
This implies the Tower Property.
49

10 Filtrations, martingales, and stopping times
Here, we have our time: T ∈ {N, Z}, N ∪ {0} = Z+
.
Definition 10.1 – Filtration
Let (Ω, F, P) be a probability space. A filtration indexed by T is a non-decreasing sequence
of σ−algebras (Ft)t∈T on (Ω, F, P) i.e. we have Fs ⊆ Ft ⊆ F ∀ s, t ∈ T s.t. s ≤ t.
Definition 10.2 – (Stochastic) Process
A process (Xt)t∈T = X is a collection of random variables on (Ω, F, P).
Definition 10.3 – Adapted
The process X = (Xt)t∈T is adapted to the filtration (Ft)t∈T if Xt ∈ mFt.
Definition 10.4 – Filtered probability space
(Ω, F, P) with filtration (Ft)t∈T is called a filtered probability space (Ω, F, (F)t∈T, P).
Definition 10.5 – Martingale
A process M = (Mt)t∈T is a martingale on a filtered probability space (Ω, F, (Ft)t∈T, P) if:
a) M is adapted to (Ft)t∈T, Mt ∈ mFt.
b) Mt ∈ L1
(Ω, F, P) ∀ t ∈ T, i.e. E [|Mt|] < ∞ ∀ t.
c) For any s ≤ t, s, t ∈ T, we have E [Mt | Fs] = Ms (P−a.s.).
Definition 10.6 – Submartingale
M = (Mt)t∈T is a submartingale if a) and b) hold in Definition 10.5 and
E [Mt | IS] ≥ Ms (P−a.s.).
Definition 10.7 – Supermartingale
M = (Mt)t∈T is a supermartingale if a) and b) hold in Definition 10.5 and
E [Mt | IS] ≤ Ms (P−a.s.).
Remarks:
1. Note that c) in Definition 10.5 is equivalent to E [(Mn+1 − Mn) · IA] = 0 ∀ A ∈ Fn
and all n ∈ T.
Proof. To prove this, we need to show that E [Mn+k | Fn] = Mn (P−a.s.).
We have:
E [Mn+2 | Fn] = E [E [Mn+2 | Fn+1]] by Tower Property
= E [Mn+1 | Fn]
= Mn
50

Exercise: Show that in Definition 10.7, E [Mt | IS] ≤ Ms (P−a.s.) is equivalent to
E [(Mn+1 − Mn) · IA] ≤ 0 ∀ A ∈ Fn ∀ n ∈ T.
Example 10.1
1. X ∈ L1
(Ω, F, P) on a filtered probability space (Ω, F, (Ft)t∈T, P). Then Mt = E [X | Ft]
is a martingale. c) from Definition 10.5 follows from the Tower Property of condi-
tional expectation.
2. Let Xi, i ∈ N be iid random variables such that P [Xi = 1] = p, P [Xi = −1] = 1 − p,
with p ∈ (0, 1). Define Mk =
k
i=1
Xi. Claim M = (Mk)k∈N is a supermartingale if and
only if p ≤
1
2
, M is a submartingale if p ≥
1
2
. Hence M is a martingale if and only if
p =
1
2
.
Proof. Here, we use Fk = σ(X1, . . . , Xk), and show that the properties in the definition
of a martingale are satisfied.
a) M is adapted to (Fk).
Mk ∈ Fk is true since Fk = σ(M1, . . . , Mk) since there exists matrix A ∈ Rk×k
such
that:





X1
X2
...
Xk





= A





M1
M2
...
Mk





and A−1





X1
X2
...
Xk





=





M1
M2
...
Mk





with A−1
=





1 0 . . . 0
1 1 . . . 0
...
...
...
...
1 1 . . . 1





So there is a bijection between these two vectors which implies Fk = σ(M1, . . . , Mk).
b) Mk ∈ L1
∀ k ∈ N.
This holds since Xi, i = 1, . . . , k are in L1
. (because Xi = ±1, so Xi ≤ k < ∞
which is bounded)
c) We have:
E[ Mk+1
Xk+1+Mk
| Fk] = Mk + E [Xk] as Mk ∈ mFk and Xk + 1 is independent on Fk
= Mk + p(1) − (1 − p)(1)
= Mk + 2p − 1
So for p ∈ 0, 1
2
, we have a supermartingale, and for p ∈ 1
2
, 1 , we have a submartin-
gale.
This proves the equivalences above.
Definition 10.8 – Stopping time
Let (Ω, F, (Ft)t∈T, P) be a filtered probability space. A random variable τ : Ω → T ∪ {∞}
is a stopping time relative to the filtration (Ft)t∈T if {τ ≤ t} ∈ Ft ∀ t ∈ T.
51

Remark:
In case T = Z+
, then τ is a stopping time if and only if {τ = t} ∈ Ft ∀ t ∈ Z+
(since
{τ = t} = {τ ≤ t} {τ ≤ t − 1} and {τ ≤ t} =
t
k=0
{τ = k}).
Example 10.2
1. Let M = (Mk)k∈N, Mk =
k
i=1
Xi as before. Then τa = inf {t ∈ N : Mt = a} (a ∈ Z).
Note that {τa ≤ t} =
t
k=1
M−1
k ({a})
∈Fk
∈ Ft and Fk ⊆ Ft ∀ k ≤ t.
2. Every constant time t ∈ T is a stopping time.
Example 10.3
Suppose we have M0 = 0, Mk =
k
i=1
Xi, with Xi iid Bernoulli random variables, P[Xi =
1] = p, P[Xi = −1] = 1 − p, p ∈ (0, 1).
We have H1 = 1, Hk = 2k−1
I{Xi=−1:i=1,...,k−1}.
Note that Hk ∈ mFk−1.
Let Nk =
k
i=1
Hi (Mi − Mi−1)
Xi
be the gains process (N = (Nk)k∈N).
Note Nk =∈ mFk, Fk = σ(X1, . . . , Xk).
Also, Nk ∈ L1
∀ p and if p ≥ 1
2
, then E [Nk+1 | Fk] = Nk (Exercise: Check this)
Furthermore, N is a supermartingale if and only if p ≤ 1
2
.
Let:
τ = inf {t ∈ N | Mt > Mt−1}
= inf {t ∈ N | Xt = 1, Xi = −1 ∀ i = 1, . . . , t − 1}
Exercise: Show τ is a stopping time, i.e. show P[τ = n] = p(1 − p)n−1
n ∈ N.
Note that:
Nk =
k
i=1
Hi(Mi − Mi−1) =
1 − 2k
Xi = −1, i = 1, 2, . . . , k
1 ∃ i = {1, . . . , k} s.t. Xi = 1
which implies that Nτ = 1 (P−a.s.).
52

Probability theory

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Probability theory

Similar to Probability theory (20)

Recently uploaded

Recently uploaded (20)

Probability theory