Fundamentals of Engineering Probability Visualization Techniques & MatLab Case Studies

Course Sampler From ATI Professional Development Short Course

Fundamentals of Engineering Probability
Visualization Techniques & MatLab Case Studies

Instructor:
Dr. Ralph E. Morganstern

ATI Course Schedule: http://www.ATIcourses.com/schedule.htm

ATI's Engineering Probability: http://www.aticourses.com/Fundamentals_of_Engineering_Probability.htm

www.ATIcourses.com

Boost Your Skills 349 Berkshire Drive
Riva, Maryland 21140
with On-Site Courses Telephone 1-888-501-2100 / (410) 965-8805

Tailored to Your Needs
Fax (410) 956-5785
Email: ATI@ATIcourses.com

The Applied Technology Institute specializes in training programs for technical professionals. Our courses keep you
current in the state-of-the-art technology that is essential to keep your company on the cutting edge in today’s highly
competitive marketplace. Since 1984, ATI has earned the trust of training departments nationwide, and has presented
on-site training at the major Navy, Air Force and NASA centers, and for a large number of contractors. Our training
increases effectiveness and productivity. Learn from the proven best.

For a Free On-Site Quote Visit Us At: http://www.ATIcourses.com/free_onsite_quote.asp

For Our Current Public Course Schedule Go To: http://www.ATIcourses.com/schedule.htm

Fundamental Probability Concepts
• Probabilistic Interpretation of Random Experiments (P)
– Outcomes: sample space
– Events: collection of outcomes (set theoretic)
– Probability Measure: assign number “probability” P ε [0,1] to event
• Dfn#1-Sample Space (S): Fine-grained enumeration (atomic - parameters)
– List all possible outcomes of a random experiment
– ME - Mutually exclusive - Disjoint “atomic”
– CE - Collectively exhaustive - Covers all outcomes
• Dfn#2- Event Space (E): Coarse-grained enumeration (re-group into sets)
– ME & CE List of Events
S (all outcomes)
Atomic Outcomes
Events: A,B,C ME but not CE A D
(Disjoint by dfn)

Events: A,B,C ,D both ME & CE
C
B

14 INDEX

Discrete parameters uniquely define the coordinates of the Sample Space (S) and the collection of all
parameter coordinate values defines all the atomic outcomes. As such atomic outcomes are mutually
exclusive (ME) and collectively exhaustive (CE) and constitute a fundamental representation of the Sample
Space S.
By taking ranges of the parameters such as A, B, C, and D, one can define a more useful Event Space which
should consist of ME and CE events which cover all outcomes in S without overlap as shown in the figure.

14

Fair Dice Event Space Representations
d2

• Coordinate Representation: 6

– Pair 6-sided dice 5
A: d1=3, d2 =arb.
4
– S={(d1,d2): d1,d2 = 1,2,…,6} 3
2 C: d1=d2
– 36 Outcomes Ordered pairs 1
d1
1 2 3 4 5 6
B: d1+d =7
• Matrix Representation: 1  [1 2 3 4 5 6]  (1,1) (1,2) (1,3) (1,4) (1,5) (1,2 ) 
6
  (2,1) (2,2) (2,3) (2,4) (2,5) (2,6)
– Cartesian Product: 2  
3
 
= (3,1)

(3,2) (3,3) (3,4) (3,5) (3,6)

– {d1} x {d2} = d1 d2T 4 (4,1) (4,2) (4,3) (4,4) (4,5) (4,6)
  (5,1) (5,2) (5,3) (5,4) (5,5) (5,6)
5  
  (6,1)
 (6,2) (6,3) (6,4) (6,5) (6,6) 
6
• Tree Representation: d2
d1 (1,1)
(1,2)
1 (1,3) 36 Outcomes
(1,4) Ordered Pairs
2 (1,5)
3 (1,6)
• Polynomial Generator for Sum Start
4
2 Dice 5 (6,1)
(6,2)
( x1 + x 2 + x3 + x 4 + x5 + x 6 ) 2 = 1x 2 + 2 x3 + 3 x 4 + 4 x5 + 5 x 6 + 6 x 7 6 (6,3)
(6,4)
Exponents represent + 5 x8 + 4 x9 + 3 x10 + 2 x11 + 1x12 (6,5)
(6,6)
6-sided die face numbers Exponents represent pair sums
Coefficients represent #ways 16

It is helpful to have simple visual representations of Sample and Event Spaces
For a pair of 6-sided dice, coordinate, matrix, and tree representations are all useful representations. Also
the polynomial generator for the sum of a pair of 6-sided dice immediately gives probabilities for each sum.
Squaring the polynomial (x1+x2+x3+x4+x5 +x6)2 yields a generator polynomial whose exponents represent
all possible sums for a pair of 6-sided dice S={2,3,4,5,6,7,8,9,10,11,12}and whose coefficients C=
{1,2,3,4,5,6,5,4,3,2,1} represent the number of ways each sum can occur. Dividing by the coefficients C by
the total #outcomes 62 = 36 yields the probability “distribution” for the pair of dice.
Venn diagrams for two or three events are useful; for example, the coordinate representation in the top
figure can be used to visualize the following events
A: {d1 = 3 and d2 = arbitrary, B= {d1 + d2 = 7}, and C= {d1 = d2}
Once we display these two events on the coordinate diagram their intersection properties are obvious, viz.,
both A & B and A & C intersect, albeit at different points, while B & C do not intersect (no point
corresponding to sum=7 and equal dice values). More than three intersecting sets, become problematic for
Venn diagrams as the advantage of visualization is muddled somewhat by the increasing number of
overlapping regions in theses cases (see next two slides).

16

Venn Diagram for 4 Sets
4C = (4C1 4-Singles) – (4C2 6-Pairs) + (4C3 4-Triples ) - ( 4C4 1-Quadruple)
0

A B
AB

BD AC
ABD ABC

AD ABCD BC

ACD BCD

CD
D C
17

As we go to Venn diagrams with more than 3 sets the labeling of regions becomes a practical limitation to
their use. In this case of 4 sets A,B,C, D, the labeling is still pretty straightforward and usable.
The 4 singles A,B,C,D are labeled in an obvious manner at the edge of each circle.
The 6 pairs AB,AC,AD,BC,BD,CD are labeled at the intersection of two circles. The 4 triples ABC, ABD,
BCD, ACD are labeled within “curved triangular areas” corresponding to the intersections of three circles.
The 1 quadruple ABCD is labeled within the unique “curved quadrilateral area” corresponding to the
intersection of all four circles.

17

Trivial Computation of Probabilities of Events
sum = d1 + d2
d2
Ex#1 Pair of Dice E1
S={(d1,d2): d1,d2 = 1,2,…,6} 6
12
5 11 E2
10
E1={(d1,d2): d1+d2 ¥ 10} 4 9
8
P(E1)=6/36=1/6 3 7
6
2 5
E2={(d1,d2): d1+ d2 = 7} 4
P(E2)=6/36=1/6 1 3
2 d1
1 2 3 4 5 6

Ex#2 Two Spins on Calibrated Wheel
S={(s1,s2): s1,s2 ε [0,1]} s2

E1={(s1,s2): s1+s2 ¥ 1.5}--> P(E1) = ----- =.52/2=1/8 1
1
E1
0.5 E3
E2={(s1,s2): s2 § .25} --> P(E2)=1(.25)/1=.25
E2
0 s1
E3={(s1,s2): s1= .85; s2= .35}--> P(E3)=0/1=0 0 0.5 1

20

For equally likely atomic events the probability of any outcome Event is easily computed as the (#atomic
outcomes in Event)/(total # outcomes). For a pair of dice, the total # of outcomes is 6*6=36 and hence
simple counting of the # points in E /36 yields P(E), etc.
Two spins on a calibrated wheel [0, 1) can be represented by the unit square in the (s1 , s2)-plane and an
analogous calculation can be performed to obtain the probability for the event E by dividing the area
covered by the event by the area of the event space (“1”): P(E)= area(E)/ 1.

20

DeMorgans’ Formulas - Finite Unions and Intersections

i) Compl(Union) = Intersec(Compls): ( E1 ∪ E2 ∪
c
∪ En ) c = E1 ∩ E2 ∩
c
∩ En
c

c c c
ii) Compl(Intersec) = Union(Compls): ( E1 ∩ E2 ∩ ∩ En ) c = E1 ∪ E2 ∪ ∪ En

Useful Forms:
A∪ B
i’) Union expressed ( A ∪ B) c = Ac B c
Visualization
Compl(Union) Intersec(Compl) ( A ∪ B)c
as an Intersection
(( A ∪ B) )
c c
= A ∪ B = ( Ac B c ) c A Ac Intersect
grey areas
B Bc Ac & B c
ii’) Intersection ( AB) c = Ac ∪ B c Ac B c Yields one
Union(Compl) grey area Ac B c
expressed as a Union Compl(Intersec)
with A and
B excluded
(( AB) )
c c
(
= AB = Ac ∪ B c )c

Taking its
complement ( Ac B c )c
yields white
area, i.e., A ∪ B

24 INDEX

DeMorgan’s Laws for the complement of finite unions and intersections states that
i) The complement of unions equals the intersections of the complements, and
ii) The complement of intersections equals the union of complements
The alternate forms obtained by taking the complements of the original equations are often more useful
because they give a direct decomposition of the union and the intersection of two or more sets
i’) The union equals the complement of the (intersection of complements)
ii’) The intersection equals the complement of the (union of complements)

A graphical construction of A U B = (Ac Bc)c is also shown in the figure..
Ac and Bc are the two shaded areas in the middle planes which exclude A and B respectively (white) ovals
Intersecting these two shaded areas and taking the complement leaves the white oval areas which is A U B

24

Set Algebra Summary Graphic

Union A ∪ B = A ∪ Ac B
= B ∪ Bc A Union AUB

“A-B” “B-A”
Intersection A ∩ B = A ⋅ B = AB A Bc A
AB B
Ac B
x ∈ AB iff x ∈ A & x ∈ B

Intersection

Difference A − B ≡ A ∩ B c = AB c
x ∈ A − B iff x ∈ A and x ∉ B Differences

DeMorgans A ∪ B = ( Ac B c )c ( A ∪ B )c = Ac B c
means
( )
c
AB = Ac ∪ B c complement of (At least one) = (not any)

27

This summary graphic illustrates the set algebra for two sets A , B and their union intersection and
difference.
DeMorgans Law can be interpreted as saying “the complement of (“at least one”) is “not any”
Associativity and commutivity of the two operations allows extension to more than two sets.

27

Basic Counting Principles
Principle #0: Take Case n=3-4; generalize to n Binomial Expansion: (a+b)3 (a+b)n
Repetitions Allowed
Principle #1: Product Rule for Sub-experiments: 6- Bins

= 263 ⋅103
m Num Suit
Licenses
⋅ nm = ∏ nk
26 26 26 10 10 10
n = n1 ⋅ n2
H
1 D
S
C
H
5
16- Bins
k =1 Start 2 D

Binary
S
2
216 = 65,536
C
H
13 D
2 2 2 2 ... 2
Generate “tree” of outcomes
S
C Digits
#ways: 13 * 4 = 52
No Repetitions
Principle #2: Perm n distinguish-obj take k k=n Arrange 11 Travel 5 Cooking 4 Garden
All Books
n! “Fill k-bins” 11! 5! 4!
n
Pk = (n) k = 3! Permute Groups
(n − k )! k<n 11 Travel Books
in 5 bins 11| 10| 9 |8 |7
Principle #3:Perm n-obj take n with r - Arrange 4!
groups of indistinguishable objects Letters
“TOOL” = 12
2!⋅1!⋅1!
hable
 # Distinguis n! 10!
 Sequences  = n !⋅n !⋅ ⋅ n !
  r − groups {4”r”, 3”s”, 2”o”, 1 “t”}
4!⋅3!⋅2!⋅1!
= 12,600
  1 2 r

Principle #4: Combination of n-objects take k Committee of 4 22! 22!
C4 =
22
= = 7315
from 22 people (22 − 4)!4! 18!4!
n n!
n
Ck =   =
 k  k ! ( n − k )! k ≤ n Order not
  Committee of 3 {2M, 1F} 6⋅5
important!
from {6M, 3F}
6
C2 ⋅3 C1 = ⋅ 3 = 45
2!
= Principle #3 with {taken , not taken} not counted 28 INDEX

Outcomes must be distinguished by labels. They are characterized by either i) distinct orderings or ii)
distinct groupings. A grouping consists of objects with distinct labels; changing order within a group is not
a new group, but is a new permutation. The four basic counting principles for groups of distinguishable
objects are summarized and examples of each are displayed in the table.
Principle#0: This is practical advice to solve a problem with n= 2,3,4 objects first and then generalize the
“solution pattern” to general n.
Principle#1: This product rule is best understood in terms of the multiplicative nature of outcomes as we
“branch out” on a tree. For a a single draw from a deck of cards there are 13 “number” branches and, in
turn, each of these has 4 “suit” branches yielding 13*4 =52 distinguishable cards or outcomes.
Principle#2: Permutation (ordering) of n objects take k at a time is best understood by setting up “k-
containers” putting one of “n” in the first, one of “n-1” , ... and finally one of “n-k+1” in the kth container.
The total #ways is obtained by the product rule as n*(n-1)*...*(n-k+1) = n!/(n-k)!
Principle#3: Permutation of all ”n” objects consisting of “r “ groups of indistinguishable objects {3 t , 4
s 5 u}. If all objects were distinguishable then the result would be n! permutations; however permutations
within the r groups does not create new outcomes and therefore we divide by factorials of the numbers in
each group to obtain n!/(n1! n2! ... nr!)
Principle#4: Combination of n objects take k is related to Principles#2, #3. There are n! permutations;
ignoring permutations within r= 2 groups {“taken” , “not taken”} yields n!/(n! (n-k)!)

28

Counting with Replacement
Refills Drop Down
Select “B” from Alphabet and Replace A
A
B
B ... Y
Y
Z
Z
Always have 26 letters to choose from A
A
B
B
Y
Y
Z
Z

23 =8 distinct 4 distinct
Permutation of “n” obj with (# drws)
orderings groupings
replacement taken “k” at a time
n
Pk = # replaceable
objects = nk A {AAA} 3 “A”
B {AAB} 2 “A”& 1”B”
A
n n n n n…n A {ABA} 2 “A”& 1”B”
A B B {ABB} 2 “B”& 1”A”
Bin# 1 2 3 …k S A A {BAA} 2 “A”& 1”B”
B
n=2 , k=3 B B {BAB} 2 “B”& 1”A”
A {BBA} 2 “B”& 1”A”
B {BBB} 3 “B”
Combination of “n” obj with
replacement taken “k” at a time
effective # objects
 n + k − 1  n + k − 1
n
Ck =
/ n + (k-1)
= n + k −1
Ck =  = 
Note: “k” can be larger than “n”
(draw k)  k   n −1 

Example: From 2 objects {A, B} choose 3 with replacement (Only Way!)
After each draw of an A or B “drop 4 Outcomes
down a replacement” add 1 after each A B A/B A/B
{AAA},{BBB}
draw except last 4! {ABB},{AAB}
(effective # objects) = 2 +(3-1)=4
2
C3 = 2+3−1C3 = 4C3 =
/ =4
3! 1!
41 INDEX

Counting permutations and combinations with replacement is analogous to a candy machine purchase in
which a new object drops down to replace the one that has been drawn, thus giving the same number of
choices in each draw.
Permutation of n obj taken k at a time with replacement: Each of the k draws has the same number of
outcomes n because of replacement, the result is n*n*n... *n = nk and is written nPk with an “over-slash” on
the permutation symbol. The case n=2, k=3 of 3 draws with 2 replaceable objects {A,B} shows the slash-
2
P3 =23 = 8 permutations that result.
Combination of n obj taken k at a time with replacement: For n=2, k=3, 2 take 3 does not make any
sense. However, with replacement, it does since each draw except the last drops down an identical item and
hence the number of items to choose from becomes n +(k-1) and slash-nCk = n+(k-1)Ck. The tree verifies this
formula and explicitly shows that there are 4 distinct groupings {3A, 3B, 2A1B, 1A2B} exactly the number
of combinations with replacement given by the general formula slash-2C3 = 2+(3-1)C3 = 4C3 =4

41

II) Fundamentals of Probability

1. Axioms
2. Formulations: Classical, Frequentist, Bayesian, Ad Hoc
3. Adding Probabilities: Inclusion / Exclusion, CE & ME
4. Application of Venn Diagrams & Trees
5. Conditional Probability & Bayes’ “Inverse Probability”
6. Independent versus Disjoint Events
7. System Reliability Analysis

47

As a theory, Probability is based on a small set of axioms which set forth fundamental properties of
construction.
In practice, probability may be formulated theoretically, experimentally, or subjectively, but must always
obey the basic Axioms.
Evaluating probabilities for events, is naturally developed in terms of their unions and intersections using
Venn Diagrams, Trees and Inclusion/Exclusion techniques.
Conditional probabilities, their inverses (Bayes’ theorem), and the dependence between two or more events
flow naturally from the basic axioms of probability.
System reliability analysis utilizes all these fundamental concepts

47

Inclusion / Exclusion Ideas
ME Events A,B - Disjoint AB= φ A B P(A∩B) = P(A) + P(B) No intersections
”Add Prob”
No intersections

Intersect: “CE, not ME” “Recast” as Disjoint Union “CE & ME”
Not Disjoint AB∫φ
A A B-A
B ∫
AB

P(A∩B) = P(A) + P(B-A) = P(A) + P(BAc)
Intersection “AB” Counted Twice!! P(A∩B) ∫ P(A) + P(B)
B = B ⋅ S = B ⋅ ( A ∪ Ac ) = BA ∪ BAc
Subtract “P(AB)” from sum; count only once
A
BAC B
P ( A ∪ B ) = P ( A) + P ( B ) − P ( AB ) AB

P( BAc ) = P( B) − P( AB)

Generalization by Induction: let D = B ∪ C
P ( A ∪ B ∪ C ) = P ( A ∪ D ) = P ( A) + P ( D) − P ( AD ) = P ( A) + P ( B ∪ C ) − P( A ⋅ ( B ∪ C ))

= P ( A) + {P ( B ) + P (C ) − P ( BC )} − {P ( AB ) + P ( AC ) − P ( ABAC )}

Inclusion /
P ( A ∪ B ∪ C ) = P ( A) + P ( B ) + P (C ) − P ( AB ) − P ( AC ) − P ( BC ) + P ( ABC ) Exclusion
add singles subtract pairs add triples
54 INDEX

It is important to realize that although probabilities are simply numbers that add, the probability of the
union of two events P(A U B) is not equal to the sum of individual probabilities for the two events P(A) +
P(B).
This is because points in this overlap region AB are counted twice; to correct for this one needs to subtract
out “once” the double counted points in the overlap yielding P(A U B) = P(A) + P(B)-P(AB).
Only in the case of non-intersection AB = φ does the simple sum of probabilities hold.
The generalization for a union of three or more sets alternates inclusion and exclusion; for A,B,C the
probability P(AUBUC) adds the singles, subtracts the doubles and adds the triple as shown.

54

Venn Diagram Application: Inclusion/Exclusion
Given following information find how many club
members play at least one sport T or S or B T (36) TS (22) S (28)

Club: 36 T , 28 S, 18 B
TSB
(4)
SB (9)

Let N= Total # members (unknown) TB (12)

36 28 18 B (18)
Write Probabilities as P(T) = ; P(S) = ; P(B) = ; etc.
N N N CLUB
Method 1: Subs into Formula for Union
P ( T ∪ S ∪ B) = P (T ) + P( S ) + P( B ) − P (TS ) − P(TB ) − P ( BS ) + P (TBS )
36 28 18 22 12 9 4 TS (22)
= + + − − − + T (36) STc (6)
N N N N N N N
43 18 1
= Thus 43 of “N” Club Members play 6 TSB
N at least one sport. (N is irrelevant) (4) 5
8 SB (9)
Method 2: Disjoint Union - Graphical TB (12) 1
T ∪ S ∪ B = T ∪ ST ∪ BT Sc c c
BTcSc (1)
CLUB
P(T ∪ S ∪ B) = P(T ) + P( ST c ) + P( BT c S c )
36 6 1 43
= + + =
N N N N
68 INDEX

This example illustrates the ease by which a Venn diagram can display the probabilities associated with the
various intersections of 3 sets T, S, and B.
The number of elements in each of the 7 distinct regions is easily read off the figure; they are required to
establish the total number in their union T U S U B via the inclusion/exclusion formula.
Another method of finding P(T U S U B ) is to decompose the union T U S U B into a union of disjoint sets
T* U S* U B* for which the probability is additive, i.e., P(T* U S* U B* ) = P(T*) + P(U*) + P(B*).

68

Matching Problem – 1
“N” men throw hats onto floor; Each man in turn randomly draws a hat
a) No Matches - Find Probability None draw own hat.
Let Event Ei = ith man chooses his own hat ; compute: P(0 − matches) = 1 − P( E1 ∪ E2 ∪ ∪ EN )

1|2|3|… | k | k+1 | … |N Hats

i1 |i2 | i3 | … | in in+1 | in+2 | in+3 | … | iN Men
Probability that
M1 & M2 &...&Mn irrespective of what
n “Ei s” choose own hats (N-n) Does not Matter
draw own hats other men draw
(Matched or Not Matched )
Total # of“n-tuple” N # perms ( N − n)!
  P( Ei1 Ei2 Ein ) = =
selections from N n Total# perms N!
 

 N  ( N − n)! N! ( N − n)! 1
Sum Joint Probabilities
∑ P( Ei1 Ei2 Ein ) =   ⋅ =
n !( N − n)! N !
=
over all “n-tuples” n −tuples
All n-tuples Eq. Likely
n N! n!

 
P (0 − Matches ) = 1 − P ( E1 ∪ E2 ∪ E3 ) = 1 −  ∑ P ( Ei1 ) − ∑ P ( Ei1 Ei2 ) + ∑ P( Ei1 Ei2 Ei3 ) = 1 − {1 − 2! + 3!} =
1 1 1
3
1− tuples pairs triples 
P(0 − matches) = 1 − P( E1 ∪ E2 ∪ ∪ EN ) = 1
−
1
+
1
−
2! 3! 4! 5!
1
+ + ( −1)
N 1
N!
 e−1
N →∞
→
b) k- Matches
Poisson with success rate λ=1/N & “time
 k! ⋅ e−1
→1
1 1 1 N −k 1 
 − + + + ( −1) 
( N − k )! 
P(k matches) =  2! 3! 4!
N→∞ intvl” t = N samples; a=λ *t =(1/N)*N =1
k!
69 INDEX

Here is an example that requires the inclusion/exclusion expansion for a large number of intersecting sets.
Since it becomes increasingly difficult to use Venn diagrams for a large number of intersecting sets, we
must use the set theoretic expansion to compute the probability. We shall spend some time on this problem
as it is very rich in probability concepts.
The problem statement is simple enough: “N men throw their hats onto the floor; each man in turn
randomly draws a hat. “
a) What is the probability that no man draws his own hat?
b) What is the probability of exactly k-matches?
Key ideas: define Event Ei = ith man selects his own hat
then take union of N sets E1 U E2 U ... U EN and
P(no-matches)=1- P(E1 U E2 U ... U EN)
The expansion of the P(E1 U E2 U ... U EN) involves addition and subtraction of P(singles), P(pairs),
P(triples), etc. ( The events Ei are CE but not ME so you cannot simply sum up the P(Ei ) for k singles to
obtain an answer to part b)) .
This slide shows a key part of the proof which establishes the very simple result that the sum over singles,
P(singles) = 1/(1!); sum over pairs is P(pairs)= 1/(2!) ; sum over triples is P(triples)=1/(3!); sum over 4-
tuples, P(4-tuples) = 1/(4!); ... sum over N-tuples, P(N-tuple) = 1/(N!).
Limit as N large approaches a Poisson Distribution with success rate for each draw λ=1/N and data length
t =N i.e., parameter a =λ t =1

69

Man Hat Problem n =3 Tree/Table Counting

M#1 M#2 M#3 M.E. Match
Tree#1 Drw#1
Drw#2 Drw#3 Outcomes Outcomes
M#1 M#2 M#3 #Matches
E2 1
2 3 E3 {E1 E2 E3 } triple 1 2 3 3
1/2 Br#1
1
E1 1/2 3 2 c
{E1 E2 E3 }
c
single 1 3 2 1
1/3 1 1/2
1
1
3 E3 {E1c E2 c E3 } single 2 1 3 1
E1C Br#2
Start 1/3 2 1/2 1
1
c c
{E1 E2 E3 }
c
No-match 2 3 1 0
3
1/3 1/2
3 1
1
2 c c
{E1 E2 E3 }
c
No-match 3 1 2 0 Br#3
E1C
1/2
2 E2
1
1 c
{E1 E2 E3 }
c
single 3 2 1 1
P(Ei) = 1/3 2/6 2/6

From Table: From Tree: Connection: Matches & Events
Prob[0-matches]=2/6 Prob[0-matches]=1-Pr[E1 U E2 U E3]
Prob[1-matches]=3/6 Prob[Sgls]=P[E1]=P[E2]=P[E3]=1/3
=1-{Sum[Sngls]-Sum[Dbls]+Sum[Trpls]}
Prob[2-matches]=0/6=0 Prob[Dbls] = P[E1E2]=(1/3)(1/2)=1/6 =1-{3(1/3) -3(1/6)+1(1/6)}=2/6
Prob[3-matches]=1/6 Prob[Trpls] = P[E1E2E3]=(1/3)(1/2)=1/6

Alternate Trees Yield: P[E1E3]= P[E2E3]=1/6
75

This slide shows the complete the tree and associated table for the Man - Hat problem in which n=3 men
throw their hats in the center of a room and then randomly select a hat. The drawing order is fixed as
Man#1, Man#2, Man #3, and the 1st column of nodes labeled as circled 1, 2, 3 shows the event E1 in which
the Man#1draws his own hat, and the complementary event E1c i.e., Man#1 does not draw his own hat . The
2nd column of nodes corresponds to the remaining two hats in each branch shows the event E2 in which the
Man#2 draws his own hat; note that E2 has two contributions of 1/6 summing to 1/3. Similarly, the 3rd draw
results in the event E3 in two positions shown again summing to 1/3.
The tree yields ME & CE outcomes expressed as composite states such as {E1E2E3}, {E1E2cE3c, etc., or
equivalently in terms of the number of matches in the next column. The nodal sequence in the tree can be
translated into the table on the right which is analogous to the table we used on the previous slide. The
number of matches can be counted directly from the table as shown.
The lower half of the slide compares the “ # of matches” events with the “compound events” formed from
the “Ei”s{ no-matches, singles, pairs, and triples }. The connection between these two types of events is
based on the common event “no-matches,” i.e., the inclusion/exclusion expansion of the expression [1-
P(E1U E2U E3) ] in terms of singles doubles and triples yields P(0-matches).

75

Conditional Probability - Definition & Properties
ˆ
P ( AS )  2
• Definition of Conditional Probability ˆ
P( A | S ) ≡ = 
ˆ
P( S )  3
• In terms of atomic events si we can formally write
ˆ ˆ
P( ∪ si S ) ∑ P( s S )
ˆ
i
(# pts in Sˆ & A)
A = ∪ si ˆ ) = P ( A S ) = si ∈ A =
si ∈ A
=
si ∈ A
P( A | S
ˆ
P( S ) ˆ
P( S ) ˆ
P( S ) (# pts in Sˆ )
ˆ
• Note in case S = S it reduces to P(A) as it must
A B

•Asymmetry of Conditional Probability BA

P(BA)

P ( BA)  fraction  BA
P ( B | A) = = =
P ( A)  BA over A 
  A Given A

Not
Symmetrical!
P( BA)  fraction  BA
P( A | B) = = =
P( B)  BA over B 
  Given B
B
82 INDEX

The formal definition of conditional probability follows directly from the renormalization concept discussed
on the previous slide. It is simply the joint probability defined on the intersection of the set A and S-cap,
P(AS-cap) divided by the normalizing probability P(S-cap).
It can also be written explicitly in terms of a sum over atomic events given in the second equation.
Conditional probability is not symmetric because the joint probability on the intersection of A and B is
divided by probability of the conditioning set which is P(A) in one case and P(B) in the other. This is also
easily visualized using Venn diagrams where the “shape division” are obviously different in the two cases.

82

Examples - Coin Flips, 3-Sided Dice
nH > nT
Flip#3
Example#1: Three Coin Flips Flip#2 H {HHH}
Given the first flip is H, Find Flip#1 H
T {HHT} ˆ
S
Prob #H > #T H {HTH}
H
T T {HTT}
#H > #T
S
4 1 1 1
ˆ
P ( S ) = ; P( HHH ) = ; P ( HHT ) = ; P( HTH ) = S H H {THH}
8 8 8 8
T
T
T {THT}
3
P ( HHH ) + P ( HHT ) + P ( HTH ) 3
= 8=
H {TTH}
P (nH > nT | H ) =
ˆ)
P( S 4 4 T
8 {TTT}

Example#2: 4-Sided Dice
Given the first “die” d1= 4” d1 d2
Find Prob of Event A: “d2= 4” 1

P(d2=4| d1= 4)=? 2 S
S 3 (4,1)
ˆ 4 1 4 (4,2) ˆ
S
P ( S ) = P( d1 = 4) = ; P( 4,4) = (4,3)
16 16 d2 (4,4)

A
1 4
P(4,4) 1
P (d 2 = 4 | d1 = 4) = = 16 =
ˆ
P( S ) 4 4 3 ˆ
S Reduced
16 2 Sample space
1 d1
1 2 3 4
83

Here are two examples illustrating conditional probability.
The first involves a series of three coin flips and a tree shows all possible outcomes for the original space S.
The reduced set of outcomes conditions on the statement “ 1st draw is a head (red circle)” and S-cap only
takes the upper branch of the tree and leads to a reduced set of outcomes. The conditional probability is
computed either by considering outcomes in this conditioning space S-cap or by computing the probability
for S (the whole tree) and then renormalizing by the probability for S-cap ( upper branch).
The second example involves the throw of a pair 4-sided dice and asks for the probability that d2 =4 given
that d1=4, P(d2 =4 | d1 =4 ). The answer is obtained directly from the definition of conditional probability
and is illustrated using a tree and a coordinate representation of the dice sample space with a Venn diagram
overlay for the event (d1, d2) = (4,4) (green) and the subspace S-cap {d1=4} (red rectangle).

83

Probability of Winning in the “Game of Craps”
Rules for the “Game of Craps”
First Throw - dice sum=(d1+d2) Subsequent Throws - dice sum=(d1+d2)
2, 3, 12 - “Lose” (L) “Point” - “Win” (W)
7, 11 - “Win” (W) 7 “Lose” (L)
Other (O) - first time defines your “Point” = “5” say Other (O) “Throw Again”

Thr#1

2 L Thr#2 Thr#3 Thr#4
4 S=d1+d2 #Ways #Prob
3 L
36 5 2, 12 1 1/36
4 W 4
6
5 Point L 3, 11 2 2/36
36 7
36 5 W
6
26 4 P
Start
O 6 o 4, 10 3 3/36
7 W 7 L 36
36 36 5 W i
8 26 6 n 5, 9 4 4/36
O
9 7 L t
36 36 s 6, 8 5 5/36
10 26 O
11 7 6 6/36
W 36
12 L
 
4  1  2
2 3
4 4  26  4  26  4  26 
P (W | 5) = +  +   +   + =  =
36 36  36  36  36  36  36  36  1 − 26  5
 
 36 
P(W ) = P(7) + P(11) + ∑ P(W | Point )P(Point )
Points

6 2  
= + + 2  P(W | 4) P (4) + P(W | 5) P (5) + P (W | 6) P(6) = .4929
36 36  1/ 3
 3 / 36 2/5 4 / 36 5 / 11 5 / 36 


85 INDEX

Here we compute the probability of winning the game of craps previously described by the rules for the 1st
and subsequent throws given in the box and illustrated by the tree. Since there are 36 equally likely
outcomes the #ways for the two dice summing to either 2 or 12 is obviously 1/36, for 3 or 11 it is 2/36, and
the remaining sums of two dice can be read directly off the sum axis coordinate representation and are
displayed in the table on the right.
We have labeled the partial tree “given the point 5” by their conditional probabilities derived from the table.
The probability for the three outcomes W(“5”), L (“7”), “Other (not “5 or 7”) can be read off the table as
P(5)= 4/36, P(7)=6/36, P(Other)= 1-(4+6)/36 =26/36. Note that these are actually conditional probabilities;
but since the throws are independent the conditionals are the same as the a prioris as taken from the table.
The P(W|5) is obtained by summing all paths that lead to a win on this “infinite tree”. Thus the 2nd throw
yields W with probability 4/36 and the 3rd throw yields W with probability P(5|Other)P(5)=(26/36)(4/36),
and the 4th throw yields W with probability P(5|Other,Other)P(5)=(26/36)2 (4/36), ... leading to an infinite
geometric series which sums to (4/36)*1/(1-26/36)=2/5.
The total probability of winning is the sum of winning on the 1st throw (“7” or “11”) plus winning on the
subsequent throws for each possible “point.” The infinite sum for the other points is obtained in a similar
manner to that for “5” and (taking points by pairs in the table leads to the factor of two) the final result is
shown to be .4929, i.e., a 49.3% chance of winning!

85

Visualization of Joint, Conditional, & Total Probability
Binary Comm Signal - 2 Levels {0,1}
Binary Decision - {R0, R1}={(“0” rcvd , “1” rcvd} x = 0,1

Joint Probability
(Symmetric) 0 1 sent

P(0,R0) = P(R0,0) ovly
R1
“0” sent & R0 (“0” rcvd ) & y =R0 ,R1 R0 rcvd
R0 (“0” rcvd ) “0” sent
Conditional Probability
0R1
(Non-Symmetric) R0 ,R1 1R1
Joint
P(0|R0) ∫ P(R0|0) 0R0
1R0
“0” sent given R0 (“0” rcvd ) x = 0 ,1
P(0) = P(0, R0 ) + P(0, R1 ) P(R0 ) = P(R0 ,0) + P(R0 ,1)
R0 (“0” rcvd ) given “0” sent
Total Probability P(0) Total Probability P(R0)
sum up joint on R0,R1 sum across joint on 0,1

Conditional Probability P( R0 ,0) P( R0 ,0)
P ( R0 | 0) ≡ =
Requires Total Probability P ( 0) P( R0 ,0) + P( R0 ,1) Re-normalize
Joint Probability
P(0), P(R0), etc. P( R0 ,0) P ( R0 ,0)
P (0 | R0 ) ≡ =
P ( R0 ) P ( R0 ,0) + P ( R0 ,1)
88 INDEX

Another way to visualize the communication channel is in terms of an overlay of a Signal Plane divided
(equally) into “0”s and “1”s and a Detection Plane which characterizes how the “0”s and “1”s are detected
and is structured as shown so that when we overlay the two planes we obtain an Outcome Plane with four
distinct regions whose areas represent probabilities of the four product (joint) states { 0R0, 0R1, 1R0, 1R1}
(similar to the tree outputs).
In this representation the total probability of a “0” P(0) can be thought of as decomposed into two parts
summed vertically over the “0”-half of the bottom plane shown by the break arrow P(0) = P(0,R0) + P(0,R1)
[Note: summing on the “1”-half of the bottom plane yields P(1) = P(1,R0) + P(1,R1).]
Similarly the total probability P(R0) can be thought of as decomposed into two parts summed horizontally
over the “R0”-portion of the bottom plane shown by the break arrow P(R0) = P(R0,0) + P(R0,1); similarly
we have P(R1) = P(R1,0) + P(R1,1).
The Total Probability of a given state is obtained by performing such sums over all joint states.

88

Log-Odds Ratio - Add & Subtract Measurement Information
Note:
Revisit Binary Comm Channel P( R0 | 0) = .95 P ( R1 | 1) = .90 P(0)=.5
E = “1”
P( R1 | 0) = .05 P ( R0 | 1) = .10 P(1)=.5
Ec = “0”

Relation between  P (1 | R1 )  P (1 | R1 ) e L1
L1 ≡ ln 1 − P(1 | R )  ⇒ e = 1 − P(1 | R ) ⇒

L1
P(1 | R1 ) =
L1 and P(1|R1)  1  1 1 + e L1
 P(1 | R1 )   P (1)   P ( R1 | 1)   P(1)   P( R1 | 1) 
L1 ≡ ln 1 − P(1 | R )  = ln 1 − P(1)  + ln 1 − P( R | 1)  = ln P(0)  + ln P ( R | 0) 
        
 1     1     1 
≡ L0 ≡ ∆L1

 P( R1 | 1) 
Additive Meas Updates for L Lnew = Lold + ∆LR1  P (1) 
 P(0)  ; ∆LR1 = ln P( R | 0) 
Lold = ln   
   1 
Updates
Meas#1: R1 Meas#2: R0 Alternate Meas#2: R1
 .5   P( R0 | 1)   .10   P( R1 |1) 
Lold = ln  = 0 ∆LR0 = ln  .90 
 .5   P( R | 0)  = ln .95 
  
∆LR1 = ln   = ln  
 0   P( R1 | 0)   .05 
 .9  = −2.25129
∆LR1 = ln  = +2.8903
 .05  Lnew = Lold + ∆LR0 Lnew = Lold + ∆LR1
= 2.8903
= 2.8903 + (−2.25129) = .63901 = 2.8903 + 2.8903 = 5.7806
Lnew = 0 + 2.8903
e 2.8903 e.63901 e 5.7806
P(1 | R1 ) = = .947 P(1 | R1 R0 ) = = .655 P (1 | R1 R0 ) = = .997
1 + e 2.8903 1 + e.63901 1 + e 5.7806
96 INDEX

Revisiting the binary communication channel we now compute updates using the log odds ratio which are
additive updates. The update equation simply starts from the initial log odds ratio which is
Lold=ln[P(1)/P(1c)] =ln(.5/.5)=0 for the communication channel. There are two measurement types R1 and
R0 and each adds an increment ∆L determined by its measurement statistics, viz.,
R1: ∆LR1 =ln[(P(R1|1)/P(R1|1c)]=ln(.90/.05) = +2.8903 (positive “confirming”)
R0: ∆LR0 = ln[(P(R0|1)/P(R0|1c)]=ln(.10/.95)= -2.25129. (negative “refuting”)

The table illustrates how easy it is to accumulate the results of two measurements R1 followed by R0 by just
adding the two ∆Ls to obtain
Lnew= 0+2.8903-2.25129=.63901,
or alternately R1 followed by R1 to obtain
Lnew=0+2.8903+2.8903=5.7806.
These log odds ratios are converted to actual probabilities by computing P= eLnew / (1+ eLnew ) yielding .655
and .997 for the above two cases.
If we want to find the number of R1 measurements needed to give .99999 probability of “1” we need only
convert .99999 to an L =ln[(.99999)/(1-.99999)] =11.51 and divide the result by 2.8903 to find 3.98 so that
4 R1 measurements are sufficient.

96

Discrete Random Variables (RV) –Key Concepts
• Discrete RVs: A series of measurements of random events
• Characteristics: “Moments:” Mean and Std Deviation
• Prob Mass Fcn: (PMF), Joint, Marginal, Conditional PMFs
• Cumulative Distr Fcn: (CDF) i) Btwn 0 and 1, ii) Non-decreasing
• Independence of two RVs
• Transformations - Derived RVs
• Expected Values (for given PMF)
• Relationships Btwn two RVs: Correlations
• Common PMFs Table
• Applications of Common PMFs
• Sums & Convolution: Polynomial Multiplication
• Generating Function: Concept & Examples

122 INDEX

This slide gives a glossary of some of the key concepts involving random variables (RVs) which we shall
discuss in detail in this section. Physical phenomena are always subject to some random components so
that RVs must appear in any realistic model and hence their statistical properties provide a framework for
analysis of multiple experiments using the same model. These concepts provide the rich environment that
allows analysis of complex random systems with several RVs by defining the distributions associated with
their sums and transformations of these distributions inherent in the mathematical equations that are used to
model the system.
At any instant, a RV takes on a single random value and represents one sample from the underlying RV
distribution defined by its probability mass function (PMF). Often we need to know the probability for some
range of values of a RV and this is found by summing the individual probability values of the PMF; thus a
cumulative distribution function (CDF) is defined to handle such sums. The CDF formally characterizes the
discrete RV in terms of a quasi-continuous function that ranges between [0,1] and which has a unique
inverse.
Distributions can also be characterized by single numbers rather than PMFs or CDFs and this leads to
concepts of mean values, standard deviations, correlations between pairs of RVs and expected values.
There are a number of fundamental PMFs used to describe physical phenomena and these common PMFs
will be compared and illustrated through examples. Finally, the relationship between the sum of two RVs
and the concept of convolution and the generating function for RVs will be discussed.

122

Transformation of Sample Space: Sum & Difference - 4-Sided Dice
Fair 4-sided dice thrown twice: RVs: Sum= “S” & Absolute Difference “D”
Uniform PMF pD1D2 (d1,d2) = 1/16 Find New PMF pDS(d,s) = ?

Labels: D/S=3/5
d pS(6)
Collapse on s-

d2 S=d1+d2 Rotated to D, S “missing”
axis

points
D=|d2-d1| Coordinates 4
3/5 2/6 1/7 0/8
4 3
2/16 Collapse on
d-axis pD(3)
2/4 1/5 0/6 1/7 2/16 2/16

d2
2
3 D 2/16 2/16 2/16
Collapse on
1

4
1/3 d-axis pD(1)
0/4 1/5 2/6
2 Fold over
s
1/16 1/16 1/16 1/16

3
0

3/
5
0/2 1/3 2/4 3/5 D/S=3/5
S-Axis

2

2/
0 1 2 3 4 5 6 7 8

2/
1

6
4
1

1/
1/
d1

7
1/

5
3

0/
d

0/

8
6
0/
1 2 3 4

0/
S

2

4
pSD ( s, d ) 4

1/
7
1/
1/

5
3
1

pD1D2(d1,d2) d

2/
2/
3

6
4
2
2

3/
2
4
5
3

1
3
D
/S
1/16
=3
0

/5
2
4
d1
2/16
2/16 2/16
1 Absolute Difference Doubles 1/ 2/ 2/ 2/
6 1 1/16
0 6 1 6 1 6 1

Values above S-Axis 1 2/16
2/16
2 1/
6 1
2/
6 1
2/
6 1
3 1/16
1 4 2/ 2/16
2 5 1/
6 1 6 1

3 6 1/16

4
7 1/
6 1
8 1/16
d
1
s 125 INDEX

In the game with 4-sided dice, we are interested in the distribution of the sum random variable S = D1 + D2 ,
pS(s) and not the joint distribution pD1,D2(d1d2). This slide and several to follow illustrate the procedure for
obtaining the desired “marginal” (or collapsed ) distribution pS(s). In the process, we shall develop the
relationship between distributions under transformation of coordinates, and define conditional, and
marginal, distributions involving a pair of RVs {D1,D2}.
We start with the 2- and 3-dimensional dice representations of equally likely outcomes of 1/16 as shown on
the left. Recall that the points (d1, d2) for dice outcomes may alternately be expressed by points (s,d) their
sum and difference coordinates, where s = d1+ d2 and d = d2 - d1 . These coordinate axes are shown in the
top left figure where the sum and difference each take on 7 values: s={2,3,4,5,6,7,8} and d={-3,-2,-
1,0,1,2,3}
We consider a slightly different transformation s = d1+ d2 and |d| = |d2 - d1| and now the absolute difference
|d| takes on only 4 values {0,1,2,3}; this has the effect of doubling the probability values of {1,2,3} by
folding over the negative difference values onto and doubling them. If we label each point in this figure by
the “|d |/ s” values we see for example that the points (d1d2) =(1,4) and (d1d2) =(4,1) at opposite corners of
the grid are both now labeled with |d| / s = 3 / 5 . Labeling all points in this manner and rotating the figure
clockwise 90o so D is up and S is to the right (central figure) we have found the new joint distribution
pSD(s,|d|) as illustrated in the two right figures where points are now labeled by (s,|d|) values. Note that the
new distribution has doubled the positive d values to 2/16 each and that certain coordinate points
(s,|d|)=(3,0) are not occupied (green). The marginal distribution pS(s) defined as the sum of the joint
distribution pSD(s,|d|) over all |d| values and is easily picked off the upper right figure by collapsing values
down along the s-axis. Similarly, the distribution pD(|d|) defined as the sum of the joint distribution
pSD(s,|d|) over all s-values. The table shows the results.

125

Common PMFs and Properties -1
RV Name PMF Mean Variance
E[ X ] = ∑ x⋅ p
x = 0 ,1
X ( x) var( X ) = E[ X 2 ] − E[ X ]2

 p X = 1 (success)
Bernoulli p X ( x) = 
1 − p = q X = 0 (failure) E [ X 2 ] = 0 2 ⋅ (1 − p ) + 12 ⋅ p
1-Trial
E [ X ] = 0 ⋅ (1 − p ) + 1 ⋅ p = p
X=x succ.
= p var( X ) = p − p 2 = p (1 − p )
“0” or “1” x “Atomic” RV
successes = pq
0 1
p X (x)
Binomial  n
p X ( x) =   p x q n − x
 
 x
n
n
n - Trials E[ X ] = ∑ x  p x q n − x
6/16
5/16
  var( X ) = npq
x = 0,1, n x=0  x 
X=x Succ. 4/16
3/16
= np
How many Independent 2/1
6
1/16
succ “x” in Bernoulli Trials 0 x
“n” trials ? 0 12 3 4
p X (x)
Geometric p X ( x) =  pq
x −1
x = 1,2, ∞
d ∞ x
 1/2
E[ X ] = ∑ x ⋅ pq x −1 = p ∑q var( X ) =
q
X=x Trials  0 (otherwise) 7/16 dq x =1
6/16
x =1 p2
1- Success 5/16 d  1  +p 1
=p  = =
dq  1 − q  (1 − q) 2 p
4/16

How many One Sequence 3/16  
trials “x” 2/16

1/16 As p decr. Expected num. trials
for “1” succ 0 x “x” for 1-succ must incr.
0 1 2 3 4 5 ... ∞
 x − 1 r x − r
Negative  x − 1 r −1 x − r E[ X ] = ∑ x ⋅ 
 r − 1 p q
 q
 r − 1 p q ⋅ p
p X ( x) =     var( X ) = r ⋅
Binomial  
x=r
p2
succ. on
Geom RV = Neg Binom r
next trial =
X=x Trials ( r −1) succ. in ( x −1) trials
for r=1 succ. p
x = r , (r + 1), ( r + 2), ∞ As p decr. Expected num. trials
r- Successes
Many Sequences “x” for r-succ must incr.
137 INDEX

This table and one to follow compare some common probability distributions and explore their
fundamental properties and how they relate to one another. A brief description is given under the “RV
Name” column followed by the PMF formula and figure in col#2; formulas for the mean and variance are
shown in the last two columns.
The Bernoulli RV X answers the question “what is the result of a single Bernoulli trial?” It takes on
only two values, namely “1”=Success with probability p and “0”=Fail with probability q=1-p.
The Binomial RV “X” answers the question “how many successes X in n Bernoulli trials?” It takes on
values corresponding to the number of successes “X” in “n” independent Bernoulli trials; the sum RV
X=X1+ X2+ ...+Xn of n Bernoulli RVs has nCx tree paths for X=x successes yielding a pmf nCx px qn-x as
shown.
The Geometric RV X answers the question “how many Bernoulli trials X for 1 success?” It takes on
values from 1 to infinity and is the sum of n-1 failed Bernoulli trials followed by one successful trial; the
sum RV X=X1+ X2+ ...+Xn of n Bernoulli RVs has only one tree path with X= x trials yielding 1-success
and so has a pmf qx-1 p1 as shown.
The Negative Binomial RV X answers the question “how many Bernoulli trials X for r- successes?” It
takes on values from r to infinity and is the sum of n Geometric random variables; the sum RV X=G1+
G2+ ...+Gr of “r” Geometric RVs with probability pr-1 qx-r p1 and has x-1Cr-1 tree paths for X=x-1 trials
yielding (r-1)-successes followed by one final success and so has a pmf x-1Cr-1 pr-1 qx-r p1 with x = r, r+1,
... inf, as shown

137

Bernoulli/Binomial Tree Structures
RV Name PMF

Bernoulli p X ( x) =  (q+p) x Prob
1-Trial 1 − p = q X = 0 (failure)
F
q 0 q
X=x succ. START

p
1 p
“0” or “1” x “Atomic” RV S

successes 0 1
Prob
Binomial  2 p X (x) x
p X ( x) =   p x q 2 − x
2 - Trials  x (q+p)2 F {FF} 0 q2 2C
1/2 q 0
x = 0,1, 2
X=x Succ. q
F
S {FS} 1 qp
p 2C
1/4
How many Independent START
q F {SF} 1 pq
1

succ “x” in Bernoulli Trials x
p S
{SS} 2 p2 2C
“2” trials ? p
S
2
0 1 2

(q+p)2 = q2 + 2pq + p2
= 2C0 p0 q2 + 2C1 p1 q1 + 2C2 p2 q0

138 INDEX

The RVs of the last slide are grouped in pairs {Bernoulli,Binomial} and {Geometric, Negative Binomial}
for a reason. The sum of many independent Bernoulli trials generates a Binomial distribution and similarly
the a sum of many independent Geometric trials generates the Negative Binomial distribution. This slide
and the next give a graphical construction of these trees for these two groups of paired distributions by
repeatedly applying the basic tree structure of the underlying Bernoulli or Geometric tree structure as
appropriate.
In the first panel we show the PMF properties for Bernoulli on the left and on the right we display
Bernoulli tree structure where the upper branch q=Pr{Fail] goes to the state X= 0 and the lower branch p =
Pr[Success] goes to the state X= 1.
In the second panel we show the PMF properties for a simple n=2 trial Binomial. The corresponding tree
structure for this Binomial is obtained by appending a second Bernoulli tree to each output node of the first
trial, thus yielding the 4 output states {{FF}, {FS}, {SF}, {SS}}. We see that there is 2C0 tree paths leading
to {FF} p0q2 , 2C1 tree paths leading to{FS} p1q1 , and 2C2 tree paths leading to {SS} p2q0 , which is
precisely as expected from the Binomial PMF for n=2.
This can be continued for n=3, 4, ... by repeatedly appending a Bernoulli tree to each new node. Further we
see that this structure for n=2 is represented algebraically by (q+p)2 inasmuch as the direct expansion gives
1=q2 + 2q1p1 +p2 ; expanding an expression corresponding to n Bernoulli trials (q+p)n obviously yields the
appropriate Binomial expansion for general exponent n.
Thus the Binomial is represented by the repetitive tree structure or by the repeated multiplication of the
algebraic structure 1=(q+p) by itself n-times to obtain 1n=(q+p)n .

138

Geometric/NegBinomial Tree Structures
RV Name PMF
p X (x)
Geometric
 pq x −1 x = 1,2, 1/2
[(1-q)-1 p]
X=x Trials p X ( x) =  7/16
q
F
 0 (otherwise) 6/16
p
1- Success 5/16 F S
4/16 q
How many 3/16 START p
trials “x” for One Infinite 2/16 S

“1” succ Sequence 1/16
p
0 x
0 1 2 3 4 5 ... S

Negative  x − 1 2−1 x − 2 F
Binomial p X ( x) =  p q ⋅ p [(1-q)-1 p ]2 q
 2 − 1 succ. on q F
X=x Trials (2 −1)succ. in ( x −1) trials
next trial S p
S
2- Successes
x = 2,3, 4, ∞ q
F p
p
F S
S
p X (x) START
q
p q
F

1/4 S q F
p S p
3/16 S
S p
Many Infinite
1/8
Sequences S
1/16 F
q

0 x S
q F

0 1 2 3 4 5 ... p
S

p2 (1-q)-2 = p {1+(-2)1-3(-q)1 +[(-2)(-3)/2] 1-4(-q)2 +[(-2)(-3)(-4)/(2)(3)] 1-5(-q)3 + ...} p p

...} S
={ 1C p
1 + 2C
1 pq1 + 3C
1 p1 q2 + 4C1 p1 q3 + p
139 INDEX

This slide first gives a graphical construction of a Geometric tree from an infinite number of Bernoulli
trials and then shows how the Negative Binomial tree is the result of appending a Geometric tree to
itself in a manner similar to that of the last slide. In the first panel we repeat the PMF properties for
Geometric RV. On the right side of this panel we display Geometric tree structure whose branches end
in a single success. This tree has a Bernoulli trial appended to each failure node and is constructed from
an infinite number of Bernoulli trials. The 1st Bernoulli trial yields X=1 with p=Pr[Success] and this
ends the lower branch; its upper branch yields X=0 with q=Pr{Fail]; this failure node spawns a 2nd
Bernoulli trial which again leads to X=1 or X=0; this process continues indefinitely. It accurately
describes the probabilities for a single success in 1, 2, 3,... inf number of trials and is algebraically
represented by the expression 1=[(1-q)-1 p] which expands to [1 + q1 + q2 + q3 +....]*p corresponding to
exactly 0, 1, 2, 3,... “failures before a single success”
In the second panel we show the PMF properties for an r=2 Negative Binomial; on the right we display
the Negative Binomial tree structure obtained by applying the basic Geometric tree to each node
(infinite number) corresponding to a 1st success. This leads to a doubly infinite tree structure for the r=2
Negative Binomial which gives the number of trials X =x required for r=2 successes. We can verify the
first few terms in the Negative binomial expansion given under PMF in the lower panel using the tree.
This process may be extended to r=3, 4, ... successes by repeatedly applying the Geometric tree to each
success node. For n=2, direct expansion of the algebraic identity 12=[(1-q)-1 p]2 yields { 1C1 p + 2C1 pq1
+ 3C1 p1 q2 + 4C1 p1 q3 + ...}p in agreement with the n=2 Negative Binomial terms in the table. In an
analogous fashion expansion of 1r=[(1-q)-1 p]r yields results for the r-success Negative Binomial. Note
that the “Negative” modifier to Binomial is a natural designation in view of the (1-q)-1 term in the
algebraic structure.

139

Bernoulli, Geometric, Binomial & Negative Binomial PMFs
• Bernoulli RV as Probability “Indicator” for Outcomes of a Series of
Experiments representing a two different Event types, namely,
E1: “Success in 1 trial” X = Bernoulli RV Binomial b(k;n,p)
n = # trials , k = # successes
E2: “ N1 is #Trials for 1stsuccess“ N1 = Geometric RV K=# Succ
for n- trials
n  n
K = ∑ Xi p K (k ) =   p k q n − k
k
i =1  
n
K = ∑ Xi
Bernoulli Bernoulli Process Sum n Indep. i =1

Single RV , Two Outcomes 1 Bernoulli trial for Bernoulli RVs “X” E ( K ) = µ K = np
Event E1 var( K ) = σ K = npq
2

p X ( x) =  p X ( x) = p
1 − p = q X = 0 (failure)
Neg. Binomial bn(nr;r,p)
Sum r Indep.
1 = # trials , 0,1 = # successes Geometric RVs
Geometric Process nr = #trials for r successes
”N1”
E ( X ) = µ X = p ; var( X ) = σ X = 0
2
n1 Bernoulli trials for  n − 1
pNr (nr ) =  r  p r q nr − r
Event E2  r −1 
r
pN1 (n1 ) = p1q n1 −1 N r = ∑ ( N1 )i r
i =1 N r = ∑ ( N1 )i
i =1
1
E[ N r ] = µ N r = rE[ N1 ] = r
Nr =# Trials p
for r-Succ. q
var( N r ) = σ N r 2 = r var( N1 ) = r
p2

140

The Bernoulli RV “X” is the basic building block for other RVs ( “atomic” RV ) and has a PMF
distribution with only two outcomes X=1 with probability p and X=0 with probability q=1-p . We have seen
that n such Bernoulli variables when added yield a Binomial PMF {b(x;n,p), x=0,1,2,...,n} which gives the
“#successes “x” for “n” trials.
We have also seen that this Binomial PMF can be understood by repeatedly appending the Bernoulli tree
graph to each of its nodes (repeated independent trials) thereby constructing a tree with 2n outcomes
corresponding to the n Bernoulli trials, each with two possible outcomes.
Alternately, the Geometric PMF can be constructed by repeatedly appending a Bernoulli tree graph, but this
time only to the failure node, an infinite number of times, thereby constructing a tree with an infinite
number of outcomes all of which correspond to “x-1” failures and exactly 1 success for x=1,2, ...., inf.
Just as the Bernoulli tree graph is a building block for the Binomial tree graph, the infinite Geometric PMF
tree graph is a building block for the Negative Binomial. The Negative Binomial tree graph for r=2
successes is constructed by appending a Geometric tree graph to itself, but this time only to the success
nodes, resulting in a doubly infinite tree graph corresponding to exactly “x-1” failures and exactly 2
successes for x= 2,3 ...., inf. Repeating this process r-times yields the r-fold infinite tree graph
corresponding to exactly “x-1” failures and exactly r successes for x= r,r+1, ...., inf.
The mathematical transformations relating Bernoulli, Binomial,Geometric and Negative Binomial are
shown in this slide.

140

Common PMFs and Properties-2
RV Name PMF Mean Variance
E[ X ] = ∑ x⋅ p
x = 0 ,1
X ( x) var( X ) = E[ X 2 ] − E[ X ]2
 "m-marked" "(N-m) = unmarked"
x from (n-x) from

Hyper-  m ( N − n) m ( N − m)
 m  N − m E[ X ] = n ⋅ = n ⋅ p var( X ) = n ⋅ ( N − 1) ⋅ N ⋅ N
geometric   x   N
    n−x  ; x ≤ x ≤ x
X=x -succ pX ( x) =  where p = m / N is the
 N
min max
( N − n)
N= fixed pop    "initial" probability of var( X ) = ⋅n⋅ p⋅q
  n ( N − 1)
m= tagged  0 ; Otherwise drawing a marked item

n=test sampl m ∈ [1, N ] ; n ∈ [1, N ] ; ( N − m − n) ≤ x ≤ min(m, n)
w/o rplcemt PMF Derives from  N   m + ( N − m)   m   N − m   m   N − m   m N − m  m  N − m 
Binomial Identity  =  =    +   + +   + +   
n  n   0   n   1   n −1   x  n − x   n  0 
n≤m≤ N

Poisson  ( a x / x !)
 x = 0,1, 2, ∞
Trials p X ( x) =  ea
E[ X ] = a var( X ) = a
0 Otherwise
X=x Succ 
Limit of Binomial
a = lim(n ⋅ p) = λ ⋅ t = (aver. arrival rate)*time
n →∞
p →0

Zeta(Zipf) 
 ( )
 1 xs
p X ( x; s ) =  ζ ( s) =
"ζ − term "
x = 1, 2, ; s >1 (
∞
E[ X ; s ] = ζ 1s ) ⋅ ∑ x⋅ 1s
x
(
∞
Var ( X ; s ) = ζ 1s ) ⋅ ∑ x2 ⋅ 1s − E[ X ; s]2
x =1
x
n - Trials ζ (s) x =1

X=x Succ.


 0 Otherwise (
∞
= ζ 1s ) ⋅ ∑ 


1 

 = ζζ( s( −1)
s)
= ζ ζ( s(−)2) −
s ( ζ (s) )
ζ ( s −1) 2

 x s −1 
−1 x =1
 
( )
∞ ∞
ζ (1.5) ζ (2.5) 2
∑
x =1
= 1 ⇒ C =  ∑  1s   = ζ 1s )
C 
 
 
 xs 
 
 
 x =1  x  
(
E[ X ; s = 3.5] = ζζ( s( −1) = 1.191
Var ( X ; s = 3.5) = ζ (3.5) − ζ (3.5)
s)
= .856
Riemann Zeta Fcn ζ (s)

141 INDEX

This second part of the Common PMFs table shows the Hyper-geometric, Poisson and Riemman Zeta (or
Zipf ) PMFs
The Hyper-geometric RV “X” answers the question “how many successes (defectives) X are obtained
with n test samples (trials without replacement) from a production run (sample space) that contains m
defective and N-m working items?” X takes on values corresponding to the number of successes
(defectives) “X” in “n” dependent Bernoulli trials; the distribution is best understood in terms of the
Binomial identity NCn = mC0 N-m Cn + ...+ mCx N-m Cn-x +... + mCm N-m Cn-m which when divided by NCn
yields the distribution mCx N-m Cn-x where X takes on values x=[xmin, xmax] where xmin=N-n-m and xmax=
min(n,m).as allowed by the combinations w/o replacement
The Poisson RV “X” answers the question “how many successes X in n Bernoulli trials with n very
large?” We shall discuss this in more detail in the second part of the course where we pair it with a
continuous distribution. For now it is sufficient to know that it represents a limiting behavior of the
Binomial PMF in the limit that n-> inf and its terms represent single terms in the expansion of ea where a
=λ∗ t is called the Poisson parameter, where λ is a “rate” and t is a time interval for the data run. The PMF
is therefore the ratio of the single term in the expansion to ea over ea which is
pX(x)={ ax/ x!} / ea for x=0,1,2,3,... The Poisson RV has many applications in physics and engineering.
The Riemman Zeta RV “X” has applications to Language processing and prime number theory and its
properties are given in the table. Note that the exponent must satisfy α >0 in order to avoid the harmonic
series which will does not converge and therefore cannot satisfy the sum to unity condition on the PMF.

141

Chapter 5 – Continuous RVs
Probability Density Function (PDF)
f X (x)
Event E = {x : a ≤ x ≤ b}
:
b
Pr[ x ∈ E ] = ∫ f X ( x)dx = ∫ f X ( x)dx Pr[a ≤ x ≤ b]
E a a x
2.0 b
Pr[ x = 2.0] = ∫f
x = 2.0
X ( x)dx = 0 Prob at a point = 0 Except for δ-fcn at a point

αδ ( x − x0 ) uniform
Mixed Continuous & Discrete Outcomes – Dirac δ-fcn f X (x) β
(b − a )
β
f X ( x) = αδ ( x − x0 ) +
(b − a )
b x0 + ε
x
∫ αδ ( x − x )dx = ∫ ε αδ ( x − x )dx =α
a
0
x0 −
0
a x0 b

Sampled Continuous Fcn g(x) f X (x) α k δ ( x − xk )
n
g (x)
f X ( x ) = ∑ α k δ ( x − xk )
k =0
b
α k = ∫ g ( x)δ ( x − xk ) =g ( xk )
a x0 x1 xk xn x

2/24/2012 3

In Discrete Probability a RV is characterized by its probability mass function (PMF) pX(x) which
specifies the amount of probability associated with each point in the discrete sample space. Continuous
probability generalizes this concept to a probability density function (PDF) fX(x) defined over a
continuous sample space. Just as the sum of pX(x) over the whole sample space must be unity, the
integral of fX(x) over the whole sample space must also be unity. An event E is defined by a sum or
integral over a portion of the sample space as shown by the shaded area in the upper figure between x=a
and x=b.
The middle panel gives an example of a mixed distribution containing continuous uniform distribution
β/(b-a) and a Dirac δ-function at the point x0 α∗ δ(x-x0) corresponding to a discrete contribution at that
point. The uniform distribution is shown as a continuous horizontal line at “height” y = β between a and
b and the Dirac δ-function is shown with an arrow corresponding to a probability mass “α” accumulated
at a single point x=x0.. The integral over the continuous part gives (b-a)* β/(b-a) = β and the integral of
the Dirac δ-function α∗ δ(x-x0) over any interval containing x0 yields α. Thus, in order for this
expression to be a valid probability density function, we require the sum of the two contributions be
unity: α+ β =1 .
Consider the continuous curve fX(x) = g(x) in the bottom panel and take the sum of products αk*δ(x-xk).
Is this a valid discrete “PMF”? In order for this to be so the sum of the contributions αk must be unity.
Does it represent a digital sampling of g(x)? No, in order to actually write down an appropriate
“sampled” version of g(x), we need to develop a “sampling” transformation Yk=Yk(X) for k=0,1,2,...,n so
as to transform the original continuous fX(x) to a discrete fY(yk) (See slide#26 )

3

Cumulative Distribution Function (CDF)
x
FX ( x) = Pr[ X ≤ x] = ∫ f X ( x ') dx ' Probability Density PDF
x '=−∞
integrates to yield CDF
fX(x) fX(x) PDF
Bdy Values : FX (−∞) = 0 ; FX (+∞) = 1 PDF
1 1
¼ δ(x-1)
1/2
Monotone Non - decr. : FX (b) ≥ FX (a ) ; if b ≥ a 0 x x
0
0 1/2 1 3/2 0 1/2 1 3/2

Prob Interpretation : Pr[a ≤ x ≤ b] = FX (b) − FX (a)

FX(x) CDF FX(x) CDF
Density PDF : d
dx FX ( x) = f X ( x)
1 1
¼
1/2 1/2
or, dFX ( x) = FX ( x + dx) − FX ( x) = f X ( x)dx
0 x 0 x
0 1/2 1 3/2 0 1/2 1 3/2

2/24/2012 7

The cumulative distribution function (CDF) for a continuous probability density function fX(x) is defined
in a manner similar to that for discrete distributions pX(x) except that the cumulative sum over a discrete
set is replaced by an integral over all X less than or equal to a value x. This integral yields a function of
“x” FX(x) = Pr[X<=x] which has the following important properties
(i)FX(x) always starts at 0 and ends at 1
(ii)FX(x) is continuous,
(iii)FX(x) is non-decreasing,
(iv)FX(x) is invertible; i.e., FX -1 (x) exists, and
(v)The density fX(x)=d/dx{FX(x)} (since exact differential d FX(x) = FX(x+dx) - FX(x) = fX(x)dx )
It is important to note all five properties of FX(x) as they have important consequences.
The figure shows the relationship between the density fX(x) and the cumulative distribution FX(x) for two
cases (i) two regions of constant density (two “boxes”) and (ii) one region of constant density plus a delta
function (one “box” and an arrow “spike”) .
In case (i) FX(x) ramps from a value of 0 to ½ in the region [0, ½ ] from the 1st constant density box, then
remains constant at ½ over the region [ ½ , 1] and finally ramps from ½ to 1 from the 2nd constant
density box. Note that the slopes of the two ramps are both “1” in this case and that the total area under
the density curves 1* [1/2-0] + 1* [3/2-1] = 1.
In case (ii) FX(x) ramps from a value of 0 to ½ in the region [0, 1] by virtue of the constant “½” density
box, then jumps by “1/4” because of the delta function, and finally continues its ramp from the value ¾ to
1. Note that this is simply the superposition of a constant density of “ ½“ plus a delta function ¼∗ δ(x-
1), and again the total area under the density curves ½ * [3/2-0] + ¼ = 1

7

Transformations of Continuous RVs
• Transformation of Densities PDFs in 1 dimension
• Transformation of Joint Densities PDFs in 2 or more dimensions
• Two Methods:
1) CDF Method:
Step#1) First find CDF FX(x) by integrating fX(x)

Step#2) Invert y=g(x) transformation y = g(x) ⇒ x = g −1 ( y )

& use it to write FY ( y ) = Pr[Y ≤ y ] in terms of the known FX(x)

(Note y= g(x) may not be “one-to-one” “multiplicity”)
y '= y
Step#3) Differentiate wrt y: d d
fY ( y ) =
dy
FY ( y ) =
dy ∫f
y '= −∞
Y ( y ' )dy '

2) Jacobian Method: Transform PDF fY(y) using derivatives f X ( x)
fY ( y ) =
Express everything in terms of variable y dy dx

fY ( y )dy = f X ( x)dx ; y = g ( x) f X ( x = g −1 ( y ))
=
g ' ( x = g −1 ( y ))
Note absolute value
2/24/2012 14

It is very important to understand how probability densities change under a transformation of coordinates
y=g(x). We have seen several examples of such coordinate transformations for discrete variables,
namely,
(i) Dice: Transform from individual dice coordinates (d1, d2) to the sum and difference coordinates (s, d)
corresponding to a 90 degree rotation of coordinates, and
(ii) Dice: Transform from individual dice coordinates (d1, d2) to the minimum and maximum coordinates
(z, w) corresponding to corner shaped surfaces of constant minimum or maximum values.
There are two methods for transforming the densities of RVs, namely (i) the CDF-method and (ii) the
Jacobian Method. While they are both quite useful for 1-dimensional PDFs fX(x), the Jacobian method is
best for transforming joint RVs .
The CDF method involves three distinct steps as indicated on the slide, namely (i) compute CDF FX(x),
(ii) Relate FY(y) = Pr[Y<=y] to FX(x) and then invert the transformation x = g-1(y) and substitute to find
FY(y) with a redefined y domain, and (iii) differentiate wrt “y” to obtain the transformed probability
density for the RV Y: fY(y). Note that if the function is multi-valued and therefore not invertible, it must
be broken up into intervals for which it is invertible and appropriate “fold-over” multiplicities must be
accounted for.
The Jacobian Method uses derivatives of the transformation to transfer densities from the original set of
RVs to the new one; the Jacobian accounts for linear, areal, and volume changes between the coordinates.
In one dimension the Jacobian is simply a derivative and is obtained by transferring the probability in the
interval x to x+dx: fX(x)dx to the probability in the interval y to y+dy: fY(y)dy Equating the two
expressions yields fY(y) =fX(x) / |dy/dx| = fX(g-1(y) ) / |dy/dx|. Note that the absolute value is necessary
since fY(y) must always be greater than or equal to zero.

14

Method#1
Transformation of Continuous RV - CDF Method

Resistance X = R Step#1 Compute FX(x) CDF= FX(x)
PDF = fX(x)
1/ 200 900 ≤ r ≤ 1100
f R (r ) =  1
 0 Otherwise
1/200
r '=r  0 r < 900

FR (r ) = Pr[ R ≤ r ] = ∫ f R (r ')dr ' = (r − 900) / 200 900 ≤ r ≤ 1100
0
r '=−∞  1 r > 1100 900 1100 x

Conductance Y = 1/R Step#2 Transform to FY(y)
PDF = fY(y)
FY ( y) = Pr[Y ≤ y] = Pr[ R ≥ 1/ y] = 1 − Pr[ R ≤ 1/ y]
6050
 1− 0 = 1 1/ y < 900 CDF= FY(y)
 1
 ( − 900)
 y 1
= 1 − FR (1/ y) = 1 − 900 ≤ 1/ y ≤ 1100
 200
 1 −1 = 0 1/ y > 1100 4050


Step#3 Differentiate FY(y)  0 y<
1
 1100
 1 0
d  1 1
fY ( y ) = FY ( y ) =  ≤ y≤ 1/1100 1/900 y
dy  200 y 2 1100 900
 0 1
y>

 900
2/24/2012 15

The Resistance X=R of a circuit has a uniform probability density function fR(r)=1/200 between 900 and
1100 ohms as shown in the top panel; the corresponding CDF FR(r) is the ramp function starting at “0”
for R<=900 and reaching “1” at R=1100 and beyond as shown. The detailed analytic function is given in
the slide and represents the result of Step#1 in the CDF-Method.
The problem is to find the PDF for the conductance Y=1/X = 1/R. We first down the definition for FY(y)
for a given value Y=y and then re-express it as a function of R =1/Y
FY(y) =Pr[Y<=y] = Pr[R>=(1/y)] = 1-Pr[R<=(1/y)]
= 1 – FR(1/y )
This last expression is now evaluated in the lower panel of the slide by substituting r=1/y into the
expression for FR(1/y ) of the upper panel. Note the resulting expression has been written down by direct
substitution and the intervals have been left in terms of 1/y. (This constitutes step#2 of the method).
Finally, differentiating FY(y) wrt “y” we find (step#3) the desired PDF fY(y); we have also “flipped” the
“1/y” interval specifications and reordered the resulting “y” intervals in the customary increasing order.
As seen in this example, the CDF method requires careful attention to the definition of the FY(y) defined
in terms of cumulative probability of the variable Y. Since Y=1/R, this leads to FY(y) = 1-
FR(1/y ) and a reverse ordering of the inequalities for the intervals.

15

Transformation of Continuous RV - Derivative (Jacobian) Method
Method#2
PDF

1 / 200 900 ≤ r ≤ 1100 6050
f R (r ) = 
 0 Otherwise 1
fY ( y ) =
fY ( y )dy = f R (r )dr ⇒ Find fY ( y ) 200 y 2

dr f (r ) 4050
fY ( y ) = f R (r ) = R
dy | dy / dr | 1 1
f X ( x) =
200 dy y=
fY ( y ) =
f R (r )
=
(1 / 200)
900 dx R
| −1 / r |2
y2 hyperbola: xy = 1
dy
slope =
dx
1 1 1
fY ( y ) = for ≤ y≤
200 y 2 1100 900 1100

x=R

Note: fY(y) is large for small slope & vice versa.
Same Differential Area (Probability) is mapped via hyperpola
to yield the tall high and short fat strip areas shown for fY(y)

2/24/2012 16

The Jacobian Method is much more straight forward and moreover has a very intuitive visualization in
the 3-dimensional plot shown on this slide. The uniform probability density function fR(r)=1/200 between
900 and 1100 ohms is written explicitly in the first boxed equation. The Jacobian method just takes the
constant fR(r) = 1/200 and divides it by the magnitude of the derivative |dy/dr|=|-1/r2| = y2 to yield directly
fY(y)=1/(200y2) for y ε [1/1100, 1/900].
The 3-dimensional plot shows exactly what is going on:
i) The original uniform distribution fX(x)=1/200 displayed as a vertical rectangle in the x-z plane ii)
Sample strips at either end with width “dx” have the same small probability dP= fX(x)dx as shown At
R=900, the density fX(x) is divided by the large slope |dy/dx| yielding a smaller magnitude for fY(y) as
illustrated, but this is compensated by a proportionately larger “dy”and thus transfers the same small
probability dP= fY(y)dy.
iii) Conversely, the strip at R=1100 is divided by a small slope |dy/dx| and yields a larger magnitude for
fY(y), which is compensated by a proportionately smaller “dy” again transferring the same dP.
iv) The end point values of the transformed density fY(y) are illustrated in the figure. The strip width “dx”
cuts the x-y transformation curve at two red points which have a “dy” width that is small at x =1100 and
large at x = 900 as determined by the slope of the curve. The shape in between these end points is a
result of the smoothly varying slope of the transformation hyperbola shown in the x-y plane.
Thus the slope of the transformation curve (hyperbola xy=constant in this case) in the x-y plane
determines how each “dx” strip of the uniform distribution fX(x)=1/200 in the x-z plane transfers to the
new density fY(y) shown in the z-y plane. This 3-dimensional representation de-mystifies the nature of
the transformation of probability densities and makes it quite natural and intuitive for 1-dimensional
density functions. It is easily extended to two-dimensional joint distributions.

16

Transformation of Continuous RV – Example 3 “Multiplicity Factor”
Gaussian PDF : y
x2
1 −
f X ( x) = e − ∞ < x < +∞
2
2π Not a 1-1 mapping
Double
Density Pts
Find PDF for Y = X 2 ( −∞, ∞) → (0, ∞) Fold-over

1 −
y
density is doubled x
e 2
f X (x) 2π
f Y (y) = 2 =2
dy/dx 2 y 1 −
y

y 2π y
e 2

1 −
= e 2 for 0 < y < +∞ x2
2πy 1 −
2π
e 2

Two Equal
GeneralRule : Contributions
from –x & +x
f X (x)
fY (y) = α ⋅
dy/dx y
Double
α = multiplici factor
ty Density Pts

" fold - over" y = x2
x

2/24/2012 18

The transformation of a Gaussian PDF under the transformation Y=X2 is easily computed using the
Jacobian method provided one incorporates a multiplicity factor α as shown in the boxed density
equation . The multiplicity factor arises because there are two contributions to the same y-value one
from –x and the other from +x as illustrated in the upper figure; thus folding the parabola across the x=0
symmetry line yields twice the density on positive x and this corresponds to a multiplicity factor α=2 in
the boxed density transformation equation.
The 3d plot shows the original Gaussian density function (grey) in the x-z plane, the transformation y=x2
in the x-y plane, and the resulting distribution shown as a dashed curve in the y-z plane. The two thin
vertical slices at –x and +x are mapped to the same y-value and hence doubles the density contribution to
fY(y) as shown.

18

Analog to Digital (A/D) Converter - Series of Step Functions
Continuous Representation of Discrete “sampled” Distributions Y (OUT)
3
A/D converter
Mapping Fcn Y = g( X ) = k +1 ; k < x ≤ k +1 -3
2
1
-2 -1
X
Mapped Density fY (y) = ∑ αk ⋅ δ(y − yk ) 0
-1 1 2 3 (IN)
k -2

a) Exponential b) Gaussian b) Uniform
1 −x2 / 2
ae − ax PDFX = f X ( x) = e 1 0 ≤ x ≤ 10
x≥0 2π PDFX = f X ( x) = 
PDFX = f X ( x) =  0 otherwise
 0 x<0 −∞ < x < ∞

 k k

α k = ∫ f X ( x)dx =  x =∫ −1
k
 ae − ax dx = −e − ax x≥0 k
1 − x2 / 2 k
k − (k − 1)
k
x = k −1 αk = ∫ 2π
e dx = ϕ (k ) −ϕ (k − 1) αk = ∫
1
dx =
x = k −1  0 x<0
x = k −1 10 10
 x=k
1 − x2 / 2
x = k −1

1
e − ak (ea − 1) x ≥ 0
ϕ (k ) ≡ ∫ 2π
e dx ; k ∈ (−∞, ∞ ) =
10
; k = 1, 2,L ,10
= ; k = 1, 2,... x =−∞

 0 x<0
∞ 10
1
fY ( y ) = ∑ e − ak (e a − 1) ⋅ δ ( y − k ) fY ( y ) = ∑ α k ⋅ δ ( y − yk ) fY ( y ) = ∑ δ ( y − k)
k =1 k k =1 10

e − (0.1) k (e0.1 − 1) = .105 ⋅ e − (0.1) k fY(y)
fY(y)
k αk
fY(y)
0.1 α kδ ( y − k )
1 0.095 1/10
0.095 2 0.086
0.050 3 0.078 y y
0 1 5 10
y
0 10 20 11 0.035
2/24/2012 26

In discussing the half-wave rectifier on the last slide we found that the effect of a “zero” slope
transformation function was to pile up all the probability in the x-interval into a single δ-function at the
constant y=“0” value associated with that part of the transformation. Here we extend that concept to a
“sample & hold” type mapping function typical of an Analog to Digital (A/D) converter. The specific
mapping function y=g(x) = k+1 for k < x ≤ k+1 is illustrated in the grey box as a series of horizontal
steps over the entire range of x [-3, 3]; the y-values for these steps range from y=-2 to y=+3. Each
horizontal (zero-slope) line accumulates the integral of fX(x) from x=k to k+1 onto its associated y-value
shown as a red circle with the point of a δ-function arrow pointing up out of the page and having an
amplitude given by the integral for that interval denoted by the symbol αk.
The table shows several examples of a digitally sampled representation for a) Exponential, b) Gaussian,
and c) Uniform distributions in the three columns. The rows of the table give the specific continuous
densities for each, the computations for the amplitudes of the discrete digital samples αk, the resulting
sum of δ-functions, and finally a plot showing arrows of different lengths to represent the δ-functions of
the sampled distributions.

26

Order Statistics - General Case n Random Variables
General Case n Variables: X1,, X2 , ... ,, Xn RVs fX (x) fX(y)dy
Assume RVs are Indep and Identically Distributed (IID) FX ( y ) 1 − FX ( y )
{X1,, X2 , ... ,, Xn } fX(x)
f X 1 X 2 L X n ( x1 x 2 L x n ) = f X ( x1 ) ⋅ f X ( x 2 ) ⋅ L ⋅ f X ( x n )
Reorder {X1,, X2 , ... ,, Xn } as follows: fX(y)dy
Y1,= smallest {X1,, X2 , ... ,, Xn } all Yk <y y y+dy all Yk > y

Y2= next smallest {X1,, X2 , ... ,, Xn } Y1 |Y2 |… |Yj-1 Yj+1 | Yj+2 | … | YN
jth “order
Yj= jth smallest {X1,, X2 , ... ,, Xn } (j-1) RVs
statistic” (n-j) RVs
Yn= largest {X1,, X2 , ... ,, Xn }
Each IID: P[Yj ≤ y]= FX(y) P[Yj > y]= 1 - FX(y)
Y1< Y2 < Yj <… < Yn
[FX(y)]j-1 [1 - FX(y)]n-j
Same PDF in variable “y” fX(y)
Diff’l Prob.
Find PDF for the jth “order statistic” “one sequence” = ( FX ( y ) ) j −1 ⋅ f X ( y )dy ⋅ (1 − FX ( y ) )n − j
Pr[ y ≤ Y j ≤ y + dy ] = fY j ( y )dy ; j = 1, 2,L , n jth order statistic
3! [φ| X1 | X2 X3]
j=1: [φ| Y1 |Y2 Y3 ] 0! 1! 2!
=3 [φ| X2 | X1 X3]
Case n=3 {Min, Mdl, Max};Y2 = “Mdl“statistic. Min [φ| X3 | X1 X2]
Y2 could be any one of {X1,, X2 , X3 } 3!
[X2,| X1 | X3], [X3,| X1 | X2]
j=2: [Y1 |Y2 | Y3 ] =6 [X1,| X2 | X3], [X3,| X2 | X1]
1! 1! 1! [X1,| X3 | X2], [X2,| X3 | X1]
Mdl
There are 3! = 6 orderings; however, we partition into 3 [ X2 X3 |X1 |φ]
3!
groups and permutations within a group is irrelevant; j=3: [Y1 Y2 | Y3 | φ] 2! 1! 0!
=3 [ X1 X3 |X2 |φ]
[ X1 X2 |X3 |φ]
Max 48
2/24/2012

Order Statistics for the general case of n IID Random Variables is detailed on this slide. The n IID RVs
{X1, X2,..., Xn} are re-ordered from the smallest Y1 to the largest Yn and the jth Y in the sequence Yj is
called the “jth order statistic”. Again we fix a value Y=y and consider the continuous range of re-ordered
Y-values illustrated in the figure: the small interval from y to y+dy contains the differential probability
for the jth order statistic Yj given by fX(y)dy; all Y-values less than this belong to the Y1 through Yj-1 and
those greater belong to Yj+1 through Yn as shown in the inset figure. Now for each of the Ys on the left we
have the probability Pr[Y1 ≤ y] = FX(y), Pr[Y2 ≤ y] = FX(y), ... Pr[Yj-1 ≤ y] = FX(y), and because they are
IID the total probability of those on the left is Pr[Yleft ≤ y] = [FX(y) ]j-1; similarly on the right we find
Pr[Yright ≤ y] = [1-FX(y) ]n-j. So for the reordered Ys the differential probability is just the product of these
three terms multiplied by a multiplicity factor α, viz.,
dP = Pr[y≤ Yj ≤ y+dy]= f Yj (y) dy = α [FX(y) ]j-1 fX(y) [1-FX(y) ]n-j dy
The multiplicity factor α results from the number of re-orderings of {X1, X2,..., Xn} for each order
statistic Yj ; arguments for n=3 and n=4 are illustrated on this slide and the next. These arguments look
(in turn) at each order statistic min, middle(s), and max and compute in each case the number of distinct
arrangements of {X1, X2,..., Xn} that yield the three groups relative to the “separation point” Y=y and
arrive at multinomial forms dependent upon the orderings for each statistic. The specific multiplicity
factors for the cases for n=3,4 are easily found to be
α = 3C (j-1),1,(3-j) = 3! / [(j-1)! 1! (3-j)!] ; α = 4C (j-1),1,(4-j) = 4! / [(j-1)! 1! (4-j)!]
and the final results for the PDF of the jth order statistic f Yj (y) in these cases are
fYj (yj) = 3C (j-1),1,(3-j) [FX(yj) ]j-1 fX(yj) [1-FX(yj) ]3-j for j=1,2,3 (n=3)
fYj (yj) = 4C (j-1),1,(4-j) [FX(yj) ]j-1 fX(yj) [1-FX(yj) ]4-j for j=1,2,3,4 (n=4)

48

Random Processes – Introduction - Lec#4
• Time Series Data = Physical Measurements in time
• Random Process = Sequence of random variable realizations
– Geiger Counter Sequence of “detections” - Poisson Process
– Communication Binary Bit Stream - Bernoulli Process “ 01001…”
– E&M Propagation Phase (I-Q components) - Gaussian Process
• Arrival Event: Success =“arrival” (of an event in time)
• Interarrival Times for Random Processes
– Not only interested in how many successes K (“ arrivals”) there are
– But also interested in “specific time of arrivals,” e.g., TK = time of kth arrival
– DSP Chip Interrupts:
Random Number of Interarrival
• Time between interrupts Process Arrivals Times
• used for data processing Geiger
Poisson Exponential
– Waiting on Telephone: Counter
• “you are 10th customer in line and … Binary Bit
Stream Bernoulli Geometric
• your wait will be approximately “7 minutes”

2/24/2012 61

Observations of physical processes produce measurements over time which almost always have
components described by a random process. Some examples are Geiger counter detections (Poisson
Process), Binary bit streams (Bernoulli Process) and Electromagnetic wave I, Q Phase components
(Gaussian Process).
Because, these processes take place over time, the notion of a “success” is translated to an “arrival” at a
specific time. Moreover, we are not only interested in how many successes K there are, but also their
specific arrival times, i.e., we would like to know the time of the kth arrival Tk. This has application to
many physical processes such as the timing of DSP chip interrupts relative to their “clock cycles” and the
queuing of customers in a telephone answering system. In both cases you want to make sure the system
can handle the “load” in an appropriate manner; for the DSP chip you need to minimize the number of
times you are near the leading or trailing “edge” of the timing pulse in order to avoid errors, while for the
telephone answering service, the 10th customer, would like to know how long he must wait in the queue
before being served.

61

Multi-User Digital Communication “CDMA” Arrival Slots
• Two signals s1 , s2 ;Decode s1 or s2 in given time slot s1 Decoded
P|s1,1]= P[1|s1] P|s1] “success”
• a priori Prob: P[s1]=3/4 ; P[s2]=1/4 P[1|s1] =(2/3)(3/4) =1/2 p1=1/2
S1 2/3
• Decoding Statistics: 1/3
P[s1] 3/4 P[0|s ]
P|s1,0]=1/4
decoded “1” : P[1|s1]=2/3 ; P[1|s2]=2/3 Time
1
s1 Not
Slot #4 P[1|s2] P|s2,1]= P[1|s2] P|s2] Decoded
not decoded “0” : P[0|s1]=1/3 ; P[0|s2]=1/3 1/4 S2 =(2/3)(1/4) =1/6 “failure”
P[s2]
2/3
q1=1/2
1/3
Nr time slots (“trials”)  n − 1 r n − r P[0|s2]
P|s2,0]=1/12
p N r ( n) =  p q
r-Decodes of s1 p1=q1=1/2  r −1 a priori decode
4 −1 1
1 1 1
1) Pr[ 1st decode in 4th slot] Pr[ N1 = k ] = p N1 (k ) = q k −1
p1 ⇒ Pr[ N1 = 4] = p N1 (4) =     =
 2   2  16

2) Pr[ 4th decode in 10th slot | 3 decodes No memory - slots 6 to 10 1 2 3 4 5 6 7 8 9 10
“1” “1” “1”
in 1st 6 time slots ] 1 1
3
1
1 1 2 3 4
Pr[ N1 = 4] = p N1 (4) = q 3 p =     = 3 “1”s
No Memory
 2   2  16
4
3) Pr[ 2nd decode in 4th slot]  n − 1 r n − r  4 − 1 2 4 − 2 1 3
Pr[ N r = n] = p N r (n) = 
  p q ⇒ Pr[ N 2 = 4] = p N 2 (4) = 
  2 − 1 p q = 3 2  = 16

 r − 1    

4) Pr[ 2nd decode in 4th slot | no decodes No memory of failures in slots 3 & 4 1 2 3 4
“0”“0”
2
in 1st 2 time slots] 1 1 1 2
Pr[ N 2 = 2] = p N 2 (2) = p 2 =   =
 2 4 “Renewal”
{ “means” N2>2 }
Pr[ N 2 = 4 , N 2 > 2 ] p N 2 (4) ( 3 / 16 ) 1
Pr[ N 2 = 4 | N 2 > 2 ] = = = =
Pr[ N 2 > 2 ] 1 − p N 2 ( 2 ) 1 − (1 / 4 ) 4
2/24/2012 78

This example illustrates renewal properties and time slot arrivals of the Geometric and Negative Binomial RV distributions.
In a multiuser environment the digital signals from multiple transmitters can occupy the same signal processing time slot so
long as they can be distinguished by their modulation characteristics. Code Division Multiple Access (CDMA) uses a
pseudorandom code that is unique to each user to “decode” the proper signal source.
Consider two signals s1 and s2 being processed in the same time slot with a priori “system usage” given by P[s1] = ¾ and P[s2]
= ¼ ; further let “1” denote successful and “0” denote unsuccessful decodes respectively. Given that each signal has the same
2/3 probability of a successful decode P[1|s1] = P[1|s2] = 2/3, we can use the tree to find the single trial probability of success
for decoding each signal.
For signal s1 we see that the end state {s1, 1} represents a successful decode and has p1=1/2 ; all other states {s1, 0}, {s2 1},
{s2, 0} represent failure to decode signal s1 with probability q1 = 1/4+1/6 + 1/12 = 1/2. Similarly for signal s2 we see that the
end state {s2, 1} represents a successful decode of s2 and has p2 =1/6 ; all other states {s2, 0}, {s1 1}, {s1, 0} represent failure to
decode signal s2 with probability q2 = 1/12+1/2 + 1/4 = 10/12 =5/6.
We consider successive decodes of s1 as independent trials with probability of success p1=1/2 . Thus, the probability of
having r- successful decodings of s1 in Nr signal processing slots “trials” is given by the Negative Binomial PMF
pNr(n) = n-1Cr-1p1rq1n-r with nr = r, r+1, r+2, .... with p1=q1=1/2
1) Pr of 1st decode (r=1) in 4th slot (N1 =4) is pN1(4) = 4-1C1-1p11q14-1 = 1(1/2)4 = 1/16
2) Pr of 4th decode (r=4) in 10th slot (N4 =10) given 3 previous decodes in 1st 6 slots is found by “restarting the process with
slots #7 , 8, 9, 10 so we need only one decode (r =1) in 4 slots, i.e., N1 =4, which is identical to part 1) and yields
Pr[N4 = 10 | N3=6] = pN1(4) = 4-1C1-1p11q14-1 = 1(1/2)4 = 1/16
3) Pr of 2nd decode (r=2) in 4th slot (N2 =4) is pN2(4) = 4-1C2-1p12q14-2 = 3(1/2)4 = 3/16
4) Pr of 2nd decode (r=2) in 4th slot given 1st two slots were not decoded is found by “restarting the process with slots #3,4 “so
we need r=2 in the two remaining slots N2 =2 which means two successes in two trials, so we have
pN2(2) = 2-1C2-1p12q12-2 = 1(1/2)2= 1/4

78

Binary Communication with Noise
Gaussian under Linear X : N ( µ X , σ X 2 )  Y : N (eµ X + f , e 2σ X 2 )
→
Transformation: Y=eX+f Y = eX + f 1 24 {
4 3
≡ µY ≡σ Y 2

Noise X: N(0,1) Y1 = N(a,1)

“1” +a Y1 = a + X Threshold d1 = detect “1”
Binary
Modulator Channel Detector
Generator -a Y0 = - a + X d0 = detect “0”
“0”
Y0 = N(-a,1)
Threshold
Threshold Detector detect “0” y=c
detect “1”
Y>c detect “+ a” or “1”
fY|A (y|-a) fY|A (y|+a)
Y≤c detect “- a” or “0”

y
-a 0 +a
Type I Type II
Prob of an Error “Missed Detection” “False Positive”
for Detection a “1”
P(Er “1” ) = P(Y ≤ c | +a) P(+a) + P(Y > c | -a) P(-a)
Type I Error “Missed Detection” Type II Error “False Positive”
Does not Exceed Threshold Exceeds Threshold
But Belongs to “+a” Distrib. But Belongs to “-a” Distrib.

2/24/2012 97

Consider the Binary communication channel depicted in the upper sketch: A binary sequence of “1”s and
“0”s is generated and then amplitude modulated by a positive amplitude +a for “1” and –a for “0” as
illustrated by the “square wave pulse train” at the modulator. Zero mean unit variance Gaussian noise
N(0,1) is added by the “channel” and the (signal + noise) outputs are two distinct Gaussian RVs : Y1= a
+X ~ N(+a, 1) and Y0=–a +X ~ N(-a,1) about two different means as shown in the probability density
plot. This output is presented to a Threshold detector which attempts to detect the original sequence of
“1”s and “0”s by setting a threshold Y =c (vertical dashed line) and assigning a “1” to Y-values to the
right and “0” to for Y-values to the left of the threhold.
Considering the detection of “1” we see that two types of error can occur as follows:
Type I Missed Detection: P(Y≤c | +a) The larger hatched area on the left with Y<c which belongs to the
N(+a,1) curve but is rejected because it does not exceed the threshold “c”
Type II False Positive: P(Y>c | -a) The smaller hatched area on the right with Y>c which belongs to the
“0” N(-a,1) curve but is falsely detected as “1” because it exceeds the threshold “c”
The total probability for an error in detecting a “1” is the sum of each conditional multiplied by its a
priori as shown in the bottom equation. The total probability for an error in detecting a “0” is written
down in an analogous fashion as a sum of conditionals multiplied by their a priori s (not shown) .

97

Common PDFs - “Continuous” and Properties
RV Name PDF Generating Mean Variance
∞
Fcn
ϕ ( s) = E[e Xs ] ∫ x⋅ f
x = −∞
X ( x)dx var( X ) = E[ X 2 ] − E[ X ]2

f X (x)
 1

f X ( x) =  b − a
a≤ x≤b e sb − e sa a+b (b − a )2
Uniform  0
 Otherwise s (b − a ) 2 12
x
a b
fT (t )
λ e − λ t t≥0 1 1
f T (t ) =  λ
Exponential  0 t<0 λ λ2
λ−s “exponential wait”
λ>0 t
f Tr (t) Peaks at
Gamma  λ e − λt (λt ) r −1
 t≥0 Exponent r r
fTr (t ) =  ( r − 1)! ial tmax = r −1
λ
 λ 
 
r
r-Erlang 
 0 t<0
1
r =1 E[T1] = λ λ−s λ
r = integer r =2 E[T2 ] =
2
For r=3: three
λ2
λ>0 Arrival Rate λ
3
r =3 E[T3 ] =
λ
“exponential waits”

t E[T3 ] = 1
λ
+
1
λ
+
1
λ
2
( x −µ )
Normal 1 − Gaussian
f X ( x) = e 2σ2 Rayleigh (σ s )2
2π ⋅ σ Peaks µs+ µ σ2
N (µ, σ ) e
2
2
at x=0 Peaks at
−∞ < x < ∞ x=1/a
( s/ a )2
Rayleigh a2 x2
− 1+  a  e
s −
2 π
⋅ 2−π
f X ( x) = a 2 xe 2
 
  2 1 π
2a 2
x ⋅ 1 + erf  a 2
 (s/a) 
x>0; a>0 0 


 2 




2/24/2012 101

This table compares some common continuous probability distributions and explores their fundamental
properties and how they relate to one another. A brief description is given under the “RV Name” column
followed by the PMF formula and figure in col#2, the generating function in col#3, and formulas for the
mean and variance in the last two columns.
The Uniform Distribution has a constant magnitude 1/(b-a) over the interval [a,b]; the mean is at the
center of the distribution (a+b)/2 and the variance is (b-a)2/12 .
The Exponential Distribution decays exponentially with time from an initial probability density λ at
t=0. The mean time for an arrival is E[T] = 1/ λ which equals the e-folding time of the exponential. Its
variance is 1/ λ2 . This cumulative exponential distribution is the probability that the first arrival T1
occurs outside a fixed time interval [0,t]; it equals the probability that the discrete number of Poisson
arrivals K(t)=0 occurs within the interval [0,t] , that is, Pr(T1>t)= Pr(K(t)=0).
The r-Erlang / Gamma Distributions for r>1, all rise from zero to reach a maximum at (r-1)/ λ and then
decay almost exponentially ~tr-1e-λt to zero. The maximum occurs after a wait of one exponential mean
wait time 1/ λ for r=1, two 1/ λ waits for r=2, and r 1/ λ waits for any r. The variance is r times that of
the exponential variance 1/ λ2 . The cumulative r-Erlang distribution is the probability that the rth arrival
time Tr occurs outside a fixed time interval [0,t] ; this equals the probability that the discrete number of
Poisson arrivals K(t) ≤ (r-1) i.e., Pr(T1>t)= Pr(K(t) ≤ (r-1)). The Gamma density is a generalization of the
rth Erlang density obtained by replacing (r-1)! with Γ(r) making it valid for non-integer values of r.
The Gaussian (Normal) Distribution is the most universal distribution in the sense of that the Central
limit theorem requires that sums of many IID RVs approach the Gaussian distribution.
The Rayleigh Distribution results from the product of two independent Gaussians when expressed in
polar coordinates and integrated over the angular coordinate. The probability density is zero at x=0 and
peaks at r=1/a½ before it drops towards zero with a “Gaussian-like” shape for x>0. It is compared with
the Gaussian which is symmetric for about x=0.

101

Consequences of Central Limit Theorem
Discrete Uniform PMF pX(x)
/
1 11
1
p X ( x) = δ ( x − xi ) ; xi = −.5,L , 0,L ,.5
11 x
-.5 -.4 -.3 -.2 -.1 0 .1 .2 .3 .4 .5
Generate uniform Sequence of N=1000 points { Xi }

{Xi } .2 | .5 | -.1 | .3 | -.2 | -.1 | -.1 | .4 | -.3 | .1 | -.5 | -.1 L -.1 | .4 | -.3 | .1 | -.5 | -.1

n=2 .7 .2 -.3 .5 -.2 -.6 Sum of n Uniform Variates Xi
n
n=4 .9 .2 -.8 Z n = ∑ X i ; n = 2, 4,8,12
i =1
n=8 1.1
Plot Frequency of Occurrence f Zn ( z )
n = 12 .2
fZn ( z ) ≈ pZn ( z )
pX (x) = 1
11

Note: Curves give “shape” of 1.0
n = 2
freq of occur. for discrete points
spaced 0.1 apart
n = 4
.05
Central Limit Thm:
n = 12
=>Generates a Gaussian
as n=2,4,8,12, … large
z
2/24/2012 -2.0 - 1.0 0 1.0 2.0 109

The Discrete Uniform PMF with values at 11 discrete points ranging from x ={-.5, -.4, -.3, -.2,-.1, 0, .1,
.2,.3,.4.,.5} can be expressed as a sum of 11 δ-functions with magnitude 1/11 at each of these points as
shown in the figure. This can also be thought of as the result of a “sample and hold” transform (see
Slide#26) of a Continuous Uniform PMF fY(y) = 1/11 ranging along the y-axis from y=-.6 to y=+.5 ; for
example, the term 1/11*δ(x-(-.5)) is the δ-function located at x= -.5 generated by integrating the
continuous PMF from y= -.6 to y=-.5 which gives an accumulated probability of ”.1/(.5 –(-.6)) =1/11 at
the correct x-location.
Suppose that a sequence of 1000 numbers from the discrete set {-.5, -.4, -.3, -.2,-.1, 0, .1, .2,.3,.4.,.5} are
randomly generated on a computer to create the data run notionally illustrated in the 2nd panel . Now we
can create sum variables Zn consisting of the sum of n =2 or n= 4 or n= 8, or n=12 of these samples
from the discrete uniform PMF. According to the CLT, as we increase “n”, the resulting frequency
distribution of the sum variables “Zn“s should approach a Gaussian. The notional illustration shows what
we should expect. The dashed rectangle shows the bounds of the original uniform discrete PMF and the
other curves show the march towards a Gaussian. Note that unlike a Gaussian all these distribution are
zero outside a finite interval determined by the number of variables that are summed. The triangle shape
is the sum of two RVs and obviously the min and max are [-1, 1] for Z2 ; the Z12 RV on the other hand,
covers the range from [-6, 6]; the range increases as we sum more variables, but only as n-> ∞ does the
sum variable fully capture the small Gaussian “tails” for large |x| as required by the CLT.
This result can also be thought of in terms of an n-fold convolution of the IID RVs Xk k=1,2,...,n which
also spreads out with each new convolution in the sequence. The next slide shows the results of a MatLab
simulation of this CLT approach to a Gaussian and a plot of the results confirming the notional sketch
shown on this slide. (The MatLab script is given on the notes page of the next slide.)

109

Examples Using Markov & Chebyshev Bounds
Markov Examples:
Prob “value” of RV X
exceeds “r” times its Kindergarten Class mean height = 42” Find bound on
mean is 1/r Prob of a student being taller than 63”
1 µ X = 42 r ⋅ 42 = 63 ⇒ r = 1.5 Pr[ X ≥ 1.5 ⋅ 42] ≤ 1 / 1.5 = 66.7%
P[ X ≥ rµ X ] ≤
r
or
Note that for r =1 the Markov
E[ X ] µ X
bound is “1” or 100%;
P[ X ≥ c] ≤ = Thus useful bounds
c c
require r >1 0 µX 2µ X 3µ X
Chebyshev 1.5µ X

Prob “deviation” of RV
Ross Ex. 7-2a) Factory production
X exceeds “r” times its
std dev r σX is 1/r2 a) Given mean =50, find bound on Prob production exceeds
75, i.e., Prob[X>75] P[ X ≥ 75] ≤ E[ X ] = 50 = .667
Markov
P [ X − µ X ≥ rσ X ] ≤
1
c 75
r2 Note a upper bound: at most 66.7%
or b) Given also variance = 25 , find bound on Prob production
2 between 40 and 60
σX P[ X − 50 ≥ 10] ≤
25
P[ X − µ X ≥ k ] ≤ 10 2
= .25 Chebyshev
k2
⇒ 1 − P[ X − 50 ≥ 10] ≥ 1 − .25 = .75
Note a lower bound: at least 75%
2/24/2012 121

Here are two examples of the application of the Markov and Chebyshev Bounds. The two forms for each
are stated on the LHS of the slide for reference purposes. The decision to use one or the other of these
bounds depends upon what type of information we have about the distribution. Thus if the RV X takes on
only positive values and we only know its mean, µX , then we must use the Markov bound. On the other
hand, if the RV X takes on both positive and negative values and we know the mean, µX , and variance,
σX2, then we must use the Chebyshev bound. If in the latter case the RV X takes on only positive values,
then we could use either Chebyshev or Markov bounds, but we would choose Chebyshev over Markov
because it uses more of the information and hence will always be a tighter upper bound. Neither of these
bounds is very tight because the information about the distribution is very limited; knowing the actual
distribution itself always yields the best bounds.
1) The mean height in a Kindergarten Class is µX = 42” and we are asked “what is the probability of a
student being taller than 63?” Short of knowing the actual distribution, the best we can do is use the
Markov inequality to find an upper bound Pr[X>63] < 42/63=.67 or 67%. This is also easily computed if
we realize that the tail is the region beyond 63”= 1.5(42”) so r=1.5 and the answer is 1/1.5 =2/3=.67 .
2) The factory production has a mean output µX = 50 units and we are asked
(a) “what is the probability of a 75 unit output?” This again involves a positive quantity X the number of
units and we choose the Markov bound for 1.5(50) = 75 units so again r=1.5 and the resulting probability
is 67% .
(b) If we are also given the variance of the production σX2 = 25 the additional information allows us to
use the Chebyshev bound to find the probability in the tails on either side of the mean of 50. Thus, if we
find the probability in the 2-sigma tails (r=2) to left of 50-10 and to the right of 50+10 as Pr[Tails] ≤ 1/22
= 25%. Hence the production within the bounds [40,60] is the complementary probability
Pr[40 ≤ X ≤ 60] =1-Pr[Tails] ≥ 1-.25 = .75 or at least 75%

121

Transformation of Variables & General Bivariate Normal Distribution
r Mean Covariance
X a bivariate normal m X = E[ X ] = 0 0  1 0 
(indep comp) N(0,1) mX =   K XX =  
K XX = E[ X ⋅ X T ] = I 0  0 1 

Linear Xform to Y Y = AX + b b 
mY = b =  1  KYY = AAT
b2 

Computation mY mY = E[Y ] = E[ A ⋅ X + b] = A ⋅ {+ b = b
E[ X ]
r
=0
Computation KYY [ ]
 
K YY = E (Y − mY )(Y − mY )T = E ( Y − b)(Y − b)T  = E[ AX ( AX )T ] = E[ A( XX T ) AT ] = A E[ XX T ] AT = AAT
{ 123
4 4
 = AX +b  =I

det K YY = det A ⋅ det AT = (det A)
2
Determinant KYY ⇒ det A = det KYY
 ∂y 
   y
A is Jacobian: det  i  = det{ Aij } ⇒ J   = det( A) = det KYY

 ∂x j 
  x
(A ) (A ) = (AA )
−1 T −1 T −1

−1
= K YY
1
( ) 
T
−  A−1 ( y − b ) ⋅ A−1 ( y − b ) 
1 2
 

New Prob Density f X ( x) e
fY ( y ) =  y1 y2 
= 2π

J
x
 det KYY
 1 x2 


1 − 2 [( y − m y )T KYY −1 ( y − m y ) ]
1
(No Longer Independent
General Bivariate e
f Y ( y ) = 2π
Normal Distribution Components or zero
det KYY means & unit variances)

2/24/2012 132

We introduced the Bivariate Gaussian distribution for the case of two independent N(0,1) Gaussians
(with the same variance =1) and arrived at a zero mean vector mX and a diagonal covariance matrix KXX
=diag(1,1) corresponding to a pair of uncorrelated Gaussian RVs and displayed in the first line of the
table. The second line of the table shows the results of making a linear transformation of variables
Y=AX+b from the X1 X2 coordinates to the new Y1 Y2 coordinates; note that the vector b =[b1,b2]T
represents the displaced origin of the Y1 Y2 coordinates relative to X =[0,0]T. We see that the new mean
vector is no longer zero but rather mY = b and the new covariance KYY =AAT no longer has unit
variances along the diagonal, but, in general, now has non-zero off-diagonal elements as well. The fact
that this linear transformation yields non-zero off-diagonal elements in the covariance matrix means that
the new RVs Y1 Y2 are no longer uncorrelated.
The computations supporting these table entries are straightforward. The new mean is obtained by taking
the expectation E[Y]= E[AX+ b] and using the fact that the original mean E[X] is zero to give mY =
E[Y]= b . Substituting this value b for mY in the covariance expression KYY = E[(Y-b)(Y-b)T] yields
KYY = E[(AX)(AX)T] = A E[XXT] AT =A AT since E[XXT] =KXX = I (i.e., the identity matrix diag(1,1)).
In order to find the new Bivariate density fY1,Y2(y1,y2) we need to divide fX1,X2(x1,x2) by the Jacobian
determinant J(Y,X) and replace X by A-1(Y-b). This Jacobian is found by differentiating the
transformation Y=AX+b to find J=det[∂Y / ∂X ] = det(A) ; note that this is easily verified by writing out
the two equations explicitly and differentiating y1 and y2 with respect to x1 and x2 to obtain the partials
∂yi / ∂xj = aij and then taking the determinant to find the Jacobian. Taking the det(KYY) =det(AAT) and
using the fact that the determinant det(A) = det(AT), we find that detA = det (KYY)½ . Finally substituting
this and X = A-1(Y-b) yields the general Bivariate Normal Distribution fY(y) given in the grey boxed
equation at the bottom of the slide. Be careful to note that the inverse KYY-1 occurs in the exponential
quadratic form and that the matrix KYY occurs in the denominator det (KYY)½ ; also observe the
“shorthand” vector notation for the bivariate density fY(y) in place of the more explicit fY1,Y2(y1,y2).

132

Bivariate Gaussian Distribution & Level Surfaces
−1 < ρ < +1 Ellipse in y1 – y2 space;
fY1Y2 ( y1 , y2 ) ≠ fY1 ( y1 ) ⋅ fY2 ( y2 )
y1 & y2 are dependent  σ2 ρσ1σ 2 
K = 1 2 
ρ =0 Diagonal Terms only; ρσ1σ 2 σ2 
Either Ellipse or Circle fY1Y2 ( y1 , y2 ) = fY1 ( y1 ) ⋅ fY2 ( y2 )
Principal Axes along y1 & y2 det (K ) = σ1 σ 2 (1 − ρ 2 ) ≥ 0
2 2
independent
ρ = ±1 Degenerate Case: Ellipse st. line:
Along one of the “Principal Axes”; y2 = ±ρ · y1 = ± y1 1 1
− yT KYY −1 y
fY1Y2 ( y1 , y2 ) = e 2
y1 & y2 are “extremely dependent” correlated or anti-correlated 2π det KYY
Positive ρ > 0 Negative ρ < 0 NO ρ = 0
Correlation Correlation Correlation
y2 Ellipses Ellipse Along Principal Axes
y2 y2
ρ >0
ρ=0
+ 45o ρ<0
σ1 > σ 2
y1 y1
y1 fY1Y2 ( y1 , y2 )
- 45o

Gaussian
Degenerate Ellipses Circle y Probability
y2 y2 2 Surface
ρ=0 y2
ρ = +1 σ1 = σ 2
ρ = −1
+ 45o arbitrary 2 d Ellipses y1
Ellipse Areas
y1 y1 y1
collapse to a line orientation
- 45o

2/24/2012 135

The bivariate density fY(y) = fY1,Y2(y1,y2) is completely determined by its mean vector mY and its
covariance matrix KYY as given by the equations on the upper right. Consider the the Bivariate Gaussian
density which is plotted as a 2d surface relative to its mean vector components mY1 and mY2 taken as the
origin. The level surfaces represented by cuts parallel to the y1-y2 plane are the ellipses given by the
quadratic form equation of the last slide. The structure of these ellipses are shown in the tableau
consisting of 3 columns for positive, negative, and zero correlation coefficient ρ and by 2 rows
corresponding to general (top row) and degenerate cases.
The general cases in the top row have unequal sigmas σ1> σ2 and as we go across the row we have an
ellipse with positive correlation (ρ > 0), one with negative correlation (ρ < 0) and an ellipse along its
principal axes with no correlation (ρ =0). The (red) arrows show the directions of the principal axes of
the ellipse in each case; the zero correlation case on the extreme right has the principal axes coinciding
with y1 and y2 , while the negative correlation case has its principal axes rotated at -45o to the y1-axis and
the positive correlation case has its principal axes rotated at +45o to the y1-axis.
The bottom row illustrates the two degenerate cases ρ =+1 and ρ =-1 in which the ellipse “collapses’ to a
straight line corresponding to complete correlation or anti-correlation (opposite variations of Y1 and Y2)
respectively, and the degenerate uncorrelated case ρ =0 in which the principal axis ellipse above it
degenerates into a circle because the two sigmas are equal (σ1=σ2 ).

135

Ellipses of Concentration
1D Gaussian Distribution 2D Gaussian Distributions described
described by two scalars: by vector & Matrix: mean vector
mean µX & Var(X) intuitive Tabulate Area mX & Covariance KXX
Normalized & Centered RV x
1 −t / 2
Φ( y) = ∫ e dt
2
Vector mX and KXX are not very intuitive!
Standardized Distribution 2π
t = −∞ 1
Tabulation of CDF 1 − xT K XX −1 x
f X ( x1 , x2 ) = e 2
2π det K XX
fX(x) Y=
X − µX fY(y)
σX
Gaussian
σX σX Probability
Surface
x y y x2
µX 0
“Level Curves”
Prob Density Standardized Density 2 d Ellipses x1

“Level curves” of Zero  x2 x 
2
−1 1 2ρx1 x2
Mean 2D Gaussian Surface xT K XX x =  12− + 2 2  = c 2 = const.
with Covariance KXX (
1 − ρ2 )  σ X1
 σ X1 σ X 2 σ X 2 


2/24/2012 138

The 1-dimensional Gaussian distribution is completely described by two scalars the mean µX and the
variance σX2. The tabulation of a single integral for the cumulative distribution function FY(y) shown in
the left box is sufficient to characterize all Gaussians X: N(µX , σX2 ) if we first transform to a
standardized Gaussian RV Y via Y = X- µX) / σX. The Gaussian integral representing the probability
distribution for the standardized Pr[Y≤y] = FY(y) is used so often it is denoted as the “Normal Integral”
Φ(x).
We would like to extend this concept of a single tabulated integral to describe all 2-dimensional Gaussian
distributions; however, as we have seen, the Bivariate Gaussian distribution requires more than just the
means and variances of two Gaussians as we must also characterize their “co-variation” by specifying
their correlation coefficient ρ. Thus we must specify the two elements of the mean vector µX and all three
elements of the (symmetric) covariance matrix KXX in order to completely characterize a Bivariate
Gaussian fX1X2(x1,x2) given in the right box of the slide.
We have seen that the level “surfaces” (actually curves) of the Gaussian PDF are ellipses centered about
the mean vector coordinates µX1 and µX2 and described by quadratic form xTK-1XX x in the exponent of
the PDF. The explicit equation for the level curves with zero mean is obtained by setting this term equal
to an arbitrary positive constant c2 as given by the equation in the slide. These ellipses are called ellipses
of concentration because the area contained within them measures the concentration of probability for the
specific “cut through” the PDF surface. In the next few slides we will show how this leads to a single
tabulated function for the Bivariate Gaussian that is analogous to Φ(x) for the Normal Distribution.

138

Gaussian & Bivariate (2d) Gaussian Distributions Compared
Probability for x to be within an ellipse “scaled by c”: α = 68.3%
Prob region
2
Prob( xT K xx −1 x < c 2 ) = FC (c ) = 1 − e − c /2
=α “slice”

−1
Note: Inverse Covariance K xx x2
2 d Ellipse
determines Ellipse 68.3%
x1
Scale Factor c in terms of % concentration:
Equivalent 1d sigma table c = − 2 ln(1 − α )
1d sigma α (%) c
fX(x)
1-σ 68.3 1.52 1-σ ≈ c=1.52 Prob Density
σX σX
2-σ 95.4 2.48
µX x
3-σ 99.7 3.41 68.3%

2/24/2012 141

On the last slide we found that the 2d probabilities are described in terms ellipses of concentration
specified by the axis scale parameter c which is related to the percentage of events contained within the
ellipse by the expression shown in the slide. This CDF is in fact a Rayleigh distribution with “radial
distance r” replaced by the ellipse scale parameter “c”.
Setting this probability within the ellipse (parameterized by the value “c”) equal to α allows us to solve
for the value of c in the boxed equation. Using this equation, we compute the table which displays the
values of the ellipse scaling parameter “c” corresponding to the standard values of 1-σ (68.3%) , 2-σ
(95.4%), and 3-σ (99.7%) associated with a 1-dimensional Gaussian distribution.
These ellipses are used to specify equivalent “standard deviations” for the Bivariate Gaussian and
extending this tabulation for all probabilities allows us to define a standard Bivariate Normal function
Ψ(c) similar to the Φ(x) for the Normal Gaussian.
The two figures illustrate this equivalence by showing the c=1.52 cut through the Bivariate Gaussian
surface yielding an equivalent “1-σ”ellipse containing α = 68.3% of the probability and then notionally
comparing the ellipse with the “1-σ” area under the standard Gaussian curve.

141

Closure Under Bayesian Updates - Summary
Summary: r X  ρ
rr  X  0  1
Started with a pair of N(0,1) RVs X & Y with correlation ρ X =  µX ≡ E  =   K XY = 
1
Y   Y  0  ρ 
1) The joint distribution is a correlated Gaussian in X and Y  x 2 − 2 ρ xy + y 2 
 
− 
2(1− ρ 2 )
f XY ( x, y ) = 1
e
2
2π 1− ρ 2
e − y /2
2) Marginal fY(y) is found to be N(0,1): fY ( y ) =
2π ( x − ρ y )2 
 
− 
3) Bayes’ Update fX|Y(x|y) is Gaussian 2(1− ρ 2 )
N ( ρ y,1 − ρ 2 ) f X |Y ( x | y ) = 1
e
2π (1− ρ 2 )

4) Pick off “conditional” mean & variance from fX|Y(x|y) µ X |Y ≡ E[ X | Y ] = ρy ; Var ( X | Y ) = 1 − ρ 2
Conditional Mean represents an “estimate of X given meas.Y” with Var(X|Y) obtained from Bayes’ Updated Gaussian
Generalize:
r X  µ   σ 2 ρσ X σY 
Start with General Gaussian Vector X =  µ= X ; K XY =  X 2 
Y   µY  ρσ X σY σY 
with non-zero mean &Variance
σX
µ X |Y ≡ E[ X | Y ] = µ X + ρ ( y − µY )
Conditional Mean and Variance σY
Represents the Bayes’ Update Equation 2
Var ( X | Y ) = σ X (1 − ρ 2 ) ; σ X |Y = σ X 1 − ρ 2

Note 1 “Gaussian Arena” we do not need to work
Note 2: Y is irrelevant for ρ=0
with distributions directly since both
X & Y indep => Conditionals do not
1) Linear Xfms & 2)Bayes’ Update Equation yield
depend upon value of y:
Gaussian Vector Results (surrogates for the joint
µX|Y = µX & σXY2 =Var(X|Y) = σX2
and conditional distributions respectively)
2/24/2012 151

Closure Under Bayesian Updates started with a pair of correlated N(0,1) Gaussian RVs with correlation
coefficient ρ. and resulted in a Gaussian conditional distribution fX|Y(x|y) with conditional mean is µX|Y
= E[X|Y] = ρy and conditional variance is Var(X|Y) = σX|Y2 = 1-ρ2.
If instead, we start with a pair of correlated Gaussian RVs having different means and variances given by
the mean vector µX and covariance matrix KXY shown in the middle panel of the slide yields the general
result for a Gaussian with a
conditional mean E[X|Y] = µX|Y = µX + ρσX(y- µY)/σY , and
conditional variance Var(X|Y) = σX|Y2
given in the boxed equation.
The lower panel interprets these results in terms of a two dimensional “Gaussian Arena” in which the
input and output are related by the underlying joint Gaussian distribution which remains Gaussian for all
possible linear coordinate transformations and even maintains its Gaussian character when one of the
variables is conditioned on the other. Thus the Gaussian vector remains Gaussian under both linear
transformations and Bayes’ updates. Also note that if the correlation is zero (ρ =0) then the input and
output variables are independent as is evident in the boxed equations which reduce to statements that the
conditional mean is equal to the mean µX|Y = µX and the conditional variance is equal to the variance
σX|Y2 = σX2 .
We note in passing that because the quadratic form in the joint Gaussian is symmetric in the X and Y
variables, we could just as well have computed the output Y conditioned on the input X to find analogous
results with X Y corresponding to the forward Bayesian relation.
A visual interpretation of this result will be given in the next slide and further insight into the role of the
communication channel and its inverse will be given in the slides after that.

151

General Case:
Visualization of Conditional Mean
given a priori yields a posteriori
Bayesian Update ρσ X
Conditions X on Y µX ; σ X 2 µ X |Y = µ X + ( y − µY ) ; σ 2
= (1 − ρ 2 )σ 2

σY X |Y X

fX|Y(x)
Distribution is Gaussian with
conditional mean µX|Y “y0-slice”
conditional variance σX|Y2 σ X |Y σ X |Y
Choose arb. y0 ; it is tangent to an
ellipse whose max is ymax= y0 = +c x
y µ X |Y x’
Recall Covariance Ellipse Construction Extremum y = y0
y’
“slice”
x 2 − 2 ρ x y + y 2 = (1 − ρ 2 ) ⋅ c 2
x − µX y − µY
% %% % x=
% ; y=
%
σX σY

found the corresponding x- value to be
x
x0 − µ X y0 − µY “origin at” x0 = µ X |Y = y0
x0 ≡ x( y0 ) = ρ y0 ⇒
% % % % = ρ⋅
σX σY ( µ X , µY )
y − µY x0 = mean
x0 = µ X + ρ ⋅ σ X ⋅ 0 = µ X |Y = y “conditioned
σY 0

on the y0-slice”
Special Cases: σ
µ X |Y = µ X + ρ X ( y − µY ) Degenerate Ellipse ρ = + 1
σY y
If ρ = 0 µ X |Y = µ X Indep. (Y is irrelevant) Distribution is a Single
E[ XY ]
y = y0 Unique point with zero
ρ=
σ X ⋅ σY If ρ = +1 µ X |Y = µ X + σ X ( y − µY ) direct correlation
σ
variance!
Y
( µ X , µY )
x
σX µ X |Y = y
If ρ = −1 µ X |Y = µ X − σY
( y − µY ) inverse correlation 0

2/24/2012 152

The results for the conditional mean and variance can be understood graphically as follows. Starting with
the Bivariate Gaussian Density we draw the elliptical contours corresponding to the horizontal cuts
through the density surface centered at the mean coordinates µX and µY indicated by the black dot at the
center. If we choose a fixed value of y=y0 the line parallel to the x-axis is tangent to one of the ellipses
and hence y0 represents the maximum y-value for that ellipse as shown by the red dot. This line also
results from a vertical plane y=y0 cutting through the distribution and the Gaussian cut through the
distribution is shown above the contours.
The x-coordinate corresponding to this maximum is obtained by dropping a perpendicular onto the x-axis
at a value x0 = µX|Y=y0 as shown in the figure. Recalling the calculation used for the covariance ellipse
construction, the x0-value corresponding to this maximum at y=y0 is given in standardized coordinates x0
=ρy0 which is converted to the coordinates of the figure by letting x0 -> (x0-µX)/σX and y0 -> (y0-µY)/σY
to yield (x0 –µX)/ σX = ρ (y0-µY)/σY or x0 = µX +ρ σX (y0-µY)/σY which is exactly the statement that x0
is the conditional mean µX|Y=y0 .
The three special cases ρ=0,+1,-1 shown in the bottom panel are:
(i) ρ=0 no correlation corresponds a coordinate system along the principal axis of the ellipse for which a
constant y=y0 cut will always yield a conditional mean µX|Y=y0 = µX
(ii) ρ=+1 complete positive correlation corresponds the case where the ellipse collapses to a straight
line; the conditional distribution is a single point with zero variance on the line with slope (σY/σX) as
shown and yields a conditional mean µX|Y=y0 = µX +σX (y0-µY)/σY
(iii) ρ=-1 complete negative correlation corresponds the case where the ellipse collapses to a straight
line; the conditional distribution is a single point with zero variance on the line with slope (-σY/σX) (not
shown) and yields a conditional mean µX|Y=y0 = µX -σX (y0-µY)/σY

152

Rationale for “Inverse Channel” & Generating Correlated RVs
Rationale: “X=ρY+V”
Given Y: N(0,1) RV (i) If Noise is not added: X=ρY:
Generate X: N(0,1) correlated to Y with coeff. ρ Var(X) =Var(ρY) =ρ2 Var(Y)= ρ2 ≠ 1
(ii) If uncorrel noise is added X=ρY+”V” with
Inverse Channel Method: X=ρY+V appropriate Var(V)= (1- ρ2 ) to cancel correl
contrib. to Var(X) then
Var(X) = Var(ρY+V) = ρ2 Var(Y) + Var(V)+2Cov(Y,V)

Y=N(0,1) ρ X=N(0,1) = ρ2 . 1 + (1- ρ2 ) + 0 = 1

input output Special Cases: “X=ρY+V” ; -1 ≤ ρ ≤ +1
V=N(0,1-ρ2 ) ρ = 0: No correlation between X & Y.
noise
0.Y + N(0,1-02 ) = N(0,1 ) X
(i) Generate samples of RV “Y” using standard X is simply the uncorrel noise sample N(0,1).
method (e.g., sum 12 uniform Variates on [-0.5,
0.5]) to yield N(0,1). ρ = ±1: Full correlation/anti-correlation (Degenerate
Ellipse or St.Line)
(ii) Generate zero mean Gaussian noise “V” with
variance 1- ρ2 to yield N(0, 1- ρ2 ). ±1 . Y + N(0,1-(±1 )2 ) = ±Y X
(iii) Multiply each RV sample “Y” by desired X is simply ±Y – value
correlation coefficient ρ
-1 < ρ < 1: General correlation
(iv) Add noise sample “V” to obtain output “X”
which is N(0,1) and has the desired correlation ρ . Y + N(0,1- ρ 2 ) X
coefficient correl(X,Y)= ρ X results from multiplying Y by the correlation
ρ and adding noise with variance (1- ρ 2 )
2/24/2012 155

The last couple slides considered the inverse channel and its relation to a Bayesian update which starts
with an a priori value of the mean µX and variance σX2 and then updates their values as a result of an
actual “measurement Y”. The conditional mean and variance formulas that we found comported with
both the Bayesian Update equation for conditional probability densities and also to those obtained by
constructing an inverse channel which creates an input X from an output Y. In this slide and the next we
consider this important “coincidence” in some detail.
The box on the left uses the inverse channel model as a computer program flow diagram to actually
generate a RV X~N(0,1) from a linear combination of Y ~N(0,1) and noise V~N(0,1-ρ2) . Note that the
input and output RVs are both N(0,1) Gaussians with unit variance yet the noise must have a variance
that is less than unity for this to work.
The rationale is simple enough, for consider what might be your first impulse to generate a pair of
correlated RVs by setting Y = ρ X (upper right box); taking the expectations E[Y] and E[Y2] we find µY
= ρ µX = ρ *0 = 0 and σY2 = ρ2 σX2 = ρ2 ≠ 1 which this does not agree with the assumption that both X
and Y are N(0,1). Agreement is possible only if we add zero-mean noise with variance (1-ρ2) because
when added to ρ2 it yields the desired unit variance for the RV Y.
The special cases of no correlation (ρ = 0 ) and full positive and negative correlation (ρ = ±1 ) are
explicitly shown to be in agreement this model. For no correlation the model gives X as just N(0,1)
random noise which is takes on values completely independent of the y–values. On the other hand for
full positive (or negative) correlation the model gives X as N(0,1) which takes on values that are exactly
the same as those for Y (or –Y). In the general case -1 < ρ <+1 the model gives X as N(0,1) RV which
tracks Y more closely for correlations near +1 and tracks the noise more closely for correlations nearer to
zero thus giving the expected intermediate behavior.

155

Multilinear Gaussian Distribution
1
n-dimensional Gaussian 1 ( x −µ X )T K XX −1 ( x −µ X )
f X ( x) = e− 2
Vector X= [ X1, X2,... Xn]T ( 2 π) n/2
det K XX
 K11 K12 K13 L K1n 
K K 22 K 23 L K 2n 
 21 
 K 31 K 3n 
Matrix components (K XX )rc = E [(X r − µ X r
)(X c − µ Xc )] ; r , c = 1,2, L n 
K 32 K 33 L

 M M M K rc M 
 K n1
 K n2 K n3 L K nn 

1 T
r T t K XX t + µ X T t
Moment Generating Fcn φ X (t ) = E[e X ⋅t ] = e 2
r ; t = [t1 , t 2 , L t n ]T

Still Gaussian After Linear Transformation: Y = AX + b µ Y = Aµ X + b KYY = AK XX AT

(See Next Slide =>)
1
1 ( y −µY )T K YY −1 ( y −µY )
fY ( y ) = e− 2 Gaussian
1st and 2nd Moment Vector µX & (2π) n / 2 det KYY
Covariance KXX Uniquely Defines
Multivariate Gaussian

Details
r r r r r r
µY = E[Y] = E[AX + b ] = Aµ X + b
( ) ( ( ))
r r r r r r r r
Y − µY = AX + b − Aµ X + b = A(X − µ X )
[( )(
r r r r T
)] [ r r r r T
(
r r
)] [
r r r r r r
]
K YY = E Y − µY Y − µY = E A(X − µ X ) A(X − µ X ) = E A(X − µ X )(X − µ X )T AT = AE (X − µ X )(X − µ X )T AT = AK XX AT
144424443
[ ]
= K XX

2/24/2012 157

The extension to Multilinear Gaussian distributions or Vectors is straight forward; taking the product of
“n” independent N(µX, σX2) Gaussians symbolized by the vector X=[X1,X2,...Xn]T yields an n-
dimensional Gaussian characterized by an n-dimensional mean vector µX and n x n covariance matrix
KXX whose diagonals equal the variances of the individual RVs and whose off diagonal elements are all
zero.
Even if we start with independent RVs, a linear transformation of the form Y= AX + b produces
correlations and the off-diagonal terms of the new covariance matrix are no longer zero. The
transformation leaves the Gaussian structure the same, but the mean and covariance become µY = AµX +
b and KYY = AKXX AT respectively.
The Gaussian always has the form fX(x)=(2π)- n/2 (detKXX) )-1/2 exp(- ½ q) with the scalar quadratic q =
[x-µX]T KXX-1[x-µX]. The row-column components of the covariance matrix are determined by the
expected values of the “row-col” pair products of centered deviations.The moment generating function
generalizes to φX(t) = E[exp(X tT )] = exp( ½ tT KXX t +µXT t) with t= [t1,t2, ...,tn]T.
Note that we have reverted to the old notation in which the components of the Gaussian vectors are
labeled by indexed quantities Xi and the new components under a coordinate transformation are Yi. This
is temporary, however, because we shall want to consider communication channels with a number of
inputs and a number of outputs and partition the n-dimensional Gaussian vector into these two distinct
type of components in order to define the conditional distribution as µX|Y in a useful manner.

157

Partitioned Multivariate Gaussian & Xfm to Block Diagonal
Partition: [X(1) | X(2) ]T {Comm Channel with multiple inputs: “X”= X(1) & outputs “Y”= X(2) }

2 x 1 Partitioned Vectors 2 x 2 Partitioned Matrix
 K11 L K1k K1,k +1 L K1n 
 x1   µ1   M
x  µ 
 kxk M M k x (n-k) M  
 2   2   K (1)(1) K (1)( 2 ) 
 K k ,1 K K kk K k ,k +1 L K kn  =
K ( 2 )( 2 ) 
 M   M 
 x(1)    µ (1)  
 
 K ( 2 )(1)
 
   xk 
K =

K=
  µk 
 K k +1,1 L K k +1,k K k +1,k L K k +1,n  
 
 x( 2 )   L 
 
µ ( 2)   L 
  µ k +1  M M
(n-k) x k M (n-k) x (n-k) M 
x
 k +1     
 M   M   K n1
 L K nk K n ,k +1 L K n ,n  
   
 xn   µn 

 y(1)   A11 M A12   x(1)   I k ,k Bk ,( n − k )   I k Bk ,( n − k ) 
Perform Linear Xfm     
where, A =  =
in “partitioned form”
 K  =  L M L  K 
 A M A  x 
 y(2)   21  0 ( n − k ), k I ( n − k ), ( n − k )   0 ( n − k ), k
  I (n−k )  
  22   (2) 

Now drop parentheses notation for partitioned components !!
T
I B   K11 K12   I k B   Ik B   K11 K12   I k 0 
Find “B”matrix so that new AK XX AT =  k ⋅ ⋅  =0 ⋅
I n −k   K 21
⋅
K 22   BT I n −k 
 0 I n −k   K 21 K 22   0 I n −k       
KYY is block diagonal
 K11 + BK 21 K12 + BK 22   I k 0 
=  ⋅  BT I 
 K 21 K 22   n−k  K 21 + K 22 B T = 0 (1)
 K11 + BK 21 K12 + BK 22 
 + K BT + BK BT 
=  12 22  K12 + BK 22 = 0 (2)
 
 
 K 21 + K 22 B
T
K 22 
2/24/2012 159

Consider a multi-dimensional communication channel partitioned into two sets as follows:
“X”: k-inputs X(1) = [X1, X2, ..., Xk]T and “Y”: (n-k)-outputs X(2) = [Xk+1, Xk+2, ..., Xn]T . The
mean vector and covariance matrix are also partitioned in the same manner to yield 2 x 1 partitioned
vector X(I) and 2 x 2 partitioned covariance matrix K(I)(J). Note that the partition dimensions of K(I)(J) are
specifically as follows:
Row#1 [K11 : K12] = [ k x k : (n-k) x k ]
Row#2 [K21 : K22] = [ k x (n-k) : (n-k) x (n-k)] .
Now lets perform a linear transformation to a new coordinate system according to the equation Y=AX+b
where it is now understood that the Y(I) and X(I) and b(I) are all partitioned in the same manner as 2 x 1
column vectors and the matrix A(I)(J) is partitioned into a 2x2 matrix which corresponds to the partitioning
of the original covariance martix K(I)(J) as shown in detail on the slide. The transformed covariance
matrix KYY is defined by the following product of n x n matrices A KXX AT ; in partitioned form we
instead have a product of three 2 x 2 matrices. The sub-matrices in the partition of A(I)(J) are chosen as
follows: A(1)(J) =[ Ik, k : Bk , (n-k)] and A(2)(J) =[ 0n-k, k : I(n-k), (n-k)] (labeled by their dimensions). The problem
is to find the 2x2 matrix B such that the new covariance matrix KYY is block diagonal; taking the product
of the three partitioned matrices A KXX AT results in two a 2x2 matrix shown at the bottom of the slide.
Forcing the two “off-diagonal” partitions (circled) to be zero yields two conditions on the matrix B and
its transpose BT as follows:
(1) K21 + K22 BT =0 ; (2) K12 +BK22 =0
Note that the partitioned components are of the original matrix KXX so for example K21 is the 2,1
partition component or (KXX)21 . On the next slide we formally solve for B and B and write down the
explicit form of the block diagonal matrix KYY with just 2 components, namely, (KYY )11 and (KYY )22 .
This will allow us to factor the multivariate Gaussian and prove a very elegant generalization of Bayes’
Update for the conditional mean and conditional covariance known as the Gauss-Markov Theorem.

159

Gauss-Markov Theorem
Updating Gaussian Vectors under Bayes’Rule
Given X and Y are jointly Gaussian Random input and output vectors with dim k and n-k respectively
Combine to form n-dim vector with partitioned mean and covariance as follows :
 K XX { 
K XY
r  X (k )  r  µ X (k )  
{
µ ≡ k ×( n − k ) 
{ ≡ Y
X  {  K ≡  k ×k
n×1  ( n−k )  n×1 µY ( n − k )  { K K YY 
n×n
( n − kYX k
{ { 
 )× ( n − k )×( n − k ) 

Gauss-Markov Theorem states that the conditional PDF of ”X given Y” is also Gaussian
with conditional mean & covariance given by

−1
µ X |Y = µ X + K XY KYY ( y − µY ) −1
{ { { 1 3 123
2 4 4 K X |Y = K XX − {
{ K XY K YY K YX
{
k ×1 k ×1 k ×( n − k ) ( n − k )×( n − k ) ( n − k )×1 13
2
k ×k k ×( n − k ) ( n − k )×( n − k ) ( n − k )×k

T
Note: Although Covariance K Symmetry of K requires  
K  = K
is symmetric, the blocks
{ ≠ {
K XY KYX the following relationship
{ XY {YX
themselves are not , i.e., for the off diagonal blocks  k ×(24 ( n − k )×k
4 
k ×( n − k ) ( n − k )× k n−k )
1 3
( n − k )× k

2/24/2012 163

The result of the last section for the n-dimensional Multivariate Gaussian are now cast in a form more
suitable for a communication channel. We introduce the new notation in which the 1st partition of the
Gaussian Vector X consists of the k inputs Xk = [X1, ...,Xk]T and the 2nd partition consists of n-k outputs
Yk = [Yk ...Yk]T . The mean vector µX and covariance matrix KXX are partitioned in a natural manner as
shown on the slide.
In this notation, the Gauss-Markov Theorem states that the conditional PDF of “vector X given vector Y”
is also a Gaussian with conditional mean and covariance given by the two boxed equations. This is
identical to the results of previous slide, however in a new notation.
Note that a possible source of confusion is to equate the partitions Xk and Yk (whose dimensions k +(n-k)
add to “partition” n) with the transformation of coordinates Y=AX used to transform between to n-
dimensional coordinate systems from X to the canonical coordinates Y.
Also note that even though the full nxn covariance matrix is symmetric Kr c = Kc r with respect to its
indices (i.e., K = KT), this is no longer true for the partitioned components K(R)(C) ≠ K(C)(R) as evidenced
by the fact that KXY ≠ KYX as they usually do not even have the same dimensions. The symmetry of the
full matrix requires blocks with transposed partition indices be transposes of one another, i.e., KXYT =
KYX which is possible now because these two matrices now have the same dimensions.
The Gauss Markov Theorem is the basis for using the conditional mean estimator µX|Y to update the a
priori mean value µX = E[X] of a k-dimensional state vector X by using an (n-k) dimensional
measurement vector Y. The state and measurement vectors must be part of the same multivariate
Gaussian distribution or equivalently the must be components of a partitioned Gaussian vector whose
means, variances, and correlations are given by the partitioned n-dimensional mean vector and
covariance matrix shown at the top of the slide. They indeed form a Gaussian “Arena”.

163

Gauss-Markov Estimator
New RVs: Note: The “Estimator” and the “Error” depend
) Estimator RV upon the specific values of X=“x” and Y=“y”
µ X |Y → X = µ X + K XY KYY −1 (Y − µY ) and hence generate samples of two new random
ˆ ˆ
variables X & e whose statistics can be
.
e = X − X = X − [ µ X + K XY KYY −1 (Y − µY )] Error RV
inferred from those of X and Y.
Following remarkable properties can be shown for these RVs
ˆ
Error e and Conditional Mean Estimator X satisfy the following:
ˆ
1) E[eX ] = 0 & E[eY ] = 0 ˆ
e ⊥ X & e ⊥ Y i.e., e is uncorrelated with the
“orthogonal” ˆ
estimator X and the data Y
2) K XY = K XY
ˆ ˆ
Estimator X and RV X have same correlation with measurements Y

3) Distributions for ˆ
X and e satisfy “Pythagorean Right Triangle Relationship”as shown
ˆ −1
X = N (µ X , K XY K YY KYX ) = N (µ X , Q)
14 244
4 3 Random
≡Q
ˆ
X = X +e X : N ( µ X , K XX ) Variable
−1
e = N (0, K XX − K XY K YY KYX ) = N (0, P)
144 2444
4 3
≡P

Gaussian Means & Variances Add e : N (0, P )
Error
N (µ X , K XX ) = N (µ X , Q) + N (0, P ) ˆ
X : N (µ X , Q)
Gauss-Markov
Recall for Scalar X & Y: Y=ρ X+V N (0,1) = N (0, ρ ) + N (0,1 − ρ )
2 Estimator
2/24/2012 164

The conditional mean is evaluated for a specific “realization” of a Gaussian RV X=“x” and Y=“y” and
hence looking at many realizations allows us to consider the conditional mean µX|Y as a random variable
itself. Thus we replace the specific realizations µX|Y and “y” in the update equation by RVs denoted
respectively as X-hat and Y as shown in the first equation. Now the difference between the true state X
and the conditional mean estimate of that state X-hat is a RV that represents the Estimation Error e =X-
(X-hat) as shown in the second equation.
These two equation can be shown to have the following remarkable properties : 1) the error is
uncorrelated with either the estimator X-hat or the data Y, 2) the X-hat estimator and the true state X
correlate with the measurements in the same way, and 3) the distributions for the RVs X-hat and e
satisfy a “Pythagorean Right Triangle Relationship between their Gaussian designations.
Looking at the figure the true state X ~ N(µX , KXX) on the hypotenuse, the estimator X-hat ~ N(µX , Q)
where Q= KXYKYY-1KYX in the plane, and the error e ~ N(0 , P) where P= KXX - KXY KYY-1KYX
perpendicular to the plane. The vector relation is X = X-hat + e which forms the right triangle and the
means and variances add so that
µX =µX +0 and KXX = P + Q = (KXX - KXY KYY-1KYX )+(KXYKYY-1KYX).
For the normal distributions this may be written in the suggestive form
N(µX , KXX) = N(µX , Q) + N(0, P) .
Also recall this relationship showed up for the scalar case of a single input X and single output Y in the
form Y=ρX+V (where V = e (noise) and solving for the error e =Y-ρX)
N(0,1) = N(0,ρ) + N(0,1-ρ2)

164

To learn more please attend this ATI course

Please post your comments and questions to our blog:
http://www.aticourses.com/blog/

Sign-up for ATI's monthly Course Schedule Updates :
http://www.aticourses.com/email_signup_page.html

Fundamentals of Engineering Probability Visualization Techniques & MatLab Case Studies

More Related Content

Viewers also liked

Similar to Fundamentals of Engineering Probability Visualization Techniques & MatLab Case Studies

More from Jim Jenkins

Recently uploaded

Fundamentals of Engineering Probability Visualization Techniques & MatLab Case Studies