Probability Review
Thursday Sep 13
Probability Review
• Events and Event spaces
• Random variables
• Joint probability distributions
• Marginalization, conditioning, chain rule,
Bayes Rule, law of total probability, etc.
• Structural properties
• Independence, conditional independence
• Mean and Variance
• The big picture
• Examples
Sample space and Events
• Sample Space, result of an experiment
• If you toss a coin twice 
• Event: a subset of 
• First toss is head = {HH,HT}
• S: event space, a set of events
• Closed under finite union and complements
• Entails other binary operation: union, diff, etc.
• Contains the empty event and 
Probability Measure
• Defined over (Ss.t.
• P() >= 0 for all  in S
• P() = 1
• If  are disjoint, then
• P( U ) = p() + p()
• We can deduce other axioms from the above ones
• Ex: P( U ) for non-disjoint event
P( U ) = p() + p() – p(∩ 
Visualization
• We can go on and define conditional
probability, using the above visualization
Conditional Probability
P(F|H) = Fraction of worlds in which H is true that also
have F true
)
(
)
(
)
|
(
H
p
H
F
p
h
f
p


Rule of total probability
A
B1
B2
B3
B4
B5
B6
B7
     

 i
i B
A
P
B
P
A
p |
From Events to Random Variable
• Almost all the semester we will be dealing with RV
• Concise way of specifying attributes of outcomes
• Modeling students (Grade and Intelligence):
• all possible students
• What are events
• Grade_A = all students with grade A
• Grade_B = all students with grade B
• Intelligence_High = … with high intelligence
• Very cumbersome
• We need “functions” that maps from to an
attribute space.
• P(G = A) = P({student ϵ G(student) = A})
Random Variables

High
low
A
B A+
I:Intelligence
G:Grade
P(I = high) = P( {all students whose intelligence is high})
Discrete Random Variables
• Random variables (RVs) which may take on
only a countable number of distinct values
– E.g. the total number of tails X you get if you flip
100 coins
• X is a RV with arity k if it can take on exactly
one value out of {x1, …, xk}
– E.g. the possible values that X can take on are 0, 1,
2, …, 100
Probability of Discrete RV
• Probability mass function (pmf): P(X = xi)
• Easy facts about pmf
 Σi P(X = xi) = 1
 P(X = xi∩X = xj) = 0 if i ≠ j
 P(X = xi U X = xj) = P(X = xi) + P(X = xj) if i ≠ j
 P(X = x1 U X = x2 U … U X = xk) = 1
Common Distributions
• Uniform X U[1, …, N]
 X takes values 1, 2, … N
 P(X = i) = 1/N
 E.g. picking balls of different colors from a box
• Binomial X Bin(n, p)
 X takes values 0, 1, …, n

 E.g. coin flips

p(X i) 
n
i






pi
(1 p)n i
Continuous Random Variables
• Probability density function (pdf) instead of
probability mass function (pmf)
• A pdf is any function f(x) that describes the
probability density in terms of the input
variable x.
Probability of Continuous RV
• Properties of pdf


• Actual probability can be obtained by taking
the integral of pdf
 E.g. the probability of X being between 0 and 1 is

f (x) 0,x
f (x) 1



P(0 X 1)  f (x)dx
0
1

Cumulative Distribution Function
• FX(v) = P(X ≤ v)
• Discrete RVs
 FX(v) = Σvi P(X = vi)
• Continuous RVs


FX (v)  f (x)dx

v

d
dx
Fx (x)  f (x)
Common Distributions
• Normal X N(μ, σ2
)

 E.g. the height of the entire population

f (x) 
1
 2
exp 
(x  )2
22






Multivariate Normal
• Generalization to higher dimensions of the
one-dimensional normal


f r
X
(xi,...,xd ) 
1
(2)d /2

1/2


exp 
1
2
r
x  
 
T
 1 r
x  
 






.
Covariance matrix
Mean
Probability Review
• Events and Event spaces
• Random variables
• Joint probability distributions
• Marginalization, conditioning, chain rule,
Bayes Rule, law of total probability, etc.
• Structural properties
• Independence, conditional independence
• Mean and Variance
• The big picture
• Examples
Joint Probability Distribution
• Random variables encodes attributes
• Not all possible combination of attributes are equally
likely
• Joint probability distributions quantify this
• P( X= x, Y= y) = P(x, y)
• Generalizes to N-RVs
•
•
 
  


x y
y
Y
x
X
P 1
,
 
 
x y
Y
X dxdy
y
x
f 1
,
,
Chain Rule
• Always true
• P(x, y, z) = p(x) p(y|x) p(z|x, y)
= p(z) p(y|z) p(x|y, z)
=…
Conditional Probability
 
 
 
P X Y
P X Y
P Y
x y
x y
y
  
  

 
)
(
)
,
(
|
y
p
y
x
p
y
x
P 
But we will always write it this way:
events
Marginalization
• We know p(X, Y), what is P(X=x)?
• We can use the low of total probability, why?
   
   




y
y
y
x
P
y
P
y
x
P
x
p
|
,
A
B1
B2
B3
B4
B5
B6
B7
Marginalization Cont.
• Another example
   
   




y
z
z
y
z
y
x
P
z
y
P
z
y
x
P
x
p
,
,
,
|
,
,
,
Bayes Rule
• We know that P(rain) = 0.5
• If we also know that the grass is wet, then
how this affects our belief about whether it
rains or not?

P rain | wet
 
P(rain)P(wet | rain)
P(wet)
P x | y
 
P(x)P(y | x)
P(y)
Bayes Rule cont.
• You can condition on more variables
 
)
|
(
)
,
|
(
)
|
(
,
|
z
y
P
z
x
y
P
z
x
P
z
y
x
P 
Probability Review
• Events and Event spaces
• Random variables
• Joint probability distributions
• Marginalization, conditioning, chain rule,
Bayes Rule, law of total probability, etc.
• Structural properties
• Independence, conditional independence
• Mean and Variance
• The big picture
• Examples
Independence
• X is independent of Y means that knowing Y
does not change our belief about X.
• P(X|Y=y) = P(X)
• P(X=x, Y=y) = P(X=x) P(Y=y)
• The above should hold for all x, y
• It is symmetric and written as X  Y
Independence
• X1, …, Xn are independent if and only if
• If X1, …, Xn are independent and identically
distributed we say they are iid (or that they
are a random sample) and we write

P(X1  A1,...,Xn  An )  P Xi  Ai
 
i
1
n

X1, …, Xn ∼ P
CI: Conditional Independence
• RV are rarely independent but we can still
leverage local structural properties like
Conditional Independence.
• X  Y | Z if once Z is observed, knowing the
value of Y does not change our belief about X
• P(rain  sprinkler’s on | cloudy)
• P(rain  sprinkler’s on | wet grass)
Conditional Independence
• P(X=x | Z=z, Y=y) = P(X=x | Z=z)
• P(Y=y | Z=z, X=x) = P(Y=y | Z=z)
• P(X=x, Y=y | Z=z) = P(X=x| Z=z) P(Y=y| Z=z)
We call these factors : very useful concept !!
Probability Review
• Events and Event spaces
• Random variables
• Joint probability distributions
• Marginalization, conditioning, chain rule,
Bayes Rule, law of total probability, etc.
• Structural properties
• Independence, conditional independence
• Mean and Variance
• The big picture
• Examples
Mean and Variance
• Mean (Expectation):
– Discrete RVs:
– Continuous RVs:
 
X
E
 
   
X P X
i
i i
v
E v v
 

   
X
E xf x dx

 


E(g(X))  g(vi)P(X vi)
vi

E(g(X))  g(x) f (x)dx



Mean and Variance
• Variance:
– Discrete RVs:
– Continuous RVs:
• Covariance:
• Covariance:
     
2
X P X
i
i i
v
V v v

  

     
2
X
V x f x dx


 
 


Var(X) E((X  )2
)
Var(X) E(X2
)  2
Cov(X,Y) E((X  x )(Y  y )) E(XY)  xy
Mean and Variance
• Correlation:

(X,Y) Cov(X,Y)/xy

 1(X,Y) 1
Properties
• Mean
–
–
– If X and Y are independent,
• Variance
–
– If X and Y are independent,
     
X Y X Y
E E E
  
   
X X
E a aE

     
XY X Y
E E E
 
   
2
X X
V a b a V
 
 
X Y (X) (Y)
V V V
  
Some more properties
• The conditional expectation of Y given X when
the value of X = x is:
• The Law of Total Expectation or Law of
Iterated Expectation:
  dy
x
y
p
y
x
X
Y
E )
|
(
*
| 


   

 dx
x
p
x
X
Y
E
X
Y
E
E
Y
E X )
(
)
|
(
)
|
(
)
(
Some more properties
• The law of Total Variance:

Var(Y) Var E(Y | X)
  E Var(Y | X)
 
Probability Review
• Events and Event spaces
• Random variables
• Joint probability distributions
• Marginalization, conditioning, chain rule,
Bayes Rule, law of total probability, etc.
• Structural properties
• Independence, conditional independence
• Mean and Variance
• The big picture
• Examples
The Big Picture
Model Data
Probability
Estimation/learning
Statistical Inference
• Given observations from a model
– What (conditional) independence assumptions
hold?
• Structure learning
– If you know the family of the model (ex,
multinomial), What are the value of the
parameters: MLE, Bayesian estimation.
• Parameter learning
Probability Review
• Events and Event spaces
• Random variables
• Joint probability distributions
• Marginalization, conditioning, chain rule,
Bayes Rule, law of total probability, etc.
• Structural properties
• Independence, conditional independence
• Mean and Variance
• The big picture
• Examples
Monty Hall Problem
• You're given the choice of three doors: Behind one
door is a car; behind the others, goats.
• You pick a door, say No. 1
• The host, who knows what's behind the doors, opens
another door, say No. 3, which has a goat.
• Do you want to pick door No. 2 instead?
Host must
reveal Goat B
Host must
reveal Goat A
Host reveals
Goat A
or
Host reveals
Goat B
Monty Hall Problem: Bayes Rule
• : the car is behind door i, i = 1, 2, 3
•
• : the host opens door j after you pick door i
•
i
C
ij
H
  1 3
i
P C 
 
0
0
1 2
1 ,
ij k
i j
j k
P H C
i k
i k j k


 




  

Monty Hall Problem: Bayes Rule cont.
• WLOG, i=1, j=3
•
•
 
   
 
13 1 1
1 13
13
P H C P C
P C H
P H

   
13 1 1
1 1 1
2 3 6
P H C P C   
•
•
Monty Hall Problem: Bayes Rule cont.
       
       
13 13 1 13 2 13 3
13 1 1 13 2 2
, , ,
1 1
1
6 3
1
2
P H P H C P H C P H C
P H C P C P H C P C
  
 
  

 
1 13
1 6 1
1 2 3
P C H  
Monty Hall Problem: Bayes Rule cont.
 
1 13
1 6 1
1 2 3
P C H  


 You should switch!
   
2 13 1 13
1 2
1
3 3
P C H P C H
   
Information Theory
• P(X) encodes our uncertainty about X
• Some variables are more uncertain that others
• How can we quantify this intuition?
• Entropy: average number of bits required to encode X
P(X) P(Y)
X Y
 
 
 
 
 

 









x
x
P x
P
x
P
x
P
x
P
x
p
E
X
H )
(
log
1
log
1
log
Information Theory cont.
• Entropy: average number of bits required to encode X
• We can define conditional entropy similarly
• i.e. once Y is known, we only need H(X,Y) – H(Y) bits
• We can also define chain rule for entropies (not surprising)
 
 
   
Y
H
Y
X
H
y
x
p
E
Y
X
H P
P
P








 ,
|
1
log
|
       
Y
X
Z
H
X
Y
H
X
H
Z
Y
X
H P
P
P
P ,
|
|
,
, 


 
 
 
 
 

 









x
x
P x
P
x
P
x
P
x
P
x
p
E
X
H )
(
log
1
log
1
log
Mutual Information: MI
• Remember independence?
• If XY then knowing Y won’t change our belief about X
• Mutual information can help quantify this! (not the only
way though)
• MI:
• “The amount of uncertainty in X which is removed by
knowing Y”
• Symmetric
• I(X;Y) = 0 iff, X and Y are independent!
     
Y
X
H
X
H
Y
X
I P
P
P |
; 

 








y x y
p
x
p
y
x
p
y
x
p
Y
X
I
)
(
)
(
)
,
(
log
)
,
(
)
;
(
Chi Square Test for Independence
(Example)
Republican Democrat Independent Total
Male 200 150 50 400
Female 250 300 50 600
Total 450 450 100 1000
• State the hypotheses
H0
: Gender and voting preferences are independent.
Ha
: Gender and voting preferences are not independent
• Choose significance level
Say, 0.05
Chi Square Test for Independence
• Analyze sample data
• Degrees of freedom =
|g|-1 * |v|-1 = (2-1) * (3-1) = 2
• Expected frequency count =
Eg,v = (ng * nv) / n
Em,r = (400 * 450) / 1000 = 180000/1000 = 180
Em,d= (400 * 450) / 1000 = 180000/1000 = 180
Em,i = (400 * 100) / 1000 = 40000/1000 = 40
Ef,r = (600 * 450) / 1000 = 270000/1000 = 270
Ef,d = (600 * 450) / 1000 = 270000/1000 = 270
Ef,i = (600 * 100) / 1000 = 60000/1000 = 60
Republican Democrat Independent Total
Male 200 150 50 400
Female 250 300 50 600
Total 450 450 100 1000
Chi Square Test for Independence
• Chi-square test statistic
• Χ2
= (200 - 180)2
/180 + (150 - 180)2
/180 + (50 - 40)2
/40 +
(250 - 270)2
/270 + (300 - 270)2
/270 + (50 - 60)2
/40
• Χ2
= 400/180 + 900/180 + 100/40 + 400/270 + 900/270 +
100/60
• Χ2
= 2.22 + 5.00 + 2.50 + 1.48 + 3.33 + 1.67 = 16.2







 
 
v
g
v
g
v
g
E
E
O
X
,
2
,
,
2
)
(
Republican Democrat Independent Total
Male 200 150 50 400
Female 250 300 50 600
Total 450 450 100 1000
Chi Square Test for Independence
• P-value
– Probability of observing a sample statistic as
extreme as the test statistic
– P(X2
≥ 16.2) = 0.0003
• Since P-value (0.0003) is less than the
significance level (0.05), we cannot accept the
null hypothesis
• There is a relationship between gender and
voting preference
Acknowledgment
• Carlos Guestrin recitation slides:
http://www.cs.cmu.edu/~guestrin/Class/10708/recitations/r1/Probability_and_St
atistics_Review.ppt
• Andrew Moore Tutorial:
http://www.autonlab.org/tutorials/prob.html
• Monty hall problem:
http://en.wikipedia.org/wiki/Monty_Hall_problem
• http://www.cs.cmu.edu/~guestrin/Class/10701-F07/recitation_schedule.html
• Chi-square test for independence
http://stattrek.com/chi-square-test/independence.aspx

Introduction Lesson in Probability and Review

  • 1.
  • 2.
    Probability Review • Eventsand Event spaces • Random variables • Joint probability distributions • Marginalization, conditioning, chain rule, Bayes Rule, law of total probability, etc. • Structural properties • Independence, conditional independence • Mean and Variance • The big picture • Examples
  • 3.
    Sample space andEvents • Sample Space, result of an experiment • If you toss a coin twice  • Event: a subset of  • First toss is head = {HH,HT} • S: event space, a set of events • Closed under finite union and complements • Entails other binary operation: union, diff, etc. • Contains the empty event and 
  • 4.
    Probability Measure • Definedover (Ss.t. • P() >= 0 for all  in S • P() = 1 • If  are disjoint, then • P( U ) = p() + p() • We can deduce other axioms from the above ones • Ex: P( U ) for non-disjoint event P( U ) = p() + p() – p(∩ 
  • 5.
    Visualization • We cango on and define conditional probability, using the above visualization
  • 6.
    Conditional Probability P(F|H) =Fraction of worlds in which H is true that also have F true ) ( ) ( ) | ( H p H F p h f p  
  • 7.
    Rule of totalprobability A B1 B2 B3 B4 B5 B6 B7         i i B A P B P A p |
  • 8.
    From Events toRandom Variable • Almost all the semester we will be dealing with RV • Concise way of specifying attributes of outcomes • Modeling students (Grade and Intelligence): • all possible students • What are events • Grade_A = all students with grade A • Grade_B = all students with grade B • Intelligence_High = … with high intelligence • Very cumbersome • We need “functions” that maps from to an attribute space. • P(G = A) = P({student ϵ G(student) = A})
  • 9.
    Random Variables  High low A B A+ I:Intelligence G:Grade P(I= high) = P( {all students whose intelligence is high})
  • 10.
    Discrete Random Variables •Random variables (RVs) which may take on only a countable number of distinct values – E.g. the total number of tails X you get if you flip 100 coins • X is a RV with arity k if it can take on exactly one value out of {x1, …, xk} – E.g. the possible values that X can take on are 0, 1, 2, …, 100
  • 11.
    Probability of DiscreteRV • Probability mass function (pmf): P(X = xi) • Easy facts about pmf  Σi P(X = xi) = 1  P(X = xi∩X = xj) = 0 if i ≠ j  P(X = xi U X = xj) = P(X = xi) + P(X = xj) if i ≠ j  P(X = x1 U X = x2 U … U X = xk) = 1
  • 12.
    Common Distributions • UniformX U[1, …, N]  X takes values 1, 2, … N  P(X = i) = 1/N  E.g. picking balls of different colors from a box • Binomial X Bin(n, p)  X takes values 0, 1, …, n   E.g. coin flips  p(X i)  n i       pi (1 p)n i
  • 13.
    Continuous Random Variables •Probability density function (pdf) instead of probability mass function (pmf) • A pdf is any function f(x) that describes the probability density in terms of the input variable x.
  • 14.
    Probability of ContinuousRV • Properties of pdf   • Actual probability can be obtained by taking the integral of pdf  E.g. the probability of X being between 0 and 1 is  f (x) 0,x f (x) 1    P(0 X 1)  f (x)dx 0 1 
  • 15.
    Cumulative Distribution Function •FX(v) = P(X ≤ v) • Discrete RVs  FX(v) = Σvi P(X = vi) • Continuous RVs   FX (v)  f (x)dx  v  d dx Fx (x)  f (x)
  • 16.
    Common Distributions • NormalX N(μ, σ2 )   E.g. the height of the entire population  f (x)  1  2 exp  (x  )2 22      
  • 17.
    Multivariate Normal • Generalizationto higher dimensions of the one-dimensional normal   f r X (xi,...,xd )  1 (2)d /2  1/2   exp  1 2 r x     T  1 r x           . Covariance matrix Mean
  • 18.
    Probability Review • Eventsand Event spaces • Random variables • Joint probability distributions • Marginalization, conditioning, chain rule, Bayes Rule, law of total probability, etc. • Structural properties • Independence, conditional independence • Mean and Variance • The big picture • Examples
  • 19.
    Joint Probability Distribution •Random variables encodes attributes • Not all possible combination of attributes are equally likely • Joint probability distributions quantify this • P( X= x, Y= y) = P(x, y) • Generalizes to N-RVs • •        x y y Y x X P 1 ,     x y Y X dxdy y x f 1 , ,
  • 20.
    Chain Rule • Alwaystrue • P(x, y, z) = p(x) p(y|x) p(z|x, y) = p(z) p(y|z) p(x|y, z) =…
  • 21.
    Conditional Probability      P X Y P X Y P Y x y x y y          ) ( ) , ( | y p y x p y x P  But we will always write it this way: events
  • 22.
    Marginalization • We knowp(X, Y), what is P(X=x)? • We can use the low of total probability, why?             y y y x P y P y x P x p | , A B1 B2 B3 B4 B5 B6 B7
  • 23.
    Marginalization Cont. • Anotherexample             y z z y z y x P z y P z y x P x p , , , | , , ,
  • 24.
    Bayes Rule • Weknow that P(rain) = 0.5 • If we also know that the grass is wet, then how this affects our belief about whether it rains or not?  P rain | wet   P(rain)P(wet | rain) P(wet) P x | y   P(x)P(y | x) P(y)
  • 25.
    Bayes Rule cont. •You can condition on more variables   ) | ( ) , | ( ) | ( , | z y P z x y P z x P z y x P 
  • 26.
    Probability Review • Eventsand Event spaces • Random variables • Joint probability distributions • Marginalization, conditioning, chain rule, Bayes Rule, law of total probability, etc. • Structural properties • Independence, conditional independence • Mean and Variance • The big picture • Examples
  • 27.
    Independence • X isindependent of Y means that knowing Y does not change our belief about X. • P(X|Y=y) = P(X) • P(X=x, Y=y) = P(X=x) P(Y=y) • The above should hold for all x, y • It is symmetric and written as X  Y
  • 28.
    Independence • X1, …,Xn are independent if and only if • If X1, …, Xn are independent and identically distributed we say they are iid (or that they are a random sample) and we write  P(X1  A1,...,Xn  An )  P Xi  Ai   i 1 n  X1, …, Xn ∼ P
  • 29.
    CI: Conditional Independence •RV are rarely independent but we can still leverage local structural properties like Conditional Independence. • X  Y | Z if once Z is observed, knowing the value of Y does not change our belief about X • P(rain  sprinkler’s on | cloudy) • P(rain  sprinkler’s on | wet grass)
  • 30.
    Conditional Independence • P(X=x| Z=z, Y=y) = P(X=x | Z=z) • P(Y=y | Z=z, X=x) = P(Y=y | Z=z) • P(X=x, Y=y | Z=z) = P(X=x| Z=z) P(Y=y| Z=z) We call these factors : very useful concept !!
  • 31.
    Probability Review • Eventsand Event spaces • Random variables • Joint probability distributions • Marginalization, conditioning, chain rule, Bayes Rule, law of total probability, etc. • Structural properties • Independence, conditional independence • Mean and Variance • The big picture • Examples
  • 32.
    Mean and Variance •Mean (Expectation): – Discrete RVs: – Continuous RVs:   X E       X P X i i i v E v v        X E xf x dx      E(g(X))  g(vi)P(X vi) vi  E(g(X))  g(x) f (x)dx   
  • 33.
    Mean and Variance •Variance: – Discrete RVs: – Continuous RVs: • Covariance: • Covariance:       2 X P X i i i v V v v            2 X V x f x dx         Var(X) E((X  )2 ) Var(X) E(X2 )  2 Cov(X,Y) E((X  x )(Y  y )) E(XY)  xy
  • 34.
    Mean and Variance •Correlation:  (X,Y) Cov(X,Y)/xy   1(X,Y) 1
  • 35.
    Properties • Mean – – – IfX and Y are independent, • Variance – – If X and Y are independent,       X Y X Y E E E        X X E a aE        XY X Y E E E       2 X X V a b a V     X Y (X) (Y) V V V   
  • 36.
    Some more properties •The conditional expectation of Y given X when the value of X = x is: • The Law of Total Expectation or Law of Iterated Expectation:   dy x y p y x X Y E ) | ( * |          dx x p x X Y E X Y E E Y E X ) ( ) | ( ) | ( ) (
  • 37.
    Some more properties •The law of Total Variance:  Var(Y) Var E(Y | X)   E Var(Y | X)  
  • 38.
    Probability Review • Eventsand Event spaces • Random variables • Joint probability distributions • Marginalization, conditioning, chain rule, Bayes Rule, law of total probability, etc. • Structural properties • Independence, conditional independence • Mean and Variance • The big picture • Examples
  • 39.
    The Big Picture ModelData Probability Estimation/learning
  • 40.
    Statistical Inference • Givenobservations from a model – What (conditional) independence assumptions hold? • Structure learning – If you know the family of the model (ex, multinomial), What are the value of the parameters: MLE, Bayesian estimation. • Parameter learning
  • 41.
    Probability Review • Eventsand Event spaces • Random variables • Joint probability distributions • Marginalization, conditioning, chain rule, Bayes Rule, law of total probability, etc. • Structural properties • Independence, conditional independence • Mean and Variance • The big picture • Examples
  • 42.
    Monty Hall Problem •You're given the choice of three doors: Behind one door is a car; behind the others, goats. • You pick a door, say No. 1 • The host, who knows what's behind the doors, opens another door, say No. 3, which has a goat. • Do you want to pick door No. 2 instead?
  • 43.
    Host must reveal GoatB Host must reveal Goat A Host reveals Goat A or Host reveals Goat B
  • 44.
    Monty Hall Problem:Bayes Rule • : the car is behind door i, i = 1, 2, 3 • • : the host opens door j after you pick door i • i C ij H   1 3 i P C    0 0 1 2 1 , ij k i j j k P H C i k i k j k            
  • 45.
    Monty Hall Problem:Bayes Rule cont. • WLOG, i=1, j=3 • •         13 1 1 1 13 13 P H C P C P C H P H      13 1 1 1 1 1 2 3 6 P H C P C   
  • 46.
    • • Monty Hall Problem:Bayes Rule cont.                 13 13 1 13 2 13 3 13 1 1 13 2 2 , , , 1 1 1 6 3 1 2 P H P H C P H C P H C P H C P C P H C P C            1 13 1 6 1 1 2 3 P C H  
  • 47.
    Monty Hall Problem:Bayes Rule cont.   1 13 1 6 1 1 2 3 P C H      You should switch!     2 13 1 13 1 2 1 3 3 P C H P C H    
  • 48.
    Information Theory • P(X)encodes our uncertainty about X • Some variables are more uncertain that others • How can we quantify this intuition? • Entropy: average number of bits required to encode X P(X) P(Y) X Y                       x x P x P x P x P x P x p E X H ) ( log 1 log 1 log
  • 49.
    Information Theory cont. •Entropy: average number of bits required to encode X • We can define conditional entropy similarly • i.e. once Y is known, we only need H(X,Y) – H(Y) bits • We can also define chain rule for entropies (not surprising)         Y H Y X H y x p E Y X H P P P          , | 1 log |         Y X Z H X Y H X H Z Y X H P P P P , | | , ,                          x x P x P x P x P x P x p E X H ) ( log 1 log 1 log
  • 50.
    Mutual Information: MI •Remember independence? • If XY then knowing Y won’t change our belief about X • Mutual information can help quantify this! (not the only way though) • MI: • “The amount of uncertainty in X which is removed by knowing Y” • Symmetric • I(X;Y) = 0 iff, X and Y are independent!       Y X H X H Y X I P P P | ;             y x y p x p y x p y x p Y X I ) ( ) ( ) , ( log ) , ( ) ; (
  • 51.
    Chi Square Testfor Independence (Example) Republican Democrat Independent Total Male 200 150 50 400 Female 250 300 50 600 Total 450 450 100 1000 • State the hypotheses H0 : Gender and voting preferences are independent. Ha : Gender and voting preferences are not independent • Choose significance level Say, 0.05
  • 52.
    Chi Square Testfor Independence • Analyze sample data • Degrees of freedom = |g|-1 * |v|-1 = (2-1) * (3-1) = 2 • Expected frequency count = Eg,v = (ng * nv) / n Em,r = (400 * 450) / 1000 = 180000/1000 = 180 Em,d= (400 * 450) / 1000 = 180000/1000 = 180 Em,i = (400 * 100) / 1000 = 40000/1000 = 40 Ef,r = (600 * 450) / 1000 = 270000/1000 = 270 Ef,d = (600 * 450) / 1000 = 270000/1000 = 270 Ef,i = (600 * 100) / 1000 = 60000/1000 = 60 Republican Democrat Independent Total Male 200 150 50 400 Female 250 300 50 600 Total 450 450 100 1000
  • 53.
    Chi Square Testfor Independence • Chi-square test statistic • Χ2 = (200 - 180)2 /180 + (150 - 180)2 /180 + (50 - 40)2 /40 + (250 - 270)2 /270 + (300 - 270)2 /270 + (50 - 60)2 /40 • Χ2 = 400/180 + 900/180 + 100/40 + 400/270 + 900/270 + 100/60 • Χ2 = 2.22 + 5.00 + 2.50 + 1.48 + 3.33 + 1.67 = 16.2            v g v g v g E E O X , 2 , , 2 ) ( Republican Democrat Independent Total Male 200 150 50 400 Female 250 300 50 600 Total 450 450 100 1000
  • 54.
    Chi Square Testfor Independence • P-value – Probability of observing a sample statistic as extreme as the test statistic – P(X2 ≥ 16.2) = 0.0003 • Since P-value (0.0003) is less than the significance level (0.05), we cannot accept the null hypothesis • There is a relationship between gender and voting preference
  • 55.
    Acknowledgment • Carlos Guestrinrecitation slides: http://www.cs.cmu.edu/~guestrin/Class/10708/recitations/r1/Probability_and_St atistics_Review.ppt • Andrew Moore Tutorial: http://www.autonlab.org/tutorials/prob.html • Monty hall problem: http://en.wikipedia.org/wiki/Monty_Hall_problem • http://www.cs.cmu.edu/~guestrin/Class/10701-F07/recitation_schedule.html • Chi-square test for independence http://stattrek.com/chi-square-test/independence.aspx