05 probabilistic graphical models

Bayesian Networks
Unit 5 Probabilistic
Graphical Models (PGM)
Wang, Yuan-Kai, 王元凱
ykwang@mails.fju.edu.tw
http://www.ykwang.tw

Department of Electrical Engineering, Fu Jen Univ.
輔仁大學電機工程系

2006~2011

Reference this document as:
Wang, Yuan-Kai, “Probabilistic Graphical Models,"
Lecture Notes of Wang, Yuan-Kai, Fu Jen University, Taiwan, 2011.
Fu Jen University Department of Electrical Engineering Wang, Yuan-Kai Copyright

Bayesian Networks Unit : Probabilistic Graphical Models p. 2

Goal of This Unit
• Learn how to
– Build graphical model (network model) by
graph theory
– Inference under uncertainty according to
probability theory
• Theory of Bayesian networks
– Conditional independence
– D-Separation
– Basic algorithm:
• Variable Elimination
• Introduce some BN models
– MRF, HMM, DBN, Naïve Bayes, …
Fu Jen University Department of Electrical Engineering Yuan-Kai Wang Copyright


Related Units
• Background
– Statistical inference
– Graph theory
• Next units
– Exact inference algorithms
– Approximate inference algorithms



References for Self-Study
• Chapter 14, Artificial Intelligence-a modern
approach, 2nd, by S. Russel & P. Norvig, Prentice
Hall, 2003
• E. Charniak, Bayesian networks without tears, AI
Magazine
• T. A. Stephenson, An introduction to Bayesian
network theory and usage, IDIAP research report,
IDIAP-RR-00-03, 2000
• B. D’Ambrosio, Inference in Bayesian networks, AI
Magazine, 1999
• M. I. Jordan & Y. Weiss, Probabilistic Inference in
graphical models,



Contents
1. Representing Uncertain Knowledge .............. 18
2. Various PGM Models ..................................... 52
3. Conditional Independence …………………. 66
4. Inference .......................................................... 88
5. Applications on Computer Vision ................. 136
6. Summary ……………………………………. 146
7. References …………………………………… 152

Fu Jen University
Fu Jen University Department of Electrical Engineering
Department of Electrical Engineering Yuan-Kai Wang Copyright
Wang, Yuan-Kai Copyright


Example – Car Diagnosis



Examples on Computer Vision

Hand Upper Head Torso Upper Hand Anthropological
Forearm Size Forearm
Size Arm Size Size Arm Size Size Measurements
Size Sf St Size Sf
Sh Sa Shd Sa Sh A

Left Left Left Right Right Right Joints
Neck
Wrist Elbow Shoulder Shoulder Elbow Wrist J
N
Wl El Sl Sr Er Wr

Left Left Left Head Torso Right Right Right Components
Hand Forearm Upper Arm H T Upper Arm Forearm Hand C
Hl Fl Ul Ur Fr Hl

Observations Observations
Oij O



Where do PGMs come from ?
• Common problems in real life :
– Complex, Uncertain



Graph + Probability
• Graph has P(X,Y)
– Node + Edge X Y
• Two kinds of graph
– Directed graph
– Undirected graph P(X|Y)
• Probability has X Y
– Random variable  Node
– Probability  Edge
• Directed graph : conditional probability
• Undirected graph: joint probability



Probabilistic Modeling of Problems
(1/2)
• Usually node has Burglary Earthquake
two semantics P(A|B,E)
– Cause Alarm
– Effect
P(J|A) P(M|A)
• Causal relationships
John Calls Mary Calls
between nodes
– Probabilistic
– Conditional probability P(Y|X): P(Effect|Cause)
– X and Y are not independent
– Directed graph


Probabilistic Modeling of Problems
(2/2)
• If node has no causal semantics
• But happens together Student X
(influence each other)
– Probabilistic P(X,Y)
– Joint probability P(X,Y)
Student Y
– X and Y are not independent
– Undirected graph



Cause/Effect  Class/Feature (1/2)
• In pattern recognition Face
Expression
/computer vision P(f |class) P(f2|class)
– Cause  class
1

– Effect  feature Eyebrow Mouth
Motion Motion

Facial expression image Base image
(neutral expression)


Cause/Effect  Class/Feature (2/2)
• Face detection: Face
2-class classification object
P(f1|class) P(f2|class)
Skin Eye
Color pattern



Cause/Effect  State/Observation
P(xt|xt-1) xt+1
• In video analysis Real Real Real
location x location x location
(Tracking) t-1 t

– Cause  State P(zt-1|xt-1) zt-1 zt
– Effect  Observation
Observed Observed
location location

Real position : xt Predicted position
Detected position : zt x-t+1
P ( z t | xt )



What Are PGMs Good For?
Medicine
Speech Bio-
Computer informatics
recognition
Vision

Text
Classification Computer
Stock market
troubleshooting

 Classification: P(class|feature)
 Prediction: P(Effect|Cause)=?
 Diagnosis: P(Cause|Effect)=?



Three Problems in PGM
Real Real Real
• Representation location location location
– Given a problem
– Build its graphical model Observed
location
Observed
location
(Construction of Bayesian network)
xt-1 x x
• Inference Real
location
Real t Real t+1
location location
– Given a set of evidences nodes
z
– Get probabilities of node(s) Observedzt-1 Observedt
location location
• Learning
– Learn the CPT of a BN x z
– Learn the graphical structure 1 3 P(xt|xt-1)
of a BN 2 6 P(zt-1|xt-1)
3 9



Structure of Related Lecture Notes
Problem Structure Data
Learning
PGM B E
Representation Learning
A
Unit 5 : Probabilistic Graphical Units 16~ : MLE, EM
Unit 9 : Hybrid BN J M
Units 10~15: Naïve Bayes, MRF,
HMM, DBN,
Kalman filter P(B) Parameter
P(E) Learning
P(A|B,E)
P(J|A)
Query Inference
P(M|A)
Unit 6: Exact inference
Unit 7: Approximate inference
Unit 8: Temporal inference


1. Representing
Uncertain Knowledge



Review (1/3)
Bayes’ Theorem
Likelihood Prior
P (e | h ) P ( h )
P (h | e) 
P (e)
Probability
Posterior
of Evidence
• Probability of an hypothesis, h, can be updated when
evidence, e, has been obtained
• It is usually not necessary to calculate P(e) directly
•As it can be obtained by normalizing the posterior
probabilities, P(hi | e)



Review (2/3)
Marginalization
P ( X )   P ( X , h)
hH



Review (3/3)
• Full joint probability distribution FJD
– Can answer any question P(X|E=e)
P(X|E=e) = hP(X, e, h)
– But become intractably large as the
number of variables grows
• Independence and conditional CPT
independence among random variables
– CPTs = FJD
– But can greatly reduce the number of
probabilities that need to specified


A Simple Bayesian Network
• 1 FJD = 2 CPTs P(C)
– P(Cavity, Toothache) 0.002
= P(Toothache|Cavity)
* P(Cavity) Cavity
– P(X,Y)=P(X|Y)P(Y) Causal
Relationship
=P(Y|X)P(X)
• Graphical model Toothache
can represent
– Causal relationship T P(T|C)
– Joint relationship T 0.70
F 0.01


A Burglary Network
P(E) (random)
The graph Burglary P(B) 0.002 variables
Earthquake
is directed 0.001

and acyclic B E P(A|B,E)
T T 0.95
A P(J|A)
Alarm T F 0.95
T 0.90 F T 0.29
F 0.05 F F 0.001

A P(M|A)
John Calls Mary Calls T 0.70
F 0.01

A conditional probability distribution quantifies
the effects of the parents on node


Compact Representation
• If all n nodes have  k parents
•  O(2k n) vs. O(2n) parameters
P(E)
Burglary P(B) 0.002
Earthquake
0.001

B E P(A|B,E)
T T 0.95
A P(J|A)
Alarm T F 0.95
T 0.90 F T 0.29
F 0.05 F F 0.001

A P(M|A)
F 0.01



Formal Definition of a BN
• Directed Acyclic Graph (DAG)
–Nodes : Random variables
–Edges : Direct influence between 2 variables
• CPTs : Quantifies the
dependency of two variables A B

 P(X|Parent(X))
–Ex : P(C|A,B), P(D|A)
• A priori distribution : D C

for each node with no parents
–Ex : P(A) and P(B) E



Conditional Independence in the
Directed Acyclic Graph
• Topology of network encodes
dependency/independence
• Weather is independent
of the other variables
• Cavity has direct
influence on Tooth and
Catch
• Toothache and Catch
are conditionally
independent given
Cavity


Conditional Probability Table (CPT)
P(W) P(C)
0.001 0.02
C P(T|C) C P(Catch|C)
T 0.90 T 0.70
F 0.05 F 0.01

P(Xi|Parent(Xi)) or P(Xi|Pa(Xi))


Causality and Bayesian Networks
• Not every BN describes causal relationships
between the variables
• Consider the dependence between Lung
Cancer, L, and the X-ray test, X.
• A BN with causality
L X P(x|l)=0.6
P(l)=0.001
P(x|l)=0.02
• Another BN represents the same distribution
and independencies without causality
P(l1|x1)=0.02915 L X P(x1)=0.02058
P(l1|x2)=0.00041


Example - Construction of BN (1/3)
• I have a burglar alarm installed at
home
• I am at work
• Neighbor John calls to say my
alarm is ringing
• But neighbor Mary doesn't call
• Sometimes it's set off by minor
earthquakes
• Is there a burglar?


• Step 1: Find Random variables
– Burglar, Earthquake, Alarm, JohnCalls,
MaryCalls
• Step 2: Represent the causal relationships
among random variables
– A burglar can set the alarm off
– An earthquake can set the alarm off
– The alarm can cause Mary to call
– The alarm can cause John to call
• Step 3: Use network topology with
probability


• 5 Boolean random variables + 5 CPTs
P(E)
Burglary Earthquake 0.002
P(B)
0.001 B E P(A|B,E)
T T 0.95
Alarm T F 0.95
A P(J|A) F T 0.29
T 0.90 F F 0.001
F 0.05
A P(M|A)
F 0.01


Marginalization in Bayesian Network
P (b, e, a, j )   P(b, e, a, j , h)   P(b, e, a, j, M )
hH M  m , m

P (b, e)   P(b, e, h)     P(b, e, A, J , M )
hH M  m , m A  a , a J  j ,  j

Burglary Earthquake

Alarm

John Calls Mary Calls



Markov Chain, Conditional Probability,
Independence, and Directed Edge
• Markov chain
P(X|L)
L X
– L and X are dependent, not independent
• Markov chain
 Has conditional prob.
 Not independent
 Has directed edge



Common Causes
Smoking It is a DAG

Bronchitis Lung Cancer
• Markov condition: I(B, L | S),
i.e. P(b | l, s) = P(b | s)
• If SB and SL, and “Joe is a smoker”
• “Joe has Bronchitis” v.s. “Joe has Lung Cancer” ?
• “Joe has Bronchitis” will not give us any more
information about the probability of “Joe has Lung
Cancer”



Common Effects
Burglary Earthquake

Alarm
It is a DAG
• Markov condition: I(B, E), i.e. P(b | e) = P(b)
• Burglary and Earthquake are independent of
each other
• However they are conditionally dependent given
Alarm
• If the alarm has gone off, news that there had
been an earthquake would ‘explain away’ the
idea that a burglary had taken place


Markov Assumption Ancestor

• Markov chain v.s.
independence Parent
• Random variable X Y1 Y2
is independent of its
non-descendents, X

given its parents Pa(X)
– Formally,
I (X, NonDesc(X) | Pa(X))
Non-descendent

Descendent



Markov Assumption Example
• In this example: Earthquake Burglary
– I ( E, B )
– I ( B, {E, R} )
– I ( R, {A, B, C} | E ) Radio Alarm
– I ( A, R | B,E )
– I ( C, {B, E, R} | A)
Call



Joint Probability Distribution
• Note that our joint distribution with 5 variables can
be represented as:
P(e, b, r , a, c)  P(e) P(b | e) P(r | e, b) P(a | e, b, r ) P(c | e, b, r , a)
But due to the Markov condition we have, for example,
P (c | e, b, r , a )  P (c | a )
The joint probability distribution can be expressed as
P(e, b, r , a, c)  P(e) P(b | e) P(r | e) P(a | e, b) P(c | a)
• Ex: the probability that someone has a smoking history,
lung cancer but not bronchitis, suffers from fatigue and
tests positive in an X-ray test is
P ( s, b, l , f , x )  0.2  0.75  0.003  0.5  0.6  0.000135


Representing the Joint Distribution
• For a BN with nodes X1, X2, …, Xn
n
P( x1 , x2 ,..., xn )   P( xi | pa( xi ))
FJD i 1 n CPTs
• An enormous saving can be made regarding the
number of values required for the joint distribution
• For n binary variables
•2n – 1 values are required for FJD
• For a BN with n binary variables and
•Each node has at most k parents
•Less than 2kn values are required for CPTs



Exercise (1/2)
S D

G U

E H

P(s, d, g, u, e  A, h  C) 
P(s)P(d)P(g | s)P(u | s, d)P(e  A| g, u)P(h  C | u)



Exercise (2/2)
• P(a, b, c, d, e) a
= P(e | a, b, c, d) P(a, b, c, d)
by the product rule b c
= P(e | c) P(a, b, c, d)
by cond. indep. assumption d e
= P(e | c) P(d | a, b, c) P(a, b, c)
= P(e | c) P(d | b, c) P(c | a, b) P(a, b)
= P(e | c) P(d | b, c) P(c | a) P(b | a) P(a)



Exercises
• Facial Expression Recognition
• Face Detection
• Face Tracking Using GeNIe
• Body Segmentation http://genie.sis.pitt.edu/



Another Example : Water-Sprinkler
P(C)
Cloudy 0.5
C P(S|C)
T 0.1 C P(R|C)
F 0.5 T 0.8
F 0.2
Sprinkler Rain

S R P(W|S,R)
T T 0.99
WetGrass T F 0.9
F T 0.9
F F 0.0


Inference in Water-Sprinkler (1/2)
• If the grass is wet (WetGrass=True)
– Two possible explanations : rain or sprinkler
– Which is the more likely?
Pr( S  T ,W  T )
Sprinkler Pr( S  T | W  T ) 
Pr(W  T )

c,r Pr(C , R, S  T ,W  T )  0.2781  0.430
Pr(W  T ) 0.6471
Pr(R  T ,W  T )
Rain Pr(R  T | W  T ) 
Pr(W  T )

c,s Pr(C, S , R  T ,W  T )  0.4581  0.708
Pr(W  T ) 0.6471
The grass is more likely to be wet because of the rain


Inference in Water-Sprinkler (2/2)
P(C)
Cloudy 0.5
C P(S|C)
T 0.1 C P(R|C)
F 0.5 T 0.8
F 0.2
Sprinkler Rain

S R P(W|S,R)
T T 0.99
T F 0.9
WetGrass F T 0.9 Time needed
F F 0.0
Using Bayes chain rule : for calculations
Pr(C , R, S , W )  Pr(C )  Pr( R | C )  Pr( S | R, C )  Pr(W | R, C , S ) 2 x 4 x 8 x 16 = 1024
Using conditional independency properties :
Pr(C , R, S , W )  Pr(C )  Pr( R | C )  Pr( S | C )  Pr(W | R, S ) 2 x 4 x 4 x 8 = 256



Inference (1/5)
P(E=t|C=t)=0.1
P(B=t|C=t) = 0.7
1
0.9 1
0.8
0.9
0.7
0.8
0.6
0.7
0.5
0.6
0.4
0.5
0.3
0.2 Earthquake Burglary 0.4
0.3
0.1
0
0.2
0.1
0

Radio Alarm
E B P(A|E,B)
e b 0.9 0.1
e b 0.2 0.8
Call e b 0.9 0.1
e b 0.01 0.99
C=t



Inference (2/5)
P(E=t|C=t)=0.1 P(B=t|C=t) = 0.7
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3
0.2
Earthquake Burglary 0.3
0.2
0.1
0.1
0
0

Radio Alarm

R=t

Call

C=t



Inference (3/5)
P(E=t|C=t)=0.1 P(B=t|C=t) = 0.7
1
1
0.9
0.9
0.8
0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.2
0.2
0.1 0.1
0 0

P(E=t|C=t,R=t)=0.97 Radio Alarm P(B=t|C=t,R=t) = 0.1
1 1
0.9 0.9
0.8 R=t 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2
0.1
Call 0.2
0.1
0 0

C=t



Inference (4/5)
P(E=t|C=t)=0.1 P(B=t|C=t) = 0.7
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0

Radio Alarm
P(E=t|C=t,R=t)=0.97 P(B=t|C=t,R=t) = 0.1
1
0.9
R=t 1
0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4
0.3
Call 0.4
0.3
0.2 0.2
0.1 0.1
0 0
C=t
Explaining away effect


Inference (5/5)
P(E=t|C=t)=0.1 P(B=t|C=t) = 0.7
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6
0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0

Radio Alarm
P(E=t|C=t,R=t)=0.97 P(B=t|C=t,R=t) = 0.1
1 R=t 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4
0.3
Call 0.4
0.3
0.2 0.2
0.1 0.1
0 0
C=t
“Probability theory is nothing but common sense reduced to calculation”
– Pierre Simon Laplace


2. Various PGM Models
Taxonomy

Factor Graph

Naïve
Bayes



Directional v.s. Undirectional
Directed Undirected
( Bayesian networks) ( Markov networks)

x1 x2 x1 x2

y1 y2 y1 y2
1
p(x, y)   p(xi | x pa(i ) ) p(y j | x pa( j ) ) p (x, y )   a (x, y )
i j Z a



Naive Bayes Model
• Strong (Naive) assumption of problems
– A single cause directly influences a number
of effects
– All effects are conditionally independent,
given the cause
n
P( x1 , x2 ,..., xn )   P( xi | pa ( xi ))
i 1
P(Cause, Effecti , Effectn )
 P(Cause) P( Effecti | Cause)
2n+1 probabilities  O(n)
i

More details on another unit


Naïve Bayesian Classifier (NBC)
• Use Naïve Bayes for classification
P (Class | Feature1 ,  Featuren ) Class
 P ( Feature1 ,  Featuren , Class)
n
 P (Class) P ( Featurei | Class) Feature 1  Feature n
i 1
Face
Face
Expression
object

Skin Eye Eyebrow Mouth
Color pattern Motion Motion


Temporal Causality
Represented by Bayesian Networks
• Temporal Causality
– In many systems, data arrives sequentially
– Dealing causality with time
• Dynamic Bayes nets (DBNs) can be used
to model such time-series (sequence)
data
• Special cases of DBNs include
– State-space models (Kalman filter)
– Hidden Markov models (HMMs)



DBN (1/2)
More complex temporal models
than HMM & Kalman
Slice 1 Slice 2
t=1 2 3 4 5
(DAG) (DAG)

+

Repeat


DBN (2/2)
t=1 2 3 4 5

n
P( x1 , x2 ,..., xn )   P( xi | pa( xi ))
i 1



Bayesian SSM



Factorial SSM
• Multiple hidden sequences
• Avoid exponentially large hidden space



Example: Markov Random Field
• Typical application: image region
labelling

yi

xi



Example: Conditional Random Field

y y

y y

xi



Markov Random Fields (1/2)
Undirected graph



MRF (2/2)
y
 
Parameter 

tying
 
 x


Local evidence
Compatibility with neighbors (compatibility with image)


3. Conditional Independencies
• A Bayesian network/probabilistic
graphical model G, represents a set of
Markov Independencies P
• There is a factorization theorem
P ( X 1 ,..., X n )   P ( X i | Pai )
i
• This section inspects deeper meanings of
conditional independence for
– The factorization theorem
– Inference algorithms in later units


Conditional Independence
• Dependencies
– Two connected nodes
influence each other
• Independent
– Example: I(B;E)
• Conditional Independent
– Example
• I(J;M|A)?
• I(B;E|A)?
– d-seperation



D-Separation
• It is a rule describing the influences
between nodes



Serial (Intermediate Cause)
• Indirect causal effect, no
evidence
B • Clearly burglary will
effect Marry call
A
• Same situation for
indirect evidence effect,
M because independence is
symmetric
• If I(E;M|A) then I(M;E|A)



Diverging (Common Cause)

• Influence can flow
A
from John call to
Mary call if we don‘t
know whether or not
J M there is alarm.
• But I(J;M|A)



Converging (Common Effect)
• Influence can‘t flow from
E B
Earthquake to burglary
if we don‘t know whether
or not there is alarm
• So I(E;B)
A
• Special structure which
cause independence.
• V-Structure



Independence of Two Events



D-Separation for 3 Nodes



Path Blockage (1/3)

• Three cases:
–Common cause Blocked
Blocked Unblocked
Active

E E
– Intermediate cause
R A R A
–Common Effect



Path Blockage (2/3)
• Three cases:
–Common cause Blocked Unblocked
Active
E E
– Intermediate cause A A

–Common Effect C C



Path Blockage (3/3)
Blocked Unblocked
Active
Three cases:
– Common cause E B

– Intermediate cause E B A

– Common Effect A C
E B
C
A

C



General Case



D-Separation in General
• X is d-separated from Y, given Z,
– If all paths from a node in X to a node in Y
are blocked, given Z
• Checking d-separation can be done
efficiently
(linear time in number of edges)
– Bottom-up phase:
Mark all nodes whose descendents are in Z
– X to Y phase:
Traverse (BFS) all edges on paths from X
to Y and check if they are blocked


Paths (1/2)
• Intuition: dependency must “flow” along
paths in the graph
• A path is a sequence of neighboring
variables
Earthquake Burglary

Examples:
• REAB Radio Alarm
• CAER
Call



Paths (2/2)
• For a path between two end nodes X, Y
• The path is a
– Active path
• If we can find dependency between X & Y
– Blocked path
• If we cannot find dependency between X & Y
• X & Y are conditional independent
• X & Y are D-Separated
• We want to classify situations in which
paths are active



D-Separation Example 1 (1/3)
E B
– d-sep(R,B)?
R A

C




– d-sep(R,B) = yes E B
– d-sep(R,B|A)?
R A

C




– d-sep(R,B) = yes E B
– d-sep(R,B|A) = no
– d-sep(R,B|E,A)? R A

C



D-Separation Example 2



D-Separation Example 3



d-separation: Car Start Problem
• 1. ‘Start’ and ‘Fuel’ are dependent on each other.
• 2. ‘Start’ and ‘Clean Spark Plugs’ are dependent on each other.
• 3. ‘Fuel’ and ‘Fuel Meter Standing’ are dependent on each other.
• 4. ‘Fuel’ and ‘Clean Spark Plugs’ are conditionally dependent on
each other given the value of ‘Start’.
• 5. ‘Fuel Meter Standing’ and ‘Start’ are conditionally
independent given the value of ‘Fuel’.



Exercises
P(xt|xt-1) xt+1 Face
Real Real Real Expression
t-1 t
P(zt-1|xt-1) zt-1 zt
Observed Observed Eyebrow Mouth
location location Motion Motion



4. Inference
• 4.1 What Is Inference
• 4.2 How Inference
• 4.3 Inference Methods



4.1 What Is Inference



Exercises (1/2)
• Face detection Facial Expression Recog.

Face Face
object Expression

Skin Eye Eyebrow Mouth
Color pattern Motion Motion



Exercises (2/2)
P(xt|xt-1) xt+1
• Face tracking Real Real Real
t-1 t
Observed Observed
location location

Real position : xt Predicted position
x-t+1
Detected position : zt
P ( z t | xt )



3 Kinds of Variables in Inference
• Remember the general inference
procedure in previous unit
(uncertainty inference unit)
• Let P(X|E=e) be the query
– X be the query variable
– E be the set of evidence variables
V S
• e be the observed values of E
– H be the remaining T L

unobserved variables A B
(Hidden variables) X D



The Burglary Example
Query : P(Burglary|John Calls=true)
Query variables: X
Burglary Earthquake
Burglary
Evidence variables: E=e
John Calls = true Alarm

Hidden variables: H
Earthquake, Alarm, John Calls Mary Calls
Marry Calls


The Asia Example
• Query P(L|v,s,d) V S
– Query variables: L
– Evidence variables: T L
V=true, S=true, D=true A B
– Hidden variables:
T, X, A, B X D



Five Types of Queries in Inference
• For a probabilistic graphical model G
• Given a set of evidence E=e
• Query the PGM with
– P(e) : Likelihood query
– arg max P(e) :
Maximum likelihood query
– P(X|e) : Posterior belief query
– arg maxx P(X=x|e) : (Single query variable)
Maximum a posterior (MAP) query
– arg maxx …x P(X1=x1, …, Xt=xt|e) :
1 t
Most probable explanation (MPE) query
Also called Viterbi decoding


Likelihood Query P(e) (1/2)
Input video
Probability of Evidence
X1 X2 Xt An HMM
e1 for Surprise
E1 E2 Et

e2 e1:t P (E1:t=e1:t)
…

et



Likelihood Query P(e) (2/2)
• Marginalization of all hidden variables
   P( E  e, H  h)
hH
     P ( E1:t  e1:t , X 1 ,  , X t )
X1 X 2 Xt

  P( E
X 1 X t
1:t  e1:t , X 1 ,  , X t )
n
   P( X
X 1 X t i 1
i | X i 1 ) P( Ei | X i ), where P ( X 1 | X 0 )  P ( X 1 )
X1 X2 Xt

E1 E2 Et


Maximum Likelihood Query
arg max P(e)
Input video An HMM
X1 X2 Xt
for Surprise
e1 PS(Xt|Xt-1),
E1 E2 Et PS(Ei|Xi)
P Surprise(e1:t)
e2 e1:t Max
P Cry(e1:t)
…

X1 X2 Xt Cry HMM
PC(Xt|Xt-1),
et
E1 E2 Et PC(Ei|Xi)



Maximum Likelihood Query
arg max P(e)
• Likelihood query P(E=e)
Step 1: Bayes theorem P ( E  e)
Step 2:
Marginalization    P ( E  e, H  h)
of all hidden variables hH

• Query arg max P(E=e)
Step 1: Bayes theorem
Step 2:
Marginalization  arg max  P ( E  e, H  h)



P(X|e) – Filtering (1/2)
• P(Xt | e1:t) X1 X2 Xt

E1 E2 Et

Real position: xi Filtered position: x’t
Detected position: ei
P ( z t | xt )



P(X|e) – Filtering (2/2)
• Inference of the query P(Xt|e1:t) is
P( X t , e1:t )
Step 1: P( X t | e1:t ) 
P(e1:t )
Bayes theorem
 P( X t , e1:t )
Step 2:
Marginalization    P ( X t , e1:t , X 1  X t 1 )
X 1 X t 1
of all hidden variables
Step 3:     P ( X i | X i 1 )P (ei | X i )
Chaining by X  X i 1~ t 1 t 1
conditional independence


P(X|e) – Prediction (1/2)
• P(Xt+k | e1:t) for k > 0
For k=1 X X Xt Xt+1
1 2

E1 E2 Et
Real position : xi Predicted position
Detected position : ei x’t+1



P(X|e) – Prediction (2/2)
• Inference of the query P(Xt+1|e1:t) is
P( X t 1 , e1:t )
Step 1: P( X t 1 | e1:t ) 
P(e1:t )
Bayes theorem
 P( X t 1 , e1:t )
Step 2:
Marginalization    P ( X t 1 , e1:t , X 1  X t )
X 1 X t
Step 3:  P ( X t 1 | X t )   P ( X i | X i 1 )P (ei | X i )
Chaining by X  X i 1~ t1 t



P(X|e) – Smoothing (1/3)
• P(Xk | e1:t) for 1  k < t
X1 X2 Xk Xt

E1 E2 Ek Et
Real position: xt
Smoothed position: xt
Detected position: zt



• Inference of the query P(Xk|e1:t) is
P( X k , e1:t )
Step 1: P( X k | e1:t ) 
P(e1:t )
Bayes theorem
 P( X k , e1:t )
Step 2:
Marginalization   P,e, 1X:t , X 1  X t )
X 1 X k 1 , X K 1
(
of all hidden variables t

Step 3:
Chaining by
  ,, X it P( X i | X i 1 )P(ei | X i )
X X , X 1~
1 k 1 K 1 t



• Fixed-lag smoothing



MAP Query (1/2)
• arg maxx P(Xi=x|e)
• Usually applied on Classification
– Find most likely class X=x,
given the evidence e (feature)
P(X=Surprise|e) If P(X=Smile|e) is the max probability
Smile = arg maxx P(Xi=x|e)
P(X=Smile|e)
Facial X={Surprise, Smile, …}


Expression

Eyebrow Mouth
 Motion
Motion


MAP Query (2/2)
• MAP query arg maxx P(X=x|E=e)
Step 1: arg max P( X  x | e)
x
Bayes theorem P ( X  x, e)
 arg max
x P ( e)
Step 2:   arg max P( X  x, e)
x
Marginalization
  arg max  P ( X  x, e, H  h)
x
hH



MPE Query
• Also called Viterbi decoding
• arg maxx P(X1=x1,…, Xt=xt|e1:t)
• = arg maxx1:t P(X1:t|e1:t)
• = Smoothing for X1:t-1 + Filtering

X1 X2 Xt

E1 E2 Et



Exercises
• Face Detection
• Facial Expression Recognition
• Face Tracking
• Body Segmentation
X={Surprise, Smile, …}
P(xt|xt-1) xt+1
Facial
Expression Real Real Real
t-1 t
Eyebrow Mouth
 Motion Observed Observed
Motion location location



4.2 How Inference
• Inference of the query P(X|E=e) is
P ( X , E  e)
Step 1: P ( X | E  e) 
P ( E  e)
Bayes theorem
 P ( X , E  e)
Step 2:
Marginalization    P ( X , E  e, H  h)

Step 3:
Chaining by     P( X i | Pa ( X i ))
hH i 1~ n


The 4th Step of Inference
Steps 1 - 3
P( X | E  e)     P( X i | Pa ( X i ))
hH i 1~ n
• Step 4: Compute the sum product?
– Need an efficient algorithm
– First, we will explain the computation of
the sum-product by an enumeration
algorithm
• Easy but not efficient
– Then, more efficient methods will be
explained in next two units


The Burglary Example (1/3)
• A posterior query on the burglary
network
– P(B|j, m)
– = P(B, j, m) / P(j, m)
– = P(B, j, m)
– = e a P(B, e, a, j, m)
E and A are hidden variables
This will use the full joint distribution table


Inference by Enumeration
• A query P(X|e)
– = h Xi P(Xi | Pa(Xi))
• Enumerate all P(Xi | Pa(Xi))
• Multiply all P(Xi | Pa(Xi))
• Sum all produts

Please refer the Unit - Uncertainty Inference



Expression Tree of Sum Product
2
• a + bc a1b1+a2b2   ai bi
i 1
+ +
n
* a * * a b ?
i 1
i i
b c a 1 b1 a 2 b2

P( X | E  e)     P( X i | Pa ( X i ))
hH i 1~ n



Problem of Sum Product :
Repeated Computation
• a1b + a2b = (a1+a2)b • a1b1+a1b2+a1b3 = (a1+a2) *
2 2 +a2b1+a2b2+a2b3 (b1+b2+b3)
  ai b  b ai 2 3 2 3
i 1 i 1
  ai b j   ai  b j
i 1 j 1 i 1 j 1
+ *
*
* * + +
b + + + +
a1 b a2 b a1 a2 + a1 a2 + b3
+
* * * * * * b1 b2
a 1 b1 a 2 b2 a 3 b3 a 1 b1 a 2 b2 a 3 b3


• P(b|j,m)= P(b)EP(E)AP(A|b,E)P(j|A)P(m|A)
*
+
P(b)
* *
+ P(e) + P(e)

E=e E=e *
E=e E=e
A=a *
A= a A=a *
A= a *
* * P(a|b,e) * * P(a|b,e)
P(a|b,e) P(a|b,e)
P(m|a) P(m|a) P(m|a) P(m|a)
P(j|a) P(j|a) P(j|a) P(j|a)



Enumeration as Depth-First Search
• Recursive depth-first enumeration
– Suppose n Boolean variables
– O(n) space
– O(2n) time
• Enumeration is inefficient
– Repeated computation
– e.g., computes P(j|a)P(m|a) for each
value of e



Enumeration Algorithm (1/2)



Enumeration Algorithm (2/2)



Exercise: Enumeration
p(smart)=.8 p(study)=.6 Query: What is the probability
smart study that a student studied, given
that they pass the exam?
p(fair)=.9
prepared fair

p(prep|…) smart smart
pass study .9 .7
smart smart study .5 .1
p(pass|…)
prep prep prep prep
fair .9 .7 .7 .2
fair .1 .1 .1 .1


Complexity of Inference
Theorem:
Computing queries in a Bayesian
network is NP-hard



Hardness
• Hardness does not mean we cannot
solve inference
– It implies that we cannot find a general
procedure that works efficiently for all
networks
– For particular families of networks, we can
have provably efficient procedures
– We will characterize such families in the
next two classes



4.3 Inference Methods
• Steps of inference
• Step 1: Bayesian theorem
• Step 2: Marginalization
• Step 3: Conditional independence
• Step 4: Sum product computation
– Enumeration
– Exact inference
– Approximate inference



Exact v.s. Approximate Inference
• Exact inference
– Get exact probability of the query
• Approximate inference
– Get approximate probability of the query
– Avoid exponential complexity of exact
inference in discrete loopy graphs
– Because cannot compute messages in
closed form (even for trees) in the non-
linear/non-Gaussian case



Exact Inference
• Enumeration
• Variable Elimination
• Belief Propagation Unit –
– Message Passing
Exact Inference
• Junction Tree
– Clustering, Join tree



Approximate Inference
• Stochastic Simulation
– Monte Carlo method
– Sampling method
– Include: direct sampling (logic sampling),
likelihood weighting sampling
• Markov Chain Monte Carlo Sampling
(MCMC)
• Loopy Belief Propagation

Unit – Approximate Inference



Software Implemented
Inference Methods
PNL GeNIe/Smile
Enumeration v (Naïve)
Variable Elimination
Junction Tree v v (Clustering)
Belief Propagation v (Pearl) v (Polytree)
Direct Sampling v (Logic)
Likelihood Sampling v(LWSampling) v(Likelihood sampling)
MCMC Sampling v(Gibbswithanneal) (Other 5 sampleings)
Loopy Belief Propagation



Exercise
• Use GeNIe (http://genie.sis.pitt.edu/)
to
– Generate a PGM
– Inference the PGM



5. Applications on CV
• Face Recognition
• Human Body Tracking
• Super-resolution



Face Modeling and Recognition Using Bayesian Networks
Gang Song*, Tao Wang, Yimin Zhang, Wei Hu, Guangyou Xu*, Gary Bradski
Face feature finder (separate) System:

Learn Gabor filter “jet” at each Add Pose switching variable
point



Face Modeling and Recognition Using Bayesian Networks
Gang Song*, Tao Wang, Yimin Zhang, Wei Hu, Guangyou Xu*, Gary Bradski
Results:

Pose Results:

BNPFR – Bayesnet with Pose
BNFR – Bayesnet w/o Pose
EHMM – Embedded HMM
EGM – Gabor jets


The Segmentation Problem
Looking for all possible joint configuration J is computationally
impractical. Therefore, segmentation takes place in two stages. First,
we segment the head and torso, and determine the position of the
neck. Then, we jointly segment the upper arms, forearms and hands,
and determine the position of the remaining joints.

Step I Step II
arg max P (O F | J , Q )  arg max
J, Q J, Q
 P (O
i , jQ HT
ij | q ij , J HT )  P (O
i , jQ A
ij | q ij , J A )
Q A , Q HT state assignments for the arm and head&torso regions
J A , J HT joints for the arms and head&torso components.



Upper Body Model
P (O F )   { P (O
i , jF
ij )  u ij }   {   P ( O
i , jF A J q ij C
ij | q ij , J , A )P ( q ij | J , A ) P ( J | A )P ( A )  u ij }

Hand Upper Head Torso Upper Hand Anthropological
Forearm Forearm
Size Arm Size Size Size Arm Size Size Measurements
Size Sf Size Sf
Sh Sa Shd St Sa Sh A

Left Left Left Right Right Right Joints
Neck
Wrist Elbow Shoulder Shoulder Elbow Wrist J
N
Wl El Sl Sr Er Wr

Left Left Left Right Right Right Components
Head Torso
Hand Forearm Upper Arm Upper Arm Forearm Hand C
H T
Hl Fl Ul Ur Fr Hl

Observations
Observations O
Oij



Body Tracking Results



MRF (1/3)
Undirected graph



MRF (2/3)
y
 
Parameter 

tying
 
 x


Local evidence
Compatibility with neighbors (compatibility with image)


MRF (3/3)
Image patches image

(xi, yi)
Scene
patches
(xi, xj)

scene


MRFs for Super-Resolution

Input Cubic Spline Bayesian Net Actual



6. Summary
• What we know the representation
problem
– What is a Bayesian network
– A little about “how to inference”



Probabilistic Graphical Models
• Given a PGM = given a joint
probability function
• We can immediately write down
– Its joint probability, and
– Its equivalent conditional probability



Directed Graph
Class
P( Feature1 ,  Featuren , Class) =
n
 P(Class) P( Featurei | Class) Feature 1  Feature n
i 1

P( X 1 ,  X t , E1 , , Et ) = X1 X2 Xt
t
 P( X 1 ) P( E1 | X 1 ) P( X t | X t 1 ) P( Et | X t )
i 2
E1 E2 Et



Undirected Graph

 P( X , Y )
 P( X 1 , X N , Y1 ,, YN )
   P( X i , X j )   ( X i , Yi )
jNeighbor ( i ) i 1~ N


All Queries have the same Form
P (Class | Feature1 ,  Featuren )
 P ( Feature1 ,  Featuren , Class)
n
 P (Class) P ( Featurei | Class)
i 1

Class

Feature 1  Feature n



We Still Need to Know
• Inference
– Details of inference algorithms
• Learning
– How to learn CPTs
– How to build or automatically learn
the structure of a Bayesian network
by given a set of learning data



7. References
• A Brief Introduction to Graphical Models and
Bayesian Networks (Kevin Murph, 1998)
– http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html
• Artificial Intelligence I (Dr. Dennis Bahler)
– http://www.csc.ncsu.edu/faculty/bahler/courses/csc520f02/bayes1.html
• Nir Friedman
– http://www.cs.huji.ac.il/~nir/
• Judea Pearl, Causality (on-line book)
– http://bayes.cs.ucla.edu/BOOK-2K/index.html
• Introduction to Bayesian Networks
– A tutorial for the 66th MORS symposium
– Dennis M. Buede, Joseph A. Tatmam, Terry A.
Bresnick



References
• An introduction to Bayesian network theory
and usage, T. A. Stephenson, IDIAP
Research report IDIAP-RR 00-03, Feb. 2000.
[Available:
http://www.rpi.edu/~liaow/file/Intro_BN.pdf]
• Bayesian network without tears, E. Charniak,
AI Magazine, 1991. [Available:
http://www.rpi.edu/~liaow/file/BNtears.pdf]



References
• Inference in belief Networks : A procedural
guide (Cecile Huang)
• Tutorial on graphical models and BNT
– presented to the Mathworks, May 2003
• Java Applets : Prof R.D. Boyle
(roger@comp.leeds.ac.uk)
• HMMs – Summary (L R Rabiner and B H Juang )
– http://www.comp.leeds.ac.uk/roger/HiddenMar
kovModels/html_dev/summary/s1_pg2.html


05 probabilistic graphical models

More Related Content

What's hot

Similar to 05 probabilistic graphical models

More from IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing

Recently uploaded

05 probabilistic graphical models