07 approximate inference in bn

Bayesian Networks
Unit 7 Approximate Inference
in Bayesian Networks
Wang, Yuan-Kai, 王元凱
ykwang@mails.fju.edu.tw
http://www.ykwang.tw

Department of Electrical Engineering, Fu Jen Univ.
輔仁大學電機工程系

2006~2011

Reference this document as:
Wang, Yuan-Kai, “Approximate Inference in Bayesian Networks,"
Lecture Notes of Wang, Yuan-Kai, Fu Jen University, Taiwan, 2011.
Fu Jen University Department of Electrical Engineering Wang, Yuan-Kai Copyright

Bayesian Networks Unit - Approximate Inference in Bayesian Networks p. 2

Goal of This Unit
• P(X|e) inference for Bayesian networks
• Why approximate inference
– Exact inference is too slow because of
exponential complexity
• Using approximate approaches
– Sampling methods
• Likelihood weighting sampling
• Markov Chain Monte Carlo sampling
– Loopy belief propagation
– Variational method



Related Units
• Background
– Probabilistic graphical model
– Exact inference in BN
• Next units
– Probabilistic inference over time



Self-Study References
• Chapter 14, Artificial Intelligence-a modern
approach, 2nd, by S. Russel & P. Norvig, Prentice
Hall, 2003.
• Inference in Bayesian networks, B. D’Ambrosio, AI
Magazine, 1999.
• Probabilistic Inference in graphical models, M. I.
Jordan & Y. Weiss.
• An introduction to MCMC for machine learning.
Andrieu, C., De Freitas, J., Doucet, A., & Jordan,
M. I., Machine Learning, vol. 50, pp.5-43, 2003.
• Computational Statistics Handbook with Matlab,
W. L. Martinez and A. R. Martinez, Chapman &
Hall/CRC, 2002
– Chapter 3 Sampling Concepts
– Chapter 4 Generating Random Variables


Structure of Related Lecture Notes
Problem Structure Data
Learning
PGM B E
Representation Learning
A
Unit 5 : BN Units 16~ : MLE, EM
Unit 9 : Hybrid BN J M
Units 10~15: Naïve Bayes, MRF,
HMM, DBN,
Kalman filter P(B) Parameter
P(E) Learning
P(A|B,E)
P(J|A)
Query Inference
P(M|A)
Unit 6: Exact inference
Unit 7: Approximate inference
Unit 8: Temporal inference


Contents
1. Sampling .......................................................... 11
2. Random Number Generator .......................... 20
3. Stochastic Simulation ……............................. 70
4. Markov Chain Monte Carlo .......................... 113
5. Loopy Belief Propagation …………………. 145
6. Variational Methods ………………………... 146
7. Implementation …………………………….. 147
8. Summary ……………………………………. 148
9. References …………………………………… 151



4 Steps of Inference
• Step 1: Bayesian theorem
P ( X , E  e)
P ( X | E  e)   P ( X , E  e)
P ( E  e)
• Step 2: Marginalization
   P( X , E  e, H  h)
hH
• Step 3: Conditional independence
    P( X i | Pa ( X i ))
hH i 1~ n
• Step 4: Product sum computation (Enumeration)
– Exact inference
– Approximate inference


Five Types of Queries in Inference
• For a probabilistic graphical model G
• Given a set of evidence E=e
• Query the PGM with
– P(e) : Likelihood query
– arg max P(e) :
Maximum likelihood query
– P(X|e) : Posterior belief query
– arg maxx P(X=x|e) : (Single query variable)
Maximum a posterior (MAP) query
– arg maxx …x P(X1=x1, …, Xk=xk|e) :
1 k
Most probable explanation (MPE) query


Approximate Inference
v.s. Exact Inference
• Exact inference: P(X|E) = 0.71828
– Get exact probability value
– Using the inference steps derived by
probabilistic formula
– Need exponential time complexity
• Approximate inference: P(X|E)  0.71
– Get approximate probability value
– Using sampling theorem
– Need only polynomial time complexity,
fast computation



Why Approximate Inference
• Large treewidth
– Large, highly connected graphical models
– Treewidth may be large (>40) in sparse
networks
• In many applications, approximation are
sufficient
– Example: P(X = x|e) = 0.3183098861
– Maybe P(X = x|e)  0.3 is a good enough
approximation
– e.g., we take action only if P(X=x|e) > 0.5


1. Sampling
• 1.1 What Is Sampling
• 1.2 Sampling for Inference



Basic Idea of Sampling
• Why sampling
– Estimate some values by random number
generation
1. Sampling
–  Random number generating
– Draw N samples from a known distribution P
– Generate N random numbers from a known
distribution S
2. Estimation
ˆ
– Compute an approximate probability P , which
approximates the real posterior probability
P(X|E)



1.1 What Is Sampling
• A very simple example with a random
variable : coin toss
– Tossing the coin, get head or tail
– It is a Boolean R.V.
• coin = head or tail
– If it is unbiased coin, head and tail have
equal probability
• A prior probability distribution
P(Coin) = <0.5, 0.5>
• Uniform distribution
– Assume we have a coin but we do not
know it is unbiased


Sampling of Coin Toss
• Sampling in this example
= flipping the coin many times N
– e.g., N=1000 times
– One flipping  get one sample
– Ideally, 500 heads, 500 tails
• P(head) = 500/1000=0.5
P(tail) = 500/1000=0.5
– Practically, 5001 heads, 499 tails
• P(head) = 501/1000=0.501
P(tail) = 499/1000=0.499
• After the sampling,
– We can estimate probability distribution
– Check if it is biased


Sampling & Estimation (Math)
• For a Boolean random variable X
– P(X) is prior distribution
= <P(x), P(x)>
– Using a sampling algorithm to generate N
samples
– Say N(x) is the number of samples that x is
true, N(x) x is false
N ( x) ˆ N ( x ) ˆ
 P( x),  P (x )
N N
N ( x) N ( x )
lim  P( x), lim  P ( x )
N  N N  N


1.2 Sampling for Inference
• Given a Bayesian network G including
(X1, …, Xn)
– We get a joint probability distribution
P(X1, …, Xn) =  P(Xi|Pa(Xi))
• For a query P(X|E=e)
– P(X|e) =   P(Xi | Parent(Xi))
– It is hard to compute
• Need exponential time in number of Xi
– We will try to use sampling to compute it



Compute P(X|e) by Sampling
• Sampling Explained in
– Generate N samples of Sections 2,3,4
P(X1, …, Xn) =  P(Xi|Pa(Xi))
• Estimation
– Use N samples to estimate
P(X,e)  N(X,e)/N
– Use N samples to estimate P(e)  N(e)/N
– Estimate P(X|e) by P(X,e) / P(e)



What Is Sampling Algorithm
• The algorithm to
– Generate samples from a known
probability distribution P
ˆ
– Estimate the approximate probability P



Various Sampling Algorithms
• Stochastic simulation Section 3
– Direct Sampling
– Rejection sampling
• Reject samples disagreeing with evidence
– Likelihood weighting
• Use evidence to weight samples
• Markov chain Monte Carlo Section 4
(MCMC)
– Sample from a stochastic process whose
stationary distribution is the true posterior



2. Random Number Generator

• Very important for sampling algorithm
• Introduce basic concepts related to
sampling of Bayesian networks
• Subsections
– 2.1 Univariate
– 2.2 Multivariate



RNG In Programming Languages
• Random number generator (RNG)
– C/C++: rand()
– Java: random()
– Matlab: rand()
• Why should we discuss it?
– They generate random numbers with
uniform distribution
– How to generate
• Gaussian, …
• Multivariate, dependent random
variables
• Non-closed-form distribution?


Generate a Random Number (1/2)
• Examples in C
– int i = rand();
– Return 0 ~ RAND_MAX (32767)
– It generates integers
• Generate a random number
between 1 and n (n<32767)
– int i = 1 + ( rand() % n )
– (rand() % n) returns a number between 0
and n - 1
– Add 1 to make random number between 1
and n
– It generates integers, but not real numbers


Generate a Random Number (2/2)
• Ex: integer between 1 and 6
–1 + ( rand() % 6)
• Ex: real number between 0 and 1
–double i = rand() / RAND_MAX
• Exercise
– Real number between 10 and 20



Generate
Many Random Numbers Repeatedly
• Using loop for repeated generation
– for (int i=0; i<1000; i++)
{ rand(); }
– int i, j[1000];
for (i=0; i<1000; i++)
{ j[i] = 1 + rand() % 6; }

rand() generates a number uniformly
Uniform distribution


Why Generate Random Numbers
• Simulate random behavior
• Make random decision
• Estimate some values



Random Behavior/Decision (1/2)
• Flip a coin for decision (Boolean)
– Fair: each face has equal probability
– int coin_face;
if (rand() > RAND_MAX/2)
coin_face = 1;
else coin_face = 0;
– int coin_face;
coin_face = rand() % 2;



Random Behavior/Decision (2/2)
• Random decision of multiple choices
– Discrete random variable
• Ex: roll a die Uniform distribution
– Fair: each face has equal probability
• int die_face; //Random variable
die_face = rand() % 6;



Estimation
• If we can simulate a random behavior
• We can estimate some values
– First, we repeat the random behavior
– Then we estimate the value



Example: The Coin Toss
• Flip the coin 1000 times to estimate the
fairness of the coin
– int coin_face; //Random variable
int frequency[2];
Uniform distribution
for (i=0; i<1000; i++)
frequency
{ coin_face = rand() % 2
frequency[coin_face]++;
}

0 1 Coin
face


Example : Area of Circle (Estimation)
• int x, y; //Two random variables
int N=1000, NCircle=0, Area;
for (i=0; i<N; i++)
{ x = rand() / RAND_MAX; x and y are
y = rand() / RAND_MAX; independent
if ( (x*x + y*y) <= 1 )
NCircle = NCircle + 1;
} A random number ?
Area = 4 * (NCircle/N);
We call (x,y) a sample


Multiple Dependent Random Variables
• Markov Chain: n random variables
X1 ... Xk ... Xn

• Bayesian Networks: 5 random variables
Burglary Earthquake

Alarm What is a sample ?

John Calls Mary Calls

Variables are dependent


Sampling
• It is to randomly generate a sample
– For a random variable X or Univariate
A set of random variables X1, …, Xn Multivariate
• Boolean, Discrete, Continuous
• Multivariate
– Independent, dependent
– According to a probability distribution P(X)
• Discrete X: Histogram
• Continuous X:
– Uniform, Gaussian, or
– Any distribution: Gaussian mixture models


Sub-Sections for
Generating a Sample
• 2.1 Univariate
– Uniform, Gaussian, Gaussian mixture
• 2.2 Multivariate
– Uniform
– Gaussian
• Independent, dependent
– Any distribution
• Gaussian mixture
– Independent, dependent
• Bayesian network



2.1 Univariate
• For a random variable X
– Boolean, discrete, continuous, hybrid
• We know P(X) is
– Uniform, Gaussian, Gaussian mixture
• Generate a sample X according to P(X)



Uniform Generator
• Every programming language provides
a rand()/random() function to generate
a uniform-distributed number
– Integer number within [0, MAX)
• Sampling a Boolean uniform number
– rand() %2
• Sampling a discrete uniform number
within [0, d)
– rand() % d
• Sampling a continuous uniform number
– Within [0, 1): rand() % MAX
– Within [a, b): a + (rand() % MAX)*(a-b)


Example : Uniform Generator
• x=rand(1,10000);
• h=hist(x,20); 600

• bar(h);
500

400

300

200

100

0
0 5 10 15 20 25



Gaussian Generator (1/2)
• Sampling Gaussian can be obtained by
uniform distribution
• There are functions in C/Java/Matlab to
randomly generate a univariate
Gaussian real number with (, )=(0,1)
– C : Numerical recipies in C,
– Java: Random.nextGaussian()
– Matlab: randn()
• Suppose it is called Gaussian()



Gaussian Generator (2/2)
• Sampling a continuous Gaussian
number with (, )
– (Gaussian() * ) + 
• Sampling a discrete Gaussian number
with (, ) ?



Example : Gaussian Generator (1/2)
• Pseudo codes
– Assume Gaussian() is a pseudo function to
generate Gaussian numbers
– double x[10000];
for (i=0; i<10000; i++)
x[i] = Gaussian();
– for (i=0; i<10000; i++)
x[i] =  + Gaussian() * ;



Example : Gaussian Generator (2/2)
• Matlab • Java
– x=randn(1,10000); – Random r=new
– h=hist(x,20); Random();
1600
– bar(h); int x[10000];
1400
for (i=0;i<10000;i++)
1200
x[i]=r.nextGaussian();
1000

800

600

400

200

0
0 5 10 15 20 25



Gaussian Mixture Generator (1/2)
• Random variable X with Gaussian
– P(X) = N(X; , )
• Random variable Y with Gaussian
mixture
– P(Y) = m mN(Y; m, m)



Gaussian Mixture Generator (2/2)
• Generate N samples of X
– for (i=0; i<N; i++)
x[i]=(Gaussian() * ) + 
• Generate N samples of Y with mixture
of M Gaussians
– Each Gaussian m has m, m
– for (m=0; m<M; m++)
for (i=0; i<N*m; i++)
y[m][i] = (Gaussian() * m) + m



Example : Gaussian Mixture
Generator
• N=10000; pi1=0.8; pi2=0.2;
• mu1=0; mu2=15; sigma1=3; sigma2=5;
• x1 = mu1 + randn(1,N*pi1) * sigma1;
• x2 = mu2 + randn(1,N*pi2) * sigma2; 900
• x = [x1, x2]; 800
• h=hist(x,50); 700

• bar(h); 600

500

400

300

200

100

0
0 10 20 30 40 50 60



2.2 Multivariate
• For random variables X1,… ,Xn
• We know P(X1,… ,Xn) is
– Uniform, Gaussian, Gaussian mixture, any
distribution
• Generate a sample (X1,… ,Xn) according
to P(X1,… ,Xn)
– Independent
– Dependent



Multivariate Boolean Uniform Generator
• Boolean random variables X1,… ,Xn
• int X[n]; // A sample
for (i=0; i<n; i++)
X[i] = rand() % 2;



Multivariate Discrete Uniform Generator

• Discrete random variables X1,…, Xn
– Each with d discrete values: [0, d-1]
– Each Xi is uniform distributed
– X1,…, Xn must be independent
• int X[n]; // A sample
for (i=0; i<n; i++)
X[i] = rand() % d;



Multivariate Gaussian Generator
- Independent (1/2)
• Pseudo codes
• For n random variables X=(X1,…,Xn)
– Gaussian : N(X; , )
• Mean vector: 
• Covariance matrix: =[ij]
• X1,…,Xn are independent
– ij = 0 for ij
• Generate a sample of X
 Generate each Xi independently



- Independent (2/2)
• Generate a sample of X =(X1,…,Xn) with
i=0, ii=1, ij = 0 for ij
– int X[n]; // a sample
for (i=0; i<n; i++)
X[i] = Gaussian();
• Generate a sample of X =(X1,…,Xn) with
i0, ii 1, ij = 0 for ij
– int X[n]; // a sample
for (i=0; i<n; i++)
X[i] = i + Gaussian() * ii;



Example – Matlab (1/2)
mx=[0 0]';  X  (0,0) T

Cx=[1 0; 0 1]; 1 0
x1=-3:0.1:3; X   
x2=-3:0.1:3; 0 1 
for i=1:length(x1),
for j=1:length(x2),
f(i,j)=(1/(2*pi*det(Cx)^
1/2))*exp((-1/2)*([x1(i)
x2(j)]-
mx')*inv(Cx)*([x1(i);x2(
j)]-mx));
end
end
mesh(x1,x2,f)
pause;
contour(x1,x2,f)
pause



• Randomly generate 1000 samples for
1 0
 X  (0,0) ,  X  
T

 0 1


y1=randn(1,1000);
y2=randn(1,1000);
plot(y1,y2,'.');



- Dependent (1/4)
• For n random variables X=(X1,…,Xn)
–Gaussian : N(X; , )
• Mean vector: 
• Covariance matrix: =[ij]
–  is a positive definite matrix
• Symmetric and all eigenvalues (pivots) > 0
– For general matrix A : A= LDU
• L: lower triangular, U: upper triangular
D: diagonal matrix of pivots
– For symmetric matrix S: S = LDLT
– For positive definite matrix  = LDL     PPT
T
T= L D L D
– This is called Cholesky decomposition
• X1,…,Xn are dependent
–ij  0


- Dependent (2/4)
• Generate a sample of X with , 
– Perform Cholesky decomposition of 
• Cholesky decomposition is pivot decomposition
for positive definite matrix
•  = PP-1 = PPT
– Generate independent Gaussian Y=(Y1,…,Yn )
with i=0, i=1
– X = PY + 



- Dependent (3/4)
• Pseudo code to generate a sample of X
with , 
– Matrix ;
Vector ;
Vector X(n), Y(n); // a sample

Matrix P=chol(); //Cholesky decomp.
for (i=0; i<n; i++) Y(i) = Gaussian();
X=P*Y+



- Dependent (4/4)
• Proof
– For n random variables X=(X1,…,Xn) with , 
– Generate n independent, zero-mean, unit variance
normal random variables Y=(Y1,…,Yn)
1  0
Y  (Y1 , , Yn )T , Y  (0, ,0)T , Y      
 
0  1 
 
– Take X = PY+, where  =PP -1 =PPT


Covariance Matrix of X  E ( X   )( X   )T 
 E{( PY )( PY )T }  E{PYY T P T }  PE{YY T }P T  PP T  


Assume
 X  (0,0)T
 1 1 / 2  1 0 
X   , P  1 / 2 3 
1 / 2 1   2

  1/ 2
Matlab:
mx=[0 0]';
Cx=[1 1/2; 1/2 1];
P=chol(Cx);



 1 1 / 2
 X  (0,0) ,  X  
T

1/ 2 1 
• mx=zeros(2,1000);
y1=randn(1,1000);
y2=randn(1,1000);
y=[y1;y2];
P=[1, 0; 1/2, sqrt(3)/2];
x=P*y+mx;
x1=x(1,:);
x2=x(2,:);
plot(x1,x2,'.');
r=corrcoef(x1',x2');


Assume
 X  (5,5)T
 1 0.9 1 0 
X   , P   9 19 
0.9 1   10 10 

  0.9

Matlab:
• mx=[5 5]';
• Cx=[1 9/10; 9/10 1];
• P=chol(Cx);



 1 0.9
 X  (5,5) ,  X  
T

 0. 9 1 

• mx=5*ones(2,1000);
y1=randn(1,1000);
y2=randn(1,1000);
y=[y1;y2];
P=[1, 0; 9/10, sqrt(19)/10];
x=P*y+mx;
x1=x(1,:);
x2=x(2,:);
plot(x1,x2,'.');
r=corrcoef(x1',x2');


Multivariate Gaussian Mixture
Generator
• Generate N samples of X with mixture of M
Gaussians (Matlab-like pseudo code)
– for (m=0; m<M; m++)
{ Matrix P=chol(m) //Cholesky decomposition
for (i=0; i<N*m; i++)
{ //Generate n independent normally distributed
// R.V. (=0, =1)
y = randn(1, n)
// Transform y into x
x=P*y+
}
}



• Combine the previous two Gaussians:
1=0.5, 2=0.5, 7

1  (0,0)
6
T
5

 1 1 / 2 4

1  
1/ 2 1 
3

  2

 2  (5,5) T 1

0

 1 0. 9  -1

2    -2

0.9 1  -3
-4 -2 0 2 4 6 8 10



• pi1= 0.5; pi2=0.5; N=2000;
mx1=zeros(2,pi1*N); Cx1=[1 1/2; 1/2 1];
P1=chol(Cx1); %P=[1, 0; 1/2, sqrt(3)/2];
y1_1=randn(1,pi1*N); y1_2=randn(1,pi1*N);
y1=[y1_1;y1_2];
x1=P1*y1+mx1; x1_1=x1(1,:); x1_2=x1(2,:);
mx2=5*ones(2,pi2*N); Cx2=[1 9/10; 9/10 1];
P2=chol(Cx2); %P=[1, 0; 1/2, sqrt(3)/2];
y2=[y2_1;y2_2];
x2=P2*y2+mx2; x2_1=x2(1,:); x2_2=x2(2,:);
z1=[x1_1,x2_1]; z2=[x1_2,x2_2];
plot(z1,z2,'.');


• Combine the previous two Gaussians
1=0.2, 2=0.8 7

6

1  (0,0) T
5

 1 1 / 2 4

1  
1/ 2 1 
3

  2

 2  (5,5) T 1

0

 1 0. 9  -1

2    -2

0.9 1  -3
-4 -2 0 2 4 6 8 10



• pi1= 0.2; pi2=0.8; N=2000;
mx1=zeros(2,pi1*N); Cx1=[1 1/2; 1/2 1];
P1=chol(Cx1); %P=[1, 0; 1/2, sqrt(3)/2];
y1=[y1_1;y1_2];
x1=P1*y1+mx1; x1_1=x1(1,:); x1_2=x1(2,:);
mx2=5*ones(2,pi2*N); Cx2=[1 9/10; 9/10 1];
P2=chol(Cx2); %P=[1, 0; 1/2, sqrt(3)/2];
y2=[y2_1;y2_2];
x2=P2*y2+mx2; x2_1=x2(1,:); x2_2=x2(2,:);
z1=[x1_1,x2_1]; z2=[x1_2,x2_2];
plot(z1,z2,'.');


Exercise
• Write a program to randomly generate
1000 samples of 3-dimensional Gaussian
with =(5,10,-3), =(2,1,3;4,2,2;3,1,2)



Any Distribution
• For random variables X1,… ,Xn
• We know P(X1,… ,Xn) has no closed-form
formula
– Independent: P(X1,… ,Xn)= P(X1)… P(Xn)
– Dependent:
P(X1,… ,Xn)=  P(Xi | Parent(Xi))
• Generate a sample (X1,… ,Xn) according to
P(X1,… ,Xn)
– Independent: generate each Xi by P(Xi)
– Dependent: generate each Xi by P(Xi| Parent(Xi))



Two Boolean R.V.s - Independent
• X1, X2 have distributions :
– P(X1)=<0.67, 0.33>, P(X2)=<0.75,0.25>
• int X1, X2; P(X1)
for (i=0; i<1000; i++) 0.67
{ if (rand() > RAND_MAX/3)
X1 = 1;
else X1 = 0; 0 1 X1
if (rand() > RAND_MAX/4) P(X2)
X2 = 1; 0.75
else X2 = 0;
}
0 1 X2


Two Boolean R.V.s - Dependent
• X1, X2 have distributions :
– P(X1)=<0.67, 0.33>
– P(X2|X1=T)=<0.75,0.25>, P(X2|X1=F)=<0.8,0.2>
• Generate a sample (x1, x2)
if (rand() > RAND_MAX/3) x1 = 1;
else x1 = 0;
if (x1==1)
else x2 = 0;
else // x1==0
else x2 = 0;



Markov Chain
• Markov Chain: n random variables
X1 ... Xk ... Xn



Bayesian Network
• Example: 5 random variables
Burglary Earthquake

Alarm

John Calls Mary Calls



3. Stochastic Simulation
• Also called
– Monte Carlo Methods
– Sampling Methods
• Sub-sections
– 3.1 Direct sampling
– 3.2 Rejection sampling
– 3.3 Likelihood weighting



3.1 Direct Sampling
• Generate N samples randomly
• For the inference P(X|E)
– P(X|E)= P(XÊ) / P(E)
– Get N(E) & N(XÊ) from the N
samples
• N(E) : No. of samples of E
• N(XÊ) : No. of samples of X and E
– P(E) = N(E) / N,
P(XÊ) = N(XÊ) / N
– P(X|E) = N(XÊ) / N(E)



Example (1/4)
• For the sprinkler network
– Estimate P(w|r)
by direct sampling
– 4 random variables
– A sample =
(c,s,r,w)



Example (2/4)
• Generate 1000 samples
Cloudy Sprinkler Rain WetGrass
T T T F
F T T F
F F T T
T T T F
T T T F
... ... ... ...
F T T F



Example (3/4)
• P(r| w) = P(r, w)/P(w)
Nw: No. of WetGrass=False
Nr^w: No. of (Rain=True&WetGrass=False)
T T T F
F T T F
Nr^w / Nw F F T T
T T F F
... ... ... ...
F T T F


Example (4/4)
• P(R|w)
– = P(R, w)/P(w)
– = < P(r ^ w)/P(w), P(r ^ w)/P(w) >
T T T F
F T T F
F F T T
T T F F
... ... ... ...
F T T F


How to Generate a Sample
for the Bayesian Network? (1/3)
• The sprinkler Bayesian network
A sample is an atomic event :
(cloundy,sprinkler,rain,wetgrass)
=(T, F, T, T)

•Assume a sampling order:
[ Cloudy, Sprinkler,
Rain, WetGrass ]



• int C, S, R, W;
for (i=0; i<1000; i++)
{ if (rand() > RAND_MAX/2) C = T;
else C = F;
if (rand() > RAND_MAX/2) S = T;
else S = F;
if (rand() > RAND_MAX/2) R = T;
else R = F;
if (rand() > RAND_MAX/2) W = T;
else W = F;
} Incorrect
Implementation


• int C, S, R, W;
for (i=0; i<1000; i++)
{ if (rand() > RAND_MAX/2) C = T;
else C = F;
if (C==T)
if (rand() > RAND_MAX*0.9)
S = T;
else S = F;
else // C==F
if (rand() > RAND_MAX/2)
S = T;
else S = F;
...
}


An Example
Generating One Sample (1/8)
• The sampling algorithm
1.Sample from P(Cloudy)=<0.5, 0.5>
– Suppose it returns true
2.Sample from
P(Sprinkler|Cloudy=true)=<0.1,0.9>
– Suppose it returns false
3.Sample from
P(Rain|Cloudy=true)=<0.8,0.2>
4.Sample from
P(WetGrass|Sprinkler=false, Rain=true) =
<0.9,0.1>


An Example
C S R W

Samples:



An Example
Random sampling: C S R W
Cloudy Samples:
c

Return: Cloudy=true



An Example
C S R W
c
Samples:

Random sampling
1. Sprinkler
2. Rain
Given Cloudy=true



An Example
C S R W
c s
Samples:

Random sampling
Sprinkler
Given Cloudy=true
Return: Sprinkler=false


An Example
C S R W
c s r
Samples:

Random sampling Rain
Given Cloudy=true

Return: Rain=true



An Example
C S R W
c s r
Samples:

Random sampling WetGrass
Given Rain=true,
Sprinkler=false



An Example
C S R W
c s r w
Samples:

Random sampling WetGrass
Given Rain=true,
Sprinkler=false
Return: WetGrass=true


The Algorithm (1/2)
• To generate one sample



The Algorithm (2/2)
• In previous example
– We get a sample [true, false, true, true]
of a Bayesian network using the Prior-
Sample
• The sampling of a Bayesian network
– Repeat the sampling N times
– We get N samples
• We can use the N samples to compute
any query probability in the Bayesian
network


How It Works (1/2)
• Why any probability can be
answered from the sampling?
– The N samples is actually a full joint
distribution table (FJD)
C S R W C S R W P
T T T F T T T F 0.02
F T T F F T T F 0.13
F F T T F F T T 0.04
T T F F T T F F 0.15
... ... ... ... ... ... ... ... ...
F T T F FJD


Why It Works (2/2)
• A sample is an atomic event (x1, ..., xn)
• P(x1, ..., xn)  N(x1, ..., xn) / N
• Therefore, a FJD is generated from
the N samples
• Note: N < 2n



Exercise: Direct Sampling
p(smart)=.8 p(study)=.6 Query: What is the probability
smart study that a student studied, given
that they pass the exam?
p(fair)=.9
prepared fair

p(prep|…) smart smart
pass study .9 .7
smart smart study .5 .1
p(pass|…)
prep prep prep prep
fair .9 .7 .7 .2
fair .1 .1 .1 .1


Problems of Direct Sampling
• It needs to generate very many
samples in order to obtain the
approximate FJD
• For a query of conditional
probability P(X|e)
– Can we just approximate the
conditional probability?
– Yes, the following two algorithms will
do this


3.2 Rejection Sampling
ˆ
• P( X | e) is estimated from samples
agreeing with e



An Example
• Estimate P(Rain|Sprinkler=true)
using 100 samples
– 27 samples have Sprinkler = true
– Of these, 8 have Rain=true and
19 have Rain=false

– P(Rain|Sprinkler=true) =
Normalize(<8,19>) = <0.296, 0.704>
• Similar to a basic real-world
empirical estimation procedure


Analysis of Rejection Sampling

P ( X | e) 
ˆ N ( X ,e )
N (e)  P ( X ,e )
P (e)  P ( X | e)
• Hence rejection sampling returns
consistent posterior estimates
• Problem: expensive if P(e) is small
– P(e) drops off exponentially with
number of evidence variables!



3.3 Likelihood Weighting
• Avoids the inefficiency of rejection
sampling
– By generating only events consistent
with the evidence variables e
• Idea Randomly
– Fix evidence variables, generate
a sample
– Sample only hidden variables event
– Weight each sample event by the
likelihood it accords the evidence
• Events have different weights


An Example (1/9)
• Query P(Rain|sprinkler, wetgrass)



An Example (2/9)
1. Set the weight  =1.0
2. Sample from P(Cloudy)=<0.5,0.5>
• Suppose it returns true
3. The evidence Sprinkler=true. So we set
 =  P(sprinkler|cloudy)=1*0.1=0.1
4. Sample from P(Rain|cloudy)=<0.8,0.2>
• Suppose it returns true
5. The evidence WetGrass=true. So we set
 =  P(wetgrass|sprinkler,rain)
=0.1*0.99=0.099
A sample event (true, true, true, true)
with weight 0.099


An Example (3/9)

=1.0



An Example (4/9)

=1.0



An Example (5/9)

=1.0



An Example (6/9)

=1.0  0.1


An Example (7/9)

=1.0  0.1


An Example (8/9)

=1.0  0.1


An Example (9/9)

=1.0  0.1  0.99
= 0.099



The Algorithm (1/2)
• The example generates a sample
event (true, true, true, true) for the
query P(Rain|sprinkler, wetgrass)
• Repeat the sampling N times
– We get N sample events
– Each event has a likelihood weight 
– 1 = rain=true , 1 = rain=false 
• P(Rain|sprinkler, wetgrass)
= < 1/(1+2), 2/(1+2) >


The Algorithm (2/2)



Exercise: Likelihood Weighting
p(smart)=.8 p(study)=.6 Query: What is the probability
smart study that a student studied, given
that they pass the exam?
p(fair)=.9
prepared fair

p(prep|…) smart smart
pass study .9 .7
smart smart study .5 .1
p(pass|…)
prep prep prep prep
fair .9 .7 .7 .2
fair .1 .1 .1 .1


Analysis (1/3)
• Why the algorithm works? P(X|E=e)
• Let the sampling probability for
WEIGHTED-SAMPLE be SWS
– The evidence variables E are fixed
with e
– All the other variables Z = {X}  Y
– The algorithm samples each variable
in Z given its parent values
l
SWS ( z , e)   P( zi | parents( Z i ))
i 1


Analysis (2/3)
• The likelihood weight w for a given
sample (z, e)=(x, y, e) is
m
w( z , e)   P (ei | parents ( Ei ))
i 1
• The weighted probability of a
sample (z,e)=(x, y, e) is
SWS ( z , e) w( z , e)
l m
  P( zi | parents ( Z i )) P (ei | parents ( Ei ))
i 1 i 1 n

 P ( x, y , e)  P( x1 , , xn )   P( xi | parents ( X i ))
i 1



Analysis (3/3)
P( x | e)    NWS ( x, y, e) w( x, y, e)
ˆ
y
  '  SWS ( x, y, e) w( x, y, e)
y

  '  P ( x, y , e)
y
  ' P ( x, e)  P ( x | e)
So the algorithm works


Discussions
• Likelihood weighting is efficient
because it uses all the samples
generated
• However, it suffers a degradation in
performance as the no. of evidence
variables increases, because
– Most samples will have very low weights,
– The weighted estimate will be dominated
by the tiny fraction of samples that have
infinitesimal likelihood


4. Inference by MCMC
• Key idea
– Sampling process as a Markov Chain
• Next sample depends on the previous one
– Approximate any posterior distribution
• "State" of network
= current assignment to all variables
• Generate next state
– by sampling one variable given Markov
blanket
• Sample each variable in turn, keeping
evidence fixed


The Markov Chain
• With Sprinkler =true, WetGrass=true,
there are four states:



Markov Blanket Sampling
• Markov blanket of Cloudy is
– Sprinkler and Rain
• Markov blanket of Rain is
– Cloudy, Sprinkler, and WetGrass
• Probability given the Markov
blanket is calculated as follows
– P(x'i|MB(Xi))
= P(x'i|Parents(Xi))
ZjChildren(Xi)P(zj|Parents(Zj))


An Example (1/2)
• Estimate P(Rain|sprinkler,wetgrass)
• Loop for N times
– Sample Cloudy or Rain given its
Markov blanket
• Count number of times Rain=true
and Rain=false in the samples



An Example (2/2)
• E.g., visit 100 states
– 31 have Rain=true,
– 69 have Rain=false
• P(Rain|sprinkler,wetgrass)
= Normalize(<31, 69>)
= <0.31, 0.69>



The Algorithm



Why it works
• Skipped
– Details in pp. 517-518 in the AIMA 2e
textbook



Sub-Sections
• 4.1 Markov chain theory
• 4.2 Two MCMC sampling algorithms



4.1 Markov Chain Theory
• Suppose X1, X2, … take some set of values
– wlog. These values are 1, 2, ...
• A Markov chain is a process that corresponds
... ...
to the network:
X1 X2 X3 Xn

• To quantify the chain, we need to specify
– Initial probability: P(X1)
– Transition probability: P(Xt+1|Xt)
• A Markov chain has stationary transition
probability: P(Xt+1|Xt) same for all times t


Irreducible Chains
• A state j is accessible from state i if there
is an n such that P(Xn = j | X1 = i) >
0
– There is a positive probability of reaching
j from i after some number steps

• A chain is irreducible if every state is
accessible from every state



Ergodic Chains
• A state is positively recurrent if there is a
finite expected time to get back to state i
after being in state i
– If X has finite number of states, then this is
suffices that i is accessible from itself

• A chain is ergodic if it is irreducible and
every state is positively recurrent



(A)periodic Chains
• A state i is periodic if there is an integer
d such that when n is not divisible by d
P(Xn = i | X1 = i ) = 0
• Intuition: only every d steps state i may
occur
• A chain is aperiodic if it contains no
periodic state



Stationary Probabilities
Thm:
• If a chain is ergodic and aperiodic, then
the limit n   P ( X n | X 1  i )
lim
exists, and does not depend on i
• Moreover, let P * ( X  j )  n   P ( X n  j | X 1  i )
lim
then, P*(X) is the unique probability
satisfying
P * (X  j )   P ( X t  1  j | X t  i )P * ( X  i )
i



Stationary Probabilities
• The probability P*(X) is the stationary
probability of the process
• Regardless of the starting point, the
process will converge to this probability

• The rate of convergence depends on
properties of the transition probability



Sampling from the
Stationary Probability
• This theory suggests how to sample from
the stationary probability:
– Set X1 = i, for some random/arbitrary i
– For t = 1, 2, …, n
• Sample a value xt+1 for Xt+1 from
P(Xt+1|Xt=xt)
– return xn
• If n is large enough, then this is a sample
from P*(X)



Designing Markov Chains
• How do we construct the right chain to
sample from?
– Ensuring aperiodicity and irreducibility is
usually easy

• Problem is ensuring the desired
stationary probability



Designing Markov Chains
Key tool:
• If the transition probability satisfies
P ( Xt  1  j |Xt i ) Q (X  j )
P ( Xt  1 i |Xt  j )
 Q ( X i )
whenever P ( Xt  1  j | Xt  i )  0

then, P*(X) = Q(X)
• This gives a local criteria for checking
that the chain will have the right
stationary distribution



MCMC Methods
• We can use these results to sample from
P(X1,…,Xn|e)
Idea:
• Construct an ergodic & aperiodic
Markov Chain such that
P*(X1,…,Xn) = P(X1,…,Xn|e)
• Simulate the chain n steps to get a
sample



MCMC Methods
Notes:
• The Markov chain variable Y takes as
value assignments to all variables that
are consistent evidence
V (Y )  { x 1 ,..., x n V ( X 1 )   V ( X 1 ) | x 1 ,..., x n satisfy e }
• For simplicity, we will denote such a
state using the vector of variables



4.2 Two MCMC Sampling
Algorithms
• Gibbs Sampler
• Metropolis-Hastings Sampler



Gibbs Sampler
• One of the simplest MCMC method
• Each transition changes the state of one
Xi
• The transition probability defined by P
itself as a stochastic procedure:
– Input: a state x1,…,xn
– Choose i at random (uniform probability)
– Sample x’i from P(Xi|x1, …, xi-1, xi+1 ,…,
xn, e)
– let x’j = xj for all j  i
– return x’1,…,x’n


Correctness of Gibbs Sampler
• How do we show correctness?



Correctness of Gibbs Sampler
• By chain rule
P(x1,…,xi-1, xi, xi+1,…,xn|e) =
P(x1,…,xi-1, xi+1,…,xn|e)P(xi|x1,…,xi-1,
xi+1,…,xn, e)
• Thus, we get Transition
P ( x 1 ,, x i  1 , x i , x i  1 ,, x n |e ) P ( x i |x 1 ,, x i  1 , x i  1 ,, x n ,e )
P ( x 1 ,, x i  1 , x 'i , x i  1 ,, x n |e )
 P ( x 'i |x 1 ,, x i  1 , x i  1 ,, x n ,e )
• Since we choose i from the same
distribution at each stage, this
procedure satisfies the ratio criteria


Gibbs Sampling for
Bayesian Network
• Why is the Gibbs sampler “easy” in BNs?
• Recall that the Markov blanket of a
variable separates it from the other
variables in the network
– P(Xi | X1,…,Xi-1,Xi+1,…,Xn) = P(Xi |
Mbi )
• This property allows us to use local
computations to perform sampling in
each transition


Gibbs Sampling in
Bayesian Networks
• How do we evaluate
P(Xi | x1,…,xi-1,xi+1,…,xn) ?
• Let Y1, …, Yk be the children of Xi
– By definition of Mbi, the parents of Yj are
in Mbi{Xi}
• It is easy to show that
P ( xi | Pa i ) P ( y j | pa y j )
P ( xi | Mb i ) 
j

 P ( x ' | Pa ) P ( y
x 'i
i i
j
j | pa y j )



Metropolis-Hastings
• More general than Gibbs (Gibbs is a
special case of M-H)
• Proposal distribution arbitrary q(x’|x)
that is ergodic and aperiodic (e.g.,
uniform)
• Transition to x’ happens with
probability
(x’|x)=min(1, P(x’)q(x|x’)/P(x)q(x’|x))
• Useful when computing P(x) infeasible
• q(x’|x)=0 implies P(x’)=0 or q(x|x’)=0


Sampling Strategy
• How do we collect the samples?
Strategy I:
• Run the chain M times, each for N steps
– each run starts from a different state
points
• Return the last state in each run

M chains



Sampling Strategy
Strategy II:
• Run one chain for a long time
• After some “burn in” period, sample
points every some fixed number of steps

“burn in” M samples from one chain


07 approximate inference in bn

07 approximate inference in bn

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 07 approximate inference in bn

Similar to 07 approximate inference in bn (20)

More from IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing

More from IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing (11)

Recently uploaded

Recently uploaded (20)

07 approximate inference in bn