Sampling and Markov Chain Monte Carlo Techniques

Part 1: 2016-01-20
Part 2: 2016-02-10
Tomasz Kuśmierczyk
Session 5: Sampling & MCMC
Approximate and Scalable Inference for Complex
Probabilistic Models in Recommender Systems
Part 2: Inference Techniques

MCMC = Monte Carlo Markov Chains
MCMC ⊂ Sampling

Literature / Credits
● Szymon Jaroszewicz lectures on “Selected Advanced Topics in Machine Learning”
● Daphne Koller lectures on “Probabilistic Graphical Models” (https://class.coursera.
org/pgm-003/lecture)
● Patrick Lam slides http://www.people.fas.harvard.
edu/~plam/teaching/methods/convergence/convergence_print.pdf
● Bishop’s book ch. 11
● MacKay, David JC. Information theory, inference and learning algorithms. Cambridge
university press, 2003. (http://www.inference.phy.cam.ac.uk/itprnn/book.pdf)
● R & JAGS online tutorials…
● …

Motivation: Monte Carlo for integrating
http://mlg.eng.cam.ac.uk/zoubin/tut06/mcmc.pdf
Non-trivial posterior
distribution (e.g., for BNs)

Sampling vs Variational Inference (previous seminar)
http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

Sampling continued ...
● Accuracy of sampling based estimates depends only on the variance of the
quantity being estimated
● It does not depend directly on the dimensionality (having many variables is
not a problem)
● In some cases we are able to break the curse of dimensionality
but
● Sampling gets much more difficult in higher dimensions
● Variance often increases as the dimension grows
● Accuracy of sampling based methods grows only with square root of the
number of samples
Jaroszewicz

Sampling techniques - basic cases
● uniform -> pseudo-random numbers generator
● discrete distributions -> range matching with the help of uniform (in log of
number of outcomes time)
● continous -> cdf inverse
● various ‘tricks’
● ...

Sampling techniques (e.g., for BNs posterior)
● Ancestral Sampling (no evidence)
● Probabilistic Logic Sampling (like AS but samples not consistent with
evidence are discarded -> low number of samples generated)
● Likelihood weighting (estimations may be inaccurate + other problems)
● Importance Sampling
● (Adaptive) Rejection Sampling
● Sampling-Importance-Resampling
● Metropolis
● Metropolis-Hastings
● Gibbs Sampling
● Hamiltionian (hybrid) Sampling
● Slice sampling
● and more...

Monte Carlo without Markov Chains

Few remarks
● there is no difference between sampling from normalized and non-normalized
distributions
● non-normalized distributions are easy to evaluate for BNs
● in most cases (e.g. rejection sampling) we work with non-normalized
distributions
● for simplicity p(x) is used in notation but there is no difference for complicated
posterior distributions
● 1D case presented but work also in multi-dimensional case.

Rejection sampling
Jaroszewicz, Bishop
c q(x)
p(x)

Rejection sampling - proof
Jaroszewicz

Selection of c?
● c should be as small as possible to have low reduction rate
● but p <= c q must hold
● Adaptive Rejection Sampling for log-concave distributions
○ log-concave = logarithm of the distribution is concave

Adaptive Rejection Sampling
Jaroszewicz

Rejection Sampling problems
● part of the samples are rejected
● tight “envelope” helps a bit
but
● in many dimensions (when there are many variables) dimensionality curse
must be taken into account
● see Bishop’s example (for rejection sampling):
○ p(x) ~ N(0, s1)
○ q(x) ~ N(0, 1.01*s1)
○ D=1000
○ -> acceptance ratio 1/20000

What is a Markov Chain?
● A triple <possibly infinite set S of possible states, initial distribution over states
P0, transition matrix P (T)>
● transition matrix - a matrix with probabilities Pij (Tij) that being in some state
si at time t we will move to another state sj at time t+1
● Markov property = next state depends only on one previous
Jaroszewicz

Markov Chains - distribution over states
Jaroszewicz

Markov Chains - stationary distribution
Jaroszewicz

Stationarity example
Daphne Koller

Stationarity from regularity
● If there exists k such that, for every two states <si, sj> the probability of
getting from si to sj in exactly k steps is > 0 (MC is regular) →MC converges
to a unique stationary distribution
● Sufficient conditions for regularity:
○ there is a path between every pair of states
○ for every state, there is a self-transition

Stationarity of irreducible, aperiodic MC
● Irreducible, aperiodic Markov chains always converge to a unique stationary
distribution

Why I talk about Markov Chains -> MCMC
the idea is that:
● Markov Chain “jumps” over states
● states determine (BN) samples (that are later used for Monte Carlo)
○ for example: state ⇔ sample
but we need:
● Markov Chain converges to a stationary distribution (to be proved every time)
● a distribution of generated samples is equal to required distribution (BNs
posterior)

Properties
● Very general purpose
● Often easy to implement
● Good theoretical guarantees as t -> ∞
but:
● Lots of tunable parameters / design choices
● Can be quite slow to converge
● Difficult to tell whether it’s working

Metropolis-Hastings derivation
on the blackboard:
1. From detailed balance to stationarity
2. Proposed distribution and acceptance probability
3. From detailed balance to conditions on acceptance probability

Part 2
Dawn of Statistical Renaissance

Gibbs sampling: Algorithm
Daphne Koller

Does it work? - often
Under certain conditions, the stationary distribution of this Markov chain is the joint
distribution of the Bayesian network:
● A probability distribution P(X) is positive, if P(X = x) > 0 for all x ∈ Dom(X).
● Theorem: If all conditional distributions in a Bayesian network are positive
(all probabilities are > 0) then a Gibbs sampler converges to the joint
distribution of the Bayesian network.

Gibbs properties
● Can handle evidence even with very low probability
● Works for all kinds of models, e.g. Markov networks, continuous variables
● Works very well in many practical cases
● overall is a very powerful and useful technique
● very popular nowadays
● has become another Swiss army knife for probabilistic inference
but
● Samples not statistically independent (statistics gets difficult)
● Hard to give guarantees on results
Jaroszewicz

Gibbs problems - more exploratory chains needed
Jaroszewicz

Bayesian PMF using MCMC
https://www.cs.toronto.edu/~amnih/papers/bpmf.pdf

Bayesian PMF using MCMC
Some useful formulas:
on the blackboard ...

You never know with randomness...

Practical problems
● We only want to use samples that are sampled from a distribution close to p
(x) - when chain is already ‘mixing’
● At early iterations (before chain converged) we may be far from p(x) - we
need ‘burn-in’ iterations
● Samples are correlated - we need thinning (take only every n-th sample)

Diagnostics
● Visual Inspection
● Geweke Diagnostic
○ tests whether the burn-in is sufficient
● Gelman and Rubin Diagnostic
○ may detect problems with disconnected sample spaces
● Raftery and Lewis Diagnostic
○ calculates the number of iterations and burn-in needed by first running
● Heidelberg and Welch Diagnostic
○ test statistic for stationarity of the distribution
http://www.people.fas.harvard.edu/~plam/teaching/methods/convergence/convergence_print.pdf

Visual inspection
http://www.people.fas.harvard.
edu/~plam/teaching/methods/convergence/converg
Multimodal distribution, hard to get
from one mode to another.
The chain is not mixing.

Autocorrelation (correlation between delayed
samples)

Geweke Diagnostic
● takes two nonoverlapping parts of the Markov chain
● compares the means of both parts, using a difference of means test
● to see if the two parts of the chain are from the same distribution (null
hypothesis).
● the test statistic is a standard Z-score with the standard errors adjusted for
autocorrelation.

Gelman and Rubin Diagnostic
1. Run m ≥ 2 chains of length 2n from overdispersed starting values.
2. Discard the first n draws in each chain.
3. Calculate the within-chain and between-chain variance.

Gelman and Rubin Diagnostic 2
4. Calculate the estimated variance of the parameter as a weighted
sum of the within-chain and between-chain variance.
5. Calculate the potential scale reduction factor.
When R is high (perhaps greater than 1.1 or 1.2), then we should run our
chains out longer to improve convergence to the stationary distribution.

Probabilistic programming language
programming language designed to:
● describe probabilistic models
● perform inference automatically even on complicated models
for example:
● PyMC
● BUGS / JAGS
● BayesPy
https://en.wikipedia.org/wiki/Probabilistic_programming_language

What’s inside?
● BUGS - Adaptive Rejection (AR) sampling
● JAGS - Slice Sampler (one variable at once)

JAGS PMF-like example: model file
model{#########START###########
sv ~ dunif(0,100)
su ~ dunif(0,100)
s ~ dunif(0,100)
tau <- 1/(s*s)
tauv <- 1/(sv*sv)
tauu <- 1/(su*su)
...
...
for (j in 1:M) {
for (d in 1:D) {
v[j,d] ~ dnorm(0, tauv)
}
}
for (i in 1:N) {
for (d in 1:D) {
u[i,d] ~ dnorm(0, tauu)
}
}
for (j in 1:M) {
for (i in 1:N) {
mu[i,j] <- inprod(u[i,], v[j,])
r3[i,j] <- 1/(1+exp(-mu[i,j]))
r[i,j] ~ dnorm(r3[i,j], tau)
}
}
}#############END############

JAGS PMF-like example: Parameters preparation
n.chains = 1
n.iter = 5000
n.burnin = n.iter
n.thin = 1 #max(1, floor((n.iter - n.burnin)/1000))
D = 10
lu = 0.05
lv = 0.05
n.cluster=n.chains
model.file = "models/pmf_hypnorm3.bug"
N = dim(train)[1]
M = dim(train)[2]
start.s = sd(train[!is.na(train)])
start.su = sqrt(start.s^2/lu)
start.sv = sqrt(start.s^2/lv)
jags.data = list(N=N, M=M, D=D, r=train)
jags.params = c("u", "v", "s", "su", "sv")
jags.inits = list(s=start.s, su=start.su, sv=start.sv,
u=matrix( rnorm(N*D,mean=0,sd=start.su), N, D),
v=matrix( rnorm(M*D,mean=0,sd=start.sv), M, D))

JAGS PMF-like example: running (sampling)
library(rjags)
model = jags.model(model.file, jags.data, n.chains=n.chains, n.adapt=n.burnin)
#update(model)
samples = jags.samples(model, jags.params, n.iter=n.iter, thin=n.thin)

JAGS PMF-like example: retrieving samples
per.chain = dim(samples$u)[3]
iterations = per.chain * dim(samples$u)[4]
user_sample = function(i, k) {samples$u[i, , (k-1)%%per.chain+1, ceiling(k/per.chain)]}
item_sample = function(j, k) {samples$v[j, , (k-1)%%per.chain+1, ceiling(k/per.chain)]}

Why it’s good, why it’s bad?
● fast prototyping
● less control

Results on movielens 100k
RMSE = 0.943 (~SGD)
More on https://github.com/tkusmierczyk/pmf-jags

Sampling and Markov Chain Monte Carlo Techniques

More Related Content

What's hot

Similar to Sampling and Markov Chain Monte Carlo Techniques

More from Tomasz Kusmierczyk

Recently uploaded

Sampling and Markov Chain Monte Carlo Techniques