Firefly exact MCMC for Big Data

EXACT MCMC ON BIGDATA:
THE TIP OF AN ICEBERG
University of Helsinki
Gianvito Siciliano
(2014 - Probabilistic Models for Big Data Seminar)

AGENDA
1. MCMC intro:
• Bayesian Inference
• Sampling methods (Gibbs, MH)
2. MCMC and Big Data
• Issues
• Approximate solutions (SGLD, SGFS, MH Test)
3. Firefly Monte Carlo
4. Conclusions

BAYESIAN MODELING
• To obtain quantities of interest from the posterior we usually need to engage with an
integral in this form:
• The problem is that these integrals are usually impossible to evaluate analytically
• Bayes rule allows us to express the posterior over parameters in terms of the prior
and likelihood terms:
P(✓|X) /
NY
i=1
P(xi|✓)P(✓)

MCMC
• Monte Carlo: simulation to draw
quantities of interest from the
distribution
• Markov Chain: stochastic process in
which future states are independent of
past states given the present state.
• Hence, MCMC is a class of method in
which we can simulate draws that are
slightly dependent and are
approximately from posterior
distribution.

HOW TO SAMPLE?
In Bayesian statistics, there are generally two algorithms that you can use (to allow
pseudo-random sampling from a distribution)
Gibbs Sampler
Metropolis-Hastings algorithm.
Used to sample from a joint distribution, if
we knew the full conditional distributions
for each parameter
JD = p(θ1, . . . , θk )
The full conditional distribution is the
distribution of the parameter conditional on
the known information and all the other
parameters:
FCD = p(θj|θ−j, X)
Used when…
• the posterior doesn’t look like any distribution
we know (no conjugacy)
• the posterior consists of more than 2
parameters (grid approximations
intractable)
• some (or all) of the full conditionals do not
look like any distributions we know (no
Gibbs sampling for those whose full
conditionals we don’t know)

Gibbs Sampler
1.
Pick a vector of starting values θ(0).
2.
Start with any θ (order does not matter).
Draw a value θ1(1) from the full conditional p(θ1 | θ2(0), θ3(0), y).
3.
Draw a value θ2(1) (again order does not matter) from the full
conditional p(θ2 | θ1(1), θ3(0), y). Note that we must use the
updated value of θ1(1).
4.
Repeat (for all parameters) until we get M draws, with each draw
being a vector θ(t).
5.
Optional burn-in and/or thinning.

MH Algorithm
1.
Choose a starting value θ(0).
2.
At iteration t, draw a candidate θ(∗) from a jumping distribution
Jt(θ∗ | θ(t−1)).
3.
Compute an acceptance ratio conditioned:
r = p(θ∗|y)/Jt(θ∗|θ(t−1)) / p(θ(t−1)|y)/Jt(θ(t−1)|θ∗)
4. Accept θ∗ as θ(t) with probability min(r,1).
If θ∗ is not accepted, then θ(t) = θ(t−1).
5. Repeat steps 2-4 M times to get M draws from p(θ | y), with optional
burn-in and/or thinning.

MH Algorithm
1.
Choose a starting value θ(0).
2.
At iteration t, draw a candidate θ(∗) from a jumping distribution
Jt(θ∗ | θ(t−1)).
3.
Compute an acceptance ratio conditioned:
r = p(θ∗|y) / p(θ(t−1)|y)
4. Accept θ∗ as θ(t) with probability min(r,1).
If θ∗ is not accepted, then θ(t) = θ(t−1).
5. Repeat steps 2-4 M times to get M draws from p(θ | y), with optional
burn-in and/or thinning.

MCMC and BIG DATA
Propose: ✓0 ⇠ Q(✓0|✓)
Accept with Prob. ↵ = min
"
1,
• Canonical MCMC algorithm proposed samples from a distribution Q and
accept/reject the proposals with a rule that need to examine the likelihood
of all data-items
• All the data are processed at each iteration,
run-time may be excessive!
Q(✓|✓0)P(✓0)
QN
i=1 P(xi|✓0)
Q(✓0|✓)P(✓)
QN
i=1 P(xi|✓)
#
If accept=True: ✓ ✓0

MCMC APPROXIMATE SOLUTIONS FOR BIG DATA
IDEA
• Assume that you have T units of computation to achieve the lowest
possible error.
• Your MCMC procedure has a knob to control the bias/variance
tradeoff
So, during the sampling phase…
Turn left => SLOW: small bias, high variance
Turn right => FAST: strong bias, low variance

SGLD & SGFS: knob = stepsize
Stochastic Gradient Langevin Dynamics
Langevin dynamics based on stochastic gradients
[W. & Teh, ICML 2011]
• The idea is to expand Stochastic Gradient descend optimization algorithm to include gaussian noise with Langevin Dynamics.
• One of the advantages of SGLD is that the entire data sets should never be saved into memory
• Disadvantages:
• it has to read from external data each iteration
• gradients are computationally expensive
• it uses a proper pre-conditions matrix to decide the size step of the transaction operator.
Stochastic Gradient Fisher Scoring
[Ahn, et al, ICML 2012]
Built on SGLD and it tries to beat its predecessor by offering a three phase procedure:
1. Burn-in: large stepsize.
2. Reached distribution: still large stepsize and samples from the asymptotic gaussian approximation of the posterior.
3. Further annealing: smaller stepsize to generate increasingly accurate samples from the true posterior.
• With this approach the algorithm tries to reduce the bias in burn-in phase and then starts sampling to reduce variance.

MH TEST: knob = confidence
CUTTING THE MH ALGORITHM BUDGET
[Korattikara et al, ICML 1023]
…by conducing sequential hypothesis tests to decide whether accept/reject a given sample and find the majority of these
decision based on a small fraction of the data
• Works directly on the rule-step of MH algorithm
• Accept a proposal with a given confidence
• Applicable to problem where is impossible to compute gradient

FIREFLY EXACT SOLUTION
ISSUE 1: prohibitive cost of evaluating every likelihood terms at every iteration (for a
big data-sets)
ISSUE 2: latter procedures construct an approximated transition operator (using
subsets of data)
GOAL: obtain an exact procedures, that leaves the true full-data posterior distribution
invariant!
HOW: by querying only the likelihood of a potentially small subset of the data at each
iteration yet simulates from the exact posterior
IDEA: introduce a collection of Bernoulli variables that turn on (and off) the data for
which calculate the likelihoods

FLYMC: HOW IT WORKS
Assuming we have:
1. Target Distribution 2. Likelihood function
Compute all N likelihoods at every iteration is a bottleneck!
3. Assume that each product term in Ln can be bounded by a cheaper lower bound:
5. Each zn has the following Bernoulli Distribution (conditioned)
6. And augment the posterior with these N vars

FLYMC: HOW IT WORKS
Assuming we have:
Why Exact?
} the marginal distrib. is still the correct posterior
given in equation 1

FLYMC: HOW IT WORKS
Assuming we have:
Why Firefly?
} from this joint distrib. evaluate only those
likelihood terms for which zn = 1 (light terms)

FLYMC: THE REDUCED SPACE
• We simulate the Markov
chain on the zn space:
zn = 0 => Dark point (no likelihoods computed)
zn = 1 => Light point (likelihoods computed)
{
• If the Markov chain
will tend to occupy zn = 0

FLYMC: LOWER BOUND
The lower bound Bn(θ) of each data point’s likelihood Ln(θ), should
satisfy 2 properties:
• Tightness, to determine the number of bright data points (M is the average):
• It must be easy to compute the product (using scale exponential-family lower bounds)
With this setting, we achieve speedup of N/M, from O(ND) ev. time of regular MCMC

MAP-OPTIMISATION
…in order to find an Approximate Maximum a Posteriori value of θ and to construct Bn to
be tight there.
The proposed algorithm versions (used in the experiments) are:
• Untuned FlyMC, with the choice of ε = 1.5 for all data points.
• MAP-tuned FlyMC that performs a gradient descent optimization to find an ε value for
each data points. (This last way allows to obtain a nearer bounds to the posteriori value of
θ).
• Regular full-posterior MCMC (for comparison)

EXPERIMENTS
Expectation:
• slower in mixing
• faster in iterating
Results:
• FlyMC offers a speedup of at
least one order of magnitude
compared with reg. MCMC

CONCLUSIONS
FlyMC is an exact procedures that has the true full-posterior as its target
The introduction of the binary latent variables is a simple and efficient idea
The lower bound is a requirement, and it can be difficult to obtain for many
problems

Acknoledgements
Dr. Antti Honkela
Dr. Arto Klami
Reviewers

Thank you!
(gianvito.siciliano@gmail.com)

Firefly exact MCMC for Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Firefly exact MCMC for Big Data

Similar to Firefly exact MCMC for Big Data (20)

More from Gianvito Siciliano

More from Gianvito Siciliano (9)

Recently uploaded

Recently uploaded (20)

Firefly exact MCMC for Big Data