Probabilistic modeling in deep learning

Probabilistic modeling in
Deep Learning
Dzianis Dus
Lead Data Scientist at InData Labs

How we will spend the next 60 minutes?
In thinking about the following topics:

1. What does “probabilistic modeling” means?

2. Why it is cool (sometimes)?

3. How we can use it to build:

a. More robust and powerful models

b. Models with predefined properties

c. Models without overfitting (o_O)

d. Infinite ensembles of models (o_O)

d. Infinite ensembles of models (o_O)
4. Deep Learning

Problem statement: Empirical way
Suppose that we want to solve classical regression problem:

Typical approach:

Typical approach:
1. Choose functional family for F(...)
2. Choose appropriate loss function
3. Choose optimization algorithm
4. Minimize loss on (X, Y)
5. ...

Problem statement: Probabilistic way
Define “probability model” (describes how your data was generated):

Having model you can calculate “likelihood” of your data:

We are working with i.i.d. data

Sharing the same variance

Data log-likelihood:
Maximum likelihood estimation:

MSE Loss minimization

MSE Loss minimization
For i.i.d. data sharing the same variance!

Log-Likelihood maximization = Empirical loss minimization

1. MAE minimization = likelihood maximization of i.i.d. Laplace-distributed variables
Empirical loss minimizationLog-Likelihood maximization =

2. For each empirically stated problem exists appropriate probability model

3. Empirical loss is often just a particular case of wider probability model

3. Empirical loss is often just a particular case of wider probability model
4. Wider model = wider opportunities!

Probabilistic modeling: Wider opportunities for Flo
Suppose that we have:
1. N unique users in the training set
2. For each user we’ve collected time series of user states (on daily basis):
3. For each user we’ve collected time series of cycles lengths:
4. We predict time series of lengths Y based on time series of states X

We want to maximize data likelihood:

Probability that user i will have
cycle with length y at day j

Just another notationProbability that user i will have
cycle with length y at day j

Cycle length of user i at day j has
Gaussian distribution

Parameters of distribution at day j
depends on model parameters
and all features up to day j

Can be easily modeled with deep RNN!

Note that:

Note that:
We don’t need any labels to predict variance!

Real life example:

Parameter estimation theory
Estimation theory is a branch of statistics that deals with estimating the values of
parameters based on measured empirical data that has a random component.
© Wikipedia

© Wikipedia
Commonly used estimators:
● Maximum likelihood estimator (MLE) - the Ugly
● Maximum a posteriori estimator (MAP) - the Bad
● Bayesian estimator - the Good

© Wikipedia
We are here

© Wikipedia
The way we go

Maximum a posteriori estimator
Until now, we’ve been talking about Maximum Likelihood Estimator:

Now assume that prior distribution over parameters exists:

Then we can apply Bayes Rule:

Posterior distribution
over model parameters

Data likelihood for specific parameters
(could be modeled with Deep Network!)

Prior distribution over parameters
(describes our prior knowledge or / and
our desires for the model)

Bayesian evidence

Bayesian evidence
A powerful method for model selection!

As a rule this integral is intractable :(
(You can never integrate this)

The core idea of Maximum a Posteriori Estimator:

Doesn’t depend on model parameters

The only (but powerful!)
difference from MLE

1. MAP estimates model parameters as mode of posterior distribution

2. MAP estimation with non-informative prior = MLE

3. MAP restricts the search space of possible models

3. MAP restricts the search space of possible models
4. With MAP you can put restrictions not only on model weights but also on many
interactions inside the network

Probabilistic modeling: Regularization
Regularization - is a process of introducing additional information in order to
solve an ill-posed problem or prevent overfitting. © Wikipedia

restrict model to have predefined properties.

It is closely connected to “prior distributions” on weights / activations / …

It is closely connected to “prior distributions” on weights / activations / …
… and to MAP estimation!

Weights decay (or L2 regularization):

Weights decay (or L2 regularization):
Appropriate probability model:
Model log-likelihood:

Data log-likelihood
(we’ve already calculated this)

Doesn’t depend on
model parameters

Squared L2 norm
of parameters

Regularization constant

So, it is clear that:

1. Laplace distribution as a prior = L1 regularization

2. It can be shown that Dropout is also a form of particular probability model …

3. … a Bayesian one :) …

4. … and therefore can be used not only as a regularization technique!

5. Do you want to pack your network weights into few kilobytes?

6. Ok, all you need - is MAP!

6. Ok, all you need - is MAP!
MAP - is all you need!

Weights packing: Empirical way
Song Han and others - Deep Compression: Compressing Deep Neural Networks with Pruning,
Trained Quantization and Huffman Coding (2015)
Modern neural networks could be dramatically compressed:

Weights packing: Soft-Weight Sharing
1. Define prior distribution of weights as Gaussian Mixture Model

Mixture of Gaussians =

2. For one of the Gaussian components force:

3. Maybe define Gamma prior for variances (for numerical stability)

3. Maybe define Gamma prior for variances (for numerical stability)
4. Just find MAP estimation for both model parameters and free mixture parameters!

Karen Ullrich - Soft Weight-Sharing For Neural Network Compression (2017)

Maximum a posteriori estimation
1. Pretty cool and powerful technique
2. You can build hierarchical models (put priors on priors of priors of…)
3. You can put priors on activations of layers (sparse autoencoders)
4. Leads to “Empirical Bayes”
5. Thinking how to restrict your model? Try to find appropriate prior!

True Bayesian Modeling: Recap
1. Posterior could be easily found in case of conjugate distributions

2. But for most real life models denominator is intractable

3. In MAP denominator is totally ignored

3. In MAP denominator is totally ignored
4. Can we find a good approximation of the posterior?

True Bayesian Modeling: Approximation
Two main ideas:

Two main ideas:
1. MCMC (Monte Carlo Markov Chain)

Two main ideas:
1. MCMC (Monte Carlo Markov Chain) - a tricky one

Two main ideas:
2. Variational Inference

Two main ideas:
2. Variational Inference - a “Black Magic” one

Two main ideas:
2. Variational Inference - a “Black Magic” one
Another ideas exists:
1. Monte Carlo Dropout
2. Stochastic gradient langevin dynamics
3. ...

True Bayesian Modeling: MCMC
1. Key idea is to construct Markov Chain which has posterior distribution as
its equilibrium distribution

2. Then you can burn-in Markov Chain (convergence to equilibrium) and then
sample from the posterior distribution

3. Sounds tricky, but it is well-defined procedure

4. PyMC3 = Bayesian Modeling and Probabilistic Machine Learning in Python

5. Unfortunately, it is not scalable

6. So, you can’t explicitly apply it to complex models (like Neural Networks)

6. So, you can’t explicitly apply it to complex models (like Neural Networks)
7. But implicit scaling is possible: Bayesian learning via stochastic gradient
langevin dynamics (2011)

True Bayesian Modeling: Variational Inference
True posterior:

True posterior:
Modeled with Deep Neural Network

True posterior:
Intractable integral :(

True posterior:
Let’s find good approximation:

True posterior:
Explicitly define distribution family
for approximation
(e.g. multivariate gaussian)

True posterior:
Variational parameters
(e.g. mean vector, covariance matrix)

True posterior:
Speaking mathematically:

True posterior:
Kullback-Leibler divergence
(measure of distributions dissimilarity)

True posterior:
True posterior is unknown :(

Achtung!
A lot of math
is coming!

Rewrite this using
Bayes rule:

Doesn’t depend on theta!
(After integration)
Parameters
of integration

So, it is a constant!

Has no effect on
minimization problem

Group this together

Multiply by (-1)

KL
divergence

It is an expectation
over q(...)

Equivalent problems!

Likelihood of your data
(your Neural Network works here!)

Prior on network weights
(you define this!)

Approximate posterior
(you define the form of this!)

We want to optimize this wrt of
approximate posterior parameters!

We need to calculate the gradient of this

Gradient calculation:

Rewrite this as expectation
(for convenience)

Ooops...

Modeled with
Deep Network!

This integral is intractable too :(
(God damn!)

If it was just q(...) then we can calculate
approximation using Monte Carlo
method!

This is just = 1!

This is gradient of log(q(...))!

Luke,
log derivative
trick!

Can be approximated
with Monte Carlo!
Luke,
log derivative
trick!

Bayesian Networks: Step by step
Define functional family for approximate posterior (e.g. Gaussian):

Solve optimization problem (with doubly stochastic gradient ascend):

Solve optimization problem (with doubly stochastic gradient ascend):
Having approximate posterior
you can sample network weights (as much as you want)!

Bayesian Networks: Pros and Cons
As a result you have:
1. Infinite ensemble of Neural Networks!
2. No overfit problem (in classical sense)!
3. No adversarial examples problem!
4. Measure of prediction confidence!
5. ...

Bayesian Networks: Pros and Cons
As a result you have:
1. Infinite ensemble of Neural Networks!
2. No overfit problem (in classical sense)!
3. No adversarial examples problem!
4. Measure of prediction confidence!
5. ...
No free hunch:
1. A lot of work is still hidden in “scalability” and “convergence”!
2. Very (very!) expensive predictions!

Bayesian Networks Examples: BRNN
Meire Fortunato and others - Bayesian Recurrent Neural Networks (2017)

Bayesian Networks Examples: SegNet
Alex Kendall and others - Bayesian SegNet: Model Uncertainty in
Deep Convolutional Encoder-Decoder Architectures for Scene Understanding (2016)

Bayesian Networks in (near) Production: UBER
Lingxue Zhu - Deep and Confident Prediction for Time Series at Uber (2017)
How it works:
1. LSTM network
2. Monte Carlo Dropout
3. Daily complete trips
prediction
4. Anomaly detection for
various metrics

Bayesian Networks in (near) Production: Flo
Predicted distributions of cycle length for 40 independent users:
Switched to Empirical Bayes for now.

Speech Summary
1. Probabilistic modeling is a powerful tool with strong math background

Speech Summary
2. Many techniques are currently not widely used in Deep Learning

Speech Summary
3. You can improve many aspects of your model using the same framework

Speech Summary
4. Scalability, stability of convergence and inference cost are main constraints

Speech Summary
5. The future of Deep Learning looks Bayesian...

Speech Summary
5. The future of Deep Learning looks Bayesian...
… (for the moment, for me)

Thank you for your !
I hope, you have a lot of questions :)
(attention)
Dzianis Dus
Lead Data Scientist at InData Labs

Probabilistic modeling in deep learning

More Related Content

What's hot

Viewers also liked

Similar to Probabilistic modeling in deep learning

More from Denis Dus

Recently uploaded

Probabilistic modeling in deep learning