2. Frequentists vs Bayesian
In a debugging problem, frequentist function with argument ‘My code
passed all X tests, is my code bug free?’ would return a YES
Bayesian function with argument ‘Often my code has bugs. My code
passed all X tests; is my code bug free?’ would return Yes with prob 0.8
and NO with prob 0.2
Additional argument in Bayesian – Often my code has bugs.
This is called prior
Prior is our belief about the situation
3. Why probabilistic
Number of instances as evidence – N
As N -> ∞ Bayesian (often) aligns with frequentist results
For small N, inference is unstable. Frequentist estimates have more
variance and larger confidence intervals. Bayesian inference excels
For large N, quote Andrew Gelman – N can never be large enough
Bayes Formula fundamental
6. Metropolis Rule
High P region
P0, θ0
P1, θ1
• If P1 > P0 accept the
jump
• If P1 < P0 we make the
jump with a probability
P1/P0
• Successful jumps form
a chain called Markov
Chain
• Algorithm is endless –
Orbit true solution but
never stop at it
• How to make jumps?
θ1 = θ0 + Ń(0, Δθ)
Metropolis Hastings – Jump with probability min(q, 1) where
q = P(θ1)/J(θ1 | θ0)
----------------------------------
P(θ0)/J(θ0 | θ1)
7. Gibbs Sampling
Assuming θ1, θ2, θ3 are parameters
of posterior
• We need to know this
conditional probability
distributions
• Gibbs sampling thus is
a special case of
metropolis rule
• Successful jumps form
a chain called Markov
Chain
• Algorithm is endless –
Orbit true solution but
never stop at it
• Practically not always
feasible
Mathematically proven that this algorithm
asymptotically converges to the solution
• Define P(θ1, θ2, θ3)
• Sample θ0
1, θ0
2, θ0
3 from prior
• For t in 1:T
• Θt
1 ~ P(θ1 | θt-1
2, θt-1
3)
• Θt
2 ~ P(θ2 | θt
1, θt-1
3)
• Θt
3 ~ P(θ3 | θt
1, θt
2)
8. Variational Inference
Information Theory
• Information = - log(p(x))
• Entropy = - ∑ p(x)log(p(x))
• Differential Entropy = - ∫ p(x)log(p(x))dx
KL Divergence
• Measures distance between 2 probability
distributions
• KL(p||q) = [- ∑ p(x)log(q(x))] - [- ∑
p(x)log(p(x))] = - ∑ p(x)log(q(x)/p(x))
• KL ≥ 0
• KL(p||q) <> KL(q||p)
Use KL Divergence
• We have P(x, z) but we want to know P(z|x) -
> P’
• So we create q(z)
• KL(p’||q) + L = log(p(x))
• L is a function of p(x, z) and q(z)
• Minimizing KL same as maximizing L
• After some neat math, we get q to be an
exponential distribution -> Neat and
convenient
VI vs MCMC
• VI is deterministic and is an approximation
• MCMC is a sampling solution and is an
approximation
• Generally, MCMC solutions considered to be
more accurate
9. Coin Toss
>>> x_train
array([1, 1, 0, 1, 0, 1, 1,……., 0,
1], dtype=int32)
>>> sum(x_train == 0)
40
>>> sum(x_train == 1)
60
You toss a coin
100 times and
see 60 heads. Is it
a fair coin?
10. How do we build a model in
Probabilistic models?
Posit a generative model
Start with a simple story about how the data is generated
What probability distribution could explain the seeing coin tosses like the ones
observed? – Forward Thinking
Infer Model Parameters
Infer specifics about story based on observations
Given model and data, how likely is it that the coin has a particular fairness? –
Backward Thinking
Criticize the model
Can simple story explain the observations? Can we improve the story?
11. Model Building
Process of stating our beliefs about how the data could have been
generated
Models are simplified descriptions of the data
Models can be declared as abstract mathematical descriptions or as code
Models allow for simulation of data
12. Generative Model for a coin toss
Model expressed in terms of probability distribution
P(params, data) = p(params) x p(data | params)
params: fairness of the coin
data: coin tosses
p(params): prior probability of a certain fairness
p(data|params): conditional probability that the data is observed given that coin
has a certain fairness
p(params, data): Joint Probability that botht the data is observed and coin has a
certain fairness
13. Generative
Model for
coin toss >>> pheads =
Uniform(low=0., high=1.)
>>> c =
Bernoulli(probs=pheads,
sample_shape=100)
Our story in Edward
Uniform: Prior
Bernoulli: conditional
14. Real Life Scenarios
We take an open dataset on Climate of Major cities across the world from
Kaggle. Click here for dataset
Using Bayesian Inference, we try to answer the following questions:
Are the major cities having temperature variability higher than a threshold in
the weather pattern?
What is the probability that any randomly chosen year has high temperature
increase greater than a threshold?
PPL framework – PyMC3
Solution available in a Jupyter Notebook – Click here for Notebook
Notebook will continue to be edited
15. Real Life Scenarios (Contd.)
We take an open dataset on district-wise education metrics in India from
Kaggle. Click here for dataset
Using Bayesian ML, we try to do Regression and Classification on above
problem
PPL framework – PyMC3 and Edward
Solution available in Jupyter Notebooks – Click here and here for
Notebooks
Notebooks will continue to be edited
19. Probabilistic – Pros /
Criticism
Can work with small / medium
data
Research on black box
interpretability
Nuanced risk functions – Good
inference and decision theory, not
just prediction
• No Free Lunch
• Bad Throughput
Skill sets – Statistical Analysis, ML/DL, Advanced
Probability and Statistics
20. Closing tips for
theoretical and
practical start points
Best is to build models from scratch
Or use weights of existing model as
priors and continue
Hypothesis testing in probabilistic way
Outlier analysis
Comparison of a NN traditional vs
probabilistic
PyMC3 and Edward
Stan getting popular
Edward 2.0 will be compatible with latest
TF
Pymc4 will be built on TF
• Learn MCMC
• MCMC algorithms like:
• Gibbs Sampling
• Metropolis Hastings
• Variational Inference
• How to create priors
• Model evaluation