Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님

Bayesian Inference :
Kalman filter에서 Optimization까지

Bayes Rule
𝑃 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑑𝑎𝑡𝑎 =
𝑃 𝑑𝑎𝑡𝑎 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑃(ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠)
𝑃(𝑑𝑎𝑡𝑎)
• Bayes rule tells us how to do inference about hypotheses from data.
• Learning and prediction can be seen as forms of inference.
Given information
Estimate hypothesis
Rev'd Thomas Bayes (1702-1761)
Data : Observation, Hypothesis : Model
Countbayesie.com/blog/2016/5/1/a-guide-to-Bayesian-statistics

3
Contents :
- Learning : Maximum a Posterior Estimator(MAP)
- Prediction : Kalman Filter and it’s implementation
- Optimization : Bayesian Optimization and it’s application

5
Cost to minimize : Cross-entropy Error
Function J(ϴ) = −
1
𝑚
𝑙𝑜g(P 𝑦 𝑥; ϴ )
Logistic regression
Likelihood
Maximum likelihood estimator (MLE)
𝜃∗ = argmax 𝜃 𝑙𝑜g(P 𝑦 𝑥; ϴ ) = arg𝑚𝑖𝑛 𝜃{ J(ϴ)}
• This approach is very ill-conditioned in nature
 sensitive to noise and model error
Learning :

6
Regularized Logistic Regression
Now assume that prior distribution over parameters exists :
Then we can apply Bayes Rule:
Posterior distribution
over model parameters
Learning :

7
Data likelihood for specific parameters
(could be modeled with Deep Network!)
Learning :

8
Prior distribution over parameters
(describes our prior knowledge or /
and our desires for the model)
Learning :

9
Bayesian evidence
A powerful method for model selection!
Learning :

10
Learning :
As a rule this integral is intractable :(
(You can never integrate this)

11
The core idea of Maximum a Posteriori Estimator:
Maximum a posteriori estimator(MAP)
𝐽 𝑀𝐴𝑃 ϴ = − log 𝑃 ϴ 𝑥, 𝑦 = -log 𝑃 𝑦 𝑥, ϴ - log 𝑃 ϴ + log 𝑃 𝑦
=𝐽 𝑀𝐿𝐸 ϴ +
1
2𝜎 𝑤
2 𝑖 ϴ𝑖
2
+ 𝑐𝑜𝑛𝑠𝑡
𝜃 𝑀𝐴𝑃
∗
= argmax 𝜃(log(𝑃 𝑦 𝑥, ϴ + log 𝑃 ϴ )) = arg𝑚𝑖𝑛 𝜃{ 𝐽 𝑀𝐴𝑃 ϴ }
Loss function of Posterior distribution over model parameters
assuming a Gaussian prior for the weights
Regularized Logistic RegressionLearning :

Variational Inference
True posterior :
Modeled with Deep Neural Network
Let’s find good approximation :
Learning :

True posterior :
Learning :
Intractable integral :(

True posterior :
Learning :

True posterior :
Explicitly define distribution family
for approximation
(e.g. multivariate gaussian)
Learning :

True posterior :
Learning :
Variational parameters
(e.g. mean vector, covariance matrix)

True posterior :
Kullback-Leibler divergence
(measure of distributions dissimilarity)
Learning :
Speaking mathematically:

True posterior :
Speaking mathematically:
True posterior is unknown :(
Learning :

20
Prediction : Kalman Filter
Autonomous Mobile Robot Design
Dr. Kostas Alexis (CSE)
Kalman Filter –A Primer
Consider a time-discrete stochastic process(Markov chain)

21
Estimates the state xt of a discrete-time controlled process that is
governed by the linear stochastic difference equation
And (linear)measurements of the state
with and
Prediction : Kalman Filter

22
Bayes Filter Algorithm
For each step, do:
• Apply motion model
• Apply to sensor model
constant
Prediction :

23
From Bayes Filter to Kalman Filter
For each step, do:
• Apply motion model
• Apply to sensor model
Prediction :

24
Kalman Filter AlgorithmPrediction :
Kt : Kalman Gain
Cov. of state Cov. of measurement noise
>>∽1 for
<<∽0 for
While passing through tunnel

25
Implementation of Kalman FilterPrediction :
GPS aided IMU :
- Gyro has drift, bias, and alignment error
- GPS, vision or kinematics can cope with these inherent problems
“ASSESSMENT OF INTEGRATED GPS/INS FOR THE EX-171 EXTENDED
RANGE GUIDED MUNITION”, AIAA-98-4416

26
• Eq. of error dynamics
• Measurement Model
state
output

27
System
v
+
+ y
+
-
K
y
K
w
u
Kalman
Filter
xe
ye
xr e
Complement data
(GPS, Kinematics, Vision, etc)
ρ
LQG(Linear Quadratic Gaussian) controller
= LQR(Linear Quadratic Regulation) + Kalman Filter

28

Optimal Data Sampling Strategy !

30
Bayesian OptimizationOptimization :
Surrogate Model (Gaussian Process)
+ Parameter Exploration or Exploitation
(Acquisition Function)
Bayesian Optimization
=

Automatic Gain Tuning
based on
Gaussian Process Global Optimization
(= Bayesian Optimization)

32
Bayesian OptimizationOptimization :

Short Introduction on
Gaussian Processes
Regression, Classification & Optimization

Why GPs ? :
- Provide Closed-Form Predictions !
- Effective for small data problems
- And Explainable !

How Do We Deal With Many Parameters, Little Data ?
1. Regularization
e.g., smoothing, L1 penalty, drop out in neural nets, large K
for K-nearest neighbor
2. Standard Bayesian approach
specify probability of data given weights, P(D|W)
specify weight priors given hyper-parameter α, P(W|α)
find posterior over weights given data, P(W|D, α)
With little data, strong weight prior constrains inference
3. Gaussian processes
place a prior over functions, p(f) directly rather than
over model parameters, p(w)

Functions : Relationship between Input and Output
Distribution of functions that satisfy
within the range of Input, X and Output, f
 Prior over functions, No Constraints
X
f
prior

• GP specifies a prior over functions, f(x)
• Suppose we have a set of observations:
• D = {(x1,y1), (x2, y2), (x3, y3), …, (xn,
yn)}
Standard Bayesian approach
• p(f|D) ~ p(D|f) p(f)
One view of Bayesian inference
• generating samples (the prior)
• discard all samples inconsistent with
our data, leaving the samples of
interest (the posterior)
• The Gaussian process allows us to
do this analytically.
Gaussian Process Approach
prior
posterior

 Bayesian data modeling technique that account for uncertainty
 Bayesian kernel regression machines
Gaussian Process Approach

Procedure to sample
2. Compute Covariance Matrix for a given 𝑋 = 𝑥1 … . 𝑥 𝑛
1. Let’s assume input, X and function, f distributed as follows
X
f

Procedure to sample
3. Compute SVD or Cholesky decomp. of K to get orthogonal basis
functions
K = 𝐴𝑆𝐵 𝑇 = 𝐿𝐿𝑇
4. Compute Basis Function
𝑓𝑖 = 𝐴𝑆1/2 𝑢𝑖
or 𝑓𝑖 = 𝐿𝑢𝑖
𝑢𝑖 ∶ 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑒𝑐𝑡𝑜𝑟 𝑤𝑖𝑡ℎ
𝑧𝑒𝑟𝑜 𝑚𝑒𝑎𝑛 𝑎𝑛𝑑 𝑢𝑛𝑖𝑡 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
L : Lower part of Cholesky
decomp. of K
X
f
posterior
X
f
prior

J
= 𝜃1 𝑟 𝑡 − 𝑦 𝑡 + 𝜃2 𝑦(𝑡)
A simple PD control example
Global optimal gains, θ to get a minimum cost J ?

A simple PD control example
Procedure of Bayesian Optimization
1. GP prior before observing any data
2. GP posterior, after five noisy evaluations
3. The next parameters θnext are chosen
at the maximum of Acquisition function
Repeat until you can find
a globally optimal θ

= 𝐻 𝑥∗ 𝐷𝑡 − 𝐻 𝑥∗ 𝐷𝑡U{𝑥, 𝑦}
Information gain, I : Mutual information for an observed data
 Reduction of uncertainty in the location 𝑥∗ by selecting points (𝑥, 𝑦)
that are expected to cause the largest reduction in entropy of
distribution 𝐻 𝑥∗ 𝐷𝑡
Acquisition function and Entropy Search

Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님

More Related Content

What's hot

Similar to Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님

More from AI Robotics KR

Recently uploaded

Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님