Large-Scale Inverse Problems Techniques

Large-scale Inverse Problems
Tania Bakhos, Peter Kitanidis
Institute for Computational Mathematical Engineering, Stanford University
Arvind K. Saibaba
Department of Electrical and Computer Engineering,Tufts University
June 28, 2015
Bakhos, Kitanidis, Saibaba Large-Scale Inverse Problems June 28, 2015 1 / 114

Outline
1 Introduction
2 Linear Inverse Problems
3 Geostatistical Approach
Bayes’ theorem
Coin toss example
Covariance modeling
Non-Gaussian priors
4 Data Assimilation
Application: CO2 monitoring
5 Uncertainty quantiﬁcation
MCMC
6 Concluding remarks

What is an Inverse Problem?
Parameters s
Model
h(s)
Data y

Parameters s
Model
h(s)
Data y
Inverse Problems

Parameters s
Model
h(s)
Data y
Quantities
of Interest
Inverse Problems

Inverse problems: Applications
Inverse Problems Geosciences
CO2
monitoring
in the
subsurface
Contaminant
source iden-
tiﬁcation
Climate
change
Hydraulic
Tomog-
raphy

Inverse problems: Applications
Inverse Problems Other ﬁelds
Medical
Imaging
Non-
destructive
testing
Neuroscience
Image
Deblurring

Application: Contaminant source identiﬁcation1
1http://www.solinst.com/Prod/660/660d2.html, Stockie, SIAM Review 2011

Application: Contaminant source identiﬁcation1
Initial
conditions
Transport
processes
Predictions/
Measurements
1http://www.solinst.com/Prod/660/660d2.html, Stockie, SIAM Review 2011

Application: Hydraulic Tomography
Manage underground sites
To better locate natural
resources
Contaminant remediation
Source http://web.stanford.edu/ jonghyun/research.html

Field pictures

Transient Hydraulic Tomography
Results from a ﬁeld experiment conducted at the Boise Hydrological Research Site
(BHRS) 2
Figure 1 : Hydraulic head measurements at observation wells (left) and log10 estimate of
the hydraulic conductivity (right)
2Cardiﬀ, Barrash and Kitanidis - Water Resoures Research 47(12) 2011.

CSEM: Oil Exploration
Source: Morten et al, 72nd EAGE Conference 2010 Barcelona, and Newman et al.
Geophysics, 72(2) 2010;

Monitoring CO2 emissions
Atmospheric transport model
Observations from monitoring stations, satellite observations, etc
Source: Anna Michalak’s plenary talk
https://www.pathlms.com/siam/courses/1043/sections/1257

Application: Global Seismic Inversion
Bui-Thanh, Tan, et al. SISC 35.6 (2013): A2494-A2523.

Need for Uncertainty Quantiﬁcation
“ Uncertainty quantiﬁcation (UQ) is the science of quantitative characterization
and reduction of uncertainties in applications. It tries to determine how likely
certain outcomes are if some aspects of the system are not exactly known.” -
Wikipedia.
6Bui et al. Proceedings of the International Conference on High Performance Computing,
Networking, Storage and Analysis. IEEE Computer Society Press 2012

Wikipedia.
“ ... how do we quantify uncertainties in the predictions of our large-scale
simulations, given limitations in observational data, computational resources, and
our understanding of physical processes ?”6

Wikipedia.
“ ... how do we quantify uncertainties in the predictions of our large-scale
simulations, given limitations in observational data, computational resources, and
our understanding of physical processes ?”6
“ Well, what I’m saying is that there are known knowns and that there are known
unknowns. But there are also unknown unknowns; things we don’t know that we
don’t know. ”
- Gin Rummy, paraphrasing D. Rumsfeld.

Statistical framework for inverse problems
Estimate model parameters (and uncertainties) from data.
Propagate forward uncertainties to predict quantities and uncertainties.
Optimal experiment design
What experimental conditions yield the most information?

Statistical framework for inverse problems
Estimate model parameters (and uncertainties) from data.
Propagate forward uncertainties to predict quantities and uncertainties.
Optimal experiment design
What experimental conditions yield the most information?
Challenge: framework often intractable because
Mathematically ill-posed (sensitivity to noise)
Computationally challenging problem
Insuﬃcient information from data

Opportunities and challenges
Central question in our research
How to exploit structure in order to overcome the curse of dimensionality to
develop scalable algorithms for statistical inverse problems?

Opportunities and challenges
Central question in our research
How to exploit structure in order to overcome the curse of dimensionality to
develop scalable algorithms for statistical inverse problems?
What do we mean by scalable?
amount of data
discretization of unknown random ﬁeld
number of processors

Sessions at SIAM Geosciences
Plenary talks
IP1 The Seismic Inverse Problem Towards Wave Equation Based Velocity
Estimation
Fons ten Kroode, Shell Research, The Netherlands
McCaw Hall 8:30-9:15 AM (Monday)
Contributed Talks
CP 3: Inverse Modeling
4:30 PM - 6:30 PM, Monday June 29 th, Room: Fisher Conference Center
room #5

Minisymposia at SIAM Geosciences
MS 54 Recent advances in Geophysical Inverse Problems
Tania Bakhos, Peter Kitanidis, Arvind Saibaba
9:30 AM - 11:30 AM Thursday July 2, Room: Bechtel Conference Center -
Main Hall
MS 12 Bayesian Methods for Large-scale Geophysical Inverse Problems
Omar Ghattas, Noemi Petra, Georg Stadler
2:00 PM - 4:00 PM, Monday June 29, Room: Fisher Conference Center room
#4
MS2, MS9, MS 15 Full-waveform inversion
William Symes, Hughes Djikpesse
9:30 AM - 11:30 AM, 2:00 - 4:00 PM and 4:30 - 6:30 PM
Room: Fisher Conference Center room #1
MS 19 Full Waveform Inversion
MS 36 3D Elastic Waveform Inversion: Challenges in Modeling and Inversion
MS 58 Forward and Inverse Problems in Geodesy, Geodynamics, and
Geomagnetism
MS46 Data Assimilation in Subsurface Applications: Advances in Model
Uncertainty Quantiﬁcation

Outline
1 Introduction
Bayes’ theorem
Coin toss example
Covariance modeling
Non-Gaussian priors
4 Data Assimilation
MCMC

Introduction
What is an inverse problem?
Forward problem: Compute the output given a system and an input.
Inverse problem: Compute either the input or the system given the output.
Hansen, PC. Discrete inverse problems: insight and algorithms. Vol. 7. SIAM, 2010

Example
Figure 2 : Magnetization inside volcano of Mt. Vesuvius from measurements of magnetic
ﬁeld
Hansen, Per Christian. Discrete inverse problems: insight and algorithms. Vol. 7. SIAM,
2010

Challenges
Inverse problems are ill-posed. They do not satisfy the three conditions for
well-posedness.
Existence: The problem must have at least a solution.
Uniqueness: The problem must only have one solution.
Stability: The solution depends continuously on the data.
The mathematical term well-posed problem stems from a deﬁnition given by
Jacques Hadamard.

Image processing
Consider the equation,
y = Ax +
Notation:
b : observations - the blurry image.
x : true image, we want to estimate.
A : blurring operator - given.
: noise in the data

Image processing
y = Ax +
Forward problem:
Given the true image x and the blurring matrix A, we get the blurred image b.

Image processing
y = Ax +
Forward problem:
What is the inverse problem?

Image processing
y = Ax +
Forward problem:
What is the inverse problem?
The opposite of the forward problem. Given b and A, we compute x (the true
image).

Image processing
From http://www.math.vt.edu/people/jmchung/resources/CSGF07.pdf

Review of basic linear algebra
A square real matrix U ∈ Rn×n
is orthogonal if its inverse equals its
transpose, i.e. UUT
= I and UT
U = I.
A real symmetric matrix A = AT
has a spectral decomposition, A = UΛUT
where U is orthogonal and Λ = diag(λ1, ..., λn) is a diagonal matrix whose
entries are eigenvalues of A.
A real square matrix that is not symmetric can be diagonalized by two
orthogonal matrices with the singular value decomposition (SVD),
A = UΣV T
where Σ is a diagonal matrix whose entries are the singular
values of A.

Need for regularization
Perturbation theory
Ax = b Would like to solve
A(x + δx) = b + Instead solving
Subtracting equation (2) - equation (1)
Aδx = ⇒ δx = A−1
Can show the following bounds
δx 2 ≤ A−1
2 2 x 2 ≥
A 2
b 2
Important result
δx 2
x 2
≤ A 2 A−1
2
cond(A)
2
b 2
The more ill-conditioned the blurring operator A is, the worse is the reconstruction.

TSVD
Regularization controls the ampliﬁcation of noise.
Truncated SVD: Discard all the singular values that are smaller than a chosen
number.

TSVD
number. The naive solution was given by
x = A−1
b = V Σ−1
UT
b =
N
i=1
uT
i b
σi
vi

TSVD
number. The naive solution was given by
x = A−1
b = V Σ−1
UT
b =
N
i=1
uT
i b
σi
vi
For TSVD we truncate the singular values so the solution is given by,
xk =
k
i=1
uT
i b
σi
vi k < N
This yields the same solution as imposing a minimum 2-norm constraint on the
least squares problem minx Ax − b 2.

TSVD
Figure 3 : Exact image (top left), TSVD k = 658 (top right), k = 218 (bottom left) and
k = 7243 (bottom right)
658 was too low (over-smoothed) and 7243 too high (under-smoothed).
Hansen, PC. Discrete inverse problems: insight and algorithms. Vol. 7.

Selective SVD
A variant of the TSVD is the SSVD where we only include components that
signiﬁcantly contribute to the regularized solution. Given a threshold τ,
x =
|uT
i b|>τ
uT
i b
σi
vi
This method is advantageous when some of the components uT
i b corresponding
to large singular values are small.

Tikhonov regularization
Least squares objective function
ˆx = arg min
x
Ax − b 2
2 + α2
x 2
2
where α is a regularization parameter.
The ﬁrst term Ax − b 2
2 measures how well the solution predicts the noisy
data, sometimes referred to as “goodness-of-ﬁt”.
The second term x 2
2 measures the regularity of the solution.
The balance of the terms is controlled by the parameter α.

Relation between Tikhonov and TSVD
The solution to the Tikhonov problem is given by,
xα = (AT
A + α2
I)−1
AT
b
If we replace A by its SVD,
xα = (V Σ2
V T
+ α2
VV T
)−1
V ΣUT
b
= V (Σ2
+ α2
I)−1
ΣUT
b
=
n
i=1
φα
i
uT
i b
σi
vi
where φα
i =
σ2
i
σ2
i + α2
are called ﬁlter factors
Note:
φα
i =
1 if σi α
σ2
i
α2 σi α
φTSVD
i =
1 if i ≤ k
0 i > k

Relation between Tikhonov and TSVD
For each k in TSVD there exists an α such that the solution to the Tikhonov
problem and the solution based on TSVD are approximately equal.

Choice of parameter α
How do we choose optimal α?
L-curve is log-log plot of the norm of the regularized solution versus the residual
norm. The best parameter lies at the corner of the L (maximum curvature)

General form of Tikhonov regularization
The Tikhonov formulation can be generalized to,
minx Ax − b 2
2 + α2
Lx 2
2
where L is a discrete smoothing operator. Common choices are the discrete ﬁrst
and second derivative operators.

Comparison of regularization methods
Figure 4 : The original image (top left) and blurred image (top right). Tikhonov
regularization (bottom left) and TSVD (bottom right).
http://www2.compute.dtu.dk/ pcha/DIP/chap8.pdf

Summary
Regularization suppresses components from noise and enforces regularity on the
computed solution.
Figure 5 : Illustration of why regularization is needed

Geophysical model problem
Unknown mass with density f (t) located at depth d below the surface.
No mass outside source.
We measure vertical component gravity ﬁeld, g(s),
Figure 6 : Gravity surveying example problem.

Magnitude of gravity ﬁeld along s is
f (t) dt
d2 + (s − t)2
and the direction is in the direction from the point at s to the point at t.
dg =
sin θ
r2
f (t)dt
Using sin θ = d/r and integrating we get the forward problem:
g(s) =
1
0
d
(d2 + (s − t)2)
3/2
f (t)dt

Swapping elements of forward problem, we get the inverse problem.
1
0
d
d2
+ (s − t)2 3/2
K(s,t)
f (t)dt = g(s)
where f (t) is the quantity we wish to estimate given measurements of g(s).

TSVD
Figure 7 : Exact solution (bottom right) and TSVD solutions

Tikhonov regularization
Figure 8 : Exact solution (bottom right) and Tikhonov solutions

Large-scale inverse problems
SVD infeasible for large-scale problems O(N3
).
Apply iterative methods to the linear system
(AT
A + α2
I)x(α) = AT
b
Generate a sequence of vectors (Krylov subspace)
Kk (AT
A, AT
b)
def
= Span{AT
b, (AT
A)AT
b, . . . , (AT
A)k−1
Ab
}
Lanczos bidiagonalization (LBD)
AVk = Uk Bk
AT
Uk = Vk BT
k + βk vk+1eT
k I
UT
k Uk = I and V T
k Vk = I
Bk =








α1
β1 α2
β2
...
... αk−1
αk








Singular vectors of Bk converge to the singular values of A. (typically largest ones
converge ﬁrst)

Large-scale iterative solvers
CGLS
The LBD can be rewritten as
(AT
A + α2
I)Vk = Vk (Bk BT
k + α2
I)
Find xk = Vk yk such that
yk = (Bk BT
k + α2
I)−1
b 2e1
obtained by a Galerkin projection on the residual
LSQR
Find xk = Vk yk by solving a k × k system of equations
yk = arg min
y
Bk
βk eT
k
y − b 2e1
2
2 + α2
y 2
2
Solve a small regularized least squares problem at each step
Additionally regularization parameter α can be estimated at each iteration.

Semi-convergence behavior
Standard convergence criteria for iterative solvers based on residual do not work
well for inverse problems.
This is because measurements are corrupted by noise. Need diﬀerent stopping
criteria/ regularization methods.
From http://www.math.vt.edu/people/jmchung/resources/CSGF07.pdf

CGLS convergence 1/2

CGLS convergence 2/2

Outline
1 Introduction
Bayes’ theorem
Coin toss example
Covariance modeling
Non-Gaussian priors
4 Data Assimilation
MCMC

Bayes’ theorem
Reverend Thomas Bayes
Interpretation: Inductive argument
p(Hypothesis|Evidence) ∝ p(Evidence|Hypothesis)p(Hypothesis)
(left) http://www.gaussianwaves.com/2013/10/bayes-theorem/ (right) Wikipedia

Coin toss experiment: Bayesian analysis
Say we have a “biased coin”
X1, X2, . . . , Xn+1 p(Xi = 1|π) = π p(Xi = 0|π) = 1 − π
What is the probability of observing a certain sequence?
H, T, H, . . .
H, H, H, . . .
After n + 1 trials we have
p(X1 = x1, X2 = x2, . . . , Xn+1 = xn+1|π) =
n+1
k=1
p(Xk = xk |π)
= π xk
(1 − π)n+1− xk
The Xi are conditionally independent.

Bayesian update: Uniform prior
Let’s assume that we don’t have any information
p(π) =
1 0 < π < 1
0 otherwise

Bayesian analysis: Uniform prior
Bayes’ rule
p(π|x1, x2, . . . , xn+1) =
p(x1, x2, . . . , xn+1|π)p(π)
p(x1, x2, . . . , xn+1)
Applying the Bayes rule
p(π|x1, x2, . . . , xn+1) ∝ π xk
(1 − π)n+1− xk
× I0<π<1

Bayes’ rule
p(π|x1, x2, . . . , xn+1) =
p(x1, x2, . . . , xn+1|π)p(π)
p(x1, x2, . . . , xn+1)
p(π|x1, x2, . . . , xn+1) ∝ π xk
(1 − π)n+1− xk
× I0<π<1
Summary of distribution:
Conditional Mean :
n
n + 2
xk
n
+
1
n + 2
Maximum :
xk
n

Bayes’ rule
p(π|x1, x2, . . . , xn+1) =
p(x1, x2, . . . , xn+1|π)p(π)
p(x1, x2, . . . , xn+1)
p(π|x1, x2, . . . , xn+1) ∝ π xk
(1 − π)n+1− xk
× I0<π<1
Summary of distribution:
Conditional Mean :
n
n + 2
xk
n
+
1
n + 2
Maximum :
xk
n
Can approximate the distribution by a Gaussian (Laplace’s approximation)
p(π|x1, x2, . . . , xn+1) ∼ N(µ, σ2
) µ =
xk
n
σ2
=
µ(1 − µ)
n

Prior: Beta distribution
π follows a Beta(α, β) distribution
p(π) ∝ πα−1
(1 − π)β−1
Beta distribution is analytically tractable; example of conjugate prior.

Bayesian update: Beta prior α = 5, β = 2

Bayesian update: Beta prior α = 0.5, β = 0.5

Bayesian Analysis: Beta prior
p(π|x1, x2, . . . , xn+1) ∝ π xk
(1 − π)n+1− xk
× πα−1
(1 − π)β−1
π xk +α−1
(1 − π)n+1− xk +β
Conditional mean
Eπ[p(π|x1, . . . , xn+1)] =
1
0
πp(π|x1, . . . , xn+1)dπ
=
n
n + α + β
xk
n
+
α + β
n + α + β
α
α + β
Observe that this gives the right limit as n → ∞.

Inverse problems: Bayesian viewpoint
Consider the measurement equation
y = h(s) + v v ∼ N(0, Γnoise)
Notation:
y : observations or measurements - given.
s : model parameters, we want to estimate.
h(s) : parameter-to-observation map - given.
v : additive i.i.d Gaussian noise

Using Bayes’ rule, the posterior pdf is
p(s|y) ∝ p(y|s)
Data misﬁt
p(s)
Prior
Data misﬁt - “How well the model reproduces data”

p(s|y) ∝ p(y|s)
Data misfit
p(s)
Prior
Data misfit - “How well the model reproduces data”
Prior - “Prior knowledge of unknown field ”
Smoothness, sparsity, etc

Geostatistical approach
Let s(x) be the parameter ﬁeld we wish to recover
s(x) =
p
k=1 fi (x)βk
Deterministic term
+ (x)
Random term
Possible choices for fi (x)
Low order polynomials f1 = 1, f2 = x, f3 = x2
, etc.
Zonation model
fi is nonzero only in certain regions
Several possible choices for (x)
We will assume Gaussian random ﬁelds.
Revisit this assumption (later in this talk).

Gaussian Random Fields
GRF are multidimensional generalizations of Gaussian processes.
Deﬁnition
A Gaussian process is a collection of random variables, any ﬁnite number of which
have a joint Gaussian distribution.

Deﬁnition
A Gaussian process is completely speciﬁed by its mean function and covariance
function.
µ(x)
def
= E[f (x)]
κ(x, y)
def
= E[(f (x) − µ(x))(f (y) − µ(y))]
The GP is denoted as
f (x) ∼ N(µ(x), κ(x, y))

Definition
Examples of Gaussian random fields
Figure 9 : Samples from Gaussian random fields

Model priors as Gaussian random ﬁelds
s|β ∼ N(Xβ, Γprior) p(β) ∝ 1

Posterior distribution
Applying Bayes theorem
p(s, β|y) ∝ p(y|s, β)p(s|β)p(β)
exp −
1
2
y − h(s) 2
Γ−1
noise
−
1
2
s − Xβ 2
Γ−1
prior

Applying Bayes theorem
p(s, β|y) ∝ p(y|s, β)p(s|β)p(β)
exp −
1
2
y − h(s) 2
Γ−1
noise
−
1
2
s − Xβ 2
Γ−1
prior
Maximum a posteriori (MAP) estimate:
ˆs, ˆβ = arg min
s,β
− log p(s, β|y)
= arg min
s,β
1
2
y − h(s) 2
Γ−1
noise
+
1
2
s − Xβ 2
Γ−1
prior

MAP Estimate - Linear Inverse Problems
Maximum a posteriori (MAP) estimate: for h(s) = Hs
ˆs, ˆβ = arg min
s,β
1
2
y − Hs 2
Γ−1
noise
+
1
2
s − Xβ 2
Γ−1
prior
21Preconditioned iterative solver developed in Saibaba and Kitanidis, WRR 2012.

ˆs, ˆβ = arg min
s,β
1
2
y − Hs 2
Γ−1
noise
+
1
2
s − Xβ 2
Γ−1
prior
Obtained by solving the system of equations
HΓpriorHT
+ Γnoise HX
(HX)T
0
ˆξ
ˆβ
=
y
0
ˆs = X ˆβ + ΓpriorHT ˆξ

ˆs, ˆβ = arg min
s,β
1
2
y − Hs 2
Γ−1
noise
+
1
2
s − Xβ 2
Γ−1
prior
Obtained by solving the system of equations
HΓpriorHT
+ Γnoise HX
(HX)T
0
ˆξ
ˆβ
=
y
0
ˆs = X ˆβ + ΓpriorHT ˆξ
Solved using a matrix-free Krylov solver.
Requires fast ways to compute Hx and Γpriorx
Preconditioner21
using a low-rank representation of Γprior

Interpolation using Gaussian Processes22
The posterior is Gaussian with
µpost(x∗
) = κ(x∗
, x)(κ(x, x) + σ2
I)−1
y(x)
covpost(x∗
, x∗
) = κ(x∗
, x∗
) − κ(x∗
, x)(κ(x, x) + σ2
I)−1
κ(x, x∗
)
22Gaussian Processes for Machine Learning, Rasmussen and Williams

Challenge:
Real-time monitoring of CO2 concentration
Time series of noisy seismic traveltime tomography data.
288 measurements and 234 × 217 unknowns
A.K. Saibaba, Ambikasaran, Li, Darve, Kitanidis, Oil and Gas Science and Technology 67.5
(2012): 857.

Mat´ern covariance family
Mat`ern class of covariance kernels
κ(x, y) =
(αr)ν
2ν−1Γ(ν)
Kν(αr), α > 0, ν > 0
Here, r = x − y 2 is the radial distance between points x and y.
Examples: Exponential kernel (ν = 1/2), Gaussian kernel ν = ∞.

Mat´ern covariance kernels
Deconvolution equation
y(t) =
T
0
f (t − τ)s(τ)dτ
Mat´ern covariance kernels ν = 1/2, 3/2, ∞
κ(x, y) = exp(−|x − y|/L)

y(t) =
T
0
κ(x, y) = (1 +
√
3|x − y|/L) exp(−
√
3|x − y|/L)

y(t) =
T
0
κ(x, y) = exp(−|x − y|2
/L2
)

Fast covariance evaluations
Consider the Gaussian priors
s|β ∼ N(Xβ, Γprior)
Covariance matrices are dense - expensive to store and compute.
For example, a dense 106
× 106
matrix costs 7.45 TB.
Typically, only need to evaluate Γpriorx and Γ−1
priorx.

Standard approaches
FFT based methods,
Fast Multipole Method,
Hierarchical Matrices
Kronecker tensor product approximations.

Standard approaches
FFT based methods,
Fast Multipole Method,
Hierarchical Matrices
Kronecker tensor product approximations.
Compared to the naive O(N2
)
Storage cost: O(N logα
N) Matvec cost: O(N logβ
N)

Toeplitz Matrices
A Toeplitz matrix T is an N × N matrix with entries such that Tij = ti−j , i.e. a
matrix of the form
T =








t0 t−1 t−2 . . . t−(N−1)
t1 t0 t−1
t2 t1 t0
...
...
...
tN−1 . . . t0








Suppose points xi = i × h and yj = j × h for i, j = 1, . . . , N
Stationary kernels Qij = κ(xi , yj ) = κ((i − j)h)
Translation-invariant kernels Qij = κ(xi , yj ) = κ(|i − j|h)
Need to store only O(N) entries, compared to O(N2
) entries.

FFT based methods
Toeplitz matrices arise from stationary covariance kernels on regular grids


c b a
b c b
a b c

 Periodic embedding
=⇒






c b a a b
b c b a a
a b c b a
a a b c b
b a a b c






Diagonalizable by Fourier basis
Matrix-Vector Products for Toeplitz matrices O(N log N)
Restricted to regular, equispaced grids.

H-matrix formulation: An Intuitive Explanation.
Consider for xi , yi = (i − 1) 1
N−1 , i = 1, . . . , N
κα(x, y) =
1
|x − y| + α
α > 0
Figure 10 : blockwise rank- α = 10−6
, = 10−6
, N = M = 256

H-matrix formulation: An Intuitive Explanation.
Consider for xi , yi = (i − 1) 1
N−1 , i = 1, . . . , N
κ(x, y) = exp(−|x − y|)
Figure 11 : blockwise rank- = 10−6
, N = M = 256

Exponentially decaying singular values of oﬀ-diagonal
blocks
κα(x, y) =
1
|x − y| + α
α > 0 (1)
Figure 12 : First 32 singular values of oﬀ-diagonal sub-blocks of matrix corresponding to
non-overlapping segments (left) [0, 0.5] × [0.5, 1] and (right) [0, 0.25] × [0.75, 1.0]
The decay of singular values can be related to the smoothness of the kernel.

Prof. SVD - Gene Golub

Rank-10 approximation

Hierarchical-matrices24
Hierarchical separation of space.
Low rank sub-blocks with well separated clusters.
Mild restrictions on the types of permissible kernels
24Hackbusch - 2000, Grasedyck and Hackbusch - 2003, Bebendorf - 2008

Level 0
Full-rank blocks Low-rank blocks

Level 0
Level 1

Level 0
Level 1
Level 2

Level 0
Level 1
Level 2
Level 3

Clustering

Block clustering

Quasi-linear geostatistical approach
Maximum a posteriori estimate:
arg min
s,β
1
2
y − h(s) 2
Γ−1
noise
+
1
2
s − Xβ 2
Γ−1
prior

Quasi-linear geostatistical approach
Maximum a posteriori estimate:
arg min
s,β
1
2
y − h(s) 2
Γ−1
noise
+
1
2
s − Xβ 2
Γ−1
prior
Algorithm 2 Quasi-linear geostatistical approach
1: while Not converged do
2: Solve the system of equations26
,
Jk ΓpriorJT
k + Γnoise Jk X
(Jk X)
T
0
ξk+1
βk+1
=
y − h(sk ) + Jk sk
0
where, the Jacobian J = ∂h
∂s s=sk
3: The update sk+1 = Xβk+1 + ΓpriorJT
k ξk+1
4: end while

MAP Estimate - Quasi-linear Inverse Problems
At each step,
linearize to get a local Gaussian approximation
Solve a sequence of linear inverse problems.

Non-Gaussian priors
Gaussian random ﬁelds often produce smooth reconstructions
Often need discontinuous reconstructions
Facies detection, tumor location.
Several possibilities
Total Variation regularization
Level Set approach
Markov Random Fields
Wavelet based reconstructions
Only scratching the surface, lots of techniques (and literature) available.

Total variation regularization
Total variation in 1D
TV (f ) = sup
n−1
k=1
|f (xk+1)−f (xk )|
Measure of arc length of a curve
Gif: Wikipedia, Figure: Kaipio et al. Statistical and computational inverse problems. Vol.
160. Springer Science & Business Media, 2006

Total Variation Regularization
MAP estimate (penalize discontinuous changes)
min
s
1
2
y − h(s) 2
Γ−1
noise
+ α
Ω
| s|ds | s| ≈
√
s · s + ε
Figure 13 : Inverse Wave propagation problem. (left) Cross-sections of inverted and
target models, (right) Surface model of the target.
Akcelic, Biros and Ghattas, Supercomputing, ACM/IEEE 2002 Conference. IEEE, 2002.

Level Set approach
s(x) = cf (x)H(φ(x)) + cb(x)(1 − H(φ(x))) H(x) =
1
2
(1 + sign(x))
Figure 14 : Image courtesy of Wikipedia
Topologically ﬂexible - able to recover multiple connected components
Evolve the shape by the minimizing an objective function.

Bayesian Level set approach
Level set function
s(x) = cf (x)H(φ(x)) + cb(x)(1 − H(φ(x))) H(x) =
1
2
(1 + sign(x))
Employ a Gaussian random ﬁeld as prior for φ(x)
Groundwater ﬂow
− · κ u(x) = f (x) x ∈ Ω
u = 0 x ∈ ∂Ω
Transformation s = log κ
Iglesias et al. Preprint arXiv:1504.00313

Outline
1 Introduction
Bayes’ theorem
Coin toss example
Covariance modeling
Non-Gaussian priors
4 Data Assimilation
MCMC

Tracking trajectories

360◦
panorama - Teliportme

4D Var Filtering
Consider the dynamical system
∂v
∂t
= F(v) + η
v(x, 0) = v0(x)
3D Var Filtering
J3(v)
def
= y(T) − h(v(x; T)) 2
Γ−1
noise
+
1
2
v0(x) − v∗
0 (x) 2
Γ−1
prior
Optimization problem
ˆv0
def
= arg min
v0
Jk (v) k = 3, 4
4D Var Filtering
J4(v)
def
=
Nt
k=1
y(tk ) − h(v(x; tk )) 2
Γ−1
noise
+
1
2
v0(x) − v∗
0 (x) 2
Γ−1
prior

Application: Contaminant source identiﬁcation
Transport equations
∂c
∂t
+ v · c = D 2
c
D c · n = 0
c(x, 0) = c0(x)
Estimate initial conditions from measurements of the contaminant ﬁeld.
Akcelik, Volkan, et al. Proceedings of the 2005 ACM/IEEE conference on Supercomputing.
IEEE Computer Society, 2005.

Linear Dynamical System
System Noise:
Measurements:
uk−1 uk uk+1
· · · sk−1 F sk F sk+1 · · ·
vk−1 Hk−1 vk Hk vk+1 Hk+1
yk−1 yk yk+1

State Evolution equations
Linear evolution equations
sk+1 = Fk sk + uk uk ∼ N(0, Γprior)
yk+1 = Hk+1sk+1 + vk vk ∼ N(0, Γnoise)
obtained by discretizing a PDE

State Evolution equations
Linear evolution equations
obtained by discretizing a PDE
Nonlinear evolution equations
sk+1 = f (sk ) + uk uk ∼ N(0, Γprior)
yk+1 = h(sk+1) + vk vk ∼ N(0, Γnoise)
Can be linearized (Extended Kalman Filter) or handled as is (Ensemble ﬁltering)

Kalman Filter
Current N(ˆsk|k, Σk|k) Update Predict
Future
N(ˆsk+1|k+1, Σk+1|k+1)
Transition matrix Fk Observation Hk
Sys. noise
wk ∼ N(0, Γsys)
Meas. noise
vk ∼ N(0, Γnoise)
All variables are modeled as Gaussian random variables
Completely speciﬁed by the mean and covariance matrix.
Kalman ﬁlter provides a recursive way to update state knowledge and
predictions.

Ensemble Kalman Filter
The EnKF is a Monte Carlo approximation of the Kalman filter.
Ensemble of state variables: X = [x1, . . . , xN ]
Ensemble of realizations are propagated individually
Can reuse legacy codes
Easily parallelizable
To update filter compute statistics based on the ensemble
Unlike Kalman filter, can be readily applied to nonlinear problems

Ensemble Kalman Filter
The EnKF is a Monte Carlo approximation of the Kalman filter.
Ensemble of state variables: X = [x1, . . . , xN ]
Ensemble of realizations are propagated individually
Can reuse legacy codes
Easily parallelizable
To update filter compute statistics based on the ensemble
Unlike Kalman filter, can be readily applied to nonlinear problems
The ensemble mean and covariance can be computed as
E[X] =
1
N
N
k=1
xk C =
1
N − 1
AAT
where A is the mean subtracted ensemble.

Application: Real-time CO2 monitoring
Sources
Receivers
Sources ﬁre a pulse, receivers measure time delay.
Measurements - travel time of each source-receiver pair.
6 sources, 48 receivers = 288 measurements
Assumption: rays travel in straight-line path
tsr =
recv
source
1
v(x)
Slowness
d + noise
Model problem for: reﬂection seismology, CT scanning, etc.

Random Walk Forecast model
Evolution of CO2 can be modeled as
A.K. Saibaba, E.L. Miller, P.K. Kitanidis, A Fast Kalman Filter for time-lapse Electrical
Resistivity Tomography. Proceedings of IGARSS 2014, Montreal

Random Walk Forecast model
Evolution of CO2 can be modeled as
Random walk assumption Fk = I
Useful modeling assumption when measurements can be acquired rapidly
Applications: Electrical Impedance Tomography, Electrical Resistivity
Tomography, Seismic Travel-time tomography
Treat Γprior using Hierarchical matrix approach
A.K. Saibaba, E.L. Miller, P.K. Kitanidis, A Fast Kalman Filter for time-lapse Electrical
Resistivity Tomography. Proceedings of IGARSS 2014, Montreal

Results: Kalman Filter
Figure 15 : True and estimated CO2-induced changes in slowness (reciprocal of velocity)
between two wells for the grid size 234 × 219 at times 3, 30 and 60 hours respectively.

Comparison of costs of diﬀerent algorithms
Grid size 59 × 55
Γprior is constructed using kernel κ(r) = θ exp(−
√
r)
Γnoise = σ2
I with σ2
= 10−4
Saibaba, Arvind K., et al. Inverse Problems 31.1 (2015): 015009.

Error in the reconstruction
Γprior is constructed using kernel κ(r) = θ exp(−
√
r)
Γnoise = σ2
I with σ2
= 10−4

Conditional Realizations
Figure 16 : Conditional realizations of CO2-induced changes in slowness (reciprocal of
velocity) between two wells for the grid size 59 × 55 at times 3, 30 and 60 hours
respectively.

Outline
1 Introduction
Bayes’ theorem
Coin toss example
Covariance modeling
Non-Gaussian priors
4 Data Assimilation
MCMC

Notation:
y : observations or measurements - given.
s : model parameters, we want to estimate.
h(s) : parameter-to-observation map - given.
v : additive i.i.d Gaussian noise
p(s|y) ∝ p(y|s)
Data misﬁt
p(s)
Prior
The posterior distribution is the Bayesian solution to the inverse problem.

Bayesian Inference: Quantifying uncertainty
Maximum-a-posteriori (MAP) estimate arg max p(s|y)
Conditional mean
sCM = Es|y [s] = s p(s|y)ds

Bayesian Inference: Quantifying uncertainty
Maximum-a-posteriori (MAP) estimate arg max p(s|y)
Conditional mean
sCM = Es|y [s] = s p(s|y)ds
Credibility intervals: Find sets C(y)
p[s ∈ C(y)|y] = 1 − α
Sample realizations from the posterior

Linear Inverse Problems
Recall the distribution is given by
p(s|y) ∝ exp −
1
2
y − Hs 2
Γ−1
noise
−
1
2
s − µ 2
Γ−1
prior
s|y ∼ N(sMAP, Γpost)
Γpost = Γ−1
prior + HT
Γ−1
noiseH
−1
= Γprior − ΓpriorHT
(HΓpriorHT
+ Γnoise)−1
HΓprior
sMAP = Γpost(HT
Γ−1
noisey + Γ−1
priorµ)
Observe that
Γpost ≤ Γprior

Variance = diag(Γpost) = diag(Γ−1
prior + HT
Γ−1
noiseH)−1
A.K. Saibaba, Ambikasaran, Li, Darve, Kitanidis, Oil and Gas Science and Technology 67.5
(2012): 857.

Nonlinear Inverse Problems
Linearize the forward operator (at the MAP point)
h(s) = h(sMAP) +
∂h
∂s
(s − sMAP) + O( s − sMAP
2
2)
Groundwater ﬂow equations
− · (κ(x) φ) = Qδ(x − xsource) x ∈ Ω
φ = 0 x ∈ ΩD
Inverse problem:
Estimate hydraulic tomography κ from discrete measurements of φ.
To make problem well-posed, work with s = log κ.
Saibaba, Arvind K., et al., Advances in Water Resources 82 (2015): 124-138.

Nonlinear Inverse Problems
Linearize the forward operator (at the MAP point)
h(s) = h(sMAP) +
∂h
∂s
(s − sMAP) + O( s − sMAP
2
2)
Figure 17 : (left) Reconstruction of log conductivity (right) Posterior variance
Saibaba, Arvind K., et al., Advances in Water Resources 82 (2015): 124-138.

Monte Carlo sampling
Suppose X has density p(x) and we are interested in f (X)
E[f (X)] = f (x)p(x)dx = lim
N→∞
1
N
N
k=1
f (xk )
Approximate using sample averages
E[f (X)] ≈
1
N
N
k=1
f (xk )
p(x) is understood to be the posterior distribution.
If samples are easy to generate, procedure is straightforward.
Use Central Limit Theorem to generate conﬁdence intervals.
Generating samples from p(x) may not be straightforward.

Acceptance-rejection sampling
Approximate distribution by an easier distribution
Points under curve
Points generated
× box area = lim
n→∞
B
A
f (x)dx
From PyMC2 website: http://pymc-devs.github.io/pymc/theory.html

Markov chains
Consider a sequence of random variables X1, X2, . . .
p(Xt+1 = xt+1|Xt = xt, . . . , X1 = x1) = p(Xt+1 = xt+1|Xt = xt)
The future depends only on the present - not the past!
Under some conditions, the chain has a stationary distribution.

Implementation
Create a Markov Chain whose stationary distribution is p(x)
1 Draw a proposal y from q(y|xn)
2 Calculate acceptance ratio
α(xn, y) = min 1,
p(y)q(xn|y)
p(xn)q(y|xn)
3 Accept/Reject
xn+1 =
y with probabilityα(xn, y)
xn with probability 1 − α(xn, y)
If q(x, y) = q(y, x) then α(xn, y) = min{1, p(y)/p(xn)}

MCMC demo
Demo at: http://chifeng.scripts.mit.edu/stuff/mcmc-demo/

Properties of MCMC sampling
Ergodic theorem for expectations
lim
N→∞
1
N
N
k=1
f (xi ) =
Ω
f (x)p(x)dx
However samples xk are no longer i.i.d. Has higher variance than MC sampling.
Popular sampling strategies
Metropolis-Hastings
Gibbs samplers
Hamiltonian MCMC
Adaptive MCMC with Delayed rejection (DRAM)
Metropolis adjusted Langevin Algorithm (MALA)

Curse of dimensionality
What is the probability of hitting a hypersphere inscribed in a hypercube?
In dimension n = 100, the probability < 2 × 10−70
.

Stochastic Newton MCMC
Martin, James, et al. SISC 34.3 (2012): A1460-A1487.

Outline
1 Introduction
Bayes’ theorem
Coin toss example
Covariance modeling
Non-Gaussian priors
4 Data Assimilation
MCMC

Opportunities
Theoretical and numerical
“Big data” meets “Big Models”
Model reduction
Posterior uncertainty quantiﬁcation
Applications
New application areas, new technologies that generate inverse problems
Combining multiple modalities to make better predictions
Software that transcends application areas

Resources for learning Inverse Problems
Books
Hansen, Per Christian. Discrete inverse problems: insight and algorithms.
Vol. 7. SIAM, 2010.
Hansen, Per Christian. Rank-deﬁcient and discrete ill-posed problems:
numerical aspects of linear inversion. Vol. 4. SIAM, 1998.
Hansen, Per Christian, James G. Nagy, and Dianne P. O’Leary. Deblurring
images: matrices, spectra, and ﬁltering. Vol. 3. Siam, 2006.
Tarantola, Albert. Inverse problem theory and methods for model parameter
estimation. SIAM, 2005.
Kaipio, Jari, and Erkki Somersalo. Statistical and computational inverse
problems. Vol. 160. Springer Science & Business Media, 2006.
Vogel, Curtis R. Computational methods for inverse problems. Vol. 23.
SIAM, 2002.
PK Kitanidis. Introduction to geostatistics: applications in hydrogeology.
Cambridge University Press, 1997.
Cressie, Noel. Statistics for spatial data. John Wiley & Sons, 2015.

Resources for learning Inverse Problems
Software Packages
Regularization Tools (MATLAB)
Website:
http://www2.compute.dtu.dk/~pcha/Regutools/regutools.html
PEST
Website http://www.pesthomepage.org/
bgaPEST
Website: http://pubs.usgs.gov/tm/07/c09/
MUQ
Website: https://bitbucket.org/mituq/muq

Large-Scale Inverse Problems Techniques

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Large-Scale Inverse Problems Techniques

Similar to Large-Scale Inverse Problems Techniques (20)

Large-Scale Inverse Problems Techniques