SlideShare a Scribd company logo
ECE 285
Image and video restoration
Chapter VI – Sparsity, shrinkage and wavelets
Charles Deledalle
March 11, 2019
1
Motivations
(a) y = x + w (b) z = F y (c)
λ2
i
λ2
i
+σ2 (d) ˆzi =
λ2
i
λ2
i
+σ2 zi (e) ˆx = F −1
ˆz
Wiener filter (LMMSE in the Fourier domain)
• Assume Fourier coefficients to be decorrelated (white),
• Modulate frequencies based on the mean power spectral density λ2
i .
Limits
• Linear: no adaptation to the content ⇒
Unable to preserve edges,
Blurry solutions.
2
Motivations
Facts and consequences
• Assume Fourier coefficients to be decorrelated (white)
• Removing Gaussian noise ⇒ need to be adaptive ⇒ Non linear
• Assuming Gaussian noise + Gaussian prior ⇒ Linear
Deductive reasoning
Fourier coefficients of clean images are not Gaussian distributed
How are Fourier coefficients distributed?
3
Motivations – Distribution of Fourier coefficients
How are Fourier coefficients distributed?
1. Perform whitening with DFT
Var[x] = L = EΛE∗
with E = 1√
n
F
diag(Λ) = (λ2
1, . . . , λ2
n) = n−1
MPSD
4
Motivations – Distribution of Fourier coefficients
How are Fourier coefficients distributed?
2. Look at the histogram
• The histogram of η has a symmetric bell shape around 0.
• It has a peak at 0 (a large number of Fourier coefficients are zero).
• It has large/heavy tails (many coefficients are “outliers”/abnormal).
(a) x (b) Whitening η of x (c) Histogram of η
5
Motivations – Distribution of Fourier coefficients
How are Fourier coefficients distributed?
3. Look for the distribution that best fits (in log scale)
• Gaussian: bell shape
√
, peak ×, tail ×
• Laplacian: bell shape ×, peak
√
, tail
√
• Student: bell shape
√
, peak ×, tail
√
(heavier)
• Others: alpha stables and generalized Gaussian distributions
(a) Whitening η of x (b) Histogram of η (c) Log-histogram of η
6
Motivations – Distribution of Fourier coefficients
Model expression (zero mean, variance = 1)
• Gaussian: bell shape
√
, peak ×, tail ×
p(ηi) =
1
√
2π
exp −
η2
i
2
• Laplacian: bell shape ×, peak
√
, tail
√
p(ηi) =
1
√
2
exp −
√
2|ηi|
• Student: bell shape
√
, peak ×, tail
√
(heavier)
p(ηi) =
1
Z
1
(2r − 2) + η2
i
r+1/2
(Z normalization constant, r > 1 controls the tails)
How do they look in multiple-dimensions?
7
Motivations – Distribution of Fourier coefficients
• Gaussian prior
• images are concentrated in an elliptical cluster,
• outliers are rare (images outside the cluster).
• Peaky & heavy tailed priors: shape between a diamond and a star.



• union of subspaces: most images lie in one of the branches of the star,
• sparsity: most of their coefficients ηi are zeros,
• robustness: outlier coefficients are frequent.
8
Shrinkage functions
Shrinkage functions
Consider the following Gaussian denoising problem
• Let y ∈ Rn
and x ∈ Rp
be two random vectors such that
y | x ∼ N(x, σ2
Idn)
E[x] = 0 and Var[x] = L = EΛE∗
• Let η = Λ−1/2
E∗
x (whitening / decorrelation of x)
Goal: estimate x from y
assuming a non-Gaussian prior pη for η.
(such as Laplacian or Student)
9
Shrinkage functions
Bayesian shrinkage functions
• Assume ηi are also independent and identically distributed (iid).
• Then, the MMSE and MAP estimators both read as
ˆx = Eˆz
Come back
where ˆzi = s(zi; λi, σ)
shrinkage
and z = E∗
y
Change of basis
• The function zi → s(zi; λi, σ) is called shrinkage function.
• Unlike the LMMSE, s will depend on the prior distribution of ηi.
• As for the LMMSE, the solution can be computed in the eigenspace.
• We say that the estimator is separable in the eigenspace (ex: Fourier).
10
Shrinkage functions
Remark
independence ⇒ uncorrelation
¬uncorrelation ⇒ ¬independence
correlation ⇒ dependence
⇒
Whitening is a necessarily step
for independence but not a
sufficient one.
(Except in the Gaussian case)
How are the shrinkage functions defined for the MMSE and MAP?
11
Shrinkage functions
• Recall that the MMSE is the posterior mean
ˆx =
Rn
xp(x|y) dx = Rn xp(y|x)p(x) dx
Rn p(y|x)p(x) dx
MMSE Shrinkage functions
• Under the previous assumptions
ˆx = Eˆz
Come back
where ˆzi = s(zi; λi, σ)
shrinkage
and z = E∗
y
Change of basis
with s(z; λ, σ) =
R
˜z exp −(z−˜z)2
2σ2 pη
˜z
λ
d˜z
R
exp −(z−˜z)2
2σ2 pη
˜z
λ
d˜z
where pη is the prior distribution on the entries of η.
• Separability: n dimensional optimization −→ n × 1d integrations.
12
Shrinkage functions
• Recall that the MAP is the optimization problem
ˆx ∈ argmax
x∈Rn
p(x|y) = argmin
x∈Rn
[− log p(y|x) − log p(x)]
MAP Shrinkage functions
• Under the previous assumptions
ˆx = Eˆz
Come back
where ˆzi = s(zi; λi, σ)
shrinkage
and z = E∗
y
Change of basis
with s(z; λ, σ) = argmin
˜z∈R
(z − ˜z)2
2σ2
− log pη
˜z
λ
where pη is the prior distribution on the entries of η.
• Separability: n dimensional integration −→ n × 1d optimisations.
13
Shrinkage functions
Example (Gaussian noise + Gaussian prior)
• MMSE Shrinkage
s(z; λ, σ) =
R
˜z exp −(z−˜z)2
2σ2 − ˜z2
2λ2 d˜z
R
exp −(z−˜z)2
2σ2 − ˜z2
2λ2 d˜z
=
λ2
λ2 + σ2
z
• MAP Shrinkage
s(z; λ, σ) = argmin
˜z∈R
(z − ˜z)2
2σ2
+
˜z2
2λ2
=
λ2
λ2 + σ2
z
• Gaussian prior: MAP = MMSE = Linear shrinkage.
• We retrieve the LMMSE as expected.
14
Shrinkage functions
Gaussian noise + Gaussian prior
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
SNR = λ/σ = 4 λ/σ = 1 λ/σ = 1/4
15
Posterior mean – Shrinkage functions – Examples
Example (Gaussian noise + Laplacian prior)
• MMSE Shrinkage
s(z; λ, σ) =
˜z exp −(z−˜z)2
2σ2 −
√
2|˜z|
λ
d˜z
exp −(z−˜z)2
2σ2 −
√
2|˜z|
λ
d˜z
= z −
γ erf z−γ√
2σ
− exp 2γz
σ2 erfc γ+z√
2σ
+ 1
erf z−γ√
2σ
+ exp 2γz
σ2 erfc γ+z√
2σ
+ 1
, γ =
√
2σ2
λ
• MAP Shrinkage (soft-thresholding)
s(z; λ, σ) = argmin
˜z∈R
(z − ˜z)2
2σ2
+
√
2|˜z|
λ
=



0 if |z| < γ
z − γ if z > γ
z + γ if z < −γ
Soft-T(z,γ)
Non-gaussian prior: MAP = MMSE → Non-linear shrinkage.
16
Shrinkage functions
Gaussian noise + Laplacian prior
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
SNR = λ/σ = 4 λ/σ = 1 λ/σ = 1/4
17
Posterior mean – Shrinkage functions – Examples
Example (Gaussian noise + Student prior)
• MMSE Shrinkage
No simple expression, requires 1d numerical integration
• MAP Shrinkage
No simple expression, requires 1d numerical optimization
For efficiency, the 1d functions
can be evaluated offline and stored in a look-up-table.
18
Shrinkage functions
Gaussian noise + Student prior
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
SNR = λ/σ = 4 λ/σ = 1 λ/σ = 1/4
19
Posterior mean – Shrinkage functions – Examples
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
SNR = λ/σ = 4 λ/σ = 1/2 λ/σ = 1/2
• Coefficients are shrunk towards zero • Signs are preserved
• Non-Gaussian priors leads to non-linear filtering:
• sparsity: small coefficients are shrunk (likely due to noise)
• robustness: large coefficients are preserved (likely encoding signal)
• Larger SNR = λ
σ
⇒ shrinkage becomes close to identity.
20
Posterior mean – Shrinkage functions – Examples
Interpretation
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
Sparsity: zero for small values.
Robustness: remain close to the identity for large values.
Transition: bias/variance tradeoff.
Can we design our own shrinkage according to what we want?
21
Shrinkage functions
Shrinkage functions (a.k.a, thresholding functions)
• Pick a shrinkage function s satisfying
• Shrink: |s(z)| |z| (non-expansive)
• Preserve sign: z · s(z) 0
• Kill low SNR: lim
λ
σ
→0
s(z; λ, σ) = 0
• Keep high SNR: lim
λ
σ
→∞
s(z; λ, σ) = z
• Increasing: z1 z2 ⇔ s(z1) s(z2)
• Beyond Bayesian: No need to relate s to a prior distribution pη.
22
Shrinkage functions
A few examples (among many others)
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
Hard-thresholding MCP SCAD
• Though not necessarily related to a prior distribution,
• Often related to a penalized least square problem, ex:
Hard-T(z) = argmin
˜z∈R
(z − ˜z)2
+ τ2
1{˜z=0} =
0 if |z| < τ
z otherwise
• Hard-thresholding: similar behavior to Student’s shrinkage.
23
Shrinkage functions
Link with penalized least square (1/2)
• D = L1/2
= EΛ1/2
is an orthogonal dictionary of n atoms/words
D = (d1, d2, . . . , dn) with ||di|| = λi and di, dj = 0 (for i = j)
• Goal: Look for the n coefficients ηi, such that ˆx close to y
ˆx = Dη =
n
i=1
ηidi = “linear comb. of the orthogonal atoms di of D”
• Choosing ηi = di
||di||2 , y , i.e., η = Λ−1/2
E∗
y, is optimal:
ˆx = y
but, it also reconstructs the noise component.
• Idea: penalize the coeffs to prevent from reconstructing the noise.
24
Shrinkage functions
Link with penalized least square (2/2)
• Penalization on the coefficients controls shrinkage and sparsity:
• 1
2
||y − Dη||2
2 +
τ2
2
||η||2
2 ⇒ ˆzi =
λ2
i
λ2
i + τ2
zi
• 1
2
||y − Dη||2
2 + τ||η||1 ⇒ ˆzi = Soft-T (zi, γi) with γi =
τ
λi
• 1
2
||y − Dη||2
2 +
τ2
2
||η||0 ⇒ ˆzi = Hard-T (zi, γi) with γi =
τ
λi
0 pseudo-norm: ||η||0 = lim
p→0
n
i=1
ηp
i
1/p
= “# of non-zero coefficients”
Sparsity: ||η||0 small compared to n
25
Posterior mean – Shrinkage in the Fourier domain
Shrinkage in the discrete Fourier domain
(a) x (b) y = x + w
sig = 20
y = x + sig * np.random.randn(x.shape)
n1, n2 = y.shape[:2]
n = n1 * n2
lbd = np.sqrt(prior_mpsd(n1, n2) / n)
z = nf.fft2(y, axes=(0, 1)) / np.sqrt(n)
zhat = shrink(z, lbd, sig)
xhat = np.real(nf.ifft2(zhat, axes=(0, 1))) * np.sqrt(n)
z = F y/
√
n
ˆzi = s(zi; λi, σ)
ˆx =
√
nF −1
ˆz
26
Posterior mean – Shrinkage in the Fourier domain
Shrinkage in the discrete Fourier domain
(a) x (b) λ
sig = 20
y = x + sig * np.random.randn(x.shape)
n1, n2 = y.shape[:2]
n = n1 * n2
lbd = np.sqrt(prior_mpsd(n1, n2) / n)
z = nf.fft2(y, axes=(0, 1)) / np.sqrt(n)
zhat = shrink(z, lbd, sig)
xhat = np.real(nf.ifft2(zhat, axes=(0, 1))) * np.sqrt(n)
z = F y/
√
n
ˆzi = s(zi; λi, σ)
ˆx =
√
nF −1
ˆz
26
Posterior mean – Shrinkage in the Fourier domain
Shrinkage in the discrete Fourier domain
(a) x (b) z
sig = 20
y = x + sig * np.random.randn(x.shape)
n1, n2 = y.shape[:2]
n = n1 * n2
lbd = np.sqrt(prior_mpsd(n1, n2) / n)
z = nf.fft2(y, axes=(0, 1)) / np.sqrt(n)
zhat = shrink(z, lbd, sig)
xhat = np.real(nf.ifft2(zhat, axes=(0, 1))) * np.sqrt(n)
z = F y/
√
n
ˆzi = s(zi; λi, σ)
ˆx =
√
nF −1
ˆz
26
Posterior mean – Shrinkage in the Fourier domain
Shrinkage in the discrete Fourier domain
(a) x (b) z (c) Linear/Wiener
(=Gaussian)
(d) Soft-T
(=Laplacian)
(e) Hard-T
(≈Student)
sig = 20
y = x + sig * np.random.randn(x.shape)
n1, n2 = y.shape[:2]
n = n1 * n2
lbd = np.sqrt(prior_mpsd(n1, n2) / n)
z = nf.fft2(y, axes=(0, 1)) / np.sqrt(n)
zhat = shrink(z, lbd, sig)
xhat = np.real(nf.ifft2(zhat, axes=(0, 1))) * np.sqrt(n)
z = F y/
√
n
ˆzi = s(zi; λi, σ)
ˆx =
√
nF −1
ˆz
26
Posterior mean – Shrinkage in the Fourier domain
Shrinkage in the discrete Fourier domain
(a) x (b) y = x + w (c) Linear/Wiener
(=Gaussian)
(d) Soft-T
(=Laplacian)
(e) Hard-T
(≈Student)
sig = 20
y = x + sig * np.random.randn(x.shape)
n1, n2 = y.shape[:2]
n = n1 * n2
lbd = np.sqrt(prior_mpsd(n1, n2) / n)
z = nf.fft2(y, axes=(0, 1)) / np.sqrt(n)
zhat = shrink(z, lbd, sig)
xhat = np.real(nf.ifft2(zhat, axes=(0, 1))) * np.sqrt(n)
z = F y/
√
n
ˆzi = s(zi; λi, σ)
ˆx =
√
nF −1
ˆz
26
Shrinkage functions – Fourier domain – Results
(a) x (b) y (σ = 20) (c) Linear/Wiener
(=Gaussian)
(d) Soft-T
(=Laplacian)
(e) Hard-T
(≈Student)
Bias ←−−−−−−−−−−−−−−−→ Variance
27
Shrinkage functions – Fourier domain – Results
(a) x (b) y (σ = 40) (c) Linear/Wiener
(=Gaussian)
(d) Soft-T
(=Laplacian)
(e) Hard-T
(≈Student)
Bias ←−−−−−−−−−−−−−−−→ Variance
27
Shrinkage functions – Fourier domain – Results
(a) x (b) y (σ = 60) (c) Linear/Wiener
(=Gaussian)
(d) Soft-T
(=Laplacian)
(e) Hard-T
(≈Student)
Bias ←−−−−−−−−−−−−−−−→ Variance
27
Shrinkage functions – Fourier domain – Results
(a) x (b) y (σ = 120) (c) Linear/Wiener
(=Gaussian)
(d) Soft-T
(=Laplacian)
(e) Hard-T
(≈Student)
Bias ←−−−−−−−−−−−−−−−→ Variance
27
Posterior mean – Limits of shrinkage in the Fourier domain
Limits of shrinkage in the discrete Fourier domain
(a) x (b) y convolution kernels
(c) Linear
(=Gaussian)
(d) Soft-T
(=Laplacian)
(e) Hard-T
(≈Student)
• Linear shrinkage (Wiener)
⇒ Non-adaptive,
• Non-linear shrinkage
⇒ Adaptive convolution,
• Adapts to the frequency content,
• but not to the spatial content.
ˆzi = s(zi; τ, σ) =
s(zi; τ, σ)
zi
× zi
element-wise product
⇔ ˆx = ν(y) ∗ y
spatial average
adapted to the spectrum of y.
28
Motivations
Consequences
• Modulating Fourier coefficients ⇒ Non spatially adaptive
• Assuming Fourier coefficients to be white+sparse ⇒ Shrinkage in Fourier
Deductive reasoning
Need another representation for sparsifying clean images
E = 1
√
n








×n








DFT: F
←−−−− Columns were the Fourier atoms
What transform can make signal white and sparse and
captures both spatial and spectral contents?
29
Wavelet transforms
Introduction to wavelets – Haar (1d case) [Alfr´ed Haar (1909)]
Id =





1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1





and F =





1 1 1 1
1 e−2πi1/4
e−2πi2/4
e−2πi3/4
1 e−2πi2/4
e−2πi4/4
e−2πi6/4
1 e−2πi3/4
e−2πi8/4
e−2πi9/4





30
Introduction to wavelets – Haar (1d case) [Alfr´ed Haar (1909)]
H1st
=
1
√
2





1 1 0 0
0 0 1 1
1 −1 0 0
0 0 1 −1





and H2nd
=





1/2 1 1 1 1
1/2 1 1 −1 −1
1/
√
2 1 −1 0 0
1/
√
2 0 0 1 −1





31
Introduction to wavelets – Haar (2d case)
(a) H1st
(4 × 4 image) (b) x (c) H1st
x
2d Haar representation
4 sub-bands



• Coarse sub-band
• Vertical detailed sub-band
• Horizontal detailed sub-band
• Diagonal detailed sub-band
32
Introduction to wavelets – Haar (2d case)
(a) H1st
(4 × 4 image) (b) x (c) H2nd
x
Multi-scale 2d Haar representation
• Repeat recursively J times
• Dyadic decomposition
• Multi-scale representation
• Related to scale spaces
32
Introduction to wavelets – Haar (2d case)
(a) H1st
(4 × 4 image) (b) x (c) H3rd
x
Multi-scale 2d Haar representation
• Repeat recursively J times
• Dyadic decomposition
• Multi-scale representation
• Related to scale spaces
33
Introduction to wavelets – Haar (2d case)
(a) H1st
(4 × 4 image) (b) x (c) H4th
x
Multi-scale 2d Haar representation
• Repeat recursively J times
• Dyadic decomposition
• Multi-scale representation
• Related to scale spaces
33
Introduction to wavelets – Haar transform - Filter bank
34
Introduction to wavelets – Haar transform - Separability
Properties of the 2d Haar transform
• Separable: 1d Haar transforms in horizontal and next vertical direction
• First: perform a low pass and high pass filtering
• Next: perform decimation by a factor of 2
Can we choose other low and high pass filters
to get a better transform?
35
Discrete wavelets
Discrete wavelet transform (DWT) (1/3) (1d and n even)
• Let h ∈ Rn
(with periodical boundary conditions) satisfying
n−1
i=0
hi = 0
n−1
i=0
h2
i = 1
and
n−1
i=0
hihi+2k = 0 for all integer k = 0
Example (Haar as a particular case)
h =
1
√
2
(0 . . . 0 − 1 + 1 0 . . . 0)
36
Discrete wavelets
Discrete wavelet transform (DWT) (2/3) (1d and n even)
• Define the high and low pass filters H : Rn
→ Rn
and G : Rn
→ Rn
as
(Hx)k = (h ∗ x)k =
n−1
i=0
hixk−i
(Gx)k = (g ∗ x)k =
n−1
i=0
gixk−i where gi = (−1)i
hn−1−i
• Note: necessarily
n−1
i=0
gi =
√
2
Example (Haar as a particular case)
h =
1
√
2
(0 . . . 0 − 1 + 1 0 . . . 0)
g =
1
√
2
(0 . . . 0 + 1 + 1 0 . . . 0)
37
Discrete wavelets
• Define the decimation by 2 of a matrix M ∈ Rn×n
as
M↓2 = “M(1:2:end, :)” ∈ Rn/2×n
i.e., the matrix obtained by removing every two rows.
• M↓2 x: apply M to x and next remove every two entries.
Discrete wavelet transform (DWT) (3/3) (1d and n even)
Let W =
G↓2
H ↓2
∈ Rn×n
Then



• x → W x: orthonormal discrete wavelet transform,
• Columns of W : orthonormal discrete wavelet basis,
• z = W x: wavelet coefficients of x.
38
Multi-scale discrete wavelets
Multi-scale DWT (1d and n multiple of 2J
) [Mallat, 1989]
Defined recursively as W J-th
=
W (J-1)-th
O
0 Id
W
39
Multi-scale discrete wavelets
Implementation of 2D DWT (n1 and n2 multiple of 2J
)
def dwt(x, J, h, g): # 2d and multi-scale
if J == 0:
return x
n1, n2 = x.shape[:2]
m1, m2 = (int(n1 / 2), int(n2 / 2))
z = dwt1d(x, h, g)
z = flip(dwt1d(flip(z), h, g))
z[:m1, :m2] = dwt(z[:m1, :m2], J - 1, h, g)
return z
def dwt1d(x, h, g): # 1d and 1scale
coarse = convolve(x, g)
detail = convolve(x, h)
z = np.concatenate((coarse[::2, :], detail[::2, :]), axis=0)
return z.shape
Use separability
40
Multi-scale discrete wavelets
Multi-scale Inverse DWT (1d and n multiple of 2J
)
Defined recursively as (W J-th
)−1
= W −1 (W (J-1)-th
)−1
O
0 Id
where W −1
= W ∗
= G∗
↑2 H∗
↑2 ∈ Rn×n
and M↑2: remove every two columns.
M↑2 x : insert 0 every two entries in x and next apply M.
41
Multi-scale discrete wavelets
Implementation of 2D IDWT (n1 and n2 multiple of 2J
)
def idwt(z, J, h, g): # 2d and multi-scale
if J == 0:
return z
n1, n2 = z.shape[:2]
m1, m2 = (int(n1 / 2), int(n2 / 2))
x = z.copy()
x[:m1, :m2] = idwt(x[:m1, :m2], J - 1, h, g)
x = flip(idwt1d(flip(x), h, g))
x = idwt1d(x, h, g, axis=0)
return x
def idwt1d(z, h, g): # 1d and 1scale
n1 = z.shape[0]
m1 = int(n1 / 2)
coarse, detail = np.zeros(z.shape), np.zeros(z.shape)
coarse[::2, :], detail[::2, :] = z[:m1, :], z[m1:, :]
x = convolve(coarse, g[::-1]) + convolve(detail, h[::-1])
return x
42
Discrete wavelets – Limited support
Discrete wavelet with limited support
• Consider a high pass filter with finite support of size m = 2p (even).
For instance for m = 4
H =














h2 h3 0 . . . 0 h0 h1
h0 h1 h2 h3 0 . . . 0
0 h0 h1 h2 h3 0 . . .
...
0 . . . 0 h0 h1 h2 h3
h2 h3 0 . . . 0 h0 h1














• Then h defines a wavelet transform if it satisfies the three conditions
hi = 0 and h2
i = 1 and hihi+2k = 0 for k = 1 to p − 1
• This system has 2p unknowns and 1 + p independent equations.
• If p = 1, 2p = 1 + p, this implies that the solution is unique (Haar).
• Otherwise, one has p − 1 degrees of freedom.
43
Discrete wavelets – Daubechies’ wavelets
Daubechies’ wavelets (1988)
• Daubechies suggests adding the p − 1 constraints
2p−1
i=0
iq
hi = 0 for q = 1 to p − 1 (vanishing q-order moments)
• For p = 2, the (orthonormal) Daubechies’ wavelets are defined as



h2
0 + h2
1 + h2
2 + h2
3 = 1
h0 + h1 + h2 + h3 = 0
h0h2 + h1h3 = 0
h1 + 2h2 + 3h3 = 0
⇔ h = ±
1
√
2





1+
√
3
4
3+
√
3
4
3−
√
3
4
1−
√
3
4





• The corresponding DWT is referred to as Daubechies-2 (or Db2).
As for the Fourier transform, there also exists a continuous version.
44
Continuous wavelets
Continuous wavelet transform (CWT) (1d)
• Continuum of locations t ∈ R and scales a > 0,
• Continuous wavelet transform of x : R → R
c(a, t)
wavelet coefficient
=
+∞
−∞
ψ∗
a,t t x(t ) dt = x
signal
, ψa,t
wavelet
where ∗
is the complex conjugate.
• ψa,t: daughter wavelets, translated and scaled versions of Ψ
ψa,t(t ) =
1
√
a
Ψ
t − t
a
• Ψ: the mother wavelets satisfying
+∞
−∞
Ψ(t) dt = 0 and
+∞
−∞
|Ψ(t)|2
dt = 1 < ∞
(zero-mean) (unit-norm / square-integrable)
45
Continuous wavelets
Inverse CWT (1d)
• The inverse continuous wavelet transform is given by
x(t) =
1
CΨ
+∞
−∞
+∞
0
1
|a|2
c(a, t )ψa,t t da dt
with CΨ =
+∞
0
|ˆΨ(u)|2
u
du where ˆΨ is the Fourier transform of Ψ.
Relation between CWT/DWT (1d)
• The DWT can be seen as the discretization of the CWT
• Diadic discretization in scale: a = 1, 2, 4, . . . , 2J
• Uniform discretization in time at scale j with step 2j
: t = 1:2j
:n
46
Continuous wavelets
Twin-scale relation (1d)
• The CWT is orthogonal (inverse = adjoint), if and only if Ψ satisfies
Ψ(t) =
√
2
m−1
i=0
hiΦ(2t − i) and Φ(t) =
√
2
m−1
i=0
giΦ(2t − i)
where h and g are high- and low-pass filters defining a DWT.
• Φ is called father wavelet or scaling function.
• Note: potentially m = ∞.
Twin-scale relation: allows to define a CWT from DWT and vice-versa.
The CWT may not have a closed form (approximated by the cascade algorithm)
47
Continuous and discrete wavelets
Haar/Db1
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
Father wavelet
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
Mother wavelet
-1
-0.5
0
0.5
1
Low pass filter
-1 0 1 2
-1
-0.5
0
0.5
1
High pass filter
-1 0 1 2
Popular wavelets are:
Haar (1909); Gabor wavelet (1946); Mexican hat/Marr wavelet (1980);
Morlet wavelet (1984); Daubechies (1988); Meyer wavelet (1990);
Binomial quadrature mirror filter (1990); Coiflets (1991); Symlets (1992).
Some classical wavelet transforms are not orthogonal.
Bi-orthogonal: Non orthogonal but invertible (ex: for symmetric wavelets).
Two filters for the direct, and two others for the inverse.
48
Continuous and discrete wavelets
Db2
-2 -1 0 1 2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Father wavelet
-2 -1 0 1 2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Mother wavelet
-1
-0.5
0
0.5
1
Low pass filter
-1 0 1 2 3 4
-1
-0.5
0
0.5
1
High pass filter
-1 0 1 2 3 4
Popular wavelets are:
Haar (1909); Gabor wavelet (1946); Mexican hat/Marr wavelet (1980);
Morlet wavelet (1984); Daubechies (1988); Meyer wavelet (1990);
Binomial quadrature mirror filter (1990); Coiflets (1991); Symlets (1992).
Some classical wavelet transforms are not orthogonal.
Bi-orthogonal: Non orthogonal but invertible (ex: for symmetric wavelets).
Two filters for the direct, and two others for the inverse.
48
Continuous and discrete wavelets
Db4
-2 0 2 4
-1
-0.5
0
0.5
1
Father wavelet
-4 -2 0 2 4
-1
-0.5
0
0.5
1
Mother wavelet
-1
-0.5
0
0.5
1
Low pass filter
0 2 4 6 8
-1
-0.5
0
0.5
1
High pass filter
0 2 4 6 8
Popular wavelets are:
Haar (1909); Gabor wavelet (1946); Mexican hat/Marr wavelet (1980);
Morlet wavelet (1984); Daubechies (1988); Meyer wavelet (1990);
Binomial quadrature mirror filter (1990); Coiflets (1991); Symlets (1992).
Some classical wavelet transforms are not orthogonal.
Bi-orthogonal: Non orthogonal but invertible (ex: for symmetric wavelets).
Two filters for the direct, and two others for the inverse.
48
Continuous and discrete wavelets
Db8
0 2 4 6 8
-1
-0.5
0
0.5
1
Father wavelet
-5 0 5
-1
-0.5
0
0.5
1
Mother wavelet
-1
-0.5
0
0.5
1
Low pass filter
0 5 10 15
-1
-0.5
0
0.5
1
High pass filter
0 5 10 15
Popular wavelets are:
Haar (1909); Gabor wavelet (1946); Mexican hat/Marr wavelet (1980);
Morlet wavelet (1984); Daubechies (1988); Meyer wavelet (1990);
Binomial quadrature mirror filter (1990); Coiflets (1991); Symlets (1992).
Some classical wavelet transforms are not orthogonal.
Bi-orthogonal: Non orthogonal but invertible (ex: for symmetric wavelets).
Two filters for the direct, and two others for the inverse.
48
Continuous and discrete wavelets
Meyer
-5 0 5
-1
-0.5
0
0.5
1
Father wavelet
-5 0 5
-1
-0.5
0
0.5
1
Mother wavelet
-1
-0.5
0
0.5
1
Low pass filter
0 20 40 60
-1
-0.5
0
0.5
1
High pass filter
0 20 40 60
Popular wavelets are:
Haar (1909); Gabor wavelet (1946); Mexican hat/Marr wavelet (1980);
Morlet wavelet (1984); Daubechies (1988); Meyer wavelet (1990);
Binomial quadrature mirror filter (1990); Coiflets (1991); Symlets (1992).
Some classical wavelet transforms are not orthogonal.
Bi-orthogonal: Non orthogonal but invertible (ex: for symmetric wavelets).
Two filters for the direct, and two others for the inverse.
48
Continuous and discrete wavelets
Coif4
-2 0 2 4 6 8
-1
-0.5
0
0.5
1
Father wavelet
-5 0 5
-1
-0.5
0
0.5
1
Mother wavelet
-1
-0.5
0
0.5
1
Low pass filter
0 5 10 15 20
-1
-0.5
0
0.5
1
High pass filter
0 5 10 15 20
Popular wavelets are:
Haar (1909); Gabor wavelet (1946); Mexican hat/Marr wavelet (1980);
Morlet wavelet (1984); Daubechies (1988); Meyer wavelet (1990);
Binomial quadrature mirror filter (1990); Coiflets (1991); Symlets (1992).
Some classical wavelet transforms are not orthogonal.
Bi-orthogonal: Non orthogonal but invertible (ex: for symmetric wavelets).
Two filters for the direct, and two others for the inverse.
48
Continuous and discrete wavelets
Sym4
-4 -2 0 2 4
-1
-0.5
0
0.5
1
Father wavelet
-4 -2 0 2 4
-1
-0.5
0
0.5
1
Mother wavelet
-1
-0.5
0
0.5
1
Low pass filter
0 2 4 6 8
-1
-0.5
0
0.5
1
High pass filter
0 2 4 6 8
Popular wavelets are:
Haar (1909); Gabor wavelet (1946); Mexican hat/Marr wavelet (1980);
Morlet wavelet (1984); Daubechies (1988); Meyer wavelet (1990);
Binomial quadrature mirror filter (1990); Coiflets (1991); Symlets (1992).
Some classical wavelet transforms are not orthogonal.
Bi-orthogonal: Non orthogonal but invertible (ex: for symmetric wavelets).
Two filters for the direct, and two others for the inverse.
48
Wavelets and sparsity
Wavelets perform image compression
• Haar encodes constant signals with one coefficient,
• Db-p encodes (p-1)-order polynomials with p coefficients.
Consequences:
• Polynomial/Smooth signals are encoded with very few coefficients,
• Coarse coefficients encode the smooth underlying signal,
• Detailed coefficients encode non-smooth content of the signal,
• Typical signals are concentrated on few coefficients,
• The remaining coefficients capture only noise components.
49
Wavelets and sparsity
Wavelets perform image compression
• Haar encodes constant signals with one coefficient,
• Db-p encodes (p-1)-order polynomials with p coefficients.
Consequences:
• Polynomial/Smooth signals are encoded with very few coefficients,
• Coarse coefficients encode the smooth underlying signal,
• Detailed coefficients encode non-smooth content of the signal,
• Typical signals are concentrated on few coefficients,
• The remaining coefficients capture only noise components.
⇒ Heavy tailed distribution with a peak at zero,
i.e., wavelets favor sparsity.
49
Wavelets as a sparsifying transform
Fourier
(a) x (b) F x (c) λ (d) (F x)i/λi
Fourier (ui, vi freq. of component i)
• E∗
= F /
√
n
• λ2
i = n−1
MPSD and ∞ if i = 0
• Arbitrary DC component
50
Wavelets as a sparsifying transform
Fourier
(a) x (b) F x (c) λ (d) (F x)i/λi
Wavelets
(e) x (f) W x (g) λ (h) (W x)i/λi
Fourier (ui, vi freq. of component i)
• E∗
= F /
√
n
• λ2
i = n−1
MPSD and ∞ if i = 0
• Arbitrary DC component
Wavelets (ji scale of component i)
• E∗
= W
• λi = α2ji−1
and ∞ if ji = J
• Arbitrary coarse component
50
Distribution of wavelet coefficients
(a) x (b) ηi = (W x)i/λi
(c) Histogram of η (d) Histogram of η
51
Shrinkage in the wavelet domain
Shrinkage in the discrete wavelet domain
(a) y
sig = 20
y = x + sig * nr.randn(*x.shape)
z = im.dwt(y, 3, h, g)
zhat = shrink(z, lbd, sig)
xhat = im.idwt(zhat, 3, h, g)
z = W y
ˆzi = s(zi; λi, σ)
ˆx = W −1
ˆz
52
Shrinkage in the wavelet domain
Shrinkage in the discrete wavelet domain
(a) y (b) z (Haar)
sig = 20
y = x + sig * nr.randn(*x.shape)
z = im.dwt(y, 3, h, g)
zhat = shrink(z, lbd, sig)
xhat = im.idwt(zhat, 3, h, g)
z = W y
ˆzi = s(zi; λi, σ)
ˆx = W −1
ˆz
52
Shrinkage in the wavelet domain
Shrinkage in the discrete wavelet domain
(a) y (b) ˆz (Haar+LMMSE)
sig = 20
y = x + sig * nr.randn(*x.shape)
z = im.dwt(y, 3, h, g)
zhat = shrink(z, lbd, sig)
xhat = im.idwt(zhat, 3, h, g)
z = W y
ˆzi = s(zi; λi, σ)
ˆx = W −1
ˆz
52
Shrinkage in the wavelet domain
Shrinkage in the discrete wavelet domain
(a) y (b) ˆz (Haar+LMMSE) (c) ˆx
sig = 20
y = x + sig * nr.randn(*x.shape)
z = im.dwt(y, 3, h, g)
zhat = shrink(z, lbd, sig)
xhat = im.idwt(zhat, 3, h, g)
z = W y
ˆzi = s(zi; λi, σ)
ˆx = W −1
ˆz
52
Shrinkage in the wavelet domain
Shrinkage in the discrete wavelet domain
(a) y (b) ˆz (Daubechies+LMMSE) (c) ˆx
sig = 20
y = x + sig * nr.randn(*x.shape)
z = im.dwt(y, 3, h, g)
zhat = shrink(z, lbd, sig)
xhat = im.idwt(zhat, 3, h, g)
z = W y
ˆzi = s(zi; λi, σ)
ˆx = W −1
ˆz
52
Shrinkage in the wavelet domain
Shrinkage in the discrete wavelet domain
(a) y (b) ˆz (Daubechies+Soft-T) (c) ˆx
sig = 20
y = x + sig * nr.randn(*x.shape)
z = im.dwt(y, 3, h, g)
zhat = shrink(z, lbd, sig)
xhat = im.idwt(zhat, 3, h, g)
z = W y
ˆzi = s(zi; λi, σ)
ˆx = W −1
ˆz
52
Shrinkage in the wavelet domain
Shrinkage in the discrete wavelet domain
(a) y (b) ˆz (Daubechies+Hard-T) (c) ˆx
sig = 20
y = x + sig * nr.randn(*x.shape)
z = im.dwt(y, 3, h, g)
zhat = shrink(z, lbd, sig)
xhat = im.idwt(zhat, 3, h, g)
z = W y
ˆzi = s(zi; λi, σ)
ˆx = W −1
ˆz
52
Shrinkage in the wavelet domain
(a) y (σ = 20) (b) Db2+LMMSE (c) Db2+Soft-T (d) Db2+Hard-T (e) Haar+Hard-T
For large noise: Blocky effects and Ringing artifacts
53
Shrinkage in the wavelet domain
(a) y (σ = 40) (b) Db2+LMMSE (c) Db2+Soft-T (d) Db2+Hard-T (e) Haar+Hard-T
For large noise: Blocky effects and Ringing artifacts
53
Shrinkage in the wavelet domain
(a) y (σ = 60) (b) Db2+LMMSE (c) Db2+Soft-T (d) Db2+Hard-T (e) Haar+Hard-T
For large noise: Blocky effects and Ringing artifacts
53
Shrinkage in the wavelet domain
(a) y (σ = 120) (b) Db2+LMMSE (c) Db2+Soft-T (d) Db2+Hard-T (e) Haar+Hard-T
For large noise: Blocky effects and Ringing artifacts
53
Undecimated wavelet transforms
Limits of the discrete wavelet transform
• While Fourier shrinkage is translation invariant:
ψ(yτ
) = ψ(y)τ
where yτ
(s) = y(s + τ)
• Wavelet shrinkage is not translation invariant.
• This is due to the decimation step:
W =
G ↓2
H ↓2
∈ Rn×n
where M ↓2 = “M(1:2:end, :)”
• This explains the blocky artifacts that we observe.
54
Undecimated discrete wavelet transform (UDWT)
Figure 1 – Haar DWT
• Haar transform groups pixels by clusters of 4.
• Blocks are treated independently to each other.
• When similar neighbor blocks are shrunk
differently, it becomes clearly visible in the image.
• This arises all the more as the noise level is large.
What if we do not decimate?
⇒ UDWT, aka, stationary or translation-invariant wavelet transform.
55
Undecimated discrete wavelet transform (UDWT)
1-scale DWT
• For a 4 × 4 image:
4 × 4 coefficients.
• For n pixels: K = n coefficients.
1-scale UDWT
• For a 4 × 4 image:
8 × 8 coefficients.
• For n pixels: K = 4n coeffs.
What about multi-scale? 56
Undecimated discrete wavelet transform (UDWT)
A trous algorithm (with holes) (Holschneider et al., 1989)
Instead of decimating the coefficients at each scale j, upsample the filters
h and g by injecting 2j
− 1 zeros between each entries.
57
Undecimated discrete wavelet transform (UDWT)
DWT: Mallat’s dyadic pyramidal multi-resolution scheme
UDWT: A trous algorithm – G:p
: inject p zeros between each filter coeffs
Multi-scales: K = (1 + J(2d
− 1))n coeffs (J: #scales, d = 2 for images)
58
Undecimated discrete wavelet transform (UDWT)
Implementation of 2D UDWT (A trous algorithm)
def udwt(x, J, h, g):
if J == 0:
return x[:, :, None]
tmph = flip(convolve(flip(x), h)) / 2
tmpg = flip(convolve(flip(x), g)) / 2
z = np.stack((convolve(tmpg, h),
convolve(tmph, g),
convolve(tmph, h)), axis=2)
coarse = convolve(tmpg, g)
h2 = interleave0(h)
g2 = interleave0(g)
z = np.concatenate((udwt(coarse, J - 1, h2, g2), z), axis=2)
return z
Linear complexity.
Can be easily modified to reduce memory usage.
59
Undecimated discrete wavelet transform (UDWT)
(a) DWT (J=2) (b) UDWT Coarse Sc. (c) Detailed Scale #1 (d) Detailed Scale #2
(e) Detailed Scale #3 (f) Detailed Scale #4 (g) Detailed Scale #5 (h) Detailed Scale #6
What about its inverse transform?
60
Undecimated discrete wavelet transform (UDWT)
DWT – Wavelet basis – and inverse DWT
• The DWT W ∈ Rn×n
has n columns and n rows.
• The n columns/rows of W are orthonormal.
• The inverse DWT is W −1
= W ∗
.
• One-to-one relationship between an image and its wavelet coefficients.
UDWT – Redundant wavelet dictionary
• The UDWT ¯W ∈ RK×n
has K = (1 + J(2d
− 1))n rows and n columns.
• The rows of ¯W cannot be linearly independent: not a basis.
• They are said to form a redundant/overcomplete wavelet dictionary.
• Since ¯W is non square, it is not invertible.
Note: redundant dictionaries necessarily favor sparsity.
61
Undecimated discrete wavelet transform (UDWT)
Pseudo-inverse UDWT
• Nevertheless, the n columns are orthonormal, then: ¯W ∗
= ¯W +
• It satisfies ¯W + ¯W = Idn, but ¯W ¯W +
= IdK
• image
¯W
−→ coefficients
¯W +
→ back to the original image,
• coefficients
¯W +
→ image
¯W
→ not necessarily the same coefficients.
• Satisfies the Parseval equality
¯W x, ¯W y = x, ¯W ∗ ¯W y = x, ¯W + ¯W y = x, y
• In the vocabulary of linear algebra: ¯W is called a tight-frame.
Consequence: an algorithm for ¯W +
can be obtained.
62
Undecimated discrete wavelet transform (UDWT)
Implementation of 2D Inverse UDWT
def iudwt(z, J, h, g):
if J == 0:
return z[:, :, 0]
h2 = interleave0(h)
g2 = interleave0(g)
coarse = iudwt(z[:, :, :-3], J - 1, h2, g2)
tmpg = convolve(coarse, g[::-1]) + 
convolve(z[:, :, -3], h[::-1])
tmph = convolve(z[:, :, -2], g[::-1]) + 
convolve(z[:, :, -1], h[::-1])
x = (flip(convolve(flip(tmpg), g[::-1])) +
flip(convolve(flip(tmph), h[::-1]))) / 2
return x
Linear complexity again.
Can also be easily modified to reduce memory usage.
Can we be more efficient?
63
Multi-scale discrete wavelets
Filter bank
• The UDWT of x for subband k, x → (W x)k is
linear and translation invariant (LTI)
⇒ It’s a convolution.
• The UDWT is a filter bank:
a set of band-pass filters that separates
the input image into multiple components.
• Each filter can be represented by its frequential response.
• Direct and inverse transform: implementation in the Fourier domain.
64
Undecimated discrete wavelet transform (UDWT)
FiltersProductSubbands
(a) Coarse 2 (b) Details 2 (c) Details 2 (d) Details 2 (e) Details 1 (f) Details 1 (g) Details 1
Haar with J = 2 levels of decomposition
65
Undecimated discrete wavelet transform (UDWT)
HaarDb2Db4Db8
(a) Coarse 2 (b) Details 2 (c) Details 2 (d) Details 2 (e) Details 1 (f) Details 1 (g) Details 1
Haar: band pass with side lobes. Db8: closer to ideal band pass.
66
Undecimated discrete wavelet transform (UDWT)
Db4 with J = 6 levels of decomposition
How to create such a filter bank?
67
Undecimated discrete wavelet transform (UDWT)
UDWT: Creation of the filter bank (offline)
def udwt_create_fb(n1, n2, J, h, g, ndim=3):
if J == 0:
return np.ones((n1, n2, 1, *[1] * (ndim - 2)))
h2 = interleave0(h)
g2 = interleave0(g)
fbrec = udwt_create_fb(n1, n2, J - 1, h2, g2, ndim=ndim)
gf1 = nf.fft(fftpad(g, n1), axis=0)
hf1 = nf.fft(fftpad(h, n1), axis=0)
gf2 = nf.fft(fftpad(g, n2), axis=0)
hf2 = nf.fft(fftpad(h, n2), axis=0)
fb = np.zeros((n1, n2, 4), dtype=np.complex128)
fb[:, :, 0] = np.outer(gf1, gf2) / 2
fb[:, :, 1] = np.outer(gf1, hf2) / 2
fb[:, :, 2] = np.outer(hf1, gf2) / 2
fb[:, :, 3] = np.outer(hf1, hf2) / 2
fb = fb.reshape(n1, n2, 4, *[1] * (ndim - 2))
fb = np.concatenate((fb[:, :, 0:1] * fbrec, fb[:, :, -3:]),
axis=2)
return fb
68
Undecimated discrete wavelet transform (UDWT)
UDWT: Direct transform using the filter bank (online)
def fb_apply(x, fb):
x = nf.fft2(x, axes=(0, 1))
z = fb * x[:, :, None]
z = np.real(nf.ifft2(z, axes=(0, 1)))
return z
UDWT: Inverse transform using the filter bank (online)
def fb_adjoint(z, fb, squeeze=True):
z = nf.fft2(z, axes=(0, 1))
x = (np.conj(fb) * z).sum(axis=2)
x = np.real(nf.ifft2(x, axes=(0, 1)))
return x
Much more efficient than previous implementation when J > 1
69
Reconstruction with the UDWT
Shrinkage with UDWT
• Consider a denoising problem y = x + w with noise variance σ2
.
• Shrink the K n coefficients independently.
ˆx = ¯W +
ˆz
Pseudo-inverse
where ˆzi = s(zi; λi, σi)
shrinkage
and z = ¯W y
Redundant representation
Rule of thumb for soft-thresholding:
• For the orthonormal DWT W : increase λi as
√
2
d(ji−1)
.
• For the tight-frame UDWT ¯W : increase λi as: 2d(ji−1/2)
.
(ji scale for coefficient i, d = 2 for images).
70
Reconstruction with the UDWT
(a) y (b) DWT(3)+Haar+HT (c) DWT(3)+Db2+HT
(e) UDWT(3)+Db2+HT 71
Reconstruction with the UDWT
(a) y (b) DWT(3)+Haar+HT (c) DWT(3)+Db2+HT
(d) UDWT(3)+Haar+HT (e) UDWT(3)+Db2+HT (f) UDWT(3)+Db8+HT 71
Reconstruction with the UDWT
(a) y (b) DWT(3)+Haar+HT (c) DWT(3)+Db2+HT
(d) UDWT(1)+Db2+HT (e) UDWT(3)+Db2+HT (f) UDWT(5)+Db2+HT 71
Reconstruction with the UDWT
(a) y (b) DWT(3)+Haar+HT (c) DWT(3)+Db2+HT
(d) UDWT(3)+Db2+Linear (e) UDWT(3)+Db2+HT (f) UDWT(3)+Db2+ST 71
Reconstruction with the UDWT
(a) y (σ = 20) (b) UDWT+Lin. (c) UDWT+HT (d) DWT+HT 72
Reconstruction with the UDWT
(a) y (σ = 40) (b) UDWT+Lin. (c) UDWT+HT (d) DWT+HT 72
Reconstruction with the UDWT
(a) y (σ = 60) (b) UDWT+Lin. (c) UDWT+HT (d) DWT+HT 72
Reconstruction with the UDWT
(a) y (σ = 120) (b) UDWT+Lin. (c) UDWT+HT (d) DWT+HT 72
Reconstruction with the UDWT
ˆx = ¯W +
ˆz
Pseudo-inverse
where ˆzi = s(zi; λi, σi)
shrink K coefficients
and z = ¯W y
Redundant representation
Connection with Bayesian shrinkage?
• Since the rows of ¯W are linearly dependent,
the coefficients zi are necessarily correlated (non-white).
• Shrink the K n coefficients independently,
even though they cannot be assumed independent.
• This estimator has no Bayesian interpretation,
it does not correspond to the MMSE or MAP.
How to use the UDWT in the Bayesian context?
73
Reconstruction with the UDWT
Bayesian analysis model
Whitening model: Consider η = Λ−1/2
W x (η coeffs)
such that E[η] = 0n and Var[η] = Idn
Analysis: images can be transformed to white coeffs.
Non-sense when rows of W are redundant.
Bayesian synthesis model
Generative model: Consider x = ¯W +
Λ1/2
η (η code)
such that E[η] = 0K and Var[η] = IdK
Synthesis: images can be generated from a white code.
Always well-founded.
74
Reconstruction with the UDWT
Forward model: y = x + w
Maximum a Posteriori for the Synthesis model
• Instead of looking for x, consider the MAP for the code η
ˆη ∈ argmax
η∈RK
p(η|y)
= argmin
η∈RK
[− log p(y|η) − log p(η)]
= argmin
η∈RK
1
2
||y − ¯W +
Λ1/2
η||2
2 − log p(η)
• Once you get ˆη , generate the image ˆx as
ˆx = ¯W +
Λ1/2
ˆη
What interpretation?
75
Reconstruction with the UDWT
Penalized least square with redundant dictionary
• Consider the redundant wavelet dictionary D = ¯W +
Λ1/2
D = ( d1, d2, . . . , dK
linearly dependent atoms
), ||di|| = λi, K n
• Goal: Look for a code η ∈ RK
, such that ˆx close to y
ˆx = Dη =
K
i=1
ηidi = “linear comb. of the redundant atoms di of D”
• Since D is redundant, different codes η produce the same image x.
• Penalize independently each ηi to select a relevant one
ˆη ∈ argmin
η∈RK
1
2
||y − ¯W +
Λ1/2
η||2
2 −
K
i=1
log p(ηi)
What choice for log p(ηi)? 76
Reconstruction with the UDWT
Penalized least square with redundant dictionary
•
1
2
||y − Dη||2
2 +
τ2
2
||η||2
2, ||η||2
2 = i η2
i ← Ridge regression
•
1
2
||y − Dη||2
2 + τ||η||1, ||η||1 = i |ηi| ← LASSO
•
1
2
||y − Dη||2
2 +
τ2
2
||η||0, ||η||0 = i 1{ηi=0} ← Sparse regression
-5 -4 -3 -2 -1 0 1 2 3 4 5
0
1
2
3
4
5
When D is redundant, these problems are no longer separable.
They require large-scale optimization techniques. 77
Regularizations and optimization
Ridge regression
Ridge/Smooth regression
• Convex energy: E(η) = 1
2
||y − Dη||2
2 + τ2
2
||η||2
2
• Gradient: E(η) = D∗
(Dη − y) + τ2
η
• Optimality conditions: ˆη = (D∗
D + τ2
IdK )−1
D∗
y
• For UDWT: this is an LTI filter ≡ convolution (non adaptive)
(a) y (b) Linear shrink (c) Ridge (d) Difference
Ridge ≡ Linear shrinkage (except if D is orthogonal).
78
Sparse regression
Sparse regression / 0 regularization (1/3)
• Energy: E(η) = 1
2
||y − Dη||2
2 + τ2
2
||η||0
• Penalty: ||η||0 = #non zero elements in η
• Non-convex: 0.5 = 1
2
(||0||0 + ||1||0) < ||0.5||0 = 1
• Produces optimal sparse solutions adapted to the signal
• But, non-differentiable and discontinue
-5 -4 -3 -2 -1 0 1 2 3 4 5
0
1
2
3
4
5
79
Sparse regression
Sparse regression / 0 regularization (2/3)
• If D is orthogonal: solution given by the Hard-Thresholding.
• Otherwise, exact solution obtained by brute force:
• For all possible support I ⊆ {1, . . . , K} (set of non-zero coefficients)
• Solve the least square estimation problem:
argmin
(ηi)i∈I
1
2
||y −
i∈I
ηiai||2
2
• Pick the solution that minimizes E.
• NP-hard combinatorial problem:
#subsets =
K
k=0
K
k
= 2K
80
Sparse regression
Sparse regression / 0 regularization (3/3)
• Sub-optimal solutions can be obtained by greedy algorithms.
• Matching pursuit (MP): (Mallat, 1993)
1 Initialization: r ← y, η ← 0, k ← 0
2 Choose i maximizing |D∗
r|i = | di, r |
3 Compute α = r, di /||di||2
2
4 Update r ← r − αdi
5 Update ηi = α
6 Update k ← k + 1
7 Back to step 2 while E(η) = 1
2
||r||2
2 + τ2
2
k decreases
• Lots of iterations: complexity O(kn), with k the sparsity of the solution.
• Each iteration requires to compute an inverse UDWT.
• Extensions: OMP (Tropp & Gilbert, 2007), CoSaMP (Needel & Tropp, 2009)
81
Least Absolute Shrinkage and Selection Operator (LASSO)
Convex relaxation: Take the best of both worlds: sparsity and convexity
LASSO / 1 regularization (Tibshirani 1996)
• Convex energy: E(η) = 1
2
||y − Dη||2
2 + τ||η||1
• Non-smooth penalty: ||η||1 = K
i=1 |ηi|
• If D is orthogonal: solution given by the Soft-Thresholding.
• Produces also sparse solutions adapted to the signal
-5 -4 -3 -2 -1 0 1 2 3 4 5
0
1
2
3
4
5
82
Least Absolute Shrinkage and Selection Operator
(a) Input (b) ST+UDWT (1s) (c) LASSO+UDWT (30s) (d) Difference
Though the solutions look alike, their codes η are very different.
83
Least Absolute Shrinkage and Selection Operator
InputST+UDWTLASSO+UDWT
(a) Image (b) Coarse 5 (c) Scale 5 (d) Scale 5 (e) Scale 5 (f) Scale 1 (g) Scale 1
The LASSO creates much sparser codes than ST only.
84
Least Absolute Shrinkage and Selection Operator
Why use the LASSO if shrinkage in the UDWT provides similar results?
• Shrinkage in the UDWT domain
can only be applied for denoising problems.
• The LASSO can be adapted to inverse-problems:
ˆx = Dˆη with ˆη ∈ argmin
η∈RK
1
2
||y − HDη||2
2 + τ||η||1
But it requires solving a non-smooth convex optimization problems.
Solution: use sub-differential and Fermat’s rule.
85
Non-smooth convex optimization
Definition (Sub-differential)
• Let f : Rn
→ R be a convex function, u ∈ Rn
is a sub-gradient of f at
x∗
, if for all x ∈ Rn
f(x) f(x∗
) + u, x − x∗
.
• The sub-differential is the set of sub-gradients
∂f(x∗
) = {u ∈ Rn
: ∀x ∈ Rn
, f(x) f(x∗
) + u, x − x∗
}.
If the sub-gradient is unique, f is differentiable and ∂f(x) = { f(x)}.
86
Non-smooth convex optimization
Definition (Sub-differential)
• Let f : Rn
→ R be a convex function, u ∈ Rn
is a sub-gradient of f at
x∗
, if for all x ∈ Rn
f(x) f(x∗
) + u, x − x∗
.
• The sub-differential is the set of sub-gradients
∂f(x∗
) = {u ∈ Rn
: ∀x ∈ Rn
, f(x) f(x∗
) + u, x − x∗
}.
If the sub-gradient is unique, f is differentiable and ∂f(x) = { f(x)}.
86
Non-smooth convex optimization
Definition (Sub-differential)
• Let f : Rn
→ R be a convex function, u ∈ Rn
is a sub-gradient of f at
x∗
, if for all x ∈ Rn
f(x) f(x∗
) + u, x − x∗
.
• The sub-differential is the set of sub-gradients
∂f(x∗
) = {u ∈ Rn
: ∀x ∈ Rn
, f(x) f(x∗
) + u, x − x∗
}.
If the sub-gradient is unique, f is differentiable and ∂f(x) = { f(x)}.
86
Non-smooth convex optimization
Definition (Sub-differential)
• Let f : Rn
→ R be a convex function, u ∈ Rn
is a sub-gradient of f at
x∗
, if for all x ∈ Rn
f(x) f(x∗
) + u, x − x∗
.
• The sub-differential is the set of sub-gradients
∂f(x∗
) = {u ∈ Rn
: ∀x ∈ Rn
, f(x) f(x∗
) + u, x − x∗
}.
If the sub-gradient is unique, f is differentiable and ∂f(x) = { f(x)}.
86
Non-smooth convex optimization
Theorem (Fermat’s rule)
Let f : Rn
→ R be a convex function, then
x∗
∈ argmin
x∈Rn
f(x) ⇔ 0n ∈ ∂f(x∗
)
If f is also differentiable, this corresponds to the standard rule f(x∗
) = 0n.
Minimizers are the only points with an horizontal tangent
87
Non-smooth convex optimization
Function (abs):
f :



R → R
x → |x|
Sub-differential (sign)
∂f(x∗
) =



{−1} if x∗
∈ (−∞, 0)
{+1} if x∗
∈ (0, ∞)
[−1, 1] if x∗
= 0
88
Non-smooth convex optimization
Function (abs):
f :



R → R
x → |x|
Sub-differential (sign)
∂f(x∗
) =



{−1} if x∗
∈ (−∞, 0)
{+1} if x∗
∈ (0, ∞)
[−1, 1] if x∗
= 0
88
Non-smooth convex optimization
Function (abs):
f :



R → R
x → |x|
Sub-differential (sign)
∂f(x∗
) =



{−1} if x∗
∈ (−∞, 0)
{+1} if x∗
∈ (0, ∞)
[−1, 1] if x∗
= 0
88
Non-smooth convex optimization
Function (abs):
f :



R → R
x → |x|
Sub-differential (sign)
∂f(x∗
) =



{−1} if x∗
∈ (−∞, 0)
{+1} if x∗
∈ (0, ∞)
[−1, 1] if x∗
= 0
88
Non-smooth convex optimization
Function (abs):
f :



R → R
x → |x|
Sub-differential (sign)
∂f(x∗
) =



{−1} if x∗
∈ (−∞, 0)
{+1} if x∗
∈ (0, ∞)
[−1, 1] if x∗
= 0
88
Non-smooth convex optimization
Function (abs):
f :



R → R
x → |x|
Sub-differential (sign)
∂f(x∗
) =



{−1} if x∗
∈ (−∞, 0)
{+1} if x∗
∈ (0, ∞)
[−1, 1] if x∗
= 0
88
Non-smooth convex optimization
Function (abs):
f :



R → R
x → |x|
Sub-differential (sign)
∂f(x∗
) =



{−1} if x∗
∈ (−∞, 0)
{+1} if x∗
∈ (0, ∞)
[−1, 1] if x∗
= 0
88
Non-smooth convex optimization
Function (abs):
f :



R → R
x → |x|
Sub-differential (sign)
∂f(x∗
) =



{−1} if x∗
∈ (−∞, 0)
{+1} if x∗
∈ (0, ∞)
[−1, 1] if x∗
= 0
88
Non-smooth convex optimization
Proximal operator
• Let f : Rn
→ R be a convex function (+ some technical conditions). The
proximal operator of f is
Proxf (x) = argmin
z∈Rn
1
2
||z − x||2
2 + f(z)
• Remark: this minimization problem always has a unique solution, so the
proximal operator is without ambiguity a function Rn
→ Rn
.
• Always non-expansive:
|| Proxf (x1) − Proxf (x2)|| ||x1 − x2||
• Can be interpreted as a denoiser/shrinkage for the regularity f.
Property
Proxγf (x) = (Id + γ∂f)−1
x
89
Non-smooth convex optimization
Proof.
argmin
z
1
2
||z − x||2
2 + γf(z) ⇔ 0 ∈ ∂
1
2
||z − x||2
2 + γf(z)
⇔ 0 ∈ ∂
1
2
||z − x||2
2 + γ∂f(z)
⇔ 0 ∈ z − x + γ∂f(z)
⇔ x ∈ z + γ∂f(z)
⇔ x ∈ (Id + γ∂f)(z)
⇔ z = (Id + γ∂f)−1
x
Even though ∂f(x) is a set, the pre-image by Id + γ∂f is unique.
90
Non-smooth convex optimization
Soft-thresholding
Proxγ|.|(x) = argmin
z∈R
1
2
(z − x)2
+ γ|z|
= (Id + γ∂|.|)−1
xk
=



x − γ if x > γ
x + γ if x < −γ
0 otherwise
91
Non-smooth convex optimization
Proximal operator of simple functions
Name f(x) Proxγf (x)
Indicator of
convex set C
0 if x ∈ C
∞ otherwise
ProjC(x)
Square 1
2
||x||2
2
x
1 + γ
Abs ||x||1 Soft-T(x, γ)
Euclidean ||x||2 1 −
γ
max(||x||2, γ)
x
Square+Affine 1
2
||Ax + b||2
2 (Id + γA∗
A)−1
(x − γA∗
b)
Separability
for x =
x1
x2
g(x1) + h(x2)
Proxγg(x1)
Proxγh(x1)
More exhaustive list: http://proximity-operator.net
92
Non-smooth convex optimization
Proximal minimization
• Let f : Rn
→ R be a convex function (+ some technical conditions).
Then, whatever the initialization x0
and γ > 0, the sequence
xk+1
= Proxγf (xk
)
converges towards a global minimizer of f.
Proxγf (xk
) = (Id + γ∂f)−1
xk
= argmin
z
1
2
||z − xk
||2
2 + γf(z)
Compared to gradient descent
• No need to be differentiable,
• No need to have Lipschitz gradient,
• Works whatever the parameter γ,
• Require to solve an optimization problem at each step.
93
Non-smooth convex optimization
Gradient descent:
read xk
on the x-axis and evaluate its image by the function x − γ F(x).
94
Non-smooth convex optimization
Proximal minimization:
Look at the set x + γ∂F(x)
94
Non-smooth convex optimization
Proximal minimization:
read xk
on the y-axis and evaluate its pre-image by x + γ F(x).
95
Non-smooth convex optimization
Proximal minimization:
the larger γ the faster, but the inversion becomes harder (ill-conditioned).
95
Non-smooth convex optimization
Toy example
• Consider the smoothing regularization problem
F(x) =
1
2
|| x||2
2,2
• Its sub-gradient is thus given by
∂F(x) = { F(x) = −∆x}
• The proximal minimization reads as
xk+1
= (Id + γ∂F)−1
xk
= (Id − γ∆)−1
xk
• This is exactly the implicit Euler scheme for the Heat equation.
96
Non-smooth convex optimization
Can we apply proximal minimization for the LASSO?
Proximal splitting methods
• The proximal operator may not have a closed form.
• Computing it may be as difficult as solving the original problem
• Solution: use proximal splitting methods, a family of techniques
developed for non-smooth convex problems.
• Idea: split the problem into subproblems, that involve
• gradient descent steps for smooth terms,
• proximal steps for simple convex terms.
97
Non-smooth convex optimization
min
x∈Rn
{E(x) = F(x) + G(x)}
Proximal forward-backward algorithm
• Assume F is convex and differentiable with L-Lipschitz gradient
|| F(x1) − F(x2)||2 L||x1 − x2||2, for all x1, x2 .
• Assume G is convex and simple, i.e., its prox is known in closed form
ProxγG(x) = argmin
z
1
2
||z − x||2
2 + γG(z)
• The proximal forward-backward algorithm reads
xk+1
= ProxγG(xk
− γ F(xk
))
• For 0 < γ < 2/L, it converges to a minimizer of E = F + G.
Aka, explicit-implicit scheme by analogy with PDE discretization schemes.
98
Non-smooth convex optimization
The LASSO problem: E(η) =
1
2
||y − Aη||2
2
F (η)
+ τ||η||1
G(η)= i |ηi|
, A = HD
Iterative Soft-Thresholding Algorithm (ISTA) (Daubechies, 2004)
• F is convex and differentiable with L-Lipschitz gradient
F(η) = A∗
(Aη − y) with L = ||A||2
2
• G is convex and simple, in fact separable:
ProxγG(η)i = Soft-T(ηi, γτ)
• The proximal forward-backward algorithm reads for 0 < γ < 2/L
ηk+1
= Soft-T(ηk
− γ(A∗
Aηk
− A∗
y), γτ)
and is known as Iterative Soft-Thresholding Algorithm (ISTA).
• Finally: ˆx = ¯Dˆη
99
Non-smooth convex optimization
Preconditioned ISTA (1/2)
• Remark
ˆη ∈ argmin
η∈RK
1
2
||y − Aη||2
2 + τ||η||1, A = H ¯W +
Λ1/2
D
∈ argmin
η∈RK
1
2
||y − H ¯W +
Λ1/2
η||2
2 + τ||η||1
• Λ1/2
invertible: bijection between z = Λ1/2
η and η = Λ−1/2
z
• Solving for η is equivalent to solve a weighted LASSO for z
ˆz ∈ argmin
z∈RK
1
2
||y − H ¯W +
z||2
2 + τ||Λ−1/2
z||1
∈ argmin
z∈RK
1
2
||y − Bz||2
2 +
K
i=1
τ
λi
|zi|, B = H ¯W +
• In practice, this equivalent problem has better conditioning.
100
Non-smooth convex optimization
Equivalent to: E(z) =
1
2
||y − Bz||2
2
F (z)
+ τ||Λ−1/2
z||1
G(z)= i
τ
λi
|zi|
, B = H ¯W +
Preconditioned ISTA (2/2)
F(z) = B∗
(Bz − y) with L = ||B||2
2
ProxγG(z)i = Soft-T zi,
γτ
λi
• ISTA becomes for 0 < γ < 2/L
zk+1
= Soft-T zk
− γ(B∗
Bzk
− B∗
y),
γτ
λi
• Finally: ˆx = ¯W +
ˆz
• Leads to larger steps γ, better conditioning, and faster convergence.
101
Non-smooth convex optimization
zk+1
= ProxγG(zk
− γ F(zk
))
= Soft-T zk
− γ(B∗
Bzk
− B∗
y),
γτ
λi
with B = H ¯W +
Bredies & Lorenz (2007): E(zk
) − E(z ) decays with rate O(1/k)
Fast ISTA (FISTA)
zk+1
= ProxγG ˜zk
− γ F(˜zk
)
˜zk+1
= zk+1
+
tk − 1
tk+1
(zk+1
− zk
)
tk+1 =
1 + 1 + 4t2
k
2
, t0 = 1
Beck & Teboulle (2009): E(zk
) − E(z ) decays with rate O(1/k2
)
102
Non-smooth convex optimization
(a) Input y: motion blur + noise (σ = 2)
50 100 150 200 250 300
10 -2
10 -1
10 0
10 1
10 2
ISTA
FISTA
(b) Convergence profiles
(c) Deconvolution ISTA(300)+UDWT (d) Deconvolution FISTA(300)+UDWT 103
Non-smooth convex optimization
InputISTA(300)FISTA(300)
(a) Image (b) Scale 4 (c) Scale 4 (d) Scale 4 (e) Scale 3 (f) Scale 3 (g) Scale 3
FISTA converges faster: sparser codes given a limited time budget
104
Sparsity: synthesis vs analysis
Sparse reconstruction: synthesis vs analysis
Sparse synthesis model with UDWT
• LASSO: ˆη ∈ argmin
η∈RK
1
2
||y − H ¯W +
Λ1/2
η||2
2 + τ||η||1
• Using the change of variable η = Λ−1/2
z:
ˆz ∈ argmin
z∈RK
1
2
||y − H ¯W +
z||2
2 + τ||Λ−1/2
z||1
Sparse analysis model with UDWT
• What about?
ˆx ∈ argmin
x∈Rn
1
2
||y − Hx||2
2 + τ||Λ−1/2 ¯W x||1
• The change of variable η = Λ−1/2 ¯W x is not one-to-one.
• The two problems are not equivalent (unless ¯W is invertible).
105
Sparse reconstruction: synthesis vs analysis
106
Sparse reconstruction: synthesis vs analysis
Analysis versus synthesis (Elad, Milanfar, Rubinstein, 2007)
Generative: generate good images
ˆx = Dˆη with ˆη ∈ argmin
η∈RK
1
2
||y − HDη||2
2 + τ||η||p
p, p 0
Synthesis: images are linear combinations of a few columns of D.
Bayesian interpretation: MAP for the sparse code η.
Discriminative: discriminate between good and bad images
ˆx ∈ argmin
x∈Rn
1
2
||y − Hx||2
2 + τ||Γx||p
p, p 0
Analysis: images are correlated with a few rows of Γ.
Bayesian interpretation: MAP for x with an improper Gibbs prior.
107
Sparse reconstruction: synthesis vs analysis
ˆη ∈ argmin
η∈RK
1
2
||y − HDη||2
2 + τ||η||p
p ( p
p-synthesis)
ˆx ∈ argmin
x∈Rn
1
2
||y − Hx||2
2 + τ||Γx||p
p ( p
p-analysis)
Common properties
Solution Problem
p = 0 Optimal sparse Non-convex & discontinue (NP-hard)
0 < p < 1 Sparse Non-convex & continue but non-smooth
p = 1 Sparse Convex & continue but non-smooth
p > 1 Smooth Convex & differentiable
p = 2 Linear Quadratic
• Γ square and invertible ⇒ equivalent for D = Γ−1
.
• Γ full-rank and p = 2 ⇒ equivalent for D = Γ+
.
• LTI dictionaries ⇒ redundant filter bank.
108
Sparse reconstruction: synthesis vs analysis
ˆη ∈ argmin
η∈RK
1
2
||y − HDη||2
2 + τ||η||p
p ( p
p-synthesis)
ˆx ∈ argmin
x∈Rn
1
2
||y − Hx||2
2 + τ||Γx||p
p ( p
p-analysis)
Synthesis
• D: synthesis dictionary.
• Atoms need to span images.
⇒ Low- & high-pass filters
⇒ Im[D] ≈ Rn
• Redundancy favor sparsity.
• K dimensional problem (> n).
• Prior separable.
Analysis
• Γ: analysis dictionary.
• Atoms need to sparsify images.
⇒ High-pass filters only
⇒ Ker[Γ] = ∅ (⊃ DC, coarse)
• Redundancy decreases sparsity.
• n dimensional problem (< K).
• Prior non-separable.
Quiz: What analysis dictionary is LTI and not too redundant?
109
Sparse reconstruction: synthesis vs analysis
ˆη ∈ argmin
η∈RK
1
2
||y − HDη||2
2 + τ||η||p
p ( p
p-synthesis)
ˆx ∈ argmin
x∈Rn
1
2
||y − Hx||2
2 + τ||Γx||p
p ( p
p-analysis)
Link between analysis models and variational methods
• p = 2: Analysis model = Tikhonov regularization.
• p = 1 & Γ = : Analysis model = anisotropic Total-Variation (TV)
TV filter bank = Horizontal and vertical gradient
Spatial filter bank Spectral f.b. (real part) Spectral f.b. (imaginary part)
Can we use proximal forward-backward for 1-analysis prior?
110
Non-smooth optimization
ˆx ∈ argmin
x∈Rn
1
2
||y − Hx||2
2
F (x)
+ τ||Γx||1
G(x)
( 1-analysis)
Proximal forward-backward for the 1-analysis problem?
• F convex and differentiable
• G convex but not simple (not separable)
−→ cannot use proximal forward backward
• Exception: for denoising H = Idn (see: Chambolle algorithm, 2004)
Need another proximal optimization technique.
111
Non-smooth optimization
min
x∈Rn
{E(x) = F(x) + G(x)}
Alternating direction method of multipliers (ADMM) (∼1970)
• Assume F and G are convex and simple (+ some mild conditions).
• For any initialization x0
, ˜x0
and d0
, the ADMM algorithm reads as
xk+1
= ProxγF (˜xk
+ dk
)
˜xk+1
= ProxγG(xk+1
− dk
)
dk+1
= dk
− xk+1
+ ˜xk+1
• For γ > 0, xk
converges to a minimizer of E = F + G.
Fast version: FADMM, similar idea as for FISTA (Goldstein et al., 2014).
Related concepts: Lagrange multipliers, Duality, Legendre transform.
How to use it for 1 analysis priors?
112
Non-smooth optimization
ˆx ∈ argmin
x∈Rn
1
2
||y − Hx||2
2 + τ||Γx||1 ( 1-analysis)
ADMM + Variable splitting (1/3)
• Define: X =
x
z
∈ Rn+K
• Consider: E(X) = F(X) + G(X)
with:



F
x
z
= ||y − Hx||2
2 + τ||z||1
G
x
z
=
0 if Γx = z
∞ otherwise
• Remark 1: Minimizing E solves the 1-analysis problem.
• Remark 2: F and G are convex and simple ⇒ ADMM applies.
113
Non-smooth optimization
ˆx ∈ argmin
x∈Rn
1
2
||y − Hx||2
2 + τ||Γx||1 ( 1-analysis)
ADMM + Variable splitting (2/3)
Applying formula from slide 92:
F
x
z
= ||y − Hx||2
2 + τ||z||1 −→ ProxγF
x
z
=
(Idn + γH∗
H)−1
(x + γH∗
y)
Soft-T(z, γτ)
G
x
z
=
0 if Γx = z
∞ otherwise
Indicator of the convex set
C={(x,z) ; Γx=z}
−→ ProxγG
x
z
=
Idn
Γ
(Idn + Γ∗
Γ)−1
(x + Γ∗
z)
Projection on C
114
Non-smooth optimization
ˆx ∈ argmin
x∈Rn
1
2
||y − Hx||2
2 + τ||Γx||1 ( 1-analysis)
ADMM + Variable splitting (3/3)
xk+1
= (Idn + γH∗
H)−1
(˜xk
+ dk
x + γH∗
y)
zk+1
= Soft-T(˜zk
+ dk
z , γτ)
˜xk+1
= (Idn + Γ∗
Γ)−1
(xk+1
− dk
x + Γ∗
(zk+1
− dk
z ))
˜zk+1
= Γ˜xk+1
dk+1
x = dk
x − xk+1
+ ˜xk+1
dk+1
z = dk
z − zk+1
+ ˜zk+1
If H is a blur, and Γ a filter bank, (Idn + γH∗
H)−1
and (Idn + Γ∗
Γ)−1
can
be computed in the Fourier domain in O(n log n).
115
Non-smooth optimization
ˆx ∈ argmin
x∈Rn
1
2
||y − Hx||2
2 + τ||Γx||1 ( 1-analysis)
Application to Total-Variation Γ =
xk+1
= (Idn + γH∗
H)−1
(˜xk
+ dk
x + γH∗
y)
zk+1
= Soft-T(˜zk
+ dk
z , γτ)
˜xk+1
= (Idn + ∗
)−1
(xk+1
− dk
x + ∗
(zk+1
− dk
z ))
˜zk+1
= ˜xk+1
dk+1
x = dk
x − xk+1
+ ˜xk+1
dk+1
z = dk
z − zk+1
+ ˜zk+1
∗
= − div and ∗
= −∆
116
Non-smooth optimization
ˆx ∈ argmin
x∈Rn
1
2
||y − Hx||2
2 + τ||Γx||1 ( 1-analysis)
Application to Total-Variation Γ =
xk+1
= (Idn + γH∗
H)−1
(˜xk
+ dk
x + γH∗
y)
zk+1
= Soft-T(˜zk
+ dk
z , γτ)
˜xk+1
= (Idn − ∆)−1
(xk+1
− dk
x − div(zk+1
− dk
z ))
˜zk+1
= ˜xk+1
dk+1
x = dk
x − xk+1
+ ˜xk+1
dk+1
z = dk
z − zk+1
+ ˜zk+1
∗
= − div and ∗
= −∆
116
Non-smooth optimization
ˆx ∈ argmin
x∈Rn
1
2
||y − Hx||2
2 + τ||Γx||1 ( 1-analysis)
Application to sparse analysis with UDWT Γ = Λ−1/2 ¯W
xk+1
= (Idn + γH∗
H)−1
(˜xk
+ dk
x + γH∗
y)
zk+1
= Soft-T(˜zk
+ dk
z ,
γτ
λi
)
˜xk+1
= (Idn + ¯W ∗ ¯W )−1
(xk+1
− dk
x + ¯W ∗
(zk+1
− dk
z ))
˜zk+1
= ¯W ˜xk+1
dk+1
x = dk
x − xk+1
+ ˜xk+1
dk+1
z = dk
z − zk+1
+ ˜zk+1
Tight-frame: ¯W ∗ ¯W = Idn
117
Non-smooth optimization
ˆx ∈ argmin
x∈Rn
1
2
||y − Hx||2
2 + τ||Γx||1 ( 1-analysis)
Application to sparse analysis with UDWT Γ = Λ−1/2 ¯W
xk+1
= (Idn + γH∗
H)−1
(˜xk
+ dk
x + γH∗
y)
zk+1
= Soft-T(˜zk
+ dk
z ,
γτ
λi
)
˜xk+1
=
1
2
(xk+1
− dk
x + ¯W ∗
(zk+1
− dk
z ))
˜zk+1
= ¯W ˜xk+1
dk+1
x = dk
x − xk+1
+ ˜xk+1
dk+1
z = dk
z − zk+1
+ ˜zk+1
Tight-frame: ¯W ∗ ¯W = Idn
117
Sparse analysis – Results
Deconvolution with UDWT (5 levels, Db2)
(a) Blurry image y (noise σ = 2) (b) Synthesis (FISTA) (c) Analysis (FADMM)
118
Sparse analysis – Results
SynthesisAnalysis
(a) 1 level (b) 2 level (c) 3 level (d) 4 level (e) 5 level
Analysis allows for less decomposition levels.
⇒ leads to faster algorithms.
119
Sparse analysis – Results
(a) Noisy (σ = 40) (b) Analysis UDWT(4) (c) +block (orien.+col.) (d) Difference
• As for TV, group coefficients across orientations/color using 2,1 norms:
||Γz||2,1
• The soft-thresholding becomes the group soft-thresholding:
Proxγ||·||2,1
(z) i
=
zi − γ zi
||zi||2
if ||zi||2 > γ
0 otherwise
120
Shrinkage, Sparsity and Wavelets – What’s next?
Reminder from last class:
Modeling the distribution of images is complex (large degree of freedom).
Applying LMMSE on patches → increase performance
Next class:
What if we use sparse priors, not for the distribution of images,
but for the distribution of patches?
121
Shrinkage, Sparsity and Wavelets – Further reading
For further reading
Sparsity, shrinkage and recovery guarantee:
• Donoho & Johnstone (1994); Moulin & Liu (1999); Donoho and Elad (2003);
Gribonval and Nielsen (2003); Cand`es and Tao (2005); Zhang (2008); Cand`es
and Romberg (2007).
• Book: Statistical Learning with Sparsity (Hastie, Tibshirani, Wainwright, 2015).
Wavelet related transforms:
• Warblet/Chirplet (Mann, Mihovilovic et al., 1991–1992), Curvelet (Cand`es &
Donoho, 2000), Noiselet (Coifman, 2001), Contourlet (Do & Vetterli, 2002),
Ridgelet (Do & Vetterli, 2003), Shearlets (Kanghui et al., 2005), Bandelet (Le
Pennec, Peyr´e, Mallat, 2005), Empirical wavelets (Gilles, 2013).
• Book: A wavelet tour of signal processing (Mallat, 2008)
Non-smooth convex optimization:
• Douglas-Rachford splitting (Combettes & Pesquet, 2007), Split Bregman
(Goldstein & Osher, 2009), Primal-Dual (Chambolle & Pock, 2011), Generalized
FB (Raguet et al., 2013), Condat algorithm (2014).
• Book: Convex Optimization (Boyd, 2004).
122
Questions?
Next class: Patch models and dictionary learning
Sources and images courtesy of
• L. Condat
• G. Peyr´e
• J. Salmon
• Wikipedia
122

More Related Content

What's hot

Double Integral Powerpoint
Double Integral PowerpointDouble Integral Powerpoint
Double Integral Powerpoint
oaishnosaj
 
DISTINGUISH BETWEEN WALSH TRANSFORM AND HAAR TRANSFORMDip transforms
DISTINGUISH BETWEEN WALSH TRANSFORM AND HAAR TRANSFORMDip transformsDISTINGUISH BETWEEN WALSH TRANSFORM AND HAAR TRANSFORMDip transforms
DISTINGUISH BETWEEN WALSH TRANSFORM AND HAAR TRANSFORMDip transforms
NITHIN KALLE PALLY
 
Optimal multi-configuration approximation of an N-fermion wave function
 Optimal multi-configuration approximation of an N-fermion wave function Optimal multi-configuration approximation of an N-fermion wave function
Optimal multi-configuration approximation of an N-fermion wave function
jiang-min zhang
 
Lesson18 Double Integrals Over Rectangles Slides
Lesson18   Double Integrals Over Rectangles SlidesLesson18   Double Integrals Over Rectangles Slides
Lesson18 Double Integrals Over Rectangles Slides
Matthew Leingang
 
Lesson 19: Double Integrals over General Regions
Lesson 19: Double Integrals over General RegionsLesson 19: Double Integrals over General Regions
Lesson 19: Double Integrals over General RegionsMatthew Leingang
 
inverse z-transform ppt
inverse z-transform pptinverse z-transform ppt
inverse z-transform ppt
mihir jain
 
Crib Sheet AP Calculus AB and BC exams
Crib Sheet AP Calculus AB and BC examsCrib Sheet AP Calculus AB and BC exams
Crib Sheet AP Calculus AB and BC exams
A Jorge Garcia
 
5.4 Saddle-point interpretation, 5.5 Optimality conditions, 5.6 Perturbation ...
5.4 Saddle-point interpretation, 5.5 Optimality conditions, 5.6 Perturbation ...5.4 Saddle-point interpretation, 5.5 Optimality conditions, 5.6 Perturbation ...
5.4 Saddle-point interpretation, 5.5 Optimality conditions, 5.6 Perturbation ...
RyotaroTsukada
 
2.3 Operations that preserve convexity & 2.4 Generalized inequalities
2.3 Operations that preserve convexity & 2.4 Generalized inequalities2.3 Operations that preserve convexity & 2.4 Generalized inequalities
2.3 Operations that preserve convexity & 2.4 Generalized inequalities
RyotaroTsukada
 
Newton's Forward/Backward Difference Interpolation
Newton's Forward/Backward  Difference InterpolationNewton's Forward/Backward  Difference Interpolation
Newton's Forward/Backward Difference Interpolation
VARUN KUMAR
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
The Statistical and Applied Mathematical Sciences Institute
 
Applied Calculus Chapter 4 multiple integrals
Applied Calculus Chapter  4 multiple integralsApplied Calculus Chapter  4 multiple integrals
Applied Calculus Chapter 4 multiple integrals
J C
 
Lesson 10: The Chain Rule (slides)
Lesson 10: The Chain Rule (slides)Lesson 10: The Chain Rule (slides)
Lesson 10: The Chain Rule (slides)
Matthew Leingang
 
Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9
Daisuke Yoneoka
 
DOUBLE INTEGRALS PPT GTU CALCULUS (2110014)
DOUBLE INTEGRALS PPT GTU CALCULUS (2110014) DOUBLE INTEGRALS PPT GTU CALCULUS (2110014)
DOUBLE INTEGRALS PPT GTU CALCULUS (2110014)
Panchal Anand
 
Lesson 21: Curve Sketching (slides)
Lesson 21: Curve Sketching (slides)Lesson 21: Curve Sketching (slides)
Lesson 21: Curve Sketching (slides)
Matthew Leingang
 
Lesson 27: Integration by Substitution (slides)
Lesson 27: Integration by Substitution (slides)Lesson 27: Integration by Substitution (slides)
Lesson 27: Integration by Substitution (slides)
Matthew Leingang
 

What's hot (19)

Double Integral Powerpoint
Double Integral PowerpointDouble Integral Powerpoint
Double Integral Powerpoint
 
DISTINGUISH BETWEEN WALSH TRANSFORM AND HAAR TRANSFORMDip transforms
DISTINGUISH BETWEEN WALSH TRANSFORM AND HAAR TRANSFORMDip transformsDISTINGUISH BETWEEN WALSH TRANSFORM AND HAAR TRANSFORMDip transforms
DISTINGUISH BETWEEN WALSH TRANSFORM AND HAAR TRANSFORMDip transforms
 
Optimal multi-configuration approximation of an N-fermion wave function
 Optimal multi-configuration approximation of an N-fermion wave function Optimal multi-configuration approximation of an N-fermion wave function
Optimal multi-configuration approximation of an N-fermion wave function
 
Lesson18 Double Integrals Over Rectangles Slides
Lesson18   Double Integrals Over Rectangles SlidesLesson18   Double Integrals Over Rectangles Slides
Lesson18 Double Integrals Over Rectangles Slides
 
Lesson 19: Double Integrals over General Regions
Lesson 19: Double Integrals over General RegionsLesson 19: Double Integrals over General Regions
Lesson 19: Double Integrals over General Regions
 
Double integration
Double integrationDouble integration
Double integration
 
inverse z-transform ppt
inverse z-transform pptinverse z-transform ppt
inverse z-transform ppt
 
Chapter 4 (maths 3)
Chapter 4 (maths 3)Chapter 4 (maths 3)
Chapter 4 (maths 3)
 
Crib Sheet AP Calculus AB and BC exams
Crib Sheet AP Calculus AB and BC examsCrib Sheet AP Calculus AB and BC exams
Crib Sheet AP Calculus AB and BC exams
 
5.4 Saddle-point interpretation, 5.5 Optimality conditions, 5.6 Perturbation ...
5.4 Saddle-point interpretation, 5.5 Optimality conditions, 5.6 Perturbation ...5.4 Saddle-point interpretation, 5.5 Optimality conditions, 5.6 Perturbation ...
5.4 Saddle-point interpretation, 5.5 Optimality conditions, 5.6 Perturbation ...
 
2.3 Operations that preserve convexity & 2.4 Generalized inequalities
2.3 Operations that preserve convexity & 2.4 Generalized inequalities2.3 Operations that preserve convexity & 2.4 Generalized inequalities
2.3 Operations that preserve convexity & 2.4 Generalized inequalities
 
Newton's Forward/Backward Difference Interpolation
Newton's Forward/Backward  Difference InterpolationNewton's Forward/Backward  Difference Interpolation
Newton's Forward/Backward Difference Interpolation
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Applied Calculus Chapter 4 multiple integrals
Applied Calculus Chapter  4 multiple integralsApplied Calculus Chapter  4 multiple integrals
Applied Calculus Chapter 4 multiple integrals
 
Lesson 10: The Chain Rule (slides)
Lesson 10: The Chain Rule (slides)Lesson 10: The Chain Rule (slides)
Lesson 10: The Chain Rule (slides)
 
Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9
 
DOUBLE INTEGRALS PPT GTU CALCULUS (2110014)
DOUBLE INTEGRALS PPT GTU CALCULUS (2110014) DOUBLE INTEGRALS PPT GTU CALCULUS (2110014)
DOUBLE INTEGRALS PPT GTU CALCULUS (2110014)
 
Lesson 21: Curve Sketching (slides)
Lesson 21: Curve Sketching (slides)Lesson 21: Curve Sketching (slides)
Lesson 21: Curve Sketching (slides)
 
Lesson 27: Integration by Substitution (slides)
Lesson 27: Integration by Substitution (slides)Lesson 27: Integration by Substitution (slides)
Lesson 27: Integration by Substitution (slides)
 

Similar to IVR - Chapter 6 - Sparsity, shrinkage and wavelets

IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb)
IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb) IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb)
IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb)
Charles Deledalle
 
Optimization introduction
Optimization introductionOptimization introduction
Optimization introduction
helalmohammad2
 
Density theorems for Euclidean point configurations
Density theorems for Euclidean point configurationsDensity theorems for Euclidean point configurations
Density theorems for Euclidean point configurations
VjekoslavKovac1
 
Nonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares MethodNonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares Method
Tasuku Soma
 
CI_L01_Optimization.pdf
CI_L01_Optimization.pdfCI_L01_Optimization.pdf
CI_L01_Optimization.pdf
SantiagoGarridoBulln
 
267 handout 2_partial_derivatives_v2.60
267 handout 2_partial_derivatives_v2.60267 handout 2_partial_derivatives_v2.60
267 handout 2_partial_derivatives_v2.60
Ali Adeel
 
Distributed Architecture of Subspace Clustering and Related
Distributed Architecture of Subspace Clustering and RelatedDistributed Architecture of Subspace Clustering and Related
Distributed Architecture of Subspace Clustering and Related
Pei-Che Chang
 
Maximizing Submodular Function over the Integer Lattice
Maximizing Submodular Function over the Integer LatticeMaximizing Submodular Function over the Integer Lattice
Maximizing Submodular Function over the Integer Lattice
Tasuku Soma
 
Instantons and Chern-Simons Terms in AdS4/CFT3
Instantons and Chern-Simons Terms in AdS4/CFT3Instantons and Chern-Simons Terms in AdS4/CFT3
Instantons and Chern-Simons Terms in AdS4/CFT3
Sebastian De Haro
 
Randomness conductors
Randomness conductorsRandomness conductors
Randomness conductorswtyru1989
 
Klt
KltKlt
Ch02 fuzzyrelation
Ch02 fuzzyrelationCh02 fuzzyrelation
Ch02 fuzzyrelation
Hoàng Đức
 
Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...
Alexander Litvinenko
 
Lattices, sphere packings, spherical codes
Lattices, sphere packings, spherical codesLattices, sphere packings, spherical codes
Lattices, sphere packings, spherical codes
wtyru1989
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
The Statistical and Applied Mathematical Sciences Institute
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - Introduction
Charles Deledalle
 
Edge detection
Edge detectionEdge detection
Edge detection
Edi Supriadi
 
Basic Cal - Quarter 1 Week 1-2.pptx
Basic Cal - Quarter 1 Week 1-2.pptxBasic Cal - Quarter 1 Week 1-2.pptx
Basic Cal - Quarter 1 Week 1-2.pptx
jamesvalenzuela6
 
03 convexfunctions
03 convexfunctions03 convexfunctions
03 convexfunctions
Sufyan Sahoo
 
Mathematics
MathematicsMathematics
Mathematics
IITIANS FOR YOUTH
 

Similar to IVR - Chapter 6 - Sparsity, shrinkage and wavelets (20)

IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb)
IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb) IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb)
IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb)
 
Optimization introduction
Optimization introductionOptimization introduction
Optimization introduction
 
Density theorems for Euclidean point configurations
Density theorems for Euclidean point configurationsDensity theorems for Euclidean point configurations
Density theorems for Euclidean point configurations
 
Nonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares MethodNonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares Method
 
CI_L01_Optimization.pdf
CI_L01_Optimization.pdfCI_L01_Optimization.pdf
CI_L01_Optimization.pdf
 
267 handout 2_partial_derivatives_v2.60
267 handout 2_partial_derivatives_v2.60267 handout 2_partial_derivatives_v2.60
267 handout 2_partial_derivatives_v2.60
 
Distributed Architecture of Subspace Clustering and Related
Distributed Architecture of Subspace Clustering and RelatedDistributed Architecture of Subspace Clustering and Related
Distributed Architecture of Subspace Clustering and Related
 
Maximizing Submodular Function over the Integer Lattice
Maximizing Submodular Function over the Integer LatticeMaximizing Submodular Function over the Integer Lattice
Maximizing Submodular Function over the Integer Lattice
 
Instantons and Chern-Simons Terms in AdS4/CFT3
Instantons and Chern-Simons Terms in AdS4/CFT3Instantons and Chern-Simons Terms in AdS4/CFT3
Instantons and Chern-Simons Terms in AdS4/CFT3
 
Randomness conductors
Randomness conductorsRandomness conductors
Randomness conductors
 
Klt
KltKlt
Klt
 
Ch02 fuzzyrelation
Ch02 fuzzyrelationCh02 fuzzyrelation
Ch02 fuzzyrelation
 
Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...
 
Lattices, sphere packings, spherical codes
Lattices, sphere packings, spherical codesLattices, sphere packings, spherical codes
Lattices, sphere packings, spherical codes
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - Introduction
 
Edge detection
Edge detectionEdge detection
Edge detection
 
Basic Cal - Quarter 1 Week 1-2.pptx
Basic Cal - Quarter 1 Week 1-2.pptxBasic Cal - Quarter 1 Week 1-2.pptx
Basic Cal - Quarter 1 Week 1-2.pptx
 
03 convexfunctions
03 convexfunctions03 convexfunctions
03 convexfunctions
 
Mathematics
MathematicsMathematics
Mathematics
 

More from Charles Deledalle

Chapter 1 - Introduction
Chapter 1 - IntroductionChapter 1 - Introduction
Chapter 1 - Introduction
Charles Deledalle
 
MLIP - Chapter 6 - Generation, Super-Resolution, Style transfer
MLIP - Chapter 6 - Generation, Super-Resolution, Style transferMLIP - Chapter 6 - Generation, Super-Resolution, Style transfer
MLIP - Chapter 6 - Generation, Super-Resolution, Style transfer
Charles Deledalle
 
MLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, CaptioningMLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, Captioning
Charles Deledalle
 
MLIP - Chapter 4 - Image classification and CNNs
MLIP - Chapter 4 - Image classification and CNNsMLIP - Chapter 4 - Image classification and CNNs
MLIP - Chapter 4 - Image classification and CNNs
Charles Deledalle
 
MLIP - Chapter 3 - Introduction to deep learning
MLIP - Chapter 3 - Introduction to deep learningMLIP - Chapter 3 - Introduction to deep learning
MLIP - Chapter 3 - Introduction to deep learning
Charles Deledalle
 
MLIP - Chapter 2 - Preliminaries to deep learning
MLIP - Chapter 2 - Preliminaries to deep learningMLIP - Chapter 2 - Preliminaries to deep learning
MLIP - Chapter 2 - Preliminaries to deep learning
Charles Deledalle
 
IVR - Chapter 4 - Variational methods
IVR - Chapter 4 - Variational methodsIVR - Chapter 4 - Variational methods
IVR - Chapter 4 - Variational methods
Charles Deledalle
 
IVR - Chapter 5 - Bayesian methods
IVR - Chapter 5 - Bayesian methodsIVR - Chapter 5 - Bayesian methods
IVR - Chapter 5 - Bayesian methods
Charles Deledalle
 
IVR - Chapter 7 - Patch models and dictionary learning
IVR - Chapter 7 - Patch models and dictionary learningIVR - Chapter 7 - Patch models and dictionary learning
IVR - Chapter 7 - Patch models and dictionary learning
Charles Deledalle
 
IVR - Chapter 3 - Basics of filtering II: Spectral filters
IVR - Chapter 3 - Basics of filtering II: Spectral filtersIVR - Chapter 3 - Basics of filtering II: Spectral filters
IVR - Chapter 3 - Basics of filtering II: Spectral filters
Charles Deledalle
 

More from Charles Deledalle (10)

Chapter 1 - Introduction
Chapter 1 - IntroductionChapter 1 - Introduction
Chapter 1 - Introduction
 
MLIP - Chapter 6 - Generation, Super-Resolution, Style transfer
MLIP - Chapter 6 - Generation, Super-Resolution, Style transferMLIP - Chapter 6 - Generation, Super-Resolution, Style transfer
MLIP - Chapter 6 - Generation, Super-Resolution, Style transfer
 
MLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, CaptioningMLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, Captioning
 
MLIP - Chapter 4 - Image classification and CNNs
MLIP - Chapter 4 - Image classification and CNNsMLIP - Chapter 4 - Image classification and CNNs
MLIP - Chapter 4 - Image classification and CNNs
 
MLIP - Chapter 3 - Introduction to deep learning
MLIP - Chapter 3 - Introduction to deep learningMLIP - Chapter 3 - Introduction to deep learning
MLIP - Chapter 3 - Introduction to deep learning
 
MLIP - Chapter 2 - Preliminaries to deep learning
MLIP - Chapter 2 - Preliminaries to deep learningMLIP - Chapter 2 - Preliminaries to deep learning
MLIP - Chapter 2 - Preliminaries to deep learning
 
IVR - Chapter 4 - Variational methods
IVR - Chapter 4 - Variational methodsIVR - Chapter 4 - Variational methods
IVR - Chapter 4 - Variational methods
 
IVR - Chapter 5 - Bayesian methods
IVR - Chapter 5 - Bayesian methodsIVR - Chapter 5 - Bayesian methods
IVR - Chapter 5 - Bayesian methods
 
IVR - Chapter 7 - Patch models and dictionary learning
IVR - Chapter 7 - Patch models and dictionary learningIVR - Chapter 7 - Patch models and dictionary learning
IVR - Chapter 7 - Patch models and dictionary learning
 
IVR - Chapter 3 - Basics of filtering II: Spectral filters
IVR - Chapter 3 - Basics of filtering II: Spectral filtersIVR - Chapter 3 - Basics of filtering II: Spectral filters
IVR - Chapter 3 - Basics of filtering II: Spectral filters
 

Recently uploaded

ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
Mukeshwaran Balu
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
yokeleetan1
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
Ethernet Routing and switching chapter 1.ppt
Ethernet Routing and switching chapter 1.pptEthernet Routing and switching chapter 1.ppt
Ethernet Routing and switching chapter 1.ppt
azkamurat
 
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
awadeshbabu
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
Low power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniquesLow power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniques
nooriasukmaningtyas
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
drwaing
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
Kamal Acharya
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
zwunae
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 

Recently uploaded (20)

ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
Ethernet Routing and switching chapter 1.ppt
Ethernet Routing and switching chapter 1.pptEthernet Routing and switching chapter 1.ppt
Ethernet Routing and switching chapter 1.ppt
 
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
Low power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniquesLow power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniques
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 

IVR - Chapter 6 - Sparsity, shrinkage and wavelets

  • 1. ECE 285 Image and video restoration Chapter VI – Sparsity, shrinkage and wavelets Charles Deledalle March 11, 2019 1
  • 2. Motivations (a) y = x + w (b) z = F y (c) λ2 i λ2 i +σ2 (d) ˆzi = λ2 i λ2 i +σ2 zi (e) ˆx = F −1 ˆz Wiener filter (LMMSE in the Fourier domain) • Assume Fourier coefficients to be decorrelated (white), • Modulate frequencies based on the mean power spectral density λ2 i . Limits • Linear: no adaptation to the content ⇒ Unable to preserve edges, Blurry solutions. 2
  • 3. Motivations Facts and consequences • Assume Fourier coefficients to be decorrelated (white) • Removing Gaussian noise ⇒ need to be adaptive ⇒ Non linear • Assuming Gaussian noise + Gaussian prior ⇒ Linear Deductive reasoning Fourier coefficients of clean images are not Gaussian distributed How are Fourier coefficients distributed? 3
  • 4. Motivations – Distribution of Fourier coefficients How are Fourier coefficients distributed? 1. Perform whitening with DFT Var[x] = L = EΛE∗ with E = 1√ n F diag(Λ) = (λ2 1, . . . , λ2 n) = n−1 MPSD 4
  • 5. Motivations – Distribution of Fourier coefficients How are Fourier coefficients distributed? 2. Look at the histogram • The histogram of η has a symmetric bell shape around 0. • It has a peak at 0 (a large number of Fourier coefficients are zero). • It has large/heavy tails (many coefficients are “outliers”/abnormal). (a) x (b) Whitening η of x (c) Histogram of η 5
  • 6. Motivations – Distribution of Fourier coefficients How are Fourier coefficients distributed? 3. Look for the distribution that best fits (in log scale) • Gaussian: bell shape √ , peak ×, tail × • Laplacian: bell shape ×, peak √ , tail √ • Student: bell shape √ , peak ×, tail √ (heavier) • Others: alpha stables and generalized Gaussian distributions (a) Whitening η of x (b) Histogram of η (c) Log-histogram of η 6
  • 7. Motivations – Distribution of Fourier coefficients Model expression (zero mean, variance = 1) • Gaussian: bell shape √ , peak ×, tail × p(ηi) = 1 √ 2π exp − η2 i 2 • Laplacian: bell shape ×, peak √ , tail √ p(ηi) = 1 √ 2 exp − √ 2|ηi| • Student: bell shape √ , peak ×, tail √ (heavier) p(ηi) = 1 Z 1 (2r − 2) + η2 i r+1/2 (Z normalization constant, r > 1 controls the tails) How do they look in multiple-dimensions? 7
  • 8. Motivations – Distribution of Fourier coefficients • Gaussian prior • images are concentrated in an elliptical cluster, • outliers are rare (images outside the cluster). • Peaky & heavy tailed priors: shape between a diamond and a star.    • union of subspaces: most images lie in one of the branches of the star, • sparsity: most of their coefficients ηi are zeros, • robustness: outlier coefficients are frequent. 8
  • 10. Shrinkage functions Consider the following Gaussian denoising problem • Let y ∈ Rn and x ∈ Rp be two random vectors such that y | x ∼ N(x, σ2 Idn) E[x] = 0 and Var[x] = L = EΛE∗ • Let η = Λ−1/2 E∗ x (whitening / decorrelation of x) Goal: estimate x from y assuming a non-Gaussian prior pη for η. (such as Laplacian or Student) 9
  • 11. Shrinkage functions Bayesian shrinkage functions • Assume ηi are also independent and identically distributed (iid). • Then, the MMSE and MAP estimators both read as ˆx = Eˆz Come back where ˆzi = s(zi; λi, σ) shrinkage and z = E∗ y Change of basis • The function zi → s(zi; λi, σ) is called shrinkage function. • Unlike the LMMSE, s will depend on the prior distribution of ηi. • As for the LMMSE, the solution can be computed in the eigenspace. • We say that the estimator is separable in the eigenspace (ex: Fourier). 10
  • 12. Shrinkage functions Remark independence ⇒ uncorrelation ¬uncorrelation ⇒ ¬independence correlation ⇒ dependence ⇒ Whitening is a necessarily step for independence but not a sufficient one. (Except in the Gaussian case) How are the shrinkage functions defined for the MMSE and MAP? 11
  • 13. Shrinkage functions • Recall that the MMSE is the posterior mean ˆx = Rn xp(x|y) dx = Rn xp(y|x)p(x) dx Rn p(y|x)p(x) dx MMSE Shrinkage functions • Under the previous assumptions ˆx = Eˆz Come back where ˆzi = s(zi; λi, σ) shrinkage and z = E∗ y Change of basis with s(z; λ, σ) = R ˜z exp −(z−˜z)2 2σ2 pη ˜z λ d˜z R exp −(z−˜z)2 2σ2 pη ˜z λ d˜z where pη is the prior distribution on the entries of η. • Separability: n dimensional optimization −→ n × 1d integrations. 12
  • 14. Shrinkage functions • Recall that the MAP is the optimization problem ˆx ∈ argmax x∈Rn p(x|y) = argmin x∈Rn [− log p(y|x) − log p(x)] MAP Shrinkage functions • Under the previous assumptions ˆx = Eˆz Come back where ˆzi = s(zi; λi, σ) shrinkage and z = E∗ y Change of basis with s(z; λ, σ) = argmin ˜z∈R (z − ˜z)2 2σ2 − log pη ˜z λ where pη is the prior distribution on the entries of η. • Separability: n dimensional integration −→ n × 1d optimisations. 13
  • 15. Shrinkage functions Example (Gaussian noise + Gaussian prior) • MMSE Shrinkage s(z; λ, σ) = R ˜z exp −(z−˜z)2 2σ2 − ˜z2 2λ2 d˜z R exp −(z−˜z)2 2σ2 − ˜z2 2λ2 d˜z = λ2 λ2 + σ2 z • MAP Shrinkage s(z; λ, σ) = argmin ˜z∈R (z − ˜z)2 2σ2 + ˜z2 2λ2 = λ2 λ2 + σ2 z • Gaussian prior: MAP = MMSE = Linear shrinkage. • We retrieve the LMMSE as expected. 14
  • 16. Shrinkage functions Gaussian noise + Gaussian prior -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 SNR = λ/σ = 4 λ/σ = 1 λ/σ = 1/4 15
  • 17. Posterior mean – Shrinkage functions – Examples Example (Gaussian noise + Laplacian prior) • MMSE Shrinkage s(z; λ, σ) = ˜z exp −(z−˜z)2 2σ2 − √ 2|˜z| λ d˜z exp −(z−˜z)2 2σ2 − √ 2|˜z| λ d˜z = z − γ erf z−γ√ 2σ − exp 2γz σ2 erfc γ+z√ 2σ + 1 erf z−γ√ 2σ + exp 2γz σ2 erfc γ+z√ 2σ + 1 , γ = √ 2σ2 λ • MAP Shrinkage (soft-thresholding) s(z; λ, σ) = argmin ˜z∈R (z − ˜z)2 2σ2 + √ 2|˜z| λ =    0 if |z| < γ z − γ if z > γ z + γ if z < −γ Soft-T(z,γ) Non-gaussian prior: MAP = MMSE → Non-linear shrinkage. 16
  • 18. Shrinkage functions Gaussian noise + Laplacian prior -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 SNR = λ/σ = 4 λ/σ = 1 λ/σ = 1/4 17
  • 19. Posterior mean – Shrinkage functions – Examples Example (Gaussian noise + Student prior) • MMSE Shrinkage No simple expression, requires 1d numerical integration • MAP Shrinkage No simple expression, requires 1d numerical optimization For efficiency, the 1d functions can be evaluated offline and stored in a look-up-table. 18
  • 20. Shrinkage functions Gaussian noise + Student prior -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 SNR = λ/σ = 4 λ/σ = 1 λ/σ = 1/4 19
  • 21. Posterior mean – Shrinkage functions – Examples -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 SNR = λ/σ = 4 λ/σ = 1/2 λ/σ = 1/2 • Coefficients are shrunk towards zero • Signs are preserved • Non-Gaussian priors leads to non-linear filtering: • sparsity: small coefficients are shrunk (likely due to noise) • robustness: large coefficients are preserved (likely encoding signal) • Larger SNR = λ σ ⇒ shrinkage becomes close to identity. 20
  • 22. Posterior mean – Shrinkage functions – Examples Interpretation -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 Sparsity: zero for small values. Robustness: remain close to the identity for large values. Transition: bias/variance tradeoff. Can we design our own shrinkage according to what we want? 21
  • 23. Shrinkage functions Shrinkage functions (a.k.a, thresholding functions) • Pick a shrinkage function s satisfying • Shrink: |s(z)| |z| (non-expansive) • Preserve sign: z · s(z) 0 • Kill low SNR: lim λ σ →0 s(z; λ, σ) = 0 • Keep high SNR: lim λ σ →∞ s(z; λ, σ) = z • Increasing: z1 z2 ⇔ s(z1) s(z2) • Beyond Bayesian: No need to relate s to a prior distribution pη. 22
  • 24. Shrinkage functions A few examples (among many others) -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 Hard-thresholding MCP SCAD • Though not necessarily related to a prior distribution, • Often related to a penalized least square problem, ex: Hard-T(z) = argmin ˜z∈R (z − ˜z)2 + τ2 1{˜z=0} = 0 if |z| < τ z otherwise • Hard-thresholding: similar behavior to Student’s shrinkage. 23
  • 25. Shrinkage functions Link with penalized least square (1/2) • D = L1/2 = EΛ1/2 is an orthogonal dictionary of n atoms/words D = (d1, d2, . . . , dn) with ||di|| = λi and di, dj = 0 (for i = j) • Goal: Look for the n coefficients ηi, such that ˆx close to y ˆx = Dη = n i=1 ηidi = “linear comb. of the orthogonal atoms di of D” • Choosing ηi = di ||di||2 , y , i.e., η = Λ−1/2 E∗ y, is optimal: ˆx = y but, it also reconstructs the noise component. • Idea: penalize the coeffs to prevent from reconstructing the noise. 24
  • 26. Shrinkage functions Link with penalized least square (2/2) • Penalization on the coefficients controls shrinkage and sparsity: • 1 2 ||y − Dη||2 2 + τ2 2 ||η||2 2 ⇒ ˆzi = λ2 i λ2 i + τ2 zi • 1 2 ||y − Dη||2 2 + τ||η||1 ⇒ ˆzi = Soft-T (zi, γi) with γi = τ λi • 1 2 ||y − Dη||2 2 + τ2 2 ||η||0 ⇒ ˆzi = Hard-T (zi, γi) with γi = τ λi 0 pseudo-norm: ||η||0 = lim p→0 n i=1 ηp i 1/p = “# of non-zero coefficients” Sparsity: ||η||0 small compared to n 25
  • 27. Posterior mean – Shrinkage in the Fourier domain Shrinkage in the discrete Fourier domain (a) x (b) y = x + w sig = 20 y = x + sig * np.random.randn(x.shape) n1, n2 = y.shape[:2] n = n1 * n2 lbd = np.sqrt(prior_mpsd(n1, n2) / n) z = nf.fft2(y, axes=(0, 1)) / np.sqrt(n) zhat = shrink(z, lbd, sig) xhat = np.real(nf.ifft2(zhat, axes=(0, 1))) * np.sqrt(n) z = F y/ √ n ˆzi = s(zi; λi, σ) ˆx = √ nF −1 ˆz 26
  • 28. Posterior mean – Shrinkage in the Fourier domain Shrinkage in the discrete Fourier domain (a) x (b) λ sig = 20 y = x + sig * np.random.randn(x.shape) n1, n2 = y.shape[:2] n = n1 * n2 lbd = np.sqrt(prior_mpsd(n1, n2) / n) z = nf.fft2(y, axes=(0, 1)) / np.sqrt(n) zhat = shrink(z, lbd, sig) xhat = np.real(nf.ifft2(zhat, axes=(0, 1))) * np.sqrt(n) z = F y/ √ n ˆzi = s(zi; λi, σ) ˆx = √ nF −1 ˆz 26
  • 29. Posterior mean – Shrinkage in the Fourier domain Shrinkage in the discrete Fourier domain (a) x (b) z sig = 20 y = x + sig * np.random.randn(x.shape) n1, n2 = y.shape[:2] n = n1 * n2 lbd = np.sqrt(prior_mpsd(n1, n2) / n) z = nf.fft2(y, axes=(0, 1)) / np.sqrt(n) zhat = shrink(z, lbd, sig) xhat = np.real(nf.ifft2(zhat, axes=(0, 1))) * np.sqrt(n) z = F y/ √ n ˆzi = s(zi; λi, σ) ˆx = √ nF −1 ˆz 26
  • 30. Posterior mean – Shrinkage in the Fourier domain Shrinkage in the discrete Fourier domain (a) x (b) z (c) Linear/Wiener (=Gaussian) (d) Soft-T (=Laplacian) (e) Hard-T (≈Student) sig = 20 y = x + sig * np.random.randn(x.shape) n1, n2 = y.shape[:2] n = n1 * n2 lbd = np.sqrt(prior_mpsd(n1, n2) / n) z = nf.fft2(y, axes=(0, 1)) / np.sqrt(n) zhat = shrink(z, lbd, sig) xhat = np.real(nf.ifft2(zhat, axes=(0, 1))) * np.sqrt(n) z = F y/ √ n ˆzi = s(zi; λi, σ) ˆx = √ nF −1 ˆz 26
  • 31. Posterior mean – Shrinkage in the Fourier domain Shrinkage in the discrete Fourier domain (a) x (b) y = x + w (c) Linear/Wiener (=Gaussian) (d) Soft-T (=Laplacian) (e) Hard-T (≈Student) sig = 20 y = x + sig * np.random.randn(x.shape) n1, n2 = y.shape[:2] n = n1 * n2 lbd = np.sqrt(prior_mpsd(n1, n2) / n) z = nf.fft2(y, axes=(0, 1)) / np.sqrt(n) zhat = shrink(z, lbd, sig) xhat = np.real(nf.ifft2(zhat, axes=(0, 1))) * np.sqrt(n) z = F y/ √ n ˆzi = s(zi; λi, σ) ˆx = √ nF −1 ˆz 26
  • 32. Shrinkage functions – Fourier domain – Results (a) x (b) y (σ = 20) (c) Linear/Wiener (=Gaussian) (d) Soft-T (=Laplacian) (e) Hard-T (≈Student) Bias ←−−−−−−−−−−−−−−−→ Variance 27
  • 33. Shrinkage functions – Fourier domain – Results (a) x (b) y (σ = 40) (c) Linear/Wiener (=Gaussian) (d) Soft-T (=Laplacian) (e) Hard-T (≈Student) Bias ←−−−−−−−−−−−−−−−→ Variance 27
  • 34. Shrinkage functions – Fourier domain – Results (a) x (b) y (σ = 60) (c) Linear/Wiener (=Gaussian) (d) Soft-T (=Laplacian) (e) Hard-T (≈Student) Bias ←−−−−−−−−−−−−−−−→ Variance 27
  • 35. Shrinkage functions – Fourier domain – Results (a) x (b) y (σ = 120) (c) Linear/Wiener (=Gaussian) (d) Soft-T (=Laplacian) (e) Hard-T (≈Student) Bias ←−−−−−−−−−−−−−−−→ Variance 27
  • 36. Posterior mean – Limits of shrinkage in the Fourier domain Limits of shrinkage in the discrete Fourier domain (a) x (b) y convolution kernels (c) Linear (=Gaussian) (d) Soft-T (=Laplacian) (e) Hard-T (≈Student) • Linear shrinkage (Wiener) ⇒ Non-adaptive, • Non-linear shrinkage ⇒ Adaptive convolution, • Adapts to the frequency content, • but not to the spatial content. ˆzi = s(zi; τ, σ) = s(zi; τ, σ) zi × zi element-wise product ⇔ ˆx = ν(y) ∗ y spatial average adapted to the spectrum of y. 28
  • 37. Motivations Consequences • Modulating Fourier coefficients ⇒ Non spatially adaptive • Assuming Fourier coefficients to be white+sparse ⇒ Shrinkage in Fourier Deductive reasoning Need another representation for sparsifying clean images E = 1 √ n         ×n         DFT: F ←−−−− Columns were the Fourier atoms What transform can make signal white and sparse and captures both spatial and spectral contents? 29
  • 39. Introduction to wavelets – Haar (1d case) [Alfr´ed Haar (1909)] Id =      1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1      and F =      1 1 1 1 1 e−2πi1/4 e−2πi2/4 e−2πi3/4 1 e−2πi2/4 e−2πi4/4 e−2πi6/4 1 e−2πi3/4 e−2πi8/4 e−2πi9/4      30
  • 40. Introduction to wavelets – Haar (1d case) [Alfr´ed Haar (1909)] H1st = 1 √ 2      1 1 0 0 0 0 1 1 1 −1 0 0 0 0 1 −1      and H2nd =      1/2 1 1 1 1 1/2 1 1 −1 −1 1/ √ 2 1 −1 0 0 1/ √ 2 0 0 1 −1      31
  • 41. Introduction to wavelets – Haar (2d case) (a) H1st (4 × 4 image) (b) x (c) H1st x 2d Haar representation 4 sub-bands    • Coarse sub-band • Vertical detailed sub-band • Horizontal detailed sub-band • Diagonal detailed sub-band 32
  • 42. Introduction to wavelets – Haar (2d case) (a) H1st (4 × 4 image) (b) x (c) H2nd x Multi-scale 2d Haar representation • Repeat recursively J times • Dyadic decomposition • Multi-scale representation • Related to scale spaces 32
  • 43. Introduction to wavelets – Haar (2d case) (a) H1st (4 × 4 image) (b) x (c) H3rd x Multi-scale 2d Haar representation • Repeat recursively J times • Dyadic decomposition • Multi-scale representation • Related to scale spaces 33
  • 44. Introduction to wavelets – Haar (2d case) (a) H1st (4 × 4 image) (b) x (c) H4th x Multi-scale 2d Haar representation • Repeat recursively J times • Dyadic decomposition • Multi-scale representation • Related to scale spaces 33
  • 45. Introduction to wavelets – Haar transform - Filter bank 34
  • 46. Introduction to wavelets – Haar transform - Separability Properties of the 2d Haar transform • Separable: 1d Haar transforms in horizontal and next vertical direction • First: perform a low pass and high pass filtering • Next: perform decimation by a factor of 2 Can we choose other low and high pass filters to get a better transform? 35
  • 47. Discrete wavelets Discrete wavelet transform (DWT) (1/3) (1d and n even) • Let h ∈ Rn (with periodical boundary conditions) satisfying n−1 i=0 hi = 0 n−1 i=0 h2 i = 1 and n−1 i=0 hihi+2k = 0 for all integer k = 0 Example (Haar as a particular case) h = 1 √ 2 (0 . . . 0 − 1 + 1 0 . . . 0) 36
  • 48. Discrete wavelets Discrete wavelet transform (DWT) (2/3) (1d and n even) • Define the high and low pass filters H : Rn → Rn and G : Rn → Rn as (Hx)k = (h ∗ x)k = n−1 i=0 hixk−i (Gx)k = (g ∗ x)k = n−1 i=0 gixk−i where gi = (−1)i hn−1−i • Note: necessarily n−1 i=0 gi = √ 2 Example (Haar as a particular case) h = 1 √ 2 (0 . . . 0 − 1 + 1 0 . . . 0) g = 1 √ 2 (0 . . . 0 + 1 + 1 0 . . . 0) 37
  • 49. Discrete wavelets • Define the decimation by 2 of a matrix M ∈ Rn×n as M↓2 = “M(1:2:end, :)” ∈ Rn/2×n i.e., the matrix obtained by removing every two rows. • M↓2 x: apply M to x and next remove every two entries. Discrete wavelet transform (DWT) (3/3) (1d and n even) Let W = G↓2 H ↓2 ∈ Rn×n Then    • x → W x: orthonormal discrete wavelet transform, • Columns of W : orthonormal discrete wavelet basis, • z = W x: wavelet coefficients of x. 38
  • 50. Multi-scale discrete wavelets Multi-scale DWT (1d and n multiple of 2J ) [Mallat, 1989] Defined recursively as W J-th = W (J-1)-th O 0 Id W 39
  • 51. Multi-scale discrete wavelets Implementation of 2D DWT (n1 and n2 multiple of 2J ) def dwt(x, J, h, g): # 2d and multi-scale if J == 0: return x n1, n2 = x.shape[:2] m1, m2 = (int(n1 / 2), int(n2 / 2)) z = dwt1d(x, h, g) z = flip(dwt1d(flip(z), h, g)) z[:m1, :m2] = dwt(z[:m1, :m2], J - 1, h, g) return z def dwt1d(x, h, g): # 1d and 1scale coarse = convolve(x, g) detail = convolve(x, h) z = np.concatenate((coarse[::2, :], detail[::2, :]), axis=0) return z.shape Use separability 40
  • 52. Multi-scale discrete wavelets Multi-scale Inverse DWT (1d and n multiple of 2J ) Defined recursively as (W J-th )−1 = W −1 (W (J-1)-th )−1 O 0 Id where W −1 = W ∗ = G∗ ↑2 H∗ ↑2 ∈ Rn×n and M↑2: remove every two columns. M↑2 x : insert 0 every two entries in x and next apply M. 41
  • 53. Multi-scale discrete wavelets Implementation of 2D IDWT (n1 and n2 multiple of 2J ) def idwt(z, J, h, g): # 2d and multi-scale if J == 0: return z n1, n2 = z.shape[:2] m1, m2 = (int(n1 / 2), int(n2 / 2)) x = z.copy() x[:m1, :m2] = idwt(x[:m1, :m2], J - 1, h, g) x = flip(idwt1d(flip(x), h, g)) x = idwt1d(x, h, g, axis=0) return x def idwt1d(z, h, g): # 1d and 1scale n1 = z.shape[0] m1 = int(n1 / 2) coarse, detail = np.zeros(z.shape), np.zeros(z.shape) coarse[::2, :], detail[::2, :] = z[:m1, :], z[m1:, :] x = convolve(coarse, g[::-1]) + convolve(detail, h[::-1]) return x 42
  • 54. Discrete wavelets – Limited support Discrete wavelet with limited support • Consider a high pass filter with finite support of size m = 2p (even). For instance for m = 4 H =               h2 h3 0 . . . 0 h0 h1 h0 h1 h2 h3 0 . . . 0 0 h0 h1 h2 h3 0 . . . ... 0 . . . 0 h0 h1 h2 h3 h2 h3 0 . . . 0 h0 h1               • Then h defines a wavelet transform if it satisfies the three conditions hi = 0 and h2 i = 1 and hihi+2k = 0 for k = 1 to p − 1 • This system has 2p unknowns and 1 + p independent equations. • If p = 1, 2p = 1 + p, this implies that the solution is unique (Haar). • Otherwise, one has p − 1 degrees of freedom. 43
  • 55. Discrete wavelets – Daubechies’ wavelets Daubechies’ wavelets (1988) • Daubechies suggests adding the p − 1 constraints 2p−1 i=0 iq hi = 0 for q = 1 to p − 1 (vanishing q-order moments) • For p = 2, the (orthonormal) Daubechies’ wavelets are defined as    h2 0 + h2 1 + h2 2 + h2 3 = 1 h0 + h1 + h2 + h3 = 0 h0h2 + h1h3 = 0 h1 + 2h2 + 3h3 = 0 ⇔ h = ± 1 √ 2      1+ √ 3 4 3+ √ 3 4 3− √ 3 4 1− √ 3 4      • The corresponding DWT is referred to as Daubechies-2 (or Db2). As for the Fourier transform, there also exists a continuous version. 44
  • 56. Continuous wavelets Continuous wavelet transform (CWT) (1d) • Continuum of locations t ∈ R and scales a > 0, • Continuous wavelet transform of x : R → R c(a, t) wavelet coefficient = +∞ −∞ ψ∗ a,t t x(t ) dt = x signal , ψa,t wavelet where ∗ is the complex conjugate. • ψa,t: daughter wavelets, translated and scaled versions of Ψ ψa,t(t ) = 1 √ a Ψ t − t a • Ψ: the mother wavelets satisfying +∞ −∞ Ψ(t) dt = 0 and +∞ −∞ |Ψ(t)|2 dt = 1 < ∞ (zero-mean) (unit-norm / square-integrable) 45
  • 57. Continuous wavelets Inverse CWT (1d) • The inverse continuous wavelet transform is given by x(t) = 1 CΨ +∞ −∞ +∞ 0 1 |a|2 c(a, t )ψa,t t da dt with CΨ = +∞ 0 |ˆΨ(u)|2 u du where ˆΨ is the Fourier transform of Ψ. Relation between CWT/DWT (1d) • The DWT can be seen as the discretization of the CWT • Diadic discretization in scale: a = 1, 2, 4, . . . , 2J • Uniform discretization in time at scale j with step 2j : t = 1:2j :n 46
  • 58. Continuous wavelets Twin-scale relation (1d) • The CWT is orthogonal (inverse = adjoint), if and only if Ψ satisfies Ψ(t) = √ 2 m−1 i=0 hiΦ(2t − i) and Φ(t) = √ 2 m−1 i=0 giΦ(2t − i) where h and g are high- and low-pass filters defining a DWT. • Φ is called father wavelet or scaling function. • Note: potentially m = ∞. Twin-scale relation: allows to define a CWT from DWT and vice-versa. The CWT may not have a closed form (approximated by the cascade algorithm) 47
  • 59. Continuous and discrete wavelets Haar/Db1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Father wavelet -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Mother wavelet -1 -0.5 0 0.5 1 Low pass filter -1 0 1 2 -1 -0.5 0 0.5 1 High pass filter -1 0 1 2 Popular wavelets are: Haar (1909); Gabor wavelet (1946); Mexican hat/Marr wavelet (1980); Morlet wavelet (1984); Daubechies (1988); Meyer wavelet (1990); Binomial quadrature mirror filter (1990); Coiflets (1991); Symlets (1992). Some classical wavelet transforms are not orthogonal. Bi-orthogonal: Non orthogonal but invertible (ex: for symmetric wavelets). Two filters for the direct, and two others for the inverse. 48
  • 60. Continuous and discrete wavelets Db2 -2 -1 0 1 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 Father wavelet -2 -1 0 1 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 Mother wavelet -1 -0.5 0 0.5 1 Low pass filter -1 0 1 2 3 4 -1 -0.5 0 0.5 1 High pass filter -1 0 1 2 3 4 Popular wavelets are: Haar (1909); Gabor wavelet (1946); Mexican hat/Marr wavelet (1980); Morlet wavelet (1984); Daubechies (1988); Meyer wavelet (1990); Binomial quadrature mirror filter (1990); Coiflets (1991); Symlets (1992). Some classical wavelet transforms are not orthogonal. Bi-orthogonal: Non orthogonal but invertible (ex: for symmetric wavelets). Two filters for the direct, and two others for the inverse. 48
  • 61. Continuous and discrete wavelets Db4 -2 0 2 4 -1 -0.5 0 0.5 1 Father wavelet -4 -2 0 2 4 -1 -0.5 0 0.5 1 Mother wavelet -1 -0.5 0 0.5 1 Low pass filter 0 2 4 6 8 -1 -0.5 0 0.5 1 High pass filter 0 2 4 6 8 Popular wavelets are: Haar (1909); Gabor wavelet (1946); Mexican hat/Marr wavelet (1980); Morlet wavelet (1984); Daubechies (1988); Meyer wavelet (1990); Binomial quadrature mirror filter (1990); Coiflets (1991); Symlets (1992). Some classical wavelet transforms are not orthogonal. Bi-orthogonal: Non orthogonal but invertible (ex: for symmetric wavelets). Two filters for the direct, and two others for the inverse. 48
  • 62. Continuous and discrete wavelets Db8 0 2 4 6 8 -1 -0.5 0 0.5 1 Father wavelet -5 0 5 -1 -0.5 0 0.5 1 Mother wavelet -1 -0.5 0 0.5 1 Low pass filter 0 5 10 15 -1 -0.5 0 0.5 1 High pass filter 0 5 10 15 Popular wavelets are: Haar (1909); Gabor wavelet (1946); Mexican hat/Marr wavelet (1980); Morlet wavelet (1984); Daubechies (1988); Meyer wavelet (1990); Binomial quadrature mirror filter (1990); Coiflets (1991); Symlets (1992). Some classical wavelet transforms are not orthogonal. Bi-orthogonal: Non orthogonal but invertible (ex: for symmetric wavelets). Two filters for the direct, and two others for the inverse. 48
  • 63. Continuous and discrete wavelets Meyer -5 0 5 -1 -0.5 0 0.5 1 Father wavelet -5 0 5 -1 -0.5 0 0.5 1 Mother wavelet -1 -0.5 0 0.5 1 Low pass filter 0 20 40 60 -1 -0.5 0 0.5 1 High pass filter 0 20 40 60 Popular wavelets are: Haar (1909); Gabor wavelet (1946); Mexican hat/Marr wavelet (1980); Morlet wavelet (1984); Daubechies (1988); Meyer wavelet (1990); Binomial quadrature mirror filter (1990); Coiflets (1991); Symlets (1992). Some classical wavelet transforms are not orthogonal. Bi-orthogonal: Non orthogonal but invertible (ex: for symmetric wavelets). Two filters for the direct, and two others for the inverse. 48
  • 64. Continuous and discrete wavelets Coif4 -2 0 2 4 6 8 -1 -0.5 0 0.5 1 Father wavelet -5 0 5 -1 -0.5 0 0.5 1 Mother wavelet -1 -0.5 0 0.5 1 Low pass filter 0 5 10 15 20 -1 -0.5 0 0.5 1 High pass filter 0 5 10 15 20 Popular wavelets are: Haar (1909); Gabor wavelet (1946); Mexican hat/Marr wavelet (1980); Morlet wavelet (1984); Daubechies (1988); Meyer wavelet (1990); Binomial quadrature mirror filter (1990); Coiflets (1991); Symlets (1992). Some classical wavelet transforms are not orthogonal. Bi-orthogonal: Non orthogonal but invertible (ex: for symmetric wavelets). Two filters for the direct, and two others for the inverse. 48
  • 65. Continuous and discrete wavelets Sym4 -4 -2 0 2 4 -1 -0.5 0 0.5 1 Father wavelet -4 -2 0 2 4 -1 -0.5 0 0.5 1 Mother wavelet -1 -0.5 0 0.5 1 Low pass filter 0 2 4 6 8 -1 -0.5 0 0.5 1 High pass filter 0 2 4 6 8 Popular wavelets are: Haar (1909); Gabor wavelet (1946); Mexican hat/Marr wavelet (1980); Morlet wavelet (1984); Daubechies (1988); Meyer wavelet (1990); Binomial quadrature mirror filter (1990); Coiflets (1991); Symlets (1992). Some classical wavelet transforms are not orthogonal. Bi-orthogonal: Non orthogonal but invertible (ex: for symmetric wavelets). Two filters for the direct, and two others for the inverse. 48
  • 66. Wavelets and sparsity Wavelets perform image compression • Haar encodes constant signals with one coefficient, • Db-p encodes (p-1)-order polynomials with p coefficients. Consequences: • Polynomial/Smooth signals are encoded with very few coefficients, • Coarse coefficients encode the smooth underlying signal, • Detailed coefficients encode non-smooth content of the signal, • Typical signals are concentrated on few coefficients, • The remaining coefficients capture only noise components. 49
  • 67. Wavelets and sparsity Wavelets perform image compression • Haar encodes constant signals with one coefficient, • Db-p encodes (p-1)-order polynomials with p coefficients. Consequences: • Polynomial/Smooth signals are encoded with very few coefficients, • Coarse coefficients encode the smooth underlying signal, • Detailed coefficients encode non-smooth content of the signal, • Typical signals are concentrated on few coefficients, • The remaining coefficients capture only noise components. ⇒ Heavy tailed distribution with a peak at zero, i.e., wavelets favor sparsity. 49
  • 68. Wavelets as a sparsifying transform Fourier (a) x (b) F x (c) λ (d) (F x)i/λi Fourier (ui, vi freq. of component i) • E∗ = F / √ n • λ2 i = n−1 MPSD and ∞ if i = 0 • Arbitrary DC component 50
  • 69. Wavelets as a sparsifying transform Fourier (a) x (b) F x (c) λ (d) (F x)i/λi Wavelets (e) x (f) W x (g) λ (h) (W x)i/λi Fourier (ui, vi freq. of component i) • E∗ = F / √ n • λ2 i = n−1 MPSD and ∞ if i = 0 • Arbitrary DC component Wavelets (ji scale of component i) • E∗ = W • λi = α2ji−1 and ∞ if ji = J • Arbitrary coarse component 50
  • 70. Distribution of wavelet coefficients (a) x (b) ηi = (W x)i/λi (c) Histogram of η (d) Histogram of η 51
  • 71. Shrinkage in the wavelet domain Shrinkage in the discrete wavelet domain (a) y sig = 20 y = x + sig * nr.randn(*x.shape) z = im.dwt(y, 3, h, g) zhat = shrink(z, lbd, sig) xhat = im.idwt(zhat, 3, h, g) z = W y ˆzi = s(zi; λi, σ) ˆx = W −1 ˆz 52
  • 72. Shrinkage in the wavelet domain Shrinkage in the discrete wavelet domain (a) y (b) z (Haar) sig = 20 y = x + sig * nr.randn(*x.shape) z = im.dwt(y, 3, h, g) zhat = shrink(z, lbd, sig) xhat = im.idwt(zhat, 3, h, g) z = W y ˆzi = s(zi; λi, σ) ˆx = W −1 ˆz 52
  • 73. Shrinkage in the wavelet domain Shrinkage in the discrete wavelet domain (a) y (b) ˆz (Haar+LMMSE) sig = 20 y = x + sig * nr.randn(*x.shape) z = im.dwt(y, 3, h, g) zhat = shrink(z, lbd, sig) xhat = im.idwt(zhat, 3, h, g) z = W y ˆzi = s(zi; λi, σ) ˆx = W −1 ˆz 52
  • 74. Shrinkage in the wavelet domain Shrinkage in the discrete wavelet domain (a) y (b) ˆz (Haar+LMMSE) (c) ˆx sig = 20 y = x + sig * nr.randn(*x.shape) z = im.dwt(y, 3, h, g) zhat = shrink(z, lbd, sig) xhat = im.idwt(zhat, 3, h, g) z = W y ˆzi = s(zi; λi, σ) ˆx = W −1 ˆz 52
  • 75. Shrinkage in the wavelet domain Shrinkage in the discrete wavelet domain (a) y (b) ˆz (Daubechies+LMMSE) (c) ˆx sig = 20 y = x + sig * nr.randn(*x.shape) z = im.dwt(y, 3, h, g) zhat = shrink(z, lbd, sig) xhat = im.idwt(zhat, 3, h, g) z = W y ˆzi = s(zi; λi, σ) ˆx = W −1 ˆz 52
  • 76. Shrinkage in the wavelet domain Shrinkage in the discrete wavelet domain (a) y (b) ˆz (Daubechies+Soft-T) (c) ˆx sig = 20 y = x + sig * nr.randn(*x.shape) z = im.dwt(y, 3, h, g) zhat = shrink(z, lbd, sig) xhat = im.idwt(zhat, 3, h, g) z = W y ˆzi = s(zi; λi, σ) ˆx = W −1 ˆz 52
  • 77. Shrinkage in the wavelet domain Shrinkage in the discrete wavelet domain (a) y (b) ˆz (Daubechies+Hard-T) (c) ˆx sig = 20 y = x + sig * nr.randn(*x.shape) z = im.dwt(y, 3, h, g) zhat = shrink(z, lbd, sig) xhat = im.idwt(zhat, 3, h, g) z = W y ˆzi = s(zi; λi, σ) ˆx = W −1 ˆz 52
  • 78. Shrinkage in the wavelet domain (a) y (σ = 20) (b) Db2+LMMSE (c) Db2+Soft-T (d) Db2+Hard-T (e) Haar+Hard-T For large noise: Blocky effects and Ringing artifacts 53
  • 79. Shrinkage in the wavelet domain (a) y (σ = 40) (b) Db2+LMMSE (c) Db2+Soft-T (d) Db2+Hard-T (e) Haar+Hard-T For large noise: Blocky effects and Ringing artifacts 53
  • 80. Shrinkage in the wavelet domain (a) y (σ = 60) (b) Db2+LMMSE (c) Db2+Soft-T (d) Db2+Hard-T (e) Haar+Hard-T For large noise: Blocky effects and Ringing artifacts 53
  • 81. Shrinkage in the wavelet domain (a) y (σ = 120) (b) Db2+LMMSE (c) Db2+Soft-T (d) Db2+Hard-T (e) Haar+Hard-T For large noise: Blocky effects and Ringing artifacts 53
  • 83. Limits of the discrete wavelet transform • While Fourier shrinkage is translation invariant: ψ(yτ ) = ψ(y)τ where yτ (s) = y(s + τ) • Wavelet shrinkage is not translation invariant. • This is due to the decimation step: W = G ↓2 H ↓2 ∈ Rn×n where M ↓2 = “M(1:2:end, :)” • This explains the blocky artifacts that we observe. 54
  • 84. Undecimated discrete wavelet transform (UDWT) Figure 1 – Haar DWT • Haar transform groups pixels by clusters of 4. • Blocks are treated independently to each other. • When similar neighbor blocks are shrunk differently, it becomes clearly visible in the image. • This arises all the more as the noise level is large. What if we do not decimate? ⇒ UDWT, aka, stationary or translation-invariant wavelet transform. 55
  • 85. Undecimated discrete wavelet transform (UDWT) 1-scale DWT • For a 4 × 4 image: 4 × 4 coefficients. • For n pixels: K = n coefficients. 1-scale UDWT • For a 4 × 4 image: 8 × 8 coefficients. • For n pixels: K = 4n coeffs. What about multi-scale? 56
  • 86. Undecimated discrete wavelet transform (UDWT) A trous algorithm (with holes) (Holschneider et al., 1989) Instead of decimating the coefficients at each scale j, upsample the filters h and g by injecting 2j − 1 zeros between each entries. 57
  • 87. Undecimated discrete wavelet transform (UDWT) DWT: Mallat’s dyadic pyramidal multi-resolution scheme UDWT: A trous algorithm – G:p : inject p zeros between each filter coeffs Multi-scales: K = (1 + J(2d − 1))n coeffs (J: #scales, d = 2 for images) 58
  • 88. Undecimated discrete wavelet transform (UDWT) Implementation of 2D UDWT (A trous algorithm) def udwt(x, J, h, g): if J == 0: return x[:, :, None] tmph = flip(convolve(flip(x), h)) / 2 tmpg = flip(convolve(flip(x), g)) / 2 z = np.stack((convolve(tmpg, h), convolve(tmph, g), convolve(tmph, h)), axis=2) coarse = convolve(tmpg, g) h2 = interleave0(h) g2 = interleave0(g) z = np.concatenate((udwt(coarse, J - 1, h2, g2), z), axis=2) return z Linear complexity. Can be easily modified to reduce memory usage. 59
  • 89. Undecimated discrete wavelet transform (UDWT) (a) DWT (J=2) (b) UDWT Coarse Sc. (c) Detailed Scale #1 (d) Detailed Scale #2 (e) Detailed Scale #3 (f) Detailed Scale #4 (g) Detailed Scale #5 (h) Detailed Scale #6 What about its inverse transform? 60
  • 90. Undecimated discrete wavelet transform (UDWT) DWT – Wavelet basis – and inverse DWT • The DWT W ∈ Rn×n has n columns and n rows. • The n columns/rows of W are orthonormal. • The inverse DWT is W −1 = W ∗ . • One-to-one relationship between an image and its wavelet coefficients. UDWT – Redundant wavelet dictionary • The UDWT ¯W ∈ RK×n has K = (1 + J(2d − 1))n rows and n columns. • The rows of ¯W cannot be linearly independent: not a basis. • They are said to form a redundant/overcomplete wavelet dictionary. • Since ¯W is non square, it is not invertible. Note: redundant dictionaries necessarily favor sparsity. 61
  • 91. Undecimated discrete wavelet transform (UDWT) Pseudo-inverse UDWT • Nevertheless, the n columns are orthonormal, then: ¯W ∗ = ¯W + • It satisfies ¯W + ¯W = Idn, but ¯W ¯W + = IdK • image ¯W −→ coefficients ¯W + → back to the original image, • coefficients ¯W + → image ¯W → not necessarily the same coefficients. • Satisfies the Parseval equality ¯W x, ¯W y = x, ¯W ∗ ¯W y = x, ¯W + ¯W y = x, y • In the vocabulary of linear algebra: ¯W is called a tight-frame. Consequence: an algorithm for ¯W + can be obtained. 62
  • 92. Undecimated discrete wavelet transform (UDWT) Implementation of 2D Inverse UDWT def iudwt(z, J, h, g): if J == 0: return z[:, :, 0] h2 = interleave0(h) g2 = interleave0(g) coarse = iudwt(z[:, :, :-3], J - 1, h2, g2) tmpg = convolve(coarse, g[::-1]) + convolve(z[:, :, -3], h[::-1]) tmph = convolve(z[:, :, -2], g[::-1]) + convolve(z[:, :, -1], h[::-1]) x = (flip(convolve(flip(tmpg), g[::-1])) + flip(convolve(flip(tmph), h[::-1]))) / 2 return x Linear complexity again. Can also be easily modified to reduce memory usage. Can we be more efficient? 63
  • 93. Multi-scale discrete wavelets Filter bank • The UDWT of x for subband k, x → (W x)k is linear and translation invariant (LTI) ⇒ It’s a convolution. • The UDWT is a filter bank: a set of band-pass filters that separates the input image into multiple components. • Each filter can be represented by its frequential response. • Direct and inverse transform: implementation in the Fourier domain. 64
  • 94. Undecimated discrete wavelet transform (UDWT) FiltersProductSubbands (a) Coarse 2 (b) Details 2 (c) Details 2 (d) Details 2 (e) Details 1 (f) Details 1 (g) Details 1 Haar with J = 2 levels of decomposition 65
  • 95. Undecimated discrete wavelet transform (UDWT) HaarDb2Db4Db8 (a) Coarse 2 (b) Details 2 (c) Details 2 (d) Details 2 (e) Details 1 (f) Details 1 (g) Details 1 Haar: band pass with side lobes. Db8: closer to ideal band pass. 66
  • 96. Undecimated discrete wavelet transform (UDWT) Db4 with J = 6 levels of decomposition How to create such a filter bank? 67
  • 97. Undecimated discrete wavelet transform (UDWT) UDWT: Creation of the filter bank (offline) def udwt_create_fb(n1, n2, J, h, g, ndim=3): if J == 0: return np.ones((n1, n2, 1, *[1] * (ndim - 2))) h2 = interleave0(h) g2 = interleave0(g) fbrec = udwt_create_fb(n1, n2, J - 1, h2, g2, ndim=ndim) gf1 = nf.fft(fftpad(g, n1), axis=0) hf1 = nf.fft(fftpad(h, n1), axis=0) gf2 = nf.fft(fftpad(g, n2), axis=0) hf2 = nf.fft(fftpad(h, n2), axis=0) fb = np.zeros((n1, n2, 4), dtype=np.complex128) fb[:, :, 0] = np.outer(gf1, gf2) / 2 fb[:, :, 1] = np.outer(gf1, hf2) / 2 fb[:, :, 2] = np.outer(hf1, gf2) / 2 fb[:, :, 3] = np.outer(hf1, hf2) / 2 fb = fb.reshape(n1, n2, 4, *[1] * (ndim - 2)) fb = np.concatenate((fb[:, :, 0:1] * fbrec, fb[:, :, -3:]), axis=2) return fb 68
  • 98. Undecimated discrete wavelet transform (UDWT) UDWT: Direct transform using the filter bank (online) def fb_apply(x, fb): x = nf.fft2(x, axes=(0, 1)) z = fb * x[:, :, None] z = np.real(nf.ifft2(z, axes=(0, 1))) return z UDWT: Inverse transform using the filter bank (online) def fb_adjoint(z, fb, squeeze=True): z = nf.fft2(z, axes=(0, 1)) x = (np.conj(fb) * z).sum(axis=2) x = np.real(nf.ifft2(x, axes=(0, 1))) return x Much more efficient than previous implementation when J > 1 69
  • 99. Reconstruction with the UDWT Shrinkage with UDWT • Consider a denoising problem y = x + w with noise variance σ2 . • Shrink the K n coefficients independently. ˆx = ¯W + ˆz Pseudo-inverse where ˆzi = s(zi; λi, σi) shrinkage and z = ¯W y Redundant representation Rule of thumb for soft-thresholding: • For the orthonormal DWT W : increase λi as √ 2 d(ji−1) . • For the tight-frame UDWT ¯W : increase λi as: 2d(ji−1/2) . (ji scale for coefficient i, d = 2 for images). 70
  • 100. Reconstruction with the UDWT (a) y (b) DWT(3)+Haar+HT (c) DWT(3)+Db2+HT (e) UDWT(3)+Db2+HT 71
  • 101. Reconstruction with the UDWT (a) y (b) DWT(3)+Haar+HT (c) DWT(3)+Db2+HT (d) UDWT(3)+Haar+HT (e) UDWT(3)+Db2+HT (f) UDWT(3)+Db8+HT 71
  • 102. Reconstruction with the UDWT (a) y (b) DWT(3)+Haar+HT (c) DWT(3)+Db2+HT (d) UDWT(1)+Db2+HT (e) UDWT(3)+Db2+HT (f) UDWT(5)+Db2+HT 71
  • 103. Reconstruction with the UDWT (a) y (b) DWT(3)+Haar+HT (c) DWT(3)+Db2+HT (d) UDWT(3)+Db2+Linear (e) UDWT(3)+Db2+HT (f) UDWT(3)+Db2+ST 71
  • 104. Reconstruction with the UDWT (a) y (σ = 20) (b) UDWT+Lin. (c) UDWT+HT (d) DWT+HT 72
  • 105. Reconstruction with the UDWT (a) y (σ = 40) (b) UDWT+Lin. (c) UDWT+HT (d) DWT+HT 72
  • 106. Reconstruction with the UDWT (a) y (σ = 60) (b) UDWT+Lin. (c) UDWT+HT (d) DWT+HT 72
  • 107. Reconstruction with the UDWT (a) y (σ = 120) (b) UDWT+Lin. (c) UDWT+HT (d) DWT+HT 72
  • 108. Reconstruction with the UDWT ˆx = ¯W + ˆz Pseudo-inverse where ˆzi = s(zi; λi, σi) shrink K coefficients and z = ¯W y Redundant representation Connection with Bayesian shrinkage? • Since the rows of ¯W are linearly dependent, the coefficients zi are necessarily correlated (non-white). • Shrink the K n coefficients independently, even though they cannot be assumed independent. • This estimator has no Bayesian interpretation, it does not correspond to the MMSE or MAP. How to use the UDWT in the Bayesian context? 73
  • 109. Reconstruction with the UDWT Bayesian analysis model Whitening model: Consider η = Λ−1/2 W x (η coeffs) such that E[η] = 0n and Var[η] = Idn Analysis: images can be transformed to white coeffs. Non-sense when rows of W are redundant. Bayesian synthesis model Generative model: Consider x = ¯W + Λ1/2 η (η code) such that E[η] = 0K and Var[η] = IdK Synthesis: images can be generated from a white code. Always well-founded. 74
  • 110. Reconstruction with the UDWT Forward model: y = x + w Maximum a Posteriori for the Synthesis model • Instead of looking for x, consider the MAP for the code η ˆη ∈ argmax η∈RK p(η|y) = argmin η∈RK [− log p(y|η) − log p(η)] = argmin η∈RK 1 2 ||y − ¯W + Λ1/2 η||2 2 − log p(η) • Once you get ˆη , generate the image ˆx as ˆx = ¯W + Λ1/2 ˆη What interpretation? 75
  • 111. Reconstruction with the UDWT Penalized least square with redundant dictionary • Consider the redundant wavelet dictionary D = ¯W + Λ1/2 D = ( d1, d2, . . . , dK linearly dependent atoms ), ||di|| = λi, K n • Goal: Look for a code η ∈ RK , such that ˆx close to y ˆx = Dη = K i=1 ηidi = “linear comb. of the redundant atoms di of D” • Since D is redundant, different codes η produce the same image x. • Penalize independently each ηi to select a relevant one ˆη ∈ argmin η∈RK 1 2 ||y − ¯W + Λ1/2 η||2 2 − K i=1 log p(ηi) What choice for log p(ηi)? 76
  • 112. Reconstruction with the UDWT Penalized least square with redundant dictionary • 1 2 ||y − Dη||2 2 + τ2 2 ||η||2 2, ||η||2 2 = i η2 i ← Ridge regression • 1 2 ||y − Dη||2 2 + τ||η||1, ||η||1 = i |ηi| ← LASSO • 1 2 ||y − Dη||2 2 + τ2 2 ||η||0, ||η||0 = i 1{ηi=0} ← Sparse regression -5 -4 -3 -2 -1 0 1 2 3 4 5 0 1 2 3 4 5 When D is redundant, these problems are no longer separable. They require large-scale optimization techniques. 77
  • 114. Ridge regression Ridge/Smooth regression • Convex energy: E(η) = 1 2 ||y − Dη||2 2 + τ2 2 ||η||2 2 • Gradient: E(η) = D∗ (Dη − y) + τ2 η • Optimality conditions: ˆη = (D∗ D + τ2 IdK )−1 D∗ y • For UDWT: this is an LTI filter ≡ convolution (non adaptive) (a) y (b) Linear shrink (c) Ridge (d) Difference Ridge ≡ Linear shrinkage (except if D is orthogonal). 78
  • 115. Sparse regression Sparse regression / 0 regularization (1/3) • Energy: E(η) = 1 2 ||y − Dη||2 2 + τ2 2 ||η||0 • Penalty: ||η||0 = #non zero elements in η • Non-convex: 0.5 = 1 2 (||0||0 + ||1||0) < ||0.5||0 = 1 • Produces optimal sparse solutions adapted to the signal • But, non-differentiable and discontinue -5 -4 -3 -2 -1 0 1 2 3 4 5 0 1 2 3 4 5 79
  • 116. Sparse regression Sparse regression / 0 regularization (2/3) • If D is orthogonal: solution given by the Hard-Thresholding. • Otherwise, exact solution obtained by brute force: • For all possible support I ⊆ {1, . . . , K} (set of non-zero coefficients) • Solve the least square estimation problem: argmin (ηi)i∈I 1 2 ||y − i∈I ηiai||2 2 • Pick the solution that minimizes E. • NP-hard combinatorial problem: #subsets = K k=0 K k = 2K 80
  • 117. Sparse regression Sparse regression / 0 regularization (3/3) • Sub-optimal solutions can be obtained by greedy algorithms. • Matching pursuit (MP): (Mallat, 1993) 1 Initialization: r ← y, η ← 0, k ← 0 2 Choose i maximizing |D∗ r|i = | di, r | 3 Compute α = r, di /||di||2 2 4 Update r ← r − αdi 5 Update ηi = α 6 Update k ← k + 1 7 Back to step 2 while E(η) = 1 2 ||r||2 2 + τ2 2 k decreases • Lots of iterations: complexity O(kn), with k the sparsity of the solution. • Each iteration requires to compute an inverse UDWT. • Extensions: OMP (Tropp & Gilbert, 2007), CoSaMP (Needel & Tropp, 2009) 81
  • 118. Least Absolute Shrinkage and Selection Operator (LASSO) Convex relaxation: Take the best of both worlds: sparsity and convexity LASSO / 1 regularization (Tibshirani 1996) • Convex energy: E(η) = 1 2 ||y − Dη||2 2 + τ||η||1 • Non-smooth penalty: ||η||1 = K i=1 |ηi| • If D is orthogonal: solution given by the Soft-Thresholding. • Produces also sparse solutions adapted to the signal -5 -4 -3 -2 -1 0 1 2 3 4 5 0 1 2 3 4 5 82
  • 119. Least Absolute Shrinkage and Selection Operator (a) Input (b) ST+UDWT (1s) (c) LASSO+UDWT (30s) (d) Difference Though the solutions look alike, their codes η are very different. 83
  • 120. Least Absolute Shrinkage and Selection Operator InputST+UDWTLASSO+UDWT (a) Image (b) Coarse 5 (c) Scale 5 (d) Scale 5 (e) Scale 5 (f) Scale 1 (g) Scale 1 The LASSO creates much sparser codes than ST only. 84
  • 121. Least Absolute Shrinkage and Selection Operator Why use the LASSO if shrinkage in the UDWT provides similar results? • Shrinkage in the UDWT domain can only be applied for denoising problems. • The LASSO can be adapted to inverse-problems: ˆx = Dˆη with ˆη ∈ argmin η∈RK 1 2 ||y − HDη||2 2 + τ||η||1 But it requires solving a non-smooth convex optimization problems. Solution: use sub-differential and Fermat’s rule. 85
  • 122. Non-smooth convex optimization Definition (Sub-differential) • Let f : Rn → R be a convex function, u ∈ Rn is a sub-gradient of f at x∗ , if for all x ∈ Rn f(x) f(x∗ ) + u, x − x∗ . • The sub-differential is the set of sub-gradients ∂f(x∗ ) = {u ∈ Rn : ∀x ∈ Rn , f(x) f(x∗ ) + u, x − x∗ }. If the sub-gradient is unique, f is differentiable and ∂f(x) = { f(x)}. 86
  • 123. Non-smooth convex optimization Definition (Sub-differential) • Let f : Rn → R be a convex function, u ∈ Rn is a sub-gradient of f at x∗ , if for all x ∈ Rn f(x) f(x∗ ) + u, x − x∗ . • The sub-differential is the set of sub-gradients ∂f(x∗ ) = {u ∈ Rn : ∀x ∈ Rn , f(x) f(x∗ ) + u, x − x∗ }. If the sub-gradient is unique, f is differentiable and ∂f(x) = { f(x)}. 86
  • 124. Non-smooth convex optimization Definition (Sub-differential) • Let f : Rn → R be a convex function, u ∈ Rn is a sub-gradient of f at x∗ , if for all x ∈ Rn f(x) f(x∗ ) + u, x − x∗ . • The sub-differential is the set of sub-gradients ∂f(x∗ ) = {u ∈ Rn : ∀x ∈ Rn , f(x) f(x∗ ) + u, x − x∗ }. If the sub-gradient is unique, f is differentiable and ∂f(x) = { f(x)}. 86
  • 125. Non-smooth convex optimization Definition (Sub-differential) • Let f : Rn → R be a convex function, u ∈ Rn is a sub-gradient of f at x∗ , if for all x ∈ Rn f(x) f(x∗ ) + u, x − x∗ . • The sub-differential is the set of sub-gradients ∂f(x∗ ) = {u ∈ Rn : ∀x ∈ Rn , f(x) f(x∗ ) + u, x − x∗ }. If the sub-gradient is unique, f is differentiable and ∂f(x) = { f(x)}. 86
  • 126. Non-smooth convex optimization Theorem (Fermat’s rule) Let f : Rn → R be a convex function, then x∗ ∈ argmin x∈Rn f(x) ⇔ 0n ∈ ∂f(x∗ ) If f is also differentiable, this corresponds to the standard rule f(x∗ ) = 0n. Minimizers are the only points with an horizontal tangent 87
  • 127. Non-smooth convex optimization Function (abs): f :    R → R x → |x| Sub-differential (sign) ∂f(x∗ ) =    {−1} if x∗ ∈ (−∞, 0) {+1} if x∗ ∈ (0, ∞) [−1, 1] if x∗ = 0 88
  • 128. Non-smooth convex optimization Function (abs): f :    R → R x → |x| Sub-differential (sign) ∂f(x∗ ) =    {−1} if x∗ ∈ (−∞, 0) {+1} if x∗ ∈ (0, ∞) [−1, 1] if x∗ = 0 88
  • 129. Non-smooth convex optimization Function (abs): f :    R → R x → |x| Sub-differential (sign) ∂f(x∗ ) =    {−1} if x∗ ∈ (−∞, 0) {+1} if x∗ ∈ (0, ∞) [−1, 1] if x∗ = 0 88
  • 130. Non-smooth convex optimization Function (abs): f :    R → R x → |x| Sub-differential (sign) ∂f(x∗ ) =    {−1} if x∗ ∈ (−∞, 0) {+1} if x∗ ∈ (0, ∞) [−1, 1] if x∗ = 0 88
  • 131. Non-smooth convex optimization Function (abs): f :    R → R x → |x| Sub-differential (sign) ∂f(x∗ ) =    {−1} if x∗ ∈ (−∞, 0) {+1} if x∗ ∈ (0, ∞) [−1, 1] if x∗ = 0 88
  • 132. Non-smooth convex optimization Function (abs): f :    R → R x → |x| Sub-differential (sign) ∂f(x∗ ) =    {−1} if x∗ ∈ (−∞, 0) {+1} if x∗ ∈ (0, ∞) [−1, 1] if x∗ = 0 88
  • 133. Non-smooth convex optimization Function (abs): f :    R → R x → |x| Sub-differential (sign) ∂f(x∗ ) =    {−1} if x∗ ∈ (−∞, 0) {+1} if x∗ ∈ (0, ∞) [−1, 1] if x∗ = 0 88
  • 134. Non-smooth convex optimization Function (abs): f :    R → R x → |x| Sub-differential (sign) ∂f(x∗ ) =    {−1} if x∗ ∈ (−∞, 0) {+1} if x∗ ∈ (0, ∞) [−1, 1] if x∗ = 0 88
  • 135. Non-smooth convex optimization Proximal operator • Let f : Rn → R be a convex function (+ some technical conditions). The proximal operator of f is Proxf (x) = argmin z∈Rn 1 2 ||z − x||2 2 + f(z) • Remark: this minimization problem always has a unique solution, so the proximal operator is without ambiguity a function Rn → Rn . • Always non-expansive: || Proxf (x1) − Proxf (x2)|| ||x1 − x2|| • Can be interpreted as a denoiser/shrinkage for the regularity f. Property Proxγf (x) = (Id + γ∂f)−1 x 89
  • 136. Non-smooth convex optimization Proof. argmin z 1 2 ||z − x||2 2 + γf(z) ⇔ 0 ∈ ∂ 1 2 ||z − x||2 2 + γf(z) ⇔ 0 ∈ ∂ 1 2 ||z − x||2 2 + γ∂f(z) ⇔ 0 ∈ z − x + γ∂f(z) ⇔ x ∈ z + γ∂f(z) ⇔ x ∈ (Id + γ∂f)(z) ⇔ z = (Id + γ∂f)−1 x Even though ∂f(x) is a set, the pre-image by Id + γ∂f is unique. 90
  • 137. Non-smooth convex optimization Soft-thresholding Proxγ|.|(x) = argmin z∈R 1 2 (z − x)2 + γ|z| = (Id + γ∂|.|)−1 xk =    x − γ if x > γ x + γ if x < −γ 0 otherwise 91
  • 138. Non-smooth convex optimization Proximal operator of simple functions Name f(x) Proxγf (x) Indicator of convex set C 0 if x ∈ C ∞ otherwise ProjC(x) Square 1 2 ||x||2 2 x 1 + γ Abs ||x||1 Soft-T(x, γ) Euclidean ||x||2 1 − γ max(||x||2, γ) x Square+Affine 1 2 ||Ax + b||2 2 (Id + γA∗ A)−1 (x − γA∗ b) Separability for x = x1 x2 g(x1) + h(x2) Proxγg(x1) Proxγh(x1) More exhaustive list: http://proximity-operator.net 92
  • 139. Non-smooth convex optimization Proximal minimization • Let f : Rn → R be a convex function (+ some technical conditions). Then, whatever the initialization x0 and γ > 0, the sequence xk+1 = Proxγf (xk ) converges towards a global minimizer of f. Proxγf (xk ) = (Id + γ∂f)−1 xk = argmin z 1 2 ||z − xk ||2 2 + γf(z) Compared to gradient descent • No need to be differentiable, • No need to have Lipschitz gradient, • Works whatever the parameter γ, • Require to solve an optimization problem at each step. 93
  • 140. Non-smooth convex optimization Gradient descent: read xk on the x-axis and evaluate its image by the function x − γ F(x). 94
  • 141. Non-smooth convex optimization Proximal minimization: Look at the set x + γ∂F(x) 94
  • 142. Non-smooth convex optimization Proximal minimization: read xk on the y-axis and evaluate its pre-image by x + γ F(x). 95
  • 143. Non-smooth convex optimization Proximal minimization: the larger γ the faster, but the inversion becomes harder (ill-conditioned). 95
  • 144. Non-smooth convex optimization Toy example • Consider the smoothing regularization problem F(x) = 1 2 || x||2 2,2 • Its sub-gradient is thus given by ∂F(x) = { F(x) = −∆x} • The proximal minimization reads as xk+1 = (Id + γ∂F)−1 xk = (Id − γ∆)−1 xk • This is exactly the implicit Euler scheme for the Heat equation. 96
  • 145. Non-smooth convex optimization Can we apply proximal minimization for the LASSO? Proximal splitting methods • The proximal operator may not have a closed form. • Computing it may be as difficult as solving the original problem • Solution: use proximal splitting methods, a family of techniques developed for non-smooth convex problems. • Idea: split the problem into subproblems, that involve • gradient descent steps for smooth terms, • proximal steps for simple convex terms. 97
  • 146. Non-smooth convex optimization min x∈Rn {E(x) = F(x) + G(x)} Proximal forward-backward algorithm • Assume F is convex and differentiable with L-Lipschitz gradient || F(x1) − F(x2)||2 L||x1 − x2||2, for all x1, x2 . • Assume G is convex and simple, i.e., its prox is known in closed form ProxγG(x) = argmin z 1 2 ||z − x||2 2 + γG(z) • The proximal forward-backward algorithm reads xk+1 = ProxγG(xk − γ F(xk )) • For 0 < γ < 2/L, it converges to a minimizer of E = F + G. Aka, explicit-implicit scheme by analogy with PDE discretization schemes. 98
  • 147. Non-smooth convex optimization The LASSO problem: E(η) = 1 2 ||y − Aη||2 2 F (η) + τ||η||1 G(η)= i |ηi| , A = HD Iterative Soft-Thresholding Algorithm (ISTA) (Daubechies, 2004) • F is convex and differentiable with L-Lipschitz gradient F(η) = A∗ (Aη − y) with L = ||A||2 2 • G is convex and simple, in fact separable: ProxγG(η)i = Soft-T(ηi, γτ) • The proximal forward-backward algorithm reads for 0 < γ < 2/L ηk+1 = Soft-T(ηk − γ(A∗ Aηk − A∗ y), γτ) and is known as Iterative Soft-Thresholding Algorithm (ISTA). • Finally: ˆx = ¯Dˆη 99
  • 148. Non-smooth convex optimization Preconditioned ISTA (1/2) • Remark ˆη ∈ argmin η∈RK 1 2 ||y − Aη||2 2 + τ||η||1, A = H ¯W + Λ1/2 D ∈ argmin η∈RK 1 2 ||y − H ¯W + Λ1/2 η||2 2 + τ||η||1 • Λ1/2 invertible: bijection between z = Λ1/2 η and η = Λ−1/2 z • Solving for η is equivalent to solve a weighted LASSO for z ˆz ∈ argmin z∈RK 1 2 ||y − H ¯W + z||2 2 + τ||Λ−1/2 z||1 ∈ argmin z∈RK 1 2 ||y − Bz||2 2 + K i=1 τ λi |zi|, B = H ¯W + • In practice, this equivalent problem has better conditioning. 100
  • 149. Non-smooth convex optimization Equivalent to: E(z) = 1 2 ||y − Bz||2 2 F (z) + τ||Λ−1/2 z||1 G(z)= i τ λi |zi| , B = H ¯W + Preconditioned ISTA (2/2) F(z) = B∗ (Bz − y) with L = ||B||2 2 ProxγG(z)i = Soft-T zi, γτ λi • ISTA becomes for 0 < γ < 2/L zk+1 = Soft-T zk − γ(B∗ Bzk − B∗ y), γτ λi • Finally: ˆx = ¯W + ˆz • Leads to larger steps γ, better conditioning, and faster convergence. 101
  • 150. Non-smooth convex optimization zk+1 = ProxγG(zk − γ F(zk )) = Soft-T zk − γ(B∗ Bzk − B∗ y), γτ λi with B = H ¯W + Bredies & Lorenz (2007): E(zk ) − E(z ) decays with rate O(1/k) Fast ISTA (FISTA) zk+1 = ProxγG ˜zk − γ F(˜zk ) ˜zk+1 = zk+1 + tk − 1 tk+1 (zk+1 − zk ) tk+1 = 1 + 1 + 4t2 k 2 , t0 = 1 Beck & Teboulle (2009): E(zk ) − E(z ) decays with rate O(1/k2 ) 102
  • 151. Non-smooth convex optimization (a) Input y: motion blur + noise (σ = 2) 50 100 150 200 250 300 10 -2 10 -1 10 0 10 1 10 2 ISTA FISTA (b) Convergence profiles (c) Deconvolution ISTA(300)+UDWT (d) Deconvolution FISTA(300)+UDWT 103
  • 152. Non-smooth convex optimization InputISTA(300)FISTA(300) (a) Image (b) Scale 4 (c) Scale 4 (d) Scale 4 (e) Scale 3 (f) Scale 3 (g) Scale 3 FISTA converges faster: sparser codes given a limited time budget 104
  • 154. Sparse reconstruction: synthesis vs analysis Sparse synthesis model with UDWT • LASSO: ˆη ∈ argmin η∈RK 1 2 ||y − H ¯W + Λ1/2 η||2 2 + τ||η||1 • Using the change of variable η = Λ−1/2 z: ˆz ∈ argmin z∈RK 1 2 ||y − H ¯W + z||2 2 + τ||Λ−1/2 z||1 Sparse analysis model with UDWT • What about? ˆx ∈ argmin x∈Rn 1 2 ||y − Hx||2 2 + τ||Λ−1/2 ¯W x||1 • The change of variable η = Λ−1/2 ¯W x is not one-to-one. • The two problems are not equivalent (unless ¯W is invertible). 105
  • 156. Sparse reconstruction: synthesis vs analysis Analysis versus synthesis (Elad, Milanfar, Rubinstein, 2007) Generative: generate good images ˆx = Dˆη with ˆη ∈ argmin η∈RK 1 2 ||y − HDη||2 2 + τ||η||p p, p 0 Synthesis: images are linear combinations of a few columns of D. Bayesian interpretation: MAP for the sparse code η. Discriminative: discriminate between good and bad images ˆx ∈ argmin x∈Rn 1 2 ||y − Hx||2 2 + τ||Γx||p p, p 0 Analysis: images are correlated with a few rows of Γ. Bayesian interpretation: MAP for x with an improper Gibbs prior. 107
  • 157. Sparse reconstruction: synthesis vs analysis ˆη ∈ argmin η∈RK 1 2 ||y − HDη||2 2 + τ||η||p p ( p p-synthesis) ˆx ∈ argmin x∈Rn 1 2 ||y − Hx||2 2 + τ||Γx||p p ( p p-analysis) Common properties Solution Problem p = 0 Optimal sparse Non-convex & discontinue (NP-hard) 0 < p < 1 Sparse Non-convex & continue but non-smooth p = 1 Sparse Convex & continue but non-smooth p > 1 Smooth Convex & differentiable p = 2 Linear Quadratic • Γ square and invertible ⇒ equivalent for D = Γ−1 . • Γ full-rank and p = 2 ⇒ equivalent for D = Γ+ . • LTI dictionaries ⇒ redundant filter bank. 108
  • 158. Sparse reconstruction: synthesis vs analysis ˆη ∈ argmin η∈RK 1 2 ||y − HDη||2 2 + τ||η||p p ( p p-synthesis) ˆx ∈ argmin x∈Rn 1 2 ||y − Hx||2 2 + τ||Γx||p p ( p p-analysis) Synthesis • D: synthesis dictionary. • Atoms need to span images. ⇒ Low- & high-pass filters ⇒ Im[D] ≈ Rn • Redundancy favor sparsity. • K dimensional problem (> n). • Prior separable. Analysis • Γ: analysis dictionary. • Atoms need to sparsify images. ⇒ High-pass filters only ⇒ Ker[Γ] = ∅ (⊃ DC, coarse) • Redundancy decreases sparsity. • n dimensional problem (< K). • Prior non-separable. Quiz: What analysis dictionary is LTI and not too redundant? 109
  • 159. Sparse reconstruction: synthesis vs analysis ˆη ∈ argmin η∈RK 1 2 ||y − HDη||2 2 + τ||η||p p ( p p-synthesis) ˆx ∈ argmin x∈Rn 1 2 ||y − Hx||2 2 + τ||Γx||p p ( p p-analysis) Link between analysis models and variational methods • p = 2: Analysis model = Tikhonov regularization. • p = 1 & Γ = : Analysis model = anisotropic Total-Variation (TV) TV filter bank = Horizontal and vertical gradient Spatial filter bank Spectral f.b. (real part) Spectral f.b. (imaginary part) Can we use proximal forward-backward for 1-analysis prior? 110
  • 160. Non-smooth optimization ˆx ∈ argmin x∈Rn 1 2 ||y − Hx||2 2 F (x) + τ||Γx||1 G(x) ( 1-analysis) Proximal forward-backward for the 1-analysis problem? • F convex and differentiable • G convex but not simple (not separable) −→ cannot use proximal forward backward • Exception: for denoising H = Idn (see: Chambolle algorithm, 2004) Need another proximal optimization technique. 111
  • 161. Non-smooth optimization min x∈Rn {E(x) = F(x) + G(x)} Alternating direction method of multipliers (ADMM) (∼1970) • Assume F and G are convex and simple (+ some mild conditions). • For any initialization x0 , ˜x0 and d0 , the ADMM algorithm reads as xk+1 = ProxγF (˜xk + dk ) ˜xk+1 = ProxγG(xk+1 − dk ) dk+1 = dk − xk+1 + ˜xk+1 • For γ > 0, xk converges to a minimizer of E = F + G. Fast version: FADMM, similar idea as for FISTA (Goldstein et al., 2014). Related concepts: Lagrange multipliers, Duality, Legendre transform. How to use it for 1 analysis priors? 112
  • 162. Non-smooth optimization ˆx ∈ argmin x∈Rn 1 2 ||y − Hx||2 2 + τ||Γx||1 ( 1-analysis) ADMM + Variable splitting (1/3) • Define: X = x z ∈ Rn+K • Consider: E(X) = F(X) + G(X) with:    F x z = ||y − Hx||2 2 + τ||z||1 G x z = 0 if Γx = z ∞ otherwise • Remark 1: Minimizing E solves the 1-analysis problem. • Remark 2: F and G are convex and simple ⇒ ADMM applies. 113
  • 163. Non-smooth optimization ˆx ∈ argmin x∈Rn 1 2 ||y − Hx||2 2 + τ||Γx||1 ( 1-analysis) ADMM + Variable splitting (2/3) Applying formula from slide 92: F x z = ||y − Hx||2 2 + τ||z||1 −→ ProxγF x z = (Idn + γH∗ H)−1 (x + γH∗ y) Soft-T(z, γτ) G x z = 0 if Γx = z ∞ otherwise Indicator of the convex set C={(x,z) ; Γx=z} −→ ProxγG x z = Idn Γ (Idn + Γ∗ Γ)−1 (x + Γ∗ z) Projection on C 114
  • 164. Non-smooth optimization ˆx ∈ argmin x∈Rn 1 2 ||y − Hx||2 2 + τ||Γx||1 ( 1-analysis) ADMM + Variable splitting (3/3) xk+1 = (Idn + γH∗ H)−1 (˜xk + dk x + γH∗ y) zk+1 = Soft-T(˜zk + dk z , γτ) ˜xk+1 = (Idn + Γ∗ Γ)−1 (xk+1 − dk x + Γ∗ (zk+1 − dk z )) ˜zk+1 = Γ˜xk+1 dk+1 x = dk x − xk+1 + ˜xk+1 dk+1 z = dk z − zk+1 + ˜zk+1 If H is a blur, and Γ a filter bank, (Idn + γH∗ H)−1 and (Idn + Γ∗ Γ)−1 can be computed in the Fourier domain in O(n log n). 115
  • 165. Non-smooth optimization ˆx ∈ argmin x∈Rn 1 2 ||y − Hx||2 2 + τ||Γx||1 ( 1-analysis) Application to Total-Variation Γ = xk+1 = (Idn + γH∗ H)−1 (˜xk + dk x + γH∗ y) zk+1 = Soft-T(˜zk + dk z , γτ) ˜xk+1 = (Idn + ∗ )−1 (xk+1 − dk x + ∗ (zk+1 − dk z )) ˜zk+1 = ˜xk+1 dk+1 x = dk x − xk+1 + ˜xk+1 dk+1 z = dk z − zk+1 + ˜zk+1 ∗ = − div and ∗ = −∆ 116
  • 166. Non-smooth optimization ˆx ∈ argmin x∈Rn 1 2 ||y − Hx||2 2 + τ||Γx||1 ( 1-analysis) Application to Total-Variation Γ = xk+1 = (Idn + γH∗ H)−1 (˜xk + dk x + γH∗ y) zk+1 = Soft-T(˜zk + dk z , γτ) ˜xk+1 = (Idn − ∆)−1 (xk+1 − dk x − div(zk+1 − dk z )) ˜zk+1 = ˜xk+1 dk+1 x = dk x − xk+1 + ˜xk+1 dk+1 z = dk z − zk+1 + ˜zk+1 ∗ = − div and ∗ = −∆ 116
  • 167. Non-smooth optimization ˆx ∈ argmin x∈Rn 1 2 ||y − Hx||2 2 + τ||Γx||1 ( 1-analysis) Application to sparse analysis with UDWT Γ = Λ−1/2 ¯W xk+1 = (Idn + γH∗ H)−1 (˜xk + dk x + γH∗ y) zk+1 = Soft-T(˜zk + dk z , γτ λi ) ˜xk+1 = (Idn + ¯W ∗ ¯W )−1 (xk+1 − dk x + ¯W ∗ (zk+1 − dk z )) ˜zk+1 = ¯W ˜xk+1 dk+1 x = dk x − xk+1 + ˜xk+1 dk+1 z = dk z − zk+1 + ˜zk+1 Tight-frame: ¯W ∗ ¯W = Idn 117
  • 168. Non-smooth optimization ˆx ∈ argmin x∈Rn 1 2 ||y − Hx||2 2 + τ||Γx||1 ( 1-analysis) Application to sparse analysis with UDWT Γ = Λ−1/2 ¯W xk+1 = (Idn + γH∗ H)−1 (˜xk + dk x + γH∗ y) zk+1 = Soft-T(˜zk + dk z , γτ λi ) ˜xk+1 = 1 2 (xk+1 − dk x + ¯W ∗ (zk+1 − dk z )) ˜zk+1 = ¯W ˜xk+1 dk+1 x = dk x − xk+1 + ˜xk+1 dk+1 z = dk z − zk+1 + ˜zk+1 Tight-frame: ¯W ∗ ¯W = Idn 117
  • 169. Sparse analysis – Results Deconvolution with UDWT (5 levels, Db2) (a) Blurry image y (noise σ = 2) (b) Synthesis (FISTA) (c) Analysis (FADMM) 118
  • 170. Sparse analysis – Results SynthesisAnalysis (a) 1 level (b) 2 level (c) 3 level (d) 4 level (e) 5 level Analysis allows for less decomposition levels. ⇒ leads to faster algorithms. 119
  • 171. Sparse analysis – Results (a) Noisy (σ = 40) (b) Analysis UDWT(4) (c) +block (orien.+col.) (d) Difference • As for TV, group coefficients across orientations/color using 2,1 norms: ||Γz||2,1 • The soft-thresholding becomes the group soft-thresholding: Proxγ||·||2,1 (z) i = zi − γ zi ||zi||2 if ||zi||2 > γ 0 otherwise 120
  • 172. Shrinkage, Sparsity and Wavelets – What’s next? Reminder from last class: Modeling the distribution of images is complex (large degree of freedom). Applying LMMSE on patches → increase performance Next class: What if we use sparse priors, not for the distribution of images, but for the distribution of patches? 121
  • 173. Shrinkage, Sparsity and Wavelets – Further reading For further reading Sparsity, shrinkage and recovery guarantee: • Donoho & Johnstone (1994); Moulin & Liu (1999); Donoho and Elad (2003); Gribonval and Nielsen (2003); Cand`es and Tao (2005); Zhang (2008); Cand`es and Romberg (2007). • Book: Statistical Learning with Sparsity (Hastie, Tibshirani, Wainwright, 2015). Wavelet related transforms: • Warblet/Chirplet (Mann, Mihovilovic et al., 1991–1992), Curvelet (Cand`es & Donoho, 2000), Noiselet (Coifman, 2001), Contourlet (Do & Vetterli, 2002), Ridgelet (Do & Vetterli, 2003), Shearlets (Kanghui et al., 2005), Bandelet (Le Pennec, Peyr´e, Mallat, 2005), Empirical wavelets (Gilles, 2013). • Book: A wavelet tour of signal processing (Mallat, 2008) Non-smooth convex optimization: • Douglas-Rachford splitting (Combettes & Pesquet, 2007), Split Bregman (Goldstein & Osher, 2009), Primal-Dual (Chambolle & Pock, 2011), Generalized FB (Raguet et al., 2013), Condat algorithm (2014). • Book: Convex Optimization (Boyd, 2004). 122
  • 174. Questions? Next class: Patch models and dictionary learning Sources and images courtesy of • L. Condat • G. Peyr´e • J. Salmon • Wikipedia 122