Decomposition and Denoising for moment sequences using convex optimization

Decomposition and Denoising for moment sequences
using convex optimization
Badri Narayan Bhaskar

High Dimensional Inference
n p
Statistics
Microarray Data
Hyperspectral
Imaging
Predict Ratings

Sparse Vector Estimation
Gene Expressions
{
{X
n
p
k sparse
?
Samples
Genes
Few Active Genes
Disease Susceptibility =
Response

Low Rank Matrices
The Netflix Problem
Movie Ratings
{
{
thousands of
movies
millions of
subscribers
movie profile
user profile
Action
+ a handful of other features...
Decomposition:

Low Rank Matrices
Topic Models
Term Document
Matrix
{
{
thousands of
documents
millions of
words
degree of topicality
topic proﬁle
Politics
+ a handful of other features...
Decomposition:

Simple objects
Objects Notion of Simplicity Atoms
Vectors Sparsity Canonical unit vectors
Matrices Rank Rank-1 matrices
Bandlimited signals No. of frequencies Complex sinusoids
Linear systems Number of poles Single pole systems
x =
X
a2A
caax =
X
a2A
caa
atomic set
atoms
“features”
nonnegative weights

Universal Regularizer
Chandrasekaran, Recht, Parillo,Wilsky (2010)
Objects Notion of Simplicity Atomic Norm
Vectors Sparsity
Matrices Rank nuclear norm
Bandlimited signals No. of frequencies ?
Linear systems McMillan degree ?
1
x A = {t 0 : x t (A)}
=
a
ca : x =
a A
caa, ca > 0
t
xx A = {t 0 : x t (A)}

Moment Problems
A
atoms
+ +
⇤ ⇥ ⌅
x
measure space
moments of a discrete measure
atomic moments
System Identiﬁcation
−1
0
1
−1
0
1
0
2
4
6
−0.5 0 0.5
2
3
4
5
6
7
Line Spectral Estimation

Decomposition
How many atoms do you
need to represent x?
composing atoms are on an
exposed face: decomposition is
easy to ﬁnd.
ﬁnd a supporting hyperplane
to certify
concentrate on recovering
“good” objects with atoms on
exposed faces
t
x

Dual Certiﬁcate
1 2 3 4 5 6 7
0
1
2
3
4
5
6
X Axis
YAxis
Gradient
Subgradients
optimum value is atomic norm.
solutions are subgradients at x
semi-inﬁnite problem
maximize
q
hq, x?
i
subject to kqk⇤
A  1
Dual Characterization of atomic norm
kqk⇤
A = sup
a2A
hq, ai
(under mild conditions, Bonsall)
hq, ai = 1, a 2 support
hq, ai < 1, otherwise.
x
t

Denoising
minimize
x
1
2
ky xk2
2 + ⌧kxkA
Tradeoff parameter
AST (Atomic Soft Thresholding)
Dual AST
maximize
q
hq, yi ⌧kqk2
2
subject to kqk⇤
A  1
y = x + w
Cn
N(0, 2
In)

AST Results
⌧ ˆq + ˆx = y
kˆqk⇤
A  1
hˆq, ˆxi = kˆxkA
Optimality Conditions
suggests
⌧ ˆq ⇡ w ) ⌧ ⇡ Ekwk⇤
A
means
ˆq 2 @kˆxkA
Can ﬁnd composing atoms!
Tradeoff = Gaussian Width
No cross validation needed - estimate
Gaussian width
Gaussian width also important to
characterize convergence rates

AST Convergence Rates
Fast Rate
s sparse
1
n
X ˆ X 2
2 C 2 s log(p)
n
Slow RateLasso n
1
n
X ˆ X 2
2
log(p)
n
1
Choose appropriate tradeoﬀ
= E ( w A) , 1
Universal, but slow rate
Minimal assumptions on atoms
1
n
ˆx x 2
2
n
x A
With a “cone condition”, faster rates are possible
Weaker than RIP,
Coherence, RSC, RE 1
n
ˆx x 2
2 C
2
1/2n

Summary
Notion of simple models
Atomic norm uniﬁes treatment of many models
Recovers tight published bounds
Bounds depend on Gaussian width
Faster rates with a local cone condition

Line Spectral Estimation
x0
...
xk
...
xn 1
=
s
j=1
cjei2 fj k
Line Spectrum Time Samples
Extrapolate the remaining moments!
µ =
j
cj fj

Super resolution
Time frequency dual
Extrapolate high frequency information from
low frequency samples
Just exchange time and frequency: same as line
spectral estimation

Classical Line Spectrum Estimation
(a crash course)
Tn(x) =
2
6
6
6
4
x1 x2 . . . xn
x⇤
2 x1 . . . xn 1
...
...
...
...
x⇤
n x⇤
n 1 . . . x1
3
7
7
7
5
Deﬁne for any vector x,
Nonlinear Parameter Estimation xk =
s
j=1
cje2 ifj k
Low rank and Toeplitz
FACT:
Tn(x ) =
s
j=1
cj
1
ei2 fj
...
ei2 (n 1)fj
1 e i2 fj
. . . e i2 (n 1)fj

Classical techniques
PRONY estimate s, root ﬁnding
CADZOW alternating projections
MUSIC low rank, plot pseudospectrum
Sensitive to model order
No rigorous theory
Only optimal asymptotically
Low rank and Toeplitz
FACT:
Tn(x ) =
s
j=1
cj
1
ei2 fj
...
ei2 (n 1)fj
1 e i2 fj
. . . e i2 (n 1)fj

Convex Perspective
0 2 4 6 8 10 12 14
−10
−5
0
5
10
15
x =
s
j=1
cja(fj)
amplitudes
frequencies
x0
...
xk
...
xn 1
=
s
j=1
cjei2 fj k
1
...
ei2 fj k
...
ei2 fj (n 1)
AtomsLine Spectrum
A+ = {a(f) | f [0, 1]}Positive Amplitudes:
A = a(f)ei
f [0, 1], [0, 2 ]Complex:

Convergence rates
With high probability, for n>4,
⌧ ⇡
p
n log(n)Choose tradeoff as
Dudley’s inequality
Bernstein’s polynomial theorem
For a signal with n samples
of the form,
Using the global AST guarantee, we get,
E
1
n
ˆx x 2
2 C
log(n)
n
s
j=1
|cj|
yk =
s
j=1
cje2 ifj k
+ wk
n log(n) n
2 log(4 log(n)) w A 1 +
1
log(n)
n log(n) + n log(16 3/2 log(n))

Fast Rates
E
✓
1
n
kˆx x?
k2
2
◆
 C 2
s
log(n)
n
min
p6=q
d(up, uq)
4
n
If every two frequencies are far enough apart (minimum separation condition)
For s frequencies and n samples, MSE is
Not a global condition on the dictionary, local to the signal
Proof uses many properties of trigonometric polynomials and Fejer
kernels derived by Candes and Fernandez Granda.
⌧ = ⌘
p
n log(n), ⌘ 2 (1, 1) large enough.
and the tradeoff parameter is chosen correctly,

Can’t do much better
Also nearly matches minimax bound on best prediction rate for
signals with 4/n minimum separation.
E
✓
1
n
kˆx x?
k2
2
◆
C 2 s log(n/4s)
n
Only logarithmic factor from “oracle rate”
E
✓
1
n
kˆx x?
k2
2
◆
 C 2
s
log(n)
n
Our rate:
E
1
n
ˆx x 2
2 C 2 s
n

Minimum Separation and Exact Recovery
Result of Candes and Fernandez Granda
4/n separation = can construct explicit dual
certiﬁcate of exact recovery.
Convex hull of far enough vertices are exposed
faces.
Prony does not have this limitation. (Then why
bother? Robustness guarantee)
Minimum separation is necessary

except for the positive case...
Convex hull of every collection of n/2
vertices is an exposed face
Inﬁnite dimensional version of cyclic
polytope, which is neighbourly
No separation condition needed.

Localizing Frequencies
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Frequency
DualPolynomialValue
0.34 0.36 0.38 0.4 0.42 0.44 0.46
0.85
0.9
0.95
1
Frequency
DualPolynomialValue
Dual Polynomial
y = ˆx + ⌧ ˆq
ˆQ(f) =
n 1
j=0
ˆqje i2 jf
hq, ai = 1, a 2 support
hq, ai < 1, otherwise.
x
t

Localization Guarantees
Spurious Amplitudes
Weighted Frequency Deviation
Near region approximation
10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
0
Frequency
DualPolynomialMagnitude
True FrequencyEstimated Frequency
0.16/n
Spurious Frequency
l: ˆfl F
|ˆcl| C1
k2 (n)
n
cj
l: ˆfl Nj
ˆcl C3
k2 (n)
n
l: ˆfl Nj
|ˆcl|
fj T
d(fj, ˆfl)
2
C2
k2 (n)
n

How to actually solve AST
Semideﬁnite Characterizations
minimize
x
1
2
ky xk2
2 + ⌧kxkA
conv(A+) conv(A)
kxkA+
=
(
x0, Tn(x) ⌫ 0,
+1, otherwise.
Bounded Real Lemma
Caratheodory Toeplitz
A+ = {a(f) | f [0, 1]}Positive Amplitudes: A = a(f)ei
f [0, 1], [0, 2 ]Complex:
x A = inf
1
2n
tr(Tn(u)) +
1
2
t
Tn(u) x
x t
0

Alternating Direction Method of Multipliers
minimizet,u,x,Z
1
2 kx yk2
2 + ⌧
2 (t + u1)
subject to Z =

T(u) x
x⇤
t
Z ⌫ 0.
AST SDP Formulation
(tl+1
, ul+1
, xl+1
) arg min
t,u,x
L⇢(t, u, x, Zl
, ⇤l
)
Zl+1
arg min
Z⌫0
L⇢(tl+1
, ul+1
, xl+1
, Z, ⇤l
)
⇤l+1
⇤l
+ ⇢
✓
Zl+1

T(ul+1
) xl+1
xl+1⇤
tl+1
◆
.
Alternating Direction Iterations
quadratic objective, linear
projection on psd cone
linear dual update
L⇢(t, u, x, Z, ⇤) =
1
2
kx yk2
2 +
⌧
2
(t + u1) +
⌧
⇤, Z

T(u) x
x⇤
t
+
⇢
2
Z

T(u) x
x⇤
t
2
F
Form Augmented Lagrangian
Lagrangian Term Augmented Lagrangian
Momentum Parameter
| {z } | {z }

Alternative Discretization Method
Can use a grid AN = {a(f, ) : f = j/N, j = 1, . . . , N, 2 [0, 2⇡)}
We show (1
2⇡n
N
)kxkAN
 kxkA  kxkAN
AST is approximated by Lasso
{
{
n
frequencies on a N grid
amplitudes
c?
+ w
atomic moments
Via Bernstein’s Polynomial
Theorem or Fritz John
Works for general epsilon nets
Fast shrinkage algorithms
Fourier projections are cheap
Coherence is irrelevant

Experimental setup
n=64 or 128 or 256 samples
no. of frequencies: n/16, n/8, n/4
SNR: -5 dB, 0 dB, ... , 20 dB
(Root) MUSIC, Matrix pencil, Cadzow: All fed true model
order.
Empirical estimated (not fed) noise variance for AST
Compared mean-squared-error and localization error
metrics, averaged over 20 trials.

Mean Squared Error
MSE vs SNR
10 5 0 5 10 15 20
30
20
10
0
10
SNR(dB)
MSE(dB)
AST
Cadzow
MUSIC
Lasso
128 samples, 8 frequencies

Mean Squared Error
Performance Proﬁles
2 4 6 8 10 12 14
0
0.2
0.4
0.6
0.8
1
β
Ps(β)
AST
Lasso 215
MUSIC
Cadzow
Matrix Pencil
P( ) = Fraction of experiments with MSE less than ⇥ minimum MSE.
Average MSE in 20 trials
n=64,128,256
s=n/16,n/8,n/4
SNR=-5, 0, ..., 20 dB

Mean Squared Error
Eﬀect of Gridding
1 2 3 4 5 6 7 8 9 10
0
0.2
0.4
0.6
0.8
1
β
Ps(β)
Lasso 2
10
Lasso 211
Lasso 212
Lasso 2
13
Lasso 2
14
Lasso 2
15

Frequency Localization
20 40 60 80 100
0
0.2
0.4
0.6
0.8
1
β
Ps(β)
AST
MUSIC
Cadzow
100 200 300 400 500
0
0.2
0.4
0.6
0.8
1
β
Ps(β)
AST
MUSIC
Cadzow
5 10 15 20
0
0.2
0.4
0.6
0.8
1
β
Ps(β)
AST
MUSIC
Cadzow
−10 −5 0 5 10 15 20
0
20
40
60
80
100
k = 32
SNR (dB)
m1
AST
MUSIC
Cadzow
−10 −5 0 5 10 15 20
0
0.01
0.02
0.03
0.04
k = 32
SNR (dB)
m2
AST
MUSIC
Cadzow
−10 −5 0 5 10 15 20
0
1
2
3
4
5
k = 32
SNR (dB)
m3
AST
MUSIC
Cadzow
Spurious Amplitudes Weighted Frequency Deviation Near region approximation

Simple LTI Systems
Vibration u(t)
u(t) y(t)
?
x[t + 1] = Ax[t] + Bu[t]
y[t] = Cx[t] + Du[t]
State Space
Find a simple linear system from time series data
Nonlinear Kalman ﬁlter based estimation
Parametric methods based on Hankel Low rank
formulations

Subspace Methods
2
6
6
6
6
6
4
y1 y2 · · · yk
y2 y3 · · · yk+1
y3 y4 · · · yk+2
...
...
...
y` y`+1 · · · y`+k 1
3
7
7
7
7
7
5
=
2
6
6
6
6
6
4
C
CA
CA2
...
CA` 1
3
7
7
7
7
7
5
⇥
x1 x2 . . . xk
⇤
+
2
6
6
6
6
6
4
D 0 . . . 0
CB D 0 . . . 0
CAB CB D . . . 0
...
...
...
CA` 2
B . . . D
3
7
7
7
7
7
5
2
6
6
6
6
6
4
u1 u2 · · · uk
u2 u3 · · · uk+1
u3 u4 · · · uk+2
...
...
...
u` u`+1 · · · u`+k 1
3
7
7
7
7
7
5
With a little computation,
| {z }
| {z }
| {z }
Y
X
U
Y = OX + TU Y U?
= OXU?
Output
State
Input
System Matrix T
Observability
SVD Subspace ID (n4sid)
Nuclear Norm (Vandenberghe et al)
Low Rank

SISO
⇥ 2 1. . .
u[t]
G
0 1 2 . . .
Past inputs Hankel Operator Future Outputs
G : 2( ⇤, 0) ⇥ 2[0, ⇤)
Fact: Rank of Hankel Operator = McMillan Degree
Relax and use Hankel nuclear norm?
Not clear how to compute...

Convex Perspective
SISO Systems
G(z) =
2(1 0.72
)
z 0.7
+
5(1 0.52
)
z (0.3 + 0.4i)
+
5(1 0.52
)
z (0.3 0.4i)
−1
0
1
−1
0
1
0
2
4
6
A =
⇢
1 |w|2
z w
w 2 D⇢
D
y = L (G(z)) + w
−0.5 0 0.5
2
3
4
5
6
7
minimize
G
ky L(G)k2
2 + µkGkA
Regularize with atomic norms:
⇡
8
kGkA  k Gk1  kGkA
Theorem:

Atomic Norm Regularization
How to solve this?
Equivalent to:
Discretize and Set
k` := Lk
✓
1 |w|2
z w`
◆
minimize
c
1
2 ky ck2
2 + µkck1 Lasso
minimize
G
ky L(G)k2
2 + µkGkA
minimizec,x ky xk2
2 + µ
P
w2D⇢
|cw|
1 |w|2
subject to x =
P
w2D⇢
cwL
⇣
1
z w
⌘

DAST Analysis
constant Hankel nuclear
norm
stability
radius
number of
measurements
Measure uniform spaced samples of
frequency response:
kG ˆGk2
H2

0k Gk1
(1 ⇢)
p
n
ˆG(z) =
X
`
ˆc`
1 |w`|2
z w`
Set:
Then we have,
k` =
1 |w`|2
e2⇡ik/n w`
Solve: ˆx = arg min
x
1
2 ky xk2
2 + µkxkA✏
over a net on
the disk
where the coeﬃcients
correspond to the support of ˆx

6 poles
ρ = 0.95
σ = 1e-1
n=30
H2 err = 2e-2

Summary
Optimal bounds for denoising using convex
methods.
Better empirical performance than classical
approaches
Local conditions on signals instead of global
dictionary conditions
Discretization works!
Acknowledgements: Results derived in
collaboration with Tang, Shah and Prof. Recht.

Future Work
• Better convergence results for discretization.
• Why does weighted mean heuristic work so well for
discretization?
• Better algorithms for System Identiﬁcation
• Greedy techniques
• More atomic sets. General localization guarantees.

Referenced Publications
B, Recht. “Atomic Norm Denoising for Line Spectral Estimation” in 49th
Allerton Conference on Controls and Systems, 2011
B, Tang, Recht. “Atomic Norm Denoising for Line Spectral Estimation”,
submitted to IEEE Transactions on Signal Processing
Tang, B, Recht. “Near Minimax Line Spectral Estimation”, To be submitted
Shah, B, Tang, Recht. “Linear System Identiﬁcation using Atomic Norm
Minimization”, Control and Decision Conference, 2012

Other Publications
Tang, B, Shah, Recht. “Compressed Sensing oﬀ the grid”, submitted to
IEEE Transactions on Information Theory
Dasarathy, Shah, B, Nowak. “Covariance Sketching”, 50thAllerton
Conference on Controls and Systems, 2012
Gubner, B, Hao “Multipath-Cluster Channel Models”, IEEE Conference on
Ultrawideband Systems 2012
Wittenberg, Alimadhi, B, Lau “ei. RxC: Hierarchical Multinomial-
Dirichlet Ecological Inference Model” in Zelig: Everyone’s Statistical Software.
Tang, B, Recht. “Sparse recovery over continuous dictionaries: Just
discretize” Submitted toAsilomar Conference 2013

Cone Condition
=
z=0
z 2
z A
z C (x )
C (x ) = {z | x + z A x A + z A}
Inﬂated Descent Cone
No direction in Cᵧ with zero atomic norm
C0
(0 1)

Minimum separation is necessary
Using two sinusoids:
ka(f1) a(f2)kA  n1/2
ka(f1) a(f2)k2
 2⇡n2
|f1 f2|
So,
When separation is lower than 1/4n2 atomic norm can’t recover it.
ka(f1) a(f2)k2 = 2⇡n3/2
|f1 f2|
Can empirically show failure with less than 1/n separation and adversarial signal

Decomposition and Denoising for moment sequences using convex optimization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Decomposition and Denoising for moment sequences using convex optimization

Similar to Decomposition and Denoising for moment sequences using convex optimization (20)

Recently uploaded

Recently uploaded (20)

Decomposition and Denoising for moment sequences using convex optimization