2019 Fall Series: Postdoc Seminars - Special Guest Lecture, Attacking the Curse of Dimensionality Using Sums of Separable Functions - Martin Mohlenkamp, September 11, 2019

Attacking the Curse of Dimensionality
using Sums of Separable Functions
Martin J. Mohlenkamp
Department of Mathematics
http://www.ohiouniversityfaculty.com/mohlenka/
SAMSI, September 2019

Abstract
Naive computations involving a function of many variables suﬀer from the
curse of dimensionality: the computational cost grows exponentially with
the number of variables. One approach to bypassing the curse is to
approximate the function as a sum of products of functions of one variable
and compute in this format. When the variables are indices, a function of
many variables is called a tensor, and this approach is to approximate and
use the tensor in the (so-called) canonical tensor format. In this talk I will
describe how such approximations can be used in numerical analysis and in
machine learning.
Martin J. Mohlenkamp (OHIO) Attacking the CoD using SoS Functions SAMSI, September 2019 2 / 28

Goals of this Talk
Show you a tool that you may ﬁnd useful.
Hint at other things I know that you may ﬁnd useful.
Not Goals
Convince you that this tool is better than other methods.
Show that I am great.

The Curse of Dimensionality (discrete setting)
d Name Notation Storage Visual
1 Vector vj $
2 Matrix Ajk $$
3 Tensor Tjkm $$$
> 3 Tensor T(j1, . . . , jd ) $d ?
The cost to do anything, even store the object,
grows exponentially in the dimension d.

The Curse of Dimensionality (function setting)
To approximate a function f (x1, x2, . . . , xd )
that has smoothness p
to accuracy
costs −d/p = ( −1/p)d = $d .
This curse is unavoidable for general function spaces (smoothness classes).
If a method seems to avoid it, look for
“constants” that grow exponentially in d,
inductive proofs that require d! terms, and
assumptions that imply a vanishing set of functions as d increases.
(Exercise: Think about how this applies to Monte Carlo methods.)

Philosophy
Naturally occuring functions of many variables are not general.
If a method can match what really occurs in some application,
then it can avoid the curse.
Non-trivial, non-circular characterizations of the set of functions that a
given method can match are hard. (I know of none.)
Instead we start from inspiration:
Neural networks are inspired by the visual cortex of cats.
The following method is inspired by partial diﬀerential equations in
physics (e.g. heat ﬂow).

Approximation by Sums of Separable Tensors/Functions
In dimension d, a rank r approximation of a tensor T is
T(j1, j2, . . . , jd ) ≈ G(j1, . . . , jd ) =
r
l=1
d
i=1
Gl
i (ji )
or equivalently T ≈ G =
r
l=1
Gl
=
r
l=1
d
i=1
Gl
i .
Instead of $d , storage is rd$, which is no longer exponential.
To do functions, just change notation:
f (x1, x2, . . . , xd ) ≈ g(x1, . . . , xd ) =
r
l=1
d
i=1
gl
i (xi ) .
With large enough r this can approximate anything within .

Basic Computational Paradigm
1 Start with operators/matrices and functions/vectors that can be
represented within with low rank.
2 Do linear algebra operations with them, e.g.
˜g = Lg =
r
l=1
r1
m=1
d
i=1
(Ll
i gm
i (xi ))
The computational cost is O(d · r · r1)
which is linear in d rather than exponential.
3 Adaptively re-minimize the rank of the output of each operation,
controlling the approximation error.

Example: Power Method
L =
r
g1 =
r1
multiply
↓
˜g =
r·r1
↓
reduce r · r1 → r2
L =
r
g2 =
r2
· · ·
↓

Reducing the Rank
We wish to (well) approximate ˜g =
R
m=1
d
i=1
˜gm
i by g =
r
l=1
d
i=1
gl
i ,
with r small(er).
This is NP-hard, but we can try optimization algorithms:
From an initial g, iteratively modify {gl
i } to reduce the error ˜g −g 2
2.
You can try your favorite generic method:
Newton’s method and variations
gradient descent and variations
GMRES, BFGS, other acronyms
etc.
Often any method will do, but sometimes all of them struggle.
(I have worked years on challenges with this optimization problem.)

Alternating Least Squares (ALS)
This optimization problem has a multilinear structure we can use.
Loop until the error is small enough or r seems insuﬃcient:
Loop through the directions k = 1, . . . , d.
Fix {gl
i } for i = k, and solve a linear least squares problem for new gl
k .
The normal equations are




i=k g1
i , g1
i . . . i=k g1
i , gr
i
...
...
...
i=k gr
i , g1
i . . . i=k gr
i , gr
i







g1
k
...
gr
k


 =




R
q=1 ˜gR
k i=k g1
i , ˜gq
i
...
R
q=1 ˜gR
k i=k gr
i , ˜gq
i



 .
ALS is old, simple, stepwise robust, adaptable, and widely used,
but does not make the underlying optimization problem any easier.

Extended Computational Paradigm
(developed mainly for quantum mechanics)
Some symmetries can be enforced implicitly in the inner product.
Example: The antisymmetrizer A creates the beast
A
N
i=1
φi (γi ) =
1
N!
φ1(γ1) φ1(γ2) · · · φ1(γN)
φ2(γ1) φ2(γ2) · · · φ2(γN)
...
...
...
φN(γ1) φN(γ2) · · · φN(γN)
,
but inner products with it are computed simply as
A ˜φi , A φi =
|L|
N!
with L(i, j) = ˜φi , φj .

If L does not have low rank but Lg1, g is computable,
then you cannot use the basic paradigm g1
apply L
−−−−→ Lg1 = ˜g
reduce rank
−−−−−−−→ g2
but you can sometimes still run ALS to form g.
Example: the electron-electron interaction (multiplication) operator
W =
1
2
N
i=1 j=i
1
ri − rj
cannot be written with small r, but
AW ˜φi , A φi
is computable (formula suppressed).

If you know why your function cannot be written will small r,
you might be able to extend the sum-of-separable format.
Example: To capture the interelectron cusp, we can use
A
P
p=0


1
2 m=n
wp(|γm − γn|)


rp
q=1
N
i=1
φp,q
i (γi ) .
Example: To scale to large systems (composed of subsystems) we can use
A
r
q=1
K
k=1


rk
qk =1
Nk
ik =1
φq,qk
k,ik
(γk,ik
)

 .

Conclusions, Part I
Sums of separable functions give a tractable way to represent (some)
functions of many variables.
You can compute with then, to solve PDEs etc.
There are various extensions.
(There are diﬃculties too, which I skip.)

Mutivariate Regression
Beginning with scattered data in high dimensions
D = (xj, yj) = (xj
1, · · · , xj
d ; yj)
N
j=1
,
deﬁne an empirical inner product between functions
f , g =
N
j=1
f (xj)g(xj) ,
which also works between a function and our data,
{(xj, yj)}N
j=1 , g =
N
j=1
yjg(xj) .
The (empirical) least-squares error is then
{(xj, yj)}N
j=1 − g
2
=
N
j=1
(yj − g(xj))2
.

Regression with a Sum of Separable Functions
Construct g(x) such that g(xj) ≈ yj with
g(x) =
r
l=1
d
i=1
gl
i (xi ) .
We can use an ALS approach:
Loop until you are happy or the metaparameters seem inappropriate:
Loop through the directions k = 1, . . . , d.
Fix {gl
i } for i = k, and update {gl
k }l to reduce (minimize) the error
N
j=1
yj −
r
l=1
gl
k (xj
k )
d
i=k
gl
i (xj
i )
2
.
If we choose each gl
k to be a linear combination of some basis functions,
then we get a linear least-squares problem in its coeﬃcients.
Otherwise (and for other loss functions) it is nonlinear.

Comments
The usual issues (noise, local minima, over-ﬁtting) and
standard techniques (regularization, cross-validation) apply.
The cost for an optimization pass is linear in both d and N,
so the method is feasible for large data sets in high dimensions.
As of 2009, this regression method was competitive on a standard set
of benchmark problems (see the paper).
As of 2010, a classiﬁcation method based on these principles was
competitive on a standard set of benchmark problems (see a paper by
Jochen Garcke).

Regression on Molecules and Materials
D = {(σj, yj)}N
j=1, where σj is a material/molecular structure,
which is an unordered set of atoms a = (t, r),
where t is a species type (e.g. t = Mo), and
r is a location in 3-dimensional space.
A structure can be mapped to a set Vσ whose elements (w, v) are a
weight w and an ordered list of atoms v called a view.
The set Vσ is invariant under rotations, translations, and the order the
atoms are given in.

E
T
  ©
rA rC
rB
rD
maps to
the views:
Weight a1 a2 a3 a4
1/4
E
T
  ©
rA E
T
  ©
rC E
T
  ©
r
B
E
T
  ©
rD
1/4
E
T
  ©
rB E
T
  ©
rC E
T
  ©
r
A
E
T
  ©
rD
1/8
E
T
  ©
rC E
T
  ©
rA E
T
  ©
rB
E
T
  ©
r
D
1/8
E
T
  ©
rC E
T
  ©
rB E
T
  ©
rA
E
T
  ©
r D
1/4
E
T
  ©
rD E
T
  ©
rC E
T
  ©
r
B
E
T
  ©
rA

Regression with Consistent Functions
From a function g on ordered lists of atoms, we can build a function on
structures that is rotation and translation invariant by deﬁning
Cg(σ) =
(w,v)∈Vσ
wg(v) .
We can then attempt to minimize the least-squares error
D −Cg 2
=
1
N
N
j=1
(yj − Cg(σj))2
=
1
N
N
j=1


yj −
(w,v)∈Vσj
wg(v)



2
.
If g([a1, a2, . . .]) := g([a1, a2, . . . , ad ]) =
r
l=1
d
i=1
gl
i (ai ) ,
then ALS can be run. Each gl
i is a function of a = (t, r), so its domain is
several copies of R3, which is tractable.

Conclusions, Part II
Sums of separable functions give a tractable way to represent (some)
functions of many variables.
You can do regression with then, for machine learning etc.
There are various extensions.
(There are diﬃculties too, which I skip.)

Examples: Gaussians and Radial Functions
a exp −b x 2
= a
d
i=1
exp −bx2
i
If φ(y) ≈
r
l=1
al e−bl y2
for 0 ≤ y , then
φ( x ) ≈
r
l=1
al exp −bl
d
i=1
x2
i =
r
l=1
al
d
i=1
exp −bl x2
i ,
with rank r independent of d (but be careful about ≈ when used).
This construction is especially useful for Greens functions such as 1/ r .

Example: Linear Model
If we can write
φ(t) ≈
r
l=1
αl exp(βl t)
then the linear model has
φ
d
i=1
ai xi + b ≈
r
l=1
αl exp βl
d
i=1
ai xi + b
=
r
l=1
αl exp(βl b)
d
i=1
exp(βl ai xi ) .
Properties of φ matter, but the orientation of the axes does not.
(Although if only one ai is nonzero, then r = 1.)

Example: Additive Model
f (x) =
d
i=1
fi (xi ) =
d
dt
d
i=1
(1 + tfi (xi ))
t=0
= lim
h→0
1
2h
d
i=1
(1 + hfi (xi )) −
d
i=1
(1 − hfi (xi )) .
At r = 2 the minimization problem is ill-posed.
Ill-posedness can allow useful approximations.
There can be large cancellations and ill-conditioning.

Example: Sine of the sum of several variables
As long as sin(αk − αj) = 0 for all j = k,
sin


d
j=1
xj

 =
d
j=1
sin(xj)
d
k=1,k=j
sin(xk + αk − αj)
sin(αk − αj)
,
which is rank d.
Ordinary trigonometric expansions yield r = 2d .
Over the complex numbers, r = 2. The ﬁeld matters.
The representation is not unique. (For generic tensors they are.)

Example: Do not add Constraints!
If {gj}2d
j=1 form an orthonormal set and
g(x) =
d
i=1
gi (xi ) +
d
i=1
(gi (xi ) + gi+d (xi ))
then an orthogonality constraint would force us to multiply out,
g(x) =
d
i=1
gi (xi ) +g1(x1)
d
i=2
(gi (xi ) + gi+d (xi ))
+g1+d (x1)
d
i=2
(gi (xi ) + gi+d (xi ))
= · · ·
and have r = 2d instead of r = 2.

Final Thoughts
There are no theorems that this approach is good,
but there are intriguing examples.
There are not many alternatives for computing in high dimensions.
(There are alternative tensor formats.)
See http://www.ohiouniversityfaculty.com/mohlenka/
for papers.
Talk with me if any of this seems useful for you.

2019 Fall Series: Postdoc Seminars - Special Guest Lecture, Attacking the Curse of Dimensionality Using Sums of Separable Functions - Martin Mohlenkamp, September 11, 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 2019 Fall Series: Postdoc Seminars - Special Guest Lecture, Attacking the Curse of Dimensionality Using Sums of Separable Functions - Martin Mohlenkamp, September 11, 2019

Similar to 2019 Fall Series: Postdoc Seminars - Special Guest Lecture, Attacking the Curse of Dimensionality Using Sums of Separable Functions - Martin Mohlenkamp, September 11, 2019 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

2019 Fall Series: Postdoc Seminars - Special Guest Lecture, Attacking the Curse of Dimensionality Using Sums of Separable Functions - Martin Mohlenkamp, September 11, 2019