Mathematics and AI

Mathématiques et Intelligence Artificielle
Marc Lelarge
INRIA-ENS
Colloquium du CEREMADE (Oct. 2023)

Mathématiques et Intelligence Artificielle

Part 1: statistical physics for machine learning
- A simple version of Approximate Message Passing (AMP)
algoritm

algoritm
- Gap between information-theoretically optimal and
computationally feasible estimators

algoritm
- Running example: matrix model
I connection to random matrix theory
I sparse PCA, community detection, Z2 synchronization,
submatrix localization, hidden clique...

Applications to high-dimensional statistics

AMP and its state evolution
Given a matrix W ∈ Rn×n and scalar functions ft : R → R, let
x0 ∈ Rn and
xt+1
= Wft(xt
) − btft−1(xt−1
) ∈ Rn
where
bt =
1
n
n
X
i=1
f 0
t (xt
i ) ∈ R.

AMP and its state evolution
Given a matrix W ∈ Rn×n and scalar functions ft : R → R, let
x0 ∈ Rn and
xt+1
= Wft(xt
) − btft−1(xt−1
) ∈ Rn
where
bt =
1
n
n
X
i=1
f 0
t (xt
i ) ∈ R.
If W ∼ GOE(n), ft are Lipschitz and the components of x0 are i.i.d
∼ X0 with E

X2
0

= 1, then for any nice test function Ψ : Rt → R,
1
n
n
X
i=1
Ψ

x1
i , . . . , xt
i

→ E [Ψ(Z1, . . . , Zt)] ,
where (Z1, . . . , Zt)
d
= (σ1G1, . . . , σtGt), where Gs ∼ N(0, 1) i.i.d.
(Bayati Montanari ’11)

Sanity check
We have x1 = Wf0(x0) so that
x1
i =
X
j
Wijf0(x0
j ),
where Wij ∼ N(0, 1/n) i.i.d. (ignore diagonal terms).
Hence x1 is a centred Gaussian vector with entries having variance
1
N
X
j
f0(x0
j )2
≈ E
h
f0(X0)2
i
= σ1.

AMP proof of Wigner’s semicircle law
Consider AMP with linear functions ft(x) = x, so that
x1
= Wx0
x2
= Wx1
− x0
= (W2
− Id)x0
x3
= Wx2
− x1
= (W3
− 2W)x0
,

x1
= Wx0
x2
= Wx1
− x0
= (W2
− Id)x0
x3
= Wx2
− x1
= (W3
− 2W)x0
,
so xt = Pt(W)x0 with
P0(x) = 1, P1(x) = x
Pt+1(x) = xPt(x) − Pt−1(x).
{Pt} are Chebyshev polynomials orthonormal wr.t. the semicircle
density µSC(x) = 1
2π
q
(4 − x2)+.

x1
= Wx0
x2
= Wx1
− x0
= (W2
− Id)x0
x3
= Wx2
− x1
= (W3
− 2W)x0
,
so xt = Pt(W)x0 with
P0(x) = 1, P1(x) = x
Pt+1(x) = xPt(x) − Pt−1(x).
{Pt} are Chebyshev polynomials orthonormal wr.t. the semicircle
density µSC(x) = 1
2π
q
(4 − x2)+.
When 1
n kx0k = 1, we have 1
n hxs, xti ≈ trPs(W)Pt(Wt).

xt+1
= Wxt
− xt−1
In this case, AMP state evolution gives
1
n
hxs
, xt
i → E [ZsZt] = 1(s = t)

xt+1
= Wxt
− xt−1
1
n
hxs
, xt
i → E [ZsZt] = 1(s = t)
Since 1
n hxs, xti ≈ trPs(W)Pt(Wt), the polynomials Pt are
orthonormal w.r.t the limit empirical spectral distribution of W
which must be µSC.

xt+1
= Wxt
− xt−1
1
n
hxs
, xt
i → E [ZsZt] = 1(s = t)
Since 1
n hxs, xti ≈ trPs(W)Pt(Wt), the polynomials Pt are
orthonormal w.r.t the limit empirical spectral distribution of W
which must be µSC.
Credit: Zhou Fan.

Wigner’s semicircle law: experiments

Explaining the Onsager term
xt+1
= Wxt
− xt−1
The first iteration with an Onsager term appears for t = 2.

xt+1
= Wxt
− xt−1
Then we have x2 = Wx1 − x0 = W2x0 − x0 so that
x2
1 =
X
i
W 2
1i x0
1 +
X
i,j6=1
W1i Wijx0
j − x0
1
=

X
i
W 2
1i x0
1 +
X
i,j6=1
W1i Wijx0
j
| {z }
N(0,1)
−

x0
1

xt+1
= Wxt
− xt−1
Then we have x2 = Wx1 − x0 = W2x0 − x0 so that
x2
1 =
X
i
W 2
1i x0
1 +
X
i,j6=1
W1i Wijx0
j − x0
1
=

X
i
W 2
1i x0
1 +
X
i,j6=1
W1i Wijx0
j
| {z }
N(0,1)
−

x0
1
The Onsager term is very similar to the Itô-correction in stochastic
calculus.

- A simple version of AMP algoritm
- Running example: matrix model
I connection to random matrix theory
I sparse PCA, community detection, Z2 synchronization,
submatrix localization, hidden clique...

Low-rank matrix estimation
“Spiked Wigner” model
Y
|{z}
observations
=
v
u
u
u
t
λ
n
XX|
| {z }
signal
+ Z
|{z}
noise
I X: vector of dimension n with entries Xi
i.i.d.
∼ P0. EX1 = 0,
EX2
1 = 1.
I Zi,j = Zj,i
i.i.d.
∼ N(0, 1).
I λ: signal-to-noise ratio.
I λ and P0 are known by the statistician.
Goal: recover the low-rank matrix XX|
from Y.

Principal component analysis (PCA)
Spectral estimator:
Estimate X using the eigenvector x̂n associated with the
largest eigenvalue µn of Y/
√
n.

Principal component analysis (PCA)
Spectral estimator:
Estimate X using the eigenvector x̂n associated with the
largest eigenvalue µn of Y/
√
n.
B.B.P. phase transition
I if λ 6 1



µn
a.s.
−
−
−
→
n→∞
2
X · x̂n
a.s.
−
−
−
→
n→∞
0
I if λ 1



µn
a.s.
−
−
−
→
n→∞
√
λ + 1
√
λ
2
|X · x̂n|
a.s.
−
−
−
→
n→∞
p
1 − 1/λ 0
(Baik, Ben Arous, Péché ’05)

Questions
I PCA fails when λ 6 1, but is it still possible to recover
the signal?

Questions
the signal?
I When λ 1, is PCA optimal?

Questions
the signal?
I When λ 1, is PCA optimal?
I More generally, what is the best achievable estimation
performance in both regimes?

Plot of MMSE
Figure: Spiked Wigner model, centred binary prior (unit variance).

We can certainly improve spectral algorithm!

A scalar denoising problem
For Y =
√
γX0 + Z where X0 ∼ P0 and Z ∼ N(0, 1)

Bayes optimal AMP
We define mmse(γ) = E
h
X0 − E[X0|
√
γX0 + Z]
2
i
and the
recursion:
q0 = 1 − λ−1
qt+1 = 1 − mmse(λqt).
With the optimal denoiser gP0 (y, γ) = E[X0|
√
γX0 + Z = y], AMP
is defined by:
xt+1
= Y
s
λ
n
ft(xt
) − λbtft−1(xt−1
),
where ft(y) = gP0 (y/
√
λqt, λqt).

Limiting formula for the MMSE
Theorem (L, Miolane ’19)
MMSEn −
−
−
→
n→∞
1
|{z}
Dummy MSE
− q∗
(λ)2
where q∗
(λ) is the minimizer of
q 0 7→ −EX0∼P0
Z0∼N

log
Z
x0
dP0(x0)e
√
λqZ0x0+λqX0x0+λq
2
x2
0

+
λ
4
q2
A simplified “free energy landscape”:
0.0 0.2 0.4 0.6 0.8 1.0
q
−0.06
−0.05
−0.04
−0.03
−0.02
−0.01
0.00 −F(λ, q)
(a) “Easy” phase (λ = 1.01)
0.0 0.2 0.4 0.6 0.8
q
−0.002
−0.001
0.000
0.001
0.002
0.003
−F(λ, q)
(b) “Hard” phase (λ = 0.625)
0.0 0.2 0.4 0.6 0.8
q
0.0000
0.0025
0.0050
0.0075
0.0100
0.0125
0.0150
0.0175 −F(λ, q)
(c) “Impossible” phase (λ = 0.5)

Proof ideas: a planted spin system
P(X = x | Y) =
1
Zn
P0(x)eHn(x)
where
Hn(x) =
X
ij
s
λ
n
Yi,jxi xj −
λ
2n
x2
i x2
j .

Proof ideas: a planted spin system
P(X = x | Y) =
1
Zn
P0(x)eHn(x)
where
Hn(x) =
X
ij
s
λ
n
Yi,jxi xj −
λ
2n
x2
i x2
j .
Two step proof:
I Lower bound: Guerra’s interpolation technique. Adapted in
(Korada, Macris ’09) (Krzakala, Xu, Zdeborová ’16)
(
Y =
√
t
p
λ/n XX| + Z
Y0 =
√
1 − t
√
λ X + Z0
I Upper bound: Cavity computations (Mézard, Parisi, Virasoro
’87). Aizenman-Sims-Starr scheme:(Aizenman, Sims,Starr
’03) (Talagrand ’10)

Part 1: conclusion
AMP is an iterative denoising algorithm which is optimal when the
energy landscape is simple.
Main references for this tutorial: (Montanari Venkataramanan ’21)
(L. Miolane ’19)
Many recent research directions: universality, structured matrices,
community detection... and new applications outside electrical
engineering like in ecology.

Part 1: conclusion
AMP is an iterative denoising algorithm which is optimal when the
energy landscape is simple.
Main references for this tutorial: (Montanari Venkataramanan ’21)
(L. Miolane ’19)
Many recent research directions: universality, structured matrices,
community detection... and new applications outside electrical
engineering like in ecology.
Deep learning, the new kid on the block:

From stochastic localization to sampling thanks to AMP
Target distribution µ
Diffusion process:
yt = tx∗
+ Bt, (x∗
∼ µ) ⊥
⊥ B.
µt(.) = P (x∗
∈ .|yt)
µ0 = µ → µ∞ = δx∗

Diffusion process:
yt = tx∗
+ Bt, (x∗
∼ µ) ⊥
⊥ B.
µt(.) = P (x∗
∈ .|yt)
µ0 = µ → µ∞ = δx∗
There exists a Brownian motion G such that yt solves the SDE:
dyt = mt(yt)dt + dGt,
where mt(y) = E [x∗|yt = y]

Diffusion process:
yt = tx∗
+ Bt, (x∗
∼ µ) ⊥
⊥ B.
µt(.) = P (x∗
∈ .|yt)
µ0 = µ → µ∞ = δx∗
There exists a Brownian motion G such that yt solves the SDE:
dyt = mt(yt)dt + dGt,
where mt(y) = E [x∗|yt = y]
Idea: use AMP for sampling (El Alaoui, Montanari, Sellke ’22),
(Montanari, Wu ’23)

Idea: use AMP for sampling (El Alaoui, Montanari, Sellke ’22),
(Montanari, Wu ’23)

N. Wiener: the invention of cybernetics

Lessons learned from AI winters: Common Task Framework (CTF)
Performance Assessment of Automatic Speech Recognizers (Pallett
’85)
”Definitive tests to fully characterize automatic speech recognizer
or system performance cannot be specified at present. However, it
is possible to design and conduct performance assessment
tests that make use of widely available speech data bases, use
test procedures similar to those used by others, and that are
well documented. These tests provide valuable benchmark
data and informative, though limited, predictive power.”

Key factors for the actual success of deep learning

The Bitter Lesson by Rich Sutton

The Bitter Lesson by Rich Sutton
The biggest lesson that can be read from 70 years of AI research is
that general methods that leverage computation are
ultimately the most effective, and by a large margin (...)
Seeking an improvement that makes a difference in the shorter
term, researchers seek to leverage their human knowledge of the
domain, but the only thing that matters in the long run is the
leveraging of computation (...) the human-knowledge approach
tends to complicate methods in ways that make them less suited to
taking advantage of general methods leveraging computation.

Is human led mathematic over?
If it turns out that some Langlands-like questions can be answered
with the use of computation, there is always the possibility that
the mathematical community will interpret this as a demonstration
that, in hindsight, the Langlands program is not as deep as we
thought it was. There is always room to say, “Aha! Now we see
that it is just a matter of computation.” (Avigad ’22)

Thank you for your attention !

Mathematics and AI

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Mathematics and AI

Similar to Mathematics and AI (20)

Recently uploaded

Recently uploaded (20)

Mathematics and AI