The main machine learning algorithms are built upon various mathematical foundations such as statistics, optimization, and probability. Will this also hold true for Artificial Intelligence? In this presentation, I will showcase some recent examples of interactions between machine learning and mathematics.
Colloquium @ CEREMADE (October 3, 2023)
4. Part 1: statistical physics for machine learning
- A simple version of Approximate Message Passing (AMP)
algoritm
5. Part 1: statistical physics for machine learning
- A simple version of Approximate Message Passing (AMP)
algoritm
- Gap between information-theoretically optimal and
computationally feasible estimators
6. Part 1: statistical physics for machine learning
- A simple version of Approximate Message Passing (AMP)
algoritm
- Gap between information-theoretically optimal and
computationally feasible estimators
- Running example: matrix model
I connection to random matrix theory
I sparse PCA, community detection, Z2 synchronization,
submatrix localization, hidden clique...
13. AMP and its state evolution
Given a matrix W ∈ Rn×n and scalar functions ft : R → R, let
x0 ∈ Rn and
xt+1
= Wft(xt
) − btft−1(xt−1
) ∈ Rn
where
bt =
1
n
n
X
i=1
f 0
t (xt
i ) ∈ R.
14. AMP and its state evolution
Given a matrix W ∈ Rn×n and scalar functions ft : R → R, let
x0 ∈ Rn and
xt+1
= Wft(xt
) − btft−1(xt−1
) ∈ Rn
where
bt =
1
n
n
X
i=1
f 0
t (xt
i ) ∈ R.
If W ∼ GOE(n), ft are Lipschitz and the components of x0 are i.i.d
∼ X0 with E
X2
0
= 1, then for any nice test function Ψ : Rt → R,
1
n
n
X
i=1
Ψ
x1
i , . . . , xt
i
→ E [Ψ(Z1, . . . , Zt)] ,
where (Z1, . . . , Zt)
d
= (σ1G1, . . . , σtGt), where Gs ∼ N(0, 1) i.i.d.
(Bayati Montanari ’11)
15. Sanity check
We have x1 = Wf0(x0) so that
x1
i =
X
j
Wijf0(x0
j ),
where Wij ∼ N(0, 1/n) i.i.d. (ignore diagonal terms).
Hence x1 is a centred Gaussian vector with entries having variance
1
N
X
j
f0(x0
j )2
≈ E
h
f0(X0)2
i
= σ1.
16. AMP proof of Wigner’s semicircle law
Consider AMP with linear functions ft(x) = x, so that
x1
= Wx0
x2
= Wx1
− x0
= (W2
− Id)x0
x3
= Wx2
− x1
= (W3
− 2W)x0
,
17. AMP proof of Wigner’s semicircle law
Consider AMP with linear functions ft(x) = x, so that
x1
= Wx0
x2
= Wx1
− x0
= (W2
− Id)x0
x3
= Wx2
− x1
= (W3
− 2W)x0
,
so xt = Pt(W)x0 with
P0(x) = 1, P1(x) = x
Pt+1(x) = xPt(x) − Pt−1(x).
{Pt} are Chebyshev polynomials orthonormal wr.t. the semicircle
density µSC(x) = 1
2π
q
(4 − x2)+.
18. AMP proof of Wigner’s semicircle law
Consider AMP with linear functions ft(x) = x, so that
x1
= Wx0
x2
= Wx1
− x0
= (W2
− Id)x0
x3
= Wx2
− x1
= (W3
− 2W)x0
,
so xt = Pt(W)x0 with
P0(x) = 1, P1(x) = x
Pt+1(x) = xPt(x) − Pt−1(x).
{Pt} are Chebyshev polynomials orthonormal wr.t. the semicircle
density µSC(x) = 1
2π
q
(4 − x2)+.
When 1
n kx0k = 1, we have 1
n hxs, xti ≈ trPs(W)Pt(Wt).
19. AMP proof of Wigner’s semicircle law
xt+1
= Wxt
− xt−1
In this case, AMP state evolution gives
1
n
hxs
, xt
i → E [ZsZt] = 1(s = t)
20. AMP proof of Wigner’s semicircle law
xt+1
= Wxt
− xt−1
In this case, AMP state evolution gives
1
n
hxs
, xt
i → E [ZsZt] = 1(s = t)
Since 1
n hxs, xti ≈ trPs(W)Pt(Wt), the polynomials Pt are
orthonormal w.r.t the limit empirical spectral distribution of W
which must be µSC.
21. AMP proof of Wigner’s semicircle law
xt+1
= Wxt
− xt−1
In this case, AMP state evolution gives
1
n
hxs
, xt
i → E [ZsZt] = 1(s = t)
Since 1
n hxs, xti ≈ trPs(W)Pt(Wt), the polynomials Pt are
orthonormal w.r.t the limit empirical spectral distribution of W
which must be µSC.
Credit: Zhou Fan.
26. Explaining the Onsager term
xt+1
= Wxt
− xt−1
The first iteration with an Onsager term appears for t = 2.
27. Explaining the Onsager term
xt+1
= Wxt
− xt−1
The first iteration with an Onsager term appears for t = 2.
Then we have x2 = Wx1 − x0 = W2x0 − x0 so that
x2
1 =
X
i
W 2
1i x0
1 +
X
i,j6=1
W1i Wijx0
j − x0
1
=
X
i
W 2
1i x0
1 +
X
i,j6=1
W1i Wijx0
j
| {z }
N(0,1)
−
x0
1
28. Explaining the Onsager term
xt+1
= Wxt
− xt−1
The first iteration with an Onsager term appears for t = 2.
Then we have x2 = Wx1 − x0 = W2x0 − x0 so that
x2
1 =
X
i
W 2
1i x0
1 +
X
i,j6=1
W1i Wijx0
j − x0
1
=
X
i
W 2
1i x0
1 +
X
i,j6=1
W1i Wijx0
j
| {z }
N(0,1)
−
x0
1
The Onsager term is very similar to the Itô-correction in stochastic
calculus.
29. Part 1: statistical physics for machine learning
- A simple version of AMP algoritm
- Gap between information-theoretically optimal and
computationally feasible estimators
- Running example: matrix model
I connection to random matrix theory
I sparse PCA, community detection, Z2 synchronization,
submatrix localization, hidden clique...
30. Low-rank matrix estimation
“Spiked Wigner” model
Y
|{z}
observations
=
v
u
u
u
t
λ
n
XX|
| {z }
signal
+ Z
|{z}
noise
I X: vector of dimension n with entries Xi
i.i.d.
∼ P0. EX1 = 0,
EX2
1 = 1.
I Zi,j = Zj,i
i.i.d.
∼ N(0, 1).
I λ: signal-to-noise ratio.
I λ and P0 are known by the statistician.
Goal: recover the low-rank matrix XX|
from Y.
31. Principal component analysis (PCA)
Spectral estimator:
Estimate X using the eigenvector x̂n associated with the
largest eigenvalue µn of Y/
√
n.
32. Principal component analysis (PCA)
Spectral estimator:
Estimate X using the eigenvector x̂n associated with the
largest eigenvalue µn of Y/
√
n.
B.B.P. phase transition
I if λ 6 1
µn
a.s.
−
−
−
→
n→∞
2
X · x̂n
a.s.
−
−
−
→
n→∞
0
I if λ 1
µn
a.s.
−
−
−
→
n→∞
√
λ + 1
√
λ
2
|X · x̂n|
a.s.
−
−
−
→
n→∞
p
1 − 1/λ 0
(Baik, Ben Arous, Péché ’05)
34. Questions
I PCA fails when λ 6 1, but is it still possible to recover
the signal?
I When λ 1, is PCA optimal?
35. Questions
I PCA fails when λ 6 1, but is it still possible to recover
the signal?
I When λ 1, is PCA optimal?
I More generally, what is the best achievable estimation
performance in both regimes?
38. A scalar denoising problem
For Y =
√
γX0 + Z where X0 ∼ P0 and Z ∼ N(0, 1)
39. A scalar denoising problem
For Y =
√
γX0 + Z where X0 ∼ P0 and Z ∼ N(0, 1)
40. Bayes optimal AMP
We define mmse(γ) = E
h
X0 − E[X0|
√
γX0 + Z]
2
i
and the
recursion:
q0 = 1 − λ−1
qt+1 = 1 − mmse(λqt).
With the optimal denoiser gP0 (y, γ) = E[X0|
√
γX0 + Z = y], AMP
is defined by:
xt+1
= Y
s
λ
n
ft(xt
) − λbtft−1(xt−1
),
where ft(y) = gP0 (y/
√
λqt, λqt).
45. Proof ideas: a planted spin system
P(X = x | Y) =
1
Zn
P0(x)eHn(x)
where
Hn(x) =
X
ij
s
λ
n
Yi,jxi xj −
λ
2n
x2
i x2
j .
46. Proof ideas: a planted spin system
P(X = x | Y) =
1
Zn
P0(x)eHn(x)
where
Hn(x) =
X
ij
s
λ
n
Yi,jxi xj −
λ
2n
x2
i x2
j .
Two step proof:
I Lower bound: Guerra’s interpolation technique. Adapted in
(Korada, Macris ’09) (Krzakala, Xu, Zdeborová ’16)
(
Y =
√
t
p
λ/n XX| + Z
Y0 =
√
1 − t
√
λ X + Z0
I Upper bound: Cavity computations (Mézard, Parisi, Virasoro
’87). Aizenman-Sims-Starr scheme:(Aizenman, Sims,Starr
’03) (Talagrand ’10)
47. Part 1: conclusion
AMP is an iterative denoising algorithm which is optimal when the
energy landscape is simple.
Main references for this tutorial: (Montanari Venkataramanan ’21)
(L. Miolane ’19)
Many recent research directions: universality, structured matrices,
community detection... and new applications outside electrical
engineering like in ecology.
48. Part 1: conclusion
AMP is an iterative denoising algorithm which is optimal when the
energy landscape is simple.
Main references for this tutorial: (Montanari Venkataramanan ’21)
(L. Miolane ’19)
Many recent research directions: universality, structured matrices,
community detection... and new applications outside electrical
engineering like in ecology.
Deep learning, the new kid on the block:
49. From stochastic localization to sampling thanks to AMP
Target distribution µ
Diffusion process:
yt = tx∗
+ Bt, (x∗
∼ µ) ⊥
⊥ B.
µt(.) = P (x∗
∈ .|yt)
µ0 = µ → µ∞ = δx∗
50. From stochastic localization to sampling thanks to AMP
Target distribution µ
Diffusion process:
yt = tx∗
+ Bt, (x∗
∼ µ) ⊥
⊥ B.
µt(.) = P (x∗
∈ .|yt)
µ0 = µ → µ∞ = δx∗
There exists a Brownian motion G such that yt solves the SDE:
dyt = mt(yt)dt + dGt,
where mt(y) = E [x∗|yt = y]
51. From stochastic localization to sampling thanks to AMP
Target distribution µ
Diffusion process:
yt = tx∗
+ Bt, (x∗
∼ µ) ⊥
⊥ B.
µt(.) = P (x∗
∈ .|yt)
µ0 = µ → µ∞ = δx∗
There exists a Brownian motion G such that yt solves the SDE:
dyt = mt(yt)dt + dGt,
where mt(y) = E [x∗|yt = y]
Idea: use AMP for sampling (El Alaoui, Montanari, Sellke ’22),
(Montanari, Wu ’23)
52. From stochastic localization to sampling thanks to AMP
Idea: use AMP for sampling (El Alaoui, Montanari, Sellke ’22),
(Montanari, Wu ’23)
57. Lessons learned from AI winters: Common Task Framework (CTF)
Performance Assessment of Automatic Speech Recognizers (Pallett
’85)
”Definitive tests to fully characterize automatic speech recognizer
or system performance cannot be specified at present. However, it
is possible to design and conduct performance assessment
tests that make use of widely available speech data bases, use
test procedures similar to those used by others, and that are
well documented. These tests provide valuable benchmark
data and informative, though limited, predictive power.”
60. The Bitter Lesson by Rich Sutton
The biggest lesson that can be read from 70 years of AI research is
that general methods that leverage computation are
ultimately the most effective, and by a large margin (...)
Seeking an improvement that makes a difference in the shorter
term, researchers seek to leverage their human knowledge of the
domain, but the only thing that matters in the long run is the
leveraging of computation (...) the human-knowledge approach
tends to complicate methods in ways that make them less suited to
taking advantage of general methods leveraging computation.
62. Is human led mathematic over?
If it turns out that some Langlands-like questions can be answered
with the use of computation, there is always the possibility that
the mathematical community will interpret this as a demonstration
that, in hindsight, the Langlands program is not as deep as we
thought it was. There is always room to say, “Aha! Now we see
that it is just a matter of computation.” (Avigad ’22)