“Statistical Physics Studies of Machine Learning Problems" by Lenka Zdeborova, Researcher @CNRS
Abstract : We will talk about some insight of the following questions: What makes problems studied in machine and statistical physics related? How can this relation be used to understand better the performance and limitations of machine learning systems? What happens when a phase transition is found in a computational problem? How do phase transitions influence algorithmic hardness?
2. Long history of physics influencing machine learning.
Examples:
Gibbs-Bogoluibov-Feynman’60s - physics behind variational
inference.
Hopfield model’82. Spin glass models of neural networks Amit,
Gutfreund, Sompolinsky’85.
Boltzmann machine Hinton, Sejnowski’86 - named after the
Boltzmann distribution.
Gardner’87 - Maximum storage capacity in neural networks
(related to VC dimension).
SVMs by Boser, Guyon, Vapnik’92 inspired by Krauth, Mézard’87
Many papers on neural networks in physics in 80s and 90s.
PHYSICS IN MACHINE
LEARNING
5. THE PUZZLE OF GENERALIZATION
According to PAC bounds (via VC dimension, Rademacher
complexity) neural networks that generalize well should not be
able to fit random labels.
ICLR’16
6. THEORETICAL QUESTIONS
IN DEEP LEARNING
Why the lack of overfitting?
“More parameters = more overfitting”
Does not seem to hold in deep learning.
7. SAMPLE COMPLEXITY
Cifar10 - 50000 samples.
How many samples are
really needed?
How low is the optimal sample complexity? Are we achieving it?
If not, is it because of architectures or algorithms?
9. THEORETICAL-PHYSICS
ROADMAP
1. Experimental observation or fundamental hypothesis.
2. Unreasonably simple model for which toughest questions
can be understood mathematically.
3. Generalize to more realistic models, relies on universality
(= important laws of nature rarely depend on many details).
10. MODELS
H = J
X
(ij)2E
SiSj
P({Si}i=1,...,N ) =
e H
Z
magnetism of materials
In data science, models are used to fit the data. (e.g. linear
regression: What is the best straight line that captures the
dependence of y on x?)
In physics, models are the main tool for understanding.
11. MODELS
In data science, models are used to fit the data. (e.g. linear
regression: What is the best straight line that captures the
dependence of y on x?)
In physics, models are the main tool for understanding.
P({Si}i=1,...,N ) =
e H
Z
H =
X
(ijk)2E
JijkSiSjSk
glass transitionp-spin model
Jijk ⇠ N(0, 1)
12. IS THIS USEFUL IN MACHINE LEARNING?
Example: Single layer neural network = generalized linear regression.
Given (X,y) find w such that
μ = 1,…, n
i = 1,…, pyμ = φ(
p
∑
i=1
Xμiwi)
data
X
y
labels
w
weights
data
weights
(noisy) activation function
13. Take random iid Gaussian and random iid from
Create
Goal: Compute the best possible generalisation error achievable
with n samples of dimension p.
High-dimensional regime:
TEACHER-STUDENT MODEL
Xμi w*i
yμ = φ(
p
∑
i=1
Xμiw*i )
Pw
p → ∞
n → ∞
n/p = Ω(1)
data
X
y
labels
w
weights
data
weights
Gardner, Derrida’89, Gyorgyi’90
14. Take random iid Gaussian and random iid from
Create
Goal: Compute the best possible generalisation error achievable
with n samples of dimension p.
High-dimensional regime:
Xμi w*i
yμ = φ(
p
∑
i=1
Xμiw*i )
Pw
p → ∞n → ∞ n/p = Ω(1)
What did we win? Posterior is tractable with replica
and cavity method, developed in the theory of spin glasses.
P(w|X, y)
TEACHER-STUDENT MODEL
Gardner, Derrida’89, Gyorgyi’90
15. Optimal generalisation error for any non-linearity
and prior on weights.
Proof of the replica formula for the optimal
generalisation error.
Approximate message passing provably reaching the
optimal generalization error (out of the hard region).
Barbier, Krzakala, Macris, Miolane, LZ, COLT’18, arXiv:1708.03395
NEW W.R.T. 1990
20. INCLUDING HIDDEN VARIABLES
data
X
y
labels
w
v1
v2
weights
p input units
K hidden units
output unit
L=3 layers
n training samples
w learned, v1 & v2 fixed
Limit:
Committee machine
Model from Schwarze’92.
Proof of the replica formula, and approximate message passing Aubin, Maillard,
Barbier, Macris, Krzakala, LZ’19, spotlight at NeurIPS’18.
K = O(1)<latexit sha1_base64="pnb2kdx6DB2WkAndPWumY4tr6mw=">AAAB7XicbVBNSwMxEJ2tX7V+VT16CRahXsquFPQiFL0IHqxgP6BdSjbNtrHZZEmyQln6H7x4UMSr/8eb/8a03YO2Phh4vDfDzLwg5kwb1/12ciura+sb+c3C1vbO7l5x/6CpZaIIbRDJpWoHWFPOBG0YZjhtx4riKOC0FYyup37riSrNpHgw45j6ER4IFjKCjZWat5d3Ze+0Vyy5FXcGtEy8jJQgQ71X/Or2JUkiKgzhWOuO58bGT7EyjHA6KXQTTWNMRnhAO5YKHFHtp7NrJ+jEKn0USmVLGDRTf0+kONJ6HAW2M8JmqBe9qfif10lMeOGnTMSJoYLMF4UJR0ai6euozxQlho8twUQxeysiQ6wwMTaggg3BW3x5mTTPKp5b8e6rpdpVFkcejuAYyuDBOdTgBurQAAKP8Ayv8OZI58V5dz7mrTknmzmEP3A+fwD3vI4P</latexit><latexit sha1_base64="pnb2kdx6DB2WkAndPWumY4tr6mw=">AAAB7XicbVBNSwMxEJ2tX7V+VT16CRahXsquFPQiFL0IHqxgP6BdSjbNtrHZZEmyQln6H7x4UMSr/8eb/8a03YO2Phh4vDfDzLwg5kwb1/12ciura+sb+c3C1vbO7l5x/6CpZaIIbRDJpWoHWFPOBG0YZjhtx4riKOC0FYyup37riSrNpHgw45j6ER4IFjKCjZWat5d3Ze+0Vyy5FXcGtEy8jJQgQ71X/Or2JUkiKgzhWOuO58bGT7EyjHA6KXQTTWNMRnhAO5YKHFHtp7NrJ+jEKn0USmVLGDRTf0+kONJ6HAW2M8JmqBe9qfif10lMeOGnTMSJoYLMF4UJR0ai6euozxQlho8twUQxeysiQ6wwMTaggg3BW3x5mTTPKp5b8e6rpdpVFkcejuAYyuDBOdTgBurQAAKP8Ayv8OZI58V5dz7mrTknmzmEP3A+fwD3vI4P</latexit><latexit sha1_base64="pnb2kdx6DB2WkAndPWumY4tr6mw=">AAAB7XicbVBNSwMxEJ2tX7V+VT16CRahXsquFPQiFL0IHqxgP6BdSjbNtrHZZEmyQln6H7x4UMSr/8eb/8a03YO2Phh4vDfDzLwg5kwb1/12ciura+sb+c3C1vbO7l5x/6CpZaIIbRDJpWoHWFPOBG0YZjhtx4riKOC0FYyup37riSrNpHgw45j6ER4IFjKCjZWat5d3Ze+0Vyy5FXcGtEy8jJQgQ71X/Or2JUkiKgzhWOuO58bGT7EyjHA6KXQTTWNMRnhAO5YKHFHtp7NrJ+jEKn0USmVLGDRTf0+kONJ6HAW2M8JmqBe9qfif10lMeOGnTMSJoYLMF4UJR0ai6euozxQlho8twUQxeysiQ6wwMTaggg3BW3x5mTTPKp5b8e6rpdpVFkcejuAYyuDBOdTgBurQAAKP8Ayv8OZI58V5dz7mrTknmzmEP3A+fwD3vI4P</latexit><latexit sha1_base64="pnb2kdx6DB2WkAndPWumY4tr6mw=">AAAB7XicbVBNSwMxEJ2tX7V+VT16CRahXsquFPQiFL0IHqxgP6BdSjbNtrHZZEmyQln6H7x4UMSr/8eb/8a03YO2Phh4vDfDzLwg5kwb1/12ciura+sb+c3C1vbO7l5x/6CpZaIIbRDJpWoHWFPOBG0YZjhtx4riKOC0FYyup37riSrNpHgw45j6ER4IFjKCjZWat5d3Ze+0Vyy5FXcGtEy8jJQgQ71X/Or2JUkiKgzhWOuO58bGT7EyjHA6KXQTTWNMRnhAO5YKHFHtp7NrJ+jEKn0USmVLGDRTf0+kONJ6HAW2M8JmqBe9qfif10lMeOGnTMSJoYLMF4UJR0ai6euozxQlho8twUQxeysiQ6wwMTaggg3BW3x5mTTPKp5b8e6rpdpVFkcejuAYyuDBOdTgBurQAAKP8Ayv8OZI58V5dz7mrTknmzmEP3A+fwD3vI4P</latexit>
p → ∞
n → ∞ α = n/p = Ω(1)
21. PHASE TRANSITONS
Specialization phase transition
= hidden units specialise to
correlate with specific features.
K=2
sign(0) = 0<latexit sha1_base64="Dc5utTXextgZij3T2/A7p36jzo8=">AAAB+nicbVBNSwMxEJ31s9avrR69BItQLyUrgl6EohePFewHtEvJptk2NJtdkqxS1v4ULx4U8eov8ea/MW33oK0PBh7vzTAzL0gE1wbjb2dldW19Y7OwVdze2d3bd0sHTR2nirIGjUWs2gHRTHDJGoYbwdqJYiQKBGsFo5up33pgSvNY3ptxwvyIDCQPOSXGSj23lHVVhDQfyAo+naArhHtuGVfxDGiZeDkpQ456z/3q9mOaRkwaKojWHQ8nxs+IMpwKNil2U80SQkdkwDqWShIx7Wez0yfoxCp9FMbKljRopv6eyEik9TgKbGdEzFAvelPxP6+TmvDSz7hMUsMknS8KU4FMjKY5oD5XjBoxtoRQxe2tiA6JItTYtIo2BG/x5WXSPKt6uOrdnZdr13kcBTiCY6iABxdQg1uoQwMoPMIzvMKb8+S8OO/Ox7x1xclnDuEPnM8fCE+Shw==</latexit><latexit sha1_base64="Dc5utTXextgZij3T2/A7p36jzo8=">AAAB+nicbVBNSwMxEJ31s9avrR69BItQLyUrgl6EohePFewHtEvJptk2NJtdkqxS1v4ULx4U8eov8ea/MW33oK0PBh7vzTAzL0gE1wbjb2dldW19Y7OwVdze2d3bd0sHTR2nirIGjUWs2gHRTHDJGoYbwdqJYiQKBGsFo5up33pgSvNY3ptxwvyIDCQPOSXGSj23lHVVhDQfyAo+naArhHtuGVfxDGiZeDkpQ456z/3q9mOaRkwaKojWHQ8nxs+IMpwKNil2U80SQkdkwDqWShIx7Wez0yfoxCp9FMbKljRopv6eyEik9TgKbGdEzFAvelPxP6+TmvDSz7hMUsMknS8KU4FMjKY5oD5XjBoxtoRQxe2tiA6JItTYtIo2BG/x5WXSPKt6uOrdnZdr13kcBTiCY6iABxdQg1uoQwMoPMIzvMKb8+S8OO/Ox7x1xclnDuEPnM8fCE+Shw==</latexit><latexit sha1_base64="Dc5utTXextgZij3T2/A7p36jzo8=">AAAB+nicbVBNSwMxEJ31s9avrR69BItQLyUrgl6EohePFewHtEvJptk2NJtdkqxS1v4ULx4U8eov8ea/MW33oK0PBh7vzTAzL0gE1wbjb2dldW19Y7OwVdze2d3bd0sHTR2nirIGjUWs2gHRTHDJGoYbwdqJYiQKBGsFo5up33pgSvNY3ptxwvyIDCQPOSXGSj23lHVVhDQfyAo+naArhHtuGVfxDGiZeDkpQ456z/3q9mOaRkwaKojWHQ8nxs+IMpwKNil2U80SQkdkwDqWShIx7Wez0yfoxCp9FMbKljRopv6eyEik9TgKbGdEzFAvelPxP6+TmvDSz7hMUsMknS8KU4FMjKY5oD5XjBoxtoRQxe2tiA6JItTYtIo2BG/x5WXSPKt6uOrdnZdr13kcBTiCY6iABxdQg1uoQwMoPMIzvMKb8+S8OO/Ox7x1xclnDuEPnM8fCE+Shw==</latexit><latexit sha1_base64="Dc5utTXextgZij3T2/A7p36jzo8=">AAAB+nicbVBNSwMxEJ31s9avrR69BItQLyUrgl6EohePFewHtEvJptk2NJtdkqxS1v4ULx4U8eov8ea/MW33oK0PBh7vzTAzL0gE1wbjb2dldW19Y7OwVdze2d3bd0sHTR2nirIGjUWs2gHRTHDJGoYbwdqJYiQKBGsFo5up33pgSvNY3ptxwvyIDCQPOSXGSj23lHVVhDQfyAo+naArhHtuGVfxDGiZeDkpQ456z/3q9mOaRkwaKojWHQ8nxs+IMpwKNil2U80SQkdkwDqWShIx7Wez0yfoxCp9FMbKljRopv6eyEik9TgKbGdEzFAvelPxP6+TmvDSz7hMUsMknS8KU4FMjKY5oD5XjBoxtoRQxe2tiA6JItTYtIo2BG/x5WXSPKt6uOrdnZdr13kcBTiCY6iABxdQg1uoQwMoPMIzvMKb8+S8OO/Ox7x1xclnDuEPnM8fCE+Shw==</latexit>
0 1 2 3 4
α
0.00
0.05
0.10
0.15
0.20
0.25
Generalizationerrorϵg(α)
0.0
0.2
0.4
0.6
0.8
1.0
Overlapq
AMP q00
AMP q01
SE q00
SE q01
SE ϵg(α)
AMP ϵg(α)
Specialization
yμ = sign[sign(∑
i
Xμ,iwi,1) + sign
∑
i
(Xμ,iwi,2)]
22. Large algorithmic gap:
IT threshold:
Algorithmic threshold
K 1<latexit sha1_base64="tIKGLXfugTsoLV203AoKJohXlvk=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi+Clgv2ANpTNdpMu3WzC7kQooT/CiwdFvPp7vPlv3LY5aOuDgcd7M8zMC1IpDLrut1NaW9/Y3CpvV3Z29/YPqodHbZNkmvEWS2SiuwE1XArFWyhQ8m6qOY0DyTvB+Hbmd564NiJRjzhJuR/TSIlQMIpW6tyTfhQRb1CtuXV3DrJKvILUoEBzUP3qDxOWxVwhk9SYnuem6OdUo2CSTyv9zPCUsjGNeM9SRWNu/Hx+7pScWWVIwkTbUkjm6u+JnMbGTOLAdsYUR2bZm4n/eb0Mw2s/FyrNkCu2WBRmkmBCZr+TodCcoZxYQpkW9lbCRlRThjahig3BW355lbQv6p5b9x4ua42bIo4ynMApnIMHV9CAO2hCCxiM4Rle4c1JnRfn3flYtJacYuYY/sD5/AH0iY6m</latexit><latexit sha1_base64="tIKGLXfugTsoLV203AoKJohXlvk=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi+Clgv2ANpTNdpMu3WzC7kQooT/CiwdFvPp7vPlv3LY5aOuDgcd7M8zMC1IpDLrut1NaW9/Y3CpvV3Z29/YPqodHbZNkmvEWS2SiuwE1XArFWyhQ8m6qOY0DyTvB+Hbmd564NiJRjzhJuR/TSIlQMIpW6tyTfhQRb1CtuXV3DrJKvILUoEBzUP3qDxOWxVwhk9SYnuem6OdUo2CSTyv9zPCUsjGNeM9SRWNu/Hx+7pScWWVIwkTbUkjm6u+JnMbGTOLAdsYUR2bZm4n/eb0Mw2s/FyrNkCu2WBRmkmBCZr+TodCcoZxYQpkW9lbCRlRThjahig3BW355lbQv6p5b9x4ua42bIo4ynMApnIMHV9CAO2hCCxiM4Rle4c1JnRfn3flYtJacYuYY/sD5/AH0iY6m</latexit><latexit sha1_base64="tIKGLXfugTsoLV203AoKJohXlvk=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi+Clgv2ANpTNdpMu3WzC7kQooT/CiwdFvPp7vPlv3LY5aOuDgcd7M8zMC1IpDLrut1NaW9/Y3CpvV3Z29/YPqodHbZNkmvEWS2SiuwE1XArFWyhQ8m6qOY0DyTvB+Hbmd564NiJRjzhJuR/TSIlQMIpW6tyTfhQRb1CtuXV3DrJKvILUoEBzUP3qDxOWxVwhk9SYnuem6OdUo2CSTyv9zPCUsjGNeM9SRWNu/Hx+7pScWWVIwkTbUkjm6u+JnMbGTOLAdsYUR2bZm4n/eb0Mw2s/FyrNkCu2WBRmkmBCZr+TodCcoZxYQpkW9lbCRlRThjahig3BW355lbQv6p5b9x4ua42bIo4ynMApnIMHV9CAO2hCCxiM4Rle4c1JnRfn3flYtJacYuYY/sD5/AH0iY6m</latexit><latexit sha1_base64="tIKGLXfugTsoLV203AoKJohXlvk=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi+Clgv2ANpTNdpMu3WzC7kQooT/CiwdFvPp7vPlv3LY5aOuDgcd7M8zMC1IpDLrut1NaW9/Y3CpvV3Z29/YPqodHbZNkmvEWS2SiuwE1XArFWyhQ8m6qOY0DyTvB+Hbmd564NiJRjzhJuR/TSIlQMIpW6tyTfhQRb1CtuXV3DrJKvILUoEBzUP3qDxOWxVwhk9SYnuem6OdUo2CSTyv9zPCUsjGNeM9SRWNu/Hx+7pScWWVIwkTbUkjm6u+JnMbGTOLAdsYUR2bZm4n/eb0Mw2s/FyrNkCu2WBRmkmBCZr+TodCcoZxYQpkW9lbCRlRThjahig3BW355lbQv6p5b9x4ua42bIo4ynMApnIMHV9CAO2hCCxiM4Rle4c1JnRfn3flYtJacYuYY/sD5/AH0iY6m</latexit>
yμ = sign[
K
∑
a=1
sign(∑
i
Xμ,iwi,a)]
n > 7.65Kp
n > const . K2
p
PHASE TRANSITONS
0 2 4 6 8 10 12 14
α = (# of samples)/(#hidden units × input size)
0.0
0.1
0.2
0.3
0.4
0.5
Generalizationerrorϵg(α)
Non-specialized
hidden units
Specialized
hidden units
Computational gap
Bayes optimal ϵg(α)
AMP ϵg(α)
Discontinuous specialization
23. impossible hard doable todaydoable
# of samples
Good generalisation error
Our goal: Quantify this in more realistic models.
Design algorithms working in the doable region.
24. LZ, F. Krzakala, Statistical Physics of Algorithm: Threshold and Algorithms,
Advances of Physics (2016), arXiv:1511.02476.
J. Barbier, N. Macris, L. Miolane, F. Krzakala, LZ, Phase Transitions, Optimal
Errors and Optimality of Message-Passing in Generalized Linear Models,
arXiv:1708.03395, COLT’18.
B. Aubin, A. Maillard, J. Barbier, F. Krzakala N. Macris,, LZ, The committee
machine: Computational to statistical gaps in learning a two-layers neural
network, arXiv:1806.05451, NeurIPS’18.
REFERENCES
25. Thank you for your attention!
LZ, F. Krzakala, Statistical Physics of Algorithm: Threshold and Algorithms,
Advances of Physics (2016), arXiv:1511.02476.
J. Barbier, N. Macris, L. Miolane, F. Krzakala, LZ, Phase Transitions, Optimal
Errors and Optimality of Message-Passing in Generalized Linear Models,
arXiv:1708.03395, COLT’18.
B. Aubin, A. Maillard, J. Barbier, F. Krzakala N. Macris,, LZ, The committee
machine: Computational to statistical gaps in learning a two-layers neural
network, arXiv:1806.05451, NeurIPS’18.
REFERENCES