stat-phys-AMALEA.pdf

1
The statistical physics of learning:

typical learning curves
www.cs.rug.nl/~biehl
Michael Biehl
AMALEA workshop, September 12, 2022

1
The statistical physics of learning:

typical learning curves
www.cs.rug.nl/~biehl
Michael Biehl
AMALEA workshop, September 12, 2022
•a little bit of history

•optimization and statistical physics (in a nutshell)

•machine learning as a special case, disorder average

•annealed approximation, high-temperature limit, replica trick

•typical learning curves in student/teacher scenarios

•a very simple example: single unit, linear regression

•nonlinear, layered neural networks:
 
- phase transitions in soft committee machines
 
- the role of the activation function

•outlook / ongoing projects

2
Statistical Physics of Neural Networks
capacity of feed-forward networks:

Elizabeth Gardner (1957-1988)

The space of interactions in neural

networks. J. Phys. A 21: 257 (1988)

dynamics, attractor neural networks:

John Hopfield. Neural Networks and

physical systems with emergent

collective computational abilities

PNAS 79(8): 2554 (1982)
learning of a rule:

Geza Györgyi, Naftali Tishby

Statistical theory of learning a rule

In: Neural Networks and Spin Glasses

World Scientific 31-36 (1990)

2
Statistical Physics of Neural Networks
capacity of feed-forward networks:

Elizabeth Gardner (1957-1988)

The space of interactions in neural

networks. J. Phys. A 21: 257 (1988)

dynamics, attractor neural networks:

John Hopfield. Neural Networks and

physical systems with emergent

collective computational abilities

PNAS 79(8): 2554 (1982)
reviews: annealed approximation,
high-T limit, replica trick etc.

S. Seung, H. Sompolinsky, N. Tishby

Statistical mechanics of learning from
examples. Phys. Rev. A 45, 6056 (1992)
learning of a rule:

Geza Györgyi, Naftali Tishby

Statistical theory of learning a rule

In: Neural Networks and Spin Glasses

World Scientific 31-36 (1990)

3
stochastic optimization
objective/cost/energy function for many degrees of freedom

3
discrete, e.g.
Metropolis algorithm

3
discrete, e.g.
• suggest a (small) change

, e.g. „single spin flip“

for a random j

3
discrete, e.g.
• acceptance of change

- always if

- with probability

if


for a random j

3
discrete, e.g.

- always if

- with probability

if


for a random j
controls acceptance rate

for „uphill“ moves

3
discrete, e.g. continuous, e.g.
Metropolis algorithm Langevin dynamics

- always if

- with probability

if


for a random j


3

- always if

- with probability

if


for a random j
• continuous temporal change,

„noisy gradient descent“



3

- always if

- with probability

if


for a random j



• with delta-correlated white noise

(spatial+temporal independence)

3

- always if

- with probability

if


for a random j



... controls noise level, i.e.

random deviation from gradient
• with delta-correlated white noise

(spatial+temporal independence)

thermal equilibrium
Markov chain continuous dynamics
4

thermal equilibrium
stationary density of configurations:

normalization: „Zustandssumme“, partition function
4
P(w) =
1
Z
exp [−βH(w)]

thermal equilibrium

Gibbs-Boltzmann density of states

• physics: thermal equilibrium of a physical system at temperature T

• optimization: formal equilibrium situation, control parameter T
4
P(w) =
1
Z
exp [−βH(w)]

thermal equilibrium

Gibbs-Boltzmann density of states

• physics: thermal equilibrium of a physical system at temperature T

• optimization: formal equilibrium situation, control parameter T
4
P(w) =
1
Z
exp [−βH(w)]
T → ∞, β → 0 :
T → 0, β → ∞ : only lowest energy (groundstate) contributes
energy is irrelevant, every state contributes equally

5
thermal averages in equilibrium, for instance
⟨⋯⟩T
free energy

5
⟨⋯⟩T
~ vol. of states with energy E
re-write as an integral over all possible energies:
Z
free energy

5
⟨⋯⟩T
Z
assume extensive energy, proportional to system size N:
free energy

5
⟨⋯⟩T
Z
free energy
Z =
∫
dE exp [−Nβ (e − s(e)/β)]

5
⟨⋯⟩T
Z
in large systems ( ) is dominated by the minimum

of the free energy
N → ∞ ln Z
f = e − s/β ∼ − ln Z/(βN)
free energy
Z =
∫
dE exp [−Nβ (e − s(e)/β)]

6
remark: saddle point integration

6
function with maximum in , consider thermodynamic limit

6
function with maximum in , consider thermodynamic limit
is given by the minimum of the free energy

f = e - s(e) / β

machine learning
special case machine learning: choice of adaptive

e.g. all weights in a neural network
7

machine learning

cost function: defined w.r.t.

sum over examples, e.g. input vectors and target labels (supervised)

costs or error measure per example, e.g. classification error
ϵ( . . . )
7
H(w) =
P
∑
μ=1
ϵ(w, ξμ
) ID = {ξμ
, S(ξμ
)}
P
μ=1

machine learning

cost function: defined w.r.t.

sum over examples, e.g. input vectors and target labels (supervised)

costs or error measure per example, e.g. classification error
ϵ( . . . )
interpretation of training:

• weights are the outcome of some stochastic optimization process

with energy-dependent stationary

• formal (thermal) equilibrium

• < ... >T : thermal averages (over the stochastic training)
P(w)
7
H(w) =
P
∑
μ=1
ϵ(w, ξμ
) ID = {ξμ
, S(ξμ
)}
P
μ=1

disorder average
• energy/cost function is defined for one particular set of examples

typical properties: additional average over random training data ID
8

disorder average

8
• typical properties on average over data sets: derivatives of

quenched free energy ~ yields averages

disorder average

8

difficult: replica trick, approximations

disorder average

8

• student / teacher scenarios

- define/control the complexity of target rule and learning system

- represent target by a teacher network

disorder average

8

• student / teacher scenarios

- define/control the complexity of target rule and learning system

- represent target by a teacher network
• simplest assumptions:  
- independent input vectors of i.i.d. components
- noise-free training labels provided by the teacher network

9
⟨ξμ
j ⟩ = 0; ⟨ξμ
j
ξν
k ⟩ = δjkδμν
input data: independent, identically distributed random components
ξμ
j
= ± 1 (with equal prob.) or
P(ξμ
j
) =
1
2π
exp
[
−
1
2 (ξμ
j )
2
]
e.g.
example: training of a single, linear unit

9
⟨ξμ
j ⟩ = 0; ⟨ξμ
j
ξν
k ⟩ = δjkδμν
g
(
1
N ∑
j
wjξj
=:x
)
g
(
1
N ∑
j
w*
j
ξj
=:y
)
student output teacher output
ξμ
j
P(ξμ
j
) =
1
2π
exp
[
−
1
2 (ξμ
j )
2
]
e.g.
x, y ∼
𝒪
(1)
weight vectors
w2
/N = Q =
𝒪
(1) w*2
/N = Q* =
𝒪
(1)
pre-activations, local potentials
w w*

9
⟨ξμ
j ⟩ = 0; ⟨ξμ
j
ξν
k ⟩ = δjkδμν
g
(
1
N ∑
j
wjξj
=:x
)
g
(
1
N ∑
j
w*
j
ξj
=:y
)
student output teacher output
H(w) =
P
∑
μ=1
ϵ(xμ
, yμ
) g(z) = z ϵ(x, y) =
1
2
(x − y)
2
cost function, energy,
e.g. lin. regression
extensive quantity: H ∝ P = αN
ξμ
j
P(ξμ
j
) =
1
2π
exp
[
−
1
2 (ξμ
j )
2
]
e.g.
x, y ∼
𝒪
(1)
weight vectors
w2
/N = Q =
𝒪
(1) w*2
/N = Q* =
𝒪
(1)
pre-activations, local potentials
w w*

10
partition function, training at T = 1/β
Z =
∫ ∏
j
dwj δ (w2
− N)
dμ(w)
exp
[
− β
∑
μ
ϵ(xμ
, yμ
)
]
⟨ln Z⟩ID
ln ⟨Z⟩ID
Annealed Approximation: instead of

10
Z =
∫ ∏
j
dwj δ (w2
− N)
dμ(w)
exp
[
− β
∑
μ
ϵ(xμ
, yμ
)
]
⟨ln Z⟩ID
ln ⟨Z⟩ID
Annealed Approximation: instead of ⟨ln Z⟩ID
≤ ln ⟨Z⟩ID

does not imply “ ”

or similarity of

extrema!
≈

10
⟨Z⟩ID
=
∫
dμ(w)
⟨
exp
[
− β
∑
μ
ϵ
(
w ⋅ ξμ
N
,
w* ⋅ ξμ
N )]⟩
ID
Z =
∫ ∏
j
dwj δ (w2
− N)
dμ(w)
exp
[
− β
∑
μ
ϵ(xμ
, yμ
)
]
⟨ln Z⟩ID
ln ⟨Z⟩ID
≤ ln ⟨Z⟩ID


or similarity of

extrema!
≈

10
⟨Z⟩ID
=
∫
dμ(w)
⟨
exp
[
− β
∑
μ
ϵ
(
w ⋅ ξμ
N
,
w* ⋅ ξμ
N )]⟩
ID
Z =
∫ ∏
j
dwj δ (w2
− N)
dμ(w)
exp
[
− β
∑
μ
ϵ(xμ
, yμ
)
]
⟨ln Z⟩ID
ln ⟨Z⟩ID
≤ ln ⟨Z⟩ID


or similarity of

extrema!
≈
δ
(
xμ
−
w ⋅ ξμ
N )
, δ
(
yμ
−
w* ⋅ ξμ
N )
traditional approach: integral representation of

explicit computation of averages …

elimination of conjugate variables …

10
⟨Z⟩ID
=
∫
dμ(w)
⟨
exp
[
− β
∑
μ
ϵ
(
w ⋅ ξμ
N
,
w* ⋅ ξμ
N )]⟩
ID
Z =
∫ ∏
j
dwj δ (w2
− N)
dμ(w)
exp
[
− β
∑
μ
ϵ(xμ
, yμ
)
]
⟨ln Z⟩ID
ln ⟨Z⟩ID
≤ ln ⟨Z⟩ID


or similarity of

extrema!
≈
δ
(
xμ
−
w ⋅ ξμ
N )
, δ
(
yμ
−
w* ⋅ ξμ
N )
traditional approach: integral representation of

explicit computation of averages …

elimination of conjugate variables …
short-cut: exploit Central Limit Theorem

i.i.d. input components for

N → ∞
xμ
=
1
N
N
∑
j=1
wjξμ
j
yμ
=
1
N
N
∑
j=1
w*
j
ξμ
j

11
P(xμ
, yμ
)
⟨xμ
⟩ =
1
N
N
∑
j=1
wj ⟨ξμ
j ⟩ = 0
joint normal density of local potentials fully specified by
disorder average
⟨(xμ
)2
⟩ =
1
N
N
∑
j,k=1
wjwk ⟨ξμ
j
ξμ
k ⟩ =
1
N
N
∑
j=1
w2
j
⟨yμ
⟩ = 0, ⟨(yμ
)2
⟩ =
N
∑
j=1
(w*
j
)2
⟨xμ
yμ
⟩ =
1
N
N
∑
j,k=1
wjw*
k ⟨ξμ
j
ξμ
k ⟩ =
1
N
N
∑
j=1
wjw*
j

11
P(xμ
, yμ
)
⟨xμ
⟩ =
1
N
N
∑
j=1
wj ⟨ξμ
j ⟩ = 0
joint normal density of local potentials fully specified by
disorder average
⟨(xμ
)2
⟩ =
1
N
N
∑
j,k=1
wjwk ⟨ξμ
j
ξμ
k ⟩ =
1
N
N
∑
j=1
w2
j
⟨yμ
⟩ = 0, ⟨(yμ
)2
⟩ =
N
∑
j=1
(w*
j
)2
⟨xμ
yμ
⟩ =
1
N
N
∑
j,k=1
wjw*
k ⟨ξμ
j
ξμ
k ⟩ =
1
N
N
∑
j=1
wjw*
j
1
N ∑
j
w2
j = Q ( = 1),
1
N ∑
j
wjw*
j
= R,
1
N ∑
j
(w*
j
)2
= Q* ( = 1)
set of order

parameters

macroscopic properties of the trained network

instead of microscopic details

12
⟨Z⟩ID
=
∫
dR exp [N (Go(R) − α G1(R))] with  αN = P
entropy term:  Go(R) =
1
N
ln
∫ ∏
j
dwj δ(N − w2
) δ(NR − w ⋅ w*)
energy term:  G1(R) = − ln
∫
dxdy P(x, y) exp[ − β ϵ(x, y) ]
annealed free energy
factorizes w.r.t. and
⟨⋯⟩ID
j = 1,2,…, N μ = 1,2,…, P

12
⟨Z⟩ID
=
∫
1
N
ln
∫ ∏
j
dwj δ(N − w2
) δ(NR − w ⋅ w*)
∫
N-dim. geometry

independent of

model details
model, training
⟨⋯⟩ID
j = 1,2,…, N μ = 1,2,…, P

12
⟨Z⟩ID
=
∫
1
N
ln
∫ ∏
j
dwj δ(N − w2
) δ(NR − w ⋅ w*)
∫
−β fann =
1
N
ln ⟨Z⟩ID
= extrR [Go(R) − αG1(R)]
saddle-point integration for

annealed free energy:
N → ∞
N-dim. geometry

independent of

model details
model, training
⟨⋯⟩ID
j = 1,2,…, N μ = 1,2,…, P

13
Go(R) =
1
N
ln
∫ ∏
j
dwj δ(1 − w2
) δ(R − w ⋅ w*) ≈ … =
1
2
ln(1 − R2
)
the hard way: - integral representation of delta-function, introducing

- saddle-point integration for large N w.r.t.
̂
R
R, ̂
R
the entropy term

13
Go(R) =
1
N
ln
∫ ∏
j
dwj δ(1 − w2
) δ(R − w ⋅ w*) ≈ … =
1
2
ln(1 − R2
)

̂
R
R, ̂
R
geometry:
w
w*
R
r = 1 − R2 V ∼ (1 − R2
)N/2
Go(R) =
1
N
ln V ∼
1
2
ln(1 − R2
)
the entropy term

13
Go(R) =
1
N
ln
∫ ∏
j
dwj δ(1 − w2
) δ(R − w ⋅ w*) ≈ … =
1
2
ln(1 − R2
)

̂
R
R, ̂
R
geometry:
w
w*
R
r = 1 − R2 V ∼ (1 − R2
)N/2
Go(R) =
1
N
ln V ∼
1
2
ln(1 − R2
)
general result: set of vectors, matrix of pairwise dot-products and norms

[R. Urbanzcik] here:
𝒞
Go =
1
2
ln det
𝒞
𝒞
=
(
1 R
R 1)
the entropy term

14
G1(R) = − ln
∫
dxdy
2π 1 − R2
exp
[
−
x2
+ y2
− 2Rxy
2(1 − R2) ]
exp[−βϵ(x, y)]
the energy term

14
G1(R) = − ln
∫
dxdy
2π 1 − R2
exp
[
−
x2
+ y2
− 2Rxy
2(1 − R2) ]
exp[−βϵ(x, y)]
linear regression (single linear student and teacher)

elementary Gaussian integrals:
ϵ(x, y) =
1
2
(x − y)2
G1(R) =
1
2
ln[1 + 2β(1 − R)]
the energy term

14
G1(R) = − ln
∫
dxdy
2π 1 − R2
exp
[
−
x2
+ y2
− 2Rxy
2(1 − R2) ]
exp[−βϵ(x, y)]

ϵ(x, y) =
1
2
(x − y)2
G1(R) =
1
2
ln[1 + 2β(1 − R)]
the energy term
−(βf ) = −
1
2
α ln[1 + 2β(1 − R)] +
1
2
ln(1 − R2
)
(+ irrevelant const. and terms that vanish for )
N → ∞

14
G1(R) = − ln
∫
dxdy
2π 1 − R2
exp
[
−
x2
+ y2
− 2Rxy
2(1 − R2) ]
exp[−βϵ(x, y)]

ϵ(x, y) =
1
2
(x − y)2
G1(R) =
1
2
ln[1 + 2β(1 − R)]
the energy term
−(βf ) = −
1
2
α ln[1 + 2β(1 − R)] +
1
2
ln(1 − R2
)
(+ irrevelant const. and terms that vanish for )
N → ∞
∂(βf )
∂R
= 0 ⇒
R
1 − R2
=
αβ
1 + 2β(1 − R)
→ R(α) at a given β

15
0.5 1.0 1.5 2.0
0.2
0.4
0.6
0.8
1.0
β = 0.1
β = 1
β = 10
β = 1000 β = 100
R
α
learning curves (linear regression)
typical success of training, i.e.

student/teacher similiarity as a

function of the training set size

15
0.5 1.0 1.5 2.0
0.2
0.4
0.6
0.8
1.0
β = 0.1
β = 1
β = 10
β = 1000 β = 100
R
α
learning curves (linear regression)
typical success of training, i.e.

student/teacher similiarity as a

function of the training set size
ϵg
ϵt
2 4 6 8 10
0.1
0.2
0.3
0.4
0.5
α
β = 1
generalization error and

training error
ϵt =
1
α
∂(βf )
∂β
=
1 − R
1 + 2β(1 − R)
ϵg = (1 − R)

16
remark: interpretation of the AA
partition function in the AA
⟨Z⟩ID
=
∫
dμ(w)
⟨
exp
[
− β H (w)
]⟩
ID
=
∫
dμ(w)
∫
dμ({ξμ
}P
μ=1) exp [−βH ({ξμ
}, w)]

16
⟨Z⟩ID
=
∫
dμ(w)
⟨
exp
[
− β H (w)
]⟩
ID
=
∫
dμ(w)
∫
dμ({ξμ
}P
}, w)]
interpretation: partition sum of a system in which

weights and data are degrees of freedom

that can be optimized (annealed) w.r.t. H

16
⟨Z⟩ID
=
∫
dμ(w)
⟨
exp
[
− β H (w)
]⟩
ID
=
∫
dμ(w)
∫
dμ({ξμ
}P
}, w)]


correct treatment: data constitutes frozen disorder in H

16
⟨Z⟩ID
=
∫
dμ(w)
⟨
exp
[
− β H (w)
]⟩
ID
=
∫
dμ(w)
∫
dμ({ξμ
}P
}, w)]


correct treatment: data constitutes frozen disorder in H
observation/folklore: AA works (qualitatively) well in realizable cases

e.g. student and teacher of the same complexity,

noise-free data

AA fails in unrealizable cases (noise, mismatch)

because the (hypothetical) system can “adapt the

the data to the task”, yields over-optimistic results

proper disorder average: replica trick/method
replica trick
formally: n non-interacting „copies“ of the system (replicas)
17
⟨ln Z⟩ID
= lim
n→0
⟨Zn
⟩ID
− 1
n
= lim
n→0
∂⟨Zn
⟩
∂n
= lim
n→0
ln ⟨Zn
⟩ID

replica trick
17
⟨ln Z⟩ID
= lim
n→0
⟨Zn
⟩ID
− 1
n
= lim
n→0
∂⟨Zn
⟩
∂n
= lim
n→0
ln ⟨Zn
⟩ID
⟨Zn
⟩ID
=
∫
n
∏
a=1
dμ(wa
)
⟨
exp
[
− β
∑
μ
∑
a
ϵ
(
wa
⋅ ξμ
N
,
w* ⋅ ξμ
N )]⟩
ID

replica trick
17
⟨ln Z⟩ID
= lim
n→0
⟨Zn
⟩ID
− 1
n
= lim
n→0
∂⟨Zn
⟩
∂n
= lim
n→0
ln ⟨Zn
⟩ID
⟨Zn
⟩ID
=
∫
n
∏
a=1
dμ(wa
)
⟨
exp
[
− β
∑
μ
∑
a
ϵ
(
wa
⋅ ξμ
N
,
w* ⋅ ξμ
N )]⟩
ID P(xμ
1
, xμ
2
, …, xμ
n , yμ
)
integration over

replica trick
data set average introduces effective interactions between replicas
… saddle point integration for , quenched free energy

⟨Zn
⟩ID
involves order parameters

requires analytic continuation for

Ra = wa
⋅ w*, qab = wa
⋅ wb
/N
n ∈ ℝ and n → 0
17
⟨ln Z⟩ID
= lim
n→0
⟨Zn
⟩ID
− 1
n
= lim
n→0
∂⟨Zn
⟩
∂n
= lim
n→0
ln ⟨Zn
⟩ID
⟨Zn
⟩ID
=
∫
n
∏
a=1
dμ(wa
)
⟨
exp
[
− β
∑
μ
∑
a
ϵ
(
wa
⋅ ξμ
N
,
w* ⋅ ξμ
N )]⟩
ID P(xμ
1
, xμ
2
, …, xμ
n , yμ
)
integration over

replica trick
data set average introduces effective interactions between replicas
… saddle point integration for , quenched free energy

⟨Zn
⟩ID
involves order parameters

requires analytic continuation for

Ra = wa
⋅ w*, qab = wa
⋅ wb
/N
n ∈ ℝ and n → 0
Marc Mezard, Giorgio Parisi (*), Miguel Virasoro

Spin Glass Theory and Beyond (1987)

(*) Nobel 2021
mathematical subtleties, replica symmetry-breaking ...
17
⟨ln Z⟩ID
= lim
n→0
⟨Zn
⟩ID
− 1
n
= lim
n→0
∂⟨Zn
⟩
∂n
= lim
n→0
ln ⟨Zn
⟩ID
⟨Zn
⟩ID
=
∫
n
∏
a=1
dμ(wa
)
⟨
exp
[
− β
∑
μ
∑
a
ϵ
(
wa
⋅ ξμ
N
,
w* ⋅ ξμ
N )]⟩
ID P(xμ
1
, xμ
2
, …, xμ
n , yμ
)
integration over

18
historical :-) examples of perceptron learning curves
S = sign(w ⋅ ξ)
S* = sign(w* ⋅ ξ)
student
teacher
w w*

18
Gibbs student
optimal generalization
Adaline

ϵ(x, y) =
1
2
[x − sign(y)]
2
perceptron, zero temperature training from

noise free linearly separable data:

maximum

stability ___ϵg ∝ α−1
___ϵg ∝ α−1/2
S = sign(w ⋅ ξ)
student
teacher
w w*

18
Gibbs student
optimal generalization
Adaline

ϵ(x, y) =
1
2
[x − sign(y)]
2
perceptron, zero temperature training from

noise free linearly separable data:

maximum

stability ___ϵg ∝ α−1
___ϵg ∝ α−1/2
more in the literature:

- label noise

- teacher weight noise

- variational opt. of
 
the cost function

- weight decay

- worst case training

- …
S = sign(w ⋅ ξ)
student
teacher
w w*

19
∫
dxdy P(x, y) exp[−β ϵ(x, y)]
training at high temperatures
AA becomes exact in the limit (replicas decouple)
T → ∞

19
∫
≈ − ln
∫
dxdyP(x, y) (1 − β ϵ(x, y))
≈ − ln [1 − β ⟨ϵ(x, y)⟩{x,y}] ≈ βϵg
β → 0
T → ∞
generalization error!

(arbitrary input)

19
∫
βf ≈ (α β) ϵg − Go(R)
free energy
≈ − ln
∫
≈ − ln [1 − β ⟨ϵ(x, y)⟩{x,y}] ≈ βϵg
β → 0
T → ∞

(arbitrary input)

19
∫
free energy
only meaningful if (α β) =
𝒪
(1)
β → 0, T → ∞
α = P/N → ∞
learn almost nothing

from infinitely many

examples
≈ − ln
∫
≈ − ln [1 − β ⟨ϵ(x, y)⟩{x,y}] ≈ βϵg
β → 0
T → ∞

(arbitrary input)

19
∫
free energy
only meaningful if (α β) =
𝒪
(1)
β → 0, T → ∞
α = P/N → ∞
learn almost nothing

from infinitely many

examples
here: P and T cannot be varied independently

are indistinguishable (input space is sampled perfectly)
ϵg and ϵt
≈ − ln
∫
≈ − ln [1 − β ⟨ϵ(x, y)⟩{x,y}] ≈ βϵg
β → 0
T → ∞

(arbitrary input)

20
adaptive student N inputs

K hidden units
layered networks: “soft committee machines” (SCM)

20

K hidden units
teacher parameterizes target
? ? ? ? ? ? ?

20

K hidden units
? ? ? ? ? ? ?
consider specific activation functions, e.g. sigmoidal / ReLU

20

K hidden units
? ? ? ? ? ? ?
consider specific activation functions, e.g. sigmoidal / ReLU
thermodynamic limit : description in terms of order parameters

student/teacher, student/student

site symmetry / hidden unit specialization:

21
typical SCM learning curves (high-T)
as a function of training set size from high-T free energy:
express generalization and entropy as functions of
ϵg s {R, S, Q, C}
ϵg

21
sigmoidal: discont. phase transition
specialized

(R>S)
un- | anti-spec.
(R=S) (R<S)
(K>2)
ϵg
poor-performing phase

persists for large data sets
ϵg s {R, S, Q, C}
ϵg

21
sigmoidal: discont. phase transition
specialized

(R>S)
un- | anti-spec.
(R=S) (R<S)
(K>2)
ϵg
poor-performing phase

persists for large data sets
ReLU: continuous transition
anti-specialized

(R<S)
specialized

(R>S)
(R=S)
ϵg
similar performances,

lower (free) energy barrier
ϵg s {R, S, Q, C}
ϵg

22
Hidden unit Specialization in Layered Neural Networks:

ReLU vs. Sigmoidal Activation

E. Oostwal, M. Straat, M. Biehl
 
Physica A 564: 125517 (2021)
Phase Transitions in Soft-Committee Machines

M. Biehl, E. Schlösser, M. Ahr

Europhysics Letters 44: 261-267 (1998)
Statistical Physics and Practical Training of Soft-Committee Machines

M. Ahr, M. Biehl, R. Urbanczik

European Physics B 10: 583-588 (1999)
sigmoidal activation

high-T, arbitrary K

replica, large K = M → ∞
ReLU activation

high-T, arbitrary K
layered networks: (SCM)

22
Hidden unit Specialization in Layered Neural Networks:

ReLU vs. Sigmoidal Activation

E. Oostwal, M. Straat, M. Biehl
 
Physica A 564: 125517 (2021)
Phase Transitions in Soft-Committee Machines

M. Biehl, E. Schlösser, M. Ahr

Europhysics Letters 44: 261-267 (1998)
Statistical Physics and Practical Training of Soft-Committee Machines

M. Ahr, M. Biehl, R. Urbanczik

European Physics B 10: 583-588 (1999)

high-T, arbitrary K

replica, large K = M → ∞
ReLU activation

high-T, arbitrary K
layered networks: (SCM)
challenges:

- more general activation functions - overfitting/underfitting

- low-temperatures: AA, replica - many layers (deep networks)

- non-trivial (realistic) input densities [Zdeborova, Goldt, Mezard…]

23
The Role of the Activation Function in Feedforward Learning Systems

(RAFFLES) NWO-funded project, Frederieke Richert
on-going & future work

23

Robust Learning of Sparse Representations: Brain-inspired Inhibition

and Statistical Physics Analysis

2 PhD projects funded by the Groningen Cognitive Systems and

Materials Centre CogniGron, in collaboration with George Azzopardi

23




- study network architectures and training schemes

which favor sparse activity and sparse connectivity

- consider activation functions which relate to hardware-realizable
 
adaptive systems

23




- study network architectures and training schemes

which favor sparse activity and sparse connectivity

- consider activation functions which relate to hardware-realizable
 
adaptive systems
see: www.cs.rug.nl/~biehl (link to description and application form,

deadline: 29 September 2022)

MiWoCI 2022 24
no statistical physics :-)

25
Cognigron March 2021 12 / 11
www.cs.rug.nl/~biehl m.biehl@rug.nl
twitter @michaelbiehl13

stat-phys-AMALEA.pdf

Recommended

Recommended

More Related Content

Similar to stat-phys-AMALEA.pdf

Similar to stat-phys-AMALEA.pdf (20)

More from University of Groningen

More from University of Groningen (20)

Recently uploaded

Recently uploaded (20)

stat-phys-AMALEA.pdf