SlideShare a Scribd company logo
1
The statistical physics of learning:


typical learning curves
www.cs.rug.nl/~biehl
Michael Biehl
AMALEA workshop, September 12, 2022
1
The statistical physics of learning:


typical learning curves
www.cs.rug.nl/~biehl
Michael Biehl
AMALEA workshop, September 12, 2022
•a little bit of history


•optimization and statistical physics (in a nutshell)


•machine learning as a special case, disorder average


•annealed approximation, high-temperature limit, replica trick


•typical learning curves in student/teacher scenarios


•a very simple example: single unit, linear regression


•nonlinear, layered neural networks:


- phase transitions in soft committee machines


- the role of the activation function


•outlook / ongoing projects
2
Statistical Physics of Neural Networks
capacity of feed-forward networks:


Elizabeth Gardner (1957-1988)


The space of interactions in neural


networks. J. Phys. A 21: 257 (1988)


dynamics, attractor neural networks:


John Hopfield. Neural Networks and


physical systems with emergent


collective computational abilities


PNAS 79(8): 2554 (1982)
learning of a rule:


Geza Györgyi, Naftali Tishby


Statistical theory of learning a rule


In: Neural Networks and Spin Glasses


World Scientific 31-36 (1990)
2
Statistical Physics of Neural Networks
capacity of feed-forward networks:


Elizabeth Gardner (1957-1988)


The space of interactions in neural


networks. J. Phys. A 21: 257 (1988)


dynamics, attractor neural networks:


John Hopfield. Neural Networks and


physical systems with emergent


collective computational abilities


PNAS 79(8): 2554 (1982)
reviews: annealed approximation,
high-T limit, replica trick etc.


S. Seung, H. Sompolinsky, N. Tishby


Statistical mechanics of learning from
examples. Phys. Rev. A 45, 6056 (1992)
learning of a rule:


Geza Györgyi, Naftali Tishby


Statistical theory of learning a rule


In: Neural Networks and Spin Glasses


World Scientific 31-36 (1990)
2
Statistical Physics of Neural Networks
capacity of feed-forward networks:


Elizabeth Gardner (1957-1988)


The space of interactions in neural


networks. J. Phys. A 21: 257 (1988)


dynamics, attractor neural networks:


John Hopfield. Neural Networks and


physical systems with emergent


collective computational abilities


PNAS 79(8): 2554 (1982)
reviews: annealed approximation,
high-T limit, replica trick etc.


S. Seung, H. Sompolinsky, N. Tishby


Statistical mechanics of learning from
examples. Phys. Rev. A 45, 6056 (1992)
learning of a rule:


Geza Györgyi, Naftali Tishby


Statistical theory of learning a rule


In: Neural Networks and Spin Glasses


World Scientific 31-36 (1990)
2
Statistical Physics of Neural Networks
capacity of feed-forward networks:


Elizabeth Gardner (1957-1988)


The space of interactions in neural


networks. J. Phys. A 21: 257 (1988)


dynamics, attractor neural networks:


John Hopfield. Neural Networks and


physical systems with emergent


collective computational abilities


PNAS 79(8): 2554 (1982)
reviews: annealed approximation,
high-T limit, replica trick etc.


S. Seung, H. Sompolinsky, N. Tishby


Statistical mechanics of learning from
examples. Phys. Rev. A 45, 6056 (1992)
learning of a rule:


Geza Györgyi, Naftali Tishby


Statistical theory of learning a rule


In: Neural Networks and Spin Glasses


World Scientific 31-36 (1990)
3
stochastic optimization
objective/cost/energy function for many degrees of freedom
3
stochastic optimization
objective/cost/energy function for many degrees of freedom
discrete, e.g.
Metropolis algorithm
3
stochastic optimization
objective/cost/energy function for many degrees of freedom
discrete, e.g.
Metropolis algorithm
• suggest a (small) change


, e.g. „single spin flip“


for a random j
3
stochastic optimization
objective/cost/energy function for many degrees of freedom
discrete, e.g.
Metropolis algorithm
• acceptance of change


- always if


- with probability


if
• suggest a (small) change


, e.g. „single spin flip“


for a random j
3
stochastic optimization
objective/cost/energy function for many degrees of freedom
discrete, e.g.
Metropolis algorithm
• acceptance of change


- always if


- with probability


if
• suggest a (small) change


, e.g. „single spin flip“


for a random j
controls acceptance rate


for „uphill“ moves
3
stochastic optimization
objective/cost/energy function for many degrees of freedom
discrete, e.g. continuous, e.g.
Metropolis algorithm Langevin dynamics
• acceptance of change


- always if


- with probability


if
• suggest a (small) change


, e.g. „single spin flip“


for a random j
controls acceptance rate


for „uphill“ moves
3
stochastic optimization
objective/cost/energy function for many degrees of freedom
discrete, e.g. continuous, e.g.
Metropolis algorithm Langevin dynamics
• acceptance of change


- always if


- with probability


if
• suggest a (small) change


, e.g. „single spin flip“


for a random j
• continuous temporal change,


„noisy gradient descent“


controls acceptance rate


for „uphill“ moves
3
stochastic optimization
objective/cost/energy function for many degrees of freedom
discrete, e.g. continuous, e.g.
Metropolis algorithm Langevin dynamics
• acceptance of change


- always if


- with probability


if
• suggest a (small) change


, e.g. „single spin flip“


for a random j
• continuous temporal change,


„noisy gradient descent“


controls acceptance rate


for „uphill“ moves
• with delta-correlated white noise


(spatial+temporal independence)
3
stochastic optimization
objective/cost/energy function for many degrees of freedom
discrete, e.g. continuous, e.g.
Metropolis algorithm Langevin dynamics
• acceptance of change


- always if


- with probability


if
• suggest a (small) change


, e.g. „single spin flip“


for a random j
• continuous temporal change,


„noisy gradient descent“


controls acceptance rate


for „uphill“ moves
... controls noise level, i.e.


random deviation from gradient
• with delta-correlated white noise


(spatial+temporal independence)
thermal equilibrium
Markov chain continuous dynamics
4
thermal equilibrium
Markov chain continuous dynamics
stationary density of configurations:


normalization: „Zustandssumme“, partition function
4
P(w) =
1
Z
exp [−βH(w)]
thermal equilibrium
Markov chain continuous dynamics
stationary density of configurations:


normalization: „Zustandssumme“, partition function
4
P(w) =
1
Z
exp [−βH(w)]
thermal equilibrium
Markov chain continuous dynamics
stationary density of configurations:


normalization: „Zustandssumme“, partition function
Gibbs-Boltzmann density of states


• physics: thermal equilibrium of a physical system at temperature T


• optimization: formal equilibrium situation, control parameter T
4
P(w) =
1
Z
exp [−βH(w)]
thermal equilibrium
Markov chain continuous dynamics
stationary density of configurations:


normalization: „Zustandssumme“, partition function
Gibbs-Boltzmann density of states


• physics: thermal equilibrium of a physical system at temperature T


• optimization: formal equilibrium situation, control parameter T
4
P(w) =
1
Z
exp [−βH(w)]
T → ∞, β → 0 :
T → 0, β → ∞ : only lowest energy (groundstate) contributes
energy is irrelevant, every state contributes equally
5
free energy
5
thermal averages in equilibrium, for instance
⟨⋯⟩T
free energy
5
thermal averages in equilibrium, for instance
⟨⋯⟩T
~ vol. of states with energy E
re-write as an integral over all possible energies:
Z
free energy
5
thermal averages in equilibrium, for instance
⟨⋯⟩T
~ vol. of states with energy E
re-write as an integral over all possible energies:
Z
assume extensive energy, proportional to system size N:
free energy
5
thermal averages in equilibrium, for instance
⟨⋯⟩T
~ vol. of states with energy E
re-write as an integral over all possible energies:
Z
assume extensive energy, proportional to system size N:
free energy
Z =
∫
dE exp [−Nβ (e − s(e)/β)]
5
thermal averages in equilibrium, for instance
⟨⋯⟩T
~ vol. of states with energy E
re-write as an integral over all possible energies:
Z
assume extensive energy, proportional to system size N:
in large systems ( ) is dominated by the minimum


of the free energy
N → ∞ ln Z
f = e − s/β ∼ − ln Z/(βN)
free energy
Z =
∫
dE exp [−Nβ (e − s(e)/β)]
6
remark: saddle point integration
6
function with maximum in , consider thermodynamic limit
remark: saddle point integration
6
function with maximum in , consider thermodynamic limit
is given by the minimum of the free energy


f = e - s(e) / β
remark: saddle point integration
machine learning
special case machine learning: choice of adaptive


e.g. all weights in a neural network
7
machine learning
special case machine learning: choice of adaptive


e.g. all weights in a neural network
cost function: defined w.r.t.


sum over examples, e.g. input vectors and target labels (supervised)


costs or error measure per example, e.g. classification error
ϵ( . . . )
7
H(w) =
P
∑
μ=1
ϵ(w, ξμ
) ID = {ξμ
, S(ξμ
)}
P
μ=1
machine learning
special case machine learning: choice of adaptive


e.g. all weights in a neural network
cost function: defined w.r.t.


sum over examples, e.g. input vectors and target labels (supervised)


costs or error measure per example, e.g. classification error
ϵ( . . . )
interpretation of training:


• weights are the outcome of some stochastic optimization process


with energy-dependent stationary


• formal (thermal) equilibrium


• < ... >T : thermal averages (over the stochastic training)
P(w)
7
H(w) =
P
∑
μ=1
ϵ(w, ξμ
) ID = {ξμ
, S(ξμ
)}
P
μ=1
disorder average
• energy/cost function is defined for one particular set of examples


typical properties: additional average over random training data ID
8
disorder average
• energy/cost function is defined for one particular set of examples


typical properties: additional average over random training data ID
8
• typical properties on average over data sets: derivatives of


quenched free energy ~ yields averages
disorder average
• energy/cost function is defined for one particular set of examples


typical properties: additional average over random training data ID
8
• typical properties on average over data sets: derivatives of


quenched free energy ~ yields averages
difficult: replica trick, approximations
disorder average
• energy/cost function is defined for one particular set of examples


typical properties: additional average over random training data ID
8
• typical properties on average over data sets: derivatives of


quenched free energy ~ yields averages
difficult: replica trick, approximations
• student / teacher scenarios


- define/control the complexity of target rule and learning system


- represent target by a teacher network
disorder average
• energy/cost function is defined for one particular set of examples


typical properties: additional average over random training data ID
8
• typical properties on average over data sets: derivatives of


quenched free energy ~ yields averages
difficult: replica trick, approximations
• student / teacher scenarios


- define/control the complexity of target rule and learning system


- represent target by a teacher network
• simplest assumptions: 

- independent input vectors of i.i.d. components
- noise-free training labels provided by the teacher network
9
⟨ξμ
j ⟩ = 0; ⟨ξμ
j
ξν
k ⟩ = δjkδμν
input data: independent, identically distributed random components
ξμ
j
= ± 1 (with equal prob.) or
P(ξμ
j
) =
1
2π
exp
[
−
1
2 (ξμ
j )
2
]
e.g.
example: training of a single, linear unit
9
⟨ξμ
j ⟩ = 0; ⟨ξμ
j
ξν
k ⟩ = δjkδμν
input data: independent, identically distributed random components
g
(
1
N ∑
j
wjξj
=:x
)
g
(
1
N ∑
j
w*
j
ξj
=:y
)
student output teacher output
ξμ
j
= ± 1 (with equal prob.) or
P(ξμ
j
) =
1
2π
exp
[
−
1
2 (ξμ
j )
2
]
e.g.
example: training of a single, linear unit
x, y ∼
𝒪
(1)
weight vectors
w2
/N = Q =
𝒪
(1) w*2
/N = Q* =
𝒪
(1)
pre-activations, local potentials
w w*
9
⟨ξμ
j ⟩ = 0; ⟨ξμ
j
ξν
k ⟩ = δjkδμν
input data: independent, identically distributed random components
g
(
1
N ∑
j
wjξj
=:x
)
g
(
1
N ∑
j
w*
j
ξj
=:y
)
student output teacher output
H(w) =
P
∑
μ=1
ϵ(xμ
, yμ
) g(z) = z ϵ(x, y) =
1
2
(x − y)
2
cost function, energy,
e.g. lin. regression
extensive quantity: H ∝ P = αN
ξμ
j
= ± 1 (with equal prob.) or
P(ξμ
j
) =
1
2π
exp
[
−
1
2 (ξμ
j )
2
]
e.g.
example: training of a single, linear unit
x, y ∼
𝒪
(1)
weight vectors
w2
/N = Q =
𝒪
(1) w*2
/N = Q* =
𝒪
(1)
pre-activations, local potentials
w w*
10
partition function, training at T = 1/β
Z =
∫ ∏
j
dwj δ (w2
− N)
dμ(w)
exp
[
− β
∑
μ
ϵ(xμ
, yμ
)
]
⟨ln Z⟩ID
ln ⟨Z⟩ID
Annealed Approximation: instead of
10
partition function, training at T = 1/β
Z =
∫ ∏
j
dwj δ (w2
− N)
dμ(w)
exp
[
− β
∑
μ
ϵ(xμ
, yμ
)
]
⟨ln Z⟩ID
ln ⟨Z⟩ID
Annealed Approximation: instead of ⟨ln Z⟩ID
≤ ln ⟨Z⟩ID




does not imply “ ”


or similarity of


extrema!
≈
10
partition function, training at T = 1/β
⟨Z⟩ID
=
∫
dμ(w)
⟨
exp
[
− β
∑
μ
ϵ
(
w ⋅ ξμ
N
,
w* ⋅ ξμ
N )]⟩
ID
Z =
∫ ∏
j
dwj δ (w2
− N)
dμ(w)
exp
[
− β
∑
μ
ϵ(xμ
, yμ
)
]
⟨ln Z⟩ID
ln ⟨Z⟩ID
Annealed Approximation: instead of ⟨ln Z⟩ID
≤ ln ⟨Z⟩ID




does not imply “ ”


or similarity of


extrema!
≈
10
partition function, training at T = 1/β
⟨Z⟩ID
=
∫
dμ(w)
⟨
exp
[
− β
∑
μ
ϵ
(
w ⋅ ξμ
N
,
w* ⋅ ξμ
N )]⟩
ID
Z =
∫ ∏
j
dwj δ (w2
− N)
dμ(w)
exp
[
− β
∑
μ
ϵ(xμ
, yμ
)
]
⟨ln Z⟩ID
ln ⟨Z⟩ID
Annealed Approximation: instead of ⟨ln Z⟩ID
≤ ln ⟨Z⟩ID




does not imply “ ”


or similarity of


extrema!
≈
δ
(
xμ
−
w ⋅ ξμ
N )
, δ
(
yμ
−
w* ⋅ ξμ
N )
traditional approach: integral representation of


explicit computation of averages …


elimination of conjugate variables …
10
partition function, training at T = 1/β
⟨Z⟩ID
=
∫
dμ(w)
⟨
exp
[
− β
∑
μ
ϵ
(
w ⋅ ξμ
N
,
w* ⋅ ξμ
N )]⟩
ID
Z =
∫ ∏
j
dwj δ (w2
− N)
dμ(w)
exp
[
− β
∑
μ
ϵ(xμ
, yμ
)
]
⟨ln Z⟩ID
ln ⟨Z⟩ID
Annealed Approximation: instead of ⟨ln Z⟩ID
≤ ln ⟨Z⟩ID




does not imply “ ”


or similarity of


extrema!
≈
δ
(
xμ
−
w ⋅ ξμ
N )
, δ
(
yμ
−
w* ⋅ ξμ
N )
traditional approach: integral representation of


explicit computation of averages …


elimination of conjugate variables …
short-cut: exploit Central Limit Theorem


i.i.d. input components for


N → ∞
xμ
=
1
N
N
∑
j=1
wjξμ
j
yμ
=
1
N
N
∑
j=1
w*
j
ξμ
j
11
P(xμ
, yμ
)
⟨xμ
⟩ =
1
N
N
∑
j=1
wj ⟨ξμ
j ⟩ = 0
joint normal density of local potentials fully specified by
disorder average
⟨(xμ
)2
⟩ =
1
N
N
∑
j,k=1
wjwk ⟨ξμ
j
ξμ
k ⟩ =
1
N
N
∑
j=1
w2
j
⟨yμ
⟩ = 0, ⟨(yμ
)2
⟩ =
N
∑
j=1
(w*
j
)2
⟨xμ
yμ
⟩ =
1
N
N
∑
j,k=1
wjw*
k ⟨ξμ
j
ξμ
k ⟩ =
1
N
N
∑
j=1
wjw*
j
11
P(xμ
, yμ
)
⟨xμ
⟩ =
1
N
N
∑
j=1
wj ⟨ξμ
j ⟩ = 0
joint normal density of local potentials fully specified by
disorder average
⟨(xμ
)2
⟩ =
1
N
N
∑
j,k=1
wjwk ⟨ξμ
j
ξμ
k ⟩ =
1
N
N
∑
j=1
w2
j
⟨yμ
⟩ = 0, ⟨(yμ
)2
⟩ =
N
∑
j=1
(w*
j
)2
⟨xμ
yμ
⟩ =
1
N
N
∑
j,k=1
wjw*
k ⟨ξμ
j
ξμ
k ⟩ =
1
N
N
∑
j=1
wjw*
j
1
N ∑
j
w2
j = Q ( = 1),
1
N ∑
j
wjw*
j
= R,
1
N ∑
j
(w*
j
)2
= Q* ( = 1)
set of order


parameters


macroscopic properties of the trained network


instead of microscopic details
12
⟨Z⟩ID
=
∫
dR exp [N (Go(R) − α G1(R))] with  αN = P
entropy term:  Go(R) =
1
N
ln
∫ ∏
j
dwj δ(N − w2
) δ(NR − w ⋅ w*)
energy term:  G1(R) = − ln
∫
dxdy P(x, y) exp[ − β ϵ(x, y) ]
annealed free energy
factorizes w.r.t. and
⟨⋯⟩ID
j = 1,2,…, N μ = 1,2,…, P
12
⟨Z⟩ID
=
∫
dR exp [N (Go(R) − α G1(R))] with  αN = P
entropy term:  Go(R) =
1
N
ln
∫ ∏
j
dwj δ(N − w2
) δ(NR − w ⋅ w*)
energy term:  G1(R) = − ln
∫
dxdy P(x, y) exp[ − β ϵ(x, y) ]
N-dim. geometry


independent of


model details
model, training
annealed free energy
factorizes w.r.t. and
⟨⋯⟩ID
j = 1,2,…, N μ = 1,2,…, P
12
⟨Z⟩ID
=
∫
dR exp [N (Go(R) − α G1(R))] with  αN = P
entropy term:  Go(R) =
1
N
ln
∫ ∏
j
dwj δ(N − w2
) δ(NR − w ⋅ w*)
energy term:  G1(R) = − ln
∫
dxdy P(x, y) exp[ − β ϵ(x, y) ]
−β fann =
1
N
ln ⟨Z⟩ID
= extrR [Go(R) − αG1(R)]
saddle-point integration for


annealed free energy:
N → ∞
N-dim. geometry


independent of


model details
model, training
annealed free energy
factorizes w.r.t. and
⟨⋯⟩ID
j = 1,2,…, N μ = 1,2,…, P
13
Go(R) =
1
N
ln
∫ ∏
j
dwj δ(1 − w2
) δ(R − w ⋅ w*) ≈ … =
1
2
ln(1 − R2
)
the hard way: - integral representation of delta-function, introducing


- saddle-point integration for large N w.r.t.
̂
R
R, ̂
R
the entropy term
13
Go(R) =
1
N
ln
∫ ∏
j
dwj δ(1 − w2
) δ(R − w ⋅ w*) ≈ … =
1
2
ln(1 − R2
)
the hard way: - integral representation of delta-function, introducing


- saddle-point integration for large N w.r.t.
̂
R
R, ̂
R
geometry:
w
w*
R
r = 1 − R2 V ∼ (1 − R2
)N/2
Go(R) =
1
N
ln V ∼
1
2
ln(1 − R2
)
the entropy term
13
Go(R) =
1
N
ln
∫ ∏
j
dwj δ(1 − w2
) δ(R − w ⋅ w*) ≈ … =
1
2
ln(1 − R2
)
the hard way: - integral representation of delta-function, introducing


- saddle-point integration for large N w.r.t.
̂
R
R, ̂
R
geometry:
w
w*
R
r = 1 − R2 V ∼ (1 − R2
)N/2
Go(R) =
1
N
ln V ∼
1
2
ln(1 − R2
)
general result: set of vectors, matrix of pairwise dot-products and norms


[R. Urbanzcik] here:
𝒞
Go =
1
2
ln det
𝒞
𝒞
=
(
1 R
R 1)
the entropy term
14
G1(R) = − ln
∫
dxdy
2π 1 − R2
exp
[
−
x2
+ y2
− 2Rxy
2(1 − R2) ]
exp[−βϵ(x, y)]
the energy term
14
G1(R) = − ln
∫
dxdy
2π 1 − R2
exp
[
−
x2
+ y2
− 2Rxy
2(1 − R2) ]
exp[−βϵ(x, y)]
linear regression (single linear student and teacher)




elementary Gaussian integrals:
ϵ(x, y) =
1
2
(x − y)2
G1(R) =
1
2
ln[1 + 2β(1 − R)]
the energy term
14
G1(R) = − ln
∫
dxdy
2π 1 − R2
exp
[
−
x2
+ y2
− 2Rxy
2(1 − R2) ]
exp[−βϵ(x, y)]
linear regression (single linear student and teacher)




elementary Gaussian integrals:
ϵ(x, y) =
1
2
(x − y)2
G1(R) =
1
2
ln[1 + 2β(1 − R)]
the energy term
−(βf ) = −
1
2
α ln[1 + 2β(1 − R)] +
1
2
ln(1 − R2
)
annealed free energy:
(+ irrevelant const. and terms that vanish for )
N → ∞
14
G1(R) = − ln
∫
dxdy
2π 1 − R2
exp
[
−
x2
+ y2
− 2Rxy
2(1 − R2) ]
exp[−βϵ(x, y)]
linear regression (single linear student and teacher)




elementary Gaussian integrals:
ϵ(x, y) =
1
2
(x − y)2
G1(R) =
1
2
ln[1 + 2β(1 − R)]
the energy term
−(βf ) = −
1
2
α ln[1 + 2β(1 − R)] +
1
2
ln(1 − R2
)
annealed free energy:
(+ irrevelant const. and terms that vanish for )
N → ∞
∂(βf )
∂R
= 0 ⇒
R
1 − R2
=
αβ
1 + 2β(1 − R)
→ R(α)  at a given  β
15
0.5 1.0 1.5 2.0
0.2
0.4
0.6
0.8
1.0
β = 0.1
β = 1
β = 10
β = 1000 β = 100
R
α
learning curves (linear regression)
typical success of training, i.e.


student/teacher similiarity as a


function of the training set size
15
0.5 1.0 1.5 2.0
0.2
0.4
0.6
0.8
1.0
β = 0.1
β = 1
β = 10
β = 1000 β = 100
R
α
learning curves (linear regression)
typical success of training, i.e.


student/teacher similiarity as a


function of the training set size
ϵg
ϵt
2 4 6 8 10
0.1
0.2
0.3
0.4
0.5
α
β = 1
generalization error and


training error
ϵt =
1
α
∂(βf )
∂β
=
1 − R
1 + 2β(1 − R)
ϵg = (1 − R)
16
remark: interpretation of the AA
partition function in the AA
⟨Z⟩ID
=
∫
dμ(w)
⟨
exp
[
− β H (w)
]⟩
ID
=
∫
dμ(w)
∫
dμ({ξμ
}P
μ=1) exp [−βH ({ξμ
}, w)]
16
remark: interpretation of the AA
partition function in the AA
⟨Z⟩ID
=
∫
dμ(w)
⟨
exp
[
− β H (w)
]⟩
ID
=
∫
dμ(w)
∫
dμ({ξμ
}P
μ=1) exp [−βH ({ξμ
}, w)]
interpretation: partition sum of a system in which


weights and data are degrees of freedom


that can be optimized (annealed) w.r.t. H
16
remark: interpretation of the AA
partition function in the AA
⟨Z⟩ID
=
∫
dμ(w)
⟨
exp
[
− β H (w)
]⟩
ID
=
∫
dμ(w)
∫
dμ({ξμ
}P
μ=1) exp [−βH ({ξμ
}, w)]
interpretation: partition sum of a system in which


weights and data are degrees of freedom


that can be optimized (annealed) w.r.t. H
correct treatment: data constitutes frozen disorder in H
16
remark: interpretation of the AA
partition function in the AA
⟨Z⟩ID
=
∫
dμ(w)
⟨
exp
[
− β H (w)
]⟩
ID
=
∫
dμ(w)
∫
dμ({ξμ
}P
μ=1) exp [−βH ({ξμ
}, w)]
interpretation: partition sum of a system in which


weights and data are degrees of freedom


that can be optimized (annealed) w.r.t. H
correct treatment: data constitutes frozen disorder in H
observation/folklore: AA works (qualitatively) well in realizable cases


e.g. student and teacher of the same complexity,


noise-free data


AA fails in unrealizable cases (noise, mismatch)


because the (hypothetical) system can “adapt the


the data to the task”, yields over-optimistic results
proper disorder average: replica trick/method
replica trick
formally: n non-interacting „copies“ of the system (replicas)
17
⟨ln Z⟩ID
= lim
n→0
⟨Zn
⟩ID
− 1
n
= lim
n→0
∂⟨Zn
⟩
∂n
= lim
n→0
ln ⟨Zn
⟩ID
proper disorder average: replica trick/method
replica trick
formally: n non-interacting „copies“ of the system (replicas)
17
⟨ln Z⟩ID
= lim
n→0
⟨Zn
⟩ID
− 1
n
= lim
n→0
∂⟨Zn
⟩
∂n
= lim
n→0
ln ⟨Zn
⟩ID
⟨Zn
⟩ID
=
∫
n
∏
a=1
dμ(wa
)
⟨
exp
[
− β
∑
μ
∑
a
ϵ
(
wa
⋅ ξμ
N
,
w* ⋅ ξμ
N )]⟩
ID
proper disorder average: replica trick/method
replica trick
formally: n non-interacting „copies“ of the system (replicas)
17
⟨ln Z⟩ID
= lim
n→0
⟨Zn
⟩ID
− 1
n
= lim
n→0
∂⟨Zn
⟩
∂n
= lim
n→0
ln ⟨Zn
⟩ID
⟨Zn
⟩ID
=
∫
n
∏
a=1
dμ(wa
)
⟨
exp
[
− β
∑
μ
∑
a
ϵ
(
wa
⋅ ξμ
N
,
w* ⋅ ξμ
N )]⟩
ID P(xμ
1
, xμ
2
, …, xμ
n , yμ
)
integration over
proper disorder average: replica trick/method
replica trick
formally: n non-interacting „copies“ of the system (replicas)
data set average introduces effective interactions between replicas
… saddle point integration for , quenched free energy


⟨Zn
⟩ID
involves order parameters


requires analytic continuation for


Ra = wa
⋅ w*, qab = wa
⋅ wb
/N
n ∈ ℝ and n → 0
17
⟨ln Z⟩ID
= lim
n→0
⟨Zn
⟩ID
− 1
n
= lim
n→0
∂⟨Zn
⟩
∂n
= lim
n→0
ln ⟨Zn
⟩ID
⟨Zn
⟩ID
=
∫
n
∏
a=1
dμ(wa
)
⟨
exp
[
− β
∑
μ
∑
a
ϵ
(
wa
⋅ ξμ
N
,
w* ⋅ ξμ
N )]⟩
ID P(xμ
1
, xμ
2
, …, xμ
n , yμ
)
integration over
proper disorder average: replica trick/method
replica trick
formally: n non-interacting „copies“ of the system (replicas)
data set average introduces effective interactions between replicas
… saddle point integration for , quenched free energy


⟨Zn
⟩ID
involves order parameters


requires analytic continuation for


Ra = wa
⋅ w*, qab = wa
⋅ wb
/N
n ∈ ℝ and n → 0
Marc Mezard, Giorgio Parisi (*), Miguel Virasoro


Spin Glass Theory and Beyond (1987)


(*) Nobel 2021
mathematical subtleties, replica symmetry-breaking ...
17
⟨ln Z⟩ID
= lim
n→0
⟨Zn
⟩ID
− 1
n
= lim
n→0
∂⟨Zn
⟩
∂n
= lim
n→0
ln ⟨Zn
⟩ID
⟨Zn
⟩ID
=
∫
n
∏
a=1
dμ(wa
)
⟨
exp
[
− β
∑
μ
∑
a
ϵ
(
wa
⋅ ξμ
N
,
w* ⋅ ξμ
N )]⟩
ID P(xμ
1
, xμ
2
, …, xμ
n , yμ
)
integration over
18
historical :-) examples of perceptron learning curves
S = sign(w ⋅ ξ)
S* = sign(w* ⋅ ξ)
student
teacher
w w*
18
historical :-) examples of perceptron learning curves
Gibbs student
optimal generalization
Adaline


ϵ(x, y) =
1
2
[x − sign(y)]
2
perceptron, zero temperature training from


noise free linearly separable data:


maximum


stability ___ϵg ∝ α−1
___ϵg ∝ α−1/2
S = sign(w ⋅ ξ)
S* = sign(w* ⋅ ξ)
student
teacher
w w*
18
historical :-) examples of perceptron learning curves
Gibbs student
optimal generalization
Adaline


ϵ(x, y) =
1
2
[x − sign(y)]
2
perceptron, zero temperature training from


noise free linearly separable data:


maximum


stability ___ϵg ∝ α−1
___ϵg ∝ α−1/2
more in the literature:


- label noise


- teacher weight noise


- variational opt. of


the cost function


- weight decay


- worst case training


- …
S = sign(w ⋅ ξ)
S* = sign(w* ⋅ ξ)
student
teacher
w w*
19
energy term:    G1(R) = − ln
∫
dxdy P(x, y) exp[−β ϵ(x, y)]
training at high temperatures
AA becomes exact in the limit (replicas decouple)
T → ∞
19
energy term:    G1(R) = − ln
∫
dxdy P(x, y) exp[−β ϵ(x, y)]
training at high temperatures
≈ − ln
∫
dxdyP(x, y) (1 − β ϵ(x, y))
≈ − ln [1 − β ⟨ϵ(x, y)⟩{x,y}] ≈ βϵg
β → 0
AA becomes exact in the limit (replicas decouple)
T → ∞
generalization error!


(arbitrary input)
19
energy term:    G1(R) = − ln
∫
dxdy P(x, y) exp[−β ϵ(x, y)]
βf ≈ (α β) ϵg − Go(R)
free energy
training at high temperatures
≈ − ln
∫
dxdyP(x, y) (1 − β ϵ(x, y))
≈ − ln [1 − β ⟨ϵ(x, y)⟩{x,y}] ≈ βϵg
β → 0
AA becomes exact in the limit (replicas decouple)
T → ∞
generalization error!


(arbitrary input)
19
energy term:    G1(R) = − ln
∫
dxdy P(x, y) exp[−β ϵ(x, y)]
βf ≈ (α β) ϵg − Go(R)
free energy
only meaningful if (α β) =
𝒪
(1)
β → 0, T → ∞
α = P/N → ∞
learn almost nothing


from infinitely many


examples
training at high temperatures
≈ − ln
∫
dxdyP(x, y) (1 − β ϵ(x, y))
≈ − ln [1 − β ⟨ϵ(x, y)⟩{x,y}] ≈ βϵg
β → 0
AA becomes exact in the limit (replicas decouple)
T → ∞
generalization error!


(arbitrary input)
19
energy term:    G1(R) = − ln
∫
dxdy P(x, y) exp[−β ϵ(x, y)]
βf ≈ (α β) ϵg − Go(R)
free energy
only meaningful if (α β) =
𝒪
(1)
β → 0, T → ∞
α = P/N → ∞
learn almost nothing


from infinitely many


examples
training at high temperatures
here: P and T cannot be varied independently


are indistinguishable (input space is sampled perfectly)
ϵg  and  ϵt
≈ − ln
∫
dxdyP(x, y) (1 − β ϵ(x, y))
≈ − ln [1 − β ⟨ϵ(x, y)⟩{x,y}] ≈ βϵg
β → 0
AA becomes exact in the limit (replicas decouple)
T → ∞
generalization error!


(arbitrary input)
20
adaptive student N inputs


K hidden units
layered networks: “soft committee machines” (SCM)
20
adaptive student N inputs


K hidden units
teacher parameterizes target
? ? ? ? ? ? ?
layered networks: “soft committee machines” (SCM)
20
adaptive student N inputs


K hidden units
teacher parameterizes target
? ? ? ? ? ? ?
consider specific activation functions, e.g. sigmoidal / ReLU
layered networks: “soft committee machines” (SCM)
20
adaptive student N inputs


K hidden units
teacher parameterizes target
? ? ? ? ? ? ?
consider specific activation functions, e.g. sigmoidal / ReLU
thermodynamic limit : description in terms of order parameters


student/teacher, student/student


site symmetry / hidden unit specialization:
layered networks: “soft committee machines” (SCM)
21
typical SCM learning curves (high-T)
as a function of training set size from high-T free energy:
express generalization and entropy as functions of
ϵg s {R, S, Q, C}
ϵg
21
typical SCM learning curves (high-T)
as a function of training set size from high-T free energy:
sigmoidal: discont. phase transition
specialized


(R>S)
un- | anti-spec.
(R=S) (R<S)
(K>2)
ϵg
poor-performing phase


persists for large data sets
express generalization and entropy as functions of
ϵg s {R, S, Q, C}
ϵg
21
typical SCM learning curves (high-T)
as a function of training set size from high-T free energy:
sigmoidal: discont. phase transition
specialized


(R>S)
un- | anti-spec.
(R=S) (R<S)
(K>2)
ϵg
poor-performing phase


persists for large data sets
ReLU: continuous transition
anti-specialized


(R<S)
specialized


(R>S)
(R=S)
ϵg
similar performances,


lower (free) energy barrier
express generalization and entropy as functions of
ϵg s {R, S, Q, C}
ϵg
22
Hidden unit Specialization in Layered Neural Networks:


ReLU vs. Sigmoidal Activation


E. Oostwal, M. Straat, M. Biehl


Physica A 564: 125517 (2021)
Phase Transitions in Soft-Committee Machines


M. Biehl, E. Schlösser, M. Ahr


Europhysics Letters 44: 261-267 (1998)
Statistical Physics and Practical Training of Soft-Committee Machines


M. Ahr, M. Biehl, R. Urbanczik


European Physics B 10: 583-588 (1999)
sigmoidal activation


high-T, arbitrary K
sigmoidal activation


replica, large K = M → ∞
ReLU activation


high-T, arbitrary K
layered networks: (SCM)
22
Hidden unit Specialization in Layered Neural Networks:


ReLU vs. Sigmoidal Activation


E. Oostwal, M. Straat, M. Biehl


Physica A 564: 125517 (2021)
Phase Transitions in Soft-Committee Machines


M. Biehl, E. Schlösser, M. Ahr


Europhysics Letters 44: 261-267 (1998)
Statistical Physics and Practical Training of Soft-Committee Machines


M. Ahr, M. Biehl, R. Urbanczik


European Physics B 10: 583-588 (1999)
sigmoidal activation


high-T, arbitrary K
sigmoidal activation


replica, large K = M → ∞
ReLU activation


high-T, arbitrary K
layered networks: (SCM)
challenges:


- more general activation functions - overfitting/underfitting


- low-temperatures: AA, replica - many layers (deep networks)


- non-trivial (realistic) input densities [Zdeborova, Goldt, Mezard…]
23
The Role of the Activation Function in Feedforward Learning Systems


(RAFFLES) NWO-funded project, Frederieke Richert
on-going & future work
23
The Role of the Activation Function in Feedforward Learning Systems


(RAFFLES) NWO-funded project, Frederieke Richert
Robust Learning of Sparse Representations: Brain-inspired Inhibition


and Statistical Physics Analysis


2 PhD projects funded by the Groningen Cognitive Systems and


Materials Centre CogniGron, in collaboration with George Azzopardi
on-going & future work
23
The Role of the Activation Function in Feedforward Learning Systems


(RAFFLES) NWO-funded project, Frederieke Richert
Robust Learning of Sparse Representations: Brain-inspired Inhibition


and Statistical Physics Analysis


2 PhD projects funded by the Groningen Cognitive Systems and


Materials Centre CogniGron, in collaboration with George Azzopardi
- study network architectures and training schemes


which favor sparse activity and sparse connectivity


- consider activation functions which relate to hardware-realizable


adaptive systems
on-going & future work
23
The Role of the Activation Function in Feedforward Learning Systems


(RAFFLES) NWO-funded project, Frederieke Richert
Robust Learning of Sparse Representations: Brain-inspired Inhibition


and Statistical Physics Analysis


2 PhD projects funded by the Groningen Cognitive Systems and


Materials Centre CogniGron, in collaboration with George Azzopardi
- study network architectures and training schemes


which favor sparse activity and sparse connectivity


- consider activation functions which relate to hardware-realizable


adaptive systems
on-going & future work
see: www.cs.rug.nl/~biehl (link to description and application form,


deadline: 29 September 2022)
MiWoCI 2022 24
no statistical physics :-)
25
Cognigron March 2021 12 / 11
www.cs.rug.nl/~biehl m.biehl@rug.nl
twitter @michaelbiehl13

More Related Content

Similar to stat-phys-AMALEA.pdf

Algorithmic Thermodynamics
Algorithmic ThermodynamicsAlgorithmic Thermodynamics
Algorithmic Thermodynamics
Sunny Kr
 
Unit iii update
Unit iii updateUnit iii update
Unit iii update
Indira Priyadarsini
 
nber_slides.pdf
nber_slides.pdfnber_slides.pdf
nber_slides.pdf
ssuser05b736
 
Neural Networks: Least Mean Square (LSM) Algorithm
Neural Networks: Least Mean Square (LSM) AlgorithmNeural Networks: Least Mean Square (LSM) Algorithm
Neural Networks: Least Mean Square (LSM) Algorithm
Mostafa G. M. Mostafa
 
Functional specialization in human cognition: a large-scale neuroimaging init...
Functional specialization in human cognition: a large-scale neuroimaging init...Functional specialization in human cognition: a large-scale neuroimaging init...
Functional specialization in human cognition: a large-scale neuroimaging init...
Ana Luísa Pinho
 
Mathematics Colloquium, UCSC
Mathematics Colloquium, UCSCMathematics Colloquium, UCSC
Mathematics Colloquium, UCSC
dongwook159
 
Learning to Reconstruct
Learning to ReconstructLearning to Reconstruct
Learning to Reconstruct
Jonas Adler
 
Hodgkin-Huxley & the nonlinear dynamics of neuronal excitability
Hodgkin-Huxley & the nonlinear  dynamics of neuronal excitabilityHodgkin-Huxley & the nonlinear  dynamics of neuronal excitability
Hodgkin-Huxley & the nonlinear dynamics of neuronal excitability
SSA KPI
 
Hmm and neural networks
Hmm and neural networksHmm and neural networks
Hmm and neural networks
Janani Ramasamy
 
Materials Modelling: From theory to solar cells (Lecture 1)
Materials Modelling: From theory to solar cells  (Lecture 1)Materials Modelling: From theory to solar cells  (Lecture 1)
Materials Modelling: From theory to solar cells (Lecture 1)
cdtpv
 
danreport.doc
danreport.docdanreport.doc
danreport.docbutest
 
E05731721
E05731721E05731721
E05731721
IOSR-JEN
 
CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...
CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...
CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...
The Statistical and Applied Mathematical Sciences Institute
 
Introduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and AlgorithmsIntroduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and Algorithms
NBER
 
A New Neural Network For Solving Linear Programming Problems
A New Neural Network For Solving Linear Programming ProblemsA New Neural Network For Solving Linear Programming Problems
A New Neural Network For Solving Linear Programming Problems
Jody Sullivan
 
SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...
SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...
SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...
Tobias Wunner
 
A walk through the intersection between machine learning and mechanistic mode...
A walk through the intersection between machine learning and mechanistic mode...A walk through the intersection between machine learning and mechanistic mode...
A walk through the intersection between machine learning and mechanistic mode...
JuanPabloCarbajal3
 

Similar to stat-phys-AMALEA.pdf (20)

Algorithmic Thermodynamics
Algorithmic ThermodynamicsAlgorithmic Thermodynamics
Algorithmic Thermodynamics
 
Unit iii update
Unit iii updateUnit iii update
Unit iii update
 
nber_slides.pdf
nber_slides.pdfnber_slides.pdf
nber_slides.pdf
 
Neural Networks: Least Mean Square (LSM) Algorithm
Neural Networks: Least Mean Square (LSM) AlgorithmNeural Networks: Least Mean Square (LSM) Algorithm
Neural Networks: Least Mean Square (LSM) Algorithm
 
Functional specialization in human cognition: a large-scale neuroimaging init...
Functional specialization in human cognition: a large-scale neuroimaging init...Functional specialization in human cognition: a large-scale neuroimaging init...
Functional specialization in human cognition: a large-scale neuroimaging init...
 
Ann
Ann Ann
Ann
 
eviewsOLSMLE
eviewsOLSMLEeviewsOLSMLE
eviewsOLSMLE
 
Mathematics Colloquium, UCSC
Mathematics Colloquium, UCSCMathematics Colloquium, UCSC
Mathematics Colloquium, UCSC
 
Learning to Reconstruct
Learning to ReconstructLearning to Reconstruct
Learning to Reconstruct
 
Hodgkin-Huxley & the nonlinear dynamics of neuronal excitability
Hodgkin-Huxley & the nonlinear  dynamics of neuronal excitabilityHodgkin-Huxley & the nonlinear  dynamics of neuronal excitability
Hodgkin-Huxley & the nonlinear dynamics of neuronal excitability
 
Hmm and neural networks
Hmm and neural networksHmm and neural networks
Hmm and neural networks
 
Materials Modelling: From theory to solar cells (Lecture 1)
Materials Modelling: From theory to solar cells  (Lecture 1)Materials Modelling: From theory to solar cells  (Lecture 1)
Materials Modelling: From theory to solar cells (Lecture 1)
 
danreport.doc
danreport.docdanreport.doc
danreport.doc
 
E05731721
E05731721E05731721
E05731721
 
20120140503023
2012014050302320120140503023
20120140503023
 
CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...
CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...
CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...
 
Introduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and AlgorithmsIntroduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and Algorithms
 
A New Neural Network For Solving Linear Programming Problems
A New Neural Network For Solving Linear Programming ProblemsA New Neural Network For Solving Linear Programming Problems
A New Neural Network For Solving Linear Programming Problems
 
SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...
SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...
SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...
 
A walk through the intersection between machine learning and mechanistic mode...
A walk through the intersection between machine learning and mechanistic mode...A walk through the intersection between machine learning and mechanistic mode...
A walk through the intersection between machine learning and mechanistic mode...
 

More from University of Groningen

Interpretable machine learning in endocrinology, M. Biehl, APPIS 2024
Interpretable machine learning in endocrinology, M. Biehl, APPIS 2024Interpretable machine learning in endocrinology, M. Biehl, APPIS 2024
Interpretable machine learning in endocrinology, M. Biehl, APPIS 2024
University of Groningen
 
ESE-Eyes-2023.pdf
ESE-Eyes-2023.pdfESE-Eyes-2023.pdf
ESE-Eyes-2023.pdf
University of Groningen
 
APPIS-FDGPET.pdf
APPIS-FDGPET.pdfAPPIS-FDGPET.pdf
APPIS-FDGPET.pdf
University of Groningen
 
prototypes-AMALEA.pdf
prototypes-AMALEA.pdfprototypes-AMALEA.pdf
prototypes-AMALEA.pdf
University of Groningen
 
Evidence for tissue and stage-specific composition of the ribosome: machine l...
Evidence for tissue and stage-specific composition of the ribosome: machine l...Evidence for tissue and stage-specific composition of the ribosome: machine l...
Evidence for tissue and stage-specific composition of the ribosome: machine l...
University of Groningen
 
The statistical physics of learning revisted: Phase transitions in layered ne...
The statistical physics of learning revisted: Phase transitions in layered ne...The statistical physics of learning revisted: Phase transitions in layered ne...
The statistical physics of learning revisted: Phase transitions in layered ne...
University of Groningen
 
Interpretable machine-learning (in endocrinology and beyond)
Interpretable machine-learning (in endocrinology and beyond)Interpretable machine-learning (in endocrinology and beyond)
Interpretable machine-learning (in endocrinology and beyond)
University of Groningen
 
Biehl hanze-2021
Biehl hanze-2021Biehl hanze-2021
Biehl hanze-2021
University of Groningen
 
2020: Prototype-based classifiers and relevance learning: medical application...
2020: Prototype-based classifiers and relevance learning: medical application...2020: Prototype-based classifiers and relevance learning: medical application...
2020: Prototype-based classifiers and relevance learning: medical application...
University of Groningen
 
2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activa...
2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activa...2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activa...
2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activa...
University of Groningen
 
2020: So you thought the ribosome was constant and conserved ...
2020: So you thought the ribosome was constant and conserved ... 2020: So you thought the ribosome was constant and conserved ...
2020: So you thought the ribosome was constant and conserved ...
University of Groningen
 
Prototype-based classifiers and their applications in the life sciences
Prototype-based classifiers and their applications in the life sciencesPrototype-based classifiers and their applications in the life sciences
Prototype-based classifiers and their applications in the life sciences
University of Groningen
 
Prototype-based models in machine learning
Prototype-based models in machine learningPrototype-based models in machine learning
Prototype-based models in machine learning
University of Groningen
 
2013: Sometimes you can trust a rat - The sbv improver species translation ch...
2013: Sometimes you can trust a rat - The sbv improver species translation ch...2013: Sometimes you can trust a rat - The sbv improver species translation ch...
2013: Sometimes you can trust a rat - The sbv improver species translation ch...
University of Groningen
 
2013: Prototype-based learning and adaptive distances for classification
2013: Prototype-based learning and adaptive distances for classification2013: Prototype-based learning and adaptive distances for classification
2013: Prototype-based learning and adaptive distances for classification
University of Groningen
 
2015: Distance based classifiers: Basic concepts, recent developments and app...
2015: Distance based classifiers: Basic concepts, recent developments and app...2015: Distance based classifiers: Basic concepts, recent developments and app...
2015: Distance based classifiers: Basic concepts, recent developments and app...
University of Groningen
 
2016: Classification of FDG-PET Brain Data
2016: Classification of FDG-PET Brain Data2016: Classification of FDG-PET Brain Data
2016: Classification of FDG-PET Brain Data
University of Groningen
 
2016: Predicting Recurrence in Clear Cell Renal Cell Carcinoma
2016: Predicting Recurrence in Clear Cell Renal Cell Carcinoma2016: Predicting Recurrence in Clear Cell Renal Cell Carcinoma
2016: Predicting Recurrence in Clear Cell Renal Cell Carcinoma
University of Groningen
 
2017: Prototype-based models in unsupervised and supervised machine learning
2017: Prototype-based models in unsupervised and supervised machine learning2017: Prototype-based models in unsupervised and supervised machine learning
2017: Prototype-based models in unsupervised and supervised machine learning
University of Groningen
 
June 2017: Biomedical applications of prototype-based classifiers and relevan...
June 2017: Biomedical applications of prototype-based classifiers and relevan...June 2017: Biomedical applications of prototype-based classifiers and relevan...
June 2017: Biomedical applications of prototype-based classifiers and relevan...
University of Groningen
 

More from University of Groningen (20)

Interpretable machine learning in endocrinology, M. Biehl, APPIS 2024
Interpretable machine learning in endocrinology, M. Biehl, APPIS 2024Interpretable machine learning in endocrinology, M. Biehl, APPIS 2024
Interpretable machine learning in endocrinology, M. Biehl, APPIS 2024
 
ESE-Eyes-2023.pdf
ESE-Eyes-2023.pdfESE-Eyes-2023.pdf
ESE-Eyes-2023.pdf
 
APPIS-FDGPET.pdf
APPIS-FDGPET.pdfAPPIS-FDGPET.pdf
APPIS-FDGPET.pdf
 
prototypes-AMALEA.pdf
prototypes-AMALEA.pdfprototypes-AMALEA.pdf
prototypes-AMALEA.pdf
 
Evidence for tissue and stage-specific composition of the ribosome: machine l...
Evidence for tissue and stage-specific composition of the ribosome: machine l...Evidence for tissue and stage-specific composition of the ribosome: machine l...
Evidence for tissue and stage-specific composition of the ribosome: machine l...
 
The statistical physics of learning revisted: Phase transitions in layered ne...
The statistical physics of learning revisted: Phase transitions in layered ne...The statistical physics of learning revisted: Phase transitions in layered ne...
The statistical physics of learning revisted: Phase transitions in layered ne...
 
Interpretable machine-learning (in endocrinology and beyond)
Interpretable machine-learning (in endocrinology and beyond)Interpretable machine-learning (in endocrinology and beyond)
Interpretable machine-learning (in endocrinology and beyond)
 
Biehl hanze-2021
Biehl hanze-2021Biehl hanze-2021
Biehl hanze-2021
 
2020: Prototype-based classifiers and relevance learning: medical application...
2020: Prototype-based classifiers and relevance learning: medical application...2020: Prototype-based classifiers and relevance learning: medical application...
2020: Prototype-based classifiers and relevance learning: medical application...
 
2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activa...
2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activa...2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activa...
2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activa...
 
2020: So you thought the ribosome was constant and conserved ...
2020: So you thought the ribosome was constant and conserved ... 2020: So you thought the ribosome was constant and conserved ...
2020: So you thought the ribosome was constant and conserved ...
 
Prototype-based classifiers and their applications in the life sciences
Prototype-based classifiers and their applications in the life sciencesPrototype-based classifiers and their applications in the life sciences
Prototype-based classifiers and their applications in the life sciences
 
Prototype-based models in machine learning
Prototype-based models in machine learningPrototype-based models in machine learning
Prototype-based models in machine learning
 
2013: Sometimes you can trust a rat - The sbv improver species translation ch...
2013: Sometimes you can trust a rat - The sbv improver species translation ch...2013: Sometimes you can trust a rat - The sbv improver species translation ch...
2013: Sometimes you can trust a rat - The sbv improver species translation ch...
 
2013: Prototype-based learning and adaptive distances for classification
2013: Prototype-based learning and adaptive distances for classification2013: Prototype-based learning and adaptive distances for classification
2013: Prototype-based learning and adaptive distances for classification
 
2015: Distance based classifiers: Basic concepts, recent developments and app...
2015: Distance based classifiers: Basic concepts, recent developments and app...2015: Distance based classifiers: Basic concepts, recent developments and app...
2015: Distance based classifiers: Basic concepts, recent developments and app...
 
2016: Classification of FDG-PET Brain Data
2016: Classification of FDG-PET Brain Data2016: Classification of FDG-PET Brain Data
2016: Classification of FDG-PET Brain Data
 
2016: Predicting Recurrence in Clear Cell Renal Cell Carcinoma
2016: Predicting Recurrence in Clear Cell Renal Cell Carcinoma2016: Predicting Recurrence in Clear Cell Renal Cell Carcinoma
2016: Predicting Recurrence in Clear Cell Renal Cell Carcinoma
 
2017: Prototype-based models in unsupervised and supervised machine learning
2017: Prototype-based models in unsupervised and supervised machine learning2017: Prototype-based models in unsupervised and supervised machine learning
2017: Prototype-based models in unsupervised and supervised machine learning
 
June 2017: Biomedical applications of prototype-based classifiers and relevan...
June 2017: Biomedical applications of prototype-based classifiers and relevan...June 2017: Biomedical applications of prototype-based classifiers and relevan...
June 2017: Biomedical applications of prototype-based classifiers and relevan...
 

Recently uploaded

(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
Scintica Instrumentation
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
sonaliswain16
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
pablovgd
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
fafyfskhan251kmf
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
Sérgio Sacani
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
yusufzako14
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
silvermistyshot
 

Recently uploaded (20)

(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
 

stat-phys-AMALEA.pdf

  • 1. 1 The statistical physics of learning: typical learning curves www.cs.rug.nl/~biehl Michael Biehl AMALEA workshop, September 12, 2022
  • 2. 1 The statistical physics of learning: typical learning curves www.cs.rug.nl/~biehl Michael Biehl AMALEA workshop, September 12, 2022 •a little bit of history •optimization and statistical physics (in a nutshell) •machine learning as a special case, disorder average •annealed approximation, high-temperature limit, replica trick •typical learning curves in student/teacher scenarios •a very simple example: single unit, linear regression •nonlinear, layered neural networks: 
 - phase transitions in soft committee machines 
 - the role of the activation function •outlook / ongoing projects
  • 3. 2 Statistical Physics of Neural Networks capacity of feed-forward networks: Elizabeth Gardner (1957-1988) The space of interactions in neural networks. J. Phys. A 21: 257 (1988) dynamics, attractor neural networks: John Hopfield. Neural Networks and physical systems with emergent collective computational abilities PNAS 79(8): 2554 (1982) learning of a rule: Geza Györgyi, Naftali Tishby Statistical theory of learning a rule In: Neural Networks and Spin Glasses World Scientific 31-36 (1990)
  • 4. 2 Statistical Physics of Neural Networks capacity of feed-forward networks: Elizabeth Gardner (1957-1988) The space of interactions in neural networks. J. Phys. A 21: 257 (1988) dynamics, attractor neural networks: John Hopfield. Neural Networks and physical systems with emergent collective computational abilities PNAS 79(8): 2554 (1982) reviews: annealed approximation, high-T limit, replica trick etc. S. Seung, H. Sompolinsky, N. Tishby Statistical mechanics of learning from examples. Phys. Rev. A 45, 6056 (1992) learning of a rule: Geza Györgyi, Naftali Tishby Statistical theory of learning a rule In: Neural Networks and Spin Glasses World Scientific 31-36 (1990)
  • 5. 2 Statistical Physics of Neural Networks capacity of feed-forward networks: Elizabeth Gardner (1957-1988) The space of interactions in neural networks. J. Phys. A 21: 257 (1988) dynamics, attractor neural networks: John Hopfield. Neural Networks and physical systems with emergent collective computational abilities PNAS 79(8): 2554 (1982) reviews: annealed approximation, high-T limit, replica trick etc. S. Seung, H. Sompolinsky, N. Tishby Statistical mechanics of learning from examples. Phys. Rev. A 45, 6056 (1992) learning of a rule: Geza Györgyi, Naftali Tishby Statistical theory of learning a rule In: Neural Networks and Spin Glasses World Scientific 31-36 (1990)
  • 6. 2 Statistical Physics of Neural Networks capacity of feed-forward networks: Elizabeth Gardner (1957-1988) The space of interactions in neural networks. J. Phys. A 21: 257 (1988) dynamics, attractor neural networks: John Hopfield. Neural Networks and physical systems with emergent collective computational abilities PNAS 79(8): 2554 (1982) reviews: annealed approximation, high-T limit, replica trick etc. S. Seung, H. Sompolinsky, N. Tishby Statistical mechanics of learning from examples. Phys. Rev. A 45, 6056 (1992) learning of a rule: Geza Györgyi, Naftali Tishby Statistical theory of learning a rule In: Neural Networks and Spin Glasses World Scientific 31-36 (1990)
  • 8. 3 stochastic optimization objective/cost/energy function for many degrees of freedom discrete, e.g. Metropolis algorithm
  • 9. 3 stochastic optimization objective/cost/energy function for many degrees of freedom discrete, e.g. Metropolis algorithm • suggest a (small) change , e.g. „single spin flip“ for a random j
  • 10. 3 stochastic optimization objective/cost/energy function for many degrees of freedom discrete, e.g. Metropolis algorithm • acceptance of change - always if - with probability if • suggest a (small) change , e.g. „single spin flip“ for a random j
  • 11. 3 stochastic optimization objective/cost/energy function for many degrees of freedom discrete, e.g. Metropolis algorithm • acceptance of change - always if - with probability if • suggest a (small) change , e.g. „single spin flip“ for a random j controls acceptance rate for „uphill“ moves
  • 12. 3 stochastic optimization objective/cost/energy function for many degrees of freedom discrete, e.g. continuous, e.g. Metropolis algorithm Langevin dynamics • acceptance of change - always if - with probability if • suggest a (small) change , e.g. „single spin flip“ for a random j controls acceptance rate for „uphill“ moves
  • 13. 3 stochastic optimization objective/cost/energy function for many degrees of freedom discrete, e.g. continuous, e.g. Metropolis algorithm Langevin dynamics • acceptance of change - always if - with probability if • suggest a (small) change , e.g. „single spin flip“ for a random j • continuous temporal change, „noisy gradient descent“ controls acceptance rate for „uphill“ moves
  • 14. 3 stochastic optimization objective/cost/energy function for many degrees of freedom discrete, e.g. continuous, e.g. Metropolis algorithm Langevin dynamics • acceptance of change - always if - with probability if • suggest a (small) change , e.g. „single spin flip“ for a random j • continuous temporal change, „noisy gradient descent“ controls acceptance rate for „uphill“ moves • with delta-correlated white noise (spatial+temporal independence)
  • 15. 3 stochastic optimization objective/cost/energy function for many degrees of freedom discrete, e.g. continuous, e.g. Metropolis algorithm Langevin dynamics • acceptance of change - always if - with probability if • suggest a (small) change , e.g. „single spin flip“ for a random j • continuous temporal change, „noisy gradient descent“ controls acceptance rate for „uphill“ moves ... controls noise level, i.e. random deviation from gradient • with delta-correlated white noise (spatial+temporal independence)
  • 16. thermal equilibrium Markov chain continuous dynamics 4
  • 17. thermal equilibrium Markov chain continuous dynamics stationary density of configurations: normalization: „Zustandssumme“, partition function 4 P(w) = 1 Z exp [−βH(w)]
  • 18. thermal equilibrium Markov chain continuous dynamics stationary density of configurations: normalization: „Zustandssumme“, partition function 4 P(w) = 1 Z exp [−βH(w)]
  • 19. thermal equilibrium Markov chain continuous dynamics stationary density of configurations: normalization: „Zustandssumme“, partition function Gibbs-Boltzmann density of states • physics: thermal equilibrium of a physical system at temperature T • optimization: formal equilibrium situation, control parameter T 4 P(w) = 1 Z exp [−βH(w)]
  • 20. thermal equilibrium Markov chain continuous dynamics stationary density of configurations: normalization: „Zustandssumme“, partition function Gibbs-Boltzmann density of states • physics: thermal equilibrium of a physical system at temperature T • optimization: formal equilibrium situation, control parameter T 4 P(w) = 1 Z exp [−βH(w)] T → ∞, β → 0 : T → 0, β → ∞ : only lowest energy (groundstate) contributes energy is irrelevant, every state contributes equally
  • 22. 5 thermal averages in equilibrium, for instance ⟨⋯⟩T free energy
  • 23. 5 thermal averages in equilibrium, for instance ⟨⋯⟩T ~ vol. of states with energy E re-write as an integral over all possible energies: Z free energy
  • 24. 5 thermal averages in equilibrium, for instance ⟨⋯⟩T ~ vol. of states with energy E re-write as an integral over all possible energies: Z assume extensive energy, proportional to system size N: free energy
  • 25. 5 thermal averages in equilibrium, for instance ⟨⋯⟩T ~ vol. of states with energy E re-write as an integral over all possible energies: Z assume extensive energy, proportional to system size N: free energy Z = ∫ dE exp [−Nβ (e − s(e)/β)]
  • 26. 5 thermal averages in equilibrium, for instance ⟨⋯⟩T ~ vol. of states with energy E re-write as an integral over all possible energies: Z assume extensive energy, proportional to system size N: in large systems ( ) is dominated by the minimum of the free energy N → ∞ ln Z f = e − s/β ∼ − ln Z/(βN) free energy Z = ∫ dE exp [−Nβ (e − s(e)/β)]
  • 27. 6 remark: saddle point integration
  • 28. 6 function with maximum in , consider thermodynamic limit remark: saddle point integration
  • 29. 6 function with maximum in , consider thermodynamic limit is given by the minimum of the free energy f = e - s(e) / β remark: saddle point integration
  • 30. machine learning special case machine learning: choice of adaptive e.g. all weights in a neural network 7
  • 31. machine learning special case machine learning: choice of adaptive e.g. all weights in a neural network cost function: defined w.r.t. sum over examples, e.g. input vectors and target labels (supervised) costs or error measure per example, e.g. classification error ϵ( . . . ) 7 H(w) = P ∑ μ=1 ϵ(w, ξμ ) ID = {ξμ , S(ξμ )} P μ=1
  • 32. machine learning special case machine learning: choice of adaptive e.g. all weights in a neural network cost function: defined w.r.t. sum over examples, e.g. input vectors and target labels (supervised) costs or error measure per example, e.g. classification error ϵ( . . . ) interpretation of training: • weights are the outcome of some stochastic optimization process with energy-dependent stationary • formal (thermal) equilibrium • < ... >T : thermal averages (over the stochastic training) P(w) 7 H(w) = P ∑ μ=1 ϵ(w, ξμ ) ID = {ξμ , S(ξμ )} P μ=1
  • 33. disorder average • energy/cost function is defined for one particular set of examples typical properties: additional average over random training data ID 8
  • 34. disorder average • energy/cost function is defined for one particular set of examples typical properties: additional average over random training data ID 8 • typical properties on average over data sets: derivatives of quenched free energy ~ yields averages
  • 35. disorder average • energy/cost function is defined for one particular set of examples typical properties: additional average over random training data ID 8 • typical properties on average over data sets: derivatives of quenched free energy ~ yields averages difficult: replica trick, approximations
  • 36. disorder average • energy/cost function is defined for one particular set of examples typical properties: additional average over random training data ID 8 • typical properties on average over data sets: derivatives of quenched free energy ~ yields averages difficult: replica trick, approximations • student / teacher scenarios - define/control the complexity of target rule and learning system - represent target by a teacher network
  • 37. disorder average • energy/cost function is defined for one particular set of examples typical properties: additional average over random training data ID 8 • typical properties on average over data sets: derivatives of quenched free energy ~ yields averages difficult: replica trick, approximations • student / teacher scenarios - define/control the complexity of target rule and learning system - represent target by a teacher network • simplest assumptions: 
 - independent input vectors of i.i.d. components - noise-free training labels provided by the teacher network
  • 38. 9 ⟨ξμ j ⟩ = 0; ⟨ξμ j ξν k ⟩ = δjkδμν input data: independent, identically distributed random components ξμ j = ± 1 (with equal prob.) or P(ξμ j ) = 1 2π exp [ − 1 2 (ξμ j ) 2 ] e.g. example: training of a single, linear unit
  • 39. 9 ⟨ξμ j ⟩ = 0; ⟨ξμ j ξν k ⟩ = δjkδμν input data: independent, identically distributed random components g ( 1 N ∑ j wjξj =:x ) g ( 1 N ∑ j w* j ξj =:y ) student output teacher output ξμ j = ± 1 (with equal prob.) or P(ξμ j ) = 1 2π exp [ − 1 2 (ξμ j ) 2 ] e.g. example: training of a single, linear unit x, y ∼ 𝒪 (1) weight vectors w2 /N = Q = 𝒪 (1) w*2 /N = Q* = 𝒪 (1) pre-activations, local potentials w w*
  • 40. 9 ⟨ξμ j ⟩ = 0; ⟨ξμ j ξν k ⟩ = δjkδμν input data: independent, identically distributed random components g ( 1 N ∑ j wjξj =:x ) g ( 1 N ∑ j w* j ξj =:y ) student output teacher output H(w) = P ∑ μ=1 ϵ(xμ , yμ ) g(z) = z ϵ(x, y) = 1 2 (x − y) 2 cost function, energy, e.g. lin. regression extensive quantity: H ∝ P = αN ξμ j = ± 1 (with equal prob.) or P(ξμ j ) = 1 2π exp [ − 1 2 (ξμ j ) 2 ] e.g. example: training of a single, linear unit x, y ∼ 𝒪 (1) weight vectors w2 /N = Q = 𝒪 (1) w*2 /N = Q* = 𝒪 (1) pre-activations, local potentials w w*
  • 41. 10 partition function, training at T = 1/β Z = ∫ ∏ j dwj δ (w2 − N) dμ(w) exp [ − β ∑ μ ϵ(xμ , yμ ) ] ⟨ln Z⟩ID ln ⟨Z⟩ID Annealed Approximation: instead of
  • 42. 10 partition function, training at T = 1/β Z = ∫ ∏ j dwj δ (w2 − N) dμ(w) exp [ − β ∑ μ ϵ(xμ , yμ ) ] ⟨ln Z⟩ID ln ⟨Z⟩ID Annealed Approximation: instead of ⟨ln Z⟩ID ≤ ln ⟨Z⟩ID does not imply “ ” or similarity of extrema! ≈
  • 43. 10 partition function, training at T = 1/β ⟨Z⟩ID = ∫ dμ(w) ⟨ exp [ − β ∑ μ ϵ ( w ⋅ ξμ N , w* ⋅ ξμ N )]⟩ ID Z = ∫ ∏ j dwj δ (w2 − N) dμ(w) exp [ − β ∑ μ ϵ(xμ , yμ ) ] ⟨ln Z⟩ID ln ⟨Z⟩ID Annealed Approximation: instead of ⟨ln Z⟩ID ≤ ln ⟨Z⟩ID does not imply “ ” or similarity of extrema! ≈
  • 44. 10 partition function, training at T = 1/β ⟨Z⟩ID = ∫ dμ(w) ⟨ exp [ − β ∑ μ ϵ ( w ⋅ ξμ N , w* ⋅ ξμ N )]⟩ ID Z = ∫ ∏ j dwj δ (w2 − N) dμ(w) exp [ − β ∑ μ ϵ(xμ , yμ ) ] ⟨ln Z⟩ID ln ⟨Z⟩ID Annealed Approximation: instead of ⟨ln Z⟩ID ≤ ln ⟨Z⟩ID does not imply “ ” or similarity of extrema! ≈ δ ( xμ − w ⋅ ξμ N ) , δ ( yμ − w* ⋅ ξμ N ) traditional approach: integral representation of explicit computation of averages … elimination of conjugate variables …
  • 45. 10 partition function, training at T = 1/β ⟨Z⟩ID = ∫ dμ(w) ⟨ exp [ − β ∑ μ ϵ ( w ⋅ ξμ N , w* ⋅ ξμ N )]⟩ ID Z = ∫ ∏ j dwj δ (w2 − N) dμ(w) exp [ − β ∑ μ ϵ(xμ , yμ ) ] ⟨ln Z⟩ID ln ⟨Z⟩ID Annealed Approximation: instead of ⟨ln Z⟩ID ≤ ln ⟨Z⟩ID does not imply “ ” or similarity of extrema! ≈ δ ( xμ − w ⋅ ξμ N ) , δ ( yμ − w* ⋅ ξμ N ) traditional approach: integral representation of explicit computation of averages … elimination of conjugate variables … short-cut: exploit Central Limit Theorem i.i.d. input components for N → ∞ xμ = 1 N N ∑ j=1 wjξμ j yμ = 1 N N ∑ j=1 w* j ξμ j
  • 46. 11 P(xμ , yμ ) ⟨xμ ⟩ = 1 N N ∑ j=1 wj ⟨ξμ j ⟩ = 0 joint normal density of local potentials fully specified by disorder average ⟨(xμ )2 ⟩ = 1 N N ∑ j,k=1 wjwk ⟨ξμ j ξμ k ⟩ = 1 N N ∑ j=1 w2 j ⟨yμ ⟩ = 0, ⟨(yμ )2 ⟩ = N ∑ j=1 (w* j )2 ⟨xμ yμ ⟩ = 1 N N ∑ j,k=1 wjw* k ⟨ξμ j ξμ k ⟩ = 1 N N ∑ j=1 wjw* j
  • 47. 11 P(xμ , yμ ) ⟨xμ ⟩ = 1 N N ∑ j=1 wj ⟨ξμ j ⟩ = 0 joint normal density of local potentials fully specified by disorder average ⟨(xμ )2 ⟩ = 1 N N ∑ j,k=1 wjwk ⟨ξμ j ξμ k ⟩ = 1 N N ∑ j=1 w2 j ⟨yμ ⟩ = 0, ⟨(yμ )2 ⟩ = N ∑ j=1 (w* j )2 ⟨xμ yμ ⟩ = 1 N N ∑ j,k=1 wjw* k ⟨ξμ j ξμ k ⟩ = 1 N N ∑ j=1 wjw* j 1 N ∑ j w2 j = Q ( = 1), 1 N ∑ j wjw* j = R, 1 N ∑ j (w* j )2 = Q* ( = 1) set of order parameters macroscopic properties of the trained network instead of microscopic details
  • 48. 12 ⟨Z⟩ID = ∫ dR exp [N (Go(R) − α G1(R))] with  αN = P entropy term:  Go(R) = 1 N ln ∫ ∏ j dwj δ(N − w2 ) δ(NR − w ⋅ w*) energy term:  G1(R) = − ln ∫ dxdy P(x, y) exp[ − β ϵ(x, y) ] annealed free energy factorizes w.r.t. and ⟨⋯⟩ID j = 1,2,…, N μ = 1,2,…, P
  • 49. 12 ⟨Z⟩ID = ∫ dR exp [N (Go(R) − α G1(R))] with  αN = P entropy term:  Go(R) = 1 N ln ∫ ∏ j dwj δ(N − w2 ) δ(NR − w ⋅ w*) energy term:  G1(R) = − ln ∫ dxdy P(x, y) exp[ − β ϵ(x, y) ] N-dim. geometry independent of model details model, training annealed free energy factorizes w.r.t. and ⟨⋯⟩ID j = 1,2,…, N μ = 1,2,…, P
  • 50. 12 ⟨Z⟩ID = ∫ dR exp [N (Go(R) − α G1(R))] with  αN = P entropy term:  Go(R) = 1 N ln ∫ ∏ j dwj δ(N − w2 ) δ(NR − w ⋅ w*) energy term:  G1(R) = − ln ∫ dxdy P(x, y) exp[ − β ϵ(x, y) ] −β fann = 1 N ln ⟨Z⟩ID = extrR [Go(R) − αG1(R)] saddle-point integration for annealed free energy: N → ∞ N-dim. geometry independent of model details model, training annealed free energy factorizes w.r.t. and ⟨⋯⟩ID j = 1,2,…, N μ = 1,2,…, P
  • 51. 13 Go(R) = 1 N ln ∫ ∏ j dwj δ(1 − w2 ) δ(R − w ⋅ w*) ≈ … = 1 2 ln(1 − R2 ) the hard way: - integral representation of delta-function, introducing - saddle-point integration for large N w.r.t. ̂ R R, ̂ R the entropy term
  • 52. 13 Go(R) = 1 N ln ∫ ∏ j dwj δ(1 − w2 ) δ(R − w ⋅ w*) ≈ … = 1 2 ln(1 − R2 ) the hard way: - integral representation of delta-function, introducing - saddle-point integration for large N w.r.t. ̂ R R, ̂ R geometry: w w* R r = 1 − R2 V ∼ (1 − R2 )N/2 Go(R) = 1 N ln V ∼ 1 2 ln(1 − R2 ) the entropy term
  • 53. 13 Go(R) = 1 N ln ∫ ∏ j dwj δ(1 − w2 ) δ(R − w ⋅ w*) ≈ … = 1 2 ln(1 − R2 ) the hard way: - integral representation of delta-function, introducing - saddle-point integration for large N w.r.t. ̂ R R, ̂ R geometry: w w* R r = 1 − R2 V ∼ (1 − R2 )N/2 Go(R) = 1 N ln V ∼ 1 2 ln(1 − R2 ) general result: set of vectors, matrix of pairwise dot-products and norms [R. Urbanzcik] here: 𝒞 Go = 1 2 ln det 𝒞 𝒞 = ( 1 R R 1) the entropy term
  • 54. 14 G1(R) = − ln ∫ dxdy 2π 1 − R2 exp [ − x2 + y2 − 2Rxy 2(1 − R2) ] exp[−βϵ(x, y)] the energy term
  • 55. 14 G1(R) = − ln ∫ dxdy 2π 1 − R2 exp [ − x2 + y2 − 2Rxy 2(1 − R2) ] exp[−βϵ(x, y)] linear regression (single linear student and teacher) elementary Gaussian integrals: ϵ(x, y) = 1 2 (x − y)2 G1(R) = 1 2 ln[1 + 2β(1 − R)] the energy term
  • 56. 14 G1(R) = − ln ∫ dxdy 2π 1 − R2 exp [ − x2 + y2 − 2Rxy 2(1 − R2) ] exp[−βϵ(x, y)] linear regression (single linear student and teacher) elementary Gaussian integrals: ϵ(x, y) = 1 2 (x − y)2 G1(R) = 1 2 ln[1 + 2β(1 − R)] the energy term −(βf ) = − 1 2 α ln[1 + 2β(1 − R)] + 1 2 ln(1 − R2 ) annealed free energy: (+ irrevelant const. and terms that vanish for ) N → ∞
  • 57. 14 G1(R) = − ln ∫ dxdy 2π 1 − R2 exp [ − x2 + y2 − 2Rxy 2(1 − R2) ] exp[−βϵ(x, y)] linear regression (single linear student and teacher) elementary Gaussian integrals: ϵ(x, y) = 1 2 (x − y)2 G1(R) = 1 2 ln[1 + 2β(1 − R)] the energy term −(βf ) = − 1 2 α ln[1 + 2β(1 − R)] + 1 2 ln(1 − R2 ) annealed free energy: (+ irrevelant const. and terms that vanish for ) N → ∞ ∂(βf ) ∂R = 0 ⇒ R 1 − R2 = αβ 1 + 2β(1 − R) → R(α)  at a given  β
  • 58. 15 0.5 1.0 1.5 2.0 0.2 0.4 0.6 0.8 1.0 β = 0.1 β = 1 β = 10 β = 1000 β = 100 R α learning curves (linear regression) typical success of training, i.e. student/teacher similiarity as a function of the training set size
  • 59. 15 0.5 1.0 1.5 2.0 0.2 0.4 0.6 0.8 1.0 β = 0.1 β = 1 β = 10 β = 1000 β = 100 R α learning curves (linear regression) typical success of training, i.e. student/teacher similiarity as a function of the training set size ϵg ϵt 2 4 6 8 10 0.1 0.2 0.3 0.4 0.5 α β = 1 generalization error and training error ϵt = 1 α ∂(βf ) ∂β = 1 − R 1 + 2β(1 − R) ϵg = (1 − R)
  • 60. 16 remark: interpretation of the AA partition function in the AA ⟨Z⟩ID = ∫ dμ(w) ⟨ exp [ − β H (w) ]⟩ ID = ∫ dμ(w) ∫ dμ({ξμ }P μ=1) exp [−βH ({ξμ }, w)]
  • 61. 16 remark: interpretation of the AA partition function in the AA ⟨Z⟩ID = ∫ dμ(w) ⟨ exp [ − β H (w) ]⟩ ID = ∫ dμ(w) ∫ dμ({ξμ }P μ=1) exp [−βH ({ξμ }, w)] interpretation: partition sum of a system in which weights and data are degrees of freedom that can be optimized (annealed) w.r.t. H
  • 62. 16 remark: interpretation of the AA partition function in the AA ⟨Z⟩ID = ∫ dμ(w) ⟨ exp [ − β H (w) ]⟩ ID = ∫ dμ(w) ∫ dμ({ξμ }P μ=1) exp [−βH ({ξμ }, w)] interpretation: partition sum of a system in which weights and data are degrees of freedom that can be optimized (annealed) w.r.t. H correct treatment: data constitutes frozen disorder in H
  • 63. 16 remark: interpretation of the AA partition function in the AA ⟨Z⟩ID = ∫ dμ(w) ⟨ exp [ − β H (w) ]⟩ ID = ∫ dμ(w) ∫ dμ({ξμ }P μ=1) exp [−βH ({ξμ }, w)] interpretation: partition sum of a system in which weights and data are degrees of freedom that can be optimized (annealed) w.r.t. H correct treatment: data constitutes frozen disorder in H observation/folklore: AA works (qualitatively) well in realizable cases e.g. student and teacher of the same complexity, noise-free data AA fails in unrealizable cases (noise, mismatch) because the (hypothetical) system can “adapt the the data to the task”, yields over-optimistic results
  • 64. proper disorder average: replica trick/method replica trick formally: n non-interacting „copies“ of the system (replicas) 17 ⟨ln Z⟩ID = lim n→0 ⟨Zn ⟩ID − 1 n = lim n→0 ∂⟨Zn ⟩ ∂n = lim n→0 ln ⟨Zn ⟩ID
  • 65. proper disorder average: replica trick/method replica trick formally: n non-interacting „copies“ of the system (replicas) 17 ⟨ln Z⟩ID = lim n→0 ⟨Zn ⟩ID − 1 n = lim n→0 ∂⟨Zn ⟩ ∂n = lim n→0 ln ⟨Zn ⟩ID ⟨Zn ⟩ID = ∫ n ∏ a=1 dμ(wa ) ⟨ exp [ − β ∑ μ ∑ a ϵ ( wa ⋅ ξμ N , w* ⋅ ξμ N )]⟩ ID
  • 66. proper disorder average: replica trick/method replica trick formally: n non-interacting „copies“ of the system (replicas) 17 ⟨ln Z⟩ID = lim n→0 ⟨Zn ⟩ID − 1 n = lim n→0 ∂⟨Zn ⟩ ∂n = lim n→0 ln ⟨Zn ⟩ID ⟨Zn ⟩ID = ∫ n ∏ a=1 dμ(wa ) ⟨ exp [ − β ∑ μ ∑ a ϵ ( wa ⋅ ξμ N , w* ⋅ ξμ N )]⟩ ID P(xμ 1 , xμ 2 , …, xμ n , yμ ) integration over
  • 67. proper disorder average: replica trick/method replica trick formally: n non-interacting „copies“ of the system (replicas) data set average introduces effective interactions between replicas … saddle point integration for , quenched free energy ⟨Zn ⟩ID involves order parameters requires analytic continuation for Ra = wa ⋅ w*, qab = wa ⋅ wb /N n ∈ ℝ and n → 0 17 ⟨ln Z⟩ID = lim n→0 ⟨Zn ⟩ID − 1 n = lim n→0 ∂⟨Zn ⟩ ∂n = lim n→0 ln ⟨Zn ⟩ID ⟨Zn ⟩ID = ∫ n ∏ a=1 dμ(wa ) ⟨ exp [ − β ∑ μ ∑ a ϵ ( wa ⋅ ξμ N , w* ⋅ ξμ N )]⟩ ID P(xμ 1 , xμ 2 , …, xμ n , yμ ) integration over
  • 68. proper disorder average: replica trick/method replica trick formally: n non-interacting „copies“ of the system (replicas) data set average introduces effective interactions between replicas … saddle point integration for , quenched free energy ⟨Zn ⟩ID involves order parameters requires analytic continuation for Ra = wa ⋅ w*, qab = wa ⋅ wb /N n ∈ ℝ and n → 0 Marc Mezard, Giorgio Parisi (*), Miguel Virasoro Spin Glass Theory and Beyond (1987) (*) Nobel 2021 mathematical subtleties, replica symmetry-breaking ... 17 ⟨ln Z⟩ID = lim n→0 ⟨Zn ⟩ID − 1 n = lim n→0 ∂⟨Zn ⟩ ∂n = lim n→0 ln ⟨Zn ⟩ID ⟨Zn ⟩ID = ∫ n ∏ a=1 dμ(wa ) ⟨ exp [ − β ∑ μ ∑ a ϵ ( wa ⋅ ξμ N , w* ⋅ ξμ N )]⟩ ID P(xμ 1 , xμ 2 , …, xμ n , yμ ) integration over
  • 69. 18 historical :-) examples of perceptron learning curves S = sign(w ⋅ ξ) S* = sign(w* ⋅ ξ) student teacher w w*
  • 70. 18 historical :-) examples of perceptron learning curves Gibbs student optimal generalization Adaline ϵ(x, y) = 1 2 [x − sign(y)] 2 perceptron, zero temperature training from noise free linearly separable data: maximum stability ___ϵg ∝ α−1 ___ϵg ∝ α−1/2 S = sign(w ⋅ ξ) S* = sign(w* ⋅ ξ) student teacher w w*
  • 71. 18 historical :-) examples of perceptron learning curves Gibbs student optimal generalization Adaline ϵ(x, y) = 1 2 [x − sign(y)] 2 perceptron, zero temperature training from noise free linearly separable data: maximum stability ___ϵg ∝ α−1 ___ϵg ∝ α−1/2 more in the literature: - label noise - teacher weight noise - variational opt. of 
 the cost function - weight decay - worst case training - … S = sign(w ⋅ ξ) S* = sign(w* ⋅ ξ) student teacher w w*
  • 72. 19 energy term:    G1(R) = − ln ∫ dxdy P(x, y) exp[−β ϵ(x, y)] training at high temperatures AA becomes exact in the limit (replicas decouple) T → ∞
  • 73. 19 energy term:    G1(R) = − ln ∫ dxdy P(x, y) exp[−β ϵ(x, y)] training at high temperatures ≈ − ln ∫ dxdyP(x, y) (1 − β ϵ(x, y)) ≈ − ln [1 − β ⟨ϵ(x, y)⟩{x,y}] ≈ βϵg β → 0 AA becomes exact in the limit (replicas decouple) T → ∞ generalization error! (arbitrary input)
  • 74. 19 energy term:    G1(R) = − ln ∫ dxdy P(x, y) exp[−β ϵ(x, y)] βf ≈ (α β) ϵg − Go(R) free energy training at high temperatures ≈ − ln ∫ dxdyP(x, y) (1 − β ϵ(x, y)) ≈ − ln [1 − β ⟨ϵ(x, y)⟩{x,y}] ≈ βϵg β → 0 AA becomes exact in the limit (replicas decouple) T → ∞ generalization error! (arbitrary input)
  • 75. 19 energy term:    G1(R) = − ln ∫ dxdy P(x, y) exp[−β ϵ(x, y)] βf ≈ (α β) ϵg − Go(R) free energy only meaningful if (α β) = 𝒪 (1) β → 0, T → ∞ α = P/N → ∞ learn almost nothing from infinitely many examples training at high temperatures ≈ − ln ∫ dxdyP(x, y) (1 − β ϵ(x, y)) ≈ − ln [1 − β ⟨ϵ(x, y)⟩{x,y}] ≈ βϵg β → 0 AA becomes exact in the limit (replicas decouple) T → ∞ generalization error! (arbitrary input)
  • 76. 19 energy term:    G1(R) = − ln ∫ dxdy P(x, y) exp[−β ϵ(x, y)] βf ≈ (α β) ϵg − Go(R) free energy only meaningful if (α β) = 𝒪 (1) β → 0, T → ∞ α = P/N → ∞ learn almost nothing from infinitely many examples training at high temperatures here: P and T cannot be varied independently are indistinguishable (input space is sampled perfectly) ϵg  and  ϵt ≈ − ln ∫ dxdyP(x, y) (1 − β ϵ(x, y)) ≈ − ln [1 − β ⟨ϵ(x, y)⟩{x,y}] ≈ βϵg β → 0 AA becomes exact in the limit (replicas decouple) T → ∞ generalization error! (arbitrary input)
  • 77. 20 adaptive student N inputs K hidden units layered networks: “soft committee machines” (SCM)
  • 78. 20 adaptive student N inputs K hidden units teacher parameterizes target ? ? ? ? ? ? ? layered networks: “soft committee machines” (SCM)
  • 79. 20 adaptive student N inputs K hidden units teacher parameterizes target ? ? ? ? ? ? ? consider specific activation functions, e.g. sigmoidal / ReLU layered networks: “soft committee machines” (SCM)
  • 80. 20 adaptive student N inputs K hidden units teacher parameterizes target ? ? ? ? ? ? ? consider specific activation functions, e.g. sigmoidal / ReLU thermodynamic limit : description in terms of order parameters student/teacher, student/student site symmetry / hidden unit specialization: layered networks: “soft committee machines” (SCM)
  • 81. 21 typical SCM learning curves (high-T) as a function of training set size from high-T free energy: express generalization and entropy as functions of ϵg s {R, S, Q, C} ϵg
  • 82. 21 typical SCM learning curves (high-T) as a function of training set size from high-T free energy: sigmoidal: discont. phase transition specialized (R>S) un- | anti-spec. (R=S) (R<S) (K>2) ϵg poor-performing phase persists for large data sets express generalization and entropy as functions of ϵg s {R, S, Q, C} ϵg
  • 83. 21 typical SCM learning curves (high-T) as a function of training set size from high-T free energy: sigmoidal: discont. phase transition specialized (R>S) un- | anti-spec. (R=S) (R<S) (K>2) ϵg poor-performing phase persists for large data sets ReLU: continuous transition anti-specialized (R<S) specialized (R>S) (R=S) ϵg similar performances, lower (free) energy barrier express generalization and entropy as functions of ϵg s {R, S, Q, C} ϵg
  • 84. 22 Hidden unit Specialization in Layered Neural Networks: ReLU vs. Sigmoidal Activation E. Oostwal, M. Straat, M. Biehl 
 Physica A 564: 125517 (2021) Phase Transitions in Soft-Committee Machines M. Biehl, E. Schlösser, M. Ahr Europhysics Letters 44: 261-267 (1998) Statistical Physics and Practical Training of Soft-Committee Machines M. Ahr, M. Biehl, R. Urbanczik European Physics B 10: 583-588 (1999) sigmoidal activation high-T, arbitrary K sigmoidal activation replica, large K = M → ∞ ReLU activation high-T, arbitrary K layered networks: (SCM)
  • 85. 22 Hidden unit Specialization in Layered Neural Networks: ReLU vs. Sigmoidal Activation E. Oostwal, M. Straat, M. Biehl 
 Physica A 564: 125517 (2021) Phase Transitions in Soft-Committee Machines M. Biehl, E. Schlösser, M. Ahr Europhysics Letters 44: 261-267 (1998) Statistical Physics and Practical Training of Soft-Committee Machines M. Ahr, M. Biehl, R. Urbanczik European Physics B 10: 583-588 (1999) sigmoidal activation high-T, arbitrary K sigmoidal activation replica, large K = M → ∞ ReLU activation high-T, arbitrary K layered networks: (SCM) challenges: - more general activation functions - overfitting/underfitting - low-temperatures: AA, replica - many layers (deep networks) - non-trivial (realistic) input densities [Zdeborova, Goldt, Mezard…]
  • 86. 23 The Role of the Activation Function in Feedforward Learning Systems (RAFFLES) NWO-funded project, Frederieke Richert on-going & future work
  • 87. 23 The Role of the Activation Function in Feedforward Learning Systems (RAFFLES) NWO-funded project, Frederieke Richert Robust Learning of Sparse Representations: Brain-inspired Inhibition and Statistical Physics Analysis 2 PhD projects funded by the Groningen Cognitive Systems and Materials Centre CogniGron, in collaboration with George Azzopardi on-going & future work
  • 88. 23 The Role of the Activation Function in Feedforward Learning Systems (RAFFLES) NWO-funded project, Frederieke Richert Robust Learning of Sparse Representations: Brain-inspired Inhibition and Statistical Physics Analysis 2 PhD projects funded by the Groningen Cognitive Systems and Materials Centre CogniGron, in collaboration with George Azzopardi - study network architectures and training schemes which favor sparse activity and sparse connectivity - consider activation functions which relate to hardware-realizable 
 adaptive systems on-going & future work
  • 89. 23 The Role of the Activation Function in Feedforward Learning Systems (RAFFLES) NWO-funded project, Frederieke Richert Robust Learning of Sparse Representations: Brain-inspired Inhibition and Statistical Physics Analysis 2 PhD projects funded by the Groningen Cognitive Systems and Materials Centre CogniGron, in collaboration with George Azzopardi - study network architectures and training schemes which favor sparse activity and sparse connectivity - consider activation functions which relate to hardware-realizable 
 adaptive systems on-going & future work see: www.cs.rug.nl/~biehl (link to description and application form, deadline: 29 September 2022)
  • 90. MiWoCI 2022 24 no statistical physics :-)
  • 91. 25 Cognigron March 2021 12 / 11 www.cs.rug.nl/~biehl m.biehl@rug.nl twitter @michaelbiehl13