A tutorial given at the AMALEA workshop 2022.
This talk presents the statistical physics based theory of machine learning in terms of simple example systems. As a recent application, the occurrence of phase transitions in layered networks is discussed.
“Statistical Physics Studies of Machine Learning Problems" by Lenka Zdeborova, Researcher @CNRS
Abstract : We will talk about some insight of the following questions: What makes problems studied in machine and statistical physics related? How can this relation be used to understand better the performance and limitations of machine learning systems? What happens when a phase transition is found in a computational problem? How do phase transitions influence algorithmic hardness?
“Statistical Physics Studies of Machine Learning Problems" by Lenka Zdeborova, Researcher @CNRS
Abstract : We will talk about some insight of the following questions: What makes problems studied in machine and statistical physics related? How can this relation be used to understand better the performance and limitations of machine learning systems? What happens when a phase transition is found in a computational problem? How do phase transitions influence algorithmic hardness?
Algorithmic entropy can be seen as a special case of entropy as studied in
statistical mechanics. This viewpoint allows us to apply many techniques
developed for use in thermodynamics to the subject of algorithmic information theory. In particular, suppose we fix a universal prefix-free Turing
Functional specialization in human cognition: a large-scale neuroimaging init...Ana Luísa Pinho
Linking brain systems and mental functions requires accurate descriptions of behavioral tasks and fine demarcations of brain regions. Functional Magnetic Resonance Imaging (fMRI) has contributed to the investigation of brain regions involved in a variety of cognitive processes. However, to date, no data collection has systematically addressed the functional mapping of cognitive mechanisms at a fine spatial scale. The Individual Brain Charting (IBC) project stands for a high-resolution multi-task fMRI dataset that intends to provide the objective basis toward a comprehensive functional atlas of the human brain. The data refer to a permanent cohort performing many different tasks. The large amount of task-fMRI data on the same subjects yields a precise mapping of the underlying functions, free from both inter-subject and inter-site variability. The first release of the IBC dataset consists of data acquired from thirteen participants during performance of a dozen of tasks. Raw data from this release are publicly available in the OpenNeuro repository and derived statistical maps can be found in NeuroVault [1]. These maps reveal a successful cognitive encoding of many psychological domains in large areas of the human brain. Indeed, main findings of the original studies were replicated at higher resolution. Our results thus provide a comprehensive revision of the neural correlates underlying behavior, highlighting nonetheless the spatial variability of functional signatures between participants. In addition, this dataset supports investigations using alternative approaches to group-level analysis of task-specific studies. For instance, such rich task-wise dataset can be applied to mega-analytic encoding models towards the development of a brain-atlasing framework, by systematically mapping functional signatures associated with the cognitive components of the tasks.
SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...Tobias Wunner
The creation of new knowledge in the Semantic Web is more and more depending on a automatic knowledge enrichment processes, such semi-structural Information Extraction (IE) in the example of the creation of DBPedia from Wikipedia. To further improve knowledge coverage IE must also consider non-structural plain natural language text resources. Here SOFIE offers a novel approach to IE which can consistently enrich semantic models from text sources by combining pattern matching, entity disambiguation and reasoning in a propositional logic approach using MAX SAT in the IE process.
A walk through the intersection between machine learning and mechanistic mode...JuanPabloCarbajal3
Talk at EURECOM, France.
It overviews regression in several of its forms: regularized, constrained, and mixed. It builds the bridge between machine learning and dynamical models.
Interpretable machine learning in endocrinology, M. Biehl, APPIS 2024University of Groningen
An introduction to interpretable machine learning in endocrinology.
In particular, the application of Generalized Matrix Relevance LVQ to the classification of andrenocortical tumors and the differential diagnosis of primary aldosteronism is given.
Algorithmic entropy can be seen as a special case of entropy as studied in
statistical mechanics. This viewpoint allows us to apply many techniques
developed for use in thermodynamics to the subject of algorithmic information theory. In particular, suppose we fix a universal prefix-free Turing
Functional specialization in human cognition: a large-scale neuroimaging init...Ana Luísa Pinho
Linking brain systems and mental functions requires accurate descriptions of behavioral tasks and fine demarcations of brain regions. Functional Magnetic Resonance Imaging (fMRI) has contributed to the investigation of brain regions involved in a variety of cognitive processes. However, to date, no data collection has systematically addressed the functional mapping of cognitive mechanisms at a fine spatial scale. The Individual Brain Charting (IBC) project stands for a high-resolution multi-task fMRI dataset that intends to provide the objective basis toward a comprehensive functional atlas of the human brain. The data refer to a permanent cohort performing many different tasks. The large amount of task-fMRI data on the same subjects yields a precise mapping of the underlying functions, free from both inter-subject and inter-site variability. The first release of the IBC dataset consists of data acquired from thirteen participants during performance of a dozen of tasks. Raw data from this release are publicly available in the OpenNeuro repository and derived statistical maps can be found in NeuroVault [1]. These maps reveal a successful cognitive encoding of many psychological domains in large areas of the human brain. Indeed, main findings of the original studies were replicated at higher resolution. Our results thus provide a comprehensive revision of the neural correlates underlying behavior, highlighting nonetheless the spatial variability of functional signatures between participants. In addition, this dataset supports investigations using alternative approaches to group-level analysis of task-specific studies. For instance, such rich task-wise dataset can be applied to mega-analytic encoding models towards the development of a brain-atlasing framework, by systematically mapping functional signatures associated with the cognitive components of the tasks.
SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...Tobias Wunner
The creation of new knowledge in the Semantic Web is more and more depending on a automatic knowledge enrichment processes, such semi-structural Information Extraction (IE) in the example of the creation of DBPedia from Wikipedia. To further improve knowledge coverage IE must also consider non-structural plain natural language text resources. Here SOFIE offers a novel approach to IE which can consistently enrich semantic models from text sources by combining pattern matching, entity disambiguation and reasoning in a propositional logic approach using MAX SAT in the IE process.
A walk through the intersection between machine learning and mechanistic mode...JuanPabloCarbajal3
Talk at EURECOM, France.
It overviews regression in several of its forms: regularized, constrained, and mixed. It builds the bridge between machine learning and dynamical models.
Interpretable machine learning in endocrinology, M. Biehl, APPIS 2024University of Groningen
An introduction to interpretable machine learning in endocrinology.
In particular, the application of Generalized Matrix Relevance LVQ to the classification of andrenocortical tumors and the differential diagnosis of primary aldosteronism is given.
A tutorial given at the AMALEA workshop 2022:
Unsupervised and supervised prototype-based learning is illustrated in terms of bio-medical applications.
The statistical physics of learning revisted: Phase transitions in layered ne...University of Groningen
"The statistical physics of learning revisted: Phase transitions in layered neural networks"
Physics Colloquium at the University of Leipzig/Germany, June 29, 2021
24 slides, ca 45 minutes
Invited lecture on Machine Learning in Medicine at the joint "Integrated Omics" course of Hanze University and University Hospital UMCG, Groningen, The Netherlands
Short presentation (15 minutes) focussing on the application of unsupervised and supervised machine learning in the paper "Tissue- and development-stage specific mRNA and heterogeneous CNV signatures of human ribosomal proteins in normal and cancer samples
Talk presented at WSOM 2016 in Houston/Texas.
Machine learning based classification of FDG-PET scan data for the diagnosis of neurodegenerative disorders
June 2017: Biomedical applications of prototype-based classifiers and relevan...University of Groningen
A presentation of several biomedical applications of prototype-based machine learning and relevance learning. Invited talk at the AlCoB conference 2017 in Aveiro/Portugal.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Lateral Ventricles.pdf very easy good diagrams comprehensive
stat-phys-AMALEA.pdf
1. 1
The statistical physics of learning:
typical learning curves
www.cs.rug.nl/~biehl
Michael Biehl
AMALEA workshop, September 12, 2022
2. 1
The statistical physics of learning:
typical learning curves
www.cs.rug.nl/~biehl
Michael Biehl
AMALEA workshop, September 12, 2022
•a little bit of history
•optimization and statistical physics (in a nutshell)
•machine learning as a special case, disorder average
•annealed approximation, high-temperature limit, replica trick
•typical learning curves in student/teacher scenarios
•a very simple example: single unit, linear regression
•nonlinear, layered neural networks:
- phase transitions in soft committee machines
- the role of the activation function
•outlook / ongoing projects
3. 2
Statistical Physics of Neural Networks
capacity of feed-forward networks:
Elizabeth Gardner (1957-1988)
The space of interactions in neural
networks. J. Phys. A 21: 257 (1988)
dynamics, attractor neural networks:
John Hopfield. Neural Networks and
physical systems with emergent
collective computational abilities
PNAS 79(8): 2554 (1982)
learning of a rule:
Geza Györgyi, Naftali Tishby
Statistical theory of learning a rule
In: Neural Networks and Spin Glasses
World Scientific 31-36 (1990)
4. 2
Statistical Physics of Neural Networks
capacity of feed-forward networks:
Elizabeth Gardner (1957-1988)
The space of interactions in neural
networks. J. Phys. A 21: 257 (1988)
dynamics, attractor neural networks:
John Hopfield. Neural Networks and
physical systems with emergent
collective computational abilities
PNAS 79(8): 2554 (1982)
reviews: annealed approximation,
high-T limit, replica trick etc.
S. Seung, H. Sompolinsky, N. Tishby
Statistical mechanics of learning from
examples. Phys. Rev. A 45, 6056 (1992)
learning of a rule:
Geza Györgyi, Naftali Tishby
Statistical theory of learning a rule
In: Neural Networks and Spin Glasses
World Scientific 31-36 (1990)
5. 2
Statistical Physics of Neural Networks
capacity of feed-forward networks:
Elizabeth Gardner (1957-1988)
The space of interactions in neural
networks. J. Phys. A 21: 257 (1988)
dynamics, attractor neural networks:
John Hopfield. Neural Networks and
physical systems with emergent
collective computational abilities
PNAS 79(8): 2554 (1982)
reviews: annealed approximation,
high-T limit, replica trick etc.
S. Seung, H. Sompolinsky, N. Tishby
Statistical mechanics of learning from
examples. Phys. Rev. A 45, 6056 (1992)
learning of a rule:
Geza Györgyi, Naftali Tishby
Statistical theory of learning a rule
In: Neural Networks and Spin Glasses
World Scientific 31-36 (1990)
6. 2
Statistical Physics of Neural Networks
capacity of feed-forward networks:
Elizabeth Gardner (1957-1988)
The space of interactions in neural
networks. J. Phys. A 21: 257 (1988)
dynamics, attractor neural networks:
John Hopfield. Neural Networks and
physical systems with emergent
collective computational abilities
PNAS 79(8): 2554 (1982)
reviews: annealed approximation,
high-T limit, replica trick etc.
S. Seung, H. Sompolinsky, N. Tishby
Statistical mechanics of learning from
examples. Phys. Rev. A 45, 6056 (1992)
learning of a rule:
Geza Györgyi, Naftali Tishby
Statistical theory of learning a rule
In: Neural Networks and Spin Glasses
World Scientific 31-36 (1990)
10. 3
stochastic optimization
objective/cost/energy function for many degrees of freedom
discrete, e.g.
Metropolis algorithm
• acceptance of change
- always if
- with probability
if
• suggest a (small) change
, e.g. „single spin flip“
for a random j
11. 3
stochastic optimization
objective/cost/energy function for many degrees of freedom
discrete, e.g.
Metropolis algorithm
• acceptance of change
- always if
- with probability
if
• suggest a (small) change
, e.g. „single spin flip“
for a random j
controls acceptance rate
for „uphill“ moves
12. 3
stochastic optimization
objective/cost/energy function for many degrees of freedom
discrete, e.g. continuous, e.g.
Metropolis algorithm Langevin dynamics
• acceptance of change
- always if
- with probability
if
• suggest a (small) change
, e.g. „single spin flip“
for a random j
controls acceptance rate
for „uphill“ moves
13. 3
stochastic optimization
objective/cost/energy function for many degrees of freedom
discrete, e.g. continuous, e.g.
Metropolis algorithm Langevin dynamics
• acceptance of change
- always if
- with probability
if
• suggest a (small) change
, e.g. „single spin flip“
for a random j
• continuous temporal change,
„noisy gradient descent“
controls acceptance rate
for „uphill“ moves
14. 3
stochastic optimization
objective/cost/energy function for many degrees of freedom
discrete, e.g. continuous, e.g.
Metropolis algorithm Langevin dynamics
• acceptance of change
- always if
- with probability
if
• suggest a (small) change
, e.g. „single spin flip“
for a random j
• continuous temporal change,
„noisy gradient descent“
controls acceptance rate
for „uphill“ moves
• with delta-correlated white noise
(spatial+temporal independence)
15. 3
stochastic optimization
objective/cost/energy function for many degrees of freedom
discrete, e.g. continuous, e.g.
Metropolis algorithm Langevin dynamics
• acceptance of change
- always if
- with probability
if
• suggest a (small) change
, e.g. „single spin flip“
for a random j
• continuous temporal change,
„noisy gradient descent“
controls acceptance rate
for „uphill“ moves
... controls noise level, i.e.
random deviation from gradient
• with delta-correlated white noise
(spatial+temporal independence)
17. thermal equilibrium
Markov chain continuous dynamics
stationary density of configurations:
normalization: „Zustandssumme“, partition function
4
P(w) =
1
Z
exp [−βH(w)]
18. thermal equilibrium
Markov chain continuous dynamics
stationary density of configurations:
normalization: „Zustandssumme“, partition function
4
P(w) =
1
Z
exp [−βH(w)]
19. thermal equilibrium
Markov chain continuous dynamics
stationary density of configurations:
normalization: „Zustandssumme“, partition function
Gibbs-Boltzmann density of states
• physics: thermal equilibrium of a physical system at temperature T
• optimization: formal equilibrium situation, control parameter T
4
P(w) =
1
Z
exp [−βH(w)]
20. thermal equilibrium
Markov chain continuous dynamics
stationary density of configurations:
normalization: „Zustandssumme“, partition function
Gibbs-Boltzmann density of states
• physics: thermal equilibrium of a physical system at temperature T
• optimization: formal equilibrium situation, control parameter T
4
P(w) =
1
Z
exp [−βH(w)]
T → ∞, β → 0 :
T → 0, β → ∞ : only lowest energy (groundstate) contributes
energy is irrelevant, every state contributes equally
23. 5
thermal averages in equilibrium, for instance
⟨⋯⟩T
~ vol. of states with energy E
re-write as an integral over all possible energies:
Z
free energy
24. 5
thermal averages in equilibrium, for instance
⟨⋯⟩T
~ vol. of states with energy E
re-write as an integral over all possible energies:
Z
assume extensive energy, proportional to system size N:
free energy
25. 5
thermal averages in equilibrium, for instance
⟨⋯⟩T
~ vol. of states with energy E
re-write as an integral over all possible energies:
Z
assume extensive energy, proportional to system size N:
free energy
Z =
∫
dE exp [−Nβ (e − s(e)/β)]
26. 5
thermal averages in equilibrium, for instance
⟨⋯⟩T
~ vol. of states with energy E
re-write as an integral over all possible energies:
Z
assume extensive energy, proportional to system size N:
in large systems ( ) is dominated by the minimum
of the free energy
N → ∞ ln Z
f = e − s/β ∼ − ln Z/(βN)
free energy
Z =
∫
dE exp [−Nβ (e − s(e)/β)]
29. 6
function with maximum in , consider thermodynamic limit
is given by the minimum of the free energy
f = e - s(e) / β
remark: saddle point integration
31. machine learning
special case machine learning: choice of adaptive
e.g. all weights in a neural network
cost function: defined w.r.t.
sum over examples, e.g. input vectors and target labels (supervised)
costs or error measure per example, e.g. classification error
ϵ( . . . )
7
H(w) =
P
∑
μ=1
ϵ(w, ξμ
) ID = {ξμ
, S(ξμ
)}
P
μ=1
32. machine learning
special case machine learning: choice of adaptive
e.g. all weights in a neural network
cost function: defined w.r.t.
sum over examples, e.g. input vectors and target labels (supervised)
costs or error measure per example, e.g. classification error
ϵ( . . . )
interpretation of training:
• weights are the outcome of some stochastic optimization process
with energy-dependent stationary
• formal (thermal) equilibrium
• < ... >T : thermal averages (over the stochastic training)
P(w)
7
H(w) =
P
∑
μ=1
ϵ(w, ξμ
) ID = {ξμ
, S(ξμ
)}
P
μ=1
33. disorder average
• energy/cost function is defined for one particular set of examples
typical properties: additional average over random training data ID
8
34. disorder average
• energy/cost function is defined for one particular set of examples
typical properties: additional average over random training data ID
8
• typical properties on average over data sets: derivatives of
quenched free energy ~ yields averages
35. disorder average
• energy/cost function is defined for one particular set of examples
typical properties: additional average over random training data ID
8
• typical properties on average over data sets: derivatives of
quenched free energy ~ yields averages
difficult: replica trick, approximations
36. disorder average
• energy/cost function is defined for one particular set of examples
typical properties: additional average over random training data ID
8
• typical properties on average over data sets: derivatives of
quenched free energy ~ yields averages
difficult: replica trick, approximations
• student / teacher scenarios
- define/control the complexity of target rule and learning system
- represent target by a teacher network
37. disorder average
• energy/cost function is defined for one particular set of examples
typical properties: additional average over random training data ID
8
• typical properties on average over data sets: derivatives of
quenched free energy ~ yields averages
difficult: replica trick, approximations
• student / teacher scenarios
- define/control the complexity of target rule and learning system
- represent target by a teacher network
• simplest assumptions:
- independent input vectors of i.i.d. components
- noise-free training labels provided by the teacher network
38. 9
⟨ξμ
j ⟩ = 0; ⟨ξμ
j
ξν
k ⟩ = δjkδμν
input data: independent, identically distributed random components
ξμ
j
= ± 1 (with equal prob.) or
P(ξμ
j
) =
1
2π
exp
[
−
1
2 (ξμ
j )
2
]
e.g.
example: training of a single, linear unit
39. 9
⟨ξμ
j ⟩ = 0; ⟨ξμ
j
ξν
k ⟩ = δjkδμν
input data: independent, identically distributed random components
g
(
1
N ∑
j
wjξj
=:x
)
g
(
1
N ∑
j
w*
j
ξj
=:y
)
student output teacher output
ξμ
j
= ± 1 (with equal prob.) or
P(ξμ
j
) =
1
2π
exp
[
−
1
2 (ξμ
j )
2
]
e.g.
example: training of a single, linear unit
x, y ∼
𝒪
(1)
weight vectors
w2
/N = Q =
𝒪
(1) w*2
/N = Q* =
𝒪
(1)
pre-activations, local potentials
w w*
40. 9
⟨ξμ
j ⟩ = 0; ⟨ξμ
j
ξν
k ⟩ = δjkδμν
input data: independent, identically distributed random components
g
(
1
N ∑
j
wjξj
=:x
)
g
(
1
N ∑
j
w*
j
ξj
=:y
)
student output teacher output
H(w) =
P
∑
μ=1
ϵ(xμ
, yμ
) g(z) = z ϵ(x, y) =
1
2
(x − y)
2
cost function, energy,
e.g. lin. regression
extensive quantity: H ∝ P = αN
ξμ
j
= ± 1 (with equal prob.) or
P(ξμ
j
) =
1
2π
exp
[
−
1
2 (ξμ
j )
2
]
e.g.
example: training of a single, linear unit
x, y ∼
𝒪
(1)
weight vectors
w2
/N = Q =
𝒪
(1) w*2
/N = Q* =
𝒪
(1)
pre-activations, local potentials
w w*
41. 10
partition function, training at T = 1/β
Z =
∫ ∏
j
dwj δ (w2
− N)
dμ(w)
exp
[
− β
∑
μ
ϵ(xμ
, yμ
)
]
⟨ln Z⟩ID
ln ⟨Z⟩ID
Annealed Approximation: instead of
42. 10
partition function, training at T = 1/β
Z =
∫ ∏
j
dwj δ (w2
− N)
dμ(w)
exp
[
− β
∑
μ
ϵ(xμ
, yμ
)
]
⟨ln Z⟩ID
ln ⟨Z⟩ID
Annealed Approximation: instead of ⟨ln Z⟩ID
≤ ln ⟨Z⟩ID
does not imply “ ”
or similarity of
extrema!
≈
43. 10
partition function, training at T = 1/β
⟨Z⟩ID
=
∫
dμ(w)
⟨
exp
[
− β
∑
μ
ϵ
(
w ⋅ ξμ
N
,
w* ⋅ ξμ
N )]⟩
ID
Z =
∫ ∏
j
dwj δ (w2
− N)
dμ(w)
exp
[
− β
∑
μ
ϵ(xμ
, yμ
)
]
⟨ln Z⟩ID
ln ⟨Z⟩ID
Annealed Approximation: instead of ⟨ln Z⟩ID
≤ ln ⟨Z⟩ID
does not imply “ ”
or similarity of
extrema!
≈
44. 10
partition function, training at T = 1/β
⟨Z⟩ID
=
∫
dμ(w)
⟨
exp
[
− β
∑
μ
ϵ
(
w ⋅ ξμ
N
,
w* ⋅ ξμ
N )]⟩
ID
Z =
∫ ∏
j
dwj δ (w2
− N)
dμ(w)
exp
[
− β
∑
μ
ϵ(xμ
, yμ
)
]
⟨ln Z⟩ID
ln ⟨Z⟩ID
Annealed Approximation: instead of ⟨ln Z⟩ID
≤ ln ⟨Z⟩ID
does not imply “ ”
or similarity of
extrema!
≈
δ
(
xμ
−
w ⋅ ξμ
N )
, δ
(
yμ
−
w* ⋅ ξμ
N )
traditional approach: integral representation of
explicit computation of averages …
elimination of conjugate variables …
45. 10
partition function, training at T = 1/β
⟨Z⟩ID
=
∫
dμ(w)
⟨
exp
[
− β
∑
μ
ϵ
(
w ⋅ ξμ
N
,
w* ⋅ ξμ
N )]⟩
ID
Z =
∫ ∏
j
dwj δ (w2
− N)
dμ(w)
exp
[
− β
∑
μ
ϵ(xμ
, yμ
)
]
⟨ln Z⟩ID
ln ⟨Z⟩ID
Annealed Approximation: instead of ⟨ln Z⟩ID
≤ ln ⟨Z⟩ID
does not imply “ ”
or similarity of
extrema!
≈
δ
(
xμ
−
w ⋅ ξμ
N )
, δ
(
yμ
−
w* ⋅ ξμ
N )
traditional approach: integral representation of
explicit computation of averages …
elimination of conjugate variables …
short-cut: exploit Central Limit Theorem
i.i.d. input components for
N → ∞
xμ
=
1
N
N
∑
j=1
wjξμ
j
yμ
=
1
N
N
∑
j=1
w*
j
ξμ
j
46. 11
P(xμ
, yμ
)
⟨xμ
⟩ =
1
N
N
∑
j=1
wj ⟨ξμ
j ⟩ = 0
joint normal density of local potentials fully specified by
disorder average
⟨(xμ
)2
⟩ =
1
N
N
∑
j,k=1
wjwk ⟨ξμ
j
ξμ
k ⟩ =
1
N
N
∑
j=1
w2
j
⟨yμ
⟩ = 0, ⟨(yμ
)2
⟩ =
N
∑
j=1
(w*
j
)2
⟨xμ
yμ
⟩ =
1
N
N
∑
j,k=1
wjw*
k ⟨ξμ
j
ξμ
k ⟩ =
1
N
N
∑
j=1
wjw*
j
47. 11
P(xμ
, yμ
)
⟨xμ
⟩ =
1
N
N
∑
j=1
wj ⟨ξμ
j ⟩ = 0
joint normal density of local potentials fully specified by
disorder average
⟨(xμ
)2
⟩ =
1
N
N
∑
j,k=1
wjwk ⟨ξμ
j
ξμ
k ⟩ =
1
N
N
∑
j=1
w2
j
⟨yμ
⟩ = 0, ⟨(yμ
)2
⟩ =
N
∑
j=1
(w*
j
)2
⟨xμ
yμ
⟩ =
1
N
N
∑
j,k=1
wjw*
k ⟨ξμ
j
ξμ
k ⟩ =
1
N
N
∑
j=1
wjw*
j
1
N ∑
j
w2
j = Q ( = 1),
1
N ∑
j
wjw*
j
= R,
1
N ∑
j
(w*
j
)2
= Q* ( = 1)
set of order
parameters
macroscopic properties of the trained network
instead of microscopic details
48. 12
⟨Z⟩ID
=
∫
dR exp [N (Go(R) − α G1(R))] with αN = P
entropy term: Go(R) =
1
N
ln
∫ ∏
j
dwj δ(N − w2
) δ(NR − w ⋅ w*)
energy term: G1(R) = − ln
∫
dxdy P(x, y) exp[ − β ϵ(x, y) ]
annealed free energy
factorizes w.r.t. and
⟨⋯⟩ID
j = 1,2,…, N μ = 1,2,…, P
49. 12
⟨Z⟩ID
=
∫
dR exp [N (Go(R) − α G1(R))] with αN = P
entropy term: Go(R) =
1
N
ln
∫ ∏
j
dwj δ(N − w2
) δ(NR − w ⋅ w*)
energy term: G1(R) = − ln
∫
dxdy P(x, y) exp[ − β ϵ(x, y) ]
N-dim. geometry
independent of
model details
model, training
annealed free energy
factorizes w.r.t. and
⟨⋯⟩ID
j = 1,2,…, N μ = 1,2,…, P
50. 12
⟨Z⟩ID
=
∫
dR exp [N (Go(R) − α G1(R))] with αN = P
entropy term: Go(R) =
1
N
ln
∫ ∏
j
dwj δ(N − w2
) δ(NR − w ⋅ w*)
energy term: G1(R) = − ln
∫
dxdy P(x, y) exp[ − β ϵ(x, y) ]
−β fann =
1
N
ln ⟨Z⟩ID
= extrR [Go(R) − αG1(R)]
saddle-point integration for
annealed free energy:
N → ∞
N-dim. geometry
independent of
model details
model, training
annealed free energy
factorizes w.r.t. and
⟨⋯⟩ID
j = 1,2,…, N μ = 1,2,…, P
51. 13
Go(R) =
1
N
ln
∫ ∏
j
dwj δ(1 − w2
) δ(R − w ⋅ w*) ≈ … =
1
2
ln(1 − R2
)
the hard way: - integral representation of delta-function, introducing
- saddle-point integration for large N w.r.t.
̂
R
R, ̂
R
the entropy term
52. 13
Go(R) =
1
N
ln
∫ ∏
j
dwj δ(1 − w2
) δ(R − w ⋅ w*) ≈ … =
1
2
ln(1 − R2
)
the hard way: - integral representation of delta-function, introducing
- saddle-point integration for large N w.r.t.
̂
R
R, ̂
R
geometry:
w
w*
R
r = 1 − R2 V ∼ (1 − R2
)N/2
Go(R) =
1
N
ln V ∼
1
2
ln(1 − R2
)
the entropy term
53. 13
Go(R) =
1
N
ln
∫ ∏
j
dwj δ(1 − w2
) δ(R − w ⋅ w*) ≈ … =
1
2
ln(1 − R2
)
the hard way: - integral representation of delta-function, introducing
- saddle-point integration for large N w.r.t.
̂
R
R, ̂
R
geometry:
w
w*
R
r = 1 − R2 V ∼ (1 − R2
)N/2
Go(R) =
1
N
ln V ∼
1
2
ln(1 − R2
)
general result: set of vectors, matrix of pairwise dot-products and norms
[R. Urbanzcik] here:
𝒞
Go =
1
2
ln det
𝒞
𝒞
=
(
1 R
R 1)
the entropy term
54. 14
G1(R) = − ln
∫
dxdy
2π 1 − R2
exp
[
−
x2
+ y2
− 2Rxy
2(1 − R2) ]
exp[−βϵ(x, y)]
the energy term
55. 14
G1(R) = − ln
∫
dxdy
2π 1 − R2
exp
[
−
x2
+ y2
− 2Rxy
2(1 − R2) ]
exp[−βϵ(x, y)]
linear regression (single linear student and teacher)
elementary Gaussian integrals:
ϵ(x, y) =
1
2
(x − y)2
G1(R) =
1
2
ln[1 + 2β(1 − R)]
the energy term
56. 14
G1(R) = − ln
∫
dxdy
2π 1 − R2
exp
[
−
x2
+ y2
− 2Rxy
2(1 − R2) ]
exp[−βϵ(x, y)]
linear regression (single linear student and teacher)
elementary Gaussian integrals:
ϵ(x, y) =
1
2
(x − y)2
G1(R) =
1
2
ln[1 + 2β(1 − R)]
the energy term
−(βf ) = −
1
2
α ln[1 + 2β(1 − R)] +
1
2
ln(1 − R2
)
annealed free energy:
(+ irrevelant const. and terms that vanish for )
N → ∞
57. 14
G1(R) = − ln
∫
dxdy
2π 1 − R2
exp
[
−
x2
+ y2
− 2Rxy
2(1 − R2) ]
exp[−βϵ(x, y)]
linear regression (single linear student and teacher)
elementary Gaussian integrals:
ϵ(x, y) =
1
2
(x − y)2
G1(R) =
1
2
ln[1 + 2β(1 − R)]
the energy term
−(βf ) = −
1
2
α ln[1 + 2β(1 − R)] +
1
2
ln(1 − R2
)
annealed free energy:
(+ irrevelant const. and terms that vanish for )
N → ∞
∂(βf )
∂R
= 0 ⇒
R
1 − R2
=
αβ
1 + 2β(1 − R)
→ R(α) at a given β
58. 15
0.5 1.0 1.5 2.0
0.2
0.4
0.6
0.8
1.0
β = 0.1
β = 1
β = 10
β = 1000 β = 100
R
α
learning curves (linear regression)
typical success of training, i.e.
student/teacher similiarity as a
function of the training set size
59. 15
0.5 1.0 1.5 2.0
0.2
0.4
0.6
0.8
1.0
β = 0.1
β = 1
β = 10
β = 1000 β = 100
R
α
learning curves (linear regression)
typical success of training, i.e.
student/teacher similiarity as a
function of the training set size
ϵg
ϵt
2 4 6 8 10
0.1
0.2
0.3
0.4
0.5
α
β = 1
generalization error and
training error
ϵt =
1
α
∂(βf )
∂β
=
1 − R
1 + 2β(1 − R)
ϵg = (1 − R)
60. 16
remark: interpretation of the AA
partition function in the AA
⟨Z⟩ID
=
∫
dμ(w)
⟨
exp
[
− β H (w)
]⟩
ID
=
∫
dμ(w)
∫
dμ({ξμ
}P
μ=1) exp [−βH ({ξμ
}, w)]
61. 16
remark: interpretation of the AA
partition function in the AA
⟨Z⟩ID
=
∫
dμ(w)
⟨
exp
[
− β H (w)
]⟩
ID
=
∫
dμ(w)
∫
dμ({ξμ
}P
μ=1) exp [−βH ({ξμ
}, w)]
interpretation: partition sum of a system in which
weights and data are degrees of freedom
that can be optimized (annealed) w.r.t. H
62. 16
remark: interpretation of the AA
partition function in the AA
⟨Z⟩ID
=
∫
dμ(w)
⟨
exp
[
− β H (w)
]⟩
ID
=
∫
dμ(w)
∫
dμ({ξμ
}P
μ=1) exp [−βH ({ξμ
}, w)]
interpretation: partition sum of a system in which
weights and data are degrees of freedom
that can be optimized (annealed) w.r.t. H
correct treatment: data constitutes frozen disorder in H
63. 16
remark: interpretation of the AA
partition function in the AA
⟨Z⟩ID
=
∫
dμ(w)
⟨
exp
[
− β H (w)
]⟩
ID
=
∫
dμ(w)
∫
dμ({ξμ
}P
μ=1) exp [−βH ({ξμ
}, w)]
interpretation: partition sum of a system in which
weights and data are degrees of freedom
that can be optimized (annealed) w.r.t. H
correct treatment: data constitutes frozen disorder in H
observation/folklore: AA works (qualitatively) well in realizable cases
e.g. student and teacher of the same complexity,
noise-free data
AA fails in unrealizable cases (noise, mismatch)
because the (hypothetical) system can “adapt the
the data to the task”, yields over-optimistic results
64. proper disorder average: replica trick/method
replica trick
formally: n non-interacting „copies“ of the system (replicas)
17
⟨ln Z⟩ID
= lim
n→0
⟨Zn
⟩ID
− 1
n
= lim
n→0
∂⟨Zn
⟩
∂n
= lim
n→0
ln ⟨Zn
⟩ID
65. proper disorder average: replica trick/method
replica trick
formally: n non-interacting „copies“ of the system (replicas)
17
⟨ln Z⟩ID
= lim
n→0
⟨Zn
⟩ID
− 1
n
= lim
n→0
∂⟨Zn
⟩
∂n
= lim
n→0
ln ⟨Zn
⟩ID
⟨Zn
⟩ID
=
∫
n
∏
a=1
dμ(wa
)
⟨
exp
[
− β
∑
μ
∑
a
ϵ
(
wa
⋅ ξμ
N
,
w* ⋅ ξμ
N )]⟩
ID
66. proper disorder average: replica trick/method
replica trick
formally: n non-interacting „copies“ of the system (replicas)
17
⟨ln Z⟩ID
= lim
n→0
⟨Zn
⟩ID
− 1
n
= lim
n→0
∂⟨Zn
⟩
∂n
= lim
n→0
ln ⟨Zn
⟩ID
⟨Zn
⟩ID
=
∫
n
∏
a=1
dμ(wa
)
⟨
exp
[
− β
∑
μ
∑
a
ϵ
(
wa
⋅ ξμ
N
,
w* ⋅ ξμ
N )]⟩
ID P(xμ
1
, xμ
2
, …, xμ
n , yμ
)
integration over
67. proper disorder average: replica trick/method
replica trick
formally: n non-interacting „copies“ of the system (replicas)
data set average introduces effective interactions between replicas
… saddle point integration for , quenched free energy
⟨Zn
⟩ID
involves order parameters
requires analytic continuation for
Ra = wa
⋅ w*, qab = wa
⋅ wb
/N
n ∈ ℝ and n → 0
17
⟨ln Z⟩ID
= lim
n→0
⟨Zn
⟩ID
− 1
n
= lim
n→0
∂⟨Zn
⟩
∂n
= lim
n→0
ln ⟨Zn
⟩ID
⟨Zn
⟩ID
=
∫
n
∏
a=1
dμ(wa
)
⟨
exp
[
− β
∑
μ
∑
a
ϵ
(
wa
⋅ ξμ
N
,
w* ⋅ ξμ
N )]⟩
ID P(xμ
1
, xμ
2
, …, xμ
n , yμ
)
integration over
68. proper disorder average: replica trick/method
replica trick
formally: n non-interacting „copies“ of the system (replicas)
data set average introduces effective interactions between replicas
… saddle point integration for , quenched free energy
⟨Zn
⟩ID
involves order parameters
requires analytic continuation for
Ra = wa
⋅ w*, qab = wa
⋅ wb
/N
n ∈ ℝ and n → 0
Marc Mezard, Giorgio Parisi (*), Miguel Virasoro
Spin Glass Theory and Beyond (1987)
(*) Nobel 2021
mathematical subtleties, replica symmetry-breaking ...
17
⟨ln Z⟩ID
= lim
n→0
⟨Zn
⟩ID
− 1
n
= lim
n→0
∂⟨Zn
⟩
∂n
= lim
n→0
ln ⟨Zn
⟩ID
⟨Zn
⟩ID
=
∫
n
∏
a=1
dμ(wa
)
⟨
exp
[
− β
∑
μ
∑
a
ϵ
(
wa
⋅ ξμ
N
,
w* ⋅ ξμ
N )]⟩
ID P(xμ
1
, xμ
2
, …, xμ
n , yμ
)
integration over
69. 18
historical :-) examples of perceptron learning curves
S = sign(w ⋅ ξ)
S* = sign(w* ⋅ ξ)
student
teacher
w w*
70. 18
historical :-) examples of perceptron learning curves
Gibbs student
optimal generalization
Adaline
ϵ(x, y) =
1
2
[x − sign(y)]
2
perceptron, zero temperature training from
noise free linearly separable data:
maximum
stability ___ϵg ∝ α−1
___ϵg ∝ α−1/2
S = sign(w ⋅ ξ)
S* = sign(w* ⋅ ξ)
student
teacher
w w*
71. 18
historical :-) examples of perceptron learning curves
Gibbs student
optimal generalization
Adaline
ϵ(x, y) =
1
2
[x − sign(y)]
2
perceptron, zero temperature training from
noise free linearly separable data:
maximum
stability ___ϵg ∝ α−1
___ϵg ∝ α−1/2
more in the literature:
- label noise
- teacher weight noise
- variational opt. of
the cost function
- weight decay
- worst case training
- …
S = sign(w ⋅ ξ)
S* = sign(w* ⋅ ξ)
student
teacher
w w*
72. 19
energy term: G1(R) = − ln
∫
dxdy P(x, y) exp[−β ϵ(x, y)]
training at high temperatures
AA becomes exact in the limit (replicas decouple)
T → ∞
73. 19
energy term: G1(R) = − ln
∫
dxdy P(x, y) exp[−β ϵ(x, y)]
training at high temperatures
≈ − ln
∫
dxdyP(x, y) (1 − β ϵ(x, y))
≈ − ln [1 − β ⟨ϵ(x, y)⟩{x,y}] ≈ βϵg
β → 0
AA becomes exact in the limit (replicas decouple)
T → ∞
generalization error!
(arbitrary input)
74. 19
energy term: G1(R) = − ln
∫
dxdy P(x, y) exp[−β ϵ(x, y)]
βf ≈ (α β) ϵg − Go(R)
free energy
training at high temperatures
≈ − ln
∫
dxdyP(x, y) (1 − β ϵ(x, y))
≈ − ln [1 − β ⟨ϵ(x, y)⟩{x,y}] ≈ βϵg
β → 0
AA becomes exact in the limit (replicas decouple)
T → ∞
generalization error!
(arbitrary input)
75. 19
energy term: G1(R) = − ln
∫
dxdy P(x, y) exp[−β ϵ(x, y)]
βf ≈ (α β) ϵg − Go(R)
free energy
only meaningful if (α β) =
𝒪
(1)
β → 0, T → ∞
α = P/N → ∞
learn almost nothing
from infinitely many
examples
training at high temperatures
≈ − ln
∫
dxdyP(x, y) (1 − β ϵ(x, y))
≈ − ln [1 − β ⟨ϵ(x, y)⟩{x,y}] ≈ βϵg
β → 0
AA becomes exact in the limit (replicas decouple)
T → ∞
generalization error!
(arbitrary input)
76. 19
energy term: G1(R) = − ln
∫
dxdy P(x, y) exp[−β ϵ(x, y)]
βf ≈ (α β) ϵg − Go(R)
free energy
only meaningful if (α β) =
𝒪
(1)
β → 0, T → ∞
α = P/N → ∞
learn almost nothing
from infinitely many
examples
training at high temperatures
here: P and T cannot be varied independently
are indistinguishable (input space is sampled perfectly)
ϵg and ϵt
≈ − ln
∫
dxdyP(x, y) (1 − β ϵ(x, y))
≈ − ln [1 − β ⟨ϵ(x, y)⟩{x,y}] ≈ βϵg
β → 0
AA becomes exact in the limit (replicas decouple)
T → ∞
generalization error!
(arbitrary input)
77. 20
adaptive student N inputs
K hidden units
layered networks: “soft committee machines” (SCM)
78. 20
adaptive student N inputs
K hidden units
teacher parameterizes target
? ? ? ? ? ? ?
layered networks: “soft committee machines” (SCM)
79. 20
adaptive student N inputs
K hidden units
teacher parameterizes target
? ? ? ? ? ? ?
consider specific activation functions, e.g. sigmoidal / ReLU
layered networks: “soft committee machines” (SCM)
80. 20
adaptive student N inputs
K hidden units
teacher parameterizes target
? ? ? ? ? ? ?
consider specific activation functions, e.g. sigmoidal / ReLU
thermodynamic limit : description in terms of order parameters
student/teacher, student/student
site symmetry / hidden unit specialization:
layered networks: “soft committee machines” (SCM)
81. 21
typical SCM learning curves (high-T)
as a function of training set size from high-T free energy:
express generalization and entropy as functions of
ϵg s {R, S, Q, C}
ϵg
82. 21
typical SCM learning curves (high-T)
as a function of training set size from high-T free energy:
sigmoidal: discont. phase transition
specialized
(R>S)
un- | anti-spec.
(R=S) (R<S)
(K>2)
ϵg
poor-performing phase
persists for large data sets
express generalization and entropy as functions of
ϵg s {R, S, Q, C}
ϵg
83. 21
typical SCM learning curves (high-T)
as a function of training set size from high-T free energy:
sigmoidal: discont. phase transition
specialized
(R>S)
un- | anti-spec.
(R=S) (R<S)
(K>2)
ϵg
poor-performing phase
persists for large data sets
ReLU: continuous transition
anti-specialized
(R<S)
specialized
(R>S)
(R=S)
ϵg
similar performances,
lower (free) energy barrier
express generalization and entropy as functions of
ϵg s {R, S, Q, C}
ϵg
84. 22
Hidden unit Specialization in Layered Neural Networks:
ReLU vs. Sigmoidal Activation
E. Oostwal, M. Straat, M. Biehl
Physica A 564: 125517 (2021)
Phase Transitions in Soft-Committee Machines
M. Biehl, E. Schlösser, M. Ahr
Europhysics Letters 44: 261-267 (1998)
Statistical Physics and Practical Training of Soft-Committee Machines
M. Ahr, M. Biehl, R. Urbanczik
European Physics B 10: 583-588 (1999)
sigmoidal activation
high-T, arbitrary K
sigmoidal activation
replica, large K = M → ∞
ReLU activation
high-T, arbitrary K
layered networks: (SCM)
85. 22
Hidden unit Specialization in Layered Neural Networks:
ReLU vs. Sigmoidal Activation
E. Oostwal, M. Straat, M. Biehl
Physica A 564: 125517 (2021)
Phase Transitions in Soft-Committee Machines
M. Biehl, E. Schlösser, M. Ahr
Europhysics Letters 44: 261-267 (1998)
Statistical Physics and Practical Training of Soft-Committee Machines
M. Ahr, M. Biehl, R. Urbanczik
European Physics B 10: 583-588 (1999)
sigmoidal activation
high-T, arbitrary K
sigmoidal activation
replica, large K = M → ∞
ReLU activation
high-T, arbitrary K
layered networks: (SCM)
challenges:
- more general activation functions - overfitting/underfitting
- low-temperatures: AA, replica - many layers (deep networks)
- non-trivial (realistic) input densities [Zdeborova, Goldt, Mezard…]
86. 23
The Role of the Activation Function in Feedforward Learning Systems
(RAFFLES) NWO-funded project, Frederieke Richert
on-going & future work
87. 23
The Role of the Activation Function in Feedforward Learning Systems
(RAFFLES) NWO-funded project, Frederieke Richert
Robust Learning of Sparse Representations: Brain-inspired Inhibition
and Statistical Physics Analysis
2 PhD projects funded by the Groningen Cognitive Systems and
Materials Centre CogniGron, in collaboration with George Azzopardi
on-going & future work
88. 23
The Role of the Activation Function in Feedforward Learning Systems
(RAFFLES) NWO-funded project, Frederieke Richert
Robust Learning of Sparse Representations: Brain-inspired Inhibition
and Statistical Physics Analysis
2 PhD projects funded by the Groningen Cognitive Systems and
Materials Centre CogniGron, in collaboration with George Azzopardi
- study network architectures and training schemes
which favor sparse activity and sparse connectivity
- consider activation functions which relate to hardware-realizable
adaptive systems
on-going & future work
89. 23
The Role of the Activation Function in Feedforward Learning Systems
(RAFFLES) NWO-funded project, Frederieke Richert
Robust Learning of Sparse Representations: Brain-inspired Inhibition
and Statistical Physics Analysis
2 PhD projects funded by the Groningen Cognitive Systems and
Materials Centre CogniGron, in collaboration with George Azzopardi
- study network architectures and training schemes
which favor sparse activity and sparse connectivity
- consider activation functions which relate to hardware-realizable
adaptive systems
on-going & future work
see: www.cs.rug.nl/~biehl (link to description and application form,
deadline: 29 September 2022)