The document discusses statistical physics approaches to machine learning and neural networks. It provides an overview of key concepts from statistical physics that are relevant to modeling learning processes, such as stochastic optimization, thermal equilibrium, free energy, and annealed approximations. As examples, it summarizes analyses of perceptron learning curves and phase transitions in training Ising perceptrons and soft committee machines. The statistical physics perspective aims to characterize typical properties and behaviors of learning systems.
The statistical physics of learning revisted: Phase transitions in layered ne...University of Groningen
"The statistical physics of learning revisted: Phase transitions in layered neural networks"
Physics Colloquium at the University of Leipzig/Germany, June 29, 2021
24 slides, ca 45 minutes
System for Prediction of Non Stationary Time Series based on the Wavelet Radi...IJECEIAES
This paper proposes and examines the performance of a hybrid model called the wavelet radial bases function neural networks (WRBFNN). The model will be compared its performance with the wavelet feed forward neural networks (WFFN model by developing a prediction or forecasting system that considers two types of input formats: input9 and input17, and also considers 4 types of non-stationary time series data. The MODWT transform is used to generate wavelet and smooth coefficients, in which several elements of both coefficients are chosen in a particular way to serve as inputs to the NN model in both RBFNN and FFNN models. The performance of both WRBFNN and WFFNN models is evaluated by using MAPE and MSE value indicators, while the computation process of the two models is compared using two indicators, many epoch, and length of training. In stationary benchmark data, all models have a performance with very high accuracy. The WRBFNN9 model is the most superior model in nonstationary data containing linear trend elements, while the WFFNN17 model performs best on non-stationary data with the non-linear trend and seasonal elements. In terms of speed in computing, the WRBFNN model is superior with a much smaller number of epochs and much shorter training time.
We consider a Bayesian approach to inverse problems with complex error structure. Hierarchical Bayesian models have been developed in this inverse problem setup. These Bayesian models contain a natural mechanism for regularization in the form of prior distributions. Different regularized prior distributions have been considered to induce sparseness. We propose MCMC as well as variational type algorithms for posterior inference. The proposed methods have been illustrated on several linear and nonlinear inverse problems.
The statistical physics of learning revisted: Phase transitions in layered ne...University of Groningen
"The statistical physics of learning revisted: Phase transitions in layered neural networks"
Physics Colloquium at the University of Leipzig/Germany, June 29, 2021
24 slides, ca 45 minutes
System for Prediction of Non Stationary Time Series based on the Wavelet Radi...IJECEIAES
This paper proposes and examines the performance of a hybrid model called the wavelet radial bases function neural networks (WRBFNN). The model will be compared its performance with the wavelet feed forward neural networks (WFFN model by developing a prediction or forecasting system that considers two types of input formats: input9 and input17, and also considers 4 types of non-stationary time series data. The MODWT transform is used to generate wavelet and smooth coefficients, in which several elements of both coefficients are chosen in a particular way to serve as inputs to the NN model in both RBFNN and FFNN models. The performance of both WRBFNN and WFFNN models is evaluated by using MAPE and MSE value indicators, while the computation process of the two models is compared using two indicators, many epoch, and length of training. In stationary benchmark data, all models have a performance with very high accuracy. The WRBFNN9 model is the most superior model in nonstationary data containing linear trend elements, while the WFFNN17 model performs best on non-stationary data with the non-linear trend and seasonal elements. In terms of speed in computing, the WRBFNN model is superior with a much smaller number of epochs and much shorter training time.
We consider a Bayesian approach to inverse problems with complex error structure. Hierarchical Bayesian models have been developed in this inverse problem setup. These Bayesian models contain a natural mechanism for regularization in the form of prior distributions. Different regularized prior distributions have been considered to induce sparseness. We propose MCMC as well as variational type algorithms for posterior inference. The proposed methods have been illustrated on several linear and nonlinear inverse problems.
Abstract : Motivated by the recovery and prediction of electricity consumption time series, we extend Nonnegative Matrix Factorization to take into account external features as side information. We consider general linear measurement settings, and propose a framework which models non-linear relationships between external features and the response variable. We extend previous theoretical results to obtain a sufficient condition on the identifiability of NMF with side information. Based on the classical Hierarchical Alternating Least Squares (HALS) algorithm, we propose a new algorithm (HALSX, or Hierarchical Alternating Least Squares with eXogeneous variables) which estimates NMF in this setting. The algorithm is validated on both simulated and real electricity consumption datasets as well as a recommendation system dataset, to show its performance in matrix recovery and prediction for new rows and columns.
We present Graph Convolutional Networks that, unlike classic DL models, allow supervised learning by exploiting both the single node features and its relationships with the others within the network.
The variational Gaussian process (VGP), a Bayesian nonparametric model which adapts its shape to match com- plex posterior distributions. The VGP generates approximate posterior samples by generating latent inputs and warping them through random non-linear mappings; the distribution over random mappings is learned during inference, enabling the transformed outputs to adapt to varying complexity.
Kernel methods and variable selection for exploratory analysis and multi-omic...tuxette
Nathalie Vialaneix
4th course on Computational Systems Biology of Cancer: Multi-omics and Machine Learning Approaches
International course, Curie training
https://training.institut-curie.org/courses/sysbiocancer2021
(remote)
September 29th, 2021
Tracking the tracker: Time Series Analysis in Python from First Principleskenluck2001
The talk will focus on
1. Forecasting
2. Anomaly Detection
This will take a dive into common methods of doing time series analysis, introduce a new algorithm for online ARIMA, and a number of variations of Kalman filters with barebone implementations in Python.
A Python implementation of a anomaly detection system on data stream with a deep dive into the mathematics that will be explained in clear layman's term. We will work through a easy group exercise to internalize the concepts.
The talk will discuss how to deploy machine learning module in a production. We discuss lessons learnt in practice and conclusion.
A tutorial given at the AMALEA workshop 2022.
This talk presents the statistical physics based theory of machine learning in terms of simple example systems. As a recent application, the occurrence of phase transitions in layered networks is discussed.
Abstract : Motivated by the recovery and prediction of electricity consumption time series, we extend Nonnegative Matrix Factorization to take into account external features as side information. We consider general linear measurement settings, and propose a framework which models non-linear relationships between external features and the response variable. We extend previous theoretical results to obtain a sufficient condition on the identifiability of NMF with side information. Based on the classical Hierarchical Alternating Least Squares (HALS) algorithm, we propose a new algorithm (HALSX, or Hierarchical Alternating Least Squares with eXogeneous variables) which estimates NMF in this setting. The algorithm is validated on both simulated and real electricity consumption datasets as well as a recommendation system dataset, to show its performance in matrix recovery and prediction for new rows and columns.
We present Graph Convolutional Networks that, unlike classic DL models, allow supervised learning by exploiting both the single node features and its relationships with the others within the network.
The variational Gaussian process (VGP), a Bayesian nonparametric model which adapts its shape to match com- plex posterior distributions. The VGP generates approximate posterior samples by generating latent inputs and warping them through random non-linear mappings; the distribution over random mappings is learned during inference, enabling the transformed outputs to adapt to varying complexity.
Kernel methods and variable selection for exploratory analysis and multi-omic...tuxette
Nathalie Vialaneix
4th course on Computational Systems Biology of Cancer: Multi-omics and Machine Learning Approaches
International course, Curie training
https://training.institut-curie.org/courses/sysbiocancer2021
(remote)
September 29th, 2021
Tracking the tracker: Time Series Analysis in Python from First Principleskenluck2001
The talk will focus on
1. Forecasting
2. Anomaly Detection
This will take a dive into common methods of doing time series analysis, introduce a new algorithm for online ARIMA, and a number of variations of Kalman filters with barebone implementations in Python.
A Python implementation of a anomaly detection system on data stream with a deep dive into the mathematics that will be explained in clear layman's term. We will work through a easy group exercise to internalize the concepts.
The talk will discuss how to deploy machine learning module in a production. We discuss lessons learnt in practice and conclusion.
A tutorial given at the AMALEA workshop 2022.
This talk presents the statistical physics based theory of machine learning in terms of simple example systems. As a recent application, the occurrence of phase transitions in layered networks is discussed.
ABSTRACT: Once introduced the fundamental ideas of quantum computing, we will discuss the possibilities offered by quantum computers in machine learning.
BIO: Davide Pastorello obtained an M.Sc. in Physics (2011) and a Ph.D. in Mathematics (2014) from Trento University. After serving at the Dept. of Mathematics and DISI in Trento, he is currently an assistant professor at the Dept. of Mathematics, University of Bologna. His main research interests concern the mathematical aspects of quantum information theory, quantum computing, and quantum machine learning.
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...IJDKP
Quantum clustering (QC), is a data clustering algorithm based on quantum mechanics which is
accomplished by substituting each point in a given dataset with a Gaussian. The width of the Gaussian is a
σ value, a hyper-parameter which can be manually defined and manipulated to suit the application.
Numerical methods are used to find all the minima of the quantum potential as they correspond to cluster
centers. Herein, we investigate the mathematical task of expressing and finding all the roots of the
exponential polynomial corresponding to the minima of a two-dimensional quantum potential. This is an
outstanding task because normally such expressions are impossible to solve analytically. However, we
prove that if the points are all included in a square region of size σ, there is only one minimum. This bound
is not only useful in the number of solutions to look for, by numerical means, it allows to to propose a new
numerical approach “per block”. This technique decreases the number of particles by approximating some
groups of particles to weighted particles. These findings are not only useful to the quantum clustering
problem but also for the exponential polynomials encountered in quantum chemistry, Solid-state Physics
and other applications.
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
Artificial intelligence is emerging as a new paradigm in materials science. This talk describes how physical intuition and (insightful) machine learning can solve the complicated task of structure recognition in materials at the nanoscale.
The dynamics of networks enables the function of a variety of systems we rely on every day, from gene regulation and metabolism in the cell to the distribution of electric power and communication of information. Understanding, steering and predicting the function of interacting nonlinear dynamical systems, in particular if they are externally driven out of equilibrium, relies on obtaining and evaluating suitable models, posing at least two major challenges. First, how can we extract key structural system features of networks if only time series data provide information about the dynamics of (some) units? Second, how can we characterize nonlinear responses of nonlinear multi-dimensional systems externally driven by fluctuations, and consequently, predict tipping points at which normal operational states may be lost? Here we report recent progress on nonlinear response theory extended to predict tipping points and on model-free inference of network structural features from observed dynamics.
In order to provide the design guidance for a multiple stage refrigerator for hosting a quantum computing device targeting for unmanned transportation platform. We provides a modeling analysis based on a preliminary single stage test data, by using Brain Storm Optimization algorithm.
Application of thermal error in machine tools based on Dynamic Bayesian NetworkIJRES Journal
In recent years, the growing interest toward complex manufacturing on machine tools and the
machining accuracy have solicited new efforts in the area of modeling and analysis of machine tools machining
errors. Therefore, the mathematical model study on the relationship between temperature field and thermal error
is the core content, which can improve the precision of parts processing and the thermal stability, also predict
and compensate machining errors of CNC machine tools. It is critical to obtain the thermal errors of a precision
machine tools in real-time. In this paper, based on Dynamic Bayesian Network (DBN), a pioneering modeling
method applied in thermal error research is presented. The dependence of thermal error and temperature field is
clearly described by graph theory, and the fuzzy classification method is proposed to reduce the computational
complexity, then forming a new method for thermal error modeling of machine tools.
Interpretable machine learning in endocrinology, M. Biehl, APPIS 2024University of Groningen
An introduction to interpretable machine learning in endocrinology.
In particular, the application of Generalized Matrix Relevance LVQ to the classification of andrenocortical tumors and the differential diagnosis of primary aldosteronism is given.
A tutorial given at the AMALEA workshop 2022:
Unsupervised and supervised prototype-based learning is illustrated in terms of bio-medical applications.
Invited lecture on Machine Learning in Medicine at the joint "Integrated Omics" course of Hanze University and University Hospital UMCG, Groningen, The Netherlands
Short presentation (15 minutes) focussing on the application of unsupervised and supervised machine learning in the paper "Tissue- and development-stage specific mRNA and heterogeneous CNV signatures of human ribosomal proteins in normal and cancer samples
Talk presented at WSOM 2016 in Houston/Texas.
Machine learning based classification of FDG-PET scan data for the diagnosis of neurodegenerative disorders
June 2017: Biomedical applications of prototype-based classifiers and relevan...University of Groningen
A presentation of several biomedical applications of prototype-based machine learning and relevance learning. Invited talk at the AlCoB conference 2017 in Aveiro/Portugal.
Tutorial at the Winter School on Machine Learning, Gran Canaria, January 2020 (ppsx format, 52 slides)
Michael Biehl, University of Groningen, The Netherlands
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
This pdf is about the Schizophrenia.
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
Introduction:
RNA interference (RNAi) or Post-Transcriptional Gene Silencing (PTGS) is an important biological process for modulating eukaryotic gene expression.
It is highly conserved process of posttranscriptional gene silencing by which double stranded RNA (dsRNA) causes sequence-specific degradation of mRNA sequences.
dsRNA-induced gene silencing (RNAi) is reported in a wide range of eukaryotes ranging from worms, insects, mammals and plants.
This process mediates resistance to both endogenous parasitic and exogenous pathogenic nucleic acids, and regulates the expression of protein-coding genes.
What are small ncRNAs?
micro RNA (miRNA)
short interfering RNA (siRNA)
Properties of small non-coding RNA:
Involved in silencing mRNA transcripts.
Called “small” because they are usually only about 21-24 nucleotides long.
Synthesized by first cutting up longer precursor sequences (like the 61nt one that Lee discovered).
Silence an mRNA by base pairing with some sequence on the mRNA.
Discovery of siRNA?
The first small RNA:
In 1993 Rosalind Lee (Victor Ambros lab) was studying a non- coding gene in C. elegans, lin-4, that was involved in silencing of another gene, lin-14, at the appropriate time in the
development of the worm C. elegans.
Two small transcripts of lin-4 (22nt and 61nt) were found to be complementary to a sequence in the 3' UTR of lin-14.
Because lin-4 encoded no protein, she deduced that it must be these transcripts that are causing the silencing by RNA-RNA interactions.
Types of RNAi ( non coding RNA)
MiRNA
Length (23-25 nt)
Trans acting
Binds with target MRNA in mismatch
Translation inhibition
Si RNA
Length 21 nt.
Cis acting
Bind with target Mrna in perfect complementary sequence
Piwi-RNA
Length ; 25 to 36 nt.
Expressed in Germ Cells
Regulates trnasposomes activity
MECHANISM OF RNAI:
First the double-stranded RNA teams up with a protein complex named Dicer, which cuts the long RNA into short pieces.
Then another protein complex called RISC (RNA-induced silencing complex) discards one of the two RNA strands.
The RISC-docked, single-stranded RNA then pairs with the homologous mRNA and destroys it.
THE RISC COMPLEX:
RISC is large(>500kD) RNA multi- protein Binding complex which triggers MRNA degradation in response to MRNA
Unwinding of double stranded Si RNA by ATP independent Helicase
Active component of RISC is Ago proteins( ENDONUCLEASE) which cleave target MRNA.
DICER: endonuclease (RNase Family III)
Argonaute: Central Component of the RNA-Induced Silencing Complex (RISC)
One strand of the dsRNA produced by Dicer is retained in the RISC complex in association with Argonaute
ARGONAUTE PROTEIN :
1.PAZ(PIWI/Argonaute/ Zwille)- Recognition of target MRNA
2.PIWI (p-element induced wimpy Testis)- breaks Phosphodiester bond of mRNA.)RNAse H activity.
MiRNA:
The Double-stranded RNAs are naturally produced in eukaryotic cells during development, and they have a key role in regulating gene expression .
1. MiWoCI IEEE 2018 1
The statistical physics of learning - revisited
www.cs.rug.nl/~biehl
Michael Biehl
Bernoulli Institute for
Mathematics, Computer Science
and Artificial Intelligence
University of Groningen / NL
2. MiWoCI IEEE 2018 2
machine learning theory ?
Computational Learning Theory
performance bounds & guarantees
independent of
- specific task
- statistical properties of data
- details of the training
...
Statistical Physics of Learning:
typical properties & phenomena
for models of specific
- systems/network architectures
- statistics of data and noise
- training algorithms / cost functions
...
4. MiWoCI IEEE 2018 4
news from the stone age of neural networks
Statistical Physics of Neural Networks: Two ground-breaking papers
Training, feed-forward networks:
Elizabeth Gardner (1957-1988).
The space of interactions in neural
networks. J. Phys. A 21:257-270 (1988)
Dynamics, attractor neural networks:
John Hopfield. Neural Networks and
physical systems with emergent
collective computational abilities.
PNAS 79(8):2554-2558 (1982)
5. MiWoCI IEEE 2018 5
From stochastic optimization Monte Carlo, Langevin dynamics
.... to thermal equilibrium: temperature, free energy, entropy, ...
(.... and back) formal application to optimization
training: stochastic optimization of (many) weights
guided by a data-dependent cost function
randomized data ( frozen disorder )
models: student/teacher scenarios
Machine learning: typical properties of large learning systems
Examples: perceptron classifier, “Ising” perceptron, layered networks
analysis: order parameters, disorder average, replica trick
annealed approximation, high temperature limit
overview
Outlook
6. MiWoCI IEEE 2018 6
stochastic optimization
objective/cost/energy function , e.g. with many degrees of freedom
discrete, e.g. continuous, e.g.
Metropolis algorithm Langevin dynamics
• acceptance of the change
- always if
- with probability
if
• suggest a (small) change
, e.g. „single spin flip“
for a random j
• compute
• continuous temporal change,
„noisy gradient descent“
controls acceptance rate
for „uphill“ moves
... controls noise level, i.e.
random deviation from gradient
• with delta-correlated white noise
(spatial + temporal independence)
7. MiWoCI IEEE 2018
thermal equilibrium
Markov chain continuous dynamics
stationary density of configurations:
normalization: „Zustandssumme“, partition function
Gibbs-Boltzmann density of states
• physics: thermal equilibrium of a physical system at temperature T
• optimization: formal equilibrium situation, control parameter T
7
note: additional constraints
can be imposed on the weights,
for instance: normalization
8. MiWoCI IEEE 2018
the role of Z: thermal averages <...>T in equilibrium, e.g.
... can be expressed as derivatives of ln Z
~ vol. of states with energy E
(microcanonical) entropy
per degree of freedom:
assume extensive energy, proportional to system size N:
thermal averages and entropy
8
re-write as an integral over all possible energies:
9. MiWoCI IEEE 2018 9
Darwin-Fowler, aka saddle point integration
function with maximum in , consider thermodynamic limit
is given by the minimum of the free energy
f = e - s(e) / β
10. MiWoCI IEEE 2018
free energy and temperature
in large systems (thermodynamic limit) ln Z is dominated by
the states with minimal free energy
T controls competition between - smaller energies
- larger number of available states
singles out the lowest energy (groundstate)
Metropolis: only down-hill, Langevin: true gradient descent
all states occur with equal probability, independent of energy
Metropolis: accept all random changes
Langevin: noise term suppresses gradient
10
T=1/β is the temperature at which <H>T = N eo
assumption: ergodicity (all states can be reached in the dynamics)
11. MiWoCI IEEE 2018 11
theory of stochastic optimization
by means of statistical physics
- development of algorithms
(e.g. Simulated Annealing )
- analysis of problem properties, even
in absence of practical algorithms
(number of groundstates, minima,...)
- applicable in many different
contexts, universality
statistical physics & optimization
12. MiWoCI IEEE 2018
machine learning
special case machine learning: choice of adaptive
e.g. all weights in a neural network, prototype
components in LVQ, centers in RBF-network....
cost function: defined w.r.t.
sum over examples, feature vectors xμ and target labels σ μ (if supervised)
costs or error measure ε(...) per example, e.g. number of misclassifications
training:
• consider weights as the outcome of a stochastic optimization process
• formal (thermal) equilibrium given by
• < ... >T : thermal average over training process for a particular data set
12
13. MiWoCI IEEE 2018
quenched average over training data
• note: energy/cost function is defined for one particular data set
typical properties by additional average over randomized data
• typical properties on average over randomized data set: derivatives of
quenched free energy ~ yield averages of the form
• the simplest assumption: i.i.d. input vectors
with i.i.d. components
• training labels given by target function:
for instance provided by a teacher network
• student / teacher scenarios
control the complexity of target rule and learning system
analyse training by (stochastic) optimization
? ? ? ? ? ? ?
13
14. MiWoCI IEEE 2018
average over training data
„replica trick“
n non-interacting „copies“ of the system (replicas)
quenched average introduces effective interactions between replicas
... saddle point integration for <Zn>ID , quenched free energy
requires analytic continuation to
Marc Mezard, Giorgio Parisi, Miguel Virasoro
Spin Glass Theory and Beyond, World Scientific (1987)
mathematical subtleties, replica symmetry-breaking,
order parameter functions, ...
14
15. MiWoCI IEEE 2018
annealed approximation:
becomes exact (=) in the high-temperature limit (replicas decouple)
• independent single examples:
• saddle point integration: < lnZ >ID / N is dominated by minimum of
• extensive number of examples: (prop. to number of weights)
generalization error plays the role
of the energy (i.e. training error?)
annealed approximation and high-T limit
with finite
“ learn almost nothing... ” (high T )
“ ...from infinitely many examples ”
15
average in the
exponent for β≈0
16. MiWoCI IEEE 2018
example: perceptron training
• student:
• teacher:
• training data: with independent
zero mean, unit variance
• Central Limit Theorem (CLT), for large N :
normally distributed with
16
fully specifies
17. MiWoCI IEEE 2018
example: perceptron training
i.i.d. isotropic data, geometry:
H SR=
=
J
B
S
f
• or, more intuitively...
order parameter R
17
18. MiWoCI IEEE 2018
example: perceptron training
• entropy:
- all weights with order parameter R: hypersphere
with radius ~ , volume ~
(+ irrelevant constants)
note: result carries over to more general C (many students and teachers)
• high-T free energy
• re-scaled number of examples
18
R
- or: exp. representation of the δ-functions + saddle point integration...
19. MiWoCI IEEE 2018 19
• “physical state”: (arg-)minimum of
• typical learning curves
example: perceptron training
R
R
R
• perfect generalization is achieved
20. MiWoCI IEEE 2018
perceptron learning curve
a very simple model:
- linearly separable rule (teacher)
- i.i.d. isotropic random data
- high temperature stochastic training
with perfect generalization for
Modifications/extensions:
- noisy data, unlearnable rules
- low-T results (annealed, replica...)
- unsupervised learning
- structured input data (clusters)
- large margin perceptron and SVM
- variational optimization of energy
function (i.e. training algorithm)
- binary weights (“Ising Perceptron”)
typical learning curve,
on average over random
linearly separable data sets
of a given size
20
21. MiWoCI IEEE 2018
example: Ising perceptron
• student:
• teacher:
• generalization error unchanged:
• entropy:
probability for alignment/misalignment
entropy of mixing N(1+R)/2 aligned and N(1-R)/2 misaligned components
21
22. MiWoCI IEEE 2018 22
• competing minima in
• for
example: Ising perceptron
• for
co-existing phases of poor/perfect
generalization, lower minimum is stable,
higher minimum is meta-stable
• for only one minimum (R=1)
“first order phase transition”
to perfect generalization
“system freezes” in
R
R
R
23. MiWoCI IEEE 2018
Monte Carlo results
(no prior knowledge)
results carry over (qualitatively) to low (zero) temperature training:
e.g. nature of phase transitions etc.
first order phase transition
(local) (global)
(global)
(local)
finite size
effects
equal f
23
24. MiWoCI IEEE 2018
adaptive student
N input units
(K) hidden units (M)
teacher
? ? ? ? ? ? ?
soft committee machine
order parameters: model parameters:macroscopic
properties of
the student
network:
training: minimization of
24
25. MiWoCI IEEE 2018 25
exploit thermodynamic limit, CLT for
normally distributed with zero means and covariance matrix
(+ constant)
soft committee machine
26. MiWoCI IEEE 2018 26
soft committee machine
K=M=2
symmetry breaking
phase transition (2nd order)
K=M > 2
1st order phase transition
with metastable states
K=5
K=2
(e.g.)
hidden unit specialization
27. MiWoCI IEEE 2018
adaptive student teacher
? ? ? ? ? ? ?
soft committee machine
• initial training phase: unspecialized hidden unit weights:
all student units represent “mean teacher”
• transition to specialization, makes perfect agreement possible
27
28. MiWoCI IEEE 2018
adaptive student teacher
? ? ? ? ? ? ?
soft committee machine
• successful training requires a critical number of examples
• hidden unit permutation symmetry has to be broken
28
• initial training phase: unspecialized hidden unit weights
all student units represent “mean teacher”
• transition to specialization, makes perfect agreement possible
equivalent permutations:
29. MiWoCI IEEE 2018 29
unspecialized state
remains meta-stable up to
large hidden layer:
many hidden units
perfect generalization without prior knowledge
impossible with order O(NK) examples ?
30. MiWoCI IEEE 2018 30
what’s next ?
• activation functions (ReLu etc.)
• deep networks
• online training by stochastic g.d.
• math. description in terms of ODE
• learning rates, momentum etc.
• regularization, e.g. drop-out,
weight decay etc.
• tree-like architectures as models
of convolution & pooling
• concept drift: time-dependent
statistics of data and target
... a lot more & new ideas to come
network architecture and design
dynamics of network training
other topics