SlideShare a Scribd company logo
1 of 101
Download to read offline
Transforming the BCPNN
Learning Rule for Spiking Units
to a Learning Rule
for Non-Spiking Units
A N T O I N E B E R G E L
Master of Science Thesis
Stockholm, Sweden 2010
Transforming the BCPNN
Learning Rule for Spiking Units
to a Learning Rule
for Non-Spiking Units
A N T O I N E B E R G E L
Master’s Thesis in Biomedical Engineering (30 ECTS credits)
at the School of Computer Science and Engineering
Royal Institute of Technology year 2010
Supervisor at CSC was Örjan Ekeberg
Examiner was Anders Lansner
TRITA-CSC-E 2010:059
ISRN-KTH/CSC/E--10/059--SE
ISSN-1653-5715
Royal Institute of Technology
School of Computer Science and Communication
KTH CSC
SE-100 44 Stockholm, Sweden
URL: www.kth.se/csc
Abstract
The Bayesian Confidence Propagation Neural Network (BCPNN) Model
has been developed in the past thirty years for specific tasks like, among
others, classification, content-addressable memory and data mining. It
uses a Bayesian-Hebbian learning rule, which exhibits fairly good per-
formances, both as an counter model and in a continously operating
incremental learner. This learning rule has never been up and running
in spiking units networks and one is bound to use the outcome of the
learning for non-spiking units and to transpose it to the spiking context
afterwards, which is highly restrictive.
The aim of Master Thesis Project is to transform the existing BCPNN
learning rule for non-spiking units, including the bias term, to the
domain of spiking neural networks based on the Hodgkin-Huxley cell
model. One aims to have a modelisation running in NEURON, which
can exhibit the same features observed with non-spiking units. The sec-
ondary goal of this paper is to compare the new learning rule to the old
one, and also with other Spike-Timing Dependent Plasticity learning
rules.
To achieve such a goal, we introduce a new version of the BCPNN
learning rule, which can account for spiking input activities. This learn-
ing rule is based on the use of synaptic traces as local variables to keep
a trace of the frequency and timing between spikes. It includes three
stages of processing, all based on low-pass filtering with three differ-
ent temporal dynamics, in order to give an evaluation of the probabil-
ities used to compute the Bayesian weights and biases. The Bayesian
weights are mapped to a synaptic conductance, updated according to
the values of these synpatic traces, and we map the bias term to an
activity-regulated potassium channel.
We exhibit results of the new spiking version of the BCPNN learning
rule in single-synapse learning and retrieval. We implement two main
models : the first based on abstract units in MATLAB and another one
based on Hodgkin-Huxley spiking units in NEURON. The last model
accounts for spike-frequency adaptation and can be used to study the ef-
fect of exact timing between presynaptic and postsynaptic spikes, under
repeated stimulations.
Acknowledgements
I would first like to thank Anders Lansner for allowing me to work at the
department Computational Biology and Neuroscience at KTH, for devoting time
and patience to assuming both roles of supervisor and examiner of this Master
Thesis, and for always helping me, guiding me and finally leaving me in the best
conditions to produce valuable work. This first step into research at a high-level
scientific department has been a very enriching experience, which I will always
remember. I would also like to thank Örjan Ekeberg, for accepting to tutor this
Master Thesis from abroad at first, and later for all the precious comments about
the report, presentation and structure of this work.
This passed year, at the department, I have had the chance to meet a lot of
people, from different backgrounds and countries. They have contributed to make
the atmosphere of work very special and especially warm and welcoming : Claudia,
who has been here from the very beginning, Charles, for his ping-pong and chess
skills when a break was needed, Aymeric, Dave, Simon, Pawel, Pierre and all the
others for making me discover new bars and restaurants. I want to give a special
thank to Mikael, for interesting talk, Pradeep and David, for their disponibility,
kindness and help with NEURON, and finally to Bernhard, who has been not only
always eager to answer my numerous questions and investigate new problems with
me, but also a devoted friend, who proposed me tremendous support and help, when
time pressure was high.
I cannot cite all the people that I have met these past two years, but I want
to say how getting to know all of them, all the conversations and moments we
had together, have changed me and made me realise that there exist no geographic
border to friendship and love. So, I want to thank Natasha, for the time she spent
on improving the language in my report, and simply for being always supportive
and making me feel that she was here with me, though at the other side of the world.
This year would have been so much different without my lovely room-mates Birte,
Isabel, Stefan F., Stefan T. and Volker, for August mondays among other things,
my two French buddies Fred and Joseph, for lunchbreaks, poker sessions and crazy
parties. I want to give a special thank to my two Italian friends who showed that
beyond neighbour rivalry, we just have so much in common and so much to share :
Enrico, the craziest person I have ever lived with and Sara, probably the best pizza
and focaccia cooker ever.
Finally, I want to thank my parents who have always helped me with all the
problems one can encounter when studying abroad for two years : I know how lucky
I am to have them with me and I hope they measure the respect I have for them.
A little word to my syblings, my sister Karen and my brother Samuel, who I will
be very happy to meet and live with again.
Contents
1 Introduction 1
1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 The BCPNN Model 5
2.1 Context and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Bayesian Confidence Propagation . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Using Neurons as probability estimators . . . . . . . . . . . . 8
2.2.2 Derivation of Network Architecture . . . . . . . . . . . . . . . 9
2.2.3 Bayesian-Hebbian Learning . . . . . . . . . . . . . . . . . . . 9
2.3 Gradual Development of the BCPNN model . . . . . . . . . . . . . . 9
2.3.1 Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Higher Order Bayesian Model . . . . . . . . . . . . . . . . . . 11
2.3.3 Graded units . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.4 Recurrent Network . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 BCPNN Learning Implementations . . . . . . . . . . . . . . . . . . . 14
2.4.1 Counter Model . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 Incremental Learning . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Performance Evaluation and Applications . . . . . . . . . . . . . . . 16
3 A spiking BCPNN Learning Rule 19
3.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Synaptic traces as local state variables . . . . . . . . . . . . . 21
3.2.2 Spike-timing Dependence . . . . . . . . . . . . . . . . . . . . 22
3.2.3 Delayed-Reward Learning . . . . . . . . . . . . . . . . . . . . 23
3.2.4 Long-term Memory . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.5 Probabilistic features . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Biological relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Abstract Units Implementation 27
4.1 Pattern presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.1 Non-spiking Pattern Presentation . . . . . . . . . . . . . . . . 28
4.1.2 Spiking frequency-based Pattern Presentation . . . . . . . . . 28
4.1.3 Spiking Poisson-generated Pattern Presentation . . . . . . . . 29
4.2 Learning Rule Implementation . . . . . . . . . . . . . . . . . . . . . 31
4.3 Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5 Hodgkin-Huxley Spiking Implementation in NEURON 35
5.1 Cell Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.1.1 Hodgkin Huxley Model . . . . . . . . . . . . . . . . . . . . . 35
5.1.2 Spike Frequency Adaptation . . . . . . . . . . . . . . . . . . . 37
5.2 Pattern presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Learning Rule Implementation . . . . . . . . . . . . . . . . . . . . . 39
5.3.1 Synaptic Integration . . . . . . . . . . . . . . . . . . . . . . . 39
5.3.2 Bias term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4 Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6 Results 47
6.1 Abstract units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.1.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.1.2 Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2 Hodgkin-Huxley Spiking Units . . . . . . . . . . . . . . . . . . . . . 51
6.2.1 Steady-State Current Discharge . . . . . . . . . . . . . . . . . 51
6.2.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2.3 Parameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.2.4 Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2.5 Spike Timing Dependence . . . . . . . . . . . . . . . . . . . . 60
7 Discussion 63
7.1 Model Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.1.1 Learning Rule Parameters . . . . . . . . . . . . . . . . . . . . 63
7.1.2 Pattern Variability . . . . . . . . . . . . . . . . . . . . . . . . 65
7.1.3 Learning-Inference Paradigm . . . . . . . . . . . . . . . . . . 65
7.2 Comparison to other learning rules . . . . . . . . . . . . . . . . . . . 66
7.2.1 Spiking vs Non-spiking Learning Rule . . . . . . . . . . . . . 66
7.2.2 Spike-timing dependence and real data . . . . . . . . . . . . . 68
7.2.3 Sliding threshold and BCM Rule . . . . . . . . . . . . . . . . 69
7.3 Further Developments and limitations . . . . . . . . . . . . . . . . . 71
7.3.1 Network implementation . . . . . . . . . . . . . . . . . . . . . 71
7.3.2 RSNP cells and inhibitory input . . . . . . . . . . . . . . . . 71
7.3.3 Hypercolumns, basket cell and lateral inhibition . . . . . . . 72
7.3.4 Parallel computing . . . . . . . . . . . . . . . . . . . . . . . . 73
8 Conclusion 75
Bibliography 77
Appendices 79
A NMODL files 81
A.1 Synapse modelisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
A.2 A-Type Potassium Channel . . . . . . . . . . . . . . . . . . . . . . . 83
B Hodgkin-Huxley Delayed Rectifier Model 87
B.1 Voltage Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
B.2 Equations for Gating Variables . . . . . . . . . . . . . . . . . . . . . 87
C NEURON stimulations parameters 89
Chapter 1
Introduction
1.1 Context
Since 1949 with Hebb’s theory, synaptic plasticity (the ability for the synap-
tic connection between two neurons to change its strength according to a certain
conjunction of presynaptic and postsynaptic events) is thought to be the biological
substrate for high level cognitive functions like learning and memory. This idea
is actually much older and was fomalized by the Spanish neuroanatomist Santiago
Ramón y Cajal in 1894, who suggested “a mechanism of learning that did not re-
quire the formation of new neurons”, but proposed that “memories might instead be
formed by strengthening the connections between existing neurons to improve the
effectiveness of their communication” [29]. Hebb went a step further by proposing
his ideas about the existence of a metabolic growth process associating neurons that
tend to have a correlated firing activity [13].
For the brain to be able to form, store and retrieve memories, as well as learn spe-
cific tasks, the biological changes at the synapse level need to be long-lasting. This is
called long-term potentiation (LTP) or depression (LTD) which means a persistent
increase or decrease in synaptic strength which is said to be the key mechanism un-
derlying learning and memory. The biological mechanisms responsible for long-term
potentiation are not exactly known, but specific protein synthesis, second-messenger
systems and N-methyl D-aspartate (NMDA) receptors are thought to play a critical
role in its formation [20].
In 1995, Fuster defined memory as “a functional property of each and all of
the areas of the cerebral cortex, and thus of all cortical systems”. He distinguishes
several types of memories : short-term/long-term, sensory/motor, declarative/non-
declarative and individual/phyletic. He proposes that all memory is associative and
its strength depends on the number of associations we make to a specific word or
mental object [11]. He introduced several key concepts like working memory, as
a gateway to long-term memory waiting to be consolidated, and active memory
as a cortical network of neurons with an activity that is above a certain baseline.
Also, his perception-action cycle suggesting a constant flow of information between
1
CHAPTER 1. INTRODUCTION
sensory and motor memory, has been proved to be a matter of interest for future
experimentation.
More recently, investigations have focused on spike-timing-dependent plasticity
(STDP), which refers to synaptic changes sensitive to the exact timing of action po-
tentials between two connected neurons : one refers to pre-post timing or positively-
correlated timing, when the presynaptic neuron fires a few milliseconds before the
postsynaptic neuron, and to post-pre timing or negatively-correlated timing, when
it goes the other way around. STDP has become a popular subject since the ex-
perimental work of Bi and Poo [6] who first demonstrated the strong influence of
exact timing (typically a time-window of 20 ms for cultured hippocampal neurons)
of presynaptic and postsynaptic spikes on synaptic long-term potentiation. Their
work with culture of hippocampal neurons, seconded by the work from others, e.g.
Rubin et al. 2005 and Mayr et al. 2009 [30, 25], has resulted in the formulation of
STDP type learning rules [27, 9].
One must be aware, however, that these rules are considered rather crude ap-
proximations by relevant experimentalists. There is a constant duality of the two
possible ways to approach neuroscience : some aim to understand the biological
mechanisms at the cell and membrane level, so that they can build up models to
reproduce them, whereas others aim to reproduce cell behaviour for applications
and fit their model to experimental data, rather than to theory. Both approaches
have their justification and it is likely that both approaches are complementary.
However, if some results arise, our understanding of the mechanisms of the brain is
still partial and a great deal remains to be done.
In this project, we focus on the Bayesian Confidence Propagation Neural Net-
work (BCPNN) model, which has been first studied by Lansner and Ekerberg (1989)
[22] and Kononenko (1989) [18]. Its main features are a network architecture di-
rectly derived from Bayes Rule and unit activities representing the probability of
stochastic events. The BCPNN model will be thoroughly described in Chapter 2.
1.2 Motivations
In 2003, Sandberg et al. proposed that “a possible future extension of the
existing BCPNN model would be to implement the model using spiking neurons to
further examine its generality and properties, such as the effect of spike synchrony
in memory reset and the effects of AHP modulation on network dynamics” [32]. At
that time, the model had just been improved from a counter model to a continously
operating incremental learning model. In this respect, the work presented is in the
continuity of what has already been done and seeds the need to have such a learning
rule operating in spiking context.
Artificial neurons are a very crude approximation of real neurons : given input
from other neurons they generate an output through an activity function. Spiking
neurons, however, mimic the behaviour of real neurons : in particular, they exhibit
spikes (they “fire” and take a high positive value) only when their potential crosses
2
1.3. OUTLINE
a threshold and for a very short amount of time. These neurons simulate all-or-
nothing behaviour and action potentials observed in real neurons [20]. The variables
attached to them, such as membrane voltage, capacitance, synaptic conductance,
have a real biological meaning.
Since the large-scale implementations of neural networks are often based on
spiking units, it is valuable to have such a formulation to be able to run on-line
learning also in large-scale spiking neural networks. The project aims to end up
with a network model with the new online learning rule operating and use it to test
some of the emergent phenomena. Evaluating the model by comparing it to the
original BCPNN rule, other STDP rules, as well as some experimental data on LTP
[30], is our prime motivation. Because of its specific features (both Hebbian and
Bayesian), the BCPNN learning rule can always be used as a reference for other
STDP learning rules to be implemented in the future. With regard to the bias
term a comparison can also be made with the threshold in the BCM learning rule,
developed in 1982 by Bienenstock, Cooper and Munroe [7].
The transformation of the BCPNN learning rule to a spiking neuron environ-
ment is somewhat challenging and has never been done before. This opens up
tremendously the extent of our work and the scope of this Master Thesis is to be
limited, for the sake of simplicity. We narrow our work to two main objectives
: the comparison to other existing learning rules, as explained above, is the first.
The second, somewhat more abstract, is to conciliate the probabilistic features of
the original BCPNN learning rule and spike-timing dependent features developed
in STDP models (Bi and Poo 1998, Morrisson 2008, Clopath 2008) [6, 27, 9]. The
new learning rule presented in Chapter 3 is built to take into account STDP-like
features, and we aim to fit our model to existing experimental data, relating to
the spike-timing dependent plasticity window (Bi and Poo 1998) [6] and intrinsic
excitibality (Jung et al. 2009) [19], following a phenomenological approach to the
problem.
An further improvement of our work would be to modify our learning rule so
that it could run on parallel computers in a large-scale context. This work is not
meant to state decisive results, or to study exhaustively one specific feature of the
BCPNN model, but rather to trigger the conversion of the BCPNN model to spiking
unit environment.
1.3 Outline
We will first redefine, in Chapter 2, the basics of the BCPNN model and its
mathematical context, from its most basic form (Naive Bayes Classifier) to more re-
cent ones (Higher Order Model, Recurrent Network). We will also relate the existing
implementations (counter model, incremental learning) and their applications. In
Chapter 3, the ‘spiking’ version of the learning rule is presented, its new features and
their biological motivation. The including two following chapters contain the core
of the thesis : we develop how we implemented the new learning rule respectively
3
CHAPTER 1. INTRODUCTION
with abstract units in MATLAB (Chapter 4) and in a spiking context in NEURON
(Chapter 5). The results are presented in Chapter 6, both single-synapse, network
implementation and phemenological approach to fit STDP data. Dependence on
model parameters and comparisons to other existing learning rules are discussed
in Chapter 7. Finally, Chapter 8 is dedicated to further developements and to the
conclusion.
4
Chapter 2
The BCPNN Model
2.1 Context and Definitions
Artificial Neural Networks
An artificial neural network (ANN) is a computational model that aims to
simulate the structure and/or functional aspects of biological neural networks. It
consists of a group of computational units, connected by weighted links through
which activation values are transmitted
The reader can find documentation about ANNs in the literature and the pur-
pose here is not to discuss Neural Networks in a general fashion. Still, we think it’s
valuable in our context to relate the main features of artificial neural networks.
Nodes The functional unit or node of the network is the basic element constituting
it. Even if, in the first place, it has a biological equivalent, like a neuron or
more recently a minicolumn, it is an abstract unit, which means that the
variables attached to it are artificial and do not have an intrinsic biological
meaning. A node i is assigned a random variable xi that can be either binary,
discrete or continous. It takes its input from other units xj and generates an
output yi.
Activity Function The activity function or transfer function is the function giv-
ing the input-output relationship for one node. Common activity functions
include linear, thresholded and sigmoïd functions. The input-output rela-
tionship for one unit i is given by yj = φ(βi + n
j=1 ωijxj) where φ is the
activity function, ωij the weight between unit i and unit j and βi the bias of
unit i.
Learning Rule The learning rule is an algorithm that modifies connections be-
tween units, the so-called weights, in response to the presentation of an input
pattern. It is often the key point of the implementation, because it determines
the response of the network to specific input, hence its applications. Classical
5
CHAPTER 2. THE BCPNN MODEL
learning rules include Perceptron Learning, Delta rule and Error Backpropa-
gation.
Network Architecture A network can have several topologies. It can be com-
posed of layers (single-layer, multi-layer networks) that can communicate in
only one direction (feedforward network) or in both directions (backpropaga-
tion or recurrent network). Connections between units in the network can be
sparse or all-to-all. They can include one or several hidden layers (internal
computational units which are not accessible from the network interface, but
used to create a specific internal representation of the data).
Input and Output units In a feedforward network, a network receives informa-
tion from input units and proposes an interpretation available at the output
units. In a recurrent network, though, the difference between input and output
units is less clear : input consists of an activation of a set of units representing
an input pattern and an output pattern is read from the activity of the units
after a phase called relaxation.
Learning and Retrieval
A network can be used in one of the two following modes : learning or retrieval.
During the learning phase, the network input units are clamped to a certain value
(clamping means that the units are assigned a value set by the operator through a
controled process) during a certain amount of time (a set of input units represents
an input pattern). During clamping, the learning rule operates, so that the weights
are updated, and retain the information contained in the pattern that has been
presented to them. In other words, during learning, the network adapts to reality
(the clamped input pattern) and changes its internal connection to remember it in
the future : learning is said to be stimulus-driven.
During the retrieval phase the weights of the network are assumed to be fixed,
keeping the internal structure of the network unchanged. Distorted, incomplete or
different patterns than the ones used during learning are presented to the network
and an output pattern is generated. In the case of layered networks, the inference
is realized by feeding a pattern to the input units and collecting it at the output
units. In other words, the network interprets the input data, using its internal
representation or knowledge.
For a recurrent network, however, the input pattern is fed to all input units (all
units in the network except for the hidden units), and the network starts a phase
called relaxation. Relaxation of the network consists of taking a pattern as input
and incrementally updating the units’ activities according to an inference rule ;
this stops when stability is reached, i.e. when the change in the units’ activities is
quite small. When the weight matrix is symmetric, convergence is guaranteed and
relaxation always converges to a stable attractor state [16].
For correct knowledge to be acquired, one must learn a pattern (learning phase)
and then check if the pattern has been stored correctly (retrieval). It is important
6
2.1. CONTEXT AND DEFINITIONS
however to alternate these two phases, so that the information stored by the network
is constantly updated and corrected. One must pay attention that a network does
not learn its own interpretation of the data, by shutting off the learning phase from
time to time.
Hebb’s postulate
Introduced by Donald Hebb in 1949, Hebb’s postulate, also called cell assembly
theory, is one of the earliest rules about synaptic plasticity. It has been formulated
as follows :
When an axon of cell A is near enough to excite a cell B and repeatedly
or persistently takes part in firing it, some growth process or metabolic
change takes place in one or both cells such that A’s efficiency, as one
of the cells firing B, is increased [13].
The theory is often summarized as “cells that fire together, wire together" and
is commonly evoked to explain some types of associative learning in which simulta-
neous activation of cells leads to pronounced increases in synaptic strength. Such
learning is known as Hebbian learning. The general idea is that cells or group of cells
that are repeatedly active at the same time will tend to become associated, so that
activity in one facilitates activity in the other [1]. Work in the laboratory of Eric
Kandel has provided evidence for the involvement of Hebbian learning mechanisms
at synapses in the marine gastropod Aplysia Californica [21].
Associative Memory
Fuster describes associative memory as “a system of memory, usually constituted
by associations between stimuli and reinforcement” [11] as opposed to recognition
or episodic memories. However, according to him, association is an attribute of all
memories from the root of their genesis to their evocation. More widespread is the
definition of auto-associative and hetero-associative memories as a form of neural
networks that enables one to retrieve entire memories from only a tiny sample of
itself. Hetero-associative networks can produce output patterns of a different size
than that of the input pattern (mapping from a pattern x to a pattern y with a
non-squared connection matrix W), whereas auto-associative networks work with
a fixed size of patterns (mapping of the same pattern x with a squared connection
matrix W).
The Hopfield network (Hoppfield 1982 [16]) is the most implemented auto-
associative memory network and serves as content-addressable memory with binary
threshold units. Under the following restrictions : wii = 0 (no unit has a connection
with itself) and wij = wji (connections are symmetric), convergence to a local min-
imum of a certain Energy Function is guaranteed. During learning, the connection
matrix W is modified to allow for attractor dynamics, so that relaxation of the
network causes the input pattern to converge towards the closest attractor state.
7
CHAPTER 2. THE BCPNN MODEL
2.2 Bayesian Confidence Propagation
2.2.1 Using Neurons as probability estimators
The main idea underlying the BCPNN learning rule is to use neurons as proba-
bility estimators. The input and output unit activities represent probabilities. The
neuron is used to estimate its probability of firing in a given context, i.e. know-
ing the information carried out by the activities of others neurons in the network.
Confidence propagation relies on the fact that the conditional probability of a given
neuron yi to fire given the context x P(yi|x) is a better approximation than the a
priori probability P(yj). By updating units like this, one propagates the confidence
of one unit to the other units in the network.
Figure 2.1: Using Neurons as probability estimators
The BCPNN learning rule is based on a probabilistic view of learning and re-
trieval ; input unit and output unit activities representing respectively confidence
of feature detection (the input to unit i from unit j is a number between 0 and 1
representing the confidence that xj is a part of this pattern) and posterior probabil-
ities of outcomes (the output to unit j is a number between 0 and 1 representing
the probability of outcome of xj knowing the pattern context)
One drawback of the fact that we use neurons as probabilistic estimators is
that we have to separate the signal. Indeed, the observation of the absence of an
attribute in a given vector is somewhat different than the absence of observation
of this attribute. However, if we only map one attribute but one unit, then the
BCPNN model will interprate zero input to this unit as an absence of information
on this attribute, and it will compute the a posteriori probabilities of other units,
discarding the input from this unit. To solve this problem, we need to separate the
data, i.e. we need to create units for all possible values of an attribute. In the case
8
2.3. GRADUAL DEVELOPMENT OF THE BCPNN MODEL
of binary units, this corresponds to having two units a and ¯a for attribute A. When
no observation is made on this attribute, the network will discard input from both
of these units.
2.2.2 Derivation of Network Architecture
The Bayesian Confidence Propagation Neural Network (BCPNN) has been de-
veloped gradually (Lansner and Ekeberg 1989, Lansner and Holst 1996, Sandberg
et al. 2002, Sandberg et al. 2003) [22, 23, 31, 32]. Starting from Bayes Theorem
(equation 2.1), we derive a network architecture, meaning that we identify the terms
in our mathematical formulae to weights ωij, biases βj, input xi and output unit
activities yj. The purpose of the learning phase will then be to update weights and
biases so that their value fits the one in the mathemical derivation of the network.
According to the complexity of the training set we use, the network architecture can
be a single-layer (see Naive Bayes Classifier), multi-layer (see Higher Order Model)
or fully-connected network (see Recurrent Network).
2.2.3 Bayesian-Hebbian Learning
The BCPNN learning rule derived in the next section uses Bayesian weights
and biases (equation 2.4). It exploits the statistical properties of the attributes in
the training set (frequencies of activation of one attribute xi and co-activation of
two attributes xi and xj) in order to evaluate the probabilities P(xi) and P(xi, xj)
used to update the weights and biases. It also shows Hebbian features because
it reinforces connections between simultaneously active units, weakens connections
between units independent from one another, and makes connections inhibitory
between anti-correlated units.
As we shall see later in this paper, when applied to a recurrent attractor network,
it gives a symmetric weight matrix and allows for fixed point attractor dynamics.
The update of the weights in the network resembles what has been proposed as
rules for biological synaptic plasticity (Wahlgren and Lansner 2001) [33].
2.3 Gradual Development of the BCPNN model
2.3.1 Naive Bayes Classifier
The Naive Bayesian Classifier (NBC) aims to calculate the probabilities of the
attributes yj given a set x = (x1, x2, ... , xi, ... , xn) of observed attributes. Both are
assumed to be discrete (for now, we only consider binary inputs). The main assump-
tion in this case is the Independence Assumption, which states that the attributes xi
are independent (P(x1, ... , xn) = n
i=1 P(xi)) and conditionally independent given
yj (P(x1, ... , xn|yj) = n
i=1 P(xi|yj))
9
CHAPTER 2. THE BCPNN MODEL
The Bayes Theorem is given by the following equation for x and y two random
variables.
P(y|x) =
P(x|y)P(y)
P(x)
(2.1)
Using this and the Independence Assumption, we can calculate the conditional
probability πj of the attribute yj given the observed attributes xi
πj = P(yj|x) =
P(x|yj)P(yj)
P(x)
= P(yj)
n
i=1
P(xi|yj)
P(xi)
= P(yj)
n
i=1
P(xi, yj)
P(xi)P(yj)
Now, we assume that we only have partial knowledge of the attributes xi. We
are given completely known observations xi when i ∈ A ⊆ {1, ... , n} and have no
information at all about the attributes xk when k ∈ {1, ... , n}A. Then, we get
πj = P(yj|xi, i ∈ A) = P(yj)
i∈A
P(xi, yj)
P(xi)P(yj)
Then, taking the logarithm of the last expression, we obtain :
log(πj) = log(P(yj))+
i∈A
oi log
P(xi, yj)
P(xi)P(yj)
= log(P(yj))+
n
i=1
oi log
P(xi, yj)
P(xi)P(yj)
(2.2)
where the indicator variable oi equals 1 if i ∈ A (which means that the ith attribute
xi is known) and equals 0 otherwise.
We finally end up with the following equation
log(πj) = βj +
n
i=1
ωij oi (2.3)
with
ωij = log
P(yj,xi)
P(yj)P(xi)
βj = log(P(yj))
(2.4)
This can be implemented as a single-layer feedforward neural network, with input
layer activations oi, weights ωij and biases βj. In this way, the single-layer feedfor-
ward neural network calculates posterior probabilities πj given the input attributes
using an exponential transfer function.
The weights and biases given in the equation 2.4 are called Bayesian weights.
We can point out the Hebbian character of these weights : ωij ∼ 0 when xi and
yj are independent (weak connection between independent units), ωij ∼ log(1
p ) > 0
when the units xi and yj are strongly correlated, since in this case P(xi, yj) ≈
P(xi) ≈ P(yj) ∼ p > 0 (strong connection between simultanously active units) and
ωij → −∞ when they are anti-correlated, because in this case P(xi, yj) → 0 (strong
inhibitory connection betwen anti-correlated ints).
The bias term βj gives a measure of intrinsic excitability of the unit xj, as we
shall see later in details. We observe that βi → 0 when pi → 1 so that the bias term
10
2.3. GRADUAL DEVELOPMENT OF THE BCPNN MODEL
has no effect in computation when unit xi is strongly activated and βi → −∞ when
pi → 0 thus muting the information carried out by unit xi when it has been activated
seldom. This process is democratic in the sense that it gives more importance to
the units who have ‘a lot to say’ and shuts off the ones not taking part in pattern
activation, considered irrelevant for learning and inference.
2.3.2 Higher Order Bayesian Model
The problem encountered in practical applications is that the Independence As-
sumption is often violated, because it is too restrictive. The standard way to deal
with this, as when facing a non-linearily separable training set, is to introduce a
hidden layer with an internal representation in which classes are separable. Here,
we use a structure of the hidden layer consisting of feature detectors organized in
hypercolumns.
Starting from the previous model we assume independence between all attributes
and conditional independence given yj
(P(x1, ... , xn) =
n
i=1
P(xi)) and (P(x1, ... , xn|yj) =
n
i=1
P(xi|yj))
However if two variables xi and xj are found not to be independent, they can
be merged into a joint variable xij, giving :
P(x1, ... , xn) = P(x1)...P(xij)...P(xn)
and a similar method may be used for the conditional probabilities. This means
that in the network we get one unit for each combination of outcomes of the original
variable xi and xj. For example, if two groups of units corresponding to primary
features A = {a, a} and B = {b, b} are not independent, we insert in their place a
group of complex units AB = {ab, ab, ab, ab} making up a composite feature. The
hypercolumn structure formed produces a decorrelated representation, where the
Bayesian model is applicable.
We note that all formulae above are unchanged. We have just introduced a
hidden layer that increases internal computation but the external environment is
unchanged. The structure of our network now resembles the structure in figure 2.2.
This process relies on a measure of independence of the attributes xi of an input
pattern x. A partially heuristic method (Lansner and Holst 1996) [23] is to merge
two columns if the measure of correlation (like the mutual information) between
them is high :
Iij =
xi∈Xi,xj∈Xj
P(xi, xj) log(
P(xi, xj)
P(xi)P(xj)
) (2.5)
A major drawback of this method is that the number of units increases expo-
nentially with their order, i.e. how many input attributes they combine (Lansner
and Holst 1996, Holst 1997) [23, 15].
11
CHAPTER 2. THE BCPNN MODEL
Figure 2.2: Architecture of the BCPNN with a hidden unit for internal decorrelated
representation
2.3.3 Graded units
Thus far, we have only considered treated binary inputs. However, it is valuable
too that the network handles graded input : for instance, if an attribute is unknown
or its value uncertain, graded input would then be a ‘confidence’ value between 0
(no) and 1 (yes). This cannot be coded directly as a graded input activity between
zero and one, because this would be interpreted as a probability in the BCPNN.
Thus we use a kind of soft interval coding to a set of graded values.
Suppose that each attribute i can take Mi different values, xii would be a
binary variable describing the probabilty for the ith attribute to take the i’th value
{xii = 1 ⇔ xi = i }. Making the necessary labellings in the previous formulae, we
get
πjj = P(yjj |xik) = P(yjj )
A
P(xik, yjj )
P(xik)P(yjj )
where for each attribute i ∈ {1, ..., n} a unique value xik is known, where k ∈
{1, ..., Mi}. Similarly it follows that
πjj = P(yjj )
n
i=1
Mi
i =1
P(xii , yjj )
P(xii )P(yjj )
oii
with indicators oii = 1 if i = k and zero otherwise. oii can be seen seen as a
degenerate probability oXi (xii ) = δxik
(xii ) = PXi (xii ) of the stochastic variables
Xi which is zero for all xii except for the known value xik (Sandberg et al. 2002)
[31].
12
2.3. GRADUAL DEVELOPMENT OF THE BCPNN MODEL
Taking the logarithm of the previous expression leads to
log(πjj ) = log(P(yjj )) +
n
i=1
log


Mi
i =1
P(xii , yjj )
P(xii )P(yjj )
oii

 (2.6)
The corresponding network now has a modular structure. The units ii in the
network, where i ∈ {1, ..., Mi}, explicitly representing the values xii of Xi may be
viewed as a hypercolumn as discussed above. By definition the units of a hypercol-
umn i have a normalized total activity Mi
i =1 oii = 1 (the variable xi can only have
one value k at a time).
Transforming these equations to the network setting yields
hjj = βjj +
n
i=1
log


Mi
i =1
ωii jj oii

 (2.7)
with



ωii jj =
P(yjj ,xii )
P(yjj )P(xii )
βjj = log(P(yjj ))
(2.8)
where hjj is the support of unit jj , βjj is the bias term and ωii jj is the weight.
πjj = f(hjj ) = exp(hjj ) can be identified as the output of unit jj , representing the
confidence (heuristic or approximate probability) that attribute j has value j given
the current context. We also need to normalize output within each hypercolumn by
ˆπjj = f(hjj ) =
exp(hjj )
j
exp(hjj )
.
Figure 2.3: Architecture of the BCPNN with a hidden unit and an additive sum-
mation layer for graded input handling
13
CHAPTER 2. THE BCPNN MODEL
Figure 2.3 shows a ‘pi-sigma network’, able to handle graded input. The notion
of a support unit is used to update the units simultaneously and not one by one
: calculations are first stored in the support units for all units and the transfer
function is then used to update the units all at once.
2.3.4 Recurrent Network
Now, because both the input oii and the output ˆπjj of the network represent
probabilities, we can feed the output back into the network as input, creating a fully
recurrent network architecture, which can work as an autoassociative memory. The
currently observed probability oii = PXi (xii ) is used as an initial approximation of
the true probability of Xii and used to calculate a posterior probability, using the
learning parameters βjj and wii jj , which tends to be a better approximation. This
is then fed back and the process is iterated until a consistent state is reached, which
is guaranteed because the weight matrix is symmetric. The reader should note that
we have now incorporated the yjj among the xii , thus dropping the notions of
input and output units.
In the recurrent network, activations can be updated either discretely or con-
tinuously. In the discrete case, ˆπjj (t + 1) is calculated from ˆπii (t), or equivalently,
the hjj (t + 1) from hii (t) using one iteration of the update rule
hjj (t + 1) = βjj +
n
i=1
log


Mi
i =1
ωii jj f(hii (t))

 (2.9)
In the continuous case hjj (t) is updated according to a differential equation,
making the approach towards an attractor state continuous:
τc
dhjj
dt
= βjj +
n
i=1
log


Mi
i =1
ωii jj f(hii (t))

 − hjj (t) (2.10)
where τc is the ‘membrane time constant’ of each unit. Input to the network is
introduced by clamping the activation of the relevant units (representing known
events or attributes). As the network is updated the activation spreads, creating
the a posteriori beliefs of other attribute values.
2.4 BCPNN Learning Implementations
2.4.1 Counter Model
This model has been developed and described (Lansner and Ekeberg 1989) [22].
The purpose is to collect statistics of unit activity and co-activity of pairs of units,
to be able to estimate the probabilities P(xi) and joint probabilities P(xi, xj) used
to calcultate W and βj values. An input pattern consists of a stimulus strength in
the range [0,1] for each unit in the network. Here, the network is entirely ‘stimulus-
driven’ during learning, otherwise the network would first interpret the input and
14
2.4. BCPNN LEARNING IMPLEMENTATIONS
then learn its own interpretation, which is to be avoided. This allows a reduction
in computing time during learning, because no time is used to infer from the data
(no internal computation).
The basic idea behind the counter model, is to estimate the probabilities P(xi),
P(xj) and P(xi, xj) by counting occurences and co-occurences in the training set.
With an estimate of p = c
Z , we obtain
βi = log(P(xi)) = log
ci
Z
and ωij = log
P(xi, xj)
P(xi)P(xj)
= log
cijZ
cicj
(2.11)
where
Z =
α
κ(α)
ci =
α
κ(α)
πi cij =
α
κ(α)
πiπj (2.12)
Here, πi is the output of unit i, α is an index over the patterns in the training
set, and κ is the significance attributed to a certain learning event. It provides
a mechanism for over-representing subjectively important learning examples and
ignoring unimportant ones. This technique is similar to boosting used in classifica-
tion, which is the over-representation of hard examples in order to increase accuracy
of the classifier. Special care has to be taken when counters come out as zero. In
the case when ci or cj is zero, wij is also set to zero. If ci and cj are both non-zero
but cij is zero, wij is set to a large negative value, log( 1
Z ). This also happens for βi
when ci is zero.
The counter model provides a simple and fast implementation of BCPNN learn-
ing, but when the maximum capacity of the network is reached, catastrophic for-
getting occurs (i.e. all memories are lost when the system is over-loaded).
2.4.2 Incremental Learning
In order to avoid catastrophic forgetting, incremental learning using exponen-
tially running averages has been implemented (Sandberg et al. 2002, Sandberg et
al. 2003) [31, 32]. The idea is to introduce intrinsic weight decay (forgetting) in
the network, so that the system never becomes over-loaded. A time constant α is
used to control the time-scale of this weight decay, allowing for short-term working
memory behaviour as well as for long-term memory.
A continuously operating network will need to learn incrementally during oper-
ation. In order to achieve this, P(xii )(t) and P(xii , xjj )(t) need to be estimated
given the information {x(t ), t < t}. The estimate should include the following
properties:
1. It should converge towards P(xii )(t) and P(xii , xjj )(t) in a stationary envi-
ronment.
2. It should give more weight to recent than remote information.
3. It should smooth or filter out noise and adapt to longer trends, in other words
lower frequency components of a non-stationary environment.
15
CHAPTER 2. THE BCPNN MODEL
(1) is the prime constraint. Our estimate needs to converge to these probabilities
because they are needed to compute the Bayesian weights and biases. (2) makes the
model operate as a ‘palimpsest memory’ meaning that recent memories constantly
overwrites old ones. Thus a pattern has to be reviewed not to be forgotten. (3)
is a stability constraint in a non-stationary environment. The low-pass filtering
operation is to be investigated again in Chapter 3.
The incremental Bayesian learning rule proposed here achieves this by approxi-
mating P(xii )(t) and P(xii , xjj )(t) with the exponentially smoothed running aver-
ages Λii of the activity ˆπii and Λii jj of coincident activity ˆπii ˆπjj . The continuous
time version of the update and learning rule takes the following form:
τc
dhii (t)
dt
= βii +
n
j=1
log


Mj
j =1
ωii jj (t) ˆπjj (t)

 − hii (t) (2.13)
ˆπii (t) =
exp(hii )
i exp(hii )
(2.14)
dΛii (t)
dt
= α([(1 − λ0) ˆπii (t) + λ0] − Λii (t)) (2.15)
dΛii jj (t)
dt
= α([(1 − λ2
0) ˆπii (t) ˆπjj (t) + λ2
0] − Λii jj (t)) (2.16)
ωii jj (t) =
Λii jj (t)
Λii (t)Λjj (t)
(2.17)
βii (t) = log(Λii (t)) (2.18)
The above probability estimates converge towards the correct values given sta-
tionary inputs for sufficiently large time constants. Since the weights of the network
depend more on recent than on old data, it appears likely that a Hopfield-like net-
work with the above learning rule would exhibit palimpsest properties.
Special care has to be taken to avoid logarithms of zero values (see Sandber et al.
2002) [31]. In addition, the parameter α provides a means to control the temporal
dynamics of the learning phase (from short-term working memory to long-term
memory), It also allows us to switch off learning when the network needs to be used
in retrieval mode, allowing for change in the network activity without corresponding
weight change, because when α = 0 the running averages ‘freeze’ to their current
values.
2.5 Performance Evaluation and Applications
Performance Evaluation
There are many criteria available to evaluate the performance of a model. Of
course, no model is better than the others on every level, nor it is designed for
every purpose. Nevertheless, in order to be accepted and developed in the future, a
16
2.5. PERFORMANCE EVALUATION AND APPLICATIONS
model needs to gather some basic features : robustness, reasonable time of execution
and stability are required to consider a model efficient. Here, we present the main
criteria we use to evaluate the BCPNN model.
Frequency of correct retrieval This is the most used criterion to evaluate the
performance of the network. Feeding a list of input patterns to the network, we
want to know how well the network learns them, by counting the occurences
of successfully completed patterns after learning. An important parameter
is the age of the pattern, because recent patterns tend to be retrieved more
accurately than old ones. Numbers of patterns, their complexity and time of
presentation are to be taken into account too.
Storage Capacity The storage capacity is the amount of patterns that a network
can store. Hopfield network capacity has been investigated (Hopfield 1982)
[16]. In our counter model, the capacity is fixed, thus it is susceptible to catas-
trophic forgetting, whereas the incremental learner has a capacity dependant
from its spontaneous forgetting (short-term memories with fast weight de-
cay dynamics are protected from catastrophic forgetting because capacity is
hardly reached, whereas long-term memories are more exposed to it).
Noise Tolerance In reality, patterns fed to the network are always a little noisy
and it is important that the attractor dynamics of the network overcome
this. To test this, we feed distorted patterns to the network and count the
frequency of retrieving the original ones. A special case is the one of competing
interpretations when a mixture of two stored patterns is fed to the network.
Convergence speed The convergence speed for relaxation of the network is also
an important trait of our model. Inference has to be fast enough so that
testing patterns do not take too long and, on the other hand, it has to use
small enough timesteps for it not to skip any attractor state with a narrow
convergence domain. Convergence speed increases substantially for distorted
and ambigous patterns (because they are distant from stable attractors in the
attractor space) (Lansner and Ekeberg 1989 [22]).
Applications
The domain of applications of the Bayesian Confidence Propagation Neural
Network is wide. Because of its statistically-based method of unsupervised learning,
it can be implemented in a series of different contexts. We present some of its
applications here.
Classification The BCPNN is first designed to evaluate probabilities from a set
of observed features or attributes, so it is natural that the BCPNN is used
for classification tasks, which aim to label a pattern and assign it to a cor-
responding class. The network architecture of these networks is single or
multi-layered, depending on the complexity of the data set. The input units
17
CHAPTER 2. THE BCPNN MODEL
correspond to the attributes, and the output units to the class units. BCPNN
and classification has been investigated exhaustively (Holst 1997) [15].
Content-addressable memory When used in a recurrent network architecture,
the BCPNN model performs quite well as a content-addressable memory. It
takes into account to statistical properties of the data and performs better
with patterns for which attributes can be considered independent, like pixel
grey-levels in an image, letters in a list of words or digits in a list of numbers.
The capacity has to be large enough to avoid memory overloading.
Because of its associative character, BCPNN memory networks can perform
pattern completion (restoring a pattern from only a sample of it) and pattern
rivalry (decision between ambigous patterns or a mixture of two existing ones).
A good example for pattern rivalry is found in optical illusions and ambigous
images.
Pharmacovigilance and Data Mining The BCPNN has been used for high-
lighting drug-ADR pairs for clinical review in the WHO ADR database as
part of the routine signal detection process (Bate et al 1998, Lindquist et al.
2000). The recurrent BCPNN has also been implemented a tool for unsuper-
vised pattern recognition and has been tested on theoretical data, and shown
effective in finding known syndromes in all haloperidol reported data in the
WHO database (Bate et al. 2001, Orre et al 2003). More recently, Ahmed
et al. revisited Bayesian pharmacovigilance signal detection methods, in a
multiple comparison setting (Ahmed et al. 2009).
18
Chapter 3
A spiking BCPNN Learning Rule
In this chapter, we introduce the new ‘spiking’ version of the BCPNN learning
rule. We give its mathematical formulation and discuss its specific features and how
they account for biologically observed phenomena.
In order to have a mapping from the original BCPNN learning to the spiking
version of it, we need to match one descriptor of the activity of the biological neurons
to the input and output of the abstract units. The most natural choice seems to be
the frequency or rate of firing of one neuron. Thus the range [0, 1] of the units in
the non-spiking network will be mapped to a range [0, fmax] where fmax represents
the maximum firing frequency of one neuron.
3.1 Formulation
The version of the learning rule that we are going to implement in a spiking
neuron context has the following form :
dzi
dt
=
yi − zi
τi
z0
i =
1
Mi
(3.1)
dzj
dt
=
yj − zj
τj
z0
j =
1
Mj
(3.2)
In this first stage of processing (equations 3.1 and 3.2), we filter the presynaptic
and postsynaptic variables yi and yj, which exhibit a ‘spiking-binary’ behaiour most
of the time, with a low-pass filter of respective time constant τi and τj (note that
they can be different). The resulting variables zi and zj are called primary synaptic
traces. Mi and Mj are the number of units in the pre-hypercolumn and the post
hypercolumn respectively, and are only used in a network context. In single-synapse
19
CHAPTER 3. A SPIKING BCPNN LEARNING RULE
learning, we set Mi = Mj = 10. The typical range of τi and τj is 5 to 20 ms.
dei
dt
=
zi − ei
τe
e0
i =
1
Mi
(3.3)
dej
dt
=
zj − ej
τe
z0
j =
1
Mj
(3.4)
deij
dt
=
zjzj − eij
τe
e0
ij =
1
MiMj
(3.5)
In the second stage of processing (equations 3.3, 3.4 and 3.5), we filter the
primary traces zi and zj with a low-pass filter of constant τe (note that it is the
same for the 3 equations). The typical range of τe is 100 to 1,000 ms. The resulting
variables ei,ej and eij are called the secondary synaptic traces. We note the intro-
duction of a secondary mutual trace eij, which keeps a trace of the mutual activity
of yi and yj and will be used to later compute P(xi, xj). Note that a mutual trace
is impossible to get at the first stage of processing since the direct product yiyj is
zero most of the time. This is because yi and yj are ‘spiking’ variables and thus
equal zero except on the occurence of a spike, so yiyj would be non-zero only when
yi and yj spike at the exact same time, which almost never happens.
dpi
dt
= κ
ei − pi
τp
p0
i =
1
Mi
(3.6)
dpj
dt
= κ
ej − pj
τp
p0
j =
1
Mj
(3.7)
dpij
dt
= κ
eij − pij
τp
p0
ij =
1
MiMj
(3.8)
In the third and last stage of processing (equations 3.6, 3.7 and 3.8), we filter
the secondary traces ei,ej and eij with a low-pass filter of constant τp (note that it
is the same for the 3 equations). The typical range of τp is 1,000 to 10,000 ms. The
resulting variables pi,pj and pij are called the tertiary synaptic traces. We also note
the presence of a mutual tertiary trace that is a direct approximation of P(xi, xj).
βi =
log(ε) if pi < ε
log(pi) otherwise
(3.9)
ωij =
log(ε) if
pij
pipj
< ε
log(
pij
pipj
) otherwise
(3.10)
The equations for updating the weights and biases (equations 3.9 and 3.10)
are the classical Bayesian weights and biases equations. Note that these equations
change a little in the case of ‘pi-sigma’ higher order networks with graded input
(equations 2.7 and 2.8). Because we deal only with binary input, we keep these
equations unchanged. When pi takes a small value it is set to a minimum value
ε in order to avoid a logarithm of zero. The same is done when
pij
pipj
becomes
20
3.2. FEATURES
too small. We note the presence of the parameter κ. It is a global ’print-now’
signal that regulates the update of tertiary traces, while leaving unchanged the
internal structure of the network (primary and secondary traces). We will explain
its function in further detail later.
The spiking version of the BCPNN learning rule is the set of these 10 equations.
It relies on 3 stages of processing that perform the same operation (low-pass filtering)
with different temporal dynamics. The set of parameters that can be controled are
the time-constants τi,τj,τe and τp, the initial values of the traces and the print-now
signal κ.
3.2 Features
3.2.1 Synaptic traces as local state variables
The implementation of local synaptic state variables such as synaptic traces
in the above learning rule is a common approach of STDP learning rules [27, 25].
These variables are used to keep a trace or memory of a presynaptic or postsynaptic
events such as the occurence of a spike. In addition, low-pass filtering enables us to
manipulate continous variables rather than ‘spiking variables’ which is problematic
when we want to estimate, for example, a joint probability P(xi, xj), since the
direct product of two spiking variables is likely to be zero, due to the ‘impulse’
nature of a spike. Indeed a spike has a very short duration and is often described
as a discontinous variable, that is non-zero only on the occurence of a spike.
Scaling these variables between 0 and 1 is very useful because it makes their
quantitative use easier. One can deal with different types of synaptic traces.
Additive trace The additive trace updates the local state variable x(t) by a con-
stant value A. The particularity of this trace is that it can be greater than 1
when a lot of events occur in a short time. It is implemented by the following
equation
dx
dt
= −
x
τ
+
ts
Aδ(t − ts)
where ts denotes the time occurence of a spike.
Saturated trace The saturated trace updates the local state variable x(t) to a
constant value A. This trace is always in the range [0,1] and it keeps only
the history of the most recent spike, because it’s invariably reset to 1 on the
occurence of a spike. It is implemented by the following equation
dx
dt
= −
x
τ
+
ts
(1 − x−)δ(t − ts)
where ts denotes the time occurence of a spike and x− is the value of x just
before the occurence of the spike.
21
CHAPTER 3. A SPIKING BCPNN LEARNING RULE
Proportional trace Here, the local state variable x(t) is updated to a value pro-
portional to its deviation to 1. This trace is always in the range [0,1] and it
realizes a synthesis of the effects of the two traces above. It keeps a value
of x(t) close to 1 when many spikes occur in a short time and it is easy to
evaluate the occurence of the last spike by looking at the exponential decay
at a time t. The proportional trace is the one we use later. It is implemented
by the following equation
dx
dt
= −
x
τ
+
ts
k(1 − x−)δ(t − ts)
with ts, x− as described above and k is the proportion of update. Typically
we use k ∈ [0.5, 0.8]. Figure 3.1 shows the dynamics of the 3 different synaptic
trace types.
Figure 3.1: Different types of synaptic traces - The upper figure corresponds
to a spike train and the lower figure displays the three different synpatic traces :
the black, blue and red curves correspond respectively to the additive, saturated
and proportional traces
3.2.2 Spike-timing Dependence
The first stage of the processing of our learning rule (equations 3.1 and 3.2)
allows us to create the primary synaptic traces. These variables with very fast
dynamics are used as recorders of the spikes : on the occurence of a spike they are
set a certain value (since we use proportional traces, this value is proportional to the
deviation between 1 and the value of the synaptic trace just before the spike x(t))
and decay exponentially until another spike occurs. Proportional traces convey two
22
3.2. FEATURES
pieces of informations : history of the last spike by looking at the current decay (if
the last spike occured recently, the trace is steep and decays fast) and global history
of the past events (when numerous spikes occur in a short period of time the trace
value comes close to 1).
The dynamics of the primary traces zi and zj are controled by the time constants
τi and τj. Since these constants can be different, pre-post timing can be promoted
over post-pre timing, and the other way around. For instance, if we set τi = 20 ms
and τj = 1 ms, then zj will decay much faster than zi. Then, if a postsynaptic spike
occurs 10 ms after a presynaptic spike, the product zizj will be non zero shortly
after the occurence of the postsynaptic spike. On the other hand, if a presynaptic
spike occurs 10 ms after a postsynaptic spike, then the product zizj will still be
zero because of the fast decay of zj. By setting τj to a small value compared to τi,
we have given a priority to pre-post timing (see figure 3.2).
The values of these two time-constants define a spike-timing time-window, (see
Bi and Poo 1998 [6]). The width and symmetry of this can be controled by manip-
ulating these constants.
Figure 3.2: Different effects of pre-post and post-pre timing on the primary
synaptic traces - The upper figure corresponds to a regular spike train post-pre-
post. Since primary traces have different time-constants (τi = 50 ms and τj = 5 ms)
pre-post timing is promoted over post-pre timing, because the resulting product zizj
(not displayed here) is much bigger after regular pre-post timing than after post-pre
timing.
3.2.3 Delayed-Reward Learning
It can be a little puzzling to realize that our learning rule has three stage
of processing of the data while we always perform the same operation (low-pass
23
CHAPTER 3. A SPIKING BCPNN LEARNING RULE
filtering). However, these three filtering procedures perform three very specific and
different tasks. As observed in previous models (Bi and Poo 1998, Rubin et al. 2005,
Morrisson et al. 2008, Mayr et al. 2009) [6, 30, 27, 25] exact spike timing between
presynaptic and postsynaptic plays a crucial role in LTP. Moreover, a time-window
of 20 ms before and after a postsynaptic spike seems to exist, so that no long lasting
change occurs if delays between spikes are greater than 20 ms.
However, the activity in the network needs to be long-lasting and to reverberate
on a much greater time-scale. In the context of delayed reward learning [28] and
reinforcement learning, the reward, which triggers the induction of LTP, occurs with
a delay on a time-scale of hundreds of milliseconds to seconds. Worse, this delay
isn’t predictable so that one cannot know when the reward and the actual learning
will take place. In order to solve this problem, we include secondary traces that
extend the reverberation of activity in the network.
Then, when a spike occurs, activity is recorded in the primary and secondary
traces. After a few hundred milliseconds, the activity has disappeared in the primary
traces, but is still reverberating in the secondary traces ei, ei and eij (equations 3.3,
3.4 and 3.5). Thus, if the print-now signal, representing the reward, is set to 1, the
secondary traces convey the information and learning can still take place.
Figure 3.3: Temporal dynamics of the different synaptic traces - Thin curves
correspond to the primary traces, thicker curves to the secondary ones and bold
curves to the tertiary traces. Blue corresponds to presynaptic traces, red corre-
sponds to postsynaptic variables and black corresponds to mutual traces - The
temporal dynamics are the slowest for the tertiary traces that build up and de-
creases slowly. The combination of these three levels of processinf enables us to
achieve different goals.
It is important to stress that both of these traces are required if we want to
account for the following phenomena : the exisence of a spike-timing window in
the order of tens of milliseconds (about 20 ms for spike delays) outside of which
24
3.2. FEATURES
no significant weight change takes place, and the fact that the reward enhancing
the learning process comes with a delay on a time-scale of hundreds of milliseconds.
As we will see later, there are biological equivalents to this print-now signal and
delayed synaptic traces.
Figure 3.3 shows the temporal dynamics of the primary, secondary and tertiary
traces for a pattern stimulation followed by no activity.
3.2.4 Long-term Memory
Finally, the third stage of processing (equations 3.6, 3.7 and 3.8) computes
synaptic state variables that have much slower dynamics. Typically, the pi, pj and
pij account for long-term memory, meaning that they store events that have been
repeated on a set of actions and experiments.
We assume that our learning rule operates in the context of delayed reward
learning and we take the example of an animal, a rat for instance, being proposed
several buttons to open doors, behind which some food (reward) is present or ab-
sent. The primary traces’ activities with fast dynamics record the precise spike
timing when activity spreads in the network consequently to taking actions (stim-
ulus, button pressing). The secondary traces account for the delayed obtention of
the reward, which comes as a delayed result of action-taking. If the rat accesses the
reward, then the ‘print-now signal’ is set to 1 and long-term memory is triggered.
The tertiary traces are activated when delayed reward has been obtained several
times and that stimulus has been reinforced. This means that pi, pj and pij build
up when the activities of the secondary traces have been above a certain baseline
on a repeated scale. Then, reinforcement occurs and memories can be stored.
It is singular however that the print-now signal κ shows up on this stage of
processing. It could have done similarly on the equations 3.3, 3.4 and 3.5, but
the biological equivalent of the print-now signal suggest that the metabolic changes
occur even if it is not activated, whereas only the weights are overwritten if the
print-now is active. Thus, it makes more sense for it to appear right before the
weight update.
3.2.5 Probabilistic features
It is important to keep in mind that our spiking version of the BCPNN learning
rule is not another implementation of a STDP pair-based learning rule. Indeed,
the state variables that we calculate represent probabilities and their values have
an intrinsic meaning on their own. This is the main reason why feeding graded
input to the network is not trivial, because it interprets activities in the network as
probabilities. As discussed previously, input to the units represents the confidence
of feature detection and the output represents the posterior probability of outcome.
In the original counter model, P(xi) and P(xi, xj) were quite easy to approxi-
mate by counting occurences and co-occurences of the features within the training
set. Due to the spiking structure of the input variables yi and yj it is a bit trickier
25
CHAPTER 3. A SPIKING BCPNN LEARNING RULE
to evaluate the probabilities P(xi) and P(xi, xj). The use of synaptic traces allows
us to create mutual traces eij and pij that convey the information about correlation
between spikes.
3.3 Biological relevance
This new version of the BCPNN learning rule shows the biological relevance
on different levels. The first is the use of synaptic traces which are thought to
have a direct biological meaning. For instance, when a presynaptic spike arrives
at a synapse, there is a quantified release of neurotransmitters. According to the
nature of the synapse, the additive trace or saturated trace might be used : the
first when the amount of transmitters is small compared to the synapse size, so that
the occurence of a new spike has an additive effect because enough free receptors
are available for synaptic transmission, and the second when the quantity of neuro-
transmitters released reaches the maximum capacity of the synapse, which means
that the synapse saturates all of its available receptors on the occurence of each
presynaptic spike.
Another direct equivalent is the ‘print-now signal’ that can be seen as a memory
modulator concentration like dopamine, which is thought to have a direct enhancing
effect on learning and memory when present in high quantities. The delayed-reward
mechanism has indeed a direct biological relevance and has been observed experi-
mentally (Potjans et al. 2009) [28].
As explained before, the mixture of variables with slow and fast temporal dynam-
ics makes sense and fits to what has been observed. The concentration of calcium
ions in the postsynaptic site is thought to play a key role in synaptic plasticity [30]
with much faster dynamics than the protein synthesis governing the transition from
early-LTP to late-LTP [9].
Clopath et al. [9] present a model to account for transition from early to late-
LTP, containing three different phases of Tag-Trigger-Consolidation. A synapse can
be in one of the three following states : untagged, tagged for LTP (high state) ot
tagged for LTD (low state), depending on a presynaptic and postsynaptic event. If
the total number of tagged synapses exceeds a threshold, a trigger process occurs
and opens up for consolidation (long lasting changes in synaptic efficacy). What
is similar in our model is the three different temporal dynamics. The secondary
mutual trace eij can be seen as an equivalent to a tagging procedure : if its value
stays above a threshold for a long enough time, then metabolic changes, such as
specific protein synthesis, occur allowing for conversion from working memory to
long-term permanent memory.
26
Chapter 4
Abstract Units Implementation
In the two next chapters, we present different implementations of the spiking
version of the BCPNN learning rule presented previously. The first implementation
consists of abstract units in MATLAB and serves as a gateway towards spiking
neuron models in NEURON. For each model, we explain how we present patterns
to the cells, implement the learning rule and use the model in retrieval mode.
Due to its ability to handle vectors and matrices, MATLAB serves as a con-
venient computational tool to build up artificial neural networks. The built-in
functions allow a great variety of 2D and 3D graphic display. One can also import
data computed elsewhere into MATLAB and process it according to its wishes.
But MATLAB loses all of its computational power when it has to process data
procedurally which is the case for our differential equations. In our learning rule,
we have to update and compute multiple variables at each time step, because we
deal with three sets of first order linear differential equations (equations 3.1 to 3.8).
Since these computations cannot be gathered in a matrix and all treated in batch
fashion, MATLAB is structurally inefficient for our task.
However, we can use it for single-synapse learning (only two units : one presy-
naptic and one postsynaptic) on reasonable time-scales (between 1,000 ms and
10,000 ms) and exploit its graphical display facilities, which is the reason why we
first implemented our learning rule in MATLAB. The aim is qualitative : displaying
weights and biases corresponding to different input patterns and giving an insight
to the synapse’s internal dynamics (primary, secondary and tertiary traces’ time-
courses).
4.1 Pattern presentation
In this section, we explain how we presented patterns to the units ; in other
words how input is fed to the network. We have three ways to present patterns
: non-spiking, frequency-based spiking and Poisson generated spiking. It is to be
noted that all along the following chapters we focus on single-synapse learning,
meaning that we deal with two units (presynaptic and postsynaptic) connected by
27
CHAPTER 4. ABSTRACT UNITS IMPLEMENTATION
a single synapse.
4.1.1 Non-spiking Pattern Presentation
As a starting point of our investigations and a reference for our further results,
we will test our learning rule by feeding patterns in a similar process to what has
been done before with non-spiking units (Sandberg et al. 2002) [31]. To achieve
this, we clamp input to the presynaptic and postsynaptic units yi and yj to the
respective values ki and kj during a certain time of presentation of about tens of
milliseconds. The values ki and kj can take only binary values or a continous value
in the range [0, 1] (graded input). Patterns are fed to the network sequentially.
For instance, if the set of input patterns, we want to learn is {(1, 1), (0, 0), (1, 1), (0, 1), (1, 1)},
then yi will be clamped to the following set of values (1,0,1,0,1) and yj will be
clamped to (1,0,1,0,1). The input variables yi and yj are ‘stepped’ and disconti-
nous (see Figure 4.1a). Hence, abstract units are artifical, because no biologically
observed variable takes constant values or exhibits such a discontinous time-course.
The time of presentation is important because it needs to be long enough for
the primary traces to retain pattern activities (the longer the pattern is seen, the
stronger the memory) but it is also valuable to impose some resting time between
patterns. Indeed, during each pattern presentation, the network needs to adapt to
it and rearrange its internal structure. In addition, between patterns, it needs to
rest for a short while, so that the fast dynamics internal variables return to their
baseline. An explanation of this is when we are learning different things and we
always need some adaptation to jump from one thing to another. We will expand
on this in the Discussion section.
On the other hand, when we want to teach a concept to our network throughout a
temporal series of patterns, the time-scale of the learning phase needs to be smaller
than the dynamics of the long-term memory traces pi, pj and pij, otherwise the
synapse forgets what has been fed to it in the past. If the long-term memory time-
constant τp equals 1 second, then after 5 seconds, past events will be discarded. So,
in this case, it doesn’t make sense to have a learning procedure that takes longer
than 5 seconds. In a nutshell, learning procedures should not exceed the forgetting
of our long-term memory.
In MATLAB, the function generate_pattern_nonspiking generates a driving
input x(t) from a series a of parameters : delay, the resting between pattern
presentation, dur, the duration of presentation of one pattern, T, the length of
the output and pattern, a vector containing the values for the driving input x(t).
Figure 4.1a shows the input activity of an abstract unit fed with the pattern x =
[1, 0, 1, 0, 0.5, 1, 0.25, 0, 0.75].
4.1.2 Spiking frequency-based Pattern Presentation
Because of its biological irrelevance, the previous pattern presentation scheme
is limited. This time, we try to mimic the ‘spiking behaviour’ of membrane voltage
28
4.1. PATTERN PRESENTATION
observed in real experiment. Still, our spike generation in MATLAB is artificial
but we are making progress in trying to imitate spiking behaviour. So, we build up
artificial spiking voltages by setting the input variable yj to 1 on the occurence of
a spike and to zero otherwise. If ts denotes the time occurence of a spike for unit i,
then our input variable yi variable can be rewritten
yi(t) =
ts
δ(t − ts)
Pattern presentation to the input units is now based on their firing frequency
rather than on a fixed stepped value. The idea is to realise a linear mapping from
a value of xi between 0 and 1 (representing the confidence of feature detection
developed in previous chapters) to a frequency fi. To achieve this, the value 1 for
xi will be mapped to a maximum frequency fmax and other values between 0 and
1 to a value directly proportional in the range [0, fmax] (i.e. 0.5 will be mapped
to fmax
2 , 0.25 will be mapped to fmax
4 , and so on). By doing this, we have just
created an input filter that transcripts graded input xi(t) between 0 and 1 to a
spiking time-dependent variable yi(t). We will later refer to the stepped value xi(t)
as the driving input and to yi(t) as the actual input activity, the first being used
only for pattern presentation and the latter to compute the synaptic traces, hence
the weights and biases.
An important feature in the frequency-based pattern presentation is that it
allows us to easily control the timing between presynaptic and postsynaptic spikes.
This offers an implementation possibility when we want to investigate the effects of
exact spike timing on the weight modification in our learning rule.
In MATLAB, the function generate_frequency_spiking generates an input
activity y(t) from a driving input x(t). The series a of parameters is similar with
the previous section and includes a value fmax, which corresponds to the maximum
output frequency (when x(t) takes a value of 1). In order to generate spikes, we
discretize the time-scale by intervals of 1 milisecond : when a spike occurs at a
specific time t0, the value y(t0) is simply set to 1. Figure 4.1b shows the input
activity of an abstract unit fed with the pattern x = [1, 0, 1, 0, 0.5, 1, 0.25, 0, 0.75].
4.1.3 Spiking Poisson-generated Pattern Presentation
We make one more step in the direction of mimicking neural-like data by im-
plementing Poisson spike trains, to feed input units. In the cortex, the timing of
successive action potentials is highly irregular and we can view the irregular inter-
spike interval as a random process. It implies that an instantaneous estimate of
the spike rate can be obtained by averaging the pooled responses of many individ-
ual neurons, but precise timing of individual spikes conveys little information The
benefit of the Poisson process for spike generation is that it adds randomness and
discards the determinism in our simulation (each random seed will give different
spike trains). Thus, we focus on the parameters underlying this random process
rather than modeling precise coincidences of presynaptic and postsynaptic events.
29
CHAPTER 4. ABSTRACT UNITS IMPLEMENTATION
(a) Non-Spiking Pattern Presentation
(b) Spiking Frequency-based Pattern Presentation(c) Spiking Poisson-generated Pattern Presentation
Figure 4.1: Abstract Units Pattern Presentations corresponding to the pattern x =
[1, 0, 1, 0, 0.5, 1, 0.25, 0, 0.75]
We assume here that the generation of each spike depends only on an underlying
signal r(t) that we will refer to as an instantaneous firing rate. It follows that the
generation of each spike is independent of all the others spikes, which is called the
spike independent hypothesis. Plus, we will make the assumption that the firing rate
r(t) is constant over time (actually r(t) is updated by steps but for one pattern we
can suppose than r(t) = r). The Poisson process is then said to be homogenous.
In a Poisson process, the probability that n events with a instantaneous rate r
occur in ∆t is given by the formula :
P({n spikes during ∆t}) = e−r∆t
(r∆t)n
n!
(4.1)
By setting n = 0 and ∆t = τ, we obtain P({next spike occurs after τ}) = e−rτ and
it follows that
P({next spike occurs before τ}) = 1 − e−rτ
(4.2)
One way to implement a Poisson spike trains is to use equation 4.2 : we generate
a random number between 0 and 1 and the inter-spike interval is given by the value
of τ that realizes the identity. But, the drawback of this method, is that the spike
train has to be created sequentially. We can create a whole Poisson spike train at
once by doing as follows.
30
4.2. LEARNING RULE IMPLEMENTATION
The average spike count between t1 and t2 can be defined from the instantaneous
firing rate by n = t2
t1
r(t) dt and for sufficiently small intervals, t1 = t − δt
2 and
t2 = t+ δt
2 , the average spike count can be approximated by n = r(t)δt = rδt after
the homogenous poisson process hypothesis. Furthermore, when δt is small enough,
the average spike count equals the probability of the firing of a single spike
P({one spike occurs during the interval (t −
δt
2
, t +
δt
2
)}) = rδt (4.3)
Now, assuming δt is small enough (usually 1 ms), if we want to create a spike
train af arbitrary length T using 4.3, we need to generate T
δt
random numbers pi
between 0 and 1. Then if pi < r, we generate spike at the time correponding to the
index of pi and if not, no spike is generated.
The Poisson Spike generation is an intermediate stage to NEURON implemen-
tations. It allows us to account for random rate-based spike generation. This is
valuable, because this process is easy to implement and gives us an idea if our
model responds well to noisy or random data. Later, some noisy spike trains may
be added to our data so that it resembles what is observed in vivo.
In MATLAB, the function generate_poisson_spiking generates an input ac-
tivity y(t) from a driving input x(t). The series a of parameters is similar with the
previous section and the rate r is set to the same value as the frequency fmax used
before. We stress the fact that the Poisson-generation of spike trains is based on a
random process. Thus, each seed gives a different input activity y(t) for the same
driving input x(t). By setting the same seed in two runs, they become identiti-
cal. Figure 4.1c shows the input activity of an abstract unit fed with the pattern
x = [1, 0, 1, 0, 0.5, 1, 0.25, 0, 0.75].
4.2 Learning Rule Implementation
In order to solve the differential equations in MATLAB, we used the solver
ode45. Its use is quite straight-forward, except that this is achieved by using func-
tions handles, which makes it tricky to control intrinsic equation parameters, like
the time-constant τi or the print-now signal κ. If the implementation of the learning
rule follows the set of equations 3.1 to 3.10, a non-negligible phenomenon arises :
when spikes are modeled by a sum of unit impulse functions in MATLAB and it
is likely that the solver might miss them, because at each time-step, in order to
evaluate the derivatives of a point it uses points in its neighbourhood. Not only the
spiking variables are highly discontinous, but they are also zero most of the time,
which prevents the solver ode45 to detect any activity.
A solution to this problem is to introduce a ‘duration’ δt for the spikes (typically
δt equals 1 to 2 milliseconds), so that the mathematical modelisation of a spike
switches from an impulse function to a door function of width δt centered in ts.
But in that case, 1
τi
is an upper bound for of dyi
dt (see equation 3.1), which results
in a small increase of the primary trace zi(t). This propagates to the secondary
31
CHAPTER 4. ABSTRACT UNITS IMPLEMENTATION
and to the tertiary traces, which, as a result, hardly overcome 0.001 which is highly
undesirable, because they are supposed to represent probabilities of activation.
To bypass these problems, we decide to split the set of equations 3.1 to 3.10
in two phases : First we update the primary traces with the help of an auxiliary
function generate_primary_trace, which solves the equation 4.4.
Zs(t) = z−
i (t) + r(1 − z−
i (t)) if xi(t) = 1
zi(t) = Zse
−t−ts
τi
(4.4)
where ts records the occurence of the last spike and Zs is updated according to the
proportional trace update.
The set of equations 3.3 to 3.8 is done separately using the solver ode45. Special
care has to be taken with the time-step increment in order to find a trade-off between
computing time and accuracy. Also, the weight update 3.10 and bias update 3.9
are straight-forward.
Finally, it is important to mention that we have implemented ‘off-line learning’,
in the sense that weights and biases are updated independtly from each other.
Everything happens as if there was no connection at all between the cells. This is
somehow not a constraint during learning, but on the contrary rather convenient.
This remains to be investigated : when learning should occur and when inference
should take over. In our model of abstract units though, the boundary between
learning and inference is clear because they are governed by different formulae used
in a different context.
4.3 Retrieval
If the learning phase is central in our implementation, it is also crucial to check
that the stored patterns can be retrieved correctly. The aquired knowledge is to be
easily retrieved, especially when we use the BCPNN as an auto or hetero-associative
memory.
Thus, in this section, we assume that a certain learning phase has already oc-
cured and that the weights ωij and bias βj are set. Our goal is to present an
incomplete pattern and to check if the network is able to complete it correctly.
Since we only deal with one synapse, input will be fed to the presynaptic unit and
output will be collected at the postsynaptic unit.
Because, we have three different pattern presentation schemes in our abstract
units model, inference is done in three different fashions. In all cases, however,
the retrieval phase aims to realise an input-output mapping from unit i to unit j.
Quantitative results are presented in the next Chapter, we focus here on the method
that enables us to achieve this.
Non-Spiking Inference
This case is the simplest because the activity of a unit is constant over time
(for the duration of one pattern presentation). In other words, because there is no
32
4.3. RETRIEVAL
difference between the driving input xi(t) and the input activity yi(t), the input-
output mapping is straight forward. Assuming that unit i is fed an input pattern
corresponding to the driving input xi = ki, we first compute the support value hj
of unit j with hj = ωijxi + βj, and then we update the output unit activity with
xj = ehj . Finally, the input-output mapping is realized by the equation 4.5.
xj = eωijxi+βj
(4.5)
In order to produce the input-output relationship curve, we compute the output
xj according to equation 4.5, for a set of input values xi, regularly spaced between 0
and 1. We end up with a corresponding output vector y mapped to an input vector
x. It is to be noticed that the above equation is the same as the equation presented
in Chapter 2 (section 2.3.4), with only two units. If the learning phase has been
successful, xj is nothing but the a posteriori probability of unit j, knowing unit i.
Spiking Inference
In the Spiking Frequency-based Poisson-generated pattern presentation schemes,
the input activity yi(t) is not constant over time. A value ki of the driving input
xi(t) corresponds to a firing frequency fi in one case, and to a firing rate in the
other case. Thus, the process of inference for one value of xi is not given by a direct
calculation like in equation 4.5, but instead, it is dependent on the time-course of
the spiking input activity yi(t), governed by the driving input value xi of unit i.
This input activity yi(t) needs to be processed to calculate a correspond output
value xj. In order to map an input value xi to a number xj between 0 and 1, we
process as follows :
1. We generate a regular spiking input yi(t) with frequency fi (FS) or a Poisson
spike train with rate fi (PS), during a time Tinf equal to 5 seconds. The firing
frequency or rate obeys fi = xi.fmax with xi ∈ [0, 1].
2. We compute a support activity sj(t) according the relation
sj(t) = ωijyi(t) + βj
3. The support activity sj(t) is then low-pass filtered by a filter with a high
time-constant τf and slow update value k.
d ˆsj
dt
=
k(sj − βj) − ˆsj
τf
4. We take the exponential of the filtered support activity ˆsj(t).
5. xj is finally set to the mean stationnary value of the output activity yj(t∞) .
33
CHAPTER 4. ABSTRACT UNITS IMPLEMENTATION
Figure 4.2: Spiking Inference with abstract units - Different stages of processing
Figure 4.3 shows these different stages of processing. This technique, despite its
apparent complexity, gives a good fit to the previous non-spiking mapping. The key
procedure occurs at step 3, when we filter the support input activity sj(t). This
variable is equal to the bias term βj when the input unit does not spike, and it is
set to the value βj +ωij on the occurrence of a spike. When we filter with a specific
low-pass filter (high time constant, small update rate), we can generate a filtered
support activity ˆsj(t), which works as an additive trace. Hence, the value of ˆsj(t)
at the end of the stimulation gives a measure of the firing frequency of the cell.
The direction of update k.(sj − βj) is proportional to the weight value ωij, which
allows negative or positive build-up according to the sign of ωij. Typically, we use
τf = 500 ms and k = 1
fmax
.
Step 4 is needed to keep the inference equation homogenous. It is crucial that
ˆsj(t) stays in the range ]−∞, 0], because we want to get a value of xj between 0 and
1. This can be controled either by the value k, or by modifying the filtering equation
like in the case of saturated traces (see Chapter 3). The biological model is composed
of steps 3 and 4, because we can draw an analogy of these processes with what occurs
at the synapse level. Shortly, the filtering accounts for synaptic integration with
low release of neuro-transmitters and slow degradation. The exponentiation in step
4 is observed in the mapping current-frequency of a cell (called current-discharge
relationship).
For the Poisson-generated spike trains, the underlying random process gives dif-
ferent output at each run. Thus, we have to compute average values, after repeating
the same inference process over several runs (between 5 and 10 runs). There is a
trade-off between discarding the randomness by increasing the number of runs and
computing time for simulations. Also it is important to keep the randomness intro-
duced with the Poisson process because it accounts for irregular spiking observed
in real neurons.
34
Chapter 5
Hodgkin-Huxley Spiking
Implementation in NEURON
NEURON is a simulation environment for modeling individual neurons and
neural networks. It was primarily developed by Michael Hines, John W. Moore,
and Ted Carnevale at Yale and Duke. Documentation about NEURON and how to
implement models in NEURON is given in the NEURON book [8].
NEURON, which is associated with the object-oriented NMODL language, offers
an efficient means to run simulations of highly connected network of neurons. Built
on the paradigm of C language, it does not suffer under procedural processing of
data and uses efficient and fast algorithms to solve differential equations. The
computing time of the abstract units model is thereby reduced by a factor 10.
5.1 Cell Model
5.1.1 Hodgkin Huxley Model
In 1952, Alan Lloyd Hodgkin and Andrew Huxley proposed a model to explain
the ionic mechanisms underlying the initiation and propagation of action potentials
in the squid giant axon [14]. They received the Nobel Prize in 1963 in Physiology
and Medicine for this work and the model has since been refered to as the Hodgkin-
Huxley model. It describes how action potentials in neurons are initiated and
propagated with the help of a set of nonlinear ordinary differential equations which
approximates the electrical characteristics of excitable cells such as neurons and
cardiac myocytes [2].
The main idea behind the Hodgkin-Huxley formalism is to give an electrical
equivalent to each bioligical component of the cell that plays a role in the trans-
mission of acion potentials, which is the support of signaling within the cell. The
components of a typical HodgkinâHuxley model, shown in Figure 5.1.1, include :
• A capacitance Cm, representing the lipid bilayer. A cell, considered as whole,
is electrically neutral but the neighbourhood surrounding the cell membrane
35
CHAPTER 5. HODGKIN-HUXLEY SPIKING IMPLEMENTATION IN NEURON
Figure 5.1: Hodgkin-Huxley model of a cell
is not. Membrane voltage is the consequence of the accumulation of charged
particles on both sides of that bilayer, impermeable to ions. A typical value
for Cm is 1 nF.
• Nonlinear electrical conductances gn(Vm, t), representing voltage-gated ion
channels. Their behaviour is described by gating variables that describe open,
closed and reverberatory states (see Appendix for equations). These conduc-
tances are both voltage and time-dependent : gn(Vm, t) where n denotes a
specific ion species. In addition, they exhibit fast dynamics because they ac-
count for the cell regenerative properties implied in the propagation of action
potentials.
• A linear conductance gleak, for passive leak channels, these channels that
are not ion-selective, always open and contribute to the resting membrane
potential. A typical value for gleak is 20 µS.cm−2.
• Generators En, describing the electrochemical gradients driving the flow of
ions, the values of which are determined from the Nernst potential of the ionic
species of interest.
This model can be extended by modeling ion pumps with the help of current
sources (the sodium-potassium ion pump is responsible for the concentrations equi-
librium inside and outside the cell). More elaborate models, include chloride and
calcium voltage-gated current, however, we only deal here with two ionic currents :
sodium and potassium and one leakage channel.
36
5.1. CELL MODEL
Further, our cell model will contain additional channels (see figure 5.3.2) : a slow
dynamics voltage-gated potassium channel accounting for spike-frequency adapta-
tion (see section 5.1.2) and an activity-dependent potassium channel modeling in-
trinsic excitability (see section 5.3.2).
As a convention we will use I > 0 when ions flow from the outside to the inside
of the cell, so that, in the normal cell dynamics, the sodium current takes positive
values and the potassium current takes negative values. The voltage equation is
given by the relation between applied current Iapp, capacitive currents Ic and the
sum of ion and leak currents Iion :
Iapp = Ic + Iion = Cm
dVm
dt
+ INa + IK + Ileak (5.1)
We see that when Iapp > 0 then dVm
dt > 0 and the membrane voltage becomes more
positive (depolarization). The dynamics of the detailed voltage and gating variables
equations are given in Appendix.
5.1.2 Spike Frequency Adaptation
Spike-frequency adaptation is a type of neural adaptation that plays a key role
in firing frequency regulation of neurons. It is characterized by an increase of the
interspike interval when a neuron is current-clamped. Among other mechanisms,
various ionic currents modulating spike generation cause this type of neural adapta-
tion : voltage-gated potassium currents (M-type currents), the interplay of calcium
currents and intracellular calcium dynamics with calcium-gated potassium channels
(AHP-type currents), and the slow recovery from inactivation of the fast sodium
current (Benda et al. 2003)[35]. Spike-frequency adaptation can account for the
findings in burst firing (Azouz et al 2000) [5].
Figure 5.2: Spike-frequency Adaptation : Membrane voltage and state variable p
37
CHAPTER 5. HODGKIN-HUXLEY SPIKING IMPLEMENTATION IN NEURON
In our model, spike frequency adaptation is taken into account by adding a
slow-dynamics voltage-gated potassium channel. The conductance of this channel
is non-linear and depends on membrane voltage Vm. It is described by an activation
variable p, that works in a similar way to an additive synaptic trace (see figure 5.1.2).
Dynamics of the channel are given by the following equations
gkim(Vm, t) = gkim.p (5.2)
dp
dt
=
p∞(Vm, t) − p
τ(Vm, t)
(5.3)
Figure 5.1.2 describes the build up of the trace p and the conductance gkim,
which is responsible for the increase of the interspike interval along the stimulation.
Because of its slow decay, the delay between stimulation must be much longer than
the stimulation itself for the p variable to return to baseline. The slow dynamics of
this channel suggest that repeated strong transient stimulation has a better effect
than long-lasting stimulation.
5.2 Pattern presentation
Now that our cells are no longer modeled by artificial units, but instead by
complex spiking Hodgkin-Huxley units, input and output need to be matched to
real variables. As presented above, the Hodgkin-Huxley model is based on the
relation between membrane potential Vm and the individual ionic currents Ii. Thus
it is natural to feed input to one cell by injecting a certain amount of current Iapp
into it and reading the output as the membrane potential firing frequency f. To
achieve this, we will use current electrodes to present patterns to the network :
Some current is injected inside the cell membrane, which depolarizes it and triggers
action potentials, and the membrane voltage is recorded as the difference of potential
between two electrodes, one inside and the other outside of the cell (see Kandel 1995
[20] about current-clamp technique).
So the input-output relationship (which is similar to the activity function in
artificial networks) is given by mapping the injected current Iapp to the membrane
firing frequency f. The curve giving the firing frequency of one unit versus the in-
jected input current is called steady-state current discharge. This curve is presented
for our units in the next chapter (see figure 6.6). For weak currents, no active firing
is triggered (the depolarization induced by the current injection is too small for the
membrane to reach the threshold and no action potential is recorded). For cur-
rents which are too strong, the Hodgkin-Huxley voltage-gated potassium channels
become inefficient to repolarize the cell and the membrane voltage stabilizes to a
supra-thresholded value. Thus, we must feed input currents that belong to a range
[0, Imax], where the steady-state current discharge is approximately linear.
During learning, we feed input patterns sequentially. It is entirely frequency-
based, meaning that input corresponding to a value between 0 and 1 is mapped to a
firing frequency. The current-frequency relationship will be used to find the current
38
5.3. LEARNING RULE IMPLEMENTATION
clamp value in order to obtain the right frequency. Let’s assume that the set of
input patterns we want to learn is (1, 1), (0, 0), (1, 1), (0, 1), (1, 1), then the unit
i must fire with the following frequency behaviour (fmax, 0, fmax, 0, fmax) and unit
j with (fmax, 0, fmax, fmax, fmax). Using the steady-state current discharge curve,
we inject corresponding currents in order to obtain the desired firing frequencies.
5.3 Learning Rule Implementation
In this section, we present how weights and biases are represented and updated
during learning, in the NEURON spiking environment. We use the object-oriented
NMODL language to create new mechanisms for simulations in NEURON. The
details of the code are given in Appendix.
5.3.1 Synaptic Integration
Modeling the synapse
In the artificial context the weight ωij between two units quantifies the strength
of the connection between them. If ωij is high, then the connection between the two
units is strong and they influence one another significantly. On the other hand, if
ωij is close to zero, the connection is very weak and the corresponding units behave
as if they were not connected. The simplest way to represent this in our spiking
context is to map ωij to a synaptic conductance gij between two units i and j. This
conductance would be time-dependent and closely related to the presynaptic and
postsynaptic events.
So, we create a model of a synapse which has intrinsic properties fulfiling the
weight update equation of our spiking learning rule 3.10, and call it a BCPNN
Synapse. It is defined as a point-process in NMODL, which means that one can
implement as many instances of this mechanism as long as one specifies a location
(a section in NEURON). All local variables associated with the section it has been
attached to become available to the point-process (membrane voltage, ionic cur-
rents, etc.). As a convention, we will always place a synapse on the postsynaptic
cell soma.
Conductance Expression
In our model, the synaptic conductance gij(t) is a product of three quantities :
gij(t) = gmax.gcomp(pi, pj, pij, t).αi(yi, t) (5.4)
gmax is the maximum conductance of the synapse : it regulates its strength (abil-
ity to conduct current) and can temporarily be set to zero if one wants to operate
off-line learning.
gcomp(pi, pj, pij, t) is directly computed from the tertiary traces pi, pj and pij simi-
39
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059
bergel_antoine_10059

More Related Content

What's hot

Alpha and Gamma Oscillations in MEG-Data: Networks, Function and Development.
Alpha and Gamma Oscillations in MEG-Data: Networks, Function and Development.Alpha and Gamma Oscillations in MEG-Data: Networks, Function and Development.
Alpha and Gamma Oscillations in MEG-Data: Networks, Function and Development.FrdricRoux5
 
From sound to grammar: theory, representations and a computational model
From sound to grammar: theory, representations and a computational modelFrom sound to grammar: theory, representations and a computational model
From sound to grammar: theory, representations and a computational modelMarco Piccolino
 
3016 all-2007-dist
3016 all-2007-dist3016 all-2007-dist
3016 all-2007-distNYversity
 
Integrating IoT Sensory Inputs For Cloud Manufacturing Based Paradigm
Integrating IoT Sensory Inputs For Cloud Manufacturing Based ParadigmIntegrating IoT Sensory Inputs For Cloud Manufacturing Based Paradigm
Integrating IoT Sensory Inputs For Cloud Manufacturing Based ParadigmKavita Pillai
 
Fuzzy and Neural Approaches in Engineering MATLAB
Fuzzy and Neural Approaches in Engineering MATLABFuzzy and Neural Approaches in Engineering MATLAB
Fuzzy and Neural Approaches in Engineering MATLABESCOM
 
Au anthea-ws-201011-ma sc-thesis
Au anthea-ws-201011-ma sc-thesisAu anthea-ws-201011-ma sc-thesis
Au anthea-ws-201011-ma sc-thesisevegod
 
Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...
Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...
Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...Alexander Zhdanov
 

What's hot (18)

t
tt
t
 
Wu dis
Wu disWu dis
Wu dis
 
Final
FinalFinal
Final
 
thesis.MSc
thesis.MScthesis.MSc
thesis.MSc
 
Lecturenotesstatistics
LecturenotesstatisticsLecturenotesstatistics
Lecturenotesstatistics
 
Sap2000 basic
Sap2000 basicSap2000 basic
Sap2000 basic
 
Alpha and Gamma Oscillations in MEG-Data: Networks, Function and Development.
Alpha and Gamma Oscillations in MEG-Data: Networks, Function and Development.Alpha and Gamma Oscillations in MEG-Data: Networks, Function and Development.
Alpha and Gamma Oscillations in MEG-Data: Networks, Function and Development.
 
Machine learning-cheat-sheet
Machine learning-cheat-sheetMachine learning-cheat-sheet
Machine learning-cheat-sheet
 
From sound to grammar: theory, representations and a computational model
From sound to grammar: theory, representations and a computational modelFrom sound to grammar: theory, representations and a computational model
From sound to grammar: theory, representations and a computational model
 
3016 all-2007-dist
3016 all-2007-dist3016 all-2007-dist
3016 all-2007-dist
 
Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learning
 
Upstill_thesis_2000
Upstill_thesis_2000Upstill_thesis_2000
Upstill_thesis_2000
 
Transport
TransportTransport
Transport
 
Integrating IoT Sensory Inputs For Cloud Manufacturing Based Paradigm
Integrating IoT Sensory Inputs For Cloud Manufacturing Based ParadigmIntegrating IoT Sensory Inputs For Cloud Manufacturing Based Paradigm
Integrating IoT Sensory Inputs For Cloud Manufacturing Based Paradigm
 
Fuzzy and Neural Approaches in Engineering MATLAB
Fuzzy and Neural Approaches in Engineering MATLABFuzzy and Neural Approaches in Engineering MATLAB
Fuzzy and Neural Approaches in Engineering MATLAB
 
Au anthea-ws-201011-ma sc-thesis
Au anthea-ws-201011-ma sc-thesisAu anthea-ws-201011-ma sc-thesis
Au anthea-ws-201011-ma sc-thesis
 
Optimal control systems
Optimal control systemsOptimal control systems
Optimal control systems
 
Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...
Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...
Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...
 

Similar to bergel_antoine_10059

Classification System for Impedance Spectra
Classification System for Impedance SpectraClassification System for Impedance Spectra
Classification System for Impedance SpectraCarl Sapp
 
Stochastic Processes and Simulations – A Machine Learning Perspective
Stochastic Processes and Simulations – A Machine Learning PerspectiveStochastic Processes and Simulations – A Machine Learning Perspective
Stochastic Processes and Simulations – A Machine Learning Perspectivee2wi67sy4816pahn
 
LChen_diss_Pitt_FVDBM
LChen_diss_Pitt_FVDBMLChen_diss_Pitt_FVDBM
LChen_diss_Pitt_FVDBMLeitao Chen
 
A Sense of Place-A Model of Synaptic Plasticity in the Hippocampus
A Sense of Place-A Model of Synaptic Plasticity in the HippocampusA Sense of Place-A Model of Synaptic Plasticity in the Hippocampus
A Sense of Place-A Model of Synaptic Plasticity in the HippocampusPeyman Askari
 
CERN-THESIS-2011-016
CERN-THESIS-2011-016CERN-THESIS-2011-016
CERN-THESIS-2011-016Manuel Kayl
 
Munarriz - Modelling of Plasmonic and Graphene Nanodevices
Munarriz - Modelling of Plasmonic and Graphene NanodevicesMunarriz - Modelling of Plasmonic and Graphene Nanodevices
Munarriz - Modelling of Plasmonic and Graphene Nanodevicesjmunarr
 
Implementation of coarse-grain coherence tracking support in ring-based multi...
Implementation of coarse-grain coherence tracking support in ring-based multi...Implementation of coarse-grain coherence tracking support in ring-based multi...
Implementation of coarse-grain coherence tracking support in ring-based multi...ed271828
 

Similar to bergel_antoine_10059 (20)

mscthesis
mscthesismscthesis
mscthesis
 
Classification System for Impedance Spectra
Classification System for Impedance SpectraClassification System for Impedance Spectra
Classification System for Impedance Spectra
 
Stochastic Processes and Simulations – A Machine Learning Perspective
Stochastic Processes and Simulations – A Machine Learning PerspectiveStochastic Processes and Simulations – A Machine Learning Perspective
Stochastic Processes and Simulations – A Machine Learning Perspective
 
thesis
thesisthesis
thesis
 
MLBOOK.pdf
MLBOOK.pdfMLBOOK.pdf
MLBOOK.pdf
 
LuciaRodrigoInsausti_Thesis
LuciaRodrigoInsausti_ThesisLuciaRodrigoInsausti_Thesis
LuciaRodrigoInsausti_Thesis
 
spurgeon_thesis_final
spurgeon_thesis_finalspurgeon_thesis_final
spurgeon_thesis_final
 
outiar.pdf
outiar.pdfoutiar.pdf
outiar.pdf
 
LChen_diss_Pitt_FVDBM
LChen_diss_Pitt_FVDBMLChen_diss_Pitt_FVDBM
LChen_diss_Pitt_FVDBM
 
A Sense of Place-A Model of Synaptic Plasticity in the Hippocampus
A Sense of Place-A Model of Synaptic Plasticity in the HippocampusA Sense of Place-A Model of Synaptic Plasticity in the Hippocampus
A Sense of Place-A Model of Synaptic Plasticity in the Hippocampus
 
CERN-THESIS-2011-016
CERN-THESIS-2011-016CERN-THESIS-2011-016
CERN-THESIS-2011-016
 
phd-thesis
phd-thesisphd-thesis
phd-thesis
 
Munarriz - Modelling of Plasmonic and Graphene Nanodevices
Munarriz - Modelling of Plasmonic and Graphene NanodevicesMunarriz - Modelling of Plasmonic and Graphene Nanodevices
Munarriz - Modelling of Plasmonic and Graphene Nanodevices
 
thesis
thesisthesis
thesis
 
PhD_Thesis_J_R_Richards
PhD_Thesis_J_R_RichardsPhD_Thesis_J_R_Richards
PhD_Thesis_J_R_Richards
 
Implementation of coarse-grain coherence tracking support in ring-based multi...
Implementation of coarse-grain coherence tracking support in ring-based multi...Implementation of coarse-grain coherence tracking support in ring-based multi...
Implementation of coarse-grain coherence tracking support in ring-based multi...
 
Matconvnet manual
Matconvnet manualMatconvnet manual
Matconvnet manual
 
book.pdf
book.pdfbook.pdf
book.pdf
 
HASMasterThesis
HASMasterThesisHASMasterThesis
HASMasterThesis
 
MSci Report
MSci ReportMSci Report
MSci Report
 

bergel_antoine_10059

  • 1. Transforming the BCPNN Learning Rule for Spiking Units to a Learning Rule for Non-Spiking Units A N T O I N E B E R G E L Master of Science Thesis Stockholm, Sweden 2010
  • 2. Transforming the BCPNN Learning Rule for Spiking Units to a Learning Rule for Non-Spiking Units A N T O I N E B E R G E L Master’s Thesis in Biomedical Engineering (30 ECTS credits) at the School of Computer Science and Engineering Royal Institute of Technology year 2010 Supervisor at CSC was Örjan Ekeberg Examiner was Anders Lansner TRITA-CSC-E 2010:059 ISRN-KTH/CSC/E--10/059--SE ISSN-1653-5715 Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.kth.se/csc
  • 3. Abstract The Bayesian Confidence Propagation Neural Network (BCPNN) Model has been developed in the past thirty years for specific tasks like, among others, classification, content-addressable memory and data mining. It uses a Bayesian-Hebbian learning rule, which exhibits fairly good per- formances, both as an counter model and in a continously operating incremental learner. This learning rule has never been up and running in spiking units networks and one is bound to use the outcome of the learning for non-spiking units and to transpose it to the spiking context afterwards, which is highly restrictive. The aim of Master Thesis Project is to transform the existing BCPNN learning rule for non-spiking units, including the bias term, to the domain of spiking neural networks based on the Hodgkin-Huxley cell model. One aims to have a modelisation running in NEURON, which can exhibit the same features observed with non-spiking units. The sec- ondary goal of this paper is to compare the new learning rule to the old one, and also with other Spike-Timing Dependent Plasticity learning rules. To achieve such a goal, we introduce a new version of the BCPNN learning rule, which can account for spiking input activities. This learn- ing rule is based on the use of synaptic traces as local variables to keep a trace of the frequency and timing between spikes. It includes three stages of processing, all based on low-pass filtering with three differ- ent temporal dynamics, in order to give an evaluation of the probabil- ities used to compute the Bayesian weights and biases. The Bayesian weights are mapped to a synaptic conductance, updated according to the values of these synpatic traces, and we map the bias term to an activity-regulated potassium channel. We exhibit results of the new spiking version of the BCPNN learning rule in single-synapse learning and retrieval. We implement two main models : the first based on abstract units in MATLAB and another one based on Hodgkin-Huxley spiking units in NEURON. The last model accounts for spike-frequency adaptation and can be used to study the ef- fect of exact timing between presynaptic and postsynaptic spikes, under repeated stimulations.
  • 4.
  • 5. Acknowledgements I would first like to thank Anders Lansner for allowing me to work at the department Computational Biology and Neuroscience at KTH, for devoting time and patience to assuming both roles of supervisor and examiner of this Master Thesis, and for always helping me, guiding me and finally leaving me in the best conditions to produce valuable work. This first step into research at a high-level scientific department has been a very enriching experience, which I will always remember. I would also like to thank Örjan Ekeberg, for accepting to tutor this Master Thesis from abroad at first, and later for all the precious comments about the report, presentation and structure of this work. This passed year, at the department, I have had the chance to meet a lot of people, from different backgrounds and countries. They have contributed to make the atmosphere of work very special and especially warm and welcoming : Claudia, who has been here from the very beginning, Charles, for his ping-pong and chess skills when a break was needed, Aymeric, Dave, Simon, Pawel, Pierre and all the others for making me discover new bars and restaurants. I want to give a special thank to Mikael, for interesting talk, Pradeep and David, for their disponibility, kindness and help with NEURON, and finally to Bernhard, who has been not only always eager to answer my numerous questions and investigate new problems with me, but also a devoted friend, who proposed me tremendous support and help, when time pressure was high. I cannot cite all the people that I have met these past two years, but I want to say how getting to know all of them, all the conversations and moments we had together, have changed me and made me realise that there exist no geographic border to friendship and love. So, I want to thank Natasha, for the time she spent on improving the language in my report, and simply for being always supportive and making me feel that she was here with me, though at the other side of the world. This year would have been so much different without my lovely room-mates Birte, Isabel, Stefan F., Stefan T. and Volker, for August mondays among other things, my two French buddies Fred and Joseph, for lunchbreaks, poker sessions and crazy parties. I want to give a special thank to my two Italian friends who showed that beyond neighbour rivalry, we just have so much in common and so much to share : Enrico, the craziest person I have ever lived with and Sara, probably the best pizza and focaccia cooker ever. Finally, I want to thank my parents who have always helped me with all the
  • 6. problems one can encounter when studying abroad for two years : I know how lucky I am to have them with me and I hope they measure the respect I have for them. A little word to my syblings, my sister Karen and my brother Samuel, who I will be very happy to meet and live with again.
  • 7. Contents 1 Introduction 1 1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 The BCPNN Model 5 2.1 Context and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Bayesian Confidence Propagation . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Using Neurons as probability estimators . . . . . . . . . . . . 8 2.2.2 Derivation of Network Architecture . . . . . . . . . . . . . . . 9 2.2.3 Bayesian-Hebbian Learning . . . . . . . . . . . . . . . . . . . 9 2.3 Gradual Development of the BCPNN model . . . . . . . . . . . . . . 9 2.3.1 Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 Higher Order Bayesian Model . . . . . . . . . . . . . . . . . . 11 2.3.3 Graded units . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.4 Recurrent Network . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 BCPNN Learning Implementations . . . . . . . . . . . . . . . . . . . 14 2.4.1 Counter Model . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.2 Incremental Learning . . . . . . . . . . . . . . . . . . . . . . 15 2.5 Performance Evaluation and Applications . . . . . . . . . . . . . . . 16 3 A spiking BCPNN Learning Rule 19 3.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.1 Synaptic traces as local state variables . . . . . . . . . . . . . 21 3.2.2 Spike-timing Dependence . . . . . . . . . . . . . . . . . . . . 22 3.2.3 Delayed-Reward Learning . . . . . . . . . . . . . . . . . . . . 23 3.2.4 Long-term Memory . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2.5 Probabilistic features . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Biological relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4 Abstract Units Implementation 27 4.1 Pattern presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
  • 8. 4.1.1 Non-spiking Pattern Presentation . . . . . . . . . . . . . . . . 28 4.1.2 Spiking frequency-based Pattern Presentation . . . . . . . . . 28 4.1.3 Spiking Poisson-generated Pattern Presentation . . . . . . . . 29 4.2 Learning Rule Implementation . . . . . . . . . . . . . . . . . . . . . 31 4.3 Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5 Hodgkin-Huxley Spiking Implementation in NEURON 35 5.1 Cell Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.1.1 Hodgkin Huxley Model . . . . . . . . . . . . . . . . . . . . . 35 5.1.2 Spike Frequency Adaptation . . . . . . . . . . . . . . . . . . . 37 5.2 Pattern presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.3 Learning Rule Implementation . . . . . . . . . . . . . . . . . . . . . 39 5.3.1 Synaptic Integration . . . . . . . . . . . . . . . . . . . . . . . 39 5.3.2 Bias term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.4 Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6 Results 47 6.1 Abstract units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.1.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.1.2 Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.2 Hodgkin-Huxley Spiking Units . . . . . . . . . . . . . . . . . . . . . 51 6.2.1 Steady-State Current Discharge . . . . . . . . . . . . . . . . . 51 6.2.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.2.3 Parameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.2.4 Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.2.5 Spike Timing Dependence . . . . . . . . . . . . . . . . . . . . 60 7 Discussion 63 7.1 Model Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 7.1.1 Learning Rule Parameters . . . . . . . . . . . . . . . . . . . . 63 7.1.2 Pattern Variability . . . . . . . . . . . . . . . . . . . . . . . . 65 7.1.3 Learning-Inference Paradigm . . . . . . . . . . . . . . . . . . 65 7.2 Comparison to other learning rules . . . . . . . . . . . . . . . . . . . 66 7.2.1 Spiking vs Non-spiking Learning Rule . . . . . . . . . . . . . 66 7.2.2 Spike-timing dependence and real data . . . . . . . . . . . . . 68 7.2.3 Sliding threshold and BCM Rule . . . . . . . . . . . . . . . . 69 7.3 Further Developments and limitations . . . . . . . . . . . . . . . . . 71 7.3.1 Network implementation . . . . . . . . . . . . . . . . . . . . . 71 7.3.2 RSNP cells and inhibitory input . . . . . . . . . . . . . . . . 71 7.3.3 Hypercolumns, basket cell and lateral inhibition . . . . . . . 72 7.3.4 Parallel computing . . . . . . . . . . . . . . . . . . . . . . . . 73 8 Conclusion 75
  • 9. Bibliography 77 Appendices 79 A NMODL files 81 A.1 Synapse modelisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 A.2 A-Type Potassium Channel . . . . . . . . . . . . . . . . . . . . . . . 83 B Hodgkin-Huxley Delayed Rectifier Model 87 B.1 Voltage Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 B.2 Equations for Gating Variables . . . . . . . . . . . . . . . . . . . . . 87 C NEURON stimulations parameters 89
  • 10.
  • 11. Chapter 1 Introduction 1.1 Context Since 1949 with Hebb’s theory, synaptic plasticity (the ability for the synap- tic connection between two neurons to change its strength according to a certain conjunction of presynaptic and postsynaptic events) is thought to be the biological substrate for high level cognitive functions like learning and memory. This idea is actually much older and was fomalized by the Spanish neuroanatomist Santiago Ramón y Cajal in 1894, who suggested “a mechanism of learning that did not re- quire the formation of new neurons”, but proposed that “memories might instead be formed by strengthening the connections between existing neurons to improve the effectiveness of their communication” [29]. Hebb went a step further by proposing his ideas about the existence of a metabolic growth process associating neurons that tend to have a correlated firing activity [13]. For the brain to be able to form, store and retrieve memories, as well as learn spe- cific tasks, the biological changes at the synapse level need to be long-lasting. This is called long-term potentiation (LTP) or depression (LTD) which means a persistent increase or decrease in synaptic strength which is said to be the key mechanism un- derlying learning and memory. The biological mechanisms responsible for long-term potentiation are not exactly known, but specific protein synthesis, second-messenger systems and N-methyl D-aspartate (NMDA) receptors are thought to play a critical role in its formation [20]. In 1995, Fuster defined memory as “a functional property of each and all of the areas of the cerebral cortex, and thus of all cortical systems”. He distinguishes several types of memories : short-term/long-term, sensory/motor, declarative/non- declarative and individual/phyletic. He proposes that all memory is associative and its strength depends on the number of associations we make to a specific word or mental object [11]. He introduced several key concepts like working memory, as a gateway to long-term memory waiting to be consolidated, and active memory as a cortical network of neurons with an activity that is above a certain baseline. Also, his perception-action cycle suggesting a constant flow of information between 1
  • 12. CHAPTER 1. INTRODUCTION sensory and motor memory, has been proved to be a matter of interest for future experimentation. More recently, investigations have focused on spike-timing-dependent plasticity (STDP), which refers to synaptic changes sensitive to the exact timing of action po- tentials between two connected neurons : one refers to pre-post timing or positively- correlated timing, when the presynaptic neuron fires a few milliseconds before the postsynaptic neuron, and to post-pre timing or negatively-correlated timing, when it goes the other way around. STDP has become a popular subject since the ex- perimental work of Bi and Poo [6] who first demonstrated the strong influence of exact timing (typically a time-window of 20 ms for cultured hippocampal neurons) of presynaptic and postsynaptic spikes on synaptic long-term potentiation. Their work with culture of hippocampal neurons, seconded by the work from others, e.g. Rubin et al. 2005 and Mayr et al. 2009 [30, 25], has resulted in the formulation of STDP type learning rules [27, 9]. One must be aware, however, that these rules are considered rather crude ap- proximations by relevant experimentalists. There is a constant duality of the two possible ways to approach neuroscience : some aim to understand the biological mechanisms at the cell and membrane level, so that they can build up models to reproduce them, whereas others aim to reproduce cell behaviour for applications and fit their model to experimental data, rather than to theory. Both approaches have their justification and it is likely that both approaches are complementary. However, if some results arise, our understanding of the mechanisms of the brain is still partial and a great deal remains to be done. In this project, we focus on the Bayesian Confidence Propagation Neural Net- work (BCPNN) model, which has been first studied by Lansner and Ekerberg (1989) [22] and Kononenko (1989) [18]. Its main features are a network architecture di- rectly derived from Bayes Rule and unit activities representing the probability of stochastic events. The BCPNN model will be thoroughly described in Chapter 2. 1.2 Motivations In 2003, Sandberg et al. proposed that “a possible future extension of the existing BCPNN model would be to implement the model using spiking neurons to further examine its generality and properties, such as the effect of spike synchrony in memory reset and the effects of AHP modulation on network dynamics” [32]. At that time, the model had just been improved from a counter model to a continously operating incremental learning model. In this respect, the work presented is in the continuity of what has already been done and seeds the need to have such a learning rule operating in spiking context. Artificial neurons are a very crude approximation of real neurons : given input from other neurons they generate an output through an activity function. Spiking neurons, however, mimic the behaviour of real neurons : in particular, they exhibit spikes (they “fire” and take a high positive value) only when their potential crosses 2
  • 13. 1.3. OUTLINE a threshold and for a very short amount of time. These neurons simulate all-or- nothing behaviour and action potentials observed in real neurons [20]. The variables attached to them, such as membrane voltage, capacitance, synaptic conductance, have a real biological meaning. Since the large-scale implementations of neural networks are often based on spiking units, it is valuable to have such a formulation to be able to run on-line learning also in large-scale spiking neural networks. The project aims to end up with a network model with the new online learning rule operating and use it to test some of the emergent phenomena. Evaluating the model by comparing it to the original BCPNN rule, other STDP rules, as well as some experimental data on LTP [30], is our prime motivation. Because of its specific features (both Hebbian and Bayesian), the BCPNN learning rule can always be used as a reference for other STDP learning rules to be implemented in the future. With regard to the bias term a comparison can also be made with the threshold in the BCM learning rule, developed in 1982 by Bienenstock, Cooper and Munroe [7]. The transformation of the BCPNN learning rule to a spiking neuron environ- ment is somewhat challenging and has never been done before. This opens up tremendously the extent of our work and the scope of this Master Thesis is to be limited, for the sake of simplicity. We narrow our work to two main objectives : the comparison to other existing learning rules, as explained above, is the first. The second, somewhat more abstract, is to conciliate the probabilistic features of the original BCPNN learning rule and spike-timing dependent features developed in STDP models (Bi and Poo 1998, Morrisson 2008, Clopath 2008) [6, 27, 9]. The new learning rule presented in Chapter 3 is built to take into account STDP-like features, and we aim to fit our model to existing experimental data, relating to the spike-timing dependent plasticity window (Bi and Poo 1998) [6] and intrinsic excitibality (Jung et al. 2009) [19], following a phenomenological approach to the problem. An further improvement of our work would be to modify our learning rule so that it could run on parallel computers in a large-scale context. This work is not meant to state decisive results, or to study exhaustively one specific feature of the BCPNN model, but rather to trigger the conversion of the BCPNN model to spiking unit environment. 1.3 Outline We will first redefine, in Chapter 2, the basics of the BCPNN model and its mathematical context, from its most basic form (Naive Bayes Classifier) to more re- cent ones (Higher Order Model, Recurrent Network). We will also relate the existing implementations (counter model, incremental learning) and their applications. In Chapter 3, the ‘spiking’ version of the learning rule is presented, its new features and their biological motivation. The including two following chapters contain the core of the thesis : we develop how we implemented the new learning rule respectively 3
  • 14. CHAPTER 1. INTRODUCTION with abstract units in MATLAB (Chapter 4) and in a spiking context in NEURON (Chapter 5). The results are presented in Chapter 6, both single-synapse, network implementation and phemenological approach to fit STDP data. Dependence on model parameters and comparisons to other existing learning rules are discussed in Chapter 7. Finally, Chapter 8 is dedicated to further developements and to the conclusion. 4
  • 15. Chapter 2 The BCPNN Model 2.1 Context and Definitions Artificial Neural Networks An artificial neural network (ANN) is a computational model that aims to simulate the structure and/or functional aspects of biological neural networks. It consists of a group of computational units, connected by weighted links through which activation values are transmitted The reader can find documentation about ANNs in the literature and the pur- pose here is not to discuss Neural Networks in a general fashion. Still, we think it’s valuable in our context to relate the main features of artificial neural networks. Nodes The functional unit or node of the network is the basic element constituting it. Even if, in the first place, it has a biological equivalent, like a neuron or more recently a minicolumn, it is an abstract unit, which means that the variables attached to it are artificial and do not have an intrinsic biological meaning. A node i is assigned a random variable xi that can be either binary, discrete or continous. It takes its input from other units xj and generates an output yi. Activity Function The activity function or transfer function is the function giv- ing the input-output relationship for one node. Common activity functions include linear, thresholded and sigmoïd functions. The input-output rela- tionship for one unit i is given by yj = φ(βi + n j=1 ωijxj) where φ is the activity function, ωij the weight between unit i and unit j and βi the bias of unit i. Learning Rule The learning rule is an algorithm that modifies connections be- tween units, the so-called weights, in response to the presentation of an input pattern. It is often the key point of the implementation, because it determines the response of the network to specific input, hence its applications. Classical 5
  • 16. CHAPTER 2. THE BCPNN MODEL learning rules include Perceptron Learning, Delta rule and Error Backpropa- gation. Network Architecture A network can have several topologies. It can be com- posed of layers (single-layer, multi-layer networks) that can communicate in only one direction (feedforward network) or in both directions (backpropaga- tion or recurrent network). Connections between units in the network can be sparse or all-to-all. They can include one or several hidden layers (internal computational units which are not accessible from the network interface, but used to create a specific internal representation of the data). Input and Output units In a feedforward network, a network receives informa- tion from input units and proposes an interpretation available at the output units. In a recurrent network, though, the difference between input and output units is less clear : input consists of an activation of a set of units representing an input pattern and an output pattern is read from the activity of the units after a phase called relaxation. Learning and Retrieval A network can be used in one of the two following modes : learning or retrieval. During the learning phase, the network input units are clamped to a certain value (clamping means that the units are assigned a value set by the operator through a controled process) during a certain amount of time (a set of input units represents an input pattern). During clamping, the learning rule operates, so that the weights are updated, and retain the information contained in the pattern that has been presented to them. In other words, during learning, the network adapts to reality (the clamped input pattern) and changes its internal connection to remember it in the future : learning is said to be stimulus-driven. During the retrieval phase the weights of the network are assumed to be fixed, keeping the internal structure of the network unchanged. Distorted, incomplete or different patterns than the ones used during learning are presented to the network and an output pattern is generated. In the case of layered networks, the inference is realized by feeding a pattern to the input units and collecting it at the output units. In other words, the network interprets the input data, using its internal representation or knowledge. For a recurrent network, however, the input pattern is fed to all input units (all units in the network except for the hidden units), and the network starts a phase called relaxation. Relaxation of the network consists of taking a pattern as input and incrementally updating the units’ activities according to an inference rule ; this stops when stability is reached, i.e. when the change in the units’ activities is quite small. When the weight matrix is symmetric, convergence is guaranteed and relaxation always converges to a stable attractor state [16]. For correct knowledge to be acquired, one must learn a pattern (learning phase) and then check if the pattern has been stored correctly (retrieval). It is important 6
  • 17. 2.1. CONTEXT AND DEFINITIONS however to alternate these two phases, so that the information stored by the network is constantly updated and corrected. One must pay attention that a network does not learn its own interpretation of the data, by shutting off the learning phase from time to time. Hebb’s postulate Introduced by Donald Hebb in 1949, Hebb’s postulate, also called cell assembly theory, is one of the earliest rules about synaptic plasticity. It has been formulated as follows : When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased [13]. The theory is often summarized as “cells that fire together, wire together" and is commonly evoked to explain some types of associative learning in which simulta- neous activation of cells leads to pronounced increases in synaptic strength. Such learning is known as Hebbian learning. The general idea is that cells or group of cells that are repeatedly active at the same time will tend to become associated, so that activity in one facilitates activity in the other [1]. Work in the laboratory of Eric Kandel has provided evidence for the involvement of Hebbian learning mechanisms at synapses in the marine gastropod Aplysia Californica [21]. Associative Memory Fuster describes associative memory as “a system of memory, usually constituted by associations between stimuli and reinforcement” [11] as opposed to recognition or episodic memories. However, according to him, association is an attribute of all memories from the root of their genesis to their evocation. More widespread is the definition of auto-associative and hetero-associative memories as a form of neural networks that enables one to retrieve entire memories from only a tiny sample of itself. Hetero-associative networks can produce output patterns of a different size than that of the input pattern (mapping from a pattern x to a pattern y with a non-squared connection matrix W), whereas auto-associative networks work with a fixed size of patterns (mapping of the same pattern x with a squared connection matrix W). The Hopfield network (Hoppfield 1982 [16]) is the most implemented auto- associative memory network and serves as content-addressable memory with binary threshold units. Under the following restrictions : wii = 0 (no unit has a connection with itself) and wij = wji (connections are symmetric), convergence to a local min- imum of a certain Energy Function is guaranteed. During learning, the connection matrix W is modified to allow for attractor dynamics, so that relaxation of the network causes the input pattern to converge towards the closest attractor state. 7
  • 18. CHAPTER 2. THE BCPNN MODEL 2.2 Bayesian Confidence Propagation 2.2.1 Using Neurons as probability estimators The main idea underlying the BCPNN learning rule is to use neurons as proba- bility estimators. The input and output unit activities represent probabilities. The neuron is used to estimate its probability of firing in a given context, i.e. know- ing the information carried out by the activities of others neurons in the network. Confidence propagation relies on the fact that the conditional probability of a given neuron yi to fire given the context x P(yi|x) is a better approximation than the a priori probability P(yj). By updating units like this, one propagates the confidence of one unit to the other units in the network. Figure 2.1: Using Neurons as probability estimators The BCPNN learning rule is based on a probabilistic view of learning and re- trieval ; input unit and output unit activities representing respectively confidence of feature detection (the input to unit i from unit j is a number between 0 and 1 representing the confidence that xj is a part of this pattern) and posterior probabil- ities of outcomes (the output to unit j is a number between 0 and 1 representing the probability of outcome of xj knowing the pattern context) One drawback of the fact that we use neurons as probabilistic estimators is that we have to separate the signal. Indeed, the observation of the absence of an attribute in a given vector is somewhat different than the absence of observation of this attribute. However, if we only map one attribute but one unit, then the BCPNN model will interprate zero input to this unit as an absence of information on this attribute, and it will compute the a posteriori probabilities of other units, discarding the input from this unit. To solve this problem, we need to separate the data, i.e. we need to create units for all possible values of an attribute. In the case 8
  • 19. 2.3. GRADUAL DEVELOPMENT OF THE BCPNN MODEL of binary units, this corresponds to having two units a and ¯a for attribute A. When no observation is made on this attribute, the network will discard input from both of these units. 2.2.2 Derivation of Network Architecture The Bayesian Confidence Propagation Neural Network (BCPNN) has been de- veloped gradually (Lansner and Ekeberg 1989, Lansner and Holst 1996, Sandberg et al. 2002, Sandberg et al. 2003) [22, 23, 31, 32]. Starting from Bayes Theorem (equation 2.1), we derive a network architecture, meaning that we identify the terms in our mathematical formulae to weights ωij, biases βj, input xi and output unit activities yj. The purpose of the learning phase will then be to update weights and biases so that their value fits the one in the mathemical derivation of the network. According to the complexity of the training set we use, the network architecture can be a single-layer (see Naive Bayes Classifier), multi-layer (see Higher Order Model) or fully-connected network (see Recurrent Network). 2.2.3 Bayesian-Hebbian Learning The BCPNN learning rule derived in the next section uses Bayesian weights and biases (equation 2.4). It exploits the statistical properties of the attributes in the training set (frequencies of activation of one attribute xi and co-activation of two attributes xi and xj) in order to evaluate the probabilities P(xi) and P(xi, xj) used to update the weights and biases. It also shows Hebbian features because it reinforces connections between simultaneously active units, weakens connections between units independent from one another, and makes connections inhibitory between anti-correlated units. As we shall see later in this paper, when applied to a recurrent attractor network, it gives a symmetric weight matrix and allows for fixed point attractor dynamics. The update of the weights in the network resembles what has been proposed as rules for biological synaptic plasticity (Wahlgren and Lansner 2001) [33]. 2.3 Gradual Development of the BCPNN model 2.3.1 Naive Bayes Classifier The Naive Bayesian Classifier (NBC) aims to calculate the probabilities of the attributes yj given a set x = (x1, x2, ... , xi, ... , xn) of observed attributes. Both are assumed to be discrete (for now, we only consider binary inputs). The main assump- tion in this case is the Independence Assumption, which states that the attributes xi are independent (P(x1, ... , xn) = n i=1 P(xi)) and conditionally independent given yj (P(x1, ... , xn|yj) = n i=1 P(xi|yj)) 9
  • 20. CHAPTER 2. THE BCPNN MODEL The Bayes Theorem is given by the following equation for x and y two random variables. P(y|x) = P(x|y)P(y) P(x) (2.1) Using this and the Independence Assumption, we can calculate the conditional probability πj of the attribute yj given the observed attributes xi πj = P(yj|x) = P(x|yj)P(yj) P(x) = P(yj) n i=1 P(xi|yj) P(xi) = P(yj) n i=1 P(xi, yj) P(xi)P(yj) Now, we assume that we only have partial knowledge of the attributes xi. We are given completely known observations xi when i ∈ A ⊆ {1, ... , n} and have no information at all about the attributes xk when k ∈ {1, ... , n}A. Then, we get πj = P(yj|xi, i ∈ A) = P(yj) i∈A P(xi, yj) P(xi)P(yj) Then, taking the logarithm of the last expression, we obtain : log(πj) = log(P(yj))+ i∈A oi log P(xi, yj) P(xi)P(yj) = log(P(yj))+ n i=1 oi log P(xi, yj) P(xi)P(yj) (2.2) where the indicator variable oi equals 1 if i ∈ A (which means that the ith attribute xi is known) and equals 0 otherwise. We finally end up with the following equation log(πj) = βj + n i=1 ωij oi (2.3) with ωij = log P(yj,xi) P(yj)P(xi) βj = log(P(yj)) (2.4) This can be implemented as a single-layer feedforward neural network, with input layer activations oi, weights ωij and biases βj. In this way, the single-layer feedfor- ward neural network calculates posterior probabilities πj given the input attributes using an exponential transfer function. The weights and biases given in the equation 2.4 are called Bayesian weights. We can point out the Hebbian character of these weights : ωij ∼ 0 when xi and yj are independent (weak connection between independent units), ωij ∼ log(1 p ) > 0 when the units xi and yj are strongly correlated, since in this case P(xi, yj) ≈ P(xi) ≈ P(yj) ∼ p > 0 (strong connection between simultanously active units) and ωij → −∞ when they are anti-correlated, because in this case P(xi, yj) → 0 (strong inhibitory connection betwen anti-correlated ints). The bias term βj gives a measure of intrinsic excitability of the unit xj, as we shall see later in details. We observe that βi → 0 when pi → 1 so that the bias term 10
  • 21. 2.3. GRADUAL DEVELOPMENT OF THE BCPNN MODEL has no effect in computation when unit xi is strongly activated and βi → −∞ when pi → 0 thus muting the information carried out by unit xi when it has been activated seldom. This process is democratic in the sense that it gives more importance to the units who have ‘a lot to say’ and shuts off the ones not taking part in pattern activation, considered irrelevant for learning and inference. 2.3.2 Higher Order Bayesian Model The problem encountered in practical applications is that the Independence As- sumption is often violated, because it is too restrictive. The standard way to deal with this, as when facing a non-linearily separable training set, is to introduce a hidden layer with an internal representation in which classes are separable. Here, we use a structure of the hidden layer consisting of feature detectors organized in hypercolumns. Starting from the previous model we assume independence between all attributes and conditional independence given yj (P(x1, ... , xn) = n i=1 P(xi)) and (P(x1, ... , xn|yj) = n i=1 P(xi|yj)) However if two variables xi and xj are found not to be independent, they can be merged into a joint variable xij, giving : P(x1, ... , xn) = P(x1)...P(xij)...P(xn) and a similar method may be used for the conditional probabilities. This means that in the network we get one unit for each combination of outcomes of the original variable xi and xj. For example, if two groups of units corresponding to primary features A = {a, a} and B = {b, b} are not independent, we insert in their place a group of complex units AB = {ab, ab, ab, ab} making up a composite feature. The hypercolumn structure formed produces a decorrelated representation, where the Bayesian model is applicable. We note that all formulae above are unchanged. We have just introduced a hidden layer that increases internal computation but the external environment is unchanged. The structure of our network now resembles the structure in figure 2.2. This process relies on a measure of independence of the attributes xi of an input pattern x. A partially heuristic method (Lansner and Holst 1996) [23] is to merge two columns if the measure of correlation (like the mutual information) between them is high : Iij = xi∈Xi,xj∈Xj P(xi, xj) log( P(xi, xj) P(xi)P(xj) ) (2.5) A major drawback of this method is that the number of units increases expo- nentially with their order, i.e. how many input attributes they combine (Lansner and Holst 1996, Holst 1997) [23, 15]. 11
  • 22. CHAPTER 2. THE BCPNN MODEL Figure 2.2: Architecture of the BCPNN with a hidden unit for internal decorrelated representation 2.3.3 Graded units Thus far, we have only considered treated binary inputs. However, it is valuable too that the network handles graded input : for instance, if an attribute is unknown or its value uncertain, graded input would then be a ‘confidence’ value between 0 (no) and 1 (yes). This cannot be coded directly as a graded input activity between zero and one, because this would be interpreted as a probability in the BCPNN. Thus we use a kind of soft interval coding to a set of graded values. Suppose that each attribute i can take Mi different values, xii would be a binary variable describing the probabilty for the ith attribute to take the i’th value {xii = 1 ⇔ xi = i }. Making the necessary labellings in the previous formulae, we get πjj = P(yjj |xik) = P(yjj ) A P(xik, yjj ) P(xik)P(yjj ) where for each attribute i ∈ {1, ..., n} a unique value xik is known, where k ∈ {1, ..., Mi}. Similarly it follows that πjj = P(yjj ) n i=1 Mi i =1 P(xii , yjj ) P(xii )P(yjj ) oii with indicators oii = 1 if i = k and zero otherwise. oii can be seen seen as a degenerate probability oXi (xii ) = δxik (xii ) = PXi (xii ) of the stochastic variables Xi which is zero for all xii except for the known value xik (Sandberg et al. 2002) [31]. 12
  • 23. 2.3. GRADUAL DEVELOPMENT OF THE BCPNN MODEL Taking the logarithm of the previous expression leads to log(πjj ) = log(P(yjj )) + n i=1 log   Mi i =1 P(xii , yjj ) P(xii )P(yjj ) oii   (2.6) The corresponding network now has a modular structure. The units ii in the network, where i ∈ {1, ..., Mi}, explicitly representing the values xii of Xi may be viewed as a hypercolumn as discussed above. By definition the units of a hypercol- umn i have a normalized total activity Mi i =1 oii = 1 (the variable xi can only have one value k at a time). Transforming these equations to the network setting yields hjj = βjj + n i=1 log   Mi i =1 ωii jj oii   (2.7) with    ωii jj = P(yjj ,xii ) P(yjj )P(xii ) βjj = log(P(yjj )) (2.8) where hjj is the support of unit jj , βjj is the bias term and ωii jj is the weight. πjj = f(hjj ) = exp(hjj ) can be identified as the output of unit jj , representing the confidence (heuristic or approximate probability) that attribute j has value j given the current context. We also need to normalize output within each hypercolumn by ˆπjj = f(hjj ) = exp(hjj ) j exp(hjj ) . Figure 2.3: Architecture of the BCPNN with a hidden unit and an additive sum- mation layer for graded input handling 13
  • 24. CHAPTER 2. THE BCPNN MODEL Figure 2.3 shows a ‘pi-sigma network’, able to handle graded input. The notion of a support unit is used to update the units simultaneously and not one by one : calculations are first stored in the support units for all units and the transfer function is then used to update the units all at once. 2.3.4 Recurrent Network Now, because both the input oii and the output ˆπjj of the network represent probabilities, we can feed the output back into the network as input, creating a fully recurrent network architecture, which can work as an autoassociative memory. The currently observed probability oii = PXi (xii ) is used as an initial approximation of the true probability of Xii and used to calculate a posterior probability, using the learning parameters βjj and wii jj , which tends to be a better approximation. This is then fed back and the process is iterated until a consistent state is reached, which is guaranteed because the weight matrix is symmetric. The reader should note that we have now incorporated the yjj among the xii , thus dropping the notions of input and output units. In the recurrent network, activations can be updated either discretely or con- tinuously. In the discrete case, ˆπjj (t + 1) is calculated from ˆπii (t), or equivalently, the hjj (t + 1) from hii (t) using one iteration of the update rule hjj (t + 1) = βjj + n i=1 log   Mi i =1 ωii jj f(hii (t))   (2.9) In the continuous case hjj (t) is updated according to a differential equation, making the approach towards an attractor state continuous: τc dhjj dt = βjj + n i=1 log   Mi i =1 ωii jj f(hii (t))   − hjj (t) (2.10) where τc is the ‘membrane time constant’ of each unit. Input to the network is introduced by clamping the activation of the relevant units (representing known events or attributes). As the network is updated the activation spreads, creating the a posteriori beliefs of other attribute values. 2.4 BCPNN Learning Implementations 2.4.1 Counter Model This model has been developed and described (Lansner and Ekeberg 1989) [22]. The purpose is to collect statistics of unit activity and co-activity of pairs of units, to be able to estimate the probabilities P(xi) and joint probabilities P(xi, xj) used to calcultate W and βj values. An input pattern consists of a stimulus strength in the range [0,1] for each unit in the network. Here, the network is entirely ‘stimulus- driven’ during learning, otherwise the network would first interpret the input and 14
  • 25. 2.4. BCPNN LEARNING IMPLEMENTATIONS then learn its own interpretation, which is to be avoided. This allows a reduction in computing time during learning, because no time is used to infer from the data (no internal computation). The basic idea behind the counter model, is to estimate the probabilities P(xi), P(xj) and P(xi, xj) by counting occurences and co-occurences in the training set. With an estimate of p = c Z , we obtain βi = log(P(xi)) = log ci Z and ωij = log P(xi, xj) P(xi)P(xj) = log cijZ cicj (2.11) where Z = α κ(α) ci = α κ(α) πi cij = α κ(α) πiπj (2.12) Here, πi is the output of unit i, α is an index over the patterns in the training set, and κ is the significance attributed to a certain learning event. It provides a mechanism for over-representing subjectively important learning examples and ignoring unimportant ones. This technique is similar to boosting used in classifica- tion, which is the over-representation of hard examples in order to increase accuracy of the classifier. Special care has to be taken when counters come out as zero. In the case when ci or cj is zero, wij is also set to zero. If ci and cj are both non-zero but cij is zero, wij is set to a large negative value, log( 1 Z ). This also happens for βi when ci is zero. The counter model provides a simple and fast implementation of BCPNN learn- ing, but when the maximum capacity of the network is reached, catastrophic for- getting occurs (i.e. all memories are lost when the system is over-loaded). 2.4.2 Incremental Learning In order to avoid catastrophic forgetting, incremental learning using exponen- tially running averages has been implemented (Sandberg et al. 2002, Sandberg et al. 2003) [31, 32]. The idea is to introduce intrinsic weight decay (forgetting) in the network, so that the system never becomes over-loaded. A time constant α is used to control the time-scale of this weight decay, allowing for short-term working memory behaviour as well as for long-term memory. A continuously operating network will need to learn incrementally during oper- ation. In order to achieve this, P(xii )(t) and P(xii , xjj )(t) need to be estimated given the information {x(t ), t < t}. The estimate should include the following properties: 1. It should converge towards P(xii )(t) and P(xii , xjj )(t) in a stationary envi- ronment. 2. It should give more weight to recent than remote information. 3. It should smooth or filter out noise and adapt to longer trends, in other words lower frequency components of a non-stationary environment. 15
  • 26. CHAPTER 2. THE BCPNN MODEL (1) is the prime constraint. Our estimate needs to converge to these probabilities because they are needed to compute the Bayesian weights and biases. (2) makes the model operate as a ‘palimpsest memory’ meaning that recent memories constantly overwrites old ones. Thus a pattern has to be reviewed not to be forgotten. (3) is a stability constraint in a non-stationary environment. The low-pass filtering operation is to be investigated again in Chapter 3. The incremental Bayesian learning rule proposed here achieves this by approxi- mating P(xii )(t) and P(xii , xjj )(t) with the exponentially smoothed running aver- ages Λii of the activity ˆπii and Λii jj of coincident activity ˆπii ˆπjj . The continuous time version of the update and learning rule takes the following form: τc dhii (t) dt = βii + n j=1 log   Mj j =1 ωii jj (t) ˆπjj (t)   − hii (t) (2.13) ˆπii (t) = exp(hii ) i exp(hii ) (2.14) dΛii (t) dt = α([(1 − λ0) ˆπii (t) + λ0] − Λii (t)) (2.15) dΛii jj (t) dt = α([(1 − λ2 0) ˆπii (t) ˆπjj (t) + λ2 0] − Λii jj (t)) (2.16) ωii jj (t) = Λii jj (t) Λii (t)Λjj (t) (2.17) βii (t) = log(Λii (t)) (2.18) The above probability estimates converge towards the correct values given sta- tionary inputs for sufficiently large time constants. Since the weights of the network depend more on recent than on old data, it appears likely that a Hopfield-like net- work with the above learning rule would exhibit palimpsest properties. Special care has to be taken to avoid logarithms of zero values (see Sandber et al. 2002) [31]. In addition, the parameter α provides a means to control the temporal dynamics of the learning phase (from short-term working memory to long-term memory), It also allows us to switch off learning when the network needs to be used in retrieval mode, allowing for change in the network activity without corresponding weight change, because when α = 0 the running averages ‘freeze’ to their current values. 2.5 Performance Evaluation and Applications Performance Evaluation There are many criteria available to evaluate the performance of a model. Of course, no model is better than the others on every level, nor it is designed for every purpose. Nevertheless, in order to be accepted and developed in the future, a 16
  • 27. 2.5. PERFORMANCE EVALUATION AND APPLICATIONS model needs to gather some basic features : robustness, reasonable time of execution and stability are required to consider a model efficient. Here, we present the main criteria we use to evaluate the BCPNN model. Frequency of correct retrieval This is the most used criterion to evaluate the performance of the network. Feeding a list of input patterns to the network, we want to know how well the network learns them, by counting the occurences of successfully completed patterns after learning. An important parameter is the age of the pattern, because recent patterns tend to be retrieved more accurately than old ones. Numbers of patterns, their complexity and time of presentation are to be taken into account too. Storage Capacity The storage capacity is the amount of patterns that a network can store. Hopfield network capacity has been investigated (Hopfield 1982) [16]. In our counter model, the capacity is fixed, thus it is susceptible to catas- trophic forgetting, whereas the incremental learner has a capacity dependant from its spontaneous forgetting (short-term memories with fast weight de- cay dynamics are protected from catastrophic forgetting because capacity is hardly reached, whereas long-term memories are more exposed to it). Noise Tolerance In reality, patterns fed to the network are always a little noisy and it is important that the attractor dynamics of the network overcome this. To test this, we feed distorted patterns to the network and count the frequency of retrieving the original ones. A special case is the one of competing interpretations when a mixture of two stored patterns is fed to the network. Convergence speed The convergence speed for relaxation of the network is also an important trait of our model. Inference has to be fast enough so that testing patterns do not take too long and, on the other hand, it has to use small enough timesteps for it not to skip any attractor state with a narrow convergence domain. Convergence speed increases substantially for distorted and ambigous patterns (because they are distant from stable attractors in the attractor space) (Lansner and Ekeberg 1989 [22]). Applications The domain of applications of the Bayesian Confidence Propagation Neural Network is wide. Because of its statistically-based method of unsupervised learning, it can be implemented in a series of different contexts. We present some of its applications here. Classification The BCPNN is first designed to evaluate probabilities from a set of observed features or attributes, so it is natural that the BCPNN is used for classification tasks, which aim to label a pattern and assign it to a cor- responding class. The network architecture of these networks is single or multi-layered, depending on the complexity of the data set. The input units 17
  • 28. CHAPTER 2. THE BCPNN MODEL correspond to the attributes, and the output units to the class units. BCPNN and classification has been investigated exhaustively (Holst 1997) [15]. Content-addressable memory When used in a recurrent network architecture, the BCPNN model performs quite well as a content-addressable memory. It takes into account to statistical properties of the data and performs better with patterns for which attributes can be considered independent, like pixel grey-levels in an image, letters in a list of words or digits in a list of numbers. The capacity has to be large enough to avoid memory overloading. Because of its associative character, BCPNN memory networks can perform pattern completion (restoring a pattern from only a sample of it) and pattern rivalry (decision between ambigous patterns or a mixture of two existing ones). A good example for pattern rivalry is found in optical illusions and ambigous images. Pharmacovigilance and Data Mining The BCPNN has been used for high- lighting drug-ADR pairs for clinical review in the WHO ADR database as part of the routine signal detection process (Bate et al 1998, Lindquist et al. 2000). The recurrent BCPNN has also been implemented a tool for unsuper- vised pattern recognition and has been tested on theoretical data, and shown effective in finding known syndromes in all haloperidol reported data in the WHO database (Bate et al. 2001, Orre et al 2003). More recently, Ahmed et al. revisited Bayesian pharmacovigilance signal detection methods, in a multiple comparison setting (Ahmed et al. 2009). 18
  • 29. Chapter 3 A spiking BCPNN Learning Rule In this chapter, we introduce the new ‘spiking’ version of the BCPNN learning rule. We give its mathematical formulation and discuss its specific features and how they account for biologically observed phenomena. In order to have a mapping from the original BCPNN learning to the spiking version of it, we need to match one descriptor of the activity of the biological neurons to the input and output of the abstract units. The most natural choice seems to be the frequency or rate of firing of one neuron. Thus the range [0, 1] of the units in the non-spiking network will be mapped to a range [0, fmax] where fmax represents the maximum firing frequency of one neuron. 3.1 Formulation The version of the learning rule that we are going to implement in a spiking neuron context has the following form : dzi dt = yi − zi τi z0 i = 1 Mi (3.1) dzj dt = yj − zj τj z0 j = 1 Mj (3.2) In this first stage of processing (equations 3.1 and 3.2), we filter the presynaptic and postsynaptic variables yi and yj, which exhibit a ‘spiking-binary’ behaiour most of the time, with a low-pass filter of respective time constant τi and τj (note that they can be different). The resulting variables zi and zj are called primary synaptic traces. Mi and Mj are the number of units in the pre-hypercolumn and the post hypercolumn respectively, and are only used in a network context. In single-synapse 19
  • 30. CHAPTER 3. A SPIKING BCPNN LEARNING RULE learning, we set Mi = Mj = 10. The typical range of τi and τj is 5 to 20 ms. dei dt = zi − ei τe e0 i = 1 Mi (3.3) dej dt = zj − ej τe z0 j = 1 Mj (3.4) deij dt = zjzj − eij τe e0 ij = 1 MiMj (3.5) In the second stage of processing (equations 3.3, 3.4 and 3.5), we filter the primary traces zi and zj with a low-pass filter of constant τe (note that it is the same for the 3 equations). The typical range of τe is 100 to 1,000 ms. The resulting variables ei,ej and eij are called the secondary synaptic traces. We note the intro- duction of a secondary mutual trace eij, which keeps a trace of the mutual activity of yi and yj and will be used to later compute P(xi, xj). Note that a mutual trace is impossible to get at the first stage of processing since the direct product yiyj is zero most of the time. This is because yi and yj are ‘spiking’ variables and thus equal zero except on the occurence of a spike, so yiyj would be non-zero only when yi and yj spike at the exact same time, which almost never happens. dpi dt = κ ei − pi τp p0 i = 1 Mi (3.6) dpj dt = κ ej − pj τp p0 j = 1 Mj (3.7) dpij dt = κ eij − pij τp p0 ij = 1 MiMj (3.8) In the third and last stage of processing (equations 3.6, 3.7 and 3.8), we filter the secondary traces ei,ej and eij with a low-pass filter of constant τp (note that it is the same for the 3 equations). The typical range of τp is 1,000 to 10,000 ms. The resulting variables pi,pj and pij are called the tertiary synaptic traces. We also note the presence of a mutual tertiary trace that is a direct approximation of P(xi, xj). βi = log(ε) if pi < ε log(pi) otherwise (3.9) ωij = log(ε) if pij pipj < ε log( pij pipj ) otherwise (3.10) The equations for updating the weights and biases (equations 3.9 and 3.10) are the classical Bayesian weights and biases equations. Note that these equations change a little in the case of ‘pi-sigma’ higher order networks with graded input (equations 2.7 and 2.8). Because we deal only with binary input, we keep these equations unchanged. When pi takes a small value it is set to a minimum value ε in order to avoid a logarithm of zero. The same is done when pij pipj becomes 20
  • 31. 3.2. FEATURES too small. We note the presence of the parameter κ. It is a global ’print-now’ signal that regulates the update of tertiary traces, while leaving unchanged the internal structure of the network (primary and secondary traces). We will explain its function in further detail later. The spiking version of the BCPNN learning rule is the set of these 10 equations. It relies on 3 stages of processing that perform the same operation (low-pass filtering) with different temporal dynamics. The set of parameters that can be controled are the time-constants τi,τj,τe and τp, the initial values of the traces and the print-now signal κ. 3.2 Features 3.2.1 Synaptic traces as local state variables The implementation of local synaptic state variables such as synaptic traces in the above learning rule is a common approach of STDP learning rules [27, 25]. These variables are used to keep a trace or memory of a presynaptic or postsynaptic events such as the occurence of a spike. In addition, low-pass filtering enables us to manipulate continous variables rather than ‘spiking variables’ which is problematic when we want to estimate, for example, a joint probability P(xi, xj), since the direct product of two spiking variables is likely to be zero, due to the ‘impulse’ nature of a spike. Indeed a spike has a very short duration and is often described as a discontinous variable, that is non-zero only on the occurence of a spike. Scaling these variables between 0 and 1 is very useful because it makes their quantitative use easier. One can deal with different types of synaptic traces. Additive trace The additive trace updates the local state variable x(t) by a con- stant value A. The particularity of this trace is that it can be greater than 1 when a lot of events occur in a short time. It is implemented by the following equation dx dt = − x τ + ts Aδ(t − ts) where ts denotes the time occurence of a spike. Saturated trace The saturated trace updates the local state variable x(t) to a constant value A. This trace is always in the range [0,1] and it keeps only the history of the most recent spike, because it’s invariably reset to 1 on the occurence of a spike. It is implemented by the following equation dx dt = − x τ + ts (1 − x−)δ(t − ts) where ts denotes the time occurence of a spike and x− is the value of x just before the occurence of the spike. 21
  • 32. CHAPTER 3. A SPIKING BCPNN LEARNING RULE Proportional trace Here, the local state variable x(t) is updated to a value pro- portional to its deviation to 1. This trace is always in the range [0,1] and it realizes a synthesis of the effects of the two traces above. It keeps a value of x(t) close to 1 when many spikes occur in a short time and it is easy to evaluate the occurence of the last spike by looking at the exponential decay at a time t. The proportional trace is the one we use later. It is implemented by the following equation dx dt = − x τ + ts k(1 − x−)δ(t − ts) with ts, x− as described above and k is the proportion of update. Typically we use k ∈ [0.5, 0.8]. Figure 3.1 shows the dynamics of the 3 different synaptic trace types. Figure 3.1: Different types of synaptic traces - The upper figure corresponds to a spike train and the lower figure displays the three different synpatic traces : the black, blue and red curves correspond respectively to the additive, saturated and proportional traces 3.2.2 Spike-timing Dependence The first stage of the processing of our learning rule (equations 3.1 and 3.2) allows us to create the primary synaptic traces. These variables with very fast dynamics are used as recorders of the spikes : on the occurence of a spike they are set a certain value (since we use proportional traces, this value is proportional to the deviation between 1 and the value of the synaptic trace just before the spike x(t)) and decay exponentially until another spike occurs. Proportional traces convey two 22
  • 33. 3.2. FEATURES pieces of informations : history of the last spike by looking at the current decay (if the last spike occured recently, the trace is steep and decays fast) and global history of the past events (when numerous spikes occur in a short period of time the trace value comes close to 1). The dynamics of the primary traces zi and zj are controled by the time constants τi and τj. Since these constants can be different, pre-post timing can be promoted over post-pre timing, and the other way around. For instance, if we set τi = 20 ms and τj = 1 ms, then zj will decay much faster than zi. Then, if a postsynaptic spike occurs 10 ms after a presynaptic spike, the product zizj will be non zero shortly after the occurence of the postsynaptic spike. On the other hand, if a presynaptic spike occurs 10 ms after a postsynaptic spike, then the product zizj will still be zero because of the fast decay of zj. By setting τj to a small value compared to τi, we have given a priority to pre-post timing (see figure 3.2). The values of these two time-constants define a spike-timing time-window, (see Bi and Poo 1998 [6]). The width and symmetry of this can be controled by manip- ulating these constants. Figure 3.2: Different effects of pre-post and post-pre timing on the primary synaptic traces - The upper figure corresponds to a regular spike train post-pre- post. Since primary traces have different time-constants (τi = 50 ms and τj = 5 ms) pre-post timing is promoted over post-pre timing, because the resulting product zizj (not displayed here) is much bigger after regular pre-post timing than after post-pre timing. 3.2.3 Delayed-Reward Learning It can be a little puzzling to realize that our learning rule has three stage of processing of the data while we always perform the same operation (low-pass 23
  • 34. CHAPTER 3. A SPIKING BCPNN LEARNING RULE filtering). However, these three filtering procedures perform three very specific and different tasks. As observed in previous models (Bi and Poo 1998, Rubin et al. 2005, Morrisson et al. 2008, Mayr et al. 2009) [6, 30, 27, 25] exact spike timing between presynaptic and postsynaptic plays a crucial role in LTP. Moreover, a time-window of 20 ms before and after a postsynaptic spike seems to exist, so that no long lasting change occurs if delays between spikes are greater than 20 ms. However, the activity in the network needs to be long-lasting and to reverberate on a much greater time-scale. In the context of delayed reward learning [28] and reinforcement learning, the reward, which triggers the induction of LTP, occurs with a delay on a time-scale of hundreds of milliseconds to seconds. Worse, this delay isn’t predictable so that one cannot know when the reward and the actual learning will take place. In order to solve this problem, we include secondary traces that extend the reverberation of activity in the network. Then, when a spike occurs, activity is recorded in the primary and secondary traces. After a few hundred milliseconds, the activity has disappeared in the primary traces, but is still reverberating in the secondary traces ei, ei and eij (equations 3.3, 3.4 and 3.5). Thus, if the print-now signal, representing the reward, is set to 1, the secondary traces convey the information and learning can still take place. Figure 3.3: Temporal dynamics of the different synaptic traces - Thin curves correspond to the primary traces, thicker curves to the secondary ones and bold curves to the tertiary traces. Blue corresponds to presynaptic traces, red corre- sponds to postsynaptic variables and black corresponds to mutual traces - The temporal dynamics are the slowest for the tertiary traces that build up and de- creases slowly. The combination of these three levels of processinf enables us to achieve different goals. It is important to stress that both of these traces are required if we want to account for the following phenomena : the exisence of a spike-timing window in the order of tens of milliseconds (about 20 ms for spike delays) outside of which 24
  • 35. 3.2. FEATURES no significant weight change takes place, and the fact that the reward enhancing the learning process comes with a delay on a time-scale of hundreds of milliseconds. As we will see later, there are biological equivalents to this print-now signal and delayed synaptic traces. Figure 3.3 shows the temporal dynamics of the primary, secondary and tertiary traces for a pattern stimulation followed by no activity. 3.2.4 Long-term Memory Finally, the third stage of processing (equations 3.6, 3.7 and 3.8) computes synaptic state variables that have much slower dynamics. Typically, the pi, pj and pij account for long-term memory, meaning that they store events that have been repeated on a set of actions and experiments. We assume that our learning rule operates in the context of delayed reward learning and we take the example of an animal, a rat for instance, being proposed several buttons to open doors, behind which some food (reward) is present or ab- sent. The primary traces’ activities with fast dynamics record the precise spike timing when activity spreads in the network consequently to taking actions (stim- ulus, button pressing). The secondary traces account for the delayed obtention of the reward, which comes as a delayed result of action-taking. If the rat accesses the reward, then the ‘print-now signal’ is set to 1 and long-term memory is triggered. The tertiary traces are activated when delayed reward has been obtained several times and that stimulus has been reinforced. This means that pi, pj and pij build up when the activities of the secondary traces have been above a certain baseline on a repeated scale. Then, reinforcement occurs and memories can be stored. It is singular however that the print-now signal κ shows up on this stage of processing. It could have done similarly on the equations 3.3, 3.4 and 3.5, but the biological equivalent of the print-now signal suggest that the metabolic changes occur even if it is not activated, whereas only the weights are overwritten if the print-now is active. Thus, it makes more sense for it to appear right before the weight update. 3.2.5 Probabilistic features It is important to keep in mind that our spiking version of the BCPNN learning rule is not another implementation of a STDP pair-based learning rule. Indeed, the state variables that we calculate represent probabilities and their values have an intrinsic meaning on their own. This is the main reason why feeding graded input to the network is not trivial, because it interprets activities in the network as probabilities. As discussed previously, input to the units represents the confidence of feature detection and the output represents the posterior probability of outcome. In the original counter model, P(xi) and P(xi, xj) were quite easy to approxi- mate by counting occurences and co-occurences of the features within the training set. Due to the spiking structure of the input variables yi and yj it is a bit trickier 25
  • 36. CHAPTER 3. A SPIKING BCPNN LEARNING RULE to evaluate the probabilities P(xi) and P(xi, xj). The use of synaptic traces allows us to create mutual traces eij and pij that convey the information about correlation between spikes. 3.3 Biological relevance This new version of the BCPNN learning rule shows the biological relevance on different levels. The first is the use of synaptic traces which are thought to have a direct biological meaning. For instance, when a presynaptic spike arrives at a synapse, there is a quantified release of neurotransmitters. According to the nature of the synapse, the additive trace or saturated trace might be used : the first when the amount of transmitters is small compared to the synapse size, so that the occurence of a new spike has an additive effect because enough free receptors are available for synaptic transmission, and the second when the quantity of neuro- transmitters released reaches the maximum capacity of the synapse, which means that the synapse saturates all of its available receptors on the occurence of each presynaptic spike. Another direct equivalent is the ‘print-now signal’ that can be seen as a memory modulator concentration like dopamine, which is thought to have a direct enhancing effect on learning and memory when present in high quantities. The delayed-reward mechanism has indeed a direct biological relevance and has been observed experi- mentally (Potjans et al. 2009) [28]. As explained before, the mixture of variables with slow and fast temporal dynam- ics makes sense and fits to what has been observed. The concentration of calcium ions in the postsynaptic site is thought to play a key role in synaptic plasticity [30] with much faster dynamics than the protein synthesis governing the transition from early-LTP to late-LTP [9]. Clopath et al. [9] present a model to account for transition from early to late- LTP, containing three different phases of Tag-Trigger-Consolidation. A synapse can be in one of the three following states : untagged, tagged for LTP (high state) ot tagged for LTD (low state), depending on a presynaptic and postsynaptic event. If the total number of tagged synapses exceeds a threshold, a trigger process occurs and opens up for consolidation (long lasting changes in synaptic efficacy). What is similar in our model is the three different temporal dynamics. The secondary mutual trace eij can be seen as an equivalent to a tagging procedure : if its value stays above a threshold for a long enough time, then metabolic changes, such as specific protein synthesis, occur allowing for conversion from working memory to long-term permanent memory. 26
  • 37. Chapter 4 Abstract Units Implementation In the two next chapters, we present different implementations of the spiking version of the BCPNN learning rule presented previously. The first implementation consists of abstract units in MATLAB and serves as a gateway towards spiking neuron models in NEURON. For each model, we explain how we present patterns to the cells, implement the learning rule and use the model in retrieval mode. Due to its ability to handle vectors and matrices, MATLAB serves as a con- venient computational tool to build up artificial neural networks. The built-in functions allow a great variety of 2D and 3D graphic display. One can also import data computed elsewhere into MATLAB and process it according to its wishes. But MATLAB loses all of its computational power when it has to process data procedurally which is the case for our differential equations. In our learning rule, we have to update and compute multiple variables at each time step, because we deal with three sets of first order linear differential equations (equations 3.1 to 3.8). Since these computations cannot be gathered in a matrix and all treated in batch fashion, MATLAB is structurally inefficient for our task. However, we can use it for single-synapse learning (only two units : one presy- naptic and one postsynaptic) on reasonable time-scales (between 1,000 ms and 10,000 ms) and exploit its graphical display facilities, which is the reason why we first implemented our learning rule in MATLAB. The aim is qualitative : displaying weights and biases corresponding to different input patterns and giving an insight to the synapse’s internal dynamics (primary, secondary and tertiary traces’ time- courses). 4.1 Pattern presentation In this section, we explain how we presented patterns to the units ; in other words how input is fed to the network. We have three ways to present patterns : non-spiking, frequency-based spiking and Poisson generated spiking. It is to be noted that all along the following chapters we focus on single-synapse learning, meaning that we deal with two units (presynaptic and postsynaptic) connected by 27
  • 38. CHAPTER 4. ABSTRACT UNITS IMPLEMENTATION a single synapse. 4.1.1 Non-spiking Pattern Presentation As a starting point of our investigations and a reference for our further results, we will test our learning rule by feeding patterns in a similar process to what has been done before with non-spiking units (Sandberg et al. 2002) [31]. To achieve this, we clamp input to the presynaptic and postsynaptic units yi and yj to the respective values ki and kj during a certain time of presentation of about tens of milliseconds. The values ki and kj can take only binary values or a continous value in the range [0, 1] (graded input). Patterns are fed to the network sequentially. For instance, if the set of input patterns, we want to learn is {(1, 1), (0, 0), (1, 1), (0, 1), (1, 1)}, then yi will be clamped to the following set of values (1,0,1,0,1) and yj will be clamped to (1,0,1,0,1). The input variables yi and yj are ‘stepped’ and disconti- nous (see Figure 4.1a). Hence, abstract units are artifical, because no biologically observed variable takes constant values or exhibits such a discontinous time-course. The time of presentation is important because it needs to be long enough for the primary traces to retain pattern activities (the longer the pattern is seen, the stronger the memory) but it is also valuable to impose some resting time between patterns. Indeed, during each pattern presentation, the network needs to adapt to it and rearrange its internal structure. In addition, between patterns, it needs to rest for a short while, so that the fast dynamics internal variables return to their baseline. An explanation of this is when we are learning different things and we always need some adaptation to jump from one thing to another. We will expand on this in the Discussion section. On the other hand, when we want to teach a concept to our network throughout a temporal series of patterns, the time-scale of the learning phase needs to be smaller than the dynamics of the long-term memory traces pi, pj and pij, otherwise the synapse forgets what has been fed to it in the past. If the long-term memory time- constant τp equals 1 second, then after 5 seconds, past events will be discarded. So, in this case, it doesn’t make sense to have a learning procedure that takes longer than 5 seconds. In a nutshell, learning procedures should not exceed the forgetting of our long-term memory. In MATLAB, the function generate_pattern_nonspiking generates a driving input x(t) from a series a of parameters : delay, the resting between pattern presentation, dur, the duration of presentation of one pattern, T, the length of the output and pattern, a vector containing the values for the driving input x(t). Figure 4.1a shows the input activity of an abstract unit fed with the pattern x = [1, 0, 1, 0, 0.5, 1, 0.25, 0, 0.75]. 4.1.2 Spiking frequency-based Pattern Presentation Because of its biological irrelevance, the previous pattern presentation scheme is limited. This time, we try to mimic the ‘spiking behaviour’ of membrane voltage 28
  • 39. 4.1. PATTERN PRESENTATION observed in real experiment. Still, our spike generation in MATLAB is artificial but we are making progress in trying to imitate spiking behaviour. So, we build up artificial spiking voltages by setting the input variable yj to 1 on the occurence of a spike and to zero otherwise. If ts denotes the time occurence of a spike for unit i, then our input variable yi variable can be rewritten yi(t) = ts δ(t − ts) Pattern presentation to the input units is now based on their firing frequency rather than on a fixed stepped value. The idea is to realise a linear mapping from a value of xi between 0 and 1 (representing the confidence of feature detection developed in previous chapters) to a frequency fi. To achieve this, the value 1 for xi will be mapped to a maximum frequency fmax and other values between 0 and 1 to a value directly proportional in the range [0, fmax] (i.e. 0.5 will be mapped to fmax 2 , 0.25 will be mapped to fmax 4 , and so on). By doing this, we have just created an input filter that transcripts graded input xi(t) between 0 and 1 to a spiking time-dependent variable yi(t). We will later refer to the stepped value xi(t) as the driving input and to yi(t) as the actual input activity, the first being used only for pattern presentation and the latter to compute the synaptic traces, hence the weights and biases. An important feature in the frequency-based pattern presentation is that it allows us to easily control the timing between presynaptic and postsynaptic spikes. This offers an implementation possibility when we want to investigate the effects of exact spike timing on the weight modification in our learning rule. In MATLAB, the function generate_frequency_spiking generates an input activity y(t) from a driving input x(t). The series a of parameters is similar with the previous section and includes a value fmax, which corresponds to the maximum output frequency (when x(t) takes a value of 1). In order to generate spikes, we discretize the time-scale by intervals of 1 milisecond : when a spike occurs at a specific time t0, the value y(t0) is simply set to 1. Figure 4.1b shows the input activity of an abstract unit fed with the pattern x = [1, 0, 1, 0, 0.5, 1, 0.25, 0, 0.75]. 4.1.3 Spiking Poisson-generated Pattern Presentation We make one more step in the direction of mimicking neural-like data by im- plementing Poisson spike trains, to feed input units. In the cortex, the timing of successive action potentials is highly irregular and we can view the irregular inter- spike interval as a random process. It implies that an instantaneous estimate of the spike rate can be obtained by averaging the pooled responses of many individ- ual neurons, but precise timing of individual spikes conveys little information The benefit of the Poisson process for spike generation is that it adds randomness and discards the determinism in our simulation (each random seed will give different spike trains). Thus, we focus on the parameters underlying this random process rather than modeling precise coincidences of presynaptic and postsynaptic events. 29
  • 40. CHAPTER 4. ABSTRACT UNITS IMPLEMENTATION (a) Non-Spiking Pattern Presentation (b) Spiking Frequency-based Pattern Presentation(c) Spiking Poisson-generated Pattern Presentation Figure 4.1: Abstract Units Pattern Presentations corresponding to the pattern x = [1, 0, 1, 0, 0.5, 1, 0.25, 0, 0.75] We assume here that the generation of each spike depends only on an underlying signal r(t) that we will refer to as an instantaneous firing rate. It follows that the generation of each spike is independent of all the others spikes, which is called the spike independent hypothesis. Plus, we will make the assumption that the firing rate r(t) is constant over time (actually r(t) is updated by steps but for one pattern we can suppose than r(t) = r). The Poisson process is then said to be homogenous. In a Poisson process, the probability that n events with a instantaneous rate r occur in ∆t is given by the formula : P({n spikes during ∆t}) = e−r∆t (r∆t)n n! (4.1) By setting n = 0 and ∆t = τ, we obtain P({next spike occurs after τ}) = e−rτ and it follows that P({next spike occurs before τ}) = 1 − e−rτ (4.2) One way to implement a Poisson spike trains is to use equation 4.2 : we generate a random number between 0 and 1 and the inter-spike interval is given by the value of τ that realizes the identity. But, the drawback of this method, is that the spike train has to be created sequentially. We can create a whole Poisson spike train at once by doing as follows. 30
  • 41. 4.2. LEARNING RULE IMPLEMENTATION The average spike count between t1 and t2 can be defined from the instantaneous firing rate by n = t2 t1 r(t) dt and for sufficiently small intervals, t1 = t − δt 2 and t2 = t+ δt 2 , the average spike count can be approximated by n = r(t)δt = rδt after the homogenous poisson process hypothesis. Furthermore, when δt is small enough, the average spike count equals the probability of the firing of a single spike P({one spike occurs during the interval (t − δt 2 , t + δt 2 )}) = rδt (4.3) Now, assuming δt is small enough (usually 1 ms), if we want to create a spike train af arbitrary length T using 4.3, we need to generate T δt random numbers pi between 0 and 1. Then if pi < r, we generate spike at the time correponding to the index of pi and if not, no spike is generated. The Poisson Spike generation is an intermediate stage to NEURON implemen- tations. It allows us to account for random rate-based spike generation. This is valuable, because this process is easy to implement and gives us an idea if our model responds well to noisy or random data. Later, some noisy spike trains may be added to our data so that it resembles what is observed in vivo. In MATLAB, the function generate_poisson_spiking generates an input ac- tivity y(t) from a driving input x(t). The series a of parameters is similar with the previous section and the rate r is set to the same value as the frequency fmax used before. We stress the fact that the Poisson-generation of spike trains is based on a random process. Thus, each seed gives a different input activity y(t) for the same driving input x(t). By setting the same seed in two runs, they become identiti- cal. Figure 4.1c shows the input activity of an abstract unit fed with the pattern x = [1, 0, 1, 0, 0.5, 1, 0.25, 0, 0.75]. 4.2 Learning Rule Implementation In order to solve the differential equations in MATLAB, we used the solver ode45. Its use is quite straight-forward, except that this is achieved by using func- tions handles, which makes it tricky to control intrinsic equation parameters, like the time-constant τi or the print-now signal κ. If the implementation of the learning rule follows the set of equations 3.1 to 3.10, a non-negligible phenomenon arises : when spikes are modeled by a sum of unit impulse functions in MATLAB and it is likely that the solver might miss them, because at each time-step, in order to evaluate the derivatives of a point it uses points in its neighbourhood. Not only the spiking variables are highly discontinous, but they are also zero most of the time, which prevents the solver ode45 to detect any activity. A solution to this problem is to introduce a ‘duration’ δt for the spikes (typically δt equals 1 to 2 milliseconds), so that the mathematical modelisation of a spike switches from an impulse function to a door function of width δt centered in ts. But in that case, 1 τi is an upper bound for of dyi dt (see equation 3.1), which results in a small increase of the primary trace zi(t). This propagates to the secondary 31
  • 42. CHAPTER 4. ABSTRACT UNITS IMPLEMENTATION and to the tertiary traces, which, as a result, hardly overcome 0.001 which is highly undesirable, because they are supposed to represent probabilities of activation. To bypass these problems, we decide to split the set of equations 3.1 to 3.10 in two phases : First we update the primary traces with the help of an auxiliary function generate_primary_trace, which solves the equation 4.4. Zs(t) = z− i (t) + r(1 − z− i (t)) if xi(t) = 1 zi(t) = Zse −t−ts τi (4.4) where ts records the occurence of the last spike and Zs is updated according to the proportional trace update. The set of equations 3.3 to 3.8 is done separately using the solver ode45. Special care has to be taken with the time-step increment in order to find a trade-off between computing time and accuracy. Also, the weight update 3.10 and bias update 3.9 are straight-forward. Finally, it is important to mention that we have implemented ‘off-line learning’, in the sense that weights and biases are updated independtly from each other. Everything happens as if there was no connection at all between the cells. This is somehow not a constraint during learning, but on the contrary rather convenient. This remains to be investigated : when learning should occur and when inference should take over. In our model of abstract units though, the boundary between learning and inference is clear because they are governed by different formulae used in a different context. 4.3 Retrieval If the learning phase is central in our implementation, it is also crucial to check that the stored patterns can be retrieved correctly. The aquired knowledge is to be easily retrieved, especially when we use the BCPNN as an auto or hetero-associative memory. Thus, in this section, we assume that a certain learning phase has already oc- cured and that the weights ωij and bias βj are set. Our goal is to present an incomplete pattern and to check if the network is able to complete it correctly. Since we only deal with one synapse, input will be fed to the presynaptic unit and output will be collected at the postsynaptic unit. Because, we have three different pattern presentation schemes in our abstract units model, inference is done in three different fashions. In all cases, however, the retrieval phase aims to realise an input-output mapping from unit i to unit j. Quantitative results are presented in the next Chapter, we focus here on the method that enables us to achieve this. Non-Spiking Inference This case is the simplest because the activity of a unit is constant over time (for the duration of one pattern presentation). In other words, because there is no 32
  • 43. 4.3. RETRIEVAL difference between the driving input xi(t) and the input activity yi(t), the input- output mapping is straight forward. Assuming that unit i is fed an input pattern corresponding to the driving input xi = ki, we first compute the support value hj of unit j with hj = ωijxi + βj, and then we update the output unit activity with xj = ehj . Finally, the input-output mapping is realized by the equation 4.5. xj = eωijxi+βj (4.5) In order to produce the input-output relationship curve, we compute the output xj according to equation 4.5, for a set of input values xi, regularly spaced between 0 and 1. We end up with a corresponding output vector y mapped to an input vector x. It is to be noticed that the above equation is the same as the equation presented in Chapter 2 (section 2.3.4), with only two units. If the learning phase has been successful, xj is nothing but the a posteriori probability of unit j, knowing unit i. Spiking Inference In the Spiking Frequency-based Poisson-generated pattern presentation schemes, the input activity yi(t) is not constant over time. A value ki of the driving input xi(t) corresponds to a firing frequency fi in one case, and to a firing rate in the other case. Thus, the process of inference for one value of xi is not given by a direct calculation like in equation 4.5, but instead, it is dependent on the time-course of the spiking input activity yi(t), governed by the driving input value xi of unit i. This input activity yi(t) needs to be processed to calculate a correspond output value xj. In order to map an input value xi to a number xj between 0 and 1, we process as follows : 1. We generate a regular spiking input yi(t) with frequency fi (FS) or a Poisson spike train with rate fi (PS), during a time Tinf equal to 5 seconds. The firing frequency or rate obeys fi = xi.fmax with xi ∈ [0, 1]. 2. We compute a support activity sj(t) according the relation sj(t) = ωijyi(t) + βj 3. The support activity sj(t) is then low-pass filtered by a filter with a high time-constant τf and slow update value k. d ˆsj dt = k(sj − βj) − ˆsj τf 4. We take the exponential of the filtered support activity ˆsj(t). 5. xj is finally set to the mean stationnary value of the output activity yj(t∞) . 33
  • 44. CHAPTER 4. ABSTRACT UNITS IMPLEMENTATION Figure 4.2: Spiking Inference with abstract units - Different stages of processing Figure 4.3 shows these different stages of processing. This technique, despite its apparent complexity, gives a good fit to the previous non-spiking mapping. The key procedure occurs at step 3, when we filter the support input activity sj(t). This variable is equal to the bias term βj when the input unit does not spike, and it is set to the value βj +ωij on the occurrence of a spike. When we filter with a specific low-pass filter (high time constant, small update rate), we can generate a filtered support activity ˆsj(t), which works as an additive trace. Hence, the value of ˆsj(t) at the end of the stimulation gives a measure of the firing frequency of the cell. The direction of update k.(sj − βj) is proportional to the weight value ωij, which allows negative or positive build-up according to the sign of ωij. Typically, we use τf = 500 ms and k = 1 fmax . Step 4 is needed to keep the inference equation homogenous. It is crucial that ˆsj(t) stays in the range ]−∞, 0], because we want to get a value of xj between 0 and 1. This can be controled either by the value k, or by modifying the filtering equation like in the case of saturated traces (see Chapter 3). The biological model is composed of steps 3 and 4, because we can draw an analogy of these processes with what occurs at the synapse level. Shortly, the filtering accounts for synaptic integration with low release of neuro-transmitters and slow degradation. The exponentiation in step 4 is observed in the mapping current-frequency of a cell (called current-discharge relationship). For the Poisson-generated spike trains, the underlying random process gives dif- ferent output at each run. Thus, we have to compute average values, after repeating the same inference process over several runs (between 5 and 10 runs). There is a trade-off between discarding the randomness by increasing the number of runs and computing time for simulations. Also it is important to keep the randomness intro- duced with the Poisson process because it accounts for irregular spiking observed in real neurons. 34
  • 45. Chapter 5 Hodgkin-Huxley Spiking Implementation in NEURON NEURON is a simulation environment for modeling individual neurons and neural networks. It was primarily developed by Michael Hines, John W. Moore, and Ted Carnevale at Yale and Duke. Documentation about NEURON and how to implement models in NEURON is given in the NEURON book [8]. NEURON, which is associated with the object-oriented NMODL language, offers an efficient means to run simulations of highly connected network of neurons. Built on the paradigm of C language, it does not suffer under procedural processing of data and uses efficient and fast algorithms to solve differential equations. The computing time of the abstract units model is thereby reduced by a factor 10. 5.1 Cell Model 5.1.1 Hodgkin Huxley Model In 1952, Alan Lloyd Hodgkin and Andrew Huxley proposed a model to explain the ionic mechanisms underlying the initiation and propagation of action potentials in the squid giant axon [14]. They received the Nobel Prize in 1963 in Physiology and Medicine for this work and the model has since been refered to as the Hodgkin- Huxley model. It describes how action potentials in neurons are initiated and propagated with the help of a set of nonlinear ordinary differential equations which approximates the electrical characteristics of excitable cells such as neurons and cardiac myocytes [2]. The main idea behind the Hodgkin-Huxley formalism is to give an electrical equivalent to each bioligical component of the cell that plays a role in the trans- mission of acion potentials, which is the support of signaling within the cell. The components of a typical HodgkinâHuxley model, shown in Figure 5.1.1, include : • A capacitance Cm, representing the lipid bilayer. A cell, considered as whole, is electrically neutral but the neighbourhood surrounding the cell membrane 35
  • 46. CHAPTER 5. HODGKIN-HUXLEY SPIKING IMPLEMENTATION IN NEURON Figure 5.1: Hodgkin-Huxley model of a cell is not. Membrane voltage is the consequence of the accumulation of charged particles on both sides of that bilayer, impermeable to ions. A typical value for Cm is 1 nF. • Nonlinear electrical conductances gn(Vm, t), representing voltage-gated ion channels. Their behaviour is described by gating variables that describe open, closed and reverberatory states (see Appendix for equations). These conduc- tances are both voltage and time-dependent : gn(Vm, t) where n denotes a specific ion species. In addition, they exhibit fast dynamics because they ac- count for the cell regenerative properties implied in the propagation of action potentials. • A linear conductance gleak, for passive leak channels, these channels that are not ion-selective, always open and contribute to the resting membrane potential. A typical value for gleak is 20 µS.cm−2. • Generators En, describing the electrochemical gradients driving the flow of ions, the values of which are determined from the Nernst potential of the ionic species of interest. This model can be extended by modeling ion pumps with the help of current sources (the sodium-potassium ion pump is responsible for the concentrations equi- librium inside and outside the cell). More elaborate models, include chloride and calcium voltage-gated current, however, we only deal here with two ionic currents : sodium and potassium and one leakage channel. 36
  • 47. 5.1. CELL MODEL Further, our cell model will contain additional channels (see figure 5.3.2) : a slow dynamics voltage-gated potassium channel accounting for spike-frequency adapta- tion (see section 5.1.2) and an activity-dependent potassium channel modeling in- trinsic excitability (see section 5.3.2). As a convention we will use I > 0 when ions flow from the outside to the inside of the cell, so that, in the normal cell dynamics, the sodium current takes positive values and the potassium current takes negative values. The voltage equation is given by the relation between applied current Iapp, capacitive currents Ic and the sum of ion and leak currents Iion : Iapp = Ic + Iion = Cm dVm dt + INa + IK + Ileak (5.1) We see that when Iapp > 0 then dVm dt > 0 and the membrane voltage becomes more positive (depolarization). The dynamics of the detailed voltage and gating variables equations are given in Appendix. 5.1.2 Spike Frequency Adaptation Spike-frequency adaptation is a type of neural adaptation that plays a key role in firing frequency regulation of neurons. It is characterized by an increase of the interspike interval when a neuron is current-clamped. Among other mechanisms, various ionic currents modulating spike generation cause this type of neural adapta- tion : voltage-gated potassium currents (M-type currents), the interplay of calcium currents and intracellular calcium dynamics with calcium-gated potassium channels (AHP-type currents), and the slow recovery from inactivation of the fast sodium current (Benda et al. 2003)[35]. Spike-frequency adaptation can account for the findings in burst firing (Azouz et al 2000) [5]. Figure 5.2: Spike-frequency Adaptation : Membrane voltage and state variable p 37
  • 48. CHAPTER 5. HODGKIN-HUXLEY SPIKING IMPLEMENTATION IN NEURON In our model, spike frequency adaptation is taken into account by adding a slow-dynamics voltage-gated potassium channel. The conductance of this channel is non-linear and depends on membrane voltage Vm. It is described by an activation variable p, that works in a similar way to an additive synaptic trace (see figure 5.1.2). Dynamics of the channel are given by the following equations gkim(Vm, t) = gkim.p (5.2) dp dt = p∞(Vm, t) − p τ(Vm, t) (5.3) Figure 5.1.2 describes the build up of the trace p and the conductance gkim, which is responsible for the increase of the interspike interval along the stimulation. Because of its slow decay, the delay between stimulation must be much longer than the stimulation itself for the p variable to return to baseline. The slow dynamics of this channel suggest that repeated strong transient stimulation has a better effect than long-lasting stimulation. 5.2 Pattern presentation Now that our cells are no longer modeled by artificial units, but instead by complex spiking Hodgkin-Huxley units, input and output need to be matched to real variables. As presented above, the Hodgkin-Huxley model is based on the relation between membrane potential Vm and the individual ionic currents Ii. Thus it is natural to feed input to one cell by injecting a certain amount of current Iapp into it and reading the output as the membrane potential firing frequency f. To achieve this, we will use current electrodes to present patterns to the network : Some current is injected inside the cell membrane, which depolarizes it and triggers action potentials, and the membrane voltage is recorded as the difference of potential between two electrodes, one inside and the other outside of the cell (see Kandel 1995 [20] about current-clamp technique). So the input-output relationship (which is similar to the activity function in artificial networks) is given by mapping the injected current Iapp to the membrane firing frequency f. The curve giving the firing frequency of one unit versus the in- jected input current is called steady-state current discharge. This curve is presented for our units in the next chapter (see figure 6.6). For weak currents, no active firing is triggered (the depolarization induced by the current injection is too small for the membrane to reach the threshold and no action potential is recorded). For cur- rents which are too strong, the Hodgkin-Huxley voltage-gated potassium channels become inefficient to repolarize the cell and the membrane voltage stabilizes to a supra-thresholded value. Thus, we must feed input currents that belong to a range [0, Imax], where the steady-state current discharge is approximately linear. During learning, we feed input patterns sequentially. It is entirely frequency- based, meaning that input corresponding to a value between 0 and 1 is mapped to a firing frequency. The current-frequency relationship will be used to find the current 38
  • 49. 5.3. LEARNING RULE IMPLEMENTATION clamp value in order to obtain the right frequency. Let’s assume that the set of input patterns we want to learn is (1, 1), (0, 0), (1, 1), (0, 1), (1, 1), then the unit i must fire with the following frequency behaviour (fmax, 0, fmax, 0, fmax) and unit j with (fmax, 0, fmax, fmax, fmax). Using the steady-state current discharge curve, we inject corresponding currents in order to obtain the desired firing frequencies. 5.3 Learning Rule Implementation In this section, we present how weights and biases are represented and updated during learning, in the NEURON spiking environment. We use the object-oriented NMODL language to create new mechanisms for simulations in NEURON. The details of the code are given in Appendix. 5.3.1 Synaptic Integration Modeling the synapse In the artificial context the weight ωij between two units quantifies the strength of the connection between them. If ωij is high, then the connection between the two units is strong and they influence one another significantly. On the other hand, if ωij is close to zero, the connection is very weak and the corresponding units behave as if they were not connected. The simplest way to represent this in our spiking context is to map ωij to a synaptic conductance gij between two units i and j. This conductance would be time-dependent and closely related to the presynaptic and postsynaptic events. So, we create a model of a synapse which has intrinsic properties fulfiling the weight update equation of our spiking learning rule 3.10, and call it a BCPNN Synapse. It is defined as a point-process in NMODL, which means that one can implement as many instances of this mechanism as long as one specifies a location (a section in NEURON). All local variables associated with the section it has been attached to become available to the point-process (membrane voltage, ionic cur- rents, etc.). As a convention, we will always place a synapse on the postsynaptic cell soma. Conductance Expression In our model, the synaptic conductance gij(t) is a product of three quantities : gij(t) = gmax.gcomp(pi, pj, pij, t).αi(yi, t) (5.4) gmax is the maximum conductance of the synapse : it regulates its strength (abil- ity to conduct current) and can temporarily be set to zero if one wants to operate off-line learning. gcomp(pi, pj, pij, t) is directly computed from the tertiary traces pi, pj and pij simi- 39