Artificial Neural Network Implementation on FPGA – a Modular Approach
1. TVLSI-00648-2014
Abstract—In this paper, we present an FPGA implementation
of an artificial neural network. In addition, the current paper
emphasizes important FPGA design principles, which turn the
development of a neural network into a much more modular
procedure. In fact, these principles may be found extremely
useful for those who plan to implement a neural architecture on
an FPGA.
Our implementation was based on a multilayer perceptron
topology and used the back-propagate algorithm for training.
Thanks to a highly modular and parameterized structure, a
network with any number of layers and neurons can be
synthesized in a matter of minutes. This means that the system
can be quickly configured for solving any type of neural network
related problem, including examples that involve deep learning.
The design was fitted on Xilinx Zynq-7000 at 200MHz clock
frequency. The peak performances measured were 5.46 MCUPS
and 8.24 MCPS during the training and computation
respectively. The implementation was successfully tested using
a training set of “Breast Cancer Wisconsin” classification
problem. Tests have shown 96% hit ratio.
Index Terms—Artificial neural networks, Backpropagation,
Deep Learning, Design methodology, Feedforward neural
networks, Field programmable gate arrays, Learning systems,
Multi-layer neural network, Multilayer perceptrons, Neural
network hardware, Parallel architectures, Reconfigurable
architectures.
I. INTRODUCTION
HE artificial neural networks, or ANNs, are
computational models that provide a unique way of data
processing. This computational model was inspired by an
animal’s neural system and was found useful in areas such as
pattern recognition, classification, approximation, etc. A
general structure of an ANN consists of a network of
interconnected artificial neurons. Moreover, the ANN is a
trainable machine, thus it is capable of learning patterns for
later recognition.
An interest in data classification, such as face recognition,
has significantly increased in the past few years. As the
available training sets are getting bigger, more and more
researchers are attracted to improve the ANN’s recognition
K. Berestizhevsky and R. Levy are with the Department of Electrical
Engineering Systems, Tel Aviv University, Tel Aviv 69978, Israel. (e-mail:
konsta9@mail.tau.ac.il; roeelev1@mail.tau.ac.il).
abilities. For instance, a significant research direction has
been conducted in the field of deep learning, carried out on
Deep Neural Networks, or DNNs [12].
On the one hand, the ANN model provides an efficient,
uncentralized way of computation. On the other hand, the
very fine grain parallelism of this model, makes it fit well
only in highly parallel hardware platforms. Various previous
work on finding an appropriate platform for neural
architecture was conducted. Platforms such as general
purpose computers, dedicated parallel computers and ASICs
were examined [3]. Eventually, it seems that when dealing
with neural architectures, the FPGA platform provides the
best compromise among its competitors. This compromise is
in terms of time-to-market, cost, speed, area and reliability.
Extended comparison between the platforms was described
in [1].
In our study, we have been looking for a flexible solution
for designing a neural architecture. The platform of our
choice, based on the previously mentioned papers, was
decided to be FPGA. Unfortunately, the early literature
review has shown lack in properly described design
principles of neural architecture on FPGA. One significant
work was conducted in [4] and provided additional
inspiration for our paper to come. Hence, in this paper, we
describe a modular method for designing an ANN on an
FPGA.
The idea presented in this paper relies on a hierarchical
approach. In other words, we break the structure of ANN
down into hierarchical levels of abstraction. To begin with,
the base level is the single artificial neuron. Afterwards, the
following levels encapsulate one or more lower-level
components and lead the development towards the top level
of the abstraction, which is the software application that
controls the global actions of the network.
The paper is organized as follows. Chapter II describes the
background behind the hardware platform used, the ANN
theory and the main algorithms used. Chapter III presents the
hierarchical approach for a neural network’s design process.
Chapter IV extends this approach to concrete development
steps. Chapter V presents the conducted tests and the results.
Artificial Neural Network Implementation on
FPGA – a Modular Approach
K. Berestizhevsky and R. Levy
T
2. TVLSI-00648-2014
Chapter II concludes the study and suggests further
improvements.
II. BACKGROUND
A. FPGA Technology
Field Programmable Gate Arrays (FPGAs) are
programmable devices that can be configured to function as
custom digital circuits. The configuration of an FPGA device
is done by means of hardware description languages such as
Verilog or VHDL. Thus, by describing the logic and by
programming the FPGA with it, one can obtain an ASIC-like
hardware.
The FPGA hardware consists of 2D array of configurable
logical blocks (CLBs) that can be set to implement specific
logical functions. The connections between these CLBs are
configurable as well. Furthermore, every logical function can
be programmed into the FPGA as long as there are enough
CLBs and the connectivity between them is available.
The fact that FPGA utilizes generic CLBs, located in fixed
positions on a chip, imposes a significant drawback. This
means that every custom design is forced to be fitted on this
fixed arrangement of blocks. As a result, an FPGA design
will always have much poorer performance than the same
logic on a properly designed ASIC. Nevertheless the
development time for FPGA design is measured by weeks,
whereas ASIC development times are measured by months.
Moreover, an appropriately parameterized FPGA design can
be reconfigured and re-programmed into the chip in a matter
of minutes.
In our study, we’ve used Xilinx Zynq™-7000 as the FPGA
platform. This platform is equipped with a dual core ARM
Cortex™-A9 and provides extensive programming abilities
as well as a convenient interface with a PC.
B. Artificial Neural Networks
Artificial neural networks (ANNs) are computational
models for solving various problems. These models can be
implemented both in software and in hardware. During the
last decades, ANNs were used for engineering purposes, such
as studying behavior and control in animals and machines,
pattern recognition, forecasting, and data compression. First,
we present a few important definitions:
ANN – Artificial Neural Network, consists of artificial
neurons, ordered in layers. The signals in the network are
flowing through these layers (see Fig. 2).
Artificial Neuron = “activator” = “node” of the ANN
General activator consists of multiple input ports, summing
junction, activation function (threshold, ramp, sigmoidal,
etc.) and one output port (see Fig. 1).
MLP– multilayer perceptron. A form of a feed-forward
neural network consisting of an input layer, followed by two
or more hidden layers of neurons, the last of which is the
output layer.
DNN – deep neural network, another way to refer to MLP
with more than a single hidden layer.
Learning Phase – period of time, when ANN determines
its learnable parameters, based on learning set. This phase
involves the training algorithm.
Calculation Phase – period of time when ANN is fed with
inputs and the forward computation on these inputs generate
an output of the network.
In the nature, the real neurons receive electrical signals
through synapses. When the received signals surpass a
threshold, the neuron emits a signal through its axon. The
artificial neuron attempts to replicate this natural behavior.
The electrical signals appear in a form of digital inputs, their
strength is their numeric value multiplied by a weight of a
corresponding input edge. The activation of an artificial
neuron is done by applying a nonlinear function (such as
sigmoid) to the accumulated value of the neuron’s inputs. The
final calculated value of an artificial neuron appears on its
output port. Fig. 1 presents the mathematical abstraction of
an artificial neuron that receives inputs denoted by x0 to xn-1
and outputs the value denoted by y.
Fig. 1. Generic structure of an artificial neuron with n inputs (x0,…,xn-1).
These outputs are multiplied by the corresponding weights (w0,…,wn-1),
summed up and passed through an activation function f.
The ANN is comprised of interconnected layers, where
each layer consists of a certain number of artificial neurons.
During the calculation phase, the ANN is activated in a way
called forward computation. In this form of activity, the
signals are flowing from the input layer, through one or
more hidden layers.
Finally, the last computations are carried out in an output
layer. Fig. 2 presents a schematic chart of ANN with L
layers and N neurons in each layer. The general structure of
the ANN can have variable amount of artificial neurons in
its layers.
3. TVLSI-00648-2014
Fig. 2. Generic structure of Feedforward ANN with L layers with N
artificial neurons in each layer.
C. Backpropagation Algorithm
Prior to executing any forward computation, the ANN
should be trained. The training algorithm of our choice is the
“Back-Propagation” algorithm [2]. It is one of the most
common models of ANNs and many others are based on it.
In our design, we have slightly modified the algorithm
flow described in [1]. Our change to the original algorithm
comes in a form of treating the biases as actual neurons.
In order to demonstrate the Backpropagation algorithm, we
consider an MLP, similar to the presented in Fig. 2. The MLP
consists of 𝐿 layers having 𝑁 𝑙
neurons in every 𝑙 𝑡ℎ
layer. The
execution of the error backpropagation algorithm is an
iterative process. In our context, 𝑛 denotes the current
iteration. The algorithm itself consists of the following five
steps:
1) Initialization of parameters
Prior to training, the following parameters must be set:
𝜇 is defined as the learning rate and is a constant scaling
factor used in error correction during each iteration of the
algorithm. 𝑤 𝑘𝑗
𝑙
(𝑛 = 0) is the weight of the connection
from neuron 𝑗 in the (𝑙 − 1)𝑡ℎ
layer, to 𝑘 𝑡ℎ
neuron in the
𝑙 𝑡ℎ
layer. This weight is modified in every iteration, where
iteration 𝑛 = 0 stands for the initialization. 𝜃 𝑙
is the
value of a bias of 𝑙 𝑡ℎ
layer. Bias can be thought of as an
additional neuron that has no inputs and it always outputs
a constant value. In our model, bias neuron is placed as a
last neuron in an input layer and in every hidden layer as
well.
All of the mentioned parameters are chosen based on
heuristics. For instance, the typical bias value is around
1.0, the algorithm step is a fractional number from an
interval (0,1) and the weights can be initialized randomly
to values between -1 and 1.
2) Training Example Input
For ANN with 𝑁0 inputs and 𝑁𝑙−1 outputs, the training set
T is defined as follows:
𝑻 ≜ {(𝑰, 𝑪)| 𝑰 ∈ ℝ 𝑵 𝟎 , 𝑪 ∈ ℝ 𝑵 𝒍−𝟏} ( 1 )
Where 𝐶 is the output vector, corresponding to an input
vector 𝐼. Every iteration, one training example (𝐼, 𝐶) ∈ 𝑇
is fed into the network.
3) Forward Computation
Data signals from neurons of a previous (𝑙 − 1) 𝑡ℎ
layer ,
are propagated towards the neurons in the 𝑙 𝑡ℎ
layer.
During that process, each neuron in the hidden and the
output layers calculates the weighted sum ( 2 ) of its
inputs.
𝑆 𝑘
𝑙
= ∑ 𝑜𝑗
𝑙−1
𝑤 𝑘𝑗
𝑙
𝑁 𝑙−1
𝑗=0
( 2 )
Where 𝑙 ∈ [ 1, . . . , 𝐿 − 1] denotes the layer number
and 𝑘 ∈ [0, … , 𝑁 𝑙
− 1] denotes the neuron number.
𝑁 𝑙−1
is the amount of neurons in (𝑙 − 1)𝑡ℎ
layer, not
including the bias neuron. 𝑆 𝑘
𝑙
is the weighted sum of the
𝑘 𝑡ℎ
neuron in the 𝑙 𝑡ℎ
layer. 𝑤 𝑘𝑗
𝑙
is the weight, as defined in
step (1). 𝑜𝑗
𝑙−1
is the neuron output of the 𝑗 𝑡ℎ
neuron in the
(𝑙 – 1) 𝑡ℎ
layer. As we have already mentioned, our
algorithm treats the biases as if they are actual neurons.
Therefore, the bias values that are passed to the next layer
are multiplied by the corresponding weights. The
advantage of this strategy is that the biasing becomes
adjustable during the learning phase. To sum up, the output
of each neuron is as follows:
𝑜 𝑘
𝑙
= {
𝑓(𝑆 𝑘
𝑙
) 𝑓𝑜𝑟 𝑘 < 𝑁 𝑙
𝜃 𝑙
𝑓𝑜𝑟 𝑘 = 𝑁 𝑙
(𝑏𝑖𝑎𝑠 𝑛𝑒𝑢𝑟𝑜𝑛)
( 3 )
Where 𝑘 ∈ [0, … , 𝑁 𝑙
] and 𝑙 ∈ [ 1, . . . , 𝐿 − 1]. 𝑜 𝑘
𝑙
is the
output of the 𝑘 𝑡ℎ
neuron in the 𝑙 𝑡ℎ
layer. 𝜃 𝑙
is the bias of
the 𝑙 𝑡ℎ
layer. 𝑓(𝑆 𝑘
𝑙
) is the activation function, which
modifies the weighted sum 𝑆 𝑘
𝑙
by passing it through a non-
linear operation. Our algorithm uses a unipolar sigmoid as
an activation function. To be specific, the Log-sigmoid
function ( 4 ) is used.
𝑓(𝑥) =
1
1 + 𝑒−𝑥
( 4 )
4) Backward Computation
The backward computation, compares the current
abilities of the network versus the expected result.
Consequently, this comparison induces the appropriate
4. TVLSI-00648-2014
adjustment to the network’s weights. To put it another
way, the algorithm tries to minimize the error between
the 𝐶 (correct) value and the actual output value that was
determined during the forward computation.
The backward computation begins at the output layer
and proceeds towards the input layer. During the
backward computation, the following steps are
performed:
a) Local gradients calculation
𝐸 𝑘
𝑙
= {
𝐶 𝑘 − 𝑜 𝑘
𝑙
𝑙 = 𝐿 − 1
∑ 𝑤 𝑘𝑗
𝑙+1
𝛿𝑗
𝑙+1
𝑁 𝑙+1
𝑗=1
𝑙 ∈ [1, … , 𝐿 − 2]
( 5 )
Where 𝐸 𝑘
𝑙
is the calculated error for the 𝑘 𝑡ℎ
neuron in the
𝑙 𝑡ℎ
layer, defined as the difference between the correct
value 𝐶 𝑘 and the actual neuron output 𝑜 𝑘
𝑙
. The 𝛿𝑗
𝑙+1
is the
gradient of the 𝑗 𝑡ℎ
neuron in the (𝑙 + 1) 𝑡ℎ
:
𝛿 𝑘
𝑙
= 𝐸 𝑘
𝑙
𝑓′
(𝐻𝑘
𝑙
) ( 6 )
Where𝑙 ∈ [1, … , 𝐿 − 1] and 𝑓′(∙) - derivative of the
activation function.
b) Weight change calculation.
Δ𝑤 𝑘𝑗
𝑙
= 𝜇𝛿 𝑘
𝑙
𝑜𝑗
𝑙−1
( 7 )
Where 𝑘 ∈ [1, … , 𝑁 𝑙
] and 𝑗 ∈ [1, … , 𝑁 𝑙−1]. Δ𝑤 𝑘𝑗
𝑙
defines
the change in weight value of the connection from neuron
𝑗 in the (𝑙 − 1) 𝑡ℎ
layer, to neuron 𝑘 in the 𝑙 𝑡ℎ
layer.
c) Weight update
𝑤 𝑘𝑗
𝑙
(𝑛 + 1) = Δ𝑤 𝑘𝑗
𝑙
(𝑛) + 𝑤 𝑘𝑗
𝑙
(𝑛) ( 8 )
Where 𝑘 ∈ [0, … , 𝑁 𝑙
− 1] and 𝑗 ∈ [0 … , 𝑁 𝑙−1]. 𝑤 𝑘𝑗
𝑙
(𝑛 +
1) is the updated weight to be used in the next (i.e.
the (𝑛 + 1) 𝑡ℎ
) iteration of the forward computation.
Δ𝑤 𝑘𝑗
𝑙
(𝑛) is the weight change calculated in the 𝑛 𝑡ℎ
iteration of the backward computation, where 𝑛 is the
current iteration. 𝑤 𝑘𝑗
𝑙
(𝑛) is the weight to be used in the 𝑛 𝑡ℎ
iteration of the forward and backward computations, where
𝑛 is the current iteration.
5) Backward Computation
Repeating the steps 2-4 for every example (𝐼, 𝐶) ∈ 𝑇 is
considered as one global iteration. One can choose to
continue training the MLP using one or more global
iterations until a predefined stopping criteria is met. Once
learning phase is complete, the ANN can execute the
forward computations on unknown inputs.
D. Sigmoid and its Derivative Approximation
As we mentioned in section B, every artificial neuron must
implement a nonlinear activation function. The nonlinear
functions are problematic in hardware implementation,
therefore an approximation is required.
The Log-sigmoid functions were widely used in many of
the state-of-the-art ANNs [1]. Therefore, this type of
activation function has become the type of our choice as well.
The approximation that we used was the piecewise linear
approximation that was described in [5] and it is defined as
follows:
𝑓𝑆 𝑖𝑔𝐴𝑝𝑝𝑟𝑜𝑥(𝑥) = {
1 𝑓𝑜𝑟 𝑥 > 2
0.25𝑥 + 0.5 𝑓𝑜𝑟 − 2 ≤ 𝑥 ≤ 2
0 𝑓𝑜𝑟 𝑥 < −2
( 9 )
Note that the derivative of a sigmoid function already has
the convenient form as follows:
𝑓𝑆𝑖𝑔𝐷𝑒𝑟𝑖𝑣𝑎𝑡𝑖𝑣𝑒(𝑥) = 𝑥(1 − 𝑥)
( 10 )
III. ARCHITECTURE OVERVIEW
As it was mentioned in the introduction, the design was
broken down into four hierarchical levels (see Fig. 3). In this
chapter, we present a review of our architecture, with respect
to this hierarchical model.
The first level is the basic block of the network – the
artificial neuron. The artificial neuron is capable of forward
and backward computations as well as of storing its incoming
links’ weights and its latest output. The artificial neurons are
grouped together in layers, where every two adjacent layers
are interconnected. This type of layered structure forms the
second abstraction level, which is the artificial neural
network itself. Similarly, the ANN is capable of forward and
backward computations, where the adjacent layers activate
each other during the computational process.
The third level of abstraction deals with the board
environment and the processing system (PS) neighborhood.
This technological setup provides the input/output (I/O)
interface between the ANN and the PS, through the memory.
The top level of abstraction is the software application that
runs within the PS and controls the ANN. Practically
speaking, the ANN’s design must be already programmed
into the programmable logic (PL) in order to be controlled by
PS.
The next chapter is about to elaborate on how the
previously mentioned abstraction is implemented.
5. TVLSI-00648-2014
Fig. 3. The 4 levels abstraction model, as it used in the current architecture
IV. IMPLEMENTATION
This chapter presents the 5 steps of the design process that
was conducted. The work started with a high-level software
simulation. After the simulation, the design process followed
a bottom-up procedure with respect to the model described in
chapter III and Fig. 3.
A. Software Simulation
Designing a full neural architecture is a complicated
process that involves an enormous amount of algorithmic and
structural decisions. Accordingly, it would be a good practice
to try different high level designs, prior to a detailed
implementation. This approach provides the ability to
compare different designs without wasting precious time on
detailed implementation.
In order to test of the high-level functionality, a high-level
simulator was designed. The simulator was written in Java
language and provided an input interface for the learning set
and the calculation inputs. The restriction of the simulator
was that it supported only an MLP with a single hidden layer.
The high-level simulator provided the simulation of the
first two levels in the hierarchical model presented in Fig. 3.
In other words, it simulated the artificial neural network only,
without the board environment. This provided the ability to
concentrate on the algorithmic correctness of the design.
The simulator application is initiated with two sets of
arguments. The first set of arguments describes the topology,
i.e. amounts of neurons in each layer and the bias values. The
second set of arguments consists of the learning parameters,
such as the training-set filename, learning rate (µ), number of
learning iteration per example and number of global
iterations. In addition, there is a special flag for determining
how the weight randomization is made. The weight
randomization can be done randomly, or it can be extracted
from the VHDL module of the implementation. The latter
option is extremely useful for conducting a comparison
between the software simulation and the hardware
implementation. Thus, by extracting the initial weights from
the hardware module, the initial conditions of the software
simulator and the hardware become identical.
The software simulator was trained using parts of the
training sets of “Breast-Cancer-Wisconsin” and “Lenses
classification” [9] problems. After the training was complete,
the simulator was tested on sporadic examples of the training
set, which were not previously fed in the training phase.
The importance of the software simulation appeared in a
form of crucial design decisions that will be described here
shortly. First, the network topology was chosen to be an MLP
with its neurons fully connected between adjacent layers.
Secondly, the training back-propagate algorithm was tested
with stopping criteria which is based on a fixed number of
iterations. Third, the data representation was chosen to be the
IEEE-754 32-bit floating-point. According to the studies
described in [6], this data representation has shown high
precision results. Finally, the last significant decision was
made regarding the sigmoid approximation as already shown
in chapter II, section D.
Another important use of software simulation was during
the validation of an ANN logical design. This process will be
described in 4.3.
B. Artificial Neuron
The basic building block of our FPGA neural network will
be an Artificial Neuron. In our text, we use the terms
Activator and Artificial Neuron interchangeably.
The activator is in charge of both the calculations and the
storage. On the one hand, the activator handles both forward
and backward computations, including training algorithm.
These computations are carried out by an efficient utilization
Fig. 4. Activator's higher level state machine. It provides two main types of
functionality – the forward and the backward computation. The lower level
state machines are initiated by the FWDi and the BCKi states.
6. TVLSI-00648-2014
of a single 32bit Acc-Mult core [7]. A computational
transaction is initiated by an appropriate signal from the
network control. When the activator completes the
transaction, an appropriate signal is outputted to the net, as
well as a valid activator’s floating-point output. On the other
hand, every activator is in charge of storing its input weights,
and its latest calculated output.
The activator’s control is comprised of two hierarchical
levels. The lower level involves various state machines.
These state machines are in charge of atomic tasks, such as
the computation of the derivatives, or the sigmoid activation.
The higher control level schedules the execution of the
previously mentioned lower level machines. As a matter of
fact, the higher control level was implemented by a sole state
machine and it is shown in Fig. 4. The top level of the
activator’s schematic is shown in Fig. A.1.
At this point, it is important to emphasize the modularity
of components within the activator. As presented in Fig. 4,
every atomic operation is encapsulated by a lower-level state
machine. Therefore, any of these state machines can be easily
replaced with a different one, whereas the rest of the activator
control can remain with no changes.
Additional important principle to emphasize is module
parameterization. Parameterized modules contribute to the
reusability of the sources. As a trivial example, consider the
mux unit, that was instantiated 6 times and each instance had
a different configuration. Another example is the select unit,
which is in charge of scheduling the inputs and the weights
for a serial accumulation, which appears in two differently
configured instances. These two components are likely to be
designed as parameterized modules. In our design, we
utilized the VHDL generics in order to create parametrically
defined modules.
C. Network
Our ANN’s hardware structure was divided into datapath
and control components. This section elaborates on these
components of ANN’s.
The ANN’s datapath connects the Activator modules
together and forms a complex computational network. The
activators are arranged in layers, where every couple of
adjacent layers have their activators fully connected. For
example, given two adjacent layers 1 and 2, every activator
in layer 1 connects its output to every activator in layer 2. For
additional illustration, see Fig. 5, or for a detailed schematic
view see Fig. A.2. Auxiliary connections between the
neurons are used for error back propagation.
During the network activity, at every given moment, at
most one layer is active (computing). When all the activators
in a given layer report ‘done’, the current layer is halted and
the next layer begins the job. This sequential calculation
method is inevitable, since every layer depends upon the
previous one. This dependence appears both in forward and
backward computations, and can be observed in equations (
2 ) and ( 5 ) respectively.
Fig. 5. ANN’s datapath is formed by interconnecting Activators. The ANN’s
controller is in charge of initiating the forward / backward computations of
the network.
The input values, such as the learning step parameter and
the correct values for training, are received from the Load-
Store Machine (will be described in the next section). These
parameters are passed through the datapath directly to the
activators. The outputs of the output layer are declared as the
network output and are passed back to the Load-Store
Machine (will be described in the next section).
The ANN’s control is a relatively simple state machine. It
receives an instruction from the Load-Store Machine (will be
described in the next chapter) and initiates the network. The
received instruction can be: ‘Calc’ – only the forward
computation will be run on the network; or ‘Learn’ – forward
& error back-propagation will be run on the network.
The generation of the network structure involves an
enormous amount of wiring. It is clear that the best way to
deal with this task is scripting. A net-generating script will
not only prevent man-made mistakes, but also make it
simpler to rearrange the network structure if needed. Indeed,
every classification problem has a unique network structure
(number of layers, activators in every layer, etc.). Therefore,
there is a need for a parameterized network generator.
Using a simple Python scripting, we have created an
automated network generator. The script was in charge of
generating the network datapath, control and the Load-Store-
Machine VHDL codes. The arguments to the script were: the
bias values, the number of layers in the network and a list of
activator amounts per layer. Additional important role of this
script was the weights randomization. When the script writes
down the VHDL lines, it hardwires the initial random floating
point values in the ‘Weights’ module instances of every
activator. The randomization is an important step towards a
successful convergence of the learning phase.
7. TVLSI-00648-2014
In order to validate the design, we have used various test
benches. The output of these test benches was compared to
the software simulator (see section A), to ensure functional
consistency.
D. Board Environment
The hardware platform of the research was Xilinx
ZedBoard Zynq Evaluation Kit along with Vivado 2014.2
and SDK 2014 CAD tools. The ZedBoard consists of the PS
and the PL. The objective of this section is to describe the
interface between the PS and the PL via the shared memory
of the board.
In the previous section, the whole logical design of the
ANN was introduced. In this section, we encapsulate the
ANN’s design by a Load Store Machine (LSM) module. The
LSM module is in charge of the following. (i) Receiving
directives from the memory. (ii) Guiding the network
controller with corresponding control signals. (iii) Reporting
the calculation results back to the memory.
Fig. 6. Complete view of a board configuration. The interface between the
programmable logic (PL) and the board infrastructure is in the form of the
three data channels DataIn, CFG and DataOut.
The LSM is connected to the board’s infrastructure by
means of 3 AXI ports named DataIn, Cfg and DataOut. Each
of these AXI ports maps a certain region of the board’s
DRAM to a single 128KB BRAM module in order to allow a
fast access. The port DataIn is used for feeding data samples
for the input layer of the network. The Cfg port is used for
triggering the network’s operation and configuring its
learning parameters, such as the learning step size, number of
iterations, etc. The ports DataIn and Cfg are used by LSM for
read only operations. As for the output, the port DataOut
provides the LSM with the ability to report upon a completion
of a task and to write back the values of the output layer.
The top module of the hardware design is called
NeuralNetworkTop and is illustrated in Fig. 6. It is important
to point that the PL fabric clock was set to 200MHz as a part
of the board configuration.
E. Software Application
In the previous sections, a full logical design and a board
configuration were described. The only missing part of a
puzzle is the software application that will run over the PS
and supply the actual data of our choice to the ANN. The
software application’s objective is to provide an interface for
a human user, letting him choose a training set, train the
network and consequently be able to query it with the new
inputs. This section describes how the software application
completes the whole design of the project. It is important to
notice that this application runs on a PS within the board. For
our convenience, the board can be connected to the PC and
its console window can be observed on a PC as well.
The application was written in C language, compiled for
ARM processor, and then exported as an ELF file for
execution on the PS of the board. The board support packages
and the hardware platform project must be compiled with the
application. The process of the compilation and of the export
was carried out using the SDK 2014 tool, which is a part of
the Xilinx evaluation kit described in section D.
The functionality of the software application is as follows.
(i) Copy the training dataset and the training configurations
to DataIn and Cfg regions respectively. (ii) Initiate the
learning job and wait for completion. (iii) Read calculation
input from the user and initiate calculation job. (iv) Display
the calculation results and return to step iii. A demonstration
of an application run is shown in Fig. 7.
Fig. 7. Console view of running the software application and entering one
calculation example (the 9 floating-point numbers). The calculation session
on the artificial neural network outputted the values 0.0, 1.0.
V. TESTS
A. Test Setup
The hardware platforms for the analysis were the
ZedBoard Zynq Evaluation Kit and a PC equipped with Intel
i7 3770 CPU @ 3.4 GHz, 16GB of RAM.
The neural problem of our choice was “Breast-Cancer-
Wisconsin” [9]. This problem is described as follows. Given
patient’s 9 parameters, predict whether the patient has benign
or malignant breast tumor. Our neural network was
synthesized in a structure of MLP with one hidden layer,
having 9 input, 8 hidden and two output neurons. The outputs
were encoded as “0, 1” for “benign” and “1, 0” for
“malignant”. The bias values were set to 1.0 in each layer.
8. TVLSI-00648-2014
Two main characteristics were analyzed. The first
characteristic was the precision of the neural computation,
whereas the second characteristic was the speed of the
training and calculation phases. These two metrics are
described in sections B and C respectively
B. Precision
The precision stands for the fraction of the correct
answers returned by the neural network. For the precision
analysis, we chose a network that was trained with the 260
training examples of the Breast-Cancer training set. The
training configurations were set to 2000 global iterations and
5 iterations per example. The training algorithm’s step size
was 0.2.
In order to challenge the network, it was queried with 50
input examples that have not appeared in the training set.
Since the outputs of the network were not always round
values 1.0 and 0.0, we have decided to treat the outputs that
were below 0.5 as 0.0 and outputs above 0.5 as 1.0.
Eventually, the test showed 48 correctly answered queries,
therefore the precision was 96%.
C. Speed
When evaluating the computational speed of ANN, the
most commonly used metrics are MCPS and MCUPS [8].
The processing speed, i.e. multiply and accumulate
operations performed in unit time, is measured by MCPS
(Millions of Connections per Second). The learning speed,
which is the rate of weight updates, is measured by MCUPS
(Millions of Connection Updates per Second).
A calculation of the performance in terms of MCPS and
MCUPS is performed while ignoring external effects that
might have limited the system performance. Assuming
200MHz PL-clock, the above metrics were extracted from the
simulation waveforms presented in Fig. 8.
Fig. 8. A waveform of the time interval between the assertion of fwd_0
signal (forward computation initiation) and the reception of f_done_0
signal (report of completion of a forward computation).
As it can be seen in Fig. 8, the interval between the
forward computation’s initiation signal (fwd_0) and the
completion report (f_done_0) lasts for 9.7𝜇𝑠𝑒𝑐𝑠. During this
interval, the hidden layer, which consists of 8 activators,
processes 9 inputs + 1 bias input. Recall that this design uses
fully-connected topology between the input and the hidden
layers, therefore 80 connection operations were made during
the hidden layer’s forward computation. Thus, the processing
speed is given by:
Processing Speed =
80
9.7𝜇
× 106
= 𝟖. 𝟐𝟒 𝑴𝑪𝑷𝑺
The next category to be tested is the learning speed. It is
important to note that the measuring of the learning speed is
performed only for the operation described by equation ( 8 )
i.e. the weight update. This specific operation takes place
during the state ‘7’ (or ‘BCK4’ as appears in Fig. 4) of the
activators state machine. By observing the Fig. 9, we derive
that the hidden layer’s weight update time takes 14.65𝜇𝑠𝑒𝑐𝑠.
Fig. 9. A waveform of the time period that the activator control stays in ‘7’
state which is responsible for the weigh update.
Similarly to the forward computation, there were 80
connections and therefore 80 weight values were updated
during this interval of time. Therefore, the derived learning
speed is given by:
Learning Speed =
80
14.65𝜇
× 106
= 𝟓. 𝟒𝟔 𝑴𝑪𝑼𝑷𝑺
VI. CONCLUSIONS
A. Goals vs. Results
The initial goal of the research was to create a flexible
neural design. As the development went on, concurrent goals
appeared, such as generalization of the design. Eventually,
the design became not only efficient, but also modular and
extensible. The modularity provided the ability to easily
assemble any neural structure, with any amount of activators
and layers.
The functionality of the architecture proved to be correct
in the conducted tests. First, in order to ensure basic
correctness, limited tests were conducted on small training
sets. Afterward, the more comprehensive tests were run in
order to calculate the precision rate of the architecture, these
tests have shown a great result of 96% precision.
As for the performance, the design was tested with a
limited amount of activators (8 in the hidden layer). It showed
a performance similar to the designs that appeared in [8].
B. Improvement Suggestions and Further Work
In this section, we enumerate suggestions for future works
that can rely on our current design.
Deep Learning. As our design fits deep neural
architectures. Our design can be instantiated as DNN,
providing a platform for an advanced research, such as [12].
Making the state machines more compact. In the current
design, some state machines use excess states to ensure
robustness. These redundant states create more transitions
9. TVLSI-00648-2014
during the execution, resulting in longer latencies. Additional
effort can be made to examine the necessity of these
redundant states.
Avoiding multiplication in non-arithmetic blocks. The
synthesis process creates “parasitic” multipliers in specific
areas of the design. These parts should be investigated and
the HDL code there should be rewritten in such a form that
will not trigger a multiplier synthesizing. The goal is to have
only one multiplier per activator module.
Fixed point accuracy. The neural computations are
performed mostly on fractional numbers between 1 and -1. In
our design we have used floating point format, which
required 32 bits per every numeric value. Rather than using
the floating point, a less flexible format can be used. For
example, Q15 fixed point format requires only 16 bits and
still supplies a good precision.
Utilize on chip learning. The idea of on-chip learning was
not implemented in our design. It implies running the
learning phase prior to optimizing implementation.
Hopefully, this should reduce the number of actual
connections for FPGA programming [10].
Utilize DSP features. Features such as “multiply-
accumulate” (MAC) [11] are likely to replace sequential
forward computations. Another example is “dot-product”,
which can be useful for weight update. Today, these DSP
features are already embedded in the evaluation boards.
Hence additional research can be conducted in order to utilize
these features as a part of the neural network.
Reducing the number of special registers in the activator.
There are registers such as mudelta and error which store
intermediate calculation results. These registers can be united
to a single register that will store the last calculation only.
10. TVLSI-00648-2014
APPENDIX
Fig. A.2. Schematic view of an artificial neural network. This specific network is comprised of 3 input neurons, followed by 3 hidden neurons, followed by
2 output neurons. Additional logic between the layers is added in order to trigger the next layer computation.
clk
reset
outW
actIn
initW
fwd
bck
fin
delta
bp_done
f_done
actOut
bpDelta
inW
clk
reset
outW
actIn
initW
fwd
bck
fin
delta
bp_done
f_done
actOut
bpDelta
inW
clk
reset
outW
actIn
initW
fwd
bck
fin
delta
bp_done
f_done
actOut
bpDelta
inW
clk
reset
outW
actIn
initW
fwd
bck
fin
delta
bp_done
f_done
actOut
bpDelta
inW
f_done_0
bp_done_0
f_done_0
delta_1_0
delta_1_0
delta_1_1
actOut[31:0]
correct[31:0]
clk
reset
actIn
initW
fwd
bck
fin
delta
bp_done
f_done
actOut
bpDelta
inW
delta_1_1
actOut[63:31]
correct[63:31]
delta_1_0
delta_1_1
delta_1_0
delta_1_1
bp_done_1
f_done_1
bp_done_1
outW_1_0
outWoutW_1_1
outW_1_0[95:64]
outW_1_1[95:64]
outW_1_0[31:0]
outW_1_1[31:0]
outW_1_0[63:31]
outW_1_1[63:31]
1
-1
1
-1
random
value
random
value
random
value
random
value
random
value
f_done_0
fwd
bck
ndp_en
actIn
bp_done_1
clk
reset
mu
mu
mu
mu
mu
mu
mu
mu
mu
mu
mu
mu
f_done_1
f_done
bp_done
bp_done_0
ACC_MULT
B
A
acc_init
acc_init_value
acc_out
acc_done
Mux_AMux_B
Act_in_mux
bp_Delta_Mux in_W_Mux
acc_en
initW
W
address
sel
sel
new_W
acc_Out
curr_Weight
sel_B
sel_A
sig_acc_init
mult_acc_init
slope_acc_init
sel_f_acc_init
sel_b_acc_init
slopeA
errorVal
bp_Delta_Mux_Out
deltaVal
acc_out
slopeB
slopeVal
in_W_Mux_Out
muDelta
0.25
w_en
sel_w_en sel_out
sel_f_acc_en
sel_f_acc_initsel_f_done
sel_f_f_en
sel_f_bp_en
nextVal acc_done
mult_acc_en
sel_b_acc_en
sel_f_acc_en
sig_acc_en
slope_acc_en
in_W_Mux_Out
sel
sel
bp_Delta_Mux_Out
sel_w_en sel_out
sel_b_acc_en
sel_b_acc_initsel_b_done
sel_b_f_en
sel_b_bp_en
nextVal acc_donesel_b_en
selsel_acc_init
0 curr_Weight 0.5
in
en
out
in
en
out
in
en
out
in
en
out
sig_in
sig_en
sig_acc_en
sig_acc_done
sig_out
sig_acc_init
store_delta
store_slope
store_error
store_muDelta
slopeVal
deltaVal
errorVal
muDeltaVal
sig_en
acc_done
sig_acc_init
sig_acc_en
actOut
w_en
sel_b_acc_init
sel_b_acc_en
sel_b_done
bp_Delta_Mux_Out
bp_Delta_Mux_Out
acc_done
sig_res
acc_res
slope_acc_en
slope_acc_done
slopa_A
slope_acc_init
acc_done
slope_acc_init
slope_acc_en
slopeA
slope_calc_en
slopa_B
slope_calc_done
slope_calc_en
slopeB
slope_clac_done
clk
reset
clk
reset
clk
reset
clk
reset
clk
reset
clk
reset
clk
reset
mult_ctl_en mult_acc_en
mult_acc_done
mult_acc_init
acc_done
mult_acc_init
mult_acc_en
mult_ctl_done
mult_ctl_en
mult_ctl_done
clk
reset
sel_b_done
sel_f_done
sig_done
mult_ctl_done
slope_calc_done
fwd
bck
fin
sel_b_done
sel_f_done
sig_done
mult_ctl_done
slope_calc_done
fwd
bck
fin
sig_en
sel_init
sel_B
sel_A
sel_b_en
sel_f_f_en
sel_f_bp_en
mult_ctl_en
bp_done
f_done
store_slope
store_muDelta
store_delta
store_error
sig_en
sel_init
sel_B
sel_A
sel_b_en
sel_f_f_en
sel_f_bp_en
mult_ctl_en
bp_done
f_done
store_slope
store_muDelta
store_delta
store_error
initW
clk
reset
actIn
inWbpDelta
mu
bp_done
f_done
delta
outW
Fig. A.1. Artificial neuron's top level schematic. The main control implemented in Activator_Ctl. The secondary state machines are implemented in
sigmoid, slope_calc_ctl, mult_ctl, Weights, Select_f and Select_b. The ACC_MULT unit is the core of the floating point computations. The storage takes
place in the Weights unit.
11. > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 11
ACKNOWLEDGMENT
We would like to thank Avi Efrati, the VLSI lab manager, for
introducing this research to us and for an outstanding assistance
during the development. We also thank Guy Lampert, the
Xilinx application engineer, for the assistance during the board
configuration stage. In addition, we thank the Xilinx Company
for its cooperation with Tel Aviv University and supplying the
ZedBoard evaluation kits.
REFERENCES
[1] OMONDI, Amos R.; RAJAPAKSE, Jagath Chandana (ed.). FPGA
implementations of neural networks. New York, NY, USA:: Springer,
2006.
[2] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning
internalrepresentations by error propagation,” newblock MIT Press 1986,
Nature, vol. 323, pp. 533–536.
[3] MISRA, Janardan; SAHA, Indranil. Artificial neural networks in
hardware: A survey of two decades of progress. Neurocomputing, 2010,
74.1: 239-255.
[4] GOMPERTS, Alexander; UKIL, Abhisek; ZURFLUH, Franz.
Development and implementation of parameterized FPGA-based general
purpose neural networks for online applications. Industrial Informatics,
IEEE Transactions on, 2011, 7.1: 78-89.
[5] BASTERRETXEA, K.; TARELA, JOSÉ MANUEL; MASTORAKIS, N.
Approximation of sigmoid function and the derivative for artificial
neurons. advances in neural networks and applications, WSES Press,
Athens, 2001, 397-401.
[6] SAHIN, Suhap; BECERIKLI, Yasar; YAZICI, Suleyman. Neural
network implementation in hardware using FPGAs. In: Neural
Information Processing. Springer Berlin Heidelberg, 2006. p. 1105-1112.
[7] REMADEVI, R. Design and Simulation of Floating Point Multiplier
Based on VHDL. International Journal of Enginnering Research and
Applications, 2013, 3.2: 283-286.
[8] IENNE, P., “Architectures for Neuro-Computers: Review and
Performance Evaluation”, EPFL Technical Report 93/21, 1993.
[9] Bache, K. & Lichman, M. UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
School of Information and Computer Science, 2013.
[10] LIN, Cheng-Jian; LEE, Chi-Yung. FPGA Implementation of a recurrent
neural fuzzy network with on-chip learning for prediction and
identification applications. Journal of Information Science and
Engineering, 2009, 25.2: 575-589.
[11] NEDJAH, Nadia, et al. Dynamic MAC-based architecture of artificial
neural networks suitable for hardware implementation on FPGAs.
Neurocomputing, 2009, 72.10: 2171-2179.
[12] Taigman, Y., Yang, M., Ranzato, M. A., & Wolf, L. (2014). Web-Scale
Training for Face Identification. arXiv preprint arXiv:1406.5266.