In this brief, a customized and pipelined hardware implementation of the quasi-Newton (QN) method on field-programmable gate array (FPGA) is proposed for fast artificial neural networks onsite training, targeting at the embedded applications. The architecture is scalable to cope with different neural network sizes while it supports batch-mode training. Experimental results demonstrate the superior performance and power efficiency of the proposed implementation over CPU, graphics processing unit, and FPGA QN implementations.
Fast neural network training on fpga using quasi newton optimization method
1. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Fast Neural Network Training on FPGA Using Quasi-Newton
Optimization Method
Abstract:
In this brief, a customized and pipelined hardware implementation of the quasi-Newton
(QN) method on field-programmable gate array (FPGA) is proposed for fast artificial
neural networks onsite training, targeting at the embedded applications. The architecture
is scalable to cope with different neural network sizes while it supports batch-mode
training. Experimental results demonstrate the superior performance and power efficiency
of the proposed implementation over CPU, graphics processing unit, and FPGA QN
implementations.
Software Implementation:
Modelsim
Xilinx 14.2
Existing System:
Field-programmable gate arrays (FPGAs) have been considered as a promising
alternative to implement artificial neural networks (ANNs) for high-performance ANN
applications because of their massive parallelism, high reconfigurability (compared to
application specific integrated circuits), and better energy efficiency [compared to
graphics processing units (GPUs)]. However, the majority of the existing FPGA-based
ANN implementations was static implementations for specific offline applications
without learning capability. While achieving high performance, the static hardware
implementations of ANN suffer from low adaptability for a wide range of applications.
Onsite training provides a dynamic training flexibility for the ANN implementations.
2. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Currently, high-performance CPUs and GPUs are widely used for offline training, but not
suitable for onsite training especially in embedded applications. It is complex to
implement onsite training in hardware, due to the following reasons. First, some software
features, such as floating-point number representations or advanced training techniques,
are not practical or expensive to be implemented in hardware. Second, the
implementation of batch-mode training increases the design complexity of hardware.
Batch training is that the ANN’s weight update is based on training error from all the
samples in training data set, which enables good convergence and efficiently prevents
from sample data perturbation attacks. However, the implementation of batch training on
an FPGA platform requires large data buffering and quick intermediate weights
computation. There have been FPGA implementations of ANNs with online training.
These implementations used backpropagation (BP) learning algorithm with non batch
training. Although widely used, the BP algorithm converges slowly, since it is essentially
a steepest descent method. Several advanced training algorithms, such as conjugate
gradient, quasi-Newton (QN), and Levenberg–Marquardt, have been proposed to speed
up the converging procedure and reduce training time, at the cost of memory resources.
This brief provides a feasible solution to the challenges in hardware implementation of
onsite training by implementing the QN training algorithm and supporting batch-mode
training on the latest FPGA. Our previous work implemented the Davidon–Fletcher–
Power QN (DFP-QN) algorithm on FPGA and achieved 17 times speedup over the
software implementation. To further improve performance, this brief proposes a hardware
implementation of the Broyden–Fletcher– Goldfarb–Shanno QN (BFGS-QN) algorithm
on FPGA for ANN training.
To the best of our knowledge, this is the first reported FPGA hardware accelerator of the
BFGS-QN algorithm in a deeply pipelined fashion. The performance of the
implementation is analyzed quantitatively. The architecture is designed with low
hardware cost in mind for scalable performance, and for both onsite and offline training.
3. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
The designed accelerator is applied to ANN training and is able to cope with different
network sizes. The experimental results show that the proposed accelerator achieves
performance improvement up to 105 times for various neural network sizes, compared
with the CPU software implementation. The hardware implementation is also applied to
two real-world applications, showing superior performance.
Disadvantages:
Performance is not efficient
Power efficiency is lower
Proposed System:
FPGA Hardware implementation
Because the steps for computing B, λ, and g are three computationally intensive parts in
the BFGS-QN algorithm, we mainly describe how the three parts are designed.
B Matrix Computation
The B computation (BC) block derives an n × n matrix, where n is the total number of
weights, according to step 5 of Algorithm 1. The most computationally intensive
operations are matrix-by-vector multiplication (MVM). For scalable performance and
modularized architecture, MVM is implemented as multiple vector-by-vector
multiplications (VVMs). Three on-chip RAM units, each storing n words, are used to
4. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
store the intermediate computation results
which are repeatedly used in the subsequent computations.
Fig. 1. Flowchart of the GSS algorithm.
5. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Fig. 2. Architecture of λ computation block. (a) Forward–backward block for determining a search
interval [a0, b0]. (b) GSS block. The operations in the dashed blocks share the same hardware.
In addition, an n-word first-input–first-ouput is used to match two input data streams of
the final addition and form Bk+1 row by row. where L mult, L sub, and L div are the latency
of multiplier, subtractor, and divider, respectively, and Tvector is the number of execution
cycles of a VVM. A deeply pipelined VVM unit is implemented with Tvector = n + + L
add[log2 L add + 2], where the last two items are for draining out the pipeline. 2) Step Size
λ Computation: An exact line search method is implemented to find λk according to step
2 of Algorithm 1. The method includes two sub modules: the forward–backward
algorithm determining an initial search interval that contains the minimizer, and the GSS
algorithm shown in Fig. 1 implemented to reduce the interval iteratively. Fig. 2 shows the
hardware architecture, where the computation time of each block is analyzed. The
6. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
architecture comprises a number of pipelined arithmetic operations and the objective
function ET evaluation.
Fig. 3. (a) Architecture of the objective function evaluation module. (b) Micro architecture of neural
model computation.
7. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Fig. 4. Architecture of the GC unit.
The dashed blocks in Fig. 2(a) and (b) are the same piece of hardware shared by the
corresponding operations. The complex objective function (1) needs to be frequently
evaluated during λ computation. Therefore, we also implement the objective function
evaluation in hardware as a separate block for high performance. It can be modified
according to different types of ANNs while other parts of the implementation remain the
same. As shown in Fig. 3(a), the block is implemented in two parts: the first part
exercises the ANN to obtain outputs and the second part computes the training error. Fig.
3(b) shows the structure of the first part. Two accumulation units are implemented to
obtain h and yˆ shown. Note that the accumulation unit is implemented differently from
the one in the VVM unit, by using length variable shift registers
Table I
Storage space required by each module
TABLE II
FP units required in the implementation
8. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Resource Usage Analysis
During the computation, intermediate results are accessed in different patterns: 1) some
results, such as w and g, which are connected to multiple calculation modules, are read
more than once and 2) the values, such as h, F(h), and δ, are read in different orders from
writing. Also, the batch training is supported. Therefore, all intermediate results are
buffered in FPGA on-chip memories, for pipelined and parallel computations and data
reuse. FPGA on-chip dual-port RAMs are used to implement the buffers. The required
storage space of each module is shown in Table I. As n increases, off-chip memory will
be used and a properly designed memory hierarchy containing on-chip and off chip
memories is needed to prevent speed slowdown. In addition, all training data are stored in
off-chip memories for applications with large number of training data. The overall data
path of the BFGS-QN hardware design needs a number of floating-point (FP) units, as
shown in Table II. The number of required FP units is independent of n.
Advantages:
Performance is efficient
Power efficiency is higher
References:
[1] J. Zhu and P. Sutton, ―FPGA implementations of neural networks— A survey of a decade of
progress,‖ in Field Programmable Logic and Application (FPL) (Lecture Notes in Computer Science),
vol. 2778, P. Y. K. Cheung and G. A. Constantinides, Eds. Berlin, Germany: Springer-Verlag, 2003.
[2] Y. Ma, N. Suda, Y. Cao, J.-S. Seo, and S. Vrudhula, ―Scalable and modularized RTL compilation of
convolutional neural networks onto FPGA,‖ in Proc. Int. Conf. Field Program. Logic Appl., Aug./Sep.
2016, pp. 1–8.
9. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
[3] A. Gomperts, A. Ukil, and F. Zurfluh, ―Development and implementation of parameterized FPGA-
based general purpose neural networks for online applications,‖ IEEE Trans. Ind. Informat., vol. 7, no.
1, pp. 78–89, Feb. 2011.
[4] N. Akhtar and A. Mian. (Feb. 2018). ―Threat of adversarial attacks on deep learning in computer
vision: A survey.‖ [Online]. Available: https://arxiv.org/abs/1801.00553
[5] R. G. Gironés, R. G. Gironés, R. C. Palero, J. C. Boluda, J. C. Boluda, and A. S. Cortés, ―FPGA
implementation of a pipelined on-line backpropagation,‖ J. VlSI Signal Process. Syst. Signal, Image
Video Technol., vol. 40, no. 2, pp. 189–213, 2005.
[6] Q. Liu, R. Sang, and Q. Zhang, ―FPGA-based acceleration of DavidonFletcher-Powell quasi-Newton
optimization method,‖ Trans. Tianjin Univ., vol. 22, no. 5, pp. 381–387, 2016.
[7] S. Razavi and B. A. Tolson, ―A new formulation for feedforward neural networks,‖ IEEE Trans.
Neural Netw., vol. 22, no. 10, pp. 1588–1598, Oct. 2011.
[8] Q. J. Zhang and K. C. Gupta, Neural Networks for RF and Microwave Design. London, U.K.: Artech
House, 2000.
[9] W. Sun and Y.-X. Yuan, Optimization Theory and Methods. New York, NY, USA: Springer-Verlag,
2006.
[10] Q. Liu, G. A. Constantinides, K. Masselos, and P. Y. K. Cheung, ―Combining data reuse with data-
level parallelization for FPGA-targeted hardware compilation: A geometric programming framework,‖
IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 28, no. 3, pp. 305–315, Mar. 2009.
[11] F. Feng, C. Zhang, J. Ma, and Q.-J. Zhang, ―Parametric modeling of EM behavior of microwave
components using combined neural networks and pole-residue-based transfer functions,‖ IEEE Trans.
Microw. Theory Techn., vol. 64, no. 1, pp. 60–77, Jan. 2016.
[12] Q. J. Zhang. (2013). Neuromodeler. [Online]. Available: http://www.doe. carleton.ca/~qjz/ [13] J.
Martens, ―Second-order optimization for neural networks,‖ Ph.D. dissertation, Graduate Dept. Comput.
Sci., Univ. Toronto, Toronto, ON, Canada, 2016. [Online]. Available: http://www.cs.toronto.
edu/~jmartens/docs/thesis_phd_martens.pdf