FPGA IMPLEMENTATION OF APPROXIMATE SOFTMAX FUNCTION FOR EFFICIENT CNN INFERENCE

e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering Technology and Science
( Peer-Reviewed, Open Access, Fully Refereed International Journal )
Volume:03/Issue:09/September-2021 Impact Factor- 6.752 www.irjmets.com
www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science
[872]
FPGA IMPLEMENTATION OF APPROXIMATE SOFTMAX FUNCTION FOR
EFFICIENT CNN INFERENCE
Mohammed Abdullah Mubarak Alshahrani*1
*1Department of Electrical and Computer Engineering, King Abdulaziz University,
Jeddah, Makkah, Saudi Arabia.
ABSTRACT
Softmax function is an integral part of object detection frameworks based on most deep or shallow neural
networks. While the configuration of different operation layers in a neural network can be quite different,
softmax operation is fixed. With the recent advances in object detection approaches, especially with the
introduction of highly accurate convolutional neural networks, researchers and developers have suggested
different hardware architectures to speed up the overall operation of these compute-intensive algorithms.
Xilinx, one of the leading FPGA vendors, has recently introduced a deep neural network development kit for
exactly this purpose. However, due to the complex nature of softmax arithmetic hardware involving
exponential function, this functionality is only available for bigger devices. For smaller devices, this operation is
bound to be implemented in software. In this paper, a light-weight hardware implementation of this function
has been proposed which does not require too many logic resources when implemented on an FPGA device.
The proposed design is based on the analysis of the statistical properties of a custom convolutional neural
network when used for classification on a standard dataset i.e. CIFAR-10. Specifically, instead of using a brute
force approach to design a generic full precision arithmetic circuit for SoftMax function using real numbers, an
approximate integer-only design has been suggested for the limited range of operands encountered in real-
world scenario. The approximate circuit uses fewer logic resources since it involves computing only a few
iterations of the series expansion of exponential function. However, despite using fewer iterations, the function
has been shown to work as good as the full precision circuit for classification and leads to only minimal error
being introduced in the associated probabilities. The circuit has been synthesized using Hardware Description
Language (HDL) Coder and Vision HDL toolboxes in Simulink® by Mathworks® which provide higher level
abstraction of image processing and machine learning algorithms for quick deployment on a variety of target
hardware. The final design has been implemented on a Xilinx FPGA development board i.e. Zedboard which
contains the necessary hardware components such as USB, Ethernet and HDMI interfaces etc. to implement a
fully working system capable of processing a machine learning application in real-time.
Keywords: FPGA, High-Level Synthesis, Machine Learning, Convolutional Neural Networks, SoftMax.
I. INTRODUCTION
The rise of machine learning applications for object classification in images and videos has risen the demand for
real-time realization of such complex algorithms on dedicated hardware. In the recent years Convolutional
Neural Networks (CNN) have made their mark as the most effective machine learning algorithm for
classification operation in various domains. Most CNN architectures use SoftMax as the final layer before the
output one. Softmax is a sort of normalization operation which transforms the arbitrary input operands into
probability distribution points which add to a sum of one. Mathematically, it is represented as,
( )
∑
(1)
Where ‘x’ is a ‘j’-dimensional input vector and ‘ ( )’ is the perceived output probability of its ‘ith’ element.
Thus, for a classification operation with ‘j’ number of classes, the softmax function can be used to calculate the
relative confidence score of the classifier for each class. This can be understood from the standpoint of a neural
network [1] with certain hidden processing layers and an output classification layers such as that depicted in
Fig. 1.

e-ISSN: 2582-5208
[873]
Figure 1: Data flow through a typical neural network
Typically the outputs of a neural network depicting a given class (e.g. a dog or a cat) are independent logistic
regression classifiers with values within the range [0 1]. In order to convert these independent output values to
a probability vector, SoftMax function depicted by eq. (1) is employed. Without this function, the different
classes will each have an independent real-value probability but the results will not add up to one necessarily
leading to difficulty in interpreting the results. This function is a generalization of the logistic function used in
logistic regression classifiers so that it could be used for multiple classes. While other layers of neural networks
employ simpler multiplication and addition arithmetic units, Softmax is particularly complex in nature due to
the inclusion of exponential and division operation. Thus, several hardware-based implementations suffer from
exorbitantly large resource utilization and poor performance due to long logic delay. To this end, many
researchers have proposed approximate implementations of this unit which necessarily result in loss of result
precision. Thus, there is a need to explore a variety of circuit design techniques to lower the computational
complexity of this inevitable component of neural networks while keeping the result precision at the acceptable
levels. Moreover, given the complex nature of the parent neural networks themselves, there is a need to
simplify the overall design process as well so that the design and test procedures could be completed within
reasonable time. This calls for employing high-level synthesis tools and developing the frameworks to make
this process easier for neural network experts less acquainted with the hardware design process. In this work,
we have designed a SoftMax implementation that is efficient in terms of resource utilization of hardware logic
circuits while keeping the result accuracy at the acceptable levels. The framework has been developed using
high-level synthesis and testing tools.
II. LITERATURE REVIEW
Given the importance of SoftMax function in deep neural networks, several research works have considered its
hardware implementation using different strategies. Recently, Li et al. [2] have described an FPGA-based
hardware approximate implementation using Look Up Tables (LUT) and piece-wise linear interpolation. This
work is based on a pipelined approach and uses multistage Wallace and other multiplier structures to speed up
the overall computation. Such an intricate structure is needed since the Softmax function requires multiple
exponential, addition, multiplication and division units. Similarly, Kouretas and Vassilis [3] have described
another approximate computing architecture with adaptive approximation to tradeoff complexity with
accuracy of the results. Realizing the high complexity, long critical path delay and associated overflow problem,
Yuan [4] has suggested using down-scaling and domain transformation to eliminate the aforementioned
problems. On the same pattern, Du et al. [5] have described a Softmax implementation based on LUTs and an
arithmetic unit to calculate natural logarithm using Maclaurin series expansion. Since, the main computational
unit in Softmax function is the exponent, its direct implementation using well-known hardware design
techniques is also relevant. Thus, a CORDIC algorithm-based FPGA implementation as suggested by Rekha and
Menon [6] and a short Taylor expansion-based implementation proposed by Jamro et a. [7] also provides
further insight into the problem of speeding up this expensive operation using a dedicated hardware
implementation. Other hardware implementation techinques for arithmetic units such as Distributed
Arithmetic [8] and Common Sub-expression sharing [9] etc. can also be considered for reducing the associated
circuit complexity.

e-ISSN: 2582-5208
[874]
III. METHODOLOGY
In this work, we have proposed to build a SoftMax hardware accelerator on an FPGA device and consider
various resource saving techniques mentioned in the literature to come up with the most optimal configuration
suitable for current generation deep neural networks. Specifically, signal statistics have been analyzed to design
the most suitable hardware structure for this important function while conserving precious logic resources. For
this purpose, a popular standard dataset for image recognition task has been selected i.e. CIFAR-10 [11]. For
experimentation on this dataset, a custom CNN has been trained in Matlab environment. This deep neural
network uses residual links for faster and efficient training [12] and has been depicted in Figure 2.
Figure 2: Custom Residual CNN for CIFAR-10 Dataset
In this work, we have focused only on the hardware implementation of the Softmax layer which is an essential
component of the CNN architecture and determines the final output class as shown in Figure 2. However, given
the complexity of the exponent function of eq. (1), its corresponding hardware implementation is too complex.

e-ISSN: 2582-5208
[875]
Thus, to come up with a low complexity hardware, we have analyzed the signal statistics of the inputs to this
function for the CIFAR-10 dataset and their relation to the final detection accuracy of the whole CNN
architecture.
Figure 3: Confusion matrix for the CIFAR-10 dataset with full precision SoftMax function
Figure 3 shows the confusion matrix obtained when the custom CNN show in Figure 2 is used to process CIFAR-
10 dataset with full precision SoftMax implementation. The overall accuracy is 9.54% on the validation set and
2.62% on the training set.
Figure 4: Histogram of input values to the SoftMax function for CIFAR-10 dataset
Figure 4 illustrates the data statistics collected for CIFAR-10 dataset when processed using the custom CNN
architecture shown in Figure 2. From the statistics it can be clearly seen that the bulk of the input values are
within a narrow range around 0 i.e. [-10 10]. Thus, it does not seem appropriate to design a hardware circuit for
a generic wider range of inputs because that leads to a very complex hardware circuit. Since the exponent
function is the main complicated operation in eq. (1), we keep our discussion focused on this operation only. It
can be clearly seen in Figure 5, that if the whole input range of [-20 40] is considered, the full precision
exponent function has a very wide output range i.e. [0 2.5 × 1017]. This requires enormous logic resources to
implement the functionality in hardware. Moreover, the input operands are real numbers with fractional parts
as well. This necessitates the use of floating- or fixed-point number formats which adds to the circuit

e-ISSN: 2582-5208
[876]
complexity even more. Realizing these two important properties of the input operands, i.e. the range as well as
the format, various approximations have been applied systematically in this work to reduce the computational
load without affecting the classification accuracy beyond reasonable levels. Moreover, the exponent function
itself can be approximated using its series expansion with truncation. However, the all these approximations,
i.e. limiting the range, quantizing to integer-only numbers and truncating the series expansion have to be
analyzed for the combined effect on the result precision when applied to the real-world scenario. For this
purpose, as mentioned earlier, the custom CNN architecture with approximate SoftMax function has been
evaluated on the standard CIFAR-10 dataset.
Figure 5: Exponent function plot for the range of input values to SoftMax function
As mentioned earlier, the main arithmetic units in the SoftMax function are the exponential and division
functions. The hardware design entry of complex functions such as that of Softmax and its integration within
the larger deep neural network is, however, too complex to be handled using conventional hardware
description language (HDL) approach. Thus, we have employed the high-level synthesis tool supplied by
Mathworks in Simulink i.e. HDL Coder toolbox. The HDL coder generates the HDL code (Verilog or VHDL)
automatically which can then be incorporate as a hardware accelerator in a larger system employing both
software (CPU) and hardware accelerator combined called “Hardware-Software Co-Design”. One big advantage
of using Simulink for such designs is that the environment can be easily setup for simulation using real images
(datasets) and the functionality can be tested before finally incorporating into the hardware. The whole
hardware-software co-design system has been implemented on an FPGA SOC i.e. Zedboard which contains both
processor and programmable sections. The main application for deep neural network runs on the processor
while the Softmax accelerator has been implemented on the FPGA logic fabric accessible to the software
through standard bus interface i.e. AXI interconnect.
Figure 6: Approximate exponent function employed as a hardware accelerator within a
hardware-software co-design in an FPGA

e-ISSN: 2582-5208
[877]
Figure 7 shows the whole hardware system with the Zedboard FPGA platform where a full hardware-software
co-design has been ported. This system uses Ubuntu operating system to provide the user-interface while the
hardware portion incorporates an accelerator for the desired functionality. The hardware and software work
seamlessly together through the standard bus interface as show in Figure 6 above.
Figure 7: The Hardware-Software Co-Design implemented on Zedboard FPGA for deep neural network
processing on live video stream
IV. RESULTS AND DISCUSSION
As mentioned in the previous section, various approximations can be applied to the SoftMax operation to lower
its computational demand while keeping the result accuracy higher. In this section, the effect of these different
approximation methods has been analyzed when applied to real-world scenario.
Approximating SoftMax Function through Integer-Only Operations
The first approximation technique applied to the whole CNN inference framework on standard CIFAR-10
dataset is the conversion of input operands to the integer-only format by dropping the fractional part without
rounding as given by,
⌊ ⌋ (2)
Although rounding seems a better method than truncation, the overall accuracy of the CNN classifier did not
register any significant drop and very similar 9.56% and 2.89% detection rates were observed on the validation
set and the training set respectively. There were, however, negligible errors in the calculation of confidence
scores in the validation set as depicted in the histogram of errors shown in Figure 8. It can be noticed that
almost all the confidence scores had zero error with only a tiny percentage (0.21 %) showing errors as little as
0.01. Thus, it can be safely concluded that integer-only operation does not affect the result accuracy
significantly while reducing the computational load from floating point operation to integer operation.
Figure 8: Histogram of errors in confidence score of the validation set introduced due
to integer-only operation

e-ISSN: 2582-5208
[878]
Approximating SoftMax Function through Limiting the Operand Range
The second approximation considered in this work is limiting the range of input operands. The first step in this
regard is limiting the operands to only positive integers i.e. [0: ∞]. Later, the range is systematically reduced to
[0: 31], [0: 15] and [0: 7] to correspond to binary bit representation of 5-bits, 4-bits and 3-bits respectively.
These successive approximations on top of using integer-only operands lead to increasingly larger errors in
both classification accuracy and confidence scores. The errors have been reported in Figures 9 to 12 and Table
1. It can be observed that the result fidelity in both the classification accuracy and the confidence scores has
been largely preserved for integer range up to [0: 15] and takes a significant hit below that. Thus, if the range is
limited to integers between a very narrow range i.e. [0: 7], the classification error increases to 22.5% while the
confidence scores can have an error up to 3.9%. The corresponding histograms of error also show the same
trend. In Figure 13, it can be seen that drastically reducing the input operand range to [0: 7], leads to higher
occurrences of non-zero errors. From this data, it can be concluded that the integer-only range [0: 15] can be
used safely with an acceptable range of error introduced due to the approximation in input operand
representation. This leads to a significant savings in the computation resources since only 4 bits are required
for number representation compared to the original floating point representation requiring at least 32 bits for
single precision representation.
Figure 9: Histogram of errors in confidence score of the validation set introduced due to integer-only
operation limited to the range [0: ∞]
operation limited to the range [0: 31]

e-ISSN: 2582-5208
[879]
Approximating SoftMax Function through series implementation
One of the most common techniques used in the literature to approximate exponent function is through
truncation of its series expansion. Precisely, the McLaurin series expansion of the exponent function is given as
follows,
∑ (3)
(4)
(5)
(6)
(7)
The infinite series for exponent function given by Eq. 3 can be approximated by first 2, 3, 4 or 5 terms as in
equations 4, 5, 6 and 7 respectively. The results of using these approximations along with the earlier
approximations i.e. integer-only operation with range limited to [0:15] have been reported in Figures 13 to 16
and Table 1. It can be noticed that although using only a two-term approximation does not lead to any
degradation of classification accuracy, a 10% error has been introduced in the confidence level scores. Using a
three-term approximation leads to a 5% error while using four terms gives 2.85 % error. 1.8 % error is given
when using five terms. With each additional term, the complexity of the operation grows. We, however,
conclude that using three or four terms is sufficient since the error is within acceptable range. Using five terms
gives a very low error but the additional complexity over four terms is not justified. To further reduce the
hardware complexity associated with division operation, it is suggested to use the nearest power-of-two
coefficients in eq. (6) to give,
(8)
As seen from the data in Table 1, this approximation does not affect the detection accuracy while only
marginally affecting the confidence scores.
operation limited to the range [0: 15] with 2 term approximation of exponent function

e-ISSN: 2582-5208
[880]
Table 1. Comparison of different approximation techniques using error on CIFAR-10 Dataset
SN. Method Validation Classification Error Validation Confidence Score Error
1 Full Precision 9.54 % 5 × 10-7 %
2 Integer-Only 9.56 % 0.21 %
3 Integer-Only [0: ∞] 9.56 % 0.24 %
4 Integer-Only [0: 31] 9.56 % 0.24 %
5 Integer-Only [0: 15] 9.73 % 0.31 %
6 Integer-Only [0: 7] 22.5 % 3.9 %
7 Integer-Only [0: 15], 2 series
terms
9.73 % 10.33 %
terms
9.73 % 5.1%

e-ISSN: 2582-5208
[881]
terms
9.73 % 2.85 %
terms
9.73 % 1.8 %
terms with power-of-two
coefficients
9.73 % 3.0 %
The proposed approximate exponent function for SoftMax operation in the custom CNN architecture has been
implemented in the Simulink HDL coder to generate it’s HDL code for use in the hardware-software co-design
shown in Figure 6 above. The schematic of this proposed design has been depicted in Figure 17. This is the
implementation of the proposed design given by eq. (8) and gives an error of 9.73 % in detection accuracy and
3.0 % in confidence scores when tested on CIFAR-10 dataset (Table 1).
Figure 17: High-Level circuit diagram for the approximate exponent function using Simulink HDL Coder
V. CONCLUSION
An approximate circuit for implementation of SoftMax function as used in standard CNN architectures has been
described in this work. For this purpose, various approximation techniques related to the range and type of
operands and series expansion of exponent function have been employed. The considered techniques have
been motivated by the actual signal statistics gathered while processing a real world standard dataset i.e.
CIFAR-10. To test the setup, a custom CNN has been trained and tested with the proposed approximations
implemented in the full system. The results show that the proposed approximation lead to negligible loss in
CNN’s detection accuracy as well as the confidence scores while reducing the circuit complexity significantly.
VI. REFERENCES
[1] Online link: Super Data Science, “Convolutional Neural Networks (CNN): Softmax & Cross-Entropy”
available at https://www.superdatascience.com/blogs/convolutional-neural-networks-cnn-softmax-
crossentropy, accessed 28th Nov, 2020.
[2] Z. Li, H. Li, X. Jiang, B. Chen, Y. Zhang and G. Du, "Efficient FPGA Implementation of Softmax Function for
DNN Applications," 2018 12th IEEE International Conference on Anti-counterfeiting, Security, and
Identification (ASID), Xiamen, China, 2018, pp. 212-216
[3] Kouretas, I.; Paliouras, V. Hardware Implementation of a Softmax-Like Function for Deep Learning.
Technologies 2020, 8, 46
[4] Yuan, B. “Efficient hardware architecture of softmax layer in deep neural network.” 2016 29th IEEE
International System-on-Chip Conference (SOCC) (2016): 323-326.
[5] Gaoming Du, Chao Tian, Zhenmin Li, Duoli Zhang, Yongsheng Yin, and Yiming Ouyang, “Efficient
Softmax Hardware Architecture for Deep Neural Networks”, in Proceedings of the 2019 Great Lakes
Symposium on VLSI (GLSVLSI '19).
[6] R. Rekha and K. P. Menon, "FPGA implementation of exponential function using cordic IP core for
extended input range," 2018 3rd IEEE International Conference on Recent Trends in Electronics,
Information & Communication Technology (RTEICT), Bangalore, India, 2018, pp. 597-600

e-ISSN: 2582-5208
[882]
[7] E. Jamro, K. Wiatr and M. Wielgosz, "FPGA Implementation of 64-Bit Exponential Function for HPC,"
2007 International Conference on Field Programmable Logic and Applications, Amsterdam, 2007, pp.
718-721
[8] NagaJyothi, Grande & Sriadibhatla, Sridevi. (2017). Distributed arithmetic architectures for FIR filters-
A comparative review. 2684-2690. 10.1109/WiSPNET.2017.8300250.
[9] Chip-Hong Chang and Mathias Faust, "A new common subexpression elimination algorithm for
realizing low-complexity higher order digital filters". Trans. Comp.-Aided Des. Integ. Cir. Sys. 29, 5 (May
2010), pp. 844–848.
[10] Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, Kurt Keutzer,
“SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size”,
arXiv:1602.07360.
[11] Alex Krizhevsky, “Learning Multiple Layers of Features from Tiny Images”, 2009.
[12] https://www.mathworks.com/help/deeplearning/ug/train-residual-network-for-image-
classification.html

FPGA IMPLEMENTATION OF APPROXIMATE SOFTMAX FUNCTION FOR EFFICIENT CNN INFERENCE

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to FPGA IMPLEMENTATION OF APPROXIMATE SOFTMAX FUNCTION FOR EFFICIENT CNN INFERENCE

Similar to FPGA IMPLEMENTATION OF APPROXIMATE SOFTMAX FUNCTION FOR EFFICIENT CNN INFERENCE (20)

More from International Research Journal of Modernization in Engineering Technology and Science

More from International Research Journal of Modernization in Engineering Technology and Science (20)

Recently uploaded

Recently uploaded (20)

FPGA IMPLEMENTATION OF APPROXIMATE SOFTMAX FUNCTION FOR EFFICIENT CNN INFERENCE