Analysis on Implementation of different CNN Architectures on FPGAs | Undergrad Thesis - BITS F421T Thesis | Author: Prayag Mohanty |BITS Pilani KK Birla Goa Campus

Analysis on Implementation of different
CNN Architectures on FPGAs
UNDERGRADUATE THESIS
Submitted in partial fulfillment of the requirements
of BITS F421T Thesis
By
PRAYAG MOHANTY
ID No. 2020A3PS0566G
Under the supervision of:
Dr. AMALIN PRINCE A.
BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE PILANI, GOA CAMPUS
December 2023
1

Declaration of Authorship
I, Prayag Mohanty, declare that this Undergraduate Thesis titled, ‘Analysis on
implementation of different CNN Architectures on FPGA’ and the work presented in it
are my own. This was undertaken in the First Semester of 2023-24. I confirm that:
● This research was primarily conducted while I was a candidate for a research
degree at this University.
● Any portions of this thesis previously submitted for a degree or qualification at
this or another institution are explicitly identified.
● I consistently and clearly credit any consulted published works of others.
● All quotations are attributed to their original sources. With the exception of such
quotations, the content of this thesis is entirely my own original work.
● I have expressed my gratitude for all significant sources of assistance.
● If the thesis draws on work I conducted collaboratively with others, I have clearly
outlined each individual's contribution, including my own.
Signed:
Date: 12 / 12 / 23
i

Certificate
This is to certify that the thesis entitled, “Analysis on implementation of different CNN
Architectures on FPGA” and submitted by Prayag Mohanty ID No. 2020A3PS0566G in
partial fulfillment of the requirements of BITS F421T Thesis embodies the work done by
him under my supervision.
_____________________________
Supervisor
Dr. Amalin Prince A.
Professor, Dept. of EEE
BITS-Pilani K.K.Birla Goa Campus
Date: 12 / 12 / 23
ii

“Knowledge is a tool, best shared. So is my thesis :) ”
-Prayag Mohanty
iii

BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE PILANI, K.K.BIRLA GOA
CAMPUS
Abstract
Bachelor of Engineering (Hons.)
Analysis on implementation of different CNN Architectures on FPGA
by Prayag Mohanty
Convolutional Neural Networks (CNNs) are a special type of neural networks that are
exceptionally good at working with data, like images, signals etc. The usage of
Field-Programmable Gate Arrays (FPGAs) in high-performance computing has garnered
significant attention with the advent of Artificial Intelligence. This thesis investigates
the performance and resource utilization of various convolutional neural network (CNN)
models for implementation on Field-Programmable Gate Arrays (FPGAs). The primary
objective is to identify optimal CNN models for FPGA deployment based on their
performance, resource utilization, and other relevant parameters. Two prominent CNN
models, AlexNet and MobileNet, were chosen for analysis. Both models were
implemented on an FPGA platform. Performance metrics such as resource utilization
metrics, including logic slices, memory blocks, and DSP slices, were monitored to assess
the hardware requirements of each model. The evaluation results demonstrate that
MobileNet exhibits significantly lower resource utilization compared to AlexNet while
maintaining a commendable level of performance. This suggests that MobileNet is a
more efficient option for deploying CNN models on FPGAs with limited hardware
resources. AlexNet, on the other hand, offers superior performance but at the expense of
higher resource consumption. This makes it a suitable choice for applications where
performance is paramount and resources are less restricted.This analysis provides
valuable insights into the suitability of different CNN models for FPGA implementation
based on their performance and resource utilization characteristics.
Keywords: Convolutional Neural Networks, FPGA, Performance, Resource Utilization,
AlexNet, MobileNet
iv

Acknowledgements
The journey of completing this thesis has been a rewarding but challenging one, and I
would like to express my heartfelt gratitude to those who have supported me throughout
the process. First and foremost, I want to thank my family for their unwavering love and
support. Their constant encouragement and belief in me have been instrumental in
helping me overcome obstacles and persevere through difficulties. I am especially grateful
for the sacrifices they made to enable me to pursue my educational goals.I extend my
sincere thanks to my relatives and friends for their encouragement and understanding. I
owe a debt of immense gratitude to my thesis supervisor, Professor Amalin Prince A.
whose guidance, expertise, and patience have been invaluable in shaping my research and
helping me refine my work. I am deeply grateful for their insightful feedback,
constructive criticism, and unwavering support throughout the research process. Finally, I
would like to express my sincere appreciation to my institute, BITS Pilani KK Birla Goa
Campus. The institution's excellent academic environment, equipment, and dedicated
faculty have provided me with the foundation and resources necessary to conduct my
research.
Thank you all for your invaluable contributions.
v

Contents
Declaration of Authorship i
Certificate ii
Abstract iv
Acknowledgements v
Contents vi
List of Figures viii
List of Tables ix
Abbreviations x
1 Introduction 1
1.1 Motivation.................................................................................................................. 1
1.2 Scope & Structure......................................................................................................2
2 Fundamentals 3
2.1 Current Work............................................................................................................ 3
2.1.1 Theoretical Background...................................................................................3
2.1.2 FPGA................................................................................................................ 3
2.2 Literature Review..................................................................................................... 4
1.2.1 AlexNet.........................................................................................................10
1.2.2 ResNet...........................................................................................................11
1.2.3 MobileNet..................................................................................................... 14
3 Design and Implementation 15
3.1 Design......................................................................................................................15
3.2 Implementation.......................................................................................................16
4 Hardware Implementation 20
4.1 Design Methodology................................................................................................20
4.2 HLS Methodology....................................................................................................20
4.3 Design Overview .................................................................................................... 20
4.4 Caching Strategy.....................................................................................................21
vi

5 Conclusion 24
Appendix 26
Bibliography 30
vii

List of Figures
1.1 Figure 1: Neuron Architecture…………………………….................................................................................. 4
1.2 Figure 2: Layers in a CNN model. .................................................................................................................. 6
1.3 Figure 3: A typical convolutional neural network layer's components.......................................................... 6
1.4 Figure 4: Visual representation of AlexNet architecture................................................................................ 10
1.5 Figure 5: Visual representation of AlexNet layers……...................................................................................11
1.6 Figure 6: Residual Learning: a building block of the ResNet architecture...................................................11
1.7 Figure 7: Representation of the ResNet architecture.….................................................................................13
1.8 Figure 8: Xilinx Vivado 2021.3 IDE………………………...............................................................................17
1.9 Figure 9: Digilent Zedboard Avnet Evaluation Kit Zynq-7000 System-on-Chip (SoC)...............................18
1.10 Figure 10: AlexNet Lite……………………......................................................................................................18
1.11 Figure 11: Chinese Academy logic architecture............................................................................................ 19
1.12 Figure 12: Angel-Eye architecture................................................................................................................. 20
1.13 Figure 13: MobileNet Lite……..............................................................................................……………..…. 20
1.14 Figure 14: Xilinx Vivado IDE..........................................................................................…………………..... 21
1.15 Figure 15: Final Top-Level FPGA Design..................................................................................................... 22
1.16 Figure 16: Convolutional / Affine Layer Virtual Memory Test Bench…………………………………….….. 23
1.17 Figure 17: Convolutional / Affine Layer Block RAM Test Bench……………………………..…………….… 23
1.18 Figure 18: Max Pool Layer Virtual Memory Test Bench………………………………………….……….….… 23
1.19 Figure 19: Max Pool Layer Block RAM Test Bench……………………………………………….……….….… 23
viii

List of Tables
1.20 Table 1: Specifications of Zedboard……………………………................................................................ 19
1.21 Table 2: Resource Utilization of Final Design. ...................................................................................... 21
1.22 Table 3: Hardware execution times of each AlexNet Layer…………..................................................... 24
1.23 Table 4: Simulation Model vs Hardware Implementation.................................................................... 25
1.24 Table 5: Comparing AlexNet vs MobileNet……..................................................................................... 25
1.25 Table 6: Comparison of other works to this work………………………….............................................. 26
ix

Abbreviations
CNN Convolutional Neural Networks
FPGA Field Programmable Gate Arrays
AI Artificial Intelligence
ML Machine Learning
HLS High Level Design
DSP Digital Signal Processing

Dedicate this to my family, friends, relatives
and electronics.
14

1.Introduction
1.1 Motivation
The field of high-performance computing (HPC) has witnessed a significant shift in recent
years, driven by the ever-increasing demand for processing power across diverse application
domains. This growth is fueled by advancements in various fields, including science,
engineering, finance, and healthcare, each requiring the ability to analyze and process
massive datasets in real-time. To address this growing demand, researchers have turned to
Field-Programmable Gate Arrays (FPGAs) as a promising alternative to traditional CPUs
and GPUs.
FPGAs offer several key advantages over traditional computing architectures. Their
reconfigurable nature allows them to be tailored to specific tasks, leading to significant
performance improvements compared to general-purpose CPUs. Additionally, FPGAs excel
in energy efficiency due to their parallel processing capabilities and optimized hardware
design. This combination of performance and efficiency makes FPGAs ideal candidates for
accelerating computationally intensive workloads in HPC.
Over the past few decades, the field of Artificial Intelligence (AI) has experienced
tremendous progress, revolutionizing numerous aspects of our lives. From image and
speech recognition to natural language processing and autonomous vehicles, AI has
demonstrably impacted various industries and scientific domains. This rapid advancement
is fueled by the increasing availability of computing resources and data, enabling the
development and deployment of complex machine learning algorithms and neural
networks.
However, the growing demand for AI applications necessitates the development of efficient
and scalable neural networks. Traditional software-based implementations often struggle to
handle the demands of real-time processing and resource limitations on mobile and
embedded systems. This is where FPGAs present a compelling solution. With their inherent
1

parallelism and hardware flexibility, FPGAs can be leveraged to implement efficient neural
networks that deliver superior performance and energy savings compared to software-based
approaches.
The motivation for this project stems from the desire to explore the potential of FPGAs in
accelerating Convolutional Neural Networks (CNNs), a class of neural networks widely
used in various AI applications, particularly image and video processing. CNNs excel in
extracting features and identifying patterns in images, making them instrumental for tasks
such as image recognition, object detection, and image segmentation.
My primary objective is to analyze and compare different CNN architectures available for
implementation on FPGAs. This analysis focuses on key performance metrics like resource
utilization, scalability, and real-time processing capabilities. The ultimate goal is to identify
and optimize a CNN model that delivers the best performance on the Zedboard, a popular
FPGA development platform.
Additionally, the potential for deploying CNNs on low-resource systems like smartphones
motivates this project. This enables the processing of sensitive data directly on the device,
eliminating the need for internet data transmission and ensuring data privacy.
Furthermore, integrating CNNs into embedded systems opens up exciting possibilities for
real-time applications in areas like robotics, autonomous vehicles, and smart home
technologies.
By exploring the implementation of various CNNs on FPGAs, this project aims to
contribute to the development of efficient and scalable AI solutions for resource-constrained
environments. The insights and findings will provide valuable knowledge and pave the way
for future research in the field of hardware-accelerated AI.
1.2 Scope & Structure
The prospect of creating a whole framework capable of analyzing data in real time
piqued my interest. However, due to the task's complexity and the lack of intensive
experience with Neural Networks earlier, the scope was reduced to the following points.
2

1. The data set was restricted to numbers. This would be a simple & good starter for
other forms of information like written language, signals etc.
2. Only individual pre-existing images were used for static analysis. The main
reason for this decision is because, while there exist Neural Networks capable of properly
analyzing video, their complexity has risen and the analysis for their use in embedded
systems has not yet been fully established, which would add an additional risk to the
project.
The project needs to be broken down into two independent sub-problems that can be tackled
separately. However, when combined, they will provide the desired overall outcome.
1. This work aims to develop a system configured to run as many layers as desired and
test it using a currently defined CNN configuration, AlexNet. This type of system would
allow a developer to scale a design to fit any size of FPGA.
2. Comparing two CNN architectures, AlexNet and MobileNet on the basis of their
measurable parameters like performance, speed, DSP slice, LUTs etc. on a Zedboard.
This would help determine the compatibility of these models on a sample Zedboard.
3

2.Fundamentals
2.1 Current Work
2.1.1 Background
Convolutional Neural Networks (CNNs) are a type of artificial intelligence that fall within
the field of machine learning and are also categorized as a deep learning technique.
Neural networks: Inspired by the human brain, neural networks are computational
structures composed of interconnected nodes called neurons. These neurons receive and
process information from each other, mimicking the way synapses in the brain facilitate
communication. This intricate network of connections, numbering in the millions, underlies
the complex thought processes and behavior observed in humans and other intelligent
beings.
Artificial Neural Networks use the way neurons interact - to construct systems in which
each of the building blocks (usually referred to as neurons) receives several inputs that are
weighed using weights and produces an output that is sent to several other building blocks.
Fig 1. shows the hardware architecture of a neuron.
Fig.1 Neuron Architecture (Reddy, 2019)
4

A neuron receives multiple inputs, such as pixel values or sound data, depending on the
application. It multiplies the inputs (say x) with suitable weights (w) and adds bias (b) . The
function σ(w⋅x+b)is obtained.
Functionality: Neural networks excel at classifying inputs into predetermined categories.
This ability stems from assigned weights to each neuron within the network. A crucial step
called training determines the specific combination of weights that enables accurate
classification. During this phase, the network receives numerous inputs with known
outputs, and the weights are adjusted iteratively until an optimal configuration is achieved.
Topology: To provide all neurons with a suitable structure for analyzing input data, they
can be organized in various ways. In our project, we will focus on networks where neurons
are arranged in ordered layers, only receiving input from the preceding layer and sending
output to the subsequent one. Consequently, the network's topology is defined by how the
layers are interconnected and the operations performed within each layer, often utilizing
previously learned weights.
Convolutional Neural Networks (CNNs) are a special type of neural networks that are
really good at working with 2D data, like images. They are commonly used for tasks like
identifying objects in images or labeling scenes.
Imagine a 256x256 image with three color channels (RGB). Feeding this pixel data into a
conventional neural network would require millions of weights, due to the typical
connectivity between neurons across layers. However, CNNs leverage the inherent spatial
locality of information in images. For instance, to identify a car in an image, analyzing
pixels in the top-right corner isn't crucial. Features like edges, lines, circles, and contours
provide enough context.
This is where convolutional layers come in. These specialized layers replace fully-connected
layers, allowing the network to focus on local information and extract meaningful features.
Each convolutional layer receives a stack of images as input and generates another stack as
output. These layers utilize small filters (kernels) to scan the input and extract features.
These filters, equipped with learned weights, help the network recognize patterns and
objects in the images.
5

In essence, CNNs employ convolutional layers to efficiently capture key features in images,
facilitating accurate image understanding and classification.
Convolutional Layer Details:
● Each input layer receives a stack of 2D images (chin) with dimensions hin×win,
referred to as input feature maps.
● Each layer outputs a stack of 2D images (chout) with dimensions hout×wout, called
output feature maps.
● Each layer utilizes a stack of chin×chout kernels (or 2D filters) with dimensions k×k
(typically ranging from 1x1 to 11x1) containing the trained weights.
By focusing on local information and utilizing efficient convolutional layers, CNNs achieve
exceptional performance in image-related tasks, solidifying their position as a powerful tool
for image processing and computer vision applications.
Fig.2 Layers in a CNN model (Goodfellow,2016)
Activation and Pooling
Activation: Each linear activation is then passed through a non-linear activation function.
This stage, also known as the "detector stage," introduces non-linearity into the network,
allowing it to learn complex relationships between features. A popular choice for the
activation function is the rectified linear unit (ReLU), which outputs the input value if it is
positive, and zero otherwise.
Pooling: This stage further modifies the layer's output by applying a pooling function.
Pooling functions summarize the output within a specific neighborhood, often reducing the
6

dimensionality of the data. Common pooling functions include:
● Max pooling: Replaces each output with the maximum value within its rectangular
neighborhood.
● Average pooling: Replaces each output with the average value within its rectangular
neighborhood.
● L2-norm pooling: Replaces each output with the L2 norm of the values within its
rectangular neighborhood.
● Weighted average pooling: Replaces each output with a weighted average based on
the distance from the central pixel.
By performing these stages sequentially, CNNs extract and learn features from input data,
enabling them to perform complex tasks like image recognition and natural language
processing.[4] (Shown below Fig.3)
Fig.3: A typical convolutional neural network layer's components (Goodfellow, 2016)
Convolutional networks (ConvNets) can be described using two distinct sets of terminology.
Left-hand View: This perspective treats the ConvNet as a collection of relatively complex layers, each containing
multiple "stages." Each kernel tensor directly corresponds to a network layer in this interpretation.
Right-hand View: This perspective presents the ConvNet as a sequence of simpler layers. Every processing step
within the network is considered its own individual layer. Consequently, not every "layer" possesses learnable
parameters.[4]
Practical Convolution
Convolution in the context of neural networks transcends a singular operation. It involves
the parallel application of multiple convolutions, leveraging the strength of extracting
diverse features across multiple spatial locations. A single kernel can only identify one type
of feature, limiting the richness of extracted information. By employing multiple kernels in
parallel, the network extracts a broader spectrum of features, enhancing its
representational power.
7

Neural networks often handle data with a richer structure than mere grids of real values.
The input typically consists of "vector-valued observations," where each data point holds
additional information beyond a single value. For instance, a color image presents red,
green, and blue intensity values at each pixel, creating a 3-dimensional tensor. One index
denotes the different channels (red, green, blue), while the other two specify the spatial
coordinates within each channel[4].
Software implementations of convolution often employ "batch mode," processing multiple
data samples simultaneously. This introduces an additional dimension (the "batch axis") to
the tensor, representing different examples within the batch. For clarity, we will disregard
the batch axis in our subsequent discussion [4].
A crucial element of convolutional networks is "multi-channel convolution," where both the
input and output possess multiple channels. This multi-channel nature introduces an
interesting property: the linear operations involved are not guaranteed to be commutative,
even with the implementation of "kernel flipping." Commutativity only holds true when
each operation involves the same number of input and output channels.
To illustrate these concepts, consider a 3-channel color image as the input to a convolutional
layer with multiple kernels. Each kernel extracts a specific type of feature from each
channel, resulting in multiple "feature maps." These feature maps, when combined, form
the output of the convolution operation[4].
Training a Neural Network
Because training is computationally expensive, there are frameworks and tools available to
help with this process. Two popular ones are Caffe and Tensorflow.
In this thesis, different frameworks were explored gradually, starting with simpler ones and
gradually moving towards more advanced ones, as we had limited prior knowledge.
There exist two primary forms of training for neural networks:
1. Full training: In situations where an ample amount of data is accessible, it is possible
to train all the network weights to enhance results tailored to the specific application.
2. Transfer learning: Frequently, insufficient data is available to train all the weights
from the ground up. In such instances, a prevalent strategy involves employing a
pre-trained network designed for a distinct application. The majority of layer weights are
repurposed, with only the final layer being adjusted to align with the requirements of the
8

new application.
2.1.2 FPGAs
Field-Programmable Gate Arrays (FPGAs) are a type of integrated circuit that can be
reprogrammed and reconfigured countless times after they have been manufactured.
These devices form the foundation of reconfigurable computing, a computing approach
that emphasizes splitting applications into parallel, application-specific pipelines. FPGAs
have reconfigurable logic resources like LUT (Look-Up Tables), DSPs (Digital Signal
Processors), and BRAMs (Block RAMs). These resources can be connected and configured
in various ways, allowing the implementation of different electronic circuits. The allure
of reconfigurable computing lies in its ability to merge the rapidity of hardware with the
adaptability of software, essentially bringing together the most advantageous features of
both hardware and software.
Harnessing the computational power of FPGAs takes a leap forward with distributed
computing. This strategy clusters FPGAs, dividing problems into smaller tasks for
parallel processing. By working as a team, this distributed network unlocks significant
performance gains through parallelization.
This approach offers key benefits:
● Scalability: Easily add FPGAs to the cluster as computational demands grow.
● Efficiency: Shared resources and coordinated tasks optimize resource utilization.
● Flexibility: Adapt and optimize the configuration to meet specific needs.
● Performance: Parallelization boosts processing speed for quicker results.
Distributed FPGAs hold promise in various fields:
● HPC: Solve complex scientific and engineering problems faster.
● AI: Train and deploy AI models with the necessary power and scalability.
● Real-Time Applications: Meet the demanding requirements of latency-sensitive
fields like robotics and autonomous systems.
9

High Level Synthesis and FPGAs
For over three decades, engineers have relied on Hardware Description Languages (HDLs)
to design electronic circuits implemented in FPGAs. This approach, while established,
requires a significant investment of time and expertise. Writing detailed descriptions of
each hardware component can be tedious and demands a deep understanding of the
underlying hardware structure.
However, a fresh and promising paradigm shift has emerged in recent years: High-Level
Synthesis (HLS). This innovative approach leverages the familiarity and convenience of
high-level languages like C to design hardware. Dedicated tools then translate this
high-level code into an equivalent hardware description in a lower-level language, known as
Register Transfer Level (RTL).
Several compelling advantages make HLS an increasingly attractive choice for hardware
design:
● Maturity and Stability: HLS tools have evolved significantly, offering improved
reliability and a clearer understanding of the generated hardware behavior.
● Efficiency and Performance: HLS can often produce hardware that rivals, or even
surpasses, the efficiency achieved by manually crafted HDL code. This efficiency
gain, combined with the significantly faster development cycle, makes HLS a
compelling option.
Given these undeniable benefits, HLS has been chosen as the technology of choice for this
thesis, paving the way for a more efficient and accessible approach to FPGA design.
2.2 Literature Review
Deep Learning
Deep learning utilizes artificial neural networks, inspired by the human brain, to perform
10

machine learning tasks. These networks consist of multiple layers organized hierarchically,
enabling them to learn complex patterns from data.
Each layer progressively builds upon the knowledge acquired by the previous layer. The
initial layers extract fundamental features, like edges or lines, from the input data.
Subsequent layers combine these basic features into more complex shapes and objects,
culminating in the identification of the desired target.
Imagine training a deep learning model to recognize hands in images. The initial layers
would learn to detect edges and lines, the building blocks of shapes. Moving up the
hierarchy, the network would combine these basic elements into more complex features, like
ovals and rectangles, which could represent whiskers, paws, and tails. Finally, the topmost
layers would recognize these combined features as specific to hands, allowing the network
to differentiate them from other animals.
While focusing on hand identification, the network simultaneously learns about other
objects present in the training data. This allows it to generalize its knowledge and apply it
to other contexts, recognizing hands in diverse environments and situations. This
hierarchical learning process, where simple features are gradually combined to form
complex representations, is the core of deep learning's success. It allows the network to
effortlessly handle complex tasks, making it a powerful tool for various applications.
Derived from the SqueezeNet topology [8], initially designed for embedded systems,
Zynqnet is tailored to be FPGA-friendly through modifications during development. The
topology comprises an initial convolutional layer, 8 identical fire modules, and a final
classification layer, each containing 3 convolutional layers. Notably, efforts were made to
align hyperparameters with power-of-two values.
Key points of improvement in this thesis include:
1. HW Definition: Zynqnet's original hardware accelerator is only partially implemented
on the Xilinx Zynq board [6], working closely with an ARM processor. In contrast, the
presented accelerator is fully hardware-designed, adapting to runtime layer variations
without software intervention.
2. Fixed Point: To mitigate FPGA overhead, fixed-point computations replace the 32-bit
11

floating-point implementation used in Zynqnet. The Ristretto tool [7] guides bit width and
fractional bits, applying manual fine-tuning.
3. Data vs. Mem: Significant size and memory reductions occur by reducing classification
items and employing 8-bit fixed-point weights. This optimization simplifies the system,
eliminating external memory access and prioritizing computation speed over memory
volume in the accelerator.
2.2.1 AlexNet
Introduced in 2012, AlexNet, a pioneering Deep Learning architecture developed using the
ImageNet database. They trained a deep convolutional neural network on 1.2 million
high-resolution images, each with dimensions of 224x224 RGB pixels (Li, F., et. al, 2017).
Achieving a worst error rate of 37.5% and a best average error rate of 17.0%, the neural
network boasted 60 million parameters, 650,000 neurons, five Convolutional Layers
followed by ReLu and Max Pool Layers, three Fully Connected Layers, and a 1000-way
Softmax Classifier [9]. The architecture, illustrated below, marked the first use of a
Rectified Linear Unit as an activation layer, deviating from the conventional Sigmoid
Activation function. Their groundbreaking implementation secured victory in the ImageNet
LSVRC-2012 competition. The entire project was conducted using two GTX 580 GPUs.
Fig.4: Visual representation of AlexNet architecture.
Illustration shows the layers used and their interconnectivity.(Krizhevsky, 2012)
12

Fig.5: Visual representation of AlexNet architecture.
Illustration shows the layers used and their interconnectivity. (Li, F., et. al, 2017)
2.2.2 VGGNet
In 2014, Karen Simonyan and Andrew Zisserman, researchers at the University of Oxford's
Visual Geometry Group, introduced VGGNet, a groundbreaking architecture that
significantly improved upon the capabilities of its predecessor, AlexNet. VGGNet's key
innovation was its increased depth, achieved by adding more convolutional layers. These
layers utilized smaller receptive fields, primarily 3x3 and 1x1 filters, enabling the network
to extract more detailed and nuanced features from the input images.
Simonyan and Zisserman tested various configurations of their network, all adhering to a
general design but differing in depth. They experimented with 11, 13, and 19 weight layers,
with each depth further divided into sub-configurations. Among these configurations,
VGG16 and VGG19 emerged as the top performers. VGG16 achieved a remarkable
maximum error rate of 27.3% and an impressive average error rate of 8.1%. VGG19, with
its increased depth, further improved upon these results, achieving a maximum error rate
of 25.5% and an average error rate of 8.0%.[7]
As expected, the increased depth of VGG16 and VGG19 led to a significant rise in the
number of parameters. VGG16 boasts 138 million parameters, while VGG19 possesses an
even more impressive 144 million parameters.
13

Figure 6. provides a visual comparison of VGG16, VGG19, and their predecessor AlexNet,
highlighting the significant architectural advancements made by VGGNet. This innovative
architecture ultimately led to its victory in the 2014 ImageNet LSRVC Challenge,
solidifying its place as a landmark achievement in the field of deep learning.
Fig. 6: Visual representation of VGGnet architecture & AlexNet (right)
Illustration shows the layers used and their interconnectivity.
(Li, F., et. al, 2017)
2.2.3 ResNet
In 2015, a team from Microsoft, including Kaiming He, Xiangyu Zhang, Shaoqing Ren, and
Jian Sun, developed the ResNet architecture as an enhancement to VGGnet. Recognizing
the importance of network depth for accuracy, they addressed the "vanishing gradient"
problem during backpropagation by introducing "deep residual learning." This novel
framework incorporated "Shortcut Connections," hypothesized to simplify training and
optimization while overcoming the gradient issue (He et al., 2015; Li, F., et al., 2017).
14

Fig.7: Residual Learning: a building block of the ResNet architecture. (He et. al., 2015)
For their experimentation they constructed a 34-layer plain network with no Shortcut
connections and a 34 Layer network with shortcut connections, a Resnet. They also
configured several networks with incrementally increasing layer count from 34 layers to
152 layers. Overall, 34-layer ResNet outperformed the 34-layer plain network and the
average error rate achieved on the 152-layer network for the 2015 ImageNet LSVRC
competition was 3.57%. This network architecture won the 2015 ImageNet LSVRC
challenge. (Li, F., et. al, 2017)
Fig.8 represents the building block of the ResNet Architecture.
15

Fig.8: Residual Learning: a building block of the ResNet architecture. [11]
16

2.2.4 MobileNet
MobileNets, which were originally developed by Google for mobile and embedded vision
applications [12], are distinguished by their use of depth-wise separable convolutions,
which reduce trainable parameters when compared to networks with regular convolutions
of the same depth. MobileNetv2 introduced linear bottlenecks and inverted residuals,
resulting in lightweight deep neural networks that are ideal for the scenario under
consideration in this work.
Fig. 9: Visual representation of MobileNet architecture
2.2.5 Other work
Several research groups have explored implementing Convolutional Neural Networks
(CNNs) on FPGAs, achieving impressive results in terms of performance and efficiency.
Here's a summary of five notable works:
1. Real-Time Video Object Recognition System (Neurocoms, South Korea, 2015)
● Architecture: Custom 5-layer CNN developed in Matlab.
● Input: Grayscale images (28x28).
● Platform: Xilinx KC705 evaluation board.
● Frequency: 250MHz.
● Power consumption: 3.1 watts.
● Resource utilization: 42,616 LUTs, 32 BRAMs, 326 DSP48s.
● Data format: 16-bit fixed point.
● Performance: Focused on frames per second.
17

Figure 10: Neurocoms work using 6 neurons and 2 receptor units
(Ahn, B,2015)
This paper describes a real-time video object recognition system implemented on an FPGA.
The system consists of a receiver, a feature map, and a detector. The receiver decodes and
pre-processes the video stream, the feature map extracts features using a CNN, and the
detector identifies objects by comparing the features to a database.
Key takeaways:
● Real-time performance
● Very efficient (3.1 watts power consumption)
● FPGA implementation enables high performance
2. Small CNN Implementation (Institute of Semiconductors, Chinese Academy of
Sciences, Beijing, China, 2015)
● Architecture: 3 convolutional layers with activation, 2 pooling layers, 1 softmax
classifier.
● Input: 32x32 images.
● Platform: Altera Arria V FPGA board.
● Performance: Focused on images per second.
18

Figure 11: Chinese Academy logic architecture
(Li, H et.al., 2015)
This paper presents a small CNN implementation on an FPGA. The CNN consists of three
convolutional layers with activation, two pooling layers, and a softmax classifier. The input
images are 32x32 and the data format is 8-bit fixed point. The CNN is implemented on an
Altera Arria V FPGA board and operates at a frequency of 50MHz.
Key takeaways:
● The CNN achieves a frames per second rate of 50, which is sufficient for real-time
video processing.
● The CNN achieves an accuracy of 93.6% on the MNIST handwritten digit
classification task.
● The CNN uses 118K LUTs, 112K BRAMs, and 13K DSPs.
3. Angel-Eye System (Tsinghua University and Stanford University, 2016)
● Architecture: Array of custom processing elements.
● Platform: Xilinx Zynq XC7Z045.
● Performance: 187.80 GFLOPS (VGG16 ConvNet).
● Custom compiler: Minimizes external memory access.
19

Figure 12: Angel-Eye (Left) Angel-Eye architecture. (Right) Processing Element
(Guo, K. et. al., 2016)
4. Customized Software Tools for CNN Accelerator (Purdue University, 2016)
● Platform: Xilinx Kintex-7 XC7K325T.
● Performance: 58-115 GFLOPS.
● Architecture: Custom software tools for optimization.
● Data format: Not specified.
5. Scalable FPGA Implementation of CNN (Arizona State University, 2016)
● Platform: Stratix-V GXA7.
● Performance: 114.5 GFLOPS.
● Resource utilization: 256 DSPs, 112K LUTs, 2,330 BRAMs.
● Shared multiplier bank: Optimizes multiplication operations.
So far
One major challenge in deploying Deep Learning (DL) models on FPGAs has been their
limited design size. The inherent trade-off between reconfigurability and density restricts
the implementation of large neural networks on FPGAs. However, advancements in
fabrication technology, particularly the use of smaller feature sizes, are enabling denser
FPGAs. Additionally, the integration of specialized computational units alongside the
general FPGA fabric enhances processing capabilities. These advancements are paving the
way for the implementation of complex DL models on single FPGA systems, opening up
new possibilities for hardware-accelerated AI.
20

3. Design and Implementation
3.1 Design
Let's delve deeper into the individual layers of a convolutional neural network:
1. Input: The network begins with the input image, typically represented as a 3D matrix
with dimensions representing width, height, and color channels (e.g., RGB). In this case,
the image size is 32x32 pixels with three color channels.
2. Convolutional Layer: This layer applies filters to the input image, extracting features
through localized dot product calculations. Applying 12 filters would result in a new 3D
volume with dimensions 32x32x12, where each element represents the activation of a
specific feature at a specific location.
3. ReLU Layer: The rectified linear unit (ReLU) layer applies a non-linear activation
function, typically max(0,x), to each element in the previous volume. This introduces
non-linearity and sparsity into the feature representation, enhancing the network's ability
to learn complex patterns. The volume size remains unchanged (32x32x12).
4. Max Pooling Layer: This layer performs downsampling by selecting the maximum value
within a predefined neighborhood in the input volume. By reducing the spatial dimensions
(e.g., by a factor of 2), the network can achieve translational invariance and reduce
computational complexity. In this case, the resulting volume would be 16x16x12.
5. Affine/Fully Connected Layer: This layer connects all neurons in the previous volume to
each output neuron, essentially performing a weighted sum followed by a bias addition.
This final step calculates the class scores for each possible category, resulting in a 1x1x10
volume where each element represents the score for a specific class.
Sequential Processing and Parameter Learning:
Convolutional Neural Networks transform the input image through a series of layers,
gradually extracting features and building increasingly complex representations. While
21

some layers like ReLU and Max Pooling operate with fixed functions, others like
Convolutional and Fully Connected layers involve trainable parameters (weights and
biases). These parameters are adjusted through gradient descent optimization during
training, allowing the network to learn optimal representations based on labeled data.
3.2 Implementation
Having covered the foundational aspects of Deep Learning and reviewed prominent Deep
Convolutional Neural Network architectures, along with their implementations on FPGA,
let's delve into the specifics of this design. This section outlines the implementation of Deep
Convolutional Neural Networks in FPGA, discussing similarities & distinctions from prior
works, design goals, and the tools employed. Following this, we provide an overview of the
overall architecture intended for implementation on the FPGA. Due to a focus on hardware
implementation and constraints in time, along with the availability of pre-existing, trained
image system data for CNN, code was sourced from the internet. Finally, we
comprehensively examine three key sub-designs: Convolutional/Affine Layer, ReLu Layer,
Max Pooling Layer, and Softmax Layer.
Similarities
In scrutinizing previous works where groups implemented DCNNs on FPGAs, numerous
similarities emerge between their implementations and the present work. Certain aspects
of DCNNs are inherently common across designs aimed at accelerating DCNNs.
Consequently, essential elements like required layers (e.g., convolution, ReLu, max pool)
and adder trees for summing channel products will not be explicitly discussed in this
section.
Bus Protocol
Firstly, prior works showcase designs employing sub-module intercommunication. Several
designs that utilized separate sub-modules in their overall architecture employed a
communication bus protocol. This approach leverages existing intellectual property from
FPGA manufacturers such as Intel or AMD, allowing the focus to be on the DCNN portion
of the task rather than the infrastructure. Additionally, hardware microprocessors or
implemented co-processors can communicate with the submodules, providing valuable
insights for both software and hardware developers during debugging and verification. The
22

drawback, however, is that a bus protocol introduces additional overhead to the design due
to handshaking between sub-modules for reliable communication. Moreover, the presence of
the bus protocol necessitates more signal routing, utilizing overall FPGA resources and
potentially leading to increased dwell time with no task being executed. Despite these
drawbacks, effective management can be achieved by carefully planning the overall design's
concept of operations.
DSP Slices
A prevalent aspect shared among prior works and the present study involves the utilization
of Digital Signal Processing (DSP) Slices. These dedicated hardware components excel in
performing multiply and add operations for both floating-point and fixed-precision
numbers. DSP slices outperform custom designs implemented in hardware description
language (HDL). FPGAs benefit from maximizing available DSP slices, enhancing the
speed of designs, especially in Deep Convolutional Neural Networks (DCNNs).
Data Format
In the software domain, Deep Learning research employs 64-bit double precision floating
point signed digits for weight data. While some works have employed 32-bit single
numbers, there is mounting evidence suggesting that reducing the bit size and format can
significantly impact overall performance. A common alteration is the use of 16-bit
fixed-precision numbers. Alternatively, truncating the 32-bit single number to a 16-bit
"Half" precision number is proposed, presenting a potentially more effective design.
Scalability
Scalability, a crucial feature in previous works and this study, revolves around navigating
through the CNN.As witnessed in other works, the increasing size of software
implementations of DCNNs, exemplified by the 152-layer ResNet design, poses a challenge
for FPGA implementation. To address this, strategies involve implementing reusable
designs capable of performing the functions of all necessary layers in the DCNN
architecture.
Simple Interface
Unlike many previous works, considerable effort has been invested in creating a custom
compiler to completely describe a Deep Convolutional Neural Network in this design. The
aim is to make the DCNN accessible to both software and hardware designers by making
23

FPGA hardware programmable. The FPGA can be commanded through function calls in the
microprocessor, performing register writes to the FPGA implementation.
Flexible Design
Unlike prior works where CNN designs are tailored to specific hardware boards, this work
aims for a configurable number of DSPs depending on the FPGA in use. Each layer in the
CNN is modular and can interact through a bus protocol, allowing developers to insert
multiple instances of the Convolutional Layer, Affine Layer, and Max Pooling Layer.
Tools
Throughout the development process of implementing a CNN on an FPGA, various tools
were employed. The choice of utilizing Xilinx chips was influenced by their extensive usage
and the author's prior experience with Xilinx products. Consequently, the tools selected for
this development were drawn from the diverse set offered by AMD Xilinx. The central
design environment was the Xilinx Vivado 2021.3 package (refer to Figure 13), serving as
the primary design hub throughout the developmental phase. Within Vivado, each neural
network layer type was crafted as an AXI-capable submodule. Additionally, Vivado
facilitated integration with pre-existing Xilinx Intellectual Property (IP), such as the Zynq
SoC Co-Processor and Memory Interface Generator (MIG). Lastly, Vivado acted as a
platform for software development, enabling the creation of straightforward software to run
on the Zynq SoC.
Fig.13 Xilinx Vivado 2021.3 IDE
Hardware: The FPGA chosen was a Zedboard. In the context of Digilent’s Zedboard
Development Kit, it consists of a matrix of programmable logic blocks and programmable
interconnections. Zedboard is a built-around Zynq-7000 SoC Xilinx that combines a
two-core ARM Cortex-A9 processor along with a FPGA fabric.
24

Fig. 14 Digilent Zedboard Avnet AES-series Evaluation Kit Zynq-7000 System-on-Chip (SoC) (www.digilent.com)
Table.1:Specifications for Zedboard (www.digilent.com)
25
SPECIFICATION DESCRIPTION
SOC OPTIONS • XC7Z020 - CLG484-1
MEMORY • 512 MB DDR3
• 256 Mb Quad - SPI Flash
VIDEO DISPLAY •1080p HDMI
•8 - bit VGA
• 128 x 32 OLED
USER INPUTS • 8 User Switches and 7 user push buttons
AUDIO • 12S Audio CODEC
ANALOG • XADC Header
• Onboard USB JTAG
POWER • 12VDC
CERTIFICATION • CE
• RoHS
DIMENSIONS • 5.3 " x6.3 "
CONFIGURATION MEMORY • 256 Mb Quad - SPI Flash , SDCARD
ETHERNET • 10/100/1000 Ethernet
USB • USB 2.0
COMMUNICATIONS • USB 2.0
•USB - UART
•10/100/1000 Ethernet
USER I / O • ( See User Inputs )
OTHER • PetaLinux BSP

4. Hardware Implementation
4.1 Design Methodology
In extensive code projects with multiple instances and increasing complexity, defining the
order and scope of steps, along with how they will be executed, is crucial—comprising the
project's methodology.
4.2 HLS Methodology
As detailed in Section 2.2, High Level Synthesis (HLS) is chosen for hardware
implementation due to its suitability. Xilinx® Vivado HLS is employed in this project,
following its three-step methodology:
1. Software Simulation: This involves testing code execution using a regular software
compiler and CPU, aided by a test bench.
2. Synthesis: Generating HDL files crucial for code implementation and HLS pragmas.
This step is critical, executed after successful software simulation.
3. Co-Simulation: The most significant step, testing synthesized code functionality using a
hardware simulation. It leverages the test bench from software simulation, comparing
outputs and ensuring hardware-software consistency.
Table 2: Simulation Model vs Hardware Implementation
26
Layer SIM FOPs HW FOPs Diff
CONV1 0.7407 G 0.73530 G 0.74%
CONV2 126.897 M 113.796 M 12.89%
CONV3 35.158 M 29.106 M 27.66%
CONV4 26.645 M 20.830 M 27.91%
CONVS 26.574 M 20.763 M 27.99%
AFFINE1 176.322 M 113.884 M 54.83%
AFFINE2 87.677 M 38.077 M 130.26%
AFFINE3 33.919 M 20.229 M 83.23%

4.3 Design Overview
Fig.15 Final Top-Level FPGA Design
Before delving into the pipelined core and other enhancements, understanding the module's
top-level functionality is vital. The system comprises three modules and a group of
memories:
1. Pipelined Core: This module serves as the computational powerhouse, receiving layer
parameters, weight information, and input data from the Flow Control module. It executes
the necessary calculations and generates the desired outputs.
2. Convolution Flow Control: This module acts as the conductor, ensuring the proper
execution of the network topology. It determines whether update or classification tasks are
required and orchestrates access to all memory units and relevant layer parameters.
3. Memory Controller: This module acts as the memory interface, deciphering read/write
positions for data exchange with the memory units. It receives instructions from both the
Flow Control and Pipelined Core modules, ensuring smooth data flow and efficient memory
27

utilization.
By understanding the interactions and responsibilities of these modules, we gain a clear
understanding of how the system operates as a whole. This high-level perspective provides
a valuable foundation for delving deeper into the specific details of the individual
components and their contributions to the overall system performance.
4.4 Caching Strategy
Organized loop order and data reuse optimization lead to local storage of reused
information to avoid accessing on-chip memory overhead. Caches are needed for kernel and
bias, output, and input.
1. Kernel and Bias Caches: Simplest caches loaded at the beginning and updated during
channel changes.
2. Output Cache: More complex due to irregular access pattern, loading bias and
computing ReLU for performance maximization.
3. Input Cache: Most complex, addressing reuse issues with a group of multiple registers
that displace information every iteration.
Memory Controller: Arrays Merging
Adapting access patterns between layers and facilitating simultaneous access to multiple
elements are essential for varying memory requirements.
Fixed-Point Implementation
Following Ristretto's fixed-point analysis of the network, bit width and fractional bits are
defined. So, Xilinx® Vivado HLS utilizes a fixed-point arithmetic type definition
(ap_fixed<bit width, frac bits>). Runtime reconfiguration is managed using integers and bit
shifts for fixed-point operations since Vivado HLS requires a compile-time definition of
fractional bits.
28

Following is a detailed explanation of the four test benches shown in the diagrams:
1. Convolutional/Affine Layer Virtual Memory Test Bench
This test bench verifies the functionality of the convolutional/affine layer implementation
by comparing its outputs to the expected outputs generated by a reference software model.
The test bench loads the input and kernel data into virtual memory and then performs the
convolutions/affine operations. The outputs are then compared to the expected outputs to
ensure that the implementation is correct.
2. Convolutional/Affine Layer Block RAM Test Bench
This test bench is similar to the virtual memory test bench, but it stores the input and
kernel data in block RAM instead of virtual memory. This test bench is useful for verifying
the performance of the convolutional/affine layer implementation, as it can achieve higher
throughput by avoiding the overhead of accessing virtual memory.
3. Max Pool Layer Virtual Memory Test Bench
This test bench verifies the functionality of the max pool layer implementation by
comparing its outputs to the expected outputs generated by a reference software model. The
test bench loads the input data into virtual memory and then performs the max pooling
operation. The outputs are then compared to the expected outputs to ensure that the
implementation is correct.
4. Max Pool Layer Block RAM Test Bench
This test bench is similar to the virtual memory test bench, but it stores the input data in
block RAM instead of virtual memory. This test bench is useful for verifying the
performance of the max pool layer implementation, as it can achieve higher throughput by
avoiding the overhead of accessing virtual memory.
The diagram shows the four test benches connected to a common input and output
interface. This allows the test benches to be easily swapped in and out, depending on the
layer being tested.
30

Input and Output Interface: This interface provides a common way to load input data into
the test benches and to read the output data from the test benches. The interface can be
implemented using a variety of different methods, such as FIFO buffers, DMA transfers, or
direct memory access.
Virtual Memory: Virtual memory is used to store the input and kernel data for the
convolutional/affine layer and the max pool layer virtual memory test benches. Virtual
memory allows the test benches to access large amounts of data without having to load it
all into physical memory at once.
Block RAM: Block RAM is used to store the input data for the convolutional/affine layer and
the max pool layer block RAM test benches. Block RAM is a type of on-chip memory that is
faster than virtual memory, but it has a limited capacity.
Test Bench Control Logic:
The test bench control logic is responsible for loading the input and kernel data into the test
benches, performing the convolutions/affine operations or the max pooling operation, and
comparing the outputs to the expected outputs. The test bench control logic can be
implemented using a variety of different methods, such as a finite state machine, a
microcontroller, or a software program.
The four test benches described above are essential tools for verifying the functionality and
performance of convolutional neural network implementations on FPGAs. By using these
test benches, designers can ensure that their implementations are correct and that they
meet the desired performance requirements.
Performance Evaluation and Analysis
After implementing all optimization techniques, the accelerator was ready to classify
images using trained network weights. To simulate the hardware behavior, Xilinx® Vivado
HLS Co-simulation was employed. Images from the validation dataset, which achieved 73%
accuracy with Ristretto, were evaluated. The simulation process, spanning over 185 hours,
resulted in an overall 58% accuracy, requiring 26 million cycles per image.
31

With a relatively small critical path, a 100MHz clock can be utilized, enabling the
processing of approximately 4 frames per second. These results are deemed successful, as
the achieved accuracy meets the project's minimum threshold, and the performance
surpasses the lower limit by nearly fourfold. Consequently, no further modifications are
required, and the accelerator is prepared for deployment.
Table 3 Resource Utilization of Final Design (AlexNet)
Resource Utilization Optimization
While the accelerator described in Section 3.2 is functional and implementable, the
pipelined core's low resource footprint (35 DSPs, 41,000 Flip-flops, and 36,500 LUTs) allows
for potential modifications or duplications to reduce the pipeline depth. This situation is
particularly suited for HLS optimization, as it can sometimes surpass human design
capabilities.(See Table 3)
Initially, Vivado generated two core instances with a 4-stage pipeline, requiring 26,596,261
cycles, due to different memory inputs. To improve this design, various configurations were
explored using the function_instantiate pragma, creating four core instances. By sharing
resources effectively, only 15% more DSPs, 27% more flip-flops, and 33% more LUTs were
utilized compared to the double-mode core implementation. This configuration enabled
reducing two out of the four pipelines by one stage each. However, this modification
resulted in a negligible 0.2% performance improvement, ultimately leading to its rejection.
Here are some parameters to compare different CNN implementations on FPGA:
32
Resource Utilization Available Utilization %
LUT 36527 53200 68.66
LUTRAM 2594 46200 5.61
FF 41198 106400 38.72
BRAM 54 140 38.22
DSP 35 220 16.08
IO 69 285 24.21
BUFG 7 32 21.88
MMCM 2 10 20
PLL 1 10 10

● Throughput: Throughput is the number of input data that can be processed per
unit time. It is an important parameter to measure the performance of a CNN
implementation on an FPGA. Measured through FOPS
● Latency: Latency is the time taken by the CNN to process one input data.
● Resource utilization: It is an important parameter to measure the efficiency of a
CNN implementation on an FPGA.
● Power consumption: It is a crucial parameter to measure the energy efficiency
of a CNN implementation on an FPGA.
● Accuracy: It is an important parameter to measure the effectiveness of a CNN
implementation on an FPGA.
● Flexibility: Flexibility is the ability of the CNN implementation to adapt to
different CNN models and configurations.
● Ease of use
Table 4: Hardware execution times of each AlexNet Layer
Table.4 shows that the convolutional layers (CONV1-CONV5) are the most time-consuming
33
Layer Start Time End Time Total Time FOPS
CONV1 0
71198.67 us
Epoch = 0x1161e
Cycle = 0x43 71.19867 ms 0.7456 G
CONV2 0
547753.71 us
Epoch = 0x85BA9
Cycle = 0x47 547.75371 ms 108.806 M
CONV3 0
463776.90 us
Epoch = 0x713A0
Cycle = 0x5A 463.77690 ms 24.858 M
CONV4 0
697862.14 us
Epoch Oxaa606
Cycle = 0x0E 697.86214 ms 16.551 M
CONV5 0
466757.25 us
Epoch = 0x71f45
Cycle = 0x19 466.75725 ms 16.543 M
AFFINE1 0
796440.32 us
Epoch = 0xc2718
Cycle = 0x20 796.44032 ms 110.922 M
AFFINE2 0
1018890.52 us
Epoch Oxf8c0a
Cycle = 0x34 1018.89052 ms 33.446 M
AFFINE3 0
4682.26 us
Epoch = 0x124A
Cycle = 0x1A 4.68226 ms 17.769 M

layers in the network, accounting for over 90% of the total execution time. This is because
convolutional layers perform a large number of floating-point operations.
The fully connected layers (AFFINE1-AFFINE3) are much faster, but they still account for
a significant portion of the total execution time. This is because fully connected layers also
perform a large number of floating-point operations, and they also require more memory
bandwidth.
The table also shows that the FOPS of each layer is inversely proportional to the execution
time. This means that the layers with the longest execution times have the lowest FOPS.
Overall, the table provides insights into the performance of the AlexNet CNN when
implemented on an FPGA. It shows that the convolutional layers are the most
time-consuming layers in the network, and that the FOPS of each layer is inversely
proportional to the execution time.
Here are some specific observations from the table:
● The CONV1 layer has the longest execution time, at 71.1986 milliseconds. This is
because the CONV1 layer has the largest number of filters.
● The CONV5 layer has the shortest execution time, at 466.7572 milliseconds. This is
because the CONV5 layer has the smallest number of filters.
● The AFFINE1 layer has the highest FOPS, at 110.922 million FOPS. This is because
the AFFINE1 layer has the smallest number of connections.
● The AFFINE3 layer has the lowest FOPS, at 17.769 million FOPS. This is because
the AFFINE3 layer has the largest number of connections.
Table.4 also shows that the total execution time for the AlexNet CNN is 796.4403
milliseconds. This means that the network can process approximately 1.25 frames per
second.
Table5 AlexNet vs MobileNet
34
Layer
Ops
To Perform
AlexNet
FOPS
MobileNet
FOPS Difference
CONV1 210249696 0.7407 G 2.9530 G 34878.85
CONV2 62332672 126.897 M 113.796 M 287.09
CONV3 13498752 35.158 M 29.106 M 42.42
CONV4 14537088 26.645 M 20.830 M 30.31
CONV5 9691392 26.574 M 20.763 M 30.15
AFFINE1 90701824 176.322 M 113.884 M 115.9
AFFINE2 38797312 87.677 M 38.077 M 24.67
AFFINE3 94720 33.919 M 20.229 M 19.76

It is important to note that the performance of a CNN implementation on an FPGA can be
affected by a variety of factors, such as the FPGA platform, the CNN architecture, and the
optimization techniques used. Table5 only provides a comparison of 2 specific CNN
implementations on a specific FPGA platform.
Guo , K. et .
al . , 2016
Ma , Y.et. al . ,
2016
Zhang, C.et . al
. , 2015
Espinosa.M.,
2019 This Work
FPGA
Zynq
XC7Z045
Stratix - V
GXA7 FPGA
Virtex7
VX485T
Artix7
XC7A200T
Zedboard Zynq
AAES-Z7EV
Clock Freq 150 MHz 100MHz 100MHz 100MHz 100MHz
Data format 16 - bit fixed Fixed ( 8-16b ) 32 - bit Float 32 - bit Float 32 - bit Fixed
Power
9.63 W
( measured )
19.5 W
( measured )
18.61 W
( measured )
1.5 W
( estimated )
0.9 W
( estimated )
FF 127653 ? 205704 103610 41198
LUT 182616 121000 186251 91865 36527
BRAM 486 1552 1024 139.5 54
DSP 780 256 2240 119 35
Performance
187.80
GFOPS
114.5
GFOPS
61.62
GFOPS
2.93
GFOPS
0.74
GFOPS
Table 6: Comparison of other works to this work. (AlexNet)
Methods of Improvement / Scope
This implementation of a Convolutional Neural Network in an AlexNet configuration is a
first pass attempt and leaves a lot room for improvement and optimization. There are a few
ways the performance of this implementation can be increased which would be areas for
future work. Looking at Table 6, we can see the differences in resource utilization and
performance between other recent works and this one. Although this implementation
achieved a lower amount of GFOPs performance, the number of chip resources is far lower
than any of the other implementations. Also, the estimated power consumption is far lower.
35

5. Conclusions
5.1 Results
While Deep Learning and Convolutional Neural Networks (CNNs) have traditionally
resided within the realm of Computer Science, with massive computations performed on
GPUs housed in desktop computers, their increasing power demands raise concerns about
efficiency. Existing FPGA implementations for CNNs primarily focus on accelerating the
convolutional layer and often have rigid structures limiting their flexibility.
This work aims to address these limitations by proposing a scalable and modular FPGA
implementation for CNNs. Unlike existing approaches, this design seeks to configure the
system for running an arbitrary number of layers, offering greater flexibility and
adaptability.
The proposed architecture was evaluated on publicly available CNN architectures like
AlexNet, ResNet, and MobileNet on a Zedboard platform. Performance analysis revealed
MobileNet as the fastest among the three, achieving an accuracy of 47.5%. This
demonstrates the system's potential for efficient and adaptable execution of diverse CNN
architectures.
This work paves the way for further research in scalable and flexible FPGA
implementations for CNNs, offering promising avenues for resource-efficient deep learning
beyond traditional computing platforms.
36

Appendix
// CNN Sample Layer model
module Layer_1
#(parameterNN=30,numWeight=784,dataWidth=16,layerNum=1,sigmoidSize=10,weightIntWidth=
4,actType="relu")
(
input clk,
input rst,
input weightValid,
input biasValid,
input [31:0] weightValue,
input [31:0] biasValue,
input [31:0] config_layer_num,
input [31:0] config_neuron_num,
input x_valid,
input [dataWidth-1:0] x_in,
output [NN-1:0] o_valid,
output [NN*dataWidth-1:0] x_out
);
neuron
#(.numWeight(numWeight),.layerNo(layerNum),.neuronNo(0),.dataWidth(dataWidth),.sigmoidSize(
sigmoidSize),.weightIntWidth(weightIntWidth),.actType(actType),.weightFile("w_1_0.mif"),.biasFil
e("b_1_0.mif"))n_0(
.clk(clk),
.rst(rst),
.myinput(x_in),
.weightValid(weightValid),
.biasValid(biasValid),
.weightValue(weightValue),
.biasValue(biasValue),
.config_layer_num(config_layer_num),
.config_neuron_num(config_neuron_num),
.myinputValid(x_valid),
.out(x_out[0*dataWidth+:dataWidth]),
.outvalid(o_valid[0])
);
endmodule
Due to space constraints, all the data, references and code can be accessed here: Thesis_Appendix
37

Bibliography
[1] D. M. Harris and S. L. Harris, Digital Design and Computer Architecture. Elsevier,
(2007)
[2] S.Authors, History of artificial intelligence
[3] Farabet, C., Martini, B., Akselrod, P., Talay, S., LeCun, Y., Culurciello, E.: Hardware
accelerated convolutional neural networks for synthetic vision systems. In: Circuits
and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on. pp.
257–260. IEEE (2010)
[4] Goodfellow, I., & Bengio, Y., & Courville, A Convolutional Networks. In Dietterich,
T.,(Ed.), Deep Learning(326-339). Cambridge, Massachusetts: The MIT Press.(2016)
[5] D. Gschwend, Zynqnet: An fpga-accelerated embedded convolutional neural network.
[6] Xilinx (2017). Zynq-7000 All Programmable SoC Family Product Tables and Product
SelectionGuide. Retrieved from
https://www.xilinx.com/support/documentation/selection-guides/zynq-7000-product-se
lection-guide.pdf
[7] Romén Neris , Adrián Rodríguez , Raúl Guerra. FPGA-Based Implementation of a
CNN Architecture for the On-Board Processing of Very High-Resolution Remote
Sensing Images, IEEE Journal Of Selected Topics In Applied Earth Observations
And Remote Sensing, Vol. 15, 2022.
[8] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, And K. Keutzer,
Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size,
arXiv:1602.07360, (2016)
[9] Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). ImageNet Classification with Deep
Convolutional Neural Networks. Advances in Neural Information Processing Systems,
25 (NIPS 2012). Retrieved from
https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neu
ral-networks.pdf
[10] Zisserman, A. & Simonyan, K. (2014). Very Deep Convolutional Networks For
Large-Scale Image Recognition. Retrieved from https://arxiv.org/pdf/1409.1556.pdf
[11] Li, F., et. al. CNN Architectures [PDF document]. Retrieved from Lecture Notes Online
Website: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture9.pdf
38

[12] Qiao, Y., & Shen, J., & Xiao, T., & Yang, Q., & Wen, M., & Zhang, C. FPGA‐
accelerated deep convolutional neural networks for high throughput and energy
efficiency. Concurrency and Computation Practice and Experience. John Wiley & Sons
Ltd.(May 06, 2016).
[13] Lacey, G., & Taylor, G., & Areibi, S. Deep Learning on FPGAs: Past, Present and
Future. Cornell University Library. https://arxiv.org/abs/1602.04283 (Feb. 13, 2016)
[14] Gomez, P. Implementation of a Convolutional Neural Network (CNN) on a FPGA for
Sign Language's Alphabet recognition. Archivo Digital UPM. Retrieved December 6,
2023, from https://oa.upm.es/53784/1/TFG_PABLO_CORREA_GOMEZ.pdf (2018, July)
[15] Espinosa, M. A. Implementation of Convolutional Neural Networks in FPGA for Image
Classification. ScholarWorks. Retrieved December 6, 2023, from
https://scholarworks.calstate.edu/downloads/hd76s209r (2019, Spring)
[16] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image
Recognition.Retrieved from https://arxiv.org/pdf/1512.03385.pdf
[17] Reddy, G. (2019, January 1). FPGA Implementation of Multiplier-Accumulator Unit
using Vedic multiplier and Reversible gates. Semantic Scholar.
https://www.semanticscholar.org/paper/FPGA-Implementation-of-Multiplier-Accumul
ator-Unit-Rajesh-Reddy/edab41b3600b2b51d6887042487bac32c80182b5
[18] Guo, K., & Sui, L., & Qiu, J., & Yao, S., & Han, S., & Wang, Y., & Yang, H. (July. 13,
2016). Angel-Eye: A Complete Design Flow for Mapping CNN onto Customized
Hardware. IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2016,
pp.24-29. doi:10.1109/ISVLSI.2016.129
[19] Ahn, B. (Oct. 01, 2015). Real-time video object recognition using convolutional neural
networks. International Joint Conference on Neural Networks (IJCNN), 2015.
doi:10.1109/IJCNN.2015.7280718
39

Analysis on Implementation of different CNN Architectures on FPGAs | Undergrad Thesis - BITS F421T Thesis | Author: Prayag Mohanty |BITS Pilani KK Birla Goa Campus

Recommended

Recommended

More Related Content

Similar to Analysis on Implementation of different CNN Architectures on FPGAs | Undergrad Thesis - BITS F421T Thesis | Author: Prayag Mohanty |BITS Pilani KK Birla Goa Campus

Similar to Analysis on Implementation of different CNN Architectures on FPGAs | Undergrad Thesis - BITS F421T Thesis | Author: Prayag Mohanty |BITS Pilani KK Birla Goa Campus (20)

More from Prayag Mohanty

More from Prayag Mohanty (10)

Recently uploaded

Recently uploaded (20)

Analysis on Implementation of different CNN Architectures on FPGAs | Undergrad Thesis - BITS F421T Thesis | Author: Prayag Mohanty |BITS Pilani KK Birla Goa Campus