Professional Project - C++ OpenCL - Platform agnostic hardware acceleration for deep neural networks

University of Surrey
Faculty of engineering and physical sciences
Department of Computing
Final Year Project Report
19/05/2016
Title: Platform agnostic hardware acceleration
for deep neural networks
Student: Callum McMahon
URN: 6279333
Supervisor: Lillian Tang

Platform agnostic hardware acceleration for deep neural networks P a g e | 1
Contents
Abstract....................................................................................................................................... 3
Abbreviations .............................................................................................................................. 3
Introduction ................................................................................................................................. 4
Background ............................................................................................................................. 4
Objectives................................................................................................................................ 5
Literature Review ........................................................................................................................ 5
Pre-existing software packages............................................................................................... 5
Exploring Caffe’s OpenCL branch in more depth..................................................................... 5
Theoretical groundwork ........................................................................................................... 7
Multi Layer feed forward perception ..................................................................................... 7
Modern Activation Functions and the Back Propagation algorithm....................................... 8
Weight regularization ........................................................................................................... 9
OpenCL learning resources and reference material............................................................... 11
System Design.......................................................................................................................... 11
Development environment..................................................................................................... 11
Essential Requirements......................................................................................................... 12
Implementation Deliverables.................................................................................................. 12
Technical Challenges ............................................................................................................ 12
Feeding the OpenCL device............................................................................................... 12
OpenCL kernel efficiency considerations ........................................................................... 14
Using clFFT ....................................................................................................................... 15
Implementation Schedule ...................................................................................................... 15
Design specification............................................................................................................... 15
Designing a flexible network architecture ........................................................................... 15
Validation tests................................................................................................................... 16
Class hierarchy .................................................................................................................. 17
Results...................................................................................................................................... 18
Requirement satisfaction ....................................................................................................... 18
Refer to system design, essential and optional requirements, page 11.................................. 18
Test validation Results........................................................................................................... 19
MNIST classification examples.............................................................................................. 20
Result Discussion.................................................................................................................. 20
Evaluation ................................................................................................................................. 21
Further Work ......................................................................................................................... 21
Conclusion............................................................................................................................. 21
Deployment guide ..................................................................................................................... 22

Bibliography .............................................................................................................................. 23
Appendices ............................................................................................................................... 25
A - Network validation architectures....................................................................................... 25
A.1. MNIST........................................................................................................................... 25
A.2. sin(a).............................................................................................................................. 25
A.3. sort(a, b, c, d, e) ............................................................................................................ 26
A.4. polynomial...................................................................................................................... 26
A.5. MNIST........................................................................................................................... 27
B – clFFT library expeiment................................................................................................... 27
B.1. Fourier transform and inverse fourier transform via clFFT and OpenCL ........................ 27
B.2. Program outputs from B.1. Showing only the first column for succinctness. ................... 31
C – Gantt time plans.............................................................................................................. 32

Abstract
This report provides an overview of resources available for deep neural network machine
learning. Current state of the art software libraries employ massively vectorised training
pipelines, enabling highly parallel computation and hence faster training convergence. Graphics
processing units, provide access to a greater threading capability than a typical central
processing unit. As such, a number of libraries have been developed with alternative fast native
GPU code paths. Current implementations are tightly integrated with the CUDA platform, a
proprietary programming model restricted to Nvidia GPUs.
In response a basic cross platform neural network library has been developed in C++,
demonstrating the feasibility of a single high performance platform agnostic code path. The
library has been built on top of the OpenCL programming framework. OpenCL is maintained by
a non-profit consortium group, Khronos, with implementations available on a number of devices
from different vendors.
Validation tests were performed on multilayer neural networks to assess training performance
and final network accuracy. Training consisted of multiple passes using back propagation and an
adaptive global learning rate.
A network consisting of two hidden linear rectifier layers was trained on the MNIST dataset; a
well known set of labelled greyscale digit images. The best observed error was achieved with a
total of 1099770 trainable parameters over 200 epochs, attaining a classification error of 4.5%.
Each epoch consisted of 5000 stochastic samples and back propagation passes. Total training
time was 53 minutes. Good fast convergence was observed using fewer training epochs. Using
10 epochs, a classification error rate of 9.6% was observed; taking 164.6 seconds of training on
an AMD Fury X.
Training on the Fury X was found to be approximately x5 faster than the i7-6700k. The Fury X
boasts approximately x72 the single floating point performance of the i7-6700k, suggesting
further optimisations can be made.
For demonstration purposes, windows x64 has been explicitly targeted by this release; porting to
another operating system would be trivial. The library has been written against OpenCL version
2.0 in order to take advantage of fine control over job queues. All recent CPUS and GPUs from
AMD and Intel are OpenCL 2.0 capable. Currently Nvidia devices only support OpenCL 1.2, but
2.0 support is likely to come in the near future.
Abbreviations
CPU Central Processing Unit
GPU Graphics Processing Unit
CUDA Compute Unified Device Architecture
OpenCL Open Computing Language
clBLAS OpenCL Basic Linear Algebra Subprograms
clFFT OpenCL Fast Fourier Transform
Linear Unit Linear Unit
Rectified Linear Unit ReLU
LU Linear Unit
SiU Sigmoid Unit

Introduction
Background
The field of machine learning is currently experiencing renewed interest. Developments in deep
neural network architectures and training methods have resulted in greatly improved model
learning accuracy for difficult tasks. Refinements to techniques are being continually developed,
with error rates as low as 15.2% being reported in difficult tasks such as speech recognition [1].
Companies are making investing large sums into neural network research. See Facebook open
sourcing deep learning modules for Torch [2]. There have been a number of high profile public
successes, as Alphabet’s AlphaGo, the first program to ever beat a professional Go player
without a handicap [3].
Figure 1.1 Google trend data showing the popularity of search terms.
Note the rapid rise of "deep learning" searches.
Deep neural networks are an evolution of single hidden layer neural networks. Whilst the idea of
a distributed computational network was conceived in the late fifties, inspired by biological
models, it was not until the invention of back propagation in 1970[4] that an effective network
training method was available. 1985 saw the first proposal of introducing convolution layers [5].
Since then a large number of new methods have been introduced: weight decay [6], fast
convolution layers using Fourier transforms [7], dropout [8], long short term memory networks
[9].
Demand for increased computational performance has risen with the increasing complexity of
neural networks. In 1995 it was demonstrated that GPUs could be used to effectively train neural
networks [10]. Neural network optimisation is a massively parallel problem, and as such is well
suited to GPU architectures, which give access to a much larger number of threads than a
typical CPU.
GPUs APIs were originally designed with fixed pipeline designed to produce visual effects.
Traditionally it has been very difficult to run exploit GPU parallelism for algorithm computation.
However, graphics API pipelines have become increasingly generic to handle more intricate
computer graphics methods [11][12]. Hardware vendors have subsequently released more
generic compute platforms [13][14][15][16] that can run on code against GPU hardware,
designed for the needs of the scientific computing community. Nvidia CUDA 1.0 was released in
2007, OpenCL 1.0 in 2009. Both OpenCL and CUDA program kernels are based on the C++14
specification.

CUDA is currently the more mature of the two GPU compute platforms, boasting a wider
selection of libraries [17]. This has directly translated into more widespread CUDA hardware
acceleration for training deep neural networks. In contrast, OpenCL implementations are
generally incomplete or non-existent (table 2.1). However, CUDA is a proprietary platform that
will only run on Nidia’s GPU hardware [18]. OpenCL implementations exist across a range of
hardware from different vendors, including both CPUs and GPUs [19]. OpenCL has the potential
to provide a single unified fast code path for training deep Neural Networks.
Objectives
 To develop a basic library deep learning library that utilises OpenCL for all intensive
operations.
 Develop an easy to use interface within C++.
 Maintain compatibility across as many OpenCL platforms as possible.
 Minimise external dependencies to ease setup and increase portability.
Literature Review
Pre-existing software packages
Software Primary language
interface
Other language
interfaces
CUDA GPU
support
OpenCL CPU / GPU
support
Caffe Python C++, Matlab Yes Third party branch from
AMD, but only neared
feature completion as of
late August 2015.
Neon Python Yes No.
Theano Python Yes In development.
Tensorflow Python C++ (graphs only) Yes In development.
Torch Lua C Yes Third party branch in
development.
Figure 2.1.1 An overview of popular deep learning software environments.
None of the popular deep learning libraries provide official OpenCL support. Caffe is the only
library with a feature complete OpenCl branch.
Exploring Caffe’s OpenCL branch in more depth
There a large number of dependencies [20] required for installation. Installations are restricted to
Ubuntu 12.04 or later. Only AMD GPUs are currently supported. Building and deploying the full
caffe OpenCL stack was deemed outside the scope of this project. Test performance metrics are
available on the github page [21], see Fig 2.2.1.

Platform Speed (images per second)
AMD W9100 & A10-7850k 255
AMD R9 Fury & A10-7850k 261
AMD R290X @1000MHz & A10-7850k 268
AMD S9150 @900MHz & Xeon E5-2640 227
Figure 2.2.1. Training performance using the well known AlexNet network. [22]
The network inputs used by Alexnet were images of 256x256 resolution. Multiplying out the total
number of pixels by the number images processed per second, we can see that OpenCL’s caffe
branch is capable of training approximately 17,104,896 inputs per second on an AMD Fury X.
Platform Speed (images per second)
AMD W9100 & A10-7850k 590
AMD R9 Fury & A10-7850k 699
AMD R290X @1000MHz & A10-7850k 606
AMD S9150 @900MHz & Xeon E5-2640 452
Figure 2.2.2. Recognition performance using AlexNet. [22]
Similarly, we can see that an approximately 45,809,664 inputs per second can be processed.

Theoretical groundwork
Multi Layer feed forward perception
The perceptron network was first proposed in 1958 by Frank Rosenblatt [24]. Perceptrons are
connected into a directed graph. The
perceptrons at the start of the graph
correspond to the network’s inputs.
Perceptrons at the end of the graph, the
outputs. Input values are passed into the input
perceptrons. Each subsequent perceptron
computes a weighted sum of the outputs from
prior connected perceptrons. The summed
value is then passed through an activation
function, A(x), and passed on through to the
next set of perceptrons. This process is
continued until the network output is reached.
These networks were handcrafted by
tweaking connection weight values. Modern
neural networks employ learning algorithms to
automatically update weight values.
𝐴 𝑥 =
𝑑(max⁡{𝑥, 0})
𝑑𝑥
Figure 2.3.2. The Heaviside step function was the
activation function originally used by Rosenblatt.
It has since been replaced by differentiable
functions. Differentiable activation functions
allow gradient descent to be used to modify
connection weights in such a way that the
network can be taught to output a set of desired
values for a given input.
Figure 2.3.1. A diagram showing how a single
perceptron unit processes inputs within a network.
This process is called a forward pass.

Modern Activation Functions and the Back Propagation algorithm
Back propagation[3] is widely used as a training algorithm for neural networks. It is a class of
gradient descent algorithm. It works by first performing a forward pass of the network. See [25]
for an overview of the algorithm.
𝑝𝑗 = 𝐴 𝑝𝑖 𝑤𝑖𝑗
𝐼𝑛𝑐𝑜𝑚𝑖𝑛𝑔
𝑤𝑒𝑖𝑔 ℎ𝑡𝑠
𝑖=1
Where 𝐴() is an activation function, 𝑤𝑖𝑗 a weight between units 𝑖 and 𝑗 and 𝑝𝑥 is the output of
unit 𝑥. 𝑖 is the index of the unit closest to the input layer.
The activation function must be differentiable so that an error gradient may be calculated. The
sigmoid function is commonly used. The linear rectifier activation function has been shown to
have better characteristics under some conditions [26]. The linear rectifier prevents the
vanishing gradient problem experienced by the sigmoid activation function, where weights of
large magnitude will have activation gradients of 0, or near to 0, which in turns reduces the
weight update deltas to 0, or near 0.
Sigmoid, derivative: 𝐴 𝑥 =
1
(1+𝑒−𝑥 )
𝑑𝐴 𝑥
𝑑𝑥
=
𝑒 𝑥
1+𝑒−𝑥 2 = 𝐴 𝑥 (1 − 𝐴 𝑋 )
Linear rectifier, derivative: 𝐴 𝑥 = 𝑙𝑛⁡(1 + 𝑒 𝑥
)
𝑑𝐴 𝑥
𝑑𝑥
=
1
1+𝑒−𝑥
An error delta is calculated at each output unit by finding the difference between its output and a
desired output value. The error deltas are propagated back through the network to the input
layer, storing deltas at each unit. This is referred to as a backwards pass.
Delta error for output units: 𝛿 𝑝 𝑗
= (𝑝𝑗 − 𝑡𝑗 )
𝑑𝐴 𝑝 𝑗
𝑑𝑥
Where 𝑡𝑗 denotes the 𝑡𝑡ℎ output unit’s target value.
Delta error for inner units: 𝛿 𝑝 𝑗
=
𝑑𝐴 𝑝 𝑗
𝑑𝑥
𝛿 𝑝 𝑖
𝑤𝑖𝑗
𝑂𝑢𝑡𝑔𝑜𝑖𝑛𝑔
𝑖=1
𝑤𝑖𝑗 is the weight from unit 𝑖 in the previously visited layer, to 𝑗 in the current layer. i.e. 𝑖 is
the index of the unit closest to the output layer.
Finally, weights are moved by value proportional to the error delta at the unit they provide inputs
for. The direction of change is opposite to the sign of the delta. The deltas are proportional to the
rate of change of the network’s error with respect to the incoming weights.
∆𝑤𝑖𝑗 = −𝑎𝛿𝑗 𝑝𝑖 =
𝑑𝐸𝑟𝑟𝑜𝑟
𝑑𝑤𝑖𝑗
Where 𝑎 is the learning rate. 𝑖 is now the index of the unit closest to the input layer.

The learning rate, 𝑎, must be small enough to allow the network to converge, yet large
enough to give a reasonable training time. Small 𝑎 values may also cause the network to
get stuck in local error minima.
Weight regularization
Weight regularization is commonly applied in one of two forms: weight decay [6], or dropout [8].
Weight regularization is intended to prevent overfitting, whereby the network learns to exactly
produce the training outputs, rather than learning a generalized pattern. Over fitted networks
perform poorly on validation test sets.
Weight decay modification to the weight update rule: ∆𝑤𝑖𝑗 = −𝑎𝛿𝑗 𝑝𝑖 − 𝑑 ∗ 𝑠𝑖𝑔𝑛(𝛿𝑗 𝑝𝑖)
Where 𝑑 is a small decay factor, such that 𝑑 ≪ 𝑎.
Weight decay may however reduce final network performance, as it will create moving global
optima. It is preferable to use dropout where possible. The dropout modification is applied to the
forward pass during training, giving each unit a small probability of outputting a value of 0.
𝑝𝑗 =
𝐴 𝑝𝑖 𝑤𝑖𝑗
𝐼𝑛𝑐𝑜𝑚𝑖𝑛𝑔
𝑖=1
𝑟𝑛𝑑 0.0, 1.0 < 𝑑
0 𝑟𝑛𝑑 0.0, 1.0 ≥ 𝑑
Where 𝑑 is a small dropout probability such that 0.0 ≤ 𝑑 < 1.0.
Dropout attempts to spread learned patterns across the network, rather than isolated groups of
units.
Convolution Layers and Fast Convolutions
Convolution layers provide method of introducing translation resistant weights into the network
[27]. Units within a convolution layer share weights in a spatial pattern, allowing the network to
quickly generalize for inputs containing translated patterns. Stacked convolution layers can
identify identify extremely complex patterns much more rapidly than a typical multi layer network;
convolution networks have seen great success in many applications.
Figure 2.3.3 A diagram showing how the weights are shared across convolutional layer units.

Convolution operations can however be expensive for large kernels, being 𝑂(𝑛𝑘2
), where 𝑛 is
the number of units in the convolutional layer, and 𝑘 is the kernel width. It has been recognised
that that convolution theorem can be applied to give greatly reduced computation time of
𝑂(𝑛𝑙𝑜𝑔 𝑛 ) for the forward pass [28].
𝐹 𝑐 . 𝑘 = 𝐹 𝑐 ∗ 𝐹(𝑘)
∴ 𝑐. 𝑘 = 𝐹−1
(𝐹 𝑐 ∗ 𝐹 𝑘 )
The convolution theorem shows that the elementwise product of two matrices is equal to the
product of their fourier transforms. Using the fast fourier transform algorithm, 𝐹(𝑐) and 𝐹(𝑘) can
be computed in 𝑂(𝑛𝑙𝑜𝑔 𝑛 ), where 𝑛 is the number of elements in 𝑐 or 𝑘 (they must have the
same number of elements). Similarly, the backpropagation algorithm may be also be modified to
take advantage of this identity [28].
Delta errors for convolutional output layer: 𝜹𝒋 = 𝑝
𝑑𝐴 𝑝
𝑑𝑝
− 𝒕
Note 𝑝
𝑑𝐴 𝑝
𝑑𝑝
is a matrix of the output layer multiplied element wise with the derivatives of
the activation function.
Delta errors for convolutional inner layer 𝜹𝒋 =
𝑑𝐴 𝒍 𝒋
𝑑𝑝
𝒍𝑖 ∗ 𝒘 𝑇
𝑖𝑗
Where 𝑖 and 𝑗 are now indexes between network layers, rather than units. For the
backwards pass 𝑖 is the index of the layer closest to the output layer.
Where 𝒍𝑖 = 𝑝 𝑖, denoting the matrix of outputs for layer 𝑖.
Weight updates for a convolutional kernel: ∆𝒘𝑖𝑗 = −𝑎(𝜹𝑗 ∗ 𝒍𝑖) =
𝑑𝑬
𝑑𝒘
For the weight updates, 𝑖 is the index of the layer closest to the input layer.

OpenCL learning resources and reference material
Having never worked with OpenCL before, I ended up working through a number of
tutorials and example programs. Listed below are all the resources I used.
Resource
Type
Name Location
PDF,
specification
OpenCL 2.0 specification https://www.khronos.org/registry/cl/specs
/opencl-2.0.pdf
Website,
reference
clBLAS manual and reference http://clmathlibraries.github.io/clBLAS/
Website,
reference
clFFT manual and reference http://clmathlibraries.github.io/clFFT/
Book Heterogeneous Computing with
OpenCL 2.0, By David Kaeli,
Perhaad Mistry, Dana Schaa and
Dong Ping Zhang
http://developer.amd.com/partners/univer
sity-programs/heterogeneous-computing-
with-opencl/
Website,
tutorial
Oak Ridge laboratory, OpenCL
vector addition tutorial
https://www.olcf.ornl.gov/tutorials/opencl-
vector-addition/
Website,
tutorial
AMD, Intro to OpenCL tutorial http://developer.amd.com/tools-and-
sdks/opencl-zone/opencl-
resources/introductory-tutorial-to-opencl/
Figure 2.4.1. Learning resources
System Design
Development environment
The OpenCL specification is written against C++, and is subsequently the language of choice for
this project.
Windows was chosen as the development environment due to personal familiarity with the visual
studio software package. Visual Studio 2015 is used to provide an up to date implementation of
the C++11 specification. In keeping with the project objectives, windows specific code shall be
restricted to the main.cpp file. All other code will be written with the standard template in mind,
and as such should compile under g++ and run on Linux.
Familiarisation with OpenCL showed that developing optimised kernels is difficult. Consequently,
I decided to employ AMD’s clBLAS library where possible. clBLAS provides a set of common
basic linear algebra kernels. AMD also provides clFFT for computing fast fourier transforms.
clFFT was added as an additional dependency so as to assist in implementing fast convolution
layers (Fig. 3.2.1).

Essential Requirements
1. A network class capable of:
a. Constructing multi layer feed forward neural networks. The programmer
should be able to easily specify the number of units within each layer.
b. Training neural networks. Training performance must be reported through
cross validation against test data.
c. Testing neural networks. A method must be implemented that returns
information on the network’s the mean standard error across a batch of test
data.
d. Processing inputs. A method must be implemented that allows the network to
accept a single set of inputs from the main program thread, returning the
corresponding output from the network.
2. A layer class that provides a logical ordering of network computational units.
3. An implementation of the back propagation training algorithm.
4. An implementation of the sigmoid activation function and its corresponding
differential.
5. A sample program capable of demonstrating network training and testing functionality
on different OpenCL devices.
6. Unit testing, testing trained Network accuracy by validating against a dataset
generated from a mathematical function.
Optional Requirements
1. Unit testing, testing trained Network accuracy by validating against a well known pre-
constructed dataset.
2. Implementation of a convolutional layer and convolutional kernel classes. These must
provide:
a. Weight sharing across spatially separated neuron units.
b. Modification to the back propagation algorithm to handle shared weights.
3. An implementation of the linear rectifier activation function and its corresponding
differential.
4. Network regularization. Either through weight decay or dropout.
Implementation Deliverables
1. A Visual Studio 2015 C++ solution containing a working example of the developed deep
neural network library.
2. Headers and associated .cpp definitions with comments describing how the library works.
3. OpenCL kernel code.
4. clBLAS and clFFT included as dynamic link libraries.
Technical Challenges
Feeding the OpenCL device
OpenCL provides a high latency, high throughput bridge between the host device and the
compute device. The host device and compute device share one or more queues. The host
produces jobs and inserts them into a queue. The compute device consumes job items from the
queue. By default, OpenCL creates a serial queue, forcing the compute device to compute jobs
in order. This is not ideal, as some jobs may take only a fraction of the compute device’s

resources. Setting CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE when creating the cl_queue will
enable the device to consume jobs out of order.
Each job is associated with an event, which may be in one of four states, queued, submitted,
running, and complete. Jobs are also associated with event completion wait lists, allowing for
synchronization and dependency blocks. Ideally the work queue will be saturated so that
compute device can be continually working on jobs.
Figure 3.1.1. A visualization of job how the queue controls job consumption. The queue is saturated, there are
more jobs available for the compute device to consume, as shown by the line in red. The host device is
shown in green. Independent jobs are undertaken either in parallel, or in an undetermined serial order.
Behaviour is undefined if the compute device attempts to write or read from cl_mem buffers
being modified by the host device. The reverse is also true. Consequently, the queue must be
utilised to stall both host and compute device until read / write operations are finished. The
number of read and write operations between the host and compute device should be minimised
in order to prevent stalls. As such, as much data as possible should be kept device side.

Figure 3.1.2. A forced synchronization point. The host is attempting to read the cl_mem holding the network’s
output. A backward gather compute job is available, but cannot be consumed until the host has finished its
read.
OpenCL kernel efficiency considerations
OpenCL kernels are small programs that run on the openCL compute device. OpenCL kernels
are compiled using an OpenCL device context by the host at program start up. The host can
then queue the kernel binary to the compute device as part of a compute task. Similarly, the
openCL host can queue read or write operations to modify or view the contents of cl_mem
buffers held in the compute device’s global cache.
Figure 3.1.3. A depiction of the hardware differences exposed by OpenCL. OpenCL devices typically have
access to much larger number of threads. An AMD Fury X GPU has access to 4096 threads.

The specification is designed with massive parallelism in mind. An instance of a submitted kernel
program is launched for each thread in the global work group. The global work group is
subdivided into equally sized local work groups. Each thread has access to a small but very fast
local memory cache, and a slower, but larger work group memory cache. All threads have
access to the global memory cache. Threads may only communicate within their work group.
Task division is primarily achieved using the thread’s unique id, which lies in the range
0 >= x< global work group size. Kernels jobs are only marked as complete once all their threads
have finished, as such the kernel is only as fast as its slowest thread.
It is also worth noting that GPUs often implement reduced instruction sets. Consequently some
function calls can have large overheads. For example, the modulo operator is expensive on
AMD GPU hardware.
Using clFFT
The clFFT library is relatively complex, yet I could only find three example programs. I
subsequently created a small program to see if I could successfully transform real valued 2D
matrix into the complex frequency domain, then back again to the spatial domain. The test was
successful. See Appendix B.1. for the code and B.2. for results.
Implementation Schedule
For the original implementation schedule, refer to Appendix C.1. A modified schedule was
created in at the end of December 2015 after the initial project proposal was recognized to be
too complex for the given time frame. See Appendix C.2. Originally I had hoped to demonstrate
basic speech recognition capabilities; however this would require that convolution features be
fully implemented. Other commitments meant that I was unsure whether or not convolution layer
functionality could be implemented in time. Instead I decided that the implementation would
benefit from greater focus on testing core multi layer network functionality and performance.
Design specification
Designing a flexible network architecture
Rather than adding computational units directly into the Layer class, it was decided to wrap them
within a pool class. This gives the programmer
more flexibility when defining network
architecture, as shown by Fig 3.2.1. This was
an early design decision, a result of designing
a way in which convolution layers and
standard unit layers could be integrated in a
complimentary fashion, rather than forcing the
programmer to choose between one or the
other. Layers enforce the sequence in which
the forward and backward passes visit units.
Pass are performed in parallel for pools in the
same layer. MatrixPools are pools of standard
units with biases. ConvPools are pools of
convolutional units arraned into a 2D matrix.
ConvPool units share a single bias between
them for each incoming convolutional kernel.
Figure 3.2.1. A network architecture example that might be used.

Validation tests
All training outputs are normalised into the range 0.0 to 1.0 such that they are compatible with
logistic units typically used by output layers. Linear rectifiers are not suitable for use in the
network output layer.
1. MNIST handwritten character recognition, 60,000 labelled training images, 10,000
labelled testing images. [29]. Network input of 28x28 = 784 LU. Output of 10 SiLU, with
the index of the unit of largest responsive corresponding to the digit’s classification.
Generating random values a, b, c, d, e in the range 0.0 to 1.0.
2. Sin(a), 1000 testing values, 200 training values. Network input of 1 LU. Output of 1 SiLU.
3. sort(a, b, c, d, e) sorting 5 parameters, 1000 testing values, 200 training values. Network
input of 5 LU. Output of 5 SiLU.
4. polynomial, 3.0f*a*a + a + 7.0f*b + 1.0f, 1000 testing values, 200 training values. Network
input of 2 LU. Output of 1 SiLU.

Class hierarchy
Figure 3.3.1. A UML diagram showing the basic relationship between network classes. Important field
members are shown. The Network class is intended to provide the primary interface used by the programmer.

Results
Requirement satisfaction
Refer to system design, essential and optional requirements, page 11.
1. a. Full compliance
1. b. Full compliance.
1. c. Full compliance.
1. d. Full compliance.
2. Full compliance.
3. Full compliance.
4. Full compliance.
5. Full compliance.
6. Full compliance.
Optional Requirements
1. Full compliance, MNIST [29] handwritten digit dataset validation provided.
2. Partial compliance. clFFT tests completed. Interface and class structure for
convolution units and kernels added. No implementations currently present.
3. Full compliance. Linear rectifiers are used as the default activation function for hidden
layers.
4. No compliance. A test was conducted with weight decay, but was not found to
increase network test validation accuracy. Consequently it was decided not to include the
weight modification change. Further resting required.

Test validation Results
Table 4.1.1. Results from validation runs with varying epoch numbers. The initial learn rate for all tests was
0.001.
OpenCL Device
Validation
test type
Training
time
(seconds) Epochs
Network
structure
Training
sample
selection
Training
passes
per
epoch
Mean
standard
error
Classifica
tion
error
AMD Fury X
(8192 GFlops) MNIST 10.941 5
Appendix
A.1 random 2000 0.1198 0.1819
Intel i7-6700k
(114 Gflops) MNIST 65.4442 5
Appendix
A.1 random 2000 0.1375 0.187
AMD Fury X
(8192 GFlops) MNIST 21.6659 10
Appendix
A.1 random 2000 0.1167 0.1617
Intel i7-6700k
(114 Gflops) MNIST 135.6973 10
Appendix
A.1 random 2000 0.1118 0.1614
AMD Fury X
(8192 GFlops) MNIST 42.4356 20
Appendix
A.1 random 2000 0.1035 0.1509
Intel i7-6700k
(114 Gflops) MNIST 262.8941 20
Appendix
A.1 random 2000 0.0873 0.1358
AMD Fury X
(8192 GFlops) Sin(x) 9.3542 20
Appendix
A.2 all 800 0.0124 N/A
Intel i7-6700k
(114 Gflops) Sin(x) 19.0043 20
Appendix
A.2 all 800 0.0084 N/A
AMD Fury X
(8192 GFlops) Sin(x) 9.3163 20
Appendix
A.2 all 800 0.0334 N/A
Intel i7-6700k
(114 Gflops) Sin(x) 18.7893 20
Appendix
A.2 all 800 0.0293 N/A
AMD Fury X
(8192 GFlops) Sin(x) 9.2561 20
Appendix
A.2 all 800 0.1025 N/A
Intel i7-6700k
(114 Gflops) Sin(x) 18.0904 20
Appendix
A.2 all 800 0.0128 N/A
AMD Fury X
(8192 GFlops)
Sort(a, b,
c, d, e) 25.9588 20
Appendix
A.3 all 800 0.2039 N/A
Intel i7-6700k
(114 Gflops)
Sort(a, b,
c, d, e) 169.0974 20
Appendix
A.3 all 800 0.2178 N/A
AMD Fury X
(8192 GFlops)
Sort(a, b,
c, d, e) 25.6606 20
Appendix
A.3 all 800 0.191 N/A
Intel i7-6700k
(114 Gflops)
Sort(a, b,
c, d, e) 173.9321 20
Appendix
A.3 all 800 0.1462 N/A
AMD Fury X
(8192 GFlops)
Sort(a, b,
c, d, e) 25.6168 20
Appendix
A.3 all 800 0.1807 N/A
Intel i7-6700k
(114 Gflops)
Sort(a, b,
c, d, e) 169.9004 20
Appendix
A.3 all 800 0.1918 N/A
AMD Fury X
(8192 GFlops) Polynomial 17.789 20
Appendix
A.4 all 800 0.0209 N/A
Intel i7-6700k
(114 Gflops) Polynomial 90.2876 20
Appendix
A.1.4 all 800 0.0315 N/A
AMD Fury X
Appendix
A.4 all 800 0.0185 N/A
Intel i7-6700k
Appendix
A.4 all 800 0.0234 N/A
AMD Fury X
Appendix
A.4 all 800 0.0203 N/A
Intel i7-6700k
Appendix
A.4 all 800 0.0239 N/A
AMD Fury X
(8192 GFlops) MNIST 3183.586 200
Appendix
A.5 random 5000 0.027 0.0454
AMD Fury X
(8192 GFlops) MNIST 164.564 10
Appendix
A.5 random 5000 0.0612 0.0956

MNIST classification examples
Figure 4.2.1. A randomly sampled 2
misclassified by the neural network a 0.
Figure 4.2.1. A randomly sampled 2
that is correctly classified.
Figure 4.2.1. A randomly sampled
5 that is correctly classified.
Result Discussion
Taking the mean average ratio of i7-6700k run times over Fury X run times from table 4.1.1,
gives a mean ratio of 4.97. This is low considering the Fury X has 8192 GFlops of compared to
the i7-6700k’s 114, which would suggest a ratio closer to 72. It is possible that the task queue is
not saturated, and that the OpenCL device is idling for a number of cycles, which would suggest
the main threa is causing throttling. Alternatively, it is possible that an OpenCL kernel is causing
a bottleneck due to poor optimisation. Further investigation is required.
Overall performance is acceptable on the Fury X, but has a some way to go before have
comparable performance of popular public libraries. The 10 epoch fury X test with a 5000
sample rate completed training in 165 seconds, and had 1,099,770 trainable parameters. A

similar network was setup within python using theaon, via python and lasagne, to provide a
reference. The theano network had 945,768 parameters, and achieved a training time of 44
seconds on an i7-7600k over 10 epochs. Final accuracy was relatively similar. My OpenCL
implementation achieved a misclassification rate of 10%. Theano achieved an error of 8%.
Recognition rate was good, taking 15.5 seconds to recognise all 10,000 MNIST test images,
giving an image per second rate of 645. Multiplying out by the size of the input 28x28 = 784, this
gives a total rate of 505,680 inputs processed per second. Caffe’s OpenCL branch is
approximately 90x faster as processing inputs, and significantly faster at training. Though it is
worth noting that batching is used for the caffe test results published on github.
The i7-6700k’s training could be quite long on my OpenCL implementation. For example, the 20
epoch MNIST test with a 2000 sample rate took 136 seconds, despite having only 218,842
training parameters.
A longer training session was undertaken using the network described in Appendix A.1.5.,
achieving a good final error rate of 4.5%, the same as what was achieved by a two layer neural
network by a popular publication on document recognition [29][30]. The network also proved
accurate over the modelled mathematical functions: sin(x), sort(a, b, c, d, e) and the polynomial
function, achieving best respective errors of 8.4%, 15%, 19%.
Evaluation
Further Work
1. Debugging performance issues.
2. Finishing integration of optional requirements.
3. Possibly worth investigating the removal of the majority of queue jobs by calling kernels
from the device. OpenCL 2.0 allows compute devices to make kernel calls. This feature
was not explored, as it adds significant design complexity. clBLAS would have to be
modified to handle custom kernel post / pre callback. clFFT supports this feature.
Conclusion
Considering the complexity of the project, I believe the outcome to be reasonable. A cross
platform deep learning library was developed in C++, and demonstrated to work successfully on
a range of tasks. Though performance was not ideal, I am confident the bottlenecks could be
identified by isolating the execution times for the called OpenCL kenerls.

Deployment guide
Hardware requirements:
OpenCL 2.0 compatible device
x64 Windows environment (tested on windows, 7, 9, 10)
Software requirements:
AMD App SDK 3.00 or greater
Building from source requires visual studio 2015 or newer
1. Proceed to http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-
parallel-processing-app-sdk/.
2. Download and install AMD APP SDK 3.0 for windows 64 bit.
3. Unzip Code_Base.zip
Running the binary:
4. Proceed to the “./Backpropagation/Bin” folder
5. Run Backpropagation.exe
Compiling from source:
4. Proceed to the “./Backpropagation/Backpropagation” folder
5. Open visual studio 2015
6. Click file -> open project/solution
7. Open Backpropagation.sln
8. Press ctl + f5 to compile and run

Bibliography
[1] Sainath, Tara N., Abdel-rahman Mohamed, Brian Kingsbury, and Bhuvana Ramabhadran. "Deep
convolutional neural networks for LVCSR." InAcoustics, Speech and Signal Processing (ICASSP), 2013
IEEE International Conference on, pp. 8614-8618. IEEE, 2013.
[2] https://research.facebook.com/blog/fair-open-sources-deep-learning-modules-for-torch/
[3] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,
Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman,
Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray
Kavukcuoglu, Thore Graepel, and Demis Hassabis. "Mastering the Game of Go with Deep Neural
Networks and Tree Search." Nature 529, no. 7587 (2016): 484.
[4] Linnainmaa, Seppo. "The representation of the cumulative rounding error of an algorithm as a Taylor
expansion of the local rounding errors." Master's Thesis (in Finnish), Univ. Helsinki (1970): 6-7.
[5] Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. Learning internal representations by
error propagation. No. ICS-8506. CALIFORNIA UNIV SAN DIEGO LA JOLLA INST FOR COGNITIVE
SCIENCE, 1985.
[6] Rumelhart, D.E., Hinton, G.E. and Williams, R.J., 1988. Learning representations by back-propagating
errors. Cognitive modeling, 5(3), p.714.
[7] Mathieu, Michael, Mikael Henaff, and Yann LeCun. "Fast training of convolutional networks through
FFTs." arXiv preprint arXiv:1312.5851 (2013).
[8] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R., 2014. Dropout: A simple
way to prevent neural networks from overfitting.The Journal of Machine Learning Research, 15(1),
pp.1929-1958.
[9] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory."Neural computation 9, no. 8
(1997): 1735-1780.
[10] Martínez-Zarzuela, Mario, Francisco Javier Díaz Pernas, José Fernando Díez Higuera, and Míriam
Antón Rodríguez. "Fuzzy ART neural network parallel computing on the GPU." In Computational and
Ambient Intelligence, pp. 463-470. Springer Berlin Heidelberg, 2007.
[11] (Shader model 5 for DirectX), accessed 21/ 05/ 2016,
https://www.google.co.uk/search?q=shader+model+5&oq=shader+model+5&aqs=chrome..69i57.3354j0j7
&sourceid=chrome&ie=UTF-8
[12] John Kessenich, Dave Baldwin, Randi Rost, “The OpenGL Shader language”,
https://www.opengl.org/registry/doc/GLSLangSpec.4.50.pdf
[13] http://www.nvidia.co.uk/object/cuda-parallel-computing-uk.html, accessed 22/05/2016
[14] https://www.khronos.org/opencl/, accessed 22/05/2016
[15] http://developer.amd.com/tools-and-sdks/opencl-zone/, accessed 22/05/2016
[16] https://software.intel.com/en-us/intel-
opencl?cid=sem43700008896000156&intel_term=intel+openCL&gclid=CjwKEAjwsYW6BRCTzvu5y8DP
hi0SJABnGLlHWfkJo5tNdbBubNlnsqdz_nyHUSfm6SPPlECfXbtAgxoCSvXw_wcB&gclsrc=aw.ds,
accessed 22/05/2016

[17] https://developer.nvidia.com/gpu-accelerated-libraries, accessed 22/05/2016
[18] https://developer.nvidia.com/cuda-gpus, accessed 22/05/2016
[19] https://www.khronos.org/conformance/adopters/conformant-products#opencl, accessed 22/05/2016
[20] https://github.com/amd/OpenCL-caffe/wiki/How-to-set-up-clBLAS-and-OpenCL, accessed 22/05/2016
[21] https://github.com/amd/OpenCL-caffe, accessed 22/05/2016
[22] Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet classification with deep convolutional
neural networks. In Advances in neural information processing systems (pp. 1097-1105).
[23] Kulkarni, Sanjeev, and Harman, Gilbert. "Multilayer Networks." In Wiley Series in Probability and
Statistics, 99-115. Hoboken, NJ, USA: John Wiley & Sons, 2011.
[24] Rosenblatt, Frank. "The perceptron: a probabilistic model for information storage and organization in
the brain." Psychological review 65, no. 6 (1958): 386.
[25] Narsky, Ilya, and Porter, Frank C. "Neural Networks." In Statistical Analysis Techniques in Particle
Physics, 251-63. Weinheim, Germany: Wiley‐VCH Verlag GmbH & KGaA, 2013. Chapter 12.
[26] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Deep sparse rectifier neural networks."
In International Conference on Artificial Intelligence and Statistics, pp. 315-323. 2011.
[27] Simard, P.Y., Steinkraus, D. and Platt, J.C., 2003, August. Best practices for convolutional neural
networks applied to visual document analysis. In null(p. 958). IEEE.
[28] Mathieu, M., Henaff, M. and LeCun, Y., 2013. Fast training of convolutional networks through
FFTs. arXiv preprint arXiv:1312.5851.
[29] http://yann.lecun.com/exdb/mnist/, accessed 22/05/2016
[30] LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. "Gradient-based learning applied to document
recognition." Proceedings of the IEEE86, no. 11 (1998): 2278-2324.

Appendices
A - Network validation architectures
A.1. MNIST
Trainable parameters 21,8842
A.2. sin(a)
Trainable parameters 387

A.3. sort(a, b, c, d, e)
A.4. polynomial

A.5. MNIST
Trainable parameters 1,099,770
B – clFFT library expeiment
B.1. Fourier transform and inverse fourier transform via clFFT and OpenCL
/* ************************************************************************
* Copyright 2013 Advanced Micro Devices, Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
* ************************************************************************/
/* ************************************************************************
* Copyright Callum McMahon
*
* Added inverse hermitian transform, showing how data can
* be to transformed back to spatial domain.
* Terminal outputs after inverse should match the original dataset.
* ************************************************************************/
/* No need to explicitely include the OpenCL headers */

#include <clFFT.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main(void)
{
system("MODE CON COLS=80 LINES=1024");
cl_int err;
cl_platform_id platform = 0;
cl_device_id device = 0;
cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 };
cl_context ctx = 0;
cl_command_queue queue = 0;
cl_mem bufX, bufY;
float *X, *Y;
cl_event event = NULL;
int ret = 0;
const size_t N0 = 8, N1 = 8;
char platform_name[128];
char device_name[128];
/* FFT library realted declarations */
clfftPlanHandle planHandle;
clfftDim dim = CLFFT_2D;
size_t clLengths[2] = { N0, N1 };
int fac = ((N1 / 2) + 1);//=N1;
//size_t l = N0;
size_t clOutStrides[2] = { 1, fac };
size_t clInStrides[2] = { 1, N0 };
/* Setup OpenCL environment. */
err = clGetPlatformIDs(1, &platform, NULL);
size_t ret_param_size = 0;
err = clGetPlatformInfo(platform, CL_PLATFORM_NAME,
sizeof(platform_name), platform_name,
&ret_param_size);
printf("Platform found: %sn", platform_name);
err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_DEFAULT, 1, &device, NULL);
err = clGetDeviceInfo(device, CL_DEVICE_NAME,
sizeof(device_name), device_name,
&ret_param_size);
printf("Device found on the above platform: %sn", device_name);
props[1] = (cl_context_properties)platform;
ctx = clCreateContext(props, 1, &device, NULL, NULL, &err);
queue = clCreateCommandQueueWithProperties(ctx, device, 0, &err);
/* Setup clFFT. */
clfftSetupData fftSetup;
err = clfftInitSetupData(&fftSetup);
err = clfftSetup(&fftSetup);
/* Allocate host & initialize data. */
/* Only allocation shown for simplicity. */
size_t buffer_size_x = N0 * N1 * sizeof(*X);

size_t buffer_size_y = ((N0+2) * N1) * sizeof(*Y);
X = (float *)malloc(buffer_size_x);
Y = (float *)malloc(buffer_size_y);
/* print input array just using the
* indices to fill the array with data */
printf("nPerforming fft on an two dimensional array of size N0 x N1 : %ld x
%ldn", N0, N1);
int i, j;
i = j = 0;
for (i = 0; i<N0; ++i) {
for (j = 0; j<N1; ++j) {
float x = 0.5f;
float y = 0.5f;
unsigned idx = (j + i*N0);
X[idx] = sin(1.0f*(float)i) + cos(0.4f*(float)j);
printf("n(%f) ", X[idx]);
}
printf("n");
}
/* Prepare OpenCL memory objects and place data inside them. */
bufX = clCreateBuffer(ctx, CL_MEM_READ_WRITE, buffer_size_x, NULL, &err);
//CL_MEM_READ_ONLY
bufY = clCreateBuffer(ctx, CL_MEM_READ_WRITE, buffer_size_y, NULL, &err);
err = clEnqueueWriteBuffer(queue, bufX, CL_TRUE, 0, buffer_size_x, X, 0, NULL,
NULL);
/* Create a default plan for a complex FFT. */
err = clfftCreateDefaultPlan(&planHandle, ctx, dim, clLengths);
/* Set plan parameters. */
err = clfftSetPlanPrecision(planHandle, CLFFT_SINGLE);
err = clfftSetLayout(planHandle, CLFFT_REAL, CLFFT_HERMITIAN_INTERLEAVED);
err = clfftSetResultLocation(planHandle, CLFFT_OUTOFPLACE);
err = clfftSetPlanOutStride(planHandle, dim, clOutStrides);
err = clfftSetPlanInStride(planHandle, dim, clInStrides);
/* Bake the plan. */
err = clfftBakePlan(planHandle, 1, &queue, NULL, NULL);
/* Execute the plan. */
err = clfftEnqueueTransform(planHandle, CLFFT_FORWARD, 1, &queue, 0, NULL, NULL,
&bufX, &bufY, NULL);
/* Wait for calculations to be finished. */
err = clFinish(queue);
/* Fetch results of calculations. */
err = clEnqueueReadBuffer(queue, bufY, CL_TRUE, 0, buffer_size_y, Y, 0, NULL,
NULL);
/* print output array */
printf("nnfft result: n");
i = j = 0;
for (i = 0; i<N0; ++i) {
for (j = 0; j<fac; ++j) {
unsigned idx = 2 * (j + i*fac);
printf("n(%f) ", sqrt(Y[idx] * Y[idx] + Y[idx+1] * Y[idx+1]));
//fiddle with restults to test
//Y[idx] += 0.01f*(float)idx;

}
printf("n");
}
printf("n");
//*****************
//revserse!
//*****************
printf("nn *** reverse ***nn");
//clOutStrides[0] = { 1, fac };
//clInStrides[0] = { 1, N0 };
err = clEnqueueWriteBuffer(queue, bufY, CL_TRUE, 0, buffer_size_y, Y, 0, NULL,
NULL);
/* Create a default plan for a complex FFT. */
err = clfftCreateDefaultPlan(&planHandle, ctx, dim, clLengths);
/* Set plan parameters. */
err = clfftSetPlanPrecision(planHandle, CLFFT_SINGLE);
err = clfftSetLayout(planHandle, CLFFT_HERMITIAN_INTERLEAVED, CLFFT_REAL);
err = clfftSetResultLocation(planHandle, CLFFT_OUTOFPLACE);
err = clfftSetPlanOutStride(planHandle, dim, clInStrides);
err = clfftSetPlanInStride(planHandle, dim, clOutStrides);
/* Bake the plan. */
err = clfftBakePlan(planHandle, 1, &queue, NULL, NULL);
/* Execute the plan. */
err = clfftEnqueueTransform(planHandle, CLFFT_FORWARD, 1, &queue, 0, NULL, NULL,
&bufY, &bufX, NULL);
/* Wait for calculations to be finished. */
err = clFinish(queue);
/* Fetch results of calculations. */
err = clEnqueueReadBuffer(queue, bufX, CL_TRUE, 0, buffer_size_x, X, 0, NULL,
NULL);
i = j = 0;
for (i = 0; i<N0; ++i) {
for (j = 0; j<N1; ++j) {
float x = 0.5f;
float y = 0.5f;
unsigned idx = (j + i*N0);
printf("n(%f) ", X[idx]);
}
printf("n");
}
//*****************
//revserse END
//*****************
/* Release OpenCL memory objects. */
clReleaseMemObject(bufX);
free(X);
clReleaseMemObject(bufY);
free(Y);
/* Release the plan. */

err = clfftDestroyPlan(&planHandle);
/* Release clFFT library. */
clfftTeardown();
/* Release OpenCL working objects. */
clReleaseCommandQueue(queue);
clReleaseContext(ctx);
getchar();
return ret;
}
B.2. Program outputs from B.1. Showing only the first column for succinctness.
Platform found: Intel(R) OpenCL
Device found on the above platform: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Performing fft on an two dimensional array of size N0 x N1 : 8 x 8
(1.000000)
(0.921061)
(0.696707)
(0.362358)
(-0.029200)
(-0.416147)
(-0.737394)
(-0.942222)
fft result:
(11.271166)
(27.725875)
(11.865518)
(8.765699)
(8.040510)
*** reverse ***
(1.000000)
(0.921061)
(0.696707)
(0.362358)
(-0.029200)
(-0.416147)
(-0.737394)
(-0.942222)

C – Gantt time plans
C.1. Original Gantt time plane

C.2. Modified Gantt time plane

Professional Project - C++ OpenCL - Platform agnostic hardware acceleration for deep neural networks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Professional Project - C++ OpenCL - Platform agnostic hardware acceleration for deep neural networks

Similar to Professional Project - C++ OpenCL - Platform agnostic hardware acceleration for deep neural networks (20)

Professional Project - C++ OpenCL - Platform agnostic hardware acceleration for deep neural networks