SlideShare a Scribd company logo
University of Surrey
Faculty of engineering and physical sciences
Department of Computing
Final Year Project Report
19/05/2016
Title: Platform agnostic hardware acceleration
for deep neural networks
Student: Callum McMahon
URN: 6279333
Supervisor: Lillian Tang
Platform agnostic hardware acceleration for deep neural networks P a g e | 1
Contents
Abstract....................................................................................................................................... 3
Abbreviations .............................................................................................................................. 3
Introduction ................................................................................................................................. 4
Background ............................................................................................................................. 4
Objectives................................................................................................................................ 5
Literature Review ........................................................................................................................ 5
Pre-existing software packages............................................................................................... 5
Exploring Caffe’s OpenCL branch in more depth..................................................................... 5
Theoretical groundwork ........................................................................................................... 7
Multi Layer feed forward perception ..................................................................................... 7
Modern Activation Functions and the Back Propagation algorithm....................................... 8
Weight regularization ........................................................................................................... 9
OpenCL learning resources and reference material............................................................... 11
System Design.......................................................................................................................... 11
Development environment..................................................................................................... 11
Essential Requirements......................................................................................................... 12
Implementation Deliverables.................................................................................................. 12
Technical Challenges ............................................................................................................ 12
Feeding the OpenCL device............................................................................................... 12
OpenCL kernel efficiency considerations ........................................................................... 14
Using clFFT ....................................................................................................................... 15
Implementation Schedule ...................................................................................................... 15
Design specification............................................................................................................... 15
Designing a flexible network architecture ........................................................................... 15
Validation tests................................................................................................................... 16
Class hierarchy .................................................................................................................. 17
Results...................................................................................................................................... 18
Requirement satisfaction ....................................................................................................... 18
Refer to system design, essential and optional requirements, page 11.................................. 18
Test validation Results........................................................................................................... 19
MNIST classification examples.............................................................................................. 20
Result Discussion.................................................................................................................. 20
Evaluation ................................................................................................................................. 21
Further Work ......................................................................................................................... 21
Conclusion............................................................................................................................. 21
Deployment guide ..................................................................................................................... 22
Platform agnostic hardware acceleration for deep neural networks P a g e | 2
Bibliography .............................................................................................................................. 23
Appendices ............................................................................................................................... 25
A - Network validation architectures....................................................................................... 25
A.1. MNIST........................................................................................................................... 25
A.2. sin(a).............................................................................................................................. 25
A.3. sort(a, b, c, d, e) ............................................................................................................ 26
A.4. polynomial...................................................................................................................... 26
A.5. MNIST........................................................................................................................... 27
B – clFFT library expeiment................................................................................................... 27
B.1. Fourier transform and inverse fourier transform via clFFT and OpenCL ........................ 27
B.2. Program outputs from B.1. Showing only the first column for succinctness. ................... 31
C – Gantt time plans.............................................................................................................. 32
Platform agnostic hardware acceleration for deep neural networks P a g e | 3
Abstract
This report provides an overview of resources available for deep neural network machine
learning. Current state of the art software libraries employ massively vectorised training
pipelines, enabling highly parallel computation and hence faster training convergence. Graphics
processing units, provide access to a greater threading capability than a typical central
processing unit. As such, a number of libraries have been developed with alternative fast native
GPU code paths. Current implementations are tightly integrated with the CUDA platform, a
proprietary programming model restricted to Nvidia GPUs.
In response a basic cross platform neural network library has been developed in C++,
demonstrating the feasibility of a single high performance platform agnostic code path. The
library has been built on top of the OpenCL programming framework. OpenCL is maintained by
a non-profit consortium group, Khronos, with implementations available on a number of devices
from different vendors.
Validation tests were performed on multilayer neural networks to assess training performance
and final network accuracy. Training consisted of multiple passes using back propagation and an
adaptive global learning rate.
A network consisting of two hidden linear rectifier layers was trained on the MNIST dataset; a
well known set of labelled greyscale digit images. The best observed error was achieved with a
total of 1099770 trainable parameters over 200 epochs, attaining a classification error of 4.5%.
Each epoch consisted of 5000 stochastic samples and back propagation passes. Total training
time was 53 minutes. Good fast convergence was observed using fewer training epochs. Using
10 epochs, a classification error rate of 9.6% was observed; taking 164.6 seconds of training on
an AMD Fury X.
Training on the Fury X was found to be approximately x5 faster than the i7-6700k. The Fury X
boasts approximately x72 the single floating point performance of the i7-6700k, suggesting
further optimisations can be made.
For demonstration purposes, windows x64 has been explicitly targeted by this release; porting to
another operating system would be trivial. The library has been written against OpenCL version
2.0 in order to take advantage of fine control over job queues. All recent CPUS and GPUs from
AMD and Intel are OpenCL 2.0 capable. Currently Nvidia devices only support OpenCL 1.2, but
2.0 support is likely to come in the near future.
Abbreviations
CPU Central Processing Unit
GPU Graphics Processing Unit
CUDA Compute Unified Device Architecture
OpenCL Open Computing Language
clBLAS OpenCL Basic Linear Algebra Subprograms
clFFT OpenCL Fast Fourier Transform
Linear Unit Linear Unit
Rectified Linear Unit ReLU
LU Linear Unit
SiU Sigmoid Unit
Platform agnostic hardware acceleration for deep neural networks P a g e | 4
Introduction
Background
The field of machine learning is currently experiencing renewed interest. Developments in deep
neural network architectures and training methods have resulted in greatly improved model
learning accuracy for difficult tasks. Refinements to techniques are being continually developed,
with error rates as low as 15.2% being reported in difficult tasks such as speech recognition [1].
Companies are making investing large sums into neural network research. See Facebook open
sourcing deep learning modules for Torch [2]. There have been a number of high profile public
successes, as Alphabet’s AlphaGo, the first program to ever beat a professional Go player
without a handicap [3].
Figure 1.1 Google trend data showing the popularity of search terms.
Note the rapid rise of "deep learning" searches.
Deep neural networks are an evolution of single hidden layer neural networks. Whilst the idea of
a distributed computational network was conceived in the late fifties, inspired by biological
models, it was not until the invention of back propagation in 1970[4] that an effective network
training method was available. 1985 saw the first proposal of introducing convolution layers [5].
Since then a large number of new methods have been introduced: weight decay [6], fast
convolution layers using Fourier transforms [7], dropout [8], long short term memory networks
[9].
Demand for increased computational performance has risen with the increasing complexity of
neural networks. In 1995 it was demonstrated that GPUs could be used to effectively train neural
networks [10]. Neural network optimisation is a massively parallel problem, and as such is well
suited to GPU architectures, which give access to a much larger number of threads than a
typical CPU.
GPUs APIs were originally designed with fixed pipeline designed to produce visual effects.
Traditionally it has been very difficult to run exploit GPU parallelism for algorithm computation.
However, graphics API pipelines have become increasingly generic to handle more intricate
computer graphics methods [11][12]. Hardware vendors have subsequently released more
generic compute platforms [13][14][15][16] that can run on code against GPU hardware,
designed for the needs of the scientific computing community. Nvidia CUDA 1.0 was released in
2007, OpenCL 1.0 in 2009. Both OpenCL and CUDA program kernels are based on the C++14
specification.
Platform agnostic hardware acceleration for deep neural networks P a g e | 5
CUDA is currently the more mature of the two GPU compute platforms, boasting a wider
selection of libraries [17]. This has directly translated into more widespread CUDA hardware
acceleration for training deep neural networks. In contrast, OpenCL implementations are
generally incomplete or non-existent (table 2.1). However, CUDA is a proprietary platform that
will only run on Nidia’s GPU hardware [18]. OpenCL implementations exist across a range of
hardware from different vendors, including both CPUs and GPUs [19]. OpenCL has the potential
to provide a single unified fast code path for training deep Neural Networks.
Objectives
 To develop a basic library deep learning library that utilises OpenCL for all intensive
operations.
 Develop an easy to use interface within C++.
 Maintain compatibility across as many OpenCL platforms as possible.
 Minimise external dependencies to ease setup and increase portability.
Literature Review
Pre-existing software packages
Software Primary language
interface
Other language
interfaces
CUDA GPU
support
OpenCL CPU / GPU
support
Caffe Python C++, Matlab Yes Third party branch from
AMD, but only neared
feature completion as of
late August 2015.
Neon Python Yes No.
Theano Python Yes In development.
Tensorflow Python C++ (graphs only) Yes In development.
Torch Lua C Yes Third party branch in
development.
Figure 2.1.1 An overview of popular deep learning software environments.
None of the popular deep learning libraries provide official OpenCL support. Caffe is the only
library with a feature complete OpenCl branch.
Exploring Caffe’s OpenCL branch in more depth
There a large number of dependencies [20] required for installation. Installations are restricted to
Ubuntu 12.04 or later. Only AMD GPUs are currently supported. Building and deploying the full
caffe OpenCL stack was deemed outside the scope of this project. Test performance metrics are
available on the github page [21], see Fig 2.2.1.
Platform agnostic hardware acceleration for deep neural networks P a g e | 6
Platform Speed (images per second)
AMD W9100 & A10-7850k 255
AMD R9 Fury & A10-7850k 261
AMD R290X @1000MHz & A10-7850k 268
AMD S9150 @900MHz & Xeon E5-2640 227
Figure 2.2.1. Training performance using the well known AlexNet network. [22]
The network inputs used by Alexnet were images of 256x256 resolution. Multiplying out the total
number of pixels by the number images processed per second, we can see that OpenCL’s caffe
branch is capable of training approximately 17,104,896 inputs per second on an AMD Fury X.
Platform Speed (images per second)
AMD W9100 & A10-7850k 590
AMD R9 Fury & A10-7850k 699
AMD R290X @1000MHz & A10-7850k 606
AMD S9150 @900MHz & Xeon E5-2640 452
Figure 2.2.2. Recognition performance using AlexNet. [22]
Similarly, we can see that an approximately 45,809,664 inputs per second can be processed.
Platform agnostic hardware acceleration for deep neural networks P a g e | 7
Theoretical groundwork
Multi Layer feed forward perception
The perceptron network was first proposed in 1958 by Frank Rosenblatt [24]. Perceptrons are
connected into a directed graph. The
perceptrons at the start of the graph
correspond to the network’s inputs.
Perceptrons at the end of the graph, the
outputs. Input values are passed into the input
perceptrons. Each subsequent perceptron
computes a weighted sum of the outputs from
prior connected perceptrons. The summed
value is then passed through an activation
function, A(x), and passed on through to the
next set of perceptrons. This process is
continued until the network output is reached.
These networks were handcrafted by
tweaking connection weight values. Modern
neural networks employ learning algorithms to
automatically update weight values.
𝐴 𝑥 =
𝑑(max⁡{𝑥, 0})
𝑑𝑥
Figure 2.3.2. The Heaviside step function was the
activation function originally used by Rosenblatt.
It has since been replaced by differentiable
functions. Differentiable activation functions
allow gradient descent to be used to modify
connection weights in such a way that the
network can be taught to output a set of desired
values for a given input.
Figure 2.3.1. A diagram showing how a single
perceptron unit processes inputs within a network.
This process is called a forward pass.
Platform agnostic hardware acceleration for deep neural networks P a g e | 8
Modern Activation Functions and the Back Propagation algorithm
Back propagation[3] is widely used as a training algorithm for neural networks. It is a class of
gradient descent algorithm. It works by first performing a forward pass of the network. See [25]
for an overview of the algorithm.
𝑝𝑗 = 𝐴 𝑝𝑖 𝑤𝑖𝑗
𝐼𝑛𝑐𝑜𝑚𝑖𝑛𝑔
𝑤𝑒𝑖𝑔 ℎ𝑡𝑠
𝑖=1
Where 𝐴() is an activation function, 𝑤𝑖𝑗 a weight between units 𝑖 and 𝑗 and 𝑝𝑥 is the output of
unit 𝑥. 𝑖 is the index of the unit closest to the input layer.
The activation function must be differentiable so that an error gradient may be calculated. The
sigmoid function is commonly used. The linear rectifier activation function has been shown to
have better characteristics under some conditions [26]. The linear rectifier prevents the
vanishing gradient problem experienced by the sigmoid activation function, where weights of
large magnitude will have activation gradients of 0, or near to 0, which in turns reduces the
weight update deltas to 0, or near 0.
Sigmoid, derivative: 𝐴 𝑥 =
1
(1+𝑒−𝑥 )
𝑑𝐴 𝑥
𝑑𝑥
=
𝑒 𝑥
1+𝑒−𝑥 2 = 𝐴 𝑥 (1 − 𝐴 𝑋 )
Linear rectifier, derivative: 𝐴 𝑥 = 𝑙𝑛⁡(1 + 𝑒 𝑥
)
𝑑𝐴 𝑥
𝑑𝑥
=
1
1+𝑒−𝑥
An error delta is calculated at each output unit by finding the difference between its output and a
desired output value. The error deltas are propagated back through the network to the input
layer, storing deltas at each unit. This is referred to as a backwards pass.
Delta error for output units: 𝛿 𝑝 𝑗
= (𝑝𝑗 − 𝑡𝑗 )
𝑑𝐴 𝑝 𝑗
𝑑𝑥
Where 𝑡𝑗 denotes the 𝑡𝑡ℎ output unit’s target value.
Delta error for inner units: 𝛿 𝑝 𝑗
=
𝑑𝐴 𝑝 𝑗
𝑑𝑥
𝛿 𝑝 𝑖
𝑤𝑖𝑗
𝑂𝑢𝑡𝑔𝑜𝑖𝑛𝑔
𝑤𝑒𝑖𝑔 ℎ𝑡𝑠
𝑖=1
𝑤𝑖𝑗 is the weight from unit 𝑖 in the previously visited layer, to 𝑗 in the current layer. i.e. 𝑖 is
the index of the unit closest to the output layer.
Finally, weights are moved by value proportional to the error delta at the unit they provide inputs
for. The direction of change is opposite to the sign of the delta. The deltas are proportional to the
rate of change of the network’s error with respect to the incoming weights.
∆𝑤𝑖𝑗 = −𝑎𝛿𝑗 𝑝𝑖 =
𝑑𝐸𝑟𝑟𝑜𝑟
𝑑𝑤𝑖𝑗
Where 𝑎 is the learning rate. 𝑖 is now the index of the unit closest to the input layer.
Platform agnostic hardware acceleration for deep neural networks P a g e | 9
The learning rate, 𝑎, must be small enough to allow the network to converge, yet large
enough to give a reasonable training time. Small 𝑎 values may also cause the network to
get stuck in local error minima.
Weight regularization
Weight regularization is commonly applied in one of two forms: weight decay [6], or dropout [8].
Weight regularization is intended to prevent overfitting, whereby the network learns to exactly
produce the training outputs, rather than learning a generalized pattern. Over fitted networks
perform poorly on validation test sets.
Weight decay modification to the weight update rule: ∆𝑤𝑖𝑗 = −𝑎𝛿𝑗 𝑝𝑖 − 𝑑 ∗ 𝑠𝑖𝑔𝑛(𝛿𝑗 𝑝𝑖)
Where 𝑑 is a small decay factor, such that 𝑑 ≪ 𝑎.
Weight decay may however reduce final network performance, as it will create moving global
optima. It is preferable to use dropout where possible. The dropout modification is applied to the
forward pass during training, giving each unit a small probability of outputting a value of 0.
𝑝𝑗 =
𝐴 𝑝𝑖 𝑤𝑖𝑗
𝐼𝑛𝑐𝑜𝑚𝑖𝑛𝑔
𝑤𝑒𝑖𝑔 ℎ𝑡𝑠
𝑖=1
𝑟𝑛𝑑 0.0, 1.0 < 𝑑
0 𝑟𝑛𝑑 0.0, 1.0 ≥ 𝑑
Where 𝑑 is a small dropout probability such that 0.0 ≤ 𝑑 < 1.0.
Dropout attempts to spread learned patterns across the network, rather than isolated groups of
units.
Convolution Layers and Fast Convolutions
Convolution layers provide method of introducing translation resistant weights into the network
[27]. Units within a convolution layer share weights in a spatial pattern, allowing the network to
quickly generalize for inputs containing translated patterns. Stacked convolution layers can
identify identify extremely complex patterns much more rapidly than a typical multi layer network;
convolution networks have seen great success in many applications.
Figure 2.3.3 A diagram showing how the weights are shared across convolutional layer units.
Platform agnostic hardware acceleration for deep neural networks P a g e | 10
Convolution operations can however be expensive for large kernels, being 𝑂(𝑛𝑘2
), where 𝑛 is
the number of units in the convolutional layer, and 𝑘 is the kernel width. It has been recognised
that that convolution theorem can be applied to give greatly reduced computation time of
𝑂(𝑛𝑙𝑜𝑔 𝑛 ) for the forward pass [28].
𝐹 𝑐 . 𝑘 = 𝐹 𝑐 ∗ 𝐹(𝑘)
∴ 𝑐. 𝑘 = 𝐹−1
(𝐹 𝑐 ∗ 𝐹 𝑘 )
The convolution theorem shows that the elementwise product of two matrices is equal to the
product of their fourier transforms. Using the fast fourier transform algorithm, 𝐹(𝑐) and 𝐹(𝑘) can
be computed in 𝑂(𝑛𝑙𝑜𝑔 𝑛 ), where 𝑛 is the number of elements in 𝑐 or 𝑘 (they must have the
same number of elements). Similarly, the backpropagation algorithm may be also be modified to
take advantage of this identity [28].
Delta errors for convolutional output layer: 𝜹𝒋 = 𝑝
𝑑𝐴 𝑝
𝑑𝑝
− 𝒕
Note 𝑝
𝑑𝐴 𝑝
𝑑𝑝
is a matrix of the output layer multiplied element wise with the derivatives of
the activation function.
Delta errors for convolutional inner layer 𝜹𝒋 =
𝑑𝐴 𝒍 𝒋
𝑑𝑝
𝒍𝑖 ∗ 𝒘 𝑇
𝑖𝑗
Where 𝑖 and 𝑗 are now indexes between network layers, rather than units. For the
backwards pass 𝑖 is the index of the layer closest to the output layer.
Where 𝒍𝑖 = 𝑝 𝑖, denoting the matrix of outputs for layer 𝑖.
Weight updates for a convolutional kernel: ∆𝒘𝑖𝑗 = −𝑎(𝜹𝑗 ∗ 𝒍𝑖) =
𝑑𝑬
𝑑𝒘
For the weight updates, 𝑖 is the index of the layer closest to the input layer.
Platform agnostic hardware acceleration for deep neural networks P a g e | 11
OpenCL learning resources and reference material
Having never worked with OpenCL before, I ended up working through a number of
tutorials and example programs. Listed below are all the resources I used.
Resource
Type
Name Location
PDF,
specification
OpenCL 2.0 specification https://www.khronos.org/registry/cl/specs
/opencl-2.0.pdf
Website,
reference
clBLAS manual and reference http://clmathlibraries.github.io/clBLAS/
Website,
reference
clFFT manual and reference http://clmathlibraries.github.io/clFFT/
Book Heterogeneous Computing with
OpenCL 2.0, By David Kaeli,
Perhaad Mistry, Dana Schaa and
Dong Ping Zhang
http://developer.amd.com/partners/univer
sity-programs/heterogeneous-computing-
with-opencl/
Website,
tutorial
Oak Ridge laboratory, OpenCL
vector addition tutorial
https://www.olcf.ornl.gov/tutorials/opencl-
vector-addition/
Website,
tutorial
AMD, Intro to OpenCL tutorial http://developer.amd.com/tools-and-
sdks/opencl-zone/opencl-
resources/introductory-tutorial-to-opencl/
Figure 2.4.1. Learning resources
System Design
Development environment
The OpenCL specification is written against C++, and is subsequently the language of choice for
this project.
Windows was chosen as the development environment due to personal familiarity with the visual
studio software package. Visual Studio 2015 is used to provide an up to date implementation of
the C++11 specification. In keeping with the project objectives, windows specific code shall be
restricted to the main.cpp file. All other code will be written with the standard template in mind,
and as such should compile under g++ and run on Linux.
Familiarisation with OpenCL showed that developing optimised kernels is difficult. Consequently,
I decided to employ AMD’s clBLAS library where possible. clBLAS provides a set of common
basic linear algebra kernels. AMD also provides clFFT for computing fast fourier transforms.
clFFT was added as an additional dependency so as to assist in implementing fast convolution
layers (Fig. 3.2.1).
Platform agnostic hardware acceleration for deep neural networks P a g e | 12
Essential Requirements
1. A network class capable of:
a. Constructing multi layer feed forward neural networks. The programmer
should be able to easily specify the number of units within each layer.
b. Training neural networks. Training performance must be reported through
cross validation against test data.
c. Testing neural networks. A method must be implemented that returns
information on the network’s the mean standard error across a batch of test
data.
d. Processing inputs. A method must be implemented that allows the network to
accept a single set of inputs from the main program thread, returning the
corresponding output from the network.
2. A layer class that provides a logical ordering of network computational units.
3. An implementation of the back propagation training algorithm.
4. An implementation of the sigmoid activation function and its corresponding
differential.
5. A sample program capable of demonstrating network training and testing functionality
on different OpenCL devices.
6. Unit testing, testing trained Network accuracy by validating against a dataset
generated from a mathematical function.
Optional Requirements
1. Unit testing, testing trained Network accuracy by validating against a well known pre-
constructed dataset.
2. Implementation of a convolutional layer and convolutional kernel classes. These must
provide:
a. Weight sharing across spatially separated neuron units.
b. Modification to the back propagation algorithm to handle shared weights.
3. An implementation of the linear rectifier activation function and its corresponding
differential.
4. Network regularization. Either through weight decay or dropout.
Implementation Deliverables
1. A Visual Studio 2015 C++ solution containing a working example of the developed deep
neural network library.
2. Headers and associated .cpp definitions with comments describing how the library works.
3. OpenCL kernel code.
4. clBLAS and clFFT included as dynamic link libraries.
Technical Challenges
Feeding the OpenCL device
OpenCL provides a high latency, high throughput bridge between the host device and the
compute device. The host device and compute device share one or more queues. The host
produces jobs and inserts them into a queue. The compute device consumes job items from the
queue. By default, OpenCL creates a serial queue, forcing the compute device to compute jobs
in order. This is not ideal, as some jobs may take only a fraction of the compute device’s
Platform agnostic hardware acceleration for deep neural networks P a g e | 13
resources. Setting CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE when creating the cl_queue will
enable the device to consume jobs out of order.
Each job is associated with an event, which may be in one of four states, queued, submitted,
running, and complete. Jobs are also associated with event completion wait lists, allowing for
synchronization and dependency blocks. Ideally the work queue will be saturated so that
compute device can be continually working on jobs.
Figure 3.1.1. A visualization of job how the queue controls job consumption. The queue is saturated, there are
more jobs available for the compute device to consume, as shown by the line in red. The host device is
shown in green. Independent jobs are undertaken either in parallel, or in an undetermined serial order.
Behaviour is undefined if the compute device attempts to write or read from cl_mem buffers
being modified by the host device. The reverse is also true. Consequently, the queue must be
utilised to stall both host and compute device until read / write operations are finished. The
number of read and write operations between the host and compute device should be minimised
in order to prevent stalls. As such, as much data as possible should be kept device side.
Platform agnostic hardware acceleration for deep neural networks P a g e | 14
Figure 3.1.2. A forced synchronization point. The host is attempting to read the cl_mem holding the network’s
output. A backward gather compute job is available, but cannot be consumed until the host has finished its
read.
OpenCL kernel efficiency considerations
OpenCL kernels are small programs that run on the openCL compute device. OpenCL kernels
are compiled using an OpenCL device context by the host at program start up. The host can
then queue the kernel binary to the compute device as part of a compute task. Similarly, the
openCL host can queue read or write operations to modify or view the contents of cl_mem
buffers held in the compute device’s global cache.
Figure 3.1.3. A depiction of the hardware differences exposed by OpenCL. OpenCL devices typically have
access to much larger number of threads. An AMD Fury X GPU has access to 4096 threads.
Platform agnostic hardware acceleration for deep neural networks P a g e | 15
The specification is designed with massive parallelism in mind. An instance of a submitted kernel
program is launched for each thread in the global work group. The global work group is
subdivided into equally sized local work groups. Each thread has access to a small but very fast
local memory cache, and a slower, but larger work group memory cache. All threads have
access to the global memory cache. Threads may only communicate within their work group.
Task division is primarily achieved using the thread’s unique id, which lies in the range
0 >= x< global work group size. Kernels jobs are only marked as complete once all their threads
have finished, as such the kernel is only as fast as its slowest thread.
It is also worth noting that GPUs often implement reduced instruction sets. Consequently some
function calls can have large overheads. For example, the modulo operator is expensive on
AMD GPU hardware.
Using clFFT
The clFFT library is relatively complex, yet I could only find three example programs. I
subsequently created a small program to see if I could successfully transform real valued 2D
matrix into the complex frequency domain, then back again to the spatial domain. The test was
successful. See Appendix B.1. for the code and B.2. for results.
Implementation Schedule
For the original implementation schedule, refer to Appendix C.1. A modified schedule was
created in at the end of December 2015 after the initial project proposal was recognized to be
too complex for the given time frame. See Appendix C.2. Originally I had hoped to demonstrate
basic speech recognition capabilities; however this would require that convolution features be
fully implemented. Other commitments meant that I was unsure whether or not convolution layer
functionality could be implemented in time. Instead I decided that the implementation would
benefit from greater focus on testing core multi layer network functionality and performance.
Design specification
Designing a flexible network architecture
Rather than adding computational units directly into the Layer class, it was decided to wrap them
within a pool class. This gives the programmer
more flexibility when defining network
architecture, as shown by Fig 3.2.1. This was
an early design decision, a result of designing
a way in which convolution layers and
standard unit layers could be integrated in a
complimentary fashion, rather than forcing the
programmer to choose between one or the
other. Layers enforce the sequence in which
the forward and backward passes visit units.
Pass are performed in parallel for pools in the
same layer. MatrixPools are pools of standard
units with biases. ConvPools are pools of
convolutional units arraned into a 2D matrix.
ConvPool units share a single bias between
them for each incoming convolutional kernel.
Figure 3.2.1. A network architecture example that might be used.
Platform agnostic hardware acceleration for deep neural networks P a g e | 16
Validation tests
All training outputs are normalised into the range 0.0 to 1.0 such that they are compatible with
logistic units typically used by output layers. Linear rectifiers are not suitable for use in the
network output layer.
1. MNIST handwritten character recognition, 60,000 labelled training images, 10,000
labelled testing images. [29]. Network input of 28x28 = 784 LU. Output of 10 SiLU, with
the index of the unit of largest responsive corresponding to the digit’s classification.
Generating random values a, b, c, d, e in the range 0.0 to 1.0.
2. Sin(a), 1000 testing values, 200 training values. Network input of 1 LU. Output of 1 SiLU.
3. sort(a, b, c, d, e) sorting 5 parameters, 1000 testing values, 200 training values. Network
input of 5 LU. Output of 5 SiLU.
4. polynomial, 3.0f*a*a + a + 7.0f*b + 1.0f, 1000 testing values, 200 training values. Network
input of 2 LU. Output of 1 SiLU.
Platform agnostic hardware acceleration for deep neural networks P a g e | 17
Class hierarchy
Figure 3.3.1. A UML diagram showing the basic relationship between network classes. Important field
members are shown. The Network class is intended to provide the primary interface used by the programmer.
Platform agnostic hardware acceleration for deep neural networks P a g e | 18
Results
Requirement satisfaction
Refer to system design, essential and optional requirements, page 11.
1. a. Full compliance
1. b. Full compliance.
1. c. Full compliance.
1. d. Full compliance.
2. Full compliance.
3. Full compliance.
4. Full compliance.
5. Full compliance.
6. Full compliance.
Optional Requirements
1. Full compliance, MNIST [29] handwritten digit dataset validation provided.
2. Partial compliance. clFFT tests completed. Interface and class structure for
convolution units and kernels added. No implementations currently present.
3. Full compliance. Linear rectifiers are used as the default activation function for hidden
layers.
4. No compliance. A test was conducted with weight decay, but was not found to
increase network test validation accuracy. Consequently it was decided not to include the
weight modification change. Further resting required.
Platform agnostic hardware acceleration for deep neural networks P a g e | 19
Test validation Results
Table 4.1.1. Results from validation runs with varying epoch numbers. The initial learn rate for all tests was
0.001.
OpenCL Device
Validation
test type
Training
time
(seconds) Epochs
Network
structure
Training
sample
selection
Training
passes
per
epoch
Mean
standard
error
Classifica
tion
error
AMD Fury X
(8192 GFlops) MNIST 10.941 5
Appendix
A.1 random 2000 0.1198 0.1819
Intel i7-6700k
(114 Gflops) MNIST 65.4442 5
Appendix
A.1 random 2000 0.1375 0.187
AMD Fury X
(8192 GFlops) MNIST 21.6659 10
Appendix
A.1 random 2000 0.1167 0.1617
Intel i7-6700k
(114 Gflops) MNIST 135.6973 10
Appendix
A.1 random 2000 0.1118 0.1614
AMD Fury X
(8192 GFlops) MNIST 42.4356 20
Appendix
A.1 random 2000 0.1035 0.1509
Intel i7-6700k
(114 Gflops) MNIST 262.8941 20
Appendix
A.1 random 2000 0.0873 0.1358
AMD Fury X
(8192 GFlops) Sin(x) 9.3542 20
Appendix
A.2 all 800 0.0124 N/A
Intel i7-6700k
(114 Gflops) Sin(x) 19.0043 20
Appendix
A.2 all 800 0.0084 N/A
AMD Fury X
(8192 GFlops) Sin(x) 9.3163 20
Appendix
A.2 all 800 0.0334 N/A
Intel i7-6700k
(114 Gflops) Sin(x) 18.7893 20
Appendix
A.2 all 800 0.0293 N/A
AMD Fury X
(8192 GFlops) Sin(x) 9.2561 20
Appendix
A.2 all 800 0.1025 N/A
Intel i7-6700k
(114 Gflops) Sin(x) 18.0904 20
Appendix
A.2 all 800 0.0128 N/A
AMD Fury X
(8192 GFlops)
Sort(a, b,
c, d, e) 25.9588 20
Appendix
A.3 all 800 0.2039 N/A
Intel i7-6700k
(114 Gflops)
Sort(a, b,
c, d, e) 169.0974 20
Appendix
A.3 all 800 0.2178 N/A
AMD Fury X
(8192 GFlops)
Sort(a, b,
c, d, e) 25.6606 20
Appendix
A.3 all 800 0.191 N/A
Intel i7-6700k
(114 Gflops)
Sort(a, b,
c, d, e) 173.9321 20
Appendix
A.3 all 800 0.1462 N/A
AMD Fury X
(8192 GFlops)
Sort(a, b,
c, d, e) 25.6168 20
Appendix
A.3 all 800 0.1807 N/A
Intel i7-6700k
(114 Gflops)
Sort(a, b,
c, d, e) 169.9004 20
Appendix
A.3 all 800 0.1918 N/A
AMD Fury X
(8192 GFlops) Polynomial 17.789 20
Appendix
A.4 all 800 0.0209 N/A
Intel i7-6700k
(114 Gflops) Polynomial 90.2876 20
Appendix
A.1.4 all 800 0.0315 N/A
AMD Fury X
(8192 GFlops) Polynomial 17.508 20
Appendix
A.4 all 800 0.0185 N/A
Intel i7-6700k
(114 Gflops) Polynomial 90.7548 20
Appendix
A.4 all 800 0.0234 N/A
AMD Fury X
(8192 GFlops) Polynomial 17.5351 20
Appendix
A.4 all 800 0.0203 N/A
Intel i7-6700k
(114 Gflops) Polynomial 87.968 20
Appendix
A.4 all 800 0.0239 N/A
AMD Fury X
(8192 GFlops) MNIST 3183.586 200
Appendix
A.5 random 5000 0.027 0.0454
AMD Fury X
(8192 GFlops) MNIST 164.564 10
Appendix
A.5 random 5000 0.0612 0.0956
Platform agnostic hardware acceleration for deep neural networks P a g e | 20
MNIST classification examples
Figure 4.2.1. A randomly sampled 2
misclassified by the neural network a 0.
Figure 4.2.1. A randomly sampled 2
that is correctly classified.
Figure 4.2.1. A randomly sampled
5 that is correctly classified.
Result Discussion
Taking the mean average ratio of i7-6700k run times over Fury X run times from table 4.1.1,
gives a mean ratio of 4.97. This is low considering the Fury X has 8192 GFlops of compared to
the i7-6700k’s 114, which would suggest a ratio closer to 72. It is possible that the task queue is
not saturated, and that the OpenCL device is idling for a number of cycles, which would suggest
the main threa is causing throttling. Alternatively, it is possible that an OpenCL kernel is causing
a bottleneck due to poor optimisation. Further investigation is required.
Overall performance is acceptable on the Fury X, but has a some way to go before have
comparable performance of popular public libraries. The 10 epoch fury X test with a 5000
sample rate completed training in 165 seconds, and had 1,099,770 trainable parameters. A
Platform agnostic hardware acceleration for deep neural networks P a g e | 21
similar network was setup within python using theaon, via python and lasagne, to provide a
reference. The theano network had 945,768 parameters, and achieved a training time of 44
seconds on an i7-7600k over 10 epochs. Final accuracy was relatively similar. My OpenCL
implementation achieved a misclassification rate of 10%. Theano achieved an error of 8%.
Recognition rate was good, taking 15.5 seconds to recognise all 10,000 MNIST test images,
giving an image per second rate of 645. Multiplying out by the size of the input 28x28 = 784, this
gives a total rate of 505,680 inputs processed per second. Caffe’s OpenCL branch is
approximately 90x faster as processing inputs, and significantly faster at training. Though it is
worth noting that batching is used for the caffe test results published on github.
The i7-6700k’s training could be quite long on my OpenCL implementation. For example, the 20
epoch MNIST test with a 2000 sample rate took 136 seconds, despite having only 218,842
training parameters.
A longer training session was undertaken using the network described in Appendix A.1.5.,
achieving a good final error rate of 4.5%, the same as what was achieved by a two layer neural
network by a popular publication on document recognition [29][30]. The network also proved
accurate over the modelled mathematical functions: sin(x), sort(a, b, c, d, e) and the polynomial
function, achieving best respective errors of 8.4%, 15%, 19%.
Evaluation
Further Work
1. Debugging performance issues.
2. Finishing integration of optional requirements.
3. Possibly worth investigating the removal of the majority of queue jobs by calling kernels
from the device. OpenCL 2.0 allows compute devices to make kernel calls. This feature
was not explored, as it adds significant design complexity. clBLAS would have to be
modified to handle custom kernel post / pre callback. clFFT supports this feature.
Conclusion
Considering the complexity of the project, I believe the outcome to be reasonable. A cross
platform deep learning library was developed in C++, and demonstrated to work successfully on
a range of tasks. Though performance was not ideal, I am confident the bottlenecks could be
identified by isolating the execution times for the called OpenCL kenerls.
Platform agnostic hardware acceleration for deep neural networks P a g e | 22
Deployment guide
Hardware requirements:
OpenCL 2.0 compatible device
x64 Windows environment (tested on windows, 7, 9, 10)
Software requirements:
AMD App SDK 3.00 or greater
Building from source requires visual studio 2015 or newer
1. Proceed to http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-
parallel-processing-app-sdk/.
2. Download and install AMD APP SDK 3.0 for windows 64 bit.
3. Unzip Code_Base.zip
Running the binary:
4. Proceed to the “./Backpropagation/Bin” folder
5. Run Backpropagation.exe
Compiling from source:
4. Proceed to the “./Backpropagation/Backpropagation” folder
5. Open visual studio 2015
6. Click file -> open project/solution
7. Open Backpropagation.sln
8. Press ctl + f5 to compile and run
Platform agnostic hardware acceleration for deep neural networks P a g e | 23
Bibliography
[1] Sainath, Tara N., Abdel-rahman Mohamed, Brian Kingsbury, and Bhuvana Ramabhadran. "Deep
convolutional neural networks for LVCSR." InAcoustics, Speech and Signal Processing (ICASSP), 2013
IEEE International Conference on, pp. 8614-8618. IEEE, 2013.
[2] https://research.facebook.com/blog/fair-open-sources-deep-learning-modules-for-torch/
[3] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,
Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman,
Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray
Kavukcuoglu, Thore Graepel, and Demis Hassabis. "Mastering the Game of Go with Deep Neural
Networks and Tree Search." Nature 529, no. 7587 (2016): 484.
[4] Linnainmaa, Seppo. "The representation of the cumulative rounding error of an algorithm as a Taylor
expansion of the local rounding errors." Master's Thesis (in Finnish), Univ. Helsinki (1970): 6-7.
[5] Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. Learning internal representations by
error propagation. No. ICS-8506. CALIFORNIA UNIV SAN DIEGO LA JOLLA INST FOR COGNITIVE
SCIENCE, 1985.
[6] Rumelhart, D.E., Hinton, G.E. and Williams, R.J., 1988. Learning representations by back-propagating
errors. Cognitive modeling, 5(3), p.714.
[7] Mathieu, Michael, Mikael Henaff, and Yann LeCun. "Fast training of convolutional networks through
FFTs." arXiv preprint arXiv:1312.5851 (2013).
[8] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R., 2014. Dropout: A simple
way to prevent neural networks from overfitting.The Journal of Machine Learning Research, 15(1),
pp.1929-1958.
[9] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory."Neural computation 9, no. 8
(1997): 1735-1780.
[10] Martínez-Zarzuela, Mario, Francisco Javier Díaz Pernas, José Fernando Díez Higuera, and Míriam
Antón Rodríguez. "Fuzzy ART neural network parallel computing on the GPU." In Computational and
Ambient Intelligence, pp. 463-470. Springer Berlin Heidelberg, 2007.
[11] (Shader model 5 for DirectX), accessed 21/ 05/ 2016,
https://www.google.co.uk/search?q=shader+model+5&oq=shader+model+5&aqs=chrome..69i57.3354j0j7
&sourceid=chrome&ie=UTF-8
[12] John Kessenich, Dave Baldwin, Randi Rost, “The OpenGL Shader language”,
https://www.opengl.org/registry/doc/GLSLangSpec.4.50.pdf
[13] http://www.nvidia.co.uk/object/cuda-parallel-computing-uk.html, accessed 22/05/2016
[14] https://www.khronos.org/opencl/, accessed 22/05/2016
[15] http://developer.amd.com/tools-and-sdks/opencl-zone/, accessed 22/05/2016
[16] https://software.intel.com/en-us/intel-
opencl?cid=sem43700008896000156&intel_term=intel+openCL&gclid=CjwKEAjwsYW6BRCTzvu5y8DP
hi0SJABnGLlHWfkJo5tNdbBubNlnsqdz_nyHUSfm6SPPlECfXbtAgxoCSvXw_wcB&gclsrc=aw.ds,
accessed 22/05/2016
Platform agnostic hardware acceleration for deep neural networks P a g e | 24
[17] https://developer.nvidia.com/gpu-accelerated-libraries, accessed 22/05/2016
[18] https://developer.nvidia.com/cuda-gpus, accessed 22/05/2016
[19] https://www.khronos.org/conformance/adopters/conformant-products#opencl, accessed 22/05/2016
[20] https://github.com/amd/OpenCL-caffe/wiki/How-to-set-up-clBLAS-and-OpenCL, accessed 22/05/2016
[21] https://github.com/amd/OpenCL-caffe, accessed 22/05/2016
[22] Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet classification with deep convolutional
neural networks. In Advances in neural information processing systems (pp. 1097-1105).
[23] Kulkarni, Sanjeev, and Harman, Gilbert. "Multilayer Networks." In Wiley Series in Probability and
Statistics, 99-115. Hoboken, NJ, USA: John Wiley & Sons, 2011.
[24] Rosenblatt, Frank. "The perceptron: a probabilistic model for information storage and organization in
the brain." Psychological review 65, no. 6 (1958): 386.
[25] Narsky, Ilya, and Porter, Frank C. "Neural Networks." In Statistical Analysis Techniques in Particle
Physics, 251-63. Weinheim, Germany: Wiley‐VCH Verlag GmbH & KGaA, 2013. Chapter 12.
[26] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Deep sparse rectifier neural networks."
In International Conference on Artificial Intelligence and Statistics, pp. 315-323. 2011.
[27] Simard, P.Y., Steinkraus, D. and Platt, J.C., 2003, August. Best practices for convolutional neural
networks applied to visual document analysis. In null(p. 958). IEEE.
[28] Mathieu, M., Henaff, M. and LeCun, Y., 2013. Fast training of convolutional networks through
FFTs. arXiv preprint arXiv:1312.5851.
[29] http://yann.lecun.com/exdb/mnist/, accessed 22/05/2016
[30] LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. "Gradient-based learning applied to document
recognition." Proceedings of the IEEE86, no. 11 (1998): 2278-2324.
Platform agnostic hardware acceleration for deep neural networks P a g e | 25
Appendices
A - Network validation architectures
A.1. MNIST
Trainable parameters 21,8842
A.2. sin(a)
Trainable parameters 387
Platform agnostic hardware acceleration for deep neural networks P a g e | 26
A.3. sort(a, b, c, d, e)
Trainable parameters 36,259
A.4. polynomial
Trainable parameters 17,285
Platform agnostic hardware acceleration for deep neural networks P a g e | 27
A.5. MNIST
Trainable parameters 1,099,770
B – clFFT library expeiment
B.1. Fourier transform and inverse fourier transform via clFFT and OpenCL
/* ************************************************************************
* Copyright 2013 Advanced Micro Devices, Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
* ************************************************************************/
/* ************************************************************************
* Copyright Callum McMahon
*
* Added inverse hermitian transform, showing how data can
* be to transformed back to spatial domain.
* Terminal outputs after inverse should match the original dataset.
* ************************************************************************/
/* No need to explicitely include the OpenCL headers */
Platform agnostic hardware acceleration for deep neural networks P a g e | 28
#include <clFFT.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main(void)
{
system("MODE CON COLS=80 LINES=1024");
cl_int err;
cl_platform_id platform = 0;
cl_device_id device = 0;
cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 };
cl_context ctx = 0;
cl_command_queue queue = 0;
cl_mem bufX, bufY;
float *X, *Y;
cl_event event = NULL;
int ret = 0;
const size_t N0 = 8, N1 = 8;
char platform_name[128];
char device_name[128];
/* FFT library realted declarations */
clfftPlanHandle planHandle;
clfftDim dim = CLFFT_2D;
size_t clLengths[2] = { N0, N1 };
int fac = ((N1 / 2) + 1);//=N1;
//size_t l = N0;
size_t clOutStrides[2] = { 1, fac };
size_t clInStrides[2] = { 1, N0 };
/* Setup OpenCL environment. */
err = clGetPlatformIDs(1, &platform, NULL);
size_t ret_param_size = 0;
err = clGetPlatformInfo(platform, CL_PLATFORM_NAME,
sizeof(platform_name), platform_name,
&ret_param_size);
printf("Platform found: %sn", platform_name);
err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_DEFAULT, 1, &device, NULL);
err = clGetDeviceInfo(device, CL_DEVICE_NAME,
sizeof(device_name), device_name,
&ret_param_size);
printf("Device found on the above platform: %sn", device_name);
props[1] = (cl_context_properties)platform;
ctx = clCreateContext(props, 1, &device, NULL, NULL, &err);
queue = clCreateCommandQueueWithProperties(ctx, device, 0, &err);
/* Setup clFFT. */
clfftSetupData fftSetup;
err = clfftInitSetupData(&fftSetup);
err = clfftSetup(&fftSetup);
/* Allocate host & initialize data. */
/* Only allocation shown for simplicity. */
size_t buffer_size_x = N0 * N1 * sizeof(*X);
Platform agnostic hardware acceleration for deep neural networks P a g e | 29
size_t buffer_size_y = ((N0+2) * N1) * sizeof(*Y);
X = (float *)malloc(buffer_size_x);
Y = (float *)malloc(buffer_size_y);
/* print input array just using the
* indices to fill the array with data */
printf("nPerforming fft on an two dimensional array of size N0 x N1 : %ld x
%ldn", N0, N1);
int i, j;
i = j = 0;
for (i = 0; i<N0; ++i) {
for (j = 0; j<N1; ++j) {
float x = 0.5f;
float y = 0.5f;
unsigned idx = (j + i*N0);
X[idx] = sin(1.0f*(float)i) + cos(0.4f*(float)j);
printf("n(%f) ", X[idx]);
}
printf("n");
}
/* Prepare OpenCL memory objects and place data inside them. */
bufX = clCreateBuffer(ctx, CL_MEM_READ_WRITE, buffer_size_x, NULL, &err);
//CL_MEM_READ_ONLY
bufY = clCreateBuffer(ctx, CL_MEM_READ_WRITE, buffer_size_y, NULL, &err);
err = clEnqueueWriteBuffer(queue, bufX, CL_TRUE, 0, buffer_size_x, X, 0, NULL,
NULL);
/* Create a default plan for a complex FFT. */
err = clfftCreateDefaultPlan(&planHandle, ctx, dim, clLengths);
/* Set plan parameters. */
err = clfftSetPlanPrecision(planHandle, CLFFT_SINGLE);
err = clfftSetLayout(planHandle, CLFFT_REAL, CLFFT_HERMITIAN_INTERLEAVED);
err = clfftSetResultLocation(planHandle, CLFFT_OUTOFPLACE);
err = clfftSetPlanOutStride(planHandle, dim, clOutStrides);
err = clfftSetPlanInStride(planHandle, dim, clInStrides);
/* Bake the plan. */
err = clfftBakePlan(planHandle, 1, &queue, NULL, NULL);
/* Execute the plan. */
err = clfftEnqueueTransform(planHandle, CLFFT_FORWARD, 1, &queue, 0, NULL, NULL,
&bufX, &bufY, NULL);
/* Wait for calculations to be finished. */
err = clFinish(queue);
/* Fetch results of calculations. */
err = clEnqueueReadBuffer(queue, bufY, CL_TRUE, 0, buffer_size_y, Y, 0, NULL,
NULL);
/* print output array */
printf("nnfft result: n");
i = j = 0;
for (i = 0; i<N0; ++i) {
for (j = 0; j<fac; ++j) {
unsigned idx = 2 * (j + i*fac);
printf("n(%f) ", sqrt(Y[idx] * Y[idx] + Y[idx+1] * Y[idx+1]));
//fiddle with restults to test
//Y[idx] += 0.01f*(float)idx;
Platform agnostic hardware acceleration for deep neural networks P a g e | 30
}
printf("n");
}
printf("n");
//*****************
//revserse!
//*****************
printf("nn *** reverse ***nn");
//clOutStrides[0] = { 1, fac };
//clInStrides[0] = { 1, N0 };
err = clEnqueueWriteBuffer(queue, bufY, CL_TRUE, 0, buffer_size_y, Y, 0, NULL,
NULL);
/* Create a default plan for a complex FFT. */
err = clfftCreateDefaultPlan(&planHandle, ctx, dim, clLengths);
/* Set plan parameters. */
err = clfftSetPlanPrecision(planHandle, CLFFT_SINGLE);
err = clfftSetLayout(planHandle, CLFFT_HERMITIAN_INTERLEAVED, CLFFT_REAL);
err = clfftSetResultLocation(planHandle, CLFFT_OUTOFPLACE);
err = clfftSetPlanOutStride(planHandle, dim, clInStrides);
err = clfftSetPlanInStride(planHandle, dim, clOutStrides);
/* Bake the plan. */
err = clfftBakePlan(planHandle, 1, &queue, NULL, NULL);
/* Execute the plan. */
err = clfftEnqueueTransform(planHandle, CLFFT_FORWARD, 1, &queue, 0, NULL, NULL,
&bufY, &bufX, NULL);
/* Wait for calculations to be finished. */
err = clFinish(queue);
/* Fetch results of calculations. */
err = clEnqueueReadBuffer(queue, bufX, CL_TRUE, 0, buffer_size_x, X, 0, NULL,
NULL);
i = j = 0;
for (i = 0; i<N0; ++i) {
for (j = 0; j<N1; ++j) {
float x = 0.5f;
float y = 0.5f;
unsigned idx = (j + i*N0);
printf("n(%f) ", X[idx]);
}
printf("n");
}
//*****************
//revserse END
//*****************
/* Release OpenCL memory objects. */
clReleaseMemObject(bufX);
free(X);
clReleaseMemObject(bufY);
free(Y);
/* Release the plan. */
Platform agnostic hardware acceleration for deep neural networks P a g e | 31
err = clfftDestroyPlan(&planHandle);
/* Release clFFT library. */
clfftTeardown();
/* Release OpenCL working objects. */
clReleaseCommandQueue(queue);
clReleaseContext(ctx);
getchar();
return ret;
}
B.2. Program outputs from B.1. Showing only the first column for succinctness.
Platform found: Intel(R) OpenCL
Device found on the above platform: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Performing fft on an two dimensional array of size N0 x N1 : 8 x 8
(1.000000)
(0.921061)
(0.696707)
(0.362358)
(-0.029200)
(-0.416147)
(-0.737394)
(-0.942222)
fft result:
(11.271166)
(27.725875)
(11.865518)
(8.765699)
(8.040510)
*** reverse ***
(1.000000)
(0.921061)
(0.696707)
(0.362358)
(-0.029200)
(-0.416147)
(-0.737394)
(-0.942222)
Platform agnostic hardware acceleration for deep neural networks P a g e | 32
C – Gantt time plans
C.1. Original Gantt time plane
Platform agnostic hardware acceleration for deep neural networks P a g e | 33
C.2. Modified Gantt time plane

More Related Content

What's hot

A Platform for Accelerating Machine Learning Applications
 A Platform for Accelerating Machine Learning Applications A Platform for Accelerating Machine Learning Applications
A Platform for Accelerating Machine Learning Applications
NVIDIA Taiwan
 
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...Gaurav Raina
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
Brendan Gregg
 
AI is Impacting HPC Everywhere
AI is Impacting HPC EverywhereAI is Impacting HPC Everywhere
AI is Impacting HPC Everywhere
inside-BigData.com
 
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
NECST Lab @ Politecnico di Milano
 
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
NECST Lab @ Politecnico di Milano
 
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUsShoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
Jiannan Ouyang, PhD
 
Achieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsAchieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-Kernels
Jiannan Ouyang, PhD
 
Debugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBDebugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDB
bmbouter
 
2. Cnnecst-Why the use of FPGA?
2. Cnnecst-Why the use of FPGA? 2. Cnnecst-Why the use of FPGA?
2. Cnnecst-Why the use of FPGA?
CNNECST - Convolutional Neural Networks
 
NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...
NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...
NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...
NECST Lab @ Politecnico di Milano
 
YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
Brendan Gregg
 
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP..."Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
Edge AI and Vision Alliance
 
Kernel Recipes 2019 - XDP closer integration with network stack
Kernel Recipes 2019 -  XDP closer integration with network stackKernel Recipes 2019 -  XDP closer integration with network stack
Kernel Recipes 2019 - XDP closer integration with network stack
Anne Nicolas
 
A CGRA-based Approach for Accelerating Convolutional Neural Networks
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksA CGRA-based Approachfor Accelerating Convolutional Neural Networks
A CGRA-based Approach for Accelerating Convolutional Neural Networks
Shinya Takamaeda-Y
 
dCUDA: Distributed GPU Computing with Hardware Overlap
 dCUDA: Distributed GPU Computing with Hardware Overlap dCUDA: Distributed GPU Computing with Hardware Overlap
dCUDA: Distributed GPU Computing with Hardware Overlap
inside-BigData.com
 
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONSA SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
cseij
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_maps
lcplcp1
 
IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告
Ryousei Takano
 

What's hot (20)

A Platform for Accelerating Machine Learning Applications
 A Platform for Accelerating Machine Learning Applications A Platform for Accelerating Machine Learning Applications
A Platform for Accelerating Machine Learning Applications
 
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
 
AI is Impacting HPC Everywhere
AI is Impacting HPC EverywhereAI is Impacting HPC Everywhere
AI is Impacting HPC Everywhere
 
Lec06 memory
Lec06 memoryLec06 memory
Lec06 memory
 
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
 
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
 
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUsShoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
 
Achieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsAchieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-Kernels
 
Debugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBDebugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDB
 
2. Cnnecst-Why the use of FPGA?
2. Cnnecst-Why the use of FPGA? 2. Cnnecst-Why the use of FPGA?
2. Cnnecst-Why the use of FPGA?
 
NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...
NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...
NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...
 
YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
 
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP..."Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
 
Kernel Recipes 2019 - XDP closer integration with network stack
Kernel Recipes 2019 -  XDP closer integration with network stackKernel Recipes 2019 -  XDP closer integration with network stack
Kernel Recipes 2019 - XDP closer integration with network stack
 
A CGRA-based Approach for Accelerating Convolutional Neural Networks
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksA CGRA-based Approachfor Accelerating Convolutional Neural Networks
A CGRA-based Approach for Accelerating Convolutional Neural Networks
 
dCUDA: Distributed GPU Computing with Hardware Overlap
 dCUDA: Distributed GPU Computing with Hardware Overlap dCUDA: Distributed GPU Computing with Hardware Overlap
dCUDA: Distributed GPU Computing with Hardware Overlap
 
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONSA SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_maps
 
IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告
 

Similar to Professional Project - C++ OpenCL - Platform agnostic hardware acceleration for deep neural networks

Deep Convolutional Neural Network acceleration on the Intel Xeon Phi
Deep Convolutional Neural Network acceleration on the Intel Xeon PhiDeep Convolutional Neural Network acceleration on the Intel Xeon Phi
Deep Convolutional Neural Network acceleration on the Intel Xeon Phi
Gaurav Raina
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
IJMER
 
GPU HistoPyramid Based Fluid Simulation and Rendering
GPU HistoPyramid Based Fluid Simulation and RenderingGPU HistoPyramid Based Fluid Simulation and Rendering
GPU HistoPyramid Based Fluid Simulation and Rendering
João Vicente P. Reis Fo.
 
Improved kernel based port-knocking in linux
Improved kernel based port-knocking in linuxImproved kernel based port-knocking in linux
Improved kernel based port-knocking in linux
dinomasch
 
Map SMAC Algorithm onto GPU
Map SMAC Algorithm onto GPUMap SMAC Algorithm onto GPU
Map SMAC Algorithm onto GPUZhengjie Lu
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
Bomm Kim
 
Design and Implementation of Quintuple Processor Architecture Using FPGA
Design and Implementation of Quintuple Processor Architecture Using FPGADesign and Implementation of Quintuple Processor Architecture Using FPGA
Design and Implementation of Quintuple Processor Architecture Using FPGA
IJERA Editor
 
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020
OpenACC
 
Distributed Traffic management framework
Distributed Traffic management frameworkDistributed Traffic management framework
Distributed Traffic management framework
Saurabh Nambiar
 
Ivo Pavlik - thesis (print version)
Ivo Pavlik - thesis (print version)Ivo Pavlik - thesis (print version)
Ivo Pavlik - thesis (print version)Ivo Pavlik
 
IIT ropar_CUDA_Report_Ankita Dewan
IIT ropar_CUDA_Report_Ankita DewanIIT ropar_CUDA_Report_Ankita Dewan
IIT ropar_CUDA_Report_Ankita DewanAnkita Dewan
 
IIT ropar_CUDA_Report_Ankita Dewan
IIT ropar_CUDA_Report_Ankita DewanIIT ropar_CUDA_Report_Ankita Dewan
IIT ropar_CUDA_Report_Ankita DewanAnkita Dewan
 
Survey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning AlgorithmSurvey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning AlgorithmSahil Kaw
 
Optimized Communication in 5G-Driven
Optimized Communication in 5G-DrivenOptimized Communication in 5G-Driven
Optimized Communication in 5G-Driven
AbdoHassan41
 
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
journalBEEI
 
OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020
OpenACC
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
Wilhelm van Belkum
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievVolodymyr Saviak
 

Similar to Professional Project - C++ OpenCL - Platform agnostic hardware acceleration for deep neural networks (20)

Deep Convolutional Neural Network acceleration on the Intel Xeon Phi
Deep Convolutional Neural Network acceleration on the Intel Xeon PhiDeep Convolutional Neural Network acceleration on the Intel Xeon Phi
Deep Convolutional Neural Network acceleration on the Intel Xeon Phi
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
Fulltext02
Fulltext02Fulltext02
Fulltext02
 
GPU HistoPyramid Based Fluid Simulation and Rendering
GPU HistoPyramid Based Fluid Simulation and RenderingGPU HistoPyramid Based Fluid Simulation and Rendering
GPU HistoPyramid Based Fluid Simulation and Rendering
 
Improved kernel based port-knocking in linux
Improved kernel based port-knocking in linuxImproved kernel based port-knocking in linux
Improved kernel based port-knocking in linux
 
Map SMAC Algorithm onto GPU
Map SMAC Algorithm onto GPUMap SMAC Algorithm onto GPU
Map SMAC Algorithm onto GPU
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
 
Design and Implementation of Quintuple Processor Architecture Using FPGA
Design and Implementation of Quintuple Processor Architecture Using FPGADesign and Implementation of Quintuple Processor Architecture Using FPGA
Design and Implementation of Quintuple Processor Architecture Using FPGA
 
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020
 
Distributed Traffic management framework
Distributed Traffic management frameworkDistributed Traffic management framework
Distributed Traffic management framework
 
Ivo Pavlik - thesis (print version)
Ivo Pavlik - thesis (print version)Ivo Pavlik - thesis (print version)
Ivo Pavlik - thesis (print version)
 
main
mainmain
main
 
IIT ropar_CUDA_Report_Ankita Dewan
IIT ropar_CUDA_Report_Ankita DewanIIT ropar_CUDA_Report_Ankita Dewan
IIT ropar_CUDA_Report_Ankita Dewan
 
IIT ropar_CUDA_Report_Ankita Dewan
IIT ropar_CUDA_Report_Ankita DewanIIT ropar_CUDA_Report_Ankita Dewan
IIT ropar_CUDA_Report_Ankita Dewan
 
Survey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning AlgorithmSurvey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning Algorithm
 
Optimized Communication in 5G-Driven
Optimized Communication in 5G-DrivenOptimized Communication in 5G-Driven
Optimized Communication in 5G-Driven
 
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
 
OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
 

Professional Project - C++ OpenCL - Platform agnostic hardware acceleration for deep neural networks

  • 1. University of Surrey Faculty of engineering and physical sciences Department of Computing Final Year Project Report 19/05/2016 Title: Platform agnostic hardware acceleration for deep neural networks Student: Callum McMahon URN: 6279333 Supervisor: Lillian Tang
  • 2. Platform agnostic hardware acceleration for deep neural networks P a g e | 1 Contents Abstract....................................................................................................................................... 3 Abbreviations .............................................................................................................................. 3 Introduction ................................................................................................................................. 4 Background ............................................................................................................................. 4 Objectives................................................................................................................................ 5 Literature Review ........................................................................................................................ 5 Pre-existing software packages............................................................................................... 5 Exploring Caffe’s OpenCL branch in more depth..................................................................... 5 Theoretical groundwork ........................................................................................................... 7 Multi Layer feed forward perception ..................................................................................... 7 Modern Activation Functions and the Back Propagation algorithm....................................... 8 Weight regularization ........................................................................................................... 9 OpenCL learning resources and reference material............................................................... 11 System Design.......................................................................................................................... 11 Development environment..................................................................................................... 11 Essential Requirements......................................................................................................... 12 Implementation Deliverables.................................................................................................. 12 Technical Challenges ............................................................................................................ 12 Feeding the OpenCL device............................................................................................... 12 OpenCL kernel efficiency considerations ........................................................................... 14 Using clFFT ....................................................................................................................... 15 Implementation Schedule ...................................................................................................... 15 Design specification............................................................................................................... 15 Designing a flexible network architecture ........................................................................... 15 Validation tests................................................................................................................... 16 Class hierarchy .................................................................................................................. 17 Results...................................................................................................................................... 18 Requirement satisfaction ....................................................................................................... 18 Refer to system design, essential and optional requirements, page 11.................................. 18 Test validation Results........................................................................................................... 19 MNIST classification examples.............................................................................................. 20 Result Discussion.................................................................................................................. 20 Evaluation ................................................................................................................................. 21 Further Work ......................................................................................................................... 21 Conclusion............................................................................................................................. 21 Deployment guide ..................................................................................................................... 22
  • 3. Platform agnostic hardware acceleration for deep neural networks P a g e | 2 Bibliography .............................................................................................................................. 23 Appendices ............................................................................................................................... 25 A - Network validation architectures....................................................................................... 25 A.1. MNIST........................................................................................................................... 25 A.2. sin(a).............................................................................................................................. 25 A.3. sort(a, b, c, d, e) ............................................................................................................ 26 A.4. polynomial...................................................................................................................... 26 A.5. MNIST........................................................................................................................... 27 B – clFFT library expeiment................................................................................................... 27 B.1. Fourier transform and inverse fourier transform via clFFT and OpenCL ........................ 27 B.2. Program outputs from B.1. Showing only the first column for succinctness. ................... 31 C – Gantt time plans.............................................................................................................. 32
  • 4. Platform agnostic hardware acceleration for deep neural networks P a g e | 3 Abstract This report provides an overview of resources available for deep neural network machine learning. Current state of the art software libraries employ massively vectorised training pipelines, enabling highly parallel computation and hence faster training convergence. Graphics processing units, provide access to a greater threading capability than a typical central processing unit. As such, a number of libraries have been developed with alternative fast native GPU code paths. Current implementations are tightly integrated with the CUDA platform, a proprietary programming model restricted to Nvidia GPUs. In response a basic cross platform neural network library has been developed in C++, demonstrating the feasibility of a single high performance platform agnostic code path. The library has been built on top of the OpenCL programming framework. OpenCL is maintained by a non-profit consortium group, Khronos, with implementations available on a number of devices from different vendors. Validation tests were performed on multilayer neural networks to assess training performance and final network accuracy. Training consisted of multiple passes using back propagation and an adaptive global learning rate. A network consisting of two hidden linear rectifier layers was trained on the MNIST dataset; a well known set of labelled greyscale digit images. The best observed error was achieved with a total of 1099770 trainable parameters over 200 epochs, attaining a classification error of 4.5%. Each epoch consisted of 5000 stochastic samples and back propagation passes. Total training time was 53 minutes. Good fast convergence was observed using fewer training epochs. Using 10 epochs, a classification error rate of 9.6% was observed; taking 164.6 seconds of training on an AMD Fury X. Training on the Fury X was found to be approximately x5 faster than the i7-6700k. The Fury X boasts approximately x72 the single floating point performance of the i7-6700k, suggesting further optimisations can be made. For demonstration purposes, windows x64 has been explicitly targeted by this release; porting to another operating system would be trivial. The library has been written against OpenCL version 2.0 in order to take advantage of fine control over job queues. All recent CPUS and GPUs from AMD and Intel are OpenCL 2.0 capable. Currently Nvidia devices only support OpenCL 1.2, but 2.0 support is likely to come in the near future. Abbreviations CPU Central Processing Unit GPU Graphics Processing Unit CUDA Compute Unified Device Architecture OpenCL Open Computing Language clBLAS OpenCL Basic Linear Algebra Subprograms clFFT OpenCL Fast Fourier Transform Linear Unit Linear Unit Rectified Linear Unit ReLU LU Linear Unit SiU Sigmoid Unit
  • 5. Platform agnostic hardware acceleration for deep neural networks P a g e | 4 Introduction Background The field of machine learning is currently experiencing renewed interest. Developments in deep neural network architectures and training methods have resulted in greatly improved model learning accuracy for difficult tasks. Refinements to techniques are being continually developed, with error rates as low as 15.2% being reported in difficult tasks such as speech recognition [1]. Companies are making investing large sums into neural network research. See Facebook open sourcing deep learning modules for Torch [2]. There have been a number of high profile public successes, as Alphabet’s AlphaGo, the first program to ever beat a professional Go player without a handicap [3]. Figure 1.1 Google trend data showing the popularity of search terms. Note the rapid rise of "deep learning" searches. Deep neural networks are an evolution of single hidden layer neural networks. Whilst the idea of a distributed computational network was conceived in the late fifties, inspired by biological models, it was not until the invention of back propagation in 1970[4] that an effective network training method was available. 1985 saw the first proposal of introducing convolution layers [5]. Since then a large number of new methods have been introduced: weight decay [6], fast convolution layers using Fourier transforms [7], dropout [8], long short term memory networks [9]. Demand for increased computational performance has risen with the increasing complexity of neural networks. In 1995 it was demonstrated that GPUs could be used to effectively train neural networks [10]. Neural network optimisation is a massively parallel problem, and as such is well suited to GPU architectures, which give access to a much larger number of threads than a typical CPU. GPUs APIs were originally designed with fixed pipeline designed to produce visual effects. Traditionally it has been very difficult to run exploit GPU parallelism for algorithm computation. However, graphics API pipelines have become increasingly generic to handle more intricate computer graphics methods [11][12]. Hardware vendors have subsequently released more generic compute platforms [13][14][15][16] that can run on code against GPU hardware, designed for the needs of the scientific computing community. Nvidia CUDA 1.0 was released in 2007, OpenCL 1.0 in 2009. Both OpenCL and CUDA program kernels are based on the C++14 specification.
  • 6. Platform agnostic hardware acceleration for deep neural networks P a g e | 5 CUDA is currently the more mature of the two GPU compute platforms, boasting a wider selection of libraries [17]. This has directly translated into more widespread CUDA hardware acceleration for training deep neural networks. In contrast, OpenCL implementations are generally incomplete or non-existent (table 2.1). However, CUDA is a proprietary platform that will only run on Nidia’s GPU hardware [18]. OpenCL implementations exist across a range of hardware from different vendors, including both CPUs and GPUs [19]. OpenCL has the potential to provide a single unified fast code path for training deep Neural Networks. Objectives  To develop a basic library deep learning library that utilises OpenCL for all intensive operations.  Develop an easy to use interface within C++.  Maintain compatibility across as many OpenCL platforms as possible.  Minimise external dependencies to ease setup and increase portability. Literature Review Pre-existing software packages Software Primary language interface Other language interfaces CUDA GPU support OpenCL CPU / GPU support Caffe Python C++, Matlab Yes Third party branch from AMD, but only neared feature completion as of late August 2015. Neon Python Yes No. Theano Python Yes In development. Tensorflow Python C++ (graphs only) Yes In development. Torch Lua C Yes Third party branch in development. Figure 2.1.1 An overview of popular deep learning software environments. None of the popular deep learning libraries provide official OpenCL support. Caffe is the only library with a feature complete OpenCl branch. Exploring Caffe’s OpenCL branch in more depth There a large number of dependencies [20] required for installation. Installations are restricted to Ubuntu 12.04 or later. Only AMD GPUs are currently supported. Building and deploying the full caffe OpenCL stack was deemed outside the scope of this project. Test performance metrics are available on the github page [21], see Fig 2.2.1.
  • 7. Platform agnostic hardware acceleration for deep neural networks P a g e | 6 Platform Speed (images per second) AMD W9100 & A10-7850k 255 AMD R9 Fury & A10-7850k 261 AMD R290X @1000MHz & A10-7850k 268 AMD S9150 @900MHz & Xeon E5-2640 227 Figure 2.2.1. Training performance using the well known AlexNet network. [22] The network inputs used by Alexnet were images of 256x256 resolution. Multiplying out the total number of pixels by the number images processed per second, we can see that OpenCL’s caffe branch is capable of training approximately 17,104,896 inputs per second on an AMD Fury X. Platform Speed (images per second) AMD W9100 & A10-7850k 590 AMD R9 Fury & A10-7850k 699 AMD R290X @1000MHz & A10-7850k 606 AMD S9150 @900MHz & Xeon E5-2640 452 Figure 2.2.2. Recognition performance using AlexNet. [22] Similarly, we can see that an approximately 45,809,664 inputs per second can be processed.
  • 8. Platform agnostic hardware acceleration for deep neural networks P a g e | 7 Theoretical groundwork Multi Layer feed forward perception The perceptron network was first proposed in 1958 by Frank Rosenblatt [24]. Perceptrons are connected into a directed graph. The perceptrons at the start of the graph correspond to the network’s inputs. Perceptrons at the end of the graph, the outputs. Input values are passed into the input perceptrons. Each subsequent perceptron computes a weighted sum of the outputs from prior connected perceptrons. The summed value is then passed through an activation function, A(x), and passed on through to the next set of perceptrons. This process is continued until the network output is reached. These networks were handcrafted by tweaking connection weight values. Modern neural networks employ learning algorithms to automatically update weight values. 𝐴 𝑥 = 𝑑(max⁡{𝑥, 0}) 𝑑𝑥 Figure 2.3.2. The Heaviside step function was the activation function originally used by Rosenblatt. It has since been replaced by differentiable functions. Differentiable activation functions allow gradient descent to be used to modify connection weights in such a way that the network can be taught to output a set of desired values for a given input. Figure 2.3.1. A diagram showing how a single perceptron unit processes inputs within a network. This process is called a forward pass.
  • 9. Platform agnostic hardware acceleration for deep neural networks P a g e | 8 Modern Activation Functions and the Back Propagation algorithm Back propagation[3] is widely used as a training algorithm for neural networks. It is a class of gradient descent algorithm. It works by first performing a forward pass of the network. See [25] for an overview of the algorithm. 𝑝𝑗 = 𝐴 𝑝𝑖 𝑤𝑖𝑗 𝐼𝑛𝑐𝑜𝑚𝑖𝑛𝑔 𝑤𝑒𝑖𝑔 ℎ𝑡𝑠 𝑖=1 Where 𝐴() is an activation function, 𝑤𝑖𝑗 a weight between units 𝑖 and 𝑗 and 𝑝𝑥 is the output of unit 𝑥. 𝑖 is the index of the unit closest to the input layer. The activation function must be differentiable so that an error gradient may be calculated. The sigmoid function is commonly used. The linear rectifier activation function has been shown to have better characteristics under some conditions [26]. The linear rectifier prevents the vanishing gradient problem experienced by the sigmoid activation function, where weights of large magnitude will have activation gradients of 0, or near to 0, which in turns reduces the weight update deltas to 0, or near 0. Sigmoid, derivative: 𝐴 𝑥 = 1 (1+𝑒−𝑥 ) 𝑑𝐴 𝑥 𝑑𝑥 = 𝑒 𝑥 1+𝑒−𝑥 2 = 𝐴 𝑥 (1 − 𝐴 𝑋 ) Linear rectifier, derivative: 𝐴 𝑥 = 𝑙𝑛⁡(1 + 𝑒 𝑥 ) 𝑑𝐴 𝑥 𝑑𝑥 = 1 1+𝑒−𝑥 An error delta is calculated at each output unit by finding the difference between its output and a desired output value. The error deltas are propagated back through the network to the input layer, storing deltas at each unit. This is referred to as a backwards pass. Delta error for output units: 𝛿 𝑝 𝑗 = (𝑝𝑗 − 𝑡𝑗 ) 𝑑𝐴 𝑝 𝑗 𝑑𝑥 Where 𝑡𝑗 denotes the 𝑡𝑡ℎ output unit’s target value. Delta error for inner units: 𝛿 𝑝 𝑗 = 𝑑𝐴 𝑝 𝑗 𝑑𝑥 𝛿 𝑝 𝑖 𝑤𝑖𝑗 𝑂𝑢𝑡𝑔𝑜𝑖𝑛𝑔 𝑤𝑒𝑖𝑔 ℎ𝑡𝑠 𝑖=1 𝑤𝑖𝑗 is the weight from unit 𝑖 in the previously visited layer, to 𝑗 in the current layer. i.e. 𝑖 is the index of the unit closest to the output layer. Finally, weights are moved by value proportional to the error delta at the unit they provide inputs for. The direction of change is opposite to the sign of the delta. The deltas are proportional to the rate of change of the network’s error with respect to the incoming weights. ∆𝑤𝑖𝑗 = −𝑎𝛿𝑗 𝑝𝑖 = 𝑑𝐸𝑟𝑟𝑜𝑟 𝑑𝑤𝑖𝑗 Where 𝑎 is the learning rate. 𝑖 is now the index of the unit closest to the input layer.
  • 10. Platform agnostic hardware acceleration for deep neural networks P a g e | 9 The learning rate, 𝑎, must be small enough to allow the network to converge, yet large enough to give a reasonable training time. Small 𝑎 values may also cause the network to get stuck in local error minima. Weight regularization Weight regularization is commonly applied in one of two forms: weight decay [6], or dropout [8]. Weight regularization is intended to prevent overfitting, whereby the network learns to exactly produce the training outputs, rather than learning a generalized pattern. Over fitted networks perform poorly on validation test sets. Weight decay modification to the weight update rule: ∆𝑤𝑖𝑗 = −𝑎𝛿𝑗 𝑝𝑖 − 𝑑 ∗ 𝑠𝑖𝑔𝑛(𝛿𝑗 𝑝𝑖) Where 𝑑 is a small decay factor, such that 𝑑 ≪ 𝑎. Weight decay may however reduce final network performance, as it will create moving global optima. It is preferable to use dropout where possible. The dropout modification is applied to the forward pass during training, giving each unit a small probability of outputting a value of 0. 𝑝𝑗 = 𝐴 𝑝𝑖 𝑤𝑖𝑗 𝐼𝑛𝑐𝑜𝑚𝑖𝑛𝑔 𝑤𝑒𝑖𝑔 ℎ𝑡𝑠 𝑖=1 𝑟𝑛𝑑 0.0, 1.0 < 𝑑 0 𝑟𝑛𝑑 0.0, 1.0 ≥ 𝑑 Where 𝑑 is a small dropout probability such that 0.0 ≤ 𝑑 < 1.0. Dropout attempts to spread learned patterns across the network, rather than isolated groups of units. Convolution Layers and Fast Convolutions Convolution layers provide method of introducing translation resistant weights into the network [27]. Units within a convolution layer share weights in a spatial pattern, allowing the network to quickly generalize for inputs containing translated patterns. Stacked convolution layers can identify identify extremely complex patterns much more rapidly than a typical multi layer network; convolution networks have seen great success in many applications. Figure 2.3.3 A diagram showing how the weights are shared across convolutional layer units.
  • 11. Platform agnostic hardware acceleration for deep neural networks P a g e | 10 Convolution operations can however be expensive for large kernels, being 𝑂(𝑛𝑘2 ), where 𝑛 is the number of units in the convolutional layer, and 𝑘 is the kernel width. It has been recognised that that convolution theorem can be applied to give greatly reduced computation time of 𝑂(𝑛𝑙𝑜𝑔 𝑛 ) for the forward pass [28]. 𝐹 𝑐 . 𝑘 = 𝐹 𝑐 ∗ 𝐹(𝑘) ∴ 𝑐. 𝑘 = 𝐹−1 (𝐹 𝑐 ∗ 𝐹 𝑘 ) The convolution theorem shows that the elementwise product of two matrices is equal to the product of their fourier transforms. Using the fast fourier transform algorithm, 𝐹(𝑐) and 𝐹(𝑘) can be computed in 𝑂(𝑛𝑙𝑜𝑔 𝑛 ), where 𝑛 is the number of elements in 𝑐 or 𝑘 (they must have the same number of elements). Similarly, the backpropagation algorithm may be also be modified to take advantage of this identity [28]. Delta errors for convolutional output layer: 𝜹𝒋 = 𝑝 𝑑𝐴 𝑝 𝑑𝑝 − 𝒕 Note 𝑝 𝑑𝐴 𝑝 𝑑𝑝 is a matrix of the output layer multiplied element wise with the derivatives of the activation function. Delta errors for convolutional inner layer 𝜹𝒋 = 𝑑𝐴 𝒍 𝒋 𝑑𝑝 𝒍𝑖 ∗ 𝒘 𝑇 𝑖𝑗 Where 𝑖 and 𝑗 are now indexes between network layers, rather than units. For the backwards pass 𝑖 is the index of the layer closest to the output layer. Where 𝒍𝑖 = 𝑝 𝑖, denoting the matrix of outputs for layer 𝑖. Weight updates for a convolutional kernel: ∆𝒘𝑖𝑗 = −𝑎(𝜹𝑗 ∗ 𝒍𝑖) = 𝑑𝑬 𝑑𝒘 For the weight updates, 𝑖 is the index of the layer closest to the input layer.
  • 12. Platform agnostic hardware acceleration for deep neural networks P a g e | 11 OpenCL learning resources and reference material Having never worked with OpenCL before, I ended up working through a number of tutorials and example programs. Listed below are all the resources I used. Resource Type Name Location PDF, specification OpenCL 2.0 specification https://www.khronos.org/registry/cl/specs /opencl-2.0.pdf Website, reference clBLAS manual and reference http://clmathlibraries.github.io/clBLAS/ Website, reference clFFT manual and reference http://clmathlibraries.github.io/clFFT/ Book Heterogeneous Computing with OpenCL 2.0, By David Kaeli, Perhaad Mistry, Dana Schaa and Dong Ping Zhang http://developer.amd.com/partners/univer sity-programs/heterogeneous-computing- with-opencl/ Website, tutorial Oak Ridge laboratory, OpenCL vector addition tutorial https://www.olcf.ornl.gov/tutorials/opencl- vector-addition/ Website, tutorial AMD, Intro to OpenCL tutorial http://developer.amd.com/tools-and- sdks/opencl-zone/opencl- resources/introductory-tutorial-to-opencl/ Figure 2.4.1. Learning resources System Design Development environment The OpenCL specification is written against C++, and is subsequently the language of choice for this project. Windows was chosen as the development environment due to personal familiarity with the visual studio software package. Visual Studio 2015 is used to provide an up to date implementation of the C++11 specification. In keeping with the project objectives, windows specific code shall be restricted to the main.cpp file. All other code will be written with the standard template in mind, and as such should compile under g++ and run on Linux. Familiarisation with OpenCL showed that developing optimised kernels is difficult. Consequently, I decided to employ AMD’s clBLAS library where possible. clBLAS provides a set of common basic linear algebra kernels. AMD also provides clFFT for computing fast fourier transforms. clFFT was added as an additional dependency so as to assist in implementing fast convolution layers (Fig. 3.2.1).
  • 13. Platform agnostic hardware acceleration for deep neural networks P a g e | 12 Essential Requirements 1. A network class capable of: a. Constructing multi layer feed forward neural networks. The programmer should be able to easily specify the number of units within each layer. b. Training neural networks. Training performance must be reported through cross validation against test data. c. Testing neural networks. A method must be implemented that returns information on the network’s the mean standard error across a batch of test data. d. Processing inputs. A method must be implemented that allows the network to accept a single set of inputs from the main program thread, returning the corresponding output from the network. 2. A layer class that provides a logical ordering of network computational units. 3. An implementation of the back propagation training algorithm. 4. An implementation of the sigmoid activation function and its corresponding differential. 5. A sample program capable of demonstrating network training and testing functionality on different OpenCL devices. 6. Unit testing, testing trained Network accuracy by validating against a dataset generated from a mathematical function. Optional Requirements 1. Unit testing, testing trained Network accuracy by validating against a well known pre- constructed dataset. 2. Implementation of a convolutional layer and convolutional kernel classes. These must provide: a. Weight sharing across spatially separated neuron units. b. Modification to the back propagation algorithm to handle shared weights. 3. An implementation of the linear rectifier activation function and its corresponding differential. 4. Network regularization. Either through weight decay or dropout. Implementation Deliverables 1. A Visual Studio 2015 C++ solution containing a working example of the developed deep neural network library. 2. Headers and associated .cpp definitions with comments describing how the library works. 3. OpenCL kernel code. 4. clBLAS and clFFT included as dynamic link libraries. Technical Challenges Feeding the OpenCL device OpenCL provides a high latency, high throughput bridge between the host device and the compute device. The host device and compute device share one or more queues. The host produces jobs and inserts them into a queue. The compute device consumes job items from the queue. By default, OpenCL creates a serial queue, forcing the compute device to compute jobs in order. This is not ideal, as some jobs may take only a fraction of the compute device’s
  • 14. Platform agnostic hardware acceleration for deep neural networks P a g e | 13 resources. Setting CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE when creating the cl_queue will enable the device to consume jobs out of order. Each job is associated with an event, which may be in one of four states, queued, submitted, running, and complete. Jobs are also associated with event completion wait lists, allowing for synchronization and dependency blocks. Ideally the work queue will be saturated so that compute device can be continually working on jobs. Figure 3.1.1. A visualization of job how the queue controls job consumption. The queue is saturated, there are more jobs available for the compute device to consume, as shown by the line in red. The host device is shown in green. Independent jobs are undertaken either in parallel, or in an undetermined serial order. Behaviour is undefined if the compute device attempts to write or read from cl_mem buffers being modified by the host device. The reverse is also true. Consequently, the queue must be utilised to stall both host and compute device until read / write operations are finished. The number of read and write operations between the host and compute device should be minimised in order to prevent stalls. As such, as much data as possible should be kept device side.
  • 15. Platform agnostic hardware acceleration for deep neural networks P a g e | 14 Figure 3.1.2. A forced synchronization point. The host is attempting to read the cl_mem holding the network’s output. A backward gather compute job is available, but cannot be consumed until the host has finished its read. OpenCL kernel efficiency considerations OpenCL kernels are small programs that run on the openCL compute device. OpenCL kernels are compiled using an OpenCL device context by the host at program start up. The host can then queue the kernel binary to the compute device as part of a compute task. Similarly, the openCL host can queue read or write operations to modify or view the contents of cl_mem buffers held in the compute device’s global cache. Figure 3.1.3. A depiction of the hardware differences exposed by OpenCL. OpenCL devices typically have access to much larger number of threads. An AMD Fury X GPU has access to 4096 threads.
  • 16. Platform agnostic hardware acceleration for deep neural networks P a g e | 15 The specification is designed with massive parallelism in mind. An instance of a submitted kernel program is launched for each thread in the global work group. The global work group is subdivided into equally sized local work groups. Each thread has access to a small but very fast local memory cache, and a slower, but larger work group memory cache. All threads have access to the global memory cache. Threads may only communicate within their work group. Task division is primarily achieved using the thread’s unique id, which lies in the range 0 >= x< global work group size. Kernels jobs are only marked as complete once all their threads have finished, as such the kernel is only as fast as its slowest thread. It is also worth noting that GPUs often implement reduced instruction sets. Consequently some function calls can have large overheads. For example, the modulo operator is expensive on AMD GPU hardware. Using clFFT The clFFT library is relatively complex, yet I could only find three example programs. I subsequently created a small program to see if I could successfully transform real valued 2D matrix into the complex frequency domain, then back again to the spatial domain. The test was successful. See Appendix B.1. for the code and B.2. for results. Implementation Schedule For the original implementation schedule, refer to Appendix C.1. A modified schedule was created in at the end of December 2015 after the initial project proposal was recognized to be too complex for the given time frame. See Appendix C.2. Originally I had hoped to demonstrate basic speech recognition capabilities; however this would require that convolution features be fully implemented. Other commitments meant that I was unsure whether or not convolution layer functionality could be implemented in time. Instead I decided that the implementation would benefit from greater focus on testing core multi layer network functionality and performance. Design specification Designing a flexible network architecture Rather than adding computational units directly into the Layer class, it was decided to wrap them within a pool class. This gives the programmer more flexibility when defining network architecture, as shown by Fig 3.2.1. This was an early design decision, a result of designing a way in which convolution layers and standard unit layers could be integrated in a complimentary fashion, rather than forcing the programmer to choose between one or the other. Layers enforce the sequence in which the forward and backward passes visit units. Pass are performed in parallel for pools in the same layer. MatrixPools are pools of standard units with biases. ConvPools are pools of convolutional units arraned into a 2D matrix. ConvPool units share a single bias between them for each incoming convolutional kernel. Figure 3.2.1. A network architecture example that might be used.
  • 17. Platform agnostic hardware acceleration for deep neural networks P a g e | 16 Validation tests All training outputs are normalised into the range 0.0 to 1.0 such that they are compatible with logistic units typically used by output layers. Linear rectifiers are not suitable for use in the network output layer. 1. MNIST handwritten character recognition, 60,000 labelled training images, 10,000 labelled testing images. [29]. Network input of 28x28 = 784 LU. Output of 10 SiLU, with the index of the unit of largest responsive corresponding to the digit’s classification. Generating random values a, b, c, d, e in the range 0.0 to 1.0. 2. Sin(a), 1000 testing values, 200 training values. Network input of 1 LU. Output of 1 SiLU. 3. sort(a, b, c, d, e) sorting 5 parameters, 1000 testing values, 200 training values. Network input of 5 LU. Output of 5 SiLU. 4. polynomial, 3.0f*a*a + a + 7.0f*b + 1.0f, 1000 testing values, 200 training values. Network input of 2 LU. Output of 1 SiLU.
  • 18. Platform agnostic hardware acceleration for deep neural networks P a g e | 17 Class hierarchy Figure 3.3.1. A UML diagram showing the basic relationship between network classes. Important field members are shown. The Network class is intended to provide the primary interface used by the programmer.
  • 19. Platform agnostic hardware acceleration for deep neural networks P a g e | 18 Results Requirement satisfaction Refer to system design, essential and optional requirements, page 11. 1. a. Full compliance 1. b. Full compliance. 1. c. Full compliance. 1. d. Full compliance. 2. Full compliance. 3. Full compliance. 4. Full compliance. 5. Full compliance. 6. Full compliance. Optional Requirements 1. Full compliance, MNIST [29] handwritten digit dataset validation provided. 2. Partial compliance. clFFT tests completed. Interface and class structure for convolution units and kernels added. No implementations currently present. 3. Full compliance. Linear rectifiers are used as the default activation function for hidden layers. 4. No compliance. A test was conducted with weight decay, but was not found to increase network test validation accuracy. Consequently it was decided not to include the weight modification change. Further resting required.
  • 20. Platform agnostic hardware acceleration for deep neural networks P a g e | 19 Test validation Results Table 4.1.1. Results from validation runs with varying epoch numbers. The initial learn rate for all tests was 0.001. OpenCL Device Validation test type Training time (seconds) Epochs Network structure Training sample selection Training passes per epoch Mean standard error Classifica tion error AMD Fury X (8192 GFlops) MNIST 10.941 5 Appendix A.1 random 2000 0.1198 0.1819 Intel i7-6700k (114 Gflops) MNIST 65.4442 5 Appendix A.1 random 2000 0.1375 0.187 AMD Fury X (8192 GFlops) MNIST 21.6659 10 Appendix A.1 random 2000 0.1167 0.1617 Intel i7-6700k (114 Gflops) MNIST 135.6973 10 Appendix A.1 random 2000 0.1118 0.1614 AMD Fury X (8192 GFlops) MNIST 42.4356 20 Appendix A.1 random 2000 0.1035 0.1509 Intel i7-6700k (114 Gflops) MNIST 262.8941 20 Appendix A.1 random 2000 0.0873 0.1358 AMD Fury X (8192 GFlops) Sin(x) 9.3542 20 Appendix A.2 all 800 0.0124 N/A Intel i7-6700k (114 Gflops) Sin(x) 19.0043 20 Appendix A.2 all 800 0.0084 N/A AMD Fury X (8192 GFlops) Sin(x) 9.3163 20 Appendix A.2 all 800 0.0334 N/A Intel i7-6700k (114 Gflops) Sin(x) 18.7893 20 Appendix A.2 all 800 0.0293 N/A AMD Fury X (8192 GFlops) Sin(x) 9.2561 20 Appendix A.2 all 800 0.1025 N/A Intel i7-6700k (114 Gflops) Sin(x) 18.0904 20 Appendix A.2 all 800 0.0128 N/A AMD Fury X (8192 GFlops) Sort(a, b, c, d, e) 25.9588 20 Appendix A.3 all 800 0.2039 N/A Intel i7-6700k (114 Gflops) Sort(a, b, c, d, e) 169.0974 20 Appendix A.3 all 800 0.2178 N/A AMD Fury X (8192 GFlops) Sort(a, b, c, d, e) 25.6606 20 Appendix A.3 all 800 0.191 N/A Intel i7-6700k (114 Gflops) Sort(a, b, c, d, e) 173.9321 20 Appendix A.3 all 800 0.1462 N/A AMD Fury X (8192 GFlops) Sort(a, b, c, d, e) 25.6168 20 Appendix A.3 all 800 0.1807 N/A Intel i7-6700k (114 Gflops) Sort(a, b, c, d, e) 169.9004 20 Appendix A.3 all 800 0.1918 N/A AMD Fury X (8192 GFlops) Polynomial 17.789 20 Appendix A.4 all 800 0.0209 N/A Intel i7-6700k (114 Gflops) Polynomial 90.2876 20 Appendix A.1.4 all 800 0.0315 N/A AMD Fury X (8192 GFlops) Polynomial 17.508 20 Appendix A.4 all 800 0.0185 N/A Intel i7-6700k (114 Gflops) Polynomial 90.7548 20 Appendix A.4 all 800 0.0234 N/A AMD Fury X (8192 GFlops) Polynomial 17.5351 20 Appendix A.4 all 800 0.0203 N/A Intel i7-6700k (114 Gflops) Polynomial 87.968 20 Appendix A.4 all 800 0.0239 N/A AMD Fury X (8192 GFlops) MNIST 3183.586 200 Appendix A.5 random 5000 0.027 0.0454 AMD Fury X (8192 GFlops) MNIST 164.564 10 Appendix A.5 random 5000 0.0612 0.0956
  • 21. Platform agnostic hardware acceleration for deep neural networks P a g e | 20 MNIST classification examples Figure 4.2.1. A randomly sampled 2 misclassified by the neural network a 0. Figure 4.2.1. A randomly sampled 2 that is correctly classified. Figure 4.2.1. A randomly sampled 5 that is correctly classified. Result Discussion Taking the mean average ratio of i7-6700k run times over Fury X run times from table 4.1.1, gives a mean ratio of 4.97. This is low considering the Fury X has 8192 GFlops of compared to the i7-6700k’s 114, which would suggest a ratio closer to 72. It is possible that the task queue is not saturated, and that the OpenCL device is idling for a number of cycles, which would suggest the main threa is causing throttling. Alternatively, it is possible that an OpenCL kernel is causing a bottleneck due to poor optimisation. Further investigation is required. Overall performance is acceptable on the Fury X, but has a some way to go before have comparable performance of popular public libraries. The 10 epoch fury X test with a 5000 sample rate completed training in 165 seconds, and had 1,099,770 trainable parameters. A
  • 22. Platform agnostic hardware acceleration for deep neural networks P a g e | 21 similar network was setup within python using theaon, via python and lasagne, to provide a reference. The theano network had 945,768 parameters, and achieved a training time of 44 seconds on an i7-7600k over 10 epochs. Final accuracy was relatively similar. My OpenCL implementation achieved a misclassification rate of 10%. Theano achieved an error of 8%. Recognition rate was good, taking 15.5 seconds to recognise all 10,000 MNIST test images, giving an image per second rate of 645. Multiplying out by the size of the input 28x28 = 784, this gives a total rate of 505,680 inputs processed per second. Caffe’s OpenCL branch is approximately 90x faster as processing inputs, and significantly faster at training. Though it is worth noting that batching is used for the caffe test results published on github. The i7-6700k’s training could be quite long on my OpenCL implementation. For example, the 20 epoch MNIST test with a 2000 sample rate took 136 seconds, despite having only 218,842 training parameters. A longer training session was undertaken using the network described in Appendix A.1.5., achieving a good final error rate of 4.5%, the same as what was achieved by a two layer neural network by a popular publication on document recognition [29][30]. The network also proved accurate over the modelled mathematical functions: sin(x), sort(a, b, c, d, e) and the polynomial function, achieving best respective errors of 8.4%, 15%, 19%. Evaluation Further Work 1. Debugging performance issues. 2. Finishing integration of optional requirements. 3. Possibly worth investigating the removal of the majority of queue jobs by calling kernels from the device. OpenCL 2.0 allows compute devices to make kernel calls. This feature was not explored, as it adds significant design complexity. clBLAS would have to be modified to handle custom kernel post / pre callback. clFFT supports this feature. Conclusion Considering the complexity of the project, I believe the outcome to be reasonable. A cross platform deep learning library was developed in C++, and demonstrated to work successfully on a range of tasks. Though performance was not ideal, I am confident the bottlenecks could be identified by isolating the execution times for the called OpenCL kenerls.
  • 23. Platform agnostic hardware acceleration for deep neural networks P a g e | 22 Deployment guide Hardware requirements: OpenCL 2.0 compatible device x64 Windows environment (tested on windows, 7, 9, 10) Software requirements: AMD App SDK 3.00 or greater Building from source requires visual studio 2015 or newer 1. Proceed to http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated- parallel-processing-app-sdk/. 2. Download and install AMD APP SDK 3.0 for windows 64 bit. 3. Unzip Code_Base.zip Running the binary: 4. Proceed to the “./Backpropagation/Bin” folder 5. Run Backpropagation.exe Compiling from source: 4. Proceed to the “./Backpropagation/Backpropagation” folder 5. Open visual studio 2015 6. Click file -> open project/solution 7. Open Backpropagation.sln 8. Press ctl + f5 to compile and run
  • 24. Platform agnostic hardware acceleration for deep neural networks P a g e | 23 Bibliography [1] Sainath, Tara N., Abdel-rahman Mohamed, Brian Kingsbury, and Bhuvana Ramabhadran. "Deep convolutional neural networks for LVCSR." InAcoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 8614-8618. IEEE, 2013. [2] https://research.facebook.com/blog/fair-open-sources-deep-learning-modules-for-torch/ [3] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. "Mastering the Game of Go with Deep Neural Networks and Tree Search." Nature 529, no. 7587 (2016): 484. [4] Linnainmaa, Seppo. "The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors." Master's Thesis (in Finnish), Univ. Helsinki (1970): 6-7. [5] Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. Learning internal representations by error propagation. No. ICS-8506. CALIFORNIA UNIV SAN DIEGO LA JOLLA INST FOR COGNITIVE SCIENCE, 1985. [6] Rumelhart, D.E., Hinton, G.E. and Williams, R.J., 1988. Learning representations by back-propagating errors. Cognitive modeling, 5(3), p.714. [7] Mathieu, Michael, Mikael Henaff, and Yann LeCun. "Fast training of convolutional networks through FFTs." arXiv preprint arXiv:1312.5851 (2013). [8] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R., 2014. Dropout: A simple way to prevent neural networks from overfitting.The Journal of Machine Learning Research, 15(1), pp.1929-1958. [9] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory."Neural computation 9, no. 8 (1997): 1735-1780. [10] Martínez-Zarzuela, Mario, Francisco Javier Díaz Pernas, José Fernando Díez Higuera, and Míriam Antón Rodríguez. "Fuzzy ART neural network parallel computing on the GPU." In Computational and Ambient Intelligence, pp. 463-470. Springer Berlin Heidelberg, 2007. [11] (Shader model 5 for DirectX), accessed 21/ 05/ 2016, https://www.google.co.uk/search?q=shader+model+5&oq=shader+model+5&aqs=chrome..69i57.3354j0j7 &sourceid=chrome&ie=UTF-8 [12] John Kessenich, Dave Baldwin, Randi Rost, “The OpenGL Shader language”, https://www.opengl.org/registry/doc/GLSLangSpec.4.50.pdf [13] http://www.nvidia.co.uk/object/cuda-parallel-computing-uk.html, accessed 22/05/2016 [14] https://www.khronos.org/opencl/, accessed 22/05/2016 [15] http://developer.amd.com/tools-and-sdks/opencl-zone/, accessed 22/05/2016 [16] https://software.intel.com/en-us/intel- opencl?cid=sem43700008896000156&intel_term=intel+openCL&gclid=CjwKEAjwsYW6BRCTzvu5y8DP hi0SJABnGLlHWfkJo5tNdbBubNlnsqdz_nyHUSfm6SPPlECfXbtAgxoCSvXw_wcB&gclsrc=aw.ds, accessed 22/05/2016
  • 25. Platform agnostic hardware acceleration for deep neural networks P a g e | 24 [17] https://developer.nvidia.com/gpu-accelerated-libraries, accessed 22/05/2016 [18] https://developer.nvidia.com/cuda-gpus, accessed 22/05/2016 [19] https://www.khronos.org/conformance/adopters/conformant-products#opencl, accessed 22/05/2016 [20] https://github.com/amd/OpenCL-caffe/wiki/How-to-set-up-clBLAS-and-OpenCL, accessed 22/05/2016 [21] https://github.com/amd/OpenCL-caffe, accessed 22/05/2016 [22] Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105). [23] Kulkarni, Sanjeev, and Harman, Gilbert. "Multilayer Networks." In Wiley Series in Probability and Statistics, 99-115. Hoboken, NJ, USA: John Wiley & Sons, 2011. [24] Rosenblatt, Frank. "The perceptron: a probabilistic model for information storage and organization in the brain." Psychological review 65, no. 6 (1958): 386. [25] Narsky, Ilya, and Porter, Frank C. "Neural Networks." In Statistical Analysis Techniques in Particle Physics, 251-63. Weinheim, Germany: Wiley‐VCH Verlag GmbH & KGaA, 2013. Chapter 12. [26] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Deep sparse rectifier neural networks." In International Conference on Artificial Intelligence and Statistics, pp. 315-323. 2011. [27] Simard, P.Y., Steinkraus, D. and Platt, J.C., 2003, August. Best practices for convolutional neural networks applied to visual document analysis. In null(p. 958). IEEE. [28] Mathieu, M., Henaff, M. and LeCun, Y., 2013. Fast training of convolutional networks through FFTs. arXiv preprint arXiv:1312.5851. [29] http://yann.lecun.com/exdb/mnist/, accessed 22/05/2016 [30] LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE86, no. 11 (1998): 2278-2324.
  • 26. Platform agnostic hardware acceleration for deep neural networks P a g e | 25 Appendices A - Network validation architectures A.1. MNIST Trainable parameters 21,8842 A.2. sin(a) Trainable parameters 387
  • 27. Platform agnostic hardware acceleration for deep neural networks P a g e | 26 A.3. sort(a, b, c, d, e) Trainable parameters 36,259 A.4. polynomial Trainable parameters 17,285
  • 28. Platform agnostic hardware acceleration for deep neural networks P a g e | 27 A.5. MNIST Trainable parameters 1,099,770 B – clFFT library expeiment B.1. Fourier transform and inverse fourier transform via clFFT and OpenCL /* ************************************************************************ * Copyright 2013 Advanced Micro Devices, Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * ************************************************************************/ /* ************************************************************************ * Copyright Callum McMahon * * Added inverse hermitian transform, showing how data can * be to transformed back to spatial domain. * Terminal outputs after inverse should match the original dataset. * ************************************************************************/ /* No need to explicitely include the OpenCL headers */
  • 29. Platform agnostic hardware acceleration for deep neural networks P a g e | 28 #include <clFFT.h> #include <stdio.h> #include <stdlib.h> #include <math.h> int main(void) { system("MODE CON COLS=80 LINES=1024"); cl_int err; cl_platform_id platform = 0; cl_device_id device = 0; cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 }; cl_context ctx = 0; cl_command_queue queue = 0; cl_mem bufX, bufY; float *X, *Y; cl_event event = NULL; int ret = 0; const size_t N0 = 8, N1 = 8; char platform_name[128]; char device_name[128]; /* FFT library realted declarations */ clfftPlanHandle planHandle; clfftDim dim = CLFFT_2D; size_t clLengths[2] = { N0, N1 }; int fac = ((N1 / 2) + 1);//=N1; //size_t l = N0; size_t clOutStrides[2] = { 1, fac }; size_t clInStrides[2] = { 1, N0 }; /* Setup OpenCL environment. */ err = clGetPlatformIDs(1, &platform, NULL); size_t ret_param_size = 0; err = clGetPlatformInfo(platform, CL_PLATFORM_NAME, sizeof(platform_name), platform_name, &ret_param_size); printf("Platform found: %sn", platform_name); err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_DEFAULT, 1, &device, NULL); err = clGetDeviceInfo(device, CL_DEVICE_NAME, sizeof(device_name), device_name, &ret_param_size); printf("Device found on the above platform: %sn", device_name); props[1] = (cl_context_properties)platform; ctx = clCreateContext(props, 1, &device, NULL, NULL, &err); queue = clCreateCommandQueueWithProperties(ctx, device, 0, &err); /* Setup clFFT. */ clfftSetupData fftSetup; err = clfftInitSetupData(&fftSetup); err = clfftSetup(&fftSetup); /* Allocate host & initialize data. */ /* Only allocation shown for simplicity. */ size_t buffer_size_x = N0 * N1 * sizeof(*X);
  • 30. Platform agnostic hardware acceleration for deep neural networks P a g e | 29 size_t buffer_size_y = ((N0+2) * N1) * sizeof(*Y); X = (float *)malloc(buffer_size_x); Y = (float *)malloc(buffer_size_y); /* print input array just using the * indices to fill the array with data */ printf("nPerforming fft on an two dimensional array of size N0 x N1 : %ld x %ldn", N0, N1); int i, j; i = j = 0; for (i = 0; i<N0; ++i) { for (j = 0; j<N1; ++j) { float x = 0.5f; float y = 0.5f; unsigned idx = (j + i*N0); X[idx] = sin(1.0f*(float)i) + cos(0.4f*(float)j); printf("n(%f) ", X[idx]); } printf("n"); } /* Prepare OpenCL memory objects and place data inside them. */ bufX = clCreateBuffer(ctx, CL_MEM_READ_WRITE, buffer_size_x, NULL, &err); //CL_MEM_READ_ONLY bufY = clCreateBuffer(ctx, CL_MEM_READ_WRITE, buffer_size_y, NULL, &err); err = clEnqueueWriteBuffer(queue, bufX, CL_TRUE, 0, buffer_size_x, X, 0, NULL, NULL); /* Create a default plan for a complex FFT. */ err = clfftCreateDefaultPlan(&planHandle, ctx, dim, clLengths); /* Set plan parameters. */ err = clfftSetPlanPrecision(planHandle, CLFFT_SINGLE); err = clfftSetLayout(planHandle, CLFFT_REAL, CLFFT_HERMITIAN_INTERLEAVED); err = clfftSetResultLocation(planHandle, CLFFT_OUTOFPLACE); err = clfftSetPlanOutStride(planHandle, dim, clOutStrides); err = clfftSetPlanInStride(planHandle, dim, clInStrides); /* Bake the plan. */ err = clfftBakePlan(planHandle, 1, &queue, NULL, NULL); /* Execute the plan. */ err = clfftEnqueueTransform(planHandle, CLFFT_FORWARD, 1, &queue, 0, NULL, NULL, &bufX, &bufY, NULL); /* Wait for calculations to be finished. */ err = clFinish(queue); /* Fetch results of calculations. */ err = clEnqueueReadBuffer(queue, bufY, CL_TRUE, 0, buffer_size_y, Y, 0, NULL, NULL); /* print output array */ printf("nnfft result: n"); i = j = 0; for (i = 0; i<N0; ++i) { for (j = 0; j<fac; ++j) { unsigned idx = 2 * (j + i*fac); printf("n(%f) ", sqrt(Y[idx] * Y[idx] + Y[idx+1] * Y[idx+1])); //fiddle with restults to test //Y[idx] += 0.01f*(float)idx;
  • 31. Platform agnostic hardware acceleration for deep neural networks P a g e | 30 } printf("n"); } printf("n"); //***************** //revserse! //***************** printf("nn *** reverse ***nn"); //clOutStrides[0] = { 1, fac }; //clInStrides[0] = { 1, N0 }; err = clEnqueueWriteBuffer(queue, bufY, CL_TRUE, 0, buffer_size_y, Y, 0, NULL, NULL); /* Create a default plan for a complex FFT. */ err = clfftCreateDefaultPlan(&planHandle, ctx, dim, clLengths); /* Set plan parameters. */ err = clfftSetPlanPrecision(planHandle, CLFFT_SINGLE); err = clfftSetLayout(planHandle, CLFFT_HERMITIAN_INTERLEAVED, CLFFT_REAL); err = clfftSetResultLocation(planHandle, CLFFT_OUTOFPLACE); err = clfftSetPlanOutStride(planHandle, dim, clInStrides); err = clfftSetPlanInStride(planHandle, dim, clOutStrides); /* Bake the plan. */ err = clfftBakePlan(planHandle, 1, &queue, NULL, NULL); /* Execute the plan. */ err = clfftEnqueueTransform(planHandle, CLFFT_FORWARD, 1, &queue, 0, NULL, NULL, &bufY, &bufX, NULL); /* Wait for calculations to be finished. */ err = clFinish(queue); /* Fetch results of calculations. */ err = clEnqueueReadBuffer(queue, bufX, CL_TRUE, 0, buffer_size_x, X, 0, NULL, NULL); i = j = 0; for (i = 0; i<N0; ++i) { for (j = 0; j<N1; ++j) { float x = 0.5f; float y = 0.5f; unsigned idx = (j + i*N0); printf("n(%f) ", X[idx]); } printf("n"); } //***************** //revserse END //***************** /* Release OpenCL memory objects. */ clReleaseMemObject(bufX); free(X); clReleaseMemObject(bufY); free(Y); /* Release the plan. */
  • 32. Platform agnostic hardware acceleration for deep neural networks P a g e | 31 err = clfftDestroyPlan(&planHandle); /* Release clFFT library. */ clfftTeardown(); /* Release OpenCL working objects. */ clReleaseCommandQueue(queue); clReleaseContext(ctx); getchar(); return ret; } B.2. Program outputs from B.1. Showing only the first column for succinctness. Platform found: Intel(R) OpenCL Device found on the above platform: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz Performing fft on an two dimensional array of size N0 x N1 : 8 x 8 (1.000000) (0.921061) (0.696707) (0.362358) (-0.029200) (-0.416147) (-0.737394) (-0.942222) fft result: (11.271166) (27.725875) (11.865518) (8.765699) (8.040510) *** reverse *** (1.000000) (0.921061) (0.696707) (0.362358) (-0.029200) (-0.416147) (-0.737394) (-0.942222)
  • 33. Platform agnostic hardware acceleration for deep neural networks P a g e | 32 C – Gantt time plans C.1. Original Gantt time plane
  • 34. Platform agnostic hardware acceleration for deep neural networks P a g e | 33 C.2. Modified Gantt time plane