Dl2 computing gpu

deep learning
Algorithms and Applications
Bernardete Ribeiro, bribeiro@dei.uc.pt
University of Coimbra, Portugal
INIT/AERFAI Summer School on Machine Learning, Benicassim 22-26 June 2015

V - GPU Computing for Machine Learning
1

outline
∙ Motivation
∙ Graphics Processing Units (GPUs) Computing
∙ Machine Learning (ML) GPU algorithms
∙ Advantages of Open-Source in the ML field
∙ Open-source GPU ML library (GPUMLib)
∙ Overview of GPUMLib algorithms
∙ Conclusions
3

motivation
∙ The volume of data is increasing at an exponential rate
Diversity
Data Sources
Low-Cost
Sensors
High-
Bandwidth
Networks
Robotic
systems
High-
capacity
storage
devices
Remote
sensing
Commodity
computing
4

data
∙ Nowadays, there are projects that can generate several
petabytes of data per day [Hey et al., 2009]:
∙ Australian Square Kilometre Array of radio telescopes
∙ CERN’s Large Hadron Collider
∙ Pan-STARRS array of celestial telescopes
5

big data
∙ Nowadays, there are projects that can generate several
petabytes of data per day [Hey et al., 2009]:
∙ Australian Square Kilometre Array of radio telescopes
∙ CERN’s Large Hadron Collider
∙ Pan-STARRS array of celestial telescopes
6

data science
∙ Data is an asset, from which useful and valuable
information can be extracted.
7

data science
∙ Science is gradually moving toward being computational
and data centric.
7

data science
∙ Science is gradually moving toward being computational
and data centric.
∙ To obtain information represents only a fraction of the time
and effort needed to analyze it.
7

challenges
Data sources Real Data
Computer
Simulation Models
Artificial Data
Extract useful
and relevant
information
Large
volumes
of data
vastly exceeds our
capacity to analyze
it
Persistent
repositories of
(accumu-
lated) Data
challenge
8

potential solution
Data sources Real Data
Computer
Simulation Models
Artificial Data
Extract useful
and relevant
information
Machine Learning
Algorithms
Large
volume
of data
Persistent
repositories of
(accumu-
lated) Data
9

computational resources
∙ Machine Learning (ML) algorithms are computationally
expensive
11

expensive
∙ Their computational requirements are usually proportional
to the amount of data being processed
11

expensive
∙ Their computational requirements are usually proportional
to the amount of data being processed
∙ ML algorithms often demand prohibitive computational
resources
11

advanced computing
∙ Problems are becoming increasingly challenging and
demanding (in some cases intractable by traditional CPU
architectures).
12

advanced computing
architectures).
∙ Toolkits supporting ML software development fail to meet
the expectations in terms of computational performance.
12

advanced computing
architectures).
∙ The scientific breakthroughs of the future will undoubtedly
be powered by advanced computing capabilities that will
allow researchers to manipulate and explore massive
datasets [Hey et al., 2009].
12

advanced computing
architectures).
∙ The scientific breakthroughs of the future will undoubtedly
be powered by advanced computing capabilities that will
allow researchers to manipulate and explore massive
datasets [Hey et al., 2009].
∙ Pressure to shift development toward high-throughput
parallel architectures (crucial for real-world applications).
12

computing with graphical processing units (gpus)

graphical processing units (gpus)
∙ Highly-parallel and
programmable devices
that can be used for
general-purpose
computing applications
[Owens et al., 2008].
2001 2002 2003 2004 2005 2006 2007 2008 2009
Year
0
200
400
600
800
1000
1200
GFLOPS
dual-core quad-core
AMD (GPU)
NVIDIA (GPU)
Intel (CPU)
14

gpus strengths
∙ Provide remarkable
performance gains
(compared to CPUs).
∙ Relatively inexpensive
(serve the large gaming
industry).
∙ Availability.
∙ Scalability.
2001 2002 2003 2004 2005 2006 2007 2008 2009
Year
0
200
400
600
800
1000
1200
GFLOPS
dual-core quad-core
AMD (GPU)
NVIDIA (GPU)
Intel (CPU)
15

gpu vs cpu performance
Disparity between the GPU and CPU peak floating-point
performance
∙ The GPU performance is
doubled every 12 months
while the CPU
performance doubles
every 18 months
[Zhongwen et al., 2005].
2001 2002 2003 2004 2005 2006 2007 2008 2009
Year
0
200
400
600
800
1000
1200
GFLOPS
dual-core quad-core
AMD (GPU)
NVIDIA (GPU)
Intel (CPU)
16

nvidia gpu architecture
Streaming
Multiprocessor
SIMT control
Shared memory
Streaming
Multiprocessor
SIMT control
Shared memory
Streaming
Multiprocessor
SIMT control
Shared memory
· · ·
Thread
scheduling
Hostin-
terface
Memoryinterface
Off-chip memory
DRAM
DRAM
DRAM
· · ·
17

Streaming Multiprocessor (SM)
Intruction Cache
Register File (32, 768 × 32-bit)
Warp Scheduler
Dispatch Unit
Warp Scheduler
Dispatch Unit
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
SP
core
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
SFU
SFU
SFU
SFU
Interconnect Network
64 KB Shared Memory / L1 Cache
Uniform Cache
Scalar Processor (SP) core
Dispatch Port
Operand Collector
floating-point
unit
integer
unit
Result Queue
18

speedups
∙ Graphics Processing Units (GPUs) are responsible for
dramatic speedups in a wide range of areas for many
problems.
20

speedups
∙ GPUs are responsible for dramatic speedups in a wide
range of areas for many problems.
∙ It is not uncommon to obtain speedups of one or two
orders of magnitude.
∙ Tasks that would take years on the CPU can now be completed
in days.
∙ Weeks of processing can be transformed into hours
[Lopes and Ribeiro, 2009]
∙ Computations that would otherwise take hours can now be
completed in a few seconds.
20

one-order of magnitude
year 12 days
month 1 day
week 5:36 hours
day 48 minutes
hour 2 minutes
30×
21

two-orders of magnitude
year 29 hours
month 2:24 hours
week 34 minutes
day 5 minutes
hour 12 seconds
300×
22

Intractable Problems
Become Manageable
23

gpu applications
http://www.nvidia.com/object/cuda-apps-flash-new.html
24

machine learning tools
∙ Caffe: Framework for convolutional neural network
algorithms
∙ cuda-convnet: High performance C++/CUDA
implementation of convolutional neural networks
∙ Theano: Python library to define, optimize, and evaluate
mathematical expressions
∙ Torch7: Scientific computing framework for machine
learning algorithms
∙ cuBLAS: GPU-accelerated version of the complete standard
BLAS library
∙ MATLAB: Easy-to-use HPC language integrating
computation, visualization, and programming
∙ GPUMLib: GPU Machine Learning Library
25

companies using gpus for machine learning
http://www.nvidia.com/object/machine-learning.html
26

ml algorithms in gpu platform
∙ Large computational requirements.
27

∙ Algorithms should present a high-degree of parallelism.
27

∙ Algorithms should present a high-degree of parallelism.
∙ Favor data throughput in detriment of the latency of
individual operations.
27

gpu ml implementations
ClosedSourceOpenSource
2004 2005 2006 2007 2008 2009 2010 2011 2012
Multilayer Perceptrons (forward-phase)
Oh and Jung
Self-Organizing Maps
Campbell et al.
Luo et al.
Genetic Algorithms
Wong et al.
Yu et al.
Back-Propagation (two layers)
Steinkrau et al.
Convolutional Neural Networks
Chellapilla et al.
Spiking Neural Networks
Bernhard and Keriven
Belief Propagation
Brunton et al.
Yang et al.
Fuzzy ART neural networks
Martínez-Zarzuela et al.
K-Means Clustering
Shalom et al.
Recurrent networks
Trebatický and Pospíchal
Decision Trees and Forests
Sharp
Neural Network based text detection
Jang et al.
Linear Radial Basis Functions
Brandstetter and Artusi
Deep Belief Networks Sparse Coding
Raina et al.
Back-Propagation (three layers)
Guzhva et al.
Support Vector Machines
Catanzaro et al.
Genetic Algorithms
Langdon and Banzhaf
K-Nearest Neighbor
Garcia et al.
Spiking Neural Networks
Nageswaran et al.
Multiple Back-Propagation
Back-Propagation
Lopes and Ribeiro
Non-negative Matrix
Factorization
Lopes and Ribeiro
28

gpu implementations
∙ The number of GPU implementations of ML algorithms has
increased substantially over the last few years.
∙ However, most of the implementations are not openly
shared.
30

open source advantages
∙ Better reproducibility of experimental results;
1
[Sonnenburg et al., 2007]
31

∙ Fair comparison of algorithms;
1
31

∙ Quicker detection of errors;
1
31

∙ Quicker adoption of algorithms;
1
31

∙ Innovative applications and easier combination of
advances;
1
31

advances;
∙ Faster adoption of ML methods in other disciplines and in
industry.
1
31

advances;
∙ Faster adoption of ML methods in other disciplines and in
industry.
∙ Cooperation among researchers
1
1
31

GPUMLib - GPU Machine Learning Library
http://gpumlib.sourceforge.net/
32

gpumlib – http://gpumlib.sourceforge.net/
Host (CPU) and device (GPU) memory access framework
HostArray HostMatrix CudaArray
DeviceArray DeviceMatrix · · ·
C++ classes (algorithms)
Back-
Propagation
Radial Basis
Functions
Deep Belief
Networks
Restricted
Boltzmann
Machines
Multiple
Back-
Propagation
Support
Vector
Machines
Non-Negative
Matrix
Factorization
· · ·
Common
Host (CPU)
Classes
Common
CUDA
Kernels
CUDA (GPU) Kernels
Multiple
Back-
Propagation
Support
Vector
Machines
Non-Negative
Matrix
Factorization
k-Nearest
Neighbor
Radial Basis
Functions
Restricted
Boltzmann
Machines
Genetic
Algorithms
· · ·
Common
Device
(GPU)
Functions
33

CUDA (Compute Unified Device Architecture)
34

cuda
∙ represented a major step toward the simplification of the
GPU programming model:
35

cuda
∙ Support for accessible programming interfaces and
industry-standard languages, such as C and C++.
35

cuda
∙ released by NVIDIA in the end of 2006 and since then
numerous GPU implementations, spanning a wide range of
applications, have been developed using this technology.
35

cuda
∙ released by NVIDIA in the end of 2006 and since then
numerous GPU implementations, spanning a wide range of
applications, have been developed using this technology.
∙ While there alternative options, such as the OpenCL, the
Microsoft Directcompute or the AMD Stream, so far CUDA is the
only technology that has achieved wide adoption and usage
[Stamatopoulos et al., 2012].
35

systems’s main characteristics
Main Characteristics
System 1 (8600 GT) Intel Core 2 6600 (2.4GHz)
NVIDIA GeForce 8600 GT
Windows Vista (x64)
4GB memory
System 2 (GTX 280) Intel Core 2 Quad Q 9300 (2.5GHz)
NVIDIA GeForce GTX 280
Windows 7 (x64)
4GB memory
System 3 (GTX 460) Intel Dual-Core i5-2410M (2.7GHz)
NVIDIA GeForce GTX 460
Windows 7 (x64)
8GB memory
37

GPUMLib Algorithms Overview
38

back-propagation (bp)
x1
x2
x3
x4
x5
y1
y2
y3
40

multiple back-propagation (mbp)
xp
1
×
xp
2 ×
xp
3
×
yp
1×
yp
2×
Space Network
Main Network with selec-
tive activation neurons
41

neural selective input model (nsim)
x
p
1
x
p
2
x
p
3
r
p
3
x
p
j
y
p
1
y
p
2
wjk
θk
×
multiplier
r
p
j
˜x
p
j
selective input neuron
Physical model
Model 1 when x
p
3 is missing: r
p
3 = 0
x
p
1
x
p
2
Conceptual models
y
p
1
y
p
2
Model 2 when the value of x
p
3 is known: r
p
3 = 1
x
p
1
x
p
2
x
p
3
y
p
1
y
p
2
42

resource allocating network with long term memory
x1
x2
x3
x4
z1
z2
Hidden
layer
Input
layer
Output
layer
Generate & StoreRetrieve & Learn
Long-Term Memory
43

uci benchmarks
Dataset (Benchmark) Samples Features Classes
f(x) = sin(x)/x 101 1 1
Two-spirals 194 2 1
Sonar 104 60 1
Covertype 11340 54 7
Poker Hand 25010 85 10
Ventricular Arrhythmia’s 19391 18 1
Yale face database 120 45 15
45

speedup (×) for the f(x) = sin(x)/x problem.
Topology Nh 8600 GT GTX 280
FF/BP
7 3.98 ± 0.19 5.48 ± 0.21
9 4.66 ± 0.16 7.15 ± 0.13
11 5.44 ± 0.13 8.43 ± 0.15
MFF/MBP
5 4.37 ± 0.10 6.08 ± 0.13
7 5.73 ± 0.07 7.99 ± 0.10
9 6.77 ± 0.09 10.24 ± 0.12
47

speedup (×) for the two-spirals problem.
FF/BP
25 7.68 ± 1.01 32.84 ± 4.78
30 7.96 ± 0.68 39.22 ± 3.36
35 7.55 ± 0.42 39.61 ± 2.49
MFF/MBP
15 9.89 ± 0.76 32.85 ± 2.59
20 9.75 ± 0.16 38.10 ± 0.70
25 10.01 ± 0.31 42.98 ± 1.27
48

two-spirals nns training time (mbp algorithm)
10
100
1000
15 20 25
Time(s)
First Hidden Layer Neurons
GTX 280
8600 GT
Core 2 6600
49

speedup (×) for the covertype problem.
FF/BP
12 8.33 ± 0.24 59.67 ± 0.50
24 8.27 ± 0.78 56.37 ± 0.53
60 7.51 ± 0.07 57.87 ± 0.53
120 8.62 ± 0.04 57.92 ± 0.26
180 9.00 ± 0.38 58.40 ± 2.74
240 10.40 ± 0.12 62.88 ± 0.53
300 18.18 ± 0.11 112.56 ± 0.69
MFF/MBP
12 8.63 ± 0.16 64.59 ± 0.87
24 8.20 ± 0.09 59.63 ± 0.74
60 8.09 ± 0.32 60.08 ± 2.41
120 8.92 ± 0.09 59.00 ± 0.57
180 19.21 ± 0.15 121.95 ± 0.75
240 23.85 ± 0.11 141.83 ± 0.52
300 25.35 ± 0.37 153.23 ± 2.22 50

speedup (×) for the poker problem.
FF/BP
12 8.35 ± 0.07 57.49 ± 0.35
24 8.05 ± 2.15 54.79 ± 0.40
60 8.73 ± 0.05 59.51 ± 0.44
120 8.98 ± 0.08 57.55 ± 0.35
180 17.80 ± 0.20 111.30 ± 0.36
240 27.41 ± 0.50 159.03 ± 2.71
300 29.03 ± 1.58 174.91 ± 9.50
MFF/MBP
12 8.61 ± 0.07 61.50 ± 0.44
24 8.13 ± 0.05 58.42 ± 0.31
60 8.68 ± 0.05 58.61 ± 0.33
120 21.02 ± 0.16 132.67 ± 0.91
180 27.85 ± 1.31 171.73 ± 8.44
240 30.29 ± 0.22 174.45 ± 0.60
300 30.12 ± 1.15 178.64 ± 6.86 51

epochs trained using the mbp (poker problem)
0.1
1
10
100
1000
0 50 100 150 200 250 300
Epochsperminute
Hidden Layer Neurons
GTX 280
8600 GT
Core 2 6600
52

speedup (×) for the arrhythmia’s problem.
Nh 8600 GT GTX 280
1 7.15 ± 0.22 33.16 ± 1.03
2 7.77 ± 0.16 42.39 ± 0.86
3 8.72 ± 0.19 49.12 ± 1.14
4 9.04 ± 0.18 48.90 ± 1.29
5 9.13 ± 0.14 49.86 ± 0.97
6 9.23 ± 0.14 53.20 ± 0.80
7 9.09 ± 0.10 53.04 ± 0.67
8 9.23 ± 0.09 53.94 ± 0.54
9 9.35 ± 0.12 53.64 ± 0.90
10 9.46 ± 0.07 53.74 ± 0.54
11 9.27 ± 0.06 52.38 ± 0.83
12 9.29 ± 0.10 51.85 ± 0.57
13 9.31 ± 0.06 51.04 ± 0.65
14 9.07 ± 0.11 50.29 ± 0.67 53

epochs trained using the mbp (arrhythmia’s)
100
1000
10000
0 2 4 6 8 10 12 14
Epochsperminute
Hidden Layer Neurons
GTX 280
8600 GT
Core 2 6600
54

speedups (×) versus average network connections per layer
0
20
40
60
80
100
120
140
160
180
200
100 1000 10000 100000 1e+006 1e+007 1e+008 1e+009
Speedup
Average number of threads per layer
55

radial basis function networks (rbf)
Dataset Samples Features CPU(s) GPU(s) Speedup
Iris 150 4 4.58 12.36 0.37
Breast 569 31 66.99 28.54 2.35
Vehicle 846 18 452.97 346.55 1.31
Vowel 990 10 994.42 866.70 1.15
CMC 1473 9 638.05 501.78 1.27
Satellite 6458 36 10011.50 2365.66 4.23
∙ GPU: 9800 GT (112 cores)
∙ CPU: Intel Core 2 Duo E8400 running at 3.0GHz
56

selecting neural networks models

autonomous training system (ats)
0
1000
2000
3000
4000
5000
1 2 3 4 5 6 7 8
NumberofNetworksTrained
Number of Hidden Neurons
Sonar
10 7 5
2445
4930
2545
57 1
58

non-negative matrix factorization (nmf)
H
W≈V
rN samples
Dfeatures
rfeatures
N samples
sample with D
original features
sample with r
new features
basis
vector
60

yale and orl image datasets
∙ Yale
∙ Vtrain is composed of 4096 rows (64 × 64 pixels) and 150
columns (face images)
∙ Vtest is composed of 4096 rows and 15 columns.
∙ AT&T (ORL)
∙ Vtrain is composed of 10304 (112 × 92) rows and 360 columns
(face images)
∙ Vtest is composed of 10304 rows and 40 columns.
61

time to perform 10,000 nmf iterations on the yale database.
10
100
1000
10000
20 40 60 80 100 120
time(seconds)
r
Vtrain (CPU)
Vtest (CPU)
Vtrain (GPU)
Vtest (GPU)
10s
1m40s
16m40s
3h46m40s
55.6× 82.5× 110.9× 182.3× 251.7×
6.6× 12.9× 21.5× 44.1× 74.1×
62

time to perform 10,000 nmf iterations on the at&t (orl) database
10
100
1000
10000
100000
300000
50 100 150 200 250 300
time(seconds)
r
Vtrain (CPU)
Vtest (CPU)
Vtrain (GPU)
Vtest (GPU)
1m40s
16m40s
3h46m40s
27h46m40s
83h20m00s
277.3×
393.0×
563.9×
533.0×
553.7×
706.8×
18.5× 58.8× 119.7×
134.7×
153.5× 173.6×
63

statistical machine learning algorithms

support vector machines (svms)
x2
x1
direction
1
ρ
direction 2
65

gpumlib – support vector machines (svms)
Speedups
Dataset Samples Features Training Classification
Adult 32561 14 1.83× 2.17×
Breast Cancer 569 30 0.14× 1.10×
German 1000 59 1.42× 2.61×
Haberman 306 3 0.15× 0.14×
Heart 270 20 0.44× 0.23×
Ionosphere 351 34 0.32× 0.41×
Sonar 208 30 0.32× 0.38×
Tic-tac-toe 958 9 0.53× 0.78×
Two-Spiral 2097152 2 6.90× 3.88×
MP3 Steganalysis 1994 742 6.87× 29.53×
Peptidases 20778 24 3.04× 4.04×
66

GPU Computing for Deep Learning
67

gpumlib deep learning algorithms
∙ Restricted Boltzmann Machines (RBMs)
∙ Deep Belief Networks (DBNs)
x· · ·
h1· · ·
p(x|h1)p(h1|x)
x· · ·
h1· · ·
h2· · ·
p(x|h1)p(h1|x)
p(h1|h2)p(h2|h1)
x· · ·
h1· · ·
h2· · ·
h3· · ·
p(x|h1)p(h1|x)
p(h1|h2)p(h2|h1)
p(h2|h3)p(h3|h2)
69

performance
Average time required to train one epoch
0.01
0.1
1
10
100
0 100 200 300 400 500 600 700 800 900
Time(s)
Hidden units
N = 1, 000
23.26×
23.13×
21.86×
24.46×
29.79×
GTX 460 (GPU)
dual-core i5 (CPU)
71

performance
0.1
1
10
100
1000
0 100 200 300 400 500 600 700 800 900
Time(s)
Hidden units
N = 10, 000
32.83×
30.29×
28.59×
29.47×
38.16×
GTX 460 (GPU)
dual-core i5 (CPU)
72

performance
1
10
100
1000
10000
0 100 200 300 400 500 600 700 800 900
Time(s)
Hidden units
N = 60, 000
42.73×
43.46×
38.64×
41.83×
46.07×
GTX 460 (GPU)
dual-core i5 (CPU)
73

conclusions
∙ Parallel implementations of ML algorithms are crucial for
the development of real-world ML applications
76

conclusions
∙ The GPU is particularly well positioned to fulfil this need,
given its availability, high-performance and relative
low-cost.
76

conclusions
low-cost.
∙ Experimental results with GPUMLib algorithms show the
potential and usefulness of this library
76

conclusions
low-cost.
∙ Problems involving larger datasets benefit the most from
this architecture.
76

conclusions
low-cost.
∙ Problems involving larger datasets benefit the most from
this architecture.
∙ To promote cooperation among researchers and benefit the
field, open-source GPU ML algorithms are fundamental
76

Hey, T., Tansley, S., and Tolle, K., editors (2009).
The Fourth Paradigm: Data-Intensive Scientiﬁc Discovery.
Microsoft Research.
Lopes, N. and Ribeiro, B. (2009).
Fast pattern classification of ventricular arrhythmias
using graphics processing units.
In Proceedings of the 14th Iberoamerican Conference on
Pattern Recognition (CIARP 2009), LNCS 5856, pages
603–610. Springer.
Owens, J. D., Houston, M., Luebke, D., Green, S., Stone,
J. E., and Phillips, J. C. (2008).
GPU computing.
Proceedings of the IEEE, 96(5):879–899.
Sonnenburg, S., Braun, M. L., Ong, C. S., Bengio, S., Bottou,
L., Holmes, G., LeCun, Y., Müller, K.-R., Pereira, F.,
76

Rasmussen, C. E., Rätsch, G., Schölkopf, B., Smola, A.,
Vincent, P., Weston, J., and Williamson, R. C. (2007).
The need for open source software in machine learning.
Journal of Machine Learning Research, 8:2443–2466.
Stamatopoulos, C., Chuang, T. Y., Fraser, C. S., and Lu, Y. Y.
(2012).
Fully automated image orientation in the absence of
targets.
In International Archives of the Photogrammetry, Remote
Sensing and Spatial Information Sciences (XXII ISPRS
Congress), volume Volume XXXIX-B5, pages 303–308.
Zhongwen, L., hongzhi, L., Zhengping, Y., and Xincai, W.
(2005).
Self-organizing maps computing on graphic process unit.
76

In Proceedings of the 13th European Symposium on
Artiﬁcial Neural Networks, pages 557–562.
76

Dl2 computing gpu

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Dl2 computing gpu

Similar to Dl2 computing gpu (20)

More from Armando Vieira

More from Armando Vieira (20)

Recently uploaded

Recently uploaded (11)

Dl2 computing gpu