Parallel convolutional neural network

Parallel
Convolutional Neural
Network
Abdullah Khan Zehady
azehady@purdue.edu

Objective
● Parallelizing Convolutional Neural Network
○ OpenMP - multithreaded programming
○ MPI - Distributed memory
● Lenet-5: Handwriting digit recognition with MNIST Data
● Image segmentation (Image Net)
● Object recognition (Hand position recognition)

Convolutional Neural Net (CNN)
● A special kind of neural networks optimized for 2D pattern
recognition.
● 2D convolution on an input matrix f, kernel matrix g with output
o is defined as
● Subsampling layer contains multiple kernels.
● Each kernel is shifted over valid but disjoint regions of the input
image
● Convolution of the image block and the kernel is computed at
each location.

Convolutional Neural Net (CNN)
● Not fully connected. Follows biological neural connection model for feature
extraction via deep network.
● Convolution + Subsampling via pooling (max, avg, stochastic) + Fully Connected
● Backward propagation through cross-entropy loss optimization

Lenet-5 CNN Layers ● Convolutional Layer (C1)
○ 5 x 5 filter with stride 1
○ kernel amount: 6
○ Image size 32 x 32
○ Feature Map 6x28x28
● Pooling (S2)
○ Smple based
discretization
○ Stride 2
● Convolutional Layer (C3)
○ More dense feature
○ Feature Map
16x10x10
● Pooling Layer (S4)
● Convolutional Layer (C5)
○ Feature Map 120x1
● Fully Connected Layer (F6)
● Output Label
○ 10 different digits

Convolutional Layer Summary
● Accepts a volume of size W1×H1×D1
● Requires four hyperparameters:
○ Number of filters K,
○ their spatial extent F,
○ the stride S,
○ the amount of zero padding P.
● Produces a volume of size W2×H2×D2 where:
○ W2=(W1−F+2P)/S+1
○ H2=(H1−F+2P)/S+1 (i.e. width and height are computed equally by
symmetry)
○ D2=K

Conv Layers are the BottleNeck
● With parameter sharing, it introduces F⋅F⋅D1 weights per filter, for a total of
(F⋅F⋅D1)⋅K weights and K biases.
● In the output volume, the d-th depth slice (of size W2×H2) is the result of
performing a valid convolution of the d-th filter over the input volume with a
stride of S, and then offset by d-th bias.
● Calculating convolutions takes more than 95% of the total training time

Conv Layers are the BottleNeck
● C1
○ ~4704 iterations
○ 38% time
● S2
○ 3% time only
● C3
○ 43% time
● S4
○ ~400 iterations
○ 1% time only
● C5
○ ~120 iterations
○ 12% time
● F6
○ 10 iterations
○ 1% time

Parallelizing Convolution
● Mapping Output Pixels to Threads
○ Output in the convolution layer can be computed independently
○ Map each pixel to a thread of a block.
○ Split each image into smaller regions
■ Each region mapped to a separate block.
○ This will not give us much performance gain -- it can even be slower than
the single block solution
■ False sharing.

Parallelizing Convolution
● Shared Memory
○ The ratio between number of arithmetic operations and number of
memory accesses is significantly high.
■ Load image to the shared memory instead of directly accessing
device memory.
○ Workload balancing among threads
■ Number of threads per block K < total number of pixels per image N
● Each thread will load N/K consecutive pixels

Naive Parallel - Batch Processing
● Instead of processing images one by one, group the images into batches of
256 images
● Each image mapped to one block
● With 5 kernels in the convolution layer, each batch convolution with one
kernel is independent from that with another kernel.
○ CPU can immediately continue to process on those output data, thus hiding part of the
memory latency.
● Using GPU can be even faster

Result
# Blk/Img Batch Shared Mem. Time (sec)
Time/Iter.
(sec)
Speedup v.
Seq. Accuracy
4 No No 205.14 3.2 1.3 56.62%
1 No No 148.63 2.1 2.4 47.24%
1 No Yes 123.46 1.75 3.1 55.90%
1 Yes Yes 157.07 0.21 20.4 56.22%

Further Approach: Rehash + Fuse
● Combine the
convolution
layer and
subsampling
layer into one
to further
speed up the
process.

Further Approach: Fused Parallel
● Tapping into the
producer
consumer
relationship
across layers in
Fused Parallel
Implementation.

Distributed Memory - MPI
● Broadcast output
before running next
layer
○ Between S2-C3, S4-C5
● Selective send based
on connectivity
● Couldn’t implement yet-
-next??

Parallel convolutional neural network

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Parallel convolutional neural network

Similar to Parallel convolutional neural network (20)

More from Abdullah Khan Zehady

More from Abdullah Khan Zehady (18)

Recently uploaded

Recently uploaded (20)

Parallel convolutional neural network