4. Convolutional Neural Net (CNN)
● A special kind of neural networks optimized for 2D pattern
recognition.
● 2D convolution on an input matrix f, kernel matrix g with output
o is defined as
● Subsampling layer contains multiple kernels.
● Each kernel is shifted over valid but disjoint regions of the input
image
● Convolution of the image block and the kernel is computed at
each location.
5. Convolutional Neural Net (CNN)
● Not fully connected. Follows biological neural connection model for feature
extraction via deep network.
● Convolution + Subsampling via pooling (max, avg, stochastic) + Fully Connected
● Backward propagation through cross-entropy loss optimization
7. Convolutional Layer Summary
● Accepts a volume of size W1×H1×D1
● Requires four hyperparameters:
○ Number of filters K,
○ their spatial extent F,
○ the stride S,
○ the amount of zero padding P.
● Produces a volume of size W2×H2×D2 where:
○ W2=(W1−F+2P)/S+1
○ H2=(H1−F+2P)/S+1 (i.e. width and height are computed equally by
symmetry)
○ D2=K
8. Conv Layers are the BottleNeck
● With parameter sharing, it introduces F⋅F⋅D1 weights per filter, for a total of
(F⋅F⋅D1)⋅K weights and K biases.
● In the output volume, the d-th depth slice (of size W2×H2) is the result of
performing a valid convolution of the d-th filter over the input volume with a
stride of S, and then offset by d-th bias.
● Calculating convolutions takes more than 95% of the total training time
9. Conv Layers are the BottleNeck
● C1
○ ~4704 iterations
○ 38% time
● S2
○ ~1176 iterations
○ 3% time only
● C3
○ ~1600 iterations
○ 43% time
● S4
○ ~400 iterations
○ 1% time only
● C5
○ ~120 iterations
○ 12% time
● F6
○ 10 iterations
○ 1% time
10. Parallelizing Convolution
● Mapping Output Pixels to Threads
○ Output in the convolution layer can be computed independently
○ Map each pixel to a thread of a block.
○ Split each image into smaller regions
■ Each region mapped to a separate block.
○ This will not give us much performance gain -- it can even be slower than
the single block solution
■ False sharing.
11. Parallelizing Convolution
● Shared Memory
○ The ratio between number of arithmetic operations and number of
memory accesses is significantly high.
■ Load image to the shared memory instead of directly accessing
device memory.
○ Workload balancing among threads
■ Number of threads per block K < total number of pixels per image N
● Each thread will load N/K consecutive pixels
12. Naive Parallel - Batch Processing
● Instead of processing images one by one, group the images into batches of
256 images
● Each image mapped to one block
● With 5 kernels in the convolution layer, each batch convolution with one
kernel is independent from that with another kernel.
○ CPU can immediately continue to process on those output data, thus hiding part of the
memory latency.
● Using GPU can be even faster
13. Result
# Blk/Img Batch Shared Mem. Time (sec)
Time/Iter.
(sec)
Speedup v.
Seq. Accuracy
4 No No 205.14 3.2 1.3 56.62%
1 No No 148.63 2.1 2.4 47.24%
1 No Yes 123.46 1.75 3.1 55.90%
1 Yes Yes 157.07 0.21 20.4 56.22%
14. Further Approach: Rehash + Fuse
● Combine the
convolution
layer and
subsampling
layer into one
to further
speed up the
process.
15. Further Approach: Fused Parallel
● Tapping into the
producer
consumer
relationship
across layers in
Fused Parallel
Implementation.
16. Distributed Memory - MPI
● Broadcast output
before running next
layer
○ Between S2-C3, S4-C5
● Selective send based
on connectivity
● Couldn’t implement yet-
-next??