Describe basic neural network design and focus on Convolutional Neural Network architecture. Explain why CPU and GPU can't fulfill CNN hardware requirement. List out three hardware examples: Nvidia, Microsoft and Google. Finally highlight optimization approach for CNN design.
7. Convolutional Neural Network
Software Developer Platform Interface Hardware
Caffe Berkeley Vision and Learning
Center
Linux, OSX, Windows,
Android
C++. Python, Matlab CPU, GPU
MatConvNet Oxford Visual Geometry
Group
Linux, OSX, Windows Matlab CPU, GPU
Matlab MathWorks Linux, OSX, Windows Matlab CPU, GPU
Tensorflow Google Brain Team Linux, OSX, Windows C++, Python CPU, GPU, TPU
Torch 7 R. Collobert, K. Kavukcuoglu,
C. Farabet
Linux, OSX, iOS,
Android, Windows
Lua, LuaJIT, C CPU, GPU
Theano Universite de Montreal Cross-Platform Python CPU, GPU
CNTK Microsoft Linux, OSX, Windows Network Description
Language
CPU, GPU, FPGA
8. CPU vs GPU
Central Processing Unit (CPU) Graphics Processing Unit (GPU)
Architecture
Instruction Set Single Instruction Single Data (SISD) Single Instruction Multiple Data (SIMD)
Operation Sequential Parallel
Processor Core Few Many
Datapath Custom Synthesis
Clock Rate High Moderate
Bandwidth Medium Large
Power Moderate High
Temperature Moderate High
9. Graphics Processing Unit (Nvidia)
Pascal Architecture
Flag Chip GP100
Process TSMC 16nm FinFet Process
Maximum Transistors 15.3B
Stream Multiprocessor (SM) 56 (10SM/GPC)
CUDA Cores 3840 CC (60CU/SM)
Base Clock 1328MHz
Boost Clock 1480MHz
FP32 Performance 10.6 TFlops
FP64 Performance 5.3 TFlops
Memory Interface 4096bit HBM2
Maximum Bandwidth 720 GB/s
Maximum Power 300W
J. Walton, Nvidia Pascal P100 Architecture Deep Dive, PC Gamer, Apr 07, 2016.
10. Catapult Fabric (Microsoft)
Purpose
Design for Neural Network Classification
Target for power reduction
Architecture
Field Programmable Gate Array (FPGA)
Software configurable engine supports runtime multiple layer
configurations
A spatially distributed array of processing elements can be scaled
up to thousand of units
On-chip redistribution network with efficient data buffer
minimizes off-chip memory traffic
Power dissipation is significantly reduced to 25W only
K. Ovtcharov, O. Ruwase, J.Y. Kim, J. Fowers, K. Strauss, E.S. Chung, Accelerating Deep Convolutional Networks Using Specialized Hardware, Microsoft
Research, Feb 2015.
11. Tensor Processor Unit (Google)
Purpose
Support Tensorflow algorithm
Target for Neural Network Classification
Architecture
Application Specific Integrated Circuit (ASIC)
Single Instruction Multiple Data (SIMD) Architecture
Low computational precision
Better performance/watt