Short Survey on the current state of Field-programmable gate array usage in Deep learning by several companies like Intel Nervana and Google's TPU (tensor processing units) vs GPU usage in terms of energy consumption and performance.
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Deep learning with FPGA
1. Deep Learning with FPGA
Drive towards dedicated Hardware for Efficient Learning
Ayush Singh
College of Computer and Information Sciences
Northeastern University
2. Intro
Deep Learning is an evolutionary machine learning technique
Deep Learning requires a lot of computations for acceptable accuracy
Modern models are highly complex ( 11.2B connections and 5M params )
Traditionally, industry used the processing power of CPU infrastructure
Enter GPU running same code and event horizon for ASICs and FPGAs
3. Evolution: CPU > GPU > FPGA <=> ASIC
As data and throughput demands increased, started looking for alternative
GPU became heros, good with parallel computations and use same code
GPU are power hogs, have low precision and high cache miss rate(TLB, IRO)
Deep Learning Models maturing and categorically became network specific
Drive to get faster results on embedded devices with limited resources
4. Field Programmable Gate Arrays
Hardware implementation of Algorithms which is always faster
Latency exponentially lower (nanoseconds vs microseconds GPU)
Orders of magnitude lower power consumption ( 20W FPGA vs 200W GPU )
Lower clock cycles, (500 Mhz vs 1348 Mhz)
Programmed conventionally using (HDL, Verilog vs C++ CUDA)
5. Current State
Successful demonstration of throughput and efficiency on custom chips
Gaining traction in Industry ( baidu, ms, google, etc)
Fraction research vs GPU
Beats GPU processing by at least twice the time ( 20TFLOPS vs 55TFLOPS)
Energy consumption 50-80x and throughput 20-40x better than CPU
6. Limitations
● Longer development time compared to GPU (1 month vs 1 day)
● Limited Block and Dynamic RAM
● Not cheap to manufacture if production volume is low
● Lack talent to code domain specific, imagine Hardware Eng expert in CNN
● Speedup bound by the using fixed instead of floating point precision
● Low bandwidth compared to GPU (780 Gb/s vs 20Gb/s)
● Porting source code is a pain for iteratively new chips
7. Future
● Brings power of Deep Learning to embedded systems and compute farms
● Flexibility for creative applications at chip, server and warehouse level
● Opens doors to research in compressing and optimizing ML Techniques
● Eventual transition to ASIC as was the case with Bitcoin Era
● Emergence of new development platforms e.g. OpenCL, DeepCL, vs
DeepCompute, CUDA
● Hybrid architecture: GPU as main while FPGA as auxiliary, CPU to correct
8. Companies
A lot of recent startups took upon
themselves to address this gap
● DeepPhi - FPGA
● Microsoft - FPGA
● Falcon Computing - FPGA
● Nervana - ASIC
● Wave Computing - ASIC
● Cognimem - ASIC
● Xilinx - Supplier
● Kintex - Supplier
● NVIDIA - GPU King
● INTEL with Nervana and Altera
● IBM with Xilinx
● Baidu with Nervana