Deep learning with FPGA

•

2 likes•603 views

Short Survey on the current state of Field-programmable gate array usage in Deep learning by several companies like Intel Nervana and Google's TPU (tensor processing units) vs GPU usage in terms of energy consumption and performance.

Engineering

Deep Learning with FPGA
Drive towards dedicated Hardware for Efficient Learning
Ayush Singh
College of Computer and Information Sciences
Northeastern University

Intro
Deep Learning is an evolutionary machine learning technique
Deep Learning requires a lot of computations for acceptable accuracy
Modern models are highly complex ( 11.2B connections and 5M params )
Traditionally, industry used the processing power of CPU infrastructure
Enter GPU running same code and event horizon for ASICs and FPGAs

Evolution: CPU > GPU > FPGA <=> ASIC
As data and throughput demands increased, started looking for alternative
GPU became heros, good with parallel computations and use same code
GPU are power hogs, have low precision and high cache miss rate(TLB, IRO)
Deep Learning Models maturing and categorically became network specific
Drive to get faster results on embedded devices with limited resources

Field Programmable Gate Arrays
Hardware implementation of Algorithms which is always faster
Latency exponentially lower (nanoseconds vs microseconds GPU)
Orders of magnitude lower power consumption ( 20W FPGA vs 200W GPU )
Lower clock cycles, (500 Mhz vs 1348 Mhz)
Programmed conventionally using (HDL, Verilog vs C++ CUDA)

Current State
Successful demonstration of throughput and efficiency on custom chips
Gaining traction in Industry ( baidu, ms, google, etc)
Fraction research vs GPU
Beats GPU processing by at least twice the time ( 20TFLOPS vs 55TFLOPS)
Energy consumption 50-80x and throughput 20-40x better than CPU

Limitations
● Longer development time compared to GPU (1 month vs 1 day)
● Limited Block and Dynamic RAM
● Not cheap to manufacture if production volume is low
● Lack talent to code domain specific, imagine Hardware Eng expert in CNN
● Speedup bound by the using fixed instead of floating point precision
● Low bandwidth compared to GPU (780 Gb/s vs 20Gb/s)
● Porting source code is a pain for iteratively new chips

Future
● Brings power of Deep Learning to embedded systems and compute farms
● Flexibility for creative applications at chip, server and warehouse level
● Opens doors to research in compressing and optimizing ML Techniques
● Eventual transition to ASIC as was the case with Bitcoin Era
● Emergence of new development platforms e.g. OpenCL, DeepCL, vs
DeepCompute, CUDA
● Hybrid architecture: GPU as main while FPGA as auxiliary, CPU to correct

Companies
A lot of recent startups took upon
themselves to address this gap
● DeepPhi - FPGA
● Microsoft - FPGA
● Falcon Computing - FPGA
● Nervana - ASIC
● Wave Computing - ASIC
● Cognimem - ASIC
● Xilinx - Supplier
● Kintex - Supplier
● NVIDIA - GPU King
● INTEL with Nervana and Altera
● IBM with Xilinx
● Baidu with Nervana

References
https://www.nextplatform.com/2016/08/23/fpga-based-deep-learning-accelerators-take-asics/
https://www.nextplatform.com/2016/08/08/deep-learning-chip-upstart-set-take-gpus-task/
http://cadlab.cs.ucla.edu/~cong/slides/HALO15_keynote.pdf
http://on-demand.gputechconf.com/supercomputing/2014/presentation/SC424-deep-learning-gpu-clusters.pdf
https://arxiv.org/abs/1602.04283
https://www.nextplatform.com/2016/02/29/broader-paths-etched-into-fpga-datacenter-roadmap/
https://arxiv.org/pdf/1504.04788.pdf
https://arxiv.org/abs/1510.00149

What's hot

FPGA on the Cloud jtsagata

Introduction to GPU ProgrammingChakkrit (Kla) Tantithamthavorn

GPU - An IntroductionDhan V Sagar

Electronic Hardware Design with FPGAKrishna Gaihre

Lec04 gpu architectureTaras Zakharchenko

GPU ProgrammingWilliam Cunningham

Introduction to Generative Adversarial Networks (GANs)Appsilon Data Science

Tensor Processing Unit (TPU)Antonios Katsarakis

Graphic Processing Unit (GPU)Jafar Khan

Generative Adversarial Network (GAN)Prakhar Rastogi

Intel Core I5Raafat Ismael

Graphics processing unit pptSandeep Singh

Neural Networks Hardware Accelerators (An Introduction)Hamidreza Bolhasani

ARM CORTEX M3 PPTGaurav Verma

"Accelerating Deep Learning Using Altera FPGAs," a Presentation from IntelEdge AI and Vision Alliance

DDR, GDDR, HBM Memory : PresentationSubhajit Sahu

Basic of AI Accelerator Design using Verilog HDLJoohan KIM

Recurrent and Recursive Nets (part 2)sohaib_alam

An AI accelerator ASIC architectureKhanh Le

"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...Edge AI and Vision Alliance

What's hot (20)

FPGA on the Cloud

Introduction to GPU Programming

GPU - An Introduction

Electronic Hardware Design with FPGA

Lec04 gpu architecture

GPU Programming

Introduction to Generative Adversarial Networks (GANs)

Tensor Processing Unit (TPU)

Graphic Processing Unit (GPU)

Generative Adversarial Network (GAN)

Intel Core I5

Graphics processing unit ppt

Neural Networks Hardware Accelerators (An Introduction)

ARM CORTEX M3 PPT

"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

DDR, GDDR, HBM Memory : Presentation

Basic of AI Accelerator Design using Verilog HDL

Recurrent and Recursive Nets (part 2)

An AI accelerator ASIC architecture

"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...

Similar to Deep learning with FPGA

SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialGanesan Narayanasamy

FPGAs in the cloud? (October 2017)Julien SIMON

Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.

AI Accelerators for Cloud DatacentersCastLabKAIST

Heterogeneous Computing : The Future of SystemsAnand Haridass

Using a Field Programmable Gate Array to Accelerate Application PerformanceOdinot Stanislas

ODSA Proof of Concept SmartNIC Speeds & FeedsODSA Workgroup

GPGPU Accelerates PostgreSQL (English)Kohei KaiGai

6 open capi_meetup_in_japan_finalYutaka Kawai

Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...Cesar Maciel

Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Matej Misik

PCCC23：筑波大学計算科学研究センターテーマ１「スーパーコンピュータCygnus / Pegasus」PC Cluster Consortium

0507036meraz rizel

Design installation-commissioning-red raider-cluster-ttuAlan Sill

Infrastructure optimization for seismic processing (eng)Vsevolod Shabad

E3MV - Embedded Vision - SundanceSundance Multiprocessor Technology Ltd.

Deploying Pretrained Model In Edge IoT Devices.pdfObject Automation

POWER9 for AI & HPCinside-BigData.com

Stream Processingarnamoy10

AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...Amazon Web Services

Similar to Deep learning with FPGA (20)

SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial

FPGAs in the cloud? (October 2017)

Backend.AI Technical Introduction (19.09 / 2019 Autumn)

AI Accelerators for Cloud Datacenters

Heterogeneous Computing : The Future of Systems

Using a Field Programmable Gate Array to Accelerate Application Performance

ODSA Proof of Concept SmartNIC Speeds & Feeds

GPGPU Accelerates PostgreSQL (English)

6 open capi_meetup_in_japan_final

Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...

Fast data in times of crisis with GPU accelerated database QikkDB | Business ...

PCCC23：筑波大学計算科学研究センターテーマ１「スーパーコンピュータCygnus / Pegasus」

0507036

Design installation-commissioning-red raider-cluster-ttu

Infrastructure optimization for seismic processing (eng)

E3MV - Embedded Vision - Sundance

Deploying Pretrained Model In Edge IoT Devices.pdf

POWER9 for AI & HPC

Stream Processing

AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...

Recently uploaded

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal

result management system report for college projectTonystark477637

DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEslot gacor bisa pakai pulsa

UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan

MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N

SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat

Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

KubeKraft presentation @CloudNativeHooghlysanyuktamishra911

Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth

Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan

Introduction and different types of Ethernet.pptxupamatechverse

(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat

(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis

Recently uploaded (20)

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...

result management system report for college project

DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working

MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE

SPICE PARK APR2024 ( 6,793 SPICE Models )

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts

Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts

KubeKraft presentation @CloudNativeHooghly

Software Development Life Cycle By Team Orange (Dept. of Pharmacy)

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...

Coefficient of Thermal Expansion and their Importance.pptx

Introduction and different types of Ethernet.pptx

(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...

(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service

Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...

Deep learning with FPGA

1. Deep Learning with FPGA Drive towards dedicated Hardware for Efficient Learning Ayush Singh College of Computer and Information Sciences Northeastern University

2. Intro Deep Learning is an evolutionary machine learning technique Deep Learning requires a lot of computations for acceptable accuracy Modern models are highly complex ( 11.2B connections and 5M params ) Traditionally, industry used the processing power of CPU infrastructure Enter GPU running same code and event horizon for ASICs and FPGAs

3. Evolution: CPU > GPU > FPGA <=> ASIC As data and throughput demands increased, started looking for alternative GPU became heros, good with parallel computations and use same code GPU are power hogs, have low precision and high cache miss rate(TLB, IRO) Deep Learning Models maturing and categorically became network specific Drive to get faster results on embedded devices with limited resources

4. Field Programmable Gate Arrays Hardware implementation of Algorithms which is always faster Latency exponentially lower (nanoseconds vs microseconds GPU) Orders of magnitude lower power consumption ( 20W FPGA vs 200W GPU ) Lower clock cycles, (500 Mhz vs 1348 Mhz) Programmed conventionally using (HDL, Verilog vs C++ CUDA)

5. Current State Successful demonstration of throughput and efficiency on custom chips Gaining traction in Industry ( baidu, ms, google, etc) Fraction research vs GPU Beats GPU processing by at least twice the time ( 20TFLOPS vs 55TFLOPS) Energy consumption 50-80x and throughput 20-40x better than CPU

6. Limitations ● Longer development time compared to GPU (1 month vs 1 day) ● Limited Block and Dynamic RAM ● Not cheap to manufacture if production volume is low ● Lack talent to code domain specific, imagine Hardware Eng expert in CNN ● Speedup bound by the using fixed instead of floating point precision ● Low bandwidth compared to GPU (780 Gb/s vs 20Gb/s) ● Porting source code is a pain for iteratively new chips

7. Future ● Brings power of Deep Learning to embedded systems and compute farms ● Flexibility for creative applications at chip, server and warehouse level ● Opens doors to research in compressing and optimizing ML Techniques ● Eventual transition to ASIC as was the case with Bitcoin Era ● Emergence of new development platforms e.g. OpenCL, DeepCL, vs DeepCompute, CUDA ● Hybrid architecture: GPU as main while FPGA as auxiliary, CPU to correct

8. Companies A lot of recent startups took upon themselves to address this gap ● DeepPhi - FPGA ● Microsoft - FPGA ● Falcon Computing - FPGA ● Nervana - ASIC ● Wave Computing - ASIC ● Cognimem - ASIC ● Xilinx - Supplier ● Kintex - Supplier ● NVIDIA - GPU King ● INTEL with Nervana and Altera ● IBM with Xilinx ● Baidu with Nervana

9. References https://www.nextplatform.com/2016/08/23/fpga-based-deep-learning-accelerators-take-asics/ https://www.nextplatform.com/2016/08/08/deep-learning-chip-upstart-set-take-gpus-task/ http://cadlab.cs.ucla.edu/~cong/slides/HALO15_keynote.pdf http://on-demand.gputechconf.com/supercomputing/2014/presentation/SC424-deep-learning-gpu-clusters.pdf https://arxiv.org/abs/1602.04283 https://www.nextplatform.com/2016/02/29/broader-paths-etched-into-fpga-datacenter-roadmap/ https://arxiv.org/pdf/1504.04788.pdf https://arxiv.org/abs/1510.00149

10. Thank You

Deep learning with FPGA

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep learning with FPGA

Similar to Deep learning with FPGA (20)

Recently uploaded

Recently uploaded (20)

Deep learning with FPGA