Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Copyright © 2015 Altera 1
Dr. Deshanand Singh, Director of Software Engineering
12 May 2015
Efficient Implementation of
Co...
Copyright © 2015 Altera 2
• Convolutional Neural Network
• Feed forward network Inspired by biological processes
• Compose...
Copyright © 2015 Altera 3
Image
Pooling
Convolution
Basic Building Block of the CNN
Source: Li Deng
Deep Learning Technolo...
Copyright © 2015 Altera 4
• Dataflow through the CNN can proceed in pipelined fashion
• No need to wait until the entire e...
Copyright © 2015 Altera 5
FPGAs : Compute fabrics that support pipelining
1-bit configurable
operation
Configured to perfo...
Copyright © 2015 Altera 6
Programming FPGAs: SDK for OpenCL
C compiler
OpenCL
Host
Program
Altera Kernel
Compiler
OpenCL
K...
Copyright © 2015 Altera 7
Accelerator
LocalMem
GlobalMem
LocalMemLocalMemLocalMem
Accelerato
r
Accelerato
r
Accelerato
r
P...
Copyright © 2015 Altera 8
• Data-parallel function
• Defines many parallel threads of
execution
• Each thread has an ident...
Copyright © 2015 Altera 9
• On each cycle the portions of the pipeline
are processing different threads
• While thread 2 i...
Copyright © 2015 Altera 10
Convolutions: Our Basic Building Block
Inew 𝑥 𝑦 = Iold
1
𝑦′=−1
1
𝑥′=−1
𝑥 + 𝑥′ 𝑦 + 𝑦′ × F 𝑥′ 𝑦′
Copyright © 2015 Altera 11
Main
Memory
Cache
• A cache can hide poor memory access patterns
Processor (CPU/GPU) Implementa...
Copyright © 2015 Altera 12
• Example Performance Point: 1 pixel per cycle
• Cache requirements: 9 reads + 1 write per cycl...
Copyright © 2015 Altera 13
Optimizing the “Cache”
• Start out with the initial picture that is W pixels wide
w
• Let’s rem...
Copyright © 2015 Altera 14
Optimizing the “Cache”
ww
w
Copyright © 2015 Altera 15
data_out[9]
• Managing data movement
to match the FPGA’s
architectural strengths is
key to obta...
Copyright © 2015 Altera 16
• CIFAR-10 Dataset
• 60000 32x32 color images
• 10 classes
• 6000 images per class
• 50000 trai...
Copyright © 2015 Altera 17
• High-latency: Requires access to global memory
• High memory-bandwidth
• Requires host coordi...
Copyright © 2015 Altera 18
• Low-latency communication between kernels
• Significantly less memory bandwidth requirements
...
Copyright © 2015 Altera 19
• Entire algorithm can be expressed in ~500
lines of OpenCL for the FPGA
• Kernels are written ...
Copyright © 2015 Altera 20
• Altera’s Arria-10 family of FPGA introduces DSP blocks with a dedicated
floating point mode.
...
Copyright © 2015 Altera 21
• CNN implementations are well suited to pipelined implementations
• Exploiting pipelining on t...
Copyright © 2015 Altera 22
• Altera’s SDK for OpenCL
• http://www.altera.com/products/software/opencl/opencl-
index.html
•...
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
Next

of

"Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera Slide 1 "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera Slide 2 "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera Slide 3 "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera Slide 4 "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera Slide 5 "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera Slide 6 "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera Slide 7 "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera Slide 8 "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera Slide 9 "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera Slide 10 "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera Slide 11 "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera Slide 12 "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera Slide 13 "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera Slide 14 "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera Slide 15 "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera Slide 16 "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera Slide 17 "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera Slide 18 "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera Slide 19 "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera Slide 20 "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera Slide 21 "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera Slide 22

YouTube videos are no longer supported on SlideShare

View original on YouTube

Upcoming SlideShare
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
Next
Download to read offline and view in fullscreen.

9

Share

"Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera

Download to read offline

For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/altera/embedded-vision-training/videos/pages/may-2015-embedded-vision-summit

For more information about embedded vision, please visit:
http://www.embedded-vision.com

Deshanand Singh, Director of Software Engineering at Altera, presents the "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs" tutorial at the May 2015 Embedded Vision Summit.

Convolutional neural networks (CNN) are becoming increasingly popular in embedded applications such as vision processing and automotive driver assistance systems. The structure of CNN systems is characterized by cascades of FIR filters and transcendental functions. FPGA technology offers a very efficient way of implementing these structures by allowing designers to build custom hardware datapaths that implement the CNN structure. One challenge of using FPGAs revolves around the design flow that has been traditionally centered around tedious hardware description languages.

In this talk, Deshanand gives a detailed explanation of how CNN algorithms can be expressed in OpenCL and compiled directly to FPGA hardware. He gives detail on code optimizations and provides comparisons with the efficiency of hand-coded implementations.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

"Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera

  1. 1. Copyright © 2015 Altera 1 Dr. Deshanand Singh, Director of Software Engineering 12 May 2015 Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs
  2. 2. Copyright © 2015 Altera 2 • Convolutional Neural Network • Feed forward network Inspired by biological processes • Composed of different layers • More layers increases accuracy • Convolutional Layer • Extract different features from input • Low level features • e.g. edges, lines corners • Pooling Layer • Reduce variance • Invariant to small translations • Max or average value • Feature over region in the image • Applications • Classification & Detection, Image recognition/tagging Convolutional Neural Network (CNN) Prof. Hinton’s CNN Algorithm
  3. 3. Copyright © 2015 Altera 3 Image Pooling Convolution Basic Building Block of the CNN Source: Li Deng Deep Learning Technology Center, Microsoft Research, Redmond, WA. USA A Tutorial at International Workshop on Mathematical Issues in Information Sciences
  4. 4. Copyright © 2015 Altera 4 • Dataflow through the CNN can proceed in pipelined fashion • No need to wait until the entire execution is complete • Can start a new set of data going to stage 1 as soon as the stage complete its execution Key Observation: Pipelining Layer 1: Convolving Kernel Layer 1: Pooling Kernel Layer 2: Convolving Kernel
  5. 5. Copyright © 2015 Altera 5 FPGAs : Compute fabrics that support pipelining 1-bit configurable operation Configured to perform any 1-bit operation: AND, OR, NOT, ADD, SUB Basic Element 1-bit register (store result) Basic Elements are surrounded with a flexible interconnect … 16-bit add Your custom 64-bit bit-shuffle and encode 32-bit sqrt Memory Block 20 Kb addr data_in data_out Can be configured and grouped using the interconnect to create various cache architectures data_in Dedicated floating point multiply and add blocks data_out Blocks are connected into a custom data-path that matches your application.
  6. 6. Copyright © 2015 Altera 6 Programming FPGAs: SDK for OpenCL C compiler OpenCL Host Program Altera Kernel Compiler OpenCL Kernels Host Binary Device Binary Altera OpenCL host library Users write software FPGA specific compilatio n target
  7. 7. Copyright © 2015 Altera 7 Accelerator LocalMem GlobalMem LocalMemLocalMemLocalMem Accelerato r Accelerato r Accelerato r Processor Accelerator LocalMem GlobalMem LocalMemLocalMemLocalMem AcceleratorAcceleratorAcceleratorProcessor • Host + Accelerator Programming Model • Sequential Host program on microprocessor • Function offload onto a highly parallel accelerator device OpenCL Programming Model Host Accelerator LocalMem GlobalMem LocalMemLocalMemLocalMem Accelerato r Accelerato r Accelerato r Processor __kernel void sum(__global float *a, __global float *b, __global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; } main() { read_data( … ); maninpulate( … ); clEnqueueWriteBuffer( … ); clEnqueueNDRange(…,sum,…); clEnqueueReadBuffer( … ); display_result( … ); }
  8. 8. Copyright © 2015 Altera 8 • Data-parallel function • Defines many parallel threads of execution • Each thread has an identifier specified by “get_global_id” • Contains keyword extensions to specify parallelism and memory hierarchy • Executed by compute object • CPU • GPU • FPGA • DSP • Other Accelerators OpenCL Kernels __kernel void sum(__global const float *a, __global const float *b, __global float *answer) { int xid = get_global_id(0); answer[xid] = a[xid] + b[xid]; } float *a = float *b = float *answer = __kernel void sum( … ); 0 1 2 3 4 5 6 7 7 6 5 4 3 2 1 0 7 7 7 7 7 7 7 7
  9. 9. Copyright © 2015 Altera 9 • On each cycle the portions of the pipeline are processing different threads • While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored Dataflow / Pipeline Architecture from OpenCL Load Load Store 0 1 2 3 4 5 6 7 8 threads for vector add example Thread IDs + Load Load Store 0 1 2 3 4 5 6 7 8 threads for vector add example Thread IDs + Load Load Store 0 1 2 3 4 5 6 7 8 threads for vector add example Thread IDs + Load Load Store 1 2 3 4 5 6 7 8 threads for vector add example Thread IDs + 0 Load Load Store 2 3 4 5 6 7 8 threads for vector add example Thread IDs + 0 1
  10. 10. Copyright © 2015 Altera 10 Convolutions: Our Basic Building Block Inew 𝑥 𝑦 = Iold 1 𝑦′=−1 1 𝑥′=−1 𝑥 + 𝑥′ 𝑦 + 𝑦′ × F 𝑥′ 𝑦′
  11. 11. Copyright © 2015 Altera 11 Main Memory Cache • A cache can hide poor memory access patterns Processor (CPU/GPU) Implementation for(int y=1; y<height-1; ++y) { for(int x=1; x<width-1; ++x) { for(int y2=-1; y2<1; ++y2) { for(int x2=-1; x2<1; ++x2) { i2[y][x] += i[y+y2][x+x2] * filter[y2][x2]; CPU
  12. 12. Copyright © 2015 Altera 12 • Example Performance Point: 1 pixel per cycle • Cache requirements: 9 reads + 1 write per cycle • Expensive hardware! • Power overhead • Cost overhead: More built in addressing flexibility than we need • Why not customize the cache for the application? FPGA Implementation ? Cache Custom Data-path 9 read ports! Memory
  13. 13. Copyright © 2015 Altera 13 Optimizing the “Cache” • Start out with the initial picture that is W pixels wide w • Let’s remove all the lines that aren’t in the neighborhood of the window ww • Take all of the lines and arrange them as a 1D array of pixels w w w ww • The pixels at the edges that we don’t need for the computation w w w ww w w w ww w • What happens when we move the window one pixel to the right? w w w ww w • What happens when we move the window one pixel to the right?
  14. 14. Copyright © 2015 Altera 14 Optimizing the “Cache” ww w
  15. 15. Copyright © 2015 Altera 15 data_out[9] • Managing data movement to match the FPGA’s architectural strengths is key to obtaining high performance. Shift Registers in Software pixel_t sr[2*W+3]; while(keep_going) { // Shift data in #pragma unroll for(int i=1; i<2*W+3; ++i) sr[i] = sr[i-1] sr[0] = data_in; // Tap output data data_out = {sr[ 0], sr[ 1], sr[ 2], sr[ w], sr[ w+1], sr[ w+2] sr[2*w], sr[2*w+1], sr[2*w+2]} // ... } ww data_in sr[0]sr[2*W+2]
  16. 16. Copyright © 2015 Altera 16 • CIFAR-10 Dataset • 60000 32x32 color images • 10 classes • 6000 images per class • 50000 training images • 10000 test images • Many CNN implementations are available for CIFAR-10 • Cuda-convnet provides a baseline implementation that many works build upon Building a CNN for CIFAR-10 on an FPGA The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. Here are the classes in the dataset, as well as 10 random images from each: airplane automobile bird cat deer dog frog horse ship truck Max-Pooling Fully Connected 2-D Convolution (5x5) On FPGA
  17. 17. Copyright © 2015 Altera 17 • High-latency: Requires access to global memory • High memory-bandwidth • Requires host coordination to pass buffers from one kernel to another CIFAR-10 CNN: Traditional OpenCL Implementation Conv 1 Kernel Pool 1 Kernel Conv 2 Kernel Global Memory (DDR) Buffer Buffer Buffer Buffer Stratix V Implementation : • Processes 183 images per second for CIFAR-10
  18. 18. Copyright © 2015 Altera 18 • Low-latency communication between kernels • Significantly less memory bandwidth requirements • Host is not involved in coordinating communication between kernels Global Memory (DDR) CIFAR-10 CNN: Kernel-to-Kernel Channels Buffer Buffer Channels Conv 1 Kernel Pool 1 Kernel Conv N Kernel Stratix V Implementation : • Processes 400 images per second for CIFAR-10 • Channel declaration: • Create a queue: value_type channel(); • Channel write: • Push data into the queue: void write_channel_altera(channel &ch, value_type data); • Channel read: • Pop the first element from the queue value_type read_channel_altera(channel &ch); channel int my_channel; write_channel_altera(my_channel, x); int y = read_channel_altera(my_channel);
  19. 19. Copyright © 2015 Altera 19 • Entire algorithm can be expressed in ~500 lines of OpenCL for the FPGA • Kernels are written as standard building blocks that are connected together through channels • The shift register convolution building block is the portion of this code that is used most heavily • The concept of having multiple concurrent kernels executing simultaneously and communicating directly on a device is currently unique to FPGAs • Will be portable in OpenCL 2.0 through the concept of “OpenCL Pipes” CIFAR-10: FPGA Code #pragma OPENCL_EXTENSION cl_altera_channel : enable // Declaration of Channel API data types channel float prod_conv1_channel; channel float conv1_pool1_channel; channel float pool1_conv2_channel; channel float conv2_pool2_channel; channel float pool2_conv3_channel; channel float conv3_pool3_channel; channel float pool3_cons_channel; __kernel void convolutional_neural_network_prod( int batch_id_begin, int batch_id_end, __global const volatile float * restrict input_global) { for(...) { write_channel_altera( prod_conv1_channel, input_global[...]); write_channel_altera( prod_pool3_channel, input_global[...]); } }
  20. 20. Copyright © 2015 Altera 20 • Altera’s Arria-10 family of FPGA introduces DSP blocks with a dedicated floating point mode. • Each DSP includes a IEEE 754 single precision floating-point multiplier and adder • FPGA logic blocks (lookup tables and registers) are no longer needed to implement floating point functions • Massive resource savings allows many more processing pipelines to run simultaneously on the FPGA Effect of Floating Point FPGAs Arria-10 Implementation : • Processes 6800 images per second for CIFAR-10
  21. 21. Copyright © 2015 Altera 21 • CNN implementations are well suited to pipelined implementations • Exploiting pipelining on the FPGA requires some attention to coding style to overcome the inherent assumptions of the writing “software” • FPGAs do not have caches • Need to exploit data reuse in a more explicit way • The concept of dataflow pipelining will not realize its full potential if we write intermediate results to memory • Bandwidth limitations begin to dominate compute • Use direct kernel to kernel communication called channels • Native support for floating point on the FPGA allows order of magnitude performance increase Lessons Learned
  22. 22. Copyright © 2015 Altera 22 • Altera’s SDK for OpenCL • http://www.altera.com/products/software/opencl/opencl- index.html • FPGA Optimized OpenCL examples (filters, convolutions, …) • http://www.altera.com/support/examples/opencl/opencl.html • CIFAR-10 dataset • http://www.cs.toronto.edu/~kriz/cifar.html • CUDA-Convnet • https://code.google.com/p/cuda-convnet/ Resources
  • YuriiCobain

    Apr. 19, 2020
  • lucutz

    Apr. 1, 2019
  • ShuaiLi19

    Sep. 19, 2017
  • MohammadMahdiMoazzam

    Dec. 20, 2016
  • szeusg

    Dec. 14, 2016
  • flaviopol

    Apr. 24, 2016
  • tthtlc

    Dec. 7, 2015
  • oayat

    Sep. 28, 2015
  • farshad01

    Sep. 9, 2015

For the full video of this presentation, please visit: http://www.embedded-vision.com/platinum-members/altera/embedded-vision-training/videos/pages/may-2015-embedded-vision-summit For more information about embedded vision, please visit: http://www.embedded-vision.com Deshanand Singh, Director of Software Engineering at Altera, presents the "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs" tutorial at the May 2015 Embedded Vision Summit. Convolutional neural networks (CNN) are becoming increasingly popular in embedded applications such as vision processing and automotive driver assistance systems. The structure of CNN systems is characterized by cascades of FIR filters and transcendental functions. FPGA technology offers a very efficient way of implementing these structures by allowing designers to build custom hardware datapaths that implement the CNN structure. One challenge of using FPGAs revolves around the design flow that has been traditionally centered around tedious hardware description languages. In this talk, Deshanand gives a detailed explanation of how CNN algorithms can be expressed in OpenCL and compiled directly to FPGA hardware. He gives detail on code optimizations and provides comparisons with the efficiency of hand-coded implementations.

Views

Total views

5,342

On Slideshare

0

From embeds

0

Number of embeds

45

Actions

Downloads

222

Shares

0

Comments

0

Likes

9

×