Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- "Efficient Implementation of Convol... by Embedded Vision A... 4836 views
- "Accelerating Deep Learning Using A... by Embedded Vision A... 5321 views
- "Fast Deployment of Low-power Deep ... by Embedded Vision A... 828 views
- "Tailoring Convolutional Neural Net... by Embedded Vision A... 1350 views
- "Efficient Convolutional Neural Net... by Embedded Vision A... 2878 views
- "Using SGEMM and FFTs to Accelerate... by Embedded Vision A... 1576 views

2,364 views

Published on

http://www.embedded-vision.com/platinum-members/auvizsystems/embedded-vision-training/videos/pages/may-2015-embedded-vision-summit

For more information about embedded vision, please visit:

http://www.embedded-vision.com

Nagesh Gupta, CEO and Founder of Auviz Systems, presents the "Trade-offs in Implementing Deep Neural Networks on FPGAs" tutorial at the May 2015 Embedded Vision Summit.

Video and images are a key part of Internet traffic—think of all the data generated by social networking sites such as Facebook and Instagram—and this trend continues to grow. Extracting usable information from video and images is thus a growing requirement in the data center. For example, object and face recognition are valuable for a wide range of uses, from social applications to security applications. Deep neural networks are currently the most popular form of convolutional neural networks (CNN) used in data centers for such applications. 3D convolutions are a core part of CNNs. Nagesh presents alternative implementations of 3D convolutions on FPGAs, and discusses trade-offs among them.

Published in:
Technology

No Downloads

Total views

2,364

On SlideShare

0

From Embeds

0

Number of Embeds

44

Shares

0

Downloads

133

Comments

0

Likes

7

No embeds

No notes for slide

- 1. Copyright © 2015 Auviz Systems 1 Nagesh Gupta 12 May 2015 Trade-offs in Implementing Deep Neural Networks on FPGAs
- 2. Copyright © 2015 Auviz Systems 2 • Startup, specializes in implementing & optimizing algorithms on FPGAs • Offers libraries of different classes of algorithms • AuvizCV—optimized OpenCV algorithms • AuvizLA —optimized BLAS • AuvizDNN—optimized deep neural networks • And develops custom algorithms in Computer Vision, Linear Algebra, Deep Learning & Machine Learning • Available as OpenCL function calls for software users to abstract the complexity of using an FPGA • Visit our booth & see AlexNet running on Xilinx FPGA! Auviz Systems
- 3. Copyright © 2015 Auviz Systems 3 The Time for Artificial Intelligence & Machine Learning • Sources: Cisco/Statista, Facebook research, IT Business Edge
- 4. Copyright © 2015 Auviz Systems 4 Machine Learning Moving to the Data Center Performance/watt Programming model & use model Microsoft Azure ML— provides Machine Learning as a service on the cloud IBM Watson at Jeopardy—one of the best demonstration of Machine Learning Amazon AWS ML & Google Predictive Analytics —other Machine Learning services on the cloud
- 5. Copyright © 2015 Auviz Systems 5 • A form of Deep Neural Networks—used for various “recognition” tasks • AlexNet [2] is a CNN configuration as shown below was used to classify 1.2 million images Convolutional Neural Networks (CNNs)
- 6. Copyright © 2015 Auviz Systems 6 • A convolution layer has multiple stages • 3D Convolutions: • Activation: Using the ReLU function, Max(x, 0) • Max pooling: Sub-sampling function that selects the max value within a neighborhood Components of AlexNet—Convolution layers 3D Convolutions Activation (ReLU) Sub-sampling (Max pooling)
- 7. Copyright © 2015 Auviz Systems 7 • Dense layers are fully connected—each output node is a function of all the input nodes • The first 2 dense layers can be represented as a matrix-vector multiplication operation • Layer 6 has 9216 inputs which are multiplied with a weight matrix to create 4096 outputs • Layer 7 has 4096 inputs which are multiplied with a different weight matrix to create 4096 outputs • The output layer uses SoftMax to classify the input image into one of 1000 classes Dense Layers in AlexNet Layer 6 Layer 7 Output layer
- 8. Copyright © 2015 Auviz Systems 8 • Sequential implementation • Implementation follows the convolution equations • Resource utilization will be very low, but the latency at 200 MHz will be 22s for the 2nd layer • High level synthesis (HLS) can be used to implement as shown in [3] • Get better performance by parallelizing the implementation Implementing 3D Convolutions Weight Matrices Input feature maps Output feature maps
- 9. Copyright © 2015 Auviz Systems 9 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09 Computations Data transfers Computations vs. Data Transfers in AlexNet • Computation latency, 2nd convolution layer • With 512 single precision floating point operations the 2nd convolution layer takes 2.2 ms to complete at 200 MHz • Data transfer latency, 2nd convolution layer • With 64 bit DDR, 1.3 Gb/s, single precision floating point data fetch latency is around 0.5 ms 3D convolutions require more number of computations, while the data transfers are higher for the dense layers
- 10. Copyright © 2015 Auviz Systems 10 3D Convolution—Parallel Implementation X = • A 11x11 weight matrix with 3 input feature maps requires 121*3 multiplications and 121*3 adders • With 363 multiply units and 363 adders, this can be done in 1 cycle • The FPGA resources required for a each single precision floating point operation are 2-5 DSP blocks and 200-400 LUTs • Implementing this in parallel will require ~1200 DSPs and ~75000 LUTs 1 Output value 11x11 Weight Matrix 11x11 Input Feature Map
- 11. Copyright © 2015 Auviz Systems 11 Increasing Throughput With Pipelining • Pipelining is a hardware concept to achieve higher throughput • Helpful with complex multi-cycle operations—works by registering intermediate results • Pipeline 3D convolutions on one dimension & parallelize the other • For example, convolve the weight matrix with an input feature map in parallel, and pipeline for different feature maps • Zhang, et al [3] convolve a set of input feature maps with a set of weight matrices in parallel and pipeline for the size of the input feature map C R C’ R’ M number of NxKxK weight filters N M Tn Tr Tc N Tn Tm Input feature maps, NxRxC K K N Tn Output feature maps, MxR’xC’
- 12. Copyright © 2015 Auviz Systems 12 • A simple way is to flatten feature maps and to create an array of feature maps—below is an illustration for the first layer of AlexNet • The weight matrices are flattened and the input feature maps are rearranged for each column to have the neighborhood required for convolutions Mapping 3D Convolutions into Matrix Multiplications . .96 55 x 55 = 3025 . .96 3 x 11 x 11 = 363 . . 3x11x11=363 55 x 55 = 3025 Y, matrix of output feature maps W, matrix of weight coefficients X, matrix of input feature maps
- 13. Copyright © 2015 Auviz Systems 13 • Larger number of compute units exhausts the FPGA resources • Each compute unit takes a few hundred LUTs and 3-5 DSPs • Data organization to ensure the compute units are performing to the max • Need to read a lot of data in parallel • Data has to be stored on-chip to enable parallel access • Routing turns out to be a bigger challenge • Proper data organization, architecture & tools are the way to overcome Implementation Challenges 0 10000 20000 30000 40000 50000 60000 70000 80000 256 512 768 Bitsrequiredpercycle Parallelism Bits per operation
- 14. Copyright © 2015 Auviz Systems 14 • Single precision floating point • Uses 32 bits to represent each data • Requires more DSPs (3-5) to implement multiply/accumulate • Fixed point • 16-bit fixed point representation would suffice for many applications [4] • Stochastic rounding techniques perform similar to single precision floating point representation [5] • Half precision • Uses 16 bits to represent data • Significant reduction in routing & overall FPGA resources • Mixed representation • Use fixed point or half precision representation for some and single precision representation for other layers Using Alternate Data Representations
- 15. Copyright © 2015 Auviz Systems 15 • OpenCL tools enable software programmers to use the FPGA accelerator without learning hardware methodologies • Programmer calls OpenCL functions to accelerate on the FPGA A complete CNN on the FPGA using OpenCL Configure & setup 3D Convolutions Dense layers Softmax
- 16. Copyright © 2015 Auviz Systems 16 Performance of AlexNet on FPGAs FPGAs can achieve an impressive 14 images/sec/Watt compared to high end GPUs such as Tesla K40, which can get to 4 images/sec/Watt
- 17. Copyright © 2015 Auviz Systems 17 • 3D convolutions are a key part of a CNN, and are compute intensive • In FPGAs, 3D convolutions can be implemented efficiently with a parallel & pipelined implementation • FPGA resources—gates & routing will be the critical factors in achieving a highly parallel implementation • OpenCL implementation tools, such as Xilinx SDAccel simplify the implementation task and provide a software flow • Alternate data representations can be used to simplify the complexity • Mixed data representations can simplify the computations without compromising on the performance • FPGAs are capable of delivering a high performance at a suitable power profile for the data center Summary
- 18. Copyright © 2015 Auviz Systems 18 • [1] Kevin Ovtcharov, et al, Accelerating Deep Convolutional Neural Networks Using Specialized Hardware, Microsoft Research, 2015 • [2] A. Krizhevsky, I. Sutskever and G. E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, Advances in Neural Information Processing Systems, 2012 • [3] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao and J. Cong, Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks, FPGA'2015, 2015 • [4] Farabet, C., LeCun, Y., Kavukcuoglu, K., Culurciello, E., Martini, B., Akselrod, P., & Talay, S., “Large-scale FPGA-based convolutional networks” in Machine Learning on Very Large Data Sets (2011). • [5] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. "Deep Learning with Limited Numerical Precision." arXiv preprint arXiv:1502.02551 (2015). References
- 19. Copyright © 2015 Auviz Systems 19 Nagesh Gupta 12 May 2015 Deep Neural Networks in FPGAs
- 20. Copyright © 2015 Auviz Systems 20 Convolutionlayers Input size Input feature maps Output feature maps Filter size Computations Total data transfer 224 x 224 3 96 11x11 110 * 10^6 255 * 10^3 27 x 27 96 256 5x5 448 * 10^6 728 * 10^3 13 x 13 256 384 3x3 150 * 10^6 993 * 10^3 13 x 13 384 384 3x3 224 * 10^6 1457 * 10^3 13 x 13 384 256 3x3 150 * 10^6 959 * 10^3 Computations vs. Data TransfersDenselayers Input data Weight matrix Computations Data transfers 9216 9216 x 4096 38 * 10^6 38 * 10^6 4096 4096 x 4096 16 * 10^6 16 * 10^6 4096 4096 x 1000 4 * 10^6 4 * 10^6

No public clipboards found for this slide

Be the first to comment