Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Jingyi Jin, Software Architect, Intel Corp.
2
clCaffe*: Unleashing the Power of Intel Graphics for
Deep Learning Acceleration
Speaker:
• Jingyi Jin, Ph.D, Software Ar...
3
Agenda
Background & motivation
clCaffe* framework
 Development
 Optimization
 Results & Use case
Conclusion & future ...
4
Neural Network
f
x1
x2
w1
w2
y
Perceptron
input
output
input
output
Convolution kernel
from:
Fully Connected NN
to:
Conv...
5
Convolutional Neural Network
convolution
convolution convolution convolution convolution
fully
connected fc fcExample Al...
6
ImageNet* Large Scale Visual Recognition
Challenge (ILSVRC)
28%
26%
16%
12%
6.60%
3.57%
2010 2011 2012 2013 2014 2015
IL...
7
Motivation
Medical Image
Analysis
Augmented
Reality
Video Surveillance
Autonomous
Driving
Military Combat
and Tracking
I...
8
Deep learning race
LeNet*
digit recognition
4 layers
GoogleNet*
6.75% error rate
22 layers
AlexNet*
16.4% error rate
8 l...
9
Training Scoring/classification
Intel’s products for deep learning
Apple* Macbook* Pro 13’’
Apple* Macbook* Pro 15’’
Apple* iMac* 21.5’’
Asus* Zenbook* Infinity
Gigabyte* Brix* Pro
Zotac* Z...
System
Agent
11
Example Chip Level Architecture: Intel® Core™ M
Intel® Processor Graphics
Gen8
Graphics, Compute, & Media
...
12
Chip Level Architecture :
4 CPU cores & Intel® Iris™ Pro Graphics: 48 EUs, & EDRAM
13
Typical System Design
Phase I: train on servers Phase II: classify on clients
Deployment of model with weights
Typicall...
14
Caffe*
Open source framework for CNN
Written in C++, CUDA* for GPU, with command line, Python*, MATLAB* interfaces
Prov...
15
Caffe*
CPUHW
Language
/ Math
library
CNN
primitives
library
CNN
framework
cuBLAS
CNN Applications
MKL
BLAS
CUDA*
NVidia...
16
clCaffe*
CPUHW
Language
/ Math
library
CNN
primitives
library
CNN
framework
cuBLAS
CNN Applications
MKL
BLAS
CUDA*
NVid...
17
clCaffe* Development
1. Enabling OpenCL™ extension to Caffe*
2. CPU/pGfx memory synchronization
– Take advantage of int...
Convolution
18
clCaffe* initial profiling
Convolution approaches:
 GEMM (General Matrix
Multiply) based
 Spatial domain ...
19
GEMM based convolution
Flatten input data and kernels, solve the convolution as a matrix multiplication problem
<Image ...
20
Spatial domain convolution
Direct application of convolution on the spatial domain: dot product of input with convoluti...
21
FFT based convolution
Convert input into Fourier Domain, apply element-wise multiplication to reduce complexity:
O(N2K2...
22
Analysis of Convolution Approaches
22
1. GEMM (default)
Pros:
• Generic and stable
• Easy to implement
(problem mapped ...
23
Spatial Convolution
Auto-tuning
 Performed at first time when
conv is called on the machine,
and cached for future use...
25
Hardware Configuration
5th Generation Intel® Core™ processor 14nm Intel® Atom™ processor
CPU Intel® Xeon® CPU E3-1200 v...
26
8
10
65
290
60
89
0 50 100 150 200 250 300
CPU + ATLAS
CPU + OpenBLAS
CPU + MKL
Spatial Convolution
FFT Convolution
GEM...
27
27
clCaffe* on 5th Generation Intel® Core™ Processors
AlexNet* training
clCaffe*
on Intel GEN
with different
convolutio...
28
Other Topologies
classification
91
55
77
290
0 50 100 150 200 250 300 350
Overfeat
VGG-A
GoogLeNet
AlexNet
Img/sec
clCa...
29
clCaffe* on 14nm Intel® Atom™ processor
AlexNet* classification
6
50
17
0 10 20 30 40 50 60
CPU + MKL
Spatial Convoluti...
30
Conclusion
 Intel not only provides the Silicon solution, but also builds SW ecosystem
around its HW for deep learning...
31
clCaffe* release status
Handed over to Open Source team in Intel
Externally available: https://github.com/01org/caffe/w...
32
clCaffe*
CPUHW
Language
/ Math
library
CNN
primitives
library
CNN
framework
cuBLAS
CNN Applications
MKL
BLAS
CUDA*
NVid...
33
Future extension
CPUHW
Language
/ Math
library
CNN
primitives
library
CNN
framework
cuBLAS
CNN Applications
MKL
BLAS
CU...
34
Future extension
CPUHW
Language
/ Math
library
CNN
primitives
library
CNN
framework
CNN Applications
MKL
BLAS
TensorFlo...
35
Reference
clCaffe*: OpenCL™ accelerated Caffe* for Convolutional Neural Networks. J.
Bottleson, S. Kim, J. Andrews, P. ...
Legal Notices and Disclaimers
Intel technologies’ features and benefits depend on system configuration and may require ena...
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
Upcoming SlideShare
Loading in …5
×

clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration

3,289 views

Published on

In this presentation, you will hear a story about how Intel graphics can accelerate deep learning applications. The method is simple and reproducible, with impressive results of up to four times over the original CPU performance. We introduce clCaffe*, an extension of the well-known Caffe* framework with OpenCL™ standard. This OpenCL™ standard enables primitives of the convolutional neural networks (CNN) pipeline to operate on GPU (graphics processing unit), FPGA (field programmable gate array) or any device with OpenCL support. Once set up, Caffe users can seamlessly toggle to clCaffe to take advantage of Intel graphics acceleration. Compared with original CPUs, Intel graphics presents 2.5x speedup (AlexNet classification), or 4.0x (GoogleNet classification) on 5th or 6th generation Intel® Core™ processors. Finally, we give a detailed analysis of clCaffe performance, and identify the lacking components in Intel Graphics software stack that impair its performance in the deep learning support.

Published in: Technology

clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration

  1. 1. Jingyi Jin, Software Architect, Intel Corp.
  2. 2. 2 clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration Speaker: • Jingyi Jin, Ph.D, Software Architect • Visual & Parallel Computing Group, Intel Corp Abstract: • In this work, I present OpenCL™ acceleration of a well-known deep learning framework, Caffe*, while focusing on the convolution layer which has been optimized with three different approaches; GEMM, spatial domain, and frequency domain. This work, clCaffe, greatly enhances the ability to leverage deep learning use cases on all types of OpenCL™ devices, particularly on small form factor devices in which discrete GPUs are rare and integrated GPUs are far more common. We present performance results of clCaffe running on Intel Graphics. Our benchmark shows 4.5x speedup running on the Intel Graphics, compared to default CPU implementation in Caffe, for AlexNet on ImageNet* 1K category dataset; or 4.0x (GoogleNet* classification) on 5th or 6th generation Intel® Core™ processors. *Other names and brands may be claimed as the property of others.
  3. 3. 3 Agenda Background & motivation clCaffe* framework  Development  Optimization  Results & Use case Conclusion & future extension *Other names and brands may be claimed as the property of others.
  4. 4. 4 Neural Network f x1 x2 w1 w2 y Perceptron input output input output Convolution kernel from: Fully Connected NN to: Convolutional NN
  5. 5. 5 Convolutional Neural Network convolution convolution convolution convolution convolution fully connected fc fcExample AlexNet* Topology Feature extraction Classification cat ILSVRC : ImageNet Large Scale Visual Recognition Challenge *Other names and brands may be claimed as the property of others.
  6. 6. 6 ImageNet* Large Scale Visual Recognition Challenge (ILSVRC) 28% 26% 16% 12% 6.60% 3.57% 2010 2011 2012 2013 2014 2015 ILSVRCclassificationerrorrate (%) AlexNet* human 5.1% *Other names and brands may be claimed as the property of others.
  7. 7. 7 Motivation Medical Image Analysis Augmented Reality Video Surveillance Autonomous Driving Military Combat and Tracking Image-based Search Engine Deep Learning Now we build over machines which can recognize!
  8. 8. 8 Deep learning race LeNet* digit recognition 4 layers GoogleNet* 6.75% error rate 22 layers AlexNet* 16.4% error rate 8 layers VGG-16 7.5% error rate 19 layers ResNet* 3.57% error rate 152 layers Number of primitive layers Errorrate(%) 1998 2012 2014 2014 2015 Call for Intel: How to best support this burst in compute demand? *Other names and brands may be claimed as the property of others.
  9. 9. 9 Training Scoring/classification Intel’s products for deep learning
  10. 10. Apple* Macbook* Pro 13’’ Apple* Macbook* Pro 15’’ Apple* iMac* 21.5’’ Asus* Zenbook* Infinity Gigabyte* Brix* Pro Zotac* ZBOX* EI730 Sony* Vaio* Tap 21 JD.com – Terran Force Clevo* Niagara* The Graphics Architecture for many OEM DT, LT, 2:1, tablet products Example Products with Processor Graphics Microsoft* Surface Pro* 3 Asus* MeMO Pad* 7 Asus* Transformer Pad* Lenovo* Miix* 2 Toshiba* Encore* 2 Tablet *Other names and brands may be claimed as the property of others.
  11. 11. System Agent 11 Example Chip Level Architecture: Intel® Core™ M Intel® Processor Graphics Gen8 Graphics, Compute, & Media Shared LLC CPU core CPU core Key Takeaway: Intel® Processor Graphics are valuable compute resources in client platforms to be unleashed!  Many different processor products, with different processor graphics configurations  Multiple CPU cores, shared LLC, system agent  Multiple clock domains, target power where it’s needed
  12. 12. 12 Chip Level Architecture : 4 CPU cores & Intel® Iris™ Pro Graphics: 48 EUs, & EDRAM
  13. 13. 13 Typical System Design Phase I: train on servers Phase II: classify on clients Deployment of model with weights Typically real-time Offline tuning of model’s weights Takes weeks or months Deep learning framework Caffe*Theano Torch Tensor Flow CNTK … *Other names and brands may be claimed as the property of others.
  14. 14. 14 Caffe* Open source framework for CNN Written in C++, CUDA* for GPU, with command line, Python*, MATLAB* interfaces Provides a complete toolkit for training, testing, benchmarking, fine-tuning and deploying models Feature highlights – Expressive: build net through plaintext schemas, not code. – Speedy: fast implementation of state-of-art modules. – Modular: easy extension to new data formats, network layers. – Open: common code and reference models for reproducibility. – Wide test coverage: every module attaches to unit test. – Large community: big community, and large pool of users. caffe.berkeleyvision.org github.com/BVLC/caffe *Other names and brands may be claimed as the property of others.
  15. 15. 15 Caffe* CPUHW Language / Math library CNN primitives library CNN framework cuBLAS CNN Applications MKL BLAS CUDA* NVidia GPUs cuFFT cuDNN Caffe C++ ATLAS Open BLAS *Other names and brands may be claimed as the property of others.
  16. 16. 16 clCaffe* CPUHW Language / Math library CNN primitives library CNN framework cuBLAS CNN Applications MKL BLAS CUDA* NVidia GPUs cuFFT cuDNN clCaffe* (Caffe* + OpenCL™) Intel pGfx C++ OpenCL ISAACATLAS Open BLAS AMD GPU Vienna CL clBLAS … Code path enabled by clCaffe Existing code path BLAS: Basic Linear Algebra Subprograms *Other names and brands may be claimed as the property of others. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
  17. 17. 17 clCaffe* Development 1. Enabling OpenCL™ extension to Caffe* 2. CPU/pGfx memory synchronization – Take advantage of integrated SoC: zero-copy on memory buffers 3. Implementation of primitive layers 4. Passing conformance tests 5. More testing 6. Performance optimization *Other names and brands may be claimed as the property of others OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos .
  18. 18. Convolution 18 clCaffe* initial profiling Convolution approaches:  GEMM (General Matrix Multiply) based  Spatial domain based  FFT (Fast Fourier Transform) based Optimization Optimize convolution is the key! *Other names and brands may be claimed as the property of others OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
  19. 19. 19 GEMM based convolution Flatten input data and kernels, solve the convolution as a matrix multiplication problem <Image source: http://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/> Step1: data flattening Step2: matrix multiply Usually mapped into a BLAS (Basic Linear Algebra Subprogram) call Step3: data unflattening col2im =>
  20. 20. 20 Spatial domain convolution Direct application of convolution on the spatial domain: dot product of input with convolution kernel <Image source: https://developer.apple.com/library/ios/documentation/Performance/Conceptual/vImage/Art/kernel_convolution.jpg>
  21. 21. 21 FFT based convolution Convert input into Fourier Domain, apply element-wise multiplication to reduce complexity: O(N2K2)  O(N2log2N), where N is data size, K is kernel size In [227x227] W [11x11] FFT(In_padded) [256x258] W_padded [256x258] 0 FFT(W_padded) [256x258] * Out [55x55] Inverse FFT 0 In_padded [256x256] In [227x227]Input Data 0 In_padded [256x256] 0 FFT(In_padded) [256x258] W [11x11]W [11x11] W_padded [256x258] 0 0 FFT(W_padded) [256x258] = W [11x11]W [11x11] Kernel Output In spatial domain In frequency domainZero padded in spatial domain Sumofelement-wisemultiplication
  22. 22. 22 Analysis of Convolution Approaches 22 1. GEMM (default) Pros: • Generic and stable • Easy to implement (problem mapped into a BLAS call) • Optimized solution if good BLAS is provided Cons: • Additional memory to store the intermediate data • Rely heavily on optimized BLAS 2. Spatial domain Pros: • Avoids additional memory copy • Speedy with optimized code Cons: • Rely on individually optimized kernels according to given params, or even given HW architecture 3. FFT domain Pros: • Lower computational complexity Cons: • Additional memory to save FFT data • Overhead is big for small kernel size, or large stride
  23. 23. 23 Spatial Convolution Auto-tuning  Performed at first time when conv is called on the machine, and cached for future use  Find the optimal kernel parameters and instantiate the fastest OpenCL™ kernel OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
  24. 24. 25 Hardware Configuration 5th Generation Intel® Core™ processor 14nm Intel® Atom™ processor CPU Intel® Xeon® CPU E3-1200 v4 @ 3.40GHz Intel® Atom™ x7 processor @ 1.60GHz GPU Intel® Iris™ Pro 6200 w/ 48 core Intel® HD Graphics w/ 16 core OS CentOS* 7.1, kernel 3.10.0-229 Windows* 10 OpenCL™ OpenCL Linux driver OpenCL Windows driver Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
  25. 25. 26 8 10 65 290 60 89 0 50 100 150 200 250 300 CPU + ATLAS CPU + OpenBLAS CPU + MKL Spatial Convolution FFT Convolution GEMM Convolution Images/sec 26 clCaffe* on Intel GEN with different convolution approach  AlexNet benchmark: Forward only, Batch size = 256  Experiment system: 5th Gen Intel® Core ™ Processor 4+3e with Intel® Iris ™ Pro 6200 (GT3e) Caffe* on CPU with different BLAS library for GEMM convolution Higher is better clCaffe* on 5th Generation Intel® Core™ Processors AlexNet* classification Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
  26. 26. 27 27 clCaffe* on 5th Generation Intel® Core™ Processors AlexNet* training clCaffe* on Intel GEN with different convolution approach Caffe* on CPU with different BLAS library for GEMM convolution Higher is better 4 5 28 56 19 28 0 50 100 150 CPU + ATLAS CPU + OpenBLAS CPU + MKL Spatial Convolution FFT Convolution GEMM Convolution Images/sec  AlexNet benchmark: Forward only, Batch size = 256  Experiment system: Experiment system: 5th Gen Intel® Core ™ Processor 4+3e with Intel® Iris ™ Pro 6200 (GT3e) Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
  27. 27. 28 Other Topologies classification 91 55 77 290 0 50 100 150 200 250 300 350 Overfeat VGG-A GoogLeNet AlexNet Img/sec clCaffe* on 5th Gen Intel® Core™ Processor GT3e using spatial convolution Higher is better Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
  28. 28. 29 clCaffe* on 14nm Intel® Atom™ processor AlexNet* classification 6 50 17 0 10 20 30 40 50 60 CPU + MKL Spatial Convolution GEMM Convolution Img/sec clCaffe on Intel GEN with different convolution approach Caffe* on CPU with different BLAS library for GEMM convolution Higher is better Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
  29. 29. 30 Conclusion  Intel not only provides the Silicon solution, but also builds SW ecosystem around its HW for deep learning support  clCaffe* is an optimized, user friendly DL solution on Intel® Processor Graphics  clCaffe presents 4.5x – 8.3x over default CPU on the same system for classification based on AlexNet*  Intel® Processor Graphics is a valuable compute resource to be unleashed on client platforms *Other names and brands may be claimed as the property of others.
  30. 30. 31 clCaffe* release status Handed over to Open Source team in Intel Externally available: https://github.com/01org/caffe/wiki/clCaffe Progressively optimized Further optimization plan:  GEMM convolution (for back propagation)  Winograd convolution Call for trial and open source contribution! *Other names and brands may be claimed as the property of others.
  31. 31. 32 clCaffe* CPUHW Language / Math library CNN primitives library CNN framework cuBLAS CNN Applications MKL BLAS CUDA* NVidia GPUs cuFFT cuDNN clCaffe (Caffe* + OpenCL™) Intel pGfx C++ OpenCL™ ISAACATLAS Open BLAS AMD GPU Vienna CL clBLAS … Code path enabled by clCaffe Existing code path *Other names and brands may be claimed as the property of others. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
  32. 32. 33 Future extension CPUHW Language / Math library CNN primitives library CNN framework cuBLAS CNN Applications MKL BLAS CUDA* Nvidia* GPUs cuFFT cuDNN Caffe* Intel pGfx C++ OpenCL™ ISAACATLAS Open BLAS Intel FPGA Vienna CL clBLAS Code path enabled by clCaffe Existing code path Intel® MKL-DNN *Other names and brands may be claimed as the property of others. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
  33. 33. 34 Future extension CPUHW Language / Math library CNN primitives library CNN framework CNN Applications MKL BLAS TensorFlow* Intel pGfx C++ OpenCL™ ISAAC Intel FPGA Vienna CL Code path enabled by clCaffe Existing code path Intel® MKL-DNN Caffe* …Torch* … Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
  34. 34. 35 Reference clCaffe*: OpenCL™ accelerated Caffe* for Convolutional Neural Networks. J. Bottleson, S. Kim, J. Andrews, P. Bindu, D. N. Murthy, J. Jin, 25th International Heterogeneity in Computing Workshop, 2016. Caffe* OpenCL™ branch: https://github.com/BVLC/caffe/tree/opencl clCaffe* wiki: https://github.com/01org/caffe/wiki/clCaffe Intel® MKL-DNN tech preview: https://software.intel.com/en-us/articles/deep-neural-network- technical-preview-for-intel-math-kernel-library-intel-mkl Intel® Processor Graphics: https://software.intel.com/sites/default/files/Compute%20Architecture%20of%20Intel%20Processor %20Graphics%20Gen8.pdf Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
  35. 35. Legal Notices and Disclaimers Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K. All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos © 2016 Intel Corporation. Intel, the Intel logo, and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

×