SlideShare a Scribd company logo
Optimizing Machine Learning
workloads on Intel®
Platforms
Colfax International — colfaxresearch.com
November 2016
colfaxresearch.com/ Welcome © Colfax International, 2013–2016
Disclaimer
2
While best efforts have been used in preparing this training, Colfax International makes no
representations or warranties of any kind and assumes no liabilities of any kind with respect to
the accuracy or completeness of the contents and specifically disclaims any implied warranties
of merchantability or fitness of use for a particular purpose. The publisher shall not be held
liable or responsible to any person or entity with respect to any loss or incidental or
consequential damages caused, or alleged to have been caused, directly or indirectly, by the
information or programs contained herein. No warranty may be created or extended by sales
representatives or written sales materials.
colfaxresearch.com/ About This Document © Colfax International, 2013–2016
Colfax Research
3
http://colfaxresearch.com/
colfaxresearch.com/ About This Document © Colfax International, 2013–2016
§2. Code Modernization
What is Code Modernization?
5
.
Code Modernization..
......
Optimizing software to better utilize features available in modern computer
architectures.
Scalar Tuning
what goes on in the pipeline?
Threading
do cores cooperate efficiently?
Vectorization
is SIMD parallelism used well?
Memory
is cache usage maximized or
RAM access streamlined?
Communication
can coordination in a distributed or
heterogeneous system be improved?
colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016
Case Study: VGG-Net on Torch
6
 0
 5
 10
 15
 20
 25
 30
Original Intel Compiler
+MKL
Middleware
Changes
User Code
Changes
Parallel
Strategy
MCDRAM as
Cache
Performance (images/s)
Optimization of NeuralTalk2
colfaxresearch.com
55x
28x
Intel® Xeon® processor E5-2650 v4 (2 sockets)
0.91 1.5
11
15
25
Intel® Xeon Phi™ processor 7210 (KNL)
5.7
10
21
28
Colfax Research Summary Paper
colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016
Intel Python Performance
7
LU
Decomposition
Cholesky
Decomposition
Singular Value
Decomposition
DGEMM
0
20
40
60
80
100
120
140
160
180
RelativePerformance
1.0 1.0 1.0 1.03.5 3.6 1.1
7.0
29.0
17.0
8.3
154.0
colfaxresearch.com
Intel Python on Knights Landing Processors (N=5000)
CPython, SciPy
CPython, NumPy
Intel Python, SciPy
Portal: software.intel.com/intel-distribution-for-python. See also: CR paper.
colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016
Three Approaches
8
.
High Level Approach
..
......
Use high level libraries that are pre-optimized for modern architectures.
▷ IntelCaffe, TensorFlow, Scikit-learn etc.
.
Low Level Approach
..
......
Apply code modernization techniques to frameworks/applications.
▷ Colfax Research Website, HOW series, Intel Modern Code page etc.
.
Middle Ground Approach
..
......
Integrate pre-optimized kernels into frameworks/applications.
▷ Intel®
MKL DNN primitives, Intel®
DAAL, Intel®
MKL DNN etc.
colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016
§3. The High Level Approach
Intel Libraries for Machine Learning
10
LeNet (Cifar10, minibatch 64)
 Xeon Phi
Processor 
 Broadwell Xeon
Processor
0
5
10
15
20
25
30
 Forward/Backward Perf (k­img/s, mini­batch 64) 
0.15k 0.75k
13.27k
25.16k
 
 BVLC
 Intel
VGG16 (ImageNet, minibatch 64)
 Xeon Phi
Processor 
 Broadwell Xeon
Processor
0
10
20
30
40
50
60
70
 Forward/Backward Perf (img/s, mini­batch 64) 
0.91
3.82
54.40
28.57
 
 BVLC
 Intel
colfaxresearch.com/ The High Level Approach © Colfax International, 2013–2016
References for Intel Machine Learning Libraries
11
▷ Intel MKL (https://software.intel.com/en-us/intel-mkl)
▷ Intel®
MKL-DNN (https://github.com/01org/MKL-DNN)
▷ IntelCaffe (https://github.com/intel/caffe)
▷ Intel Theano (https://github.com/intel/theano)
▷ Intel DAAL (https://software.intel.com/en-us/intel-daal)
▷ Intel Torch (https://github.com/xhzhao/Optimized-Torch)
▷ IntelPython (https://software.intel.com/en-us/intel-distribution-for-python)
• Scikit-learn, Numpy, Scipy etc.
▷ And more coming...
• TensorFlow, CNTK, etc.
colfaxresearch.com/ The High Level Approach © Colfax International, 2013–2016
Intel Distribution for Python
12
SciPy
Caffe
Intel Distribution for Python →
Intel Math Kernel Library →
Intel DAAL
Portal: software.intel.com/intel-distribution-for-python. See also: CR paper.
colfaxresearch.com/ The High Level Approach © Colfax International, 2013–2016
§4. Low Level Approach
Optimization Areas
14
Scalar Tuning
what goes on in the pipeline?
Threading
do cores cooperate efficiently?
Vectorization
is SIMD parallelism used well?
Memory
is cache usage maximized or
RAM access streamlined?
Communication
can coordination in a distributed or
heterogeneous system be improved?
colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
Case Study: VGG-Net on Torch
15
 0
 5
 10
 15
 20
 25
 30
Original Intel Compiler
+MKL
Middleware
Changes
User Code
Changes
Parallel
Strategy
MCDRAM as
Cache
Performance (images/s)
Optimization of NeuralTalk2
colfaxresearch.com
55x
28x
Intel® Xeon® processor E5-2650 v4 (2 sockets)
0.91 1.5
11
15
25
Intel® Xeon Phi™ processor 7210 (KNL)
5.7
10
21
28
Colfax Research Summary Paper
colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
Base Torch Performance
16
 0
 2
 4
 6
 8
 10
 12
 14
 16
 18
 10  20  30  40  50  60
images/s
Batch Count (images)
Comp. Perf. (64 threads)
By Layer:
▷ ReLU: 66%
▷ Conv: 30%
▷ MaxPool: 3%
▷ Other: <1%
colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
Performance After ReLU Optimization
17
 0
 5
 10
 15
 20
 25
 30
 35
 40
 10  20  30  40  50  60
images/s
Batch Count (images)
Original (64 threads)
ReLU optimized (64 threads)
RELU -> 160x boost
By Layer:
▷ ReLU: 1%
▷ Conv: 85%
▷ MaxPool: 11%
▷ Other: 3%
colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
FALCON paper
18
https://colfaxresearch.com/falcon-library/
colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
Learn More
Colfax Research
20
http://colfaxresearch.com/
colfaxresearch.com/ Learn More © Colfax International, 2013–2016
→ HowSeries.com
§5. The Middle Ground Approach
Intel MKL and Intel MKL-DNN
23
slide credit: Intel corp.
colfaxresearch.com/ The Middle Ground Approach © Colfax International, 2013–2016
Stand-alone Example: Convolution
24
1 // Creating MKL DNN primitive object
2 dnnPrimitive_t convFwd;
3 dnnConvolutionCreateForward_F32(&convFwd, NULL, dnnAlgorithmConvolutionDirect,
4 dim, input_dims, output_dims, filter_dims,
5 conv_strides, padding, dnnBorderZeros);
6
7 // Creating the needed data buffer
8 void* conv_res[dnnResourceNumber];
9 conv_res[dnnResourceSrc] = (void*) input;
10 conv_res[dnnResourceFilter] = (void*) filter;
11 conv_res[dnnResourceDst] = (void*) output;
12
13 // Execute the workload
14 dnnExecute_F32(pConvFwd, conv_res);
For more: Intel MKL documentation on DNN primitives
colfaxresearch.com/ The Middle Ground Approach © Colfax International, 2013–2016
Example Integration: IntelCaffe
25
GitHub link: https://github.com/intel/caffe/
Example layer implementations: caffe/src/caffe/layers/mkl_*.cpp
1 // Grabbing parameters from Caffe Layers
2 PoolingParameter pool_param = this->layer_param_.pooling_param();
3 channels_ = bottom[0]->channels();
4 height_ = bottom[0]->height();
5 width_ = bottom[0]->width();
6 num_ = bottom[0]->num();
7 // ... //
8 kernel_h_ = pool_param.kernel_h(); kernel_w_ = pool_param.kernel_w();
9 // ..... //
10
11 // Creating the math kernel from these parameters
12 status = dnnPoolingCreateForward<Dtype>( /* ... */ );
colfaxresearch.com/ The Middle Ground Approach © Colfax International, 2013–2016
§6. Distributed Memory Computation
"FLOPs Are Cheap"?
27
.
Theoretical estimates, Intel®
Xeon E5-2697 V3 processor
..
......
Performance =28 cores ×2.7 GHz ×(256/64) vec.lanes ×2 FMA ×2 FPU ≈ 1.2TFLOP/s
RequiredDataRate =1.2TFLOP/s×8bytes ≈ 10TB/s
OPAMaxBandwidth =12.5GB/s ≈ 0.01TB/s
Ratio = 10/0.01 ≈ 1000 (FLOPs)/(Memory Transferred)
To put it short...
.
Difficulty of Distributed Computation
..
......
In the time it takes to transfer one data element, processors can do thousands of
operation on one data element.
colfaxresearch.com/ Distributed Memory Computation © Colfax International, 2013–2016
Distributed Computation for Neural Networks
28
Forward
Backward
Loss
Update
Forward
Backward
Loss
Update
Gather
Gradients
Forward
Backward
Loss
Update
Forward
Backward
Loss
Update
Partial
Results
Partial
Results
node 2node 1 node 2node 1
Data Parallel Model Parallel
Gradient Trnsferred but not Data Data Trnsferred but not Gradient
colfaxresearch.com/ Distributed Memory Computation © Colfax International, 2013–2016
Caffe Scaling
29
Source: Intel®
Corporation. (Caffe* Training on Multi-node
Distributed-memory Systems Based on Intel®
Xeon®
Processor E5 Family)
colfaxresearch.com/ Distributed Memory Computation © Colfax International, 2013–2016
Machine Learning Framework: Intel®
DAAL
Algorithms in DAAL
31
Analysis
- Low Order Moments
- Quantile
- Correlation and Variance
- Cosine Distance Matrix
- Correlation Distance Matrix
- K-Means Clustering
- Principal Component Analysis
- Cholesky Decomposition
Training & prediction
- Regression
- Linear/Ridge Regresion
- Clasification
- Naive Bayes Classifier
- Boosting
- SVM
- Neural Networks
- Multi-Class Classifier
- Singular Value Decomposition
- QR Decomposition
- Expectation-Maximization
- Multivariate Outlier Detection
- Univariate Outlier Detection
- Association Rules
- Kernel Functions
- Quality Metrics
Portal: DAAL page. See also: intro article, CR papers.
colfaxresearch.com/ Machine Learning Framework: Intel® DAAL © Colfax International, 2013–2016
Algorithms in DAAL
32
Data Set
Partial
Computation
Data Set
Partial
Computation
Data Set
Partial
Computation
Final Result Final Result
Data Set
Data Set
Data Set
Final Result
Full
Computation
Full
Computation
Data Set
Distributed Mode Batch Mode Online Mode
Portal: DAAL page. See also: intro article, CR papers.
colfaxresearch.com/ Machine Learning Framework: Intel® DAAL © Colfax International, 2013–2016
Communication Framework: MPI
Structure of MPI Applications: Hello World
34
1 #include "mpi.h"
2 #include <cstdio>
3 int main (int argc, char *argv[]) {
4 MPI_Init (&argc, &argv); // Initialize MPI envirnmnt
5 int rank, size, namelen;
6 char name[MPI_MAX_PROCESSOR_NAME];
7 MPI_Comm_rank (MPI_COMM_WORLD, &rank); // ID of current process
8 MPI_Get_processor_name (name, &namelen); // Hostname of node
9 MPI_Comm_size (MPI_COMM_WORLD, &size); // Number of processes
10 printf ("Hello World from rank %d running on %s!n", rank, name);
11 if (rank == 0) printf("MPI World size = %d processesn", size);
12 MPI_Finalize (); // Terminate MPI environment
13 }
colfaxresearch.com/ Communication Framework: MPI © Colfax International, 2013–2016
Collective Communication: Gather
35
1 int MPI_Gather(void *sendbuf, int sendcnt, MPI_Datatype sendtype,
2 void *recvbuf, int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm);
Gather
sender
data
sender
data
sender
data
sender
data
receiver
colfaxresearch.com/ Communication Framework: MPI © Colfax International, 2013–2016
Collective Communication: Broadcast
36
1 int MPI_Bcast( void *buffer, int count, MPI_Datatype datatype,
2 int root, MPI_Comm comm );
sender
data
receiver receiver receiverreceiver
Broadcast
colfaxresearch.com/ Communication Framework: MPI © Colfax International, 2013–2016
Implementation
Example Distributed Image Processing: DAAL
38
▷ Algorithm <step1Local> is responsible for the forward/backward propagation.
1 training::Distributed<step1Local> local_net; // local net algorithm
2 local_net.compute(); // forward/backward
3 part_res = local_net.getPartialResult(); // getting partial result
4 local_net.input.get(training::inputModel)
5 ->setWeightsAndBiases(wb); // Update the weights/bias
▷ Algorithm <step2Master> is responsible for accumulating the gradient.
1 training::Distributed<step2Master> master_net; // master net algorithm
2 master_net.input.add(training::partialResults, // Add partial result
3 0, part_res);
4 master_net.compute(); // Accumulate gradients
5 wbModel = master_net.getPartialResult() // Get Current Model
6 ->get(training::resultFromMaster)
7 ->get(training::model);
8 wb = wbModel->getWeightsAndBiases(); // Extract weights/bias
colfaxresearch.com/ Implementation © Colfax International, 2013–2016
Example Distributed Image Processing (Part 1)
39
1 // Computation part of the node with the master net
2 // Local forward and backward propagation
3 local_net.compute();
4 part_res[master_node_id] = local_net.getPartialResult();
5
6 // ... Code to store the result into a buffer (char *) ... //
7
8 // Send the result to the master node
9 MPI_Gather(....);
10
11 // ... Code to reconstruct the partial result from the buffer... //
12
13 // accumulate the partial result from nodes
14 for(int i = 0; i < num_nodes; i++)
15 master_net.input.add(training::partialResults, node, part_res[i]);
16 master_net.compute();
colfaxresearch.com/ Implementation © Colfax International, 2013–2016
Example Distributed Image Processing (Part 2)
40
1 // ... Continuing on the master compute ... //
2
3 // Extract the weight/bias from the master net
4 training::ModelPtr wbModel = master_net.getPartialResult()
5 ->get(training::resultFromMaster)
6 ->get(training::model);
7 NumericTablePtr wb = wbModel->getWeightsAndBiases();
8
9 // ... Code to store weights/bias into a buffer (char*) ... //
10
11 // Broadcast the weights/bias to all nodes //
12 MPI_Bcast(.....);
13
14 // ... Code to reconstruct the weights/bias from buffer ... //
15
16 // Update the weights on local node
17 local_net.input.get(training::inputModel)->setWeightsAndBiases(wb);
colfaxresearch.com/ Implementation © Colfax International, 2013–2016
Parallel Efficiency
41
 0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 4.5
 1  2  3  4
Parallel Efficiency
Number of Nodes
93%
91%
87%
Linear Scaling (Theoretical)
Distributed Lenet
Further performance optimizations and model parallelism are coming soon...
colfaxresearch.com/ Implementation © Colfax International, 2013–2016
§7. Final Words
Colfax Research
43
http://colfaxresearch.com/
colfaxresearch.com/ Final Words © Colfax International, 2013–2016
Thank you for your Attention!
Join us at Booth #2407 at SC16!
colfaxresearch.com/ Final Words © Colfax International, 2013–2016

More Related Content

Similar to Intel colfax optimizing-machine-learning-workloads

Containerizing HPC and AI applications using E4S and Performance Monitor tool
Containerizing HPC and AI applications using E4S and Performance Monitor toolContainerizing HPC and AI applications using E4S and Performance Monitor tool
Containerizing HPC and AI applications using E4S and Performance Monitor tool
Ganesan Narayanasamy
 
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Chris Fregly
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)
Ryousei Takano
 
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
Chris Fregly
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to Omega
George Markomanolis
 
IntroML_2.
IntroML_2.IntroML_2.
IntroML_2.
Elio Laureano
 
ParaForming - Patterns and Refactoring for Parallel Programming
ParaForming - Patterns and Refactoring for Parallel ProgrammingParaForming - Patterns and Refactoring for Parallel Programming
ParaForming - Patterns and Refactoring for Parallel Programming
khstandrews
 
17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...
17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...
17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...
Athens Big Data
 
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobilNLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil
Databricks
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
ScyllaDB
 
Being HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on PurposeBeing HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on Purpose
Aman Kohli
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Dr. Fabio Baruffa
 
Design, Verification and Emulation of an Island-Based Network Flow Processor
Design, Verification and Emulation of an Island-Based Network Flow ProcessorDesign, Verification and Emulation of an Island-Based Network Flow Processor
Design, Verification and Emulation of an Island-Based Network Flow Processor
Netronome
 
Modern Web Security, Lazy but Mindful Like a Fox
Modern Web Security, Lazy but Mindful Like a FoxModern Web Security, Lazy but Mindful Like a Fox
Modern Web Security, Lazy but Mindful Like a Fox
C4Media
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systems
Ganesan Narayanasamy
 
CIS 512 discussion post responses.CPUs and Programming Pleas.docx
CIS 512 discussion post responses.CPUs and Programming Pleas.docxCIS 512 discussion post responses.CPUs and Programming Pleas.docx
CIS 512 discussion post responses.CPUs and Programming Pleas.docx
mccormicknadine86
 
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
Chris Fregly
 
byteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA SolutionsbyteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA Solutions
byteLAKE
 
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Community
 

Similar to Intel colfax optimizing-machine-learning-workloads (20)

Containerizing HPC and AI applications using E4S and Performance Monitor tool
Containerizing HPC and AI applications using E4S and Performance Monitor toolContainerizing HPC and AI applications using E4S and Performance Monitor tool
Containerizing HPC and AI applications using E4S and Performance Monitor tool
 
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)
 
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to Omega
 
IntroML_2.
IntroML_2.IntroML_2.
IntroML_2.
 
All in one
All in oneAll in one
All in one
 
ParaForming - Patterns and Refactoring for Parallel Programming
ParaForming - Patterns and Refactoring for Parallel ProgrammingParaForming - Patterns and Refactoring for Parallel Programming
ParaForming - Patterns and Refactoring for Parallel Programming
 
17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...
17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...
17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...
 
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobilNLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
 
Being HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on PurposeBeing HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on Purpose
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
 
Design, Verification and Emulation of an Island-Based Network Flow Processor
Design, Verification and Emulation of an Island-Based Network Flow ProcessorDesign, Verification and Emulation of an Island-Based Network Flow Processor
Design, Verification and Emulation of an Island-Based Network Flow Processor
 
Modern Web Security, Lazy but Mindful Like a Fox
Modern Web Security, Lazy but Mindful Like a FoxModern Web Security, Lazy but Mindful Like a Fox
Modern Web Security, Lazy but Mindful Like a Fox
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systems
 
CIS 512 discussion post responses.CPUs and Programming Pleas.docx
CIS 512 discussion post responses.CPUs and Programming Pleas.docxCIS 512 discussion post responses.CPUs and Programming Pleas.docx
CIS 512 discussion post responses.CPUs and Programming Pleas.docx
 
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
 
byteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA SolutionsbyteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA Solutions
 
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 

Intel colfax optimizing-machine-learning-workloads

  • 1. Optimizing Machine Learning workloads on Intel® Platforms Colfax International — colfaxresearch.com November 2016 colfaxresearch.com/ Welcome © Colfax International, 2013–2016
  • 2. Disclaimer 2 While best efforts have been used in preparing this training, Colfax International makes no representations or warranties of any kind and assumes no liabilities of any kind with respect to the accuracy or completeness of the contents and specifically disclaims any implied warranties of merchantability or fitness of use for a particular purpose. The publisher shall not be held liable or responsible to any person or entity with respect to any loss or incidental or consequential damages caused, or alleged to have been caused, directly or indirectly, by the information or programs contained herein. No warranty may be created or extended by sales representatives or written sales materials. colfaxresearch.com/ About This Document © Colfax International, 2013–2016
  • 3. Colfax Research 3 http://colfaxresearch.com/ colfaxresearch.com/ About This Document © Colfax International, 2013–2016
  • 5. What is Code Modernization? 5 . Code Modernization.. ...... Optimizing software to better utilize features available in modern computer architectures. Scalar Tuning what goes on in the pipeline? Threading do cores cooperate efficiently? Vectorization is SIMD parallelism used well? Memory is cache usage maximized or RAM access streamlined? Communication can coordination in a distributed or heterogeneous system be improved? colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016
  • 6. Case Study: VGG-Net on Torch 6  0  5  10  15  20  25  30 Original Intel Compiler +MKL Middleware Changes User Code Changes Parallel Strategy MCDRAM as Cache Performance (images/s) Optimization of NeuralTalk2 colfaxresearch.com 55x 28x Intel® Xeon® processor E5-2650 v4 (2 sockets) 0.91 1.5 11 15 25 Intel® Xeon Phi™ processor 7210 (KNL) 5.7 10 21 28 Colfax Research Summary Paper colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016
  • 7. Intel Python Performance 7 LU Decomposition Cholesky Decomposition Singular Value Decomposition DGEMM 0 20 40 60 80 100 120 140 160 180 RelativePerformance 1.0 1.0 1.0 1.03.5 3.6 1.1 7.0 29.0 17.0 8.3 154.0 colfaxresearch.com Intel Python on Knights Landing Processors (N=5000) CPython, SciPy CPython, NumPy Intel Python, SciPy Portal: software.intel.com/intel-distribution-for-python. See also: CR paper. colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016
  • 8. Three Approaches 8 . High Level Approach .. ...... Use high level libraries that are pre-optimized for modern architectures. ▷ IntelCaffe, TensorFlow, Scikit-learn etc. . Low Level Approach .. ...... Apply code modernization techniques to frameworks/applications. ▷ Colfax Research Website, HOW series, Intel Modern Code page etc. . Middle Ground Approach .. ...... Integrate pre-optimized kernels into frameworks/applications. ▷ Intel® MKL DNN primitives, Intel® DAAL, Intel® MKL DNN etc. colfaxresearch.com/ Code Modernization © Colfax International, 2013–2016
  • 9. §3. The High Level Approach
  • 10. Intel Libraries for Machine Learning 10 LeNet (Cifar10, minibatch 64)  Xeon Phi Processor   Broadwell Xeon Processor 0 5 10 15 20 25 30  Forward/Backward Perf (k­img/s, mini­batch 64)  0.15k 0.75k 13.27k 25.16k    BVLC  Intel VGG16 (ImageNet, minibatch 64)  Xeon Phi Processor   Broadwell Xeon Processor 0 10 20 30 40 50 60 70  Forward/Backward Perf (img/s, mini­batch 64)  0.91 3.82 54.40 28.57    BVLC  Intel colfaxresearch.com/ The High Level Approach © Colfax International, 2013–2016
  • 11. References for Intel Machine Learning Libraries 11 ▷ Intel MKL (https://software.intel.com/en-us/intel-mkl) ▷ Intel® MKL-DNN (https://github.com/01org/MKL-DNN) ▷ IntelCaffe (https://github.com/intel/caffe) ▷ Intel Theano (https://github.com/intel/theano) ▷ Intel DAAL (https://software.intel.com/en-us/intel-daal) ▷ Intel Torch (https://github.com/xhzhao/Optimized-Torch) ▷ IntelPython (https://software.intel.com/en-us/intel-distribution-for-python) • Scikit-learn, Numpy, Scipy etc. ▷ And more coming... • TensorFlow, CNTK, etc. colfaxresearch.com/ The High Level Approach © Colfax International, 2013–2016
  • 12. Intel Distribution for Python 12 SciPy Caffe Intel Distribution for Python → Intel Math Kernel Library → Intel DAAL Portal: software.intel.com/intel-distribution-for-python. See also: CR paper. colfaxresearch.com/ The High Level Approach © Colfax International, 2013–2016
  • 13. §4. Low Level Approach
  • 14. Optimization Areas 14 Scalar Tuning what goes on in the pipeline? Threading do cores cooperate efficiently? Vectorization is SIMD parallelism used well? Memory is cache usage maximized or RAM access streamlined? Communication can coordination in a distributed or heterogeneous system be improved? colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
  • 15. Case Study: VGG-Net on Torch 15  0  5  10  15  20  25  30 Original Intel Compiler +MKL Middleware Changes User Code Changes Parallel Strategy MCDRAM as Cache Performance (images/s) Optimization of NeuralTalk2 colfaxresearch.com 55x 28x Intel® Xeon® processor E5-2650 v4 (2 sockets) 0.91 1.5 11 15 25 Intel® Xeon Phi™ processor 7210 (KNL) 5.7 10 21 28 Colfax Research Summary Paper colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
  • 16. Base Torch Performance 16  0  2  4  6  8  10  12  14  16  18  10  20  30  40  50  60 images/s Batch Count (images) Comp. Perf. (64 threads) By Layer: ▷ ReLU: 66% ▷ Conv: 30% ▷ MaxPool: 3% ▷ Other: <1% colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
  • 17. Performance After ReLU Optimization 17  0  5  10  15  20  25  30  35  40  10  20  30  40  50  60 images/s Batch Count (images) Original (64 threads) ReLU optimized (64 threads) RELU -> 160x boost By Layer: ▷ ReLU: 1% ▷ Conv: 85% ▷ MaxPool: 11% ▷ Other: 3% colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
  • 18. FALCON paper 18 https://colfaxresearch.com/falcon-library/ colfaxresearch.com/ Low Level Approach © Colfax International, 2013–2016
  • 22. §5. The Middle Ground Approach
  • 23. Intel MKL and Intel MKL-DNN 23 slide credit: Intel corp. colfaxresearch.com/ The Middle Ground Approach © Colfax International, 2013–2016
  • 24. Stand-alone Example: Convolution 24 1 // Creating MKL DNN primitive object 2 dnnPrimitive_t convFwd; 3 dnnConvolutionCreateForward_F32(&convFwd, NULL, dnnAlgorithmConvolutionDirect, 4 dim, input_dims, output_dims, filter_dims, 5 conv_strides, padding, dnnBorderZeros); 6 7 // Creating the needed data buffer 8 void* conv_res[dnnResourceNumber]; 9 conv_res[dnnResourceSrc] = (void*) input; 10 conv_res[dnnResourceFilter] = (void*) filter; 11 conv_res[dnnResourceDst] = (void*) output; 12 13 // Execute the workload 14 dnnExecute_F32(pConvFwd, conv_res); For more: Intel MKL documentation on DNN primitives colfaxresearch.com/ The Middle Ground Approach © Colfax International, 2013–2016
  • 25. Example Integration: IntelCaffe 25 GitHub link: https://github.com/intel/caffe/ Example layer implementations: caffe/src/caffe/layers/mkl_*.cpp 1 // Grabbing parameters from Caffe Layers 2 PoolingParameter pool_param = this->layer_param_.pooling_param(); 3 channels_ = bottom[0]->channels(); 4 height_ = bottom[0]->height(); 5 width_ = bottom[0]->width(); 6 num_ = bottom[0]->num(); 7 // ... // 8 kernel_h_ = pool_param.kernel_h(); kernel_w_ = pool_param.kernel_w(); 9 // ..... // 10 11 // Creating the math kernel from these parameters 12 status = dnnPoolingCreateForward<Dtype>( /* ... */ ); colfaxresearch.com/ The Middle Ground Approach © Colfax International, 2013–2016
  • 26. §6. Distributed Memory Computation
  • 27. "FLOPs Are Cheap"? 27 . Theoretical estimates, Intel® Xeon E5-2697 V3 processor .. ...... Performance =28 cores ×2.7 GHz ×(256/64) vec.lanes ×2 FMA ×2 FPU ≈ 1.2TFLOP/s RequiredDataRate =1.2TFLOP/s×8bytes ≈ 10TB/s OPAMaxBandwidth =12.5GB/s ≈ 0.01TB/s Ratio = 10/0.01 ≈ 1000 (FLOPs)/(Memory Transferred) To put it short... . Difficulty of Distributed Computation .. ...... In the time it takes to transfer one data element, processors can do thousands of operation on one data element. colfaxresearch.com/ Distributed Memory Computation © Colfax International, 2013–2016
  • 28. Distributed Computation for Neural Networks 28 Forward Backward Loss Update Forward Backward Loss Update Gather Gradients Forward Backward Loss Update Forward Backward Loss Update Partial Results Partial Results node 2node 1 node 2node 1 Data Parallel Model Parallel Gradient Trnsferred but not Data Data Trnsferred but not Gradient colfaxresearch.com/ Distributed Memory Computation © Colfax International, 2013–2016
  • 29. Caffe Scaling 29 Source: Intel® Corporation. (Caffe* Training on Multi-node Distributed-memory Systems Based on Intel® Xeon® Processor E5 Family) colfaxresearch.com/ Distributed Memory Computation © Colfax International, 2013–2016
  • 31. Algorithms in DAAL 31 Analysis - Low Order Moments - Quantile - Correlation and Variance - Cosine Distance Matrix - Correlation Distance Matrix - K-Means Clustering - Principal Component Analysis - Cholesky Decomposition Training & prediction - Regression - Linear/Ridge Regresion - Clasification - Naive Bayes Classifier - Boosting - SVM - Neural Networks - Multi-Class Classifier - Singular Value Decomposition - QR Decomposition - Expectation-Maximization - Multivariate Outlier Detection - Univariate Outlier Detection - Association Rules - Kernel Functions - Quality Metrics Portal: DAAL page. See also: intro article, CR papers. colfaxresearch.com/ Machine Learning Framework: Intel® DAAL © Colfax International, 2013–2016
  • 32. Algorithms in DAAL 32 Data Set Partial Computation Data Set Partial Computation Data Set Partial Computation Final Result Final Result Data Set Data Set Data Set Final Result Full Computation Full Computation Data Set Distributed Mode Batch Mode Online Mode Portal: DAAL page. See also: intro article, CR papers. colfaxresearch.com/ Machine Learning Framework: Intel® DAAL © Colfax International, 2013–2016
  • 34. Structure of MPI Applications: Hello World 34 1 #include "mpi.h" 2 #include <cstdio> 3 int main (int argc, char *argv[]) { 4 MPI_Init (&argc, &argv); // Initialize MPI envirnmnt 5 int rank, size, namelen; 6 char name[MPI_MAX_PROCESSOR_NAME]; 7 MPI_Comm_rank (MPI_COMM_WORLD, &rank); // ID of current process 8 MPI_Get_processor_name (name, &namelen); // Hostname of node 9 MPI_Comm_size (MPI_COMM_WORLD, &size); // Number of processes 10 printf ("Hello World from rank %d running on %s!n", rank, name); 11 if (rank == 0) printf("MPI World size = %d processesn", size); 12 MPI_Finalize (); // Terminate MPI environment 13 } colfaxresearch.com/ Communication Framework: MPI © Colfax International, 2013–2016
  • 35. Collective Communication: Gather 35 1 int MPI_Gather(void *sendbuf, int sendcnt, MPI_Datatype sendtype, 2 void *recvbuf, int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm); Gather sender data sender data sender data sender data receiver colfaxresearch.com/ Communication Framework: MPI © Colfax International, 2013–2016
  • 36. Collective Communication: Broadcast 36 1 int MPI_Bcast( void *buffer, int count, MPI_Datatype datatype, 2 int root, MPI_Comm comm ); sender data receiver receiver receiverreceiver Broadcast colfaxresearch.com/ Communication Framework: MPI © Colfax International, 2013–2016
  • 38. Example Distributed Image Processing: DAAL 38 ▷ Algorithm <step1Local> is responsible for the forward/backward propagation. 1 training::Distributed<step1Local> local_net; // local net algorithm 2 local_net.compute(); // forward/backward 3 part_res = local_net.getPartialResult(); // getting partial result 4 local_net.input.get(training::inputModel) 5 ->setWeightsAndBiases(wb); // Update the weights/bias ▷ Algorithm <step2Master> is responsible for accumulating the gradient. 1 training::Distributed<step2Master> master_net; // master net algorithm 2 master_net.input.add(training::partialResults, // Add partial result 3 0, part_res); 4 master_net.compute(); // Accumulate gradients 5 wbModel = master_net.getPartialResult() // Get Current Model 6 ->get(training::resultFromMaster) 7 ->get(training::model); 8 wb = wbModel->getWeightsAndBiases(); // Extract weights/bias colfaxresearch.com/ Implementation © Colfax International, 2013–2016
  • 39. Example Distributed Image Processing (Part 1) 39 1 // Computation part of the node with the master net 2 // Local forward and backward propagation 3 local_net.compute(); 4 part_res[master_node_id] = local_net.getPartialResult(); 5 6 // ... Code to store the result into a buffer (char *) ... // 7 8 // Send the result to the master node 9 MPI_Gather(....); 10 11 // ... Code to reconstruct the partial result from the buffer... // 12 13 // accumulate the partial result from nodes 14 for(int i = 0; i < num_nodes; i++) 15 master_net.input.add(training::partialResults, node, part_res[i]); 16 master_net.compute(); colfaxresearch.com/ Implementation © Colfax International, 2013–2016
  • 40. Example Distributed Image Processing (Part 2) 40 1 // ... Continuing on the master compute ... // 2 3 // Extract the weight/bias from the master net 4 training::ModelPtr wbModel = master_net.getPartialResult() 5 ->get(training::resultFromMaster) 6 ->get(training::model); 7 NumericTablePtr wb = wbModel->getWeightsAndBiases(); 8 9 // ... Code to store weights/bias into a buffer (char*) ... // 10 11 // Broadcast the weights/bias to all nodes // 12 MPI_Bcast(.....); 13 14 // ... Code to reconstruct the weights/bias from buffer ... // 15 16 // Update the weights on local node 17 local_net.input.get(training::inputModel)->setWeightsAndBiases(wb); colfaxresearch.com/ Implementation © Colfax International, 2013–2016
  • 41. Parallel Efficiency 41  0  0.5  1  1.5  2  2.5  3  3.5  4  4.5  1  2  3  4 Parallel Efficiency Number of Nodes 93% 91% 87% Linear Scaling (Theoretical) Distributed Lenet Further performance optimizations and model parallelism are coming soon... colfaxresearch.com/ Implementation © Colfax International, 2013–2016
  • 44. Thank you for your Attention! Join us at Booth #2407 at SC16! colfaxresearch.com/ Final Words © Colfax International, 2013–2016