SlideShare a Scribd company logo
1 of 82
Download to read offline
1
HOW Series: Knights Landing
Hands-On Workshop on Performance Optimization for
Intel® Xeon Phi™
Processor Family x200
Colfax International — @colfaxintl
September 2016 - Rev. 3.0
colfaxresearch.com/knl-webinar Welcome © Colfax International, 2013–2016
2
About This Document
This document represents the ma-
terials of a Web-based training “In-
troduction to 2nd Generation Intel®
Xeon Phi™
Processors: Developer’s
Guide to Knights Landing” developed
and run by Colfax International.
© Colfax International, 2013–2016
colfaxresearch.com/knl-webinar/
colfaxresearch.com/knl-webinar About This Document © Colfax International, 2013–2016
3
Disclaimer
While best efforts have been used in preparing this training, Colfax International makes no
representations or warranties of any kind and assumes no liabilities of any kind with respect to
the accuracy or completeness of the contents and specifically disclaims any implied warranties
of merchantability or fitness of use for a particular purpose. The publisher shall not be held
liable or responsible to any person or entity with respect to any loss or incidental or
consequential damages caused, or alleged to have been caused, directly or indirectly, by the
information or programs contained herein. No warranty may be created or extended by sales
representatives or written sales materials.
colfaxresearch.com/knl-webinar About This Document © Colfax International, 2013–2016
4
Resources
colfaxresearch.com/knl-webinar Resources © Colfax International, 2013–2016
5
Colfax Research
http://colfaxresearch.com/
colfaxresearch.com/knl-webinar Resources © Colfax International, 2013–2016
6
HOW Series
Interested? Sign-up at:
colfaxresearch.com/how-series
colfaxresearch.com/knl-webinar Resources © Colfax International, 2013–2016
7
Developer Access Program (DAP)
Can’t wait to get your hands on Knights Landing?
Find out more at dap.xeonphi.com
or contact us at dap@colfax-intl.com.
colfaxresearch.com/knl-webinar Resources © Colfax International, 2013–2016
8
§2. Intel Architecture: Today and
Tomorrow
colfaxresearch.com/knl-webinar Intel Architecture: Today and Tomorrow © Colfax International, 2013–2016
9
Intel Architecture
colfaxresearch.com/knl-webinar Intel Architecture: Today and Tomorrow © Colfax International, 2013–2016
10
Servers with Intel Xeon Phi Processors
Bootable Host Processor
RHEL/CentOS/SUSE/Win
64 cores × 4 HT, 1.3 GHz
≤ 384 GiB DDR4, > 90 GB/s
16 GiB HBM, > 400 GB/s
PCIe bus for network
4 systems in 2U chassis
dap.xeonphi.com
colfaxresearch.com/knl-webinar Intel Architecture: Today and Tomorrow © Colfax International, 2013–2016
11
Workstations with Intel Xeon Phi Processors
Bootable Host Processor
RHEL/CentOS/SUSE/Win
64 cores × 4 HT, 1.3 GHz
≤ 384 GiB DDR4, > 90 GB/s
16 GiB HBM, > 400 GB/s
PCIe bus for add-in cards
Liquid- or air-cooled
dap.xeonphi.com
colfaxresearch.com/knl-webinar Intel Architecture: Today and Tomorrow © Colfax International, 2013–2016
12
Future Form-Factors
KNLF: KNL with Fabric
Fabric integrated on CPU
Intel OmniPath Architecture
Socket mount processor
*KNC image
KNL Coprocessor
PCIe add-in card
Requires host
Multiple KNLs in a system
colfaxresearch.com/knl-webinar Intel Architecture: Today and Tomorrow © Colfax International, 2013–2016
13
§3. Out-of-the-Box Performance
colfaxresearch.com/knl-webinar Out-of-the-Box Performance © Colfax International, 2013–2016
14
Intel® MKL
Intel® Math Kernel Library (MKL) — standard mathematical functions
optimized for Intel architecture.
BLAS
LAPACK
Sparse solvers
ScaLAPACK
Linear Algebra
Fast Fourier
Transform
Vector Math
Vector Random
Number Generators
Summary
Statistics
Data Fitting
Multidimentional
(up to 7D)
FFTW interfaces
Cluster FFT
Trigonometric
Hyperbolic
Exponential
Logarithmic
Power/Root
Rounding
Congruential
Recursive
Wichmann-Hill
Mersenne Twister
Sobol
Neiderreiter
Non-deterministic
Kurtosis
Variation
coefficitent
Quantiles, order
statistics
Min/max
Variance-
covariance
Splines
Interpolation
Cell search
colfaxresearch.com/knl-webinar Out-of-the-Box Performance © Colfax International, 2013–2016
15
DGEMM in MKL
C/C++
1 #include <mkl.h>
2 ...
3 cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
4 n, n, n, 1.0, A, n, B, n, 0.0, C, n);
Python
from scipy.linalg.blas import dgemm as dgemm
...
C=dgemm(alpha=1.0, a=A, b=B, overwrite_c=1, trans_b=1)
R
require(Matrix)
...
c <- a %*% b
colfaxresearch.com/knl-webinar Out-of-the-Box Performance © Colfax International, 2013–2016
16
Python Benchmark
N=500 1000 1200 1500 2000 3000 4000 5000
0
500
1000
1500
2000
Performance(GFLOP/s)
10 11 11 11 12 12 12 1260 69 71 72 77 75 73 83
280
701 745
968
1270
1530
1730
1850
colfaxresearch.com
DGEMM on Knights Landing Processors
CPython, SciPy
CPython, NumPy
Intel Python, SciPy
For more information on KNL with Intel Python, see this paper
colfaxresearch.com/knl-webinar Out-of-the-Box Performance © Colfax International, 2013–2016
17
§4. Importance of Code Optimization
colfaxresearch.com/knl-webinar Importance of Code Optimization © Colfax International, 2013–2016
18
Python Benchmark
What we did not tell you in the previous slide...
LU
Decomposition
Cholesky
Decomposition
Singular Value
Decomposition
DGEMM
0
20
40
60
80
100
120
140
160
180
RelativePerformance
1.0 1.0 1.0 1.03.5 3.6 1.1
7.0
29.0
17.0
8.3
154.0
colfaxresearch.com
Intel Python on Knights Landing Processors (N=5000)
CPython, SciPy
CPython, NumPy
Intel Python, SciPy
It takes good software to unlock the performance of KNL!
colfaxresearch.com/knl-webinar Importance of Code Optimization © Colfax International, 2013–2016
19
Optimization Areas
Scalar Tuning
what goes on in the pipeline?
Threading
do cores cooperate efficiently?
Vectorization
is SIMD parallelism used well?
Memory
is cache usage maximized or
RAM access streamlined?
Communication
can coordination in a distributed or
heterogeneous system be improved?
colfaxresearch.com/knl-webinar Importance of Code Optimization © Colfax International, 2013–2016
20
Application
1
Astrophysics:
planetary systems
galaxies
cosmological structures
2
Electrostatic systems:
molecules
crystals
This work: “toy model” with all-to-all
O n2
algorithm. Practical N-body sim-
ulations may use tree algorithms with
O nlogn complexity.
Source: APOD, credit: Debra Meloy
Elmegreen (Vassar College) et al., & the
Hubble Heritage Team (AURA/ STScI/ NASA)
colfaxresearch.com/knl-webinar Importance of Code Optimization © Colfax International, 2013–2016
21
N-body Simulation on Intel Architecture
Details in Chapter 23 of this book
Step 0:
Initial
Step 1:
Multi-threaded
Step 2:
Scalar Tuning
Step 3:
Vectorized w/SoA
Step 4:
Memory Optimization
0
500
1000
1500
2000
2500
3000
Performance,GFLOP/s
5.9
160 261
918 976
0.8
124 196 288
1712
2.1
316
443
1612
2875
N-Body Simulation Performance
Intel Xeon E5-2697 v3 (Haswell)
Intel Xeon Phi 7120A (1st gen, KNC)
Intel Xeon Phi 7220 (2nd gen, KNL)
The best way to prepare your code for KNL: optimize for Xeon or KNC
colfaxresearch.com/knl-webinar Importance of Code Optimization © Colfax International, 2013–2016
22
§5. Details of Architecture
colfaxresearch.com/knl-webinar Details of Architecture © Colfax International, 2013–2016
23
Cores in Intel Xeon Phi Processors
colfaxresearch.com/knl-webinar Cores in Intel Xeon Phi Processors © Colfax International, 2013–2016
24
KNL Die Organization: Tiles
Up to 36 tiles, each with 2 physical cores (72 total).
Distributed L2 cache across a mesh interconnect.
CORE
L2
CORE CORE
L2
CORE CORE
L2
CORE CORE
L2
CORE CORE
L2
CORE
DDR4CONTROLLER
DDR4CONTROLLER
CORE
L2
CORE
CORE
L2
CORE
CORE
L2
CORE
CORE
L2
CORE
CORE
L2
CORE
CORE
L2
CORE
CORE
L2
CORE CORE
L2
CORECORE
L2
CORE CORE
L2
CORE
CORE
L2
CORE
CORE
L2
CORE
CORE
L2
CORE CORE
L2
CORE
MCDRAM PCIe
≤384GiBsystemDDR4,~90GB/s
CORE
L2
CORE
≤ 16 GiB on-package MCDRAM, ~ 400 GB/s
MCDRAM
MCDRAM MCDRAM
CORE
L2
CORE CORE
L2
CORE
CORE
L2
CORECORE
L2
CORE
...
36 TILES
72 CORES
colfaxresearch.com/knl-webinar Cores in Intel Xeon Phi Processors © Colfax International, 2013–2016
25
KNL Cores
4-way hyper-threading (up to 4×72 = 288 logical processors)
L2 cache shared by 2 cores on a tile.
I-cache
Decode
+ Retire
Vector
ALU
Vector
ALU
Legacy
32 KiB L1 D-cache
1 MiB
L2
D-cache
I-cache
Decode
+ Retire
Vector
ALU
Vector
ALU
Legacy
32 KiB L1 D-cache
CORE CORETILE
to Mesh
HW+ SW
prefetch
HW+ SW
prefetch
Out-of-
order
Out-of-
order
colfaxresearch.com/knl-webinar Cores in Intel Xeon Phi Processors © Colfax International, 2013–2016
26
More Forgiving Cores than First Generation (KNC)
Based on Intel®
Atom cores (Silvermont microarchitecture)
Out-of-order cores:
Better latency masking from long latency operations
Back-to-back instructions from single thread:
Thread count requirement reduced to ≈ 70 (from ≈ 120 for KNC)
Advanced branch prediction:
Fewer cycles wasted to branch misprediction
Generally more forgiving to non-optimized code.
colfaxresearch.com/knl-webinar Cores in Intel Xeon Phi Processors © Colfax International, 2013–2016
27
Vector Instruction Support
colfaxresearch.com/knl-webinar Vector Instruction Support © Colfax International, 2013–2016
28
Short Vector Support
Vector instructions – one of the implementations of SIMD (Single
Instruction Multiple Data) parallelism.
+ =
+ =
+ =
+ =
+ =
VectorLength
Scalar Instructions Vector Instructions
4 1 5
0 3 3
-2 8 6
9 -7 2
4 1 5
0 3 3
-2 8 6
9 -7 2
colfaxresearch.com/knl-webinar Vector Instruction Support © Colfax International, 2013–2016
29
Dual VPU
Each core on KNL has two Vector Processing Units (VPUs).
Penalty from non-vectorized code
SP →512 bitregisters/32 bits ×2 VPUs = 32 SIMD lanes
DP →512 bitregisters/64 bits ×2 VPUs = 16 SIMD lanes
VPU 0
512-bit registers = 16 SP or 8 DP
ZMM registers
x32 per thread VPU 1
Decode
I-Cache
32/16 SIMD lanes (SP/DP)
colfaxresearch.com/knl-webinar Vector Instruction Support © Colfax International, 2013–2016
30
Supported Instruction Sets on KNL
Intel®
Advanced Vector
Extensions 512 (AVX-512)
512-bit vector registers.
Hardware gather/scatter, DP
transcendental functions support
and more.
Supported by non-Intel compilers
like GCC.
≤ Intel®
AVX2
Legacy mode operation.
Binary compatibility with Xeon.
Does not include IMCI (from KNC).
E5-2600
(SNB)
E5-2600v3
(HSW)
KNL
(Xeon Phi)
AVX-512PF
AVX-512ER
AVX-512CD
AVX-512F
BMI
AVX2
AVX
SSE
x87/MMX
TSX
BMI
AVX2
AVX
SSE
x87/MMX
AVX
SSE
x87/MMX
LegacySupport
no TSX
colfaxresearch.com/knl-webinar Vector Instruction Support © Colfax International, 2013–2016
31
AVX-512 Features
Knights Landing: first AVX-512 processor
AVX-512F (Fundamentals)
Extension of most AVX2 instructions to 512-bit vector registers.
AVX-512CD (Conflict Detection)
Efficient conflict detection (application: binning).
AVX-512ER (Exponential and Reciprocal)
Transcendental function (exp, rcp and rsqrt) support.
AVX-512PF (Prefetch)
Prefetch for scatter and gather.
White Paper: http://colfaxresearch.com/avx-512/
colfaxresearch.com/knl-webinar Vector Instruction Support © Colfax International, 2013–2016
32
Performance Considerations
Even if your code is vectorized, tuning may unlock more performance.
link to paper
Providing enough parallelism.
More consecutive vector
operations required to
overcome vectorization latency.
Loop pipelining and unrolling.
Double the pipeline stages to
populate.
Better vectorization patterns.
Avoid long latency operations
with unit-stride and unmasked
operations.
colfaxresearch.com/knl-webinar Vector Instruction Support © Colfax International, 2013–2016
33
High-Bandwidth Memory
colfaxresearch.com/knl-webinar High-Bandwidth Memory © Colfax International, 2013–2016
34
KNL Memory Organization
Direct access to on-package MCDRAM and system DDR4 (socket)
Use MCDRAM as cache, in flat mode, or as hybrid
CORE
CORE
REGISTERSREGISTERS
L1
cache
L1
cache
L2
cache
MCDRAM
(on package
RAM)
DDR4
RAM
(system memory)
Up to 16 GiB
over 400 GB/s
Up to 384 GiB
~ 90 GB/s (STREAM)
...
more cores
Intel Xeon Phi
Processor
Flat,
Cache
or
Hybrid
colfaxresearch.com/knl-webinar High-Bandwidth Memory © Colfax International, 2013–2016
35
MCDRAM Memory Modes
Flat Mode
MCDRAM treated as a
NUMA node
Users control what
goes to MCDRAM
CPU
DDR4
Addressable
MCDRAM
NUMA
Node 0
NUMA
Node 1
Cache Mode
MCDRAM treated as a
Last Level Cache (LLC)
MCDRAM is used
automatically
CPU
DDR4
MCDRAM
as cache
Hybrid Mode
Combination of Flat
and Cache
Ratio can be chosen in
the BIOS
Addressable
MCDRAM
NUMA
Node 0
NUMA
Node 1
DDR4
MCDRAM
as cache
CPU
colfaxresearch.com/knl-webinar High-Bandwidth Memory © Colfax International, 2013–2016
36
Running Applications in HBM with numactl
Finding information about the NUMA nodes in the system.
user@knl% # In Flat mode with All-to-All
user@knl% numactl -H
available: 2 nodes (0-1)
node 0 cpus: ... all cpus ...
node 0 size: 98207 MB
node 0 free: 94798 MB
node 1 cpus:
node 1 size: 16384 MB
node 1 free: 15991 MB
Binding the application to MCDRAM (Flat/Hybrid)
user@knl% gcc myapp.c -o runme -mavx512f -O2
user@knl% numactl --membind 1 ./runme
// ... Application running on MCDRAM ... //
colfaxresearch.com/knl-webinar High-Bandwidth Memory © Colfax International, 2013–2016
37
Allocation in HBM with Memkind Library
Manual allocation to MCDRAM possible with hbwmalloc and Memkind Library.
1 #include <hbwmalloc.h>
2 const int n = 1<<10;
3 // Allocation to MCDRAM
4 double* A = (double*) hbw_malloc(sizeof(double)*n);
5 // No replacement for _mm_malloc. Use posix_memalign
6 double* B;
7 int ret = hbw_posix_memalign((void*) B, 64, sizeof(double)*n);
8 .....
9 // Free with hbw_free
10 hbw_free(A); hbw_free(b);
Fortran Allocations.
1 REAL, ALLOCATABLE :: A(:)
2 !DEC$ ATTRIBUTES FASTMEM :: A
3 ALLOCATE (A(1:1024))
colfaxresearch.com/knl-webinar High-Bandwidth Memory © Colfax International, 2013–2016
38
Compilation with Memkind Library and hbwmalloc
To compile C/C++ applications:
user@knl% icpc -lmemkind foo.cc -o runme
user@knl% g++ -lmemkind foo.cc -o runme
To compile Fortran applications:
user@knl% ifort -lmemkind foo.f90 -o runme
user@knl% gfortran -lmemkind foo.f90 -o runme
Open source distribution of Memkind library can be found at:
http://memkind.github.io/memkind/
Colfax Publication on using MCDRAM and Memkind Library:
http://colfaxresearch.com/knl-mcdram/
colfaxresearch.com/knl-webinar High-Bandwidth Memory © Colfax International, 2013–2016
39
Flow Chart for Bandwidth-Bound Applications
Does your Application
fit in 16 GB?
Can you partition ≤16 GB
of BW-critical memory?
No
Yes Yes
No
Start
numactl Memkind Cache mode
Simply run the whole
program in MCDRAM
No code modification
required
Manually allocate
BW-critical memory to
MCDRAM
Memkind calls need to
be added.
Allow the OS to figure
out how to use
MCDRAM
No code modification
required
colfaxresearch.com/knl-webinar High-Bandwidth Memory © Colfax International, 2013–2016
40
Clustering Modes on KNL
colfaxresearch.com/knl-webinar Clustering Modes on KNL © Colfax International, 2013–2016
41
Clustering Modes: All-to-All
No affinity between the distributed Tag Directory (TD) and memory.
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
NUMAnode0
MCDRAMMCDRAM
MCDRAMMCDRAM
colfaxresearch.com/knl-webinar Clustering Modes on KNL © Colfax International, 2013–2016
42
Clustering Modes: Quadrant/Hemisphere
Tag Directory (TD) and memory reside in the same quadrant.
tile
TD
MCDRAM
MCDRAMMCDRAM
MCDRAM
tile
TD
tile
TD
tile
TD
NUMAnode0
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
colfaxresearch.com/knl-webinar Clustering Modes on KNL © Colfax International, 2013–2016
43
Clustering Modes: SNC-4/SNC-2
Will appear as 4 NUMA nodes (similar to quad-socket system).
tile
TD
MCDRAM
MCDRAMMCDRAM
MCDRAM
tile
TD
tile
TD
tile
TD
NUMAnode0
NUMAnode1
NUMAnode2
NUMAnode3
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
colfaxresearch.com/knl-webinar Clustering Modes on KNL © Colfax International, 2013–2016
44
How to Use Clustering modes
Thread affinity to match the memory/directory affinity.
Nested OpenMP
1 #pragma omp parallel
2 {
3 // ...
4 #pragma omp parallel
5 {
6 // ...
7 }
8 }
OpenMP Threads
Quadrants
KNL
user@knl% OMP_NUM_THREADS=4,72
user@knl% OMP_NESTED=1
MPI+OpenMP
1 stat = MPI_Init();
2 // ...
3 #pragma omp parallel
4 {
5 // ...
6 }
7 // ...
8 MPI_Finalize();
NUMA node 0
MPI rank 0
OpenMP
NUMA node 1
MPI rank 1
OpenMP
NUMA node 2
MPI rank 2
OpenMP
NUMA node 3
MPI rank 3
OpenMP
user@knl% mpirun -host knl 
> -np 4 ./myparallel_app
White Paper: http://colfaxresearch.com/knl-numa/
colfaxresearch.com/knl-webinar Clustering Modes on KNL © Colfax International, 2013–2016
45
“Get Ready for Intel’s Knights Landing” Guide
colfaxresearch.com/knl-webinar “Get Ready for Intel’s Knights Landing” Guide © Colfax International, 2013–2016
46
Get Ready For Intel®
Xeon Phi processors (Codenamed:
Knights Landing)
colfaxresearch.com/knl-ready
colfaxresearch.com/knl-webinar “Get Ready for Intel’s Knights Landing” Guide © Colfax International, 2013–2016
47
§6. Hands-On Demonstrations
colfaxresearch.com/knl-webinar Hands-On Demonstrations © Colfax International, 2013–2016
48
Running the STREAM Benchmark
colfaxresearch.com/knl-webinar Running the STREAM Benchmark © Colfax International, 2013–2016
49
STREAM Benchmark
Industry-standard tool for
memory bandwidth
measurement
4 tests: COPY, ADD,
SCALE and TRIAD
Download from Dr. John
McCalpin’s site:
www.cs.virginia.edu/stream/
Expected for KNL: ≈400 GB/s
colfaxresearch.com/knl-webinar Running the STREAM Benchmark © Colfax International, 2013–2016
50
STREAM Benchmark on Tuning
Set 1 thread per core on all platforms (-1 for offload)
Set affinity “scatter”: white paper (Colfax Research)
Tune prefetching: white paper (Intel)
In addition, secret sauce for your own STREAM-like application:
Parallel first touch (see NUMA discussion in Session 08)
Essential element – streaming stores: discussion
colfaxresearch.com/knl-webinar Running the STREAM Benchmark © Colfax International, 2013–2016
51
HBM in Knights Landing
Finding HBM (MCDRAM) in an Intel Xeon Phi processor x200 (KNL):
user@knl% # In Flat mode with All-to-All or Quadrant
user@knl% numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ... 249 250 251 252 253 254 255
node 0 size: 98207 MB
node 0 free: 94798 MB
node 1 cpus:
node 1 size: 16384 MB
node 1 free: 15991 MB
Binding the application to HBM (Flat/Hybrid)
user@knl% numactl --membind 1 ./runme
// ... Application running in HBM ... //
colfaxresearch.com/knl-webinar Running the STREAM Benchmark © Colfax International, 2013–2016
52
Vectorization with AVX-512
colfaxresearch.com/knl-webinar Vectorization with AVX-512 © Colfax International, 2013–2016
53
Demo: AVX-512CD and Histograms
1 // worker.cc from Colfax lab 4.04
2 void Histogram(const float* age, int* const hist,
3 const int n, const float group_width,
4 const int m) {
5
6 for (int i = 0; i < n; i++) {
7 const int j = (int) ( age[i] / group_width );
8 hist[j]++;
9 }
10 }
56.8 52.1 21.1 46.9 23.8 16.5 96.3 19.4
20.0 20.0 20.0 20.0 20.0 20.0 20.0 20.0
2.84 2.60 1.05 2.34 1.19 0.82 4.81 0.97
int
2 2 1 2 1 0 4 0
0 1 2 3 4hist:
=
÷
user@knl% cat worker.optrpt
....
remark ...: vectorization support: scatter was generated for the variable hist:
remark ...: vectorization support: gather was generated for the variable hist:
remark #15300: LOOP WAS VECTORIZED
colfaxresearch.com/knl-webinar Vectorization with AVX-512 © Colfax International, 2013–2016
54
Intel Compiler support for AVX-512
Intel Compiler versions ≥ 15.0 supports AVX-512 instruction set.
user@knl% icc -v
icc version 16.0.1 (gcc version 4.8.5 compatibility)
user@knl% icc -help
// ... truncated output ... //
-x<code>
...
MIC-AVX512
CORE-AVX512
COMMON-AVX512
-xMIC-AVX512 : for KNL (supports F, CD, ER, PF)
-xCORE-AVX512 : for future Xeon (supports F, CD, DQ, BW, VL)
-xCOMMON-AVX512 : common to KNL and Xeon (supports F, CD)
colfaxresearch.com/knl-webinar Vectorization with AVX-512 © Colfax International, 2013–2016
55
Strip-Mining for Vectorization
Vectorization improves performance on both platforms. However, more work is
needed to take advantage of the MIC architecture. See materials on multi-threading.
Scalar Serial Code Strength Reduction Vectorized Serial Code
(Strip-Mining)
0.0
0.2
0.4
0.6
0.8
1.0
Performance,billionvalues/s(higherisbetter)
0.373
0.588
0.937
0.017 0.045
0.1260.105 0.119 0.117
Intel Xeon Processor E5-2697 V2
Intel Xeon Phi Coprocessor 7120P (KNC)
Intel Xeon Phi Processor 7210 (KNL)
colfaxresearch.com/knl-webinar Vectorization with AVX-512 © Colfax International, 2013–2016
56
Using Reduction instead of Synchronization
Vectorized Serial Code
(Strip-Mining)
Vectorized Parallel Code
(Atomic Operations)
Vectorized Parallel Code
(Private Variables)
0
5
10
15
20
Performance,billionvalues/s(higherisbetter)
0.937
0.046
11.2
0.126 0.035
13.9
0.117 0.017
16.4
Intel Xeon Processor E5-2697 V2
Intel Xeon Phi Coprocessor 7120P (KNC)
Intel Xeon Phi Processor 7210 (KNL)
colfaxresearch.com/knl-webinar Vectorization with AVX-512 © Colfax International, 2013–2016
57
Memory Traffic Optimization
Vectorized Parallel Code
(Private Variables)
Parallel Initialization
(First-Touch Allocation)
0
5
10
15
20
25
Performance,billionvalues/s(higherisbetter)
11.2
22.1
13.9 13.6
16.4 16.4
Intel Xeon processor E5-2697 V2
Intel Xeon Phi coprocessor 7120P (KNC)
Intel Xeon Phi processor 7210 (KNL)
colfaxresearch.com/knl-webinar Vectorization with AVX-512 © Colfax International, 2013–2016
58
Tuning with Intel Math Kernel Library
colfaxresearch.com/knl-webinar Tuning with Intel Math Kernel Library © Colfax International, 2013–2016
59
Batch Processing
Feed multiple signals to MKL
Let the library decide on the parallel strategy.
1 MKL_Complex16* data =
2 (MKL_Complex16*) malloc(sizeof(MKL_Complex16)*fft_size*num_fft);
3
4 DFTI_DESCRIPTOR_HANDLE handle;
5 DftiCreateDescriptor(&handle, DFTI_DOUBLE, DFTI_COMPLEX, 1, (MKL_LONG)fft_size);
6 DftiSetValue(handle, DFTI_NUMBER_OF_TRANSFORMS, num_fft);
7 DftiSetValue(handle, DFTI_INPUT_DISTANCE, fft_size);
8 DftiSetValue(handle, DFTI_OUTPUT_DISTANCE, fft_size);
9 DftiSetValue(handle, DFTI_PLACEMENT, DFTI_INPLACE);
10 DftiCommitDescriptor(handle);
colfaxresearch.com/knl-webinar Tuning with Intel Math Kernel Library © Colfax International, 2013–2016
60
Nested Parallelism
... ... ...
Fine-grained
parallelism
Coarse-grained
parallelism
Nested
parallelism
throwing all threads
on one MKL function
using one thread
per MKL function
putting teams of threads
on several MKL functions
colfaxresearch.com/knl-webinar Tuning with Intel Math Kernel Library © Colfax International, 2013–2016
61
Thread Affinity Tuning
Xeon Xeon Phi
MKL uses 1 thread/core MKL uses 4 threads/core
OMP_NUM_THREADS=#phys. cores OMP_NUM_THREADS=4×#cores
KMP_AFFINITY=scatter KMP_AFFINITY=scatter
KMP_AFFINITY=compact (HT off) KMP_AFFINITY=compact
KMP_AFFINITY=compact,1 (HT on)
See also HOW series Session 8.
colfaxresearch.com/knl-webinar Tuning with Intel Math Kernel Library © Colfax International, 2013–2016
62
Thread Affinity: Scatter Pattern
Generally beneficial for bandwidth-bound applications.
HT 0 HT 1 HT 0 HT 1
Core 0 Core 1
Socket 0
HT 0 HT 1 HT 0 HT 1
Core 0 Core 1
Socket 1
Threads:
Cores:
0 2 1 3
HT 0 HT 1 HT 0 HT 1
Core 0 Core 1
Socket 0
HT 0 HT 1 HT 0 HT 1
Core 0 Core 1
Socket 1
Threads:
Cores:
0
HT 0 HT 1 HT 0 HT 1
Core 0
Xeon Phi chip
HT 0 HT 1 HT 0 HT 1
Core 1
Threads:
Cores:
0 61 122 183 1 62 123 184
...
MKL
on
Xeon:
MKL
on
Xeon Phi:
KMP_AFFINITY=scatter
KMP_AFFINITY=scatter
colfaxresearch.com/knl-webinar Tuning with Intel Math Kernel Library © Colfax International, 2013–2016
63
Data Container Tweaks
Parallel first-touch (see also HOW series Session 8):
1 #pragma omp parallel for
2 for(int i = 0; i < num_fft; i++)
3 for(int j = 0; j < fft_size; j++) {
4 data[i*fft_size+j].real = cos(k*j);
5 data[i*fft_size+j].imag = sin(k*j);
6 }
Special allocator (alignment – see also HOW series Session 6 and
allocation in HBM on KNL)
1 MKL_Complex16* data =
2 (MKL_Complex16*) mkl_malloc(sizeof(MKL_Complex16)*fft_size*num_fft, 64);
colfaxresearch.com/knl-webinar Tuning with Intel Math Kernel Library © Colfax International, 2013–2016
64
Complex-to-Complex 1D FFT Performance
Initial Batch
Mode
Affinity
Tuning
Parallel
First Touch
MKL
Allocator
0
50
100
150
200
250
300
350
400
450
DoublePrecisionGFLOP/s
10
56 65
145 156
1
148 155 155
239
5
113 113 113
383
1D Fast Fourier Transform Performance
Intel Xeon processor E5-2697 v3
Intel Xeon Phi coprocessor 7120P
Intel Xeon Phi processor 7210
5colfaxresearch.com/knl-webinar Tuning with Intel Math Kernel Library © Colfax International, 2013–2016
65
Machine Learning on Intel Architecture
colfaxresearch.com/knl-webinar Machine Learning on Intel Architecture © Colfax International, 2013–2016
66
Machine Learning: Optimized Middleware
http://colfaxresearch.com/isc16-neuraltalk
colfaxresearch.com/knl-webinar Machine Learning on Intel Architecture © Colfax International, 2013–2016
67
More Information: “HOW” Series
colfaxresearch.com/knl-webinar More Information: “HOW” Series © Colfax International, 2013–2016
68
HOW Series
Need more tutorials? Sign-up at:
colfaxresearch.com/how-series
colfaxresearch.com/knl-webinar More Information: “HOW” Series © Colfax International, 2013–2016
69
§7. Closing Words
colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016
70
Bottom Line
KNL is a highly-parallel successor of KNC, with
improvements in both performance and ease-of-use.
KEEP CALM
GO PARALLEL
AND
colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016
71
Get Ready For Intel®
Xeon Phi processors (Codenamed:
Knights Landing)
colfaxresearch.com/knl-ready
colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016
72
Colfax Research
http://colfaxresearch.com/
colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016
73
Developer Access Program (DAP)
Can’t wait to get your hands on Knights Landing?
Find out more at dap.xeonphi.com
or contact us at dap@colfax-intl.com.
colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016
74
Thank you for Tuning In!
colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016
75
§8. Back-up Slides
colfaxresearch.com/knl-webinar Back-up Slides © Colfax International, 2013–2016
76
Loop Collapse: Principle
Idea: combine iterations spaces of the inner loop and the outer loop.
1 #pragma omp parallel for collapse(2)
2 for (int i = 0; i < m; i++)
3 for (int j = 0; j < n; j++) {
4 // ...
5 // ...
6 }
1 #pragma omp parallel for
2 for (int c = 0; c < m*n; c++) {
3 i = c / n;
4 j = c % n;
5 // ...
6 }
colfaxresearch.com/knl-webinar Back-up Slides © Colfax International, 2013–2016
77
Exposing Parallelism: Strip-Mining and Loop Collapse
1 void sum_stripmine(const int m, const int n, long* M, long* s){
2 const int STRIP=1024;
3 assert(n%STRIP==0);
4 s[0:m]=0;
5 #pragma omp parallel
6 {
7 long total[m]; total[0:m]=0;
8 #pragma omp for collapse(2) schedule(guided)
9 for (int i=0; i<m; i++)
10 for (int jj=0; jj<n; jj+=STRIP)
11 #pragma vector aligned
12 for (int j=jj; j<jj+STRIP; j++)
13 total[i]+=M[i*n+j];
14 for (int i=0; i<m; i++) // Reduction
15 #pragma omp atomic
16 s[i]+=total[i];
17 } }
colfaxresearch.com/knl-webinar Back-up Slides © Colfax International, 2013–2016
78
Checking for AVX-512 support
Check /proc/cpuinfo for AVX-512 flags
user@knl% cat /proc/cpuinfo
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36
clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm
constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc
aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 fma
cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer xsave avx
f16c rdrand lahf_lm abm 3dnowprefetch arat epb xsaveopt pln pts dtherm
tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2
erms avx512f rdseed adx avx512pf avx512er avx512cd
Also possible through C/C++ function calls: see this blog.
colfaxresearch.com/knl-webinar Back-up Slides © Colfax International, 2013–2016
79
Using AVX-512: Two Approaches
Automatic Vectorization:
Vectorization w/ compiler.
Portable: just recompile.
Tuning with directives.
1 double A[vec_width], B[vec_width];
2 // ...
3
4
5 // This loop will be auto-vectorized
6 for(int i = 0; i < vec_width; i++)
7 A[i]+=B[i];
1 double A[vec_width], B[vec_width];
2 // ...
3 // This is explicitly vectorized
4 __m512d A_vec = _mm512_load_pd(A);
5 __m512d B_vec = _mm512_load_pd(B);
6 A_vec = _mm512_add_pd(A_vec,B_vec);
7 _mm512_store_pd(A,A_vec);
Explicit Vectorization:
Vectorization w/ intrinsics
Full control over instructions
Limited portability
colfaxresearch.com/knl-webinar Back-up Slides © Colfax International, 2013–2016
80
GCC support for AVX-512
GCC ≥ 4.9.1 supports AVX-512 instruction set.
user@knl% g++ -v
gcc version 4.9.2 (GCC)
user@knl% g++ foo.cc -mavx512f -mavx512er -mavx512cd -mavx512pf
Basic automatic vectorization support: add -O2 or -O3.
1 // ... foo.cc ... //
2 for(int i = 0; i < n; i++)
3 B[i] = A[i] + B[i];
user@knl% g++ -s foo.cc -mavx512f -03
user@knl% cat foo.s
...
vmovapd -16432(%rbp,%rax), %zmm0
vaddpd -8240(%rbp,%rax), %zmm0, %zmm0
vmovapd %zmm0, -8240(%rbp,%rax)
colfaxresearch.com/knl-webinar Back-up Slides © Colfax International, 2013–2016
81
HOW Series
Interested? Sign-up at:
colfaxresearch.com/how-series
colfaxresearch.com/knl-webinar Back-up Slides © Colfax International, 2013–2016
82
Textbook
ISBN: 978-0-9885234-0-1 (508 pages, Electronic or Print)
Parallel Programming
and Optimization with
Intel®
Xeon Phi™
Coprocessors
Handbook on the Development and
Optimization of Parallel Applications
for Intel®
Xeon®
Processors
and Intel®
Xeon Phi™
Coprocessors
© Colfax International, 2015
PARALLEL PROGRAMMING
AND OPTIMIZATION WITH
HANDBOOKONTHE
DEVELOPMENTAND
OPTIMIZATIONOF
PARALLEL
APPLICATIONSFOR
INTEL XEON
PROCESSORS
ANDINTEL
XEONPHI
COPROCESSORS
INTEL XEON PHI
COPROCESSORS
TMR
SECONDEDITION
C O L F A X I N T E R N A T I O N A L
ANDREY VLADIMIROV | RYO ASAI | VADIM KARPUSENKO
http://xeonphi.com/book
colfaxresearch.com/knl-webinar Back-up Slides © Colfax International, 2013–2016

More Related Content

What's hot

The Role of a Network Software Developer in Network Transformation
The Role of a Network Software Developer in Network TransformationThe Role of a Network Software Developer in Network Transformation
The Role of a Network Software Developer in Network TransformationMichelle Holley
 
TLDK - FD.io Sept 2016
TLDK - FD.io Sept 2016 TLDK - FD.io Sept 2016
TLDK - FD.io Sept 2016 Benoit Hudzia
 
Learning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under ContainersLearning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under Containersinside-BigData.com
 
Intel® Ethernet Update
Intel® Ethernet Update Intel® Ethernet Update
Intel® Ethernet Update Michelle Holley
 
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...Eric Van Hensbergen
 
OpenSPARC T1 Processor
OpenSPARC T1 ProcessorOpenSPARC T1 Processor
OpenSPARC T1 ProcessorDVClub
 
Overview of Intel® Omni-Path Architecture
Overview of Intel® Omni-Path ArchitectureOverview of Intel® Omni-Path Architecture
Overview of Intel® Omni-Path ArchitectureIntel® Software
 
Quieting noisy neighbor with Intel® Resource Director Technology
Quieting noisy neighbor with Intel® Resource Director TechnologyQuieting noisy neighbor with Intel® Resource Director Technology
Quieting noisy neighbor with Intel® Resource Director TechnologyMichelle Holley
 
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...Haidee McMahon
 
The HPE Machine and Gen-Z - BUD17-503
The HPE Machine and Gen-Z - BUD17-503The HPE Machine and Gen-Z - BUD17-503
The HPE Machine and Gen-Z - BUD17-503Linaro
 
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel ArchitectureDPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel ArchitectureJim St. Leger
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
Redesigning the LTE Packet Core
Redesigning the LTE Packet CoreRedesigning the LTE Packet Core
Redesigning the LTE Packet CoreMichelle Holley
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systemsinside-BigData.com
 
Durgam vahia open_sparc_fpga
Durgam vahia open_sparc_fpgaDurgam vahia open_sparc_fpga
Durgam vahia open_sparc_fpgaObsidian Software
 
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchJim St. Leger
 
Netsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfvNetsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfvIntel
 
DPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettDPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettJim St. Leger
 

What's hot (20)

The Role of a Network Software Developer in Network Transformation
The Role of a Network Software Developer in Network TransformationThe Role of a Network Software Developer in Network Transformation
The Role of a Network Software Developer in Network Transformation
 
Intel dpdk Tutorial
Intel dpdk TutorialIntel dpdk Tutorial
Intel dpdk Tutorial
 
TLDK - FD.io Sept 2016
TLDK - FD.io Sept 2016 TLDK - FD.io Sept 2016
TLDK - FD.io Sept 2016
 
Learning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under ContainersLearning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under Containers
 
Intel® Ethernet Update
Intel® Ethernet Update Intel® Ethernet Update
Intel® Ethernet Update
 
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
 
OpenSPARC T1 Processor
OpenSPARC T1 ProcessorOpenSPARC T1 Processor
OpenSPARC T1 Processor
 
Overview of Intel® Omni-Path Architecture
Overview of Intel® Omni-Path ArchitectureOverview of Intel® Omni-Path Architecture
Overview of Intel® Omni-Path Architecture
 
Quieting noisy neighbor with Intel® Resource Director Technology
Quieting noisy neighbor with Intel® Resource Director TechnologyQuieting noisy neighbor with Intel® Resource Director Technology
Quieting noisy neighbor with Intel® Resource Director Technology
 
Cloud Networking Trends
Cloud Networking TrendsCloud Networking Trends
Cloud Networking Trends
 
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
 
The HPE Machine and Gen-Z - BUD17-503
The HPE Machine and Gen-Z - BUD17-503The HPE Machine and Gen-Z - BUD17-503
The HPE Machine and Gen-Z - BUD17-503
 
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel ArchitectureDPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
Redesigning the LTE Packet Core
Redesigning the LTE Packet CoreRedesigning the LTE Packet Core
Redesigning the LTE Packet Core
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systems
 
Durgam vahia open_sparc_fpga
Durgam vahia open_sparc_fpgaDurgam vahia open_sparc_fpga
Durgam vahia open_sparc_fpga
 
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
 
Netsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfvNetsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfv
 
DPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettDPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles Shiflett
 

Similar to HOW Series: Knights Landing

Optimize Machine Learning Workloads on Intel® Platforms
Optimize Machine Learning Workloads on Intel® PlatformsOptimize Machine Learning Workloads on Intel® Platforms
Optimize Machine Learning Workloads on Intel® PlatformsIntel® Software
 
Intel colfax optimizing-machine-learning-workloads
Intel colfax optimizing-machine-learning-workloadsIntel colfax optimizing-machine-learning-workloads
Intel colfax optimizing-machine-learning-workloadsTracy Johnson
 
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...PT Datacomm Diangraha
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingMichelle Holley
 
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...Ceph Community
 
Give Your Confluent Platform Superpowers! (Sandeep Togrika, Intel and Bert Ha...
Give Your Confluent Platform Superpowers! (Sandeep Togrika, Intel and Bert Ha...Give Your Confluent Platform Superpowers! (Sandeep Togrika, Intel and Bert Ha...
Give Your Confluent Platform Superpowers! (Sandeep Togrika, Intel and Bert Ha...HostedbyConfluent
 
Omni path-fabric-software-architecture-overview
Omni path-fabric-software-architecture-overviewOmni path-fabric-software-architecture-overview
Omni path-fabric-software-architecture-overviewDESMOND YUEN
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesDr. Fabio Baruffa
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersIntel® Software
 
Converged IO for HP ProLiant Gen8
Converged IO for HP ProLiant Gen8Converged IO for HP ProLiant Gen8
Converged IO for HP ProLiant Gen8IT Brand Pulse
 
Increasing Throughput per Node for Content Delivery Networks
Increasing Throughput per Node for Content Delivery NetworksIncreasing Throughput per Node for Content Delivery Networks
Increasing Throughput per Node for Content Delivery NetworksDESMOND YUEN
 
BPF & Cilium - Turning Linux into a Microservices-aware Operating System
BPF  & Cilium - Turning Linux into a Microservices-aware Operating SystemBPF  & Cilium - Turning Linux into a Microservices-aware Operating System
BPF & Cilium - Turning Linux into a Microservices-aware Operating SystemThomas Graf
 
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...Ceph Community
 
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...Ceph Community
 
InfiniBand for the enterprise
InfiniBand for the enterpriseInfiniBand for the enterprise
InfiniBand for the enterpriseAnas Kanzoua
 
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...Ceph Community
 
Ceph Day Beijing - Storage Modernization with Intel and Ceph
Ceph Day Beijing - Storage Modernization with Intel and CephCeph Day Beijing - Storage Modernization with Intel and Ceph
Ceph Day Beijing - Storage Modernization with Intel and CephDanielle Womboldt
 
Ceph Day Beijing - Storage Modernization with Intel & Ceph
Ceph Day Beijing - Storage Modernization with Intel & Ceph Ceph Day Beijing - Storage Modernization with Intel & Ceph
Ceph Day Beijing - Storage Modernization with Intel & Ceph Ceph Community
 

Similar to HOW Series: Knights Landing (20)

Optimize Machine Learning Workloads on Intel® Platforms
Optimize Machine Learning Workloads on Intel® PlatformsOptimize Machine Learning Workloads on Intel® Platforms
Optimize Machine Learning Workloads on Intel® Platforms
 
Intel colfax optimizing-machine-learning-workloads
Intel colfax optimizing-machine-learning-workloadsIntel colfax optimizing-machine-learning-workloads
Intel colfax optimizing-machine-learning-workloads
 
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet Processing
 
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
 
Give Your Confluent Platform Superpowers! (Sandeep Togrika, Intel and Bert Ha...
Give Your Confluent Platform Superpowers! (Sandeep Togrika, Intel and Bert Ha...Give Your Confluent Platform Superpowers! (Sandeep Togrika, Intel and Bert Ha...
Give Your Confluent Platform Superpowers! (Sandeep Togrika, Intel and Bert Ha...
 
Omni path-fabric-software-architecture-overview
Omni path-fabric-software-architecture-overviewOmni path-fabric-software-architecture-overview
Omni path-fabric-software-architecture-overview
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
Converged IO for HP ProLiant Gen8
Converged IO for HP ProLiant Gen8Converged IO for HP ProLiant Gen8
Converged IO for HP ProLiant Gen8
 
Increasing Throughput per Node for Content Delivery Networks
Increasing Throughput per Node for Content Delivery NetworksIncreasing Throughput per Node for Content Delivery Networks
Increasing Throughput per Node for Content Delivery Networks
 
BPF & Cilium - Turning Linux into a Microservices-aware Operating System
BPF  & Cilium - Turning Linux into a Microservices-aware Operating SystemBPF  & Cilium - Turning Linux into a Microservices-aware Operating System
BPF & Cilium - Turning Linux into a Microservices-aware Operating System
 
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
 
IBM PureSystems
IBM PureSystemsIBM PureSystems
IBM PureSystems
 
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...
 
InfiniBand for the enterprise
InfiniBand for the enterpriseInfiniBand for the enterprise
InfiniBand for the enterprise
 
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
 
GIST AI-X Computing Cluster
GIST AI-X Computing ClusterGIST AI-X Computing Cluster
GIST AI-X Computing Cluster
 
Ceph Day Beijing - Storage Modernization with Intel and Ceph
Ceph Day Beijing - Storage Modernization with Intel and CephCeph Day Beijing - Storage Modernization with Intel and Ceph
Ceph Day Beijing - Storage Modernization with Intel and Ceph
 
Ceph Day Beijing - Storage Modernization with Intel & Ceph
Ceph Day Beijing - Storage Modernization with Intel & Ceph Ceph Day Beijing - Storage Modernization with Intel & Ceph
Ceph Day Beijing - Storage Modernization with Intel & Ceph
 

Recently uploaded

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Recently uploaded (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 

HOW Series: Knights Landing

  • 1. 1 HOW Series: Knights Landing Hands-On Workshop on Performance Optimization for Intel® Xeon Phi™ Processor Family x200 Colfax International — @colfaxintl September 2016 - Rev. 3.0 colfaxresearch.com/knl-webinar Welcome © Colfax International, 2013–2016
  • 2. 2 About This Document This document represents the ma- terials of a Web-based training “In- troduction to 2nd Generation Intel® Xeon Phi™ Processors: Developer’s Guide to Knights Landing” developed and run by Colfax International. © Colfax International, 2013–2016 colfaxresearch.com/knl-webinar/ colfaxresearch.com/knl-webinar About This Document © Colfax International, 2013–2016
  • 3. 3 Disclaimer While best efforts have been used in preparing this training, Colfax International makes no representations or warranties of any kind and assumes no liabilities of any kind with respect to the accuracy or completeness of the contents and specifically disclaims any implied warranties of merchantability or fitness of use for a particular purpose. The publisher shall not be held liable or responsible to any person or entity with respect to any loss or incidental or consequential damages caused, or alleged to have been caused, directly or indirectly, by the information or programs contained herein. No warranty may be created or extended by sales representatives or written sales materials. colfaxresearch.com/knl-webinar About This Document © Colfax International, 2013–2016
  • 6. 6 HOW Series Interested? Sign-up at: colfaxresearch.com/how-series colfaxresearch.com/knl-webinar Resources © Colfax International, 2013–2016
  • 7. 7 Developer Access Program (DAP) Can’t wait to get your hands on Knights Landing? Find out more at dap.xeonphi.com or contact us at dap@colfax-intl.com. colfaxresearch.com/knl-webinar Resources © Colfax International, 2013–2016
  • 8. 8 §2. Intel Architecture: Today and Tomorrow colfaxresearch.com/knl-webinar Intel Architecture: Today and Tomorrow © Colfax International, 2013–2016
  • 9. 9 Intel Architecture colfaxresearch.com/knl-webinar Intel Architecture: Today and Tomorrow © Colfax International, 2013–2016
  • 10. 10 Servers with Intel Xeon Phi Processors Bootable Host Processor RHEL/CentOS/SUSE/Win 64 cores × 4 HT, 1.3 GHz ≤ 384 GiB DDR4, > 90 GB/s 16 GiB HBM, > 400 GB/s PCIe bus for network 4 systems in 2U chassis dap.xeonphi.com colfaxresearch.com/knl-webinar Intel Architecture: Today and Tomorrow © Colfax International, 2013–2016
  • 11. 11 Workstations with Intel Xeon Phi Processors Bootable Host Processor RHEL/CentOS/SUSE/Win 64 cores × 4 HT, 1.3 GHz ≤ 384 GiB DDR4, > 90 GB/s 16 GiB HBM, > 400 GB/s PCIe bus for add-in cards Liquid- or air-cooled dap.xeonphi.com colfaxresearch.com/knl-webinar Intel Architecture: Today and Tomorrow © Colfax International, 2013–2016
  • 12. 12 Future Form-Factors KNLF: KNL with Fabric Fabric integrated on CPU Intel OmniPath Architecture Socket mount processor *KNC image KNL Coprocessor PCIe add-in card Requires host Multiple KNLs in a system colfaxresearch.com/knl-webinar Intel Architecture: Today and Tomorrow © Colfax International, 2013–2016
  • 13. 13 §3. Out-of-the-Box Performance colfaxresearch.com/knl-webinar Out-of-the-Box Performance © Colfax International, 2013–2016
  • 14. 14 Intel® MKL Intel® Math Kernel Library (MKL) — standard mathematical functions optimized for Intel architecture. BLAS LAPACK Sparse solvers ScaLAPACK Linear Algebra Fast Fourier Transform Vector Math Vector Random Number Generators Summary Statistics Data Fitting Multidimentional (up to 7D) FFTW interfaces Cluster FFT Trigonometric Hyperbolic Exponential Logarithmic Power/Root Rounding Congruential Recursive Wichmann-Hill Mersenne Twister Sobol Neiderreiter Non-deterministic Kurtosis Variation coefficitent Quantiles, order statistics Min/max Variance- covariance Splines Interpolation Cell search colfaxresearch.com/knl-webinar Out-of-the-Box Performance © Colfax International, 2013–2016
  • 15. 15 DGEMM in MKL C/C++ 1 #include <mkl.h> 2 ... 3 cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 4 n, n, n, 1.0, A, n, B, n, 0.0, C, n); Python from scipy.linalg.blas import dgemm as dgemm ... C=dgemm(alpha=1.0, a=A, b=B, overwrite_c=1, trans_b=1) R require(Matrix) ... c <- a %*% b colfaxresearch.com/knl-webinar Out-of-the-Box Performance © Colfax International, 2013–2016
  • 16. 16 Python Benchmark N=500 1000 1200 1500 2000 3000 4000 5000 0 500 1000 1500 2000 Performance(GFLOP/s) 10 11 11 11 12 12 12 1260 69 71 72 77 75 73 83 280 701 745 968 1270 1530 1730 1850 colfaxresearch.com DGEMM on Knights Landing Processors CPython, SciPy CPython, NumPy Intel Python, SciPy For more information on KNL with Intel Python, see this paper colfaxresearch.com/knl-webinar Out-of-the-Box Performance © Colfax International, 2013–2016
  • 17. 17 §4. Importance of Code Optimization colfaxresearch.com/knl-webinar Importance of Code Optimization © Colfax International, 2013–2016
  • 18. 18 Python Benchmark What we did not tell you in the previous slide... LU Decomposition Cholesky Decomposition Singular Value Decomposition DGEMM 0 20 40 60 80 100 120 140 160 180 RelativePerformance 1.0 1.0 1.0 1.03.5 3.6 1.1 7.0 29.0 17.0 8.3 154.0 colfaxresearch.com Intel Python on Knights Landing Processors (N=5000) CPython, SciPy CPython, NumPy Intel Python, SciPy It takes good software to unlock the performance of KNL! colfaxresearch.com/knl-webinar Importance of Code Optimization © Colfax International, 2013–2016
  • 19. 19 Optimization Areas Scalar Tuning what goes on in the pipeline? Threading do cores cooperate efficiently? Vectorization is SIMD parallelism used well? Memory is cache usage maximized or RAM access streamlined? Communication can coordination in a distributed or heterogeneous system be improved? colfaxresearch.com/knl-webinar Importance of Code Optimization © Colfax International, 2013–2016
  • 20. 20 Application 1 Astrophysics: planetary systems galaxies cosmological structures 2 Electrostatic systems: molecules crystals This work: “toy model” with all-to-all O n2 algorithm. Practical N-body sim- ulations may use tree algorithms with O nlogn complexity. Source: APOD, credit: Debra Meloy Elmegreen (Vassar College) et al., & the Hubble Heritage Team (AURA/ STScI/ NASA) colfaxresearch.com/knl-webinar Importance of Code Optimization © Colfax International, 2013–2016
  • 21. 21 N-body Simulation on Intel Architecture Details in Chapter 23 of this book Step 0: Initial Step 1: Multi-threaded Step 2: Scalar Tuning Step 3: Vectorized w/SoA Step 4: Memory Optimization 0 500 1000 1500 2000 2500 3000 Performance,GFLOP/s 5.9 160 261 918 976 0.8 124 196 288 1712 2.1 316 443 1612 2875 N-Body Simulation Performance Intel Xeon E5-2697 v3 (Haswell) Intel Xeon Phi 7120A (1st gen, KNC) Intel Xeon Phi 7220 (2nd gen, KNL) The best way to prepare your code for KNL: optimize for Xeon or KNC colfaxresearch.com/knl-webinar Importance of Code Optimization © Colfax International, 2013–2016
  • 22. 22 §5. Details of Architecture colfaxresearch.com/knl-webinar Details of Architecture © Colfax International, 2013–2016
  • 23. 23 Cores in Intel Xeon Phi Processors colfaxresearch.com/knl-webinar Cores in Intel Xeon Phi Processors © Colfax International, 2013–2016
  • 24. 24 KNL Die Organization: Tiles Up to 36 tiles, each with 2 physical cores (72 total). Distributed L2 cache across a mesh interconnect. CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE DDR4CONTROLLER DDR4CONTROLLER CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORECORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE MCDRAM PCIe ≤384GiBsystemDDR4,~90GB/s CORE L2 CORE ≤ 16 GiB on-package MCDRAM, ~ 400 GB/s MCDRAM MCDRAM MCDRAM CORE L2 CORE CORE L2 CORE CORE L2 CORECORE L2 CORE ... 36 TILES 72 CORES colfaxresearch.com/knl-webinar Cores in Intel Xeon Phi Processors © Colfax International, 2013–2016
  • 25. 25 KNL Cores 4-way hyper-threading (up to 4×72 = 288 logical processors) L2 cache shared by 2 cores on a tile. I-cache Decode + Retire Vector ALU Vector ALU Legacy 32 KiB L1 D-cache 1 MiB L2 D-cache I-cache Decode + Retire Vector ALU Vector ALU Legacy 32 KiB L1 D-cache CORE CORETILE to Mesh HW+ SW prefetch HW+ SW prefetch Out-of- order Out-of- order colfaxresearch.com/knl-webinar Cores in Intel Xeon Phi Processors © Colfax International, 2013–2016
  • 26. 26 More Forgiving Cores than First Generation (KNC) Based on Intel® Atom cores (Silvermont microarchitecture) Out-of-order cores: Better latency masking from long latency operations Back-to-back instructions from single thread: Thread count requirement reduced to ≈ 70 (from ≈ 120 for KNC) Advanced branch prediction: Fewer cycles wasted to branch misprediction Generally more forgiving to non-optimized code. colfaxresearch.com/knl-webinar Cores in Intel Xeon Phi Processors © Colfax International, 2013–2016
  • 27. 27 Vector Instruction Support colfaxresearch.com/knl-webinar Vector Instruction Support © Colfax International, 2013–2016
  • 28. 28 Short Vector Support Vector instructions – one of the implementations of SIMD (Single Instruction Multiple Data) parallelism. + = + = + = + = + = VectorLength Scalar Instructions Vector Instructions 4 1 5 0 3 3 -2 8 6 9 -7 2 4 1 5 0 3 3 -2 8 6 9 -7 2 colfaxresearch.com/knl-webinar Vector Instruction Support © Colfax International, 2013–2016
  • 29. 29 Dual VPU Each core on KNL has two Vector Processing Units (VPUs). Penalty from non-vectorized code SP →512 bitregisters/32 bits ×2 VPUs = 32 SIMD lanes DP →512 bitregisters/64 bits ×2 VPUs = 16 SIMD lanes VPU 0 512-bit registers = 16 SP or 8 DP ZMM registers x32 per thread VPU 1 Decode I-Cache 32/16 SIMD lanes (SP/DP) colfaxresearch.com/knl-webinar Vector Instruction Support © Colfax International, 2013–2016
  • 30. 30 Supported Instruction Sets on KNL Intel® Advanced Vector Extensions 512 (AVX-512) 512-bit vector registers. Hardware gather/scatter, DP transcendental functions support and more. Supported by non-Intel compilers like GCC. ≤ Intel® AVX2 Legacy mode operation. Binary compatibility with Xeon. Does not include IMCI (from KNC). E5-2600 (SNB) E5-2600v3 (HSW) KNL (Xeon Phi) AVX-512PF AVX-512ER AVX-512CD AVX-512F BMI AVX2 AVX SSE x87/MMX TSX BMI AVX2 AVX SSE x87/MMX AVX SSE x87/MMX LegacySupport no TSX colfaxresearch.com/knl-webinar Vector Instruction Support © Colfax International, 2013–2016
  • 31. 31 AVX-512 Features Knights Landing: first AVX-512 processor AVX-512F (Fundamentals) Extension of most AVX2 instructions to 512-bit vector registers. AVX-512CD (Conflict Detection) Efficient conflict detection (application: binning). AVX-512ER (Exponential and Reciprocal) Transcendental function (exp, rcp and rsqrt) support. AVX-512PF (Prefetch) Prefetch for scatter and gather. White Paper: http://colfaxresearch.com/avx-512/ colfaxresearch.com/knl-webinar Vector Instruction Support © Colfax International, 2013–2016
  • 32. 32 Performance Considerations Even if your code is vectorized, tuning may unlock more performance. link to paper Providing enough parallelism. More consecutive vector operations required to overcome vectorization latency. Loop pipelining and unrolling. Double the pipeline stages to populate. Better vectorization patterns. Avoid long latency operations with unit-stride and unmasked operations. colfaxresearch.com/knl-webinar Vector Instruction Support © Colfax International, 2013–2016
  • 33. 33 High-Bandwidth Memory colfaxresearch.com/knl-webinar High-Bandwidth Memory © Colfax International, 2013–2016
  • 34. 34 KNL Memory Organization Direct access to on-package MCDRAM and system DDR4 (socket) Use MCDRAM as cache, in flat mode, or as hybrid CORE CORE REGISTERSREGISTERS L1 cache L1 cache L2 cache MCDRAM (on package RAM) DDR4 RAM (system memory) Up to 16 GiB over 400 GB/s Up to 384 GiB ~ 90 GB/s (STREAM) ... more cores Intel Xeon Phi Processor Flat, Cache or Hybrid colfaxresearch.com/knl-webinar High-Bandwidth Memory © Colfax International, 2013–2016
  • 35. 35 MCDRAM Memory Modes Flat Mode MCDRAM treated as a NUMA node Users control what goes to MCDRAM CPU DDR4 Addressable MCDRAM NUMA Node 0 NUMA Node 1 Cache Mode MCDRAM treated as a Last Level Cache (LLC) MCDRAM is used automatically CPU DDR4 MCDRAM as cache Hybrid Mode Combination of Flat and Cache Ratio can be chosen in the BIOS Addressable MCDRAM NUMA Node 0 NUMA Node 1 DDR4 MCDRAM as cache CPU colfaxresearch.com/knl-webinar High-Bandwidth Memory © Colfax International, 2013–2016
  • 36. 36 Running Applications in HBM with numactl Finding information about the NUMA nodes in the system. user@knl% # In Flat mode with All-to-All user@knl% numactl -H available: 2 nodes (0-1) node 0 cpus: ... all cpus ... node 0 size: 98207 MB node 0 free: 94798 MB node 1 cpus: node 1 size: 16384 MB node 1 free: 15991 MB Binding the application to MCDRAM (Flat/Hybrid) user@knl% gcc myapp.c -o runme -mavx512f -O2 user@knl% numactl --membind 1 ./runme // ... Application running on MCDRAM ... // colfaxresearch.com/knl-webinar High-Bandwidth Memory © Colfax International, 2013–2016
  • 37. 37 Allocation in HBM with Memkind Library Manual allocation to MCDRAM possible with hbwmalloc and Memkind Library. 1 #include <hbwmalloc.h> 2 const int n = 1<<10; 3 // Allocation to MCDRAM 4 double* A = (double*) hbw_malloc(sizeof(double)*n); 5 // No replacement for _mm_malloc. Use posix_memalign 6 double* B; 7 int ret = hbw_posix_memalign((void*) B, 64, sizeof(double)*n); 8 ..... 9 // Free with hbw_free 10 hbw_free(A); hbw_free(b); Fortran Allocations. 1 REAL, ALLOCATABLE :: A(:) 2 !DEC$ ATTRIBUTES FASTMEM :: A 3 ALLOCATE (A(1:1024)) colfaxresearch.com/knl-webinar High-Bandwidth Memory © Colfax International, 2013–2016
  • 38. 38 Compilation with Memkind Library and hbwmalloc To compile C/C++ applications: user@knl% icpc -lmemkind foo.cc -o runme user@knl% g++ -lmemkind foo.cc -o runme To compile Fortran applications: user@knl% ifort -lmemkind foo.f90 -o runme user@knl% gfortran -lmemkind foo.f90 -o runme Open source distribution of Memkind library can be found at: http://memkind.github.io/memkind/ Colfax Publication on using MCDRAM and Memkind Library: http://colfaxresearch.com/knl-mcdram/ colfaxresearch.com/knl-webinar High-Bandwidth Memory © Colfax International, 2013–2016
  • 39. 39 Flow Chart for Bandwidth-Bound Applications Does your Application fit in 16 GB? Can you partition ≤16 GB of BW-critical memory? No Yes Yes No Start numactl Memkind Cache mode Simply run the whole program in MCDRAM No code modification required Manually allocate BW-critical memory to MCDRAM Memkind calls need to be added. Allow the OS to figure out how to use MCDRAM No code modification required colfaxresearch.com/knl-webinar High-Bandwidth Memory © Colfax International, 2013–2016
  • 40. 40 Clustering Modes on KNL colfaxresearch.com/knl-webinar Clustering Modes on KNL © Colfax International, 2013–2016
  • 41. 41 Clustering Modes: All-to-All No affinity between the distributed Tag Directory (TD) and memory. tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD NUMAnode0 MCDRAMMCDRAM MCDRAMMCDRAM colfaxresearch.com/knl-webinar Clustering Modes on KNL © Colfax International, 2013–2016
  • 42. 42 Clustering Modes: Quadrant/Hemisphere Tag Directory (TD) and memory reside in the same quadrant. tile TD MCDRAM MCDRAMMCDRAM MCDRAM tile TD tile TD tile TD NUMAnode0 tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD colfaxresearch.com/knl-webinar Clustering Modes on KNL © Colfax International, 2013–2016
  • 43. 43 Clustering Modes: SNC-4/SNC-2 Will appear as 4 NUMA nodes (similar to quad-socket system). tile TD MCDRAM MCDRAMMCDRAM MCDRAM tile TD tile TD tile TD NUMAnode0 NUMAnode1 NUMAnode2 NUMAnode3 tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD colfaxresearch.com/knl-webinar Clustering Modes on KNL © Colfax International, 2013–2016
  • 44. 44 How to Use Clustering modes Thread affinity to match the memory/directory affinity. Nested OpenMP 1 #pragma omp parallel 2 { 3 // ... 4 #pragma omp parallel 5 { 6 // ... 7 } 8 } OpenMP Threads Quadrants KNL user@knl% OMP_NUM_THREADS=4,72 user@knl% OMP_NESTED=1 MPI+OpenMP 1 stat = MPI_Init(); 2 // ... 3 #pragma omp parallel 4 { 5 // ... 6 } 7 // ... 8 MPI_Finalize(); NUMA node 0 MPI rank 0 OpenMP NUMA node 1 MPI rank 1 OpenMP NUMA node 2 MPI rank 2 OpenMP NUMA node 3 MPI rank 3 OpenMP user@knl% mpirun -host knl > -np 4 ./myparallel_app White Paper: http://colfaxresearch.com/knl-numa/ colfaxresearch.com/knl-webinar Clustering Modes on KNL © Colfax International, 2013–2016
  • 45. 45 “Get Ready for Intel’s Knights Landing” Guide colfaxresearch.com/knl-webinar “Get Ready for Intel’s Knights Landing” Guide © Colfax International, 2013–2016
  • 46. 46 Get Ready For Intel® Xeon Phi processors (Codenamed: Knights Landing) colfaxresearch.com/knl-ready colfaxresearch.com/knl-webinar “Get Ready for Intel’s Knights Landing” Guide © Colfax International, 2013–2016
  • 47. 47 §6. Hands-On Demonstrations colfaxresearch.com/knl-webinar Hands-On Demonstrations © Colfax International, 2013–2016
  • 48. 48 Running the STREAM Benchmark colfaxresearch.com/knl-webinar Running the STREAM Benchmark © Colfax International, 2013–2016
  • 49. 49 STREAM Benchmark Industry-standard tool for memory bandwidth measurement 4 tests: COPY, ADD, SCALE and TRIAD Download from Dr. John McCalpin’s site: www.cs.virginia.edu/stream/ Expected for KNL: ≈400 GB/s colfaxresearch.com/knl-webinar Running the STREAM Benchmark © Colfax International, 2013–2016
  • 50. 50 STREAM Benchmark on Tuning Set 1 thread per core on all platforms (-1 for offload) Set affinity “scatter”: white paper (Colfax Research) Tune prefetching: white paper (Intel) In addition, secret sauce for your own STREAM-like application: Parallel first touch (see NUMA discussion in Session 08) Essential element – streaming stores: discussion colfaxresearch.com/knl-webinar Running the STREAM Benchmark © Colfax International, 2013–2016
  • 51. 51 HBM in Knights Landing Finding HBM (MCDRAM) in an Intel Xeon Phi processor x200 (KNL): user@knl% # In Flat mode with All-to-All or Quadrant user@knl% numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ... 249 250 251 252 253 254 255 node 0 size: 98207 MB node 0 free: 94798 MB node 1 cpus: node 1 size: 16384 MB node 1 free: 15991 MB Binding the application to HBM (Flat/Hybrid) user@knl% numactl --membind 1 ./runme // ... Application running in HBM ... // colfaxresearch.com/knl-webinar Running the STREAM Benchmark © Colfax International, 2013–2016
  • 52. 52 Vectorization with AVX-512 colfaxresearch.com/knl-webinar Vectorization with AVX-512 © Colfax International, 2013–2016
  • 53. 53 Demo: AVX-512CD and Histograms 1 // worker.cc from Colfax lab 4.04 2 void Histogram(const float* age, int* const hist, 3 const int n, const float group_width, 4 const int m) { 5 6 for (int i = 0; i < n; i++) { 7 const int j = (int) ( age[i] / group_width ); 8 hist[j]++; 9 } 10 } 56.8 52.1 21.1 46.9 23.8 16.5 96.3 19.4 20.0 20.0 20.0 20.0 20.0 20.0 20.0 20.0 2.84 2.60 1.05 2.34 1.19 0.82 4.81 0.97 int 2 2 1 2 1 0 4 0 0 1 2 3 4hist: = ÷ user@knl% cat worker.optrpt .... remark ...: vectorization support: scatter was generated for the variable hist: remark ...: vectorization support: gather was generated for the variable hist: remark #15300: LOOP WAS VECTORIZED colfaxresearch.com/knl-webinar Vectorization with AVX-512 © Colfax International, 2013–2016
  • 54. 54 Intel Compiler support for AVX-512 Intel Compiler versions ≥ 15.0 supports AVX-512 instruction set. user@knl% icc -v icc version 16.0.1 (gcc version 4.8.5 compatibility) user@knl% icc -help // ... truncated output ... // -x<code> ... MIC-AVX512 CORE-AVX512 COMMON-AVX512 -xMIC-AVX512 : for KNL (supports F, CD, ER, PF) -xCORE-AVX512 : for future Xeon (supports F, CD, DQ, BW, VL) -xCOMMON-AVX512 : common to KNL and Xeon (supports F, CD) colfaxresearch.com/knl-webinar Vectorization with AVX-512 © Colfax International, 2013–2016
  • 55. 55 Strip-Mining for Vectorization Vectorization improves performance on both platforms. However, more work is needed to take advantage of the MIC architecture. See materials on multi-threading. Scalar Serial Code Strength Reduction Vectorized Serial Code (Strip-Mining) 0.0 0.2 0.4 0.6 0.8 1.0 Performance,billionvalues/s(higherisbetter) 0.373 0.588 0.937 0.017 0.045 0.1260.105 0.119 0.117 Intel Xeon Processor E5-2697 V2 Intel Xeon Phi Coprocessor 7120P (KNC) Intel Xeon Phi Processor 7210 (KNL) colfaxresearch.com/knl-webinar Vectorization with AVX-512 © Colfax International, 2013–2016
  • 56. 56 Using Reduction instead of Synchronization Vectorized Serial Code (Strip-Mining) Vectorized Parallel Code (Atomic Operations) Vectorized Parallel Code (Private Variables) 0 5 10 15 20 Performance,billionvalues/s(higherisbetter) 0.937 0.046 11.2 0.126 0.035 13.9 0.117 0.017 16.4 Intel Xeon Processor E5-2697 V2 Intel Xeon Phi Coprocessor 7120P (KNC) Intel Xeon Phi Processor 7210 (KNL) colfaxresearch.com/knl-webinar Vectorization with AVX-512 © Colfax International, 2013–2016
  • 57. 57 Memory Traffic Optimization Vectorized Parallel Code (Private Variables) Parallel Initialization (First-Touch Allocation) 0 5 10 15 20 25 Performance,billionvalues/s(higherisbetter) 11.2 22.1 13.9 13.6 16.4 16.4 Intel Xeon processor E5-2697 V2 Intel Xeon Phi coprocessor 7120P (KNC) Intel Xeon Phi processor 7210 (KNL) colfaxresearch.com/knl-webinar Vectorization with AVX-512 © Colfax International, 2013–2016
  • 58. 58 Tuning with Intel Math Kernel Library colfaxresearch.com/knl-webinar Tuning with Intel Math Kernel Library © Colfax International, 2013–2016
  • 59. 59 Batch Processing Feed multiple signals to MKL Let the library decide on the parallel strategy. 1 MKL_Complex16* data = 2 (MKL_Complex16*) malloc(sizeof(MKL_Complex16)*fft_size*num_fft); 3 4 DFTI_DESCRIPTOR_HANDLE handle; 5 DftiCreateDescriptor(&handle, DFTI_DOUBLE, DFTI_COMPLEX, 1, (MKL_LONG)fft_size); 6 DftiSetValue(handle, DFTI_NUMBER_OF_TRANSFORMS, num_fft); 7 DftiSetValue(handle, DFTI_INPUT_DISTANCE, fft_size); 8 DftiSetValue(handle, DFTI_OUTPUT_DISTANCE, fft_size); 9 DftiSetValue(handle, DFTI_PLACEMENT, DFTI_INPLACE); 10 DftiCommitDescriptor(handle); colfaxresearch.com/knl-webinar Tuning with Intel Math Kernel Library © Colfax International, 2013–2016
  • 60. 60 Nested Parallelism ... ... ... Fine-grained parallelism Coarse-grained parallelism Nested parallelism throwing all threads on one MKL function using one thread per MKL function putting teams of threads on several MKL functions colfaxresearch.com/knl-webinar Tuning with Intel Math Kernel Library © Colfax International, 2013–2016
  • 61. 61 Thread Affinity Tuning Xeon Xeon Phi MKL uses 1 thread/core MKL uses 4 threads/core OMP_NUM_THREADS=#phys. cores OMP_NUM_THREADS=4×#cores KMP_AFFINITY=scatter KMP_AFFINITY=scatter KMP_AFFINITY=compact (HT off) KMP_AFFINITY=compact KMP_AFFINITY=compact,1 (HT on) See also HOW series Session 8. colfaxresearch.com/knl-webinar Tuning with Intel Math Kernel Library © Colfax International, 2013–2016
  • 62. 62 Thread Affinity: Scatter Pattern Generally beneficial for bandwidth-bound applications. HT 0 HT 1 HT 0 HT 1 Core 0 Core 1 Socket 0 HT 0 HT 1 HT 0 HT 1 Core 0 Core 1 Socket 1 Threads: Cores: 0 2 1 3 HT 0 HT 1 HT 0 HT 1 Core 0 Core 1 Socket 0 HT 0 HT 1 HT 0 HT 1 Core 0 Core 1 Socket 1 Threads: Cores: 0 HT 0 HT 1 HT 0 HT 1 Core 0 Xeon Phi chip HT 0 HT 1 HT 0 HT 1 Core 1 Threads: Cores: 0 61 122 183 1 62 123 184 ... MKL on Xeon: MKL on Xeon Phi: KMP_AFFINITY=scatter KMP_AFFINITY=scatter colfaxresearch.com/knl-webinar Tuning with Intel Math Kernel Library © Colfax International, 2013–2016
  • 63. 63 Data Container Tweaks Parallel first-touch (see also HOW series Session 8): 1 #pragma omp parallel for 2 for(int i = 0; i < num_fft; i++) 3 for(int j = 0; j < fft_size; j++) { 4 data[i*fft_size+j].real = cos(k*j); 5 data[i*fft_size+j].imag = sin(k*j); 6 } Special allocator (alignment – see also HOW series Session 6 and allocation in HBM on KNL) 1 MKL_Complex16* data = 2 (MKL_Complex16*) mkl_malloc(sizeof(MKL_Complex16)*fft_size*num_fft, 64); colfaxresearch.com/knl-webinar Tuning with Intel Math Kernel Library © Colfax International, 2013–2016
  • 64. 64 Complex-to-Complex 1D FFT Performance Initial Batch Mode Affinity Tuning Parallel First Touch MKL Allocator 0 50 100 150 200 250 300 350 400 450 DoublePrecisionGFLOP/s 10 56 65 145 156 1 148 155 155 239 5 113 113 113 383 1D Fast Fourier Transform Performance Intel Xeon processor E5-2697 v3 Intel Xeon Phi coprocessor 7120P Intel Xeon Phi processor 7210 5colfaxresearch.com/knl-webinar Tuning with Intel Math Kernel Library © Colfax International, 2013–2016
  • 65. 65 Machine Learning on Intel Architecture colfaxresearch.com/knl-webinar Machine Learning on Intel Architecture © Colfax International, 2013–2016
  • 66. 66 Machine Learning: Optimized Middleware http://colfaxresearch.com/isc16-neuraltalk colfaxresearch.com/knl-webinar Machine Learning on Intel Architecture © Colfax International, 2013–2016
  • 67. 67 More Information: “HOW” Series colfaxresearch.com/knl-webinar More Information: “HOW” Series © Colfax International, 2013–2016
  • 68. 68 HOW Series Need more tutorials? Sign-up at: colfaxresearch.com/how-series colfaxresearch.com/knl-webinar More Information: “HOW” Series © Colfax International, 2013–2016
  • 69. 69 §7. Closing Words colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016
  • 70. 70 Bottom Line KNL is a highly-parallel successor of KNC, with improvements in both performance and ease-of-use. KEEP CALM GO PARALLEL AND colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016
  • 71. 71 Get Ready For Intel® Xeon Phi processors (Codenamed: Knights Landing) colfaxresearch.com/knl-ready colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016
  • 73. 73 Developer Access Program (DAP) Can’t wait to get your hands on Knights Landing? Find out more at dap.xeonphi.com or contact us at dap@colfax-intl.com. colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016
  • 74. 74 Thank you for Tuning In! colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016
  • 75. 75 §8. Back-up Slides colfaxresearch.com/knl-webinar Back-up Slides © Colfax International, 2013–2016
  • 76. 76 Loop Collapse: Principle Idea: combine iterations spaces of the inner loop and the outer loop. 1 #pragma omp parallel for collapse(2) 2 for (int i = 0; i < m; i++) 3 for (int j = 0; j < n; j++) { 4 // ... 5 // ... 6 } 1 #pragma omp parallel for 2 for (int c = 0; c < m*n; c++) { 3 i = c / n; 4 j = c % n; 5 // ... 6 } colfaxresearch.com/knl-webinar Back-up Slides © Colfax International, 2013–2016
  • 77. 77 Exposing Parallelism: Strip-Mining and Loop Collapse 1 void sum_stripmine(const int m, const int n, long* M, long* s){ 2 const int STRIP=1024; 3 assert(n%STRIP==0); 4 s[0:m]=0; 5 #pragma omp parallel 6 { 7 long total[m]; total[0:m]=0; 8 #pragma omp for collapse(2) schedule(guided) 9 for (int i=0; i<m; i++) 10 for (int jj=0; jj<n; jj+=STRIP) 11 #pragma vector aligned 12 for (int j=jj; j<jj+STRIP; j++) 13 total[i]+=M[i*n+j]; 14 for (int i=0; i<m; i++) // Reduction 15 #pragma omp atomic 16 s[i]+=total[i]; 17 } } colfaxresearch.com/knl-webinar Back-up Slides © Colfax International, 2013–2016
  • 78. 78 Checking for AVX-512 support Check /proc/cpuinfo for AVX-512 flags user@knl% cat /proc/cpuinfo flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer xsave avx f16c rdrand lahf_lm abm 3dnowprefetch arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms avx512f rdseed adx avx512pf avx512er avx512cd Also possible through C/C++ function calls: see this blog. colfaxresearch.com/knl-webinar Back-up Slides © Colfax International, 2013–2016
  • 79. 79 Using AVX-512: Two Approaches Automatic Vectorization: Vectorization w/ compiler. Portable: just recompile. Tuning with directives. 1 double A[vec_width], B[vec_width]; 2 // ... 3 4 5 // This loop will be auto-vectorized 6 for(int i = 0; i < vec_width; i++) 7 A[i]+=B[i]; 1 double A[vec_width], B[vec_width]; 2 // ... 3 // This is explicitly vectorized 4 __m512d A_vec = _mm512_load_pd(A); 5 __m512d B_vec = _mm512_load_pd(B); 6 A_vec = _mm512_add_pd(A_vec,B_vec); 7 _mm512_store_pd(A,A_vec); Explicit Vectorization: Vectorization w/ intrinsics Full control over instructions Limited portability colfaxresearch.com/knl-webinar Back-up Slides © Colfax International, 2013–2016
  • 80. 80 GCC support for AVX-512 GCC ≥ 4.9.1 supports AVX-512 instruction set. user@knl% g++ -v gcc version 4.9.2 (GCC) user@knl% g++ foo.cc -mavx512f -mavx512er -mavx512cd -mavx512pf Basic automatic vectorization support: add -O2 or -O3. 1 // ... foo.cc ... // 2 for(int i = 0; i < n; i++) 3 B[i] = A[i] + B[i]; user@knl% g++ -s foo.cc -mavx512f -03 user@knl% cat foo.s ... vmovapd -16432(%rbp,%rax), %zmm0 vaddpd -8240(%rbp,%rax), %zmm0, %zmm0 vmovapd %zmm0, -8240(%rbp,%rax) colfaxresearch.com/knl-webinar Back-up Slides © Colfax International, 2013–2016
  • 81. 81 HOW Series Interested? Sign-up at: colfaxresearch.com/how-series colfaxresearch.com/knl-webinar Back-up Slides © Colfax International, 2013–2016
  • 82. 82 Textbook ISBN: 978-0-9885234-0-1 (508 pages, Electronic or Print) Parallel Programming and Optimization with Intel® Xeon Phi™ Coprocessors Handbook on the Development and Optimization of Parallel Applications for Intel® Xeon® Processors and Intel® Xeon Phi™ Coprocessors © Colfax International, 2015 PARALLEL PROGRAMMING AND OPTIMIZATION WITH HANDBOOKONTHE DEVELOPMENTAND OPTIMIZATIONOF PARALLEL APPLICATIONSFOR INTEL XEON PROCESSORS ANDINTEL XEONPHI COPROCESSORS INTEL XEON PHI COPROCESSORS TMR SECONDEDITION C O L F A X I N T E R N A T I O N A L ANDREY VLADIMIROV | RYO ASAI | VADIM KARPUSENKO http://xeonphi.com/book colfaxresearch.com/knl-webinar Back-up Slides © Colfax International, 2013–2016