SlideShare a Scribd company logo
Developer’s Guide to Knights Landing
Introduction to 2nd Generation
Intel®
Xeon Phi™
Processors
Colfax International — @colfaxintl
May 2016 - Rev. 1.2
colfaxresearch.com/knl-webinar Welcome © Colfax International, 2013–2016
About This Document
This document represents the ma-
terials of a Web-based training “In-
troduction to 2nd Generation Intel®
Xeon Phi™
Processors: Developer’s
Guide to Knights Landing” developed
and run by Colfax International.
© Colfax International, 2013-2016
colfaxresearch.com/knl-webinar/
colfaxresearch.com/knl-webinar About This Document © Colfax International, 2013–2016
Disclaimer
While best efforts have been used in preparing this training, Colfax International makes no
representations or warranties of any kind and assumes no liabilities of any kind with respect to
the accuracy or completeness of the contents and specifically disclaims any implied warranties
of merchantability or fitness of use for a particular purpose. The publisher shall not be held
liable or responsible to any person or entity with respect to any loss or incidental or
consequential damages caused, or alleged to have been caused, directly or indirectly, by the
information or programs contained herein. No warranty may be created or extended by sales
representatives or written sales materials.
colfaxresearch.com/knl-webinar About This Document © Colfax International, 2013–2016
Resources
colfaxresearch.com/knl-webinar Resources © Colfax International, 2013–2016
Colfax Research
http://colfaxresearch.com/
(already registered? get $10 off the book)
colfaxresearch.com/knl-webinar Resources © Colfax International, 2013–2016
HOW Series
Interested? Sign-up at:
colfaxresearch.com/how-series
colfaxresearch.com/knl-webinar Resources © Colfax International, 2013–2016
Developer Access Program (DAP)
Can’t wait to get your hands on Knights Landing?
Find out more at dap.xeonphi.com
or contact us at dap@colfax-intl.com.
colfaxresearch.com/knl-webinar Resources © Colfax International, 2013–2016
Get Ready For KNL
http://colfaxresearch.com/get-ready-for-intel-knights-landing-3-papers/
colfaxresearch.com/knl-webinar Resources © Colfax International, 2013–2016
§2. Intel Architecture: Today and
Tomorrow
colfaxresearch.com/knl-webinar Intel Architecture: Today and Tomorrow © Colfax International, 2013–2016
Intel Architecture
colfaxresearch.com/knl-webinar Intel Architecture: Today and Tomorrow © Colfax International, 2013–2016
Intel Xeon Phi Processors (2nd Gen)
2nd Generation of Intel Many Integrated Core (MIC) Architecture.
Specialized platform for demanding computing applications.
Bootable host proccesor or
coprocessor
3+ TFLOP/s DP
6+ TFLOP/s SP
Up to 16 GiB MCDRAM
MCDRAM bandwidth ≈5x DDR4
Binary compatible with Intel Xeon
Public disclosures
colfaxresearch.com/knl-webinar Intel Architecture: Today and Tomorrow © Colfax International, 2013–2016
Standard CPU Form Factor
Bootable Host Processor
No need for “host”. OS runs
on KNL processor.
Supports common OS.
No more PCIe bottleneck!
Direct access to ≤ 384 GiB
DDR4 RAM
Up to ≈90 GB/s DDR4
bandwidth
Access to PCIe Bus
colfaxresearch.com/knl-webinar Intel Architecture: Today and Tomorrow © Colfax International, 2013–2016
Future Releases
KNLF: KNL with Fabric
Fabric integrated on CPU
Intel OmniPath Architecture
Socket mount processor
*KNC image
KNL Coprocessor
PCIe add-in card
Requires host
Multiple KNLs in a system
colfaxresearch.com/knl-webinar Intel Architecture: Today and Tomorrow © Colfax International, 2013–2016
§3. Cores and Threading
colfaxresearch.com/knl-webinar Cores and Threading © Colfax International, 2013–2016
Importance of Parallelism
If KNL was a steam engine...
It would have a lot of furnaces.
Single-threaded code on KNL
is like having 1 fireman service 72 steam engines
colfaxresearch.com/knl-webinar Cores and Threading © Colfax International, 2013–2016
Implementing Multi-threading
You have to use a core to benefit from it!
Threading frameworks:
OpenMP
TBB
Cilk Plus
Pthreads
Hybrid with MPI
Any others supported by
Xeon Processors
OpenMP Threads
Quadrants
KNL
colfaxresearch.com/knl-webinar Cores and Threading © Colfax International, 2013–2016
Features of KNL Cores
colfaxresearch.com/knl-webinar Features of KNL Cores © Colfax International, 2013–2016
KNL Die Organization: Tiles
Up to 36 tiles, each with 2 physical cores (72 total).
Distributed L2 cache across a mesh interconnect.
CORE
L2
CORE CORE
L2
CORE CORE
L2
CORE CORE
L2
CORE CORE
L2
CORE
DDR4CONTROLLER
DDR4CONTROLLER
CORE
L2
CORE
CORE
L2
CORE
CORE
L2
CORE
CORE
L2
CORE
CORE
L2
CORE
CORE
L2
CORE
CORE
L2
CORE CORE
L2
CORECORE
L2
CORE CORE
L2
CORE
CORE
L2
CORE
CORE
L2
CORE
CORE
L2
CORE CORE
L2
CORE
MCDRAM PCIe
≤384GiBsystemDDR4,~90GB/s
CORE
L2
CORE
≤ 16 GiB on-package MCDRAM, ~ 400 GB/s
MCDRAM
MCDRAM MCDRAM
CORE
L2
CORE CORE
L2
CORE
CORE
L2
CORECORE
L2
CORE
...
36 TILES
72 CORES
colfaxresearch.com/knl-webinar Features of KNL Cores © Colfax International, 2013–2016
KNL Cores
4-way hyper-threading (up to 4×72 = 288 logical processors)
L2 cache shared by 2 cores on a tile.
I-cache
Decode
+ Retire
Vector
ALU
Vector
ALU
Legacy
32 KiB L1 D-cache
1 MiB
L2
D-cache
I-cache
Decode
+ Retire
Vector
ALU
Vector
ALU
Legacy
32 KiB L1 D-cache
CORE CORETILE
to Mesh
HW+ SW
prefetch
HW+ SW
prefetch
Out-of-
order
Out-of-
order
colfaxresearch.com/knl-webinar Features of KNL Cores © Colfax International, 2013–2016
More Forgiving Cores than First Generation (KNC)
Based on Intel®
Atom cores (Silvermont microarchitecture)
Out-of-order cores:
Better latency masking from long latency operations
Back-to-back instructions from single thread:
Thread count requirement reduced to ≈ 70 (from ≈ 120 for KNC)
Advanced branch prediction:
Fewer cycles wasted to branch misprediction
Generally more forgiving to non-optimized code.
colfaxresearch.com/knl-webinar Features of KNL Cores © Colfax International, 2013–2016
Performance Considerations
colfaxresearch.com/knl-webinar Performance Considerations © Colfax International, 2013–2016
Affinity and cpuinfo
Neighboring threads often work on nearby/same locations in memory.
→ Use thread pinning to have them share L2 cache!
You can use cpuinfo (part of Intel®
MPI) to find which cores share cache.
user@knl% cpuinfo
// ... cpuinfo output ... //
L2 1 MB (0,1,64,65,128,129,192,193)(2,3,66,67,130,131,194,195)(4,5,68, ...
Use KMP_AFFINITY for Intel Compilers or OMP_PROC_BIND for GCC compilers.
user@knl% export KMP_AFFINITY=compact
user@knl% export OMP_PROC_BIND=close
colfaxresearch.com/knl-webinar Performance Considerations © Colfax International, 2013–2016
Tuning Threading
Even if you do have multi-threaded code, look out for issues such as...
Insufficient parallelism Load imbalance
Learn more at: colfaxresearch.com/how-series
colfaxresearch.com/knl-webinar Performance Considerations © Colfax International, 2013–2016
§4. Vectorization
colfaxresearch.com/knl-webinar Vectorization © Colfax International, 2013–2016
Scalar Code and Our Fireman
Not using vector instructions in KNL
is like feeding a steam engine with a spoon.
colfaxresearch.com/knl-webinar Vectorization © Colfax International, 2013–2016
Short Vector Support
Vector instructions – one of the implementations of SIMD (Single
Instruction Multiple Data) parallelism.
+ =
+ =
+ =
+ =
+ =
VectorLength
Scalar Instructions Vector Instructions
4 1 5
0 3 3
-2 8 6
9 -7 2
4 1 5
0 3 3
-2 8 6
9 -7 2
colfaxresearch.com/knl-webinar Vectorization © Colfax International, 2013–2016
Vector Instructions on KNL
colfaxresearch.com/knl-webinar Vector Instructions on KNL © Colfax International, 2013–2016
Dual VPU
Each core on KNL has two Vector Processing Units (VPUs).
Penalty from non-vectorized code
SP →512 bitregisters/32 bits ×2 VPUs = 32 SIMD lanes
DP →512 bitregisters/64 bits ×2 VPUs = 16 SIMD lanes
VPU 0
512-bit registers = 16 SP or 8 DP
ZMM registers
x32 per thread VPU 1
Decode
I-Cache
32/16 SIMD lanes (SP/DP)
colfaxresearch.com/knl-webinar Vector Instructions on KNL © Colfax International, 2013–2016
Supported Instruction Sets on KNL
Intel®
Advanced Vector
Extensions 512 (AVX-512)
512-bit vector registers.
Hardware gather/scatter, DP
transcendental functions support
and more.
Supported by non-Intel compilers
like GCC.
≤ Intel®
AVX2
Legacy mode operation.
Binary compatibility with Xeon.
Does not include IMCI (from KNC).
E5-2600
(SNB)
E5-2600v3
(HSW)
KNL
(Xeon Phi)
AVX-512PF
AVX-512ER
AVX-512CD
AVX-512F
BMI
AVX2
AVX
SSE
x87/MMX
TSX
BMI
AVX2
AVX
SSE
x87/MMX
AVX
SSE
x87/MMX
LegacySupport
no TSX
colfaxresearch.com/knl-webinar Vector Instructions on KNL © Colfax International, 2013–2016
AVX-512 Features
Knights Landing: first AVX-512 processor
AVX-512F (Fundamentals)
Extension of most AVX2 instructions to 512-bit vector registers.
AVX-512CD (Conflict Detection)
Efficient conflict detection (application: binning).
AVX-512ER (Exponential and Reciprocal)
Transcendental function (exp, rcp and rsqrt) support.
AVX-512PF (Prefetch)
Prefetch for scatter and gather.
colfaxresearch.com/knl-webinar Vector Instructions on KNL © Colfax International, 2013–2016
Checking for AVX-512 support
Check /proc/cpuinfo for AVX-512 flags
user@knl% cat /proc/cpuinfo
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36
clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm
constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc
aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 fma
cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer xsave avx
f16c rdrand lahf_lm abm 3dnowprefetch arat epb xsaveopt pln pts dtherm
tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2
erms avx512f rdseed adx avx512pf avx512er avx512cd
Also possible through C/C++ function calls: see this blog.
colfaxresearch.com/knl-webinar Vector Instructions on KNL © Colfax International, 2013–2016
AVX-512CD: Histograms
1 // worker.cc from Colfax lab 4.04
2 void Histogram(const float* age, int* const hist,
3 const int n, const float group_width,
4 const int m) {
5
6 for (int i = 0; i < n; i++) {
7 const int j = (int) ( age[i] / group_width );
8 hist[j]++;
9 }
10 }
56.8 52.1 21.1 46.9 23.8 16.5 96.3 19.4
20.0 20.0 20.0 20.0 20.0 20.0 20.0 20.0
2.84 2.60 1.05 2.34 1.19 0.82 4.81 0.97
int
2 2 1 2 1 0 4 0
0 1 2 3 4hist:
=
÷
user@knl% cat worker.optrpt
....
remark ...: vectorization support: scatter was generated for the variable hist:
remark ...: vectorization support: gather was generated for the variable hist:
remark #15300: LOOP WAS VECTORIZED
colfaxresearch.com/knl-webinar Vector Instructions on KNL © Colfax International, 2013–2016
AVX-512ER: Transcendental Functions
Support for transcendental
functions.
Exponential.
Reciprocal.
Inverse Square Root.
Better precision.
Double Precision.
Max relative error:
2−23
(exp)
2−28
(rcp and rsqrt)
Source: APOD
colfaxresearch.com/knl-webinar Vector Instructions on KNL © Colfax International, 2013–2016
Programming Considerations
colfaxresearch.com/knl-webinar Programming Considerations © Colfax International, 2013–2016
Using AVX-512: Two Approaches
Automatic Vectorization:
Vectorization w/ compiler.
Portable: just recompile.
Tuning with directives.
1 double A[vec_width], B[vec_width];
2 // ...
3
4
5 // This loop will be auto-vectorized
6 for(int i = 0; i < vec_width; i++)
7 A[i]+=B[i];
1 double A[vec_width], B[vec_width];
2 // ...
3 // This is explicitly vectorized
4 __m512d A_vec = _mm512_load_pd(A);
5 __m512d B_vec = _mm512_load_pd(B);
6 A_vec = _mm512_add_pd(A_vec,B_vec);
7 _mm512_store_pd(A,A_vec);
Explicit Vectorization:
Vectorization w/ intrinsics
Full control over instructions
Limited portability
colfaxresearch.com/knl-webinar Programming Considerations © Colfax International, 2013–2016
Intel Compiler support for AVX-512
Intel Compiler versions ≥ 15.0 supports AVX-512 instruction set.
user@knl% icc -v
icc version 16.0.1 (gcc version 4.8.5 compatibility)
user@knl% icc -help
// ... truncated output ... //
-x<code>
...
MIC-AVX512
CORE-AVX512
COMMON-AVX512
-xMIC-AVX512 : for KNL (supports F, CD, ER, PF)
-xCORE-AVX512 : for future Xeon (supports F, CD, DQ, BW, VL)
-xCOMMON-AVX512 : common to KNL and Xeon (supports F, CD)
colfaxresearch.com/knl-webinar Programming Considerations © Colfax International, 2013–2016
GCC support for AVX-512
GCC ≥ 4.9.1 supports AVX-512 instruction set.
user@knl% g++ -v
gcc version 4.9.2 (GCC)
user@knl% g++ foo.cc -mavx512f -mavx512er -mavx512cd -mavx512pf
Basic automatic vectorization support: add -O2 or -O3.
1 // ... foo.cc ... //
2 for(int i = 0; i < n; i++)
3 B[i] = A[i] + B[i];
user@knl% g++ -s foo.cc -mavx512f -03
user@knl% cat foo.s
...
vmovapd -16432(%rbp,%rax), %zmm0
vaddpd -8240(%rbp,%rax), %zmm0, %zmm0
vmovapd %zmm0, -8240(%rbp,%rax)
colfaxresearch.com/knl-webinar Programming Considerations © Colfax International, 2013–2016
Performance Considerations
Even if your code is vectorized, tuning may unlock more performance.
Providing enough parallelism.
More consecutive vector
operations required to
overcome vectorization latency.
Loop pipelining and unrolling.
Double the pipeline stages to
populate.
Better vectorization patterns.
Avoid long latency operations
with unit-stride and unmasked
operations.
colfaxresearch.com/knl-webinar Programming Considerations © Colfax International, 2013–2016
§5. Memory Architecture
colfaxresearch.com/knl-webinar Memory Architecture © Colfax International, 2013–2016
Where’s the Fuel?
Cores can’t do work if the data is not there.
Source: wikipedia
KNL memory must be used efficiently to feed the cores.
colfaxresearch.com/knl-webinar Memory Architecture © Colfax International, 2013–2016
MCDRAM on KNL
colfaxresearch.com/knl-webinar MCDRAM on KNL © Colfax International, 2013–2016
KNL Memory Organization
Direct access to on package MCDRAM and system DDR4 (socket)
Use MCDRAM as cache, in flat mode, or as hybrid
CORE
CORE
REGISTERSREGISTERS
L1
cache
L1
cache
L2
cache
MCDRAM
(on package
RAM)
DDR4
RAM
(system memory)
Up to 16 GiB
over 400 GB/s
Up to 384 GiB
~ 90 GB/s (STREAM)
...
more cores
Intel Xeon Phi
Processor
Flat,
Cache
or
Hybrid
colfaxresearch.com/knl-webinar MCDRAM on KNL © Colfax International, 2013–2016
MCDRAM Memory Modes
Flat Mode
MCDRAM treated as a
NUMA node
Users control what
goes to MCDRAM
CPU
DDR4
Flat
MCDRAM
NUMA
NODE 0
NUMA
NODE 1
Cache Mode
MCDRAM treated as a
Last Level Cache (LLC)
MCDRAM is used
automatically
CPU
DDR4
Cache
MCDRAM
NUMA
NODE 0
Hybrid Mode
Combination of Flat
and Cache
Ratio can be chosen in
the BIOS
CPU
DDR4
NUMA
NODE 0
NUMA
NODE 1
Flat
MCDRAM
Cache
MCDRAM
colfaxresearch.com/knl-webinar MCDRAM on KNL © Colfax International, 2013–2016
MCDRAM Performance
Copyright © 2015, Intel Corporation. All rights reserved Avinash Sodani ISC 2015 Intel® Xeon Phi ™ Workshop.
DDR and MCDRAM Bandwidth vs. Latency
Bandwidth
Latency
MCDRAM
DDR
DDR BW Limit MCDRAM BW Limit
Conceptual diagram
MCDRAM latency more than DDR at low loads but much less at high loads
Source: Intel, ISC 2015 KNL keynote (pdf)
colfaxresearch.com/knl-webinar MCDRAM on KNL © Colfax International, 2013–2016
Programming with MCDRAM
colfaxresearch.com/knl-webinar Programming with MCDRAM © Colfax International, 2013–2016
Numactl
Finding information about the NUMA nodes in the system.
user@knl% # In Flat mode with All-to-All
user@knl% numactl -H
available: 2 nodes (0-1)
node 0 cpus: ... all cpus ...
node 0 size: 98207 MB
node 0 free: 94798 MB
node 1 cpus:
node 1 size: 16384 MB
node 1 free: 15991 MB
Binding the application to MCDRAM (Flat/Hybrid)
user@knl% gcc myapp.c -o runme -mavx512f -O2
user@knl% numactl --membind 1 ./runme
// ... Application running on MCDRAM ... //
colfaxresearch.com/knl-webinar Programming with MCDRAM © Colfax International, 2013–2016
Memkind Library and hbwmalloc
Manual allocation to MCDRAM possible with hbwmalloc and Memkind Library.
1 #include <hbwmalloc.h>
2 const int n = 1<<10;
3 // Allocation to MCDRAM
4 double* A = (double*) hbw_malloc(sizeof(double)*n);
5 // No replacement for _mm_malloc. Use posix_memalign
6 double* B;
7 int ret = hbw_posix_memalign((void*) B, 64, sizeof(double)*n);
8 .....
9 // Free with hbw_free
10 hbw_free(A); hbw_free(b);
Fortran Allocations.
1 REAL, ALLOCATABLE :: A(:)
2 !DEC$ ATTRIBUTES FASTMEM :: A
3 ALLOCATE (A(1:1024))
colfaxresearch.com/knl-webinar Programming with MCDRAM © Colfax International, 2013–2016
Compilation with Memkind Library and hbwmalloc
To compile C/C++ applications:
user@knl% icpc -lmemkind foo.cc -o runme
user@knl% g++ -lmemkind foo.cc -o runme
To compile Fortran applications:
user@knl% ifort -lmemkind foo.f90 -o runme
user@knl% gfortran -lmemkind foo.f90 -o runme
Open source distribution of Memkind library can be found at:
http://memkind.github.io/memkind/
colfaxresearch.com/knl-webinar Programming with MCDRAM © Colfax International, 2013–2016
Flow Chart for Bandwidth-Bound Applications
Does your Application
fit in 16 GB?
Can you partition ≤16 GB
of BW-critical memory?
No
Yes Yes
No
Start
numactl Memkind Cache mode
Simply run the whole
program in MCDRAM
No code modification
required
Manually allocate
BW-critical memory to
MCDRAM
Memkind calls need to
be added.
Allow the OS to figure
out how to use
MCDRAM
No code modification
required
colfaxresearch.com/knl-webinar Programming with MCDRAM © Colfax International, 2013–2016
Cluster Modes on KNL
colfaxresearch.com/knl-webinar Cluster Modes on KNL © Colfax International, 2013–2016
KNL Die Organization
Mesh interconnect relaxes data locality requirement [somewhat]
All-to-all, quadrant or sub-numa domain communication in mesh
CORE
L2
CORE CORE
L2
CORE CORE
L2
CORE CORE
L2
CORE CORE
L2
CORE
DDR4CONTROLLER
DDR4CONTROLLER
CORE
L2
CORE
CORE
L2
CORE
CORE
L2
CORE
CORE
L2
CORE
CORE
L2
CORE
CORE
L2
CORE
CORE
L2
CORE CORE
L2
CORECORE
L2
CORE CORE
L2
CORE
CORE
L2
CORE
CORE
L2
CORE
CORE
L2
CORE CORE
L2
CORE
MCDRAM PCIe
≤384GiBsystemDDR4,~90GB/s
CORE
L2
CORE
≤ 16 GiB on-package MCDRAM, ~ 400 GB/s
MCDRAM
MCDRAM MCDRAM
CORE
L2
CORE CORE
L2
CORE
CORE
L2
CORECORE
L2
CORE
...
36 TILES
72 CORES
colfaxresearch.com/knl-webinar Cluster Modes on KNL © Colfax International, 2013–2016
Cluster Modes: All-to-All
No affinity between the distributed Tag Directory (TD) and memory.
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
NUMAnode0
MCDRAMMCDRAM
MCDRAMMCDRAM
colfaxresearch.com/knl-webinar Cluster Modes on KNL © Colfax International, 2013–2016
Cluster Modes: Quadrant/Hemisphere
Tag Directory (TD) and memory reside in the same quadrant.
tile
TD
MCDRAM
MCDRAMMCDRAM
MCDRAM
tile
TD
tile
TD
tile
TD
NUMAnode0
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
colfaxresearch.com/knl-webinar Cluster Modes on KNL © Colfax International, 2013–2016
Cluster Modes: SNC-4/SNC-2
Will appear as 4 NUMA nodes (similar to quad-socket system).
tile
TD
MCDRAM
MCDRAMMCDRAM
MCDRAM
tile
TD
tile
TD
tile
TD
NUMAnode0
NUMAnode1
NUMAnode2
NUMAnode3
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
tile
TD
colfaxresearch.com/knl-webinar Cluster Modes on KNL © Colfax International, 2013–2016
Programming with Cluster Modes
colfaxresearch.com/knl-webinar Programming with Cluster Modes © Colfax International, 2013–2016
How to Use Cluster modes
Thread affinity to match the memory/directory affinity.
Nested OpenMP
1 #pragma omp parallel
2 {
3 // ...
4 #pragma omp parallel
5 {
6 // ...
7 }
8 }
OpenMP Threads
Quadrants
KNL
user@knl% OMP_NUM_THREADS=4,72
user@knl% OMP_NESTED=1
MPI+OpenMP
1 stat = MPI_Init();
2 // ...
3 #pragma omp parallel
4 {
5 // ...
6 }
7 // ...
8 MPI_Finalize();
NUMA node 0
MPI rank 0
OpenMP
NUMA node 1
MPI rank 1
OpenMP
NUMA node 2
MPI rank 2
OpenMP
NUMA node 3
MPI rank 3
OpenMP
user@knl% mpirun -host knl 
> -np 4 ./myparallel_app
colfaxresearch.com/knl-webinar Programming with Cluster Modes © Colfax International, 2013–2016
§6. Importance of Code Optimization
colfaxresearch.com/knl-webinar Importance of Code Optimization © Colfax International, 2013–2016
N-body Simulation on Intel Architecture
 unoptimized  Multi­threaded  Scaler Tuning  AoS­>SoA Tiled, Unrolled
 (unrll and jam) 
0
200
400
600
800
1000
1200
1400
1600
1800
 GFLOP/s
4.7
72.9
151.2
326.8
324.4
5.9
162.0
261.0
970.0
934.0
0.8
124.0
196.0
1190.0
1640.0
 
 Sandy Bridge E5­2670 (1st gen, SNB) 
 Haswell E5­2697 v3 (3rd gen, HSW)
 Intel Xeon Phi 7120A (1st gen, KNC)
Get Ready
The best way to prepare your code for KNL is to optimize for KNC
colfaxresearch.com/knl-webinar Importance of Code Optimization © Colfax International, 2013–2016
Learn More
colfaxresearch.com/knl-webinar Learn More © Colfax International, 2013–2016
HOW Series
Interested? Sign-up at:
colfaxresearch.com/how-series
colfaxresearch.com/knl-webinar Learn More © Colfax International, 2013–2016
Textbook
ISBN: 978-0-9885234-0-1 (508 pages, Electronic or Print)
Parallel Programming
and Optimization with
Intel®
Xeon Phi™
Coprocessors
Handbook on the Development and
Optimization of Parallel Applications
for Intel®
Xeon®
Processors
and Intel®
Xeon Phi™
Coprocessors
© Colfax International, 2015
PARALLEL PROGRAMMING
AND OPTIMIZATION WITH
HANDBOOKONTHE
DEVELOPMENTAND
OPTIMIZATIONOF
PARALLEL
APPLICATIONSFOR
INTEL XEON
PROCESSORS
ANDINTEL
XEONPHI
COPROCESSORS
INTEL XEON PHI
COPROCESSORS
TMR
SECONDEDITION
C O L F A X I N T E R N A T I O N A L
ANDREY VLADIMIROV | RYO ASAI | VADIM KARPUSENKO
http://xeonphi.com/book
colfaxresearch.com/knl-webinar Learn More © Colfax International, 2013–2016
§7. Closing Words
colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016
Colfax Research
http://colfaxresearch.com/
(already registered? get $10 off the book)
colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016
Developer Access Program (DAP)
Can’t wait to get your hands on Knights Landing?
Find out more at dap.xeonphi.com
or contact us at dap@colfax-intl.com.
colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016
Bottom Line
KNL is a highly-parallel successor of KNC, with
improvements in both performance and ease-of-use.
KEEP CALM
GO PARALLEL
AND
colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016
Thank you for Tuning In!
colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016

More Related Content

What's hot

OpenSPARC T1 Processor
OpenSPARC T1 ProcessorOpenSPARC T1 Processor
OpenSPARC T1 ProcessorDVClub
 
Intel the-latest-on-ofi
Intel the-latest-on-ofiIntel the-latest-on-ofi
Intel the-latest-on-ofi
Tracy Johnson
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet Processing
Michelle Holley
 
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Eric Van Hensbergen
 
Intel dpdk Tutorial
Intel dpdk TutorialIntel dpdk Tutorial
Intel dpdk Tutorial
Saifuddin Kaijar
 
The HPE Machine and Gen-Z - BUD17-503
The HPE Machine and Gen-Z - BUD17-503The HPE Machine and Gen-Z - BUD17-503
The HPE Machine and Gen-Z - BUD17-503
Linaro
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
Kernel TLV
 
Quieting noisy neighbor with Intel® Resource Director Technology
Quieting noisy neighbor with Intel® Resource Director TechnologyQuieting noisy neighbor with Intel® Resource Director Technology
Quieting noisy neighbor with Intel® Resource Director Technology
Michelle Holley
 
FD.io - The Universal Dataplane
FD.io - The Universal DataplaneFD.io - The Universal Dataplane
FD.io - The Universal Dataplane
Open Networking Summit
 
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Haidee McMahon
 
DPDK Integration: A Product's Journey - Roger B. Melton
DPDK Integration: A Product's Journey - Roger B. MeltonDPDK Integration: A Product's Journey - Roger B. Melton
DPDK Integration: A Product's Journey - Roger B. Melton
harryvanhaaren
 
Durgam vahia open_sparc_fpga
Durgam vahia open_sparc_fpgaDurgam vahia open_sparc_fpga
Durgam vahia open_sparc_fpgaObsidian Software
 
Challenges for Deploying a High-Performance Computing Application to the Cloud
Challenges for Deploying a High-Performance Computing Application to the CloudChallenges for Deploying a High-Performance Computing Application to the Cloud
Challenges for Deploying a High-Performance Computing Application to the Cloud
Intel® Software
 
Intel® RDT Hands-on Lab
Intel® RDT Hands-on LabIntel® RDT Hands-on Lab
Intel® RDT Hands-on Lab
Michelle Holley
 
Network Service Benchmarking
Network Service BenchmarkingNetwork Service Benchmarking
Network Service Benchmarking
Michelle Holley
 
Different approaches to performance enhancements in network virtualization fo...
Different approaches to performance enhancements in network virtualization fo...Different approaches to performance enhancements in network virtualization fo...
Different approaches to performance enhancements in network virtualization fo...
Michelle Holley
 
OpenPOWER Summit 2020 - OpenCAPI Keynote
OpenPOWER Summit 2020 -  OpenCAPI KeynoteOpenPOWER Summit 2020 -  OpenCAPI Keynote
OpenPOWER Summit 2020 - OpenCAPI Keynote
Allan Cantle
 
Performance challenges in software networking
Performance challenges in software networkingPerformance challenges in software networking
Performance challenges in software networking
Stephen Hemminger
 
Userspace networking
Userspace networkingUserspace networking
Userspace networking
Stephen Hemminger
 
Netsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfvNetsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfv
Intel
 

What's hot (20)

OpenSPARC T1 Processor
OpenSPARC T1 ProcessorOpenSPARC T1 Processor
OpenSPARC T1 Processor
 
Intel the-latest-on-ofi
Intel the-latest-on-ofiIntel the-latest-on-ofi
Intel the-latest-on-ofi
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet Processing
 
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
 
Intel dpdk Tutorial
Intel dpdk TutorialIntel dpdk Tutorial
Intel dpdk Tutorial
 
The HPE Machine and Gen-Z - BUD17-503
The HPE Machine and Gen-Z - BUD17-503The HPE Machine and Gen-Z - BUD17-503
The HPE Machine and Gen-Z - BUD17-503
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Quieting noisy neighbor with Intel® Resource Director Technology
Quieting noisy neighbor with Intel® Resource Director TechnologyQuieting noisy neighbor with Intel® Resource Director Technology
Quieting noisy neighbor with Intel® Resource Director Technology
 
FD.io - The Universal Dataplane
FD.io - The Universal DataplaneFD.io - The Universal Dataplane
FD.io - The Universal Dataplane
 
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
 
DPDK Integration: A Product's Journey - Roger B. Melton
DPDK Integration: A Product's Journey - Roger B. MeltonDPDK Integration: A Product's Journey - Roger B. Melton
DPDK Integration: A Product's Journey - Roger B. Melton
 
Durgam vahia open_sparc_fpga
Durgam vahia open_sparc_fpgaDurgam vahia open_sparc_fpga
Durgam vahia open_sparc_fpga
 
Challenges for Deploying a High-Performance Computing Application to the Cloud
Challenges for Deploying a High-Performance Computing Application to the CloudChallenges for Deploying a High-Performance Computing Application to the Cloud
Challenges for Deploying a High-Performance Computing Application to the Cloud
 
Intel® RDT Hands-on Lab
Intel® RDT Hands-on LabIntel® RDT Hands-on Lab
Intel® RDT Hands-on Lab
 
Network Service Benchmarking
Network Service BenchmarkingNetwork Service Benchmarking
Network Service Benchmarking
 
Different approaches to performance enhancements in network virtualization fo...
Different approaches to performance enhancements in network virtualization fo...Different approaches to performance enhancements in network virtualization fo...
Different approaches to performance enhancements in network virtualization fo...
 
OpenPOWER Summit 2020 - OpenCAPI Keynote
OpenPOWER Summit 2020 -  OpenCAPI KeynoteOpenPOWER Summit 2020 -  OpenCAPI Keynote
OpenPOWER Summit 2020 - OpenCAPI Keynote
 
Performance challenges in software networking
Performance challenges in software networkingPerformance challenges in software networking
Performance challenges in software networking
 
Userspace networking
Userspace networkingUserspace networking
Userspace networking
 
Netsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfvNetsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfv
 

Viewers also liked

Trends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systemsTrends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systems
Igor José F. Freitas
 
Manual de jhon dere serie 300
Manual de jhon dere serie 300Manual de jhon dere serie 300
Manual de jhon dere serie 300
Andy Chunga
 
Automotiveairconditioningtrainingmanual
Automotiveairconditioningtrainingmanual Automotiveairconditioningtrainingmanual
Automotiveairconditioningtrainingmanual
abrahamjospher
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
Cisco ACI: A New Approach to Software Defined Networking
Cisco ACI: A New Approach to Software Defined NetworkingCisco ACI: A New Approach to Software Defined Networking
Cisco ACI: A New Approach to Software Defined Networking
Zivaro Inc
 
007 tubing andpipe
007 tubing andpipe007 tubing andpipe
007 tubing andpipe
John Hutchison
 
Robotics classes in mumbai
Robotics classes in mumbaiRobotics classes in mumbai
Robotics classes in mumbai
Vibrant Technologies & Computers
 
How Fannie Mae Leverages Data Quality to Improve the Business
How Fannie Mae Leverages Data Quality to Improve the BusinessHow Fannie Mae Leverages Data Quality to Improve the Business
How Fannie Mae Leverages Data Quality to Improve the Business
DLT Solutions
 
Robotics training in mumbai
Robotics training in mumbai Robotics training in mumbai
Robotics training in mumbai
Vibrant Technologies & Computers
 
Python classes in mumbai
Python classes in mumbaiPython classes in mumbai
Python classes in mumbai
Vibrant Technologies & Computers
 
How to Accelerate Backup Performance with Dell DR Series Backup Appliances
How to Accelerate Backup Performance with Dell DR Series Backup AppliancesHow to Accelerate Backup Performance with Dell DR Series Backup Appliances
How to Accelerate Backup Performance with Dell DR Series Backup Appliances
DLT Solutions
 
Cloud Ready Data: Speeding Your Journey to the Cloud
Cloud Ready Data: Speeding Your Journey to the CloudCloud Ready Data: Speeding Your Journey to the Cloud
Cloud Ready Data: Speeding Your Journey to the Cloud
DLT Solutions
 
High Dimensional Indexing using MongoDB (MongoSV 2012)
High Dimensional Indexing using MongoDB (MongoSV 2012)High Dimensional Indexing using MongoDB (MongoSV 2012)
High Dimensional Indexing using MongoDB (MongoSV 2012)Nicholas Knize, Ph.D., GISP
 
Reduce Networking/Infrastructure Cost With Multi-Vendor Equipment and Support...
Reduce Networking/Infrastructure Cost With Multi-Vendor Equipment and Support...Reduce Networking/Infrastructure Cost With Multi-Vendor Equipment and Support...
Reduce Networking/Infrastructure Cost With Multi-Vendor Equipment and Support...
Curvature
 
Dematic Logistics Review #5
Dematic Logistics Review #5Dematic Logistics Review #5
Dematic Logistics Review #5
scottcottingham
 
VVTA: RFP 2015-03: CNG Fuel Cylinder Replacement
VVTA: RFP 2015-03: CNG Fuel Cylinder ReplacementVVTA: RFP 2015-03: CNG Fuel Cylinder Replacement
VVTA: RFP 2015-03: CNG Fuel Cylinder Replacement
Victor Valley Transit Authority
 
Market Intelligence FY15 Defense Budget Briefing
 Market Intelligence FY15 Defense Budget Briefing Market Intelligence FY15 Defense Budget Briefing
Market Intelligence FY15 Defense Budget Briefing
immixGroup
 
Business Innovation Approach
Business Innovation ApproachBusiness Innovation Approach
Business Innovation Approach
Koen Klokgieters
 
Oshkosh 1070 f heavy equipment transporter, united kingdom
Oshkosh 1070 f heavy equipment transporter, united kingdomOshkosh 1070 f heavy equipment transporter, united kingdom
Oshkosh 1070 f heavy equipment transporter, united kingdomhindujudaic
 

Viewers also liked (20)

Logmein presentación
Logmein presentaciónLogmein presentación
Logmein presentación
 
Trends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systemsTrends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systems
 
Manual de jhon dere serie 300
Manual de jhon dere serie 300Manual de jhon dere serie 300
Manual de jhon dere serie 300
 
Automotiveairconditioningtrainingmanual
Automotiveairconditioningtrainingmanual Automotiveairconditioningtrainingmanual
Automotiveairconditioningtrainingmanual
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Cisco ACI: A New Approach to Software Defined Networking
Cisco ACI: A New Approach to Software Defined NetworkingCisco ACI: A New Approach to Software Defined Networking
Cisco ACI: A New Approach to Software Defined Networking
 
007 tubing andpipe
007 tubing andpipe007 tubing andpipe
007 tubing andpipe
 
Robotics classes in mumbai
Robotics classes in mumbaiRobotics classes in mumbai
Robotics classes in mumbai
 
How Fannie Mae Leverages Data Quality to Improve the Business
How Fannie Mae Leverages Data Quality to Improve the BusinessHow Fannie Mae Leverages Data Quality to Improve the Business
How Fannie Mae Leverages Data Quality to Improve the Business
 
Robotics training in mumbai
Robotics training in mumbai Robotics training in mumbai
Robotics training in mumbai
 
Python classes in mumbai
Python classes in mumbaiPython classes in mumbai
Python classes in mumbai
 
How to Accelerate Backup Performance with Dell DR Series Backup Appliances
How to Accelerate Backup Performance with Dell DR Series Backup AppliancesHow to Accelerate Backup Performance with Dell DR Series Backup Appliances
How to Accelerate Backup Performance with Dell DR Series Backup Appliances
 
Cloud Ready Data: Speeding Your Journey to the Cloud
Cloud Ready Data: Speeding Your Journey to the CloudCloud Ready Data: Speeding Your Journey to the Cloud
Cloud Ready Data: Speeding Your Journey to the Cloud
 
High Dimensional Indexing using MongoDB (MongoSV 2012)
High Dimensional Indexing using MongoDB (MongoSV 2012)High Dimensional Indexing using MongoDB (MongoSV 2012)
High Dimensional Indexing using MongoDB (MongoSV 2012)
 
Reduce Networking/Infrastructure Cost With Multi-Vendor Equipment and Support...
Reduce Networking/Infrastructure Cost With Multi-Vendor Equipment and Support...Reduce Networking/Infrastructure Cost With Multi-Vendor Equipment and Support...
Reduce Networking/Infrastructure Cost With Multi-Vendor Equipment and Support...
 
Dematic Logistics Review #5
Dematic Logistics Review #5Dematic Logistics Review #5
Dematic Logistics Review #5
 
VVTA: RFP 2015-03: CNG Fuel Cylinder Replacement
VVTA: RFP 2015-03: CNG Fuel Cylinder ReplacementVVTA: RFP 2015-03: CNG Fuel Cylinder Replacement
VVTA: RFP 2015-03: CNG Fuel Cylinder Replacement
 
Market Intelligence FY15 Defense Budget Briefing
 Market Intelligence FY15 Defense Budget Briefing Market Intelligence FY15 Defense Budget Briefing
Market Intelligence FY15 Defense Budget Briefing
 
Business Innovation Approach
Business Innovation ApproachBusiness Innovation Approach
Business Innovation Approach
 
Oshkosh 1070 f heavy equipment transporter, united kingdom
Oshkosh 1070 f heavy equipment transporter, united kingdomOshkosh 1070 f heavy equipment transporter, united kingdom
Oshkosh 1070 f heavy equipment transporter, united kingdom
 

Similar to Developer's Guide to Knights Landing

Intel® Advanced Vector Extensions Support in GNU Compiler Collection
Intel® Advanced Vector Extensions Support in GNU Compiler CollectionIntel® Advanced Vector Extensions Support in GNU Compiler Collection
Intel® Advanced Vector Extensions Support in GNU Compiler Collection
DESMOND YUEN
 
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
PT Datacomm Diangraha
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Dr. Fabio Baruffa
 
Achieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsAchieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-Kernels
Jiannan Ouyang, PhD
 
2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANL2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANL
dgoodell
 
Intel colfax optimizing-machine-learning-workloads
Intel colfax optimizing-machine-learning-workloadsIntel colfax optimizing-machine-learning-workloads
Intel colfax optimizing-machine-learning-workloads
Tracy Johnson
 
Optimize Machine Learning Workloads on Intel® Platforms
Optimize Machine Learning Workloads on Intel® PlatformsOptimize Machine Learning Workloads on Intel® Platforms
Optimize Machine Learning Workloads on Intel® Platforms
Intel® Software
 
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Michelle Holley
 
Simplify Networking for Containers
Simplify Networking for ContainersSimplify Networking for Containers
Simplify Networking for Containers
LinuxCon ContainerCon CloudOpen China
 
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP..."Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
Edge AI and Vision Alliance
 
Snabbflow: A Scalable IPFIX exporter
Snabbflow: A Scalable IPFIX exporterSnabbflow: A Scalable IPFIX exporter
Snabbflow: A Scalable IPFIX exporter
Igalia
 
Summit 16: How to Compose a New OPNFV Solution Stack?
Summit 16: How to Compose a New OPNFV Solution Stack?Summit 16: How to Compose a New OPNFV Solution Stack?
Summit 16: How to Compose a New OPNFV Solution Stack?
OPNFV
 
Mpls conference 2016-data center virtualisation-11-march
Mpls conference 2016-data center virtualisation-11-marchMpls conference 2016-data center virtualisation-11-march
Mpls conference 2016-data center virtualisation-11-march
Aricent
 
186 devlin p-poster(2)
186 devlin p-poster(2)186 devlin p-poster(2)
186 devlin p-poster(2)vaidehi87
 
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
The Linux Foundation
 
InAccel FPGA resource manager
InAccel FPGA resource managerInAccel FPGA resource manager
InAccel FPGA resource manager
Christoforos Kachris
 
100Gbps OpenStack For Providing High-Performance NFV
100Gbps OpenStack For Providing High-Performance NFV100Gbps OpenStack For Providing High-Performance NFV
100Gbps OpenStack For Providing High-Performance NFV
NTT Communications Technology Development
 
Stacks and Layers: Integrating P4, C, OVS and OpenStack
Stacks and Layers: Integrating P4, C, OVS and OpenStackStacks and Layers: Integrating P4, C, OVS and OpenStack
Stacks and Layers: Integrating P4, C, OVS and OpenStack
Open-NFP
 

Similar to Developer's Guide to Knights Landing (20)

Intel® Advanced Vector Extensions Support in GNU Compiler Collection
Intel® Advanced Vector Extensions Support in GNU Compiler CollectionIntel® Advanced Vector Extensions Support in GNU Compiler Collection
Intel® Advanced Vector Extensions Support in GNU Compiler Collection
 
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
 
Achieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsAchieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-Kernels
 
2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANL2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANL
 
Intel colfax optimizing-machine-learning-workloads
Intel colfax optimizing-machine-learning-workloadsIntel colfax optimizing-machine-learning-workloads
Intel colfax optimizing-machine-learning-workloads
 
Optimize Machine Learning Workloads on Intel® Platforms
Optimize Machine Learning Workloads on Intel® PlatformsOptimize Machine Learning Workloads on Intel® Platforms
Optimize Machine Learning Workloads on Intel® Platforms
 
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
 
Simplify Networking for Containers
Simplify Networking for ContainersSimplify Networking for Containers
Simplify Networking for Containers
 
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP..."Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
 
SudheerV_resume_a
SudheerV_resume_aSudheerV_resume_a
SudheerV_resume_a
 
chameleon chip
chameleon chipchameleon chip
chameleon chip
 
Snabbflow: A Scalable IPFIX exporter
Snabbflow: A Scalable IPFIX exporterSnabbflow: A Scalable IPFIX exporter
Snabbflow: A Scalable IPFIX exporter
 
Summit 16: How to Compose a New OPNFV Solution Stack?
Summit 16: How to Compose a New OPNFV Solution Stack?Summit 16: How to Compose a New OPNFV Solution Stack?
Summit 16: How to Compose a New OPNFV Solution Stack?
 
Mpls conference 2016-data center virtualisation-11-march
Mpls conference 2016-data center virtualisation-11-marchMpls conference 2016-data center virtualisation-11-march
Mpls conference 2016-data center virtualisation-11-march
 
186 devlin p-poster(2)
186 devlin p-poster(2)186 devlin p-poster(2)
186 devlin p-poster(2)
 
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
 
InAccel FPGA resource manager
InAccel FPGA resource managerInAccel FPGA resource manager
InAccel FPGA resource manager
 
100Gbps OpenStack For Providing High-Performance NFV
100Gbps OpenStack For Providing High-Performance NFV100Gbps OpenStack For Providing High-Performance NFV
100Gbps OpenStack For Providing High-Performance NFV
 
Stacks and Layers: Integrating P4, C, OVS and OpenStack
Stacks and Layers: Integrating P4, C, OVS and OpenStackStacks and Layers: Integrating P4, C, OVS and OpenStack
Stacks and Layers: Integrating P4, C, OVS and OpenStack
 

Recently uploaded

top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
Srikant77
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 

Recently uploaded (20)

top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 

Developer's Guide to Knights Landing

  • 1. Developer’s Guide to Knights Landing Introduction to 2nd Generation Intel® Xeon Phi™ Processors Colfax International — @colfaxintl May 2016 - Rev. 1.2 colfaxresearch.com/knl-webinar Welcome © Colfax International, 2013–2016
  • 2. About This Document This document represents the ma- terials of a Web-based training “In- troduction to 2nd Generation Intel® Xeon Phi™ Processors: Developer’s Guide to Knights Landing” developed and run by Colfax International. © Colfax International, 2013-2016 colfaxresearch.com/knl-webinar/ colfaxresearch.com/knl-webinar About This Document © Colfax International, 2013–2016
  • 3. Disclaimer While best efforts have been used in preparing this training, Colfax International makes no representations or warranties of any kind and assumes no liabilities of any kind with respect to the accuracy or completeness of the contents and specifically disclaims any implied warranties of merchantability or fitness of use for a particular purpose. The publisher shall not be held liable or responsible to any person or entity with respect to any loss or incidental or consequential damages caused, or alleged to have been caused, directly or indirectly, by the information or programs contained herein. No warranty may be created or extended by sales representatives or written sales materials. colfaxresearch.com/knl-webinar About This Document © Colfax International, 2013–2016
  • 4. Resources colfaxresearch.com/knl-webinar Resources © Colfax International, 2013–2016
  • 5. Colfax Research http://colfaxresearch.com/ (already registered? get $10 off the book) colfaxresearch.com/knl-webinar Resources © Colfax International, 2013–2016
  • 6. HOW Series Interested? Sign-up at: colfaxresearch.com/how-series colfaxresearch.com/knl-webinar Resources © Colfax International, 2013–2016
  • 7. Developer Access Program (DAP) Can’t wait to get your hands on Knights Landing? Find out more at dap.xeonphi.com or contact us at dap@colfax-intl.com. colfaxresearch.com/knl-webinar Resources © Colfax International, 2013–2016
  • 8. Get Ready For KNL http://colfaxresearch.com/get-ready-for-intel-knights-landing-3-papers/ colfaxresearch.com/knl-webinar Resources © Colfax International, 2013–2016
  • 9. §2. Intel Architecture: Today and Tomorrow colfaxresearch.com/knl-webinar Intel Architecture: Today and Tomorrow © Colfax International, 2013–2016
  • 10. Intel Architecture colfaxresearch.com/knl-webinar Intel Architecture: Today and Tomorrow © Colfax International, 2013–2016
  • 11. Intel Xeon Phi Processors (2nd Gen) 2nd Generation of Intel Many Integrated Core (MIC) Architecture. Specialized platform for demanding computing applications. Bootable host proccesor or coprocessor 3+ TFLOP/s DP 6+ TFLOP/s SP Up to 16 GiB MCDRAM MCDRAM bandwidth ≈5x DDR4 Binary compatible with Intel Xeon Public disclosures colfaxresearch.com/knl-webinar Intel Architecture: Today and Tomorrow © Colfax International, 2013–2016
  • 12. Standard CPU Form Factor Bootable Host Processor No need for “host”. OS runs on KNL processor. Supports common OS. No more PCIe bottleneck! Direct access to ≤ 384 GiB DDR4 RAM Up to ≈90 GB/s DDR4 bandwidth Access to PCIe Bus colfaxresearch.com/knl-webinar Intel Architecture: Today and Tomorrow © Colfax International, 2013–2016
  • 13. Future Releases KNLF: KNL with Fabric Fabric integrated on CPU Intel OmniPath Architecture Socket mount processor *KNC image KNL Coprocessor PCIe add-in card Requires host Multiple KNLs in a system colfaxresearch.com/knl-webinar Intel Architecture: Today and Tomorrow © Colfax International, 2013–2016
  • 14. §3. Cores and Threading colfaxresearch.com/knl-webinar Cores and Threading © Colfax International, 2013–2016
  • 15. Importance of Parallelism If KNL was a steam engine... It would have a lot of furnaces. Single-threaded code on KNL is like having 1 fireman service 72 steam engines colfaxresearch.com/knl-webinar Cores and Threading © Colfax International, 2013–2016
  • 16. Implementing Multi-threading You have to use a core to benefit from it! Threading frameworks: OpenMP TBB Cilk Plus Pthreads Hybrid with MPI Any others supported by Xeon Processors OpenMP Threads Quadrants KNL colfaxresearch.com/knl-webinar Cores and Threading © Colfax International, 2013–2016
  • 17. Features of KNL Cores colfaxresearch.com/knl-webinar Features of KNL Cores © Colfax International, 2013–2016
  • 18. KNL Die Organization: Tiles Up to 36 tiles, each with 2 physical cores (72 total). Distributed L2 cache across a mesh interconnect. CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE DDR4CONTROLLER DDR4CONTROLLER CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORECORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE MCDRAM PCIe ≤384GiBsystemDDR4,~90GB/s CORE L2 CORE ≤ 16 GiB on-package MCDRAM, ~ 400 GB/s MCDRAM MCDRAM MCDRAM CORE L2 CORE CORE L2 CORE CORE L2 CORECORE L2 CORE ... 36 TILES 72 CORES colfaxresearch.com/knl-webinar Features of KNL Cores © Colfax International, 2013–2016
  • 19. KNL Cores 4-way hyper-threading (up to 4×72 = 288 logical processors) L2 cache shared by 2 cores on a tile. I-cache Decode + Retire Vector ALU Vector ALU Legacy 32 KiB L1 D-cache 1 MiB L2 D-cache I-cache Decode + Retire Vector ALU Vector ALU Legacy 32 KiB L1 D-cache CORE CORETILE to Mesh HW+ SW prefetch HW+ SW prefetch Out-of- order Out-of- order colfaxresearch.com/knl-webinar Features of KNL Cores © Colfax International, 2013–2016
  • 20. More Forgiving Cores than First Generation (KNC) Based on Intel® Atom cores (Silvermont microarchitecture) Out-of-order cores: Better latency masking from long latency operations Back-to-back instructions from single thread: Thread count requirement reduced to ≈ 70 (from ≈ 120 for KNC) Advanced branch prediction: Fewer cycles wasted to branch misprediction Generally more forgiving to non-optimized code. colfaxresearch.com/knl-webinar Features of KNL Cores © Colfax International, 2013–2016
  • 21. Performance Considerations colfaxresearch.com/knl-webinar Performance Considerations © Colfax International, 2013–2016
  • 22. Affinity and cpuinfo Neighboring threads often work on nearby/same locations in memory. → Use thread pinning to have them share L2 cache! You can use cpuinfo (part of Intel® MPI) to find which cores share cache. user@knl% cpuinfo // ... cpuinfo output ... // L2 1 MB (0,1,64,65,128,129,192,193)(2,3,66,67,130,131,194,195)(4,5,68, ... Use KMP_AFFINITY for Intel Compilers or OMP_PROC_BIND for GCC compilers. user@knl% export KMP_AFFINITY=compact user@knl% export OMP_PROC_BIND=close colfaxresearch.com/knl-webinar Performance Considerations © Colfax International, 2013–2016
  • 23. Tuning Threading Even if you do have multi-threaded code, look out for issues such as... Insufficient parallelism Load imbalance Learn more at: colfaxresearch.com/how-series colfaxresearch.com/knl-webinar Performance Considerations © Colfax International, 2013–2016
  • 25. Scalar Code and Our Fireman Not using vector instructions in KNL is like feeding a steam engine with a spoon. colfaxresearch.com/knl-webinar Vectorization © Colfax International, 2013–2016
  • 26. Short Vector Support Vector instructions – one of the implementations of SIMD (Single Instruction Multiple Data) parallelism. + = + = + = + = + = VectorLength Scalar Instructions Vector Instructions 4 1 5 0 3 3 -2 8 6 9 -7 2 4 1 5 0 3 3 -2 8 6 9 -7 2 colfaxresearch.com/knl-webinar Vectorization © Colfax International, 2013–2016
  • 27. Vector Instructions on KNL colfaxresearch.com/knl-webinar Vector Instructions on KNL © Colfax International, 2013–2016
  • 28. Dual VPU Each core on KNL has two Vector Processing Units (VPUs). Penalty from non-vectorized code SP →512 bitregisters/32 bits ×2 VPUs = 32 SIMD lanes DP →512 bitregisters/64 bits ×2 VPUs = 16 SIMD lanes VPU 0 512-bit registers = 16 SP or 8 DP ZMM registers x32 per thread VPU 1 Decode I-Cache 32/16 SIMD lanes (SP/DP) colfaxresearch.com/knl-webinar Vector Instructions on KNL © Colfax International, 2013–2016
  • 29. Supported Instruction Sets on KNL Intel® Advanced Vector Extensions 512 (AVX-512) 512-bit vector registers. Hardware gather/scatter, DP transcendental functions support and more. Supported by non-Intel compilers like GCC. ≤ Intel® AVX2 Legacy mode operation. Binary compatibility with Xeon. Does not include IMCI (from KNC). E5-2600 (SNB) E5-2600v3 (HSW) KNL (Xeon Phi) AVX-512PF AVX-512ER AVX-512CD AVX-512F BMI AVX2 AVX SSE x87/MMX TSX BMI AVX2 AVX SSE x87/MMX AVX SSE x87/MMX LegacySupport no TSX colfaxresearch.com/knl-webinar Vector Instructions on KNL © Colfax International, 2013–2016
  • 30. AVX-512 Features Knights Landing: first AVX-512 processor AVX-512F (Fundamentals) Extension of most AVX2 instructions to 512-bit vector registers. AVX-512CD (Conflict Detection) Efficient conflict detection (application: binning). AVX-512ER (Exponential and Reciprocal) Transcendental function (exp, rcp and rsqrt) support. AVX-512PF (Prefetch) Prefetch for scatter and gather. colfaxresearch.com/knl-webinar Vector Instructions on KNL © Colfax International, 2013–2016
  • 31. Checking for AVX-512 support Check /proc/cpuinfo for AVX-512 flags user@knl% cat /proc/cpuinfo flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer xsave avx f16c rdrand lahf_lm abm 3dnowprefetch arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms avx512f rdseed adx avx512pf avx512er avx512cd Also possible through C/C++ function calls: see this blog. colfaxresearch.com/knl-webinar Vector Instructions on KNL © Colfax International, 2013–2016
  • 32. AVX-512CD: Histograms 1 // worker.cc from Colfax lab 4.04 2 void Histogram(const float* age, int* const hist, 3 const int n, const float group_width, 4 const int m) { 5 6 for (int i = 0; i < n; i++) { 7 const int j = (int) ( age[i] / group_width ); 8 hist[j]++; 9 } 10 } 56.8 52.1 21.1 46.9 23.8 16.5 96.3 19.4 20.0 20.0 20.0 20.0 20.0 20.0 20.0 20.0 2.84 2.60 1.05 2.34 1.19 0.82 4.81 0.97 int 2 2 1 2 1 0 4 0 0 1 2 3 4hist: = ÷ user@knl% cat worker.optrpt .... remark ...: vectorization support: scatter was generated for the variable hist: remark ...: vectorization support: gather was generated for the variable hist: remark #15300: LOOP WAS VECTORIZED colfaxresearch.com/knl-webinar Vector Instructions on KNL © Colfax International, 2013–2016
  • 33. AVX-512ER: Transcendental Functions Support for transcendental functions. Exponential. Reciprocal. Inverse Square Root. Better precision. Double Precision. Max relative error: 2−23 (exp) 2−28 (rcp and rsqrt) Source: APOD colfaxresearch.com/knl-webinar Vector Instructions on KNL © Colfax International, 2013–2016
  • 34. Programming Considerations colfaxresearch.com/knl-webinar Programming Considerations © Colfax International, 2013–2016
  • 35. Using AVX-512: Two Approaches Automatic Vectorization: Vectorization w/ compiler. Portable: just recompile. Tuning with directives. 1 double A[vec_width], B[vec_width]; 2 // ... 3 4 5 // This loop will be auto-vectorized 6 for(int i = 0; i < vec_width; i++) 7 A[i]+=B[i]; 1 double A[vec_width], B[vec_width]; 2 // ... 3 // This is explicitly vectorized 4 __m512d A_vec = _mm512_load_pd(A); 5 __m512d B_vec = _mm512_load_pd(B); 6 A_vec = _mm512_add_pd(A_vec,B_vec); 7 _mm512_store_pd(A,A_vec); Explicit Vectorization: Vectorization w/ intrinsics Full control over instructions Limited portability colfaxresearch.com/knl-webinar Programming Considerations © Colfax International, 2013–2016
  • 36. Intel Compiler support for AVX-512 Intel Compiler versions ≥ 15.0 supports AVX-512 instruction set. user@knl% icc -v icc version 16.0.1 (gcc version 4.8.5 compatibility) user@knl% icc -help // ... truncated output ... // -x<code> ... MIC-AVX512 CORE-AVX512 COMMON-AVX512 -xMIC-AVX512 : for KNL (supports F, CD, ER, PF) -xCORE-AVX512 : for future Xeon (supports F, CD, DQ, BW, VL) -xCOMMON-AVX512 : common to KNL and Xeon (supports F, CD) colfaxresearch.com/knl-webinar Programming Considerations © Colfax International, 2013–2016
  • 37. GCC support for AVX-512 GCC ≥ 4.9.1 supports AVX-512 instruction set. user@knl% g++ -v gcc version 4.9.2 (GCC) user@knl% g++ foo.cc -mavx512f -mavx512er -mavx512cd -mavx512pf Basic automatic vectorization support: add -O2 or -O3. 1 // ... foo.cc ... // 2 for(int i = 0; i < n; i++) 3 B[i] = A[i] + B[i]; user@knl% g++ -s foo.cc -mavx512f -03 user@knl% cat foo.s ... vmovapd -16432(%rbp,%rax), %zmm0 vaddpd -8240(%rbp,%rax), %zmm0, %zmm0 vmovapd %zmm0, -8240(%rbp,%rax) colfaxresearch.com/knl-webinar Programming Considerations © Colfax International, 2013–2016
  • 38. Performance Considerations Even if your code is vectorized, tuning may unlock more performance. Providing enough parallelism. More consecutive vector operations required to overcome vectorization latency. Loop pipelining and unrolling. Double the pipeline stages to populate. Better vectorization patterns. Avoid long latency operations with unit-stride and unmasked operations. colfaxresearch.com/knl-webinar Programming Considerations © Colfax International, 2013–2016
  • 39. §5. Memory Architecture colfaxresearch.com/knl-webinar Memory Architecture © Colfax International, 2013–2016
  • 40. Where’s the Fuel? Cores can’t do work if the data is not there. Source: wikipedia KNL memory must be used efficiently to feed the cores. colfaxresearch.com/knl-webinar Memory Architecture © Colfax International, 2013–2016
  • 41. MCDRAM on KNL colfaxresearch.com/knl-webinar MCDRAM on KNL © Colfax International, 2013–2016
  • 42. KNL Memory Organization Direct access to on package MCDRAM and system DDR4 (socket) Use MCDRAM as cache, in flat mode, or as hybrid CORE CORE REGISTERSREGISTERS L1 cache L1 cache L2 cache MCDRAM (on package RAM) DDR4 RAM (system memory) Up to 16 GiB over 400 GB/s Up to 384 GiB ~ 90 GB/s (STREAM) ... more cores Intel Xeon Phi Processor Flat, Cache or Hybrid colfaxresearch.com/knl-webinar MCDRAM on KNL © Colfax International, 2013–2016
  • 43. MCDRAM Memory Modes Flat Mode MCDRAM treated as a NUMA node Users control what goes to MCDRAM CPU DDR4 Flat MCDRAM NUMA NODE 0 NUMA NODE 1 Cache Mode MCDRAM treated as a Last Level Cache (LLC) MCDRAM is used automatically CPU DDR4 Cache MCDRAM NUMA NODE 0 Hybrid Mode Combination of Flat and Cache Ratio can be chosen in the BIOS CPU DDR4 NUMA NODE 0 NUMA NODE 1 Flat MCDRAM Cache MCDRAM colfaxresearch.com/knl-webinar MCDRAM on KNL © Colfax International, 2013–2016
  • 44. MCDRAM Performance Copyright © 2015, Intel Corporation. All rights reserved Avinash Sodani ISC 2015 Intel® Xeon Phi ™ Workshop. DDR and MCDRAM Bandwidth vs. Latency Bandwidth Latency MCDRAM DDR DDR BW Limit MCDRAM BW Limit Conceptual diagram MCDRAM latency more than DDR at low loads but much less at high loads Source: Intel, ISC 2015 KNL keynote (pdf) colfaxresearch.com/knl-webinar MCDRAM on KNL © Colfax International, 2013–2016
  • 45. Programming with MCDRAM colfaxresearch.com/knl-webinar Programming with MCDRAM © Colfax International, 2013–2016
  • 46. Numactl Finding information about the NUMA nodes in the system. user@knl% # In Flat mode with All-to-All user@knl% numactl -H available: 2 nodes (0-1) node 0 cpus: ... all cpus ... node 0 size: 98207 MB node 0 free: 94798 MB node 1 cpus: node 1 size: 16384 MB node 1 free: 15991 MB Binding the application to MCDRAM (Flat/Hybrid) user@knl% gcc myapp.c -o runme -mavx512f -O2 user@knl% numactl --membind 1 ./runme // ... Application running on MCDRAM ... // colfaxresearch.com/knl-webinar Programming with MCDRAM © Colfax International, 2013–2016
  • 47. Memkind Library and hbwmalloc Manual allocation to MCDRAM possible with hbwmalloc and Memkind Library. 1 #include <hbwmalloc.h> 2 const int n = 1<<10; 3 // Allocation to MCDRAM 4 double* A = (double*) hbw_malloc(sizeof(double)*n); 5 // No replacement for _mm_malloc. Use posix_memalign 6 double* B; 7 int ret = hbw_posix_memalign((void*) B, 64, sizeof(double)*n); 8 ..... 9 // Free with hbw_free 10 hbw_free(A); hbw_free(b); Fortran Allocations. 1 REAL, ALLOCATABLE :: A(:) 2 !DEC$ ATTRIBUTES FASTMEM :: A 3 ALLOCATE (A(1:1024)) colfaxresearch.com/knl-webinar Programming with MCDRAM © Colfax International, 2013–2016
  • 48. Compilation with Memkind Library and hbwmalloc To compile C/C++ applications: user@knl% icpc -lmemkind foo.cc -o runme user@knl% g++ -lmemkind foo.cc -o runme To compile Fortran applications: user@knl% ifort -lmemkind foo.f90 -o runme user@knl% gfortran -lmemkind foo.f90 -o runme Open source distribution of Memkind library can be found at: http://memkind.github.io/memkind/ colfaxresearch.com/knl-webinar Programming with MCDRAM © Colfax International, 2013–2016
  • 49. Flow Chart for Bandwidth-Bound Applications Does your Application fit in 16 GB? Can you partition ≤16 GB of BW-critical memory? No Yes Yes No Start numactl Memkind Cache mode Simply run the whole program in MCDRAM No code modification required Manually allocate BW-critical memory to MCDRAM Memkind calls need to be added. Allow the OS to figure out how to use MCDRAM No code modification required colfaxresearch.com/knl-webinar Programming with MCDRAM © Colfax International, 2013–2016
  • 50. Cluster Modes on KNL colfaxresearch.com/knl-webinar Cluster Modes on KNL © Colfax International, 2013–2016
  • 51. KNL Die Organization Mesh interconnect relaxes data locality requirement [somewhat] All-to-all, quadrant or sub-numa domain communication in mesh CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE DDR4CONTROLLER DDR4CONTROLLER CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORECORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE CORE L2 CORE MCDRAM PCIe ≤384GiBsystemDDR4,~90GB/s CORE L2 CORE ≤ 16 GiB on-package MCDRAM, ~ 400 GB/s MCDRAM MCDRAM MCDRAM CORE L2 CORE CORE L2 CORE CORE L2 CORECORE L2 CORE ... 36 TILES 72 CORES colfaxresearch.com/knl-webinar Cluster Modes on KNL © Colfax International, 2013–2016
  • 52. Cluster Modes: All-to-All No affinity between the distributed Tag Directory (TD) and memory. tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD NUMAnode0 MCDRAMMCDRAM MCDRAMMCDRAM colfaxresearch.com/knl-webinar Cluster Modes on KNL © Colfax International, 2013–2016
  • 53. Cluster Modes: Quadrant/Hemisphere Tag Directory (TD) and memory reside in the same quadrant. tile TD MCDRAM MCDRAMMCDRAM MCDRAM tile TD tile TD tile TD NUMAnode0 tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD colfaxresearch.com/knl-webinar Cluster Modes on KNL © Colfax International, 2013–2016
  • 54. Cluster Modes: SNC-4/SNC-2 Will appear as 4 NUMA nodes (similar to quad-socket system). tile TD MCDRAM MCDRAMMCDRAM MCDRAM tile TD tile TD tile TD NUMAnode0 NUMAnode1 NUMAnode2 NUMAnode3 tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD tile TD colfaxresearch.com/knl-webinar Cluster Modes on KNL © Colfax International, 2013–2016
  • 55. Programming with Cluster Modes colfaxresearch.com/knl-webinar Programming with Cluster Modes © Colfax International, 2013–2016
  • 56. How to Use Cluster modes Thread affinity to match the memory/directory affinity. Nested OpenMP 1 #pragma omp parallel 2 { 3 // ... 4 #pragma omp parallel 5 { 6 // ... 7 } 8 } OpenMP Threads Quadrants KNL user@knl% OMP_NUM_THREADS=4,72 user@knl% OMP_NESTED=1 MPI+OpenMP 1 stat = MPI_Init(); 2 // ... 3 #pragma omp parallel 4 { 5 // ... 6 } 7 // ... 8 MPI_Finalize(); NUMA node 0 MPI rank 0 OpenMP NUMA node 1 MPI rank 1 OpenMP NUMA node 2 MPI rank 2 OpenMP NUMA node 3 MPI rank 3 OpenMP user@knl% mpirun -host knl > -np 4 ./myparallel_app colfaxresearch.com/knl-webinar Programming with Cluster Modes © Colfax International, 2013–2016
  • 57. §6. Importance of Code Optimization colfaxresearch.com/knl-webinar Importance of Code Optimization © Colfax International, 2013–2016
  • 58. N-body Simulation on Intel Architecture  unoptimized  Multi­threaded  Scaler Tuning  AoS­>SoA Tiled, Unrolled  (unrll and jam)  0 200 400 600 800 1000 1200 1400 1600 1800  GFLOP/s 4.7 72.9 151.2 326.8 324.4 5.9 162.0 261.0 970.0 934.0 0.8 124.0 196.0 1190.0 1640.0    Sandy Bridge E5­2670 (1st gen, SNB)   Haswell E5­2697 v3 (3rd gen, HSW)  Intel Xeon Phi 7120A (1st gen, KNC) Get Ready The best way to prepare your code for KNL is to optimize for KNC colfaxresearch.com/knl-webinar Importance of Code Optimization © Colfax International, 2013–2016
  • 59. Learn More colfaxresearch.com/knl-webinar Learn More © Colfax International, 2013–2016
  • 60. HOW Series Interested? Sign-up at: colfaxresearch.com/how-series colfaxresearch.com/knl-webinar Learn More © Colfax International, 2013–2016
  • 61. Textbook ISBN: 978-0-9885234-0-1 (508 pages, Electronic or Print) Parallel Programming and Optimization with Intel® Xeon Phi™ Coprocessors Handbook on the Development and Optimization of Parallel Applications for Intel® Xeon® Processors and Intel® Xeon Phi™ Coprocessors © Colfax International, 2015 PARALLEL PROGRAMMING AND OPTIMIZATION WITH HANDBOOKONTHE DEVELOPMENTAND OPTIMIZATIONOF PARALLEL APPLICATIONSFOR INTEL XEON PROCESSORS ANDINTEL XEONPHI COPROCESSORS INTEL XEON PHI COPROCESSORS TMR SECONDEDITION C O L F A X I N T E R N A T I O N A L ANDREY VLADIMIROV | RYO ASAI | VADIM KARPUSENKO http://xeonphi.com/book colfaxresearch.com/knl-webinar Learn More © Colfax International, 2013–2016
  • 62. §7. Closing Words colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016
  • 63. Colfax Research http://colfaxresearch.com/ (already registered? get $10 off the book) colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016
  • 64. Developer Access Program (DAP) Can’t wait to get your hands on Knights Landing? Find out more at dap.xeonphi.com or contact us at dap@colfax-intl.com. colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016
  • 65. Bottom Line KNL is a highly-parallel successor of KNC, with improvements in both performance and ease-of-use. KEEP CALM GO PARALLEL AND colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016
  • 66. Thank you for Tuning In! colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016