SlideShare a Scribd company logo
1 of 22
Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned

Optimizing Commercial Software for Intel® Xeon Phi™
Coprocessors: Lessons Learned

Supercomputing Conference
Denver, Colorado, USA
November 17-22, 2013
©2013 Acceleware Ltd. All rights reserved.

Dan Cyca, Chief Technical Officer, Acceleware
Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned

In My Parallel Universe…


Small to medium-sized seismic companies aren’t limited by
computational resources when processing seismic data

1
Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned

Seismic Computing Requirements
1 EF
Full WE Approximation

100 PF

Elastic
Imaging

10 PF

FWI

1 PF

RTM

100 TF

Paraxial WE
approximation

10 TF

WEM

1 TF
100 GF

Kirchhoff Migration
Post SDM, PreSTM

1990

1995

2000

2005

2010

2012

2015

Source: Total

3
Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned

RTM Overview
Source

Propagate
forwards
in time

Receiver data

Propagate backwards
in time

4
Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned

RTM Introduction









Finite-difference code
Compute intensive:
 10s of hours per
seismic shot
Large memory footprint:
 100GB per shot
Large local storage
requirement:
 500GB per shot
10,000s of shots

5
Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned

RTM: Computational Requirements







RTM image is made by migrating and then stacking a large number
of shots (typically between 10,000 and 100,000)
Migrating each shot requires two or three 3D wave propagations
Each shot migration requires large RAM (~100GB) and temporary
disk space (~500GB)
Runtime per shots varies between a few minutes (low frequency
isotropic) to several hours (high frequency anisotropic)
Typical compute cluster used for RTM will be 100s of nodes

6
Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned

In My Parallel Universe…


Small to medium-sized seismic
companies aren’t limited by
computational resources when
processing seismic data
– We want to make RTM (1 PFlop)
available to these companies



We’re delivering parallel software to
run RTM on Xeon Phi systems

7
Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned

RTM: Wave Propagation


Finite-difference time domain technique
–



3D grid with millions of points
–
–




3D stencils

Update the entire grid every time step
1000s of time steps

Memory footprint of 10-100 GB
Wavefield data from forward pass stored to disk
to facilitate imaging

8
Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned

Parallelizing Single Shots






Finite-difference grid contains
over 200 million cells per
volume (2 GB)
Numerous volumes per shot
(Earth model, wavefields and
image)
One shot easily fits in a CPU
compute node, but may be
too large for a single Xeon
Phi

9
Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned

Parallelizing Each Shot: Multiple
Cards


Phi 0

Phi 1

Phi 2

The volume is
partitioned into pieces
that fit on a single Xeon
Phi

10
Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned

Parallelizing Each Shot: Multiple
Cards


Phi 0

Transfer


…

Boundaries must be
transferred between
partitions
Transfers can become a
bottleneck unless they are
done asynchronously with
stencil calculations

…

Transfer

Phi 1

11
Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned

Parallelizing Each Shot: Within
Card
Core 0, Thread 0
Core 0, Thread 1
Core 0, Thread 2
Core 0, Thread 3
Core 1, Thread 0

x/y
Core 1, Thread 1

z 


Data in x and y are split over cores
Operations in z dimension are vectorized

12
Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned

Levels of Parallelism




Each shot is split over multiple Xeon Phi Coprocessors (or Xeon
nodes) using MPI
The partition on each Phi is split over cores using OpenMP
Operations on each thread are vectorized using the compiler’s
autovectorizer

13
Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned

Kernel: 8th Order Spatial Derivative
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

#pragma omp parallel for
for(size_t x = xMin; x < xMax; x++)
{
for(size_t y = yMin; y < yMax; y++)
{
size_t const idx = x*strideX + y*strideY;
#pragma vector …
for(size_t z = zMin; z < zMax; z++)
{
size_t const i = idx + z;
pVy[i] =
yCoeffs[0]*(pV[i-4*strideY]-pV[i+4*strideY])
yCoeffs[1]*(pV[i-3*strideY]-pV[i+3*strideY])
yCoeffs[2]*(pV[i-2*strideY]-pV[i+2*strideY])
yCoeffs[3]*(pV[i-1*strideY]-pV[i+1*strideY])
yCoeffs[4]*pV[i];
}
}
}

Triple loop
over dimensions

+
+
+
+

One-dimensional
derivative: simple
calculation with large
memory bandwidth
Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned

Tuning OpenMP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

#pragma omp parallel for collapse(2) schedule(static)
for(size_t x = xMin; x < xMax; x++)
{
for(size_t y = yMin; y < yMax; y++)
{
#pragma vector …
size_t const idx = x*strideX + y*strideY;
for(size_t z = zMin; z < zMax; z++)
{
size_t const i = idx + z;
// Derivative Calculations
}
}
}



Many options available
for OpenMP
–



Tuning especially important
on Phi (mostly because of
high thread count)

Here we use static loop
scheduling, because it
has the lowest overhead
–

It is also the most prone to
load-balance issues
Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned

Tuning OpenMP






Collapse(2) combines two adjacent for loops
Here, X and Y dimensions are combined. Eg: X = 250, Y = 150
Work is broken more evenly onto cores when there are more iterations
– 250 iterations on 240 threads (60*4) means 10 threads do double work
while other threads wait (1/2 time wasted)
– 250 x 150 divides much better onto 240 threads (1/157 time wasted)
Improved Phi performance by 1.5x!

Y
X

X*Y
Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned

Tuning Thread Affinity



We programmatically set affinity with run dependent logic
Isolating various tasks prevents over-subscription of cores

Transfer Threads

Core 0

Disk IO Threads

Core 1

Propagation Threads

Core 2

Core 60

…

OS Threads

Core 61
Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned

Tuning Thread Affinity


Thread affinity settings improved scaling on multiple Phis and
multiple CPU sockets

Without Affinity Changes
With Affinity Changes

Dual Xeon
Dual Xeon
Phi vs. Single sockets vs.
Phi
single socket
1.3x
1.9x
1.9x

1.7x

 Different settings
for Xeon Phi and
Xeon
Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned

Tuning Memory Access
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

#pragma omp parallel for collapse(2) schedule(static)
for(size_t x = xMin; x < xMax; x++)
{
for(size_t y = yMin; y < yMax; y++)
{
size_t const idx = x*strideX + y*strideY;
__assume(strideX%16==0);
__assume(strideY%16==0);
__assume(idx%16==0);
__assume_aligned(pV ,64);
__assume_aligned(pVy ,64);
#pragma vector always assert vecremainder
#pragma ivdep
#pragma vector nontemporal (pVy)
for(size_t z = zMin; z < zMax; z++)
{
size_t const i = idx + z;
pVy[i] = (
yCoeffs*(pV[i-4*strideY]-pV[i+4*strideY])...
}
}
}

Improved performance by
1.1x on both Xeon and Xeon
Phi!
Give compiler hints about
indexing so it knows when to
use aligned reads/writes

pVy[i] is written once and
should not be cached

19
Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned

Current Performance Results



For anisotropic wave propagation, Xeon Phi coprocessor is
~2.3x a single Xeon E5-2670 CPU
Same code-base and optimizations applied to Xeon and Xeon Phi

20
Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned

About Acceleware


Professional training
–
–
–
–



High performance consulting
–
–
–



Xeon Phi Coprocessor Optimization
OpenCL
OpenMP
MPI
Feasibility Studies
Porting and Optimization
Algorithm parallelization

Accelerated software
–
–

Oil and Gas
Electromagnetics
21
Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned

Questions?
Come visit us in booth #1825!
Head Office
Tel: +1 403.249.9099
Email: services@acceleware.com
Viktoria Kaczur
Senior Account Manager
Tel: +1 403.249.9099 ext. 356
Cell: +1 403.671.4455
Email: viktoria.kaczur@acceleware.com
22

More Related Content

What's hot

Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Slide_N
 
Ai Forum at Computex 2017 - Keynote Slides by Jensen Huang
Ai Forum at Computex 2017 - Keynote Slides by Jensen HuangAi Forum at Computex 2017 - Keynote Slides by Jensen Huang
Ai Forum at Computex 2017 - Keynote Slides by Jensen HuangNVIDIA Taiwan
 
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlowRajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlowAI Frontiers
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
Using neon for pattern recognition in audio data
Using neon for pattern recognition in audio dataUsing neon for pattern recognition in audio data
Using neon for pattern recognition in audio dataIntel Nervana
 
190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pubJaewook. Kang
 
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYCBryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYCMLconf
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...NVIDIA Taiwan
 
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」Shinya Takamaeda-Y
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central
 
Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation Nabil Chouba
 
Surrey dl-4
Surrey dl-4Surrey dl-4
Surrey dl-4ozzie73
 
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNetAlex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNetAI Frontiers
 
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...Shinya Takamaeda-Y
 
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...AMD Developer Central
 
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent AcceleratorExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent AcceleratorJinho Lee
 
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019corehard_by
 
OpenPOWER Application Optimisation meet up
OpenPOWER Application Optimisation meet up OpenPOWER Application Optimisation meet up
OpenPOWER Application Optimisation meet up Ganesan Narayanasamy
 

What's hot (20)

Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
 
Ai Forum at Computex 2017 - Keynote Slides by Jensen Huang
Ai Forum at Computex 2017 - Keynote Slides by Jensen HuangAi Forum at Computex 2017 - Keynote Slides by Jensen Huang
Ai Forum at Computex 2017 - Keynote Slides by Jensen Huang
 
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlowRajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Using neon for pattern recognition in audio data
Using neon for pattern recognition in audio dataUsing neon for pattern recognition in audio data
Using neon for pattern recognition in audio data
 
190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub
 
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYCBryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation
 
Surrey dl-4
Surrey dl-4Surrey dl-4
Surrey dl-4
 
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNetAlex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
 
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
 
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
 
TensorFlow for HPC?
TensorFlow for HPC?TensorFlow for HPC?
TensorFlow for HPC?
 
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent AcceleratorExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
 
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019
 
OpenPOWER Application Optimisation meet up
OpenPOWER Application Optimisation meet up OpenPOWER Application Optimisation meet up
OpenPOWER Application Optimisation meet up
 

Viewers also liked

Web Dev101 For Journalists
Web Dev101 For JournalistsWeb Dev101 For Journalists
Web Dev101 For JournalistsLisa Williams
 
Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real C...
Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real C...Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real C...
Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real C...Andrés Gómez
 
Profiling and Optimizing for Xeon Phi with Allinea MAP
Profiling and Optimizing for Xeon Phi with Allinea MAPProfiling and Optimizing for Xeon Phi with Allinea MAP
Profiling and Optimizing for Xeon Phi with Allinea MAPIntel IT Center
 
COSMIC: Middleware for Xeon Phi Servers and Clusters
COSMIC: Middleware for Xeon Phi Servers and ClustersCOSMIC: Middleware for Xeon Phi Servers and Clusters
COSMIC: Middleware for Xeon Phi Servers and Clustersinside-BigData.com
 
Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization...
Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization...Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization...
Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization...Intel IT Center
 
Deep Convolutional Network evaluation on the Intel Xeon Phi
Deep Convolutional Network evaluation on the Intel Xeon PhiDeep Convolutional Network evaluation on the Intel Xeon Phi
Deep Convolutional Network evaluation on the Intel Xeon PhiGaurav Raina
 
Altair on Intel Xeon Phi: Optimizing HPC for Breakthrough Performance
Altair on Intel Xeon Phi:  Optimizing HPC for Breakthrough PerformanceAltair on Intel Xeon Phi:  Optimizing HPC for Breakthrough Performance
Altair on Intel Xeon Phi: Optimizing HPC for Breakthrough PerformanceIntel IT Center
 
Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)AllineaSoftware
 
Deep Convolutional Neural Network acceleration on the Intel Xeon Phi
Deep Convolutional Neural Network acceleration on the Intel Xeon PhiDeep Convolutional Neural Network acceleration on the Intel Xeon Phi
Deep Convolutional Neural Network acceleration on the Intel Xeon PhiGaurav Raina
 
Arquitetura do coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013
Arquitetura do coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013Arquitetura do coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013
Arquitetura do coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013Intel Software Brasil
 
Developer's Guide to Knights Landing
Developer's Guide to Knights LandingDeveloper's Guide to Knights Landing
Developer's Guide to Knights LandingAndrey Vladimirov
 
Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™  Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™ Intel Software Brasil
 
Productive parallel programming for intel xeon phi coprocessors
Productive parallel programming for intel xeon phi coprocessorsProductive parallel programming for intel xeon phi coprocessors
Productive parallel programming for intel xeon phi coprocessorsinside-BigData.com
 
Intel® Xeon® Phi Coprocessor High Performance Programming
Intel® Xeon® Phi Coprocessor High Performance ProgrammingIntel® Xeon® Phi Coprocessor High Performance Programming
Intel® Xeon® Phi Coprocessor High Performance ProgrammingBrian Gesiak
 
Intel® Trace Analyzer e Collector (ITAC) - Intel Software Conference 2013
Intel® Trace Analyzer e Collector (ITAC) - Intel Software Conference 2013Intel® Trace Analyzer e Collector (ITAC) - Intel Software Conference 2013
Intel® Trace Analyzer e Collector (ITAC) - Intel Software Conference 2013Intel Software Brasil
 
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. TanenbaumA Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaumeurobsdcon
 
進階嵌入式系統開發與實做 (2014 年秋季 ) 課程說明
進階嵌入式系統開發與實做 (2014 年秋季 ) 課程說明進階嵌入式系統開發與實做 (2014 年秋季 ) 課程說明
進階嵌入式系統開發與實做 (2014 年秋季 ) 課程說明National Cheng Kung University
 

Viewers also liked (20)

Web Dev101 For Journalists
Web Dev101 For JournalistsWeb Dev101 For Journalists
Web Dev101 For Journalists
 
Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real C...
Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real C...Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real C...
Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real C...
 
Profiling and Optimizing for Xeon Phi with Allinea MAP
Profiling and Optimizing for Xeon Phi with Allinea MAPProfiling and Optimizing for Xeon Phi with Allinea MAP
Profiling and Optimizing for Xeon Phi with Allinea MAP
 
COSMIC: Middleware for Xeon Phi Servers and Clusters
COSMIC: Middleware for Xeon Phi Servers and ClustersCOSMIC: Middleware for Xeon Phi Servers and Clusters
COSMIC: Middleware for Xeon Phi Servers and Clusters
 
Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization...
Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization...Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization...
Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization...
 
Deep Convolutional Network evaluation on the Intel Xeon Phi
Deep Convolutional Network evaluation on the Intel Xeon PhiDeep Convolutional Network evaluation on the Intel Xeon Phi
Deep Convolutional Network evaluation on the Intel Xeon Phi
 
Altair on Intel Xeon Phi: Optimizing HPC for Breakthrough Performance
Altair on Intel Xeon Phi:  Optimizing HPC for Breakthrough PerformanceAltair on Intel Xeon Phi:  Optimizing HPC for Breakthrough Performance
Altair on Intel Xeon Phi: Optimizing HPC for Breakthrough Performance
 
HOW Series: Knights Landing
HOW Series: Knights LandingHOW Series: Knights Landing
HOW Series: Knights Landing
 
Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)
 
Deep Convolutional Neural Network acceleration on the Intel Xeon Phi
Deep Convolutional Neural Network acceleration on the Intel Xeon PhiDeep Convolutional Neural Network acceleration on the Intel Xeon Phi
Deep Convolutional Neural Network acceleration on the Intel Xeon Phi
 
Implement Runtime Environments for HSA using LLVM
Implement Runtime Environments for HSA using LLVMImplement Runtime Environments for HSA using LLVM
Implement Runtime Environments for HSA using LLVM
 
Arquitetura do coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013
Arquitetura do coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013Arquitetura do coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013
Arquitetura do coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013
 
Developer's Guide to Knights Landing
Developer's Guide to Knights LandingDeveloper's Guide to Knights Landing
Developer's Guide to Knights Landing
 
Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™  Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™
 
Productive parallel programming for intel xeon phi coprocessors
Productive parallel programming for intel xeon phi coprocessorsProductive parallel programming for intel xeon phi coprocessors
Productive parallel programming for intel xeon phi coprocessors
 
Intel® Xeon® Phi Coprocessor High Performance Programming
Intel® Xeon® Phi Coprocessor High Performance ProgrammingIntel® Xeon® Phi Coprocessor High Performance Programming
Intel® Xeon® Phi Coprocessor High Performance Programming
 
Intel® Trace Analyzer e Collector (ITAC) - Intel Software Conference 2013
Intel® Trace Analyzer e Collector (ITAC) - Intel Software Conference 2013Intel® Trace Analyzer e Collector (ITAC) - Intel Software Conference 2013
Intel® Trace Analyzer e Collector (ITAC) - Intel Software Conference 2013
 
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. TanenbaumA Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
 
LLVM introduction
LLVM introductionLLVM introduction
LLVM introduction
 
進階嵌入式系統開發與實做 (2014 年秋季 ) 課程說明
進階嵌入式系統開發與實做 (2014 年秋季 ) 課程說明進階嵌入式系統開發與實做 (2014 年秋季 ) 課程說明
進階嵌入式系統開發與實做 (2014 年秋季 ) 課程說明
 

Similar to Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned

Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Intel® Software
 
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06ManhHoangVan
 
Overview Of Parallel Development - Ericnel
Overview Of Parallel Development -  EricnelOverview Of Parallel Development -  Ericnel
Overview Of Parallel Development - Ericnelukdpe
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wallugur candan
 
PVS-Studio, a solution for resource intensive applications development
PVS-Studio, a solution for resource intensive applications developmentPVS-Studio, a solution for resource intensive applications development
PVS-Studio, a solution for resource intensive applications developmentOOO "Program Verification Systems"
 
Static analysis of C++ source code
Static analysis of C++ source codeStatic analysis of C++ source code
Static analysis of C++ source codeAndrey Karpov
 
Static analysis of C++ source code
Static analysis of C++ source codeStatic analysis of C++ source code
Static analysis of C++ source codePVS-Studio
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...Chester Chen
 
DEF CON 27 - ALI ISLAM and DAN REGALADO WEAPONIZING HYPERVISORS
DEF CON 27 - ALI ISLAM and DAN REGALADO WEAPONIZING HYPERVISORSDEF CON 27 - ALI ISLAM and DAN REGALADO WEAPONIZING HYPERVISORS
DEF CON 27 - ALI ISLAM and DAN REGALADO WEAPONIZING HYPERVISORSFelipe Prado
 
Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)Julien SIMON
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionAkihiro Hayashi
 
The Best Programming Practice for Cell/B.E.
The Best Programming Practice for Cell/B.E.The Best Programming Practice for Cell/B.E.
The Best Programming Practice for Cell/B.E.Slide_N
 
Embedded system Design introduction _ Karakola
Embedded system Design introduction _ KarakolaEmbedded system Design introduction _ Karakola
Embedded system Design introduction _ KarakolaJohanAspro
 
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...Amazon Web Services
 
TF Dev Summit 2019
TF Dev Summit 2019TF Dev Summit 2019
TF Dev Summit 2019Ray Hilton
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
 
Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...
Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...
Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...William Nadolski
 

Similar to Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned (20)

Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel
 
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
 
Overview Of Parallel Development - Ericnel
Overview Of Parallel Development -  EricnelOverview Of Parallel Development -  Ericnel
Overview Of Parallel Development - Ericnel
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
 
PVS-Studio, a solution for resource intensive applications development
PVS-Studio, a solution for resource intensive applications developmentPVS-Studio, a solution for resource intensive applications development
PVS-Studio, a solution for resource intensive applications development
 
Ch1
Ch1Ch1
Ch1
 
Ch1
Ch1Ch1
Ch1
 
Static analysis of C++ source code
Static analysis of C++ source codeStatic analysis of C++ source code
Static analysis of C++ source code
 
Static analysis of C++ source code
Static analysis of C++ source codeStatic analysis of C++ source code
Static analysis of C++ source code
 
SNAP MACHINE LEARNING
SNAP MACHINE LEARNINGSNAP MACHINE LEARNING
SNAP MACHINE LEARNING
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
 
DEF CON 27 - ALI ISLAM and DAN REGALADO WEAPONIZING HYPERVISORS
DEF CON 27 - ALI ISLAM and DAN REGALADO WEAPONIZING HYPERVISORSDEF CON 27 - ALI ISLAM and DAN REGALADO WEAPONIZING HYPERVISORS
DEF CON 27 - ALI ISLAM and DAN REGALADO WEAPONIZING HYPERVISORS
 
Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
 
The Best Programming Practice for Cell/B.E.
The Best Programming Practice for Cell/B.E.The Best Programming Practice for Cell/B.E.
The Best Programming Practice for Cell/B.E.
 
Embedded system Design introduction _ Karakola
Embedded system Design introduction _ KarakolaEmbedded system Design introduction _ Karakola
Embedded system Design introduction _ Karakola
 
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
 
TF Dev Summit 2019
TF Dev Summit 2019TF Dev Summit 2019
TF Dev Summit 2019
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...
Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...
Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...
 

More from Intel IT Center

AI Crash Course- Supercomputing
AI Crash Course- SupercomputingAI Crash Course- Supercomputing
AI Crash Course- SupercomputingIntel IT Center
 
FPGA Inference - DellEMC SURFsara
FPGA Inference - DellEMC SURFsaraFPGA Inference - DellEMC SURFsara
FPGA Inference - DellEMC SURFsaraIntel IT Center
 
High Memory Bandwidth Demo @ One Intel Station
High Memory Bandwidth Demo @ One Intel StationHigh Memory Bandwidth Demo @ One Intel Station
High Memory Bandwidth Demo @ One Intel StationIntel IT Center
 
INFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutions
INFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutionsINFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutions
INFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutionsIntel IT Center
 
Disrupt Hackers With Robust User Authentication
Disrupt Hackers With Robust User AuthenticationDisrupt Hackers With Robust User Authentication
Disrupt Hackers With Robust User AuthenticationIntel IT Center
 
Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...
Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...
Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...Intel IT Center
 
Harness Digital Disruption to Create 2022’s Workplace Today
Harness Digital Disruption to Create 2022’s Workplace TodayHarness Digital Disruption to Create 2022’s Workplace Today
Harness Digital Disruption to Create 2022’s Workplace TodayIntel IT Center
 
Don't Rely on Software Alone. Protect Endpoints with Hardware-Enhanced Security.
Don't Rely on Software Alone.Protect Endpoints with Hardware-Enhanced Security.Don't Rely on Software Alone.Protect Endpoints with Hardware-Enhanced Security.
Don't Rely on Software Alone. Protect Endpoints with Hardware-Enhanced Security.Intel IT Center
 
Achieve Unconstrained Collaboration in a Digital World
Achieve Unconstrained Collaboration in a Digital WorldAchieve Unconstrained Collaboration in a Digital World
Achieve Unconstrained Collaboration in a Digital WorldIntel IT Center
 
Intel® Xeon® Scalable Processors Enabled Applications Marketing Guide
Intel® Xeon® Scalable Processors Enabled Applications Marketing GuideIntel® Xeon® Scalable Processors Enabled Applications Marketing Guide
Intel® Xeon® Scalable Processors Enabled Applications Marketing GuideIntel IT Center
 
#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...
#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...
#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...Intel IT Center
 
Identity Protection for the Digital Age
Identity Protection for the Digital AgeIdentity Protection for the Digital Age
Identity Protection for the Digital AgeIntel IT Center
 
Three Steps to Making a Digital Workplace a Reality
Three Steps to Making a Digital Workplace a RealityThree Steps to Making a Digital Workplace a Reality
Three Steps to Making a Digital Workplace a RealityIntel IT Center
 
Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...
Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...
Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...Intel IT Center
 
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0Intel IT Center
 
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications ShowcaseIntel IT Center
 
Intel® Xeon® Processor E5-2600 v4 Core Business Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Core Business Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Core Business Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Core Business Applications ShowcaseIntel IT Center
 
Intel® Xeon® Processor E5-2600 v4 Financial Security Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Financial Security Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Financial Security Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Financial Security Applications ShowcaseIntel IT Center
 
Intel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications ShowcaseIntel IT Center
 
Intel® Xeon® Processor E5-2600 v4 Tech Computing Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Tech Computing Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Tech Computing Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Tech Computing Applications ShowcaseIntel IT Center
 

More from Intel IT Center (20)

AI Crash Course- Supercomputing
AI Crash Course- SupercomputingAI Crash Course- Supercomputing
AI Crash Course- Supercomputing
 
FPGA Inference - DellEMC SURFsara
FPGA Inference - DellEMC SURFsaraFPGA Inference - DellEMC SURFsara
FPGA Inference - DellEMC SURFsara
 
High Memory Bandwidth Demo @ One Intel Station
High Memory Bandwidth Demo @ One Intel StationHigh Memory Bandwidth Demo @ One Intel Station
High Memory Bandwidth Demo @ One Intel Station
 
INFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutions
INFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutionsINFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutions
INFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutions
 
Disrupt Hackers With Robust User Authentication
Disrupt Hackers With Robust User AuthenticationDisrupt Hackers With Robust User Authentication
Disrupt Hackers With Robust User Authentication
 
Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...
Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...
Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...
 
Harness Digital Disruption to Create 2022’s Workplace Today
Harness Digital Disruption to Create 2022’s Workplace TodayHarness Digital Disruption to Create 2022’s Workplace Today
Harness Digital Disruption to Create 2022’s Workplace Today
 
Don't Rely on Software Alone. Protect Endpoints with Hardware-Enhanced Security.
Don't Rely on Software Alone.Protect Endpoints with Hardware-Enhanced Security.Don't Rely on Software Alone.Protect Endpoints with Hardware-Enhanced Security.
Don't Rely on Software Alone. Protect Endpoints with Hardware-Enhanced Security.
 
Achieve Unconstrained Collaboration in a Digital World
Achieve Unconstrained Collaboration in a Digital WorldAchieve Unconstrained Collaboration in a Digital World
Achieve Unconstrained Collaboration in a Digital World
 
Intel® Xeon® Scalable Processors Enabled Applications Marketing Guide
Intel® Xeon® Scalable Processors Enabled Applications Marketing GuideIntel® Xeon® Scalable Processors Enabled Applications Marketing Guide
Intel® Xeon® Scalable Processors Enabled Applications Marketing Guide
 
#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...
#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...
#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...
 
Identity Protection for the Digital Age
Identity Protection for the Digital AgeIdentity Protection for the Digital Age
Identity Protection for the Digital Age
 
Three Steps to Making a Digital Workplace a Reality
Three Steps to Making a Digital Workplace a RealityThree Steps to Making a Digital Workplace a Reality
Three Steps to Making a Digital Workplace a Reality
 
Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...
Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...
Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...
 
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
 
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
 
Intel® Xeon® Processor E5-2600 v4 Core Business Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Core Business Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Core Business Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Core Business Applications Showcase
 
Intel® Xeon® Processor E5-2600 v4 Financial Security Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Financial Security Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Financial Security Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Financial Security Applications Showcase
 
Intel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications Showcase
 
Intel® Xeon® Processor E5-2600 v4 Tech Computing Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Tech Computing Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Tech Computing Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Tech Computing Applications Showcase
 

Recently uploaded

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Recently uploaded (20)

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned

  • 1. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Optimizing Commercial Software for Intel® Xeon Phi™ Coprocessors: Lessons Learned Supercomputing Conference Denver, Colorado, USA November 17-22, 2013 ©2013 Acceleware Ltd. All rights reserved. Dan Cyca, Chief Technical Officer, Acceleware
  • 2. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned In My Parallel Universe…  Small to medium-sized seismic companies aren’t limited by computational resources when processing seismic data 1
  • 3. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Seismic Computing Requirements 1 EF Full WE Approximation 100 PF Elastic Imaging 10 PF FWI 1 PF RTM 100 TF Paraxial WE approximation 10 TF WEM 1 TF 100 GF Kirchhoff Migration Post SDM, PreSTM 1990 1995 2000 2005 2010 2012 2015 Source: Total 3
  • 4. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned RTM Overview Source Propagate forwards in time Receiver data Propagate backwards in time 4
  • 5. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned RTM Introduction      Finite-difference code Compute intensive:  10s of hours per seismic shot Large memory footprint:  100GB per shot Large local storage requirement:  500GB per shot 10,000s of shots 5
  • 6. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned RTM: Computational Requirements      RTM image is made by migrating and then stacking a large number of shots (typically between 10,000 and 100,000) Migrating each shot requires two or three 3D wave propagations Each shot migration requires large RAM (~100GB) and temporary disk space (~500GB) Runtime per shots varies between a few minutes (low frequency isotropic) to several hours (high frequency anisotropic) Typical compute cluster used for RTM will be 100s of nodes 6
  • 7. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned In My Parallel Universe…  Small to medium-sized seismic companies aren’t limited by computational resources when processing seismic data – We want to make RTM (1 PFlop) available to these companies  We’re delivering parallel software to run RTM on Xeon Phi systems 7
  • 8. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned RTM: Wave Propagation  Finite-difference time domain technique –  3D grid with millions of points – –   3D stencils Update the entire grid every time step 1000s of time steps Memory footprint of 10-100 GB Wavefield data from forward pass stored to disk to facilitate imaging 8
  • 9. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Parallelizing Single Shots    Finite-difference grid contains over 200 million cells per volume (2 GB) Numerous volumes per shot (Earth model, wavefields and image) One shot easily fits in a CPU compute node, but may be too large for a single Xeon Phi 9
  • 10. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Parallelizing Each Shot: Multiple Cards  Phi 0 Phi 1 Phi 2 The volume is partitioned into pieces that fit on a single Xeon Phi 10
  • 11. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Parallelizing Each Shot: Multiple Cards  Phi 0 Transfer  … Boundaries must be transferred between partitions Transfers can become a bottleneck unless they are done asynchronously with stencil calculations … Transfer Phi 1 11
  • 12. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Parallelizing Each Shot: Within Card Core 0, Thread 0 Core 0, Thread 1 Core 0, Thread 2 Core 0, Thread 3 Core 1, Thread 0 x/y Core 1, Thread 1 z   Data in x and y are split over cores Operations in z dimension are vectorized 12
  • 13. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Levels of Parallelism    Each shot is split over multiple Xeon Phi Coprocessors (or Xeon nodes) using MPI The partition on each Phi is split over cores using OpenMP Operations on each thread are vectorized using the compiler’s autovectorizer 13
  • 14. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Kernel: 8th Order Spatial Derivative 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 #pragma omp parallel for for(size_t x = xMin; x < xMax; x++) { for(size_t y = yMin; y < yMax; y++) { size_t const idx = x*strideX + y*strideY; #pragma vector … for(size_t z = zMin; z < zMax; z++) { size_t const i = idx + z; pVy[i] = yCoeffs[0]*(pV[i-4*strideY]-pV[i+4*strideY]) yCoeffs[1]*(pV[i-3*strideY]-pV[i+3*strideY]) yCoeffs[2]*(pV[i-2*strideY]-pV[i+2*strideY]) yCoeffs[3]*(pV[i-1*strideY]-pV[i+1*strideY]) yCoeffs[4]*pV[i]; } } } Triple loop over dimensions + + + + One-dimensional derivative: simple calculation with large memory bandwidth
  • 15. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Tuning OpenMP 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #pragma omp parallel for collapse(2) schedule(static) for(size_t x = xMin; x < xMax; x++) { for(size_t y = yMin; y < yMax; y++) { #pragma vector … size_t const idx = x*strideX + y*strideY; for(size_t z = zMin; z < zMax; z++) { size_t const i = idx + z; // Derivative Calculations } } }  Many options available for OpenMP –  Tuning especially important on Phi (mostly because of high thread count) Here we use static loop scheduling, because it has the lowest overhead – It is also the most prone to load-balance issues
  • 16. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Tuning OpenMP     Collapse(2) combines two adjacent for loops Here, X and Y dimensions are combined. Eg: X = 250, Y = 150 Work is broken more evenly onto cores when there are more iterations – 250 iterations on 240 threads (60*4) means 10 threads do double work while other threads wait (1/2 time wasted) – 250 x 150 divides much better onto 240 threads (1/157 time wasted) Improved Phi performance by 1.5x! Y X X*Y
  • 17. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Tuning Thread Affinity   We programmatically set affinity with run dependent logic Isolating various tasks prevents over-subscription of cores Transfer Threads Core 0 Disk IO Threads Core 1 Propagation Threads Core 2 Core 60 … OS Threads Core 61
  • 18. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Tuning Thread Affinity  Thread affinity settings improved scaling on multiple Phis and multiple CPU sockets Without Affinity Changes With Affinity Changes Dual Xeon Dual Xeon Phi vs. Single sockets vs. Phi single socket 1.3x 1.9x 1.9x 1.7x  Different settings for Xeon Phi and Xeon
  • 19. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Tuning Memory Access 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 #pragma omp parallel for collapse(2) schedule(static) for(size_t x = xMin; x < xMax; x++) { for(size_t y = yMin; y < yMax; y++) { size_t const idx = x*strideX + y*strideY; __assume(strideX%16==0); __assume(strideY%16==0); __assume(idx%16==0); __assume_aligned(pV ,64); __assume_aligned(pVy ,64); #pragma vector always assert vecremainder #pragma ivdep #pragma vector nontemporal (pVy) for(size_t z = zMin; z < zMax; z++) { size_t const i = idx + z; pVy[i] = ( yCoeffs*(pV[i-4*strideY]-pV[i+4*strideY])... } } } Improved performance by 1.1x on both Xeon and Xeon Phi! Give compiler hints about indexing so it knows when to use aligned reads/writes pVy[i] is written once and should not be cached 19
  • 20. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Current Performance Results   For anisotropic wave propagation, Xeon Phi coprocessor is ~2.3x a single Xeon E5-2670 CPU Same code-base and optimizations applied to Xeon and Xeon Phi 20
  • 21. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned About Acceleware  Professional training – – – –  High performance consulting – – –  Xeon Phi Coprocessor Optimization OpenCL OpenMP MPI Feasibility Studies Porting and Optimization Algorithm parallelization Accelerated software – – Oil and Gas Electromagnetics 21
  • 22. Optimizing Commercial Software for Intel Xeon Phi Coprocessors: Lessons Learned Questions? Come visit us in booth #1825! Head Office Tel: +1 403.249.9099 Email: services@acceleware.com Viktoria Kaczur Senior Account Manager Tel: +1 403.249.9099 ext. 356 Cell: +1 403.671.4455 Email: viktoria.kaczur@acceleware.com 22

Editor's Notes

  1. Here we see an example of a single RTM shot as the simulation progresses in time.First, we propagate the source wavelet through the earth model.Next, we inject the receiver data into another wavefield using the same earth model.Finally we see the image forming. We can see the image forming over time here.
  2. Expand 3D to 2D, explain how finer granularity in scheduling helps, especially when there are more cores
  3. Marcel: is my fractional time correct? 250*150 / 240 = 156.25, so during the last iteration (#157), 0.25 of the threads are busy, others are waiting, and the overall slowdown is only one more loop, so 1/157 of the total time.