2. Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned
In My Parallel Universe…
Small to medium-sized seismic companies aren’t limited by
computational resources when processing seismic data
1
3. Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned
Seismic Computing Requirements
1 EF
Full WE Approximation
100 PF
Elastic
Imaging
10 PF
FWI
1 PF
RTM
100 TF
Paraxial WE
approximation
10 TF
WEM
1 TF
100 GF
Kirchhoff Migration
Post SDM, PreSTM
1990
1995
2000
2005
2010
2012
2015
Source: Total
3
4. Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned
RTM Overview
Source
Propagate
forwards
in time
Receiver data
Propagate backwards
in time
4
5. Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned
RTM Introduction
Finite-difference code
Compute intensive:
10s of hours per
seismic shot
Large memory footprint:
100GB per shot
Large local storage
requirement:
500GB per shot
10,000s of shots
5
6. Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned
RTM: Computational Requirements
RTM image is made by migrating and then stacking a large number
of shots (typically between 10,000 and 100,000)
Migrating each shot requires two or three 3D wave propagations
Each shot migration requires large RAM (~100GB) and temporary
disk space (~500GB)
Runtime per shots varies between a few minutes (low frequency
isotropic) to several hours (high frequency anisotropic)
Typical compute cluster used for RTM will be 100s of nodes
6
7. Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned
In My Parallel Universe…
Small to medium-sized seismic
companies aren’t limited by
computational resources when
processing seismic data
– We want to make RTM (1 PFlop)
available to these companies
We’re delivering parallel software to
run RTM on Xeon Phi systems
7
8. Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned
RTM: Wave Propagation
Finite-difference time domain technique
–
3D grid with millions of points
–
–
3D stencils
Update the entire grid every time step
1000s of time steps
Memory footprint of 10-100 GB
Wavefield data from forward pass stored to disk
to facilitate imaging
8
9. Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned
Parallelizing Single Shots
Finite-difference grid contains
over 200 million cells per
volume (2 GB)
Numerous volumes per shot
(Earth model, wavefields and
image)
One shot easily fits in a CPU
compute node, but may be
too large for a single Xeon
Phi
9
10. Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned
Parallelizing Each Shot: Multiple
Cards
Phi 0
Phi 1
Phi 2
The volume is
partitioned into pieces
that fit on a single Xeon
Phi
10
11. Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned
Parallelizing Each Shot: Multiple
Cards
Phi 0
Transfer
…
Boundaries must be
transferred between
partitions
Transfers can become a
bottleneck unless they are
done asynchronously with
stencil calculations
…
Transfer
Phi 1
11
12. Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned
Parallelizing Each Shot: Within
Card
Core 0, Thread 0
Core 0, Thread 1
Core 0, Thread 2
Core 0, Thread 3
Core 1, Thread 0
x/y
Core 1, Thread 1
z
Data in x and y are split over cores
Operations in z dimension are vectorized
12
13. Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned
Levels of Parallelism
Each shot is split over multiple Xeon Phi Coprocessors (or Xeon
nodes) using MPI
The partition on each Phi is split over cores using OpenMP
Operations on each thread are vectorized using the compiler’s
autovectorizer
13
14. Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned
Kernel: 8th Order Spatial Derivative
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#pragma omp parallel for
for(size_t x = xMin; x < xMax; x++)
{
for(size_t y = yMin; y < yMax; y++)
{
size_t const idx = x*strideX + y*strideY;
#pragma vector …
for(size_t z = zMin; z < zMax; z++)
{
size_t const i = idx + z;
pVy[i] =
yCoeffs[0]*(pV[i-4*strideY]-pV[i+4*strideY])
yCoeffs[1]*(pV[i-3*strideY]-pV[i+3*strideY])
yCoeffs[2]*(pV[i-2*strideY]-pV[i+2*strideY])
yCoeffs[3]*(pV[i-1*strideY]-pV[i+1*strideY])
yCoeffs[4]*pV[i];
}
}
}
Triple loop
over dimensions
+
+
+
+
One-dimensional
derivative: simple
calculation with large
memory bandwidth
15. Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned
Tuning OpenMP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#pragma omp parallel for collapse(2) schedule(static)
for(size_t x = xMin; x < xMax; x++)
{
for(size_t y = yMin; y < yMax; y++)
{
#pragma vector …
size_t const idx = x*strideX + y*strideY;
for(size_t z = zMin; z < zMax; z++)
{
size_t const i = idx + z;
// Derivative Calculations
}
}
}
Many options available
for OpenMP
–
Tuning especially important
on Phi (mostly because of
high thread count)
Here we use static loop
scheduling, because it
has the lowest overhead
–
It is also the most prone to
load-balance issues
16. Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned
Tuning OpenMP
Collapse(2) combines two adjacent for loops
Here, X and Y dimensions are combined. Eg: X = 250, Y = 150
Work is broken more evenly onto cores when there are more iterations
– 250 iterations on 240 threads (60*4) means 10 threads do double work
while other threads wait (1/2 time wasted)
– 250 x 150 divides much better onto 240 threads (1/157 time wasted)
Improved Phi performance by 1.5x!
Y
X
X*Y
17. Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned
Tuning Thread Affinity
We programmatically set affinity with run dependent logic
Isolating various tasks prevents over-subscription of cores
Transfer Threads
Core 0
Disk IO Threads
Core 1
Propagation Threads
Core 2
Core 60
…
OS Threads
Core 61
18. Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned
Tuning Thread Affinity
Thread affinity settings improved scaling on multiple Phis and
multiple CPU sockets
Without Affinity Changes
With Affinity Changes
Dual Xeon
Dual Xeon
Phi vs. Single sockets vs.
Phi
single socket
1.3x
1.9x
1.9x
1.7x
Different settings
for Xeon Phi and
Xeon
19. Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned
Tuning Memory Access
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#pragma omp parallel for collapse(2) schedule(static)
for(size_t x = xMin; x < xMax; x++)
{
for(size_t y = yMin; y < yMax; y++)
{
size_t const idx = x*strideX + y*strideY;
__assume(strideX%16==0);
__assume(strideY%16==0);
__assume(idx%16==0);
__assume_aligned(pV ,64);
__assume_aligned(pVy ,64);
#pragma vector always assert vecremainder
#pragma ivdep
#pragma vector nontemporal (pVy)
for(size_t z = zMin; z < zMax; z++)
{
size_t const i = idx + z;
pVy[i] = (
yCoeffs*(pV[i-4*strideY]-pV[i+4*strideY])...
}
}
}
Improved performance by
1.1x on both Xeon and Xeon
Phi!
Give compiler hints about
indexing so it knows when to
use aligned reads/writes
pVy[i] is written once and
should not be cached
19
20. Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned
Current Performance Results
For anisotropic wave propagation, Xeon Phi coprocessor is
~2.3x a single Xeon E5-2670 CPU
Same code-base and optimizations applied to Xeon and Xeon Phi
20
21. Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned
About Acceleware
Professional training
–
–
–
–
High performance consulting
–
–
–
Xeon Phi Coprocessor Optimization
OpenCL
OpenMP
MPI
Feasibility Studies
Porting and Optimization
Algorithm parallelization
Accelerated software
–
–
Oil and Gas
Electromagnetics
21
22. Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned
Questions?
Come visit us in booth #1825!
Head Office
Tel: +1 403.249.9099
Email: services@acceleware.com
Viktoria Kaczur
Senior Account Manager
Tel: +1 403.249.9099 ext. 356
Cell: +1 403.671.4455
Email: viktoria.kaczur@acceleware.com
22
Editor's Notes
Here we see an example of a single RTM shot as the simulation progresses in time.First, we propagate the source wavelet through the earth model.Next, we inject the receiver data into another wavefield using the same earth model.Finally we see the image forming. We can see the image forming over time here.
Expand 3D to 2D, explain how finer granularity in scheduling helps, especially when there are more cores
Marcel: is my fractional time correct? 250*150 / 240 = 156.25, so during the last iteration (#157), 0.25 of the threads are busy, others are waiting, and the overall slowdown is only one more loop, so 1/157 of the total time.