Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned

Optimizing Commercial Software
for Intel Xeon Phi Coprocessors:
Lessons Learned

Optimizing Commercial Software for Intel® Xeon Phi™
Coprocessors: Lessons Learned

Supercomputing Conference
Denver, Colorado, USA
November 17-22, 2013
©2013 Acceleware Ltd. All rights reserved.

Dan Cyca, Chief Technical Officer, Acceleware

Lessons Learned

In My Parallel Universe…


Small to medium-sized seismic companies aren’t limited by
computational resources when processing seismic data

1

Lessons Learned

Seismic Computing Requirements
1 EF
Full WE Approximation

100 PF

Elastic
Imaging

10 PF

FWI

1 PF

RTM

100 TF

Paraxial WE
approximation

10 TF

WEM

1 TF
100 GF

Kirchhoff Migration
Post SDM, PreSTM

1990

1995

2000

2005

2010

2012

2015

Source: Total

3

Lessons Learned

RTM Overview
Source

Propagate
forwards
in time

Receiver data

Propagate backwards
in time

4

Lessons Learned

RTM Introduction









Finite-difference code
Compute intensive:
 10s of hours per
seismic shot
Large memory footprint:
 100GB per shot
Large local storage
requirement:
 500GB per shot
10,000s of shots

5

Lessons Learned

RTM: Computational Requirements







RTM image is made by migrating and then stacking a large number
of shots (typically between 10,000 and 100,000)
Migrating each shot requires two or three 3D wave propagations
Each shot migration requires large RAM (~100GB) and temporary
disk space (~500GB)
Runtime per shots varies between a few minutes (low frequency
isotropic) to several hours (high frequency anisotropic)
Typical compute cluster used for RTM will be 100s of nodes

6

Lessons Learned

In My Parallel Universe…


Small to medium-sized seismic
companies aren’t limited by
computational resources when
processing seismic data
– We want to make RTM (1 PFlop)
available to these companies



We’re delivering parallel software to
run RTM on Xeon Phi systems

7

Lessons Learned

RTM: Wave Propagation


Finite-difference time domain technique
–



3D grid with millions of points
–
–




3D stencils

Update the entire grid every time step
1000s of time steps

Memory footprint of 10-100 GB
Wavefield data from forward pass stored to disk
to facilitate imaging

8

Lessons Learned

Parallelizing Single Shots






Finite-difference grid contains
over 200 million cells per
volume (2 GB)
Numerous volumes per shot
(Earth model, wavefields and
image)
One shot easily fits in a CPU
compute node, but may be
too large for a single Xeon
Phi

9

Lessons Learned

Parallelizing Each Shot: Multiple
Cards


Phi 0

Phi 1

Phi 2

The volume is
partitioned into pieces
that fit on a single Xeon
Phi

10

Lessons Learned

Parallelizing Each Shot: Multiple
Cards


Phi 0

Transfer


…

Boundaries must be
transferred between
partitions
Transfers can become a
bottleneck unless they are
done asynchronously with
stencil calculations

…

Transfer

Phi 1

11

Lessons Learned

Parallelizing Each Shot: Within
Card
Core 0, Thread 0
Core 0, Thread 1
Core 0, Thread 2
Core 0, Thread 3
Core 1, Thread 0

x/y
Core 1, Thread 1

z 


Data in x and y are split over cores
Operations in z dimension are vectorized

12

Lessons Learned

Levels of Parallelism




Each shot is split over multiple Xeon Phi Coprocessors (or Xeon
nodes) using MPI
The partition on each Phi is split over cores using OpenMP
Operations on each thread are vectorized using the compiler’s
autovectorizer

13

Lessons Learned

Kernel: 8th Order Spatial Derivative
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

#pragma omp parallel for
for(size_t x = xMin; x < xMax; x++)
{
for(size_t y = yMin; y < yMax; y++)
{
size_t const idx = x*strideX + y*strideY;
#pragma vector …
for(size_t z = zMin; z < zMax; z++)
{
size_t const i = idx + z;
pVy[i] =
yCoeffs[0]*(pV[i-4*strideY]-pV[i+4*strideY])
yCoeffs[4]*pV[i];
}
}
}

Triple loop
over dimensions

+
+
+
+

One-dimensional
derivative: simple
calculation with large
memory bandwidth

Lessons Learned

Tuning OpenMP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

#pragma omp parallel for collapse(2) schedule(static)
{
{
#pragma vector …
{
// Derivative Calculations
}
}
}



Many options available
for OpenMP
–



Tuning especially important
on Phi (mostly because of
high thread count)

Here we use static loop
scheduling, because it
has the lowest overhead
–

It is also the most prone to
load-balance issues

Lessons Learned

Tuning OpenMP






Collapse(2) combines two adjacent for loops
Here, X and Y dimensions are combined. Eg: X = 250, Y = 150
Work is broken more evenly onto cores when there are more iterations
– 250 iterations on 240 threads (60*4) means 10 threads do double work
while other threads wait (1/2 time wasted)
– 250 x 150 divides much better onto 240 threads (1/157 time wasted)
Improved Phi performance by 1.5x!

Y
X

X*Y

Lessons Learned

Tuning Thread Affinity



We programmatically set affinity with run dependent logic
Isolating various tasks prevents over-subscription of cores

Transfer Threads

Core 0

Disk IO Threads

Core 1

Propagation Threads

Core 2

Core 60

…

OS Threads

Core 61

Lessons Learned

Tuning Thread Affinity


Thread affinity settings improved scaling on multiple Phis and
multiple CPU sockets

Without Affinity Changes
With Affinity Changes

Dual Xeon
Dual Xeon
Phi vs. Single sockets vs.
Phi
single socket
1.3x
1.9x
1.9x

1.7x

 Different settings
for Xeon Phi and
Xeon

Lessons Learned

Tuning Memory Access
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

#pragma omp parallel for collapse(2) schedule(static)
{
{
__assume(strideX%16==0);
__assume(strideY%16==0);
__assume(idx%16==0);
__assume_aligned(pV ,64);
__assume_aligned(pVy ,64);
#pragma vector always assert vecremainder
#pragma ivdep
#pragma vector nontemporal (pVy)
{
pVy[i] = (
yCoeffs*(pV[i-4*strideY]-pV[i+4*strideY])...
}
}
}

Improved performance by
1.1x on both Xeon and Xeon
Phi!
Give compiler hints about
indexing so it knows when to
use aligned reads/writes

pVy[i] is written once and
should not be cached

19

Lessons Learned

Current Performance Results



For anisotropic wave propagation, Xeon Phi coprocessor is
~2.3x a single Xeon E5-2670 CPU
Same code-base and optimizations applied to Xeon and Xeon Phi

20

Lessons Learned

About Acceleware


Professional training
–
–
–
–



High performance consulting
–
–
–



Xeon Phi Coprocessor Optimization
OpenCL
OpenMP
MPI
Feasibility Studies
Porting and Optimization
Algorithm parallelization

Accelerated software
–
–

Oil and Gas
Electromagnetics
21

Lessons Learned

Questions?
Come visit us in booth #1825!
Head Office
Tel: +1 403.249.9099
Email: services@acceleware.com
Viktoria Kaczur
Senior Account Manager
Tel: +1 403.249.9099 ext. 356
Cell: +1 403.671.4455
Email: viktoria.kaczur@acceleware.com
22

Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned

Similar to Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned (20)

More from Intel IT Center

More from Intel IT Center (20)

Recently uploaded

Recently uploaded (20)

Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned

Editor's Notes