SlideShare a Scribd company logo
Ph.D. defense
December 21st
2015
Contributions of hybrid architectures
to depth imaging: a CPU, APU and GPU
comparative study
Issam SAID
Energy supply and demand
• 40% more energy is needed by 2035
• No choice but Oil, Gas and Coal
• Sophisticated seismic methods
I. Said Ph.D. defense 12/21/2015 1/50
Seismic methods for Oil & Gas exploration
Acquisition Processing Interpretation
Shot = source activation + data collection (receivers)
Seismic survey
• Air-gun array
• Hydrophones
Shot record
I. Said Ph.D. defense 12/21/2015 2/50
Seismic methods for Oil & Gas exploration
Acquisition Processing
Noise at-
tenuation
Demul-
tiple
Interpo-
lation
Imaging
Interpretation
{Subsurface image
I. Said Ph.D. defense 12/21/2015 2/50
Seismic methods for Oil & Gas exploration
Acquisition Processing Interpretation
Calculate seismic attributes
• Dip
• Azimuth
• Coherence
I. Said Ph.D. defense 12/21/2015 2/50
Seismic methods for Oil & Gas exploration
Acquisition Processing Interpretation
Calculate seismic attributes
• Dip
• Azimuth
• Coherence (courtesy of Total)
I. Said Ph.D. defense 12/21/2015 2/50
Reverse Time Migration (RTM)
• The reference computer based imaging algorithm in the industry
• Repositions seismic events into their true location in the subsurface
I. Said Ph.D. defense 12/21/2015 3/50
Reverse Time Migration (RTM)
• The reference computer based imaging algorithm in the industry
• Repositions seismic events into their true location in the subsurface
• Sub-salt and steep dips imaging
• Accurate (full wave equation (two-way))
• Requires massive compute resources (compute and storage)
I. Said Ph.D. defense 12/21/2015 3/50
RTM workflow
Forward modeling (FWD)
I. Said Ph.D. defense 12/21/2015 4/50
RTM workflow
Forward modeling (FWD) Backward modeling (BWD)
I. Said Ph.D. defense 12/21/2015 4/50
RTM workflow
Forward modeling (FWD) Backward modeling (BWD) Imaging condition
I. Said Ph.D. defense 12/21/2015 4/50
RTM workflow
Forward modeling (FWD) Backward modeling (BWD) Imaging condition
I. Said Ph.D. defense 12/21/2015 4/50
RTM workflow
Forward modeling (FWD) Backward modeling (BWD) Imaging condition
I. Said Ph.D. defense 12/21/2015 4/50
The underlying theory of the RTM algorithm
The RTM operator
Img(x) =
H
0
T
0
Sh(x, t) ∗ Rh(x, T − t)dt dh
The Cauchy problem



1
c2
∂2
u(x, t)
∂t2
− ∆u(x, t) = s(t), in Ω
u(x, 0) = 0
∂u(x, 0)
∂t
= 0
Boundary condition
u = 0 on ∂Ω
I. Said Ph.D. defense 12/21/2015 5/50
Finite Difference Time Domain for RTM
• Finite Difference Time Domain (8th
order in space, 2nd
order in time)
• Regular grids
• Perfectly Matched Layers (PML) as an absorbing boundary condition
Un+1
i,j,k = 2Un
i,j,k − Un−1
i,j,k + c2
i,j,k∆t2
∆Un
i,j,k + c2
i,j,k∆t2
sn
• Heavy computation (hours to days of processing time)
• Terabytes of temporary data
• Requires High Performance Computing
I. Said Ph.D. defense 12/21/2015 6/50
HPC solutions for RTM
CPU clusters are the reference
• Process large data sets across interconnected multi-core CPUs
• Advanced optimization techniques (vectorization, cache blocking)
Hardware accelerators and co-processors
• RTM is massively parallel
• GPU, FPGA, Intel Xeon Phi
• Dominance of GPUs:
• Huge compute power (up to 5 TFlop/s)
• High memory bandwidth (up to 300 GB/s)
• Possible PCI overheads (sustained bandwidth up to 12 GB/s)
• Data snapshotting
• MPI communications with neighbors (multi-GPU)
• Limited memory capacities
• A high-end GPU has only 12 GB at most
• CPU based compute nodes have 128 GB
• High power consumptions 400 W (CPU+GPU(not standalone))
I. Said Ph.D. defense 12/21/2015 7/50
GPU based solutions for RTM
• Possible software techniques to overcome RTM limits on GPUs:
• Temporal blocking (PCI overhead)
• Overlapping CPU-GPU transfers with computations (PCI overhead)
• Out-of-core algorithms (memory limitation)
• Extensive efforts and investments
• Hardware solution with an acceptable performance/efforts trade-off?
I. Said Ph.D. defense 12/21/2015 8/50
Towards unifying CPUs and GPUs
GPU main memory
Dispatch units
L2
CU0
L1Local memory
Register file
PE
CU1
L1Local memory
Register file
PE
CUN-1
L1Local memory
Register file
PE
CPU
System memory
CPU0
L1 WC
L2
CPUs-1
L1 WC
L2
L3
FPUFPU
PCI Express Bus
CPU
System memory
CPU3
L1 WC
L2
FPU
CPU2
L1 WC
L2
FPU
CPU0
L1 WC
FPU
CPU1
L1 WC
L2
FPU
Quad-core CPU module
Dispatch units
CU0
TEX L1Local memory
Register file
PE
CU1
TEX L1Local memory
Register file
PE
CUN-1
TEX L1Local memory
Register file
PE
Integrated GPU moduleUNB GPUmemorycontroller
Memorycontroller
GARLIC
ONION
Accelerated Processing Unit (APU)
CPU+discrete GPU
I. Said Ph.D. defense 12/21/2015 9/50
Towards unifying CPUs and GPUs
Strengths
• No PCI Express bus
• Integrated GPUs can address the entire memory
• Low power processors ( 95 W TDP at most):
• CPU 150 W TDP at most
• GPU 300 W at most
Weaknesses
• Low compute power as compared to GPUs:
• Kaveri APU 730 GFlop/s (integrated GPU)
• Phenom CPU 150 GFlop/s
• Tahiti GPU 3700 GFlop/s
• An order of magnitude less memory bandwidth than GPUs:
• APU up to 25 GB/s memory bandwidth
• GPU 300 GB/s
I. Said Ph.D. defense 12/21/2015 9/50
Overview of contributions
Architectures
CPU
GPU
APU
Data
placement
strategies
Applications
Matrix
multiplication
Finite
difference
stencils
Hybrid
strategy
Modeling
RTM
Successive
generations
Evaluation
Power
efficiency
Performance
One nodeLarge scale
Strong
scaling
Weak
scaling
Is the APU a valuable HPC solution for depth imaging that:
• may be more efficient than CPU solutions?
• is likely to overcome GPU based solutions limitations?
Program-
ming models
OpenCLOpenACC
I. Said Ph.D. defense 12/21/2015 10/50
Overview of contributions
Architectures
CPU
GPU
APU
Data
placement
strategies
Applications
Matrix
multiplication
Finite
difference
stencils
Hybrid
strategy
Modeling
RTM
Successive
generations
Evaluation
Power
efficiency
Performance
One nodeLarge scale
Strong
scaling
Weak
scaling
Program-
ming models
OpenCLOpenACC
I. Said Ph.D. defense 12/21/2015 10/50
Overview of contributions
Architectures
CPU
GPU
APU
Data
placement
strategies
Applications
Matrix
multiplication
Finite
difference
stencils
Hybrid
strategy
Modeling
RTM
Successive
generations
Evaluation
Power
efficiency
Performance
One nodeLarge scale
Strong
scaling
Weak
scaling
Program-
ming models
OpenCLOpenACC
I. Said Ph.D. defense 12/21/2015 10/50
Overview of contributions
Architectures
CPU
GPU
APU
Data
placement
strategies
Applications
Matrix
multiplication
Finite
difference
stencils
Hybrid
strategy
Modeling
RTM
Successive
generations
Evaluation
Power
efficiency
Performance
One nodeLarge scale
Strong
scaling
Weak
scaling
Program-
ming models
OpenCLOpenACC
I. Said Ph.D. defense 12/21/2015 10/50
Overview of contributions
Architectures
CPU
GPU
APU
Data
placement
strategies
Applications
Matrix
multiplication
Finite
difference
stencils
Hybrid
strategy
Modeling
RTM
Successive
generations
Evaluation
Power
efficiency
Performance
One nodeLarge scale
Strong
scaling
Weak
scaling
Program-
ming models
OpenCLOpenACC
I. Said Ph.D. defense 12/21/2015 10/50
Evaluation of the APU technology
Architectures
CPU
GPU
APU
Data
placement
strategies
Applications
Matrix
multiplication
Finite
difference
stencils
Hybrid
strategy
Modeling
RTM
Successive
generations
Evaluation
Power
efficiency
Performance
One nodeLarge scale
Strong
scaling
Weak
scaling
Program-
ming models
OpenCLOpenACC
I. Said Ph.D. defense 12/21/2015 11/50
The APU memory subsystem
• Onion: coherent bus (slow)
• Garlic: non coherent bus (full memory bandwidth)
I. Said Ph.D. defense 12/21/2015 12/50
The APU memory subsystem
• c: regular CPU memory (size depends on the RAM)
I. Said Ph.D. defense 12/21/2015 12/50
The APU memory subsystem
• g: fixed size (512 MB to 4 GB)
• cg: explicit copy from CPU memory to GPU memory
• gc: explicit copy from GPU memory to CPU memory
I. Said Ph.D. defense 12/21/2015 12/50
The APU memory subsystem
• u: zero-copy and non coherent (read-only accesses from GPU cores)
• Fixed and limited size (up to 1 GB)
I. Said Ph.D. defense 12/21/2015 12/50
The APU memory subsystem
• z: zero-copy and coherent memory
I. Said Ph.D. defense 12/21/2015 12/50
The APU memory subsystem
• z: zero-copy and coherent memory
• Variable size (up to the maximum CPU memory size)
I. Said Ph.D. defense 12/21/2015 12/50
Data placement strategies on APU
• OpenCL data copy kernel
• From buffer A to buffer B
• Store buffers A and B in different memory locations
• Evaluate different combinations:
• For example cggc (explicit copy):
• zz (zero-copy):
I. Said Ph.D. defense 12/21/2015 13/50
Data placement benchmark results on APU
0
5
10
15
20
25
30
35
cggc zgc ugc zz uz
time(ms)lowerisbetter
Data placement strategies
buffer size: 128 MB
kernel(copy time) GPU-to-CPU CPU-to-GPU
• Using zero-copy = 60% maximum sustained bandwidth
• Select the most relevant strategies: cggc, ugc and zz
I. Said Ph.D. defense 12/21/2015 14/50
Applicative benchmarks on APU
Matrix multiplication
• Compute bound algorithm
• Evaluate the sustained compute gap between GPUs and APUs
8th
order 3D finite difference stencil
• Memory bound algorithm
• Building block of the Reverse Time Migration
• Evaluate the APU memory performance
Impact of data placement strategies on the APU performance
I. Said Ph.D. defense 12/21/2015 15/50
Finite difference stencils
∆Un
i,j,k =
1
∆x2
p/2
l=−p/2
alUn
i+l,j,k+
1
∆y2
p/2
l=−p/2
alUn
i,j+l,k+
1
∆z2
p/2
l=−p/2
alUn
i,j,k+l, p = 8
• Compute complexity O(N3
)
• Storage complexity O(N3
)
• Data snapshotting (K ∈ [1 − 10])
I. Said Ph.D. defense 12/21/2015 16/50
Stencils: implementation details
• 2D work-item grid on the 3D domain
• 1 column along the Z axis/work-item
• Register blocking when traversing the
Z dimension
• Implementations:
• scalar: global memory
• local scalar: local memory to exploit
memory access redundancies
• vectorized: global memory +
explicit vectorization
• local vectorized: local memory +
explicit vectorization
I. Said Ph.D. defense 12/21/2015 17/50
Stencil computations on CPU
2
4
6
8
10
12
14
16
18
64 128 256 512 1024
GFlop/shigherisbetter
NxNx32
scalar
vectorized
local vectorized
openmp
• Explicit vectorization helped to deliver the best performance (SSE)
• OpenCL ≥ OpenMP
I. Said Ph.D. defense 12/21/2015 18/50
Stencil computations on GPU
0
100
200
300
400
500
600
64 128 256 512 1024
GFlop/shigherisbetter
NxNx32
scalar
local scalar
vectorized
local vectorized
• Scalar ≥ vectorized thanks to GCN (Graphics Core Next)
• Scalar code + OpenCL local memory offered the best performance
I. Said Ph.D. defense 12/21/2015 19/50
Stencil computations on APU
10
20
30
40
50
60
70
80
90
64 128 256 512 1024
GFlop/shigherisbetter
NxNx32
scalar
local scalar
vectorized
local vectorized
• Local scalar gives the best performance numbers for N ≥ 128
• Vectorization is not needed thanks to GCN
I. Said Ph.D. defense 12/21/2015 20/50
Stencils: data placement strategies
• Fixed problem size (1024 × 1024 × 32)
• One snapshot every K computation (1 ≤ K ≤ 10)
• Select the best OpenCL implementations (scalar, local scalar)
• Combine them with data placement strategies: cggc, ugc, zz
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 10
GFlop/shigherisbetter
K computations + 1 snapshot
problem size: 1024x1024x32 (128 MB)
best
scalar-cggc
scalar-ugc
scalar-zz
local scalar-cggc
local scalar-ugc
local scalar-zz
• Best: local scalar (zz) for 1 ≤ K ≤ 3 and (cggc) for 3 ≤ K ≤ 10
• Select explicit copy (cggc) and zero-copy (zz) for RTM
I. Said Ph.D. defense 12/21/2015 21/50
Stencils: performance comparison
8
16
32
64
128
256
512
1 2 3 4 5 6 7 8 9 10
GFlop/shigherisbetter
K computations + 1 snapshot
performance projection
problem size: 1024x1024x32 (128 MB)
CPU
GPU
APU
APU(Onion=Garlic)
• APU > CPU ∀K
• GPU > APU, 2 ≤ K ≤ 10
• APU > GPU when performing one snapshot after each iteration
I. Said Ph.D. defense 12/21/2015 22/50
Stencils: conclusion
• APU can be an attractive solution:
• For a high rate of data snapshotting (finite difference)
• For medium sized problems (matrix multiplication)
• An order of magnitude of theoretical performance gap GPU/APU:
• But only 3× to 4× only in practice
• Performance only: the GPU remains the privileged solution
• Power is gaining interest in the HPC community (Green500)
• Power wall and Exascale
• What about power consumption?
I. Said Ph.D. defense 12/21/2015 23/50
Power measurement methodology
• Raritan PX (DPXR8A-16) PDU to monitor the power consumption
• Performance per Watt (PPW) metric
Methodology
• The power drawn by the system as a whole:
• Same functional hardware components for the 3 architectures
• CPU+GPU for GPU based solutions
• Importance of Power Supply Units (PSUs) electric efficiency
I. Said Ph.D. defense 12/21/2015 24/50
Stencils: power efficiency comparison
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 2 3 4 5 6 7 8 9 10
GFlop/s/Whigherisbetter
K computations + 1 snapshot
up to 62 W
up to 222 W
up to 159 W
problem size: 1024x1024x32 (128 MB)
CPU GPU APU
• CPU offers a very low power efficiency (0.08 GFlop/s/W)
• APU is 13% more power efficient that the GPU
• Higher gain for compute bound algorithm (matrix multiplication):
• Flops consume less power than memory accesses
I. Said Ph.D. defense 12/21/2015 25/50
RTM on one HPC node
Architectures
CPU
GPU
APU
Data
placement
strategies
Applications
Matrix
multiplication
Finite
difference
stencils
Hybrid
strategy
Modeling
RTM
Successive
generations
Evaluation
Power
efficiency
Performance
One nodeLarge scale
Strong
scaling
Weak
scaling
Program-
ming models
OpenCLOpenACC
I. Said Ph.D. defense 12/21/2015 26/50
One-node RTM GPU/APU implementations
• Multiple OpenCL kernels (PML):
• Reduce compute/memory divergence
• Stencils study conclusions:
• Stencil optimizations and auto-tuning
• scalar and local scalar
• Data placement strategies (APU)
• Imaging condition on CPU
• Evaluate:
• Kernels (kernels only)
• Full application (overall)
Absorbing boundaries
 
Z
 
Y
X
Physical domain
Free surface
Absorbing boundaries
Case study
• 3D SEG/EAGE Salt velocity model
• Compute grid that fits in one GPU compute node (less than 3 GB)
• Selective checkpointing frequency K=10
I. Said Ph.D. defense 12/21/2015 27/50
One-node RTM on GPU/APU
kernels only (GFlop/s) overall (GFlop/s) %loss
GPU 141.77 29.65 79%
APU(explicit copy) 32.42 15.93 50%
APU(zero-copy) 15.2 11.45 24%
GPU
• Best: scalar implementation
• Impact of PCI+IO (snapshotting) on performance
APU
• Best: scalar with using explicit data copies (cggc)
• Local memory is beneficial when using zero-copy memory objects
I. Said Ph.D. defense 12/21/2015 28/50
One-node RTM: performance comparison
0
20
40
60
80
100
120
140
160
GFlop/shigherisbetter
one node/architecture
CPU
APU(zero-copy)
APU(explicitcopy)
GPU
RTM(kernels only) RTM(overall)
• Poor performance on the Phenom CPU (OpenCL)
• Gap between GPU and APU:
• 4.4× with kernels only
• 1.8× only when considering overall
I. Said Ph.D. defense 12/21/2015 29/50
One-node RTM: power efficiency comparison
0
0.05
0.1
0.15
0.2
0.25
0.3
GFlop/s/Whigherisbetter
CPU
APU(zero-copy)
APU(explicit copy)
GPU
137 W
62 W
62 W
198 W
• Performance numbers based on overall timings
• Poor power efficiency on the Phenom CPU (0.013 GFlop/s/W)
• APU can be more power efficient than the GPU:
• 1.80× (explicit copy)
• 1.23× (zero-copy)
I. Said Ph.D. defense 12/21/2015 30/50
One-node RTM: conclusion
• RTM(kernels only): huge gap between APU and GPU
• RTM(overall): the performance gap is reduced
• Performance + power: APU is almost twice more efficient than GPU
I. Said Ph.D. defense 12/21/2015 31/50
RTM on multi-node hybrid architectures
Architectures
CPU
GPU
APU
Data
placement
strategies
Applications
Matrix
multiplication
Finite
difference
stencils
Hybrid
strategy
Modeling
RTM
Successive
generations
Evaluation
Power
efficiency
Performance
One nodeLarge scale
Strong
scaling
Weak
scaling
Program-
ming models
OpenCLOpenACC
I. Said Ph.D. defense 12/21/2015 32/50
RTM on multi-node hybrid architectures
Motivations
• Real world cases generate large amounts of data ( 1 Terabyte)
• Larger than one node memory capacities
• Impact of MPI communications on the PCI overhead (GPU)?
• Impact of zero-copy on MPI communications (APU)?
Clusters (located at Total)
CPU cluster APU cluster GPU cluster
Number of used nodes 16
Processors/node
2×Intel Xeon CPU E5-2670 1×AMD A10-7850K 1×NVIDIA Tesla K40s
(Kaveri) 1×Intel Xeon CPU E5-2680
Case study
• Same velocity model, K=10
• Compute grids size 25 GB
I. Said Ph.D. defense 12/21/2015 33/50
Multi-node RTM: implementation
• 3D domain decomposition
• One-node study conclusions
• Boundaries copied to contiguous
buffers:
• For GPUs, using OpenCL kernels
• For GPUs, PCI memory transfers:
• Communications with neighbors
• I/O operations for snapshotting
Z
Y
X
 
South
 
North
 
Back
 
East
 
West
 
Front
 
I. Said Ph.D. defense 12/21/2015 34/50
Multi-node RTM: MPI overlapping
Problem: ineffective non-blocking communications (initial)
Isend(buf) do work(no buf) Wait() use bufProcess P0
Process P1 time
Recv(buf)
no progress thread
Solution: explicit overlap technique (overlap)
MPI communications (blocking)
sync
Update the domain boundaries
Update the inner domain
User thread
Auxiliary thread
time
activate
I. Said Ph.D. defense 12/21/2015 35/50
Multi-CPU RTM(FWD)
0
1
2
3
4
5
initial
overlap
initial
overlap
initial
overlap
initial
overlap
initial
overlap
initial
overlap
initial
overlap
time(s)lowerisbetter
#nodes
1 2 4 8 16 32 64
%: MPI fraction
%: performance gain
comm
out
in
1.09%
4.89%
5.69%
35.57%
43.04%
66.32%
75.93%
max[in,comm]
-26.96%
-10.70%
-16.10%
6.10%
18.20%
48.89%
44.89%
perfect scaling
0
0.1
0.2
0.3
0.4
0.5
643216
43.04%
66.32%
75.93%
18.20%
48.89%
44.89%
• 1 CPU node = 2 sockets (8 cores 2-way SMT each)
• 16 MPI/node (32 threads (SMT))
• The overlap technique is beneficial when MPI fractions are high
I. Said Ph.D. defense 12/21/2015 36/50
Multi-GPU RTM(FWD)
0
0.02
0.04
0.06
0.08
0.1
0.12
initial
overlap
initial
overlap
time(s)lowerisbetter
#nodes
%: MPI fraction
%: performance gain
8 16
comm
d-h-comm
unpack
pack
in+out
out
max[in,comm]
perfect scaling
15.76%
26.49%
12.21%
14.24%
• Only 2 test cases due to memory limitations
• Up to 14% of performance gain (CPU dedicated to communications)
• PCI overheads hinder achieving near to perfect scaling
I. Said Ph.D. defense 12/21/2015 37/50
Multi-APU(explicit copy) RTM(FWD)
0
0.1
0.2
0.3
0.4
0.5
0.6
initial
overlap
initial
overlap
time(s)lowerisbetter
#nodes
%: MPI fraction
%: performance gain
8 16
comm
d-h-comm
unpack
pack
in+out
out
max[in,comm]
perfect scaling
14.76%
19.03%
13.99%
18.89%
• Up to 18% of performance gain (CPU dedicated to communications)
• Lower overhead to copy the boundaries ⇒ near to perfect scaling
I. Said Ph.D. defense 12/21/2015 38/50
Multi-APU(zero-copy) RTM(FWD)
0
1
2
3
4
5
6
7
initial
initial
overlap
initial
overlap
initial
overlap
initial
overlap
time(s)lowerisbetter
#nodes
%: MPI fraction
%: performance gain
1 2 4 8 16
comm
d-h-comm
unpack
pack
in+out
out
max[in,comm]
perfect scaling
0.10%
0.36%
13.42%
17.80%
0.34%
1.80%
8.35%
14.68%
0
0.2
0.4
0.6
0.8
1
168
13.42%
17.80%
8.35%
14.68%
• Up to 14% of performance gain
• Zero-copy ⇒ no CPU-GPU overhead + near to perfect scaling
I. Said Ph.D. defense 12/21/2015 39/50
Multi-node RTM: asynchronous I/O
Synchronous data snapshotting (sync)
Proposed solution (async)
I. Said Ph.D. defense 12/21/2015 40/50
Multi-CPU RTM(BWD)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
sync
async
sync
async
sync
async
sync
async
sync
async
sync
async
sync
async
time(s)lowerisbetter
#nodes
1 2 4 8 16 32 64
%: I/O fraction
%: performance gain
io
img
out
max[in,comm]
perfect scaling
24.40%
20.67%
17.83%
13.69%
4.07%
9.76%
1.74%
22.98%
18.95%
13.09%
10.29%
9.05%
11.51%
-0.26%
0
0.05
0.1
0.15
0.2
0.25
0.3
4.07%
9.76%
1.74%
9.05%
11.51%
-0.26%
• MPI processes pinning
• Background engine for asynchronous I/O (auxiliary thread)
• Asynchronous I/O is beneficial for low nodes count only
• High nodes count: compute nodes are overused
I. Said Ph.D. defense 12/21/2015 41/50
Multi-GPU RTM(BWD)
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
sync
async
sync
async
time(s)lowerisbetter
#nodes
%: I/O fraction
%: performance gain
8 16
io
img
dtoh-io
d-h-comm
unpack
pack
out
max[in,comm]
perfect scaling
34.17%
46.93%
33.73%
40.23%
• Kernel times ⇒ I/O fraction
• Up to 40% performance gain (CPU fully dedicated to MPI+I/O)
• PCI overhead for I/O and communications with neighbors
I. Said Ph.D. defense 12/21/2015 42/50
Multi-APU(explicit copy) RTM(BWD)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
sync
async
sync
async
time(s)lowerisbetter
#nodes
%: I/O fraction
%: performance gain
8 16
io
img
dtoh-io
d-h-comm
unpack
pack
out
max[in,comm]
perfect scaling
13.06%
13.57%
11.66%
13.33%
• Asynchronous I/O offers up to 13% performance gain
• Lower CPU-GPU data copies overhead
I. Said Ph.D. defense 12/21/2015 43/50
Multi-APU(zero-copy) RTM(BWD)
0
1
2
3
4
5
6
sync
async
sync
async
sync
async
sync
async
sync
async
time(s)lowerisbetter
#nodes
%: I/O fraction
%: performance gain
1 2 4 8 16
io
img
dtoh-io
d-h-comm
unpack
pack
out
max[in,comm]
perfect scaling
11.58%
11.48%
10.86%
9.21%
10.12%
9.34%
7.34%
8.16%
4.51%
9.10%
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
168
9.21%
10.12%
4.51%
9.10%
• Asynchronous I/O offers up to 9% performance gain
• Zero-copy memory = No CPU-GPU data copies prior to I/O
I. Said Ph.D. defense 12/21/2015 44/50
Multi-node RTM: performance comparison
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
time(s)lowerisbetter
16 nodes/cluster
CPU
APU(zero-copy)
APU(explicit copy)
GPU
• APU cluster (explicit copy) > CPU cluster
• 1.6× (node (2 CPU) to node (1 APU))
• 3.2× (socket (1 CPU) to socket (1 APU))
• GPU cluster > APU cluster (explicit copy) by 3.5×
• GPU cluster > APU cluster (zero-copy) by 8.3×
• APU cluster > APU cluster (zero-copy) by 2.3×
I. Said Ph.D. defense 12/21/2015 45/50
Multi-node RTM: estimated power efficiency
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
time(s)lowerisbetter
1600 WCPU
8 nodes
APU(zero-copy)
16 nodes
APU(explicit copy)
16 nodes
GPU
4 nodes
$3200
$12000
• Power budget 1600 W (TDP and maximum power consumption)
• APU cluster (zero-copy) > CPU cluster
• APU cluster (explicit copy) = GPU cluster
I. Said Ph.D. defense 12/21/2015 46/50
Conclusions
• Evaluation of the APU technology:
• Performance standpoint: GPU > APU
• Performance + power: the APU becomes an attractive solution
• Importance of data placement strategies
• One-node RTM study:
• The same conclusions (APU evaluation) were confirmed
• Multi-node study of the RTM:
• GPU/APU: I/O and communications = high fraction of run times
• GPU/APU: overlapping I/O and communications is mandatory
• Kaveri APU 3.2× speedup over Intel Xeon E5-2670
• Kaveri APU falls behind NVIDIA Tesla K40s GPU by 3.5×
• APU = GPU (power efficiency)
I. Said Ph.D. defense 12/21/2015 47/50
Conclusions on programming models
• 3 OpenACC based solutions:
• OpenACC only
• OpenACC+HMPPcg (extension to HMPP provided by CAPS)
• OpenACC+code modification
APU (GFlop/s) GPU (GFlop/s) #LOC
OpenACC 17.61 77.55 34
OpenCL 32.42 141.77 779
• OpenACC+HMPPcg offers the best directive based performance
• OpenACC+HMPPcg provides only half the OpenCL performance
• But 26× less lines of code (LOC)
I. Said Ph.D. defense 12/21/2015 48/50
Perspectives
• Directive-based approach for multi-node RTM
• Upcoming APU roadmap
• Full memory unification (hardware level)
• HBM (High Bandwidth Memory) + compute units count increase
• OpenPower and NVLink
• More complex and realistic RTM algorithms:
• Adding anisotropy
• Elastic media
I. Said Ph.D. defense 12/21/2015 49/50
Thank you for your attention, questions?
List of publications
• H. Calandra, R. Dolbeau, P. Fortin, J.-L. Lamotte, I. Said,
Assessing the relevance of APU for high performance scientific computing,
AMD Fusion Developer Summit (AFDS), 2012.
• H. Calandra, R. Dolbeau, P. Fortin, J.-L. Lamotte, I. Said,
Evaluation of successive CPUs/APUs/GPUs based on an OpenCL finite difference stencil,
21st Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2013.
• H. Calandra, R. Dolbeau, P. Fortin, J.-L. Lamotte, I. Said,
Forward seismic modeling on AMD Accelerated Processing Unit,
2013 Rice Oil & Gas HPC Workshop.
• P. Eberhart, I. Said, P. Fortin, H. Calandra,
Hybrid strategy for stencil computations on the APU,
The 1st International Workshop on High-Performance Stencil Computations, 2014.
• F. Jézéquel, J.-L. Lamotte, I. Said,
Estimation of numerical reproducibility on CPU and GPU,
Federated Conference on Computer Science and Information Systems, 2015.
• I. Said, P. Fortin, J.-L. Lamotte and H. Calandra,
Leveraging the Accelerated Processing Units for seismic imaging: a performance and power efficiency comparison against
CPUs and GPUs,
(submitted on October 2015 to an international journal).
• I. Said, P. Fortin, J.-L. Lamotte, H. Calandra,
Efficient Reverse Time Migration on APU clusters,
2016 Rice Oil & Gas HPC Workshop (submitted on November 2015).
I. Said Ph.D. defense 12/21/2015 50/50
APU generations: FD performance
20
30
40
50
60
70
80
90
1 2 3 4 5 6 7 8 9 10
GFlop/s
K computations + 1 snapshot
llano
llano(comp-only)
trinity
trinity(comp-only)
kaveri
kaveri(comp-only)
Weak scaling: multi-CPU RTM
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
fw
d-sync
fw
d-async
bw
d-sync
bw
d-async
fw
d-sync
fw
d-async
bw
d-sync
bw
d-async
fw
d-sync
fw
d-async
bw
d-sync
bw
d-async
fw
d-sync
fw
d-async
bw
d-sync
bw
d-async
fw
d-sync
fw
d-async
bw
d-sync
bw
d-async
fw
d-sync
fw
d-async
bw
d-sync
bw
d-async
fw
d-sync
fw
d-async
bw
d-sync
bw
d-async
time(s)
#nodes
1 2 4 8 16 32 64
io
out
max[in,comm]
img
fwd perfect scaling
bwd perfect scaling
7.50%
8.90%
7.37%
7.33%
7.19%
7.00%
10.12%
3.85%
-0.36%
2.74%
2.53%
4.34%
9.32%
9.76%
19.37%
19.19%
15.24%
16.61%
17.32%
21.30%
18.38%
17.47%
16.48%
14.86%
15.08%
12.35%
20.09%
8.74%
I. Said Ph.D. defense 12/21/2015 51/50
Weak scaling: multi-GPU RTM
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
fw
d-sync
fw
d-async
bw
d-sync
bw
d-asyncfw
d-sync
fw
d-async
bw
d-sync
bw
d-asyncfw
d-sync
fw
d-async
bw
d-sync
bw
d-asyncfw
d-sync
fw
d-async
bw
d-sync
bw
d-asyncfw
d-sync
fw
d-async
bw
d-sync
bw
d-async
time(s)
#nodes
1 2 4 8 16
io
dtoh-io
d-h-comm
unpack
pack
out
max[in,comm]
img
fwd perfect scaling
bwd perfect scaling
45.08%
42.26%
42.16%
42.58%
46.70%
42.94%
40.77%
39.49%
27.52%
42.79%
38.05%
38.11%
40.60%
40.39%
52.67%
34.17%
34.84%
36.65%
42.70%
47.60%
I. Said Ph.D. defense 12/21/2015 52/50
Weak scaling: multi-APU RTM (explicit copy)
0
0.1
0.2
0.3
0.4
0.5
0.6
fw
d-sync
fw
d-async
bw
d-sync
bw
d-asyncfw
d-sync
fw
d-async
bw
d-sync
bw
d-asyncfw
d-sync
fw
d-async
bw
d-sync
bw
d-asyncfw
d-sync
fw
d-async
bw
d-sync
bw
d-asyncfw
d-sync
fw
d-async
bw
d-sync
bw
d-async
time(s)
#nodes
1 2 4 8 16
io
dtoh-io
d-h-comm
unpack
pack
out
max[in,comm]
img
fwd perfect scaling
bwd perfect scaling
17.40%
17.10%
16.88%
17.01%
16.35%
17.63%
9.54%
9.41%
9.50%
11.08%
11.19%
11.25%
11.26%
11.55%
11.32%
10.43%
7.30%
7.48%
7.46%
6.90%
I. Said Ph.D. defense 12/21/2015 53/50
Weak scaling: multi-APU RTM (zero-copy)
0
0.2
0.4
0.6
0.8
1
1.2
fw
d-sync
fw
d-async
bw
d-sync
bw
d-asyncfw
d-sync
fw
d-async
bw
d-sync
bw
d-asyncfw
d-sync
fw
d-async
bw
d-sync
bw
d-asyncfw
d-sync
fw
d-async
bw
d-sync
bw
d-asyncfw
d-sync
fw
d-async
bw
d-sync
bw
d-async
time(s)
#nodes
1 2 4 8 16
io
dtoh-io
d-h-comm
unpack
pack
out
max[in,comm]
img
fwd perfect scaling
bwd perfect scaling
13.08%
12.87%
12.79%
12.80%
12.78%
11.91%
9.53%
7.70%
8.17%
8.95%
10.79%
11.04%
11.13%
10.99%
10.61%
9.53%
9.49%
7.82%
8.42%
8.79%
I. Said Ph.D. defense 12/21/2015 54/50
Estimated production throughput
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
1-cpu-fwd
1-cpu-bwd
1-apu-zz-fwd
1-apu-zz-bwd
8-apu-cggc-fwd8-apu-cggc-bwd
8-gpu-fwd
8-gpu-bwd
time(s)
8 shots in parallel on 8 nodes 8 shots in sequential on 8 nodes, 1 shot/8 nodes
• Loss of parallel efficiency as the nodes count increases
• GPU cluster > APU cluster (zero-copy) by 7.6× (8.3×)
• APU cluster > APU cluster (zero-copy) by 2× (2.3×)
I. Said Ph.D. defense 12/21/2015 55/50

More Related Content

Similar to slides

Cache Optimization Techniques for General Purpose Graphic Processing Units
Cache Optimization Techniques for General Purpose Graphic Processing UnitsCache Optimization Techniques for General Purpose Graphic Processing Units
Cache Optimization Techniques for General Purpose Graphic Processing Units
Vajira Thambawita
 
The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...
NECST Lab @ Politecnico di Milano
 
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - SundanceE3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
Sundance Multiprocessor Technology Ltd.
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
NECST Lab @ Politecnico di Milano
 
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptxPACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
ssuser30e7d2
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
Kohei KaiGai
 
OCP liquid direct to chip temperature guideline.pdf
OCP liquid direct to chip temperature guideline.pdfOCP liquid direct to chip temperature guideline.pdf
OCP liquid direct to chip temperature guideline.pdf
bui thequan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
Dilum Bandara
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
Kohei KaiGai
 
RCW@DEI - Real Needs And Limits
RCW@DEI - Real Needs And LimitsRCW@DEI - Real Needs And Limits
RCW@DEI - Real Needs And Limits
Marco Santambrogio
 
Efficient Evaluation of Embedded-System Design Alternatives (SPLC Tutorial 2019)
Efficient Evaluation of Embedded-System Design Alternatives (SPLC Tutorial 2019)Efficient Evaluation of Embedded-System Design Alternatives (SPLC Tutorial 2019)
Efficient Evaluation of Embedded-System Design Alternatives (SPLC Tutorial 2019)
Maxime Cordy
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...
NECST Lab @ Politecnico di Milano
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
inside-BigData.com
 
Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Panel: NRP Science Impacts​
Panel: NRP Science Impacts​
Larry Smarr
 
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...
Editor IJMTER
 
Task allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed systemTask allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed system
Deepak Shankar
 
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P..."Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
Edge AI and Vision Alliance
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Fisnik Kraja
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep Learning
Sergey Karayev
 

Similar to slides (20)

Cache Optimization Techniques for General Purpose Graphic Processing Units
Cache Optimization Techniques for General Purpose Graphic Processing UnitsCache Optimization Techniques for General Purpose Graphic Processing Units
Cache Optimization Techniques for General Purpose Graphic Processing Units
 
The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...
 
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - SundanceE3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
 
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptxPACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
OCP liquid direct to chip temperature guideline.pdf
OCP liquid direct to chip temperature guideline.pdfOCP liquid direct to chip temperature guideline.pdf
OCP liquid direct to chip temperature guideline.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
RCW@DEI - Real Needs And Limits
RCW@DEI - Real Needs And LimitsRCW@DEI - Real Needs And Limits
RCW@DEI - Real Needs And Limits
 
Efficient Evaluation of Embedded-System Design Alternatives (SPLC Tutorial 2019)
Efficient Evaluation of Embedded-System Design Alternatives (SPLC Tutorial 2019)Efficient Evaluation of Embedded-System Design Alternatives (SPLC Tutorial 2019)
Efficient Evaluation of Embedded-System Design Alternatives (SPLC Tutorial 2019)
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurations
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Panel: NRP Science Impacts​
Panel: NRP Science Impacts​
 
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...
 
Task allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed systemTask allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed system
 
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P..."Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep Learning
 

slides

  • 1. Ph.D. defense December 21st 2015 Contributions of hybrid architectures to depth imaging: a CPU, APU and GPU comparative study Issam SAID
  • 2. Energy supply and demand • 40% more energy is needed by 2035 • No choice but Oil, Gas and Coal • Sophisticated seismic methods I. Said Ph.D. defense 12/21/2015 1/50
  • 3. Seismic methods for Oil & Gas exploration Acquisition Processing Interpretation Shot = source activation + data collection (receivers) Seismic survey • Air-gun array • Hydrophones Shot record I. Said Ph.D. defense 12/21/2015 2/50
  • 4. Seismic methods for Oil & Gas exploration Acquisition Processing Noise at- tenuation Demul- tiple Interpo- lation Imaging Interpretation {Subsurface image I. Said Ph.D. defense 12/21/2015 2/50
  • 5. Seismic methods for Oil & Gas exploration Acquisition Processing Interpretation Calculate seismic attributes • Dip • Azimuth • Coherence I. Said Ph.D. defense 12/21/2015 2/50
  • 6. Seismic methods for Oil & Gas exploration Acquisition Processing Interpretation Calculate seismic attributes • Dip • Azimuth • Coherence (courtesy of Total) I. Said Ph.D. defense 12/21/2015 2/50
  • 7. Reverse Time Migration (RTM) • The reference computer based imaging algorithm in the industry • Repositions seismic events into their true location in the subsurface I. Said Ph.D. defense 12/21/2015 3/50
  • 8. Reverse Time Migration (RTM) • The reference computer based imaging algorithm in the industry • Repositions seismic events into their true location in the subsurface • Sub-salt and steep dips imaging • Accurate (full wave equation (two-way)) • Requires massive compute resources (compute and storage) I. Said Ph.D. defense 12/21/2015 3/50
  • 9. RTM workflow Forward modeling (FWD) I. Said Ph.D. defense 12/21/2015 4/50
  • 10. RTM workflow Forward modeling (FWD) Backward modeling (BWD) I. Said Ph.D. defense 12/21/2015 4/50
  • 11. RTM workflow Forward modeling (FWD) Backward modeling (BWD) Imaging condition I. Said Ph.D. defense 12/21/2015 4/50
  • 12. RTM workflow Forward modeling (FWD) Backward modeling (BWD) Imaging condition I. Said Ph.D. defense 12/21/2015 4/50
  • 13. RTM workflow Forward modeling (FWD) Backward modeling (BWD) Imaging condition I. Said Ph.D. defense 12/21/2015 4/50
  • 14. The underlying theory of the RTM algorithm The RTM operator Img(x) = H 0 T 0 Sh(x, t) ∗ Rh(x, T − t)dt dh The Cauchy problem    1 c2 ∂2 u(x, t) ∂t2 − ∆u(x, t) = s(t), in Ω u(x, 0) = 0 ∂u(x, 0) ∂t = 0 Boundary condition u = 0 on ∂Ω I. Said Ph.D. defense 12/21/2015 5/50
  • 15. Finite Difference Time Domain for RTM • Finite Difference Time Domain (8th order in space, 2nd order in time) • Regular grids • Perfectly Matched Layers (PML) as an absorbing boundary condition Un+1 i,j,k = 2Un i,j,k − Un−1 i,j,k + c2 i,j,k∆t2 ∆Un i,j,k + c2 i,j,k∆t2 sn • Heavy computation (hours to days of processing time) • Terabytes of temporary data • Requires High Performance Computing I. Said Ph.D. defense 12/21/2015 6/50
  • 16. HPC solutions for RTM CPU clusters are the reference • Process large data sets across interconnected multi-core CPUs • Advanced optimization techniques (vectorization, cache blocking) Hardware accelerators and co-processors • RTM is massively parallel • GPU, FPGA, Intel Xeon Phi • Dominance of GPUs: • Huge compute power (up to 5 TFlop/s) • High memory bandwidth (up to 300 GB/s) • Possible PCI overheads (sustained bandwidth up to 12 GB/s) • Data snapshotting • MPI communications with neighbors (multi-GPU) • Limited memory capacities • A high-end GPU has only 12 GB at most • CPU based compute nodes have 128 GB • High power consumptions 400 W (CPU+GPU(not standalone)) I. Said Ph.D. defense 12/21/2015 7/50
  • 17. GPU based solutions for RTM • Possible software techniques to overcome RTM limits on GPUs: • Temporal blocking (PCI overhead) • Overlapping CPU-GPU transfers with computations (PCI overhead) • Out-of-core algorithms (memory limitation) • Extensive efforts and investments • Hardware solution with an acceptable performance/efforts trade-off? I. Said Ph.D. defense 12/21/2015 8/50
  • 18. Towards unifying CPUs and GPUs GPU main memory Dispatch units L2 CU0 L1Local memory Register file PE CU1 L1Local memory Register file PE CUN-1 L1Local memory Register file PE CPU System memory CPU0 L1 WC L2 CPUs-1 L1 WC L2 L3 FPUFPU PCI Express Bus CPU System memory CPU3 L1 WC L2 FPU CPU2 L1 WC L2 FPU CPU0 L1 WC FPU CPU1 L1 WC L2 FPU Quad-core CPU module Dispatch units CU0 TEX L1Local memory Register file PE CU1 TEX L1Local memory Register file PE CUN-1 TEX L1Local memory Register file PE Integrated GPU moduleUNB GPUmemorycontroller Memorycontroller GARLIC ONION Accelerated Processing Unit (APU) CPU+discrete GPU I. Said Ph.D. defense 12/21/2015 9/50
  • 19. Towards unifying CPUs and GPUs Strengths • No PCI Express bus • Integrated GPUs can address the entire memory • Low power processors ( 95 W TDP at most): • CPU 150 W TDP at most • GPU 300 W at most Weaknesses • Low compute power as compared to GPUs: • Kaveri APU 730 GFlop/s (integrated GPU) • Phenom CPU 150 GFlop/s • Tahiti GPU 3700 GFlop/s • An order of magnitude less memory bandwidth than GPUs: • APU up to 25 GB/s memory bandwidth • GPU 300 GB/s I. Said Ph.D. defense 12/21/2015 9/50
  • 20. Overview of contributions Architectures CPU GPU APU Data placement strategies Applications Matrix multiplication Finite difference stencils Hybrid strategy Modeling RTM Successive generations Evaluation Power efficiency Performance One nodeLarge scale Strong scaling Weak scaling Is the APU a valuable HPC solution for depth imaging that: • may be more efficient than CPU solutions? • is likely to overcome GPU based solutions limitations? Program- ming models OpenCLOpenACC I. Said Ph.D. defense 12/21/2015 10/50
  • 25. Evaluation of the APU technology Architectures CPU GPU APU Data placement strategies Applications Matrix multiplication Finite difference stencils Hybrid strategy Modeling RTM Successive generations Evaluation Power efficiency Performance One nodeLarge scale Strong scaling Weak scaling Program- ming models OpenCLOpenACC I. Said Ph.D. defense 12/21/2015 11/50
  • 26. The APU memory subsystem • Onion: coherent bus (slow) • Garlic: non coherent bus (full memory bandwidth) I. Said Ph.D. defense 12/21/2015 12/50
  • 27. The APU memory subsystem • c: regular CPU memory (size depends on the RAM) I. Said Ph.D. defense 12/21/2015 12/50
  • 28. The APU memory subsystem • g: fixed size (512 MB to 4 GB) • cg: explicit copy from CPU memory to GPU memory • gc: explicit copy from GPU memory to CPU memory I. Said Ph.D. defense 12/21/2015 12/50
  • 29. The APU memory subsystem • u: zero-copy and non coherent (read-only accesses from GPU cores) • Fixed and limited size (up to 1 GB) I. Said Ph.D. defense 12/21/2015 12/50
  • 30. The APU memory subsystem • z: zero-copy and coherent memory I. Said Ph.D. defense 12/21/2015 12/50
  • 31. The APU memory subsystem • z: zero-copy and coherent memory • Variable size (up to the maximum CPU memory size) I. Said Ph.D. defense 12/21/2015 12/50
  • 32. Data placement strategies on APU • OpenCL data copy kernel • From buffer A to buffer B • Store buffers A and B in different memory locations • Evaluate different combinations: • For example cggc (explicit copy): • zz (zero-copy): I. Said Ph.D. defense 12/21/2015 13/50
  • 33. Data placement benchmark results on APU 0 5 10 15 20 25 30 35 cggc zgc ugc zz uz time(ms)lowerisbetter Data placement strategies buffer size: 128 MB kernel(copy time) GPU-to-CPU CPU-to-GPU • Using zero-copy = 60% maximum sustained bandwidth • Select the most relevant strategies: cggc, ugc and zz I. Said Ph.D. defense 12/21/2015 14/50
  • 34. Applicative benchmarks on APU Matrix multiplication • Compute bound algorithm • Evaluate the sustained compute gap between GPUs and APUs 8th order 3D finite difference stencil • Memory bound algorithm • Building block of the Reverse Time Migration • Evaluate the APU memory performance Impact of data placement strategies on the APU performance I. Said Ph.D. defense 12/21/2015 15/50
  • 35. Finite difference stencils ∆Un i,j,k = 1 ∆x2 p/2 l=−p/2 alUn i+l,j,k+ 1 ∆y2 p/2 l=−p/2 alUn i,j+l,k+ 1 ∆z2 p/2 l=−p/2 alUn i,j,k+l, p = 8 • Compute complexity O(N3 ) • Storage complexity O(N3 ) • Data snapshotting (K ∈ [1 − 10]) I. Said Ph.D. defense 12/21/2015 16/50
  • 36. Stencils: implementation details • 2D work-item grid on the 3D domain • 1 column along the Z axis/work-item • Register blocking when traversing the Z dimension • Implementations: • scalar: global memory • local scalar: local memory to exploit memory access redundancies • vectorized: global memory + explicit vectorization • local vectorized: local memory + explicit vectorization I. Said Ph.D. defense 12/21/2015 17/50
  • 37. Stencil computations on CPU 2 4 6 8 10 12 14 16 18 64 128 256 512 1024 GFlop/shigherisbetter NxNx32 scalar vectorized local vectorized openmp • Explicit vectorization helped to deliver the best performance (SSE) • OpenCL ≥ OpenMP I. Said Ph.D. defense 12/21/2015 18/50
  • 38. Stencil computations on GPU 0 100 200 300 400 500 600 64 128 256 512 1024 GFlop/shigherisbetter NxNx32 scalar local scalar vectorized local vectorized • Scalar ≥ vectorized thanks to GCN (Graphics Core Next) • Scalar code + OpenCL local memory offered the best performance I. Said Ph.D. defense 12/21/2015 19/50
  • 39. Stencil computations on APU 10 20 30 40 50 60 70 80 90 64 128 256 512 1024 GFlop/shigherisbetter NxNx32 scalar local scalar vectorized local vectorized • Local scalar gives the best performance numbers for N ≥ 128 • Vectorization is not needed thanks to GCN I. Said Ph.D. defense 12/21/2015 20/50
  • 40. Stencils: data placement strategies • Fixed problem size (1024 × 1024 × 32) • One snapshot every K computation (1 ≤ K ≤ 10) • Select the best OpenCL implementations (scalar, local scalar) • Combine them with data placement strategies: cggc, ugc, zz 20 30 40 50 60 70 80 1 2 3 4 5 6 7 8 9 10 GFlop/shigherisbetter K computations + 1 snapshot problem size: 1024x1024x32 (128 MB) best scalar-cggc scalar-ugc scalar-zz local scalar-cggc local scalar-ugc local scalar-zz • Best: local scalar (zz) for 1 ≤ K ≤ 3 and (cggc) for 3 ≤ K ≤ 10 • Select explicit copy (cggc) and zero-copy (zz) for RTM I. Said Ph.D. defense 12/21/2015 21/50
  • 41. Stencils: performance comparison 8 16 32 64 128 256 512 1 2 3 4 5 6 7 8 9 10 GFlop/shigherisbetter K computations + 1 snapshot performance projection problem size: 1024x1024x32 (128 MB) CPU GPU APU APU(Onion=Garlic) • APU > CPU ∀K • GPU > APU, 2 ≤ K ≤ 10 • APU > GPU when performing one snapshot after each iteration I. Said Ph.D. defense 12/21/2015 22/50
  • 42. Stencils: conclusion • APU can be an attractive solution: • For a high rate of data snapshotting (finite difference) • For medium sized problems (matrix multiplication) • An order of magnitude of theoretical performance gap GPU/APU: • But only 3× to 4× only in practice • Performance only: the GPU remains the privileged solution • Power is gaining interest in the HPC community (Green500) • Power wall and Exascale • What about power consumption? I. Said Ph.D. defense 12/21/2015 23/50
  • 43. Power measurement methodology • Raritan PX (DPXR8A-16) PDU to monitor the power consumption • Performance per Watt (PPW) metric Methodology • The power drawn by the system as a whole: • Same functional hardware components for the 3 architectures • CPU+GPU for GPU based solutions • Importance of Power Supply Units (PSUs) electric efficiency I. Said Ph.D. defense 12/21/2015 24/50
  • 44. Stencils: power efficiency comparison 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1 2 3 4 5 6 7 8 9 10 GFlop/s/Whigherisbetter K computations + 1 snapshot up to 62 W up to 222 W up to 159 W problem size: 1024x1024x32 (128 MB) CPU GPU APU • CPU offers a very low power efficiency (0.08 GFlop/s/W) • APU is 13% more power efficient that the GPU • Higher gain for compute bound algorithm (matrix multiplication): • Flops consume less power than memory accesses I. Said Ph.D. defense 12/21/2015 25/50
  • 45. RTM on one HPC node Architectures CPU GPU APU Data placement strategies Applications Matrix multiplication Finite difference stencils Hybrid strategy Modeling RTM Successive generations Evaluation Power efficiency Performance One nodeLarge scale Strong scaling Weak scaling Program- ming models OpenCLOpenACC I. Said Ph.D. defense 12/21/2015 26/50
  • 46. One-node RTM GPU/APU implementations • Multiple OpenCL kernels (PML): • Reduce compute/memory divergence • Stencils study conclusions: • Stencil optimizations and auto-tuning • scalar and local scalar • Data placement strategies (APU) • Imaging condition on CPU • Evaluate: • Kernels (kernels only) • Full application (overall) Absorbing boundaries   Z   Y X Physical domain Free surface Absorbing boundaries Case study • 3D SEG/EAGE Salt velocity model • Compute grid that fits in one GPU compute node (less than 3 GB) • Selective checkpointing frequency K=10 I. Said Ph.D. defense 12/21/2015 27/50
  • 47. One-node RTM on GPU/APU kernels only (GFlop/s) overall (GFlop/s) %loss GPU 141.77 29.65 79% APU(explicit copy) 32.42 15.93 50% APU(zero-copy) 15.2 11.45 24% GPU • Best: scalar implementation • Impact of PCI+IO (snapshotting) on performance APU • Best: scalar with using explicit data copies (cggc) • Local memory is beneficial when using zero-copy memory objects I. Said Ph.D. defense 12/21/2015 28/50
  • 48. One-node RTM: performance comparison 0 20 40 60 80 100 120 140 160 GFlop/shigherisbetter one node/architecture CPU APU(zero-copy) APU(explicitcopy) GPU RTM(kernels only) RTM(overall) • Poor performance on the Phenom CPU (OpenCL) • Gap between GPU and APU: • 4.4× with kernels only • 1.8× only when considering overall I. Said Ph.D. defense 12/21/2015 29/50
  • 49. One-node RTM: power efficiency comparison 0 0.05 0.1 0.15 0.2 0.25 0.3 GFlop/s/Whigherisbetter CPU APU(zero-copy) APU(explicit copy) GPU 137 W 62 W 62 W 198 W • Performance numbers based on overall timings • Poor power efficiency on the Phenom CPU (0.013 GFlop/s/W) • APU can be more power efficient than the GPU: • 1.80× (explicit copy) • 1.23× (zero-copy) I. Said Ph.D. defense 12/21/2015 30/50
  • 50. One-node RTM: conclusion • RTM(kernels only): huge gap between APU and GPU • RTM(overall): the performance gap is reduced • Performance + power: APU is almost twice more efficient than GPU I. Said Ph.D. defense 12/21/2015 31/50
  • 51. RTM on multi-node hybrid architectures Architectures CPU GPU APU Data placement strategies Applications Matrix multiplication Finite difference stencils Hybrid strategy Modeling RTM Successive generations Evaluation Power efficiency Performance One nodeLarge scale Strong scaling Weak scaling Program- ming models OpenCLOpenACC I. Said Ph.D. defense 12/21/2015 32/50
  • 52. RTM on multi-node hybrid architectures Motivations • Real world cases generate large amounts of data ( 1 Terabyte) • Larger than one node memory capacities • Impact of MPI communications on the PCI overhead (GPU)? • Impact of zero-copy on MPI communications (APU)? Clusters (located at Total) CPU cluster APU cluster GPU cluster Number of used nodes 16 Processors/node 2×Intel Xeon CPU E5-2670 1×AMD A10-7850K 1×NVIDIA Tesla K40s (Kaveri) 1×Intel Xeon CPU E5-2680 Case study • Same velocity model, K=10 • Compute grids size 25 GB I. Said Ph.D. defense 12/21/2015 33/50
  • 53. Multi-node RTM: implementation • 3D domain decomposition • One-node study conclusions • Boundaries copied to contiguous buffers: • For GPUs, using OpenCL kernels • For GPUs, PCI memory transfers: • Communications with neighbors • I/O operations for snapshotting Z Y X   South   North   Back   East   West   Front   I. Said Ph.D. defense 12/21/2015 34/50
  • 54. Multi-node RTM: MPI overlapping Problem: ineffective non-blocking communications (initial) Isend(buf) do work(no buf) Wait() use bufProcess P0 Process P1 time Recv(buf) no progress thread Solution: explicit overlap technique (overlap) MPI communications (blocking) sync Update the domain boundaries Update the inner domain User thread Auxiliary thread time activate I. Said Ph.D. defense 12/21/2015 35/50
  • 55. Multi-CPU RTM(FWD) 0 1 2 3 4 5 initial overlap initial overlap initial overlap initial overlap initial overlap initial overlap initial overlap time(s)lowerisbetter #nodes 1 2 4 8 16 32 64 %: MPI fraction %: performance gain comm out in 1.09% 4.89% 5.69% 35.57% 43.04% 66.32% 75.93% max[in,comm] -26.96% -10.70% -16.10% 6.10% 18.20% 48.89% 44.89% perfect scaling 0 0.1 0.2 0.3 0.4 0.5 643216 43.04% 66.32% 75.93% 18.20% 48.89% 44.89% • 1 CPU node = 2 sockets (8 cores 2-way SMT each) • 16 MPI/node (32 threads (SMT)) • The overlap technique is beneficial when MPI fractions are high I. Said Ph.D. defense 12/21/2015 36/50
  • 56. Multi-GPU RTM(FWD) 0 0.02 0.04 0.06 0.08 0.1 0.12 initial overlap initial overlap time(s)lowerisbetter #nodes %: MPI fraction %: performance gain 8 16 comm d-h-comm unpack pack in+out out max[in,comm] perfect scaling 15.76% 26.49% 12.21% 14.24% • Only 2 test cases due to memory limitations • Up to 14% of performance gain (CPU dedicated to communications) • PCI overheads hinder achieving near to perfect scaling I. Said Ph.D. defense 12/21/2015 37/50
  • 57. Multi-APU(explicit copy) RTM(FWD) 0 0.1 0.2 0.3 0.4 0.5 0.6 initial overlap initial overlap time(s)lowerisbetter #nodes %: MPI fraction %: performance gain 8 16 comm d-h-comm unpack pack in+out out max[in,comm] perfect scaling 14.76% 19.03% 13.99% 18.89% • Up to 18% of performance gain (CPU dedicated to communications) • Lower overhead to copy the boundaries ⇒ near to perfect scaling I. Said Ph.D. defense 12/21/2015 38/50
  • 58. Multi-APU(zero-copy) RTM(FWD) 0 1 2 3 4 5 6 7 initial initial overlap initial overlap initial overlap initial overlap time(s)lowerisbetter #nodes %: MPI fraction %: performance gain 1 2 4 8 16 comm d-h-comm unpack pack in+out out max[in,comm] perfect scaling 0.10% 0.36% 13.42% 17.80% 0.34% 1.80% 8.35% 14.68% 0 0.2 0.4 0.6 0.8 1 168 13.42% 17.80% 8.35% 14.68% • Up to 14% of performance gain • Zero-copy ⇒ no CPU-GPU overhead + near to perfect scaling I. Said Ph.D. defense 12/21/2015 39/50
  • 59. Multi-node RTM: asynchronous I/O Synchronous data snapshotting (sync) Proposed solution (async) I. Said Ph.D. defense 12/21/2015 40/50
  • 60. Multi-CPU RTM(BWD) 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 sync async sync async sync async sync async sync async sync async sync async time(s)lowerisbetter #nodes 1 2 4 8 16 32 64 %: I/O fraction %: performance gain io img out max[in,comm] perfect scaling 24.40% 20.67% 17.83% 13.69% 4.07% 9.76% 1.74% 22.98% 18.95% 13.09% 10.29% 9.05% 11.51% -0.26% 0 0.05 0.1 0.15 0.2 0.25 0.3 4.07% 9.76% 1.74% 9.05% 11.51% -0.26% • MPI processes pinning • Background engine for asynchronous I/O (auxiliary thread) • Asynchronous I/O is beneficial for low nodes count only • High nodes count: compute nodes are overused I. Said Ph.D. defense 12/21/2015 41/50
  • 61. Multi-GPU RTM(BWD) 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 sync async sync async time(s)lowerisbetter #nodes %: I/O fraction %: performance gain 8 16 io img dtoh-io d-h-comm unpack pack out max[in,comm] perfect scaling 34.17% 46.93% 33.73% 40.23% • Kernel times ⇒ I/O fraction • Up to 40% performance gain (CPU fully dedicated to MPI+I/O) • PCI overhead for I/O and communications with neighbors I. Said Ph.D. defense 12/21/2015 42/50
  • 62. Multi-APU(explicit copy) RTM(BWD) 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 sync async sync async time(s)lowerisbetter #nodes %: I/O fraction %: performance gain 8 16 io img dtoh-io d-h-comm unpack pack out max[in,comm] perfect scaling 13.06% 13.57% 11.66% 13.33% • Asynchronous I/O offers up to 13% performance gain • Lower CPU-GPU data copies overhead I. Said Ph.D. defense 12/21/2015 43/50
  • 63. Multi-APU(zero-copy) RTM(BWD) 0 1 2 3 4 5 6 sync async sync async sync async sync async sync async time(s)lowerisbetter #nodes %: I/O fraction %: performance gain 1 2 4 8 16 io img dtoh-io d-h-comm unpack pack out max[in,comm] perfect scaling 11.58% 11.48% 10.86% 9.21% 10.12% 9.34% 7.34% 8.16% 4.51% 9.10% 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 168 9.21% 10.12% 4.51% 9.10% • Asynchronous I/O offers up to 9% performance gain • Zero-copy memory = No CPU-GPU data copies prior to I/O I. Said Ph.D. defense 12/21/2015 44/50
  • 64. Multi-node RTM: performance comparison 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 time(s)lowerisbetter 16 nodes/cluster CPU APU(zero-copy) APU(explicit copy) GPU • APU cluster (explicit copy) > CPU cluster • 1.6× (node (2 CPU) to node (1 APU)) • 3.2× (socket (1 CPU) to socket (1 APU)) • GPU cluster > APU cluster (explicit copy) by 3.5× • GPU cluster > APU cluster (zero-copy) by 8.3× • APU cluster > APU cluster (zero-copy) by 2.3× I. Said Ph.D. defense 12/21/2015 45/50
  • 65. Multi-node RTM: estimated power efficiency 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 time(s)lowerisbetter 1600 WCPU 8 nodes APU(zero-copy) 16 nodes APU(explicit copy) 16 nodes GPU 4 nodes $3200 $12000 • Power budget 1600 W (TDP and maximum power consumption) • APU cluster (zero-copy) > CPU cluster • APU cluster (explicit copy) = GPU cluster I. Said Ph.D. defense 12/21/2015 46/50
  • 66. Conclusions • Evaluation of the APU technology: • Performance standpoint: GPU > APU • Performance + power: the APU becomes an attractive solution • Importance of data placement strategies • One-node RTM study: • The same conclusions (APU evaluation) were confirmed • Multi-node study of the RTM: • GPU/APU: I/O and communications = high fraction of run times • GPU/APU: overlapping I/O and communications is mandatory • Kaveri APU 3.2× speedup over Intel Xeon E5-2670 • Kaveri APU falls behind NVIDIA Tesla K40s GPU by 3.5× • APU = GPU (power efficiency) I. Said Ph.D. defense 12/21/2015 47/50
  • 67. Conclusions on programming models • 3 OpenACC based solutions: • OpenACC only • OpenACC+HMPPcg (extension to HMPP provided by CAPS) • OpenACC+code modification APU (GFlop/s) GPU (GFlop/s) #LOC OpenACC 17.61 77.55 34 OpenCL 32.42 141.77 779 • OpenACC+HMPPcg offers the best directive based performance • OpenACC+HMPPcg provides only half the OpenCL performance • But 26× less lines of code (LOC) I. Said Ph.D. defense 12/21/2015 48/50
  • 68. Perspectives • Directive-based approach for multi-node RTM • Upcoming APU roadmap • Full memory unification (hardware level) • HBM (High Bandwidth Memory) + compute units count increase • OpenPower and NVLink • More complex and realistic RTM algorithms: • Adding anisotropy • Elastic media I. Said Ph.D. defense 12/21/2015 49/50
  • 69. Thank you for your attention, questions? List of publications • H. Calandra, R. Dolbeau, P. Fortin, J.-L. Lamotte, I. Said, Assessing the relevance of APU for high performance scientific computing, AMD Fusion Developer Summit (AFDS), 2012. • H. Calandra, R. Dolbeau, P. Fortin, J.-L. Lamotte, I. Said, Evaluation of successive CPUs/APUs/GPUs based on an OpenCL finite difference stencil, 21st Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2013. • H. Calandra, R. Dolbeau, P. Fortin, J.-L. Lamotte, I. Said, Forward seismic modeling on AMD Accelerated Processing Unit, 2013 Rice Oil & Gas HPC Workshop. • P. Eberhart, I. Said, P. Fortin, H. Calandra, Hybrid strategy for stencil computations on the APU, The 1st International Workshop on High-Performance Stencil Computations, 2014. • F. Jézéquel, J.-L. Lamotte, I. Said, Estimation of numerical reproducibility on CPU and GPU, Federated Conference on Computer Science and Information Systems, 2015. • I. Said, P. Fortin, J.-L. Lamotte and H. Calandra, Leveraging the Accelerated Processing Units for seismic imaging: a performance and power efficiency comparison against CPUs and GPUs, (submitted on October 2015 to an international journal). • I. Said, P. Fortin, J.-L. Lamotte, H. Calandra, Efficient Reverse Time Migration on APU clusters, 2016 Rice Oil & Gas HPC Workshop (submitted on November 2015). I. Said Ph.D. defense 12/21/2015 50/50
  • 70. APU generations: FD performance 20 30 40 50 60 70 80 90 1 2 3 4 5 6 7 8 9 10 GFlop/s K computations + 1 snapshot llano llano(comp-only) trinity trinity(comp-only) kaveri kaveri(comp-only)
  • 71. Weak scaling: multi-CPU RTM 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 fw d-sync fw d-async bw d-sync bw d-async fw d-sync fw d-async bw d-sync bw d-async fw d-sync fw d-async bw d-sync bw d-async fw d-sync fw d-async bw d-sync bw d-async fw d-sync fw d-async bw d-sync bw d-async fw d-sync fw d-async bw d-sync bw d-async fw d-sync fw d-async bw d-sync bw d-async time(s) #nodes 1 2 4 8 16 32 64 io out max[in,comm] img fwd perfect scaling bwd perfect scaling 7.50% 8.90% 7.37% 7.33% 7.19% 7.00% 10.12% 3.85% -0.36% 2.74% 2.53% 4.34% 9.32% 9.76% 19.37% 19.19% 15.24% 16.61% 17.32% 21.30% 18.38% 17.47% 16.48% 14.86% 15.08% 12.35% 20.09% 8.74% I. Said Ph.D. defense 12/21/2015 51/50
  • 72. Weak scaling: multi-GPU RTM 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 fw d-sync fw d-async bw d-sync bw d-asyncfw d-sync fw d-async bw d-sync bw d-asyncfw d-sync fw d-async bw d-sync bw d-asyncfw d-sync fw d-async bw d-sync bw d-asyncfw d-sync fw d-async bw d-sync bw d-async time(s) #nodes 1 2 4 8 16 io dtoh-io d-h-comm unpack pack out max[in,comm] img fwd perfect scaling bwd perfect scaling 45.08% 42.26% 42.16% 42.58% 46.70% 42.94% 40.77% 39.49% 27.52% 42.79% 38.05% 38.11% 40.60% 40.39% 52.67% 34.17% 34.84% 36.65% 42.70% 47.60% I. Said Ph.D. defense 12/21/2015 52/50
  • 73. Weak scaling: multi-APU RTM (explicit copy) 0 0.1 0.2 0.3 0.4 0.5 0.6 fw d-sync fw d-async bw d-sync bw d-asyncfw d-sync fw d-async bw d-sync bw d-asyncfw d-sync fw d-async bw d-sync bw d-asyncfw d-sync fw d-async bw d-sync bw d-asyncfw d-sync fw d-async bw d-sync bw d-async time(s) #nodes 1 2 4 8 16 io dtoh-io d-h-comm unpack pack out max[in,comm] img fwd perfect scaling bwd perfect scaling 17.40% 17.10% 16.88% 17.01% 16.35% 17.63% 9.54% 9.41% 9.50% 11.08% 11.19% 11.25% 11.26% 11.55% 11.32% 10.43% 7.30% 7.48% 7.46% 6.90% I. Said Ph.D. defense 12/21/2015 53/50
  • 74. Weak scaling: multi-APU RTM (zero-copy) 0 0.2 0.4 0.6 0.8 1 1.2 fw d-sync fw d-async bw d-sync bw d-asyncfw d-sync fw d-async bw d-sync bw d-asyncfw d-sync fw d-async bw d-sync bw d-asyncfw d-sync fw d-async bw d-sync bw d-asyncfw d-sync fw d-async bw d-sync bw d-async time(s) #nodes 1 2 4 8 16 io dtoh-io d-h-comm unpack pack out max[in,comm] img fwd perfect scaling bwd perfect scaling 13.08% 12.87% 12.79% 12.80% 12.78% 11.91% 9.53% 7.70% 8.17% 8.95% 10.79% 11.04% 11.13% 10.99% 10.61% 9.53% 9.49% 7.82% 8.42% 8.79% I. Said Ph.D. defense 12/21/2015 54/50
  • 75. Estimated production throughput 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1-cpu-fwd 1-cpu-bwd 1-apu-zz-fwd 1-apu-zz-bwd 8-apu-cggc-fwd8-apu-cggc-bwd 8-gpu-fwd 8-gpu-bwd time(s) 8 shots in parallel on 8 nodes 8 shots in sequential on 8 nodes, 1 shot/8 nodes • Loss of parallel efficiency as the nodes count increases • GPU cluster > APU cluster (zero-copy) by 7.6× (8.3×) • APU cluster > APU cluster (zero-copy) by 2× (2.3×) I. Said Ph.D. defense 12/21/2015 55/50